I have a Python script that processes tennis match data stored in a Pandas DataFrame (tennis_data_processed). Each row represents a single match from 2010 to 2023, including details about the tournament, match, and the two players involved. There’s also a target variable that indicates whether Player1 won (1) or Player1 lost (0).
I’m attempting to add two new features, player1_h2h and player2_h2h, which represents the head-to-head record of players in each match. The idea is to count the number of wins each player has against the other player before the current match.
The code I’ve implemented works well when the target is 1 (indicating Player1 won). However, there’s an issue when the target is 0 (indicating Player1 lost and therefore player2 won). In this case, the head-to-head record is not updating correctly for the subsequent match, as it adds the win to the player that lost.
The table I have before trying to create the h2h features is:
tourney_date | player1_id | player2_id | target |
---|---|---|---|
2012-01-16 | A | B | 1 |
2012-01-16 | C | D | 0 |
2012-03-27 | B | A | 1 |
2012-03-27 | D | C | 0 |
2012-04-29 | A | B | 1 |
The table i want as a result (I’ll show what it should look like for a head-to-head between two specific players, but it should be done for all matches):
tourney_date | player1_id | player2_id | target | player1_h2h | player2_h2h |
---|---|---|---|---|---|
2012-01-16 | A | B | 0 | 0 | 0 |
2012-01-27 | A | B | 0 | 0 | 1 |
2012-03-14 | B | A | 1 | 2 | 0 |
2015-01-20 | A | B | 0 | 0 | 3 |
2020-10-07 | B | A | 1 | 1 | 3 |
2020-10-15 | A | B | 1 | 3 | 2 |
To do this I have the following code:
def calculate_head2head(row, player1_col, player2_col, target_col):
# Identify player1 and player2 based on the target
player1_id = row[player1_col] if row[target_col] == 1 else row[player2_col]
player2_id = row[player2_col] if row[target_col] == 1 else row[player1_col]
# Identify the player who won in the previous row
prev_target = 1 - row[target_col] # Switch 1 to 0 and vice versa
prev_won_player_id = row[player2_col] if prev_target == 1 else row[player1_col]
# Filter relevant matches for head-to-head calculation
matches = tennis_data_processed[
((tennis_data_processed[player1_col] == player1_id) & (tennis_data_processed[player2_col] == player2_id)) |
((tennis_data_processed[player1_col] == player2_id) & (tennis_data_processed[player2_col] == player1_id))
]
# Count the number of wins for the players
player1_wins = matches[(matches[target_col] == 1) & (matches['tourney_date'] < row['tourney_date'])].shape[0]
player2_wins = matches[(matches[target_col] == 0) & (matches['tourney_date'] < row['tourney_date'])].shape[0]
# Adjust wins if player1 is now in player2 column
if row[target_col] == 0:
player1_wins, player2_wins = player2_wins, player1_wins
prev_won_player_id = player2_id # Update the previous winner to player2
prev_matches = tennis_data_processed.loc[
(tennis_data_processed.index < row.name) &
(tennis_data_processed[target_col] == prev_target)
].sort_values(by='tourney_date', ascending=False)
if not prev_matches.empty:
if row['tourney_date'] > prev_matches.iloc[0]['tourney_date']:
if prev_target == 0:
player2_wins += 1
else:
player1_wins += 1
return player1_wins, player2_wins
# Apply the function row-wise to calculate head-to-head records
tennis_data_processed[['player1_h2h', 'player2_h2h']] = tennis_data_processed.apply(
lambda row: calculate_head2head(row, 'player1_id', 'player2_id', 'target'),
axis=1,
result_type="expand"
)
But with this code the resulting DataFrame is:
tourney_date | player1_id | player2_id | target | player1_h2h | player2_h2h |
---|---|---|---|---|---|
2012-01-16 | A | B | 0 | 0 | 0 |
2012-01-27 | A | B | 0 | 1 | 0 |
2012-03-14 | B | A | 1 | 0 | 2 |
2015-01-20 | A | B | 0 | 2 | 1 |
2020-10-07 | B | A | 1 | 1 | 3 |
2020-10-15 | A | B | 1 | 4 | 1 |
With my code, when the target is 0 (indicating Player1 lost and therefore player2 won)the head-to-head record is beeing updated to the player that lost the previous match. And when target is 1, it adds the win to the correct player.