pandas - Calculating head-to-head records in a DataFrame based on target values in Python

I have a Python script that processes tennis match data stored in a Pandas DataFrame (tennis_data_processed). Each row represents a single match from 2010 to 2023, including details about the tournament, match, and the two players involved. There’s also a target variable that indicates whether Player1 won (1) or Player1 lost (0).

I’m attempting to add two new features, player1_h2h and player2_h2h, which represents the head-to-head record of players in each match. The idea is to count the number of wins each player has against the other player before the current match.

The code I’ve implemented works well when the target is 1 (indicating Player1 won). However, there’s an issue when the target is 0 (indicating Player1 lost and therefore player2 won). In this case, the head-to-head record is not updating correctly for the subsequent match, as it adds the win to the player that lost.

The table I have before trying to create the h2h features is:

tourney_date	player1_id	player2_id	target
2012-01-16	A	B	1
2012-01-16	C	D	0
2012-03-27	B	A	1
2012-03-27	D	C	0
2012-04-29	A	B	1

The table i want as a result (I’ll show what it should look like for a head-to-head between two specific players, but it should be done for all matches):

tourney_date	player1_id	player2_id	target	player1_h2h	player2_h2h
2012-01-16	A	B	0	0	0
2012-01-27	A	B	0	0	1
2012-03-14	B	A	1	2	0
2015-01-20	A	B	0	0	3
2020-10-07	B	A	1	1	3
2020-10-15	A	B	1	3	2

To do this I have the following code:

def calculate_head2head(row, player1_col, player2_col, target_col):
    # Identify player1 and player2 based on the target
    player1_id = row[player1_col] if row[target_col] == 1 else row[player2_col]
    player2_id = row[player2_col] if row[target_col] == 1 else row[player1_col]

    # Identify the player who won in the previous row
    prev_target = 1 - row[target_col]  # Switch 1 to 0 and vice versa
    prev_won_player_id = row[player2_col] if prev_target == 1 else row[player1_col]

    # Filter relevant matches for head-to-head calculation
    matches = tennis_data_processed[
        ((tennis_data_processed[player1_col] == player1_id) & (tennis_data_processed[player2_col] == player2_id)) |
        ((tennis_data_processed[player1_col] == player2_id) & (tennis_data_processed[player2_col] == player1_id))
    ]

    # Count the number of wins for the players
    player1_wins = matches[(matches[target_col] == 1) & (matches['tourney_date'] < row['tourney_date'])].shape[0]
    player2_wins = matches[(matches[target_col] == 0) & (matches['tourney_date'] < row['tourney_date'])].shape[0]

    # Adjust wins if player1 is now in player2 column
    if row[target_col] == 0:
        player1_wins, player2_wins = player2_wins, player1_wins
        prev_won_player_id = player2_id  # Update the previous winner to player2

    prev_matches = tennis_data_processed.loc[
        (tennis_data_processed.index < row.name) & 
        (tennis_data_processed[target_col] == prev_target)
    ].sort_values(by='tourney_date', ascending=False)

    if not prev_matches.empty:
        if row['tourney_date'] > prev_matches.iloc[0]['tourney_date']:
            if prev_target == 0:
                player2_wins += 1
            else:
                player1_wins += 1

    return player1_wins, player2_wins

# Apply the function row-wise to calculate head-to-head records
tennis_data_processed[['player1_h2h', 'player2_h2h']] = tennis_data_processed.apply(
    lambda row: calculate_head2head(row, 'player1_id', 'player2_id', 'target'),
    axis=1,
    result_type="expand"
)

But with this code the resulting DataFrame is:

tourney_date	player1_id	player2_id	target	player1_h2h	player2_h2h
2012-01-16	A	B	0	0	0
2012-01-27	A	B	0	1	0
2012-03-14	B	A	1	0	2
2015-01-20	A	B	0	2	1
2020-10-07	B	A	1	1	3
2020-10-15	A	B	1	4	1

With my code, when the target is 0 (indicating Player1 lost and therefore player2 won)the head-to-head record is beeing updated to the player that lost the previous match. And when target is 1, it adds the win to the correct player.

Source link