Plotting Event Data with Python
It’s been some time since I last posted a tutorial, let alone one in Python. So I thought now the time is better than ever to get back to it. In this tutorial I am going to run through plotting match events from StatsBomb using Python and Matplotlib. We are going to call the StatsBomb open data set using their Python package and then plot data from a few different scenarios. So let’s get started.
First we need to load the important libraries into Python.
# Read in libraries
import json
from statsbombpy import sb # Used to obtain StatsBomb data.
import statsbomb as sbp
import pandas as pd # Read and manipulate data.
import numpy as np # Read and manipulate data.
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt # Plotting data
from mplsoccer.pitch import Pitch
Now that we have the libraries, we can start to call the StatsBomb library for some data. We have a few options for this but first let’s see what competitons we have available to us. From our free datasets, we have the following Female competitions to look at.
comps = sb.competitions()
comps[comps.competition_gender == 'female']
credentials were not supplied. open data access only
competition_id | season_id | country_name | competition_name | competition_gender | season_name | match_updated | match_available | |
---|---|---|---|---|---|---|---|---|
15 | 37 | 42 | England | FA Women’s Super League | female | 2019/2020 | 2020-08-12T11:24:04.483090 | 2020-08-12T11:24:04.483090 |
16 | 37 | 4 | England | FA Women’s Super League | female | 2018/2019 | 2020-07-29T05:00 | 2020-07-29T05:00 |
32 | 49 | 3 | United States of America | NWSL | female | 2018 | 2020-07-29T05:00 | 2020-07-29T05:00 |
34 | 72 | 30 | International | Women’s World Cup | female | 2019 | 2020-07-29T05:00 | 2020-07-29T05:00 |
So now we have this, let’s find a single match to pull data from.
matches = sb.matches(competition_id=37, season_id=42)
matches.head(5)
credentials were not supplied. open data access only
match_id | match_date | kick_off | competition | season | home_team | away_team | home_score | away_score | match_status | last_updated | match_week | competition_stage | stadium | referee | data_version | shot_fidelity_version | xy_fidelity_version | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2275054 | 2020-01-05 | 15:00:00.000 | England - FA Women’s Super League | 2019/2020 | Brighton & Hove Albion WFC | Liverpool WFC | 1 | 0 | available | 2020-07-29T05:00 | 11 | Regular Season | NaN | NaN | 1.1.0 | 2 | 2 |
1 | 2275072 | 2020-01-05 | 13:30:00.000 | England - FA Women’s Super League | 2019/2020 | Chelsea FCW | Reading WFC | 3 | 1 | available | 2020-07-29T05:00 | 11 | Regular Season | The Cherry Red Records Stadium | S. Pearson | 1.1.0 | 2 | 2 |
2 | 2275085 | 2020-01-05 | 15:00:00.000 | England - FA Women’s Super League | 2019/2020 | Tottenham Hotspur Women | Manchester City WFC | 1 | 4 | available | 2020-07-29T05:00 | 11 | Regular Season | The Hive Stadium | H. Conley | 1.1.0 | 2 | 2 |
3 | 2275113 | 2020-01-19 | 16:00:00.000 | England - FA Women’s Super League | 2019/2020 | West Ham United LFC | Brighton & Hove Albion WFC | 2 | 1 | available | 2020-07-29T05:00 | 13 | Regular Season | The Rush Green Stadium | Ryan Atkin | 1.1.0 | 2 | 2 |
4 | 19800 | 2019-03-14 | 20:30:00.000 | England - FA Women’s Super League | 2019/2020 | Arsenal WFC | Bristol City WFC | 4 | 0 | available | 2020-08-12T11:24:04.483090 | 1 | Regular Season | Meadow Park | R. Whitton | 1.1.0 | None | None |
We can just use the first match on the list to pull all the events from. For this tutorial, we will pull the event data as a split dataset, split the data in to the events we want to look at. This will allow us to create a few different visuals for this match.
Shots
The first thing we will plot is shots from a single match. We have the match from above, so now we can pull the events from this match and split a specific type or event. First we will split the shots from our eventdata set to create a single shot plot.
# Call the event API through the statsbomb package.
eventdata = sb.events(match_id=2275054, split=True)
# Split the shot events from the rest of the data.
shotevents = eventdata['shots']
# Split the location data in to x/y values.
# Location data is provided as a list which is harder to use.
shotevents[['location_x', 'location_y']] = shotevents['location'].apply(pd.Series)
# Define columns we want to keep further down.
shotCols = ['statsbomb_xg', 'end_location_y', 'end_location_x', 'end_location_z']
# Create a function to split specific columns into values.
# This function will split the end_location values specifically from
# the shot column.
def parse_function(data) -> pd.DataFrame:
df = pd.DataFrame(data)
dfcolumns = df.columns
for i in dfcolumns:
try:
df[[str(i) + '_y', str(i) + '_x', str(i) + '_z']] = df[i].apply(pd.Series)
df = df.drop(i, axis = 1)
except ValueError:
pass
return df
# Run the data through the parse function and keep the columns above.
shot_df = parse_function(shotevents['shot'].apply(pd.Series))
shot_df = shot_df[shotCols]
# Merge the data together in to one dataframe.
shotevents['statsbomb_xg'], shotevents['end_location_x'], shotevents['end_location_y'], shotevents['end_location_z'] = shot_df['statsbomb_xg'], shot_df['end_location_y'], shot_df['end_location_x'], shot_df['end_location_z']
shotevents.head(5)
credentials were not supplied. open data access only
id | index | period | timestamp | minute | second | type | possession | possession_team | play_pattern | … | shot | match_id | under_pressure | out | location_x | location_y | statsbomb_xg | end_location_x | end_location_y | end_location_z | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3a4692e6-631c-47f4-8d34-644531797698 | 115 | 1 | 00:03:37.333 | 3 | 37 | Shot | 10 | Liverpool WFC | From Goal Kick | … | {‘one_on_one’: True, ‘statsbomb_xg’: 0.1886289… | 2275054 | NaN | NaN | 108.9 | 52.3 | 0.188629 | 120.0 | 28.1 | 0.2 |
1 | a49554c0-8b60-4eb0-9949-526cfcb6d54e | 262 | 1 | 00:08:22.408 | 8 | 22 | Shot | 22 | Brighton & Hove Albion WFC | From Throw In | … | {‘statsbomb_xg’: 0.007219963, ‘end_location’: … | 2275054 | NaN | NaN | 86.5 | 56.2 | 0.007220 | 117.8 | 42.1 | 0.2 |
2 | de542aa0-a50e-4318-b006-c4fe6cb23b41 | 642 | 1 | 00:18:49.169 | 18 | 49 | Shot | 48 | Liverpool WFC | From Corner | … | {‘statsbomb_xg’: 0.12033855, ‘end_location’: [… | 2275054 | NaN | NaN | 115.7 | 39.1 | 0.120339 | 120.0 | 38.6 | 4.9 |
3 | 6f812987-8b59-42cc-b699-bc9337b6269a | 705 | 1 | 00:20:56.064 | 20 | 56 | Shot | 52 | Liverpool WFC | From Corner | … | {‘statsbomb_xg’: 0.37038276, ‘end_location’: [… | 2275054 | NaN | NaN | 113.3 | 45.4 | 0.370383 | 120.0 | 45.1 | 0.2 |
4 | b4bd0579-da0a-46a0-9669-776989838113 | 870 | 1 | 00:27:55.377 | 27 | 55 | Shot | 60 | Liverpool WFC | Regular Play | … | {‘statsbomb_xg’: 0.011415341, ‘end_location’: … | 2275054 | NaN | NaN | 93.0 | 21.3 | 0.011415 | 120.0 | 45.0 | 4.5 |
5 rows × 27 columns
With our dataset, we had a few steps to work through to get a clean dataframe. For example, our shot column is a dict, meaning we need to parse out these values before we can use them easily in our pitch plots below.
Now we have our values, we can create our shot plot using Matplotlib and mplsoccer libraries.
# Setup the pitch
figsize = (16, 8)
pitch = Pitch(figsize=figsize, tight_layout=False, goal_type='box', pitch_color='#aabb97', line_color='white', stripe_color='#c2d59d', stripe=True)
fig, ax = pitch.draw()
# Store team names
t1name = shotevents.team.iloc[0]
t2name = list(set(shotevents.team.unique()) - set([t1name]))[0]
# Split data by team
team1 = shotevents[shotevents.team == t1name]
team1['location_x'] = 120 - team1['location_x']
team1['location_y'] = 80 - team1['location_y']
team1['end_location_x'] = 120 - team1['end_location_x']
team1['end_location_y'] = 80 - team1['end_location_y']
team2 = shotevents[shotevents.team == t2name]
# Plot starting locations
t1 = pitch.scatter(team1.location_x, team1.location_y, s=team1.statsbomb_xg*500, ax=ax, color="red", edgecolors="k", label="LFC")
t2 = pitch.scatter(team2.location_x, team2.location_y, s=team2.statsbomb_xg*500, ax=ax, color="darkblue", edgecolors="k", label="BHA")
# Plot the shot directions
lt1 = pitch.lines(team1.location_x, team1.location_y, team1.end_location_x, team1.end_location_y, ax=ax, alpha=0.2, color="red", comet=True, label="LFC Shot")
lt2 = pitch.lines(team2.location_x, team2.location_y, team2.end_location_x, team2.end_location_y, ax=ax, alpha=0.2, color="blue", comet=True, label="BHA Shot")
# Add a legend and a title to our plot
legend = ax.legend(loc='lower center', labelspacing=1, fontsize=12, ncol=4)
title = ax.set_title(f'Shots of {t1name} vs {t2name}', fontsize = 18)
There we have a nice lookng shot plot, with the lines for each shot and the size of the dot related to the xG for the shot taken. We can see this didn’t take too much time and the mplsoccer library really made the pitch plot look great.
Using comet=True also adds a really nice looking line that adds to the image well. Let’s give passes ago next using just the lines.
Passes
This time with our pass plot, we will do something slightly different and create a subplot to stack one team on top of the other. This will stop the plot looking crowded with both teams on the same figure. First we need to get our data, so let’s do the same thing as with our shots.
# Split the pass events from the rest of the data.
passevents = eventdata['passes']
# Split the location data in to x/y values.
# Location data is provided as a list which is harder to use.
passevents[['location_x', 'location_y']] = passevents['location'].apply(pd.Series)
# Define columns we want to keep further down.
passCols = ['end_location_y', 'end_location_x', 'outcome_name']
# Create a function to split specific columns into values.
# This function will split the end_location values specifically from
# the shot column.
def pass_parse_function(data) -> pd.DataFrame:
df = pd.DataFrame(data)
dfcolumns = df.columns
for i in dfcolumns:
try:
df[[str(i) + '_x', str(i) + '_y']] = df[i].apply(pd.Series)
except ValueError:
pass
return df
# Run the data through the parse function and keep the columns above.
pass_df = pass_parse_function(passevents['pass'].apply(pd.Series))
passoutcomes = pass_df['outcome'].apply(pd.Series)
pass_df = pass_df
# Merge the data together in to one dataframe.
passevents['end_location_x'], passevents['end_location_y'], passevents['outcome_name'] = pass_df['end_location_x'], pass_df['end_location_y'], passoutcomes['name']
passevents.head(5)
id | index | period | timestamp | minute | second | type | possession | possession_team | play_pattern | … | pass | match_id | under_pressure | off_camera | counterpress | location_x | location_y | end_location_x | end_location_y | outcome_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | cb8110ef-c586-479d-8aaf-52d991c1a6da | 5 | 1 | 00:00:00.014 | 0 | 0 | Pass | 2 | Brighton & Hove Albion WFC | From Kick Off | … | {‘recipient’: {‘id’: 22337, ‘name’: ’Maya Le T… | 2275054 | NaN | NaN | NaN | 61.0 | 40.1 | 37.0 | 42.3 | NaN |
1 | 2f58f14d-8cad-4d89-be9c-aa942e9acc32 | 8 | 1 | 00:00:02.664 | 0 | 2 | Pass | 2 | Brighton & Hove Albion WFC | From Kick Off | … | {‘recipient’: {‘id’: 16383, ‘name’: ’Danique K… | 2275054 | NaN | NaN | NaN | 36.2 | 39.7 | 29.6 | 56.2 | NaN |
2 | e3fc9388-b818-49b4-bded-0eb34194cfa6 | 12 | 1 | 00:00:06.966 | 0 | 6 | Pass | 2 | Brighton & Hove Albion WFC | From Kick Off | … | {‘recipient’: {‘id’: 22337, ‘name’: ’Maya Le T… | 2275054 | NaN | NaN | NaN | 21.4 | 58.8 | 19.5 | 34.8 | NaN |
3 | 513cb3e7-e938-4a1a-a163-b598d7f8ed76 | 16 | 1 | 00:00:09.939 | 0 | 9 | Pass | 2 | Brighton & Hove Albion WFC | From Kick Off | … | {‘recipient’: {‘id’: 16400, ‘name’: ’Kayleigh … | 2275054 | NaN | NaN | NaN | 21.2 | 34.2 | 65.5 | 75.7 | Incomplete |
4 | 01634478-ec2a-4fa2-b9ec-5d9064a8e6b6 | 18 | 1 | 00:00:13.524 | 0 | 13 | Pass | 2 | Brighton & Hove Albion WFC | From Kick Off | … | {‘recipient’: {‘id’: 15631, ‘name’: ’Niamh Cha… | 2275054 | NaN | NaN | NaN | 54.6 | 4.4 | 71.1 | 0.1 | Out |
5 rows × 26 columns
Now we have our data, we can create our plot. This time, we are going to build our subplot as the axis and then add our pitch to each subplot. We also need to specify our figure size within the subplot creation so we don’t get a small plot. Let’s see how this turns out.
# Setup the pitch
figsize = (25, 16)
pitchpass = Pitch(figsize=figsize, goal_type='box', pitch_color='#aabb97', line_color='white', stripe_color='#c2d59d', stripe=True)
fig, ax = plt.subplots(nrows=2, ncols=1, figsize=figsize)
pitch.draw(ax=ax[0])
pitch.draw(ax=ax[1])
# Split data by team
passteam1 = passevents[passevents.team == t1name]
passteam2 = passevents[passevents.team == t2name]
# Create a boolean value to filter the data below for
# complete and incomplete passes.
compass = passteam1.outcome_name.isna()
compass2 = passteam2.outcome_name.isna()
# Plot starting locations
t1 = pitchpass.lines(passteam1[compass].location_x, passteam1[compass].location_y, passteam1[compass].end_location_x, passteam1[compass].end_location_y, ax=ax[0], color="gold", label="Completed Passes", comet=True, lw=2, transparent=True)
t1incom = pitchpass.lines(passteam1[~compass].location_x, passteam1[~compass].location_y, passteam1[~compass].end_location_x, passteam1[~compass].end_location_y, ax=ax[0], color="red", label="Incomplete Passes", comet=True, lw=2, transparent=True)
t2 = pitchpass.lines(passteam2[compass2].location_x, passteam2[compass2].location_y, passteam2[compass2].end_location_x, passteam2[compass2].end_location_y, ax=ax[1], color="gold", label="Completed Passes", comet=True, lw=2, transparent=True)
t2incom = pitchpass.lines(passteam2[~compass2].location_x, passteam2[~compass2].location_y, passteam2[~compass2].end_location_x, passteam2[~compass2].end_location_y, ax=ax[1], color="red", label="Incomplete Passes", comet=True, lw=2, transparent=True)
# Add a legend and a title to our plot
legend = ax[0].legend(loc='lower center', labelspacing=1, fontsize=12, ncol=4)
title = ax[0].set_title(f'Passes of {t1name}', fontsize = 18)
# Add a legend and a title to our plot
legend = ax[1].legend(loc='lower center', labelspacing=1, fontsize=12, ncol=4)
title = ax[1].set_title(f'Passes of {t2name}', fontsize = 18)
How good is this, with the comet line we can see the start and end of the pass. While with the colours we can see the complete and incomplete passes easily.
Coming from R, coding these plots feels like it takes a lot, but in reality it is very similar just missing the pipe feature. But overall, I have to say I really like how these turned out.
Hope you all enjoyed this tutorial / walkthrough of creating plots using Matplotlib in Python. I had fun creating these and will be looking to use these more in the future.