Reformatting Statsbomb Data in Python

Last updated on Apr 17, 2020 21 min read Python, StatsBomb

For this tutorial, I am going to carry on from where we finished in my first tutorial, which you can find that tutorial here. In that tutorial, we downloaded and installed the statsbombpy library and ran through the basic calls to download the free data released by Statsbomb.

In this tutorial, I am going to start with the basic call to get the match events from a single game, and start to parse out some of the information embeded within the file we receive from the call. It’s important to note, that the basic call in the Statsbomb library parses the JSON file in to a “tidy” dataframe. This means we are working with a Pandas dataframe and not with the raw JSON file. I will parse a raw JSON file in a future tutorial.

So let’s get started, first we need to import the libraries that we need to use.

# Read in appropriate libraries
from statsbombpy import sb # Statsbomb library to obtain data
import pandas as pd # Used to read in and manipulate data
import numpy as np # Used to help manipulate data

Once we have imported the libraries, we need to call the sb.events function to get the events from a single match. Once we do that, we need to have a look at the file, so we can do that by calling the head function on the first so many rows.

To view some of this data, you will need to scroll to the right of the tables presented below. This will apply to all tables in this blog and unfortunately was not something I could adjust.

### Run function to call the events from a single match
## Add match_id here
match = 2275038

## Run the event function using the assigned match from above
match_events = sb.events(match_id = match)
match_events.head(10)

credentials were not supplied. open data access only

	50_50	bad_behaviour	ball_receipt	ball_recovery	block	carry	clearance	counterpress	dribble	duel	duration	foul_committed	foul_won	goalkeeper	half_start	id	index	injury_stoppage	interception	location	match_id	minute	miscontrol	off_camera	out	pass	period	play_pattern	player	position	possession	possession_team	related_events	second	shot	substitution	tactics	team	timestamp	type	under_pressure
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	9579b6c0-b747-4ab7-9aa4-9aff4b852827	1	NaN	NaN	NaN	2275038	0	NaN	NaN	NaN	NaN	1	Regular Play	NaN	NaN	1	Reading WFC	NaN	0	NaN	NaN	{‘formation’: 41212, ‘lineup’: [{‘player’: {’i…	Reading WFC	00:00:00.000	Starting XI	NaN
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	a4a12e95-b01a-4042-879f-45c3d992e969	2	NaN	NaN	NaN	2275038	0	NaN	NaN	NaN	NaN	1	Regular Play	NaN	NaN	1	Reading WFC	NaN	0	NaN	NaN	{‘formation’: 4231, ‘lineup’: [{‘player’: {’id…	West Ham United LFC	00:00:00.000	Starting XI	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.100000	NaN	NaN	NaN	{‘late_video_start’: True}	da9c5398-dae9-4a3d-b821-fd600b54a55d	3	NaN	NaN	NaN	2275038	0	NaN	NaN	NaN	NaN	1	Regular Play	NaN	NaN	1	Reading WFC	[035f18f5-8767-475f-b96b-b1548c2fd642]	0	NaN	NaN	NaN	West Ham United LFC	00:00:00.000	Half Start	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.100000	NaN	NaN	NaN	{‘late_video_start’: True}	035f18f5-8767-475f-b96b-b1548c2fd642	4	NaN	NaN	NaN	2275038	0	NaN	NaN	NaN	NaN	1	Regular Play	NaN	NaN	1	Reading WFC	[da9c5398-dae9-4a3d-b821-fd600b54a55d]	0	NaN	NaN	NaN	Reading WFC	00:00:00.000	Half Start	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	5a4cee02-6737-48e2-918f-724080b37471	1524	NaN	NaN	NaN	2275038	45	NaN	NaN	NaN	NaN	2	Regular Play	NaN	NaN	96	Reading WFC	[f0bd2ba7-a946-4414-b04f-aeeae0928f31]	0	NaN	NaN	NaN	West Ham United LFC	00:00:00.000	Half Start	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	f0bd2ba7-a946-4414-b04f-aeeae0928f31	1525	NaN	NaN	NaN	2275038	45	NaN	NaN	NaN	NaN	2	Regular Play	NaN	NaN	96	Reading WFC	[5a4cee02-6737-48e2-918f-724080b37471]	0	NaN	NaN	NaN	Reading WFC	00:00:00.000	Half Start	NaN
6	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.771676	NaN	NaN	NaN	NaN	c0cc1e3b-af5c-448c-82cc-08e546a72f5b	5	NaN	NaN	[61.0, 40.1]	2275038	0	NaN	NaN	NaN	{‘recipient’: {‘id’: 10251, ‘name’: ’Fara Will…	1	From Kick Off	Jade Moore	Center Defensive Midfield	2	Reading WFC	[99c4a406-f3d4-4bd0-b5b0-3a0598ae54dd]	0	NaN	NaN	NaN	Reading WFC	00:00:00.046	Pass	NaN
7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2.831446	NaN	NaN	NaN	NaN	7172fc12-eaf2-4e9c-9f82-16950a04cfa7	8	NaN	NaN	[54.8, 40.5]	2275038	0	NaN	NaN	NaN	{‘recipient’: {‘id’: 15725, ‘name’: ’Natasha H…	1	From Kick Off	Fara Williams	Center Attacking Midfield	2	Reading WFC	[91ca5def-0b84-4d2d-9313-68418f3e1b3a]	0	NaN	NaN	NaN	Reading WFC	00:00:00.897	Pass	NaN
8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.307655	NaN	NaN	NaN	NaN	188815ee-705c-4deb-951b-97348bf7838f	16	NaN	NaN	[33.2, 2.8]	2275038	0	NaN	NaN	NaN	{‘recipient’: {‘id’: 18147, ‘name’: ’Kate Long…	1	From Kick Off	Laura Vetterlein	Left Back	2	Reading WFC	[a0d2f369-a161-424e-85a2-419a4fc693da]	8	NaN	NaN	NaN	West Ham United LFC	00:00:08.899	Pass	NaN
9	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2.176197	NaN	NaN	NaN	NaN	91685ff0-0d95-497a-9a8d-af14a6851ef6	25	NaN	NaN	[77.4, 74.1]	2275038	0	NaN	NaN	NaN	{‘recipient’: {‘id’: 15725, ‘name’: ’Natasha H…	1	From Kick Off	Fara Williams	Center Attacking Midfield	2	Reading WFC	[dc4d1ac5-1444-438d-8d5b-572b9707048b]	13	NaN	NaN	NaN	Reading WFC	00:00:13.385	Pass	NaN

So we have 41 columns of data, but we can’t see them all as Pandas will cut some of the columns out so as to not display too much information on the page. We can however print the column headers and see what we have to work with, or we can change some Pandas options to print the entire 41 columns for us. So lets change some options so we can also see what the data in these columns might look like.

### Change Pandas options to print max columns
pd.set_option('display.max_columns', None)

### Reprint head of data sorting by the minute column
match_events.sort_values('minute').head(10)

	50_50	bad_behaviour	ball_receipt	ball_recovery	block	carry	clearance	counterpress	dribble	duel	duration	foul_committed	foul_won	goalkeeper	half_start	id	index	injury_stoppage	interception	location	match_id	miscontrol	off_camera	out	pass	period	play_pattern	player	position	possession	possession_team	related_events	second	shot	substitution	tactics	team	timestamp	type	under_pressure
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	9579b6c0-b747-4ab7-9aa4-9aff4b852827	1	NaN	NaN	NaN	2275038	NaN	NaN	NaN	NaN	1	Regular Play	NaN	NaN	1	Reading WFC	NaN	0	NaN	NaN	{‘formation’: 41212, ‘lineup’: [{‘player’: {’i…	Reading WFC	00:00:00.000	Starting XI	NaN
764	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	a0d2f369-a161-424e-85a2-419a4fc693da	18	NaN	NaN	[44.3, 3.7]	2275038	NaN	NaN	NaN	NaN	1	From Kick Off	Kate Longhurst	Left Defensive Midfield	2	Reading WFC	[188815ee-705c-4deb-951b-97348bf7838f, 2103761…	10	NaN	NaN	NaN	West Ham United LFC	00:00:10.207	Ball Receipt*	True
765	NaN	NaN	{‘outcome’: {‘id’: 9, ‘name’: ‘Incomplete’}}	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	dc4d1ac5-1444-438d-8d5b-572b9707048b	26	NaN	NaN	[108.1, 70.0]	2275038	NaN	NaN	NaN	NaN	1	From Kick Off	Natasha Harding	Right Back	2	Reading WFC	[91685ff0-0d95-497a-9a8d-af14a6851ef6]	15	NaN	NaN	NaN	Reading WFC	00:00:15.562	Ball Receipt*	NaN
766	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3bcacf01-dbde-4597-8c43-408c6212da68	28	NaN	NaN	[6.8, 31.6]	2275038	NaN	NaN	NaN	NaN	1	From Free Kick	Anne Moorhouse	Goalkeeper	3	West Ham United LFC	[126b174d-4d51-43a5-9952-4a5657dc93b9]	29	NaN	NaN	NaN	West Ham United LFC	00:00:29.562	Ball Receipt*	NaN
767	NaN	NaN	{‘outcome’: {‘id’: 9, ‘name’: ‘Incomplete’}}	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	f559c7a8-cd68-4256-a0ac-96232523949e	32	NaN	NaN	[43.0, 75.0]	2275038	NaN	NaN	NaN	NaN	1	From Free Kick	Cecilie Redisch Kvamme	Right Back	3	West Ham United LFC	[0f00f803-10bc-42dd-bdde-0123bee8b0c5]	33	NaN	NaN	NaN	West Ham United LFC	00:00:33.215	Ball Receipt*	NaN
768	NaN	NaN	{‘outcome’: {‘id’: 9, ‘name’: ‘Incomplete’}}	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	41477a37-41b3-48ef-842b-203dc3da9db6	35	NaN	NaN	[71.0, 67.8]	2275038	NaN	NaN	NaN	NaN	1	From Throw In	Leanne Kiernan	Center Attacking Midfield	4	West Ham United LFC	[e4045b66-f292-48b2-af77-228460807a6f]	58	NaN	NaN	NaN	West Ham United LFC	00:00:58.621	Ball Receipt*	NaN
1132	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1042ba43-e741-4487-b5c5-6977d53e64b9	1521	NaN	NaN	[35.7, 44.3]	2275038	NaN	NaN	NaN	NaN	1	Regular Play	Kristine Leine	Left Center Back	96	Reading WFC	[b43398f4-1953-4e94-83a0-944a3f72cb77]	0	NaN	NaN	NaN	Reading WFC	00:00:00.212	Ball Receipt*	NaN
1418	NaN	NaN	NaN	NaN	NaN	{‘end_location’: [84.6, 72.5]}	NaN	NaN	NaN	NaN	0.610300	NaN	NaN	NaN	NaN	901ab741-eee0-4ae2-920f-24e23f4da695	11	NaN	NaN	[77.7, 75.4]	2275038	NaN	NaN	NaN	NaN	1	From Kick Off	Natasha Harding	Right Back	2	Reading WFC	[018f464d-055c-459e-b966-3f3dc6f60b19, 91ca5de…	3	NaN	NaN	NaN	Reading WFC	00:00:03.729	Carry	True
1419	NaN	NaN	NaN	NaN	NaN	{‘end_location’: [33.2, 2.8]}	NaN	NaN	NaN	NaN	4.470062	NaN	NaN	NaN	NaN	29dbf328-c64f-4974-a703-14d060348f1d	14	NaN	NaN	[33.2, 7.6]	2275038	NaN	NaN	NaN	NaN	1	From Kick Off	Laura Vetterlein	Left Back	2	Reading WFC	[188815ee-705c-4deb-951b-97348bf7838f, 5b12bc3…	4	NaN	NaN	NaN	West Ham United LFC	00:00:04.429	Carry	True
1420	NaN	NaN	NaN	NaN	NaN	{‘end_location’: [43.1, 4.3]}	NaN	NaN	NaN	NaN	0.543583	NaN	NaN	NaN	NaN	4b94b9e0-f80b-44d1-ab38-4bbbdd549c51	19	NaN	NaN	[44.3, 3.7]	2275038	NaN	NaN	NaN	NaN	1	From Kick Off	Kate Longhurst	Left Defensive Midfield	2	Reading WFC	[21037614-fe96-4eaa-af62-6d661507cc37, 3456b6d…	10	NaN	NaN	NaN	West Ham United LFC	00:00:10.207	Carry	True

Great, now we can see all 41 columns and the data they contain. This should help us with rearranging and parsing out the data we need. Data for location for example provides an x / y value, as a list within the dataframe. This will need to be separated out before we could save this file, or use it effectively.

To do this, we are going to use a pd.Series. After reading a lot online, this is a slower method than using a Numpy tolist method, but handles NaN values much easier, which is something we are required to deal with in this dataset.

### First rename our dataframe
match_events_split = match_events

### Apply our split renaming columns for when we split the column
match_events_split[['location_x', 'location_y']] = match_events_split['location'].apply(pd.Series)

### Drop our location column as we don't need this anymore
match_events_split = match_events_split.drop('location', axis = 1)

### View the top of our file again to see this worked
match_events_split.head(10)

	50_50	bad_behaviour	ball_receipt	ball_recovery	block	carry	clearance	counterpress	dribble	duel	duration	foul_committed	foul_won	goalkeeper	half_start	id	index	injury_stoppage	interception	match_id	minute	miscontrol	off_camera	out	pass	period	play_pattern	player	position	possession	possession_team	related_events	second	shot	substitution	tactics	team	timestamp	type	under_pressure	location_x	location_y
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	9579b6c0-b747-4ab7-9aa4-9aff4b852827	1	NaN	NaN	2275038	0	NaN	NaN	NaN	NaN	1	Regular Play	NaN	NaN	1	Reading WFC	NaN	0	NaN	NaN	{‘formation’: 41212, ‘lineup’: [{‘player’: {’i…	Reading WFC	00:00:00.000	Starting XI	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	a4a12e95-b01a-4042-879f-45c3d992e969	2	NaN	NaN	2275038	0	NaN	NaN	NaN	NaN	1	Regular Play	NaN	NaN	1	Reading WFC	NaN	0	NaN	NaN	{‘formation’: 4231, ‘lineup’: [{‘player’: {’id…	West Ham United LFC	00:00:00.000	Starting XI	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.100000	NaN	NaN	NaN	{‘late_video_start’: True}	da9c5398-dae9-4a3d-b821-fd600b54a55d	3	NaN	NaN	2275038	0	NaN	NaN	NaN	NaN	1	Regular Play	NaN	NaN	1	Reading WFC	[035f18f5-8767-475f-b96b-b1548c2fd642]	0	NaN	NaN	NaN	West Ham United LFC	00:00:00.000	Half Start	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.100000	NaN	NaN	NaN	{‘late_video_start’: True}	035f18f5-8767-475f-b96b-b1548c2fd642	4	NaN	NaN	2275038	0	NaN	NaN	NaN	NaN	1	Regular Play	NaN	NaN	1	Reading WFC	[da9c5398-dae9-4a3d-b821-fd600b54a55d]	0	NaN	NaN	NaN	Reading WFC	00:00:00.000	Half Start	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	5a4cee02-6737-48e2-918f-724080b37471	1524	NaN	NaN	2275038	45	NaN	NaN	NaN	NaN	2	Regular Play	NaN	NaN	96	Reading WFC	[f0bd2ba7-a946-4414-b04f-aeeae0928f31]	0	NaN	NaN	NaN	West Ham United LFC	00:00:00.000	Half Start	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	f0bd2ba7-a946-4414-b04f-aeeae0928f31	1525	NaN	NaN	2275038	45	NaN	NaN	NaN	NaN	2	Regular Play	NaN	NaN	96	Reading WFC	[5a4cee02-6737-48e2-918f-724080b37471]	0	NaN	NaN	NaN	Reading WFC	00:00:00.000	Half Start	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.771676	NaN	NaN	NaN	NaN	c0cc1e3b-af5c-448c-82cc-08e546a72f5b	5	NaN	NaN	2275038	0	NaN	NaN	NaN	{‘recipient’: {‘id’: 10251, ‘name’: ’Fara Will…	1	From Kick Off	Jade Moore	Center Defensive Midfield	2	Reading WFC	[99c4a406-f3d4-4bd0-b5b0-3a0598ae54dd]	0	NaN	NaN	NaN	Reading WFC	00:00:00.046	Pass	NaN	61.0	40.1
7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2.831446	NaN	NaN	NaN	NaN	7172fc12-eaf2-4e9c-9f82-16950a04cfa7	8	NaN	NaN	2275038	0	NaN	NaN	NaN	{‘recipient’: {‘id’: 15725, ‘name’: ’Natasha H…	1	From Kick Off	Fara Williams	Center Attacking Midfield	2	Reading WFC	[91ca5def-0b84-4d2d-9313-68418f3e1b3a]	0	NaN	NaN	NaN	Reading WFC	00:00:00.897	Pass	NaN	54.8	40.5
8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.307655	NaN	NaN	NaN	NaN	188815ee-705c-4deb-951b-97348bf7838f	16	NaN	NaN	2275038	0	NaN	NaN	NaN	{‘recipient’: {‘id’: 18147, ‘name’: ’Kate Long…	1	From Kick Off	Laura Vetterlein	Left Back	2	Reading WFC	[a0d2f369-a161-424e-85a2-419a4fc693da]	8	NaN	NaN	NaN	West Ham United LFC	00:00:08.899	Pass	NaN	33.2	2.8
9	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2.176197	NaN	NaN	NaN	NaN	91685ff0-0d95-497a-9a8d-af14a6851ef6	25	NaN	NaN	2275038	0	NaN	NaN	NaN	{‘recipient’: {‘id’: 15725, ‘name’: ’Natasha H…	1	From Kick Off	Fara Williams	Center Attacking Midfield	2	Reading WFC	[dc4d1ac5-1444-438d-8d5b-572b9707048b]	13	NaN	NaN	NaN	Reading WFC	00:00:13.385	Pass	NaN	77.4	74.1

Now we have our locations parsed out, each of our event type columns, such as pass or shot, have a variety of information also contained within them. Such as if the pass was an assist, who received the pass and how long the pass was. All of this information is provided as a Numpy array within the dataframe. If we isolate the pass column and drop all NaN values, this is what we get.

### Select the pass column
pass_data_raw = match_events_split['pass']

### Drop NaN values from our selected column
pass_data_raw.dropna().head()

6     {'recipient': {'id': 10251, 'name': 'Fara Will...
7     {'recipient': {'id': 15725, 'name': 'Natasha H...
8     {'recipient': {'id': 18147, 'name': 'Kate Long...
9     {'recipient': {'id': 15725, 'name': 'Natasha H...
10    {'recipient': {'id': 22027, 'name': 'Anne Moor...
Name: pass, dtype: object

As we can see, there are lists within lists here and the information provided might be common across multiple event types within this dataset. Let’s see if we can pull anything further out of here and create a nice little dataframe of the information.

First, we can split the list in to a dataframe of values and rather than having our list of lists.

### Convert our list in to a dataframe
pass_data = pass_data_raw.apply(pd.Series)

### Filter our list to find our pass values
pass_data[pass_data.length >= 0]

	0	aerial_won	angle	assisted_shot_id	body_part	cross	cut_back	deflected	end_location	goal_assist	height	inswinging	length	miscommunication	no_touch	outcome	outswinging	recipient	shot_assist	switch	technique	through_ball	type
6	NaN	NaN	2.900027	NaN	{‘id’: 40, ‘name’: ‘Right Foot’}	NaN	NaN	NaN	[54.1, 41.8]	NaN	{‘id’: 1, ‘name’: ‘Ground Pass’}	NaN	7.106335	NaN	NaN	NaN	NaN	{‘id’: 10251, ‘name’: ‘Fara Williams’}	NaN	NaN	NaN	NaN	{‘id’: 65, ‘name’: ‘Kick Off’}
7	NaN	NaN	0.990103	NaN	{‘id’: 38, ‘name’: ‘Left Foot’}	NaN	NaN	NaN	[77.7, 75.4]	NaN	{‘id’: 3, ‘name’: ‘High Pass’}	NaN	41.742306	NaN	NaN	NaN	NaN	{‘id’: 15725, ‘name’: ‘Natasha Harding’}	NaN	NaN	NaN	NaN	NaN
8	NaN	NaN	0.080904	NaN	{‘id’: 38, ‘name’: ‘Left Foot’}	NaN	NaN	NaN	[44.3, 3.7]	NaN	{‘id’: 1, ‘name’: ‘Ground Pass’}	NaN	11.136427	NaN	NaN	NaN	NaN	{‘id’: 18147, ‘name’: ‘Kate Longhurst’}	NaN	NaN	NaN	NaN	NaN
9	NaN	NaN	-0.132765	NaN	{‘id’: 40, ‘name’: ‘Right Foot’}	NaN	NaN	NaN	[108.1, 70.0]	NaN	{‘id’: 3, ‘name’: ‘High Pass’}	NaN	30.972569	NaN	NaN	{‘id’: 76, ‘name’: ‘Pass Offside’}	NaN	{‘id’: 15725, ‘name’: ‘Natasha Harding’}	NaN	NaN	NaN	NaN	NaN
10	NaN	NaN	2.101826	NaN	{‘id’: 38, ‘name’: ‘Left Foot’}	NaN	NaN	NaN	[6.8, 31.6]	NaN	{‘id’: 1, ‘name’: ‘Ground Pass’}	NaN	21.918486	NaN	NaN	NaN	NaN	{‘id’: 22027, ‘name’: ‘Anne Moorhouse’}	NaN	NaN	NaN	NaN	{‘id’: 62, ‘name’: ‘Free Kick’}
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
757	NaN	NaN	-1.799721	NaN	NaN	NaN	NaN	NaN	[112.1, 69.7]	NaN	{‘id’: 3, ‘name’: ‘High Pass’}	NaN	10.575916	NaN	NaN	NaN	NaN	{‘id’: 26570, ‘name’: ‘Amalie Vevle Eikeland’}	NaN	NaN	NaN	NaN	{‘id’: 67, ‘name’: ‘Throw-in’}
758	NaN	NaN	2.310073	NaN	{‘id’: 40, ‘name’: ‘Right Foot’}	NaN	NaN	NaN	[110.2, 78.1]	NaN	{‘id’: 1, ‘name’: ‘Ground Pass’}	NaN	4.601087	NaN	NaN	NaN	NaN	{‘id’: 15725, ‘name’: ‘Natasha Harding’}	NaN	NaN	NaN	NaN	NaN
759	NaN	NaN	-0.851966	NaN	{‘id’: 40, ‘name’: ‘Right Foot’}	NaN	NaN	NaN	[112.3, 75.9]	NaN	{‘id’: 1, ‘name’: ‘Ground Pass’}	NaN	2.126029	NaN	NaN	{‘id’: 9, ‘name’: ‘Incomplete’}	NaN	NaN	NaN	NaN	NaN	NaN	NaN
760	NaN	NaN	-1.830611	NaN	NaN	NaN	NaN	NaN	[109.5, 72.1]	NaN	{‘id’: 3, ‘name’: ‘High Pass’}	NaN	8.174350	NaN	NaN	NaN	NaN	{‘id’: 10190, ‘name’: ‘Jade Moore’}	NaN	NaN	NaN	NaN	{‘id’: 67, ‘name’: ‘Throw-in’}
761	NaN	NaN	-1.843406	NaN	NaN	NaN	NaN	NaN	[111.1, 70.7]	NaN	{‘id’: 3, ‘name’: ‘High Pass’}	NaN	9.656604	NaN	NaN	{‘id’: 77, ‘name’: ‘Unknown’}	NaN	NaN	NaN	NaN	NaN	NaN	{‘id’: 67, ‘name’: ‘Throw-in’}

756 rows × 23 columns

Now we can see what our data actually includes, with a few of our columns still including a list of information., with ‘id’ and ‘name’ common within those columns, while we have a list of co-ordinates for our end locations as well. We can convert these columns one by one, but this would be time consuming to do for all individual variables in this dataset. For example, we can split each column like this:

### Split the height column in to separate columns
pass_data[['0', 'pass_height_id', 'pass_height_name']] = pass_data['height'].apply(pd.Series)

### Filter the dataframe to find our data
pass_data[pass_data.length >= 0]

	0	aerial_won	angle	assisted_shot_id	body_part	cross	cut_back	deflected	end_location	goal_assist	height	inswinging	length	miscommunication	no_touch	outcome	outswinging	recipient	shot_assist	switch	technique	through_ball	type	0	pass_height_id	pass_height_name
6	NaN	NaN	2.900027	NaN	{‘id’: 40, ‘name’: ‘Right Foot’}	NaN	NaN	NaN	[54.1, 41.8]	NaN	{‘id’: 1, ‘name’: ‘Ground Pass’}	NaN	7.106335	NaN	NaN	NaN	NaN	{‘id’: 10251, ‘name’: ‘Fara Williams’}	NaN	NaN	NaN	NaN	{‘id’: 65, ‘name’: ‘Kick Off’}	NaN	1.0	Ground Pass
7	NaN	NaN	0.990103	NaN	{‘id’: 38, ‘name’: ‘Left Foot’}	NaN	NaN	NaN	[77.7, 75.4]	NaN	{‘id’: 3, ‘name’: ‘High Pass’}	NaN	41.742306	NaN	NaN	NaN	NaN	{‘id’: 15725, ‘name’: ‘Natasha Harding’}	NaN	NaN	NaN	NaN	NaN	NaN	3.0	High Pass
8	NaN	NaN	0.080904	NaN	{‘id’: 38, ‘name’: ‘Left Foot’}	NaN	NaN	NaN	[44.3, 3.7]	NaN	{‘id’: 1, ‘name’: ‘Ground Pass’}	NaN	11.136427	NaN	NaN	NaN	NaN	{‘id’: 18147, ‘name’: ‘Kate Longhurst’}	NaN	NaN	NaN	NaN	NaN	NaN	1.0	Ground Pass
9	NaN	NaN	-0.132765	NaN	{‘id’: 40, ‘name’: ‘Right Foot’}	NaN	NaN	NaN	[108.1, 70.0]	NaN	{‘id’: 3, ‘name’: ‘High Pass’}	NaN	30.972569	NaN	NaN	{‘id’: 76, ‘name’: ‘Pass Offside’}	NaN	{‘id’: 15725, ‘name’: ‘Natasha Harding’}	NaN	NaN	NaN	NaN	NaN	NaN	3.0	High Pass
10	NaN	NaN	2.101826	NaN	{‘id’: 38, ‘name’: ‘Left Foot’}	NaN	NaN	NaN	[6.8, 31.6]	NaN	{‘id’: 1, ‘name’: ‘Ground Pass’}	NaN	21.918486	NaN	NaN	NaN	NaN	{‘id’: 22027, ‘name’: ‘Anne Moorhouse’}	NaN	NaN	NaN	NaN	{‘id’: 62, ‘name’: ‘Free Kick’}	NaN	1.0	Ground Pass
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
757	NaN	NaN	-1.799721	NaN	NaN	NaN	NaN	NaN	[112.1, 69.7]	NaN	{‘id’: 3, ‘name’: ‘High Pass’}	NaN	10.575916	NaN	NaN	NaN	NaN	{‘id’: 26570, ‘name’: ‘Amalie Vevle Eikeland’}	NaN	NaN	NaN	NaN	{‘id’: 67, ‘name’: ‘Throw-in’}	NaN	3.0	High Pass
758	NaN	NaN	2.310073	NaN	{‘id’: 40, ‘name’: ‘Right Foot’}	NaN	NaN	NaN	[110.2, 78.1]	NaN	{‘id’: 1, ‘name’: ‘Ground Pass’}	NaN	4.601087	NaN	NaN	NaN	NaN	{‘id’: 15725, ‘name’: ‘Natasha Harding’}	NaN	NaN	NaN	NaN	NaN	NaN	1.0	Ground Pass
759	NaN	NaN	-0.851966	NaN	{‘id’: 40, ‘name’: ‘Right Foot’}	NaN	NaN	NaN	[112.3, 75.9]	NaN	{‘id’: 1, ‘name’: ‘Ground Pass’}	NaN	2.126029	NaN	NaN	{‘id’: 9, ‘name’: ‘Incomplete’}	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.0	Ground Pass
760	NaN	NaN	-1.830611	NaN	NaN	NaN	NaN	NaN	[109.5, 72.1]	NaN	{‘id’: 3, ‘name’: ‘High Pass’}	NaN	8.174350	NaN	NaN	NaN	NaN	{‘id’: 10190, ‘name’: ‘Jade Moore’}	NaN	NaN	NaN	NaN	{‘id’: 67, ‘name’: ‘Throw-in’}	NaN	3.0	High Pass
761	NaN	NaN	-1.843406	NaN	NaN	NaN	NaN	NaN	[111.1, 70.7]	NaN	{‘id’: 3, ‘name’: ‘High Pass’}	NaN	9.656604	NaN	NaN	{‘id’: 77, ‘name’: ‘Unknown’}	NaN	NaN	NaN	NaN	NaN	NaN	{‘id’: 67, ‘name’: ‘Throw-in’}	NaN	3.0	High Pass

756 rows × 26 columns

Now that was relatively simple to do, but having to change the values for each column we want to split like this will take a fair amount of time to do. What we could do is write a function that checks each column of the dataframe and apply a function to it.

### Split pass data in to dataframe
pass_data_split = pass_data_raw.apply(pd.Series)

### Create function to loop through columns
### and apply a function to split the column
### in to id and name columns.
def pass_parse_function(data) -> pd.DataFrame:
    
    df = pd.DataFrame(data)
    dfcolumns = df.columns
    for i in dfcolumns:
        try: 
            df[['0', str(i) + '_id', str(i) + '_name']] = df[i].apply(pd.Series)
            df = df.drop(i, axis = 1)
        except ValueError:
            pass
    
    return df

### Run the function using the split dataframe
pass_df = pass_parse_function(pass_data_split)

### View the data from the function
pass_df[pass_df.length >= 0].head(10)

	0	aerial_won	angle	assisted_shot_id	cross	cut_back	deflected	end_location	goal_assist	inswinging	length	miscommunication	no_touch	outswinging	shot_assist	switch	through_ball	0	body_part_id	body_part_name	height_id	height_name	outcome_id	outcome_name	recipient_id	recipient_name	technique_id	technique_name	type_id	type_name
6	NaN	NaN	2.900027	NaN	NaN	NaN	NaN	[54.1, 41.8]	NaN	NaN	7.106335	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.0	Right Foot	1.0	Ground Pass	NaN	NaN	10251.0	Fara Williams	NaN	NaN	65.0	Kick Off
7	NaN	NaN	0.990103	NaN	NaN	NaN	NaN	[77.7, 75.4]	NaN	NaN	41.742306	NaN	NaN	NaN	NaN	NaN	NaN	NaN	38.0	Left Foot	3.0	High Pass	NaN	NaN	15725.0	Natasha Harding	NaN	NaN	NaN	NaN
8	NaN	NaN	0.080904	NaN	NaN	NaN	NaN	[44.3, 3.7]	NaN	NaN	11.136427	NaN	NaN	NaN	NaN	NaN	NaN	NaN	38.0	Left Foot	1.0	Ground Pass	NaN	NaN	18147.0	Kate Longhurst	NaN	NaN	NaN	NaN
9	NaN	NaN	-0.132765	NaN	NaN	NaN	NaN	[108.1, 70.0]	NaN	NaN	30.972569	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.0	Right Foot	3.0	High Pass	76.0	Pass Offside	15725.0	Natasha Harding	NaN	NaN	NaN	NaN
10	NaN	NaN	2.101826	NaN	NaN	NaN	NaN	[6.8, 31.6]	NaN	NaN	21.918486	NaN	NaN	NaN	NaN	NaN	NaN	NaN	38.0	Left Foot	1.0	Ground Pass	NaN	NaN	22027.0	Anne Moorhouse	NaN	NaN	62.0	Free Kick
11	NaN	NaN	0.760755	NaN	NaN	NaN	NaN	[37.8, 65.3]	NaN	NaN	40.175865	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.0	Right Foot	2.0	Low Pass	9.0	Incomplete	31553.0	Cecilie Redisch Kvamme	NaN	NaN	NaN	NaN
12	NaN	NaN	-0.237374	NaN	NaN	NaN	NaN	[71.9, 74.0]	NaN	NaN	25.515486	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3.0	High Pass	9.0	Incomplete	18146.0	Leanne Kiernan	NaN	NaN	67.0	Throw-in
13	NaN	NaN	-2.696125	NaN	NaN	NaN	NaN	[57.3, 25.3]	NaN	NaN	12.300406	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.0	Right Foot	2.0	Low Pass	9.0	Incomplete	31628.0	Kristine Leine	NaN	NaN	NaN	NaN
14	NaN	NaN	1.010365	NaN	NaN	NaN	NaN	[66.6, 75.5]	NaN	NaN	23.139793	NaN	NaN	NaN	NaN	NaN	NaN	NaN	38.0	Left Foot	1.0	Ground Pass	NaN	NaN	31553.0	Cecilie Redisch Kvamme	NaN	NaN	62.0	Free Kick
15	NaN	NaN	-0.167896	NaN	NaN	NaN	NaN	[71.9, 74.5]	NaN	NaN	5.984146	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.0	Right Foot	1.0	Ground Pass	NaN	NaN	18153.0	Alisha Lehmann	NaN	NaN	NaN	NaN

So there we have a function that parses out our columns and applies a new name to each colunmn. This is far simpler, and quicker than trying to write out a single line of code for each column we need to pull details from.

I’m not sure if this is the quickest function for doing this, so if anyone who uses Python is reading this, feel free to let me know if there is a better method of doing it.

Hopefully this tutorial provides you with some good information on using Python to look at Statsbomb data. I know there is a few things I can clean up in this process and hopefully I can learn / show you all those in the future.

python StatsBomb

Reformatting Statsbomb Data in Python

Related