Importing StatsBomb Data in to R

A tutorial on importing StatsBomb data in to R using the StatsBombR package.

Since I wrote this first post back in 2019, a few things have changed with how our capture the data from the StatsBomb open data sets. So this post, is mostly unchanged, but the code itself is different and will work with newer versions of R, namely 4+. If anyone has any questions please reach out and let me know.

In 2018, StatsBomb announced they would provide free data from Women’s Leagues around the world. So far, they have released data from the National Women’s Soccer League (NWSL, USA) and the FA Women’s Super League (FAWSL, UK), with France Feminine and UEFA Women’s Champions League to follow. This move is incredible and provides analysts such as myself a training ground to begin our path as Data Analysts, or hone our skills and improve our analyses.

Accessing the data is pretty easy, you can either download the JSON files through the StatsBomb website here, or through their StatsBombR package directly in R / RStudio. For this post, I will show you how I downloaded and created simple summary tables for data from the 2018-19 FAWSL. For this I used 2 R packages, StatsBombR and Tidyverse, so make sure these are installed before continuing.

To begin with, we need to load our two packages and then read in the data. StatsBombR has a nice function “StatsBombFreeEvents” that will download all their free event data to use in R. We can download the event data, and all the matches from the FAWSL (competition_id 37 in the event data). After this we can use the “allclean” function from StatsBombR to separate vector columns in the dataframe. For example, StatsBomb saves location data as a list (e.g. c(45,62)) and so this needs to be separated before we can write the data in to a csv file. The “allclean” function will do this for us which is handy.

library(StatsBombR)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: stringi
## Warning: package 'stringi' was built under R version 4.1.2
## Loading required package: stringr
## Loading required package: tibble
## Warning: package 'tibble' was built under R version 4.1.2
## Loading required package: rvest
## Warning: package 'rvest' was built under R version 4.1.2
## Loading required package: RCurl
## Warning: package 'RCurl' was built under R version 4.1.1
## Loading required package: doParallel
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
## Loading required package: httr
## Loading required package: jsonlite
## Loading required package: purrr
## 
## Attaching package: 'purrr'
## The following object is masked from 'package:jsonlite':
## 
##     flatten
## The following objects are masked from 'package:foreach':
## 
##     accumulate, when
## Loading required package: sp
## Warning: package 'sp' was built under R version 4.1.2
## Loading required package: tidyr
## Warning: package 'tidyr' was built under R version 4.1.2
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
## 
##     complete
## Warning: replacing previous import 'foreach::when' by 'purrr::when' when loading
## 'StatsBombR'
## Warning: replacing previous import 'jsonlite::flatten' by 'purrr::flatten' when
## loading 'StatsBombR'
## Warning: replacing previous import 'foreach::accumulate' by 'purrr::accumulate'
## when loading 'StatsBombR'
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v forcats 0.5.1
## v readr   2.1.1
## Warning: package 'readr' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x purrr::accumulate()     masks foreach::accumulate()
## x tidyr::complete()       masks RCurl::complete()
## x dplyr::filter()         masks stats::filter()
## x purrr::flatten()        masks jsonlite::flatten()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag()            masks stats::lag()
## x purrr::when()           masks foreach::when()
### Read in all the free competitions from the FAWSL.
#comp <- FreeCompetitions() %>% 
#  filter(competition_id == 37)

### Find all the free matches from the selected competition.
#matches <- FreeMatches(comp)

### Find all the free events from the matches found above.
#data <- StatsBombFreeEvents(MatchesDF = matches, Parallel = T)

### Clean the data downloaded above.
#data <- allclean(data)

Now that we have our data, we need to filter the match event data to include on FAWSL games. We can then easily join our match data frame with our event data.

### Filter event data to include only FAWSL data.
#data1 <- data %>% 
#  filter(data$competition_id == 37) 

### Join event and match data by "match_id"
#data1 <- left_join(data1, matches, by = "match_id")

Now we have one big data frame including all event data from the entire FAWSL 2018-19 season. We can now take this data to create Team and Player tables which we can use in PowerBi for data visualizations. We can also find the minutes played by players in each match of the dataset which will be handy for converting to per minute values.

### Create a team table, selecting team name and id from
### our full data set and creating a list of unique names
### and ids. 
#TeamNames <- data1 %>% 
#  select(team.name, team.id) %>% 
#  unique()
#names(TeamNames) <- c("Team Name", "Team ID")

### Create a player table with unique names, removing "NA"
### values which are found in "set up" rows of our data set.
#PlayerNames <- data1 %>% 
#  select(player.name, player.id, team.name, team.id) %>% 
#  unique() %>%
#  filter(player.name != "NA")
#names(PlayerNames) <- c("Player Name", "Player ID", "Team Name", "Team ID")

#minutes_played <- get.minutesplayed(data)

Now we have our team and player tables, we can create summary tables for shots and passes. This can make it easy to create specific visualizations in the future. In this step, I will remove columns that were cleaned using the “allclean” function from StatsBombR. Some of these columns contain lists that can’t be written to a csv file.

### Create a shot summary table removing unwanted columns.
#ShotsTable <- data1 %>% 
#  filter(type.name == "Shot") %>%
#  select(-c(related_events, tactics.lineup, shot.freeze_frame)) %>% 
#  separate(location, into = c(NA, "x", "y")) %>% 
#  separate(pass.end_location, into = c(NA, "pass.end.x", "pass.end.y")) %>% 
#  separate(shot.end_location, into = c(NA, "shot.end.x", "shot.end.y")) %>% 
#  separate(goalkeeper.end_location, into = c(NA, "GK.end.x", "GK.end.y"))

### Create a pass summary table removing unwanted columns.
#PassTable <- data1 %>% 
#  filter(type.name == "Pass") %>% 
#  select(-c(related_events, tactics.lineup, shot.freeze_frame)) %>% 
#  separate(location, into = c(NA, "x", "y")) %>% 
#  separate(pass.end_location, into = c(NA, "pass.end.x", "pass.end.y")) %>% 
#  separate(shot.end_location, into = c(NA, "shot.end.x", "shot.end.y")) %>% 
#  separate(goalkeeper.end_location, into = c(NA, "GK.end.x", "GK.end.y"))

We have now created our summary tables and and can write our data to csv files for later analysis or visualizations. These tables can be used to calculate simple shot or pass analyses or visualizations of pitch locations. Whilst, we will also keep our full data set for more advanced analyses.

The full code from this tutorial is below and can be used to start your analysis of StatsBomb data. In future tutorials I will use these tables to dive further in to this amazing set of free data. Hopefully you will find something of use or interest to you!

#library(StatsBombR)
#library(tidyverse)


#comp <- FreeCompetitions() %>% 
#  filter(competition_id == 37)

#matches <- FreeMatches(comp)

#data <- StatsBombFreeEvents(MatchesDF = matches, Parallel = T)

#data <- allclean(data)

#data1 <- data %>% 
#  filter(data$competition_id == 37) 
#minutes_played <- get.minutesplayed(data)

#data1 <- left_join(data, matches, by = "match_id")

#TeamNames <- data1 %>% 
#  select(team.name, team.id) %>% 
#  unique()
#names(TeamNames) <- c("Team Name", "Team ID")

#PlayerNames <- data1 %>% 
#  select(player.name, player.id, team.name, team.id) %>% 
#  unique() %>%
#  filter(player.name != "NA")
#names(PlayerNames) <- c("Player Name", "Player ID", "Team Name", "Team ID")

#ShotsTable <- data1 %>% 
#  filter(type.name == "Shot") %>%
#  select(-c(related_events, tactics.lineup, shot.freeze_frame)) %>% 
#  separate(location, into = c(NA, "x", "y")) %>% 
#  separate(pass.end_location, into = c(NA, "pass.end.x", "pass.end.y")) %>% 
#  separate(shot.end_location, into = c(NA, "shot.end.x", "shot.end.y")) %>% 
#  separate(goalkeeper.end_location, into = c(NA, "GK.end.x", "GK.end.y"))

#PassTable <- data1 %>% 
#  filter(type.name == "Pass") %>% 
#  select(-c(related_events, tactics.lineup, shot.freeze_frame)) %>% 
#  separate(location, into = c(NA, "x", "y")) %>% 
#  separate(pass.end_location, into = c(NA, "pass.end.x", "pass.end.y")) %>% 
#  separate(shot.end_location, into = c(NA, "shot.end.x", "shot.end.y")) %>% 
#  separate(goalkeeper.end_location, into = c(NA, "GK.end.x", "GK.end.y"))

#FullData <- data1 %>% 
#  select(-c(related_events, tactics.lineup, shot.freeze_frame)) %>% 
#  separate(location, into = c(NA, "x", "y")) %>% 
#  separate(pass.end_location, into = c(NA, "pass.end.x", "pass.end.y")) %>% 
#  separate(shot.end_location, into = c(NA, "shot.end.x", "shot.end.y")) %>% 
#  separate(goalkeeper.end_location, into = c(NA, "GK.end.x", "GK.end.y"))

#setwd()
#write_csv(minutes_played, "Statsbomb_MinutedPlayed.csv")
#write_csv(FullData, "StatsBomb_FullData.csv")
#write_csv(PlayerNames, "StatsBomb_PlayerTable.csv")
#write_csv(TeamNames, "StatsBomb_TeamTable.csv")
#write_csv(ShotsTable, "StatsBomb_ShotSummary.csv")
#write_csv(PassTable, "StatsBomb_PassSummary.csv")
#write_csv(matches, "StatsBomb_MatchTable.csv")
comments powered by Disqus

Related