Data in Sport - Part 1

Data in sport isn’t a new thing, but the volume of data is increasing and coming from multiple sources. All these sources use their preferred data format and supply that data through manual exports or API connections. This can require specific knowledge in obtaining and transforming this data before it can be analysed or reported on.

Therefore, this series of posts aims to talk through an end to end data pipeline. I will talk through some of the roles typically seen in a data pipeline, their functions and ultimately how this can improve the quality and usability of your data.

\[\\[0.1in]\]

Data Engineering

The concept of data engineering often relates to someone that integrates and prepares data for analytical use. These tasks in sport will often fall on the same person that reports or analyses this data. So let’s talk through how you can speed up this process and provide an integrated pipeline that extracts, transforms and prepares your data for analysis and reporting.

\[\\[0.1in]\]

Automated Extraction

In a general data process, manually extracting data through a CSV export can be time-consuming. We can improve the efficiency of this process in a couple of ways. The first step requires some coding knowledge in R or Python. The benefit of a coded process is that it is easily repeatable and allows for various steps to be completed in a set order. The other reason is data providers often have an Application Programming Interface (API) for users to interact with. An API allows a user to send a request with a response returned as a specified data format, often a JSON or XML type file. This type of data is referred to as semi-structured, as they do not have columns and rows but rather chunks of information that may not always be the same across all chunks.

Here is an example of a JSON file, where you can see the structure is not usable at first and requires some transformation. This is a single event from a StatsBomb JSON output file.

\[\\[0.1in]\]

{ “id” : “272ef47d-171f-4395-a156-f4979e7bf352”, “index” : 5, “period” : 1, “timestamp” : “00:00:00.100”, “minute” : 0, “second” : 0, “type” : { “id” : 30, “name” : “Pass” }, “possession” : 2, “possession_team” : { “id” : 971, “name” : “Chelsea LFC” }, “play_pattern” : { “id” : 9, “name” : “From Kick Off” }, “off_camera” : false, “team” : { “id” : 971, “name” : “Chelsea LFC” }, “player” : { “id” : 4647, “name” : “So-yun Ji” }, “position” : { “id” : 14, “name” : “Center Midfield” }, “location” : [ 61.0, 40.0 ], “duration” : 0.0, “related_events” : [ “237cac8c-5cb0-4015-9d08-c7df9699a136” ], “pass” : { “recipient” : { “id” : 4659, “name” : “Ramona Bachmann” }, “length” : 3.6055512, “angle” : -0.98279375, “height” : { “id” : 1, “name” : “Ground Pass” }, “end_location” : [ 63.0, 37.0 ], “type” : { “id” : 65, “name” : “Kick Off” }, “body_part” : { “id” : 40, “name” : “Right Foot” } } }

\[\\[0.1in]\]

With a coded process, you can run that manually on your local machine, or you can find cloud-based software that allows the code to execute on a schedule. One example is Apache Airflow, which schedules Python code to perform the required steps to integrate with an API and retrieve the requested data. This is the beginning of a fully-automated data pipeline that can speed up your data analysis processes.

\[\\[0.1in]\]

Data Transformation

Now that you have your data, we need to make it usable. That’s where your R/Python code can become even more powerful. The repeatable steps can make this process a breeze. More so, these coding platforms have built-in capabilities for working with semi-structured data and making it usable. Now, this step doesn’t just mean restructuring data for use in reporting, but could also be summarising or adding information from other sources. Once you have transformed it, what happens next, where does it go?

\[\\[0.1in]\]

Data Loading

Following extraction and transformation, we can now load our data into a database. The database might be a large CSV/Excel file or a SQL data warehouse. Whatever works best in your situation is where your data will end up. If you have a simple workflow, it will likely be a CSV/Excel file that is used in reporting software (Tableau/PowerBi/Excel). In sport, this is likely the most common scenario currently in use.

If we were to take this further, a SQL data warehouse will likely be your endpoint. A data warehouse is often relational, with data stored in tables with common fields and identifiers. These databases allow the concept of “big data” to become a real thing, with data from various sources combined to create a complete picture of your athletes or team performances.

\[\\[0.1in]\]

Living in the Cloud

With the advancement of technology, the common workflow of extract, transform and load is being challenged. Companies like Snowflake provide native support for working with semi-structured data allowing users to extract and load data, before transformation tasks. This allows the raw data to be kept in a highly compressed format, saving space but also utilising the power of your “data cloud” to be fully utilised and automated from end-to-end.

\[\\[0.1in]\]

Where to Next?

Now we have our data in a usable format, we can report on it. Stay tuned for the next data in sport post for what reporting options we have!

comments powered by Disqus

Related