Welcome back to the Data in Sport series. This week we are going to look at data management best practices. If you missed the first post, you can find it here. In that post, we talked about data engineering, how you can automate data extraction and transformation of your data sources. This week, we are going to expand on that slightly and talk about data management. In someways, the two concepts go together as a lot of the tasks performed during the data engineering phase will translate to the management of your data. So let’s get started.
\[\\[0.1in]\]
What does data management require?
When I think of data management, I think of all the tasks required to make sure my data is clean, usable and coherent. For example, here is a list of tasks we will walk through today:
- Removing errors and imputing missing data.
- Applying relevant logic during transformation.
- Adding custom identifiers for transformed information.
- Storing your information in a database of your choosing.
If we go back a step, the understanding of a coding platform such as R/Python will make this process easy, but definitely isn’t required. All these steps can be manually completed in Excel or other software. Coding will just automate and speed up this process massively.
\[\\[0.1in]\]
Removing errors
With any type of data, there is the possibility of errors being identified. If this data is collected by yourself, such as training or testing data, then the removal of errors will be your job. External parties will likely do this themselves, but at times you will need to identify and notify the external party before this is done, so this process is good to know.
Data Errors
An easy and common error identified by sport scientists for example is the max velocity from GPS data. A small spike is possible for a variety of reasons, but can easily take the velocity of your athlete either massively, or slightly, out of the bounds of what they are capable of. Being able to identify and rectify these issues before the data is stored is a huge benefit. You can add either a warning, or some code to remove the maximal value and choose the next largest value in the report for example. Or the warning, can help you revert back to the software and remove the error from the dataset completely. These types of errors are also possibly with optical tracking data, so be aware and make sure your error checking is in place with any data that deals with velocity traces.
Identifier Errors
This is a step we will also cover lower, but one issue that is common with manual entry data sets comes about with identifiers for your data points. For example, where dates or names are added, they can be added in so many different ways or accidentally misspell. At this step, you can fix common errors for things like dates, but names can be challenging as there are so many different combinations of error that it is hard to cover every possibility. You can add custom identifiers to your data sets, which we can cover below.
\[\\[0.1in]\]
Imputing missing data
Now the first thing you are thinking is probably what does imputing actually mean?
Impute
Assign (a value) to something by inference from the value of the products or processes to which it contributes.
So for easy understanding, we can fill in our missing data using something, such as the average for the session, or in comparison to similar sessions. A really simple method, might be to use the average for the players in a similar roles throughout the session, or within the drill the data is missing.
A recent sport science article concluded that a simple average will suffice to impute missing data for longitudinal monitoring. However, they only examined RPE data, where a simple mean might be an accurate representation. More research is required to look at different data types, such as GPS tracking, and how to impute missing data in an accurate way.
\[\\[0.1in]\]
Applying relevant logic
In this step, this often applies more in a business setting, but could also be relevant where custom metrics are generated from the raw data steps. For example, where match event data only uses text values to identify when a shot occurred, but you want to be able to easily sum them in a summarised data set, maybe you will assign a binary (0,1) to the rows. This will make it much easier to summarise data in future steps.
\[\\[0.1in]\]
Adding custom identifiers
This step can make or break your data, especially with longitudinal data. The ability to identify an athlete or a specific period of time is incredibly important. When it comes to dates, they can be easily reformatted using coding software, with a single step added to reformat to the relevant form required. When it comes to names, it is hard to cover all errors that may appear, so one of your best options is to have athlete id numbers, that are only relevant to a single, unique athlete.
Adding an id to your data
For manually input data, it is easier to add the single id than it is to spell out an athletes full name. This removes the the need for spell checking but also allows you to link this to an athlete table with names and other characteristics. This is a simple change you can make to your data management that makes a huge difference to how easy it is to analyse your data.
Adding to an id in external software
Some software, for GPS or Force Plates, will allow you to add external id’s that can be exported with the data export you create. For example, companies like Opta have an id that is associated with each unique athlete. This is there data management process which helps keep their job easy with so many athletes in their database. You can add this id to your athlete table for easy identification and linking across relevant datasets that you might have. Where this is not possible, on initial setup, you will need to make sure the athlete is named correctly so you can associate the correct id with your athlete during transformation steps of the data management process.
\[\\[0.1in]\]
Storing your data in a database
The last step is storing your data in a stable location that is both secure but easily accessible by all parties that require it. You have two options available:
- A simple database using excel files and OneDrive cloud storage.
- Creating a more stable and secure cloud based database using a data warehouse.
Both options have their benefits, but one is easy to set up, the other a little more complicated but with far greater benefits for longevity and ease of use.
Simple database using Excel and OneDrive
An easy way to create a database is to use Excel files stored in a cloud storage system. A really easy solution is with Excel and OneDrive. These products work really well together and also link easily with a reporting tool like Power Bi. If you live in a different ecosystem, such as Google, then this will also work well and allow you to easily link with Tableau and Power Bi.
For this type of database, your best option is to create a backup of your main tables frequently, but also make sure your folders and files are named appropriately to be able to identify where your data is kept. This is good option regardless, so that you always have offline access to your data if required in the future. These steps can be automated quickly using Python or R, allowing you to save time and start analysing your data sooner.
Creating a stable database using the Cloud
The other option you have, is to utilise cloud data warehouse options that are being offered by companies such as Snowflake. This allows you to create stable data structures that will connect to almost anything, whilst also giving you the power to store and analyse greater amounts of data rapidly. The main benefit of a system like this, is the ability to scale up and down when required, whilst not having to maintain expensive equipment by yourself. Snowflake is constantly providing updates to users that make the system easier and more efficient as well, another benefit of using a cloud provider.
For a cloud based set up, you will require some different tools, including Azure Blob/AWS S3/Google Cloud storage solutions. These options are like OneDrive/Google Drive but are less frills and cheaper, at about $25 USD for 1TB of storage. The benefit of these systems is cheap backup storage, in addition to what is provided by Snowflake, but also easy connection to your cloud data warehouse. You can add a simple step of code in R or Python which will add the data to one of these storage solutions.
Lastly, the database can be created within Snowflake. A simple connection to your storage provider will automatically bring your data in to Snowflake, where you can transform and create the required views of the data you need. This is all done using SQL code, which is really easy to learn and utilise.
This post gives you a good outline of some data management best practices, this is only the starting point though as there is a lot to learn. Security of data in line with government regulations is huge, especially when it comes to health data. So if you are looking to improve your data management practices, please reach out below and I am happy to chat with you further on your options.
Otherwise, I look forward to seeing you all in the next data for sport post!