Hey everyone,
Welcome back to this weeks R for sport post. This week, we will start working with CSV files in R. I will show you how to read the data into an R environment, manipulate the data and export the data to a new file. Let’s not waste any time and get straight into it!
\[\\[0.1in]\]
Reading a CSV file
Reading a CSV file into R is quite simple, as the tidyverse provides a useful function for this. Using read_csv() will read the contents of your CSV file and convert the data into a tibble for use in R.
# ignore the hashtag when using in your scripts
# read_csv("C:/filelocation/filename.csv")
One thing to be aware of with R is the format of your file locations. Instead of using a “" for subdirectories, R recognises the”/" instead. So make sure you write your directories as follows when you input the full location of your file. See the above location as an example.
To store your data in the R environment, you need to assign a variable name. Generally, a tibble might be named something similar to “data” or “shots_df”, depending on what your CSV data contains. Make sure to name the variable something easy to remember and also reuse in your code.
# ignore the hashtag when using in your scripts
# data <- read_csv("C:/filelocation/filename.csv")
\[\\[0.1in]\]
Filtering your tibble
Once your data is in the environment, you can manipulate the data to use in models or visuals. One of the first things you might want to do is filter your dataset to only include the data you want. For example, this might be to find a particular team, player, or position. Whatever your filter might be, you will want to use the filter function in R. This function takes your dataset and filters based on the expressions you provide. For example, if you want to filter a shot dataset to only include a single team, you would do something similar to below.
# ignore the hashtag when using in your scripts
#data_filter <- data %>%
# filter(team.name == "Arsenal FC")
One of the first things you will notice is that coding looks for double equals signs “==” to indicate team name should equal “Arsenal FC”. To identify the reverse, where something does not equal “Arsenal FC”, you would use “!=”. This differs from what might be familiar in Excel, where “<>” would indicate does not equal.
If we would like to filter by multiple values, we can use an “in” clause with a defined list. This tells the filter function to match values in the list rather than just a single value. We can also add a second expression using “&&” in the filter function. This allows us to filter by team name and match date for example. You can see examples of both filter types below.
# ignore the hashtag when using in your scripts
#data_filter <- data %>%
# filter(team.name in c("Arsenal FC", "Liverpool FC"))
#
#data_filters <- data %>%
# filter(team.name = "Arsenal FC" && match.date == "2020-01-01")
\[\\[0.1in]\]
Selecting specific columns
When using data from some providers, only some columns might apply to your model or visual of interest. Using the select function allows you to choose the columns you want to keep.
# ignore the hashtag when using in your scripts
# Select using a list and column index
#data_select <- data %>%
# filter(team.name in c("Arsenal FC", "Liverpool FC")) %>%
# select(c(1:5))
#
# Select using column names
#data_select <- data %>%
# filter(team.name in c("Arsenal FC", "Liverpool FC")) %>%
# select(c("ID", "team.name", "match.date", "location.x", "location.y"))
The select function has to options when selecting columns. You can use the column index or exact column names in the function. Both options need to be provided in a list using “c()” in your code as above. When you want to select multiple columns that are sequential, the column index provides a quick and easy way of doing this. By using the column name, you can also reorder the columns at the same time, which may be useful if this data will be used outside R for visualisation or to add to a database with strict column ordering.
\[\\[0.1in]\]
Summarising your data
Once you have a filtered data set, you might want to summarise the data to show something different. For example, we might have filtered the data to a single game or a single season and want to view the top goal scorers, or how many goals were scored by a single team. To find that out, we would do the following:
# ignore the hashtag when using in your scripts
# Summarise using an additional filter and group_by
#data_summary <- data_select %>%
# filter(type.name == "Shot") %>%
# group_by(team.name) %>%
# summarise("Shots" = n() )
The above code, will summarise our dataset to only rows where the type.name is equal to shot. We then group_by team names and count the number of rows that exist based on our filters. This will output a table that shows each team and the number of shots they took. You could also add the match date in the group_by statement to add a secondary group, to show shots on each match day. This gives you the ability to output a summary of a single game or multiple games in to a single file.
\[\\[0.1in]\]
Exporting to a new CSV file
Now we have our summarised dataset, we might want to export that for use elsewhere. Within R, you can easily export the data in a single CSV file for use in external software with the write_csv function.
# ignore the hashtag when using in your scripts
# Export summary to a new CSV file
# write_csv(data_summary, "c:/filelocation/folder/filename.csv")
The above code, requires a dataframe or tibble to be supplied, along with a file location including filename. This will export the data provided in to a single CSV file for use elsewhere in your data management systems.
\[\\[0.1in]\]
Video
You can find the full video showing the information discussed above, in a little more detail, below.
As always, hit like on the video and subscribe here for more videos to help you Power Performance Through Data.
Until next time,
Josh