Ted Kwartler

Sports Analytics in Practice with R


Скачать книгу

What happens to `x` occurs within the curly brackets. In this case, a simple operation `x + 3` overwrites the internal value of `x` and the new value is returned. The function will be an object in the environment and can accept any numeric or integer value. Here, the function is created and then applied to a value of 2. The output is assigned an object itself in `exampleThree`.

      plus3 <- function(x){ x <- x + 3 return(x) } exampleThree <- plus3(2) exampleThree

      Of course, functions can be more complex. As an example, the following function is made to be more dynamic by adding a new parameter, called `value`. Now both are required for the function to operate. The `x` value is now divided by the `value` input parameter that is passed into the function. Additionally, before the result is returned from the function, the `round` function is applied further adjusting the preceding division. In the end, for example, the custom function `divideVal` will accept a number 5, divide it by 2, and then round the result so that it returns the value 2.

      divideVal <- function(x, value){ x <- x / value x <- round(x) return(x) } exampleValue <- divideVal(5,2) exampleValue

      Applying R Basics to Real Data

      Let’s reward your laborious work though foundational R coding with something of actual interest utilizing sports data. Like many scripts in this book, let’s begin by loading packages. For each of these, you need to first run `install.packages` and assuming that executes without error, the following library calls will specialize R for the task at hand. As an example, script using real sports data, our only goal is to obtain the data, manipulate it, and finally plot it.

      To begin call `library(RCurl)` which is a general network interface client. Functions within this library allow R to make a network connection to download the data. One could have data locally in a file, connect to an API, database, or even web scrape the data. However, in upcoming code, the data are download directly from an online repository. Next, `library(ggplot2)` loads the grammar of graphics namespace with excellent visualization capabilities. The `library(ggthemes)` call is a convenience library accompanying `ggplot2` for quick, predefined aesthetics. Lastly, the `library(tidyr)` functions are used for tidying data, which is a style of data organization that is efficient if not intuitive. Here, the basic raw will be rearranged before plotting.

      library(RCurl) library(ggplot2) library(ggthemes) library(tidyr)

      c1Data <- ‘https://raw.githubusercontent.com/kwartler/Practical_Sports_Analytics/main/C1_Data/2019-2020%20Dallas%20Player%20Stats.csv’

      Now to execute a network connection employ the `getURL` function which lies within the `RCurl` package. This function simply accepts the string URL address previously defined. Be sure to have the address exactly correct to avoid any errors. nbaFile <- getURL(c1Data)

      Finally, the base-R function `read.csv` is used with the downloaded data. The `read.csv` function is widely used because CSV files are ubiquitous. Further, the function can accept a local file path leading to a hard disk rather than the file downloaded here but the path must be exactly correct. Spaces, capitalization, and misspellings will result in cryptic and frustrating file not found errors. Assuming the web address was correct, and the `getURL` function executed without error, then the result of this code is a new object called `nbaData`. It is automatically read in as a `data.frame` object.

      nbaData <- read.csv(text = nbaFile)

      Unlike a spreadsheet program where you can scroll to any area of the sheet to look at the contents, R holds the data frame as an object which is an abstraction. As a result, it can be difficult to comprehend the loaded data. Thus, it is a best practice to explore the data to learn about its characteristics. In fact, exploratory data analysis, EDA, in itself is a robust field within analytics. The code below only scratches the surface of what is possible.

      To being this basic EDA defines the dimensions of the data using the `dim` function applied to the `nbaData` data frame. This will print the total rows and columns for the data frame. Similar to the indexing code, the first number represents the rows and the second the columns.

      dim(nbaData)

      Since data frames have named columns, you may want to know what the column headers are. The base-R function `names` accepts a few types of objects and in this case will print the column names of the basketball data.

      At this point you know the column names and the size of the data loaded in the environment. Another popular way to get familiar with the data is to glimpse at a portion of it. This is preferred to calling the entire object in your console. Data frames can often be tens of thousands of rows or more plus hundreds of columns. If you call a large object directly in console, your system may lag trying to print that much data as an output. Thus, the popular `head` function accepts a data object along with an integer parameter representing the number of records to print to select. Since this function call is not being assigned an object, the result is printed to console for review. The default behavior selects six though this can be adjusted for more or less observations. When called the `head` function will print the first `n` rows of the data frame. This is in contrast to the `tail` function which will print the last `n` rows.

      head(nbaData, n = 6)

      You should notice that the column `TEAM` shows “Dal” for all results in the `head` function. To ensure this data set only contains players from the Dallas team you can employ the `table` function specifying the `TEAM` column either by name or by index position. The `table` function merely tallies the levels or values of a column. After running the next code chunk, you see that “Dal” appears 19 times in this data set. Had there been another value in this column, additional tallied information would be presented.

      table(nbaData$TEAM) table(nbaData[,2])

      Lastly, another basic EDA function is `summary`. The `summary` function can be applied to any object and will return some information determined by the type of object it receives. In the case of a data frame, the `summary` function will examine each column individually. It will denote character columns and, when declared as factor, will tally the different factor levels. Perhaps most important is how `summary` treats numeric columns. For each numeric column, the minimum, first quartile, median, mean, third quartile, and maximum are returned. If missing values are stored as “NA” in a particular column, the function will also tally that. This allows the practitioner to understand each columns range, distribution, averages, and how much of the column contains NA values.

      summary(nbaData)