Ted Kwartler

Sports Analytics in Practice with R


Скачать книгу

in effect the factor level alone represents specific “meta” information such as the other teams in the conference, and even perhaps some of the team’s schedule. This meta-information is inherited as a pattern within the larger data set, not explicitly defined within the object type. While this may be confusing, it will make sense eventually as the object types and classes move to multiple values instead of single values later in this chapter. The code below simply creates a single object, `teamA` with a factor defined as the Eastern conference. The function to declare value as a factor is simply `as.factor`.

      teamA <- as.factor('Eastern_Conference')

      In addition to factors, the last commonplace variable type includes “character.” Character objects represent natural language, for example, from social media or fan forums that need to be analyzed. The field of character and string analysis is referred to as Natural Language Processing (NLP). These methods and technology underpin the popular smart speakers and voice assistants among other everyday common technologies such as e-mail spam filters. This book devotes one chapter to gauging fan engagement on a popular forum. Thus, this type of data type will be covered extensively. However, one chapter merely covers the basics of NLP and much more can be accomplished with additional methods, code, and academic literature. Below is a fictitious social media post from a fan. Character values can be declared with `as.character` but, as written here, are not necessary.

      fanTweet <- "I love baseball"

Name Code Description
“integer” x <- 5L A whole number without a decimal point
“numeric” y <- 5.123 A floating point number
“logical” z <- TRUE z <- T #capital T or F is acceptable too A logical “Boolean” operator either TRUE or FALSE. R will interpret TRUE as 1 and FALSE as 0 for some operations
“factor” playerPosition <- as.factor(“forward”) A factor is a distinct class often representing non-unique information. The factor classes are referred to as “levels.” Here, a player position is defined as a factor with the level “forward”
“character” fanComment <- “I love the hot dogs at the stadium” Character values, known as strings, represent natural language. Unlike factors, they can be repeating or mutually exclusive. A growing subset of analytics work includes Natural Language Processing (NLP)

      Previously, the objects created such as `xVal` and `i` represented a single value. R’s coding environment relies on specific data types and corresponding classes that can be more complex than a single value. For instance, R can create and work with “vectors.” Vectors are merely columns of data that you may be familiar with if you’re coming to R from a spreadsheets program. To create a numeric vector, you employ the combine function which is `c`. In the following code, a vector of numbers is created called `xVec`. The `xVec` object utilizes some of the objects previously created along with additional values that are explicitly declared within the `c`, combine function. Each value within the vector is separated by a comma. Once `xVec` is created, calling in the console will return multiple values where the object such as `xVal` is now substituted to their numeric equivalents.

      xVec <- c(xVal, i, newObj, 345,678)

      Scaling up from a single vector, one method for arranging multiple columns into a single object is with `cbind`. The `cbind` function arranges vectors in a column-wise fashion. Similarly, the `rbind` function will stack vectors as rows. The resulting object type is no longer a “numeric” or other previous type discussed, but instead “matrix” type. A matrix arranges data into rows and columns within a single object. This code creates `xMatrix` using `cbind` and simply repeating the previous vector `xVec` to create a second column. Once executed the `xMatrix` variable is in the environment and when called demonstrates a five row by two column arrangement of the data in a single object. Calling `class` on the object will return “matrix.”

      xMatrix <- cbind(xVec, xVec)

number1logical2factor3string4
1TRUEastring1
4TRUEbs2
1FALSEas3
345FALSEbs4
678TRUEbs5

      xDataFrame <- data.frame(number1 = xVec, logical2 = c(T,T,F,F,T), factor3 = as.factor(c('a','b','a','b','b')), string4 = c('string1', 's2', 's3', 's4', 's5'))

      R can employ either a matrix or data frame to arrange data in rows and columns. In both object types, the columns and rows must be complete. For example, you cannot `cbind` a vector with three values to another with two values. This makes the data “ragged” and for matrices r data frames requires you to fill in the cell value with NA. However, some functions require one object class over another. The difference is that a matrix must have all values be of the same data type. For example, each value in all of the columns must all be numeric or all logical. If this is not the case, the matrix function will coerce the data into characters automatically which can cause issues. As a result, most often in this