Lillian Pierson

Data Science For Dummies


Скачать книгу

the variables have a nonlinear relationship. This curvature occurs because, with variables related in a non-linear manner, a change in the value of x does not necessarily correspond to the same change in dataset’s y-value.

       Your data is nonnormally distributed.

      To use Spearman Rank to test for correlation between ordinal variables, you’d simply plug the values for your variables into the following formula and calculate the result.

math

       ρ = Spearman's rank correlation coefficient

       d = difference between the two ranks of each data point

       n = total number of data points in the data set

Schematic illustration of an example of a non-linear relationships between watch time and % viewership.

      FIGURE 4-2: An example of a non-linear relationship between watch time and % viewership.

      Any intermediate-level data scientist should have a good understanding of linear algebra and how to do math using matrices. Array and matrix objects are the primary data structure in analytical computing. You need them in order to perform mathematical and statistical operations on large and multidimensional datasets — datasets with many different features to be tracked simultaneously. In this section, you see exactly what is involved in using linear algebra and machine learning methods to reduce a dataset’s dimensionality — in other words, to reduce a dataset’s feature count, without losing the important information the dataset contains, by compressing its features’ information into synthetic variables that you can subsequently utilize to make predictions or as input into another machine learning model.

      Decomposing data to reduce dimensionality

      

The difference between SVD and PCA is just this: PCA assumes that you are working with a square (1x1) input matrix. If your input matrix is not square, then use SVD instead, because SVD does not make this assumption. PCA is covered in greater detail later in this chapter.

       Compressing sparse matrices: If you have a clean yet sparse dataset then, you don’t want to remove any of the information that the dataset holds, but you do need to compress that information down into a manageable number of variables, so that you can use them to make predictions. A handy thing about SVD is that it allows you to set the number of variables, or components, it creates from your original dataset. And if you don’t remove any of those components, then you will reduce the size of your dataset without losing any of its important information. This process is illustrated in Figure 4-3.

       Cleaning and compressing dirty data: In other cases, you can use SVD to do an algorithmic cleanse of a dirty, noisy dataset. In this case you’d apply SVD to uncover your components, and then decide which of them to keep by looking at their variance. The industry standard is that explained variance of the components you keep should add up to at least 75 percent or more. This ensures that at least 75 percent of the dataset’s original information has been retained within the components you’ve kept. This process is illustrated in Figure 4-4.

If the sum of the explained variance — or cumulative variance explained (CVE) — for the components you keep is less than 95 percent, do not use the components as derived features further downstream in other machine learning models. In this case, the information lost within these derived features will cause the machine learning model to generate inaccurate, unreliable predictions. These derived components are, however, useful as a source for descriptive statistics or for building more general descriptive analytics — in other words, analytics that describe what happened in the past, and answer questions like “what happened” “when” “how many”, and “where.”

Schematic illustration of applying SVD to compress a sparse, clean dataset.

      FIGURE 4-3: Applying SVD to compress a sparse, clean dataset.

Schematic illustration of applying SVD to clean and compress a sparse, dirty dataset.

      FIGURE 4-4: Applying SVD to clean and compress a sparse, dirty dataset.

The lower the CVE, the more you should take your model’s results with a grain of salt.

      

If you remove some components, then when you go to reconstruct your matrix, you'll probably notice that the resulting matrix isn’t an exact match to your original dataset. Worry not! That is the data that remains after much of the information redundancy and noise was filtered out by SVD and removed by you.

      Take a closer look at Figure 4-5:

      A = u * S * v

       A: This is the matrix that holds all your original data.

       u: This is a left-singular vector (an eigenvector) of A, and it holds all the important, nonredundant information about your data’s observations.

       v: This is a right-singular eigenvector of A. It holds all the important, nonredundant information about columns in your dataset’s features.

       S: