Lillian Pierson

Data Science For Dummies


Скачать книгу

about the procedures performed during the compression.

Schematic illustration of using SVD to decompose data down to u, S, and V matrices.

      FIGURE 4-5: You can use SVD to decompose data down to u, S, and V matrices.

      Reducing dimensionality with factor analysis

      Factor analysis is along the same lines as SVD in that it’s a method you can use for filtering out redundant information and noise from your data. An offspring of the psychometrics field, this method was developed to help you derive a root cause in cases where a shared root cause results in shared variance — when a variable’s variance correlates with the variance of other variables in the dataset.

      

A variable's variability measures how much variance it has around its mean. The greater a variable’s variance, the more information that variable contains.

      When you find shared variance in your dataset, that means information redundancy is at play. You can use factor analysis or principal component analysis to clear your data of this information redundancy. You see more on principal component analysis in the following section, but for now, focus on factor analysis and the fact that you can use it to compress your dataset’s information into a reduced set of meaningful, non-information-redundant latent variables — meaningful inferred variables that underlie a dataset but are not directly observable.

      Factor analysis makes the following assumptions:

       Your features are metric — numeric variables on which meaningful calculations can be made.

       Your features should be continuous or ordinal (if you’re not sure what ordinal is, refer back to the first class, business class, and economy class analogy in the probability distributions section of this chapter).

       You have more than 100 observations in your dataset and at least 5 observations per feature.

       Your sample is homogenous.

       There is r > 0.3 correlation between the features in your dataset.

      In factor analysis, you do a regression — a topic covered later in this chapter — on features to uncover underlying latent variables, or factors. You can then use those factors as variables in future analyses, to represent the original dataset from which they’re derived. At its core, factor analysis is the process of fitting a model to prepare a dataset for analysis by reducing its dimensionality and information redundancy.

      Decreasing dimensionality and removing outliers with PCA

      Principal component analysis (PCA) is another dimensionality reduction technique that’s closely related to SVD: This unsupervised statistical method finds relationships between features in your dataset and then transforms and reduces them to a set of non-information-redundant principal components — uncorrelated features that embody and explain the information that’s contained within the dataset (that is, its variance). These components act as a synthetic, refined representation of the dataset, with the information redundancy, noise, and outliers stripped out. You can then use those reduced components as input for your machine learning algorithms to make predictions based on a compressed representation of your data. (For more on outliers, see the “Detecting Outliers” section, later in this chapter.)

      The PCA model makes these two assumptions:

       Multivariate normality (MVN) — or a set of real-valued, correlated, random variables that are each clustered around a mean — is desirable, but not required.

       Variables in the dataset should be continuous.

      Although PCA is like factor analysis, they have two major differences: One difference is that PCA does not regress to find some underlying cause of shared variance, but instead decomposes a dataset to succinctly represent its most important information in a reduced number of features. The other key difference is that, with PCA, the first time you run the model, you don’t specify the number of components to be discovered in the dataset. You let the initial model results tell you how many components to keep, and then you rerun the analysis to extract those features.

      

Similar to the CVE discussion in the SVD part of this chapter, the amount of variance you retain depends on how you’re applying PCA, as well as the data you’re inputting into the model. Breaking it down based on how you’re applying PCA, the following rules of thumb become relevant:

       Used for descriptive analytics: If PCA is being used for descriptive purposes only (for example, when working to build a descriptive avatar of your company’s ideal customer) the CVE can be lower than 95 percent. In this case you can get away with a CVE as low as 75-80 percent.

       Used for diagnostic, predictive or prescriptive analytics: If principal components are meant for downstream models that generate diagnostic, predictive or prescriptive analytics, then CVE should be 95 percent or higher. Just realize that the lower the CVE, the less reliable your model results will be downstream. Each percentage of CVE that’s lost represents a small amount of information from your original dataset that won’t be captured by the principal components.

      

When using PCA for outlier detection, simply plot the principal components on an x-y scatter plot and visually inspect for areas that might have outliers. Those data points correspond to potential outliers that are worth investigating.

      Life is complicated. We’re often forced to make decisions where several different criteria come into play, and it often seems unclear which criterion should have priority. Mathematicians, being mathematicians, have come up with quantitative approaches that you can use for decision support whenever you have several criteria or alternatives on which to base your decision. You see those approaches in Chapter 3, where I talk about neural networks and deep learning — another method that fulfills this same decision-support purpose is multiple criteria decision-making (or MCDM, for short).

      Turning to traditional MCDM

      You can use MCDM methods in anything from stock portfolio management to fashion-trend evaluation, from disease outbreak control to land development decision-making. Anywhere you have two or more criteria on which you need to base your decision, you can use MCDM methods to help you evaluate alternatives.

      To use multiple criteria decision-making, the following two assumptions must be satisfied:

       Multiple criteria evaluation: You must have more than one criterion to optimize.

       Zero-sum system: Optimizing with respect to one criterion must come at the sacrifice of at least one other criterion. This means that there must be trade-offs between criteria — to gain with respect to one means losing with respect to at least one other.