Lillian Pierson

Data Science For Dummies


Скачать книгу

target variable must be binary or ordinal. Binary classification assigns a 1 for “yes” and a 0 for “no.”

       Predictive features should be independent of each other.

      Logistic regression requires a greater number of observations than linear regression to produce a reliable result. The rule of thumb is that you should have at least 50 observations per predictive feature if you expect to generate reliable results.

      

Predicting survivors on the Titanic is the classic practice problem for newcomers to learn logistic regression. You can practice it and see lots of examples of this problem worked out over on Kaggle. (www.kaggle.com/c/titanic).

      Ordinary least squares (OLS) regression methods

      Ordinary least squares (OLS) is a statistical method that fits a linear regression line to a dataset. With OLS, you do this by squaring the vertical distance values that describe the distances between the data points and the best-fit line, adding up those squared distances, and then adjusting the placement of the best-fit line so that the summed squared distance value is minimized. Use OLS if you want to construct a function that’s a close approximation to your data.

      

As always, don’t expect the actual value to be identical to the value predicted by the regression. Values predicted by the regression are simply estimates that are most similar to the actual values in the model.

      OLS is particularly useful for fitting a regression line to models containing more than one independent variable. In this way, you can use OLS to estimate the target from dataset features.

      

When using OLS regression methods to fit a regression line that has more than one independent variable, two or more of the variables may be interrelated. When two or more independent variables are strongly correlated with each other, this is called multicollinearity. Multicollinearity tends to adversely affect the reliability of the variables as predictors when they’re examined apart from one another. Luckily, however, multicollinearity doesn’t decrease the overall predictive reliability of the model when it’s considered collectively.

      Many statistical and machine learning approaches assume that your data has no outliers. Outlier removal is an important part of preparing your data for analysis. In this section, you see a variety of methods you can use to discover outliers in your data.

      Analyzing extreme values

      Outliers are data points with values that are significantly different from the majority of data points comprising a variable. It’s important to find and remove outliers because, left untreated, they skew variable distribution, make variance appear falsely high, and cause a misrepresentation of intervariable correlations.

      Outliers fall into the following three categories:

       Point: Point outliers are data points with anomalous values compared to the normal range of values in a feature.

       Contextual: Contextual outliers are data points that are anomalous only within a specific context. To illustrate, if you’re inspecting weather station data from January in Orlando, Florida, and you see a temperature reading of 23 degrees F, this would be quite anomalous because the average temperature there is 70 degrees F in January. But consider if you were looking at data from January at a weather station in Anchorage, Alaska — a temperature reading of 23 degrees F in this context isn’t anomalous at all.

       Collective: These outliers appear nearby one another, all having similar values that are anomalous to the majority of values in the feature.

      You can detect outliers using either a univariate or multivariate approach, as spelled out in the next two sections.

      Detecting outliers with univariate analysis

      Univariate outlier detection is where you look at features in your dataset and inspect them individually for anomalous values. You can choose from two simple methods for doing this:

       Tukey outlier labeling

       Tukey boxplotting

      Tukey boxplotting is an exploratory data analysis technique that’s useful for visualizing the distribution of data within a numeric variable by visualizing that distribution with quartiles. As you might guess, the Tukey boxplot was named after its inventor, John Tukey, an American mathematician who did most of his work back in the 1960s and 70s. Tukey outlier labeling refers to labeling data points (that lie beyond the minimum and maximum extremes of a box plot) as outliers.

      

Here’s a good rule of thumb:

      a = Q1 – 1.5*IQR

      and

      b = Q3 + 1.5*IQR.

      If your minimum value is less than a, or your maximum value is greater than b, the variable probably has outliers.

Schematic illustration of spotting outliers with a Tukey boxplot.

      Credit: Python for Data Science Essential Training Part 1, LinkedIn.com

      Detecting outliers with multivariate analysis

      Sometimes outliers show up only within combinations of data points from disparate variables. These outliers wreak havoc on machine learning algorithms, so it’s important to detect and remove them. You can use multivariate analysis of outliers to do this. A multivariate approach to outlier detection involves considering two or more variables at a time and inspecting them together for outliers. You can use one of several methods, including:

       A scatter-plot matrix

       Boxplotting

       Density-based spatial clustering of applications with noise (DBScan) — as discussed in Chapter 5

       Principal component analysis (PCA, as shown in Figure 4-8)