Data Cleaning. Ihab F. Ilyas. Читать онлайн. Hotlib. HOTLIB.NET

Data Cleaning

defined for the following hypothesis, where H₀ is the null hypothesis and Ha is the alternative hypothesis:

H0 : there are no outliers in the dataset

Ha : there is exactly one outlier in the dataset.

Grubbs’ test statistic is defined as with and s denoting the sample mean and standard deviation, respectively. Grubbs’ test statistic is the largest absolute deviation from the sample mean in units of the sample standard deviation. The above test statistic is the two-sided version of the test. Grubbs’ test can also be used as a one-sided test for determining whether the minimum value is an outlier or the maximum value is an outlier. To test whether the minimum value Ymin is an outlier, the test statistic is defined as ; similarly, to test whether the maximum value Ymax is an outlier, the test statistic is defined as . For the two-sided test, the null hypothesis of no outliers is rejected if with denoting the upper critical value of the t-distribution with N − 2 degrees of freedom and a significance level of α/2N. For a one-sided test, replace α/2N with α/N.

Table 2.1 Employee records, including name, age, income, and tax

Example 2.1 Consider Table 2.1, including the name, age, income, and tax of employees. Domain knowledge suggests that the first value t₁[age] and the last value t₉[age] are outlying age values. We use Grubbs’ test with a significance level α = 0.05 to identify outliers in the age column.

The mean of the 9 age values is 136.78, and the standard deviation of the 9 age values is 323.92. Grubbs’ test statistic which is obtained at t₉. With a significance level α = 0.05, the critical value is computed as G_critical = 2.21. Since G > G_critical, the null hypothesis is rejected. Therefore, t₉[age] is reported as an outlier.

Removing t₉[age], we are left with 8 age values. The mean of the 8 age values is 28.88, and the standard deviation of the 8 age values is 12.62. Grubbs’ test statistic is , which is obtained at t₁. With a significance level α = 0.05, the critical value is computed to G_critical = 2.12. Since G > G_critical, the null hypothesis is rejected. Therefore, t₁[age] is reported as an outlier.

Removing t₁[age], we are left with 7 age values. The mean of the 7 age values is 32.86, and the standard deviation of the 7 age values is 6.15. Grubbs’ test statistic , which is obtained at t₈. With a significance level α = 0.05, the critical value is computed as G_critical = 2.01. Since G < G_critical, the null hypothesis is accepted.

Previous discussion assumes that the data follows an approximately normal distribution. To assess this, several graphical techniques can be used, including the Normal Probability Plot, the Run Sequence Plot, the Histogram, the Box Plot, and the Lag Plot.

Iglewicz and Hoaglin provide an extensive discussion of the outlier tests previously given [Iglewicz and Hoaglin 1993]. Barnett and Lewis [1994] provide a book length treatment of the subject. They provide additional tests when data is not normally distributed.

2.2.3 Fitting Distribution: Parametric Approaches

The other type of statistics-based approach first fits a statistical distribution to describe the normal behavior of the given data points, and then applies a statistical inference procedure to determine if a certain data point belongs to the learned model. Data points that have a low probability according to the learned statistical model are declared as anomalous outliers. In this section, we discuss parametric approaches for fitting a distribution to the data.

Univariate

We first consider univariate outlier detection, for example, for a set of values x₁, x₂, …, xn that appear in one column of a relational table. Assuming the data follows a normal distribution, fitting the values under a normal distribution essentially means computing the mean μ and the standard deviation σ from the current data points x₁, x₂, …, xn. Given μ and σ, a simple way to identify outliers is to compute a z-score for every xi, which is defined as the number of standard deviations away xi is from the mean, namely z-score. Data values that have a z-score greater than a threshold, for example, of three, are declared to be outliers.

Since there might be outliers among x₁, x₂, … xn, the estimated μ and σ might be far off from their actual values, resulting in missing outliers in the data, as we show in Example 2.2.

Example 2.2 Consider again the age column in Table 2.1. The mean of the 9 age values is 136.78, and the standard deviation of the 9 age values is 323.92. The procedure that identifies values that are more than 2 standard deviations away from the mean as outliers would mark values that are not in the range of [136.78 − 2 * 323.92, 136.78 + 2 * 323.92] = [–511.06, 784.62]. The last value t₉[age] is not in the range, and thus is correctly marked as an outlier. The first value t₁[age], however, is in the range and is thus missed.

This effect is called masking [Hellerstein 2008]; that is, a single data point has severely shifted the mean and standard deviation so much as to mask other outliers. To mitigate the effect of masking, robust statistics are often employed, which can correctly capture important properties of the underlying distribution even in the face of many outliers in the data values. Intuitively, the breakdown point of an estimator is the proportion of incorrect data values (e.g., arbitrarily large or small values) an estimator can tolerate before giving an incorrect estimate. The mean and standard deviation have the lowest breakdown point: a single bad value can distort the mean completely.

Robust Univariate Statistics. We now introduce two robust statistics: the median and the median absolute deviation (MAD) that can replace mean and standard deviation, respectively. The median of a set of n data points is the data point for which half of the data points are smaller, and half are larger; in the case of an even number of data points, the median is the average of the middle two data points. The median, also known as the 50th percentile, is of critical importance in robust statistics with a breakdown point of 50%; as long as no more than half the data are outliers, the median will not give an arbitrarily bad result. The median absolute deviation (MAD) is defined as the median of the absolute deviations from the data’s median, namely, MAD = mediani(|xi − medianj (xj)|). Similar to the median, MAD is a more robust statistic than the standard deviation. In the calculation of the standard deviation, the distances from xi to the mean are squared, so large deviations, which often are caused by outliers, are weighted heavily, while in the calculation of MAD, the deviations of a small number of outliers are irrelevant because MAD

Скачать книгу