Ihab F. Ilyas

Data Cleaning


Скачать книгу

defined for the following hypothesis, where H0 is the null hypothesis and Ha is the alternative hypothesis:

      H0 : there are no outliers in the dataset

      Ha : there is exactly one outlier in the dataset.

      Grubbs’ test statistic is defined as image with image and s denoting the sample mean and standard deviation, respectively. Grubbs’ test statistic is the largest absolute deviation from the sample mean in units of the sample standard deviation. The above test statistic is the two-sided version of the test. Grubbs’ test can also be used as a one-sided test for determining whether the minimum value is an outlier or the maximum value is an outlier. To test whether the minimum value Ymin is an outlier, the test statistic is defined as image; similarly, to test whether the maximum value Ymax is an outlier, the test statistic is defined as image. For the two-sided test, the null hypothesis of no outliers is rejected if image with image denoting the upper critical value of the t-distribution with N − 2 degrees of freedom and a significance level of α/2N. For a one-sided test, replace α/2N with α/N.

Image

      Example 2.1 Consider Table 2.1, including the name, age, income, and tax of employees. Domain knowledge suggests that the first value t1[age] and the last value t9[age] are outlying age values. We use Grubbs’ test with a significance level α = 0.05 to identify outliers in the age column.

      The mean of the 9 age values is 136.78, and the standard deviation of the 9 age values is 323.92. Grubbs’ test statistic image which is obtained at t9. With a significance level α = 0.05, the critical value is computed as Gcritical = 2.21. Since G > Gcritical, the null hypothesis is rejected. Therefore, t9[age] is reported as an outlier.

      Removing t9[age], we are left with 8 age values. The mean of the 8 age values is 28.88, and the standard deviation of the 8 age values is 12.62. Grubbs’ test statistic is image, which is obtained at t1. With a significance level α = 0.05, the critical value is computed to Gcritical = 2.12. Since G > Gcritical, the null hypothesis is rejected. Therefore, t1[age] is reported as an outlier.

      Removing t1[age], we are left with 7 age values. The mean of the 7 age values is 32.86, and the standard deviation of the 7 age values is 6.15. Grubbs’ test statistic image, which is obtained at t8. With a significance level α = 0.05, the critical value is computed as Gcritical = 2.01. Since G < Gcritical, the null hypothesis is accepted.

      Previous discussion assumes that the data follows an approximately normal distribution. To assess this, several graphical techniques can be used, including the Normal Probability Plot, the Run Sequence Plot, the Histogram, the Box Plot, and the Lag Plot.

      Iglewicz and Hoaglin provide an extensive discussion of the outlier tests previously given [Iglewicz and Hoaglin 1993]. Barnett and Lewis [1994] provide a book length treatment of the subject. They provide additional tests when data is not normally distributed.

      The other type of statistics-based approach first fits a statistical distribution to describe the normal behavior of the given data points, and then applies a statistical inference procedure to determine if a certain data point belongs to the learned model. Data points that have a low probability according to the learned statistical model are declared as anomalous outliers. In this section, we discuss parametric approaches for fitting a distribution to the data.

      Univariate

      We first consider univariate outlier detection, for example, for a set of values x1, x2, …, xn that appear in one column of a relational table. Assuming the data follows a normal distribution, fitting the values under a normal distribution essentially means computing the mean μ and the standard deviation σ from the current data points x1, x2, …, xn. Given μ and σ, a simple way to identify outliers is to compute a z-score for every xi, which is defined as the number of standard deviations away xi is from the mean, namely z-score. Data values that have a z-score greater than a threshold, for example, of three, are declared to be outliers.

      Since there might be outliers among x1, x2, … xn, the estimated μ and σ might be far off from their actual values, resulting in missing outliers in the data, as we show in Example 2.2.

      This effect is called masking [Hellerstein 2008]; that is, a single data point has severely shifted the mean and standard deviation so much as to mask other outliers. To mitigate the effect of masking, robust statistics are often employed, which can correctly capture important properties of the underlying distribution even in the face of many outliers in the data values. Intuitively, the breakdown point of an estimator is the proportion of incorrect data values (e.g., arbitrarily large or small values) an estimator can tolerate before giving an incorrect estimate. The mean and standard deviation have the lowest breakdown point: a single bad value can distort the mean completely.

      Robust Univariate Statistics. We now introduce two robust statistics: the median and the median absolute deviation (MAD) that can replace mean and standard deviation, respectively. The median of a set of n data points is the data point for which half of the data points are smaller, and half are larger; in the case of an even number of data points, the median is the average of the middle two data points. The median, also known as the 50th percentile, is of critical importance in robust statistics with a breakdown point of 50%; as long as no more than half the data are outliers, the median will not give an arbitrarily bad result. The median absolute deviation (MAD) is defined as the median of the absolute deviations from the data’s median, namely, MAD = mediani(|ximedianj (xj)|). Similar to the median, MAD is a more robust statistic than the standard deviation. In the calculation of the standard deviation, the distances from xi to the mean are squared, so large deviations, which often are caused by outliers, are weighted heavily, while in the calculation of MAD, the deviations of a small number of outliers are irrelevant because MAD