Ihab F. Ilyas

Data Cleaning


Скачать книгу

the same frequency, the histogram is called an equi-depth histogram.

      Equi-width histograms can be used to detect outliers; data points that belong to bins that have very low frequency are reported as outliers. A major challenge with histogram-based outlier detection approaches is that it is often very difficult to come up with the “right” width of the interval. If the width of the bins is too narrow, the frequency of some bins might be low, and normal data points belong to those bins are reported as outliers, and if the width of the bins is too wide, outlier data points might get absorbed in bins that have normal data points, and thus fail to report the real outliers. The problem is further exacerbated when adapting histograms to multivariate data, since the frequency of every division diminishes as the number of dimensions increases, as we mentioned earlier as a curse of dimensionality.

Image

      Example 2.5 Figure 2.2 shows two equi-width histograms we built for the age column in Table 2.1 using free statistics software [Wessa 2012]. We report data points residing in bins that contain less than three data points as outliers.

      In Figure 2.2(a), the age values are split into 10 equal-width bins, namely (0, 100], (100, 200], …, (900, 1000]. Only the first bin and the last bin contain data points. Specifically, bin (0, 100] contains 8 points and bin (900, 1000] contains 1 point. Therefore, we report t9[age] ∈ (900, 1000] as an outlier.

      In Figure 2.2(b), the age values are split into 50 equal-width bins, namely (0, 20], (20, 40], …, (980, 1000]. Only the first three bins and the last bin contain data points. Specifically, bin (0, 20] contains 1 point, bin (20, 40] contains 6 points, bin (40, 60] contains 6 points, and bin (980, 1000] contains 1 point. Therefore, we report t1[age] ∈ (0, 20], t8[age] ∈ (40, 60], and t9[age] ∈ (980, 1000] as outliers.

      There are usually two ways to generalize a histogram to deal with multivariate data: (1) computing an outlier score for each dimension using the one-dimensional histogram, and then combining the score to obtain an overall outlier score for every data point; and (2) binning every dimension to divide multiple dimensions together; the number of data points belonging to each division is counted, and the data points that belong to divisions with very low frequency counts are reported as outliers. The performance of histogram methods usually deteriorates with increasing number of dimensions due to the sparsity of the grid structure with increasing dimensionality [Aggarwal 2013]. For example, given a 10-dimensional space with each dimension being split into 10 bins, the grid will contain 1010 grid cells. It would require 1010 available data points to allow every grid cell to have one data point in expectation.

      Kernel Density Estimation

      KDE [Rosenblatt 1956, Parzen 1962, Gan and Bailis 2017] is a nonparametric way to estimate the pdf f (x) of a random variable X based on a sample of n data points, which is the same as histograms, but can be endowed with properties such as smoothness or continuity by using a suitable kernel.

      Let x1, x2, …, xn be an independent and identically distributed sample drawn from some distribution with unknown pdf f (x). KDE estimates f (x) using image defined as follows:

image

      where K(•) is the kernel, and h is a smoothing parameter called the bandwidth. A kernel has to be a non-negative function that integrates to one and has mean zero, namely, image and K(x) = K(−x). A kernel with a subscript h is a scaled kernel that satisfies Kh(x) = 1/hK (x/h). Some commonly used kernel functions include the normal, uniform, triangular, biweight, triweight, and Epanechnikov kernels. Just as the choice of the width of the bin influences the result of histograms, the bandwidth of the kernel is a free parameter strongly influencing the KDE as well.

      Example 2.6 To compare KDEs with histograms, consider 9 data points x1 = 1, x2 = 1, x3 = 2, x4 = 2, x5 = 2, x6 = 4, x7 = 6, x8 = 9, and x9 = 10. Figure 2.3 compares the pdf derived using both the equi-width histogram with a bin width equal to 2 and three KDEs using different kernel functions and bandwidths. As we can see, KDEs in general produce much smoother estimates than the histogram. Comparing Figures 2.3(b), 2.3(c), and 2.3(d), we can see that bandwidth clearly controls the smoothness of the pdf produced. Given the estimated pdf, any points that have low probability are flagged as outliers.

      The generalization of multivariate KDE from univariate KDE is similar to the generalization of multivariate histograms from univariate histograms, and is discussed in detail in Simonoff [2012].

Image

      In contrast to statistics-based outlier detection techniques, distance-based outlier techniques do not assume an underlying model that generates the data and are often nonparametric. These approaches often define a distance between data points, which is used for defining a normal behavior, such as normal data points should be close to many other data points, and any data point that deviates from such normal behavior is declared an outlier [Knorr and Ng 1998, 1999, Breunig et al. 2000].

      Distance-based outlier detection methods can be further divided into global or local methods depending on the reference population used when determining whether a point is an outlier. A global distance-based outlier detection method determines whether a point is an outlier based on the distance between that data point and all other data points in the dataset. On the other hand, a local method considers the distance between a point and its neighborhood points when determining outliers.

      2.3.1 Global Distance-Based Outlier Detection

      Distance-based methods were first introduced by Knorr and Ng [Knorr and Ng 1998, 1999]. An object O in a dataset I is a distance-based outlier, i.e., DB(p, D) outlier, if at least fraction p of the objects in I lies greater than distance D from O. DB(p, D) can be seen as a generalization of some existing statistics-based outlier detection definitions. For example, let I be observations from an univariate normal distribution N (μ, δ) and O be a point from I, then the z-score of O is greater than 3 if and only if O is a DB(0.9988, 0.13δ) outlier [Knorr and Ng 1998]. Knorr and Ng proved that statistics-based outlier detection definitions based on other underlying distributions, such as Student t-distribution and Poisson distribution, are also equivalent to DB(p, D) outliers with suitable choice of p and D.

      The definition of DB(p, D) does not provide a ranking or a score of outliers and requires specifying a distance parameter D, which could be difficult to determine and may involve a trial and error process. There are other distance-based outlier definitions that look at the k nearest