Industrial Data Analytics for Diagnosis and Prognosis. Yong Chen. Читать онлайн. Hotlib. HOTLIB.NET

Yong Chen

Industrial Data Analytics for Diagnosis and Prognosis

cells have the lightest color because any variable has the strongest relationship to itself. From the heatmap in Figure 2.9, we can also see that the two MPG variables (city.mpg and highway.mpg) have strong negative relationships with many of the other numerical variables in the data set.

Figure 2.9 Heatmap of correlation for all numerical variables.

2.2 Summary Statistics

Data visualization is an effective and intuitive representation of the qualitative features of the data. Key characteristics of data can also be quantitatively summarized by numerical statistics. This section introduces common summary statistics for univariate and multivariate data.

2.2.1 Sample Mean, Variance, and Covariance

Sample Mean – Measure of Location

A sample mean or sample average provides a measure of location, or central tendency, of a variable in a data set. Consider a univariate data set, which is a data set with a single variable, that consists of a random sample of n observations x₁, x₂,…, xn. The sample mean is simply the ordinary arithmetic average

x with bar on top equals 1 over n sum from i equals 1 to n of x subscript i.

For a data set y_i, i = 1, 2,…, n obtained by multiplying each xi by a constant a, i.e., yi = axi, i = 1, 2,…, n, it is easy to see that

x with bar on top equals a top enclose x.

Sample Variance – Measure of Spread

The sample variance measures the spread of the data and is defined as

$s squared equals fraction numerator begin display style sum subscript i equals 1 end subscript superscript n left parenthesis x subscript i minus x with bar on top right parenthesis squared end style over denominator n minus 1 end fraction equals fraction numerator begin display style sum subscript i equals 1 end subscript superscript n x subscript i superscript 2 minus n x with bar on top squared end style over denominator n minus 1 end fraction.$ (2.1)

The square root of the sample variance, s = √s², is called the sample standard deviation. The sample standard deviation is of the same measurement unit as the observations. For yi = axi, i = 1,2,…, n, its sample variance is

s subscript y superscript 2 equals a squared s squared.

Sample Covariance and Correlation – Measure of Linear Association Between Two Variables

If each of the n observations of a data set is measured on two variables x₁ and x₂, let (x₁₁, x₂₁,...,x_n₁) and (x₁₂, x₂₂,...,x_n₂) denote the n observations on x₁ and x₂, respectively. The sample covariance of x₁ and x₂ is defined as

(2.2)

where x̄₁ and x̄₂ are the sample means of x₁ and x₂, respectively. The value of sample covariance of two variables is affected by the linear association between them. From (2.2), if x₁ and x₂ have a strong positive linear association, they are usually both above their means or both below their means. Consequently, the product (x_i1−x¯₁)(x_i2−x¯₂) will typically be positive and their sample covariance will have a large positive value. On the other hand, if x₁ and x₂ have a strong negative linear association, the product (x_i1−x¯₁)(x_i2−x¯₂) will typically be negative and their sample covariance will have a negative value. If y₁ and y₂ are obtained by multiplying each measurement of x₁ and x₂ with a₁ and a₂, respectively, it is easy to see from (2.2) that the sample covariance of y₁ and y₂ is

s subscript 12 superscript y equals a subscript 1 a subscript 2 s subscript 12. (2.3)

Equation (2.3) says that if the measurements are scaled, for example by changing measurement units, the sample covariance will be scaled correspondingly. The sample covariance’s dependence on the measurement units makes it difficult to determine how large a sample covariance indicates a strong (linear) association between two variables. The sample correlation defined as follows is a measure of linear association that does not depend on the measurement units, or scaling of the variables

$r subscript 12 equals fraction numerator s subscript 12 over denominator s subscript 1 s subscript 2 end fraction equals fraction numerator begin display style sum subscript i equals 1 end subscript superscript n left parenthesis x subscript i 1 end subscript minus x with bar on top subscript 1 right parenthesis left parenthesis x subscript i 2 end subscript minus x with bar on top subscript 2 right parenthesis end style over denominator square root of begin display style sum subscript i equals 1 end subscript superscript n left parenthesis x subscript i 1 end subscript minus x with bar on top subscript 1 right parenthesis squared sum subscript i equals 1 end subscript superscript n left parenthesis x subscript i 2 end subscript minus x with bar on top subscript 2 right parenthesis squared to the power of text end text end exponent end style end root end fraction comma$ (2.4)

where s₁ and s₂ are the sample standard deviation of x₁ and x₂, respectively. The sample correlation ranges between −1 and 1, with values close to 1, −1, and 0 indicating a strong positive linear association, a strong negative linear association, and no linear association, respectively.

Example 2.2 To illustrate the calculation of summary statistics, we take a random sample of 10 observations, as shown in Table 2.1, from the auto.spec data set on the variables curb.weight, length, and width. We use x_i, i =1,2,3, to represent the three variables:

Table 2.1 A random sample of 10 observations from the auto. spec data set.

x₁	x₂	x₃
3515	190.9	70.3
2300	168.7	64.0
2800	168.9	65.0
2122	166.3	64.4
2293	169.1	66.0
2765	176.8	64.8
2275	171.7	65.5
1890	159.1	64.2
2926	173.2	66.3
1909	158.8	63.6

Скачать книгу