straight x subscript ij minus straight x with bar on top subscript straight j right parenthesis left parenthesis straight x subscript ik minus straight x subscript straight k right parenthesis over denominator straight n minus 1 end fraction."/> (2.5)
The diagonal elements of S, sjj, j = 1,…,p are the sample variance of the jth variable. It is easy to see that when k = j, the sample covariance in (2.5) is equal to sj2, the sample variance of the jth variable. So both notations sjj and sj2 represent the sample variance of xj. It is also obvious from (2.5) that skj. So the sample covariance matrix S is a symmetric matrix. The sample covariance matrix S can also be written by the observation vector xi as
Similarly, we define the sample correlation matrix as
The (j, k)th element of R is the sample correlation of the jth and kth variables:
The sample correlation between a variable and itself is equal to 1. So the diagonal elements of a sample correlation matrix are all equal to 1. The sample correlation matrix R is obviously symmetric since rjk = rkj.
Example 2.4 Consider the data set in Table 2.1. In Example 2.2, we found that x̄1 = 2479.5 and x̄2 = 170.35. Similarly, we can obtain x̄3 = 65.41. So the mean vector of x = (x1 x2 x3)T is given by
In Example 2.2, we calculated the sample variances, sample covariance, and sample correlation of x1 and x2. Similarly, we can obtain the sample variance of x3 and its sample covariance and correlation with the other two variables as
Note that while s23 is much smaller than s13, r23 is greater than r13, which indicates that the linear association between x2 and x3 is stronger than that of x1 and x3. This clearly shows that the magnitude of the covariance itself is not meaningful in characterizing how strong the relationship of two variables is. Combining all the sample variance, covariance, and correlation information, the sample covariance matrix and sample correlation matrix of x = (x1 x2 x3)T can be written as
2.2.3 Linear Combination of Variables
We are often interested in some linear combinations of the variables x1, x2,…, xp. For example, for the auto_spec
data set, two of the variables are city.mpg
and highway.mpg
. If you expect that 60% of the mileage for a car is on highway and 40% is on local roads, then the average MPG for a car can be estimated as 0.6 × highway.mpg + 0.4 × city.mpg, which is a linear combination of city.mpg
and highway.mpg
. In general, let c1, c2,…, cp be constants and consider the linear combination of the variables x1, x2,…, xp given by
For each observation of the data set, the corresponding value of the variable z can be found by
where cT = (c1 c2 … cp). It can be seen that the sample mean of z is
The sample variance of z can be found as
Because sample variance is always non-negative, for any c ∈ ℛp we have cT Sc ≥ 0 from (2.8). Therefore, the sample covariance matrix S is always a positive semidefinite matrix.
In general, if we have q linear combinations of x1, x2,…, xp defined