A. Gouveia Oliveira

Biostatistics Decoded


Скачать книгу

to understand what this quantity really represents. However, the standard deviation is the most popular of all measures of dispersion. Why is that?

      A final remark about the variance. Although the variance is an average, the total sum of squares is divided not by the number of observations as an average should be, but by the number of observations minus 1, that is, by n − 1.

      It does no harm if we use symbols to explain the calculations. The formula for the calculation of the variance of an attribute x is

equation

      where ∑ (capital “S” in the Greek alphabet) stands for summation and images represents the mean of attribute x. So, the expression reads “sum all the squared differences of each value to the overall mean and then divide by the sample size.”

      Naturally, the formula for the standard deviation is

equation

      The reason why we use the n − 1 divisor instead of the n divisor for the sum of squares when we calculate the variance and the standard deviation is because, when we present those quantities, we are implicitly trying to give an estimate of their value in the population. Now, since we use the data from our sample to calculate the variance, the resulting value will always be smaller than the value of the variance in the population. We say that our result is biased toward a smaller value. What is the explanation for that bias?

      Remember that the variance is the average of the squared differences between individual values and the mean. If we calculated the variance by subtracting the individual values from the true mean (the population mean), the result would be unbiased. This is not what we do, though. We subtract the individual values from the sample mean. Since the sample mean is the quantity closest to all the values in the dataset, individual values are more likely to be closer to the sample mean than to the population mean. Therefore, the value of the sample variance tends to be smaller than the value of the population variance. The variance is a good measure of dispersion of the values observed in a sample, but is biased as a measure of dispersion of the values in the population from which the sample was taken. However, this bias is easily corrected if the sum of squares is divided by the number of observations minus 1.

Graph depicts the n divisor of the sum of squares. Graph depicts the n minus 1 divisor of the sum of squares.

      Using a computer’s random number generator, we obtained random samples of a variable with variance equal to 1. This is the population variance of that variable. Starting with samples of size 2, we obtained 10 000 random samples and computed their sample variances using the n divisor. Next, we computed the average of those 10 000 sample variances and retained the result. We then repeated the procedure with samples of size 3, 4, 5, and so on up to 100.

      The plot of the averaged value of sample variances against sample size is represented by the solid line in Figure 1.12. It can clearly be seen that, regardless of the sample size, the variance computed with the n divisor is on average less than the population variance, and the deviation from the true variance increases as the sample size decreases.

      Now let us repeat the procedure, exactly as before, but this time using the n − 1 divisor. The plot of the average sample variance against sample size is shown in Figure 1.13. The solid line is now exactly over 1, the value of the population variance, for all sample sizes.

      This experiment clearly illustrates that, contrary to the sample variance using the n divisor, the sample variance using the n − 1 divisor is an unbiased estimator of the population variance.

      Degrees of freedom is a central notion in statistics that applies to all problems of estimation of quantities in populations from the observations made on samples. The number of degrees of freedom is the number of values in the calculation of a quantity that are free to vary. The general rule for finding the number of degrees of freedom for any statistic that estimates a quantity in the population is to count the number of independent values used in the calculation minus the number of population quantities that were replaced by sample quantities during the calculation.

      In the calculation of the variance, instead of summing the squared differences of each value to the population mean, we summed the squared differences to the sample mean. Therefore, we replaced a population parameter by a sample parameter and, because of that, we lose one degree of freedom. Therefore, the number of degrees of freedom of a sample variance is n − 1.

      As a binary variable is a numeric variable, in addition to calculating a mean, which is called a proportion in binary variables, we can also calculate a variance. The computation is the same as for interval variables: the differences of each observation from the mean are squared, then summed up and divided by the number of observations. With binary l variables there is no need to correct the denominator and the sum of squares is divided by n.