rel="nofollow" href="#ulink_4e7f0509-010c-5a6e-97ff-1b7ec9c2677f">Figure 1.14.
Figure 1.14 Relationship between a proportion and its variance.
1.11 Properties of Means and Variances
Means and variances have some interesting properties that deserve mention. Knowledge of some of these properties will be very helpful when analyzing the data, and will be required several times in the following sections. Regardless, they are all intuitive and easy to understand.
With a computer, we generated random numbers between 0 and 1, representing observations from a continuous attribute with uniform distribution, which we will call variable A. This attribute is called a random variable because it can take any value from a set of possible distinct values, each with a given probability. In this case, variable A can take any value from the set of real numbers between 0 and 1, all with equal probability. Hence the probability distribution of variable A is called the uniform distribution.
A second variable, called variable B, with uniform distribution but with values between 0 and 2, was also generated. The distributions of the observed values of those variables are shown in the graphs of Figure 1.15. That type of graph is called a histogram, the name given to the graph of a continuous variable where the values are grouped in bins that are plotted adjacent to each other. Let us now see what happens to the mean and variance when we perform arithmetic operations on a random variable.
When a constant amount is added to, or subtracted from, the values of a random variable, the mean will, respectively, increase or decrease by that amount but the variance will not change. This is illustrated in Figure 1.16 (left graph), which shows the distribution of variable A plus 2. This result is obvious, because, as all values are increased (or decreased) by the same amount, the mean will also increase (or decrease) by that amount and the distance of each value from the mean will thus remain the same, keeping the variance unchanged.
Figure 1.15 Two random variables with uniform distribution.
Figure 1.16 Properties of means and variances.
When a constant amount multiplies, or divides, the values of a random variable, the mean will be, respectively, multiplied or divided by that amount, and the variance will be, respectively, multiplied or divided by the square of that amount. Therefore, the standard deviation will be multiplied or divided by the same amount. Figure 1.16 (middle graph), shows the distribution of A multiplied by 2. As an example, consider the attribute height with mean 1.7 m and standard deviation 0.6 m. If we want to convert the height to centimeters, we multiply all values by 100. Now the mean will of course be 170 cm and the standard deviation 60 cm. Thus, the mean was multiplied by 100 and the standard deviation also by 100 (and, therefore, the variance was multiplied by 1002).
When observations from two independent random variables are added or subtracted, the mean of the resulting variable will be the sum or the subtraction, respectively, of the means of the two variables. In both cases, however, the variance of the new variable will be the sum of the variances of the two variables. The right graph in Figure 1.16 shows the result of adding variables A and B. The first result is easy to understand, but the second is not that evident, so we will try to show it by an example.
Suppose we have two sets of strips of paper of varying length. We take one strip from each set and glue them at their ends. When we have glued together all the pairs of strips, we will end up with strips that have lengths that are more variable. This is because, in some cases, we added long strips to long strips, making them much longer than average, and added short strips to short strips, making them much smaller than average. Therefore, the variation of strip length increased. Now, if instead of adding the two strips of paper we cut a variable amount from each strip, we will eventually make large cuts in short strips and small cuts in large strips, again increasing variation.
Note that this result will not hold if the variables are not independent, that is, if they are correlated. Taking the example above, if we decided to always make large cuts in long strips and small cuts in short strips, we would end up with a smaller variance. If we did it the other way around, the final variance would be much larger than the sum of the two variances.
Figure 1.17 summarizes the properties of means and variances just described. Means and variances are represented by the letters μ and σ2 of the Greek alphabet.
Figure 1.17 Table of mean and variance properties.
1.12 Descriptive Statistics
The central tendency, location, and dispersion measures may be used to describe a collection of data. Those measures are called descriptive statistics and are used to summarize observations on ordinal and interval attributes. Binary attributes are described by the mean but, as the variance of a binary attribute is determined by the mean, it is not customary to present variances of binary attributes and the usual practice is to present counts.
Descriptive statistics are used to abstract the observations on a sample of the population and they must not be used to infer quantities in populations. From what was said, it is clear that the first thing that must be done when evaluating the results of a research study is, therefore, to abstract the data. To do that, we must first identify the scale of measurement used with each attribute in the dataset, and then we must decide which one is the best method for summarizing the data.
One simple method is the tabulation of the data whereby, for each study variable, we make a list of all the different values found in the dataset and, in front of each one, we write down the number of times it occurred. This is called the absolute frequency of each value. In order to improve readability, it is customary to also write down the number of occurrences of each value as a percentage of the total number of values, the relative frequency.
Figure 1.18 Tabulation of nominal data.