of variables measured on each unit consists of two or more variables, a data set is called a multivariate data set, and a multivariate data set consisting of only two variables is called a bivariate data set. In a multivariate data set, there is usually one variable that is of primary interest to a research question that is believed to be explained by some of the other variables measured in the study. The variable of primary interest is called a response variable and the variables believed to cause changes in the response are called explanatory variables or predictor variables. The explanatory variables are often referred to as the input variables and the response variable is often referred to as the output variable. Furthermore, in a statistical model, the response variable is the variable that is being modeled; the explanatory variables are the input variables in the model that are believed to cause or explain differences in the response variable. For example, in studying the survival of melanoma patients, the response variable might be Survival Time that is expected to be influenced by the explanatory variables Age, Gender, Clark’s Stage, and Tumor Size. In this case, a model relating Survival Time to the explanatory variables Age, Gender, Clark’s Stage, and Tumor Size might be investigated in the research study.
A multivariate data set often consists of a mixture of qualitative and quantitative variables. For example, in a biomedical study, several variables that are commonly measured are a subject’s age, race, gender, height, and weight. When data have been collected, the multivariate data set is generally stored in a spreadsheet with the columns containing the data on each variable and the rows of the spreadsheet containing the observations on each subject in the study.
In studying the response variable, it is often the case that there are subpopulations that are determined by a particular set of values of the explanatory variables that will be important in answering the research questions. In this case, it is critical that a variable be included in the data set that identifies which subpopulation each unit belongs to. For example, in the National Health and Nutrition Examination Survey (NHANES) study, the distribution of the weight of female children was studied. The response variable in this study was weight and some of the explanatory variables measured in this study were height, age, and gender. The result of this part of the NHANES study was a distribution of the weights of females over a certain range of age. The resulting distributions were summarized in the chart given in Figure 2.2 that shows the weight ranges for females for several different ages.
Figure 2.2 Weight-by-age chart for girls in the NHANES study.
Example 2.8
In the article “The validity of self-reported weight in US adults: a population based cross-sectional study” published in BMC Public Health (Villanueva, 2001), the author reported the results of a study on the validity of self-reported weight. The data set used in the study was a multivariate data set with response variable being the difference between the self-reported weight and the actual weight of an individual. The explanatory variables in this study were gender, age, race–ethnicity, highest educational attainment, level of activity, and perception of the individuals’ current weight.
2.2 Population Distributions and Parameters
For a well-defined population of units and a variable, say X, the collection of all possible values of the variable X formed by measuring all of the units in the target population forms the population associated with the variable X. When multiple variables are recorded, each of the variables will generate its own population. Furthermore, since a variable may take on many different values, an important question concerning the population of values of the variable is “How can the population of values of a variable be described or summarized?” The two different approaches that can be used to describe the population of values of the variable are (1) to describe explicitly how the variable is distributed over its values and (2) to describe a set of characteristics that summarize the distribution of the values in the population.
2.2.1 Distributions
A statistical analysis of a population is centered on how the values of a variable are distributed, and the distribution of a variable or population is an explicit description of how the values of the variable are distributed often described in terms of percentages. The distribution of a variable is also called a probability distribution because it describes the probabilities that each of the possible values of the variable will occur. Moreover, the distribution of a variable is often presented in table or chart or modeled with a mathematical equation that explicitly determines the percentage of the population taking on each possible value of the variable. The total percentage in a probability distribution is 100%. The distribution of a qualitative or a discrete variable is generally displayed in a bar chart or in a table, and the distribution of a continuous variable is generally displayed in a graph or is represented by a mathematical function.
Example 2.9
The four basic classifications of blood type are O, A, B, and AB. The distribution of blood type, according to the American Red Cross, is given in Table 2.1, and a bar chart representing this distribution is shown in Figure 2.3. Based on the information in Table 2.1, 45% of Americans have type O blood, 40% have type A, 11% have type B, and 4% have type AB blood.
Figure 2.3 A bar chart of the distribution of blood types in the United States.
Table 2.1 The Distribution of Blood Type According to the American Red Cross
Blood Type | Percentage |
---|---|
O | 45% |
A | 40% |
B | 11% |
AB | 4% |
Another method of classifying blood types is to represent blood type by type and Rh factor. A bivariate distribution of blood type for the variables type and Rh factor is given in Table 2.2 and the bar chart in Figure 2.4.
Figure 2.4 A bar chart of the distribution of blood types and Rh factor in the United States.
Table 2.2 The Distribution of Blood Types with Rh Factor
Rh Factor | ||
---|---|---|
Type | + | − |
O | 38% | 7% |
A | 34% | 6% |
B |