Richard J. Rossi

Applied Biostatistics for the Health Sciences


Скачать книгу

of variables measured on each unit consists of two or more variables, a data set is called a multivariate data set, and a multivariate data set consisting of only two variables is called a bivariate data set. In a multivariate data set, there is usually one variable that is of primary interest to a research question that is believed to be explained by some of the other variables measured in the study. The variable of primary interest is called a response variable and the variables believed to cause changes in the response are called explanatory variables or predictor variables. The explanatory variables are often referred to as the input variables and the response variable is often referred to as the output variable. Furthermore, in a statistical model, the response variable is the variable that is being modeled; the explanatory variables are the input variables in the model that are believed to cause or explain differences in the response variable. For example, in studying the survival of melanoma patients, the response variable might be Survival Time that is expected to be influenced by the explanatory variables Age, Gender, Clark’s Stage, and Tumor Size. In this case, a model relating Survival Time to the explanatory variables Age, Gender, Clark’s Stage, and Tumor Size might be investigated in the research study.

      A multivariate data set often consists of a mixture of qualitative and quantitative variables. For example, in a biomedical study, several variables that are commonly measured are a subject’s age, race, gender, height, and weight. When data have been collected, the multivariate data set is generally stored in a spreadsheet with the columns containing the data on each variable and the rows of the spreadsheet containing the observations on each subject in the study.

      Figure 2.2 Weight-by-age chart for girls in the NHANES study.

       Example 2.8

      In the article “The validity of self-reported weight in US adults: a population based cross-sectional study” published in BMC Public Health (Villanueva, 2001), the author reported the results of a study on the validity of self-reported weight. The data set used in the study was a multivariate data set with response variable being the difference between the self-reported weight and the actual weight of an individual. The explanatory variables in this study were gender, age, race–ethnicity, highest educational attainment, level of activity, and perception of the individuals’ current weight.

      2.2 Population Distributions and Parameters

      2.2.1 Distributions

       Example 2.9

      Figure 2.3 A bar chart of the distribution of blood types in the United States.

Blood Type Percentage
O 45%
A 40%
B 11%
AB 4%

      Figure 2.4 A bar chart of the distribution of blood types and Rh factor in the United States.



Rh Factor
Type +
O 38% 7%
A 34% 6%
B