Daniel J. Denis

Applied Univariate, Bivariate, and Multivariate Statistics


Скачать книгу

as we can assume multivariate normality, we have some idea of how such linear combinations will be distributed.

      As an example of how matrices will be used to develop more complete and general models, consider the multivariate general linear model in matrix form:

equation

      where yi = 1 to yi = n are observed measurements on some dependent variable, X is the model matrix containing a constant of 1 in the first column to represent the common intercept term (i.e., “common” implying there is one intercept that represents all observations in our data), xi = 1 to xi = n are observed values on a predictor variable, α is the fixed intercept parameter, β is the slope parameter, which we also assume to be fixed, and ε is a vector of errors ε1 to εn (we use ε here instead of E).

      Suppose now we want to add a second response variable. Because of the generality of (2.7), this can be easily accommodated:

equation

      Performing inferential tests to help draw conclusions about population parameters is useful, but ultimately the findings of a statistical analysis should make their way into a graph or other visualization. Data visualization is a field in itself, and with the advent of modern computing power, possibilities exist today that could only be dreamt of in the past. Simple visualizations such a histograms, boxplots, scatterplots, etc., can be useful in depicting findings but also in helping to verify assumptions that underlay the statistical model one is using. For example, since many tests of normality and equality of variances (and covariances) are relatively sensitive to the types of data to which they are applied, oftentimes researchers will generate simple plots in order to detect potential gross violations of such assumptions. We feature such techniques throughout the book.

      For graphical displays meant to communicate findings (rather than test assumptions), Friendly (2000) puts the field into context:

      Designing good graphics is surely an art, but as surely, it is one that ought to be informed by scienceIn this view, an effective graphical display, like good writing, requires an understanding of its purpose – what aspects of the data are to be communicated to the viewer. In writing, we communicate most effectively when we know our audience and tailor the message appropriately. (p. 8)

      In high‐dimensional space, the challenge of graphical approaches is to summarize data into lower dimensions, while still retaining most of the information in the original data. We feature some such plots in later chapters. For a thorough account of data visualization, see datavis.ca (Friendly, 2020). For sophisticated graphics using R, consult Wickham (2009).

      For now, it is useful to briefly review some basic plots for which the reader is likely already familiar.

      

      2.27.1 Box‐and‐Whisker Plots

      The boxplot was a contribution of John Tukey (1977) in the spirit of what is called exploratory data analysis, or “EDA” which encouraged scientists to spend more of their energy on descriptive techniques instead of focusing exclusively on confirmatory statistical tests. Boxplots of parent heights from Galton's data appear below:

      The boxplot provides what is generally known as a five‐number summary of a distribution, of which we can obtain most of the numbers we need by the summary function in R:

      > summary(parent) Min. 1st Qu. Median Mean 3rd Qu. Max. 64.00 67.50 68.50 68.31 69.50 73.00

      Recall that the median is the point in the ordered data that divides the data set into two equal parts. The location of the median is computed by (n + 1)/2. In Galton's data, there are 928 observations, and so the location of the median is at 464.5th (i.e., (928 + 1)/2) point in the ordered data set. For parent, this value is equal to 68.50. The first and third quartiles represent the 25th and 75th percentiles and are 67.50 and 69.50 respectively. We can also compute the range as

      > range(parent) [1] 64 73

      We can also generate boxplots by category. Throughout the book, we use Fisher's iris data (Fisher, 1936) in which flower characteristics such as sepal and petal length are categorized by species of flower. We plot sepal length by species:

      > library(lattice) > attach(iris) > bwplot(Sepal.Length ~ Species)An illustration of a boxplot that plots setosa, versicolor, and virginica versus sepal.Length.