Daniel J. Denis

Applied Univariate, Bivariate, and Multivariate Statistics


Скачать книгу

section A and Mary is a student in section B. On the final exam for the course, John receives a raw score of 80 out of 100 (i.e., 80%). Mary, on the other hand, earns a score of 70 out of 100 (i.e., 70%). At first glance, it may appear that John was more successful on his final exam. However, scores, considered absolutely, do not allow us a comparison of each student's score relative to their class distributions. For instance, if the mean in John's class was equal to 85% with a standard deviation of 2, this means that John's z‐score is:

equation

      Suppose that in Mary's class, the mean was equal to 65% also with a standard deviation of 2. Mary's z‐score is thus:

equation

      As we can see, relative to their particular distributions, Mary greatly outperformed John. Assuming each distribution is approximately normal, the density under the curve for a normal distribution with mean 0 and standard deviation of 1 at a score of 2.5 is:

      > dnorm(2.5, 0, 1) [1] 0.017528

      where dnorm is the density under the curve at 2.5. This is the value of f(x) at the score of 2.5. What then is the probability of scoring 2.5 or greater? To get the cumulative density up to 2.5, we compute:

      > pnorm(2.5, 0, 1) [1] 0.9937903

      > 1-pnorm(2.5, 0, 1) [1] 0.006209665

Graph depicts the shaded area under the standard normal distribution at a z-score of up to 2.5 standard deviations.

      We can see then the percentage of students scoring higher than Mary is in the margin of approximately 0.6% (i.e., multiply the proportion by 100). What proportion of students scored better than John in his class? Recall that his z‐score was equal to −2.5. Because we know the normal distribution is symmetric, we already know the area lying below −2.5 is the same as that lying above 2.5. This means that approximately 99.38% of students scored higher than John. Hence, we see that Mary drastically outperformed her colleague when we consider their scores relative to their classes. Be careful to note that in drawing these conclusions, we had to assume each score (that of John's and Mary's) came from a normal distribution. The mere fact that we transformed their raw scores to z‐scores in no way normalizes their raw distributions. Standardization standardizes, but it does not normalize.

      One can also easily verify that approximately 68% of cases in a normal distribution lie within −1 and +1 standard deviations, while approximately 95% of cases lie within −2 and +2 standard deviations.

      2.1.1 Plotting Normal Distributions

      We can plot normal densities in R by simply requesting the lower and upper limit on the abscissa:

      > x <- seq(from = -3, to = +3, length.out = 100) > plot(x, dnorm(x))Graph depicts the plot of normal distributions.

      Distributions (and densities) of a single variable typically go by the name of univariate distributions to distinguish them from distributions of two (bivariate) or more variables (multivariate).

      > install.packages(“HistData”) > library(HistData) > attach(Galton) > Galton parent child 1 70.5 61.7 2 68.5 61.7 3 65.5 61.7 4 64.5 61.7 5 64.0 61.7 6 67.5 62.2 7 67.5 62.2 8 67.5 62.2 9 66.5 62.2 10 66.5 62.2

      We first install the package using the install.packages function. The library statement loads the package HistData into R's search path. From there, we attach the Galton data to insert the object (dataframe) into the search list. We generate a histogram of parent height:

      > hist(parent, main = "Histogram of Parent Height")Histogram depicting the plot of parent versus frequency.

      2.1.2 Binomial Distributions

      The binomial distribution is given by:

equation

      where,

       p(r) is the probability of observing r occurrences out of n possible occurrences,2

       p is the probability of a “success” on any given trial, and

       1 − p is the probability of a failure on any given trial, often simply referred to by “q” (i.e., q = 1 − p).

      The binomial setting provides an ideal context to demonstrate the essentials of hypothesis‐testing logic, as we will soon see. In a binomial setting, the following conditions must hold:

       The variable under study must be binary in nature. That is, the outcome of the experiment can result in only one category or another. That is, the outcome categories are mutually exclusive. For instance, the flipping of a coin has this characteristic, because the coin can either come up “head” or “tail” and nothing else (yes, we are ruling out the possibility that it lands on its side, and I think it is safe to do so).

       The probability of a “success” on each trial remains constant (or stationary) from trial to trial. For example, if the probability of head is equal to 0.5 on our first flip, we assume it is also equal to 0.5 on the second, third, fourth flips, and so on.

       Each trial is independent of each other trial. That is, the fact that we get a head on our first