get the following:
We interpret (2.3) as follows:
Over all possible samples, the probability is 0.95 that the range between
Very important to note regarding the above statement is that μ is not the random variable. The part that is random is the sample on which is computed the interval. That is, the probability statement is not about μ but rather is about samples. The population mean μ is assumed to be fixed. The 95% confidence interval tells us that if we continued to sample repeatedly, and on each sample computed a confidence interval, then 95% of these intervals would include the true parameter.
The 99% confidence interval for the mean is likewise given by:
Notice that the only difference between (2.3) and (2.4) is the choice of different critical values on either side of μ (i.e., 1.96 for the 95% interval and 2.58 for the 99% interval).
Though of course not very useful, a 100% confidence interval, if constructed, would be defined as:
If you think about it carefully, the 100% confidence interval should make perfect sense. If you would like to be 100% “sure” that the interval will cover the true population mean, then you have to extend your limits to negative and positive infinity, otherwise, you could not be fully confident. Likewise, on the other extreme, a 0% interval would simply have
That is, if you want to have zero confidence in guessing the location of the population mean, μ, then guess the sample mean
2.14 MAXIMUM LIKELIHOOD
When we speak of likelihood, we mean the probability of some sample data or set of observations conditional on some hypothesized parameter or set of parameters (Everitt, 2002). Conditional probability statements such as p(D/H0) can very generally be considered simple examples of likelihoods, where typically the set of parameters, in this case, may be simply μ and σ2. A likelihood function is the likelihood of a parameter given data (see Fox, 2016).
When we speak of maximum‐likelihood estimation, we mean the process of maximizing a likelihood subject to certain parameter conditions. As a simple example, suppose we obtain 8 heads on 10 flips of a presumably fair coin. Our null hypothesis was that the coin is fair, meaning that the probability of heads is p(H) = 0.5. However, our actual obtained result of 8 heads on 10 flips would suggest the true probability of heads to be closer to p(H) = 0.8. Thus, we ask the question:
Which value of θmakes the observed result most likely?
If we only had two choices of θ to select from, 0.5 and 0.8, our answer would have to be 0.8, since this value of the parameter θ makes the sample result of 8 heads out of 10 flips most likely. That is the essence of how maximum‐likelihood estimation works (see Hays, 1994, for a similar example). ML is the most common method of estimating parameters in many models, including factor analysis, path analysis, and structural equation models to be discussed later in the book. There are very good reasons why mathematical statisticians generally approve of maximum likelihood. We summarize some of their most favorable properties.
Firstly, ML estimators are asymptotically unbiased, which means that bias essentially vanishes as sample size increases without bound (Bollen, 1989). Secondly, ML estimators are consistent and asymptotically efficient, the latter meaning that the estimator has a small asymptotic variance relative to many other estimators. Thirdly, ML estimators are asymptotically normally distributed, meaning that as sample size grows, the estimator takes on a normal distribution. Finally, ML estimators possess the invariance property (see Casella and Berger, 2002, for details).
2.15 AKAIKE'S INFORMATION CRITERIA
A measure of model fit commonly used in comparing models that uses the log‐likelihood is Akaike's information criteria, or AIC (Sakamoto, Ishiguro, and Kitagawa, 1986). This is one statistic of the kind generally referred to as penalized likelihood statistics (another is the Bayesian information criterion, or BIC). AIC is defined as:
where Lm is the maximized log‐likelihood and m is the number of parameters in the given model. Lower values of AIC indicate a better‐fitting model than do larger values. Recall that the more parameters fit to a model, in general, the better will be the fit of that model. For example, a model that has a unique parameter for each data point would fit perfectly. This is the so‐called saturated model. AIC jointly considers both the goodness of fit as well as the number of parameters required to obtain the given fit, essentially “penalizing” for increasing the number of parameters unless they contribute to model fit. Adding one or more parameters to a model may cause −2Lm to decrease (which is a good thing substantively), but if the parameters are not worthwhile, this will be offset by an increase in 2m.
The Bayesian information criterion, or BIC (Schwarz, 1978) is defined as −2Lm + m log(N), where m, as before, is the number of parameters in the model and N the total number of observations used to fit the model. Lower values of BIC are also desirable when comparing models. BIC typically penalizes model complexity more heavily than AIC. For a comparison of AIC and BIC, see Burnham and Anderson (2011).