observed indicator variables will be associated with these factors. This does not mean that we cannot use factor analysis in an exploratory way. Indeed, the entire focus of this text is on exploratory factor analysis. However, it does mean that we should have some sense for what the latent variable structure is likely to be. This translates into having a general sense for the number of factors that we are likely to find (e.g., somewhere between two and four), and how the observed variables would be expected to group together (e.g., items 1, 3, 5, and 8 should be measuring a common construct and thus should group together on a common factor). Without such a preexisting theory about the likely factor structure, we will not be able to ascertain when we have an acceptable factor solution and when we do not. Remember, we are using observed data to determine whether predictions from our factor model are accurate. This means that we need to have a sufficiently well-developed factor model so as to make predictions about what the results should look like. For example, what does theory say about the relationship between depression and sleep disturbance? It says that individuals suffering from depression will experience what for them are unusual sleep patterns. Thus, we would expect depressed individuals to indicate that they are indeed suffering from unusual sleep patterns. In short, having a well-constructed theory about the latent structure that we are expecting to find is crucial if we are to conduct the factor analysis properly and make good sense of the results that it provides to us.
Comparison of Exploratory and Confirmatory Factor Analysis
Factor analysis models, as a whole, exist on a continuum. At one extreme is the purely exploratory model, which incorporates no a priori information, such as the possible number of factors or how indicators are associated with factors. At the other extreme lies a purely confirmatory factor model in which the number of factors, as well as the way in which the observed indicators group onto these factors, is provided by the researcher. These modeling frameworks differ both conceptually and statistically. From a conceptual standpoint, exploratory models are used when the researcher has little or no prior information regarding the expected latent structure underlying a set of observed indicators. For example, if very little prior empirical work has been done with a set of indicators, or there is not much in the way of a theoretical framework for a factor model, then by necessity the researcher would need to engage in an exploratory investigation of the underlying factor structure. In other words, without prior information on which to base the factor analysis, the researcher cannot make any presuppositions regarding what the structure might look like, even with regard to the number of factors underlying the observed indicators. In other situations, there may be a strong theoretical basis upon which a hypothesized latent structure rests, such as when a scale has been developed using well-established theories. However, if very little prior empirical work exists exploring this structure, the researcher may not be able to use a more confirmatory approach and thus would rely on exploratory factor analysis (EFA) to examine several possible factor solutions, which might be limited in terms of the number of latent variables by the theoretical framework upon which the model is based. Conceptually, a confirmatory factor analysis (CFA) approach would be used when there is both a strong theoretical expectation regarding the expected factor structure and prior empirical evidence (usually in the form of multiple EFA studies) supporting this structure. In such cases, CFA is used to (a) ascertain how well the hypothesized latent variable model fits the observed data and (b) compare a small number of models with one another in order to identify the one that yields the best fit to the data.
From a statistical perspective, EFA and CFA differ in terms of the constraints that are placed upon the factor structure prior to estimation of the model parameters. With EFA there are few, if any, constraints placed on the model parameters. Observed indicators are typically allowed to have nonzero relationships with all of the factors, and the number of factors is not constrained to be a particular number. Thus, the entire EFA enterprise is concerned with answering the question of how many factors underlie an observed set of indicators, and what structure the relationship between factors and indicators takes. In contrast, CFA models are highly constrained. In most instances, each indicator variable is allowed to be associated with only a single factor, with relationships to all other factors set to 0. Furthermore, the specific factor upon which an indicator is allowed to load is predetermined by the researcher. This is why having a strong theory and prior empirical evidence is crucial to the successful fitting of CFA models. Without such strong prior information, the researcher may have difficulty in properly defining the latent structure, potentially creating a situation in which an improper model is fit to the data. The primary difficulty with fitting an incorrect model is that it may appear to fit the data reasonably well, based on statistical indices, and yet may not be the correct model. Without earlier exploration of the likely latent structure, however, it would not be possible for the researcher to know this. CFA does have the advantage of being a fully determined model, which is not the case with EFA, as we have already discussed. Thus, it is possible to come to more definitive determinations regarding which of several CFA models provides the best fit to a set of data because they can be compared directly using familiar tools such as statistical hypothesis testing. Conversely, determining the optimal EFA model for a set of data is often not a straightforward or clear process, as we will see later in the book.
In summary, EFA and CFA sit at opposite ends of a modeling continuum, separated by the amount of prior information and theory available to the researcher. The more such information and the stronger the theory, the more appropriate CFA will be. Conversely, the less that such prior evidence is available, and the weaker the theories about the latent structure, the more appropriate will be EFA. Finally, researchers should take care not to use both EFA and CFA on the same set of data. In cases where a small set of CFA models do not fit a set of sample data well, a researcher might use EFA in order to investigate potential alternative models. This is certainly an acceptable approach; however, the same set of data used to investigate these EFA-based alternatives should not then be used with an additional CFA model to validate what exploration has suggested might be optimal models. In such cases, the researcher would need to obtain a new sample upon which the CFA would be fit in order to investigate the plausibility of the EFA findings. If the same data were used for both analyses, the CFA model would likely yield spuriously good fit to the sample for the model, given that the sample data had already yielded the factor structure that is being tested, through the EFA.
EFA and Other Multivariate Data Reduction Techniques
Factor analysis belongs to a larger family of statistical procedures known collectively as data reduction techniques. In general, all data reduction techniques are designed to take a larger set of observed variables and combine them in some way so as to yield a smaller set of variables. The differences among these methods lies in the criteria used to combine the initial set of variables. We discuss this criterion for EFA at some length in Chapter 3, namely the effort to find a factor structure that yields accurate estimates of the covariance matrix of the observed variables using a smaller set of latent variables. Another statistical analysis with the goal of reducing the number of observed variables to a smaller number of unobserved variates is discriminant analysis (DA). DA is used in situations where a researcher has two or more groups in the sample (e.g., treatment and control groups) and would like to gain insights into how the groups differ on a set of measured variables. However, rather than examining each variable separately, it is more statistically efficient to consider them collectively. In order to reduce the number of variables to consider in this case, DA can be used. As with EFA, DA uses a heuristic to combine the observed variables with one another into a smaller set of latent variables that are called discriminant functions. In this case, the algorithm finds the combination(s) that maximize the group mean difference on these functions. The number of possible discriminant functions is the minimum of p and J-1, where p is the number of observed variables, and J is the number of groups. The functions resulting from DA are orthogonal to one another, meaning that they reflect different aspects of the shared group variance associated with the observed variables. The discriminant functions in DA can be expressed as follows:
Dfi = wf1 x1i + wf2 x2i + ⋅⋅⋅ + wfp xpi (Equation 1.1)
where