predictors and the responses. Using these correlations, the factors are extracted in such a way that they not only explain variation in the X and Y variables, but they also relate the X variables to the Y variables.
As you might suspect, consideration of the entire 6 x 6 correlation matrix without regard to predictors and responses leads directly to PCA. As we have seen in Chapter 3, PCA also exploits the idea of projections to reduce dimensionality, and is often used as an exploratory technique prior to PLS.
Consideration of the 4 x 4 submatrix (the orange elements) leads to a technique called Principal Components Regression, or PCR (Hastie et al. 2001). Here, the dimensionality of the X space is reduced through PCA, and the resulting components are treated as new predictors for each response in Y using MLR. To fit PCR in JMP requires a two-stage process (Analyze > Multivariate Methods > Principal Components, followed by Analyze > Fit Model). In many instances, though, PLS is a superior choice.
For completeness, we mention that consideration of the 2 x 2 submatrix associated with the Ys (the blue elements) along with the 4 x 4 submatrix associated with the Xs (the orange elements) leads to Maximum Redundancy Analysis, MRA (van den Wollenberg 1977). This is a technique that is not as widely used as PLS, PCA, and PCR.
Why Use PLS?
Consistent with the heritage of PLS, let’s consider a simulation of a simplified example from spectroscopy. In this situation, samples are measured in two ways: Typically, one is quick, inexpensive, and online; the other is slow, expensive, and offline, usually involving a skilled technician and some chemistry. The goal is to build a model that predicts well enough so that only the inexpensive method need be used on subsequent samples, acting as a surrogate for the expensive method.
The online measurement consists of a set of intensities measured at multiple wavelengths or frequencies. These measured intensities serve as the values of X for the sample at hand. To simplify the discussion, we assume that the technician only measures a single quantity, so that (as in our MLR example in Chapter 2) Y is a column vector with the same number of rows as we have samples.
To set up the simulation, run the script SpectralData.jsl by clicking on the correct link in the master journal. This opens a control panel, shown in Figure 4.2.
Figure 4.2: Control Panel for Spectral Data Simulation
Once you obtain the control panel, complete the following steps:
1. Set the Number of Peaks to 3 Peaks.
2. Set the Noise Level in Intensity Measurements slider to 0.02.
3. Leave the Noise in Model slider set to 0.00.
4. Click Run.
This produces a data table with 45 rows containing an ID column, a Response column, and 81 columns representing wavelengths, which are collected in the column group called Predictors. The data table also has a number of saved scripts.
Let’s run the first script, Stack Wavelengths. This script stacks the intensity values so that we can plot the individual spectra. In the data table that the script creates, run the script Individual Spectra. Figure 4.3 shows plots similar to those that you see.
Figure 4.3: Individual Spectra
Note that some samples display two peaks and some three. In fact, the very definition of what is or is not a peak can quickly be called into question with real data, and over the years spectroscopists and chemometricians have developed a plethora of techniques to pre-process spectral data in ways that are reflective of the specific technique and instrument used.
Now run the script Combined Spectra in the stacked data table. This script plots the spectra for all 45 samples against a single set of axes (Figure 4.4). You can click on an individual set of spectral readings in the plot to highlight its trace and the corresponding rows in the data table.
Figure 4.4: Combined Spectra
Our simulation captures the essence of the analysis challenge. We have 81 predictors and 45 rows. A common strategy in such situations is to attempt to extract significant features (such as peak heights, widths, and shapes) and to use this smaller set of features for subsequent modeling. However, in this case we have neither the desire nor the background knowledge to attempt this. Rather, we take the point of view that the intensities in the measured spectrum (the row within X), taken as a whole, provide a fingerprint for that row that we try to relate to the corresponding measured value in Y.
Let’s close the data table Stacked Data and return to the main data table Response and Intensities. Run the script Correlations Between Xs. This script creates a color map that shows the correlations between every pair of predictors using a blue to red color scheme (Figure 4.5). Note that the color scheme is given by the legend to the right of the plot. To see the numerical values of the correlations, click the red triangle next to Multivariate and select Correlations Multivariate.
Figure 4.5: Correlations for Predictors Shown in a Color Map
In the section “The Effect of Correlation among Predictors: A Simulation” in Chapter 2, we investigated the impact of varying the correlation between just two predictors. Here we have 81 predictors, one for each wavelength, resulting in 81*80/2 = 3,240 pairs of predictors. Figure 4.5 gives a pictorial representation of the correlations among all 3,240 pairs.
The cells on the main diagonal are colored the most intense shade of red, because the correlation of a variable with itself is +1. However, Figure 4.5 shows three large blocks of red. These are a consequence of the three peaks that you requested in the simulation. You can experiment by rerunning the simulation with a different number of peaks and other slider settings to see the impact on this correlation structure.
Next, in the data table, find and run the script MLR (Fit Model). This attempts to fit a multiple linear regression to Response, using all 81 columns as predictors. The report starts out with a long list of Singularity Details. This report, for our simulated data, is partially shown in Figure 4.6.
Figure 4.6: Partial List of Singularity Details for Multiple Linear Regression Analysis
Here, the X matrix has 81+1 = 82 columns, but X and Y have only 45 rows. Because n < m (using our earlier notation), we should expect MLR to run into trouble. Note that the JMP Fit Model platform does produce some output, though it’s not particularly useful in this case. If you want more details about what JMP is doing here, select Help > Books > Fitting Linear Models and search for “Singularity Details”.
Now run the script Partial Least Squares to see a partial least squares report. We cover the report details later on, but for now, notice that there is no mention of singularities. In fact, the Variable Importance Plot (Figure 4.7) assesses the contribution of each of the 81 wavelengths in modeling the response. Because higher Variable Importance for the Projection (VIP) values suggest higher influence, we conclude that wavelengths between about –3.0 and 1.0 have comparatively higher influence than the rest.
Figure 4.7: PLS Variable Importance Plot