Marie Gaudard A.

Discovering Partial Least Squares with JMP


Скачать книгу

table Spearheads.jmp also contains four scripts that help us perform the PLS analysis quickly. In the later chapters containing examples, we walk through the menu options that enable you to conduct such an analysis. But, for now, the scripts expedite the analysis, permitting us to focus on the concepts underlying a PLS analysis.

      The first script, Fit Model Launch Window, located in the upper left of the data table as shown in Figure 1.2, enables us to set up the analysis we want. From the red-triangle menu, shown in Figure 1.2, select Run Script. This script only runs if you are using JMP Pro since it uses the Fit Model partial least squares personality. If you are using JMP, you can select Analyze > Multivariate Methods > Partial Least Squares from the JMP menu bar. You will be able to follow the text, but with minor modifications.

      Figure 1.2: Running the Script “Fit Model Launch Window”

Figure 1.2: Running the Script “Fit Model Launch Window”

      This script produces a populated Fit Model launch window (Figure 1.3). The column Tribe is entered as a response, Y, while the 10 columns representing metal composition measurements are entered as Model Effects. Note that the Personality is set to Partial Least Squares. In JMP Pro, you can access this launch window directly by selecting Analyze > Fit Model from the JMP menu bar.

      Below the Personality drop-down menu, shown in Figure 1.3, there are check boxes for Centering and Scaling. As mentioned in the previous section, centering and scaling all variables in a PLS analysis treats them equitably in the analysis. There is also a check box for Standardize X. This option, described in “The Standardize X Option” in Appendix 1, centers and scales columns that are involved in higher-order terms. JMP selects these three options by default.

      Figure 1.3: Populated Fit Model Launch Window

Figure 1.3: Populated Fit Model Launch Window

      Clicking Run brings us to the Partial Least Squares Model Launch control panel (Figure 1.4). Here, we can make choices about how we would like to fit the model. Note that we are allowed to choose between two fitting algorithms to be discussed later: NIPALS and SIMPLS. We accept the default settings. (To reproduce the exact analysis shown below, select Set Random Seed from the red triangle menu at the top of the report and enter 111.) Click Go. (You can, instead, run the script PLS Fit to see the report.)

      Figure 1.4: PLS Model Launch Control Panel

Figure 1.4: PLS Model Launch Control Panel

      This appends three new report sections, as shown in Figure 1.5: Model Comparison Summary, KFold Cross Validation with K=7 and Method=NIPALS, and NIPALS Fit with 3 Factors. Later, we fully explain the various options and report contents, but for now we take the analysis on trust in order to quickly see this example in its entirety. As we discuss later, the Number of Factors is a key aspect of a PLS model. The report in Figure 1.5 shows 3 Factors, but your report might show a different number. This is because the Validation Method of KFold, set as a default in the JMP Pro Model Launch control panel, involves an element of randomness.

      Figure 1.5: Initial PLS Reports

Figure 1.5: Initial PLS Reports

      Once you have built a model in JMP, you can save the prediction formula to the table containing the data that were analyzed. We do this for our PLS model. From the options in the red-triangle menu for the NIPALS Fit with 3 Factors, select Save Columns > Save Prediction Formula (Figure 1.6).

      Figure 1.6: Saving the Prediction Formula

Figure 1.6: Saving the Prediction Formula

      The saved formula column, Pred Formula Tribe, appears as the last column in the data table. Because we are actually saving a formula, we obtain predicted values for all 19 rows.

      To see how well our PLS model has performed, let’s simulate the arrival of new data using our test set. We would like to remove the Hide and Exclude row states from rows 10-19, and apply them to rows 1-9. You can do this by hand, or by running the script Toggle Hidden/Excluded Rows. To do this by hand, select Rows > Clear Row States, select rows 1-9, right-click in the highlighted area near the row numbers, and select Hide and Exclude. (In versions of JMP prior to JMP 11, select Exclude/Unexclude, and then right-click again and select Hide/Unhide.)

      Now run the script Predicted vs Actual Tribe. For each row, this plots the predicted score for tribal origin on the vertical axis against the actual tribe of origin on the horizontal axis (Figure 1.7).

      Figure 1.7: Predicted versus Actual Tribe for Test Data

Figure 1.7: Predicted versus Actual Tribe for Test Data

      To produce this plot yourself, select Graph > Graph Builder. In the Variables panel, right-click on the modeling type icon to the left of Tribe and select Nominal. (This causes the value labels for Tribe to display.) Drag Tribe to the X area and Pred Formula Tribe to the Y area.

      Note that the predicted values are not exactly +1 or -1, so it makes sense to use a decision boundary (the dotted blue line at the value 0) to separate or classify the scores produced by our model into two groups. You can insert a decision boundary by double-clicking on the vertical axis. This opens the Y Axis Specification window. In the Reference Lines section near the bottom of the window, click Add to add a reference line at 0, and then enter the text Decision Boundary in the Label text box.

      The important finding conveyed by the graph is that our PLS model has performed admirably. The model has correctly classified all ten observations in the test set. All of the observations for “Tribe A” have predicted values below 0 and all those for “Tribe B” have predicted values above 0.

      Our model for the spearhead data was built using only nine spearheads, one less than the number of chemical measurements made. PLS provides an excellent classification model in this case.

      Before exploring PLS in more detail, let’s engage in a quick review of multiple linear regression. This is a common approach to modeling a single variable in Y using a collection of variables, X.

      2

      A Review of Multiple Linear Regression

       The Cars Example

       Estimating the Coefficients

       Underfitting and Overfitting: A Simulation

       The Effect of Correlation among Predictors: A Simulation

      Consider Figure 2.1, which displays the data table CarsSmall.jmp. You can open this table by clicking on the correct link in the master journal. This data table consists of six rows, corresponding to specific cars of different types, and six variables from the JMP sample data table Cars.jmp.

      Figure 2.1: Data Table CarsSmall.jmp

Figure 2.1: Data Table CarsSmall.jmp

      The first column, Automobile, is an identifier column. Our goal is to predict miles per gallon (MPG) from the other descriptive variables. So, in this context, the variable MPG is the single variable in Y,