TABLE 3.2 The Testing Set
Observation Numbers | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|
8.5 | 9.4 | 5.4 | 11.7 | 6.5 | 10.3 | 12.7 | 11.0 | 15.4 | 2.8 |
|
49.4 | 43.0 | 19.3 | 56.4 | 28.3 | 53.7 | 58.1 | 28.7 | 80.7 | 13.6 |
Use the training set to obtain the line of best fit of
First, read the training set
x_train <- c(11.8, 10.8, 8.6, ..., 8.9) y_train <- c(31.3, 59.9, 27.6, ..., 38.5)
and the testing set
x_test <- c(8.5, 9.4, 5.4, …, 2.8) y_test <- c(49.4, 43.0, 19.3,…, 13.6)
Then, plot the training set, to establish if a linear trend exists.
plot(x_train, y_train, main = "Training Data", font.main = 1)
gives Fig. 3.17.
Figure 3.17 The Scatter of the Training Data
Since Fig. 3.17 shows a linear trend, we obtain the line of best fit of
abline(lm(y_train ∼ x_train))
to get Fig. 3.18.
Figure 3.18 The Line of Best Fit for the Training Data
Next, we use the testing data to decide on the suitability of the line.
The coefficients of the line are obtained in R with
lm(formula = y_train ∼ x_train) Coefficients: (Intercept) x_train -0.9764 4.9959
The estimated values
y_est <- - 0.9764 + 4.9959 * x_test round(y_est, 1)
which gives
y_est 41.5 46.0 26.0 57.5 31.5 50.5 62.5 54.0 76.0 13.0
We now compare these estimated values with the observed values.
y_test 49.4 43.0 19.3 56.4 28.7 53.7 58.1 54.0 80.7 13.6plot(x_test, y_test, main = "Testing Data", font.main = 1) abline(lm(y_train ∼ x_train)) # plot the line of best fit segments(x_test, y_test, x_test, y_est)
gives Fig. 3.19. Here, segments
plots vertical lines between (x_test, y_test) and (x_test, y-est)
Figure 3.19 shows the observed values,
Figure 3.19 Differences Between Observed and Estimated
The line of best fit is the simplest regression model; it uses just one independent variable for prediction. In real‐life situations, many more independent variables or other models, such as, for example a quadratic, may be required, but for supervised learning, the approach is always the same:
Determine if there is a relationship between the dependent variable and the independent variables;
Fit the model to the training data;
Test the suitability of the model by predicting the ‐values in the testing data from the model and by comparing the observed and predicted ‐values.
The predictions from these models assumes that the trend, based on the data analyzed, continues to exist. Should the trend change, for example, when a house pricing model is estimated from data before an economic crash, the predictions will not be valid.
Regression analysis is just one of the many techniques from the area of Probability and Statistics that machine learning invokes. We will encounter more in later chapters. Should you wish to go into this topic more deeply, we recommend the book, A First Course in Machine Learning by Girolami (2015).
3.7 GRAPHICAL DISPLAYS VERSUS SUMMARY STATISTICS
Before we finish, let us look at a simple, classic