Chapter 4 Training / Test Data

https://www.stat.berkeley.edu/~aldous/157/Papers/shmueli.pdf

Predictive power is assessed using metrics computedfrom a holdout set or using cross-validation (Stone,1974; Geisser,1975)

Testing the procedure on the data that gave it birth is almost certain to overestimate performance” (Mosteller and Tukey,1977).

Let’s extend this idea of training and test splits. Remember, our goal is to generate a robust model that better estimates out-of-sample error. We can do this by resampling our data set in a way that allows us to learn from the data but not so much so that it follows the data set too closely.

We can take a single data set and partition / split it into a number of train / test subsets. We just did that in the earlier section but we only did it once. If we do this a number of times we hope we are training our model more effectively.

What would the RMSE look like if we created say K number of subsets of the data frame and selectively held out each of the K subsets, built a model on the combined remaining subsets, and then tested the model on the holdout ? We would then average the RMSE to get an idea of its variation. The series of sequential steps would be as follows:

Subset the data frame into k groups 
For each subset:
   Consider the subset as a "hold out"" or test data set
   Combine the remaining subsets as a training data set
   Fit a model on the combined training set 
   Evaluate the model using the holdout test set
   Save the evaluation score (e.g. RNSE)

Summarize evaluation score (e.g. mean of RMSE)

This is called K-Fold Cross Validation. A general view of this procedure is depicted in the following figure which shows a 5 Fold scenario.

Now, here is the general idea in illustrated form relative to mtcars. Assume we want 4 folds. We would divide the data frame into 4 folds of 8 records each. The first model would be built using Fold 1 as the holdout / test data set after first combining Folds 2,3 and 4 into a training set set

So the second iteration would then take the second fold as the holdout / test data frame and combine Folds 1,3, and 4 into a training data frame.

## [1] 32

Check out the folds to get a better understanding of what is going on. We generated a list that has 8 elements each of which holds a 4 element vector corresponding to indices for records in the mtcars data frame.

## $`1`
## [1] 22 25  4 23 20 26  8 12
## 
## $`2`
## [1] 18 24 15  9 21 14  3  1
## 
## $`3`
## [1] 29 16 11  2 27 28  6 19
## 
## $`4`
## [1] 13 17 31 32  7 10  5 30

Again, each list element has the indices of eight unique observations from the data frame. We have four folds with eight elements each for a total of 32 numbers corresponding to row numbers from the mtcars data frame.

##                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Dodge Challenger 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Hornet 4 Drive   21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## AMC Javelin      15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Toyota Corolla   33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat X1-9        27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Merc 240D        24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 450SE       16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3

4.1 Cross Fold Validation

To implement the cross validation, we will create a processing loop that will execute once for each of the 8 folds. During each execution of the loop we will create a model using data combined from all folds except the fold corresponding to the current loop number (e.g, 1, 2, .. 4).

Once the model is built we then test it on the fold number corresponding to the current loop number.

So now we can create some lists to contain the models that we make along withe the associated predictions, errors and computed RMSE. We we can inspect any of the intermediate results after the fact to validate our work or look more closely at any specific result.

The above list structures allow us to drill down into any aspect of the models and predictions we have made for each of the 4 folds. More importantly we can see how well the model works against each of the individual holdout / test data sets. In the end, we just want to be able to look at the average RMSE across the folds. This gives us clues as to how good the model might perform against new data.

## [1] 3.046228
## [1] 0.7071975

4.2 Create A Function To Automate Things

Since we have gone to the trouble of creating a loop structure to process the folds, we could easily turn this into a function to automate the splitting of the data frame across some arbitrary number of folds just to get an idea of how the RMSE looks for different numbers of folds.

We could even have our function accommodate different formula if we wanted but we won’t focus on that right now. You will soon discover that the caret package does these kinds of things for you but we aren’t quite there yet.

Let’s look at the average RMSE across 8 folds.

## [1] 3.047164

## [1] 0.9967001

4.3 Repeated Cross Validation

Since we already have an existing function we can up the ante by repeating the cross validation. This will provide more data on how the RMSE might be distributed across multiple runs, each of which does Cross Fold validation. This example will repeat a 4 Fold Cross Validation , 20 times.

## [1] 4.565091 3.895989 2.514073 2.757429 2.287045 2.680096 1.663681 4.013908

## [1] 2.993835

##        V1              V2              V3              V4       
##  Min.   :1.508   Min.   :1.560   Min.   :1.633   Min.   :1.981  
##  1st Qu.:2.027   1st Qu.:2.535   1st Qu.:2.146   1st Qu.:2.360  
##  Median :2.428   Median :2.736   Median :2.498   Median :2.691  
##  Mean   :3.046   Mean   :3.015   Mean   :2.939   Mean   :3.061  
##  3rd Qu.:4.581   3rd Qu.:3.342   3rd Qu.:4.087   3rd Qu.:3.538  
##  Max.   :4.902   Max.   :4.975   Max.   :4.544   Max.   :5.045  
##        V5              V6              V7              V8       
##  Min.   :1.836   Min.   :1.735   Min.   :1.269   Min.   :1.254  
##  1st Qu.:2.269   1st Qu.:2.310   1st Qu.:2.311   1st Qu.:1.746  
##  Median :2.492   Median :2.637   Median :2.785   Median :3.085  
##  Mean   :2.973   Mean   :2.929   Mean   :2.999   Mean   :2.958  
##  3rd Qu.:3.844   3rd Qu.:3.148   3rd Qu.:3.550   3rd Qu.:3.915  
##  Max.   :4.624   Max.   :5.316   Max.   :5.208   Max.   :4.631  
##        V9             V10              V11             V12       
##  Min.   :1.133   Min.   :0.6545   Min.   :1.254   Min.   :1.445  
##  1st Qu.:2.213   1st Qu.:2.8133   1st Qu.:2.121   1st Qu.:2.198  
##  Median :2.811   Median :3.3927   Median :2.824   Median :2.437  
##  Mean   :2.917   Mean   :3.0968   Mean   :2.958   Mean   :2.987  
##  3rd Qu.:3.617   3rd Qu.:3.6623   3rd Qu.:4.022   3rd Qu.:3.316  
##  Max.   :4.764   Max.   :4.6753   Max.   :4.712   Max.   :6.101  
##       V13             V14              V15             V16       
##  Min.   :1.198   Min.   :0.8574   Min.   :1.766   Min.   :1.114  
##  1st Qu.:2.518   1st Qu.:2.2471   1st Qu.:2.298   1st Qu.:2.122  
##  Median :2.932   Median :2.7995   Median :2.528   Median :2.714  
##  Mean   :3.033   Mean   :2.9685   Mean   :2.984   Mean   :2.923  
##  3rd Qu.:4.020   3rd Qu.:4.0352   3rd Qu.:3.225   3rd Qu.:3.968  
##  Max.   :4.177   Max.   :4.6596   Max.   :5.642   Max.   :4.641  
##       V17             V18              V19             V20       
##  Min.   :1.681   Min.   :0.7252   Min.   :1.862   Min.   :1.593  
##  1st Qu.:2.465   1st Qu.:2.7033   1st Qu.:2.158   1st Qu.:1.967  
##  Median :3.103   Median :2.9494   Median :2.798   Median :2.923  
##  Mean   :3.054   Mean   :2.9983   Mean   :3.054   Mean   :2.984  
##  3rd Qu.:3.904   3rd Qu.:3.8184   3rd Qu.:3.983   3rd Qu.:3.749  
##  Max.   :3.969   Max.   :4.2868   Max.   :4.645   Max.   :5.075

4.4 Bootstrap

An alternative to K-Fold Cross Validation is to use the bootstrap sampling approach which will produce training data sets the same size as the original data set although some observations might be repeated as the sampling process is done with replacement. The observations that do not appear in each of the training sets are then used as a test set. These observations are known as “out of bag samples”. We’ll make a function to do bootstrap sampling.

## [1] 2.441587 3.733133 3.216098 2.544607

## [1] "Summary of Bootstrap RMSE"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.046   2.799   3.197   3.242   3.615   4.749
## [1] "Summary of CV"
##        V1              V2              V3              V4       
##  Min.   :1.508   Min.   :1.560   Min.   :1.633   Min.   :1.981  
##  1st Qu.:2.027   1st Qu.:2.535   1st Qu.:2.146   1st Qu.:2.360  
##  Median :2.428   Median :2.736   Median :2.498   Median :2.691  
##  Mean   :3.046   Mean   :3.015   Mean   :2.939   Mean   :3.061  
##  3rd Qu.:4.581   3rd Qu.:3.342   3rd Qu.:4.087   3rd Qu.:3.538  
##  Max.   :4.902   Max.   :4.975   Max.   :4.544   Max.   :5.045  
##        V5              V6              V7              V8       
##  Min.   :1.836   Min.   :1.735   Min.   :1.269   Min.   :1.254  
##  1st Qu.:2.269   1st Qu.:2.310   1st Qu.:2.311   1st Qu.:1.746  
##  Median :2.492   Median :2.637   Median :2.785   Median :3.085  
##  Mean   :2.973   Mean   :2.929   Mean   :2.999   Mean   :2.958  
##  3rd Qu.:3.844   3rd Qu.:3.148   3rd Qu.:3.550   3rd Qu.:3.915  
##  Max.   :4.624   Max.   :5.316   Max.   :5.208   Max.   :4.631  
##        V9             V10              V11             V12       
##  Min.   :1.133   Min.   :0.6545   Min.   :1.254   Min.   :1.445  
##  1st Qu.:2.213   1st Qu.:2.8133   1st Qu.:2.121   1st Qu.:2.198  
##  Median :2.811   Median :3.3927   Median :2.824   Median :2.437  
##  Mean   :2.917   Mean   :3.0968   Mean   :2.958   Mean   :2.987  
##  3rd Qu.:3.617   3rd Qu.:3.6623   3rd Qu.:4.022   3rd Qu.:3.316  
##  Max.   :4.764   Max.   :4.6753   Max.   :4.712   Max.   :6.101  
##       V13             V14              V15             V16       
##  Min.   :1.198   Min.   :0.8574   Min.   :1.766   Min.   :1.114  
##  1st Qu.:2.518   1st Qu.:2.2471   1st Qu.:2.298   1st Qu.:2.122  
##  Median :2.932   Median :2.7995   Median :2.528   Median :2.714  
##  Mean   :3.033   Mean   :2.9685   Mean   :2.984   Mean   :2.923  
##  3rd Qu.:4.020   3rd Qu.:4.0352   3rd Qu.:3.225   3rd Qu.:3.968  
##  Max.   :4.177   Max.   :4.6596   Max.   :5.642   Max.   :4.641  
##       V17             V18              V19             V20       
##  Min.   :1.681   Min.   :0.7252   Min.   :1.862   Min.   :1.593  
##  1st Qu.:2.465   1st Qu.:2.7033   1st Qu.:2.158   1st Qu.:1.967  
##  Median :3.103   Median :2.9494   Median :2.798   Median :2.923  
##  Mean   :3.054   Mean   :2.9983   Mean   :3.054   Mean   :2.984  
##  3rd Qu.:3.904   3rd Qu.:3.8184   3rd Qu.:3.983   3rd Qu.:3.749  
##  Max.   :3.969   Max.   :4.2868   Max.   :4.645   Max.   :5.075