Chapter 11 Data Pre Processing

Data rarely arrives in a form that is directly suitable for use with a modeling method. There are a number of considerations to make such as how to handle missing data, highly correlated variables, and class imbalances - some categories are over or under represented. Additionally, some variables, also known as “features”, will require transformation or will need to be used to create new variables.

11.1 Types of Pre Processing

Consider the case where the measured data (the numeric data) might be on different scales (e.g. height vs weight). This might result in the need to scale and center the data. Some methods take this into consideration whereas others do not. Suffice it to say that data prep can be an ongoing process that requires a number of experiments before arriving at the best form of data.

11.2 Missing Values

This is a frequent situation in real life. Think of patients checking in for clinic visits over time. Sometimes they come for their appointments, sometimes they don’t. Sometimes when they do come, their information has changed or some diagnostic test is repeated with a new result which is entered or not. Or, whomever maintains the patient database, decides to add in some new variables to measure for all patients moving forward. This means that all existing patients will have missing values for those new variables. To see how this manifests practically in predictive learning consider the following version of the mtcars data frame which has some missing values:

11.2.1 Finding Rows with Missing Data

Ar first glance it looks like all features have valid variable values but we can look for missing values which, in R, are indicated by NA Base R provides a number of commands to do this. First let’s see how many rows there are in the data frame.

## [1] 32

Now let’s see how many rows have at least one column with a missing value. So we have eight rows in the data frame that contain one or more missing values.

## [1] 24

11.2.2 Finding Columns With Missing Data

What columns have missing values ? Here we leverage the use of the apply family of functions along with the ability to create anonymous functions on the fly. Both R and Python provide this capability. We see that the wt column has three missing values and the carb feature has six missing values.

##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##    0    0    0    0    0    3    0    0    0    0    6

If we actually wanted to see all rows with missing values:

##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 2  21.0   6 160.0 110 3.90    NA 17.02  0  1    4    4
## 9  22.8   4 140.8  95 3.92    NA 22.90  1  0    4    2
## 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4   NA
## 12 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3   NA
## 19 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4   NA
## 20 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4   NA
## 23 15.2   8 304.0 150 3.15    NA 17.30  0  0    3   NA
## 28 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5   NA

We might also consider using the DataExplorer package to help us identify the “missingness” of the data:

What do we do with these ? It depends on a number of things. How hard is it to get this type of data ? If it’s rare information then we probably want to keep as much of the data as possible. In fact, if most of the data in a given row is present maybe one strategy is to tell whatever modeling method we use to ignore the missing values (na.rm = TRUE). Some methods might do this by default without you even asking.

However, if we have plenty of data we could just filter out any row from the data frame that contains any missing values. In this case we would lose eight rows of data. This is low stakes data but if this were rare or hard to obtain information then we wouldn’t want to do this.

## [1] 24

11.2.3 Use the Median Approach

What could we do ? Well we could keep all rows even if they contains NAs and then use imputation methods to supply values for the missing information. There are R packages that do this but one quick way to do this without going that route is to replace the missing value in the wt column with the median value for the entire column.

Using median is appropriate when the missing values are of the “missing at random” variety. There might some bias in the data that introduces a “not at random” situation. We’ll look at that case momentarily. Let’s look at a boxplot of the wt feature.

To make the substitution would involve the following. First we need to find out which row numbers have missing values for the wt feature.

## [1]  2  9 23

Use this information with the data frame bracket notation see the rows where the NAs occur for the wt feature

##     mpg cyl  disp  hp drat wt  qsec vs am gear carb
## 2  21.0   6 160.0 110 3.90 NA 17.02  0  1    4    4
## 9  22.8   4 140.8  95 3.92 NA 22.90  1  0    4    2
## 23 15.2   8 304.0 150 3.15 NA 17.30  0  0    3   NA

Now do the replacement

Verify that the replacement was done successfully. If so, then we should see the value of 3.44 in place of the previous NA values.

##     mpg cyl  disp  hp drat   wt  qsec vs am gear carb
## 2  21.0   6 160.0 110 3.90 3.44 17.02  0  1    4    4
## 9  22.8   4 140.8  95 3.92 3.44 22.90  1  0    4    2
## 23 15.2   8 304.0 150 3.15 3.44 17.30  0  0    3   NA

11.2.4 Package-based Approach

This seems like a lot of work and maybe it is if you aren’t up to date with your R skills although, conceptually, this is straightforward and simple. The Hmisc package provides an easy way to do this:

The Hmisc package provides and impute function to do the work for us. Check it out. Notice how it finds the rows for which the wt feature is missing.

##      1      2      3      4      5      6      7      8      9     10     11 
##  2.620 3.440*  2.320  3.215  3.440  3.460  3.570  3.190 3.440*  3.440  3.440 
##     12     13     14     15     16     17     18     19     20     21     22 
##  4.070  3.730  3.780  5.250  5.424  5.345  2.200  1.615  1.835  2.465  3.520 
##     23     24     25     26     27     28     29     30     31     32 
## 3.440*  3.840  3.845  1.935  2.140  1.513  3.170  2.770  3.570  2.780

To do the replacement is straightforward.

11.2.5 Using caret

Another imputation approach is to use the K-Nearest Neighbors method to find observations that are similar to the ones that contain missing data. The missing values can then be filled using information from the most similar observations. We won’t go into that choosing rather to use the convenience offered by the caret package to help us.

So if we choose to use caret we can use the preProcess function to signal our intent to use imputation - in this case the K-Nearest Neighbors technique. KNN imputation is particularly useful for dealing with the “not at random” situation where there could be bias in the way missing values occur. This approach looks at similar observations to those containing missing values which means that it will attempt to fill in missing values as a function of a number of variables as opposed to just one.

One subtlety here is that caret requires us to use an alternative to the formula interface (e.g. mpg ~ .) approach when using the train function - at least the version of caret that I am currently using.

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Linear Regression 
## 
## 28 samples
## 10 predictors
## 
## Pre-processing: nearest neighbor imputation (10), centered (10), scaled (10) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 28, 28, 28, 28, 28, 28, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   4.851577  0.6068996  3.870965
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
## [1] 4.808981

11.3 Scaling

In terms of what methods benefit (or require) you to scale data prior to use, consider that any method that uses the idea of “distance” will require this. This helps address the situation wherein one the size and range of one feature might overshadow another. For example, look at the range of features in the mtcars dataframe. Not only are the variables on different scales (e.g. MPG vs Weight vs Horse Power), a feature such as displacement might over influence a distance calculation when compared to qsec.

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Let’s look at an example of why we might need to scale in the first place. Here is some data setup to motivate the general need for scaling.

The relationships within each of the data features does not change but the scaling effect makes each feature more directly comparable to all others.

11.3.1 Methods That Benefit From Scaling

The following approaches benefit from scaling:

Linear/non-linear regression, logistic regression, KNN, SVM, Neural Networks, clustering algorithms like k-means clustering. Methods that employ PCA and dimensionality reduction should use scaled data.

In R and Python, some of the individual functions might have arguments to activate the scaling as part of the process.

11.3.2 Methods That Do Not Require Scaling

Methods that don’t require scaling (or whose results don’t rely upon it) include rule-based algorithms such as Decision trees and more generally CART - Random Forests, Gradient Boosted Decision Even if you scale the data the relative relationships will be preserved post scaling so the decision to split a tree won’t be impacted.

11.3.3 How To Scale

You could do your own scaling and centering which might be helpful to understand what is going on. First, when we say “scaling” we typically mean “centering” and “scaling”.

11.3.3.1 Centering

Take a vector (or column) of numeric data and find the mean. Then subtract the computed mean value from each element of the vector / column. We will use the wt column from mtcars as an example.

##  [1] -0.59725 -0.34225 -0.89725 -0.00225  0.22275  0.24275  0.35275 -0.02725
##  [9] -0.06725  0.22275  0.22275  0.85275  0.51275  0.56275  2.03275  2.20675
## [17]  2.12775 -1.01725 -1.60225 -1.38225 -0.75225  0.30275  0.21775  0.62275
## [25]  0.62775 -1.28225 -1.07725 -1.70425 -0.04725 -0.44725  0.35275 -0.43725

11.3.3.2 Scaling

This step involves taking the standard deviation of the vector / column. Then divide each value in the vector / column by this computed standard deviation. We’ll process the centered data from the previous computation.

##  [1] -0.610399567 -0.349785269 -0.917004624 -0.002299538  0.227654255
##  [6]  0.248094592  0.360516446 -0.027849959 -0.068730634  0.227654255
## [11]  0.227654255  0.871524874  0.524039143  0.575139986  2.077504765
## [16]  2.255335698  2.174596366 -1.039646647 -1.637526508 -1.412682800
## [21] -0.768812180  0.309415603  0.222544170  0.636460997  0.641571082
## [26] -1.310481114 -1.100967659 -1.741772228 -0.048290296 -0.457097039
## [31]  0.360516446 -0.446876870

11.3.3.3 Scaling a Data frame

We could continue to do this by hand

However, there is a function in R called scale which will do this for us. It has the added benefit of providing some attributes in the output that we could later use to de-scale the scaled data. This is useful if we build a model with scaled data because the predictions will be in terms of the scaled data which might not make sense to a third party. So you would then need to scale the predictions back into the units of the original data.

Here is how the scale function works:

##                             mpg        cyl        disp          hp        drat
## Mazda RX4            0.15088482 -0.1049878 -0.57061982 -0.53509284  0.56751369
## Mazda RX4 Wag        0.15088482 -0.1049878 -0.57061982 -0.53509284  0.56751369
## Datsun 710           0.44954345 -1.2248578 -0.99018209 -0.78304046  0.47399959
## Hornet 4 Drive       0.21725341 -0.1049878  0.22009369 -0.53509284 -0.96611753
## Hornet Sportabout   -0.23073453  1.0148821  1.04308123  0.41294217 -0.83519779
## Valiant             -0.33028740 -0.1049878 -0.04616698 -0.60801861 -1.56460776
## Duster 360          -0.96078893  1.0148821  1.04308123  1.43390296 -0.72298087
## Merc 240D            0.71501778 -1.2248578 -0.67793094 -1.23518023  0.17475447
## Merc 230             0.44954345 -1.2248578 -0.72553512 -0.75387015  0.60491932
## Merc 280            -0.14777380 -0.1049878 -0.50929918 -0.34548584  0.60491932
## Merc 280C           -0.38006384 -0.1049878 -0.50929918 -0.34548584  0.60491932
## Merc 450SE          -0.61235388  1.0148821  0.36371309  0.48586794 -0.98482035
## Merc 450SL          -0.46302456  1.0148821  0.36371309  0.48586794 -0.98482035
## Merc 450SLC         -0.81145962  1.0148821  0.36371309  0.48586794 -0.98482035
## Cadillac Fleetwood  -1.60788262  1.0148821  1.94675381  0.85049680 -1.24665983
## Lincoln Continental -1.60788262  1.0148821  1.84993175  0.99634834 -1.11574009
## Chrysler Imperial   -0.89442035  1.0148821  1.68856165  1.21512565 -0.68557523
## Fiat 128             2.04238943 -1.2248578 -1.22658929 -1.17683962  0.90416444
## Honda Civic          1.71054652 -1.2248578 -1.25079481 -1.38103178  2.49390411
## Toyota Corolla       2.29127162 -1.2248578 -1.28790993 -1.19142477  1.16600392
## Toyota Corona        0.23384555 -1.2248578 -0.89255318 -0.72469984  0.19345729
## Dodge Challenger    -0.76168319  1.0148821  0.70420401  0.04831332 -1.56460776
## AMC Javelin         -0.81145962  1.0148821  0.59124494  0.04831332 -0.83519779
## Camaro Z28          -1.12671039  1.0148821  0.96239618  1.43390296  0.24956575
## Pontiac Firebird    -0.14777380  1.0148821  1.36582144  0.41294217 -0.96611753
## Fiat X1-9            1.19619000 -1.2248578 -1.22416874 -1.17683962  0.90416444
## Porsche 914-2        0.98049211 -1.2248578 -0.89093948 -0.81221077  1.55876313
## Lotus Europa         1.71054652 -1.2248578 -1.09426581 -0.49133738  0.32437703
## Ford Pantera L      -0.71190675  1.0148821  0.97046468  1.71102089  1.16600392
## Ferrari Dino        -0.06481307 -0.1049878 -0.69164740  0.41294217  0.04383473
## Maserati Bora       -0.84464392  1.0148821  0.56703942  2.74656682 -0.10578782
## Volvo 142E           0.21725341 -1.2248578 -0.88529152 -0.54967799  0.96027290
##                               wt        qsec         vs         am       gear
## Mazda RX4           -0.610399567 -0.77716515 -0.8680278  1.1899014  0.4235542
## Mazda RX4 Wag       -0.349785269 -0.46378082 -0.8680278  1.1899014  0.4235542
## Datsun 710          -0.917004624  0.42600682  1.1160357  1.1899014  0.4235542
## Hornet 4 Drive      -0.002299538  0.89048716  1.1160357 -0.8141431 -0.9318192
## Hornet Sportabout    0.227654255 -0.46378082 -0.8680278 -0.8141431 -0.9318192
## Valiant              0.248094592  1.32698675  1.1160357 -0.8141431 -0.9318192
## Duster 360           0.360516446 -1.12412636 -0.8680278 -0.8141431 -0.9318192
## Merc 240D           -0.027849959  1.20387148  1.1160357 -0.8141431  0.4235542
## Merc 230            -0.068730634  2.82675459  1.1160357 -0.8141431  0.4235542
## Merc 280             0.227654255  0.25252621  1.1160357 -0.8141431  0.4235542
## Merc 280C            0.227654255  0.58829513  1.1160357 -0.8141431  0.4235542
## Merc 450SE           0.871524874 -0.25112717 -0.8680278 -0.8141431 -0.9318192
## Merc 450SL           0.524039143 -0.13920420 -0.8680278 -0.8141431 -0.9318192
## Merc 450SLC          0.575139986  0.08464175 -0.8680278 -0.8141431 -0.9318192
## Cadillac Fleetwood   2.077504765  0.07344945 -0.8680278 -0.8141431 -0.9318192
## Lincoln Continental  2.255335698 -0.01608893 -0.8680278 -0.8141431 -0.9318192
## Chrysler Imperial    2.174596366 -0.23993487 -0.8680278 -0.8141431 -0.9318192
## Fiat 128            -1.039646647  0.90727560  1.1160357  1.1899014  0.4235542
## Honda Civic         -1.637526508  0.37564148  1.1160357  1.1899014  0.4235542
## Toyota Corolla      -1.412682800  1.14790999  1.1160357  1.1899014  0.4235542
## Toyota Corona       -0.768812180  1.20946763  1.1160357 -0.8141431 -0.9318192
## Dodge Challenger     0.309415603 -0.54772305 -0.8680278 -0.8141431 -0.9318192
## AMC Javelin          0.222544170 -0.30708866 -0.8680278 -0.8141431 -0.9318192
## Camaro Z28           0.636460997 -1.36476075 -0.8680278 -0.8141431 -0.9318192
## Pontiac Firebird     0.641571082 -0.44699237 -0.8680278 -0.8141431 -0.9318192
## Fiat X1-9           -1.310481114  0.58829513  1.1160357  1.1899014  0.4235542
## Porsche 914-2       -1.100967659 -0.64285758 -0.8680278  1.1899014  1.7789276
## Lotus Europa        -1.741772228 -0.53093460  1.1160357  1.1899014  1.7789276
## Ford Pantera L      -0.048290296 -1.87401028 -0.8680278  1.1899014  1.7789276
## Ferrari Dino        -0.457097039 -1.31439542 -0.8680278  1.1899014  1.7789276
## Maserati Bora        0.360516446 -1.81804880 -0.8680278  1.1899014  1.7789276
## Volvo 142E          -0.446876870  0.42041067  1.1160357  1.1899014  0.4235542
##                           carb
## Mazda RX4            0.7352031
## Mazda RX4 Wag        0.7352031
## Datsun 710          -1.1221521
## Hornet 4 Drive      -1.1221521
## Hornet Sportabout   -0.5030337
## Valiant             -1.1221521
## Duster 360           0.7352031
## Merc 240D           -0.5030337
## Merc 230            -0.5030337
## Merc 280             0.7352031
## Merc 280C            0.7352031
## Merc 450SE           0.1160847
## Merc 450SL           0.1160847
## Merc 450SLC          0.1160847
## Cadillac Fleetwood   0.7352031
## Lincoln Continental  0.7352031
## Chrysler Imperial    0.7352031
## Fiat 128            -1.1221521
## Honda Civic         -0.5030337
## Toyota Corolla      -1.1221521
## Toyota Corona       -1.1221521
## Dodge Challenger    -0.5030337
## AMC Javelin         -0.5030337
## Camaro Z28           0.7352031
## Pontiac Firebird    -0.5030337
## Fiat X1-9           -1.1221521
## Porsche 914-2       -0.5030337
## Lotus Europa        -0.5030337
## Ford Pantera L       0.7352031
## Ferrari Dino         1.9734398
## Maserati Bora        3.2116766
## Volvo 142E          -0.5030337
## attr(,"scaled:center")
##        mpg        cyl       disp         hp       drat         wt       qsec 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
##         vs         am       gear       carb 
##   0.437500   0.406250   3.687500   2.812500 
## attr(,"scaled:scale")
##         mpg         cyl        disp          hp        drat          wt 
##   6.0269481   1.7859216 123.9386938  68.5628685   0.5346787   0.9784574 
##        qsec          vs          am        gear        carb 
##   1.7869432   0.5040161   0.4989909   0.7378041   1.6152000

The last rows make reference to attributes which can be considered as “meta information”. The presence of this information does not prevent you from using the scaled data for other operations. But if you needed to de-scaled the data you would need the standard deviation for each column and the mean of each column to reverse the centering and scaling process. That info is stashed in the attributes:

##         mpg         cyl        disp          hp        drat          wt 
##   6.0269481   1.7859216 123.9386938  68.5628685   0.5346787   0.9784574 
##        qsec          vs          am        gear        carb 
##   1.7869432   0.5040161   0.4989909   0.7378041   1.6152000
##        mpg        cyl       disp         hp       drat         wt       qsec 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
##         vs         am       gear       carb 
##   0.437500   0.406250   3.687500   2.812500

If we wanted to de-scaled the data frame that we just scaled we could something like this:

##                mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
##                mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

Ugh. What a pain ! But, if we wanted to use the scaled data to build a model, here is what we would do:

##         Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
##         0.4162772         0.3353706         1.0220793         0.1902753 
## Hornet Sportabout 
##        -0.3977454

So how would we re express this information in terms of the unscaled data ? It’s tedious but doable. We leverage the fact that the scaled version of the data has attributes we can use to convert the predictions back to their unscaled form.

##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
##            22.59951            22.11189            26.25064            21.23740 
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
##            17.69343            20.38304            14.38626            22.49601 
##            Merc 230            Merc 280           Merc 280C          Merc 450SE 
##            24.41909            18.69903            19.19165            14.17216 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
##            15.59957            15.74222            12.03401            10.93644 
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
##            10.49363            27.77291            29.89674            29.51237 
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
##            23.64310            16.94305            17.73218            13.30602 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
##            16.69168            28.29347            26.15295            27.63627 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##            18.87004            19.69383            13.94112            24.36827

Does this match what we would get had we not scaled the data in the first place ?

##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
##            22.59951            22.11189            26.25064            21.23740 
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
##            17.69343            20.38304            14.38626            22.49601 
##            Merc 230            Merc 280           Merc 280C          Merc 450SE 
##            24.41909            18.69903            19.19165            14.17216 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
##            15.59957            15.74222            12.03401            10.93644 
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
##            10.49363            27.77291            29.89674            29.51237 
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
##            23.64310            16.94305            17.73218            13.30602 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
##            16.69168            28.29347            26.15295            27.63627 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##            18.87004            19.69383            13.94112            24.36827
## [1] TRUE

11.3.3.4 caret and preprocess

But the caret package will help you do this using the preProcess function. Here we can actually request that the data is “centered” and “scaled” as part of the model assembly process. We could do this before the call to the Train function but then we would also have to convert the training and test data ourselves. In the following situation, the data will be centered and scaled though the returned RMSE will be in terms of the unscaled data.

To verify that the data is being centered and scaled within the call the train function, checkout the myLm object:

##                .outcome        cyl       disp         hp       drat          wt
## Mazda.RX4          21.0 -0.1224292 -0.5638267 -0.5419377  0.5505338 -0.60292444
## Mazda.RX4.Wag      21.0 -0.1224292 -0.5638267 -0.5419377  0.5505338 -0.32911192
## Datsun.710         22.8 -1.2651012 -0.9997273 -0.7880850  0.4597509 -0.92505682
## Hornet.4.Drive     21.4 -0.1224292  0.2576783 -0.5419377 -0.9383068  0.03597145
## Valiant            18.1 -0.1224292 -0.0189509 -0.6143340 -1.5193177  0.29904623
##                      qsec         vs         am       gear       carb
## Mazda.RX4      -0.7515018 -0.8504201  1.1338934  0.3745127  0.6657138
## Mazda.RX4.Wag  -0.4053341 -0.8504201  1.1338934  0.3745127  0.6657138
## Datsun.710      0.5775349  1.1338934  1.1338934  0.3745127 -1.1381558
## Hornet.4.Drive  1.0906048  1.1338934 -0.8504201 -0.9362817 -1.1381558
## Valiant         1.5727669  1.1338934 -0.8504201 -0.9362817 -1.1381558
##                       mpg        cyl       disp         hp       drat
## Mazda RX4       0.1649448 -0.1224292 -0.5638267 -0.5419377  0.5505338
## Mazda RX4 Wag   0.1649448 -0.1224292 -0.5638267 -0.5419377  0.5505338
## Datsun 710      0.4556166 -1.2651012 -0.9997273 -0.7880850  0.4597509
## Hornet 4 Drive  0.2295385 -0.1224292  0.2576783 -0.5419377 -0.9383068
## Valiant        -0.3033599 -0.1224292 -0.0189509 -0.6143340 -1.5193177
##                         wt       qsec         vs         am       gear
## Mazda RX4      -0.60292444 -0.7515018 -0.8504201  1.1338934  0.3745127
## Mazda RX4 Wag  -0.32911192 -0.4053341 -0.8504201  1.1338934  0.3745127
## Datsun 710     -0.92505682  0.5775349  1.1338934  1.1338934  0.3745127
## Hornet 4 Drive  0.03597145  1.0906048  1.1338934 -0.8504201 -0.9362817
## Valiant         0.29904623  1.5727669  1.1338934 -0.8504201 -0.9362817
##                      carb
## Mazda RX4       0.6657138
## Mazda RX4 Wag   0.6657138
## Datsun 710     -1.1381558
## Hornet 4 Drive -1.1381558
## Valiant        -1.1381558

This is convenient because if we now wish to use the predict function, it will scale the test data that we provide for use with the predict function. What we get back will be in terms of the uncenter and unscaled predicted variable (mpg). We do not have to suffer through the conversion reverse scaling process ourselves.

## Hornet Sportabout          Merc 230 Chrysler Imperial         Fiat X1-9 
##         17.947087         29.175386          8.527641         28.422414

11.3.4 Order of Processing

Note that the order of processing is important. You should center and then scale. Underneath the hood, the centering operation finds the mean of a feature (column) and then subtracts it from each value therein. So the following are equivalent:

##  [1] -0.59725 -0.34225 -0.89725 -0.00225  0.22275  0.24275  0.35275 -0.02725
##  [9] -0.06725  0.22275  0.22275  0.85275  0.51275  0.56275  2.03275  2.20675
## [17]  2.12775 -1.01725 -1.60225 -1.38225 -0.75225  0.30275  0.21775  0.62275
## [25]  0.62775 -1.28225 -1.07725 -1.70425 -0.04725 -0.44725  0.35275 -0.43725
##  [1] -0.59725 -0.34225 -0.89725 -0.00225  0.22275  0.24275  0.35275 -0.02725
##  [9] -0.06725  0.22275  0.22275  0.85275  0.51275  0.56275  2.03275  2.20675
## [17]  2.12775 -1.01725 -1.60225 -1.38225 -0.75225  0.30275  0.21775  0.62275
## [25]  0.62775 -1.28225 -1.07725 -1.70425 -0.04725 -0.44725  0.35275 -0.43725

After centering, the scaling is done by dividing the (centered) columns of the data frame by their respective standard deviations. In terms of dealing with missing data and scaling, one should first do imputation followed by centering and scaling. In a call to the train function, this would look like the following. We’ll use our version of mtcars that has missing data.

11.4 Low Variance Variables

Some variables exhibit low variance and might be nearly constant. Such variables can be detected by using some basic functions in R before you begin to build a model. As an example, we’ll use the mtcars data frame and introduce a low variance variable - actually, we’ll make it a constant.

## [1] 0

What if we use this variable when making a model. We’ll get a lot of problems. While this is a contrived example, it is possible to get this situation when using cross fold validation where the data is segmented into smaller subsets where a variable can be zero variance.

The caret package has an option to the preProcess argument that allows us to remove such variables so it won’t impact the resulting model. Notice that we remove the near zero variance variables before we center which happens before scaling. This is the recommended order.

The above now works. There is a subtle consideration at work here. We could pre process the data with nzv or zv where the former removes “near” zero variance features and the latter removes constant-valued features. With near zero variance features there is a way to specify tolerance for deciding whether a feature has near zero variance.

Think of nzv as being slightly more permissive and flexible whereas zv eliminates zero variance variables. Sometimes you might want to keep near zero variance features around simply because there could be some interesting information therein.

In general we could remove the constant or near zero variance features before passing the data to the train function. Caret has a standalone function called nearZeroVariance to do this. We’ll it doesn’t actually remove the feature but it will tell us which column(s) exhibit very low variance or have a constant value. In this case it is column number 5 which corresponds to drat. We already knew that.

## [1] 5
##                     drat
## Mazda RX4              3
## Mazda RX4 Wag          3
## Datsun 710             3
## Hornet 4 Drive         3
## Hornet Sportabout      3
## Valiant                3
## Duster 360             3
## Merc 240D              3
## Merc 230               3
## Merc 280               3
## Merc 280C              3
## Merc 450SE             3
## Merc 450SL             3
## Merc 450SLC            3
## Cadillac Fleetwood     3
## Lincoln Continental    3
## Chrysler Imperial      3
## Fiat 128               3
## Honda Civic            3
## Toyota Corolla         3
## Toyota Corona          3
## Dodge Challenger       3
## AMC Javelin            3
## Camaro Z28             3
## Pontiac Firebird       3
## Fiat X1-9              3
## Porsche 914-2          3
## Lotus Europa           3
## Ford Pantera L         3
## Ferrari Dino           3
## Maserati Bora          3
## Volvo 142E             3

There is a data frame in the caret package which exhibits this behavior in a more organic fashion - that is the data measurements of a number of features are near zero variance.

## 'data.frame':    208 obs. of  134 variables:
## [1]  3 16 17 22 25 50 60
## [1] "negative"     "peoe_vsa.2.1" "peoe_vsa.3.1" "a_acid"       "vsa_acid"    
## [6] "frac.anion7." "alert"

11.5 PCA - Principal Components Analysis

So one of the problems with data can be what is called multicollinearity where high correlations exist between variables in a data set. Consider the mtcars data frame for example. Let’s assume that we want to predict whether a given car has an automatic transmission (0) or manual (1). We’ll remove other columns from the data frame that represent categorical data so we can focus on the continuous numeric variables.

11.5.1 Identify The Factors

##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##   25    3   27   22   22   29   30    2    2    3    6

Let’s eliminate cyl,vs,gear, and carb. We’ll also remove the rownames since, if we don’t, then they will cause problems when we create the biplot of the principal components.

Now, let’s look at the correlation matrix to see if we have highly correlated variables. We do have several correlations that exceed .7 which is sufficiently high to consider that they might cause problems when building models. The caret package has a findCorrelation function that can remove predictors that exhibit a correlation above a certain threshold but we’ll see how PCA can help - so we’ll leave them in for now.

11.5.2 Check For High Correlations

This is a graphic equivalent of this command:

##             mpg       disp         hp        drat         wt        qsec
## mpg   1.0000000 -0.8475514 -0.7761684  0.68117191 -0.8676594  0.41868403
## disp -0.8475514  1.0000000  0.7909486 -0.71021393  0.8879799 -0.43369788
## hp   -0.7761684  0.7909486  1.0000000 -0.44875912  0.6587479 -0.70822339
## drat  0.6811719 -0.7102139 -0.4487591  1.00000000 -0.7124406  0.09120476
## wt   -0.8676594  0.8879799  0.6587479 -0.71244065  1.0000000 -0.17471588
## qsec  0.4186840 -0.4336979 -0.7082234  0.09120476 -0.1747159  1.00000000

11.5.3 So Why Use PCA ?

There are several functions for doing Principal Components Analysis on this data. But why are we even thinking about PCA ? Well, it helps us deal with highly correlated data by reducing the dimensionality of a data set. In the mtcars data frame we don’t have that many variables / columns but wouldn’t it be nice to transform the data in a way that reduced the number of columns that we had to consider while also dealing with the multicollinearity ?

This is what PCA can do for us. In reality we are using the eigenvectors of the covariance matrix of the original data. We use them to transform the original data into a reduced number of columns to consider. To get the ball rolling, we’ll use the prcomp function.

## Call:
## princomp(x = scaled_mtcars_data)
## 
## Standard deviations:
##    Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6 
## 2.0140855 1.0546249 0.5682775 0.3866999 0.3477012 0.2243967 
## 
##  6  variables and  32 observations.
## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4     Comp.5
## Standard deviation     2.0140855 1.0546249 0.56827746 0.38669985 0.34770121
## Proportion of Variance 0.6978994 0.1913520 0.05555944 0.02572676 0.02079933
## Cumulative Proportion  0.6978994 0.8892514 0.94481088 0.97053763 0.99133697
##                             Comp.6
## Standard deviation     0.224396671
## Proportion of Variance 0.008663031
## Cumulative Proportion  1.000000000

11.5.4 Check The BiPlot

What we get from this summary is that PC1 accounts for roughly 70% of the variation in the data set. By the time we get to PC3, about 95% of the variation is accounted. We can also look at something called a biplot that allows us to see what variables in the first two components are influential. This information is also available just by viewing the default return information from the object itself but the biplot makes it easier to see.

Back to the summary information, we cal use something called a screeplot which will help us see how many of the components to use. We don’t actually need the plot although it can help. Look at the plot and find the “elbow” which is the point at which the rate of change stabilizes. In the plot below it looks like the elbow is at PC3. In looking at the summary info, if we select three components then we are accounting for about 95% of the variation in the data. IF we selected two components then we would have about 89% of the variation accounted for.

## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4     Comp.5
## Standard deviation     2.0140855 1.0546249 0.56827746 0.38669985 0.34770121
## Proportion of Variance 0.6978994 0.1913520 0.05555944 0.02572676 0.02079933
## Cumulative Proportion  0.6978994 0.8892514 0.94481088 0.97053763 0.99133697
##                             Comp.6
## Standard deviation     0.224396671
## Proportion of Variance 0.008663031
## Cumulative Proportion  1.000000000

11.5.5 Check The ScreePlot

Let’s pick two components. We’ll multiply our data set (using (matrix multiplication) by the two “loadings” / components that we picked. This will give us our original data in terms of the two principal components.

11.5.6 Use The Transformed Data

So now we’ll use the Naive Bayes method to do build a model using the transformed data.

##    
##      0  1
##   0 18  0
##   1  1 13
## Accuracy for PCA mod:  0.96875

Compare this to a model built using the untransformed data:

##    
##      0  1
##   0 16  2
##   1  3 11
## Accuracy for PCA mod:  0.84375

11.5.7 PLS

Partial Least Squares regression is related to PCA although, in classification problems, the latter ignores the variable being predicted. PLS uses information from the variable to be predicted (the class labels) to help maximize the separation of the two classes. Of course, using pls as a method within caret is easy.

We already knew that the mtcars data frame would benefit from PCA. Here we see that PLS uses the first three Principal Components to arrive at an accuracy of 0.97 on the data set.

11.5.8 Summary

So the advantages of PCA should be clear in this case. We have effectively replaced the original data by a smaller data set while also dealing with the correlation issues. We used only two components. Now, a disadvantage here is that the model with the transformed data is in terms of the components which means that the model is less transparent. Perhaps a minor price to pay for better accuracy. In terms of the caret package we can do this using the preProcess function:

##          
## predicted  0  1
##         0 18  0
##         1  1 13

11.6 Order of Pre-Processing

In reality, if we wanted to line up the order in which to pre process data it would read something like the following:

  • Remove Near Zero Variance Features
  • Do Imputation (knn or median)
  • Center
  • Scale

11.7 Handling Categories

If you haven’t already, you should first read the section on “Levels Of Measurement” to reacquaint yourself with the differences between Nominal, Ordinal, Interval, and Ratio data.

In this section we’ll deal with categories and factors which represent categories (e.g. “male”,“female”,“smoker”,“non-smoker”). These variables, while useful, need to be recoded in a way to make them useful for machine learning methods. In terms of categories, we have nominal and ordinal features with the former being names or labels and the latter being the same except with some notion of order (e.g. “low”,“medium”,“high”).

11.7.1 Examples

In R, you can usually create a factor out of a feature and R will handle it correctly when applying a machine learning method. Under the hood, it turns the factors into dummy variables. Notice how the model creates variables of the type cyl6 and cyl8. Where is cyl4 ? Well, absence of cyl4 is simply when cyl6 and cyl8 do not exist for that record.

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars_exampl)
## 
## Coefficients:
## (Intercept)         cyl6         cyl8         disp           hp         drat  
##    17.81984     -1.66031      1.63744      0.01391     -0.04613      0.02635  
##          wt         qsec           vs           am         gear         carb  
##    -3.80625      0.64696      1.74739      2.61727      0.76403      0.50935

We could have created the dummy variables ourselves but we didn’t need to do that here. In Python, we generally would. But when we do it’s best to have 1 less category than is encoded by the unique feature values under consideration. If there are n=3 unique values then we would want 2 dummy variables.

##    mpg cyl.4 cyl.6 cyl.8 disp  hp drat    wt  qsec vs am gear carb
## 1 21.0     0     1     0  160 110 3.90 2.620 16.46  0  1    4    4
## 2 21.0     0     1     0  160 110 3.90 2.875 17.02  0  1    4    4
## 3 22.8     1     0     0  108  93 3.85 2.320 18.61  1  1    4    1
## 4 21.4     0     1     0  258 110 3.08 3.215 19.44  1  0    3    1
## 5 18.7     0     0     1  360 175 3.15 3.440 17.02  0  0    3    2
## 6 18.1     0     1     0  225 105 2.76 3.460 20.22  1  0    3    1

But here we would want to use the fullrank = TRUE in the call to dummyVars to accomplish the n-1 encoding. The reason we would do this is to avoid the collinearity although in this case that isn’t a problem here. We could first check the correlations

##    mpg cyl.6 cyl.8 disp  hp drat    wt  qsec vs am gear carb
## 1 21.0     1     0  160 110 3.90 2.620 16.46  0  1    4    4
## 2 21.0     1     0  160 110 3.90 2.875 17.02  0  1    4    4
## 3 22.8     0     0  108  93 3.85 2.320 18.61  1  1    4    1
## 4 21.4     1     0  258 110 3.08 3.215 19.44  1  0    3    1
## 5 18.7     0     1  360 175 3.15 3.440 17.02  0  0    3    2
## 6 18.1     1     0  225 105 2.76 3.460 20.22  1  0    3    1

But again, R will do this for us just by indicating that it is a factor. But let me show you what will happen in another case with respect to the am variable if we do not use the n-1 encoding

##    qsec vs am.0 am.1 gear
## 1 16.46  0    0    1    4
## 2 17.02  0    0    1    4
## 3 18.61  1    0    1    4
## 4 19.44  1    1    0    3
## 5 17.02  0    1    0    3
## 6 20.22  1    1    0    3

Check the correlations between am.0 and am.1. They are perfectly correlated which could cause problems in the modeling methods. We could express am.0 as a linear function of am.1. This is why using the n-1 approach helps deal with the collinearity problem.

11.7.2 Admissions Data

For example, read in the following data which relates to admissions data for students applying to an academic program. This comes from UCLA Statisitical Consulting site.

##   admit gre  gpa rank
## 1     0 380 3.61    3
## 2     1 660 3.67    3
## 3     1 800 4.00    1
## 4     1 640 3.19    4
## 5     0 520 2.93    4
## 6     1 760 3.00    2
## 'data.frame':    400 obs. of  4 variables:
##  $ admit: int  0 1 1 1 0 1 1 0 1 0 ...
##  $ gre  : int  380 660 800 640 520 760 560 400 540 700 ...
##  $ gpa  : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
##  $ rank : int  3 3 1 4 4 2 1 2 3 2 ...

Let’s examine this data by determining the number of unique values assumed by each feature. This helps us understand if a variable is a category / factor variable or a continuous quantity. It appears that admit takes on only two distinct values and rank assumes only four which suggests that both might be more of a category than a quantity upon which we could perform lots of calculations.

As it relates to admit, this is a yes / no assessment and there is no inherent order even though it has been encoded as a 0 or a 1. This might be something that we might predict with a model.

We will treat the variables gre and gpa as continuous. The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest.

##   admit gre gpa rank
## 1     2  26 132    4

You can always use the summary function to get a quick overview of the data. Notice here that admit is more of a binary entity although since R thinks it is just a number it will compute the percentiles for it. Same with the rest of the variables. We probably don’t want this but let’s hold off on doing anything about it for now since most newcomers to Data Analysis will make this mistake. Let’s see what happens.

##      admit             gre             gpa             rank      
##  Min.   :0.0000   Min.   :220.0   Min.   :2.260   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:520.0   1st Qu.:3.130   1st Qu.:2.000  
##  Median :0.0000   Median :580.0   Median :3.395   Median :2.000  
##  Mean   :0.3175   Mean   :587.7   Mean   :3.390   Mean   :2.485  
##  3rd Qu.:1.0000   3rd Qu.:660.0   3rd Qu.:3.670   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :800.0   Max.   :4.000   Max.   :4.000

So with this data, an interesting problem might be to predict whether an applicant is admitted to the program based on the other variables in the data set. We could pick Logistic Regression for this activity.

## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.

Why does this bomb out ? Well, the GLM method expects us to be predicting a binary outcome which, to the human eye, the admit variable actually is. However, R doesn’t know this as it currently thinks of the admit variable as being simply a number so it thinks we are trying to predict a numeric outcome hence the error. To fix this we need to make admit a factor which is a term used in statistics to refer to categories. So let’s see the effect of doing this. While we are at it, let’s also resummarize the data to see if R treats the admit variable any differently after the conversion to factor has been made.

## 'data.frame':    400 obs. of  4 variables:
##  $ admit: Factor w/ 2 levels "0","1": 1 2 2 2 1 2 2 1 2 1 ...
##  $ gre  : int  380 660 800 640 520 760 560 400 540 700 ...
##  $ gpa  : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
##  $ rank : int  3 3 1 4 4 2 1 2 3 2 ...
## 
##  Now look at the summary
##   0   1 
## 273 127

So now R understands that admit is a category so it chooses to offer a table / count summary of the data as opposed to the summary statistics it would apply for numeric data. So now let’s try again to build a predictive model for admit.

## [1] 0.7049714
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5802  -0.8848  -0.6382   1.1575   2.1732  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.449548   1.132846  -3.045  0.00233 ** 
## gre          0.002294   0.001092   2.101  0.03564 *  
## gpa          0.777014   0.327484   2.373  0.01766 *  
## rank        -0.560031   0.127137  -4.405 1.06e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 459.44  on 396  degrees of freedom
## AIC: 467.44
## 
## Number of Fisher Scoring iterations: 4

11.7.3 Is Rank A Category ?

Okay that was nice and it appears that all of our predictors are significant with rank ,in particular, being so. However, when looking at the rank variable it looks like it is an ordinal variable as opposed to an interval variable. It’s interesting since rank is on an interval but is a rank of 0 significant ? This variable could be considered as an interval variable but it’s more likely that it is ordinal in which case we would need to turn it into a factor.

## 'data.frame':    400 obs. of  4 variables:
##  $ admit: Factor w/ 2 levels "0","1": 1 2 2 2 1 2 2 1 2 1 ...
##  $ gre  : int  380 660 800 640 520 760 560 400 540 700 ...
##  $ gpa  : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
##  $ rank : Factor w/ 4 levels "1","2","3","4": 3 3 1 4 4 2 1 2 3 2 ...
## [1] 0.6899151
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6268  -0.8662  -0.6388   1.1490   2.0790  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.989979   1.139951  -3.500 0.000465 ***
## gre          0.002264   0.001094   2.070 0.038465 *  
## gpa          0.804038   0.331819   2.423 0.015388 *  
## rank2       -0.675443   0.316490  -2.134 0.032829 *  
## rank3       -1.340204   0.345306  -3.881 0.000104 ***
## rank4       -1.551464   0.417832  -3.713 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.52
## 
## Number of Fisher Scoring iterations: 4

This is interesting in that the Accuracy from this model is less than the previous one although we see that the interpretation of Rank is more nuanced here in that ranks of 3 and 4 appear to be more significant than other rank values. We might take this into consideration when thinking about how to transform the features or modify our prediction formula.

11.7.4 Relationship To One Hot Encoding

Usually when you indicate to R that a variable is a factor then most procedures and functions will know how to handle that when computing various statistics and building models. Other languages might not do this for you in which case you need to employ an approach called One Hot Encoding. We can do this also in R but in many (most) cases it is not necessary as long as you explicitly create factors as indicated above. It might be helpful to first see what a one hot encoded version of the admissions data frame might look:

There are two things to notice here. The first is that the admit variable isn’t present since that is the value we will be predicting. The second thing to notice is that for each “level” of the rank variable there is a corresponding column. Now, there is the concept of a “rank deficient” fit wherein you will generally want one less variable than the number of categories present to avoid perfect collinearity. In this case the above encoding would look like:

So, in cases where all rank features (2,3 and 4) were 0 would correspond to a rank of 1. In any case, here is how we would handle the situation using the One Hot Encoding approach. First, we’ll read in the data again:

##   admit gre  gpa rank
## 1     0 380 3.61    3
## 2     1 660 3.67    3
## 3     1 800 4.00    1
## 4     1 640 3.19    4
## 5     0 520 2.93    4
## 6     1 760 3.00    2
## 'data.frame':    400 obs. of  4 variables:
##  $ admit: int  0 1 1 1 0 1 1 0 1 0 ...
##  $ gre  : int  380 660 800 640 520 760 560 400 540 700 ...
##  $ gpa  : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
##  $ rank : int  3 3 1 4 4 2 1 2 3 2 ...

Now since we want to predict admissions and we consider rank as a factor we’ll turn those into factors. As mentioned, this is all that is necessary when using R but let’s pretend that it isn’t.

## 'data.frame':    400 obs. of  4 variables:
##  $ admit: Factor w/ 2 levels "0","1": 1 2 2 2 1 2 2 1 2 1 ...
##  $ gre  : int  380 660 800 640 520 760 560 400 540 700 ...
##  $ gpa  : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
##  $ rank : Factor w/ 4 levels "1","2","3","4": 3 3 1 4 4 2 1 2 3 2 ...

So now we’ll use the dummyVars function from caret to make dummy vars out using out admissions data frame. Since we are ultimately wanting to predict the admit feature, we’ll tell dummyVars that we don’t want that to be split across two columns.

## Dummy Variable Object
## 
## Formula: admit ~ .
## <environment: 0x7f84292c0db0>
## 4 variables, 2 factors
## Variables and levels will be separated by '.'
## A full rank encoding is used
## Warning in model.frame.default(Terms, newdata, na.action = na.action, xlev =
## object$lvls): variable 'admit' is not a factor
## 'data.frame':    400 obs. of  5 variables:
##  $ gre   : num  380 660 800 640 520 760 560 400 540 700 ...
##  $ gpa   : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
##  $ rank.2: num  0 0 0 0 0 1 0 1 0 1 ...
##  $ rank.3: num  1 1 0 0 0 0 0 0 1 0 ...
##  $ rank.4: num  0 0 0 1 1 0 0 0 0 0 ...
##   gre  gpa rank.2 rank.3 rank.4
## 1 380 3.61      0      1      0
## 2 660 3.67      0      1      0
## 3 800 4.00      0      0      0
## 4 640 3.19      0      0      1
## 5 520 2.93      0      0      1
## 6 760 3.00      1      0      0

So we could now use this data frame in our modeling attempts as before to see if it makes any difference. We’ll also need to use an alternative calling sequence to the train function which allows us to specify the predicted column (admit) separately from the features we are using to make that prediction.

## [1] 0.7024651
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6268  -0.8662  -0.6388   1.1490   2.0790  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.989979   1.139951  -3.500 0.000465 ***
## gre          0.002264   0.001094   2.070 0.038465 *  
## gpa          0.804038   0.331819   2.423 0.015388 *  
## rank.2      -0.675443   0.316490  -2.134 0.032829 *  
## rank.3      -1.340204   0.345306  -3.881 0.000104 ***
## rank.4      -1.551464   0.417832  -3.713 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.52
## 
## Number of Fisher Scoring iterations: 4

How does this compare to the model we built earlier where we told R that both admit and rank were factors ? It’s pretty much the same.

## [1] 0.6899151
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6268  -0.8662  -0.6388   1.1490   2.0790  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.989979   1.139951  -3.500 0.000465 ***
## gre          0.002264   0.001094   2.070 0.038465 *  
## gpa          0.804038   0.331819   2.423 0.015388 *  
## rank2       -0.675443   0.316490  -2.134 0.032829 *  
## rank3       -1.340204   0.345306  -3.881 0.000104 ***
## rank4       -1.551464   0.417832  -3.713 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.52
## 
## Number of Fisher Scoring iterations: 4

11.8 Binning

Sometimes we have data that spans a range of values such that it might make more sense to “bin” or collect the values into “buckets” represented by categories. This is akin to what a histogram does. Look at the PimaIndiansDiabetes data set for an example. Specifically, look at the pregnant feature in the form of a histogram. Most of those surveyed had less than 5 pregnancies although there are those who had at least that number of pregnancies and more.

We could build our models using the number of pregnancies without attempting to transform the feature in any way. But let’s bin this feature. In R, we can use the cut function to chop up a numerical range of data into a specified number of bins and according to some function such as the quantile function which would give us for bins labelled from 1 to 4. This will produce sample quantiles according to the 0, 25th, 50th, 75%, and 100% percentiles.

##   [1] 3 1 4 1 1 3 2 4 2 4 3 4 4 1 3 4 1 4 1 1 2 4 4 4 4 4 4 1 4 3 3 2 2 3 4 3 4
##  [38] 4 2 3 2 4 4 4 4 1 1 2 4 4 1 1 3 4 4 1 4 1 1 1 2 4 3 2 4 3 1 2 1 3 2 3 4 3
##  [75] 1 1 4 3 1 2 2 2 4 1 3 2 4 2 4 1 1 3 4 3 2 3 2 1 3 1 1 1 1 1 2 1 1 3 2 1 2
## [112] 4 1 3 4 3 3 3 3 3 1 3 2 3 1 1 2 1 1 1 3 4 2 4 2 2 1 1 1 3 2 3 2 4 3 1 4 2
## [149] 3 2 1 3 4 1 4 4 2 1 2 4 3 4 1 2 1 3 2 3 3 2 3 3 2 1 2 4 3 1 3 3 3 1 1 3 3
## [186] 4 4 1 4 3 2 4 4 4 4 3 1 2 3 3 1 1 1 2 3 3 4 3 1 4 2 1 4 1 4 4 3 3 3 3 1 2
## [223] 4 4 1 1 1 2 3 1 3 3 1 3 2 3 4 1 4 1 1 3 2 3 2 4 4 1 4 1 4 2 2 1 4 1 2 2 1
## [260] 4 2 2 3 2 3 3 1 2 1 2 4 2 2 1 4 2 4 1 3 2 1 4 4 4 2 4 3 1 3 3 1 1 2 1 1 3
## [297] 2 1 4 4 1 2 3 3 2 2 4 1 1 2 3 1 2 2 4 2 2 2 2 3 3 2 1 4 2 1 1 4 2 3 4 2 1
## [334] 4 1 1 1 3 4 4 1 1 1 3 4 4 1 2 2 3 3 3 2 1 2 4 1 4 4 1 3 3 3 3 3 3 3 1 2 1
## [371] 2 1 1 2 2 4 1 1 3 1 1 1 1 1 1 1 3 4 3 2 1 3 1 3 3 2 2 1 2 2 3 3 3 4 3 2 3
## [408] 1 4 1 3 1 1 1 1 2 1 3 1 2 1 2 1 2 4 3 1 1 1 1 2 2 1 2 1 1 4 3 1 3 1 2 3 4
## [445] 3 1 1 1 1 1 1 2 1 2 2 4 1 3 4 4 4 1 4 3 4 1 1 1 4 3 1 1 1 4 3 1 2 4 4 3 2
## [482] 1 3 1 1 1 1 1 3 4 2 2 3 3 2 3 3 2 4 3 2 2 3 4 2 4 1 1 2 4 4 1 4 2 2 2 4 4
## [519] 4 3 2 2 3 4 2 2 1 2 1 1 2 1 1 3 1 3 1 1 1 2 4 2 4 3 1 4 3 3 1 3 1 2 3 1 1
## [556] 4 1 4 4 4 3 1 1 3 1 2 1 3 3 1 2 2 2 2 1 1 3 2 4 2 1 3 4 4 4 1 4 3 2 1 4 2
## [593] 2 2 3 1 1 1 1 1 1 3 1 4 3 1 1 1 1 1 2 2 4 3 4 2 3 2 4 1 2 2 3 1 2 3 1 1 3
## [630] 3 4 1 2 1 4 4 3 2 4 1 1 3 3 3 2 2 1 1 4 1 1 1 3 2 1 2 2 1 4 2 4 1 4 4 3 1
## [667] 3 4 3 4 3 1 4 2 4 3 4 1 2 2 2 1 1 3 3 2 2 1 1 1 4 4 2 4 2 4 2 1 3 3 2 3 1
## [704] 2 3 3 4 2 4 2 2 3 4 1 2 4 2 4 1 3 3 1 1 3 1 3 1 1 2 2 2 4 2 2 2 3 1 4 2 1
## [741] 4 2 1 4 4 4 1 1 2 3 3 1 2 1 4 1 4 1 1 3 2 4 4 4 2 3 1 1
## Levels: 1 2 3 4
##  Factor w/ 4 levels "1","2","3","4": 3 1 4 1 1 3 2 4 2 4 ...

So we could now build a model using this newly binned information. It might not make a big difference although it might give some insight into to what extent the number of pregnancies influence the model. More specifically, which bin of the pregnancy variable would be influential - if it all ? First, let’s create a model using the untransformed data as found in the PimaIndiansDiabetes data frame.

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5566  -0.7274  -0.4159   0.7267   2.9297  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.4046964  0.7166359 -11.728  < 2e-16 ***
## pregnant     0.1231823  0.0320776   3.840 0.000123 ***
## glucose      0.0351637  0.0037087   9.481  < 2e-16 ***
## pressure    -0.0132955  0.0052336  -2.540 0.011072 *  
## triceps      0.0006190  0.0068994   0.090 0.928515    
## insulin     -0.0011917  0.0009012  -1.322 0.186065    
## mass         0.0897010  0.0150876   5.945 2.76e-09 ***
## pedigree     0.9451797  0.2991475   3.160 0.001580 ** 
## age          0.0148690  0.0093348   1.593 0.111192    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 993.48  on 767  degrees of freedom
## Residual deviance: 723.45  on 759  degrees of freedom
## AIC: 741.45
## 
## Number of Fisher Scoring iterations: 5
## glm variable importance
## 
##          Overall
## glucose   100.00
## mass       62.35
## pregnant   39.93
## pedigree   32.69
## pressure   26.09
## age        16.01
## insulin    13.12
## triceps     0.00

We definitely see that pregnancy is an important feature here although, again, we don’t know what sub population might have influence, if any, on the overall model. Let’s rerun this analysis on the transformed version of the data frame.

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5051  -0.7199  -0.4201   0.7282   2.9632  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.2900923  0.7432705 -11.154  < 2e-16 ***
## pregnant2    0.2020865  0.2679622   0.754  0.45075    
## pregnant3    0.4018240  0.2722524   1.476  0.13996    
## pregnant4    1.1252206  0.2909948   3.867  0.00011 ***
## glucose      0.0349990  0.0037156   9.420  < 2e-16 ***
## pressure    -0.0130319  0.0052424  -2.486  0.01292 *  
## triceps      0.0009007  0.0069532   0.130  0.89693    
## insulin     -0.0012053  0.0009026  -1.335  0.18173    
## mass         0.0890646  0.0151221   5.890 3.87e-09 ***
## pedigree     0.9126652  0.2990451   3.052  0.00227 ** 
## age          0.0150488  0.0093797   1.604  0.10863    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 993.48  on 767  degrees of freedom
## Residual deviance: 722.27  on 757  degrees of freedom
## AIC: 744.27
## 
## Number of Fisher Scoring iterations: 5
## glm variable importance
## 
##           Overall
## glucose   100.000
## mass       62.004
## pregnant4  40.229
## pedigree   31.457
## pressure   25.364
## age        15.876
## pregnant3  14.493
## insulin    12.981
## pregnant2   6.724
## triceps     0.000

This is interesting in that perhaps it’s the 4th bin of the pregnancy feature that has the major influence on the model. So if we look at the distribution of positive cases we see that in the positive cases there are more people from the 4th bin: