Chapter 6 Feature Importance

Another advantage of using the caret train function is that it provides a method to determine variable importance. This is useful when considering what features to include or not when building a model. If we summarize a given model, our myglm_caret model, we’ll see that some of our predictors are not significant.

We could use the varImp to use statistics generated by the specific modeling process itself. For more complex modeling techniques this winds up being very useful since digging into the model diagnostics can be daunting - although quite useful.

varImp(myglm_caret)

## glm variable importance
## 
##          Overall
## glucose  100.000
## mass      63.719
## pregnant  44.532
## pedigree  28.932
## pressure  27.731
## age        9.651
## insulin    5.656
## triceps    0.000

If you wanted to see how the different models rates the significance of predictor variables then you can easily plot them.

library(gridExtra)
p1 <- plot(varImp(myglm_caret),main="varImp for glm")
p2 <- plot(varImp(myrf_caret),main="varImp for Rf")
grid.arrange(p1,p2,ncol=2)

6.0.1 Feature Elimination

The caret package also supports “recursive feature elimination” which automates the selection of optimal features. This can be controversial since such a process could work at the expense of important statistical considerations. However, it remains a tool in the Machine Learning toolbox.

Let’s work though an example of this using caret functions. First, we’ll remove highly correlated predictor variables from consideration. We don’t really have a lot of highly correlated variables. It turns out that “age” is correlated with “pregnant” at a level of 0.54.

plot_correlation(pm[,1:8], type="continuous")

# find attributes that exceed some spcified threshold
highlyCorrelated <- findCorrelation(cor(pm[,1:8]),
                                    cutoff=0.5,
                                    names = TRUE)

# print indexes of highly correlated attributes
print(highlyCorrelated)

## [1] "age"

6.0.2 The rfe Function

Let’s apply the RFE method on the Pima Indians Diabetes data set. The algorithm is configured to explore all possible subsets of the attributes. All 8 attributes are selected in this example, although in the plot showing the accuracy of the different attribute subset sizes, we can see that just 4 attributes gives almost comparable results

rfFuncs$summary <- twoClassSummary
control <- rfeControl(functions=rfFuncs, 
                      method="cv", 
                      number=4)

# run the RFE algorithm
results <- rfe(pm[,1:8], 
               pm[,9], 
               sizes=c(1:8), 
               rfeControl=control,
               metric="ROC")

# summarize the results
print(results)

## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (4 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables    ROC  Sens   Spec   ROCSD   SensSD  SpecSD Selected
##          1 0.7244 0.852 0.4179 0.02110 0.008000 0.01219         
##          2 0.7687 0.830 0.5224 0.01022 0.035402 0.04222         
##          3 0.7966 0.834 0.5522 0.01954 0.036000 0.05170         
##          4 0.8067 0.838 0.5746 0.02375 0.007659 0.04478         
##          5 0.8130 0.842 0.5410 0.02846 0.030898 0.06716         
##          6 0.8206 0.844 0.5784 0.03239 0.034871 0.06139         
##          7 0.8202 0.848 0.5448 0.03241 0.041825 0.04640         
##          8 0.8272 0.858 0.5821 0.03646 0.030199 0.03656        *
## 
## The top 5 variables (out of 8):
##    glucose, mass, age, pregnant, pedigree

# list the chosen features
predictors(results)

## [1] "glucose"  "mass"     "age"      "pregnant" "pedigree" "insulin"  "triceps"  "pressure"

# plot the results
plot(results, type=c("g", "o"))