Chapter 8 Using External ML Frameworks

There are a number of companies that provide easy access to Machine Learning services including Google, Amazon, Data Robot, and H2o. In particular, the company H20.ai provides frameworks for accessible Machine Learning by experts and non-experts. They promote the idea of “citizen data science” which seeks to lower barriers to participation in the world of AI. While they have a commercial product, they also provide an open source tool:

H2O is a fully open source, distributed in-memory machine learning platform with linear scalability. H2O supports the most widely used statistical & machine learning algorithms including gradient boosted machines, generalized linear models, deep learning and more.

Moreover, H2O provides access to an “Auto ML” service that selects methods appropriate to a given data set. This is useful to help jump start ideas.

H2O also has an industry leading AutoML functionality that automatically runs through all the algorithms and their hyperparameters to produce a leaderboard of the best models. The H2O platform is used by over 18,000 organizations globally and is extremely popular in both the R & Python communities.

Better yet, there is an R package called, somewhat unimaginatively, “h2o” “which provides:

R interface for ‘H2O’, the scalable open source machine learning platform that offers parallelized implementations of many supervised and unsupervised machine learning algorithms such as Generalized Linear Models, Gradient Boosting Machines (including XGBoost), Random Forests, Deep Neural Networks (Deep Learning), Stacked Ensembles, Naive Bayes, Cox Proportional Hazards, K-Means, PCA, Word2Vec, as well as a fully automatic machine learning algorithm (AutoML).

8.1 H2O In Action

The package must first be installed which can done using the install.packages function (or the menu in R Studio). Loading the library is done just as you would any other library.

library(h2o)

The goal of using this library is not to replace the methods available to you in R but, just like the caret package, seeks to provide a uniform interface for a variety of underlying methods. This includes common methods including an “Auto ML” service that picks methods for you. Let’s apply h2o to our work. The underlying h2o architecture uses a “running instance” concept that can be initialized and accessed from R. You initialize it once per interactive session.

h2o.init()

H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /var/folders/wh/z0v5hqgx3dzdfgz47lnbr_3w0000gn/T//RtmpRehHby/h2o_esteban_started_from_r.out
    /var/folders/wh/z0v5hqgx3dzdfgz47lnbr_3w0000gn/T//RtmpRehHby/h2o_esteban_started_from_r.err

java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

Starting H2O JVM and connecting: .. Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         2 seconds 577 milliseconds 
    H2O cluster timezone:       America/New_York 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.26.0.2 
    H2O cluster version age:    5 months and 5 days !!! 
    H2O cluster name:           H2O_started_from_R_esteban_pgj795 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   1.78 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4 
    R Version:                  R version 3.5.3 (2019-03-11) 

Your H2O cluster version is too old (5 months and 5 days)!
Please download and install the latest version from http://h2o.ai/download/
Show in New WindowClear OutputExpand/Collapse Output
  |===========================================================| 100%
Console~/Dropbox/ML/bookdown-minimal/
Console
Terminal

R Markdown

~/Dropbox/ML/bookdown-minimal/

Once the h2o environment has been initialized then work can begin. This will take the form of using R functions provided by the h2o package to read in data and prepare it for use with various methods. Let’s repeat the regression on mtcars using h2o functions. Since mtcars is already available in the the R environment we can easily import it into h2o.

# Import mtcars
pm_h2o_df <- as.h2o(pm)

# Idenitfy the variable to be predicted
y <- "diabetes"

# Put the predictor names into a vector
x <- setdiff(colnames(pm_h2o_df),y)

8.2 Create Some Models

Now let’s create some training and test data sets. We could do this ourselves using conventional R commands or helper functions from the caret package. However, the h2o package provides its own set of helpers.

splits <- h2o.splitFrame(pm_h2o_df, ratios=0.8,seed=1)
train_h2o <- splits[[1]]
test_h2o  <- splits[[2]]

head(train)

pregnant glucose pressure triceps insulin mass pedigree age diabetes
415        0     138       60      35     167 34.6    0.534  21      pos
463        8      74       70      40      49 35.3    0.705  39      neg
179        5     143       78       0       0 45.0    0.190  47      neg
526        3      87       60      18       0 21.8    0.444  21      neg
195        8      85       55      20       0 24.4    0.136  42      neg
118        5      78       48       0       0 33.7    0.654  25      neg

Now let’s create a model. We’ll use the Generalized Linear Model function from h2o. It is important to note that this function is implemented from within h2o. That is, we are not in anyway using any existing R packages to do this nor are we using anything from the care package. Here we’ll request a 4-Fold, Cross Validation step as part of the model assembly.

h2o_glm_model <- h2o.glm(y=y,x=x,train_h2o,nfolds=4,family="binomial")
summary(h2o_glm_model)


Model Details:
==============

H2OBinomialModel: glm
Model Key:  GLM_model_R_1608222700875_6444 
GLM Model: summary
    family  link                                regularization number_of_predictors_total
1 binomial logit Elastic Net (alpha = 0.5, lambda = 4.842E-4 )                          8
  number_of_active_predictors number_of_iterations     training_frame
1                           8                    4 RTMP_sid_9938_1608

H2OBinomialMetrics: glm
** Reported on training data. **

MSE:  0.1445039
RMSE:  0.3801367
LogLoss:  0.4460641
Mean Per-Class Error:  0.221113
AUC:  0.8553703
AUCPR:  0.7687956
Gini:  0.7107406
R^2:  0.365502
Residual Deviance:  544.1982
AIC:  562.1982

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
       neg pos    Error      Rate
neg    306  90 0.227273   =90/396
pos     46 168 0.214953   =46/214
Totals 352 258 0.222951  =136/610

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold      value idx
1                       max f1  0.322003   0.711864 198
2                       max f2  0.156607   0.809944 282
3                 max f0point5  0.557849   0.730858 129
4                 max accuracy  0.557849   0.796721 129
5                max precision  0.995359   1.000000   0
6                   max recall  0.008453   1.000000 394
7              max specificity  0.995359   1.000000   0
8             max absolute_mcc  0.322003   0.538805 198
9   max min_per_class_accuracy  0.324477   0.772727 196
10 max mean_per_class_accuracy  0.322003   0.778887 198
11                     max tns  0.995359 396.000000   0
12                     max fns  0.995359 213.000000   0
13                     max fps  0.001187 396.000000 399
14                     max tps  0.008453 214.000000 394
15                     max tnr  0.995359   1.000000   0
16                     max fnr  0.995359   0.995327   0
17                     max fpr  0.001187   1.000000 399
18                     max tpr  0.008453   1.000000 394

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

H2OBinomialMetrics: glm
** Reported on cross-validation data. **
** 4-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  0.1482055
RMSE:  0.3849747
LogLoss:  0.4565372
Mean Per-Class Error:  0.2281813
AUC:  0.8476765
AUCPR:  0.7575767
Gini:  0.6953531
R^2:  0.3492487
Residual Deviance:  556.9754
AIC:  572.9754

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
       neg pos    Error      Rate
neg    293 103 0.260101  =103/396
pos     42 172 0.196262   =42/214
Totals 335 275 0.237705  =145/610

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold      value idx
1                       max f1  0.300194   0.703476 214
2                       max f2  0.149506   0.807478 295
3                 max f0point5  0.567272   0.725995 126
4                 max accuracy  0.567272   0.793443 126
5                max precision  0.995091   1.000000   0
6                   max recall  0.006949   1.000000 395
7              max specificity  0.995091   1.000000   0
8             max absolute_mcc  0.422845   0.533300 166
9   max min_per_class_accuracy  0.334315   0.767677 199
10 max mean_per_class_accuracy  0.300194   0.771819 214
11                     max tns  0.995091 396.000000   0
12                     max fns  0.995091 213.000000   0
13                     max fps  0.001183 396.000000 399
14                     max tps  0.006949 214.000000 395
15                     max tnr  0.995091   1.000000   0
16                     max fnr  0.995091   0.995327   0
17                     max fpr  0.001183   1.000000 399
18                     max tpr  0.006949   1.000000 395

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Cross-Validation Metrics Summary: 
                mean          sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
accuracy   0.7910583 0.033798862  0.7621951  0.8396947  0.7852349 0.77710843
auc        0.8504793 0.029461967  0.8479798 0.89006025  0.8450461 0.81883115
aucpr      0.7592555  0.06003577  0.6966476  0.8367924  0.7714088  0.7321732
err       0.20894173 0.033798862 0.23780487 0.16030534  0.2147651 0.22289157
err_count      32.25    8.057088       39.0       21.0       32.0       37.0

---
                        mean          sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
precision         0.67893696  0.07371945  0.5974026  0.7755102  0.6818182 0.66101694
r2                0.35242754 0.066888936 0.32669345 0.45028827  0.3334273 0.29930115
recall             0.7858796  0.06506488  0.8518519  0.7916667  0.8035714  0.6964286
residual_deviance  139.24385    24.11607  149.79475  105.70603   139.6818  161.79282
rmse              0.38351524  0.01814124 0.38561666 0.35723653 0.39543307 0.39577466
specificity        0.7945068  0.06356689  0.7181818  0.8674699  0.7741935  0.8181818

Scoring History: 
            timestamp   duration iterations negative_log_likelihood objective
1 2020-12-17 16:32:16  0.000 sec          0               395.25107   0.64795
2 2020-12-17 16:32:16  0.002 sec          1               282.03640   0.46308
3 2020-12-17 16:32:16  0.002 sec          2               272.59940   0.44793
4 2020-12-17 16:32:16  0.003 sec          3               272.10447   0.44722
5 2020-12-17 16:32:16  0.004 sec          4               272.09909   0.44722

Variable Importances: (Extract with `h2o.varimp`) 
=================================================

  variable relative_importance scaled_importance percentage
1  glucose          1.25985125        1.00000000 0.36643670
2     mass          0.78580357        0.62372726 0.22855656
3 pregnant          0.48159698        0.38226496 0.14007591
4 pressure          0.35020240        0.27797123 0.10185886
5 pedigree          0.29432303        0.23361729 0.08560595
6  insulin          0.11722238        0.09304462 0.03409496
7  triceps          0.08235523        0.06536901 0.02395361
8      age          0.06675943        0.05298993 0.01941746

8.3 Saving A Model

You can save the contents of any h2o generated model by using the h2o.saveModel() function. You could extract pieces of information from the S4 object but saving the model is easy to do - as is reading it back in.

model_path <- h2o.saveModel(h2o_glm_model,path=getwd(),force=TRUE)

# If you need to load a previously saved model

saved_model <- h2o.loadModel(model_path)

8.4 Using the Auto ML Feature

Are you curious as to what model might be the “best” for your data ? This is a very fertile field of research that keeps growing and some feel will one be the dominant technology in ML - where a model picks a model. Sounds odd but that is where it is heading. Check the current h2o Auto ML documentation for more details. For now, most of the Auto ML services use a set of heuristics to examine data and then find the most appropriate method to build a model. The currently supported method implementations in the open source version include:

- three pre-specified XGBoost GBM (Gradient Boosting Machine) models
- a fixed grid of GLMs, a default Random Forest (DRF)
- five pre-specified H2O GBMs
- a near-default Deep Neural Net
- an Extremely Randomized Forest (XRT)
- a random grid of XGBoost GBMs
- a random grid of H2O GBMs
- a random grid of Deep Neural Nets.

8.5 Launching A Job

Of course, it all begins with the idea of specifying a performance metric such as RMSE or the area under a ROC curve. The idea here is that we specify some input, apply transformations, create a test/train pair, and then call the h2o auto function.

h20_pm_model <- h2o.automl(y = y, x = x,
                      training_frame = train_h2o,
                      leaderboard_frame = test_h2o,
                      max_runtime_secs = 60,
                      seed = 1,
                      sort_metric = "AUC",
                      project_name = "pm2")

Let’s check out the object that is returned. It is an S4 object in R which means that it has “slots” which can be accessed via the “@” operator.

slotNames(h2o_auto_pm)
[1] "project_name"   "leader"         "leaderboard"    "event_log"      "modeling_steps"
[6] "training_info" 

h2o_auto_pm@leaderboard

                                             model_id       auc   logloss     aucpr
1          GBM_grid__1_AutoML_20201217_162611_model_4 0.8043091 0.5049815 0.6544545
2          GBM_grid__1_AutoML_20201217_162823_model_4 0.8043091 0.5049815 0.6544545
3                        XRT_1_AutoML_20201217_162611 0.7987892 0.5119855 0.6036122
4                        XRT_1_AutoML_20201217_162823 0.7987892 0.5119855 0.6036122
5 DeepLearning_grid__3_AutoML_20201217_162823_model_1 0.7964744 0.5162976 0.6133439
6          GBM_grid__1_AutoML_20201217_162611_model_7 0.7962963 0.5119919 0.6594433
  mean_per_class_error      rmse       mse
1            0.2546296 0.4137757 0.1712103
2            0.2546296 0.4137757 0.1712103
3            0.2560541 0.4174395 0.1742558
4            0.2560541 0.4174395 0.1742558
5            0.2508903 0.4162580 0.1732707
6            0.2457265 0.4140577 0.1714438

[80 rows x 7 columns] 
>

Stop the H2O instance

h2o.shutdown(prompt=FALSE)