Chapter 12 Using External ML Frameworks

There are a number of companies that provide easy access to Machine Learning services including Google, Amazon, Data Robot, and H2o. In particular, the company H20.ai provides frameworks for accessible Machine Learning by experts and non-experts. They promote the idea of “citizen data science” which seeks to lower barriers to participation in the world of AI. While they have a commercial product, they also provide an open source tool:

H2O is a fully open source, distributed in-memory machine learning platform with linear scalability. H2O supports the most widely used statistical & machine learning algorithms including gradient boosted machines, generalized linear models, deep learning and more.

Moreover, H2O provides access to an “Auto ML” service that selects methods appropriate to a given data set. This is useful to help jump start ideas.

H2O also has an industry leading AutoML functionality that automatically runs through all the algorithms and their hyperparameters to produce a leaderboard of the best models. The H2O platform is used by over 18,000 organizations globally and is extremely popular in both the R & Python communities.

Better yet, there is an R package called, somewhat unimaginatively, “h2o” "which provides:

R interface for ‘H2O’, the scalable open source machine learning platform that offers parallelized implementations of many supervised and unsupervised machine learning algorithms such as Generalized Linear Models, Gradient Boosting Machines (including XGBoost), Random Forests, Deep Neural Networks (Deep Learning), Stacked Ensembles, Naive Bayes, Cox Proportional Hazards, K-Means, PCA, Word2Vec, as well as a fully automatic machine learning algorithm (AutoML).

12.1 Using h2o

The package must first be installed which can done using the install.packages function (or the menu in R Studio). Loading the library is done just as you would any other library.

library(h2o)

The goal of using this library is not to replace the methods available to you in R but, just like the caret package, seeks to provide a uniform interface for a variety of underlying methods. This includes common methods including an “Auto ML” service that picks methods for you. Let’s apply h2o to our work. The underlying h2o architecture uses a “running instance” concept that can be initialized and accessed from R. You initialize it once per interactive session.

h2o.init()

H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /var/folders/wh/z0v5hqgx3dzdfgz47lnbr_3w0000gn/T//RtmpRehHby/h2o_esteban_started_from_r.out
    /var/folders/wh/z0v5hqgx3dzdfgz47lnbr_3w0000gn/T//RtmpRehHby/h2o_esteban_started_from_r.err

java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

Starting H2O JVM and connecting: .. Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         2 seconds 577 milliseconds 
    H2O cluster timezone:       America/New_York 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.26.0.2 
    H2O cluster version age:    5 months and 5 days !!! 
    H2O cluster name:           H2O_started_from_R_esteban_pgj795 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   1.78 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4 
    R Version:                  R version 3.5.3 (2019-03-11) 

Your H2O cluster version is too old (5 months and 5 days)!
Please download and install the latest version from http://h2o.ai/download/
Show in New WindowClear OutputExpand/Collapse Output
  |===========================================================| 100%
Console~/Dropbox/ML/bookdown-minimal/
Console
Terminal

R Markdown

~/Dropbox/ML/bookdown-minimal/

Once the h2o environment has been initialized then work can begin. This will take the form of using R functions provided by the h2o package to read in data and prepare it for use with various methods. Let’s repeat the regression on mtcars using h2o functions. Since mtcars is already available in the the R environment we can easily import it into h2o.

# Import mtcars
mtcars_h2o_df <- as.h2o(mtcars)

# Idenitfy the variable to be predicted
y <- "mpg"

# Put the predictor names into a vector
x <- setdiff(colnames(mtcars_h2o_df),y)

12.2 Create Some h20 Models

Now let’s create some training and test data sets. We could do this ourselves using conventional R commands or helper functions from the caret package. However, the h2o package provides its own set of helpers.

splits <- h2o.splitFrame(mtcars_h2o_df, ratios=0.8,seed=1)
train_h2o <- splits[[1]]
test_h2o  <- splits[[2]]

train

mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
4 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
5 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
6 14.3   8  360 245 3.21 3.570 15.84  0  0    3    4

[29 rows x 11 columns]

Now let’s create a model. We’ll use the Generalized Linear Model function from h2o. It is important to note that this function is implemented from within h2o. That is, we are not in anyway using any existing R packages to do this nor are we using anything from the care package. Here we’ll request a 4-Fold, Cross Validation step as part of the model assembly.

h2o_glm_model <- h2o.glm(y=y,x=x,train_h2o,nfolds=4)
summary(h2o_glm_model)

#

 |===========================================================| 100%
Model Details:
==============

H2ORegressionModel: glm
Model Key:  GLM_model_R_1577927955348_1 
GLM Model: summary
    family     link                              regularization
1 gaussian identity Elastic Net (alpha = 0.5, lambda = 1.0664 )
  number_of_predictors_total number_of_active_predictors
1                         10                           9
  number_of_iterations    training_frame
1                    1 RTMP_sid_bf87_673

H2ORegressionMetrics: glm
** Reported on training data. **

MSE:  6.185253
RMSE:  2.487017
MAE:  1.940791
RMSLE:  0.1135999
Mean Residual Deviance :  6.185253
R^2 :  0.8392098
Null Deviance :1115.568
Null D.o.F. :28
Residual Deviance :179.3723
Residual D.o.F. :19
AIC :157.1413



H2ORegressionMetrics: glm
** Reported on cross-validation data. **
** 4-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  9.520966
RMSE:  3.085606
MAE:  2.478209
RMSLE:  0.1462186
Mean Residual Deviance :  9.520966
R^2 :  0.7524955
Null Deviance :1194.241
Null D.o.F. :28
Residual Deviance :276.108
Residual D.o.F. :19
AIC :169.6498


Cross-Validation Metrics Summary: 
                            mean          sd cv_1_valid cv_2_valid
mae                    2.4294133   0.4533141  3.3255308  2.6851656
mean_residual_deviance  9.431254   2.5729601  11.924386  13.098149
mse                     9.431254   2.5729601  11.924386  13.098149
null_deviance          298.56015    73.43793   429.7452  322.68143
r2                     0.7409388 0.041704282  0.6850852 0.70873845
residual_deviance         69.027   21.894964  107.31948  65.490746
rmse                   2.9984744  0.46925735  3.4531705  3.6191366
rmsle                  0.1363086 0.026307607 0.19733523 0.13199924
                       cv_3_valid cv_4_valid
mae                     1.6167169  2.0902402
mean_residual_deviance  3.6748443   9.027636
mse                     3.6748443   9.027636
null_deviance           139.37894  302.43506
r2                     0.83919096 0.73074067
residual_deviance       22.049067   81.24872
rmse                    1.9169884  3.0046024
rmsle                   0.0983199 0.11758001

Scoring History: 
            timestamp   duration iterations negative_log_likelihood
1 2020-01-01 20:21:13  0.000 sec          0              1115.56759
  objective
1  38.46785

Variable Importances: (Extract with `h2o.varimp`) 
=================================================

   variable relative_importance scaled_importance  percentage
1        wt          1.19294625        1.00000000 0.208632891
2       cyl          0.92951526        0.77917615 0.162561772
3      disp          0.78424629        0.65740287 0.137155861
4        hp          0.69294345        0.58086729 0.121188021
5      carb          0.62287613        0.52213261 0.108934035
6        am          0.55736672        0.46721864 0.097477175
7        vs          0.46246830        0.38766902 0.080880507
8      drat          0.45447201        0.38096604 0.079482047
9      gear          0.02108593        0.01767551 0.003687692
10     qsec          0.00000000        0.00000000 0.000000000

Now we can do a prediction on the object against the test set.

(h2o_glm_preds <- h2o.predict(h2o_glm_model,test_h2o))

#
h2o.performance(h2o_glm_model,test_h2o)

 |===========================================================| 100%
   predict
1 25.93186
2 20.67344
3 24.38237

[3 rows x 1 column] 
H2ORegressionMetrics: glm

MSE:  6.762548
RMSE:  2.60049
MAE:  2.495891
RMSLE:  0.1076563
Mean Residual Deviance :  6.762548
R^2 :  -2.052306
Null Deviance :10.87611
Null D.o.F. :2
Residual Deviance :20.28765
Residual D.o.F. :-7
AIC :36.24783

12.3 Saving A Model

You can save the contents of any h2o generated model by using the h2o.saveModel() function. You could extract pieces of information from the S4 object but saving the model is easy to do - as is reading it back in.

model_path <- h2o.saveModel(h2o_glm_model,path=getwd(),force=TRUE)

# If you need to load a previously saved model

saved_model <- h2o.loadModel(model_path)

12.4 Using The h2o Auto ML Feature

Are you curious as to what model might be the “best” for your data ? This is a very fertile field of research that keeps growing and some feel will one be the dominant technology in ML - where a model picks a model. Sounds odd but that is where it is heading. Check the current h2o Auto ML documentation for more details. For now, most of the Auto ML services use a set of heuristics to examine data and then find the most appropriate method to build a model. The currently supported method implementations in the opensource version include:

three pre-specified XGBoost GBM (Gradient Boosting Machine) models
a fixed grid of GLMs, a default Random Forest (DRF)
five pre-specified H2O GBMs
a near-default Deep Neural Net
an Extremely Randomized Forest (XRT)
a random grid of XGBoost GBMs
a random grid of H2O GBMs
a random grid of Deep Neural Nets.

12.5 Launching a Job

Of course, it all begins with the idea of specifying a performance metric such as RMSE or the area under a ROC curve. The idea here is that we specify some input, apply transformations, create a test/train pair, and then call the h2o auto function.

h2o_auto_mtcars <- h2o.automl(y = y, x = x,
                              training_frame = train_h2o,
                              leaderboard_frame = test_h2o,
                              max_runtime_secs = 60,
                              seed = 1,
                              sort_metric = "RMSE",
                              project_name = "mtcars")

Let’s check out the object that is returned. It is an S4 object in R which means that it has “slots” which can be accessed via the “@” operator.

slotNames(h2o_auto_mtcars)
h2o_auto_mtcars@leaderboard

[1] "project_name"  "leader"        "leaderboard"   "event_log"    
[5] "training_info"
                                        model_id
1     GBM_grid_1_AutoML_20200101_202413_model_53
2          DeepLearning_1_AutoML_20200101_202413
3 XGBoost_grid_1_AutoML_20200101_202413_model_14
4  XGBoost_grid_1_AutoML_20200101_202413_model_5
5     GBM_grid_1_AutoML_20200101_202413_model_52
6     GBM_grid_1_AutoML_20200101_202413_model_16
  mean_residual_deviance      rmse       mse       mae      rmsle
1              0.2147231 0.4633822 0.2147231 0.4138704 0.02128094
2              0.3734735 0.6111248 0.3734735 0.5076945 0.02559229
3              0.4586288 0.6772213 0.4586288 0.6318582 0.03060372
4              0.4787276 0.6919014 0.4787276 0.5760670 0.03091116
5              0.7571985 0.8701716 0.7571985 0.7831136 0.03729489
6              0.7772694 0.8816289 0.7772694 0.8020415 0.03896093

[89 rows x 6 columns]

Stop the H2O instance

h2o.shutdown(prompt=FALSE)