Chapter 12 Using External ML Frameworks

There are a number of companies that provide easy access to Machine Learning services including Google, Amazon, Data Robot, and H2o. In particular, the company H20.ai provides frameworks for accessible Machine Learning by experts and non-experts. They promote the idea of “citizen data science” which seeks to lower barriers to participation in the world of AI. While they have a commercial product, they also provide an open source tool:

H2O is a fully open source, distributed in-memory machine learning platform with linear scalability. H2O supports the most widely used statistical & machine learning algorithms including gradient boosted machines, generalized linear models, deep learning and more.

Moreover, H2O provides access to an “Auto ML” service that selects methods appropriate to a given data set. This is useful to help jump start ideas.

H2O also has an industry leading AutoML functionality that automatically runs through all the algorithms and their hyperparameters to produce a leaderboard of the best models. The H2O platform is used by over 18,000 organizations globally and is extremely popular in both the R & Python communities.

Better yet, there is an R package called, somewhat unimaginatively, “h2o” "which provides:

R interface for ‘H2O’, the scalable open source machine learning platform that offers parallelized implementations of many supervised and unsupervised machine learning algorithms such as Generalized Linear Models, Gradient Boosting Machines (including XGBoost), Random Forests, Deep Neural Networks (Deep Learning), Stacked Ensembles, Naive Bayes, Cox Proportional Hazards, K-Means, PCA, Word2Vec, as well as a fully automatic machine learning algorithm (AutoML).

12.1 Using h2o

The package must first be installed which can done using the install.packages function (or the menu in R Studio). Loading the library is done just as you would any other library.

The goal of using this library is not to replace the methods available to you in R but, just like the caret package, seeks to provide a uniform interface for a variety of underlying methods. This includes common methods including an “Auto ML” service that picks methods for you. Let’s apply h2o to our work. The underlying h2o architecture uses a “running instance” concept that can be initialized and accessed from R. You initialize it once per interactive session.

Once the h2o environment has been initialized then work can begin. This will take the form of using R functions provided by the h2o package to read in data and prepare it for use with various methods. Let’s repeat the regression on mtcars using h2o functions. Since mtcars is already available in the the R environment we can easily import it into h2o.

12.2 Create Some h20 Models

Now let’s create some training and test data sets. We could do this ourselves using conventional R commands or helper functions from the caret package. However, the h2o package provides its own set of helpers.

Now let’s create a model. We’ll use the Generalized Linear Model function from h2o. It is important to note that this function is implemented from within h2o. That is, we are not in anyway using any existing R packages to do this nor are we using anything from the care package. Here we’ll request a 4-Fold, Cross Validation step as part of the model assembly.

h2o_glm_model <- h2o.glm(y=y,x=x,train_h2o,nfolds=4)
summary(h2o_glm_model)

#

 |===========================================================| 100%
Model Details:
==============

H2ORegressionModel: glm
Model Key:  GLM_model_R_1577927955348_1 
GLM Model: summary
    family     link                              regularization
1 gaussian identity Elastic Net (alpha = 0.5, lambda = 1.0664 )
  number_of_predictors_total number_of_active_predictors
1                         10                           9
  number_of_iterations    training_frame
1                    1 RTMP_sid_bf87_673

H2ORegressionMetrics: glm
** Reported on training data. **

MSE:  6.185253
RMSE:  2.487017
MAE:  1.940791
RMSLE:  0.1135999
Mean Residual Deviance :  6.185253
R^2 :  0.8392098
Null Deviance :1115.568
Null D.o.F. :28
Residual Deviance :179.3723
Residual D.o.F. :19
AIC :157.1413



H2ORegressionMetrics: glm
** Reported on cross-validation data. **
** 4-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE:  9.520966
RMSE:  3.085606
MAE:  2.478209
RMSLE:  0.1462186
Mean Residual Deviance :  9.520966
R^2 :  0.7524955
Null Deviance :1194.241
Null D.o.F. :28
Residual Deviance :276.108
Residual D.o.F. :19
AIC :169.6498


Cross-Validation Metrics Summary: 
                            mean          sd cv_1_valid cv_2_valid
mae                    2.4294133   0.4533141  3.3255308  2.6851656
mean_residual_deviance  9.431254   2.5729601  11.924386  13.098149
mse                     9.431254   2.5729601  11.924386  13.098149
null_deviance          298.56015    73.43793   429.7452  322.68143
r2                     0.7409388 0.041704282  0.6850852 0.70873845
residual_deviance         69.027   21.894964  107.31948  65.490746
rmse                   2.9984744  0.46925735  3.4531705  3.6191366
rmsle                  0.1363086 0.026307607 0.19733523 0.13199924
                       cv_3_valid cv_4_valid
mae                     1.6167169  2.0902402
mean_residual_deviance  3.6748443   9.027636
mse                     3.6748443   9.027636
null_deviance           139.37894  302.43506
r2                     0.83919096 0.73074067
residual_deviance       22.049067   81.24872
rmse                    1.9169884  3.0046024
rmsle                   0.0983199 0.11758001

Scoring History: 
            timestamp   duration iterations negative_log_likelihood
1 2020-01-01 20:21:13  0.000 sec          0              1115.56759
  objective
1  38.46785

Variable Importances: (Extract with `h2o.varimp`) 
=================================================

   variable relative_importance scaled_importance  percentage
1        wt          1.19294625        1.00000000 0.208632891
2       cyl          0.92951526        0.77917615 0.162561772
3      disp          0.78424629        0.65740287 0.137155861
4        hp          0.69294345        0.58086729 0.121188021
5      carb          0.62287613        0.52213261 0.108934035
6        am          0.55736672        0.46721864 0.097477175
7        vs          0.46246830        0.38766902 0.080880507
8      drat          0.45447201        0.38096604 0.079482047
9      gear          0.02108593        0.01767551 0.003687692
10     qsec          0.00000000        0.00000000 0.000000000

Now we can do a prediction on the object against the test set.

12.3 Saving A Model

You can save the contents of any h2o generated model by using the h2o.saveModel() function. You could extract pieces of information from the S4 object but saving the model is easy to do - as is reading it back in.

12.4 Using The h2o Auto ML Feature

Are you curious as to what model might be the “best” for your data ? This is a very fertile field of research that keeps growing and some feel will one be the dominant technology in ML - where a model picks a model. Sounds odd but that is where it is heading. Check the current h2o Auto ML documentation for more details. For now, most of the Auto ML services use a set of heuristics to examine data and then find the most appropriate method to build a model. The currently supported method implementations in the opensource version include:

  • three pre-specified XGBoost GBM (Gradient Boosting Machine) models
  • a fixed grid of GLMs, a default Random Forest (DRF)
  • five pre-specified H2O GBMs
  • a near-default Deep Neural Net
  • an Extremely Randomized Forest (XRT)
  • a random grid of XGBoost GBMs
  • a random grid of H2O GBMs
  • a random grid of Deep Neural Nets.