Scalable Automatic Machine Learning: Introducing H2O’s AutoML

Prepared by: Erin LeDell, Navdeep Gill & Ray Peck

In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts and experts, alike. The first steps toward simplifying machine learning involved developing simple, unified interfaces to a variety of machine learning algorithms (e.g. H2O).

Although H2O has made it easy for non-experts to experiment with machine learning, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models. Deep Neural Networks in particular are notoriously difficult for a non-expert to tune properly. We have designed an easy-to-use interface which automates the process of training a large, diverse, selection of candidate models and training a stacked ensemble on the resulting models (which often leads to an even better model). Making it’s debut in the latest “Preview Release” of H2O, version 3.12.0.1 (aka “Vapnik”), we introduce H2O’s AutoML for Scalable Automatic Machine Learning.

H2O’s AutoML can be used for automating a large part of the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. The user can also use a performance metric-based stopping criterion for the AutoML process rather than a specific time constraint. Stacked Ensembles will be automatically trained on the collection individual models to produce a highly predictive ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard.

AutoML Interface

We provide a simple function that performs a process that would typically require many lines of code. This frees up users to focus on other aspects of the data science pipeline tasks such as data-preprocessing, feature engineering and model deployment.

R:

aml <- h2o.automl(x = x, y = y, training_frame = train, 
                  max_runtime_secs = 3600)

Python:

aml = H2OAutoML(max_runtime_secs = 3600)
aml.train(x = x, y = y, training_frame = train)

Flow (H2O's Web GUI):

AutoML Leaderboard

Each AutoML run returns a "Leaderboard" of models, ranked by a default performance metric. Here is an example leaderboard for a binary classification task:

More information, and full R and Python code examples are available on the H2O 3.12.0.1 AutoML docs page in the H2O User Guide.

Stacked Ensembles and Word2Vec now available in H2O!

Prepared by: Erin LeDell and Navdeep Gill

Stacked Ensembles

sz42-6-wheels-lightened

H2O’s new Stacked Ensemble method is a supervised ensemble machine learning algorithm that finds the optimal combination of a collection of prediction algorithms using a process called stacking or “Super Learning.” This method currently supports regression and binary classification, and multiclass support is planned for a future release. A full list of the planned features for Stacked Ensemble can be viewed here.

H2O previously has supported the creation of ensembles of H2O models through a separate implementation, the h2oEnsemble R package, which is still available and will continue to be maintained, however for new projects we’d recommend using the native H2O version. Native support for stacking in the H2O backend brings support for ensembles to all the H2O APIs.

Creating ensembles of H2O models is now dead simple. You simply pass a list of existing H2O model ids to the stacked ensemble function and you are ready to go. This list of models can be a set of manually created H2O models, a random grid of models (of GBMs, for example), or set of grids of different algorithms. Typically, the more diverse the collection of base models, the better the ensemble performance. Thus, using H2O’s Random Grid Search to generate a collection of random models is a handy way of quickly generating a set of base models for the ensemble.

R:

ensemble <- h2o.stackedEnsemble(x = x, y = y, training_frame = train, base_models = my_models)

Python:

ensemble = H2OStackedEnsembleEstimator(base_models=my_models)
ensemble.train(x=x, y=y, training_frame=train)

Full R and Python code examples are available on the Stacked Ensembles docs page. Kagglers rejoice!

Word2Vec

w2v
\(\)

H2O now has a full implementation of Word2Vec. Word2Vec is a group of related models that are used to produce word embeddings (a language modeling/feature engineering technique in natural language processing where words or phrases are mapped to vectors of real numbers). The word embeddings can subsequently be used in a machine learning model, for example, GBM. This allows user to utilize text based data with current H2O algorithms in a very efficient manner. An R example is available here.

Technical Details

H2O’s Word2Vec is based on the skip-gram model. The training objective of skip-gram is to learn word vector representations that are good at predicting its context in the same sentence. Mathematically, given a sequence of training words $w_1, w_2, \dots, w_T$, the objective of the skip-gram model is to maximize the average log-likelihood

$$\frac{1}{T} \sum_{t = 1}^{T}\sum_{j=-k}^{j=k} \log p(w_{t+j} | w_t)$$

where $k$ is the size of the training window.
In the skip-gram model, every word w is associated with two vectors $u_w$ and $v_w$ which are vector representations of $w$ as word and context respectively. The probability of correctly predicting word $w_i$ given word $w_j$ is determined by the softmax model, which is

$$p(w_i | w_j ) = \frac{\exp(u_{w_i}^{\top}v_{w_j})}{\sum_{l=1}^{V} \exp(u_l^{\top}v_{w_j})}$$

where $V$ is the vocabulary size.
The skip-gram model with softmax is expensive because the cost of computing $\log p(w_i | w_j)$ is proportional to $V$, which can be easily in order of millions. To speed up training of Word2Vec, we used hierarchical softmax, which reduced the complexity of computing of $\log p(w_i | w_j)$ to $O(\log(V))$

Tverberg Release (H2O 3.10.3.4)

Below is a detailed list of all the items that are part of the Tverberg release.

List of New Features:

PUBDEV-2058- Implement word2vec in h2o (To use this feature in R, please visit this demo)
PUBDEV-3635- Ability to Select Columns for PDP computation in Flow (With this enhancement, users will be able to select which features/columns to render Partial Dependence Plots from Flow. (R/Python supported already). Known issue PUBDEV-3782: when nbins < categorical levels, PDP won't compute. Please visit also this post.)
PUBDEV-3881- Add PCA Estimator documentation to Python API Docs
PUBDEV-3902- Documentation: Add information about Azure support to H2O User Guide (Beta)
PUBDEV-3739- StackedEnsemble: put ensemble creation into the back end.

List of Improvements:

PUBDEV-3989- Decrease size of h2o.jar
PUBDEV-3257- Documentation: As a K-Means user, I want to be able to better understand the parameters
PUBDEV-3741- StackedEnsemble: add tests in R and Python to ensure that a StackedEnsemble performs at least as well as the base_models
PUBDEV-3857- Clean up the generated Python docs
PUBDEV-3895- Filter H2OFrame on pandas dates and time (python)
PUBDEV-3912- Provide way to specify context_path via Python/R h2o.init methods
PUBDEV-3933- Modify gen_R.py for Stacked Ensemble
PUBDEV-3972- Add Stacked Ensemble code examples to Python docstrings

List of Bugs:

PUBDEV-2464- Using asfactor() in Python client cannot allocate to a variable
PUBDEV-3111- R API's h2o.interaction() does not use destination_frame argument
PUBDEV-3694- Errors with PCA on wide data for pca_method = GramSVD which is the default
PUBDEV-3742- StackedEnsemble should work for regression
PUBDEV-3865- h2o gbm : for an unseen categorical level, discrepancy in predictions when score using h2o vs pojo/mojo
PUBDEV-3883- Negative indexing for H2OFrame is buggy in R API
PUBDEV-3894- Relational operators don't work properly with time columns.
PUBDEV-3966- java.lang.AssertionError when using h2o.makeGLMModel
PUBDEV-3835- Standard Errors in GLM: calculating and showing specifically when called
PUBDEV-3965- Importing data in python returns error - TypeError: expected string or bytes-like object
Hotfix: Remove StackedEnsemble from Flow UI. Training is only supported from Python and R interfaces. Viewing is supported in the Flow UI.

List of Tasks

PUBDEV-3336- h2o.create_frame(): if randomize=True, value param cannot be used
PUBDEV-3740- REST: implement simple ensemble generation API
PUBDEV-3843- Modify R REST API to always return binary data
PUBDEV-3844- Safe GET calls for POJO/MOJO/genmodel
PUBDEV-3864- Import files by pattern
PUBDEV-3884- StackedEnsemble: Add to online documentation
PUBDEV-3940- Add Stacked Ensemble code examples to R docs

Download here: http://h2o-release.s3.amazonaws.com/h2o/rel-tverberg/4/index.html