New features in H2O 3.18

Wolpert Release (H2O 3.18)

There’s a new major release of H2O and it’s packed with new features and fixes!

We named this release after David Wolpert, who is famous for inventing Stacking (aka Stacked Ensembles). Stacking is a central component in H2O AutoML, so we’re very grateful for his contributions to machine learning! He is also famous for the “No Free Lunch” theorem, which generally states that no single algorithm will be the best in all cases. In other words, there’s no magic bullet. This is precisely why stacking is such a powerful and practical algorithm — you never know in advance if a Deep Neural Network, or GBM or Random Forest will be the best algorithm for your problem. When you combine all of these together into a stacked ensemble, you are guaranteed to benefit from the strengths of each of these algorithms. You can read more about Dr. Wolpert and his work here.

Distributed XGBoost

The central feature of this release is support for distributed XGBoost, as well as other XGBoost enhancements and bug fixes. We are bringing XGBoost support to more platforms (including older versions of CentOS/Ubuntu) and we now support multi-node XGBoost training (though this feature is still in “beta”).

There are a number of XGBoost bug fixes, such the ability to use XGBoost models after they have been saved to disk and re-loaded into the H2O cluster, and fixes to the XGBoost MOJO. With all the improvements to H2O’s XGBoost, we are much closer to adding XGBoost to AutoML, and you can expect to see that in a future release. You can read more about the H2O XGBoost integration in the XGBoost User Guide.

AutoML & Stacked Ensembles

One big addition to H2O Automatic Machine Learning (AutoML) is the ability to turn off certain algorithms. By default, H2O AutoML will train Gradient Boosting Machines (GBM), Random Forests (RF), Generalized Linear Models (GLM), Deep Neural Networks (DNN) and Stacked Ensembles. However, sometimes it may be useful to turn off some of those algorithms. In particular, if you have sparse, wide data, you may choose to turn off the tree-based models (GBMs and RFs). Conversely, if tree-based models perform comparatively well on your data, then you may choose to turn off GLMs and DNNs. Keep in mind that Stacked Ensembles benefit from diversity of the set of base learners, so keeping “bad” models may still improve the overall performance of the Stacked Ensembles created by the AutoML run. The new argument is called exclude_algos and you can read more about it in the AutoML User Guide.

There are several improvements to the Stacked Ensemble functionality in H2O 3.18. The big new feature is the ability to fully customize the metalearning algorithm. The default metalearner (a GLM with non-negative weights) usually does pretty well, however, you are encouraged to experiment with other algorithms (such as GBM) and various hyperparameter settings. In the next major release, we will add the ability to easily perform a grid search on the hyperparameters of the metalearner algorithm using the standard H2O Grid Search functionality.

Highlights

Below is a list of some of the highlights from the 3.18 release. As usual, you can see a list of all the items that went into this release at the Changes.md file in the h2o-3 GitHub repository.

New Features:

  • PUBDEV-4652 – Added support for XGBoost multi-node training in H2O
  • PUBDEV-4980 – Users can now exclude certain algorithms during an AutoML run
  • PUBDEV-5086 – Stacked Ensemble should allow user to pass in a customized metalearner
  • PUBDEV-5224 – Users can now specify a seed parameter in Stacked Ensemble
  • PUBDEV-5204 – GLM: Allow user to specify a list of interactions terms to include/exclude

Bugs:

  • PUBDEV-4585 – Fixed an issue that caused XGBoost binary save/load to fail
  • PUBDEV-4593 – Fixed an issue that caused a Levenshtein Distance Normalization Error
  • PUBDEV-5133 – In Flow, the scoring history plot is now available for GLM models
  • PUBDEV-5195 – Fixed an issue in XGBoost that caused MOJOs to fail to work without manually adding the Commons Logging dependency
  • PUBDEV-5215 – Users can now specify interactions when running GLM in Flow
  • PUBDEV-5315 – Fixed an issue that caused XGBoost OpenMP to fail on Ubuntu 14.04

Documentation:

  • PUBDEV-5311 – The H2O-3 download site now includes a link to the HTML version of the R documentation

Download here: http://h2o-release.s3.amazonaws.com/h2o/latest_stable.html