Sparkling Water 2.2.10 is now available!

Hi Makers!

There are several new features in the latest Sparkling Water. The major new addition is that we now publish Sparkling Water documentation as a website which is available here. This link is for Spark 2.2.

We have also documented and fixed a few issues with LDAP on Sparkling Water. Exact steps are provided in the documentation.

Bundled H2O was upgraded to 3.18.0.4 which brings the ordinal regression for GLM as the major change.

The last major change included in this release is the availability of the H2O AutoML and H2O Grid Search transformation for the Spark pipelines. They are now being exposed as regular Spark Estimator and can be used within your Spark pipelines. An example PySparkling AutoML script can be found here.

The full changelog can be viewed here.

Sparkling Water is already integrated with Spark 2.3 on master and the next release will be also for this latest Spark version.

Stay tuned!

New features in H2O 3.18

Wolpert Release (H2O 3.18)

There’s a new major release of H2O and it’s packed with new features and fixes!

We named this release after David Wolpert, who is famous for inventing Stacking (aka Stacked Ensembles). Stacking is a central component in H2O AutoML, so we’re very grateful for his contributions to machine learning! He is also famous for the “No Free Lunch” theorem, which generally states that no single algorithm will be the best in all cases. In other words, there’s no magic bullet. This is precisely why stacking is such a powerful and practical algorithm — you never know in advance if a Deep Neural Network, or GBM or Random Forest will be the best algorithm for your problem. When you combine all of these together into a stacked ensemble, you are guaranteed to benefit from the strengths of each of these algorithms. You can read more about Dr. Wolpert and his work here.

Distributed XGBoost

The central feature of this release is support for distributed XGBoost, as well as other XGBoost enhancements and bug fixes. We are bringing XGBoost support to more platforms (including older versions of CentOS/Ubuntu) and we now support multi-node XGBoost training (though this feature is still in “beta”).

There are a number of XGBoost bug fixes, such the ability to use XGBoost models after they have been saved to disk and re-loaded into the H2O cluster, and fixes to the XGBoost MOJO. With all the improvements to H2O’s XGBoost, we are much closer to adding XGBoost to AutoML, and you can expect to see that in a future release. You can read more about the H2O XGBoost integration in the XGBoost User Guide.

AutoML & Stacked Ensembles

One big addition to H2O Automatic Machine Learning (AutoML) is the ability to turn off certain algorithms. By default, H2O AutoML will train Gradient Boosting Machines (GBM), Random Forests (RF), Generalized Linear Models (GLM), Deep Neural Networks (DNN) and Stacked Ensembles. However, sometimes it may be useful to turn off some of those algorithms. In particular, if you have sparse, wide data, you may choose to turn off the tree-based models (GBMs and RFs). Conversely, if tree-based models perform comparatively well on your data, then you may choose to turn off GLMs and DNNs. Keep in mind that Stacked Ensembles benefit from diversity of the set of base learners, so keeping “bad” models may still improve the overall performance of the Stacked Ensembles created by the AutoML run. The new argument is called exclude_algos and you can read more about it in the AutoML User Guide.

There are several improvements to the Stacked Ensemble functionality in H2O 3.18. The big new feature is the ability to fully customize the metalearning algorithm. The default metalearner (a GLM with non-negative weights) usually does pretty well, however, you are encouraged to experiment with other algorithms (such as GBM) and various hyperparameter settings. In the next major release, we will add the ability to easily perform a grid search on the hyperparameters of the metalearner algorithm using the standard H2O Grid Search functionality.

Highlights

Below is a list of some of the highlights from the 3.18 release. As usual, you can see a list of all the items that went into this release at the Changes.md file in the h2o-3 GitHub repository.

New Features:

  • PUBDEV-4652 – Added support for XGBoost multi-node training in H2O
  • PUBDEV-4980 – Users can now exclude certain algorithms during an AutoML run
  • PUBDEV-5086 – Stacked Ensemble should allow user to pass in a customized metalearner
  • PUBDEV-5224 – Users can now specify a seed parameter in Stacked Ensemble
  • PUBDEV-5204 – GLM: Allow user to specify a list of interactions terms to include/exclude

Bugs:

  • PUBDEV-4585 – Fixed an issue that caused XGBoost binary save/load to fail
  • PUBDEV-4593 – Fixed an issue that caused a Levenshtein Distance Normalization Error
  • PUBDEV-5133 – In Flow, the scoring history plot is now available for GLM models
  • PUBDEV-5195 – Fixed an issue in XGBoost that caused MOJOs to fail to work without manually adding the Commons Logging dependency
  • PUBDEV-5215 – Users can now specify interactions when running GLM in Flow
  • PUBDEV-5315 – Fixed an issue that caused XGBoost OpenMP to fail on Ubuntu 14.04

Documentation:

  • PUBDEV-5311 – The H2O-3 download site now includes a link to the HTML version of the R documentation

Download here: http://h2o-release.s3.amazonaws.com/h2o/latest_stable.html

Driverless AI Blog

In today’s market, there aren’t enough data scientists to satisfy the growing demand for people in the field. With many companies moving towards automating processes across their businesses (everything from HR to Marketing), companies are forced to compete for the best data science talent to meet their needs. A report by McKinsey says that based on 2018 job market predictions: “The United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of big data.” H2O’s Driverless AI addresses this gap by democratizing data science and making it accessible to non-experts, while simultaneously increasing the efficiency of expert data scientists. Its point-and-click UI minimizes the complicated legwork that precedes the actual model build.

Driverless AI is designed to take a raw dataset and run it through a proprietary algorithm that automates the data exploration/feature engineering process, which typically takes ~80% of a data scientist’s time. It then auto-tunes model parameters and provides the user with the model that yields the best results. Therefore, experienced data scientists are spending far less time engineering new features and can focus on drawing actionable insights from the models Driverless AI builds. Lastly, the user can see visualizations generated by the Machine Learning Interpretability (MLI) component of Driverless AI to clarify the model results and the effect of changing variables’ values. The MLI feature eliminates the black box nature of machine learning models and provides clear and straightforward results from a model as well as how changing features will alter results.

Driverless AI is also GPU-enabled, which can result in up to 40x speed ups. We had demonstrated GPU acceleration to achieve those speedups for machine learning algorithms at GTC in May 2017. We’ve ported over XGBoost, GLM, K-Means and other algorithms to GPUs to achieve significant performance gains. This enable Driverless AI to run thousands of iterations to find the most accurate feature transforms and models.

The automatic nature of Driverless AI leads to increased accuracy. AutoDL engineers new features mechanically, and AutoML finds the right algorithms and tunes them to create the perfect ensemble of models. You can think of it as a Kaggle Grandmaster in a box. To demonstrate the power of Driverless AI, we participated in a bunch of Kaggle contests and the results are here below. Driverless AI out of the box got performed nearly as well as the best Kaggle Grandmasters

Let’s look at an example: we are going to work with a credit card dataset and predict whether or not a person is going to default on their payment next month based on a set of variables related to their payment history. After simply choosing the variable we are predicting for as well as the number of iterations we’d like to run, we launch our experiment.

As the experiment cycles through iterations, it creates a variable importance chart ranking existing and newly created features by their effect on the model’s accuracy.

In this example, AutoDL creates a feature that represents the cross validation target encoding of the variables sex and education. In other words, if we group everyone who is of the same sex and who has the same level of education in this dataset, the resulting feature would help in predicting whether or not the customer is going to default on their payment next month. Generating features like this one usually takes the majority of a data scientist’s time, but Driverless AI automates this process for the user.

After AutoDL generates new features, we run the updated dataset through AutoML. At this point, Driverless AI builds a series of models using various algorithms and delivers a leaderboard ranking the success of each model. The user can then inspect and choose the model that best fits their needs.

Lastly, we can use the Machine Learning Interpretability feature to get clear and concise explanations of our model results. Four dynamic graphs are generated automatically: KLime, Variable Importance, Decision Tree Chart, and Partial Dependence Plot. Each one helps the user explore the model output more closely. KLIME creates one global surrogate GLM on the entire training data and also creates numerous local surrogate GLMs on samples formed from K-Means clusters in the training data. All penalized GLM surrogates are trained to model the predictions of the Driverless AI model. The Variable Importance measures the effect that a variable has on the predictions of a model, while the Partial Dependence Plot shows the effect of changing one variable on the outcome. The Decision Tree Surrogate Model clears up the Driverless AI model by displaying an approximate flow-chart of the complex Driverless AI model’s decision making process. The Decision Tree Surrogate Model also displays the most important variables in the Driverless AI model and the most important interactions in the Driverless AI model. Lastly, the Explanations button gives the user a plain English sentence about how each variable effects the model.

All of these graphs can be used to visualize and debug the Driverless AI model by comparing the displayed decision-process, important variables, and important interactions to known standards, domain knowledge, and reasonable expectations.

Driverless AI streamlines the machine learning workflow for inexperienced and expert users alike. For more information, click here.

Scalable Automatic Machine Learning: Introducing H2O’s AutoML

Prepared by: Erin LeDell, Navdeep Gill & Ray Peck

In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts and experts, alike. The first steps toward simplifying machine learning involved developing simple, unified interfaces to a variety of machine learning algorithms (e.g. H2O).

Although H2O has made it easy for non-experts to experiment with machine learning, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models. Deep Neural Networks in particular are notoriously difficult for a non-expert to tune properly. We have designed an easy-to-use interface which automates the process of training a large, diverse, selection of candidate models and training a stacked ensemble on the resulting models (which often leads to an even better model). Making it’s debut in the latest “Preview Release” of H2O, version 3.12.0.1 (aka “Vapnik”), we introduce H2O’s AutoML for Scalable Automatic Machine Learning.

H2O’s AutoML can be used for automating a large part of the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. The user can also use a performance metric-based stopping criterion for the AutoML process rather than a specific time constraint. Stacked Ensembles will be automatically trained on the collection individual models to produce a highly predictive ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard.

AutoML Interface

We provide a simple function that performs a process that would typically require many lines of code. This frees up users to focus on other aspects of the data science pipeline tasks such as data-preprocessing, feature engineering and model deployment.

R:

aml <- h2o.automl(x = x, y = y, training_frame = train, 
                  max_runtime_secs = 3600)

Python:

aml = H2OAutoML(max_runtime_secs = 3600)
aml.train(x = x, y = y, training_frame = train)

Flow (H2O's Web GUI):

AutoML Leaderboard

Each AutoML run returns a "Leaderboard" of models, ranked by a default performance metric. Here is an example leaderboard for a binary classification task:

More information, and full R and Python code examples are available on the H2O 3.12.0.1 AutoML docs page in the H2O User Guide.