H2O + Kubeflow/Kubernetes How-To

Today, we are introducing a walkthrough on how to deploy H2O 3 on Kubeflow. Kubeflow is an open source project led by Google that sits on top of the Kubernetes engine. It is designed to alleviate some of the more tedious tasks associated with machine learning. Kubeflow helps orchestrate deployment of apps through the full cycle of development, testing, and production, while allowing for resource scaling as demand increases. H2O 3’s goal is to reduce the time spent by data scientists on time-consuming tasks like designing grid search algorithms and tuning hyperparameters, while also providing an interface that allows newer practitioners an easy foothold into the machine learning space. The integration of H2O and Kubeflow is extremely powerful, as it provides a turn-key solution for easily deployable and highly scalable machine learning applications, with minimal input required from the user.

Getting Started:

  1. Make sure to have kubectl and ksonnet installed on the machine you are using, as we will need both. Kubectl is the Kubernetes command line tool, and ksonnet is an additional command line tool that assists in managing more complex deployments. Ksonnet helps to generate Kubernetes manifests from templates that may contain several parameters and components.
  2. Launch a Kubernetes cluster. This can either be an on-prem deployment of Kubernetes or an on-cloud cluster from Google Kubernetes Engine. Minikube offers a platform for local testing and development running as a virtual machine on your laptop.
  3. Make sure to configure kubectl to work with the your Kubernetes cluster.
    a. kubectl cluster-info, will tell you which cluster kubectl is configured to work on at the moment.
    b. Google Kubernetes Engine has a link in the GCP console that will provide the command for properly configuring kubectl.

  4. c. minikube start, will launch minikube and should automatically configure kubectl. You can check this by running the command: “minikube status” after launching minikube to verify.

    4. Now we are ready to start our deployment. To begin with, we will initialize a ksonnet application by running the command “ks init <your_app_name>”.
    5. Move into the directory that was created by the previous command using “cd <my_app_name>”. You will see that it has been populated with a couple directories, as well as, files containing some default parameters. You do not need to touch these.
    6. In order to install the Kubeflow components, we add a ksonnet registry to application. This can be done by running the commands:
    ks registry add kubeflow <location_of_the_registry>
    ks pkg install kubeflow/core
    ks pkg install kubeflow/tf-serving
    ks pkg install kubeflow/tf-job
    ks pkg install kubeflow/h2o3
    a. This will create a registry called “kubeflow” within the ksonnet application using the components found within the specified location.
    b. <location_of_the_registry> is typically a github repo. For this walkthrough, you can use this repo as it has the prebuilt components for both H2O and Kubeflow.
    c. ks pkg install <component_name> will install the components that we will reference when deploying Kubeflow and H2O.

    7. Let’s start with deploying the core Kubeflow components first:
    kubectl create namespace ${NAMESPACE}
    ks generate core kubeflow-core ––name=kubeflow-core ––namespace=${NAMESPACE}
    ks env add cloud
    ks param set kubeflow-core cloud gke ––env=cloud
    ks apply ${KF_ENV} -c kubeflow-core
    a. These commands will create a deployment of the core Kubeflow components.
    b. Note: if you are using minikube, you may want to create an environment named “local” or “minikube” rather than “cloud”, and you can skip the “ks param set …” command.
    c. For GKE: you may need to run this command “kubectl create clusterrolebinding default-admin ––clusterrole=cluster-admin ––user=your-user@email.com” to avoid RBAC permission errors.
    8. Kubeflow is now deployed on our Kubernetes cluster. There are two options for deploying H2O on Kubeflow: through Kubeflow’s JupyterHub Notebook offering, or as a persistent server. Both options accept a docker image containing the necessary packages for running H2O.
    a. You can find the dockerfiles needed for both options here.
    b. Copy the dockerfiles to a local directory and run the command “docker build -t <name_for_docker_image> -f <name_of_dockerfile>”.
    c. If we are deploying to the cloud, it is a good idea to push the image to a docker container registry like docker hub or google container registry.

Deploy JupyterHub Notebook:
1. The JupyterHub serve comes deployed with the core Kubeflow components. Running the command “kubectl get svc -n=${NAMESPACE}” will show us a service running with the name “tf-hub-0”.

2. Use the command: “kubectl port-forward tf-hub-0 8000:8000 ––namespace={$NAMESPACE}” to make the exposed port available to your local machine, and open in your browser. Create a username and password when prompted within the browser window, and click “Start My Server”.
3. You will be prompted to designate a docker image to pull, as well as, requests for CPUs, memory, and additional resources. Fill in the resource requests as preferred.
Note: We already have the notebook image (“h2o3-kf-notebook:v1”) pushed to GCR. You will want to build your own image using the dockerfiles provided, and push them to GCR. The notebook image is fairly large, it may take some time to download and start.

4. Once the notebook server has properly spawned, you will see the familiar Jupyter Notebook homepage. Create a new Python 3 notebook. The image built from the dockerfiles provided will have all the requisite plugins to run H2O.
5. A basic example of running H2O AutoML would look something like the images below. A sample of the Jupyter Notebook is available in the repo, or you can follow the example from the H2O AutoML documentation:

Deploy H2O 3 Persistent Server:
1. If we want to deploy H2O 3 as a persistent server, we use the prototype available within the ksonnet registry.
Run the command:
ks prototype use io.ksonnet.pkg.h2o3 h2o3 \
––name h2o3 \
––namespace kubeflow \
––model_server_image <image_name_in_container_registry>
This will create the necessary component for deploying H2O 3 as a persistent server.
2. Finally, deploy the H2O 3 component to the server using this command:
ks apply cloud -c h2o3 -n kubeflow
a. Flag -c specifies the component you wish to deploy and -n flag specifies that we deploy the component to the kubeflow namespace
3. Use “kubectl get deployments” to make sure that the H2O 3 persistent server was deployed properly. “kubectl get pods” will show the name of the pod to which the server was deployed.

4. Additionally, if running “kubectl get svc -n kubeflow” you will see a service named “h2o3” running with type “LoadBalancer”. If you wait about a minute, the external-ip will change from <pending> to a real ip address.
5. Go to a working directory where you would like to store any Jupyter Notebooks or scripts. At this point you can launch a Jupyter Notebook locally or write a python script the runs H2O. Make sure your local version of H2O 3 is up to date. You can follow the steps here to install the newest version of H2O 3. By default, docker will build the image using the most current version of H2O 3.
a. Use the External IP address obtained from “kubectl get svc -n kubeflow” and the port 54321 in the h2o.init() command, and you will connect H2O to the cluster running in kubernetes.

b. From here, the steps are the same as in the JupyterHub Notebook above. You can follow the same example steps as are outlined in the AutoML example here.
6. Optionally, you can direct your browser to the exposed ip address with http://<your_ip>:54321. This will launch H2O Flow, which is H2O’s web server offering. H2O Flow provides a notebook like UI with more point and click options as compared to a Jupyter Notebook which requires understanding of Python syntax.

This walkthrough provides a small window into a high-potential, ongoing project. Currently, the deployment of H2O.ai’s enterprise product Driverless AI on Kubeflow is in progress. At the moment, it is deployable in a similar fashion to the H2O 3 persistent server, and beta work on this can be found within the github repo. Driverless AI speeds up data science workflows by automating feature engineering, model tuning, ensembling, and model deployment.

Please feel free to contact me with any questions via email or Linkedin.
All files are available here: https://github.com/h2oai/h2o-kubeflow.

Makers in Action: Community, Partners and Team Members at #GTC18

NVIDIA’s GPU Technology Conference (GTC) has been incredible! Folks from all over the world are exploring the latest breakthroughs in self-driving cars, smart cities, healthcare, high performance computing, virtual reality, and more, all propelled by the AI movement.

If you’re attending GTC and would like to see our solutions in action (recently named a leader by Gartner) or chat with one of the H2O.ai Makers, come visit us at booth #725.

We’re excited to be joined at #GTC18 by a number of community members, partners, and H2O.ai team members including:

Venkatesh Ramanathan (Software Engineer, PayPal) and Ajay Gopal (Chief Data Scientist, Deserve) joined speakers from KickView and Memorial Sloan Kettering Cancer Center on the “Deep Learning Insitute Executive Workshop” panel.

Arpit Mehta (Data Scientist, Product Owner: Big Data Architectures, BMW Group) presented “Now I C U: Analyzing Data Flow Inside an Autonomous Driving Car” and “Beyond Autonomous Driving: Unleashing Value via Machine Learning Applications in Automotive Industry“.

Supermicro (Booth #111) – Supermicro provides customers around the world with application-optimized server, workstation, blade, storage and GPU systems. Stop by their booth for a demo of Driverless AI. Driverless AI speeds up data science workflows by automating feature engineering, model tuning, ensembling, and model deployment.

MapD (Booth #602) – MapD’s mission is to redefine the limits of scale and speed in big data analytics. Along with us, NVIDIA, and other innovative companies, they are also one of the founding GPU Open Analytics Initiative (GOAI) members. Enjoy a recent post (“How VW Predicts Churn with GPU-Accelerated Machine Learning and Visual Analytics”) by Wamsi Viswanath (Data Scientist, MapD) in which he shares how H2O was used in concert with MapD.

Dell EMC (Booth #815) – Dell EMC enables digital transformation with trusted solutions for the modern data center.

Team Members
– Wednesday, March 28th, 9am – Ashrith Barthur (Security Scientist) presented “Network Security with Machine Learning”.
– Thursday, March 29th, 11am, Room 211A – Jon McKinney (Director of Research) will be presenting “World’s Fastest Machine Learning With GPUs”.
– Thursday, March 29th, 4pm, LL21C – Arno Candel (CTO) will host the “Hands-on with Driverless AI” workshop.

Enjoy GTC!


Director of Community

H2O4GPU now available in R

In September, H2O.ai released a new open source software project for GPU machine learning called H2O4GPU. The initial release (blog post here) included a Python module with a scikit-learn compatible API, which allows it to be used as a drop-in replacement for scikit-learn with support for GPUs on selected (and ever-growing) algorithms. We are proud to announce that the same collection of GPU algorithms is now available in R, and the h2o4gpu R package is available on CRAN.

The R package makes use of RStudio’s reticulate R package for facilitating access to Python libraries through R. Reticulate embeds a Python session within your R session, enabling seamless, high-performance interoperability and was originally created by RStudio in an effort to bring the TensorFlow Python library into R.

This is exciting news for the R community, as h2o4gpu is the first machine learning package that brings together a diverse collection of supervised and unsupervised GPU-powered algorithms in a unified interface. The initial collection of algorithms includes:

  • Random Forest, Gradient Boosting Machine (GBM), Generalized Linear Models (GLM) with Elastic Net regularization
  • K-Means, Principal Component Analysis (PCA), Truncated SVD

h2o4gpu has a functional interface. This is different than many modeling packages in R (including the h2o package), however, functional interfaces are becoming increasingly popular in the R ecosystem.

Here’s an example of how to specify a Random Forest classification model with a non-default value for the max_depth parameter.

model <- h2o4gpu.random_forest_classifier(max_depth = 10L)

To train the model, you simply pipe the model object to a fit() function which takes the training data as arguments. Once the model is trained, we pipe the model to the predict() function to generate predictions.

Here is a quick demo of how to train, predict and evaluate an H2O4GPU model using the Iris dataset.

Detailed installation instructions and a comprehensive tutorial is available in package vignette, so we encourage you to visit the vignette to get started.

H2O4GPU is a new project under active development and we are looking for contributors! If you find a bug, please check that we have not already fixed the issue in the bleeding edge version and then check that we do not already have an issue opened for this topic. If not, then please file a new GitHub issue with a reproducible example. 🙏

  • Here is the main GitHub repo. If you like the package, please 🌟 the repo on GitHub!
  • If you’re looking to contribute, check out the CONTRIBUTING.md file.
  • All open issues that are specific to the R package are here.
  • All open issues are here.

Thanks for checking out our new package!

-- Navdeep Gill, Erin LeDell, and Yuan Tang

Come meet the Makers!

NVIDIA’s GPU Technology Conference (GTC) Silicon Valley, March 26-29th is the premier AI and deep learning event, providing you with training, insights, and direct access to the industry’s best and brightest. It’s where you will see the latest breakthroughs in self-driving cars, smart cities, healthcare, high-performance computing, virtual reality and more, and all because of the power of AI. H2O.ai will be there in full force to share how you can immediately gain value and insights from our industry-leading AI and ML platforms. In case you hadn’t heard, H2O.ai was named a leader in 2018 Gartner Magic Quadrant for Data Science and Machine Learning platforms. You can get the report here.

Please visit us at booth #725 to see Driverless AI in action and talk to the Makers leading the AI movement! Our sessions will be leading edge talks that you won’t want to miss.

  1. Ashrith Barthur – Network Security with Machine Learning

    Ashrith will speak about modeling different kinds of cyber attacks and building a model that is able to identify these different kinds of attacks using machine learning.

    Room 210F – Wednesday, 28 March, 9 AM to 9:50 AM.

  2. Jonathan McKinney – World’s Fastest Machine Learning with GPUs

    Jonathan will introduce H2O4GPU, a fully featured machine learning library that is optimized for GPUs with a robust python API that is a drop dead replacement for scikit-learn. He will demonstrate benchmarks for the most common algorithms relevant to enterprise AI and will showcase performance gains as compared to running on CPUs.

    Room 220B – Thursday, March 29, 11 AM to 11:50 AM.

  3. Arno Candel – Hands-on with Driverless AI

    In this lab, Arno will show how to install and start Driverless AI, the automated Kaggle Grandmaster in-a-box software, on a multi GPU box. He will go through the full end-to-end workflow and showcase how Driverless AI uses the power of GPUs to achieve 40x speedups on algorithms that in turn allow it run thousands of iterations and find the best model.

    Room LL21C – Thursday, March 29, 4 PM to 6 PM.

Can’t make it to the event? Schedule a time to talk to one of our makers!

How Driverless AI Prevents Overfitting and Leakage

By Marios Michailidis, Competitive Data Scientist, H2O.ai

In this post, I’ll provide an overview of overfitting, k-fold cross-validation, and leakage. I’ll also explain how Driverless AI avoids overfitting and leakage.

An Introduction to Overfitting

A common pitfall that causes machine learning models to fail when tested in a real-world environment is overfitting. In Driverless AI, we take special measures in various stages of the modeling process to ensure that overfitting is prevented and that all models produced are generalizable and give consistent results on test data.

Before highlighting the different measures used to prevent overfitting within Driverless AI, we should first define overfitting. According to Oxford Dictionaries, it is the production of an analysis which corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably. Overfitting is often explained in connection with two other terms, namely bias and variance. Consider the following diagram that shows how typically the training and test errors “move” throughout the modeling process.

When an algorithm starts learning from the data, any relationships found are (still) basic and simple. You could see in the top left of the chart that both errors, training (teal) and test (red) are quite high because the model has not yet uncovered much information about the data and makes basic/biased (and quite erroneous) assumptions. These basic/simple assumptions cause high errors due to (high) bias. At this point, the model is not sensitive to fluctuations in the training data because the logic of the model is still very simple. This low sensitivity to fluctuations is often referred to as having low variance. This whole initial stage can be described as underfitting, a state in the modeling process that causes errors because the model is still very basic and does not live up to its potential. This state is not deemed as serious as overfitting because it sits at the beginning of the modeling process and generally the modeler (often greedily) tries to maximize performance ending up overdoing it.

As the model keeps learning, the training error decreases, as well as, the error for the test data. That is because the relationships found in the training data are still significant and generalizable. However, after a while, even though the training error keeps decreasing at a fast pace, the test error does not follow. The algorithm has exhausted all significant information from within the training data and starts modeling noise. This state where the training error keeps decreasing, but the test error is becoming worse (increases), is called overfitting – the model is learning more than it should and it is sensitive to tiny fluctuations in the training data, hence characterized by high variance. Ideally, the learning process needs to stop at the point where the red line is at its lowest or optimal point, somewhere in the middle of the graph. How quickly or slowly this point comes depends on various factors.

In order to get the most out of your predictions, it becomes imperative to control overfitting and underfitting. Since Driverless AI thrives in making very predictive models, it has built-in measures to control overfitting and underfitting. These methods will be analyzed in the following sections.

This section includes both existing and upcoming features.

Check similarity of train and test datasets
This step is conditional upon providing a test dataset which is not always the case. If a test dataset is provided during the training procedure, various checks take place to ensure that the training set is consistent with the test set. This happens both at a univariate level and global level. At a univariate level, the distribution of each variable in the training data is checked against the equivalent one in the test data. If a big difference is detected, the column is omitted and a warning message is displayed.

To illustrate the problem, imagine having a variable that measures the age in years for some customers. If the distribution for train and test data resembles the one below, even a very basic model would start to overfit very quickly.

This happens because the model will assume that a given customer will be between 40 and 85 years old. However, that model is tasked to make predictions for customers that are much younger. This model would most probably fail to generalize well in the test data and the more the model will rely on age, the worse the performance will be for the test data.

To detect this problem, Driverless AI would fit models to determine if certain values for some features have a tendency to appear more in train or in test data. If such relationships are found and deemed significant, the feature is omitted and/or a warning message is displayed to alert the user about this discrepancy.

K-Fold Cross-Validation

Depending on the accuracy and speed settings selected in Driverless AI before the experiment is run, the training data will be split into multiple train and validation pairs and Driverless AI will try to find that optimal point (mentioned in the top section) for the validation data in order to stop the learning process. This method can be illustrated below for K=4.

This process gets repeated 4 times (if K=4) and K models are run until all parts become validated exactly once. The same model parameters are applied to all these processes and an overall (average) error is estimated for all parts’ predictions to ensure that any model built with such parameters (features, hyperparameters, model types) will generalize well in any (unseen) test dataset.

More specifically based on the average error on all these validation predictions, Driverless AI will determine:

  • When to stop the learning process. Driverless AI primarily uses XGBoost which requires various iterations to reach the optimal point and the multiple validation datasets facilitate in finding that global best generalizable point that works well in all validation parts.

  • Which hyperparameter values to change/tune. XGBoost has a long list of parameters which control different elements in the modeling process. For instance, based on performance in the validation data, the best learning rate, maximum depth, leaf size, and growing policy are found and help achieve a better generalization error in test data.

  • Which features or feature transformations to use. Driverless AI’s strong performance comes from its feature engineering pipeline. Typically thousands of features will be explored during the training process, but only a minority of them will be useful eventually. The performance of validation data, once again defines which of the generated features are worth keeping and which need to be discarded.

  • Which models to ensemble and which weights to use. Driverless AI uses an exhaustive process to find the best linear weights to combine multiple different models (trained with different parameters) and gets better results in unseen data.

Check for IID

A common pitfall that causes overfitting is when a dataset contains temporal elements and the observations (samples) are not Independent and Identically Distributed (IID). In the case of temporal elements, consider having future data predicting past data and not the other way around, as that prediction is misleading.

DAI can determine the presence of non-IID easily if a date field or equivalent is provided. In this instance, it will assess how strong the temporal elements are and whether the target variables values significantly change for different periods in the date field. It uses correlation, serial correlation checks and variable binning to determine if and how strong these temporal elements are. It should be noted that this check is extended to the ordering of the dataset too. Sometimes the data itself may be ordered in a way that it does not help if the validation is formulated randomly. For example, if the data is sorted by date/time, even though that field is not provided.

If non-IID is detected, Driverless AI typically switches to time series based validation mode to get more consistent errors when predicting the test (further) data. The difference with the k-fold (mentioned above) is that the data is split based on date/time and all pairs of train,validation are formulated so that train is in the past and validation is in the future. The type of features Driverless AI will generate are different than when there is IID. This process also identifies main entities that will be used for both validation and feature engineering. The entities are the group-by categorical features such as stores or type of products that if used make predictions better – ex: sales by store and product in last 30 days.

As a summary of this subsection, this check is put in place to ensure that Driverless AI performs feature engineering, selects parameters and ensembles models in a way that best resembles reality when strong temporal elements are detected.

Avoiding Leakage in Feature Engineering

Leakage can be defined as the creation of unexpected additional information in the training data, allowing a model or machine learning algorithm to make unrealistically good predictions. It has various types and causes. Driverless AI is particularly careful with a type of leakage that can arise when implementing target (mean or likelihood) encoding. In many occasions, these type of features tend to be the most predictive inputs, especially when high cardinality categorical variable (as in with many distinct values) are present in the data. However, over-reliance on these features can cause a model to overfit quickly. These features may be created differently in different domains – for example in banking they take the form of weights of evidence, in times series they are the past mean values of the target variable given certain lags and so on.

A common cause that makes target encoding fail is when predicting an observation/entry that its target value was included for the formulation of an average value for a certain category. For example, let’s assume there is a dataset that contains a categorical variable called profession and various job titles are listed like ‘doctor’, ‘teacher’, ‘policeman’, etc. Out of all of the mentioned entries, there is only one entry for ‘entrepreneur’ and that entry has an income of $3,000,000,000. When estimating the average income for all ‘entrepreneurs’, it will be $3,000,000,000, as there is only one ‘entrepreneur’. If a new variable is created that measures the average income of a profession, while trying to predict income, it will create the connection that the average salary of an entrepreneur is $3,000,000,000. Given how big this value is, it is likely to make a model over-rely on this connection and predict high values for all entrepreneurs.

This can be referred to as target leakage because it uses the target value directly as a feature. There are various ways to mitigate the impact of this type of leakage, like estimating averages only when there is a significant number of cases for one category. Ideally though, the average values need to be created with data that the entries’ features have not used in any way as their respective target values. A way to achieve this is to have another holdout dataset that is used to estimate the mean target values for certain categories and then apply these to the train and validation data – in other words, have a third dataset dedicated to target encoding.

The latter approach suffers from the fact that the model will then need to be built with significantly less data, as some part is surrendered to estimate the averages per category. To counter this, Driverless AI uses a CCV or Cross-Cross-Validation, or cross-validation inside a cross-validation. After a train and validation pair has been determined, then only the train part is undergoing another k-fold procedure where K-1 parts are used to estimate average target values for the selected categories and apply those to the Kth part until the train dataset has its mean (target) values estimated in k batches. They can be applied at the same time to the outer valid dataset, taking an average of all the K-folds’ mean values.

The same type of leakage can also be found in metamodeling. For more information on this topic, enjoy this video, in which Kaggle Grandmaster, Mathias Müller, discusses this and other types of leakage, including how they get created and prevented.

In regards to other types of leakage, Driverless AI will throw warning messages if some features are strongly correlated with the target but typically does not take any additional action by itself unless the correlation is perfect.

Other means to avoid overfitting
There are various other mechanisms, tools or approaches in Driverless AI that help prevent overfitting:

  • Bagging (or Randomized Averaging): When the time comes for final predictions, Driverless AI will run multiple models with slightly different parameters and aggregate the results. This ensures that the final prediction, which is based on multiple models, is not too attached to the original data exactly due to this imputed randomness and produced of (to-some-extend) uncorrelated models. In other words, bagging has the ability to reduce the variance without changing the bias.
  • Dimensionality reduction: Although Driverless AI generates many features, it will also employ techniques such as SVD or PCA to minimize the features’ expansion when encountering high cardinality categorical features and simplifying the modeling process.

  • Feature pruning: Driverless AI can get a measure of how important a feature is to the model. Features that tend to add very little to the prediction are likely to just be noise and are therefore discarded.

Examples of Consistency

It is no secret that we like to test drive Driverless AI in competitive environments such as Kaggle, Analytics Vidhya, and CrowdANALYTIX to know how it fares compared to some of the best data scientists in the world. However, we are not only interested in accuracy, which is a product of how much time you allow Driverless AI to run and can be configured in the beginning, but also the consistency between validation performance and performance in the test data, in other words the ability to avoid overfitting.

Here is a list of results from public sources that show Driverless AI’s validation performance and performance on the test data drawn with various combinations of accuracy and speed settings. Some of the results may contain combinations of multiple Driverless AI outputs and a metamodeling layer on top of them:

NameMetricValidation ResultsTest ResultsFinal RankSource
Analytics Vidhya - Churn Predictionauc0.689 (rank 14/250)0.677 (rank 8/250)8/250
LinkedIn blog
Analytics Vidhya - Churn Predictionauc0.69104 (rank 8/250)0.67844 (rank 5/250)5/250Competition website
BNP Paribas Cardif Claims Managementlog loss0.43573 (rank 23/2926)0.43316 (rank 18/2926)
Predicting How Points End in Tennislog loss0.1774 (rank 3/207)0.19051 (rank 6/207)
Competition website
Predicting How Points End in Tennislog loss0.1769 (rank 2/207)0.19062 (rank 8/207)8/207
Competition website
BNP Paribas Cardif Claims Managementlog loss0.44196 (rank 52/2926)0.44137 (Rank 60 /2926)
(fast settings)
Amazon.com - Employee Access Challengeauc0.91165 (rank 65/1687)0.90933 (rank 79 /1687
(fast settings)
New York City Taxi Trip DurationRMSLE0.31017 (rank 11/1257)
0.31181 (rank 11/1257)
Analytics Vidhya - McKinsey Analyticsauc0.85932 (rank 17/503)0.85456 (rank 6/503)6/503Competition website

Sparkling Water 2.2.10 is now available!

Hi Makers!

There are several new features in the latest Sparkling Water. The major new addition is that we now publish Sparkling Water documentation as a website which is available here. This link is for Spark 2.2.

We have also documented and fixed a few issues with LDAP on Sparkling Water. Exact steps are provided in the documentation.

Bundled H2O was upgraded to which brings the ordinal regression for GLM as the major change.

The last major change included in this release is the availability of the H2O AutoML and H2O Grid Search transformation for the Spark pipelines. They are now being exposed as regular Spark Estimator and can be used within your Spark pipelines. An example PySparkling AutoML script can be found here.

The full changelog can be viewed here.

Sparkling Water is already integrated with Spark 2.3 on master and the next release will be also for this latest Spark version.

Stay tuned!

Congratulations – H2O is a leader in the Gartner Magic Quadrant for Data Science and Machine Learning Platforms

Congratulations – Thanks to the support of our customer community over the past years, H2O.ai is a leader and one with the most completeness of vision in Gartner Magic Quadrant for Data Science and Machine Learning Platforms. It is an ecosystem we dedicated a good part of this decade to open up and spring. This is testimony to the incredibly community-centric maker culture of team H2O in our relentless support of our customers with beautiful intelligent products. Our partnership with NVIDIA and IBM helped bring GPUs to Machine Learning this past year. Our work with Azure, AWS and Google Cloud to make it easy to try, train and deploy AI. Automation of AI pipelines with AI in DriverlessAI will help maximize extremely scarce data science talent and bring it to many more enterprises. We will make it cheaper, faster and easier to experiment and build AI products. This is fast moving AI space with tectonic shifts and very high product innovation from great players – even we are only getting started. We seek your partnership to further transform your problems and verticals with AI to build solutions together.

From the first to our latest investors, our amazing team members: past, present, new and future ones and supportive families; our community of data scientists who attended the first and recent meetups to spread the word, believers and our customers who backed our vision and execution – Each and every one of you are part of this incredibly fun journey. Thank you. Gratitude is the word that comes to mind. Your support inspires us to do great things, in the pursuit of magic! (and magic quadrants) 🙂

this will be fun, Sri

New features in H2O 3.18

Wolpert Release (H2O 3.18)

There’s a new major release of H2O and it’s packed with new features and fixes!

We named this release after David Wolpert, who is famous for inventing Stacking (aka Stacked Ensembles). Stacking is a central component in H2O AutoML, so we’re very grateful for his contributions to machine learning! He is also famous for the “No Free Lunch” theorem, which generally states that no single algorithm will be the best in all cases. In other words, there’s no magic bullet. This is precisely why stacking is such a powerful and practical algorithm — you never know in advance if a Deep Neural Network, or GBM or Random Forest will be the best algorithm for your problem. When you combine all of these together into a stacked ensemble, you are guaranteed to benefit from the strengths of each of these algorithms. You can read more about Dr. Wolpert and his work here.

Distributed XGBoost

The central feature of this release is support for distributed XGBoost, as well as other XGBoost enhancements and bug fixes. We are bringing XGBoost support to more platforms (including older versions of CentOS/Ubuntu) and we now support multi-node XGBoost training (though this feature is still in “beta”).

There are a number of XGBoost bug fixes, such the ability to use XGBoost models after they have been saved to disk and re-loaded into the H2O cluster, and fixes to the XGBoost MOJO. With all the improvements to H2O’s XGBoost, we are much closer to adding XGBoost to AutoML, and you can expect to see that in a future release. You can read more about the H2O XGBoost integration in the XGBoost User Guide.

AutoML & Stacked Ensembles

One big addition to H2O Automatic Machine Learning (AutoML) is the ability to turn off certain algorithms. By default, H2O AutoML will train Gradient Boosting Machines (GBM), Random Forests (RF), Generalized Linear Models (GLM), Deep Neural Networks (DNN) and Stacked Ensembles. However, sometimes it may be useful to turn off some of those algorithms. In particular, if you have sparse, wide data, you may choose to turn off the tree-based models (GBMs and RFs). Conversely, if tree-based models perform comparatively well on your data, then you may choose to turn off GLMs and DNNs. Keep in mind that Stacked Ensembles benefit from diversity of the set of base learners, so keeping “bad” models may still improve the overall performance of the Stacked Ensembles created by the AutoML run. The new argument is called exclude_algos and you can read more about it in the AutoML User Guide.

There are several improvements to the Stacked Ensemble functionality in H2O 3.18. The big new feature is the ability to fully customize the metalearning algorithm. The default metalearner (a GLM with non-negative weights) usually does pretty well, however, you are encouraged to experiment with other algorithms (such as GBM) and various hyperparameter settings. In the next major release, we will add the ability to easily perform a grid search on the hyperparameters of the metalearner algorithm using the standard H2O Grid Search functionality.


Below is a list of some of the highlights from the 3.18 release. As usual, you can see a list of all the items that went into this release at the Changes.md file in the h2o-3 GitHub repository.

New Features:

  • PUBDEV-4652 – Added support for XGBoost multi-node training in H2O
  • PUBDEV-4980 – Users can now exclude certain algorithms during an AutoML run
  • PUBDEV-5086 – Stacked Ensemble should allow user to pass in a customized metalearner
  • PUBDEV-5224 – Users can now specify a seed parameter in Stacked Ensemble
  • PUBDEV-5204 – GLM: Allow user to specify a list of interactions terms to include/exclude


  • PUBDEV-4585 – Fixed an issue that caused XGBoost binary save/load to fail
  • PUBDEV-4593 – Fixed an issue that caused a Levenshtein Distance Normalization Error
  • PUBDEV-5133 – In Flow, the scoring history plot is now available for GLM models
  • PUBDEV-5195 – Fixed an issue in XGBoost that caused MOJOs to fail to work without manually adding the Commons Logging dependency
  • PUBDEV-5215 – Users can now specify interactions when running GLM in Flow
  • PUBDEV-5315 – Fixed an issue that caused XGBoost OpenMP to fail on Ubuntu 14.04


  • PUBDEV-5311 – The H2O-3 download site now includes a link to the HTML version of the R documentation

Download here: http://h2o-release.s3.amazonaws.com/h2o/latest_stable.html

Developing and Operationalizing H2O.ai Models with Azure

This post originally appeared here. It was authored by Daisy Deng, Software Engineer, and Abhinav Mithal, Senior Engineering Manager, at Microsoft.

The focus on machine learning and artificial intelligence has soared over the past few years, even as fast, scalable and reliable ML and AI solutions are increasingly viewed as being vital to business success. H2O.ai has lately been gaining fame in the AI world for its fast in-memory ML algorithms and for easy consumption in production. H2O.ai is designed to provide a fast, scalable, and open source ML platform and it recently added support for deep learning as well. There are many ways to run H2O.ai on Azure. This post provides an overview of how to efficiently develop and operationalize H2O.ai ML models on Azure.

H2O.ai can be deployed in many ways including on a single node, on a multi-node cluster, in a Hadoop cluster and an Apache Spark cluster. H2O.ai is written in Java, so it naturally supports Java APIs. Since the standard Scala backend is a Java VM, H2O.ai also supports the Scala API. It also has rich interfaces for Python and R. The h2o R and h2o Python packages respectively help R and Python users access H2O.ai algorithms and functionality. The R and Python scripts that use the h2o library interact with the H2O clusters using REST API calls.

With the rising popularity of Apache Spark, Sparkling Water was developed to combine H2O functionality with Apache Spark. Sparkling Water provides a way to launch the H2O service on each Spark executor in the Spark cluster, forming a H2O cluster. A typical way of using the two together is to do data munging in Apache Spark while run training and scoring using H2O. Apache Spark has built-in support for Python through PySpark and pysparkling provides bindings between Spark and H2O to run Sparkling Water applications in Python. Sparklyr provides the R interface to Spark and rsparkling provides bindings between Spark and H2O to run Sparkling Water applications in R.

Table 1 and Figure 1 below show more information about how to run Sparkling Water applications on Spark from R and Python.

Model Development

Data Science Virtual Machine (DSVM) is a great tool with which you can start developing ML models in a single-node environment. H2O.ai comes preinstalled for Python on DSVM. If you use R (on Ubuntu), you can follow the script in our earlier blog post to set up your environment. If you are dealing with large datasets, you may consider using a cluster for development. Below are the two recommended choices for cluster-based development.

Azure HDInsight offers fully-managed clusters that come in many handy configurations. Azure HDInsight allows users to create Spark clusters with H2O.ai with all the dependencies pre-installed. Python users can experiment with it by following the Jupyter notebook examples that come with the cluster. R users can follow our previous post to set up the environment to use RStudio for development. Once the development of the model is finished and you’ve trained your model, you can save the trained model for scoring. H2O allows you to save the trained model as a MOJO file. A JAR file, h2o-genmodel.jar, is also generated when the model is saved. This jar file is need when you want to load your trained model in Java or Scala code while Python and R code can directly load the trained model using the H2O API.

If you are looking for low-cost clusters, you can use the Azure Distributed Data Engineering Toolkit (AZTK) to start a Docker-based Spark cluster on top of Azure Batch with low-priority VMs. The cluster created through AZTK is accessible for use in development through SSH or Jupyter notebooks. Compared to Jupyter Notebooks on Azure HDInsight clusters, the Jupyter notebook is rudimentary and does not come pre-configured for H2O.ai model development. Users also need to save the development work to external durable storage because once the AZTK spark cluster is torn down it cannot be restored.

Table 2 shows a summary of using the three environments for model development.

Batch Scoring and Model Retraining

Batch scoring is also referred to as offline scoring. It usually deals with significant amounts of data and may require a lot of processing time. Retraining deals with model drifting where the model no longer captures patterns in newer datasets accurately. Batch scoring and model retraining are considered batch processing and they can be operationalized in a similar fashion.

If you have many parallel tasks each of which can be handled by a single VM, Azure Batch is a great tool to handle this type of workload. Azure Batch Shipyard provides code-free job configuration and creation on Azure Batch with Docker containers. We can easily include Apache Spark and H2O.ai in the Docker image and use them with Azure Batch Shipyard. In Azure Batch Shipyard, each model retraining, or batch scoring, can be configured as a task. This type of job, consisting of several separate tasks, is also known as an “embarrassingly parallel” workload, which is fundamentally different from distributed computing where communications between tasks is required to complete a job. Interested readers can continue to read more from this wiki.

If the batch processing job needs a cluster for distributed processing, for example, if the amount of data is large or it’s more cost-effective to use a cluster, you can use AZTK to create a Docker-based Spark cluster. H2O.ai can be easily included in the Docker image, and the process of cluster creation, job submission, and cluster deletion can be automated and triggered by the Azure Function App. However, in this method, the users need to configure the cluster and manage container images. If you want a fully-managed cluster with detailed monitoring, Azure HDInsight cluster is a better choice. Currently we can use Azure Data Factory Spark Activity to submit batch jobs to the cluster. However, it requires having a HDInsight cluster running all the time, so it’s mostly relevant in use cases with frequent batch processing.

Table 3 shows a comparison of the three ways of running batch processing in Spark where H2O.ai can be easily integrated in each computing environment.

Online Scoring

Online scoring means scoring with a small response time, so this is also referred to as real-time scoring. In general, online scoring deals with a single-point prediction or mini-batch predictions and should use pre-computed cached features when possible. We can load the ML models and the relevant libraries and run scoring in any application. If a microservice architecture is preferred to separate concerns and decouple dependencies, it is recommended to implement online scoring as a web service with Rest API. The web services for scoring with the H2O ML model are usually written in Java, Scala or Python. As we mentioned in the Model Development section, the saved H2O model is in the MOJO format and, together with the model, the h2o-genmodel.jar file is generated. While web services written in Java or Scala can use this JAR file to load the saved model for scoring, web services written in Python can directly call the Python API to load the saved model.

Azure provides many choices to host web services.

Azure Web App is an Azure PaaS offering to host web applications. It provides a fully-managed platform which allows users to focus on their application. Recently, Azure Web App Service for Containers, built on Azure Web App on Linux, was released to host containerize web applications. Azure Container Service with Kubernetes (AKS) provides an effortless way to create, configure and manage a cluster of VMs to run containerized applications. Both Azure Web App Service for Containers and Azure Container Service provide great portability and run-environment customization for web applications. Azure Machine Learning (AML) Model Management CLI/API provides an even simpler way to deploy and manage web services on ACS with Kubernetes. We have listed below a comparison of the three Azure services for hosting online scoring in Table 4.

Edge Scoring

Edge scoring means executing scoring on internet-of-things (IoT) devices. With edge scoring, the devices perform analytics and make intelligent decisions once the data is collected without having to send the data to a central processing center. Edge scoring is important in use cases where data privacy requirements are high, or the desired scoring latency is super low. Enabled by container technology,

Azure Machine Learning, together with Azure IoT Edge provide easy ways to deploy machine learning models to Azure IoT edge devices. With AML containers, the use of H2O.ai on edge comes with minimal effort. Check out our recent blog post titled Artificial Intelligence and Machine Learning on the Cutting Edge for more details on how to enable edge intelligence.


In this post, we discussed a developer’s journey for building and deploying H2O.ai-based solutions with Azure services, and covered model development, model retraining, batch scoring and online scoring together with edge scoring. Our AI development journey in this post focused on H2O.ai. However, these learnings are not specific just to H2O.ai and can be applied just as easily to any Spark-based solutions. As more and more frameworks such as TensorFlow and Microsoft Cognitive Toolkit (CNTK) have been enabled to run on Spark, we believe these learnings will become more valuable. Understanding the right product choices based on business and technical needs is fundamental to the success of any project, and we hope the information in this post proves to be useful in your project.

Daisy & Abhinav

Happy Holidays from H2O.ai

Dear Community,

Your intelligence, support and love have been the strength behind an incredible year of growth, product innovation, partnerships, investments and customer wins for H2O and AI in 2017. Thank you for answering our rallying call to democratize AI with our maker culture.

Our mission to make AI ubiquitous is still fresh as dawn and our creativity new as spring. We are only getting started, learning, rising from each fall. H2O and Driverless AI are just the beginnings.

As we look into 2018, we see prolific innovation to make AI accessible to everyone. Simplicity that opens scale. Our focus on making experiments faster, easier and cheaper. We are so happy that you will be the center of our journey. We look forward to delivering many more magical customer experiences.

On behalf of the team and management at H2O, I wish you all a wonderful holiday: deep meaningful time spent with yourself and your loved ones and to come back refreshed for a winning 2018!

Gratitude for your partnership in our beautiful journey – it’s just begun!

this will be fun,

Sri Ambati
CEO & Co-Founder

P.S. #H2OWorld was an amazing experience. I invite you to watch the keynote and more than 40 talks and conversations.