H2O-3 on FfDL: Bringing deep learning and machine learning closer together

This post originally appeared in the IBM Developer blog here.

This post is co-authored by Animesh Singh, Nicholas Png, Tommy Li, and Vinod Iyengar.

Deep learning frameworks like TensorFlow, PyTorch, Caffe, MXNet, and Chainer have reduced the effort and skills needed to train and use deep learning models. But for AI developers and data scientists, it’s still a challenge to set up and use these frameworks in a consistent manner for distributed model training and serving.

The open source Fabric for Deep Learning (FfDL) project provides a consistent way for AI developers and data scientists to use deep learning as a service on Kubernetes and to use Jupyter notebooks to execute distributed deep learning training for models written with these multiple frameworks.

Now, FfDL is announcing a new addition that brings together that deep learning training capability with state-of-the-art machine learning methods.

Augment deep learning with best-of-breed machine learning capabilities

For anyone who wants to try machine learning algorithms with FfDL, we are excited to introduce H2O.ai as the newest member of the FfDL stack. H2O-3 is H2O.ai’s open source platform, an in-memory, distributed, and scalable machine learning and predictive analytics platform that enables you to build machine learning models on big data. H2O-3 offers an expansive library of algorithms, such as Distributed Random Forests, XGBoost, and Stacked Ensembles, as well as AutoML, a powerful tool for users with less experience in data science and machine learning.

After data cleansing, or “munging,” one of the most fundamental parts of training a powerful and predictive model is properly tuning the model. For example, deep neural networks are notoriously difficult for a non-expert to tune properly. This is where AutoML becomes an extremely valuable tool. It provides an intuitive interface that automates the process of training a large number of candidate models and selecting the highest performing model based on the user’s preferred scoring method.

In combination with FfDL, H2O-3 makes data science highly accessible to users of all levels of experience. You can simply deploy FfDL to your Kubernetes cluster and submit a training job to FfDL. Behind the scenes, FfDL sets up the H2O-3 environment, runs your training job, and streams the training logs for you to monitor and debug your model. Since FfDL also supports multi-node clusters with H2O-3, you can horizontally scale your H2O-3 training job seamlessly on all your Kubernetes nodes. When model training is complete, you can save your model locally to FfDL or to a cloud object store, where it can be obtained later for serving inference.

Try H2O on FfDL today!

You can find the details on how to train H2O models on FfDL in the open source FfDL readme file and guide. Deploy, use, and extend them with any of the capabilities that you find helpful. We’re waiting for your feedback and pull requests!

How to Frame Your Business Problem for Automatic Machine Learning

Over the last several years, machine learning has become an integral part of many organizations’ decision-making at various levels. With not enough data scientists to fill the increasing demand for data-driven business processes, H2O.ai has developed a product called Driverless AI that automates several time consuming aspects of a typical data science workflow: data visualization, feature engineering, predictive modeling, and model explanation. In this post, I will describe Driverless AI, how you can properly frame your business problem to get the most out of this automatic machine learning product, and how automatic machine learning is used to create business value.

What is Driverless AI and what kind of business problems does it solve?

H2O Driverless AI is a high-performance, GPU-enabled computing platform for automatic development and rapid deployment of state-of-the-art predictive analytics models. It reads tabular data from plain text sources, Hadoop, or S3 buckets and automates data visualization and building predictive models. Driverless AI is currently targeting business applications like loss-given-default, probability of default, customer churn, campaign response, fraud detection, anti-money-laundering, demand forecasting, and predictive asset maintenance models. (Or in machine learning parlance: common regression, binomial classification, and multinomial classification problems.)

How do you frame business problems in a data set for Driverless AI?

The data that is read into Driverless AI must contain one entity per row, like a customer, patient, piece of equipment, or financial transaction. That row must also contain information about what you will be trying to predict using similar data in the future, like whether that customer in the row of data used a promotion, whether that patient was readmitted to the hospital within thirty days of being released, whether that piece of equipment required maintenance, or whether that financial transaction was fraudulent. (In data science speak, Driverless AI requires “labeled” data.) Driverless AI runs through your data many, many times looking for interactions, insights, and business drivers of the phenomenon described by the provided data set. Driverless AI can handle simple data quality problems, but it currently requires all data for a single predictive model to be in the same data set and that data set must have already undergone standard ETL, cleaning, and normalization routines before being loaded into Driverless AI.

How do you use Driverless AI results to create commercial value?

Commercial value is generated by Driverless AI in a few ways.

● Driverless AI empowers data scientists or data analysts to work on projects faster and more efficiently by using automation and state-of-the-art computing power to accomplish tasks in just minutes or hours that can take humans months.

● Like in many other industries, automation leads to standardization of business processes, enforces best practices, and eventually drives down the cost of delivering the final product – in this case a predictive model.

● Driverless AI makes deploying predictive models easy – typically a difficult step in the data science process. In large organizations, value from predictive modeling is typically realized when a predictive model is moved from a data analysts’ or data scientists’ development environment into a production deployment setting where the model is running on live data, making decisions quickly and automatically that make or save money. Driverless AI provides both Java- and Python-based technologies to make production deployment simpler.

Moreover, the system was designed with interpretability and transparency in mind. Every prediction made by a Driverless AI model can be explained to business users, so the system is viable even for regulated industries.

Customer success stories with Driverless AI

PayPal tried Driverless AI on a collusion fraud use case and found that simply running on a laptop for 2 hours, Driverless AI yielded impressive fraud detection accuracy, and running on GPU-enhanced hardware, it was able to produce the same accuracy in just 20 minutes. The Driverless AI model was more accurate than PayPal’s existing predictive model and the Driverless AI system found the same insights in their data that their data scientists did! The system also found new features in their data that had not been used before for predictive modeling. For more information about the PayPal use case, click here

G5, a real estate marketing optimization firm, uses Driverless AI in their Intelligent Marketing Cloud to assist clients in targeted marketing spending for property management. Empowered by Driverless AI technology, marketers can quickly prioritize and convert highly qualified inbound leads from G5’s Intelligent Marketing Cloud platform with 95 percent accuracy for serious purchase intent. To learn more about how G5 uses Driverless AI check out:

How can you try Driverless AI?

Visit: https://www.h2o.ai/driverless-ai/ and download your free 21-day evaluation copy.

We are happy to help you get started installing and using Driverless AI, and here are some resources we’ve put together to enable in that process:

● Installing Driverless AI: https://www.youtube.com/watch?v=swrqej9tFcU

● Launching an Experiment with Driverless AI: https://www.youtube.com/watch?v=bw6CbZu0dKk

● Driverless AI Webinars: https://www.gotostage.com/channel/4a90aa11b48f4a5d8823ec924e7bd8cf

● Driverless AI Documentation: http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/index.html

Time is Money! Automate Your Time-Series Forecasts with Driverless AI

Time-series forecasting is one of the most common and important tasks in business analytics. There are many real-world applications like sales, weather, stock market, energy demand, just to name a few. We strongly believe that automation can help our users deliver business value in a timely manner. Therefore, once again we translated our Kaggle Grand Masters’ time-series recipes into our automatic machine learning platform Driverless AI (version 1.2). This blog post introduces the new time-series functionality with a simple sales forecasting example.

The key features/recipes that make automation possible are:

  • Automatic handling of time groups (e.g. different stores and departments)
  • Robust time-series validation
            – Accounts for gaps and forecast horizon
            – Uses past information only (i.e. no data leakage)
  • Time-series specific feature engineering recipes
            – Date features like day of week, day of month etc.
            – AutoRegressive features like optimal lag and lag-features interaction
            – Different types of exponentially weighted moving averages
            – Aggregation of past information (different time groups and time intervals)
            – Target transformations and differentiation
  • Integration with existing feature engineering functions (recipes and optimization)
  • Automatic pipelines generation (see this blog post)

A Typical Example: Sales Forecasting

Below is a typical example of sales forecasting based on Walmart competition on Kaggle. In order to frame it as a machine learning problem, we formulate the historical sales data and additional attributes as shown below:

Raw data:

Data formulated for machine learning:

Once you have your data prepared in tabular format (see raw data above), Driverless AI can formulate it for machine learning and sort out the rest. If this is your very first session, the Driverless AI assistant (new feature in version 1.2) will guide you through the journey.

Similar to previous Driverless AI examples, users need to select the dataset for training/test and define the target. For time-series, users need to define the time column (by choosing AUTO or selecting the date column manually). If weighted scoring is required (like the Walmart Kaggle competition), users can select the column with specific weights for different samples.

If users prefer to use automatic handling of time groups, they can leave the setting for time groups columns as AUTO.

Expert users can define specific time groups and change other settings as shown below.

Once the experiment is finished, users can make new predictions and download the scoring pipeline just like any other Driverless AI experiments.

Seeing is believing. Try Driverless AI yourself today. Sign up here for a free 21-day trial license.

Until next time,

Bonus fact: The masterminds behind our time-series recipes are Marios Michailidis and Mathias Müller so internally we call this feature AutoM&M.

H2O.ai and IBM build a Strategic Partnership to bring AI innovation to the market together

Excited to announce our strategic partnership with IBM that allows them to resell and take to market H2O Driverless AI to businesses worldwide. This partnership makes AI economical – faster, cheaper and easier to do experiments. H2O Driverless AI and IBM POWER9 GPU Systems are bringing together the best of breed AI innovation. We have been working with IBM to port Driverless AI, our latest product that addresses the skills gap in data science and trust in AI, to IBM Power Systems. The combination delivers 5X performance for workloads including the new time series in Driverless AI. Please check out the blog by Sumit Gupta, VP of AI, Machine Learning and HPC at IBM Cognitive Systems, for more details on the partnership.

Outstanding performance on IBM POWER9

HPC is the new PC. To handle the increasingly complex workloads of AI you need an integrated system of software and hardware that is fully optimized for each other. H2O Driverless AI on IBM POWER9 delivers precisely that. IBM POWER9 supports nearly 2.6x more RAM, 9.5x more I/O bandwidth than comparable systems. It can also support upto 6 V100 GPUs on a single system. Driverless AI is built on top of datatable for python for data ingest and feature engineering and H2O4GPU for machine learning. We’ve been able to get nearly 2X the data ingest speed and over 50% faster feature engineering. In addition, with the power of GPU accelerated machine learning we’re able to deliver nearly 30X speedup on model building. Overall, we’ve been able to accelerate Driverless AI by up to 10X for IID data and up to 5X for time series data.

The Power of IBM and H2O.ai Solves Customer Challenges

AI will transform businesses as we know it. Companies will leverage AI across multiple business units and establish centers of excellence for AI. From asset price forecasting in capital markets, to supply chain optimization in manufacturing, to personalized insurance policies, no walk of business will be immune to pressures to democratize decision making with AI. IBM and H2O.ai will build the trusted hardware / software co-design needed for enterprises to make that transition. The winners of this partnership are our customers and their stories will be replicated across the ecosystem.

this will be fun, Sri

AI in Healthcare – Redefining Patient & Physician Experiences

Register for the Meetup Here

Patients, physicians, nurses, health administrators and policymakers are beneficiaries of the rapid transformations in health and life sciences. These transformations are being driven by new discoveries (etiology, therapies, and drugs/implants), market reconfiguration and consolidation, a movement to value-based care, and access/affordability considerations. The people and systems that are driving these changes are generating new engagement models, workflows, data, and most importantly, new needs for all participants in the care continuum.

Analytics 1.0 (driven by business intelligence & reporting) for Healthcare as we describe in our book is inadequate to address these transformations. A retrospective understanding of “what happened?” is limited in its usefulness as it only provides for corrective action – usually driven by resource availability. To improve wellness, care outcomes, clinician satisfaction, and patient quality of life, we ought to be leveraging little and big data via Analytics 2.0 & 3.0. This journey will require leveraging machine/deep learning and other AI methods to separate signal from noise, integrate insights into a workflow, address data fidelity, and develop contextually-intelligent agents.

Automating machine learning and deep learning simplifies access to these advanced technologies by the Humans of Healthcare. They are key pre-requisites to create a data-driven, learning Healthcare organization. The net results – better science, improved access & affordability, and evidence-based wellness/care.

Among others involved in the care continuum, physicians are at the forefront of the coming health sciences revolution. Join our expert, all-physician panel at the H2O offices in Mountain View, CA to hear their expert thoughts and interact with them. Our panel consists of 3 leading physician leaders who are also driving clinical innovations using AI in their specialties & organizations:

  1. Dr. Baber Ghauri, Physician Executive and Healthcare Innovator, Trinity Health

  2. Dr. Esther Yu, Professor & Neuroradiologist, UCSF

  3. Dr. Pratik Mukherjee, Professor, and Director of CIND, San Francisco VA

  4. Moderator: Prashant Natarajan, Sr. Dir. AI Apps at H2O.ai and best-selling author/contributor to books on medical informatics & analytics


p>We look forward to seeing you in person.

-H2O.ai Team

From Kaggle Grand Masters’ Recipes to Production Ready in a Few Clicks

Introducing Accelerated Automatic Pipelines in H2O Driverless AI

At H2O, we work really hard to make machine learning fast, accurate, and accessible to everyone. With H2O Driverless AI, users can leverage years of world-class, Kaggle Grand Masters experience and our GPU-accelerated algorithms (H2O4GPU) to produce top quality predictive models in a fully automatic and timely fashion.

In our most recent release (version 1.1), we are going one step further to streamline the deployment process with MOJO (Model ObJect, Optimized). Inherited from our popular H2O-3 platform, MOJO is a highly optimized, low-latency scoring engine that is easily embeddable in any Java environment. With automatic pipeline generation in Driverless AI, users can go from automatic machine learning to production ready in just a few clicks. This blog post illustrates the usage of MOJO in Driverless AI with a simple example.

Easing the Pain Points in a Machine Learning Workflow

In a typical enterprise machine learning workflow, there are many things that could go wrong due to human errors, bad data science practices, different tools/infrastructure, incompatible code, lack of testing, versioning, communication and so on.

Driverless AI is our solution to ease those pain points in the second half of the workflow (i.e., creative feature engineering, model building, and deployment). We strongly believe that most organizations can benefit from automatic machine learning pipelines. A recent PayPal use-case shows that Driverless AI can help produce top quality predictive models with significant time and cost savings.


With Driverless AI, we are trying to mimic what top data science teams would do when they need to develop a new machine learning pipeline. Below are the four key areas of focus:

1. Exploratory Data Analysis (EDA) with Automatic Visualizations (AutoViz)

AutoViz allows users to gain quick insights from data without the laborious tasks of creating individual plots. It shows users the most interesting graphs automatically based on statistics, and it is designed to work on large datasets efficiently. The mastermind behind AutoViz is our Chief Scientist, Professor Leland Wilkinson of “ The Grammar of Graphics” fame.

2. Automatic Feature Engineering and Model Building

We call this part of Driverless AI “ Kaggle Grand Masters in a Box”. It is essentially the best data science practices, tricks and creative feature engineering of our Kaggle Grand Masters translated into an artificial intelligence (AI) platform. In other words, it is AI to do AI. On top of that, we make the automatic machine learning process insanely fast on Nvidia GPUs. Our users can benefit from quick turnaround time and top quality predictive models that one would expect from the Kaggle Grand Masters themselves.

3. Machine Learning Interpretability (MLI)

In Driverless AI, we have implemented some of the latest ML interpretation techniques (e.g., LIME, LOCO, ICE, Shapely, PDP, etc.), so our users can go from model building to model interpretation in a seamless fashion. These techniques are crucial for those who must explain their models to regulators or customers. The masterminds behind MLI are my colleagues Patrick Hall, Navdeep Gill, and Mark Chan. Watch their talk about MLI in Driverless AI here.

4. Automatic Pipelines Generation – The Focus of this Blog Post

Model deployment remains one of the most common and complex challenges in data analytics. Inherited from our popular H2O-3 platform, MOJO is a well-tested, robust technology that is being used by our users and customers at enormous scale. Let me illustrate the MOJO usage with a simple example below.

Credit Card Example

Like many other Driverless AI demos that you may have seen before at H2O World or our webinars, I am going to use the credit card dataset from the UCI machine learning repository for the MOJO example. Let me fast-forward the process to the end of a Driverless AI experiment and focus on the new MOJO options. From version 1.1.0, users have the option to build and download MOJO for fast, low-latency scoring. Here is a step-by-step walkthrough:

Step 1: Build a MOJO Scoring Pipeline
After the experiment, click on the newly available option BUILD MOJO SCORING PIPELINE. The build process is automatic and it should be done within a few minutes.

Step 2: Download and Unzip MOJO
Click on DOWNLOAD MOJO SCORING PIPELINE to download mojo.zip. After unzipping the file, you should be able to see a new folder called mojo-pipeline. The pipeline.mojo and mojo2-runtime.jar in the folder are the two main files you need for the MOJO scoring pipeline.

Step 3: Download Driverless AI License
Another key ingredient for MOJO pipeline is a valid Driverless AI license. You can download the license.sig file (usually in the license folder) from the machine hosting Driverless AI. Put the license file into the mojo-pipeline folder from the previous step.


Optional Step: Install Java 7 or 8
The MOJO scoring pipeline requires Java 8 (or Java 7/8 from version 1.1.2). If you have not installed it, please follow the instructions here.

Step 4: A Simple Test Run
In the mojo-pipeline folder, you will find a small example.csv with some data samples. This dataset can be used for a quick test run. Open the folder in terminal and then run the following command: bash run_example.sh

Alternatively, run the full command like this:
java -Dai.h2o.mojos.runtime.license.file=license.sig -cp mojo2-runtime.jar ai.h2o.mojos.ExecuteMojo pipeline.mojo example.csv

It should return predictions (the probabilities of default payment in this credit card demo) and the time required for scoring each sample. Remember this scoring pipeline includes everything from complex feature transformations based on Kaggle Grand Masters’ recipes to computing predictions from the final model ensemble. With MOJO, our users have a low-latency scoring engine that can make new predictions in milliseconds.

Step 5: Create Your Own Scoring Service
Users can, of course, define and program their own scoring services. For more information, please go through the Compile and Run the MOJO from Java section in our Driverless AI documentation.


This blog post gives a quick overview of the automatic pipelines in Driverless AI. The key benefits for our users are:

  1. Immediate increase in productivity – eliminating time wasted on human errors, incompatible code, debugging, etc.
  2. Production ready in a few clicks – seamless integration of complex feature engineering and scoring engine in one MOJO.
  3. An enterprise-grade, low-latency scoring engine that is easily embeddable in any Java environment.

Don’t take my words for it, sign up for a free 21-day trial and try Driverless AI yourself today.

Until next time,

Note #1: Two years, numerous H2O models, slide decks, events and #360selfies later, I am finally making a return to blogging. I hope you enjoy reading this blog post.

Note #2: H2O is going to Budapest again. Come find me, Erin, and Kuba at eRum conference from May 14 to 16. I will be delivering the “Automatic and Interpretable Machine Learning in R with H2O and LIME” workshop with a real, multimillion-dollar Moneyball Shiny app.

H2O World coming to NYC

H2O World coming to NYC

Whether you’re just starting out learning how machine learning and H2O.ai can supercharge your business or a veteran looking for more, we want to invite you to join some of greatest minds in the field to learn how AI and H2O.ai can transform your business. Our flagship event, H2O World is back and it’s going to be bigger than ever! We’re making our way around the world with our first stop at The New York Academy of Sciences on June 7th.

You’ll get exclusive access to the brains behind the operations of open source, H2O, H2O Driverless AI, Sparkling Water, MLI, and more! You’ll even be able to get a hands-on tutorial of our revolutionary Driverless AI platform and learn directly from the people implementing H2O.ai’s solutions to solve some of their companies’’ toughest problems.

With an eclectic group of speakers of product managers, data scientists, customer success managers, and more, we’ve got something for everyone! Don’t miss out on a full day of talks and hands-on sessions. Learn how H2O.ai is democratizing machine learning and transforming businesses in all industries ranging from healthcare, finance, insurance and more.

Highlights from Last year:

Leah Liebler
Marketing @ H2O

Democratize care with AI — AI to do AI for Healthcare

Very excited to have Prashant Natarajan (@natarpr) join us along with Sanjay Joshi on our vision to change the world of healthcare with AI. Health is wealth. And one worth saving the most. They bring invaluable domain knowledge and context to our cause.

As one of our customers would like to say, Healthcare should be optimized for health and outcomes for the ones in need of care. Health / Care, as in, Health divided by Care, how healthy can one be with least amount of care! We are investing in health because it is the right thing to do over the long term — especially with the convergence of Finance, Life Insurance, Retail towards Health. So many opportunities for cross-pollination!

With our strong ecosystem, community and customers’ support, h2o.ai will democratize care with AI — make it faster, cheaper and easier — accessible to all. Machine Learning touches lives — with Domain Scientists on our side, we can accelerate change to the problems that are in the most need. We are fortunate to have the team and culture that allows us to bring great products with high velocity to the marketplace. Stay tuned for Driverless AI for Health, one micro service AI model at a time!

As you feel inspired by the immense opportunities to serve humanity — please join Prashant Natarajan and www.h2o.ai community on our mission!

this will be fun! Sri

Sparkling Water 2.3.0 is now available!

Hi Makers!

We are happy to announce that Sparkling Water now fully supports Spark 2.3 and is available from our download page.

If you are using an older version of Spark, that’s no problem. Even though we suggest upgrading to the latest version possible, we keep the Sparkling Water releases for Spark 2.2 and 2.1 up-to-date with the latest version if we are not limited by Spark.

The last release of Sparkling Water contained several important bug fixes. The 3 major bug fixes are:

  • Handle nulls properly in H2OMojoModel. In the previous versions, running predictions on the H2OMojoModel with null values would fail. We now handle the null values as missing values and it no longer fails.

  • We marked the Spark dependencies in our maven packages as provided. This means that we assume that Spark dependencies are always provided by the run-time, which should always be true. This ensures a cleaner and more transparent Sparkling Water environment.

  • In PySparkling, the method as_h2o_frame didn’t issue an alert when we passed in a wrong input type. This method accepts only Spark DataFrames and RDDs, however, some users tried to pass different types and this method ended silently. Now we fail if the user passes a wrong data type to this method.

It is also important to mention that Spark 2.3 removed support for Scala 2.10. We’ve done the same in the release for Spark 2.3. Scala 2.10 is still supported in the older Spark versions.

The latest Sparkling Water versions also integrated with H2O which brings several important fixes. The full change log for H2O is available here and the full Sparkling Water change log can be viewed here.

Thank you!


Senior Software Engineer, Sparkling Water Team

H2O + Kubeflow/Kubernetes How-To

Today, we are introducing a walkthrough on how to deploy H2O 3 on Kubeflow. Kubeflow is an open source project led by Google that sits on top of the Kubernetes engine. It is designed to alleviate some of the more tedious tasks associated with machine learning. Kubeflow helps orchestrate deployment of apps through the full cycle of development, testing, and production, while allowing for resource scaling as demand increases. H2O 3’s goal is to reduce the time spent by data scientists on time-consuming tasks like designing grid search algorithms and tuning hyperparameters, while also providing an interface that allows newer practitioners an easy foothold into the machine learning space. The integration of H2O and Kubeflow is extremely powerful, as it provides a turn-key solution for easily deployable and highly scalable machine learning applications, with minimal input required from the user.

Getting Started:

  1. Make sure to have kubectl and ksonnet installed on the machine you are using, as we will need both. Kubectl is the Kubernetes command line tool, and ksonnet is an additional command line tool that assists in managing more complex deployments. Ksonnet helps to generate Kubernetes manifests from templates that may contain several parameters and components.
  2. Launch a Kubernetes cluster. This can either be an on-prem deployment of Kubernetes or an on-cloud cluster from Google Kubernetes Engine. Minikube offers a platform for local testing and development running as a virtual machine on your laptop.
  3. Make sure to configure kubectl to work with the your Kubernetes cluster.
    a. kubectl cluster-info, will tell you which cluster kubectl is configured to work on at the moment.
    b. Google Kubernetes Engine has a link in the GCP console that will provide the command for properly configuring kubectl.

  4. c. minikube start, will launch minikube and should automatically configure kubectl. You can check this by running the command: “minikube status” after launching minikube to verify.

    4. Now we are ready to start our deployment. To begin with, we will initialize a ksonnet application by running the command “ks init <your_app_name>”.
    5. Move into the directory that was created by the previous command using “cd <my_app_name>”. You will see that it has been populated with a couple directories, as well as, files containing some default parameters. You do not need to touch these.
    6. In order to install the Kubeflow components, we add a ksonnet registry to application. This can be done by running the commands:
    ks registry add kubeflow <location_of_the_registry>
    ks pkg install kubeflow/core
    ks pkg install kubeflow/tf-serving
    ks pkg install kubeflow/tf-job
    ks pkg install kubeflow/h2o3
    a. This will create a registry called “kubeflow” within the ksonnet application using the components found within the specified location.
    b. <location_of_the_registry> is typically a github repo. For this walkthrough, you can use this repo as it has the prebuilt components for both H2O and Kubeflow.
    c. ks pkg install <component_name> will install the components that we will reference when deploying Kubeflow and H2O.

    7. Let’s start with deploying the core Kubeflow components first:
    kubectl create namespace ${NAMESPACE}
    ks generate core kubeflow-core ––name=kubeflow-core ––namespace=${NAMESPACE}
    ks env add cloud
    ks param set kubeflow-core cloud gke ––env=cloud
    ks apply ${KF_ENV} -c kubeflow-core
    a. These commands will create a deployment of the core Kubeflow components.
    b. Note: if you are using minikube, you may want to create an environment named “local” or “minikube” rather than “cloud”, and you can skip the “ks param set …” command.
    c. For GKE: you may need to run this command “kubectl create clusterrolebinding default-admin ––clusterrole=cluster-admin ––user=your-user@email.com” to avoid RBAC permission errors.
    8. Kubeflow is now deployed on our Kubernetes cluster. There are two options for deploying H2O on Kubeflow: through Kubeflow’s JupyterHub Notebook offering, or as a persistent server. Both options accept a docker image containing the necessary packages for running H2O.
    a. You can find the dockerfiles needed for both options here.
    b. Copy the dockerfiles to a local directory and run the command “docker build -t <name_for_docker_image> -f <name_of_dockerfile>”.
    c. If we are deploying to the cloud, it is a good idea to push the image to a docker container registry like docker hub or google container registry.

Deploy JupyterHub Notebook:
1. The JupyterHub serve comes deployed with the core Kubeflow components. Running the command “kubectl get svc -n=${NAMESPACE}” will show us a service running with the name “tf-hub-0”.

2. Use the command: “kubectl port-forward tf-hub-0 8000:8000 ––namespace={$NAMESPACE}” to make the exposed port available to your local machine, and open in your browser. Create a username and password when prompted within the browser window, and click “Start My Server”.
3. You will be prompted to designate a docker image to pull, as well as, requests for CPUs, memory, and additional resources. Fill in the resource requests as preferred.
Note: We already have the notebook image (“h2o3-kf-notebook:v1”) pushed to GCR. You will want to build your own image using the dockerfiles provided, and push them to GCR. The notebook image is fairly large, it may take some time to download and start.

4. Once the notebook server has properly spawned, you will see the familiar Jupyter Notebook homepage. Create a new Python 3 notebook. The image built from the dockerfiles provided will have all the requisite plugins to run H2O.
5. A basic example of running H2O AutoML would look something like the images below. A sample of the Jupyter Notebook is available in the repo, or you can follow the example from the H2O AutoML documentation:

Deploy H2O 3 Persistent Server:
1. If we want to deploy H2O 3 as a persistent server, we use the prototype available within the ksonnet registry.
Run the command:
ks prototype use io.ksonnet.pkg.h2o3 h2o3 \
––name h2o3 \
––namespace kubeflow \
––model_server_image <image_name_in_container_registry>
This will create the necessary component for deploying H2O 3 as a persistent server.
2. Finally, deploy the H2O 3 component to the server using this command:
ks apply cloud -c h2o3 -n kubeflow
a. Flag -c specifies the component you wish to deploy and -n flag specifies that we deploy the component to the kubeflow namespace
3. Use “kubectl get deployments” to make sure that the H2O 3 persistent server was deployed properly. “kubectl get pods” will show the name of the pod to which the server was deployed.

4. Additionally, if running “kubectl get svc -n kubeflow” you will see a service named “h2o3” running with type “LoadBalancer”. If you wait about a minute, the external-ip will change from <pending> to a real ip address.
5. Go to a working directory where you would like to store any Jupyter Notebooks or scripts. At this point you can launch a Jupyter Notebook locally or write a python script the runs H2O. Make sure your local version of H2O 3 is up to date. You can follow the steps here to install the newest version of H2O 3. By default, docker will build the image using the most current version of H2O 3.
a. Use the External IP address obtained from “kubectl get svc -n kubeflow” and the port 54321 in the h2o.init() command, and you will connect H2O to the cluster running in kubernetes.

b. From here, the steps are the same as in the JupyterHub Notebook above. You can follow the same example steps as are outlined in the AutoML example here.
6. Optionally, you can direct your browser to the exposed ip address with http://<your_ip>:54321. This will launch H2O Flow, which is H2O’s web server offering. H2O Flow provides a notebook like UI with more point and click options as compared to a Jupyter Notebook which requires understanding of Python syntax.

This walkthrough provides a small window into a high-potential, ongoing project. Currently, the deployment of H2O.ai’s enterprise product Driverless AI on Kubeflow is in progress. At the moment, it is deployable in a similar fashion to the H2O 3 persistent server, and beta work on this can be found within the github repo. Driverless AI speeds up data science workflows by automating feature engineering, model tuning, ensembling, and model deployment.

Please feel free to contact me with any questions via email or Linkedin.
All files are available here: https://github.com/h2oai/h2o-kubeflow.