Automatic Feature Engineering for Text Analytics – The Latest Addition to Our Kaggle Grandmasters’ Recipes

According to Kaggle’s ‘The State of Machine Learning and Data Science’ survey, text data is the second most used data type at work for data scientists. There are a lot of interesting text analytics applications like sentiment prediction, product categorization, document classification and so on.

In the latest version (1.3) of our Driverless AI platform, we have included Natural Language Processing (NLP) recipes for text classification and regression problems. Our platform has the ability to support both standalone text and text with other numerical values as predictive features. In particular, we have implemented the following recipes and models:

  • Text-specific feature engineering recipes:
    • TFIDF, Frequency of n-grams
    • Truncated SVD
    • Word embeddings
  • Text-specific models to extract features from text:
    • Convolutional neural network models on word embeddings
    • Linear models on TFIDF vectors

A Typical Example: Sentiment Analysis

Let us illustrate the usage with a classical example of sentiment analysis on tweets using the US Airline Sentiment dataset from Figure Eight’s Data for Everyone library. We can split the dataset into training and test with this simple script. We will just use the tweets in the ‘text’ column and the sentiment (positive, negative or neutural) in the ‘airline_sentiment’ column for this demo. Here are some samples from the dataset:

Once we have our dataset ready in the tabular format, we are all set to use the Driverless AI. Similar to other problems in the Driverless AI setup, we need to choose the dataset and then specify the target column (‘airline_sentiment’).

Since there are other columns in the dataset, we need to click on ‘Dropped Cols’ and then exclude everything but ‘text’ as shown below:

Next, we will need to make sure TensorFlow is enabled for the experiment. We can go to ‘Expert Settings’ and switch on ‘TensorFlow Models’.

At this point, we are ready to launch an experiment. Text features will be automatically generated and evaluated during the feature engineering process. Note that some features such as TextCNN rely on TensorFlow models. We recommend using GPU(s) to leverage the power of TensorFlow and accelerate the feature engineering process.

Once the experiment is done, users can make new predictions and download the scoring pipeline just like any other Driverless AI experiments.

Bonus fact #1: The masterminds behind our NLP recipes are Sudalai Rajkumar (aka SRK) and Dmitry Larko.

Bonus fact #2: Don’t want to use the Driverless AI GUI? You can run the same demo using our Python API. See this example notebook.

Seeing is believing. Try Driverless AI yourself today. Sign up here for a free 21-day trial license.

Until next time,
SRK and Joe

H2O’s AutoML in Spark

This blog post demonstrates how H2O’s powerful automatic machine learning can be used together with the Spark in Sparkling Water.

We show the benefits of Spark & H2O integration, use Spark for data munging tasks and H2O for the modelling phase, where all these steps are wrapped inside a Spark Pipeline. The integration between Spark and H2O can be see on the figure below. All technical details behind this integration are explained in our documentation which you can access from here.

Sparkling Water Architecture

At the end of this blog post, we also show how the generated model can be taken into production using Spark Streaming application. We use Python and PySparkling for model training phase and Scala for the deployment example.

For the purpose of this blog, we use the Combined Cycle Power Plant dataset. The goal here is to predict the energy output (in megawatts), given the temperature, ambient pressure, relative humidity and exhaust vacuum values. We will alter the dataset a little bit for the blog post purposes and use only rows where the temperature is higher then 10 degrees celsius. This can be explained such as that we are interested just in plant performance in the warmer days.

Obtain Sparkling Water

First step is to download Sparkling Water. It can be downloaded from our official download page. Please make sure to use the latest Sparkling Water as the H2OAutoml in Sparkling Water is a fairly new feature available in the latest versions. Once you downloaded the Sparkling Water, please follow the instructions on the PySparkling tab on how to start PySparkling interpreter.

Download Sparkling Water

Start Sparkling Water

In order to be able to use both Spark and H2O alongside, we need to make H2O available inside the Spark cluster. This can be achieved by
running the code below.

from pysparkling import * # Import PySparkling
hc = H2OContext.getOrCreate(spark) # Start the H2OContext

This code starts H2O node on each spark executor and a special H2O node called client node inside the Spark driver.

Load Data into Sparkling Water

We use Spark to load the data into memory. For that, we can use the line below:

powerplant_df = spark.read.option("inferSchema", "true").csv("powerplant_output.csv", header=True)

This code imports the file from the specified location into the Spark cluster and creates a Spark Dataframe from it. The original datafile can be downloaded from here. It is important to specify the inferSchema option to true because otherwise, Spark won’t try to automatically infer the data types and we will have all columns with type String. This way, the types are correctly inferred.

We will use a portion of this data for the training purposes and a portion for demonstrating the predictions. We can use randomSplit method available on the Spark Dataframe to split the data as:

splits = powerplant_df.randomSplit([0.8, 0.2], 1)
train = splits[0]
for_predictions = splits[1]

We split the dataset and give 80% to one split and 20% to another. The last argument specifies the seed to the method can behave deterministically. We use the 80% part for the training purposes and the second part for scoring later.

Define the Pipeline with H2O AutoML

Now, we can define the Spark pipeline containing the H2O AutoML. Before we do that, we need to do a few imports so all classes and methods we require are available:

from pysparkling.ml import H2OAutoML
from pyspark.ml import Pipeline
from pyspark.ml.feature import SQLTransformer

And finally, we can start building the pipeline stages. The first pipeline stage is the SQLTransformer used for selecting the rows where the temperature is higher than 10 degrees celsius. The SQLTransformer is powerful Spark transformer where we can specify any Spark SQL code which we want to execute on the dataframe passed to from the pipeline.

temperatureTransformer = SQLTransformer(statement="SELECT * FROM __THIS__ WHERE TemperatureCelcius > 10")

It is important to understand that no code is executed at this stage as we are just defining the stages. We will show how to execute the whole pipeline later in the blog post.

The next pipeline stage is not a transformer, but estimator and is used for creating the H2O model using the H2O AutoML algorithm. This estimator is provided by the Sparkling Water library, but we can see that the API is unified with the other Spark pipeline stages.

automlEstimator = H2OAutoML(maxRuntimeSecs=60, predictionCol="HourlyEnergyOutputMW", ratio=0.9)

We defined the H2OAutoML estimator. The maxRuntimeSecs argument specifies how long we want to run the automl algorithm. The predictionCol specifies the response column and the ratio argument specifies how big part of dataset is used for the training purposes.
We specified that we want to use 90% of data for training purposes and 10% of data for the validation.

As we have defined both stages we need, we can define the whole pipeline:

pipeline = Pipeline(stages=[temperatureTransformer, automlEstimator])

And finally, train it on on the training dataset we prepared above:

model = automlEstimator.fit(df)

This call goes through all the pipeline stages and in case of estimators, creates a model. So as part of this call, we run the H2O AutoML algorithm and find the best model given the search criteria we specified in the arguments. The model variable contains the whole Spark pipeline model, which also internally contains the model found by automl. The H2O model stored inside is stored in the H2O MOJO format. That means that it is independent from the H2O runtime and therefore, it can be run anywhere without initializing an H2O cluster. For more information about MOJO, please visit the MOJO documentation.

Prediction

We can run generate predictions on the returned model simply as:

predicted = model.transform(for_predictions)

This call again goes through all the pipeline stages and in case it hits a stage with a model, it performs a scoring operation.

We can also see a few first results as:

predicted.take(2)

Export Model for Deployment

In the following part of the blog post, we show how to put this model into production. For that, we need to export the model, which can be done simply as:

model.write().overwrite().save("pipeline.model")

This call will store the model into the pipeline.model file. It is also helpful trick to export schema of the data. This is especially useful in the case of streaming applications where it’s hard to determine the type of data based on a single row in the input.

We can export the schema as:

with open('schema.json','w') as f:
    f.write(str(powerplant_df.schema.json()))

Deploy the Model

Now, we would like to demonstrate how the Spark pipeline with model found by automl can be put into production in case of Spark Streaming application. For the deployment, we can start a new Spark application, it can be in Scala or Python and we can load the trained pipeline model. The pipeline model contains the H2O AutoML model packaged as a MOJO and therefore, it is independent on the H2O runtime. We will use Scala to demonstrate the language independence of the exported pipeline.

The deployment consist of several steps:

  • Load the schema from the schema file.
  • Create input data stream and pass it the schema. The input data stream will point to a directory where new csv files will be coming from different streaming sources. It can also be a on-line source of streaming data.
  • Load the pipeline from the pipeline file<./li>
  • Create output data stream. For our purposes, we store the data into memory and also to a SparkSQL table so we can see immediate results.

// Start Spark
val spark = SparkSession.builder().master("local").getOrCreate()


// Load exported pipeline
import org.apache.spark.sql.types.DataType
val pipelineModel = PipelineModel.read.load("pipeline.model")


// Load exported schema of input data
val schema = StructType(DataType.fromJson(scala.io.Source.fromFile("schema.json").mkString).asInstanceOf[StructType].map {
  case StructField(name, dtype, nullable, metadata) => StructField(name, dtype, true, metadata)
  case rec => rec
})
println(schema)


// Define input stream
val inputDataStream = spark.readStream.schema(schema).csv("/path/to/folder/where/input/data/are/being/generated")


// Apply loaded model
val outputDataStream = pipelineModel.transform(inputDataStream)


// Forward output stream into memory-sink
outputDataStream.writeStream.format("memory").queryName("predictions").start()


// Query results
while(true){
  spark.sql("select * from predictions").show()
  Thread.sleep(5000)
}

Conclusion

This code demonstrates that we can relatively easily put a pipeline model into production. We used Python for the model creation and JVM-based language for the deployment. The resulting pipeline model contains model found by H2O automl algorithm, exported as MOJO. This means that we don’t need to start H2O or Sparkling Water in place where we deploy the model (but we need to ensure that Sparkling Water dependencies are on the classpath).

H2O-3 on FfDL: Bringing deep learning and machine learning closer together

This post originally appeared in the IBM Developer blog here.

This post is co-authored by Animesh Singh, Nicholas Png, Tommy Li, and Vinod Iyengar.

Deep learning frameworks like TensorFlow, PyTorch, Caffe, MXNet, and Chainer have reduced the effort and skills needed to train and use deep learning models. But for AI developers and data scientists, it’s still a challenge to set up and use these frameworks in a consistent manner for distributed model training and serving.

The open source Fabric for Deep Learning (FfDL) project provides a consistent way for AI developers and data scientists to use deep learning as a service on Kubernetes and to use Jupyter notebooks to execute distributed deep learning training for models written with these multiple frameworks.

Now, FfDL is announcing a new addition that brings together that deep learning training capability with state-of-the-art machine learning methods.

Augment deep learning with best-of-breed machine learning capabilities

For anyone who wants to try machine learning algorithms with FfDL, we are excited to introduce H2O.ai as the newest member of the FfDL stack. H2O-3 is H2O.ai’s open source platform, an in-memory, distributed, and scalable machine learning and predictive analytics platform that enables you to build machine learning models on big data. H2O-3 offers an expansive library of algorithms, such as Distributed Random Forests, XGBoost, and Stacked Ensembles, as well as AutoML, a powerful tool for users with less experience in data science and machine learning.

After data cleansing, or “munging,” one of the most fundamental parts of training a powerful and predictive model is properly tuning the model. For example, deep neural networks are notoriously difficult for a non-expert to tune properly. This is where AutoML becomes an extremely valuable tool. It provides an intuitive interface that automates the process of training a large number of candidate models and selecting the highest performing model based on the user’s preferred scoring method.

In combination with FfDL, H2O-3 makes data science highly accessible to users of all levels of experience. You can simply deploy FfDL to your Kubernetes cluster and submit a training job to FfDL. Behind the scenes, FfDL sets up the H2O-3 environment, runs your training job, and streams the training logs for you to monitor and debug your model. Since FfDL also supports multi-node clusters with H2O-3, you can horizontally scale your H2O-3 training job seamlessly on all your Kubernetes nodes. When model training is complete, you can save your model locally to FfDL or to a cloud object store, where it can be obtained later for serving inference.

Try H2O on FfDL today!

You can find the details on how to train H2O models on FfDL in the open source FfDL readme file and guide. Deploy, use, and extend them with any of the capabilities that you find helpful. We’re waiting for your feedback and pull requests!

How to Frame Your Business Problem for Automatic Machine Learning

Over the last several years, machine learning has become an integral part of many organizations’ decision-making at various levels. With not enough data scientists to fill the increasing demand for data-driven business processes, H2O.ai has developed a product called Driverless AI that automates several time consuming aspects of a typical data science workflow: data visualization, feature engineering, predictive modeling, and model explanation. In this post, I will describe Driverless AI, how you can properly frame your business problem to get the most out of this automatic machine learning product, and how automatic machine learning is used to create business value.

What is Driverless AI and what kind of business problems does it solve?

H2O Driverless AI is a high-performance, GPU-enabled computing platform for automatic development and rapid deployment of state-of-the-art predictive analytics models. It reads tabular data from plain text sources, Hadoop, or S3 buckets and automates data visualization and building predictive models. Driverless AI is currently targeting business applications like loss-given-default, probability of default, customer churn, campaign response, fraud detection, anti-money-laundering, demand forecasting, and predictive asset maintenance models. (Or in machine learning parlance: common regression, binomial classification, and multinomial classification problems.)

How do you frame business problems in a data set for Driverless AI?

The data that is read into Driverless AI must contain one entity per row, like a customer, patient, piece of equipment, or financial transaction. That row must also contain information about what you will be trying to predict using similar data in the future, like whether that customer in the row of data used a promotion, whether that patient was readmitted to the hospital within thirty days of being released, whether that piece of equipment required maintenance, or whether that financial transaction was fraudulent. (In data science speak, Driverless AI requires “labeled” data.) Driverless AI runs through your data many, many times looking for interactions, insights, and business drivers of the phenomenon described by the provided data set. Driverless AI can handle simple data quality problems, but it currently requires all data for a single predictive model to be in the same data set and that data set must have already undergone standard ETL, cleaning, and normalization routines before being loaded into Driverless AI.

How do you use Driverless AI results to create commercial value?

Commercial value is generated by Driverless AI in a few ways.

● Driverless AI empowers data scientists or data analysts to work on projects faster and more efficiently by using automation and state-of-the-art computing power to accomplish tasks in just minutes or hours that can take humans months.

● Like in many other industries, automation leads to standardization of business processes, enforces best practices, and eventually drives down the cost of delivering the final product – in this case a predictive model.

● Driverless AI makes deploying predictive models easy – typically a difficult step in the data science process. In large organizations, value from predictive modeling is typically realized when a predictive model is moved from a data analysts’ or data scientists’ development environment into a production deployment setting where the model is running on live data, making decisions quickly and automatically that make or save money. Driverless AI provides both Java- and Python-based technologies to make production deployment simpler.

Moreover, the system was designed with interpretability and transparency in mind. Every prediction made by a Driverless AI model can be explained to business users, so the system is viable even for regulated industries.

Customer success stories with Driverless AI

PayPal tried Driverless AI on a collusion fraud use case and found that simply running on a laptop for 2 hours, Driverless AI yielded impressive fraud detection accuracy, and running on GPU-enhanced hardware, it was able to produce the same accuracy in just 20 minutes. The Driverless AI model was more accurate than PayPal’s existing predictive model and the Driverless AI system found the same insights in their data that their data scientists did! The system also found new features in their data that had not been used before for predictive modeling. For more information about the PayPal use case, click here

G5, a real estate marketing optimization firm, uses Driverless AI in their Intelligent Marketing Cloud to assist clients in targeted marketing spending for property management. Empowered by Driverless AI technology, marketers can quickly prioritize and convert highly qualified inbound leads from G5’s Intelligent Marketing Cloud platform with 95 percent accuracy for serious purchase intent. To learn more about how G5 uses Driverless AI check out:
https://www.h2o.ai/g5-h2o-ai-partner-to-deliver-ai-optimization-for-real-estate-marketing/

How can you try Driverless AI?

Visit: https://www.h2o.ai/driverless-ai/ and download your free 21-day evaluation copy.

We are happy to help you get started installing and using Driverless AI, and here are some resources we’ve put together to enable in that process:

● Installing Driverless AI: https://www.youtube.com/watch?v=swrqej9tFcU

● Launching an Experiment with Driverless AI: https://www.youtube.com/watch?v=bw6CbZu0dKk

● Driverless AI Webinars: https://www.gotostage.com/channel/4a90aa11b48f4a5d8823ec924e7bd8cf

● Driverless AI Documentation: http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/index.html

Time is Money! Automate Your Time-Series Forecasts with Driverless AI

Time-series forecasting is one of the most common and important tasks in business analytics. There are many real-world applications like sales, weather, stock market, energy demand, just to name a few. We strongly believe that automation can help our users deliver business value in a timely manner. Therefore, once again we translated our Kaggle Grand Masters’ time-series recipes into our automatic machine learning platform Driverless AI (version 1.2). This blog post introduces the new time-series functionality with a simple sales forecasting example.

The key features/recipes that make automation possible are:

  • Automatic handling of time groups (e.g. different stores and departments)
  • Robust time-series validation
            – Accounts for gaps and forecast horizon
            – Uses past information only (i.e. no data leakage)
  • Time-series specific feature engineering recipes
            – Date features like day of week, day of month etc.
            – AutoRegressive features like optimal lag and lag-features interaction
            – Different types of exponentially weighted moving averages
            – Aggregation of past information (different time groups and time intervals)
            – Target transformations and differentiation
  • Integration with existing feature engineering functions (recipes and optimization)
  • Automatic pipelines generation (see this blog post)

A Typical Example: Sales Forecasting

Below is a typical example of sales forecasting based on Walmart competition on Kaggle. In order to frame it as a machine learning problem, we formulate the historical sales data and additional attributes as shown below:

Raw data:

Data formulated for machine learning:

Once you have your data prepared in tabular format (see raw data above), Driverless AI can formulate it for machine learning and sort out the rest. If this is your very first session, the Driverless AI assistant (new feature in version 1.2) will guide you through the journey.

Similar to previous Driverless AI examples, users need to select the dataset for training/test and define the target. For time-series, users need to define the time column (by choosing AUTO or selecting the date column manually). If weighted scoring is required (like the Walmart Kaggle competition), users can select the column with specific weights for different samples.

If users prefer to use automatic handling of time groups, they can leave the setting for time groups columns as AUTO.

Expert users can define specific time groups and change other settings as shown below.

Once the experiment is finished, users can make new predictions and download the scoring pipeline just like any other Driverless AI experiments.

Seeing is believing. Try Driverless AI yourself today. Sign up here for a free 21-day trial license.

Until next time,
Joe

Bonus fact: The masterminds behind our time-series recipes are Marios Michailidis and Mathias Müller so internally we call this feature AutoM&M.

H2O.ai and IBM build a Strategic Partnership to bring AI innovation to the market together

Excited to announce our strategic partnership with IBM that allows them to resell and take to market H2O Driverless AI to businesses worldwide. This partnership makes AI economical – faster, cheaper and easier to do experiments. H2O Driverless AI and IBM POWER9 GPU Systems are bringing together the best of breed AI innovation. We have been working with IBM to port Driverless AI, our latest product that addresses the skills gap in data science and trust in AI, to IBM Power Systems. The combination delivers 5X performance for workloads including the new time series in Driverless AI. Please check out the blog by Sumit Gupta, VP of AI, Machine Learning and HPC at IBM Cognitive Systems, for more details on the partnership.

Outstanding performance on IBM POWER9

HPC is the new PC. To handle the increasingly complex workloads of AI you need an integrated system of software and hardware that is fully optimized for each other. H2O Driverless AI on IBM POWER9 delivers precisely that. IBM POWER9 supports nearly 2.6x more RAM, 9.5x more I/O bandwidth than comparable systems. It can also support upto 6 V100 GPUs on a single system. Driverless AI is built on top of datatable for python for data ingest and feature engineering and H2O4GPU for machine learning. We’ve been able to get nearly 2X the data ingest speed and over 50% faster feature engineering. In addition, with the power of GPU accelerated machine learning we’re able to deliver nearly 30X speedup on model building. Overall, we’ve been able to accelerate Driverless AI by up to 10X for IID data and up to 5X for time series data.

The Power of IBM and H2O.ai Solves Customer Challenges

AI will transform businesses as we know it. Companies will leverage AI across multiple business units and establish centers of excellence for AI. From asset price forecasting in capital markets, to supply chain optimization in manufacturing, to personalized insurance policies, no walk of business will be immune to pressures to democratize decision making with AI. IBM and H2O.ai will build the trusted hardware / software co-design needed for enterprises to make that transition. The winners of this partnership are our customers and their stories will be replicated across the ecosystem.

this will be fun, Sri

AI in Healthcare – Redefining Patient & Physician Experiences

Register for the Meetup Here

Patients, physicians, nurses, health administrators and policymakers are beneficiaries of the rapid transformations in health and life sciences. These transformations are being driven by new discoveries (etiology, therapies, and drugs/implants), market reconfiguration and consolidation, a movement to value-based care, and access/affordability considerations. The people and systems that are driving these changes are generating new engagement models, workflows, data, and most importantly, new needs for all participants in the care continuum.

Analytics 1.0 (driven by business intelligence & reporting) for Healthcare as we describe in our book is inadequate to address these transformations. A retrospective understanding of “what happened?” is limited in its usefulness as it only provides for corrective action – usually driven by resource availability. To improve wellness, care outcomes, clinician satisfaction, and patient quality of life, we ought to be leveraging little and big data via Analytics 2.0 & 3.0. This journey will require leveraging machine/deep learning and other AI methods to separate signal from noise, integrate insights into a workflow, address data fidelity, and develop contextually-intelligent agents.

Automating machine learning and deep learning simplifies access to these advanced technologies by the Humans of Healthcare. They are key pre-requisites to create a data-driven, learning Healthcare organization. The net results – better science, improved access & affordability, and evidence-based wellness/care.

Among others involved in the care continuum, physicians are at the forefront of the coming health sciences revolution. Join our expert, all-physician panel at the H2O offices in Mountain View, CA to hear their expert thoughts and interact with them. Our panel consists of 3 leading physician leaders who are also driving clinical innovations using AI in their specialties & organizations:

  1. Dr. Baber Ghauri, Physician Executive and Healthcare Innovator, Trinity Health

  2. Dr. Esther Yu, Professor & Neuroradiologist, UCSF

  3. Dr. Pratik Mukherjee, Professor, and Director of CIND, San Francisco VA

  4. Moderator: Prashant Natarajan, Sr. Dir. AI Apps at H2O.ai and best-selling author/contributor to books on medical informatics & analytics

<

p>We look forward to seeing you in person.

-H2O.ai Team

From Kaggle Grand Masters’ Recipes to Production Ready in a Few Clicks

Introducing Accelerated Automatic Pipelines in H2O Driverless AI

At H2O, we work really hard to make machine learning fast, accurate, and accessible to everyone. With H2O Driverless AI, users can leverage years of world-class, Kaggle Grand Masters experience and our GPU-accelerated algorithms (H2O4GPU) to produce top quality predictive models in a fully automatic and timely fashion.

In our most recent release (version 1.1), we are going one step further to streamline the deployment process with MOJO (Model ObJect, Optimized). Inherited from our popular H2O-3 platform, MOJO is a highly optimized, low-latency scoring engine that is easily embeddable in any Java environment. With automatic pipeline generation in Driverless AI, users can go from automatic machine learning to production ready in just a few clicks. This blog post illustrates the usage of MOJO in Driverless AI with a simple example.

Easing the Pain Points in a Machine Learning Workflow

In a typical enterprise machine learning workflow, there are many things that could go wrong due to human errors, bad data science practices, different tools/infrastructure, incompatible code, lack of testing, versioning, communication and so on.

Driverless AI is our solution to ease those pain points in the second half of the workflow (i.e., creative feature engineering, model building, and deployment). We strongly believe that most organizations can benefit from automatic machine learning pipelines. A recent PayPal use-case shows that Driverless AI can help produce top quality predictive models with significant time and cost savings.

ml_workflow

With Driverless AI, we are trying to mimic what top data science teams would do when they need to develop a new machine learning pipeline. Below are the four key areas of focus:

1. Exploratory Data Analysis (EDA) with Automatic Visualizations (AutoViz)

AutoViz allows users to gain quick insights from data without the laborious tasks of creating individual plots. It shows users the most interesting graphs automatically based on statistics, and it is designed to work on large datasets efficiently. The mastermind behind AutoViz is our Chief Scientist, Professor Leland Wilkinson of “ The Grammar of Graphics” fame.

2. Automatic Feature Engineering and Model Building

We call this part of Driverless AI “ Kaggle Grand Masters in a Box”. It is essentially the best data science practices, tricks and creative feature engineering of our Kaggle Grand Masters translated into an artificial intelligence (AI) platform. In other words, it is AI to do AI. On top of that, we make the automatic machine learning process insanely fast on Nvidia GPUs. Our users can benefit from quick turnaround time and top quality predictive models that one would expect from the Kaggle Grand Masters themselves.

3. Machine Learning Interpretability (MLI)

In Driverless AI, we have implemented some of the latest ML interpretation techniques (e.g., LIME, LOCO, ICE, Shapely, PDP, etc.), so our users can go from model building to model interpretation in a seamless fashion. These techniques are crucial for those who must explain their models to regulators or customers. The masterminds behind MLI are my colleagues Patrick Hall, Navdeep Gill, and Mark Chan. Watch their talk about MLI in Driverless AI here.

4. Automatic Pipelines Generation – The Focus of this Blog Post

Model deployment remains one of the most common and complex challenges in data analytics. Inherited from our popular H2O-3 platform, MOJO is a well-tested, robust technology that is being used by our users and customers at enormous scale. Let me illustrate the MOJO usage with a simple example below.

Credit Card Example

Like many other Driverless AI demos that you may have seen before at H2O World or our webinars, I am going to use the credit card dataset from the UCI machine learning repository for the MOJO example. Let me fast-forward the process to the end of a Driverless AI experiment and focus on the new MOJO options. From version 1.1.0, users have the option to build and download MOJO for fast, low-latency scoring. Here is a step-by-step walkthrough:

Step 1: Build a MOJO Scoring Pipeline
After the experiment, click on the newly available option BUILD MOJO SCORING PIPELINE. The build process is automatic and it should be done within a few minutes.

Step 2: Download and Unzip MOJO
Click on DOWNLOAD MOJO SCORING PIPELINE to download mojo.zip. After unzipping the file, you should be able to see a new folder called mojo-pipeline. The pipeline.mojo and mojo2-runtime.jar in the folder are the two main files you need for the MOJO scoring pipeline.

Step 3: Download Driverless AI License
Another key ingredient for MOJO pipeline is a valid Driverless AI license. You can download the license.sig file (usually in the license folder) from the machine hosting Driverless AI. Put the license file into the mojo-pipeline folder from the previous step.

license

Optional Step: Install Java 7 or 8
The MOJO scoring pipeline requires Java 8 (or Java 7/8 from version 1.1.2). If you have not installed it, please follow the instructions here.

Step 4: A Simple Test Run
In the mojo-pipeline folder, you will find a small example.csv with some data samples. This dataset can be used for a quick test run. Open the folder in terminal and then run the following command: bash run_example.sh

Alternatively, run the full command like this:
java -Dai.h2o.mojos.runtime.license.file=license.sig -cp mojo2-runtime.jar ai.h2o.mojos.ExecuteMojo pipeline.mojo example.csv

It should return predictions (the probabilities of default payment in this credit card demo) and the time required for scoring each sample. Remember this scoring pipeline includes everything from complex feature transformations based on Kaggle Grand Masters’ recipes to computing predictions from the final model ensemble. With MOJO, our users have a low-latency scoring engine that can make new predictions in milliseconds.

Step 5: Create Your Own Scoring Service
Users can, of course, define and program their own scoring services. For more information, please go through the Compile and Run the MOJO from Java section in our Driverless AI documentation.

Conclusions

This blog post gives a quick overview of the automatic pipelines in Driverless AI. The key benefits for our users are:

  1. Immediate increase in productivity – eliminating time wasted on human errors, incompatible code, debugging, etc.
  2. Production ready in a few clicks – seamless integration of complex feature engineering and scoring engine in one MOJO.
  3. An enterprise-grade, low-latency scoring engine that is easily embeddable in any Java environment.

Don’t take my words for it, sign up for a free 21-day trial and try Driverless AI yourself today.

Until next time,
Joe

Note #1: Two years, numerous H2O models, slide decks, events and #360selfies later, I am finally making a return to blogging. I hope you enjoy reading this blog post.

Note #2: H2O is going to Budapest again. Come find me, Erin, and Kuba at eRum conference from May 14 to 16. I will be delivering the “Automatic and Interpretable Machine Learning in R with H2O and LIME” workshop with a real, multimillion-dollar Moneyball Shiny app.

H2O World coming to NYC

H2O World coming to NYC

Whether you’re just starting out learning how machine learning and H2O.ai can supercharge your business or a veteran looking for more, we want to invite you to join some of greatest minds in the field to learn how AI and H2O.ai can transform your business. Our flagship event, H2O World is back and it’s going to be bigger than ever! We’re making our way around the world with our first stop at The New York Academy of Sciences on June 7th.

You’ll get exclusive access to the brains behind the operations of open source, H2O, H2O Driverless AI, Sparkling Water, MLI, and more! You’ll even be able to get a hands-on tutorial of our revolutionary Driverless AI platform and learn directly from the people implementing H2O.ai’s solutions to solve some of their companies’’ toughest problems.

With an eclectic group of speakers of product managers, data scientists, customer success managers, and more, we’ve got something for everyone! Don’t miss out on a full day of talks and hands-on sessions. Learn how H2O.ai is democratizing machine learning and transforming businesses in all industries ranging from healthcare, finance, insurance and more.

Highlights from Last year:

Leah Liebler
Marketing @ H2O

Democratize care with AI — AI to do AI for Healthcare

Very excited to have Prashant Natarajan (@natarpr) join us along with Sanjay Joshi on our vision to change the world of healthcare with AI. Health is wealth. And one worth saving the most. They bring invaluable domain knowledge and context to our cause.

As one of our customers would like to say, Healthcare should be optimized for health and outcomes for the ones in need of care. Health / Care, as in, Health divided by Care, how healthy can one be with least amount of care! We are investing in health because it is the right thing to do over the long term — especially with the convergence of Finance, Life Insurance, Retail towards Health. So many opportunities for cross-pollination!

With our strong ecosystem, community and customers’ support, h2o.ai will democratize care with AI — make it faster, cheaper and easier — accessible to all. Machine Learning touches lives — with Domain Scientists on our side, we can accelerate change to the problems that are in the most need. We are fortunate to have the team and culture that allows us to bring great products with high velocity to the marketplace. Stay tuned for Driverless AI for Health, one micro service AI model at a time!

As you feel inspired by the immense opportunities to serve humanity — please join Prashant Natarajan and www.h2o.ai community on our mission!

this will be fun! Sri