How to Frame Your Business Problem for Automatic Machine Learning

Over the last several years, machine learning has become an integral part of many organizations’ decision-making at various levels. With not enough data scientists to fill the increasing demand for data-driven business processes, H2O.ai has developed a product called Driverless AI that automates several time consuming aspects of a typical data science workflow: data visualization, feature engineering, predictive modeling, and model explanation. In this post, I will describe Driverless AI, how you can properly frame your business problem to get the most out of this automatic machine learning product, and how automatic machine learning is used to create business value.

What is Driverless AI and what kind of business problems does it solve?

H2O Driverless AI is a high-performance, GPU-enabled computing platform for automatic development and rapid deployment of state-of-the-art predictive analytics models. It reads tabular data from plain text sources, Hadoop, or S3 buckets and automates data visualization and building predictive models. Driverless AI is currently targeting business applications like loss-given-default, probability of default, customer churn, campaign response, fraud detection, anti-money-laundering, demand forecasting, and predictive asset maintenance models. (Or in machine learning parlance: common regression, binomial classification, and multinomial classification problems.)

How do you frame business problems in a data set for Driverless AI?

The data that is read into Driverless AI must contain one entity per row, like a customer, patient, piece of equipment, or financial transaction. That row must also contain information about what you will be trying to predict using similar data in the future, like whether that customer in the row of data used a promotion, whether that patient was readmitted to the hospital within thirty days of being released, whether that piece of equipment required maintenance, or whether that financial transaction was fraudulent. (In data science speak, Driverless AI requires “labeled” data.) Driverless AI runs through your data many, many times looking for interactions, insights, and business drivers of the phenomenon described by the provided data set. Driverless AI can handle simple data quality problems, but it currently requires all data for a single predictive model to be in the same data set and that data set must have already undergone standard ETL, cleaning, and normalization routines before being loaded into Driverless AI.

How do you use Driverless AI results to create commercial value?

Commercial value is generated by Driverless AI in a few ways.

● Driverless AI empowers data scientists or data analysts to work on projects faster and more efficiently by using automation and state-of-the-art computing power to accomplish tasks in just minutes or hours that can take humans months.

● Like in many other industries, automation leads to standardization of business processes, enforces best practices, and eventually drives down the cost of delivering the final product – in this case a predictive model.

● Driverless AI makes deploying predictive models easy – typically a difficult step in the data science process. In large organizations, value from predictive modeling is typically realized when a predictive model is moved from a data analysts’ or data scientists’ development environment into a production deployment setting where the model is running on live data, making decisions quickly and automatically that make or save money. Driverless AI provides both Java- and Python-based technologies to make production deployment simpler.

Moreover, the system was designed with interpretability and transparency in mind. Every prediction made by a Driverless AI model can be explained to business users, so the system is viable even for regulated industries.

Customer success stories with Driverless AI

PayPal tried Driverless AI on a collusion fraud use case and found that simply running on a laptop for 2 hours, Driverless AI yielded impressive fraud detection accuracy, and running on GPU-enhanced hardware, it was able to produce the same accuracy in just 20 minutes. The Driverless AI model was more accurate than PayPal’s existing predictive model and the Driverless AI system found the same insights in their data that their data scientists did! The system also found new features in their data that had not been used before for predictive modeling. For more information about the PayPal use case, click here

G5, a real estate marketing optimization firm, uses Driverless AI in their Intelligent Marketing Cloud to assist clients in targeted marketing spending for property management. Empowered by Driverless AI technology, marketers can quickly prioritize and convert highly qualified inbound leads from G5’s Intelligent Marketing Cloud platform with 95 percent accuracy for serious purchase intent. To learn more about how G5 uses Driverless AI check out:
https://www.h2o.ai/g5-h2o-ai-partner-to-deliver-ai-optimization-for-real-estate-marketing/

How can you try Driverless AI?

Visit: https://www.h2o.ai/driverless-ai/ and download your free 21-day evaluation copy.

We are happy to help you get started installing and using Driverless AI, and here are some resources we’ve put together to enable in that process:

● Installing Driverless AI: https://www.youtube.com/watch?v=swrqej9tFcU

● Launching an Experiment with Driverless AI: https://www.youtube.com/watch?v=bw6CbZu0dKk

● Driverless AI Webinars: https://www.gotostage.com/channel/4a90aa11b48f4a5d8823ec924e7bd8cf

● Driverless AI Documentation: http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/index.html

H2O.ai Releases H2O4GPU, the Fastest Collection of GPU Algorithms on the Market, to Expedite Machine Learning in Python

H2O4GPU is an open-source collection of GPU solvers created by H2O.ai. It builds on the easy-to-use scikit-learn Python API and its well-tested CPU-based algorithms. It can be used as a drop-in replacement for scikit-learn with support for GPUs on selected (and ever-growing) algorithms. H2O4GPU inherits all the existing scikit-learn algorithms and falls back to CPU algorithms when the GPU algorithm does not support an important existing scikit-learn class option. It utilizes the efficient parallelism and high throughput of GPUs. Additionally, GPUs allow the user to complete training and inference much faster than possible on ordinary CPUs.

Today, select algorithms are GPU-enabled. These include Gradient Boosting Machines (GBM’s), Generalized Linear Models (GLM’s), and K-Means Clustering. Using H2O4GPU, users can unlock the power of GPU’s through the scikit-learn API that many already use today. In addition to the scikit-learn Python API, an R API is in development.

Here are specific benchmarks from a recent H2O4GPU test:

  • More than 5X faster on GPUs as compared to CPUs
  • Nearly 10X faster on GPUs
  • More than 40X faster on GPUs

“We’re excited to release these lightning-fast H2O4GPU algorithms and continue H2O.ai’s foray into GPU innovation,” said Sri Ambati, co-founder and CEO of H2O.ai. “H2O4GPU democratizes industry-leading speed, accuracy and interpretability for scikit-learn users from all over the globe. This includes enterprise AI users who were previously too busy building models to have time for what really matters: generating revenue.”

“The release of H2O4GPU is an important milestone,” said Jim McHugh, general manager and vice president at NVIDIA. “Delivered as part of an open-source platform it brings the incredible power of acceleration provided by NVIDIA GPUs to widely-used machine learning algorithms that today’s data scientists have come to rely upon.”

H2O4GPU’s release follows the launch of Driverless AI, H2O.ai’s fully automated solution that handles data science operations — data preparation, algorithms, model deployment and more — for any business needing world-class AI capability in a single product. Built by top-ranking Kaggle Grandmasters, Driverless AI is essentially an entire data science team baked into one application.

Following is some information on each GPU enabled algorithm as well as a roadmap.

Gradient Linear Model (GLM)

  • Framework utilizes Proximal Graph Solver (POGS)
  • Solvers include Lasso, Ridge Regression, Logistic Regression, and Elastic Net Regularization
  • Improvements to original implementation of POGS:
    • Full alpha search
    • Cross Validation
    • Early Stopping
    • Added scikit-learn-like API
    • Supports multiple GPU’s

Gradient Linear Model (GLM)

Gradient Boosting Machines (Please check out Rory’s blog on Nvidia Dev Blogs for a more detailed write-up on Gradient Boosted Trees on GPUs)

  • Based on XGBoost
  • Raw floating point data — binned into quantiles
  • Quantiles are stored as compressed instead of floats
  • Compressed quantiles are efficiently transferred to GPU
  • Sparsity is handled directly with high GPU efficiency
  • Multi-GPU enabled by sharing rows using NVIDIA NCCL AllReduce

Gradient Boosting Machines

k-Means Clustering

  • Based on NVIDIA prototype of k-Means algorithm in CUDA
  • Improvements to original implementation:
    • Significantly faster than scikit-learn implementation (50x) and other GPU implementations (5-10x)
    • Supports multiple GPUs

k-Means Clustering

H2O4GPU combines the power of GPU acceleration with H2O’s parallel implementation of popular algorithms, taking computational performance levels to new heights.

To learn more about H2O4GPU click here and for more information about the math behind each algorithm, click here.

Driverless AI Blog

In today’s market, there aren’t enough data scientists to satisfy the growing demand for people in the field. With many companies moving towards automating processes across their businesses (everything from HR to Marketing), companies are forced to compete for the best data science talent to meet their needs. A report by McKinsey says that based on 2018 job market predictions: “The United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of big data.” H2O’s Driverless AI addresses this gap by democratizing data science and making it accessible to non-experts, while simultaneously increasing the efficiency of expert data scientists. Its point-and-click UI minimizes the complicated legwork that precedes the actual model build.

Driverless AI is designed to take a raw dataset and run it through a proprietary algorithm that automates the data exploration/feature engineering process, which typically takes ~80% of a data scientist’s time. It then auto-tunes model parameters and provides the user with the model that yields the best results. Therefore, experienced data scientists are spending far less time engineering new features and can focus on drawing actionable insights from the models Driverless AI builds. Lastly, the user can see visualizations generated by the Machine Learning Interpretability (MLI) component of Driverless AI to clarify the model results and the effect of changing variables’ values. The MLI feature eliminates the black box nature of machine learning models and provides clear and straightforward results from a model as well as how changing features will alter results.

Driverless AI is also GPU-enabled, which can result in up to 40x speed ups. We had demonstrated GPU acceleration to achieve those speedups for machine learning algorithms at GTC in May 2017. We’ve ported over XGBoost, GLM, K-Means and other algorithms to GPUs to achieve significant performance gains. This enable Driverless AI to run thousands of iterations to find the most accurate feature transforms and models.

The automatic nature of Driverless AI leads to increased accuracy. AutoDL engineers new features mechanically, and AutoML finds the right algorithms and tunes them to create the perfect ensemble of models. You can think of it as a Kaggle Grandmaster in a box. To demonstrate the power of Driverless AI, we participated in a bunch of Kaggle contests and the results are here below. Driverless AI out of the box got performed nearly as well as the best Kaggle Grandmasters

Let’s look at an example: we are going to work with a credit card dataset and predict whether or not a person is going to default on their payment next month based on a set of variables related to their payment history. After simply choosing the variable we are predicting for as well as the number of iterations we’d like to run, we launch our experiment.

As the experiment cycles through iterations, it creates a variable importance chart ranking existing and newly created features by their effect on the model’s accuracy.

In this example, AutoDL creates a feature that represents the cross validation target encoding of the variables sex and education. In other words, if we group everyone who is of the same sex and who has the same level of education in this dataset, the resulting feature would help in predicting whether or not the customer is going to default on their payment next month. Generating features like this one usually takes the majority of a data scientist’s time, but Driverless AI automates this process for the user.

After AutoDL generates new features, we run the updated dataset through AutoML. At this point, Driverless AI builds a series of models using various algorithms and delivers a leaderboard ranking the success of each model. The user can then inspect and choose the model that best fits their needs.

Lastly, we can use the Machine Learning Interpretability feature to get clear and concise explanations of our model results. Four dynamic graphs are generated automatically: KLime, Variable Importance, Decision Tree Chart, and Partial Dependence Plot. Each one helps the user explore the model output more closely. KLIME creates one global surrogate GLM on the entire training data and also creates numerous local surrogate GLMs on samples formed from K-Means clusters in the training data. All penalized GLM surrogates are trained to model the predictions of the Driverless AI model. The Variable Importance measures the effect that a variable has on the predictions of a model, while the Partial Dependence Plot shows the effect of changing one variable on the outcome. The Decision Tree Surrogate Model clears up the Driverless AI model by displaying an approximate flow-chart of the complex Driverless AI model’s decision making process. The Decision Tree Surrogate Model also displays the most important variables in the Driverless AI model and the most important interactions in the Driverless AI model. Lastly, the Explanations button gives the user a plain English sentence about how each variable effects the model.

All of these graphs can be used to visualize and debug the Driverless AI model by comparing the displayed decision-process, important variables, and important interactions to known standards, domain knowledge, and reasonable expectations.

Driverless AI streamlines the machine learning workflow for inexperienced and expert users alike. For more information, click here.