How to Frame Your Business Problem for Automatic Machine Learning

Over the last several years, machine learning has become an integral part of many organizations’ decision-making at various levels. With not enough data scientists to fill the increasing demand for data-driven business processes, H2O.ai has developed a product called Driverless AI that automates several time consuming aspects of a typical data science workflow: data visualization, feature engineering, predictive modeling, and model explanation. In this post, I will describe Driverless AI, how you can properly frame your business problem to get the most out of this automatic machine learning product, and how automatic machine learning is used to create business value.

What is Driverless AI and what kind of business problems does it solve?

H2O Driverless AI is a high-performance, GPU-enabled computing platform for automatic development and rapid deployment of state-of-the-art predictive analytics models. It reads tabular data from plain text sources, Hadoop, or S3 buckets and automates data visualization and building predictive models. Driverless AI is currently targeting business applications like loss-given-default, probability of default, customer churn, campaign response, fraud detection, anti-money-laundering, demand forecasting, and predictive asset maintenance models. (Or in machine learning parlance: common regression, binomial classification, and multinomial classification problems.)

How do you frame business problems in a data set for Driverless AI?

The data that is read into Driverless AI must contain one entity per row, like a customer, patient, piece of equipment, or financial transaction. That row must also contain information about what you will be trying to predict using similar data in the future, like whether that customer in the row of data used a promotion, whether that patient was readmitted to the hospital within thirty days of being released, whether that piece of equipment required maintenance, or whether that financial transaction was fraudulent. (In data science speak, Driverless AI requires “labeled” data.) Driverless AI runs through your data many, many times looking for interactions, insights, and business drivers of the phenomenon described by the provided data set. Driverless AI can handle simple data quality problems, but it currently requires all data for a single predictive model to be in the same data set and that data set must have already undergone standard ETL, cleaning, and normalization routines before being loaded into Driverless AI.

How do you use Driverless AI results to create commercial value?

Commercial value is generated by Driverless AI in a few ways.

● Driverless AI empowers data scientists or data analysts to work on projects faster and more efficiently by using automation and state-of-the-art computing power to accomplish tasks in just minutes or hours that can take humans months.

● Like in many other industries, automation leads to standardization of business processes, enforces best practices, and eventually drives down the cost of delivering the final product – in this case a predictive model.

● Driverless AI makes deploying predictive models easy – typically a difficult step in the data science process. In large organizations, value from predictive modeling is typically realized when a predictive model is moved from a data analysts’ or data scientists’ development environment into a production deployment setting where the model is running on live data, making decisions quickly and automatically that make or save money. Driverless AI provides both Java- and Python-based technologies to make production deployment simpler.

Moreover, the system was designed with interpretability and transparency in mind. Every prediction made by a Driverless AI model can be explained to business users, so the system is viable even for regulated industries.

Customer success stories with Driverless AI

PayPal tried Driverless AI on a collusion fraud use case and found that simply running on a laptop for 2 hours, Driverless AI yielded impressive fraud detection accuracy, and running on GPU-enhanced hardware, it was able to produce the same accuracy in just 20 minutes. The Driverless AI model was more accurate than PayPal’s existing predictive model and the Driverless AI system found the same insights in their data that their data scientists did! The system also found new features in their data that had not been used before for predictive modeling. For more information about the PayPal use case, click here

G5, a real estate marketing optimization firm, uses Driverless AI in their Intelligent Marketing Cloud to assist clients in targeted marketing spending for property management. Empowered by Driverless AI technology, marketers can quickly prioritize and convert highly qualified inbound leads from G5’s Intelligent Marketing Cloud platform with 95 percent accuracy for serious purchase intent. To learn more about how G5 uses Driverless AI check out:
https://www.h2o.ai/g5-h2o-ai-partner-to-deliver-ai-optimization-for-real-estate-marketing/

How can you try Driverless AI?

Visit: https://www.h2o.ai/driverless-ai/ and download your free 21-day evaluation copy.

We are happy to help you get started installing and using Driverless AI, and here are some resources we’ve put together to enable in that process:

● Installing Driverless AI: https://www.youtube.com/watch?v=swrqej9tFcU

● Launching an Experiment with Driverless AI: https://www.youtube.com/watch?v=bw6CbZu0dKk

● Driverless AI Webinars: https://www.gotostage.com/channel/4a90aa11b48f4a5d8823ec924e7bd8cf

● Driverless AI Documentation: http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/index.html

Time is Money! Automate Your Time-Series Forecasts with Driverless AI

Time-series forecasting is one of the most common and important tasks in business analytics. There are many real-world applications like sales, weather, stock market, energy demand, just to name a few. We strongly believe that automation can help our users deliver business value in a timely manner. Therefore, once again we translated our Kaggle Grand Masters’ time-series recipes into our automatic machine learning platform Driverless AI (version 1.2). This blog post introduces the new time-series functionality with a simple sales forecasting example.

The key features/recipes that make automation possible are:

  • Automatic handling of time groups (e.g. different stores and departments)
  • Robust time-series validation
            – Accounts for gaps and forecast horizon
            – Uses past information only (i.e. no data leakage)
  • Time-series specific feature engineering recipes
            – Date features like day of week, day of month etc.
            – AutoRegressive features like optimal lag and lag-features interaction
            – Different types of exponentially weighted moving averages
            – Aggregation of past information (different time groups and time intervals)
            – Target transformations and differentiation
  • Integration with existing feature engineering functions (recipes and optimization)
  • Automatic pipelines generation (see this blog post)

A Typical Example: Sales Forecasting

Below is a typical example of sales forecasting based on Walmart competition on Kaggle. In order to frame it as a machine learning problem, we formulate the historical sales data and additional attributes as shown below:

Raw data:

Data formulated for machine learning:

Once you have your data prepared in tabular format (see raw data above), Driverless AI can formulate it for machine learning and sort out the rest. If this is your very first session, the Driverless AI assistant (new feature in version 1.2) will guide you through the journey.

Similar to previous Driverless AI examples, users need to select the dataset for training/test and define the target. For time-series, users need to define the time column (by choosing AUTO or selecting the date column manually). If weighted scoring is required (like the Walmart Kaggle competition), users can select the column with specific weights for different samples.

If users prefer to use automatic handling of time groups, they can leave the setting for time groups columns as AUTO.

Expert users can define specific time groups and change other settings as shown below.

Once the experiment is finished, users can make new predictions and download the scoring pipeline just like any other Driverless AI experiments.

Seeing is believing. Try Driverless AI yourself today. Sign up here for a free 21-day trial license.

Until next time,
Joe

Bonus fact: The masterminds behind our time-series recipes are Marios Michailidis and Mathias Müller so internally we call this feature AutoM&M.

From Kaggle Grand Masters’ Recipes to Production Ready in a Few Clicks

Introducing Accelerated Automatic Pipelines in H2O Driverless AI

At H2O, we work really hard to make machine learning fast, accurate, and accessible to everyone. With H2O Driverless AI, users can leverage years of world-class, Kaggle Grand Masters experience and our GPU-accelerated algorithms (H2O4GPU) to produce top quality predictive models in a fully automatic and timely fashion.

In our most recent release (version 1.1), we are going one step further to streamline the deployment process with MOJO (Model ObJect, Optimized). Inherited from our popular H2O-3 platform, MOJO is a highly optimized, low-latency scoring engine that is easily embeddable in any Java environment. With automatic pipeline generation in Driverless AI, users can go from automatic machine learning to production ready in just a few clicks. This blog post illustrates the usage of MOJO in Driverless AI with a simple example.

Easing the Pain Points in a Machine Learning Workflow

In a typical enterprise machine learning workflow, there are many things that could go wrong due to human errors, bad data science practices, different tools/infrastructure, incompatible code, lack of testing, versioning, communication and so on.

Driverless AI is our solution to ease those pain points in the second half of the workflow (i.e., creative feature engineering, model building, and deployment). We strongly believe that most organizations can benefit from automatic machine learning pipelines. A recent PayPal use-case shows that Driverless AI can help produce top quality predictive models with significant time and cost savings.

ml_workflow

With Driverless AI, we are trying to mimic what top data science teams would do when they need to develop a new machine learning pipeline. Below are the four key areas of focus:

1. Exploratory Data Analysis (EDA) with Automatic Visualizations (AutoViz)

AutoViz allows users to gain quick insights from data without the laborious tasks of creating individual plots. It shows users the most interesting graphs automatically based on statistics, and it is designed to work on large datasets efficiently. The mastermind behind AutoViz is our Chief Scientist, Professor Leland Wilkinson of “ The Grammar of Graphics” fame.

2. Automatic Feature Engineering and Model Building

We call this part of Driverless AI “ Kaggle Grand Masters in a Box”. It is essentially the best data science practices, tricks and creative feature engineering of our Kaggle Grand Masters translated into an artificial intelligence (AI) platform. In other words, it is AI to do AI. On top of that, we make the automatic machine learning process insanely fast on Nvidia GPUs. Our users can benefit from quick turnaround time and top quality predictive models that one would expect from the Kaggle Grand Masters themselves.

3. Machine Learning Interpretability (MLI)

In Driverless AI, we have implemented some of the latest ML interpretation techniques (e.g., LIME, LOCO, ICE, Shapely, PDP, etc.), so our users can go from model building to model interpretation in a seamless fashion. These techniques are crucial for those who must explain their models to regulators or customers. The masterminds behind MLI are my colleagues Patrick Hall, Navdeep Gill, and Mark Chan. Watch their talk about MLI in Driverless AI here.

4. Automatic Pipelines Generation – The Focus of this Blog Post

Model deployment remains one of the most common and complex challenges in data analytics. Inherited from our popular H2O-3 platform, MOJO is a well-tested, robust technology that is being used by our users and customers at enormous scale. Let me illustrate the MOJO usage with a simple example below.

Credit Card Example

Like many other Driverless AI demos that you may have seen before at H2O World or our webinars, I am going to use the credit card dataset from the UCI machine learning repository for the MOJO example. Let me fast-forward the process to the end of a Driverless AI experiment and focus on the new MOJO options. From version 1.1.0, users have the option to build and download MOJO for fast, low-latency scoring. Here is a step-by-step walkthrough:

Step 1: Build a MOJO Scoring Pipeline
After the experiment, click on the newly available option BUILD MOJO SCORING PIPELINE. The build process is automatic and it should be done within a few minutes.

Step 2: Download and Unzip MOJO
Click on DOWNLOAD MOJO SCORING PIPELINE to download mojo.zip. After unzipping the file, you should be able to see a new folder called mojo-pipeline. The pipeline.mojo and mojo2-runtime.jar in the folder are the two main files you need for the MOJO scoring pipeline.

Step 3: Download Driverless AI License
Another key ingredient for MOJO pipeline is a valid Driverless AI license. You can download the license.sig file (usually in the license folder) from the machine hosting Driverless AI. Put the license file into the mojo-pipeline folder from the previous step.

license

Optional Step: Install Java 7 or 8
The MOJO scoring pipeline requires Java 8 (or Java 7/8 from version 1.1.2). If you have not installed it, please follow the instructions here.

Step 4: A Simple Test Run
In the mojo-pipeline folder, you will find a small example.csv with some data samples. This dataset can be used for a quick test run. Open the folder in terminal and then run the following command: bash run_example.sh

Alternatively, run the full command like this:
java -Dai.h2o.mojos.runtime.license.file=license.sig -cp mojo2-runtime.jar ai.h2o.mojos.ExecuteMojo pipeline.mojo example.csv

It should return predictions (the probabilities of default payment in this credit card demo) and the time required for scoring each sample. Remember this scoring pipeline includes everything from complex feature transformations based on Kaggle Grand Masters’ recipes to computing predictions from the final model ensemble. With MOJO, our users have a low-latency scoring engine that can make new predictions in milliseconds.

Step 5: Create Your Own Scoring Service
Users can, of course, define and program their own scoring services. For more information, please go through the Compile and Run the MOJO from Java section in our Driverless AI documentation.

Conclusions

This blog post gives a quick overview of the automatic pipelines in Driverless AI. The key benefits for our users are:

  1. Immediate increase in productivity – eliminating time wasted on human errors, incompatible code, debugging, etc.
  2. Production ready in a few clicks – seamless integration of complex feature engineering and scoring engine in one MOJO.
  3. An enterprise-grade, low-latency scoring engine that is easily embeddable in any Java environment.

Don’t take my words for it, sign up for a free 21-day trial and try Driverless AI yourself today.

Until next time,
Joe

Note #1: Two years, numerous H2O models, slide decks, events and #360selfies later, I am finally making a return to blogging. I hope you enjoy reading this blog post.

Note #2: H2O is going to Budapest again. Come find me, Erin, and Kuba at eRum conference from May 14 to 16. I will be delivering the “Automatic and Interpretable Machine Learning in R with H2O and LIME” workshop with a real, multimillion-dollar Moneyball Shiny app.

Come meet the Makers!

NVIDIA’s GPU Technology Conference (GTC) Silicon Valley, March 26-29th is the premier AI and deep learning event, providing you with training, insights, and direct access to the industry’s best and brightest. It’s where you will see the latest breakthroughs in self-driving cars, smart cities, healthcare, high-performance computing, virtual reality and more, and all because of the power of AI. H2O.ai will be there in full force to share how you can immediately gain value and insights from our industry-leading AI and ML platforms. In case you hadn’t heard, H2O.ai was named a leader in 2018 Gartner Magic Quadrant for Data Science and Machine Learning platforms. You can get the report here.

Please visit us at booth #725 to see Driverless AI in action and talk to the Makers leading the AI movement! Our sessions will be leading edge talks that you won’t want to miss.

  1. Ashrith Barthur – Network Security with Machine Learning

    Ashrith will speak about modeling different kinds of cyber attacks and building a model that is able to identify these different kinds of attacks using machine learning.

    Room 210F – Wednesday, 28 March, 9 AM to 9:50 AM.

  2. Jonathan McKinney – World’s Fastest Machine Learning with GPUs

    Jonathan will introduce H2O4GPU, a fully featured machine learning library that is optimized for GPUs with a robust python API that is a drop dead replacement for scikit-learn. He will demonstrate benchmarks for the most common algorithms relevant to enterprise AI and will showcase performance gains as compared to running on CPUs.

    Room 220B – Thursday, March 29, 11 AM to 11:50 AM.

  3. Arno Candel – Hands-on with Driverless AI

    In this lab, Arno will show how to install and start Driverless AI, the automated Kaggle Grandmaster in-a-box software, on a multi GPU box. He will go through the full end-to-end workflow and showcase how Driverless AI uses the power of GPUs to achieve 40x speedups on algorithms that in turn allow it run thousands of iterations and find the best model.

    Room LL21C – Thursday, March 29, 4 PM to 6 PM.

Can’t make it to the event? Schedule a time to talk to one of our makers!

How Driverless AI Prevents Overfitting and Leakage

By Marios Michailidis, Competitive Data Scientist, H2O.ai

In this post, I’ll provide an overview of overfitting, k-fold cross-validation, and leakage. I’ll also explain how Driverless AI avoids overfitting and leakage.

An Introduction to Overfitting

A common pitfall that causes machine learning models to fail when tested in a real-world environment is overfitting. In Driverless AI, we take special measures in various stages of the modeling process to ensure that overfitting is prevented and that all models produced are generalizable and give consistent results on test data.

Before highlighting the different measures used to prevent overfitting within Driverless AI, we should first define overfitting. According to Oxford Dictionaries, it is the production of an analysis which corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably. Overfitting is often explained in connection with two other terms, namely bias and variance. Consider the following diagram that shows how typically the training and test errors “move” throughout the modeling process.

When an algorithm starts learning from the data, any relationships found are (still) basic and simple. You could see in the top left of the chart that both errors, training (teal) and test (red) are quite high because the model has not yet uncovered much information about the data and makes basic/biased (and quite erroneous) assumptions. These basic/simple assumptions cause high errors due to (high) bias. At this point, the model is not sensitive to fluctuations in the training data because the logic of the model is still very simple. This low sensitivity to fluctuations is often referred to as having low variance. This whole initial stage can be described as underfitting, a state in the modeling process that causes errors because the model is still very basic and does not live up to its potential. This state is not deemed as serious as overfitting because it sits at the beginning of the modeling process and generally the modeler (often greedily) tries to maximize performance ending up overdoing it.

As the model keeps learning, the training error decreases, as well as, the error for the test data. That is because the relationships found in the training data are still significant and generalizable. However, after a while, even though the training error keeps decreasing at a fast pace, the test error does not follow. The algorithm has exhausted all significant information from within the training data and starts modeling noise. This state where the training error keeps decreasing, but the test error is becoming worse (increases), is called overfitting – the model is learning more than it should and it is sensitive to tiny fluctuations in the training data, hence characterized by high variance. Ideally, the learning process needs to stop at the point where the red line is at its lowest or optimal point, somewhere in the middle of the graph. How quickly or slowly this point comes depends on various factors.

In order to get the most out of your predictions, it becomes imperative to control overfitting and underfitting. Since Driverless AI thrives in making very predictive models, it has built-in measures to control overfitting and underfitting. These methods will be analyzed in the following sections.

This section includes both existing and upcoming features.

Check similarity of train and test datasets
This step is conditional upon providing a test dataset which is not always the case. If a test dataset is provided during the training procedure, various checks take place to ensure that the training set is consistent with the test set. This happens both at a univariate level and global level. At a univariate level, the distribution of each variable in the training data is checked against the equivalent one in the test data. If a big difference is detected, the column is omitted and a warning message is displayed.

To illustrate the problem, imagine having a variable that measures the age in years for some customers. If the distribution for train and test data resembles the one below, even a very basic model would start to overfit very quickly.

This happens because the model will assume that a given customer will be between 40 and 85 years old. However, that model is tasked to make predictions for customers that are much younger. This model would most probably fail to generalize well in the test data and the more the model will rely on age, the worse the performance will be for the test data.

To detect this problem, Driverless AI would fit models to determine if certain values for some features have a tendency to appear more in train or in test data. If such relationships are found and deemed significant, the feature is omitted and/or a warning message is displayed to alert the user about this discrepancy.

K-Fold Cross-Validation

Depending on the accuracy and speed settings selected in Driverless AI before the experiment is run, the training data will be split into multiple train and validation pairs and Driverless AI will try to find that optimal point (mentioned in the top section) for the validation data in order to stop the learning process. This method can be illustrated below for K=4.

This process gets repeated 4 times (if K=4) and K models are run until all parts become validated exactly once. The same model parameters are applied to all these processes and an overall (average) error is estimated for all parts’ predictions to ensure that any model built with such parameters (features, hyperparameters, model types) will generalize well in any (unseen) test dataset.

More specifically based on the average error on all these validation predictions, Driverless AI will determine:

  • When to stop the learning process. Driverless AI primarily uses XGBoost which requires various iterations to reach the optimal point and the multiple validation datasets facilitate in finding that global best generalizable point that works well in all validation parts.

  • Which hyperparameter values to change/tune. XGBoost has a long list of parameters which control different elements in the modeling process. For instance, based on performance in the validation data, the best learning rate, maximum depth, leaf size, and growing policy are found and help achieve a better generalization error in test data.

  • Which features or feature transformations to use. Driverless AI’s strong performance comes from its feature engineering pipeline. Typically thousands of features will be explored during the training process, but only a minority of them will be useful eventually. The performance of validation data, once again defines which of the generated features are worth keeping and which need to be discarded.

  • Which models to ensemble and which weights to use. Driverless AI uses an exhaustive process to find the best linear weights to combine multiple different models (trained with different parameters) and gets better results in unseen data.

Check for IID

A common pitfall that causes overfitting is when a dataset contains temporal elements and the observations (samples) are not Independent and Identically Distributed (IID). In the case of temporal elements, consider having future data predicting past data and not the other way around, as that prediction is misleading.

DAI can determine the presence of non-IID easily if a date field or equivalent is provided. In this instance, it will assess how strong the temporal elements are and whether the target variables values significantly change for different periods in the date field. It uses correlation, serial correlation checks and variable binning to determine if and how strong these temporal elements are. It should be noted that this check is extended to the ordering of the dataset too. Sometimes the data itself may be ordered in a way that it does not help if the validation is formulated randomly. For example, if the data is sorted by date/time, even though that field is not provided.

If non-IID is detected, Driverless AI typically switches to time series based validation mode to get more consistent errors when predicting the test (further) data. The difference with the k-fold (mentioned above) is that the data is split based on date/time and all pairs of train,validation are formulated so that train is in the past and validation is in the future. The type of features Driverless AI will generate are different than when there is IID. This process also identifies main entities that will be used for both validation and feature engineering. The entities are the group-by categorical features such as stores or type of products that if used make predictions better – ex: sales by store and product in last 30 days.

As a summary of this subsection, this check is put in place to ensure that Driverless AI performs feature engineering, selects parameters and ensembles models in a way that best resembles reality when strong temporal elements are detected.

Avoiding Leakage in Feature Engineering

Leakage can be defined as the creation of unexpected additional information in the training data, allowing a model or machine learning algorithm to make unrealistically good predictions. It has various types and causes. Driverless AI is particularly careful with a type of leakage that can arise when implementing target (mean or likelihood) encoding. In many occasions, these type of features tend to be the most predictive inputs, especially when high cardinality categorical variable (as in with many distinct values) are present in the data. However, over-reliance on these features can cause a model to overfit quickly. These features may be created differently in different domains – for example in banking they take the form of weights of evidence, in times series they are the past mean values of the target variable given certain lags and so on.

A common cause that makes target encoding fail is when predicting an observation/entry that its target value was included for the formulation of an average value for a certain category. For example, let’s assume there is a dataset that contains a categorical variable called profession and various job titles are listed like ‘doctor’, ‘teacher’, ‘policeman’, etc. Out of all of the mentioned entries, there is only one entry for ‘entrepreneur’ and that entry has an income of $3,000,000,000. When estimating the average income for all ‘entrepreneurs’, it will be $3,000,000,000, as there is only one ‘entrepreneur’. If a new variable is created that measures the average income of a profession, while trying to predict income, it will create the connection that the average salary of an entrepreneur is $3,000,000,000. Given how big this value is, it is likely to make a model over-rely on this connection and predict high values for all entrepreneurs.

This can be referred to as target leakage because it uses the target value directly as a feature. There are various ways to mitigate the impact of this type of leakage, like estimating averages only when there is a significant number of cases for one category. Ideally though, the average values need to be created with data that the entries’ features have not used in any way as their respective target values. A way to achieve this is to have another holdout dataset that is used to estimate the mean target values for certain categories and then apply these to the train and validation data – in other words, have a third dataset dedicated to target encoding.

The latter approach suffers from the fact that the model will then need to be built with significantly less data, as some part is surrendered to estimate the averages per category. To counter this, Driverless AI uses a CCV or Cross-Cross-Validation, or cross-validation inside a cross-validation. After a train and validation pair has been determined, then only the train part is undergoing another k-fold procedure where K-1 parts are used to estimate average target values for the selected categories and apply those to the Kth part until the train dataset has its mean (target) values estimated in k batches. They can be applied at the same time to the outer valid dataset, taking an average of all the K-folds’ mean values.

The same type of leakage can also be found in metamodeling. For more information on this topic, enjoy this video, in which Kaggle Grandmaster, Mathias Müller, discusses this and other types of leakage, including how they get created and prevented.

In regards to other types of leakage, Driverless AI will throw warning messages if some features are strongly correlated with the target but typically does not take any additional action by itself unless the correlation is perfect.

Other means to avoid overfitting
There are various other mechanisms, tools or approaches in Driverless AI that help prevent overfitting:

  • Bagging (or Randomized Averaging): When the time comes for final predictions, Driverless AI will run multiple models with slightly different parameters and aggregate the results. This ensures that the final prediction, which is based on multiple models, is not too attached to the original data exactly due to this imputed randomness and produced of (to-some-extend) uncorrelated models. In other words, bagging has the ability to reduce the variance without changing the bias.
  • Dimensionality reduction: Although Driverless AI generates many features, it will also employ techniques such as SVD or PCA to minimize the features’ expansion when encountering high cardinality categorical features and simplifying the modeling process.

  • Feature pruning: Driverless AI can get a measure of how important a feature is to the model. Features that tend to add very little to the prediction are likely to just be noise and are therefore discarded.

Examples of Consistency

It is no secret that we like to test drive Driverless AI in competitive environments such as Kaggle, Analytics Vidhya, and CrowdANALYTIX to know how it fares compared to some of the best data scientists in the world. However, we are not only interested in accuracy, which is a product of how much time you allow Driverless AI to run and can be configured in the beginning, but also the consistency between validation performance and performance in the test data, in other words the ability to avoid overfitting.

Here is a list of results from public sources that show Driverless AI’s validation performance and performance on the test data drawn with various combinations of accuracy and speed settings. Some of the results may contain combinations of multiple Driverless AI outputs and a metamodeling layer on top of them:

NameMetricValidation ResultsTest ResultsFinal RankSource
Analytics Vidhya - Churn Predictionauc0.689 (rank 14/250)0.677 (rank 8/250)8/250
LinkedIn blog
Analytics Vidhya - Churn Predictionauc0.69104 (rank 8/250)0.67844 (rank 5/250)5/250Competition website
BNP Paribas Cardif Claims Managementlog loss0.43573 (rank 23/2926)0.43316 (rank 18/2926)
18/2926Video
Predicting How Points End in Tennislog loss0.1774 (rank 3/207)0.19051 (rank 6/207)
6/207
Competition website
Predicting How Points End in Tennislog loss0.1769 (rank 2/207)0.19062 (rank 8/207)8/207
Competition website
BNP Paribas Cardif Claims Managementlog loss0.44196 (rank 52/2926)0.44137 (Rank 60 /2926)
(fast settings)
60/2926Video
Amazon.com - Employee Access Challengeauc0.91165 (rank 65/1687)0.90933 (rank 79 /1687
(fast settings)
79/1687Video
New York City Taxi Trip DurationRMSLE0.31017 (rank 11/1257)
0.31181 (rank 11/1257)
11/1257
Tweet
Analytics Vidhya - McKinsey Analyticsauc0.85932 (rank 17/503)0.85456 (rank 6/503)6/503Competition website
Driverless AI – Introduction, Hands-On Lab and Updates

Driverless AI – Introduction, Hands-On Lab and Updates

#H2OWorld was an incredible experience. Thank you to everyone who joined us!

There were so many fascinating conversations and interesting presentations. I’d love to invite you to enjoy the presentations by visiting our YouTube channel.

Over the next few weeks, we’ll be highlighting many of the talks. Today I’m excited to share two presentations focused on Driverless AI – “Introduction and a Look Under the Hood + Hands-On Lab” and “Hands-On Focused on Machine Learning Interpretability”.

Slides available here.

Slides available here.

The response to Driverless AI has been amazing. We’re constantly receiving helpful feedback and making updates.

A few recent updates include:

Version 1.0.11 (December 12 2017)
– Faster multi-GPU training, especially for small data
– Increase default amount of exploration of genetic algorithm for systems with fewer than 4 GPUs
– Improved accuracy of generalization performance estimate for models on small data (< 100k rows)
– Faster abort of experiment
– Improved final ensemble meta-learner
– More robust date parsing

Version 1.0.10 (December 4 2017)
– Tooltips and link to documentation in parameter settings screen
– Faster training for multi-class problems with > 5 classes
– Experiment summary displayed in GUI after experiment finishes
– Python Client Library downloadable from the GUI
– Speedup for Maxwell-based GPUs
– Support for multinomial AUC and Gini scorers
– Add MCC and F1 scorers for binomial and multinomial problems
– Faster abort of experiment

Version 1.0.9 (November 29 2017)
– Support for time column for causal train/validation splits in time-series datasets
– Automatic detection of the time column from temporal correlations in data
– MLI improvements, dedicated page, selection of datasets and models
– Improved final ensemble meta-learner
– Test set score now displayed in experiment listing
– Original response is preserved in exported datasets
– Various bug fixes

Additional release notes can be viewed here:
http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/release_notes.html

If you’d like to learn more about Driverless AI, feel free to explore these helpful links:
– Driverless AI User Guide: http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/index.html
– Driverless AI Webinars: https://webinar.com/channel/4a90aa11b48f4a5d8823ec924e7bd8cf
– Latest Driverless AI Docker Download: https://www.h2o.ai/driverless-ai-download/
– Latest Driverless AI AWS AMI: Search for AMI-id : ami-d8c3b4a2
– Stack Overflow: https://stackoverflow.com/questions/tagged/driverless-ai

Want to try Driverless AI? Send us a note.