Automatic Feature Engineering for Text Analytics – The Latest Addition to Our Kaggle Grandmasters’ Recipes

According to Kaggle’s ‘The State of Machine Learning and Data Science’ survey, text data is the second most used data type at work for data scientists. There are a lot of interesting text analytics applications like sentiment prediction, product categorization, document classification and so on.

In the latest version (1.3) of our Driverless AI platform, we have included Natural Language Processing (NLP) recipes for text classification and regression problems. Our platform has the ability to support both standalone text and text with other numerical values as predictive features. In particular, we have implemented the following recipes and models:

  • Text-specific feature engineering recipes:
    • TFIDF, Frequency of n-grams
    • Truncated SVD
    • Word embeddings
  • Text-specific models to extract features from text:
    • Convolutional neural network models on word embeddings
    • Linear models on TFIDF vectors

A Typical Example: Sentiment Analysis

Let us illustrate the usage with a classical example of sentiment analysis on tweets using the US Airline Sentiment dataset from Figure Eight’s Data for Everyone library. We can split the dataset into training and test with this simple script. We will just use the tweets in the ‘text’ column and the sentiment (positive, negative or neutural) in the ‘airline_sentiment’ column for this demo. Here are some samples from the dataset:

Once we have our dataset ready in the tabular format, we are all set to use the Driverless AI. Similar to other problems in the Driverless AI setup, we need to choose the dataset and then specify the target column (‘airline_sentiment’).

Since there are other columns in the dataset, we need to click on ‘Dropped Cols’ and then exclude everything but ‘text’ as shown below:

Next, we will need to make sure TensorFlow is enabled for the experiment. We can go to ‘Expert Settings’ and switch on ‘TensorFlow Models’.

At this point, we are ready to launch an experiment. Text features will be automatically generated and evaluated during the feature engineering process. Note that some features such as TextCNN rely on TensorFlow models. We recommend using GPU(s) to leverage the power of TensorFlow and accelerate the feature engineering process.

Once the experiment is done, users can make new predictions and download the scoring pipeline just like any other Driverless AI experiments.

Bonus fact #1: The masterminds behind our NLP recipes are Sudalai Rajkumar (aka SRK) and Dmitry Larko.

Bonus fact #2: Don’t want to use the Driverless AI GUI? You can run the same demo using our Python API. See this example notebook.

Seeing is believing. Try Driverless AI yourself today. Sign up here for a free 21-day trial license.

Until next time,
SRK and Joe

Time is Money! Automate Your Time-Series Forecasts with Driverless AI

Time-series forecasting is one of the most common and important tasks in business analytics. There are many real-world applications like sales, weather, stock market, energy demand, just to name a few. We strongly believe that automation can help our users deliver business value in a timely manner. Therefore, once again we translated our Kaggle Grand Masters’ time-series recipes into our automatic machine learning platform Driverless AI (version 1.2). This blog post introduces the new time-series functionality with a simple sales forecasting example.

The key features/recipes that make automation possible are:

  • Automatic handling of time groups (e.g. different stores and departments)
  • Robust time-series validation
            – Accounts for gaps and forecast horizon
            – Uses past information only (i.e. no data leakage)
  • Time-series specific feature engineering recipes
            – Date features like day of week, day of month etc.
            – AutoRegressive features like optimal lag and lag-features interaction
            – Different types of exponentially weighted moving averages
            – Aggregation of past information (different time groups and time intervals)
            – Target transformations and differentiation
  • Integration with existing feature engineering functions (recipes and optimization)
  • Automatic pipelines generation (see this blog post)

A Typical Example: Sales Forecasting

Below is a typical example of sales forecasting based on Walmart competition on Kaggle. In order to frame it as a machine learning problem, we formulate the historical sales data and additional attributes as shown below:

Raw data:

Data formulated for machine learning:

Once you have your data prepared in tabular format (see raw data above), Driverless AI can formulate it for machine learning and sort out the rest. If this is your very first session, the Driverless AI assistant (new feature in version 1.2) will guide you through the journey.

Similar to previous Driverless AI examples, users need to select the dataset for training/test and define the target. For time-series, users need to define the time column (by choosing AUTO or selecting the date column manually). If weighted scoring is required (like the Walmart Kaggle competition), users can select the column with specific weights for different samples.

If users prefer to use automatic handling of time groups, they can leave the setting for time groups columns as AUTO.

Expert users can define specific time groups and change other settings as shown below.

Once the experiment is finished, users can make new predictions and download the scoring pipeline just like any other Driverless AI experiments.

Seeing is believing. Try Driverless AI yourself today. Sign up here for a free 21-day trial license.

Until next time,
Joe

Bonus fact: The masterminds behind our time-series recipes are Marios Michailidis and Mathias Müller so internally we call this feature AutoM&M.

From Kaggle Grand Masters’ Recipes to Production Ready in a Few Clicks

Introducing Accelerated Automatic Pipelines in H2O Driverless AI

At H2O, we work really hard to make machine learning fast, accurate, and accessible to everyone. With H2O Driverless AI, users can leverage years of world-class, Kaggle Grand Masters experience and our GPU-accelerated algorithms (H2O4GPU) to produce top quality predictive models in a fully automatic and timely fashion.

In our most recent release (version 1.1), we are going one step further to streamline the deployment process with MOJO (Model ObJect, Optimized). Inherited from our popular H2O-3 platform, MOJO is a highly optimized, low-latency scoring engine that is easily embeddable in any Java environment. With automatic pipeline generation in Driverless AI, users can go from automatic machine learning to production ready in just a few clicks. This blog post illustrates the usage of MOJO in Driverless AI with a simple example.

Easing the Pain Points in a Machine Learning Workflow

In a typical enterprise machine learning workflow, there are many things that could go wrong due to human errors, bad data science practices, different tools/infrastructure, incompatible code, lack of testing, versioning, communication and so on.

Driverless AI is our solution to ease those pain points in the second half of the workflow (i.e., creative feature engineering, model building, and deployment). We strongly believe that most organizations can benefit from automatic machine learning pipelines. A recent PayPal use-case shows that Driverless AI can help produce top quality predictive models with significant time and cost savings.

ml_workflow

With Driverless AI, we are trying to mimic what top data science teams would do when they need to develop a new machine learning pipeline. Below are the four key areas of focus:

1. Exploratory Data Analysis (EDA) with Automatic Visualizations (AutoViz)

AutoViz allows users to gain quick insights from data without the laborious tasks of creating individual plots. It shows users the most interesting graphs automatically based on statistics, and it is designed to work on large datasets efficiently. The mastermind behind AutoViz is our Chief Scientist, Professor Leland Wilkinson of “ The Grammar of Graphics” fame.

2. Automatic Feature Engineering and Model Building

We call this part of Driverless AI “ Kaggle Grand Masters in a Box”. It is essentially the best data science practices, tricks and creative feature engineering of our Kaggle Grand Masters translated into an artificial intelligence (AI) platform. In other words, it is AI to do AI. On top of that, we make the automatic machine learning process insanely fast on Nvidia GPUs. Our users can benefit from quick turnaround time and top quality predictive models that one would expect from the Kaggle Grand Masters themselves.

3. Machine Learning Interpretability (MLI)

In Driverless AI, we have implemented some of the latest ML interpretation techniques (e.g., LIME, LOCO, ICE, Shapely, PDP, etc.), so our users can go from model building to model interpretation in a seamless fashion. These techniques are crucial for those who must explain their models to regulators or customers. The masterminds behind MLI are my colleagues Patrick Hall, Navdeep Gill, and Mark Chan. Watch their talk about MLI in Driverless AI here.

4. Automatic Pipelines Generation – The Focus of this Blog Post

Model deployment remains one of the most common and complex challenges in data analytics. Inherited from our popular H2O-3 platform, MOJO is a well-tested, robust technology that is being used by our users and customers at enormous scale. Let me illustrate the MOJO usage with a simple example below.

Credit Card Example

Like many other Driverless AI demos that you may have seen before at H2O World or our webinars, I am going to use the credit card dataset from the UCI machine learning repository for the MOJO example. Let me fast-forward the process to the end of a Driverless AI experiment and focus on the new MOJO options. From version 1.1.0, users have the option to build and download MOJO for fast, low-latency scoring. Here is a step-by-step walkthrough:

Step 1: Build a MOJO Scoring Pipeline
After the experiment, click on the newly available option BUILD MOJO SCORING PIPELINE. The build process is automatic and it should be done within a few minutes.

Step 2: Download and Unzip MOJO
Click on DOWNLOAD MOJO SCORING PIPELINE to download mojo.zip. After unzipping the file, you should be able to see a new folder called mojo-pipeline. The pipeline.mojo and mojo2-runtime.jar in the folder are the two main files you need for the MOJO scoring pipeline.

Step 3: Download Driverless AI License
Another key ingredient for MOJO pipeline is a valid Driverless AI license. You can download the license.sig file (usually in the license folder) from the machine hosting Driverless AI. Put the license file into the mojo-pipeline folder from the previous step.

license

Optional Step: Install Java 7 or 8
The MOJO scoring pipeline requires Java 8 (or Java 7/8 from version 1.1.2). If you have not installed it, please follow the instructions here.

Step 4: A Simple Test Run
In the mojo-pipeline folder, you will find a small example.csv with some data samples. This dataset can be used for a quick test run. Open the folder in terminal and then run the following command: bash run_example.sh

Alternatively, run the full command like this:
java -Dai.h2o.mojos.runtime.license.file=license.sig -cp mojo2-runtime.jar ai.h2o.mojos.ExecuteMojo pipeline.mojo example.csv

It should return predictions (the probabilities of default payment in this credit card demo) and the time required for scoring each sample. Remember this scoring pipeline includes everything from complex feature transformations based on Kaggle Grand Masters’ recipes to computing predictions from the final model ensemble. With MOJO, our users have a low-latency scoring engine that can make new predictions in milliseconds.

Step 5: Create Your Own Scoring Service
Users can, of course, define and program their own scoring services. For more information, please go through the Compile and Run the MOJO from Java section in our Driverless AI documentation.

Conclusions

This blog post gives a quick overview of the automatic pipelines in Driverless AI. The key benefits for our users are:

  1. Immediate increase in productivity – eliminating time wasted on human errors, incompatible code, debugging, etc.
  2. Production ready in a few clicks – seamless integration of complex feature engineering and scoring engine in one MOJO.
  3. An enterprise-grade, low-latency scoring engine that is easily embeddable in any Java environment.

Don’t take my words for it, sign up for a free 21-day trial and try Driverless AI yourself today.

Until next time,
Joe

Note #1: Two years, numerous H2O models, slide decks, events and #360selfies later, I am finally making a return to blogging. I hope you enjoy reading this blog post.

Note #2: H2O is going to Budapest again. Come find me, Erin, and Kuba at eRum conference from May 14 to 16. I will be delivering the “Automatic and Interpretable Machine Learning in R with H2O and LIME” workshop with a real, multimillion-dollar Moneyball Shiny app.