Stacked Ensembles and Word2Vec now available in H2O!

Prepared by: Erin LeDell and Navdeep Gill

Stacked Ensembles

sz42-6-wheels-lightened

H2O’s new Stacked Ensemble method is a supervised ensemble machine learning algorithm that finds the optimal combination of a collection of prediction algorithms using a process called stacking or “Super Learning.” This method currently supports regression and binary classification, and multiclass support is planned for a future release. A full list of the planned features for Stacked Ensemble can be viewed here.

H2O previously has supported the creation of ensembles of H2O models through a separate implementation, the h2oEnsemble R package, which is still available and will continue to be maintained, however for new projects we’d recommend using the native H2O version. Native support for stacking in the H2O backend brings support for ensembles to all the H2O APIs.

Creating ensembles of H2O models is now dead simple. You simply pass a list of existing H2O model ids to the stacked ensemble function and you are ready to go. This list of models can be a set of manually created H2O models, a random grid of models (of GBMs, for example), or set of grids of different algorithms. Typically, the more diverse the collection of base models, the better the ensemble performance. Thus, using H2O’s Random Grid Search to generate a collection of random models is a handy way of quickly generating a set of base models for the ensemble.

R:

ensemble <- h2o.stackedEnsemble(x = x, y = y, training_frame = train, base_models = my_models)

Python:

ensemble = H2OStackedEnsembleEstimator(base_models=my_models)
ensemble.train(x=x, y=y, training_frame=train)

Full R and Python code examples are available on the Stacked Ensembles docs page. Kagglers rejoice!

Word2Vec

w2v
\(\)

H2O now has a full implementation of Word2Vec. Word2Vec is a group of related models that are used to produce word embeddings (a language modeling/feature engineering technique in natural language processing where words or phrases are mapped to vectors of real numbers). The word embeddings can subsequently be used in a machine learning model, for example, GBM. This allows user to utilize text based data with current H2O algorithms in a very efficient manner. An R example is available here.

Technical Details

H2O’s Word2Vec is based on the skip-gram model. The training objective of skip-gram is to learn word vector representations that are good at predicting its context in the same sentence. Mathematically, given a sequence of training words $w_1, w_2, \dots, w_T$, the objective of the skip-gram model is to maximize the average log-likelihood

$$\frac{1}{T} \sum_{t = 1}^{T}\sum_{j=-k}^{j=k} \log p(w_{t+j} | w_t)$$

where $k$ is the size of the training window.
In the skip-gram model, every word w is associated with two vectors $u_w$ and $v_w$ which are vector representations of $w$ as word and context respectively. The probability of correctly predicting word $w_i$ given word $w_j$ is determined by the softmax model, which is

$$p(w_i | w_j ) = \frac{\exp(u_{w_i}^{\top}v_{w_j})}{\sum_{l=1}^{V} \exp(u_l^{\top}v_{w_j})}$$

where $V$ is the vocabulary size.
The skip-gram model with softmax is expensive because the cost of computing $\log p(w_i | w_j)$ is proportional to $V$, which can be easily in order of millions. To speed up training of Word2Vec, we used hierarchical softmax, which reduced the complexity of computing of $\log p(w_i | w_j)$ to $O(\log(V))$

Tverberg Release (H2O 3.10.3.4)

Below is a detailed list of all the items that are part of the Tverberg release.

List of New Features:

PUBDEV-2058- Implement word2vec in h2o (To use this feature in R, please visit this demo)
PUBDEV-3635- Ability to Select Columns for PDP computation in Flow (With this enhancement, users will be able to select which features/columns to render Partial Dependence Plots from Flow. (R/Python supported already). Known issue PUBDEV-3782: when nbins < categorical levels, PDP won't compute. Please visit also this post.)
PUBDEV-3881- Add PCA Estimator documentation to Python API Docs
PUBDEV-3902- Documentation: Add information about Azure support to H2O User Guide (Beta)
PUBDEV-3739- StackedEnsemble: put ensemble creation into the back end.

List of Improvements:

PUBDEV-3989- Decrease size of h2o.jar
PUBDEV-3257- Documentation: As a K-Means user, I want to be able to better understand the parameters
PUBDEV-3741- StackedEnsemble: add tests in R and Python to ensure that a StackedEnsemble performs at least as well as the base_models
PUBDEV-3857- Clean up the generated Python docs
PUBDEV-3895- Filter H2OFrame on pandas dates and time (python)
PUBDEV-3912- Provide way to specify context_path via Python/R h2o.init methods
PUBDEV-3933- Modify gen_R.py for Stacked Ensemble
PUBDEV-3972- Add Stacked Ensemble code examples to Python docstrings

List of Bugs:

PUBDEV-2464- Using asfactor() in Python client cannot allocate to a variable
PUBDEV-3111- R API's h2o.interaction() does not use destination_frame argument
PUBDEV-3694- Errors with PCA on wide data for pca_method = GramSVD which is the default
PUBDEV-3742- StackedEnsemble should work for regression
PUBDEV-3865- h2o gbm : for an unseen categorical level, discrepancy in predictions when score using h2o vs pojo/mojo
PUBDEV-3883- Negative indexing for H2OFrame is buggy in R API
PUBDEV-3894- Relational operators don't work properly with time columns.
PUBDEV-3966- java.lang.AssertionError when using h2o.makeGLMModel
PUBDEV-3835- Standard Errors in GLM: calculating and showing specifically when called
PUBDEV-3965- Importing data in python returns error - TypeError: expected string or bytes-like object
Hotfix: Remove StackedEnsemble from Flow UI. Training is only supported from Python and R interfaces. Viewing is supported in the Flow UI.

List of Tasks

PUBDEV-3336- h2o.create_frame(): if randomize=True, value param cannot be used
PUBDEV-3740- REST: implement simple ensemble generation API
PUBDEV-3843- Modify R REST API to always return binary data
PUBDEV-3844- Safe GET calls for POJO/MOJO/genmodel
PUBDEV-3864- Import files by pattern
PUBDEV-3884- StackedEnsemble: Add to online documentation
PUBDEV-3940- Add Stacked Ensemble code examples to R docs

Download here: http://h2o-release.s3.amazonaws.com/h2o/rel-tverberg/4/index.html

Creating a Binary Classifier to Sort Trump vs. Clinton Tweets Using NLP

The problem: Can we determine if a tweet came from the Donald Trump Twitter account (@realDonaldTrump) or the Hillary Clinton Twitter account (@HillaryClinton) using text analysis and Natural Language Processing (NLP) alone?

The Solution: Yes! We’ll divide this tutorial into three parts, the first on how to gather the necessary data, the second on data exploration, munging, & feature engineering, and the third on building our model itself. You can find all of our code on GitHub (https://git.io/vPwxr).


Part One: Collecting the Data
Note: We are going to be using Python. For the R version of this process, the concepts translate, and we have some code on Github that might be helpful. You can find the notebook for this part as “TweetGetter.ipynb” in our GitHub repository: https://git.io/vPwxr.

We used the Twitter API to collect tweets by both presidential candidates, which would become our dataset. Twitter only lets you access the latest ~3000 or so tweets from a particular handle, even though they keep all the Tweets in their own databases. 

The first step is to create an app on Twitter, which you can do by visiting https://apps.twitter.com/. After completing the form you can access your app, and your keys and tokens. Specifically we’re looking for four things: the client key and secret (called consumer key and consumer secret) and the resource owner key and secret (called access token and access token secret).


screen-shot-2016-10-12-at-1-19-02-pm
We save this information in JSON format in a separate file

Then, we can use the Python libraries Requests and Pandas to gather the tweets into a DataFrame. We only really care about three things: the author of the Tweet (Donald Trump or Hillary Clinton), the text of the Tweet, and the unique identifier of the Tweet, but we can take in as much other data as we want (for the sake of data exploration, we also included the timestamp of each Tweet).

Once we have all this information, we can output it to a .csv file for further analysis and exploration. 


Part Two: Data Cleaning and Munging
You can find the notebook for this part as “NLPAnalysis.ipynb” in our GitHub repository: https://git.io/vPwxr.

To fully take advantage of machine learning, we need to add features to this dataset. For example, we might want to take into account the punctuation that each Twitter account uses, thinking that it might be important in helping us discriminate between Trump and Clinton. If we take the amount of punctuation symbols in each Tweet, and take the average across all Tweets, we get the following graph:

screen-shot-2016-10-14-at-2-55-54-pm

Or perhaps we care about how many hashtags or mentions each account uses:

screen-shot-2016-10-14-at-2-56-21-pm

With our timestamp data, we can examine Tweets by their Retweet count, over time:


screen-shot-2016-10-14-at-2-28-12-pm
The tall blue skyscraper was Clinton’s “Delete Your Account” Tweet

screen-shot-2016-10-14-at-2-28-03-pm

The scale graph, on a logarithmic scale

We can also compare the distribution of Tweets over time. We can see that Clinton tweets more frequently than Trump (this is also evidenced by us being able to access older Tweets from Trump, since there’s a hard limit on the number of Tweets we can access).


screen-shot-2016-10-14-at-2-27-53-pm
The Democratic National Convention was in session from July 25th to the 28th

We can construct heatmaps of when these candidates were posting:


screen-shot-2016-10-14-at-2-27-42-pm
Heatmap of Trump Tweets, by day and hour

All this light analysis was useful for intuition, but our real goal is to only use the text of the tweet (including derived features) for our classification. If we included features like the time-stamp, it would become a lot easier.

We can utilize a process called tokenization, which lets us create features from the words in our text. To understand why this is useful, let’s pretend to only care about the mentions (for example, @h2oai) in each tweet. We would expect that Donald Trump would mention certain people (@GovPenceIN) more than others and certainly different people than Hillary Clinton. Of course, there might be people both parties tweet at (maybe @POTUS). These patterns could be useful in classifying Tweets. 

Now, we can apply that same line of thinking to words. To make sure that we are only including valuable words, we can exclude stop-words which are filler words, such as ‘and’ or ‘the.’ We can also use a metric called term frequency – inverse document frequency (TF-IDF) that computes how important a word is to a document. 

There are also other ways to use and combine NLP. One approach might be sentiment analysis, where we interpret a tweet to be positive or negative. David Robinson did this to show that Trump’s personal tweets are angrier, as opposed to those written by his staff.

Another approach might be to create word trees that represent sentence structure. Once each tweet has been represented in this format you can examine metrics such as tree length or number of nodes, which are measures of the complexity of a sentence. Maybe Trump tweets a lot of clauses, as opposed to full sentences.


Part Three: Building, Training, and Testing the Model
You can find the notebooks for this part as “Python-TF-IDF.ipynb” and “TweetsNLP.flow” in our GitHub repository: https://git.io/vPwxr.

There were a lot of approaches to take but we decided to keep it simple for now by only using TF-IDF vectorization. The actual code writing was relatively simple thanks to the excellent Scikit-Learn package alongside NLTK. 

We could have also done some further cleaning of the data, such as excluding urls from our Tweets text (right now, strings such as “zy7vpfrsdz” get their own feature column as the NLTK vectorizer treats them as words). Our not doing this won’t affect our model as the urls are unique, but it might save on space and time. Another strategy could be to stem words, treating words as their root (so ‘hearing’ and ‘heard’ would both be coded as ‘hear’).

Still, our model (created using H2O Flow) produces quite a good result without those improvements. We can use a variety of metrics to confirm this, including the Area Under the Curve (AUC). The AUC measures the True Positive Rate (tpr) versus the False Negative Rate (fpr). A score of 0.5 means that the model is equivalent to flipping a coin, and a score of 1 means that the model is 100% accurate. 


screen-shot-2016-10-13-at-2-33-56-pm
The model curve is blue, while the red curve represents 50–50 guessing

For a more intuitive judgement of our model we can look at the variable importances of our model (what the model considers to be good discriminators of the data) and see if they make sense:


screen-shot-2016-10-13-at-1-41-46-pm
Can you guess which words (variables) correspond (are important) to which candidate?

Maybe the next step could be to build an app that will take in text and output if the text is more likely to have come from Clinton or Trump. Perhaps we can even consider the Tweets of several politicians, assign them a ‘liberal/conservative’ score, and then build a model to predict if a Tweet is more conservative or more liberal (important features would maybe include “Benghazi” or “climate change”). Another cool application might be a deep learning model, in the footsteps of @DeepDrumpf.

If this inspired you to create analysis or build models, please let us know! We might want to highlight your project 🎉📈.

A Newbie’s Guide to H2O in Python – Guest Post

This blog was originally posted here

I created this guide to help fellow newbies get their feet wet with H2O, an open-source predictive analytics platform that is fast, powerful, and easy to use. Using a combination of extraordinary math and high-performance parallel processing, H2O allows you to quickly create models for big data. The steps below show you how to download and start analyzing data at high speeds with H2O. After that it’s up to you.

What You’ll Learn

  • How to download H2O (just updated to OS X El Capitan? Then Java too)
  • How to use H2O with IPython Notebook & where to get demo scripts
  • How to teach a computer to recognize handwritten digits with H2O
  • Where to find documentation and community resources

A Delicious Drink of Water — Downloading H2O

(If you don’t feel like reading the long version below just go here)

I recommend downloading the latest release of H2O (which is ‘Bleeding Edge’ as of this moment) because it has the most Python features, but you can also see the other releases here, as well as the software requirements. Okay, Let’s get started:

Do you have Java on your computer? No sure? Here’s how to check:

  • Open your terminal and type in ‘java -version’:

MacBook-Pro:~ username$ java -version

command

If you don’t have Java you can either click through the pop up dialogue box and make your way to the correct downloadable version, or you can go directly to the Java downloads page here (two-for-one tip: download the Java Development Kit and get the Java Runtime Environment with it).

Now that you have Java (fingers crossed), you can download H2O (I’m assuming you have Python, but if you don’t, consider downloading Anaconda which gives you access to amazing Python packages for data analysis and scientific computing).

You can find the official instructions to Download H2O’s ‘Bleeding Edge’ release here (click on the ‘Install in Python’ tab), or follow below:

  1. Prerequisite: Python 2.7
  2. Type the following in your terminal:

Fellow newbies don’t type in the ‘MacBook-Pro:~ username$’ part only type in what’s listed after the ‘$’: (you can get more command line help here).

MacBook-Pro:~ username$ pip install requests
MacBook-Pro:~ username$ pip install tabulate
MacBook-Pro:~ username$ pip install scikit-learn

MacBook-Pro:~ username$ pip uninstall h2o
MacBook-Pro:~ username$ pip install http://h2o-release.s3.amazonaws.com/h2o/master/3250/Python/h2o-3.7.0.3250-py2.py3-none-any.whl

As shown above, if you installed an earlier version of H2O, uninstalling and reinstalling H2O with pip will do the trick.

Let’s Get Interactive — IPython Notebook

If don’t already have IPython Notebook, you can download it following these instructions. If you downloaded Anaconda, it comes with IPython Notebook so you’re set. And here’s a video tutorial on how to use IPython Notebook.

If everything goes as planned, to open IPython Notebook you ‘cd’ to your directory of choice (I chose my Desktop folder) and enter ‘ipython notebook’. (If you’re still new to the command line, learn more about using ‘cd’, which I like to use as a verb, here and here).

MacBook-Pro:~ username$ cd Desktop
MacBook-Pro:Desktop username$ ipython notebook

Random Note: After I updated to OS X El Capitan the command above didn’t work. For many people using ‘conda update conda’ and then ‘conda update ipython’ will solve the issue, but in my case I got an SSL error that wouldn’t let me ‘conda update’ anything. I found the solution here, using:

MacBook-Pro:~ username$ conda config — set ssl_verify False
MacBook-Pro:~ username$ conda update requests openssl
MacBook-Pro:~ username$ conda config — set ssl_verify True

Now that you have IPython Notebook, you can play around with some of H2O’s demo notebooks. If you’re new to Github, however, downloading the demos to your desktop can seem daunting, but don’t worry it’s easy. Here’s the trick:

  1. Navigate to H2O’s Python Demo Repository
  2. Click on your ‘.ipynb’ demo of choice (let’s do citi_bike_small.ipynb
  3. Click on ‘Raw’ in the upper right corner, then after the next web page opens, go to ‘File’ on the menu bar and select ‘Save Page As’ (or similar)
  4. Open your terminal, cd to the Downloads folder, or wherever you saved the IPython Notebook, then type ‘ipython notebook citi_bike_small.ipynb’
  5. Now you can go through the demo running each cell individually (click on the cell and press shift + enter)

Classifying Handwritten Digits — Enter a Kaggle Competition

A great way to get a feel for H2O is to test it out on a Kaggle data science competition. Don’t know what Kaggle is? Never enter a Kaggle Competition? That’s totally fine, I’ll give you a script to get your feet wet. If you’re still nervous here’s a great article about how to get started with Kaggle given your previous experience.

Are you excited? Get excited! You are going to teach your computer to recognize HANDWRITTEN DIGITS! (I feel like if you’re still ready at this point, it’s time to let my enthusiasm shine through).

  1. Take a look at Kaggle’s Digit Recognizer Competition
  2. Look at a demo notebook to get started
  3. Download the notebook by clicking on ‘Raw’ and then saving it
  4. Open up and run the notebook to generate a submission csv file
  5. Submit the file for your first submission to Kaggle, then play around with your model parameters and see if you can improve your Kaggle submission score

Getting Help — Resources & Documentation