Creating a Binary Classifier to Sort Trump vs. Clinton Tweets Using NLP

The problem: Can we determine if a tweet came from the Donald Trump Twitter account (@realDonaldTrump) or the Hillary Clinton Twitter account (@HillaryClinton) using text analysis and Natural Language Processing (NLP) alone?

The Solution: Yes! We’ll divide this tutorial into three parts, the first on how to gather the necessary data, the second on data exploration, munging, & feature engineering, and the third on building our model itself. You can find all of our code on GitHub (https://git.io/vPwxr).


Part One: Collecting the Data
Note: We are going to be using Python. For the R version of this process, the concepts translate, and we have some code on Github that might be helpful. You can find the notebook for this part as “TweetGetter.ipynb” in our GitHub repository: https://git.io/vPwxr.

We used the Twitter API to collect tweets by both presidential candidates, which would become our dataset. Twitter only lets you access the latest ~3000 or so tweets from a particular handle, even though they keep all the Tweets in their own databases. 

The first step is to create an app on Twitter, which you can do by visiting https://apps.twitter.com/. After completing the form you can access your app, and your keys and tokens. Specifically we’re looking for four things: the client key and secret (called consumer key and consumer secret) and the resource owner key and secret (called access token and access token secret).


screen-shot-2016-10-12-at-1-19-02-pm
We save this information in JSON format in a separate file

Then, we can use the Python libraries Requests and Pandas to gather the tweets into a DataFrame. We only really care about three things: the author of the Tweet (Donald Trump or Hillary Clinton), the text of the Tweet, and the unique identifier of the Tweet, but we can take in as much other data as we want (for the sake of data exploration, we also included the timestamp of each Tweet).

Once we have all this information, we can output it to a .csv file for further analysis and exploration. 


Part Two: Data Cleaning and Munging
You can find the notebook for this part as “NLPAnalysis.ipynb” in our GitHub repository: https://git.io/vPwxr.

To fully take advantage of machine learning, we need to add features to this dataset. For example, we might want to take into account the punctuation that each Twitter account uses, thinking that it might be important in helping us discriminate between Trump and Clinton. If we take the amount of punctuation symbols in each Tweet, and take the average across all Tweets, we get the following graph:

screen-shot-2016-10-14-at-2-55-54-pm

Or perhaps we care about how many hashtags or mentions each account uses:

screen-shot-2016-10-14-at-2-56-21-pm

With our timestamp data, we can examine Tweets by their Retweet count, over time:


screen-shot-2016-10-14-at-2-28-12-pm
The tall blue skyscraper was Clinton’s “Delete Your Account” Tweet

screen-shot-2016-10-14-at-2-28-03-pm

The scale graph, on a logarithmic scale

We can also compare the distribution of Tweets over time. We can see that Clinton tweets more frequently than Trump (this is also evidenced by us being able to access older Tweets from Trump, since there’s a hard limit on the number of Tweets we can access).


screen-shot-2016-10-14-at-2-27-53-pm
The Democratic National Convention was in session from July 25th to the 28th

We can construct heatmaps of when these candidates were posting:


screen-shot-2016-10-14-at-2-27-42-pm
Heatmap of Trump Tweets, by day and hour

All this light analysis was useful for intuition, but our real goal is to only use the text of the tweet (including derived features) for our classification. If we included features like the time-stamp, it would become a lot easier.

We can utilize a process called tokenization, which lets us create features from the words in our text. To understand why this is useful, let’s pretend to only care about the mentions (for example, @h2oai) in each tweet. We would expect that Donald Trump would mention certain people (@GovPenceIN) more than others and certainly different people than Hillary Clinton. Of course, there might be people both parties tweet at (maybe @POTUS). These patterns could be useful in classifying Tweets. 

Now, we can apply that same line of thinking to words. To make sure that we are only including valuable words, we can exclude stop-words which are filler words, such as ‘and’ or ‘the.’ We can also use a metric called term frequency – inverse document frequency (TF-IDF) that computes how important a word is to a document. 

There are also other ways to use and combine NLP. One approach might be sentiment analysis, where we interpret a tweet to be positive or negative. David Robinson did this to show that Trump’s personal tweets are angrier, as opposed to those written by his staff.

Another approach might be to create word trees that represent sentence structure. Once each tweet has been represented in this format you can examine metrics such as tree length or number of nodes, which are measures of the complexity of a sentence. Maybe Trump tweets a lot of clauses, as opposed to full sentences.


Part Three: Building, Training, and Testing the Model
You can find the notebooks for this part as “Python-TF-IDF.ipynb” and “TweetsNLP.flow” in our GitHub repository: https://git.io/vPwxr.

There were a lot of approaches to take but we decided to keep it simple for now by only using TF-IDF vectorization. The actual code writing was relatively simple thanks to the excellent Scikit-Learn package alongside NLTK. 

We could have also done some further cleaning of the data, such as excluding urls from our Tweets text (right now, strings such as “zy7vpfrsdz” get their own feature column as the NLTK vectorizer treats them as words). Our not doing this won’t affect our model as the urls are unique, but it might save on space and time. Another strategy could be to stem words, treating words as their root (so ‘hearing’ and ‘heard’ would both be coded as ‘hear’).

Still, our model (created using H2O Flow) produces quite a good result without those improvements. We can use a variety of metrics to confirm this, including the Area Under the Curve (AUC). The AUC measures the True Positive Rate (tpr) versus the False Negative Rate (fpr). A score of 0.5 means that the model is equivalent to flipping a coin, and a score of 1 means that the model is 100% accurate. 


screen-shot-2016-10-13-at-2-33-56-pm
The model curve is blue, while the red curve represents 50–50 guessing

For a more intuitive judgement of our model we can look at the variable importances of our model (what the model considers to be good discriminators of the data) and see if they make sense:


screen-shot-2016-10-13-at-1-41-46-pm
Can you guess which words (variables) correspond (are important) to which candidate?

Maybe the next step could be to build an app that will take in text and output if the text is more likely to have come from Clinton or Trump. Perhaps we can even consider the Tweets of several politicians, assign them a ‘liberal/conservative’ score, and then build a model to predict if a Tweet is more conservative or more liberal (important features would maybe include “Benghazi” or “climate change”). Another cool application might be a deep learning model, in the footsteps of @DeepDrumpf.

If this inspired you to create analysis or build models, please let us know! We might want to highlight your project 🎉📈.

sparklyr: R interface for Apache Spark

This post is reposted from Rstudio’s announcement on sparklyr – Rstudio’s extension for Spark

sparklyr-illustration

  • Connect to Spark from R. The sparklyr package provides a complete dplyr backend.
  • Filter and aggregate Spark datasets then bring them into R for analysis and visualization.
  • Use Spark’s distributed machine learning library from R.
  • Create extensions that call the full Spark API and provide interfaces to Spark packages.

Installation

You can install the sparklyr package from CRAN as follows:

install.packages("sparklyr")

You should also install a local version of Spark for development purposes:

library(sparklyr)
spark_install(version = "1.6.2")

To upgrade to the latest version of sparklyr, run the following command and restart your r session:

devtools::install_github("rstudio/sparklyr")

If you use the RStudio IDE, you should also download the latest preview release of the IDE which includes several enhancements for interacting with Spark (see the RStudio IDE section below for more details).

Connecting to Spark

You can connect to both local instances of Spark as well as remote Spark clusters. Here we’ll connect to a local instance of Spark via the spark_connect function:

library(sparklyr)
sc <- spark_connect(master = "local")

The returned Spark connection (sc) provides a remote dplyr data source to the Spark cluster.

For more information on connecting to remote Spark clusters see the Deployment section of the sparklyr website.

Using dplyr

We can new use all of the available dplyr verbs against the tables within the cluster.

We’ll start by copying some datasets from R into the Spark cluster (note that you may need to install the nycflights13 and Lahman packages in order to execute this code):

install.packages(c("nycflights13", "Lahman"))
library(dplyr)
iris_tbl <- copy_to(sc, iris)
flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
batting_tbl <- copy_to(sc, Lahman::Batting, "batting")
src_tbls(sc)
## [1] "batting" "flights" "iris"

To start with here’s a simple filtering example:

# filter by departure delay and print the first few records
flights_tbl %>% filter(dep_delay == 2)
## Source:   query [?? x 19]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
## 
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1   2013     1     1      517            515         2      830
## 2   2013     1     1      542            540         2      923
## 3   2013     1     1      702            700         2     1058
## 4   2013     1     1      715            713         2      911
## 5   2013     1     1      752            750         2     1025
## 6   2013     1     1      917            915         2     1206
## 7   2013     1     1      932            930         2     1219
## 8   2013     1     1     1028           1026         2     1350
## 9   2013     1     1     1042           1040         2     1325
## 10  2013     1     1     1231           1229         2     1523
## # ... with more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dbl>

Introduction to dplyr provides additional dplyr examples you can try. For example, consider the last example from the tutorial which plots data on flight delays:

delay <- flights_tbl %>% 
  group_by(tailnum) %>%
  summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
  filter(count > 20, dist < 2000, !is.na(delay)) %>%
  collect

# plot delays
library(ggplot2)
ggplot(delay, aes(dist, delay)) +
  geom_point(aes(size = count), alpha = 1/2) +
  geom_smooth() +
  scale_size_area(max_size = 2)

ggplot2-flights

Window Functions

dplyr window functions are also supported, for example:

batting_tbl %>%
  select(playerID, yearID, teamID, G, AB:H) %>%
  arrange(playerID, yearID, teamID) %>%
  group_by(playerID) %>%
  filter(min_rank(desc(H)) <= 2 & H > 0)
## Source:   query [?? x 7]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
## Groups: playerID
## 
##     playerID yearID teamID     G    AB     R     H
##        <chr>  <int>  <chr> <int> <int> <int> <int>
## 1  abbotpa01   2000    SEA    35     5     1     2
## 2  abbotpa01   2004    PHI    10    11     1     2
## 3  abnersh01   1992    CHA    97   208    21    58
## 4  abnersh01   1990    SDN    91   184    17    45
## 5  abreujo02   2014    CHA   145   556    80   176
## 6  acevejo01   2001    CIN    18    34     1     4
## 7  acevejo01   2004    CIN    39    43     0     2
## 8  adamsbe01   1919    PHI    78   232    14    54
## 9  adamsbe01   1918    PHI    84   227    10    40
## 10 adamsbu01   1945    SLN   140   578    98   169
## # ... with more rows

For additional documentation on using dplyr with Spark see the dplyr section of the sparklyr website.

Using SQL

It’s also possible to execute SQL queries directly against tables within a Spark cluster. The spark_connection object implements a DBI interface for Spark, so you can use dbGetQuery to execute SQL and return the result as an R data frame:

library(DBI)
iris_preview <- dbGetQuery(sc, "SELECT * FROM iris LIMIT 10")
iris_preview
##    Sepal_Length Sepal_Width Petal_Length Petal_Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa

Machine Learning

You can orchestrate machine learning algorithms in a Spark cluster via the machine learning functions within sparklyr. These functions connect to a set of high-level APIs built on top of DataFrames that help you create and tune machine learning workflows.

Here’s an example where we use ml_linear_regression to fit a linear regression model. We’ll use the built-in mtcars dataset, and see if we can predict a car’s fuel consumption (mpg) based on its weight (wt), and the number of cylinders the engine contains (cyl). We’ll assume in each case that the relationship between mpg and each of our features is linear.

# copy mtcars into spark
mtcars_tbl <- copy_to(sc, mtcars)

# transform our data set, and then partition into 'training', 'test'
partitions <- mtcars_tbl %>%
  filter(hp >= 100) %>%
  mutate(cyl8 = cyl == 8) %>%
  sdf_partition(training = 0.5, test = 0.5, seed = 1099)

# fit a linear model to the training dataset
fit <- partitions$training %>%
  ml_linear_regression(response = "mpg", features = c("wt", "cyl"))
fit
## Call: ml_linear_regression(., response = "mpg", features = c("wt", "cyl"))
## 
## Coefficients:
## (Intercept)          wt         cyl 
##   37.066699   -2.309504   -1.639546

For linear regression models produced by Spark, we can use summary() to learn a bit more about the quality of our fit, and the statistical significance of each of our predictors.

summary(fit)
## Call: ml_linear_regression(., response = "mpg", features = c("wt", "cyl"))
## 
## Deviance Residuals::
##     Min      1Q  Median      3Q     Max 
## -2.6881 -1.0507 -0.4420  0.4757  3.3858 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 37.06670    2.76494 13.4059 2.981e-07 ***
## wt          -2.30950    0.84748 -2.7252   0.02341 *  
## cyl         -1.63955    0.58635 -2.7962   0.02084 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-Squared: 0.8665
## Root Mean Squared Error: 1.799

Spark machine learning supports a wide array of algorithms and feature transformations and as illustrated above it’s easy to chain these functions together with dplyr pipelines. To learn more see the machine learning section.

Reading and Writing Data

You can read and write data in CSV, JSON, and Parquet formats. Data can be stored in HDFS, S3, or on the lcoal filesystem of cluster nodes.

temp_csv <- tempfile(fileext = ".csv")
temp_parquet <- tempfile(fileext = ".parquet")
temp_json <- tempfile(fileext = ".json")

spark_write_csv(iris_tbl, temp_csv)
iris_csv_tbl <- spark_read_csv(sc, "iris_csv", temp_csv)

spark_write_parquet(iris_tbl, temp_parquet)
iris_parquet_tbl <- spark_read_parquet(sc, "iris_parquet", temp_parquet)

spark_write_csv(iris_tbl, temp_json)
iris_json_tbl <- spark_read_csv(sc, "iris_json", temp_json)

src_tbls(sc)
## [1] "batting"      "flights"      "iris"         "iris_csv"    
## [5] "iris_json"    "iris_parquet" "mtcars"

Extensions

The facilities used internally by sparklyr for its dplyr and machine learning interfaces are available to extension packages. Since Spark is a general purpose cluster computing system there are many potential applications for extensions (e.g. interfaces to custom machine learning pipelines, interfaces to 3rd party Spark packages, etc.).

Here’s a simple example that wraps a Spark text file line counting function with an R function:

# write a CSV 
tempfile <- tempfile(fileext = ".csv")
write.csv(nycflights13::flights, tempfile, row.names = FALSE, na = "")

# define an R interface to Spark line counting
count_lines <- function(sc, path) {
  spark_context(sc) %>% 
    invoke("textFile", path, 1L) %>% 
      invoke("count")
}

# call spark to count the lines of the CSV
count_lines(sc, tempfile)
## [1] 336777

To learn more about creating extensions see the Extensions section of the sparklyr website.

dplyr Utilities

You can cache a table into memory with:

tbl_cache(sc, "batting")

and unload from memory using:

tbl_uncache(sc, "batting")

Connection Utilities

You can view the Spark web console using the spark_web function:

spark_web(sc)

You can show the log using the spark_log function:

spark_log(sc, n = 10)
## 16/09/24 07:50:59 INFO ContextCleaner: Cleaned accumulator 224
## 16/09/24 07:50:59 INFO ContextCleaner: Cleaned accumulator 223
## 16/09/24 07:50:59 INFO ContextCleaner: Cleaned accumulator 222
## 16/09/24 07:50:59 INFO BlockManagerInfo: Removed broadcast_64_piece0 on localhost:56324 in memory (size: 20.6 KB, free: 483.0 MB)
## 16/09/24 07:50:59 INFO ContextCleaner: Cleaned accumulator 220
## 16/09/24 07:50:59 INFO Executor: Finished task 0.0 in stage 67.0 (TID 117). 2082 bytes result sent to driver
## 16/09/24 07:50:59 INFO TaskSetManager: Finished task 0.0 in stage 67.0 (TID 117) in 122 ms on localhost (1/1)
## 16/09/24 07:50:59 INFO DAGScheduler: ResultStage 67 (count at NativeMethodAccessorImpl.java:-2) finished in 0.122 s
## 16/09/24 07:50:59 INFO TaskSchedulerImpl: Removed TaskSet 67.0, whose tasks have all completed, from pool 
## 16/09/24 07:50:59 INFO DAGScheduler: Job 47 finished: count at NativeMethodAccessorImpl.java:-2, took 0.125238 s

Finally, we disconnect from Spark:

spark_disconnect(sc)

RStudio IDE

The latest RStudio Preview Release of the RStudio IDE includes integrated support for Spark and the sparklyr package, including tools for:

  • Creating and managing Spark connections
  • Browsing the tables and columns of Spark DataFrames
  • Previewing the first 1,000 rows of Spark DataFrames

Once you’ve installed the sparklyr package, you should find a new Spark pane within the IDE. This pane includes a New Connection dialog which can be used to make connections to local or remote Spark instances:

spark-connect

Once you’ve connected to Spark you’ll be able to browse the tables contained within the Spark cluster:

spark-tab

The Spark DataFrame preview uses the standard RStudio data viewer:

spark-connect

The RStudio IDE features for sparklyr are available now as part of the RStudio Preview Release.

When is the Best Time to Look for Apartments on Craigslist?

A while ago I was looking for an apartment in San Francisco. There are a lot of problems with finding housing in San Francisco, mostly stemming from the fierce competition. I was checking Craigslist every single day. It still took me (and my girlfriend) a few months to find a place — and we had to sublet for three weeks in between. Thankfully we’re happily housed now but it was quite the journey. Others have talked about their search for SF housing, but I have a few tips myself:

1) While Craigslist continues to be the best resource for finding housing (it’s how I found my current apartment), there are quite a few Facebook groups that may also be useful. My experience has been of having weekly cycles, where I send out lots of emails, get a stream of responses, go visit 1-2 places per weekday evening, and then get a stream of rejections back. If you do check Craigslist, the best times to check are Tuesday and Wednesday evenings, and then the following mornings, as the following graphic shows.

screen-shot-2016-09-26-at-2-42-45-pm

Data sourced from over 10,000 SF apartment Craigslist postings over the month of September

2) Be prepared to apply to an apartment on the spot. I’ve been burned a few times when I took a day or two to fully think about the location and price, but by the time I applied I was at the back of the line. It really helps to know exactly what you want, and know how to spot it. The good news is that even if you’re not sure of your wants and needs at the beginning of your search, you’ll learn as you visit more and more apartments.

3) Make sure you know where the laundry machines are. I once lived in an apartment where I forgot to ask if they had laundry in the building (they didn’t.) The result was that I spent an unanticipated few hours every few weeks having to clean my clothes. It’s not the end of the world, and unfortunately I doubt it could have changed my situation, but it’s still a very important amenity that some people overlook.

Focus

———- Forwarded message ———
From: SriSatish Ambati
Date: Thu, Sep 15, 2016 at 10:17 PM
Subject: changes and all hands tomorrow.
To: team

Team,

Our focus has changed towards larger fewer deals & deeper engagements with handful of finance and insurance customers.

We took a hard look at our marketing spend, pr programs and personnel. We let go most of our amazing inside sales talent. And two of our account executives. We are not building a vertical in IOT. In all nine business folks were affected. No further changes are anticipated or necessary.

These were heroic partners in our journey. I spoke to most all of them today to personally convey the message. Some were with me for a short time, many for years – all great humans who diligently served me and h2o well. I’m grateful for their support and partnership towards my vision. I learnt a lot from each one of them and will not hesitate to assist them any ways possible personally.

Thank you. heroes in bcc. may you find fulfillment & love in your path. It’s a small world and we will all meet very soon.

Our goal as a startup is to get to the most optimal business unit before nurturing and scaling it for growth. We will work tirelessly to get us to that state.

thank you for your partnership, Sri


culture . code . customer
c o m m u n i t y, h2o.ai,

Distracted Driving

Last week, we started to examine the 7.2% increase in traffic fatalities from 2014 to 2015, the reversal of a near decade-long downward trend. We then broke out the data by various accident classifications, such as “speeding” or “driving with a positive BAC,” and identified those classifications that had the greatest increase. One label that showed promise for improvement was “involving a distracted driver.” According to Pew Research, the number of Americans who own a mobile device has pretty consistently risen over the past decade, as has the number of Americans who own a smartphone. Moreover, apps like Pokemon Go have built-in features that incentivize driving while playing, and these types of augmented reality games are only going to become more common.

The National Highway Traffic Safety Association (NHTSA) defines distracted driving as “any activity that could divert a person’s attention away from the primary task of driving.” This includes several activities, from texting while driving, to using one hand to place a call. The Governor’s Highway Safety Administration (GHSA), an organization that “provides leadership and representation for the states and territories to improve traffic safety,” notes that states can even collect data on distracted driving in different ways. While most states split up distracted driving into two or three categories, some states use only one category (and other states use as many as 15 categories!) These categories include not just distraction by technology, but also events such as animals in the vehicle, or the consumption of food & drink. Distracted driving is also said to be under-reported because drivers are less likely to admit to using their phone in the event of a crash.

Because of these discrepancies, it’s important to keep in mind that regulations vary from state to state and policy that successfully reduces accidents in one state may not automatically follow to another state. Still, sharing what works and what doesn’t can be important in saving lives, which is one reason why this data is collected and aggregated. So, which states are succeeding at reducing the number of fatalities caused by distracted driving?

screen-shot-2016-09-14-at-5-02-14-pm

In 2015, New Mexico had one of the highest rates of distracted driving fatalities per mile driven. New Mexico Governor Susana Martinez recognized this even back in 2014, signing a bill that banned texting while driving citing, “Texting while driving is now the leading cause of death for New Mexico’s teen drivers. Most other states have banned the practice of texting while driving.”

screen-shot-2016-09-14-at-4-37-03-pm

Did these laws end up working? Well, maybe. If we look at all crashes (not just fatal ones) in New Mexico from 2005 to 2014, the general trend was downward post-2007, seemingly leveling out during 2014. Unfortunately, data from 2015 on the total number of crashes in New Mexico isn’t available and so we aren’t able to examine whether the bill ended up succeeding in terms of reducing all crashes due to distracted driving.

screen-shot-2016-09-14-at-4-36-52-pm

If we examine only fatalities, we see that while the number of all fatalities decreases in 2015, the 2014 bill doesn’t seem to actually affect distracted driving fatalities. The bill was signed in March and took effect in July, and so there were several months for its effects to be able to propagate. This ambiguous policy impact isn’t limited to New Mexico either. Economists Rahi Abouk and Scott Adams ran a national study where they discovered that “while the effects are strong for the month immediately following ban imposition, accident levels appear to return toward normal levels in about three months.” Still, New Mexico’s decrease in fatalities bucked the national trend by the greatest amount (we’ll talk about which states experienced the largest increase in fatalities in a separate post).

screen-shot-2016-09-13-at-4-21-19-pm

A teaser photo for a new NHTSA advertising campaign

Of course, legislation is only one tactic that can be used to prevent distracted driving. The GHSA notes that (some) states use other tactics such as social media outreach or statewide campaigns. Several states have adopted slogans, which range from the passive Wyoming “The road is no place for distractions” to the more flavorful Missouri “U TXT UR NXT, NO DWT.” States sometimes aim these campaigns at specific demographics, such as teens and young adults, who have higher rates of distracted driving (quite a few states also pass legislation that directly targets young people).


This post is the second in our series on traffic fatalities, inspired by a call to action put out by the Department of Transportation. Watch out for another post highlighting a different aspect of the dataset next week. In the meantime if you have any questions or comments or suggestions you can find me at @JayMahabal or email me at jay@h2oai.

Introducing H2O Community & Support Portals

At H2O, we enjoy serving our customers and the community, and we take pride in making them successful while using H2O products. Today, we are very excited to announce two great platforms for our customers and for the community to better communicate with H2O. Let’s start with our community first:

Community Badge

The success of every open source project depends on a vibrant community, and having an active community helps to convert an average product into a successful product. So to maintain our commitment to our H2O community, we are releasing an updated community platform at https://community.h2o.ai. This community platform is available for everyone, whether you are new to machine intelligence or are a seasoned veteran. If you are new to machine intelligence or H2O, you have an opportunity to learn from great minds, and if you are a seasoned industry veteran, you can not only enhance your skillset, you can also help others to achieve success.

Our objective is to develop this community in a way where every community member has the opportunity to establish himself or herself as a technology leader or expert by helping others. Every moment you spend here in the community, either by creating or consuming content, will not only help you to learn more, but will also help to establish your own brand as a reputed member of our machine intelligence community. Here are some highlights for our community:

  • The community content is distributed into 3 main sections as below:
    • Questions
    • Ideas
    • Knowledge Base Articles
  • The contents in the above sections is distributed among various technology groups called spaces, i.e. Algorithms, H2O, Sparkling Water, Exception, Debugging, Build, etc.

  • Every content needs to be part of a specific space so that experts in their space can provide faster and better responses. A list of all spaces is here.
  • As a visitor, you are welcome to visit every section of the community and learn from posts from community members.
  • Once logged in as community member using OpenID®, you can ask questions, write knowledge base articles, and propose ideas or feature requests for our products.
  • You are welcome to provide feedback to others’ content by liking the KB, question, or answer or simply by up-voting an idea.
  • As you spend more and more time here in community, you will be given higher roles toward management and improvement of your own community.
  • As logged-in member of community, every activity adds points toward your reputation, and as you spend more time in community, you will rank higher among your peers and establish yourself as an expert or a technology leader.
  • Please make sure you read the Guidelines before posting a question.
  • We are working towards making the site more integrated with other social platforms such as Twitter® and Facebook®, as well as adding support to other OpenID providers.

Now let me introduce our updated enterprise support portal:

Support Badge

H2O has been by over 60K data scientists since its initial release, and now more than 7,000 organizations worldwide use H2O applications, including H2O-3 and Sparkling Water. To assist our enterprise customers, we have revamped our enterprise support portal, which is available at https://support.h2o.ai. With this new portal, we are able to provide SLA-based, 24×7 support for our enterprise customers. Please visit this page to learn about the H2O enterprise support offering. While this support portal is specially catered to assist our enterprise customers, it is also open for everyone who is using any of the free, open source H2O applications.

You can open a support incident with the H2O support team in one of two ways:

  • Through the Support Portal
    • Please visit to support portal at https://support.h2o.ai, and select “NEW SUPPORT INCIDENT”.
    • You don’t need to be logged in to the support portal to open a new incident; however it is advisable to have an account so that you can monitor the ticket progress.
    • You will have an opportunity to set up incident priority, i.e. Low, High, Medium, or Urgent.
  • By Email
    • Send an email to support@h2o.ai describing your problem clearly.
    • Please attach any other info within the email in zipped format that could be helpful to identify the root cause.

When opening a support incident, please provide your H2O version, your Hadoop or Spark version (if applicable) and any logs, stack dump, or other information that might be helpful when troubleshooting this problem. Whether you are an H2O enterprise customer or just using one of our free, open source H2O applications, both of these venues are open for you to bring your question or comments. We are listening and are here to help.

We look forward to working with you through our community and support portals.

Avkash Chauhan

H2O Support: Customer focused and Community Driven

Fatal Traffic Accidents Rise in 2015

On Tuesday, August 30th, the National Highway Traffic Safety Administration released their annual dataset of traffic fatalities asking interested parties to use the dataset to identify the causes of an increase of 7.2% in fatalities from 2014 to 2015. As part of H2O.ai‘s vision of using artificial intelligence for the betterment of society we were excited to tackle this problem.

screen-shot-2016-09-14-at-2-29-27-pm

This post is the first in our series on the Department of Transportation dataset and driving fatalities which will hopefully culminate in a hackathon in late September, where we’ll invite community members to join forces with the talented engineers and scientists at H2O.ai to find a solution to this problem and prescribe policy changes.


To begin, we started by reading some literature and getting familiar with the data. These documents served as excellent inspiration for possible paths of analysis and guided our thinking. Our introductory investigation was based around asking a series of questions, paving the way for detailed analysis down the road. The dataset includes every (reported) accident along with several labels, from “involving a distracted driving” to “involving a driver aged 15-20.” Even though fatalities as a whole fell during the last ten years more progress has been made in some areas over others, and comparing 2014 incidents to 2015 incidents can reveal promising openings for policy action.

screen-shot-2016-09-14-at-2-29-58-pm

It’s important to keep in mind that regulations vary from state to state and policy that successfully reduces accidents in one state may not follow to another state. Still, sharing what works and what doesn’t can be important in saving lives, which is one reason why this data is collected and aggregated.

Next week we’ll examine distracted driving, and investigate whether or not the laws that prohibited texting while driving made a difference — and why those laws didn’t continue the downward trend in 2015. We’ll follow that with an investigation into speeding and motorcycle crashes. In the meantime if you have any questions or comments or suggestions you can find me at @JayMahabal or email me at jay@h2oai.


Clarification: September 14th, 2016
We shifted graph labels to reduce confusion.

IoT – Take Charge of Your Business and IT Insights Starting at the Edge

Instead of just being hype, the Internet of Things (IoT) is now becoming a reality. Gartner forecasts that 6.4 billion connected devices will be in use worldwide, and 5.5 million new devices will get connected every day, in 2016. These devices range from wearables, to sensors in vehicles the can detect surrounding obstacles, to sensors in pipelines that detect their own wear-and-tear. Huge volumes of data are collected from these connected devices, and yet companies struggle to get optimal business and IT outcomes from it.

Why is this the case?
Rule-based data models limit insights. Industry experts have a wealth of knowledge manually driving business rules, which in turn drive the data models. Many current IoT practices simply run large volumes of data through these rule-based models, but the business insights are limited by what rule-based models allow. Machine Learning/Artificial Intelligence allows new patterns to be found within stored data without human intervention. These new patterns can be applied to data models, allowing new insights to be generated for better business results.
Analytics in the backend data center delay insights. In current IoT practice, data is collected and analyzed in the backend data center (e.g. OLAP/MPP database, Hadoop, etc.). Typically, data models are large and harder to deploy at the edge due to IoT edge devices having limited computing resources. The trade-off is that large amounts of data travel miles and miles of distance, unfiltered and un-analyzed until the backend systems have the bandwidth to process them. This defeats the spirit of getting business insights for agility in real-time, not to mention the high cost of data transfer and ingestion in the backend data center.
Lack of security measures at the edge reduce the accuracy of insights. Current IoT practice also only secures the backend while security threats can be injected from edge devices. Data can be altered and viruses can be injected during the long period of data transfer. How accurate can the insights be when data integrity is not preserved?

The good news is that H2O can help with:
Pattern-based models. H2O detects patterns in the data with distributed Machine Learning algorithms, instead of depending on pre-established rules. It has been proven in many use cases that H2O’s AI engine can find dozens more patterns than humans are able to discover. Patterns can also change over time and H2O models can be continuously retrained to yield more and better insights.
Fast and easy model deployment with small footprint. The H2O Open Source Machine Learning Platform creates data models, with a minimal footprint, that can score events and make predictions in nanoseconds. The data models are Java-based and can be deployed anywhere with a JVM, or even as a web service. Models can easily be deployed at the IoT edge to yield real-time business and IT insights.
Enabling security measures at the edge. AI is particularly adept at finding and establishing patterns, especially when it’s fed huge amounts of data. Security loopholes and threats take on new forms all the time. H2O models can easily adapt as data show new patterns of security threats. Deploying these adaptive models at the edge means that threats can be blocked early on, before they’re able to cause damage throughout the system.

There are many advantages in enabling analytics at the IoT edge. Using H2O will be crucial in this endeavor. Many industry experts are already moving in this direction. What are you waiting for?

H2O + TensorFlow on AWS GPU

TensorFlow on AWS GPU instance
In this tutorial, we show how to setup TensorFlow on AWS GPU instance and run H2O Tensorflow Deep learning demo.

Pre-requisites:
To get started, request an AWS EC2 instance with GPU support. We used a single g2.2xlarge instance running Ubuntu 14.04.To setup TensorFlow with GPU support, following softwares should be installed:

  1. Java 1.8
  2. Python pip
  3. Unzip utility
  4. CUDA Toolkit (>= v7.0)
  5. cuDNN (v4.0)
  6. Bazel (>= v0.2)
  7. TensorFlow (v0.9)

To run H2O Tensorflow Deep learning demo, following softwares should be installed:

  1. IPython notebook
  2. Scala
  3. Spark
  4. Sparkling water

Software Installation:
Java:


#To install Java follow below steps: Type ‘Y’ on installation prompt
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update 
sudo apt-get install oracle-java8-installer
Update JAVA_HOME in ~/.bashrc 

#Add JAVA_HOME to PATH: 
export PATH=$PATH:$JAVA_HOME/bin 

# Execute following command to update current session: 
source ~/.bashrc 

#Verify version and path: 
java -version 
echo $JAVA_HOME

Python:


#AWS EC2 instance has Python installed by default. Verify if Python 2.7 is installed already:
python -V 

#Install pip 
sudo apt-get install python-pip 

#Install IPython notebook 
sudo pip install "ipython[notebook]" 

#To run H2O example notebooks, execute following commands: 
sudo pip install requests 
sudo pip install tabulate 

Unzip utility:


#Execute following command to install unzip
sudo apt-get install unzip

Scala:


#Follow below mentioned steps: Type ‘Y’ on installation prompt
sudo apt-get install scala 

#Update SCALA_HOME in ~/.bashrc and execute following command to update current session: 
source ~/.bashrc 

#Verify version and path: 
scala -version 
echo $SCALA_HOME 

Spark:


#Java and Scala should be installed before installing Spark. 
#Get latest version of Spark binary:
wget http://apache.cs.utah.edu/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz

#Extract the file: 
tar xvzf spark-1.6.1-bin-hadoop2.6.tgz 

#Update SPARK_HOME in ~/.bashrc and execute following command to update current session: 
source ~/.bashrc 

#Add SPARK_HOME to PATH: 
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin 

#Verify the variables: 
echo $SPARK_HOME

Sparkling Water:


#Latest Spark pre-built for Hadoop should be installed and point SPARK_HOME to it:
export SPARK_HOME="/path/to/spark/installation"

#To launch a local Spark cluster with 3 worker nodes with 2 cores and 1g per node, export MASTER variable
export MASTER="local-cluster[3,2,1024]"

#Download and run Sparkling Water
wget http://h2o-release.s3.amazonaws.com/sparkling-water/rel-1.6/5/sparkling-water-1.6.5.zip
unzip sparkling-water-1.6.5.zip
cd sparkling-water-1.6.5
bin/sparkling-shell --conf "spark.executor.memory=1g"

CUDA Toolkit:


#In order to build or run TensorFlow with GPU support, both NVIDIA’s Cuda Toolkit (>= 7.0) and cuDNN (>= v2) need to be installed. 
#To install CUDA toolkit, run:
wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1410/x86_64/cuda-repo-ubuntu1410_7.0-28_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1410_7.0-28_amd64.deb
sudo apt-get update
sudo apt-get install cuda

cuDNN:

 
#To install cuDNN, download a file named cudnn-7.0-linux-x64-v4.0-prod.tgz after filling NVIDIA questionnaire. 
#You need to transfer it to your EC2 instance’s home directory.
tar -zxf cudnn-7.0-linux-x64-v4.0-prod.tgz &&
rm cudnn-7.0-linux-x64-v4.0-prod.tgz
sudo cp -R cuda/lib64 /usr/local/cuda/lib64 
sudo cp ~/cuda/include/cudnn.h /usr/local/cuda

#Reboot the system 
sudo reboot

#Update environment variables as shown below:
export CUDA_HOME=/usr/local/cuda 
export CUDA_ROOT=/usr/local/cuda 
export PATH=$PATH:$CUDA_ROOT/bin 
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_ROOT/lib64

Bazel:


#To instal Bazel(>= v0.2), run:
sudo apt-get install pkg-config zip g++ zlib1g-dev
wget https://github.com/bazelbuild/bazel/releases/download/0.3.0/bazel-0.3.0-installer-linux-x86_64.sh
chmod +x bazel-0.3.0-installer-linux-x86_64.sh
./bazel-0.3.0-installer-linux-x86_64.sh --user

TensorFlow:


#Download and install TensorFlow:
wget https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.9.0rc0-cp27-none-linux_x86_64.whl
sudo pip install --upgrade tensorflow-0.9.0rc0-cp27-none-linux_x86_64.whl 

#Configure TF with GPU support enabled using: 
./configure

To build TensorFlow, run:


bazel build -c opt --config=cuda //tensorflow/cc:tutorials_example_trainer
bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
sudo pip install --upgrade /tmp/tensorflow_pkg/tensorflow-0.8.0-py2-none-any.whl

Run H2O Tensorflow Deep learning demo:


#Since, we want to open IPython notebook remotely, we will use IP and port option. To start TensorFlow notebook:
cd sparkling-water-1.6.5/ 
IPYTHON_OPTS="notebook --no-browser --ip='*' --port=54321" bin/pysparkling #Note that port specified in above command should be open in the system. Open http://PublicIP:8888 in browser to start IPython notebook console. Click on TensorFlowDeepLearning.ipynb Refer this video for demo details. #Sample .bashrc contents: export JAVA_HOME=/usr/lib/jvm/java-8-oracle export SCALA_HOME=/usr/share/java export SPARK_HOME=/home/ubuntu/spark-1.6.1-bin-hadoop2.6 export MASTER="local-cluster[3,2,1024]" export PATH=$PATH:$JAVA_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin export CUDA_HOME=/usr/local/cuda export CUDA_ROOT=/usr/local/cuda export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/jvm/java-8-oracle/bin:/home/ubuntu/spark-1.6.1-bin-hadoop2.6/bin:/home/ubuntu/spark-1.6.1-bin-hadoop2.6/sbin:/usr/local/cuda/bin:/home/ubuntu/bin export LD_LIBRARY_PATH=:/usr/local/cuda/lib64

Troubleshooting:
1) ERROR: Getting java.net.UnknownHostException while starting spark-shell
Solution:
Make sure /etc/hosts has entry for hostname.
Eg: 127.0.0.1 hostname

2) ERROR: Getting Could not find .egg-info directory in install record error during IPython installation
Solution:

sudo pip install --upgrade setuptools pip

3) ERROR: Can’t find swig while configuring TF
Solution:

sudo apt-get install swig

4) ERROR: “Ignoring gpu device (device: 0, name: GRID K520, pci bus id: 0000:00:03.0) with Cuda compute capability 3.0. The minimum required Cuda capability is 3.5”
Solution:
Specify 3.0 while configuring TF at:
Please note that each additional compute capability significantly increases your build time and binary size.

5) ERROR: Could not insert ‘nvidia_352’: Unknown symbol in module, or unknown parameter (see dmesg)
Solution:

sudo apt-get install linux-image-extra-virtual

6) ERROR: Cannot find ’./util/python/python_include
Solution:

sudo apt-get install python-dev

7) Find Public IP address of system
Solution:

curl http://169.254.169.254/latest/meta-data/public-ipv4

Demo Videos

H2O GBM Tuning Tutorial for R

In this tutorial, we show how to build a well-tuned H2O GBM model for a supervised classification task. We specifically don’t focus on feature engineering and use a small dataset to allow you to reproduce these results in a few minutes on a laptop. This script can be directly transferred to datasets that are hundreds of GBs large and H2O clusters with dozens of compute nodes.

This tutorial is written in R Markdown. You can download the source from H2O’s github repository.

A port to a Python Jupyter Notebook version is available as well.

Installation of the H2O R Package

Either download H2O from H2O.ai’s website or install the latest version of H2O into R with the following R code:

# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

# Next, we download packages that H2O depends on.
pkgs <- c("methods","statmod","stats","graphics","RCurl","jsonlite","tools","utils")
for (pkg in pkgs) {
  if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
}

# Now we download, install and initialize the H2O package for R.
install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-turchin/8/R")))

Launch an H2O cluster on localhost

library(h2o)
h2o.init(nthreads=-1)
## optional: connect to a running H2O cluster
#h2o.init(ip="mycluster", port=55555) 
Starting H2O JVM and connecting: . Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         1 seconds 248 milliseconds 
    H2O cluster version:        3.8.2.8 
    H2O cluster name:           H2O_started_from_R_arno_wyu958 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   3.56 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    R Version:                  R version 3.2.2 (2015-08-14)

Import the data into H2O

Everything is scalable and distributed from now on. All processing is done on the fully multi-threaded and distributed H2O Java-based backend and can be scaled to large datasets on large compute clusters.
Here, we use a small public dataset (Titanic), but you can use datasets that are hundreds of GBs large.

## 'path' can point to a local file, hdfs, s3, nfs, Hive, directories, etc.
df <- h2o.importFile(path = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
dim(df)
head(df)
tail(df)
summary(df,exact_quantiles=TRUE)

## pick a response for the supervised problem
response <- "survived"

## the response variable is an integer, we will turn it into a categorical/factor for binary classification
df[[response]] <- as.factor(df[[response]])           

## use all other columns (except for the name) as predictors
predictors <- setdiff(names(df), c(response, "name")) 
> summary(df,exact_quantiles=TRUE)
 pclass          survived        name sex         age               sibsp            parch           ticket            fare              cabin                 embarked
 Min.   :1.000   Min.   :0.000        male  :843  Min.   : 0.1667   Min.   :0.0000   Min.   :0.000   Min.   :    680   Min.   :  0.000   C23 C25 C27    :   6  S :914  
 1st Qu.:2.000   1st Qu.:0.000        female:466  1st Qu.:21.0000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:  19950   1st Qu.:  7.896   B57 B59 B63 B66:   5  C :270  
 Median :3.000   Median :0.000                    Median :28.0000   Median :0.0000   Median :0.000   Median : 234604   Median : 14.454   G6             :   5  Q :123  
 Mean   :2.295   Mean   :0.382                    Mean   :29.8811   Mean   :0.4989   Mean   :0.385   Mean   : 249039   Mean   : 33.295   B96 B98        :   4  NA:  2  
 3rd Qu.:3.000   3rd Qu.:1.000                    3rd Qu.:39.0000   3rd Qu.:1.0000   3rd Qu.:0.000   3rd Qu.: 347468   3rd Qu.: 31.275   C22 C26        :   4          
 Max.   :3.000   Max.   :1.000                    Max.   :80.0000   Max.   :8.0000   Max.   :9.000   Max.   :3101298   Max.   :512.329   C78            :   4          
                                                  NA's   :263                                        NA's   :352       NA's   :1         NA             :1014          
 boat             body            home.dest                
 Min.   : 1.000   Min.   :  1.0   New York  NY        : 64 
 1st Qu.: 5.000   1st Qu.: 72.0   London              : 14 
 Median :10.000   Median :155.0   Montreal  PQ        : 10 
 Mean   : 9.405   Mean   :160.8   Cornwall / Akron  OH:  9 
 3rd Qu.:13.000   3rd Qu.:256.0   Paris  France       :  9 
 Max.   :16.000   Max.   :328.0   Philadelphia  PA    :  8 
 NA's   :911      NA's   :1188    NA                  :564 

From now on, everything is generic and directly applies to most datasets. We assume that all feature engineering is done at this stage and focus on model tuning. For multi-class problems, you can use h2o.logloss() or h2o.confusionMatrix() instead of h2o.auc() and for regression problems, you can use h2o.deviance() or h2o.mse().

Split the data for Machine Learning

We split the data into three pieces: 60% for training, 20% for validation, 20% for final testing.
Here, we use random splitting, but this assumes i.i.d. data. If this is not the case (e.g., when events span across multiple rows or data has a time structure), you’ll have to sample your data non-randomly.

splits <- h2o.splitFrame(
  data = df, 
  ratios = c(0.6,0.2),   ## only need to specify 2 fractions, the 3rd is implied
  destination_frames = c("train.hex", "valid.hex", "test.hex"), seed = 1234
)
train <- splits[[1]]
valid <- splits[[2]]
test  <- splits[[3]]

Establish baseline performance

As the first step, we’ll build some default models to see what accuracy we can expect. Let’s use the AUC metric for this demo, but you can use h2o.logloss and stopping_metric="logloss" as well. It ranges from 0.5 for random models to 1 for perfect models.

The first model is a default GBM, trained on the 60% training split

## We only provide the required parameters, everything else is default
gbm <- h2o.gbm(x = predictors, y = response, training_frame = train)

## Show a detailed model summary
gbm

## Get the AUC on the validation set
h2o.auc(h2o.performance(gbm, newdata = valid)) 

The AUC is over 94%, so this model is highly predictive!

[1] 0.9431953

The second model is another default GBM, but trained on 80% of the data (here, we combine the training and validation splits to get more training data), and cross-validated using 4 folds.
Note that cross-validation takes longer and is not usually done for really large datasets.

## h2o.rbind makes a copy here, so it's better to use splitFrame with `ratios = c(0.8)` instead above
gbm <- h2o.gbm(x = predictors, y = response, training_frame = h2o.rbind(train, valid), nfolds = 4, seed = 0xDECAF)

## Show a detailed summary of the cross validation metrics
## This gives you an idea of the variance between the folds
gbm@model$cross_validation_metrics_summary

## Get the cross-validated AUC by scoring the combined holdout predictions.
## (Instead of taking the average of the metrics across the folds)
h2o.auc(h2o.performance(gbm, xval = TRUE))

We see that the cross-validated performance is similar to the validation set performance:

[1] 0.9403432

Next, we train a GBM with “I feel lucky” parameters.
We’ll use early stopping to automatically tune the number of trees using the validation AUC.
We’ll use a lower learning rate (lower is always better, just takes more trees to converge).
We’ll also use stochastic sampling of rows and columns to (hopefully) improve generalization.

gbm <- h2o.gbm(
  ## standard model parameters
  x = predictors, 
  y = response, 
  training_frame = train, 
  validation_frame = valid,
  
  ## more trees is better if the learning rate is small enough 
  ## here, use "more than enough" trees - we have early stopping
  ntrees = 10000,                                                            
  
  ## smaller learning rate is better (this is a good value for most datasets, but see below for annealing)
  learn_rate=0.01,                                                         
  
  ## early stopping once the validation AUC doesn't improve by at least 0.01% for 5 consecutive scoring events
  stopping_rounds = 5, stopping_tolerance = 1e-4, stopping_metric = "AUC", 
  
  ## sample 80% of rows per tree
  sample_rate = 0.8,                                                       

  ## sample 80% of columns per split
  col_sample_rate = 0.8,                                                   

  ## fix a random number generator seed for reproducibility
  seed = 1234,                                                             
  
  ## score every 10 trees to make early stopping reproducible (it depends on the scoring interval)
  score_tree_interval = 10                                                 
)

## Get the AUC on the validation set
h2o.auc(h2o.performance(gbm, valid = TRUE))

This model doesn’t seem to be much better than the previous models:

[1] 0.939335

For this small dataset, dropping 20% of observations per tree seems too aggressive in terms of adding regularization. For larger datasets, this is usually not a bad idea. But we’ll let this parameter tune freshly below, so no worries.

Note: To see what other stopping_metric parameters you can specify, simply pass an invalid option:

gbm <- h2o.gbm(x = predictors, y = response, training_frame = train, stopping_metric = "yada")
Error in .h2o.checkAndUnifyModelParameters(algo = algo, allParams = ALL_PARAMS,  : 
  "stopping_metric" must be in "AUTO", "deviance", "logloss", "MSE", "AUC", 
  "lift_top_group", "r2", "misclassification", but got yada

Hyper-Parameter Search

Next, we’ll do real hyper-parameter optimization to see if we can beat the best AUC so far (around 94%).

The key here is to start tuning some key parameters first (i.e., those that we expect to have the biggest impact on the results). From experience with gradient boosted trees across many datasets, we can state the following “rules”:

  1. Build as many trees (ntrees) as it takes until the validation set error starts increasing.
  2. A lower learning rate (learn_rate) is generally better, but will require more trees. Using learn_rate=0.02and learn_rate_annealing=0.995 (reduction of learning rate with each additional tree) can help speed up convergence without sacrificing accuracy too much, and is great to hyper-parameter searches. For faster scans, use values of 0.05 and 0.99 instead.
  3. The optimum maximum allowed depth for the trees (max_depth) is data dependent, deeper trees take longer to train, especially at depths greater than 10.
  4. Row and column sampling (sample_rate and col_sample_rate) can improve generalization and lead to lower validation and test set errors. Good general values for large datasets are around 0.7 to 0.8 (sampling 70-80 percent of the data) for both parameters. Column sampling per tree (col_sample_rate_per_tree) can also be tuned. Note that it is multiplicative with col_sample_rate, so setting both parameters to 0.8 results in 64% of columns being considered at any given node to split.
  5. For highly imbalanced classification datasets (e.g., fewer buyers than non-buyers), stratified row sampling based on response class membership can help improve predictive accuracy. It is configured with sample_rate_per_class (array of ratios, one per response class in lexicographic order).
  6. Most other options only have a small impact on the model performance, but are worth tuning with a Random hyper-parameter search nonetheless, if highest performance is critical.

First we want to know what value of max_depth to use because it has a big impact on the model training time and optimal values depend strongly on the dataset.
We’ll do a quick Cartesian grid search to get a rough idea of good candidate max_depth values. Each model in the grid search will use early stopping to tune the number of trees using the validation set AUC, as before.
We’ll use learning rate annealing to speed up convergence without sacrificing too much accuracy.

## Depth 10 is usually plenty of depth for most datasets, but you never know
hyper_params = list( max_depth = seq(1,29,2) )
#hyper_params = list( max_depth = c(4,6,8,12,16,20) ) ##faster for larger datasets

grid <- h2o.grid(
  ## hyper parameters
  hyper_params = hyper_params,
  
  ## full Cartesian hyper-parameter search
  search_criteria = list(strategy = "Cartesian"),
  
  ## which algorithm to run
  algorithm="gbm",
  
  ## identifier for the grid, to later retrieve it
  grid_id="depth_grid",
  
  ## standard model parameters
  x = predictors, 
  y = response, 
  training_frame = train, 
  validation_frame = valid,
  
  ## more trees is better if the learning rate is small enough 
  ## here, use "more than enough" trees - we have early stopping
  ntrees = 10000,                                                            
  
  ## smaller learning rate is better
  ## since we have learning_rate_annealing, we can afford to start with a bigger learning rate
  learn_rate = 0.05,                                                         
  
  ## learning rate annealing: learning_rate shrinks by 1% after every tree 
  ## (use 1.00 to disable, but then lower the learning_rate)
  learn_rate_annealing = 0.99,                                               
  
  ## sample 80% of rows per tree
  sample_rate = 0.8,                                                       

  ## sample 80% of columns per split
  col_sample_rate = 0.8, 
  
  ## fix a random number generator seed for reproducibility
  seed = 1234,                                                             
  
  ## early stopping once the validation AUC doesn't improve by at least 0.01% for 5 consecutive scoring events
  stopping_rounds = 5,
  stopping_tolerance = 1e-4,
  stopping_metric = "AUC", 
  
  ## score every 10 trees to make early stopping reproducible (it depends on the scoring interval)
  score_tree_interval = 10                                                
)

## by default, display the grid search results sorted by increasing logloss (since this is a classification task)
grid                                                                       

## sort the grid models by decreasing AUC
sortedGrid <- h2o.getGrid("depth_grid", sort_by="auc", decreasing = TRUE)    
sortedGrid

## find the range of max_depth for the top 5 models
topDepths = sortedGrid@summary_table$max_depth[1:5]                       
minDepth = min(as.numeric(topDepths))
maxDepth = max(as.numeric(topDepths))
> sortedGrid
H2O Grid Details
================

Grid ID: depth_grid 
Used hyper parameters: 
  -  max_depth 
Number of models: 15 
Number of failed models: 0 

Hyper-Parameter Search Summary: ordered by decreasing auc
   max_depth           model_ids               auc
1         27 depth_grid_model_13  0.95657931811778
2         25 depth_grid_model_12 0.956353902507749
3         29 depth_grid_model_14 0.956241194702733
4         21 depth_grid_model_10 0.954663285432516
5         19  depth_grid_model_9 0.954494223724993
6         13  depth_grid_model_6 0.954381515919978
7         23 depth_grid_model_11 0.954043392504931
8         11  depth_grid_model_5 0.952183713722175
9         15  depth_grid_model_7 0.951789236404621
10        17  depth_grid_model_8 0.951507466892082
11         9  depth_grid_model_4 0.950436742744435
12         7  depth_grid_model_3 0.946942800788955
13         5  depth_grid_model_2 0.939306846999155
14         3  depth_grid_model_1 0.932713440405748
15         1  depth_grid_model_0  0.92902225979149

It appears that max_depth values of 19 to 29 are best suited for this dataset, which is unusally deep!

> minDepth
[1] 19
> maxDepth
[1] 29

Now that we know a good range for max_depth, we can tune all other parameters in more detail. Since we don’t know what combinations of hyper-parameters will result in the best model, we’ll use random hyper-parameter search to “let the machine get luckier than a best guess of any human”.

hyper_params = list( 
  ## restrict the search to the range of max_depth established above
  max_depth = seq(minDepth,maxDepth,1),                                      
  
  ## search a large space of row sampling rates per tree
  sample_rate = seq(0.2,1,0.01),                                             
  
  ## search a large space of column sampling rates per split
  col_sample_rate = seq(0.2,1,0.01),                                         
  
  ## search a large space of column sampling rates per tree
  col_sample_rate_per_tree = seq(0.2,1,0.01),                                
  
  ## search a large space of how column sampling per split should change as a function of the depth of the split
  col_sample_rate_change_per_level = seq(0.9,1.1,0.01),                      
  
  ## search a large space of the number of min rows in a terminal node
  min_rows = 2^seq(0,log2(nrow(train))-1,1),                                 
  
  ## search a large space of the number of bins for split-finding for continuous and integer columns
  nbins = 2^seq(4,10,1),                                                     
  
  ## search a large space of the number of bins for split-finding for categorical columns
  nbins_cats = 2^seq(4,12,1),                                                
  
  ## search a few minimum required relative error improvement thresholds for a split to happen
  min_split_improvement = c(0,1e-8,1e-6,1e-4),                               
  
  ## try all histogram types (QuantilesGlobal and RoundRobin are good for numeric columns with outliers)
  histogram_type = c("UniformAdaptive","QuantilesGlobal","RoundRobin")       
)

search_criteria = list(
  ## Random grid search
  strategy = "RandomDiscrete",      
  
  ## limit the runtime to 60 minutes
  max_runtime_secs = 3600,         
  
  ## build no more than 100 models
  max_models = 100,                  
  
  ## random number generator seed to make sampling of parameter combinations reproducible
  seed = 1234,                        
  
  ## early stopping once the leaderboard of the top 5 models is converged to 0.1% relative difference
  stopping_rounds = 5,                
  stopping_metric = "AUC",
  stopping_tolerance = 1e-3
)

grid <- h2o.grid(
  ## hyper parameters
  hyper_params = hyper_params,
  
  ## hyper-parameter search configuration (see above)
  search_criteria = search_criteria,
  
  ## which algorithm to run
  algorithm = "gbm",
  
  ## identifier for the grid, to later retrieve it
  grid_id = "final_grid", 
  
  ## standard model parameters
  x = predictors, 
  y = response, 
  training_frame = train, 
  validation_frame = valid,
  
  ## more trees is better if the learning rate is small enough
  ## use "more than enough" trees - we have early stopping
  ntrees = 10000,                                                            
  
  ## smaller learning rate is better
  ## since we have learning_rate_annealing, we can afford to start with a bigger learning rate
  learn_rate = 0.05,                                                         
  
  ## learning rate annealing: learning_rate shrinks by 1% after every tree 
  ## (use 1.00 to disable, but then lower the learning_rate)
  learn_rate_annealing = 0.99,                                               
  
  ## early stopping based on timeout (no model should take more than 1 hour - modify as needed)
  max_runtime_secs = 3600,                                                 
  
  ## early stopping once the validation AUC doesn't improve by at least 0.01% for 5 consecutive scoring events
  stopping_rounds = 5, stopping_tolerance = 1e-4, stopping_metric = "AUC", 
  
  ## score every 10 trees to make early stopping reproducible (it depends on the scoring interval)
  score_tree_interval = 10,                                                
  
  ## base random number generator seed for each model (automatically gets incremented internally for each model)
  seed = 1234                                                             
)

## Sort the grid models by AUC
sortedGrid <- h2o.getGrid("final_grid", sort_by = "auc", decreasing = TRUE)    
sortedGrid

We can see that the best models have even better validation AUCs than our previous best models, so the random grid search was successful!

Hyper-Parameter Search Summary: ordered by decreasing auc
  col_sample_rate col_sample_rate_change_per_level col_sample_rate_per_tree  histogram_type max_depth
1            0.49                             1.04                     0.94 QuantilesGlobal        28
2            0.92                             0.93                     0.56 QuantilesGlobal        27
3            0.35                             1.09                     0.83 QuantilesGlobal        29
4            0.42                             0.98                     0.53 UniformAdaptive        24
5             0.7                             1.02                     0.56 UniformAdaptive        25
  min_rows min_split_improvement nbins nbins_cats sample_rate           model_ids               auc
1        2                     0    32        256        0.86 final_grid_model_68 0.974049027895182
2        4                     0   128        128        0.93 final_grid_model_96 0.971400394477318
3        4                 1e-08    64        128        0.69 final_grid_model_38 0.968864468864469
4        1                 1e-04    64         16        0.69 final_grid_model_55 0.967793744716822
5        2                 1e-08    32        256        0.34 final_grid_model_22 0.966553958861651

We can inspect the best 5 models from the grid search explicitly, and query their validation AUC:

for (i in 1:5) {
  gbm <- h2o.getModel(sortedGrid@model_ids[[i]])
  print(h2o.auc(h2o.performance(gbm, valid = TRUE)))
}
[1] 0.974049
[1] 0.9714004
[1] 0.9688645
[1] 0.9677937
[1] 0.966554

You can also see the results of the grid search in Flow:
alt text

Model Inspection and Final Test Set Scoring

Let’s see how well the best model of the grid search (as judged by validation set AUC) does on the held out test set:

gbm <- h2o.getModel(sortedGrid@model_ids[[1]])
print(h2o.auc(h2o.performance(gbm, newdata = test)))

Good news. It does as well on the test set as on the validation set, so it looks like our best GBM model generalizes well to the unseen test set:

[1] 0.9712568

We can inspect the winning model’s parameters:

gbm@parameters
> gbm@parameters
$model_id
[1] "final_grid_model_68"

$training_frame
[1] "train.hex"

$validation_frame
[1] "valid.hex"

$score_tree_interval
[1] 10

$ntrees
[1] 10000

$max_depth
[1] 28

$min_rows
[1] 2

$nbins
[1] 32

$nbins_cats
[1] 256

$stopping_rounds
[1] 5

$stopping_metric
[1] "AUC"

$stopping_tolerance
[1] 1e-04

$max_runtime_secs
[1] 3414.017

$seed
[1] 1234

$learn_rate
[1] 0.05

$learn_rate_annealing
[1] 0.99

$distribution
[1] "bernoulli"

$sample_rate
[1] 0.86

$col_sample_rate
[1] 0.49

$col_sample_rate_change_per_level
[1] 1.04

$col_sample_rate_per_tree
[1] 0.94

$histogram_type
[1] "QuantilesGlobal"

$x
 [1] "pclass"    "sex"       "age"       "sibsp"     "parch"     "ticket"    "fare"      "cabin"    
 [9] "embarked"  "boat"      "body"      "home.dest"

$y
[1] "survived"

Now we can confirm that these parameters are generally sound, by building a GBM model on the whole dataset (instead of the 60%) and using internal 5-fold cross-validation (re-using all other parameters including the seed):

model <- do.call(h2o.gbm,
        ## update parameters in place
        {
          p <- gbm@parameters
          p$model_id = NULL          ## do not overwrite the original grid model
          p$training_frame = df      ## use the full dataset
          p$validation_frame = NULL  ## no validation frame
          p$nfolds = 5               ## cross-validation
          p
        }
)
model@model$cross_validation_metrics_summary
> model@model$cross_validation_metrics_summary
Cross-Validation Metrics Summary: 
                               mean           sd cv_1_valid  cv_2_valid  cv_3_valid  cv_4_valid cv_5_valid
F0point5                  0.9082877  0.017469764  0.9448819  0.87398374   0.8935743   0.9034908  0.9255079
F1                        0.8978795  0.008511053  0.9099526   0.8820513   0.8989899   0.9119171  0.8864865
F2                        0.8886758  0.016845208  0.8775137  0.89026916   0.9044715  0.92050207  0.8506224
accuracy                  0.9236877  0.004604631 0.92883897   0.9151291  0.92248064  0.93307084  0.9189189
auc                       0.9606385  0.006671454 0.96647465   0.9453869    0.959375  0.97371733 0.95823866
err                     0.076312296  0.004604631 0.07116105 0.084870845  0.07751938  0.06692913 0.08108108
err_count                        20    1.4142135         19          23          20          17         21
lift_top_group            2.6258688  0.099894695  2.3839285   2.8229167    2.632653   2.6736841  2.6161616
logloss                  0.23430987  0.019006629 0.23624699  0.26165685  0.24543843  0.18311584 0.24509121
max_per_class_error      0.11685239  0.025172591 0.14285715 0.104166664 0.091836736  0.07368421 0.17171717
mcc                       0.8390522  0.011380583  0.8559271  0.81602895  0.83621955   0.8582395  0.8288459
mean_per_class_accuracy  0.91654545 0.0070778215   0.918894   0.9107738  0.91970664   0.9317114  0.9016414
mean_per_class_error     0.08345456 0.0070778215 0.08110599 0.089226194 0.080293365  0.06828865 0.09835859
mse                      0.06535896  0.004872401 0.06470373   0.0717801   0.0669676 0.052562267 0.07078109
precision                 0.9159663   0.02743855   0.969697  0.86868685        0.89   0.8979592 0.95348835
r2                        0.7223932  0.021921812  0.7342935  0.68621415   0.7157123   0.7754977 0.70024836
recall                    0.8831476  0.025172591 0.85714287   0.8958333  0.90816325   0.9263158 0.82828283
specificity              0.94994324  0.016345335  0.9806452   0.9257143     0.93125   0.9371069      0.975

Ouch! So it looks like we overfit quite a bit on the validation set as the mean AUC on the 5 folds is “only” 96.06% +/- 0.67%. So we cannot always expect AUCs of 97% with these parameters on this dataset. So to get a better estimate of model performance, the Random hyper-parameter search could have used nfolds = 5 (or 10, or similar) in combination with 80% of the data for training (i.e., not holding out a validation set, but only the final test set). However, this would take more time, as nfolds+1 models will be built for every set of parameters.

Instead, to save time, let’s just scan through the top 5 models and cross-validated their parameters with nfolds=5 on the entire dataset:

for (i in 1:5) {
  gbm <- h2o.getModel(sortedGrid@model_ids[[i]])
  cvgbm <- do.call(h2o.gbm,
        ## update parameters in place
        {
          p <- gbm@parameters
          p$model_id = NULL          ## do not overwrite the original grid model
          p$training_frame = df      ## use the full dataset
          p$validation_frame = NULL  ## no validation frame
          p$nfolds = 5               ## cross-validation
          p
        }
  )
  print(gbm@model_id)
  print(cvgbm@model$cross_validation_metrics_summary[5,]) ## Pick out the "AUC" row
}
[1] "final_grid_model_68"
Cross-Validation Metrics Summary: 
         mean          sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
auc 0.9606385 0.006671454 0.96647465  0.9453869   0.959375 0.97371733 0.95823866
[1] "final_grid_model_96"
Cross-Validation Metrics Summary: 
          mean           sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
auc 0.96491456 0.0052218214  0.9631913  0.9597024  0.9742985  0.9723933 0.95498735
[1] "final_grid_model_38"
Cross-Validation Metrics Summary: 
         mean          sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
auc 0.9638506 0.004603204 0.96134794  0.9573512   0.971301 0.97192985 0.95732325
[1] "final_grid_model_55"
Cross-Validation Metrics Summary: 
         mean           sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
auc 0.9657447 0.0062724343  0.9562212 0.95428574  0.9686862 0.97490895 0.97462124
[1] "final_grid_model_22"
Cross-Validation Metrics Summary: 
         mean           sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
auc 0.9648925 0.0065437974 0.96633065 0.95285714  0.9557398  0.9736511 0.97588384

The avid reader might have noticed that we just implicitly did further parameter tuning using the “final” test set (which is part of the entire dataset df), which is not good practice – one is not supposed to use the “final” test set more than once. Hence, we’re not going to pick a different “best” model, but we’re just learning about the variance in AUCs. It turns out, for this tiny dataset, that the variance is rather large, which is not surprising.

Keeping the same “best” model, we can make test set predictions as follows:

gbm <- h2o.getModel(sortedGrid@model_ids[[1]])
preds <- h2o.predict(gbm, test)
head(preds)
gbm@model$validation_metrics@metrics$max_criteria_and_metric_scores

Note that the label (survived or not) is predicted as well (in the first predict column), and it uses the threshold with the highest F1 score (here: 0.528098) to make labels from the probabilities for survival (p1). The probability for death (p0) is given for convenience, as it is just 1-p1.

> head(preds)
  predict         p0         p1
1       0 0.98055935 0.01944065
2       0 0.98051200 0.01948800
3       0 0.81430963 0.18569037
4       1 0.02121241 0.97878759
5       1 0.02528104 0.97471896
6       0 0.92056020 0.07943980
> gbm@model$validation_metrics@metrics$max_criteria_and_metric_scores
Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold    value idx
1                       max f1  0.528098 0.920792  96
2                       max f2  0.170853 0.926966 113
3                 max f0point5  0.767931 0.959488  90
4                 max accuracy  0.767931 0.941606  90
5                max precision  0.979449 1.000000   0
6                   max recall  0.019425 1.000000 206
7              max specificity  0.979449 1.000000   0
8             max absolute_MCC  0.767931 0.878692  90
9   max min_per_class_accuracy  0.204467 0.928994 109
10 max mean_per_class_accuracy  0.252473 0.932319 106

You can also see the “best” model in more detail in Flow:
alt text
alt text

The model and the predictions can be saved to file as follows:

h2o.saveModel(gbm, "/tmp/bestModel.csv", force=TRUE)
h2o.exportFile(preds, "/tmp/bestPreds.csv", force=TRUE)

The model can also be exported as a plain old Java object (POJO) for H2O-independent (standalone/Storm/Kafka/UDF) scoring in any Java environment.

h2o.download_pojo(gbm)
/*
  Licensed under the Apache License, Version 2.0
    http://www.apache.org/licenses/LICENSE-2.0.html

  AUTOGENERATED BY H2O at 2016-06-02T17:06:34.382-07:00
  3.9.1.99999
  
  Standalone prediction code with sample test data for GBMModel named final_grid_model_68

  How to download, compile and execute:
      mkdir tmpdir
      cd tmpdir
      curl http://172.16.2.75:54321/3/h2o-genmodel.jar > h2o-genmodel.jar
      curl http://172.16.2.75:54321/3/Models.java/final_grid_model_68 > final_grid_model_68.java
      javac -cp h2o-genmodel.jar -J-Xmx2g -J-XX:MaxPermSize=128m final_grid_model_68.java

     (Note:  Try java argument -XX:+PrintCompilation to show runtime JIT compiler behavior.)
*/
import java.util.Map;
import hex.genmodel.GenModel;
import hex.genmodel.annotations.ModelPojo;

...
class final_grid_model_68_Tree_0_class_0 {
  static final double score0(double[] data) {
    double pred =      (data[9 /* boat */] <14.003472f ? 
         (!Double.isNaN(data[9]) && data[9 /* boat */] != 12.0f ? 
            0.13087687f : 
             (data[3 /* sibsp */] <7.3529413E-4f ? 
                0.13087687f : 
                0.024317414f)) : 
         (data[5 /* ticket */] <2669.5f ? 
             (data[5 /* ticket */] <2665.5f ? 
                 (data[10 /* body */] <287.5f ? 
                    -0.08224204f : 
                     (data[2 /* age */] <14.2421875f ? 
                        0.13087687f : 
                         (data[4 /* parch */] <4.892368E-4f ? 
                             (data[6 /* fare */] <39.029896f ? 
                                 (data[1 /* sex */] <0.5f ? 
                                     (data[5 /* ticket */] <2659.5f ? 
                                        0.13087687f : 
                                        -0.08224204f) : 
                                    -0.08224204f) : 
                                0.08825309f) : 
                            0.13087687f))) : 
                0.13087687f) : 
             (data[9 /* boat */] <15.5f ? 
                0.13087687f : 
                 (!GenModel.bitSetContains(GRPSPLIT0, 42, data[7 
...

Ensembling Techniques

After learning above that the variance of the test set AUC of the top few models was rather large, we might be able to turn this into our advantage by using ensembling techniques. The simplest one is taking the average of the predictions (survival probabilities) of the top k grid search model predictions (here, we use k=10):

prob = NULL
k=10
for (i in 1:k) {
  gbm <- h2o.getModel(sortedGrid@model_ids[[i]])
  if (is.null(prob)) prob = h2o.predict(gbm, test)$p1
  else prob = prob + h2o.predict(gbm, test)$p1
}
prob <- prob/k
head(prob)

We now have a blended probability of survival for each person on the Titanic.

> head(prob)
          p1
1 0.02258923
2 0.01615957
3 0.15837298
4 0.98565663
5 0.98792208
6 0.17941366

We can bring those ensemble predictions to our R session’s memory space and use other R packages.

probInR  <- as.vector(prob)
labelInR <- as.vector(as.numeric(test[[response]]))
if (! ("cvAUC" %in% rownames(installed.packages()))) { install.packages("cvAUC") }
library(cvAUC)
cvAUC::AUC(probInR, labelInR)
[1] 0.977534

This simple blended ensemble test set prediction has an even higher AUC than the best single model, but we need to do more validation studies, ideally using cross-validation. We leave this as an exercise for the reader – take the parameters of the top 10 models, retrain them with nfolds=5 on the full dataset, set keep_holdout_predictions=TRUE and average the predicted probabilities in h2o.getFrame(cvgbm[i]@model$cross_validation_holdout_predictions_frame_id), then score that with cvAUC as shown above).

For more sophisticated ensembling approaches, such as stacking via a superlearner, we refer to the H2O Ensemble github page.

Summary

We learned how to build H2O GBM models for a binary classification task on a small but realistic dataset with numerical and categorical variables, with the goal to maximize the AUC (ranges from 0.5 to 1). We first established a baseline with the default model, then carefully tuned the remaining hyper-parameters without “too much” human guess-work. We used both Cartesian and Random hyper-parameter searches to find good models. We were able to get the AUC on a holdout test set from the low 94% range with the default model to the mid 97% after tuning, and to the high 97% with some simple ensembling technique known as blending. We performed simple cross-validation variance analysis to learn that results were slightly “lucky” due to the specific train/valid/test set splits, and settled to expect mid 96% AUCs instead.

Note that this script and the findings therein are directly transferrable to large datasets on distributed clusters including Spark/Hadoop environments.

More information can be found here http://www.h2o.ai/docs/.