XGBoost in H2O Machine Learning Platform

Untitled Document.md

The new H2O release brings a shiny new feature – integration of the powerful XGBoost library algorithm into H2O Machine Learning Platform!

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable.

XGBoost provides parallel tree boosting (also known as GBDT, GBM) that solves many data science problems in a fast and accurate way.

By integrating XGBoost into the H2O Machine Learning platform, we not only enrich the family of provided algorithms by one of the most powerful machine learning algorithms, but we have also exposed it with all the nice features of H2O – Python, R APIs and Flow UI, real-time training progress, and MOJO support.


Let’s quickly try to run XGBoost on the HIGGS dataset from Python. The first step is to get the latest H2O and install the Python library. Please follow instruction at H2O download page.

The next step is to download the HIGGS training and validation data. We can use sample datasets stored in S3:

wget https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/higgs_train_imbalance_100k.csv
wget https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/higgs_test_imbalance_100k.csv
# Or use full data: wget https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/higgs_head_2M.csv

Now, it is time to start your favorite Python environment and build some XGBoost models.

The first step involves starting H2O on single node cluster:

import h2o

In the next step, we import and prepare data via the H2O API:

train_path = 'higgs_train_imbalance_100k.csv'
test_path = 'higgs_test_imbalance_100k.csv'

df_train = h2o.import_file(train_path)
df_test = h2o.import_file(test_path)

# Transform first feature into categorical feature
df_train[0] = df_train[0].asfactor()
df_valid[0] = df_valid[0].asfactor()

After data preparation, it is time to build an XGBoost model. Let’s try to train 100 trees with a maximum depth of 10:

param = {
      "ntrees" : 100
    , "max_depth" : 10
    , "learn_rate" : 0.02
    , "sample_rate" : 0.7
    , "col_sample_rate_per_tree" : 0.9
    , "min_rows" : 5
    , "seed": 4241
    , "score_tree_interval": 100

from h2o.estimators import H2OXGBoostEstimator
model = H2OXGBoostEstimator(**param)
model.train(x = list(range(1, df_train.shape[1])), y = 0, training_frame = df_train, validation_frame = df_valid)

At this point we can use the trained model like a normal H2O model, and for example use it to generate predictions:

prediction = model.predict(df_valid)[:,2]

Or we can open H2O Flow UI and explore model properties in nice user-friendly way:

Or rebuild model with different training parameters:

Technical Details

The integration of XGBoost into the H2O Machine Learning Platform utilizes the JNI interface of XGBoost and the corresponding native libraries. H2O wraps all JNI calls and exposes them as regular H2O model and model builder APIs.

The implementation itself is based on two separated modules, which are enriching the core H2O platform.

The first module, h2o-genmodel-ext-xgboost, extends module h2o-genmodel and registers an XGBoost-specific MOJO. The module also contains all necessary XGBoost binary libraries. Right now, the module provides libraries for OS X and Linux, however support of Windows is coming soon.

The module can contain multiple libraries for each platform to support different configurations (e.g., with/without GPU/OMP). H2O always tries to load the most powerful one (currently a library with GPU and OMP support). If it fails, the loader tries the next one in a loader chain. For each platform, we always provide an XGBoost library with minimal configuration (supports only single CPU) that serves as fallback in case all other libraries could not be loaded.

The second module, h2o-ext-xgboost, contains the actual XGBoost model and model builder code, which communicates with native XGBoost libraries via the JNI API. The module also provides all necessary REST API definitions to expose XGBoost model builder to clients.

Note: To learn more about H2O modular architecture, please, visit review our H2O Platform Extensibility blog post.


There are several technical limitations of the current implementation that we are trying to resolve. However, it is necessary to mention them. In general, if XGBoost cannot be initialized for any reason (e.g., unsupported platform), then the algorithm is not exposed via REST API and is not available for clients. Clients can verify availability of the XGBoost by using the corresponding client API call. For example, in Python:

is_xgboost_available = H2OXGBoostEstimator.available()

The list of limitations include:

  1. Right now XGBoost is initialized only for single-node H2O clustersl however multi-node XGBoost support is coming soon.

  2. The list of supported platforms includes:
    Platform Minimal XGBoost OMP GPU Compilation OS
    Linux yes yes yes Ubuntu 14.04, g++ 4.7
    OS X yes no no OS X 10.11
    Windows no no no NA

    Note: Minimal XGBoost configuration includes support for a single CPU.

  3. Furthermore, because we are using native XGBoost libraries that depend on OS/platform libraries, it is possible that on older operating systems, XGBoost will not be able to find all necessary binary dependencies, and will not be initialized and available.

  4. XGBoost GPU libraries are compiled against CUDA 8, which is a necessary runtime requirement in order to utilize XGBoost GPU support.

Please give H2O XGBoost chance, try it, and let us know your experience or suggest improvements via h2ostream!

H2O Platform Extensibility

Untitled Document.md

The latest H2O release,, introduced several new concepts to improve extensibility and modularity of the H2O machine learning platform. This blog post will clarify motivation, explain design decisions we made, and demonstrate the overall approach for this release.


The H2O Machine Learning platform was designed as a monolith application. However, a growing H2O community along with multiple new projects were demanding that we revisit the architecture and make the development of independent H2O extensions easier.

Furthermore, we would like to allow easy integration of third party tools (e.g., XGBoost, TensorFlow) under a common H2O API.


Conceptually, platform modularity and extensibility can be achieved in different ways:

  1. Compile time code composition: A compile time process assembles all necessary code modules together into a resulting deployable application.
  2. Link time composition: An application is composed at start time based on modules provided at JVM classpath.
  3. Runtime composition: An application can be dynamically extended at runtime, new modules can be loaded, or existing modules can be deactivated.

The approach (1) represents a method adopted by the older version of H2O and its h2o.jar assembly process. In this case, all code is compiled and assembled into a single artifact. However, it has several major limitations. Mainly, it does need a predefined list of code components to put into the resulting artifact, and it does not allow developers and community to create independent extensions.

On the other hand, the last approach (3) is fully dynamic and is adopted by tools like OSGi, Eclipse, or Chrome and brings the freedom of having a fully dynamic environment that users can modify. However, in the context of a machine learning platform, we believe it is not necessary.

Hence, we decided to adopt the second approach (2) to our architecture and provide link time composition of modules.

With this approach, users specify the modules that they are going to use, and the specified modules are registered by H2O core via a JVM capability called Java Service Provider Interface (Java SPI).

Java SPI is a simple JVM service that allows you to register modules, implementing a given interface (or extending an abstract class), and then list them at runtime. The modules need to be registered by a so called service file located in the META-INF/services directory. The service file contains the name of the component implementation. Then the application can query all available components (e.g., that are given at classpath or available via specified classloader) and use them internally via an implemented interface.

From a design perspective, there are several locations in the H2O platform to make extensible:

  • H2O core
  • Parsers
  • Rapids
  • Persistent providers

In this blog post, we would like to focus only on the first two items; however, a similar approach could be or is already adopted for the remaining parts.

Regarding first item from the list, H2O core extensibility is crucial for adopting new features – for example, to introduce a new watchdog thread that shuts down H2O if a condition is satisfied, or a new public API layer like GRPC. The core modules are marked by the interface water.AbstractH2OExtension, which provides hooks
into the H2O platform lifecycle.

The second extension point allows you to extend a provided REST API, which is typically necessary when a new algorithm is introduced and needs to be exposed via REST API. In this case, the extension module needs to implement the interface water.api.RestApiExtension and register the implementation via the file META-INF/services/water.api.RestApiExtension.


We are going to show extensibility on the XGBoost module – a new feature included in the latest version. XGBoost is a gradient boosting library distributed in a native non-Java form. Our goal is to publish it via the H2O API and use it in the same way as the rest of H2O algorithms. To realize that we need to:

  1. Extend the core of H2O with functionality that will load a binary version of XGBoost
  2. Wrap XGBoost into the H2O Java API
  3. Expose the Java API via REST API

To implement the first step, we are going to define a tiny implementation of water.AbstractH2OExtension, which will try to load XGBoost native libraries. The core extension does nothing except signal availability of XGBoost on the current platform (i.e., not all platforms are supported by XGBoost native libraries):

package hex.tree.xgboost;

public class XGBoostExtension extends AbstractH2OExtension {
  public static String NAME = "XGBoost";

  public String getExtensionName() {
    return NAME;

  public boolean isEnabled() {
    try {
        return true;
    } catch (Exception e) {
        return false;

Now, we need to register the extension via SPI. We create a new file under META-INF/services called water.AbstractH2OExtension with the following content:


We will not go into details of the second step, which will be described in another blog post, but we will directly implement the last step.

To expose H2O-specific REST API for XGBoost Java API, we need to implement the interface water.api.RestApiExtension. However, in this example
we take a shortcut and reuse existing code infrastructure for registering the algorithm’s REST API exposed via class water.api.AlgoAbstractRegister:

package hex.api.xgboost;

public class RegisterRestApi extends AlgoAbstractRegister {

  public void registerEndPoints(RestApiContext context) {
    XGBoost xgBoostMB = new XGBoost(true);
    // Register XGBoost model builder REST API
    registerModelBuilder(context, xgBoostMB, SchemaServer.getStableVersion());

  public String getName() {
    return "XGBoost";

  public List<String> getRequiredCoreExtensions() {
    return Collections.singletonList(XGBoostExtension.NAME);

And again, it is necessary to register the defined class with the SPI subsystem via the file META-INF/services/water.api.RestApiExtension:


REST API registration requires one more step that involves registration of used schemas (classes that are used by REST API and implementing water.api.Schema). This is an annoying step that is necessary right now, but we hope to remove it in the future. Registration of schemas is done in the same way as registration of extensions – it is necessary to list all schemas in the file META-INF/services/water.api.Schema:


From this point, the REST API definition published by XGBoost model builder is visible to clients. We compile the code and bundle it with H2O core code (or put it on the classpath) and run it:

java -cp h2o-ext-xgboost.jar:h2o.jar water.H2OApp

During the start, we should see a boot message that mentions loaded extensions (XGBoost core extension and REST API extension):

INFO: Flow dir: '/Users/michal/h2oflows'
INFO: Cloud of size 1 formed [/]
INFO: Registered parsers: [GUESS, ARFF, XLS, SVMLight, AVRO, PARQUET, CSV]
INFO: XGBoost extension initialized
INFO: Watchdog extension initialized
INFO: Registered 2 core extensions in: 68ms
INFO: Registered H2O core extensions: [XGBoost, Watchdog]
INFO: Found XGBoost backend with library: xgboost4j
INFO: Registered: 160 REST APIs in: 310ms
INFO: Registered REST API extensions: [AutoML, XGBoost, Algos, Core V3, Core V4]
INFO: Registered: 230 schemas in 342ms
INFO: H2O started in 4932ms

The platform also publishes a list of available extensions via a capabilities REST end-point. A client can get the complete list of capabilities via GET <ip:port>/3/Capabilities:

curl http://localhost:54321/3/Capabilities

Or get a list of core extensions (GET <ip:port>/3/Capabilities/Core):

curl http://localhost:54321/3/Core

Or get a list of REST API extensions (GET <ip:port>/3/Capabilities/API):

curl http://localhost:54321/3/API

Note: We do not modularize R/Python/Flow clients. The client is responsible to self-configure based on information provided by the backend (e.g., via Capabilities REST end point) and fails gracefully if the user invokes an operation that is not provided by the backend.

For more details about the change, please, consult the following:

Football Flowers

Why We Bought A Happy Diwali Billboard


It’s been a dark year in many ways, so we wanted to lighten things up and celebrate Diwali — the festival of lights!

Diwali is a holiday that celebrates joy, hope, knowledge and all that is full of light — the perfect antidote for some of the more negative developments coming out of the Silicon Valley recently. Throw in a polarizing presidential race where a certain candidate wants to literally build a wall around US borders, and it’s clear that inclusivity is as important as ever.

Diwali is also a great opportunity to highlight the advancements Asian Americans have made in technology, especially South Asian Americans. The heads of Google (Sundar Pichai) and Microsoft (Satya Nadella) — two major forces in the world of AI — are led by Indian Americans. They join other leaders across the technology ecosystem that we also want to recognize broadly.

Today we are open-sourcing Diwali. America embraced Yoga and Chicken Tikka, so why not Diwali too?

Connecting to Spark & Sparkling Water from R & Rstudio

Sparkling Water offers the best of breed machine learning for Spark users. Sparkling Water brings all of H2O’s advanced algorithms and capabilities to Spark. This means that you can continue to use H2O from Rstudio or any other ide of your choice. This post will walk you through the steps to get running on plain R or R studio from Spark.

It works just the same the same way as regular H2O. You just need to call h2o.init() from R with the right parameters i.e. IP, PORT

For example: we start sparkling shell (bin/sparkling-shell) here and create an H2OContext:

Now H2OContext is running and H2O’s REST API is exposed on 172.162.223:54321

So we can open RStudio and call h2o.init() (make sure you have the right R H2O package installed):


Let’s now create a Spark DataFrame, then publish it as H2O frame and access it from R:

This is how you achieve that in sparkling-shell:
val df = sc.parallelize(1 to 100).toDF // creates Spark DataFrame
val hf = h2oContext.asH2OFrame(df) // publishes DataFrame as H2O's Frame


You can see that the name of the published frame is frame_rdd_6. Now let us go to RStudio and list all the available frames via h2o.ls() function:

Alternatively you could also name the frame during the transformation from Spark to H2O as shown below:

h2oContext.asH2OFrame(df) -> val hf = h2oContext.asH2OFrame(df, "simple.frame")


We can fetch the frame as well or invoke a R function on it:

Keep hacking!

Thank you, Cliff

Cliff resigned from the Company last week – He is parting on good terms and supports our success in future. Cliff and I worked closely since 2004 so this is a loss for me. It ends an era of prolific work supporting my vision as a partner.

Let’s take this opportunity to congratulate Cliff on his work, in helping me build something from nothing. Millions of little things we did together to get us this far. (Still remember the uHaul trip with earliest furniture in old building and cliff cranking out code furiously running on Taco Bell & Fiesta Del Mar.) A lot of how I built the Company has to do with maximizing partnering with Cliff. Lots of wins came out of that and we’ll cherish them. Like all good things it came to an end. I only wish him the very best in the future.

Over the past four years, Cliff and the rest of you have helped me build an amazing technology, business, customer and investor team. Your creativity, passion, loyalty, spirited work, grit & determination are the pillars of support and wellspring of life for the Company. I’ll look for strong partners in each one of you as we pick up and continue on building the tremendous opportunity for changing the world with innovation. It’s an amazing responsibility we have been given.

Change is a constant in the life of a startup. While this hurts now, many Companies before us have overcome them and transitioned smoothly into the next phase. And we shall become stronger from the change and build even more vibrant community and culture of sharing & participation.

We have an amazing team loyal to the Company, the fullest support of our Community of Customers. And a will to survive & win that will help H2O metamorphose into the next stage in the natural evolution for the Company.
This will be beautiful when built.

Thank you for your company –
in the journey to transform the world, Sri

How to Build a Machine Learning App Using Sparkling Water and Apache Spark

The Sparkling Water project is nearing its one-year anniversary, which means Michal Malohlava, our main contributor, has been very busy for the better part of this past year. The Sparkling Water project combines H2O machine-learning algorithms with the execution power of Apache Spark. This means that the project is heavily dependent on two of the fastest growing machine-learning open source projects out there. With every major release of Spark or H2O there are API changes and, less frequently, major data structure changes that affect Sparkling Water. Throw Cloudera releases into the mix, and you have a plethora of git commits dedicated to maintaining a few simple calls to move data between the different platforms.

All that hard work on the backend means that users can easily benefit from programming in a uniform environment that combines both H2O and MLLib algorithms. For the data scientist using a Cloudera-supported distribution of Spark, they can easily incorporate an H2O library into their Spark application. An entry point to the H2O programming world (called H2OContext) is created and allows for the launch of H2O, parallel import of frames into memory and the use of H2O algorithms. This seamless integration into Spark makes launching a Sparkling Water application as easy as launching a Spark application:

bin/spark-submit --class water.YourSparklingWaterApp --master yarn-client sparkling-water-app-assembly.jar

Setup and Installation

Sparkling Water is certified on Cloudera and certified to work with versions of Spark installations that come prepackaged with each distribution. To install Sparkling Water, navigate to h2o.ai/download and download the version corresponding to the version of Spark available with your Cloudera cluster. Rather than downloading Spark and then distributing on the Cloudera cluster manually, simply set your SPARK_HOME to the spark directory in your opt directory:

$ export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark

For ease of use, we are looking into taking advantage of Cloudera Manager and creating distributable H2O and Sparkling Water parcels. This will simplify the management of the various versions of Cloudera, Spark and H2O.


Figure 1 illustrates the concept of technical realization. The application developer implements a Spark application using the Spark API and Sparkling Water library. After submitting the resulting Sparkling Water application into a Spark cluster, the application can create H2OContext, which initializes H2O services on top of Spark nodes. The application can then use any functionality provided by H2O, including its algorithms and interactive UI. H2O uses its own data structure called H2OFrame to represent tabular data, but H2OContext allows H2O to share data with Spark’s RDDs.


Figure 1: Sparkling Water architecture

Figure 2 illustrates the launch sequence of Sparkling Water on a Cloudera cluster. Both Spark and H2O are in-memory processes and all computation occurs in memory with minimal writing to disk, occurring exclusively when specified by the user. Because all the data used in the modeling process needs to read into memory, the recommended method of launching Spark and H2O is through YARN, which dynamically allocates available resources. When the job is finished, you can tear down the Sparkling Water cluster and free up resources for other jobs. All Spark and Sparkling Water applications launched with YARN will be tracked and listed in the history server that you can launch on Cloudera Manager.

YARN will allocate the container to launch the application master in and when you launch with yarn-client, the spark driver runs in the client process and the application master submits a request to the resource manager to spawn the Spark Executor JVMs. Finally, after creating a Sparkling Water cluster, you have access to HDFS to read data into either H2O or Spark.


Figure 2: Sparkling Water on Cloudera [Launching on YARN]

Programming Model

The H2OContext exposes two operators for: (1) publishing Spark RDD as H2O Frame (2) publishing H2O Frame as Spark RDD. The direction from Spark to H2O makes sense when data are prepared with the help of Spark API and passed to H2O algorithms:
// ...
val srdd: SchemaRDD = sqlContext.sql(&quot;SELECT * FROM ChicagoCrimeTable where Arrest = &#039;true&#039;&quot;)
// Publish the RDD as H2OFrame
val h2oFrame: H2OFrame = h2oContext.asH2OFrame(srdd)
// ...
val dlModel: DeepLearningModel = new DeepLearning().trainModel.get

The opposite direction from H2O Frame to Spark RDD is used in a situation when the user needs to expose H2O’s frames as Spark’s RDDs. For example:
val prediction: H2OFrame = dlModel.score(testFrame)
// ...
// Exposes prediction frame as RDD
val srdd: SchemaRDD = asSchemaRDD(prediction)

The H2O context simplifies the programming model by introducing implicit conversion to hide asSchemaRDD and asH2OFrame calls.

Sparkling Water excels in situations when you need to call advanced machine learning algorithms from an existing Spark workflow. Furthermore, we found that it is the perfect platform for designing and developing smarter machine learning applications. In the rest of this post, we will demonstrate how to use Sparkling Water to create a simple machine learning application that predicts arrest probability for a given crime in Chicago.

Example Application

We’ve seen some incredible applications of Deep Learning with respect to image recognition and machine translation but this specific use case has to do with public safety; in particular, how Deep Learning can be used to fight crime in the forward-thinking cities of San Francisco and Chicago. The cool thing about these two cities (and many others!) is that they are both open data cities, which means anybody can access city data ranging from transportation information to building maintenance records. So if you are a data scientist or thinking about becoming a data scientist, there are publicly available city-specific datasets you can play with. For this example, we looked at the historical crime data from both Chicago and San Francisco and joined this data with other external data, such as weather and socioeconomic factors, using Spark’s SQL context:


Figure 3: Spark + H2O Workflow

We perform the data import, ad-hoc data munging (parsing the date column, for example), and joining of tables by leveraging the power of Spark. We then publish the Spark RDD as an H2O Frame (Fig. 2).

val sc: SparkContext = // ...
implicit val sqlContext = new SQLContext(sc)
implicit val h2oContext = new H2OContext(sc).start()
import h2oContext._

val weatherTable = asSchemaRDD(createWeatherTable("hdfs://data/chicagoAllWeather.csv"))
registerRDDAsTable(weatherTable, "chicagoWeather")
// Census data
val censusTable = asSchemaRDD(createCensusTable("hdfds://data/chicagoCensus.csv"))
registerRDDAsTable(censusTable, "chicagoCensus")
// Crime data
val crimeTable  = asSchemaRDD(createCrimeTable("hdfs://data/chicagoCrimes10k.csv", "MM/dd/yyyy hh:mm:ss a", "Etc/UTC"))
registerRDDAsTable(crimeTable, "chicagoCrime")

val crimeWeather = sql("""SELECT a.Year, ..., b.meanTemp, ..., c.PER_CAPITA_INCOME
    |FROM chicagoCrime a
    |JOIN chicagoWeather b
    |ON a.Year = b.year AND a.Month = b.month AND a.Day = b.day
    |JOIN chicagoCensus c
    |ON a.Community_Area = c.Community_Area_Number""".stripMargin)

// Publish result as H2O Frame
val crimeWeatherHF: H2OFrame = crimeWeather

// Split data into train and test datasets
val frs = splitFrame(crimeWeatherHF, Array("train.hex", "test.hex"), Array(0.8, 0.2))
val (train, test) = (frs(0), frs(1))

<p><p>Figures 4 and 5 below include some cool visualizations we made of the joined table using H2O’s Flow as part of Sparkling Water.</p></p>

<p><p><img src="http://h2o.ai/blog/2015_04_deep-learning-public-safety/crimeDL_fig2.png" width="1074" height="762" alt="crimeDL_fig2" class="aligncenter" /></p></p>

<p><p><strong>Figure 4: San Francisco crime visualizations</strong></p></p>

<p><p><img src="http://h2o.ai/blog/2015_04_deep-learning-public-safety/crimeDL_fig3.png" width="1080" height="816" alt="crimeDL_fig3" class="aligncenter" /></p></p>

<p><p><strong>Figure 5: Chicago crime visualizations</strong></p></p>

<p><p>Interesting how in both cities’ crime seems to occur most frequently during the winter—a surprising fact given how cold the weather gets in Chicago!</p></p>

<p><p>Using H2O Flow, we were able to look at the arrest rates of every category of recorded crimes in Chicago and compare them with the percentage of total crimes each category represents. Some crimes with the highest arrest rates also occur least frequently, and vice versa.</p></p>

<p><p><img src="http://h2o.ai/blog/2015_04_deep-learning-public-safety/crimeDL_fig4.png" width="1022" height="768" alt="crimeDL_fig4" class="aligncenter" /></p></p>

<p><p><strong>Figure 6: Chicago arrest rates and total % of all crimes by category</strong></p></p>

<p><p>Once the data is transformed to an H2O Frame, we train a deep neural network to predict the likelihood of an arrest for a given crime.</p></p>

def DLModel(train: H2OFrame, test: H2OFrame, response: String,
epochs: Int = 10, l1: Double = 0.0001, l2: Double = 0.0001,
activation: Activation = Activation.RectifierWithDropout, hidden:Array[Int] = Array(200,200))
(implicit h2oContext: H2OContext) : DeepLearningModel = {
import h2oContext._
import hex.deeplearning.DeepLearning
import hex.deeplearning.DeepLearningModel.DeepLearningParameters

val dlParams = new DeepLearningParameters()
dlParams._train = train
dlParams._valid = test
dlParams._response_column = response
dlParams._epochs = epochs
dlParams._l1 = l1
dlParams._l2 = l2
dlParams._activation = activation
dlParams._hidden = hidden

// Create a job
val dl = new DeepLearning(dlParams)
val model = dl.trainModel.get

// Build Deep Learning model
val dlModel = DLModel(train, test, 'Arrest)
// Collect model performance metrics and predictions for test data
val (trainMetricsDL, testMetricsDL) = binomialMetrics(dlModel, train, test)

Here is a screenshot of our H2O Deep Learning model being tuned inside Flow and the resulting AUC curve from scoring the trained model against the validation dataset.</p>


Figure 7: Chicago validation data AUC

The last building block of the application is formed by a function which predicts the arrest rate probability for a new crime. The function combines the Spark API to enrich each incoming crime event with census information and H2O’s deep learning model, which scores the event:


def scoreEvent(crime: Crime, model: Model[<em>,</em>,<em>], censusTable: SchemaRDD)
              (implicit sqlContext: SQLContext, h2oContext: H2OContext): Float = {
  import h2oContext.</em>
  import sqlContext._
  // Create a single row table
  val srdd:SchemaRDD = sqlContext.sparkContext.parallelize(Seq(crime))
  // Join table with census data
  val row: DataFrame = censusTable.join(srdd, on = Option(&#039;Community_Area === &#039;Community_Area_Number)) //.printSchema
  val predictTable = model.score(row)
  val probOfArrest = predictTable.vec(&quot;true&quot;).at(0)


val crimeEvent = Crime(&quot;02/08/2015 11:43:58 PM&quot;, 1811, &quot;NARCOTICS&quot;, &quot;STREET&quot;,false, 422, 4, 7, 46, 18)
val arrestProbability = 100 * scoreEvent(crime, dlModel, censusTable)


Figure 8: Geo-mapped predictions

Because each of the crimes reported comes with latitude-longitude coordinates, we scored our hold out data using the trained model and plotted the predictions on a map of Chicago—specifically, the Downtown district. The color coding corresponds to the model’s prediction for likelihood of an arrest with red being very likely (X > 0.8) and blue being unlikely (X < 0.2). Smart analytics + resource management = safer streets.

Further Reading

If you’re interested in finding out more about Sparkling Water or H2O please join us at H2O World 2015 in Mountain View, CA. We’ll have a series of great speakers including Stanford Professors Rob Tibshirani and Stephen Boyd, Hilary Mason, the Founder of Fast Forward Labs, Erik Huddleston, the CEO of TrendKite, Danqing Zhao, Big Data Director for Macy’s and Monica Rogati, Equity Partner at Data Collective.

How I used H2O to crunch through a bank’s customer data

This entry was originally posted here

Six months back I gingerly started exploring a few data science courses. After having successfully completed some of the courses I was restless. I wanted to try my data hacking skills on some real data (read kaggle).

I find competing in hackathons, helps you to benchmark yourself against your fellow data fanatics! You suddenly start to realize the enormity of your ignorance. It’s like the data set is talking back to you — “You know nothing, Aakash!”

So when my friend suggested that I take part in a hackathon organized by Zone Startup in collaboration with a large financial institution I jumped at the opportunity!

The problem statement

To develop a propensity model – The client has a couple of use cases where they have not been able to get 80% response captures in top 3 deciles or >3X lift in the top decile – in spite of several iterations. The expectation here would be identification of any new technique / algorithm (apart from logistic regression), which can help the client get the desired results.

What was in the data

We were provided with profile information and casa & debit card transaction data of over 800k customers. This data was divided into 2 equal parts for training & testing (provided by the client). We were supposed to find the customers who were more likely to respond to a personal loan offer. This was around 0.xx% of the total number of customers in the data set. A very rare event!

That’s when you fall in love with H2o!

To the uninitiated, H2O is an amazingly fast scalable machine learning API that you can use to build smarter applications. It’s been used by companies like Cisco & Paypal for predictive analysis. From their own website: “The new version offers a single integrated and tested platform for enterprise and open-source use, enhanced usability through a web user interface (UI) with embeddable workflows, elegant APIs, and direct integration for R, Python and Sparkling Water.”

You can check more about this package here or check some use cases on the H2O Youtube channel.

H2O Software Stack

My workflow

The total customer set was equally divided into a training set & test set. I divided the customers in the training data set by a 75:25 split. So the algorithms were trained on 75% of the customers in the training set and validated on the remaining 25%.
Of the debit & casa transactional data I extracted some ninety features for all the customers. Adding another 65 features from the profile information, I had a total of ~150 features for each of the 800k customers.

I added a small routine for feature selection. Subsets of the total ~150 features were selected and trained on four algorithms (viz. GBM, GLM, DRF & DLM). I ran 200 models of each algorithm with a different combination of features. The models which gave the best performance in capturing the respondent’s in the top decile were selected and a grid-search was performed for choosing the best parameters for each of the models. Finally an ensemble of my best models was used to capture the rare customers who are likely to respond to a loan offer.


This gave me a 5.2x lift against the business-as-usual (BAU) case. The client had given a benchmark of a 3.0x capture on the top decile or more than a 80% capture rate in the top 3 decile.


Mishaps & some lessons learned

I have never used a top decile capture as an optimization metric, so that was a very hard learning experience since I had not clarified it with the organizers until the second day of the hack!

H2o is really fast & powerful! The initial setup took some time, but then once it was set up it’s quite a smooth operator. I was simply blown away by the idea of running hundreds of models to test all my hypothesis. I must have run close to a thousand different models using different feature sets and parameter settings to tune the algorithms.

There were 15 competing teams from various analytics companies as well as teams from top universities during the hackathon, my algorithm was chosen as one of the top 4. The top two prizes were won by teams which used a XGboost algorithm.

Feedback & Reason for writing this blog

I have spent the last 6-8 months learning about the subtleties of data science. And I feel like I am standing in front of a big ocean. (I don’t think that feeling will change even after a decade of working on data!)

This hackathon was a steep learning experience. It’s a totally different thing to sit for late nights and hack away on your computer to optimize your code, and it’s a totally different skill-set to stand before the client and give them a presentation!

However I don’t believe that a 5.5x-5.2x lift over the BAU is the best that we can get using these algorithms. If you have worked on bank data or marketing analytics, I would love to know what you think about the performance of the algorithm. I would certainly love to see if I can get any further boost from it.


A big thanks to the excellent support from H2O! Especially to Jeff G without whose help I would not have been able to set up a multi-cluster node