Football Flowers

Why We Bought A Happy Diwali Billboard


It’s been a dark year in many ways, so we wanted to lighten things up and celebrate Diwali — the festival of lights!

Diwali is a holiday that celebrates joy, hope, knowledge and all that is full of light — the perfect antidote for some of the more negative developments coming out of the Silicon Valley recently. Throw in a polarizing presidential race where a certain candidate wants to literally build a wall around US borders, and it’s clear that inclusivity is as important as ever.

Diwali is also a great opportunity to highlight the advancements Asian Americans have made in technology, especially South Asian Americans. The heads of Google (Sundar Pichai) and Microsoft (Satya Nadella) — two major forces in the world of AI — are led by Indian Americans. They join other leaders across the technology ecosystem that we also want to recognize broadly.

Today we are open-sourcing Diwali. America embraced Yoga and Chicken Tikka, so why not Diwali too?

Connecting to Spark & Sparkling Water from R & Rstudio

Sparkling Water offers the best of breed machine learning for Spark users. Sparkling Water brings all of H2O’s advanced algorithms and capabilities to Spark. This means that you can continue to use H2O from Rstudio or any other ide of your choice. This post will walk you through the steps to get running on plain R or R studio from Spark.

It works just the same the same way as regular H2O. You just need to call h2o.init() from R with the right parameters i.e. IP, PORT

For example: we start sparkling shell (bin/sparkling-shell) here and create an H2OContext:

Now H2OContext is running and H2O’s REST API is exposed on 172.162.223:54321

So we can open RStudio and call h2o.init() (make sure you have the right R H2O package installed):


Let’s now create a Spark DataFrame, then publish it as H2O frame and access it from R:

This is how you achieve that in sparkling-shell:
val df = sc.parallelize(1 to 100).toDF // creates Spark DataFrame
val hf = h2oContext.asH2OFrame(df) // publishes DataFrame as H2O's Frame


You can see that the name of the published frame is frame_rdd_6. Now let us go to RStudio and list all the available frames via function:

Alternatively you could also name the frame during the transformation from Spark to H2O as shown below:

h2oContext.asH2OFrame(df) -> val hf = h2oContext.asH2OFrame(df, "simple.frame")


We can fetch the frame as well or invoke a R function on it:

Keep hacking!

Thank you, Cliff

Cliff resigned from the Company last week – He is parting on good terms and supports our success in future. Cliff and I worked closely since 2004 so this is a loss for me. It ends an era of prolific work supporting my vision as a partner.

Let’s take this opportunity to congratulate Cliff on his work, in helping me build something from nothing. Millions of little things we did together to get us this far. (Still remember the uHaul trip with earliest furniture in old building and cliff cranking out code furiously running on Taco Bell & Fiesta Del Mar.) A lot of how I built the Company has to do with maximizing partnering with Cliff. Lots of wins came out of that and we’ll cherish them. Like all good things it came to an end. I only wish him the very best in the future.

Over the past four years, Cliff and the rest of you have helped me build an amazing technology, business, customer and investor team. Your creativity, passion, loyalty, spirited work, grit & determination are the pillars of support and wellspring of life for the Company. I’ll look for strong partners in each one of you as we pick up and continue on building the tremendous opportunity for changing the world with innovation. It’s an amazing responsibility we have been given.

Change is a constant in the life of a startup. While this hurts now, many Companies before us have overcome them and transitioned smoothly into the next phase. And we shall become stronger from the change and build even more vibrant community and culture of sharing & participation.

We have an amazing team loyal to the Company, the fullest support of our Community of Customers. And a will to survive & win that will help H2O metamorphose into the next stage in the natural evolution for the Company.
This will be beautiful when built.

Thank you for your company –
in the journey to transform the world, Sri

How to Build a Machine Learning App Using Sparkling Water and Apache Spark

The Sparkling Water project is nearing its one-year anniversary, which means Michal Malohlava, our main contributor, has been very busy for the better part of this past year. The Sparkling Water project combines H2O machine-learning algorithms with the execution power of Apache Spark. This means that the project is heavily dependent on two of the fastest growing machine-learning open source projects out there. With every major release of Spark or H2O there are API changes and, less frequently, major data structure changes that affect Sparkling Water. Throw Cloudera releases into the mix, and you have a plethora of git commits dedicated to maintaining a few simple calls to move data between the different platforms.

All that hard work on the backend means that users can easily benefit from programming in a uniform environment that combines both H2O and MLLib algorithms. For the data scientist using a Cloudera-supported distribution of Spark, they can easily incorporate an H2O library into their Spark application. An entry point to the H2O programming world (called H2OContext) is created and allows for the launch of H2O, parallel import of frames into memory and the use of H2O algorithms. This seamless integration into Spark makes launching a Sparkling Water application as easy as launching a Spark application:

bin/spark-submit --class water.YourSparklingWaterApp --master yarn-client sparkling-water-app-assembly.jar

Setup and Installation

Sparkling Water is certified on Cloudera and certified to work with versions of Spark installations that come prepackaged with each distribution. To install Sparkling Water, navigate to and download the version corresponding to the version of Spark available with your Cloudera cluster. Rather than downloading Spark and then distributing on the Cloudera cluster manually, simply set your SPARK_HOME to the spark directory in your opt directory:

$ export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark

For ease of use, we are looking into taking advantage of Cloudera Manager and creating distributable H2O and Sparkling Water parcels. This will simplify the management of the various versions of Cloudera, Spark and H2O.


Figure 1 illustrates the concept of technical realization. The application developer implements a Spark application using the Spark API and Sparkling Water library. After submitting the resulting Sparkling Water application into a Spark cluster, the application can create H2OContext, which initializes H2O services on top of Spark nodes. The application can then use any functionality provided by H2O, including its algorithms and interactive UI. H2O uses its own data structure called H2OFrame to represent tabular data, but H2OContext allows H2O to share data with Spark’s RDDs.


Figure 1: Sparkling Water architecture

Figure 2 illustrates the launch sequence of Sparkling Water on a Cloudera cluster. Both Spark and H2O are in-memory processes and all computation occurs in memory with minimal writing to disk, occurring exclusively when specified by the user. Because all the data used in the modeling process needs to read into memory, the recommended method of launching Spark and H2O is through YARN, which dynamically allocates available resources. When the job is finished, you can tear down the Sparkling Water cluster and free up resources for other jobs. All Spark and Sparkling Water applications launched with YARN will be tracked and listed in the history server that you can launch on Cloudera Manager.

YARN will allocate the container to launch the application master in and when you launch with yarn-client, the spark driver runs in the client process and the application master submits a request to the resource manager to spawn the Spark Executor JVMs. Finally, after creating a Sparkling Water cluster, you have access to HDFS to read data into either H2O or Spark.


Figure 2: Sparkling Water on Cloudera [Launching on YARN]

Programming Model

The H2OContext exposes two operators for: (1) publishing Spark RDD as H2O Frame (2) publishing H2O Frame as Spark RDD. The direction from Spark to H2O makes sense when data are prepared with the help of Spark API and passed to H2O algorithms:
// ...
val srdd: SchemaRDD = sqlContext.sql("SELECT * FROM ChicagoCrimeTable where Arrest = 'true'")
// Publish the RDD as H2OFrame
val h2oFrame: H2OFrame = h2oContext.asH2OFrame(srdd)
// ...
val dlModel: DeepLearningModel = new DeepLearning().trainModel.get

The opposite direction from H2O Frame to Spark RDD is used in a situation when the user needs to expose H2O’s frames as Spark’s RDDs. For example:
val prediction: H2OFrame = dlModel.score(testFrame)
// ...
// Exposes prediction frame as RDD
val srdd: SchemaRDD = asSchemaRDD(prediction)

The H2O context simplifies the programming model by introducing implicit conversion to hide asSchemaRDD and asH2OFrame calls.

Sparkling Water excels in situations when you need to call advanced machine learning algorithms from an existing Spark workflow. Furthermore, we found that it is the perfect platform for designing and developing smarter machine learning applications. In the rest of this post, we will demonstrate how to use Sparkling Water to create a simple machine learning application that predicts arrest probability for a given crime in Chicago.

Example Application

We’ve seen some incredible applications of Deep Learning with respect to image recognition and machine translation but this specific use case has to do with public safety; in particular, how Deep Learning can be used to fight crime in the forward-thinking cities of San Francisco and Chicago. The cool thing about these two cities (and many others!) is that they are both open data cities, which means anybody can access city data ranging from transportation information to building maintenance records. So if you are a data scientist or thinking about becoming a data scientist, there are publicly available city-specific datasets you can play with. For this example, we looked at the historical crime data from both Chicago and San Francisco and joined this data with other external data, such as weather and socioeconomic factors, using Spark’s SQL context:


Figure 3: Spark + H2O Workflow

We perform the data import, ad-hoc data munging (parsing the date column, for example), and joining of tables by leveraging the power of Spark. We then publish the Spark RDD as an H2O Frame (Fig. 2).

val sc: SparkContext = // ...
implicit val sqlContext = new SQLContext(sc)
implicit val h2oContext = new H2OContext(sc).start()
import h2oContext._

val weatherTable = asSchemaRDD(createWeatherTable("hdfs://data/chicagoAllWeather.csv"))
registerRDDAsTable(weatherTable, "chicagoWeather")
// Census data
val censusTable = asSchemaRDD(createCensusTable("hdfds://data/chicagoCensus.csv"))
registerRDDAsTable(censusTable, "chicagoCensus")
// Crime data
val crimeTable  = asSchemaRDD(createCrimeTable("hdfs://data/chicagoCrimes10k.csv", "MM/dd/yyyy hh:mm:ss a", "Etc/UTC"))
registerRDDAsTable(crimeTable, "chicagoCrime")

val crimeWeather = sql("""SELECT a.Year, ..., b.meanTemp, ..., c.PER_CAPITA_INCOME
    |FROM chicagoCrime a
    |JOIN chicagoWeather b
    |ON a.Year = b.year AND a.Month = b.month AND a.Day =
    |JOIN chicagoCensus c
    |ON a.Community_Area = c.Community_Area_Number""".stripMargin)

// Publish result as H2O Frame
val crimeWeatherHF: H2OFrame = crimeWeather

// Split data into train and test datasets
val frs = splitFrame(crimeWeatherHF, Array("train.hex", "test.hex"), Array(0.8, 0.2))
val (train, test) = (frs(0), frs(1))

<p><p>Figures 4 and 5 below include some cool visualizations we made of the joined table using H2O’s Flow as part of Sparkling Water.</p></p>

<p><p><img src="" width="1074" height="762" alt="crimeDL_fig2" class="aligncenter" /></p></p>

<p><p><strong>Figure 4: San Francisco crime visualizations</strong></p></p>

<p><p><img src="" width="1080" height="816" alt="crimeDL_fig3" class="aligncenter" /></p></p>

<p><p><strong>Figure 5: Chicago crime visualizations</strong></p></p>

<p><p>Interesting how in both cities’ crime seems to occur most frequently during the winter—a surprising fact given how cold the weather gets in Chicago!</p></p>

<p><p>Using H2O Flow, we were able to look at the arrest rates of every category of recorded crimes in Chicago and compare them with the percentage of total crimes each category represents. Some crimes with the highest arrest rates also occur least frequently, and vice versa.</p></p>

<p><p><img src="" width="1022" height="768" alt="crimeDL_fig4" class="aligncenter" /></p></p>

<p><p><strong>Figure 6: Chicago arrest rates and total % of all crimes by category</strong></p></p>

<p><p>Once the data is transformed to an H2O Frame, we train a deep neural network to predict the likelihood of an arrest for a given crime.</p></p>

def DLModel(train: H2OFrame, test: H2OFrame, response: String,
epochs: Int = 10, l1: Double = 0.0001, l2: Double = 0.0001,
activation: Activation = Activation.RectifierWithDropout, hidden:Array[Int] = Array(200,200))
(implicit h2oContext: H2OContext) : DeepLearningModel = {
import h2oContext._
import hex.deeplearning.DeepLearning
import hex.deeplearning.DeepLearningModel.DeepLearningParameters

val dlParams = new DeepLearningParameters()
dlParams._train = train
dlParams._valid = test
dlParams._response_column = response
dlParams._epochs = epochs
dlParams._l1 = l1
dlParams._l2 = l2
dlParams._activation = activation
dlParams._hidden = hidden

// Create a job
val dl = new DeepLearning(dlParams)
val model = dl.trainModel.get

// Build Deep Learning model
val dlModel = DLModel(train, test, 'Arrest)
// Collect model performance metrics and predictions for test data
val (trainMetricsDL, testMetricsDL) = binomialMetrics(dlModel, train, test)

Here is a screenshot of our H2O Deep Learning model being tuned inside Flow and the resulting AUC curve from scoring the trained model against the validation dataset.</p>


Figure 7: Chicago validation data AUC

The last building block of the application is formed by a function which predicts the arrest rate probability for a new crime. The function combines the Spark API to enrich each incoming crime event with census information and H2O’s deep learning model, which scores the event:


def scoreEvent(crime: Crime, model: Model[<em>,</em>,<em>], censusTable: SchemaRDD)
              (implicit sqlContext: SQLContext, h2oContext: H2OContext): Float = {
  import h2oContext.</em>
  import sqlContext._
  // Create a single row table
  val srdd:SchemaRDD = sqlContext.sparkContext.parallelize(Seq(crime))
  // Join table with census data
  val row: DataFrame = censusTable.join(srdd, on = Option(&#039;Community_Area === &#039;Community_Area_Number)) //.printSchema
  val predictTable = model.score(row)
  val probOfArrest = predictTable.vec(&quot;true&quot;).at(0)


val crimeEvent = Crime(&quot;02/08/2015 11:43:58 PM&quot;, 1811, &quot;NARCOTICS&quot;, &quot;STREET&quot;,false, 422, 4, 7, 46, 18)
val arrestProbability = 100 * scoreEvent(crime, dlModel, censusTable)


Figure 8: Geo-mapped predictions

Because each of the crimes reported comes with latitude-longitude coordinates, we scored our hold out data using the trained model and plotted the predictions on a map of Chicago—specifically, the Downtown district. The color coding corresponds to the model’s prediction for likelihood of an arrest with red being very likely (X > 0.8) and blue being unlikely (X < 0.2). Smart analytics + resource management = safer streets.

Further Reading

If you’re interested in finding out more about Sparkling Water or H2O please join us at H2O World 2015 in Mountain View, CA. We’ll have a series of great speakers including Stanford Professors Rob Tibshirani and Stephen Boyd, Hilary Mason, the Founder of Fast Forward Labs, Erik Huddleston, the CEO of TrendKite, Danqing Zhao, Big Data Director for Macy’s and Monica Rogati, Equity Partner at Data Collective.

How I used H2O to crunch through a bank’s customer data

This entry was originally posted here

Six months back I gingerly started exploring a few data science courses. After having successfully completed some of the courses I was restless. I wanted to try my data hacking skills on some real data (read kaggle).

I find competing in hackathons, helps you to benchmark yourself against your fellow data fanatics! You suddenly start to realize the enormity of your ignorance. It’s like the data set is talking back to you — “You know nothing, Aakash!”

So when my friend suggested that I take part in a hackathon organized by Zone Startup in collaboration with a large financial institution I jumped at the opportunity!

The problem statement

To develop a propensity model – The client has a couple of use cases where they have not been able to get 80% response captures in top 3 deciles or >3X lift in the top decile – in spite of several iterations. The expectation here would be identification of any new technique / algorithm (apart from logistic regression), which can help the client get the desired results.

What was in the data

We were provided with profile information and casa & debit card transaction data of over 800k customers. This data was divided into 2 equal parts for training & testing (provided by the client). We were supposed to find the customers who were more likely to respond to a personal loan offer. This was around 0.xx% of the total number of customers in the data set. A very rare event!

That’s when you fall in love with H2o!

To the uninitiated, H2O is an amazingly fast scalable machine learning API that you can use to build smarter applications. It’s been used by companies like Cisco & Paypal for predictive analysis. From their own website: “The new version offers a single integrated and tested platform for enterprise and open-source use, enhanced usability through a web user interface (UI) with embeddable workflows, elegant APIs, and direct integration for R, Python and Sparkling Water.”

You can check more about this package here or check some use cases on the H2O Youtube channel.

H2O Software Stack

My workflow

The total customer set was equally divided into a training set & test set. I divided the customers in the training data set by a 75:25 split. So the algorithms were trained on 75% of the customers in the training set and validated on the remaining 25%.
Of the debit & casa transactional data I extracted some ninety features for all the customers. Adding another 65 features from the profile information, I had a total of ~150 features for each of the 800k customers.

I added a small routine for feature selection. Subsets of the total ~150 features were selected and trained on four algorithms (viz. GBM, GLM, DRF & DLM). I ran 200 models of each algorithm with a different combination of features. The models which gave the best performance in capturing the respondent’s in the top decile were selected and a grid-search was performed for choosing the best parameters for each of the models. Finally an ensemble of my best models was used to capture the rare customers who are likely to respond to a loan offer.


This gave me a 5.2x lift against the business-as-usual (BAU) case. The client had given a benchmark of a 3.0x capture on the top decile or more than a 80% capture rate in the top 3 decile.


Mishaps & some lessons learned

I have never used a top decile capture as an optimization metric, so that was a very hard learning experience since I had not clarified it with the organizers until the second day of the hack!

H2o is really fast & powerful! The initial setup took some time, but then once it was set up it’s quite a smooth operator. I was simply blown away by the idea of running hundreds of models to test all my hypothesis. I must have run close to a thousand different models using different feature sets and parameter settings to tune the algorithms.

There were 15 competing teams from various analytics companies as well as teams from top universities during the hackathon, my algorithm was chosen as one of the top 4. The top two prizes were won by teams which used a XGboost algorithm.

Feedback & Reason for writing this blog

I have spent the last 6-8 months learning about the subtleties of data science. And I feel like I am standing in front of a big ocean. (I don’t think that feeling will change even after a decade of working on data!)

This hackathon was a steep learning experience. It’s a totally different thing to sit for late nights and hack away on your computer to optimize your code, and it’s a totally different skill-set to stand before the client and give them a presentation!

However I don’t believe that a 5.5x-5.2x lift over the BAU is the best that we can get using these algorithms. If you have worked on bank data or marketing analytics, I would love to know what you think about the performance of the algorithm. I would certainly love to see if I can get any further boost from it.


A big thanks to the excellent support from H2O! Especially to Jeff G without whose help I would not have been able to set up a multi-cluster node

The Definitive Performance Tuning Guide for H2O Deep Learning (Ported scripts to H2O-3, results are taken from February’s blog)


This document gives guidelines for performance tuning of H2O Deep Learning, both in terms of speed and accuracy. It is intended for existing users of H2O Deep Learning (which is easy to change if you’re not), as it assumes some familiarity with the parameters and use cases.


This effort was in part motivated by a Deep Learning project that benchmarked against H2O, and we tried to reproduce their results, especially since they claimed that H2O is slower.
Summary of our findings (much more at the end of part 1):
* We are unable to reproduce their findings (our numbers are 9x to 52x faster, on a single node)
* The benchmark was only on raw training throughput and didn’t consider model accuracy (at all)
* We present improved parameter options for higher training speed (more below)
* We present a single-node model that trains faster than their 16-node system (and leads to higher accuracy than with their settings)

Continue reading

An Introduction to Data Science: Meetup Summary Guest Post by Zen Kishimoto

Originally posted on Tek-Tips forums by Zen here

I went to two meetups at H2O, which provides an open source predictive analytics platform. The second meetup was full of participants because its theme was an introduction to data science.

Data science is a new buzzword, and I feel like everyone claims to be a data scientist or something relating to that these days. But other than real data scientists, very few people really understand what a data scientist is or does. Bits and pieces of information are available, but it takes a basic understanding of the subject to exploit such fragmented information. Once you are up to a certain altitude, a series of blogs by Ricky Ho are very informative.

But first things first. There were three speakers at that meetup, but I’ll only elaborate on the first one, who described data science for laymen. The speaker was Dr. Erin LeDell, whose presentation title was Data Science for Non-Data Scientists.

Dr. Erin LeDell

In the following summary of her points, I stay at a bare-bones level so that a total amateur can grasp what data science is all about. Once you get it, I think you can reference other materials for more details. Her presentations and others are available here. The presentation was also videorecorded and is available here.

At the beginning, she introduced three Stanford professors who work closely with H2O:

The first two professors publish many books, but LeDell mentioned
that two ebooks on very relevant subjects are available free of charge.
You can download the books below:

###Data science process

LeDell gave a high-level view of the data science process:

A simple explanation of each step is given here, in my words.

Problem formulatio

  • A data scientist studies and researches the problem domain and identifies factors contributing to the analysis.

Collect & process data

  • Relevant data about the identified factors are collected and processed. Data processing includes cleansing data, which means getting rid of corrupt and/or incorrect values and normalizing values. Some data scientists say that 50-80% of their time is spent cleansing data. This was mentioned by   several data scientists.

Machine learning

  • The most appropriate machine learning algorithm is developed or selected from a pool of well-known algorithms.

Insights & action

  • The results are analyzed for appropriate action.

What background does a data scientist need?

This is a question asked by many non–data scientists. I have seen it many times, along with many answers. LeDell answered: mathematics and statistics, programming and database, communication and visualization, and domain knowledge and soft skills.

Drew Conway‘s answer is well known. Actually, the second speaker referred to it.

Data Scientist Skills

This diagram is available here.

Machine Learning

LeDell classified machine learning into three categories: regression, classification, and clustering.

These algorithms are well known and documented. In most cases, a data scientist uses an existing algorithm rather than developing one.

Deep and ensemble learning

LeDell introduced two more technologies: deep and ensemble learning.

Deep learning is described as:

“A branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures, composed of multiple non-linear transformations.” (Wikipedia, 2015)

In ensemble learning, multiple learning algorithms obtain better predictive performance with the penalty of computation time. More details are here.

Finally, LeDell gave the following information for more details on the subject.

I skipped some of her discussion here, but I hope this is a good start to understanding what data science is, and that you will dig further into it.

Zen Kishimoto

About Zen Kishimoto

Seasoned research and technology executive with various functional expertise, including roles in analyst, writer, CTO, VP Engineering, general management, sales, and marketing in diverse high-tech and cleantech industry segments, including software, mobile embedded systems, Web technologies, and networking. Current focus and expertise are in the area of the IT application to energy, such as smart grid, green IT, building/data center energy efficiency, and cloud computing.