Use H2O.ai on Azure HDInsight

This is a repost from this article on MSDN

We’re hosting an upcoming webinar to present you how to use H2O on HDInsight and to answer your questions. Sign up for our upcoming webinar on combining H2O and Azure HDInsight.

We recently announced that H2O and Microsoft Azure HDInsight have integrated to provide Data Scientists with a Leading Combination of Engines for Machine Learning and Deep Learning. Through H2O’s AI platform and its Sparkling Water solution, users can combine the fast, scalable machine learning algorithms of H2O with the capabilities of Spark, as well as drive computation from Scala/R/Python and utilize the H2O Flow UI, providing an ideal machine learning platform for application developers.

In this blog, we will provide a detailed step-by-step guide to help you set up the first H2O on HDInsight solution.

Step 1: setting up the environment

The first step is to create an HDInsight cluster with H2O installed. You can either create an HDInsight cluster and install H2O during provision time, or you can also install H2O on an existing cluster. Please note that H2O on HDInsight only works for Spark 2.0 on HDInsight 3.5 as of today, which is the default version of HDInsight.

For more information on how to create a cluster in HDInsight, please refer to the documentation here (https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-provision-linux-clusters). For more information on how to install an application on an existing cluster, please refer to the documentation here (https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apps-install-applications)

Please be noted that we’ve recently updated our UI with less clicks, so you need to click “custom” button to install applications on HDInsight.


Step 2: Setting up the environment

After installing H2O on HDInsight, you can simply use the built-in Jupyter notebooks to write your first H2O on HDInsight applications. You can simply go to (https://yourclustername.azurehdinsight.net/jupyter) to open the Jupyter Notebook. You will see a folder named “H2O-PySparkling-Examples”.


There are a few examples in the folder, but I recommend starting with the one named “Sentiment_analysis_with_Sparkling_Water.ipynb”. Most of the details on how to use the H2O PySparkling Water APIs are already covered in the Notebook itself, so here I will give some high-level overviews.

The first thing you need to do is to configure the environment. Most of the configurations are already taken care by the system, such as the FLOW UI address, Spark jar location, the Sparkling water egg file, etc.

There are three important parameter to configure: the driver memory, executor memory, and the number of executors. The default values are optimized for the default 4 node cluster, but your cluster size might vary.

Tuning these parameters are outside of scope of this blog, as it is more of a Spark resource tuning problem. There are a few good reference articles such as this one.

Note that all spark applications deployed using a Jupyter Notebook will have “yarn-cluster” deploy-mode. This means that the spark driver node will be allocated on any worker node of the cluster, not on the head nodes.

In this example, we simply allocate 75% of an HDInsight cluster worker nodes to the driver and executors (21 GB each), and put 3 executors, since the default HDInsight cluster size is 4 worker nodes (3 executors + 1 driver)


Please refer to the Jupyter Notebook tutorial for more information on how to use Jupyter Notebooks on HDInsight.

The second step here is to create an H2O context. Since one default spark context is already configured in the Jupyter Notebook (called sc), in H2O, we just need to call

h2o_context = pysparkling.H2OContext.getOrCreate(sc)

so H2O can recognize the default spark context.

After executing this line of code, H2O will print out the status, as well as the YARN application it is using.


After this, you can use H2O APIs plus the Spark APIs to write your applications. To learn more about Sparkling Water APIs, refer to the H2O GitHub site here.


This sentiment analysis example has a few steps to analyze the data:

  1. Load data to Spark and H2O frames
  2. Data munging using H2O API
    • Remove columns
    • Refine Time Column into Year/Month/Day/DayOfWeek/Hour columns
  3. Data munging using Spark API
    • Select columns Score, Month, Day, DayOfWeek, Summary
    • Define UDF to transform score (0..5) to binary positive/negative
    • Use TF-IDF to vectorize summary column
  4. Model building using H2O API
    • Use H2O Grid Search to tune hyper parameters
    • Select the best Deep Learning model

Please refer to the Jupyter Notebook for more details.

Step 3: use FLOW UI to monitor the progress and visualize the model

H2O Flow is an interactive web-based computational user interface where you can combine code execution, text, mathematics, plots and rich media into a single document, much like Jupyter Notebooks. With H2O Flow, you can capture, rerun, annotate, present, and share your workflow. H2O Flow allows you to use H2O interactively to import files, build models, and iteratively improve them. Based on your models, you can make predictions and add rich text to create vignettes of your work – all within Flow’s browser-based environment. In this blog, we will only focus on its visualization part.

H2O FLOW web service lives in the Spark driver and is routed through the HDInsight gateway, so it can only be accessed when the spark application/Notebook is running

You can click the available link in the Jupyter Notebook, or you can directly access this URL: https://yourclustername-h2o.apps.azurehdinsight.net/flow/index.html

In this example, we will demonstrate its visualization capabilities. Simply click “Model > List Grid Search Results” (since we are trying to use Grid Search to tune hyper parameters)


Then you can access the 4 grid search results:


And you can view the details of each model. For example, you can visualize the ROC curve as below:


In Jupyter Notebooks, you can also view the performance in text format:


In this blog, we have walked you through the detailed steps on how to create your first H2O application on HDInsight for your machine learning applications. For more information on H2O, please visit H2O site; For more information on HDInsight, please visit the HDInsight site

This blog-post is co-authored by Pablo Marin(@pablomarin), Solution Architect in Microsoft.

Why We Bought A Happy Diwali Billboard


It’s been a dark year in many ways, so we wanted to lighten things up and celebrate Diwali — the festival of lights!

Diwali is a holiday that celebrates joy, hope, knowledge and all that is full of light — the perfect antidote for some of the more negative developments coming out of the Silicon Valley recently. Throw in a polarizing presidential race where a certain candidate wants to literally build a wall around US borders, and it’s clear that inclusivity is as important as ever.

Diwali is also a great opportunity to highlight the advancements Asian Americans have made in technology, especially South Asian Americans. The heads of Google (Sundar Pichai) and Microsoft (Satya Nadella) — two major forces in the world of AI — are led by Indian Americans. They join other leaders across the technology ecosystem that we also want to recognize broadly.

Today we are open-sourcing Diwali. America embraced Yoga and Chicken Tikka, so why not Diwali too?

Drink in the Data with H2O at Strata SJ 2016

It’s about to rain data in San Jose when Strata + Hadoop World comes to town March 29 – March 31st.

H2O has a waterfall of action happening at the show. Here’s a rundown of what’s on tap.
Keep it handy so you have less chance of FOMO (fear of missing out).

Hang out with H2O at Booth #1225 to learn more about how machine learning can help transform your business and find us throughout the conference:

Tuesday, March 29th

Wednesday, March 30th

  • 12:45pm – 1:15pm Meet the Makers: The brains and innovation behind the leading machine learning solution is on hand to hack with you
    • #AskArno – Arno Candel, Chief Architect and H2O algorithm expert
    • #RuReady with Matt Dowle, H2O Hacker and author of R data.table
    • #SparkUp with Michal Malohlava principal developer of Sparkling Water
    • #InWithErin – Erin LeDell, Machine Learning Scientist and H2O ensembles expert
  • 2:40pm – 3:20pm H2O highlighted in An introduction to Transamerica’s product recommendation platform
  • 5:50pm – 6:50pm Booth Crawl. Have a beer on us at Booth #1225
  • 7:00pm – 9:00pm Let it Flow with H2O – Drinks + Data at the Arcadia Lounge. Grab your invite at Booth #1225

Thursday, March 31st

  • 12:45pm – 1:15pm Ask Transamerica. Vishal Bamba and Nitin Prabhu of Transamerica join us at Booth #1225 for Q&A with you!

How to Build a Machine Learning App Using Sparkling Water and Apache Spark

The Sparkling Water project is nearing its one-year anniversary, which means Michal Malohlava, our main contributor, has been very busy for the better part of this past year. The Sparkling Water project combines H2O machine-learning algorithms with the execution power of Apache Spark. This means that the project is heavily dependent on two of the fastest growing machine-learning open source projects out there. With every major release of Spark or H2O there are API changes and, less frequently, major data structure changes that affect Sparkling Water. Throw Cloudera releases into the mix, and you have a plethora of git commits dedicated to maintaining a few simple calls to move data between the different platforms.

All that hard work on the backend means that users can easily benefit from programming in a uniform environment that combines both H2O and MLLib algorithms. For the data scientist using a Cloudera-supported distribution of Spark, they can easily incorporate an H2O library into their Spark application. An entry point to the H2O programming world (called H2OContext) is created and allows for the launch of H2O, parallel import of frames into memory and the use of H2O algorithms. This seamless integration into Spark makes launching a Sparkling Water application as easy as launching a Spark application:

bin/spark-submit --class water.YourSparklingWaterApp --master yarn-client sparkling-water-app-assembly.jar

Setup and Installation

Sparkling Water is certified on Cloudera and certified to work with versions of Spark installations that come prepackaged with each distribution. To install Sparkling Water, navigate to h2o.ai/download and download the version corresponding to the version of Spark available with your Cloudera cluster. Rather than downloading Spark and then distributing on the Cloudera cluster manually, simply set your SPARK_HOME to the spark directory in your opt directory:

$ export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark

For ease of use, we are looking into taking advantage of Cloudera Manager and creating distributable H2O and Sparkling Water parcels. This will simplify the management of the various versions of Cloudera, Spark and H2O.


Figure 1 illustrates the concept of technical realization. The application developer implements a Spark application using the Spark API and Sparkling Water library. After submitting the resulting Sparkling Water application into a Spark cluster, the application can create H2OContext, which initializes H2O services on top of Spark nodes. The application can then use any functionality provided by H2O, including its algorithms and interactive UI. H2O uses its own data structure called H2OFrame to represent tabular data, but H2OContext allows H2O to share data with Spark’s RDDs.


Figure 1: Sparkling Water architecture

Figure 2 illustrates the launch sequence of Sparkling Water on a Cloudera cluster. Both Spark and H2O are in-memory processes and all computation occurs in memory with minimal writing to disk, occurring exclusively when specified by the user. Because all the data used in the modeling process needs to read into memory, the recommended method of launching Spark and H2O is through YARN, which dynamically allocates available resources. When the job is finished, you can tear down the Sparkling Water cluster and free up resources for other jobs. All Spark and Sparkling Water applications launched with YARN will be tracked and listed in the history server that you can launch on Cloudera Manager.

YARN will allocate the container to launch the application master in and when you launch with yarn-client, the spark driver runs in the client process and the application master submits a request to the resource manager to spawn the Spark Executor JVMs. Finally, after creating a Sparkling Water cluster, you have access to HDFS to read data into either H2O or Spark.


Figure 2: Sparkling Water on Cloudera [Launching on YARN]

Programming Model

The H2OContext exposes two operators for: (1) publishing Spark RDD as H2O Frame (2) publishing H2O Frame as Spark RDD. The direction from Spark to H2O makes sense when data are prepared with the help of Spark API and passed to H2O algorithms:
// ...
val srdd: SchemaRDD = sqlContext.sql("SELECT * FROM ChicagoCrimeTable where Arrest = 'true'")
// Publish the RDD as H2OFrame
val h2oFrame: H2OFrame = h2oContext.asH2OFrame(srdd)
// ...
val dlModel: DeepLearningModel = new DeepLearning().trainModel.get

The opposite direction from H2O Frame to Spark RDD is used in a situation when the user needs to expose H2O’s frames as Spark’s RDDs. For example:
val prediction: H2OFrame = dlModel.score(testFrame)
// ...
// Exposes prediction frame as RDD
val srdd: SchemaRDD = asSchemaRDD(prediction)

The H2O context simplifies the programming model by introducing implicit conversion to hide asSchemaRDD and asH2OFrame calls.

Sparkling Water excels in situations when you need to call advanced machine learning algorithms from an existing Spark workflow. Furthermore, we found that it is the perfect platform for designing and developing smarter machine learning applications. In the rest of this post, we will demonstrate how to use Sparkling Water to create a simple machine learning application that predicts arrest probability for a given crime in Chicago.

Example Application

We’ve seen some incredible applications of Deep Learning with respect to image recognition and machine translation but this specific use case has to do with public safety; in particular, how Deep Learning can be used to fight crime in the forward-thinking cities of San Francisco and Chicago. The cool thing about these two cities (and many others!) is that they are both open data cities, which means anybody can access city data ranging from transportation information to building maintenance records. So if you are a data scientist or thinking about becoming a data scientist, there are publicly available city-specific datasets you can play with. For this example, we looked at the historical crime data from both Chicago and San Francisco and joined this data with other external data, such as weather and socioeconomic factors, using Spark’s SQL context:


Figure 3: Spark + H2O Workflow

We perform the data import, ad-hoc data munging (parsing the date column, for example), and joining of tables by leveraging the power of Spark. We then publish the Spark RDD as an H2O Frame (Fig. 2).

val sc: SparkContext = // ...
implicit val sqlContext = new SQLContext(sc)
implicit val h2oContext = new H2OContext(sc).start()
import h2oContext._

val weatherTable = asSchemaRDD(createWeatherTable("hdfs://data/chicagoAllWeather.csv"))
registerRDDAsTable(weatherTable, "chicagoWeather")
// Census data
val censusTable = asSchemaRDD(createCensusTable("hdfds://data/chicagoCensus.csv"))
registerRDDAsTable(censusTable, "chicagoCensus")
// Crime data
val crimeTable  = asSchemaRDD(createCrimeTable("hdfs://data/chicagoCrimes10k.csv", "MM/dd/yyyy hh:mm:ss a", "Etc/UTC"))
registerRDDAsTable(crimeTable, "chicagoCrime")

val crimeWeather = sql("""SELECT a.Year, ..., b.meanTemp, ..., c.PER_CAPITA_INCOME
    |FROM chicagoCrime a
    |JOIN chicagoWeather b
    |ON a.Year = b.year AND a.Month = b.month AND a.Day = b.day
    |JOIN chicagoCensus c
    |ON a.Community_Area = c.Community_Area_Number""".stripMargin)

// Publish result as H2O Frame
val crimeWeatherHF: H2OFrame = crimeWeather

// Split data into train and test datasets
val frs = splitFrame(crimeWeatherHF, Array("train.hex", "test.hex"), Array(0.8, 0.2))
val (train, test) = (frs(0), frs(1))

<p><p>Figures 4 and 5 below include some cool visualizations we made of the joined table using H2O’s Flow as part of Sparkling Water.</p></p>

<p><p><img src="http://h2o.ai/blog/2015_04_deep-learning-public-safety/crimeDL_fig2.png" width="1074" height="762" alt="crimeDL_fig2" class="aligncenter" /></p></p>

<p><p><strong>Figure 4: San Francisco crime visualizations</strong></p></p>

<p><p><img src="http://h2o.ai/blog/2015_04_deep-learning-public-safety/crimeDL_fig3.png" width="1080" height="816" alt="crimeDL_fig3" class="aligncenter" /></p></p>

<p><p><strong>Figure 5: Chicago crime visualizations</strong></p></p>

<p><p>Interesting how in both cities’ crime seems to occur most frequently during the winter—a surprising fact given how cold the weather gets in Chicago!</p></p>

<p><p>Using H2O Flow, we were able to look at the arrest rates of every category of recorded crimes in Chicago and compare them with the percentage of total crimes each category represents. Some crimes with the highest arrest rates also occur least frequently, and vice versa.</p></p>

<p><p><img src="http://h2o.ai/blog/2015_04_deep-learning-public-safety/crimeDL_fig4.png" width="1022" height="768" alt="crimeDL_fig4" class="aligncenter" /></p></p>

<p><p><strong>Figure 6: Chicago arrest rates and total % of all crimes by category</strong></p></p>

<p><p>Once the data is transformed to an H2O Frame, we train a deep neural network to predict the likelihood of an arrest for a given crime.</p></p>

def DLModel(train: H2OFrame, test: H2OFrame, response: String,
epochs: Int = 10, l1: Double = 0.0001, l2: Double = 0.0001,
activation: Activation = Activation.RectifierWithDropout, hidden:Array[Int] = Array(200,200))
(implicit h2oContext: H2OContext) : DeepLearningModel = {
import h2oContext._
import hex.deeplearning.DeepLearning
import hex.deeplearning.DeepLearningModel.DeepLearningParameters

val dlParams = new DeepLearningParameters()
dlParams._train = train
dlParams._valid = test
dlParams._response_column = response
dlParams._epochs = epochs
dlParams._l1 = l1
dlParams._l2 = l2
dlParams._activation = activation
dlParams._hidden = hidden

// Create a job
val dl = new DeepLearning(dlParams)
val model = dl.trainModel.get

// Build Deep Learning model
val dlModel = DLModel(train, test, 'Arrest)
// Collect model performance metrics and predictions for test data
val (trainMetricsDL, testMetricsDL) = binomialMetrics(dlModel, train, test)

Here is a screenshot of our H2O Deep Learning model being tuned inside Flow and the resulting AUC curve from scoring the trained model against the validation dataset.</p>


Figure 7: Chicago validation data AUC

The last building block of the application is formed by a function which predicts the arrest rate probability for a new crime. The function combines the Spark API to enrich each incoming crime event with census information and H2O’s deep learning model, which scores the event:


def scoreEvent(crime: Crime, model: Model[<em>,</em>,<em>], censusTable: SchemaRDD)
              (implicit sqlContext: SQLContext, h2oContext: H2OContext): Float = {
  import h2oContext.</em>
  import sqlContext._
  // Create a single row table
  val srdd:SchemaRDD = sqlContext.sparkContext.parallelize(Seq(crime))
  // Join table with census data
  val row: DataFrame = censusTable.join(srdd, on = Option(&#039;Community_Area === &#039;Community_Area_Number)) //.printSchema
  val predictTable = model.score(row)
  val probOfArrest = predictTable.vec(&quot;true&quot;).at(0)


val crimeEvent = Crime(&quot;02/08/2015 11:43:58 PM&quot;, 1811, &quot;NARCOTICS&quot;, &quot;STREET&quot;,false, 422, 4, 7, 46, 18)
val arrestProbability = 100 * scoreEvent(crime, dlModel, censusTable)


Figure 8: Geo-mapped predictions

Because each of the crimes reported comes with latitude-longitude coordinates, we scored our hold out data using the trained model and plotted the predictions on a map of Chicago—specifically, the Downtown district. The color coding corresponds to the model’s prediction for likelihood of an arrest with red being very likely (X > 0.8) and blue being unlikely (X < 0.2). Smart analytics + resource management = safer streets.

Further Reading

If you’re interested in finding out more about Sparkling Water or H2O please join us at H2O World 2015 in Mountain View, CA. We’ll have a series of great speakers including Stanford Professors Rob Tibshirani and Stephen Boyd, Hilary Mason, the Founder of Fast Forward Labs, Erik Huddleston, the CEO of TrendKite, Danqing Zhao, Big Data Director for Macy’s and Monica Rogati, Equity Partner at Data Collective.

How I used H2O to crunch through a bank’s customer data

This entry was originally posted here

Six months back I gingerly started exploring a few data science courses. After having successfully completed some of the courses I was restless. I wanted to try my data hacking skills on some real data (read kaggle).

I find competing in hackathons, helps you to benchmark yourself against your fellow data fanatics! You suddenly start to realize the enormity of your ignorance. It’s like the data set is talking back to you — “You know nothing, Aakash!”

So when my friend suggested that I take part in a hackathon organized by Zone Startup in collaboration with a large financial institution I jumped at the opportunity!

The problem statement

To develop a propensity model – The client has a couple of use cases where they have not been able to get 80% response captures in top 3 deciles or >3X lift in the top decile – in spite of several iterations. The expectation here would be identification of any new technique / algorithm (apart from logistic regression), which can help the client get the desired results.

What was in the data

We were provided with profile information and casa & debit card transaction data of over 800k customers. This data was divided into 2 equal parts for training & testing (provided by the client). We were supposed to find the customers who were more likely to respond to a personal loan offer. This was around 0.xx% of the total number of customers in the data set. A very rare event!

That’s when you fall in love with H2o!

To the uninitiated, H2O is an amazingly fast scalable machine learning API that you can use to build smarter applications. It’s been used by companies like Cisco & Paypal for predictive analysis. From their own website: “The new version offers a single integrated and tested platform for enterprise and open-source use, enhanced usability through a web user interface (UI) with embeddable workflows, elegant APIs, and direct integration for R, Python and Sparkling Water.”

You can check more about this package here or check some use cases on the H2O Youtube channel.

H2O Software Stack

My workflow

The total customer set was equally divided into a training set & test set. I divided the customers in the training data set by a 75:25 split. So the algorithms were trained on 75% of the customers in the training set and validated on the remaining 25%.
Of the debit & casa transactional data I extracted some ninety features for all the customers. Adding another 65 features from the profile information, I had a total of ~150 features for each of the 800k customers.

I added a small routine for feature selection. Subsets of the total ~150 features were selected and trained on four algorithms (viz. GBM, GLM, DRF & DLM). I ran 200 models of each algorithm with a different combination of features. The models which gave the best performance in capturing the respondent’s in the top decile were selected and a grid-search was performed for choosing the best parameters for each of the models. Finally an ensemble of my best models was used to capture the rare customers who are likely to respond to a loan offer.


This gave me a 5.2x lift against the business-as-usual (BAU) case. The client had given a benchmark of a 3.0x capture on the top decile or more than a 80% capture rate in the top 3 decile.


Mishaps & some lessons learned

I have never used a top decile capture as an optimization metric, so that was a very hard learning experience since I had not clarified it with the organizers until the second day of the hack!

H2o is really fast & powerful! The initial setup took some time, but then once it was set up it’s quite a smooth operator. I was simply blown away by the idea of running hundreds of models to test all my hypothesis. I must have run close to a thousand different models using different feature sets and parameter settings to tune the algorithms.

There were 15 competing teams from various analytics companies as well as teams from top universities during the hackathon, my algorithm was chosen as one of the top 4. The top two prizes were won by teams which used a XGboost algorithm.

Feedback & Reason for writing this blog

I have spent the last 6-8 months learning about the subtleties of data science. And I feel like I am standing in front of a big ocean. (I don’t think that feeling will change even after a decade of working on data!)

This hackathon was a steep learning experience. It’s a totally different thing to sit for late nights and hack away on your computer to optimize your code, and it’s a totally different skill-set to stand before the client and give them a presentation!

However I don’t believe that a 5.5x-5.2x lift over the BAU is the best that we can get using these algorithms. If you have worked on bank data or marketing analytics, I would love to know what you think about the performance of the algorithm. I would certainly love to see if I can get any further boost from it.


A big thanks to the excellent support from H2O! Especially to Jeff G without whose help I would not have been able to set up a multi-cluster node

The Definitive Performance Tuning Guide for H2O Deep Learning (Ported scripts to H2O-3, results are taken from February’s blog)


This document gives guidelines for performance tuning of H2O Deep Learning, both in terms of speed and accuracy. It is intended for existing users of H2O Deep Learning (which is easy to change if you’re not), as it assumes some familiarity with the parameters and use cases.


This effort was in part motivated by a Deep Learning project that benchmarked against H2O, and we tried to reproduce their results, especially since they claimed that H2O is slower.
Summary of our findings (much more at the end of part 1):
* We are unable to reproduce their findings (our numbers are 9x to 52x faster, on a single node)
* The benchmark was only on raw training throughput and didn’t consider model accuracy (at all)
* We present improved parameter options for higher training speed (more below)
* We present a single-node model that trains faster than their 16-node system (and leads to higher accuracy than with their settings)

Continue reading

An Introduction to Data Science: Meetup Summary Guest Post by Zen Kishimoto

Originally posted on Tek-Tips forums by Zen here

I went to two meetups at H2O, which provides an open source predictive analytics platform. The second meetup was full of participants because its theme was an introduction to data science.

Data science is a new buzzword, and I feel like everyone claims to be a data scientist or something relating to that these days. But other than real data scientists, very few people really understand what a data scientist is or does. Bits and pieces of information are available, but it takes a basic understanding of the subject to exploit such fragmented information. Once you are up to a certain altitude, a series of blogs by Ricky Ho are very informative.

But first things first. There were three speakers at that meetup, but I’ll only elaborate on the first one, who described data science for laymen. The speaker was Dr. Erin LeDell, whose presentation title was Data Science for Non-Data Scientists.

Dr. Erin LeDell

In the following summary of her points, I stay at a bare-bones level so that a total amateur can grasp what data science is all about. Once you get it, I think you can reference other materials for more details. Her presentations and others are available here. The presentation was also videorecorded and is available here.

At the beginning, she introduced three Stanford professors who work closely with H2O:

The first two professors publish many books, but LeDell mentioned
that two ebooks on very relevant subjects are available free of charge.
You can download the books below:

###Data science process

LeDell gave a high-level view of the data science process:

A simple explanation of each step is given here, in my words.

Problem formulatio

  • A data scientist studies and researches the problem domain and identifies factors contributing to the analysis.

Collect & process data

  • Relevant data about the identified factors are collected and processed. Data processing includes cleansing data, which means getting rid of corrupt and/or incorrect values and normalizing values. Some data scientists say that 50-80% of their time is spent cleansing data. This was mentioned by   several data scientists.

Machine learning

  • The most appropriate machine learning algorithm is developed or selected from a pool of well-known algorithms.

Insights & action

  • The results are analyzed for appropriate action.

What background does a data scientist need?

This is a question asked by many non–data scientists. I have seen it many times, along with many answers. LeDell answered: mathematics and statistics, programming and database, communication and visualization, and domain knowledge and soft skills.

Drew Conway‘s answer is well known. Actually, the second speaker referred to it.

Data Scientist Skills

This diagram is available here.

Machine Learning

LeDell classified machine learning into three categories: regression, classification, and clustering.

These algorithms are well known and documented. In most cases, a data scientist uses an existing algorithm rather than developing one.

Deep and ensemble learning

LeDell introduced two more technologies: deep and ensemble learning.

Deep learning is described as:

“A branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures, composed of multiple non-linear transformations.” (Wikipedia, 2015)

In ensemble learning, multiple learning algorithms obtain better predictive performance with the penalty of computation time. More details are here.

Finally, LeDell gave the following information for more details on the subject.

I skipped some of her discussion here, but I hope this is a good start to understanding what data science is, and that you will dig further into it.

Zen Kishimoto

About Zen Kishimoto

Seasoned research and technology executive with various functional expertise, including roles in analyst, writer, CTO, VP Engineering, general management, sales, and marketing in diverse high-tech and cleantech industry segments, including software, mobile embedded systems, Web technologies, and networking. Current focus and expertise are in the area of the IT application to energy, such as smart grid, green IT, building/data center energy efficiency, and cloud computing.

Lending Club : Predict Bad Loans to Minimize Loss to Defaulted Accounts

As a sales engineer on the H2O.ai team I get asked a lot about the value add of H2O. How do you put a price tag on something that is open source? This typically revolves around the use cases; if a use case pertains to improving user experience or making apps that can improve internal operations then there’s no straightforward way of monetarily accumulating better experiences. However, if the use case is focused on detecting fraud or maintaining enough supply for the next sales quarter, we can calculate the total amount of money saved by detecting pricey fraudulent cases or sales lost due to incorrectly forecasted demand.

The H2O team has built a number of user-facing demostrations from our Ask Craig App to predicting flight delays which are available in R, Sparkling Water,and Python. Today, we will use Lending Club’s Open Data to obtain the probability of a loan request defaulting or being charged off. We will build an H2O model and calculate the dollar amount of money saved by rejecting these loan requests with the model (not including the opportunity cost), and then combine this with the profits lost in rejecting good loans to determine the net amount saved.

Summary of Approved Loan Applicants

The dataset had a total of half a million records from 2007 up to 2015 which means with H2O, the data can actually just be processed on your personal computer with an H2O instance with at least 2GB of memory. The first step is to import the data and create a new column that categorizes the loan as either a good loan or a bad loan (the user has defaulted or the account has been charged off). The following is a code snippet for R:

    print("Start up H2O...")
    conn <- h2o.init(nthreads = -1)

    print("Import approved and rejected loan requests from Lending Tree...")
    path   <- "/Users/amy/Desktop/lending_club/loanStats/"
    loanStats <- h2o.importFile(path = path, destination_frame = "LoanStats")

    print("Create bad loan label, this will include charged off, defaulted, and late repayments on loans...")
    loanStats$bad_loan <- ifelse(loanStats$loan_status == "Charged Off" | 
                             loanStats$loan_status == "Default" | 
                            loanStats$loan_status == "Does not meet the credit policy.  Status:Charged Off", 
                            1, 0)
    loanStats$bad_loan <- as.factor(loanStats$bad_loan)

    print("Create the applicant's risk score, if the credit score is 0 make it an NA...")
    loanStats$risk_score <- ifelse(loanStats$last_fico_range_low == 0, NA,
                               (loanStats$last_fico_range_high + loanStats$last_fico_range_low)/2)

Credit Score Summaries

In H2O Flow, you can grab the distribution of credit scores for good loans vs bad loans. It is easy to see that owners of bad loans typically have the lowest credit score, which will be the biggest driving force in predicting whether a loan is good or not. However we want a model that actually takes into account other features so that loans aren’t automatically cut off at a certain threshold.



Pending review : Will update blog soon