The Definitive Performance Tuning Guide for H2O Deep Learning (Ported scripts to H2O-3, results are taken from February’s blog)

Introduction

This document gives guidelines for performance tuning of H2O Deep Learning, both in terms of speed and accuracy. It is intended for existing users of H2O Deep Learning (which is easy to change if you’re not), as it assumes some familiarity with the parameters and use cases.

Motivation

This effort was in part motivated by a Deep Learning project that benchmarked against H2O, and we tried to reproduce their results, especially since they claimed that H2O is slower.
Summary of our findings (much more at the end of part 1):
* We are unable to reproduce their findings (our numbers are 9x to 52x faster, on a single node)
* The benchmark was only on raw training throughput and didn’t consider model accuracy (at all)
* We present improved parameter options for higher training speed (more below)
* We present a single-node model that trains faster than their 16-node system (and leads to higher accuracy than with their settings)

Read More

An Introduction to Data Science: Meetup Summary Guest Post by Zen Kishimoto

Originally posted on Tek-Tips forums by Zen here

I went to two meetups at H2O, which provides an open source predictive analytics platform. The second meetup was full of participants because its theme was an introduction to data science.

Data science is a new buzzword, and I feel like everyone claims to be a data scientist or something relating to that these days. But other than real data scientists, very few people really understand what a data scientist is or does. Bits and pieces of information are available, but it takes a basic understanding of the subject to exploit such fragmented information. Once you are up to a certain altitude, a series of blogs by Ricky Ho are very informative.

But first things first. There were three speakers at that meetup, but I’ll only elaborate on the first one, who described data science for laymen. The speaker was Dr. Erin LeDell, whose presentation title was Data Science for Non-Data Scientists.

Dr. Erin LeDell

In the following summary of her points, I stay at a bare-bones level so that a total amateur can grasp what data science is all about. Once you get it, I think you can reference other materials for more details. Her presentations and others are available here. The presentation was also videorecorded and is available here.

At the beginning, she introduced three Stanford professors who work closely with H2O:

The first two professors publish many books, but LeDell mentioned
that two ebooks on very relevant subjects are available free of charge.
You can download the books below:

###Data science process

LeDell gave a high-level view of the data science process:

A simple explanation of each step is given here, in my words.

Problem formulatio

  • A data scientist studies and researches the problem domain and identifies factors contributing to the analysis.

Collect & process data

  • Relevant data about the identified factors are collected and processed. Data processing includes cleansing data, which means getting rid of corrupt and/or incorrect values and normalizing values. Some data scientists say that 50-80% of their time is spent cleansing data. This was mentioned by   several data scientists.

Machine learning

  • The most appropriate machine learning algorithm is developed or selected from a pool of well-known algorithms.

Insights & action

  • The results are analyzed for appropriate action.

What background does a data scientist need?

This is a question asked by many non–data scientists. I have seen it many times, along with many answers. LeDell answered: mathematics and statistics, programming and database, communication and visualization, and domain knowledge and soft skills.

Drew Conway‘s answer is well known. Actually, the second speaker referred to it.

Data Scientist Skills

This diagram is available here.

Machine Learning

LeDell classified machine learning into three categories: regression, classification, and clustering.

These algorithms are well known and documented. In most cases, a data scientist uses an existing algorithm rather than developing one.

Deep and ensemble learning

LeDell introduced two more technologies: deep and ensemble learning.

Deep learning is described as:

“A branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures, composed of multiple non-linear transformations.” (Wikipedia, 2015)

In ensemble learning, multiple learning algorithms obtain better predictive performance with the penalty of computation time. More details are here.

Finally, LeDell gave the following information for more details on the subject.

I skipped some of her discussion here, but I hope this is a good start to understanding what data science is, and that you will dig further into it.

Zen Kishimoto

About Zen Kishimoto

Seasoned research and technology executive with various functional expertise, including roles in analyst, writer, CTO, VP Engineering, general management, sales, and marketing in diverse high-tech and cleantech industry segments, including software, mobile embedded systems, Web technologies, and networking. Current focus and expertise are in the area of the IT application to energy, such as smart grid, green IT, building/data center energy efficiency, and cloud computing.

Lending Club : Predict Bad Loans to Minimize Loss to Defaulted Accounts

As a sales engineer on the H2O.ai team I get asked a lot about the value add of H2O. How do you put a price tag on something that is open source? This typically revolves around the use cases; if a use case pertains to improving user experience or making apps that can improve internal operations then there’s no straightforward way of monetarily accumulating better experiences. However, if the use case is focused on detecting fraud or maintaining enough supply for the next sales quarter, we can calculate the total amount of money saved by detecting pricey fraudulent cases or sales lost due to incorrectly forecasted demand.

The H2O team has built a number of user-facing demostrations from our Ask Craig App to predicting flight delays which are available in R, Sparkling Water,and Python. Today, we will use Lending Club’s Open Data to obtain the probability of a loan request defaulting or being charged off. We will build an H2O model and calculate the dollar amount of money saved by rejecting these loan requests with the model (not including the opportunity cost), and then combine this with the profits lost in rejecting good loans to determine the net amount saved.

Summary of Approved Loan Applicants

The dataset had a total of half a million records from 2007 up to 2015 which means with H2O, the data can actually just be processed on your personal computer with an H2O instance with at least 2GB of memory. The first step is to import the data and create a new column that categorizes the loan as either a good loan or a bad loan (the user has defaulted or the account has been charged off). The following is a code snippet for R:

    print("Start up H2O...")
    library(h2o)
    conn <- h2o.init(nthreads = -1)

    print("Import approved and rejected loan requests from Lending Tree...")
    path   <- "/Users/amy/Desktop/lending_club/loanStats/"
    loanStats <- h2o.importFile(path = path, destination_frame = "LoanStats")

    print("Create bad loan label, this will include charged off, defaulted, and late repayments on loans...")
    loanStats$bad_loan <- ifelse(loanStats$loan_status == "Charged Off" | 
                             loanStats$loan_status == "Default" | 
                            loanStats$loan_status == "Does not meet the credit policy.  Status:Charged Off", 
                            1, 0)
    loanStats$bad_loan <- as.factor(loanStats$bad_loan)

    print("Create the applicant's risk score, if the credit score is 0 make it an NA...")
    loanStats$risk_score <- ifelse(loanStats$last_fico_range_low == 0, NA,
                               (loanStats$last_fico_range_high + loanStats$last_fico_range_low)/2)

Credit Score Summaries

In H2O Flow, you can grab the distribution of credit scores for good loans vs bad loans. It is easy to see that owners of bad loans typically have the lowest credit score, which will be the biggest driving force in predicting whether a loan is good or not. However we want a model that actually takes into account other features so that loans aren’t automatically cut off at a certain threshold.

credit_score_distributions

Modeling

Pending review : Will update blog soon

Introduction to Data Science using H2O – Chicago

Thank you to Chicago for the great meetup on 29 July 2015. Slides have been posted on GitHub. The links to the sample scripts and data is contained in the slides. If you have any further questions about H2O, please join our GoogleGroup or chat with us on Gitter .

The slides are also available on the H2O Slideshare:

Also, thank you to Serendipity Labs; a great space and location!

Enjoy H2O and let us know about your data science / machine learning journey!

-Hank
@hankroark

useR! Aalborg 2015 conference

The H2O team spent most of the useR! Aalborg 2015 conference at the booth giving demos and discussing H2O. Amy had a 16 node EC2 cluster running with 8 cores per node, making a total of 128 CPUs. The demo consisted of loading large files in parallel and then running our distributed machine learning algos in parallel.

At an R conference, most people wanted to script H2O from R, which is of course built-in (as is Python) but we also conveyed the benefits that our user interface Flow can provide in this space (even for programmers) by automating and accelerating common tasks. We enjoyed discussing future directions with and bouncing ideas off of the attendees. There is nothing like seeing people’s first reaction to the product, live and in person! As an open source platform, H2O thrives on suggestions and contributions from our community.

All components of H2O are developed-in-the-open on GitHub.

Read More

KFold Cross Validation With H2O-3 and R

This blog is also explains the solution to a Google Stream question we received

Note: KFold Cross Validation will be added to H2O-3 as an argument soon

This is a terse guide to building KFold cross-validated models with H2O using the R interface. There's not very much R code needed to get up and running, but it's by no means the one-magic-button method either. This guide is intended for the more “rustic” data scientist that likes to get there hands a bit dirty and build out their own tools.

Read More

‘Ask Craig’- Determining Craigslist Job Categories with Sparkling Water, Part 2

This is the second blog in a two blog series. The first blog is on turning these models into a Spark streaming application

The presentation on this application can be downloaded and viewed at Slideshare

In the last blog post we learned how to build a set of H2O and Spark models to predict categories for jobs posted on Craigslist using Sparkling Water.

This blog post will show how to use the models to build a Spark streaming application which scores posted job titles on the fly.

Read More

‘Ask Craig’- Determining Craigslist Job Categories with Sparkling Water

This is the first blog in a two blog series. The second blog is on turning these models into a Spark streaming application

The presentation on this application can be downloaded and viewed at Slideshare

One question we often get asked at Meetups or conferences is: “How are you guys different than other open-source machine-learning toolkits? Notably: Spark’s MLlib?” The answer to this question is not “black and white” but actually a shade of “gray”. The best way to showcase the power of Spark’s MLlib library and H2O.ai’s distributed algorithms is to build an app that utilizes both of their strengths in harmony, going end-to-end from data-munging and model building through deployment and scoring on real-time data using Spark Streaming. Enough chit-chat, let’s make it happen!

Read More