H2O + TensorFlow on AWS GPU

TensorFlow on AWS GPU instance
In this tutorial, we show how to setup TensorFlow on AWS GPU instance and run H2O Tensorflow Deep learning demo.

Pre-requisites:
To get started, request an AWS EC2 instance with GPU support. We used a single g2.2xlarge instance running Ubuntu 14.04.To setup TensorFlow with GPU support, following softwares should be installed:

  1. Java 1.8
  2. Python pip
  3. Unzip utility
  4. CUDA Toolkit (>= v7.0)
  5. cuDNN (v4.0)
  6. Bazel (>= v0.2)
  7. TensorFlow (v0.9)

To run H2O Tensorflow Deep learning demo, following softwares should be installed:

  1. IPython notebook
  2. Scala
  3. Spark
  4. Sparkling water

Software Installation:
Java:

 #To install Java follow below steps: Type ‘Y’ on installation prompt sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java8-installer Update JAVA_HOME in ~/.bashrc #Add JAVA_HOME to PATH: export PATH=$PATH:$JAVA_HOME/bin # Execute following command to update current session: source ~/.bashrc #Verify version and path: java -version echo $JAVA_HOME 

Python:

 #AWS EC2 instance has Python installed by default. Verify if Python 2.7 is installed already: python -V #Install pip sudo apt-get install python-pip #Install IPython notebook sudo pip install "ipython[notebook]" #To run H2O example notebooks, execute following commands: sudo pip...

H2O GBM Tuning Tutorial for R

In this tutorial, we show how to build a well-tuned H2O GBM model for a supervised classification task. We specifically don’t focus on feature engineering and use a small dataset to allow you to reproduce these results in a few minutes on a laptop. This script can be directly transferred to datasets that are hundreds of GBs large and H2O clusters with dozens of compute nodes.

This tutorial is written in R Markdown. You can download the source from H2O’s github repository.

A port to a Python Jupyter Notebook version is available as well.

Installation of the H2O R Package

Either download H2O from H2O.ai’s website or install the latest version of H2O into R with the following R code:

# The following two commands remove any previously installed H2O packages for R. if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) } if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") } # Next, we download packages that H2O depends on. pkgs <- c("methods","statmod","stats","graphics","RCurl","jsonlite","tools","utils") for (pkg in pkgs) { if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) } } # Now we download, install and initialize the H2O package for R. install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-turchin/8/R"))) 

Launch an H2O…

Hyperparameter Optimization in H2O: Grid Search, Random Search and the Future

“Good, better, best. Never let it rest. ‘Til your good is better and your better is best.” – St. Jerome

tl;dr

H2O now has random hyperparameter search with time- and metric-based early stopping. Bergstra and Bengio[1] write on p. 281:

Compared with neural networks configured by a pure grid search, we find that random search over the same domain is able to find models that are as good or better within a small fraction of the computation time.

Even smarter means of searching the hyperparameter space are in the pipeline, but for most use cases random search does as well.

What Are Hyperparameters?

Nearly all model algorithms used in machine learning have a set of tuning “knobs” which affect how the learning algorithm fits the model to the data. Examples are the regularization settings alpha and lambda for Generalized Linear Modeling or ntrees and max_depth for Gradient Boosted Models. These knobs are called hyperparameters to distinguish them from internal model parameters, such as GLM’s beta coefficients or Deep Learning’s weights, which get learned from the data during the model training process.

What Is Hyperparameter Optimization?

The set…

Spam Detection with Sparkling Water and Spark Machine Learning Pipelines

This short post presents the “ham or spam” demo, which has already been posted earlier by Michal Malohlava, using our new API in latest Sparkling Water for Spark 1.6 and earlier versions, unifying Spark and H2O Machine Learning pipelines. It shows how to create a simple Spark Machine Learning pipeline and a model based on the fitted pipeline, which can be later used for prediction whether a particular message is spam or not.

Before diving into the demo steps, we would like to provide some details about the new features in the upcoming Sparkling Water 2.0:

  • Support for Apache Spark 2.0 and backwards compatibility with all previous versions.
  • The ability to run Apache Spark and Scala through H2O’s Flow UI.
  • H2O feature improvements and visualizations for MLlib algorithms, including the ability to score feature importance.
  • Visual intelligence for Apache Spark.
  • The ability to build Ensembles using H2O plus MLlib algorithms.
  • The power to export MLlib models as POJOs (Plain Old Java Objects), which can be easily run on commodity hardware.
  • A toolchain for ML pipelines.
  • Debugging support for Spark pipelines.
  • Model and data governance through Steam.
  • Bringing H2O’s powerful data munging capabilities to Apache Spark.
  • In order to run the code…

    Interview with Carolyn Phillips, Sr. Data Scientist, Neurensic

    During Open Tour Chicago we conducted a series of interviews with data scientists attending the conference. This is the second of a multipart series recapping our conversations.

    Be sure to keep an eye out for updates by checking our website or following us on Twitter @h2oai.

    AAEAAQAAAAAAAAeRAAAAJGZmMWZiMGE1LTVlMDgtNGQwZi05NzYyLTEwMTMxNDhmODcwMw

    H2O.ai: How did you become a data scientist?

    Phillips: Until very close to two months ago I was a computational scientist working at Argonne National Laboratory.

    H2O.ai: Okay.

    Phillips: I was working in material science, physics, mathematics, etc., but I was getting bored with that and looking for new opportunities, and I got hired by a startup in Chicago.

    H2O.ai: Yes.

    Phillips: When they hired me they said, “We’re hiring you, and your title is Senior Quantitative Analyst,” but the very first day I showed up, they said, “Scratch that. We’ve changed your title. Your title is now Senior Data Scientist.” And I said, “Yes, all right.” It has senior in it, so I’m okay going with that.

    H2O.ai: Nice. I like it.

    Phillips:…

    Interview with Svetlana Kharlamova, ­Sr. Data Scientist, Grainger

    During Open Tour Chicago we conducted a series of interviews with data scientists attending the conference. This is the first of a multipart series recapping our conversations.

    Be sure to keep an eye out for updates by checking our website or following us on Twitter @h2oai.

    Svetlana Kharlamova

    H2O.ai: How did you become a data scientist?

    Kharlamova: I’m a physicist.

    H2O.ai: Okay.

    Kharlamova: I came here from the academia of physics. I worked for seven years in academia for physics and math, and four years ago I switched to finance to be more of a math person than a physics person.

    H2O.ai: I see.

    Kharlamova: And from finance I came to the data industry. At that time data science was booming.

    H2O.ai: Oh, okay.

    Kharlamova: And I got excited with all new the stuff and technologies coming up, and here I am.

    H2O.ai: Okay, nice. So what business do you work for now?

    Kharlamova: I work for Grainger. We’re focused on equipment distribution; serving as a connector between manufacturing plants,…

    H2O Day at Capital One

    Here at H2O.ai one of our most important partners is Capital One, and we’re proud to have been working with them for over a year. One of the world’s leading financial services providers, Capital One has a strong reputation for being an extremely data and technology-focused organization. That’s why when the Capital One team invited us to their offices in McLean, Virginia for for a full day of H2O talks and demos we were delighted to accept. Many key members of Capital One’s technology team were among the 500+ attendees at the event, including Jeff Chapman, MVP of Shared Technology, Hiren Hiranandani, Lead Software Engineer, Mike Fulkerson, VP of Software Engineering and Adam Wenchel, VP of Data Engineering.

    A major theme throughout the day was “vertical is the new horizontal,” an idea presented by our CEO Sri Ambati, about how every company is becoming a technology company. Sri pointed out that software is becoming increasingly ubiquitous at organizations at the same time that code is becoming a commodity. Today, the only assets that companies can defend is their community and brand. Airbnb is more valuable than most hospitality companies, despite owning no property,…

    Red herring bites

    At the Bay Area R User Group in February I presented progress in big-join in H2O which is based on the algorithm in R’s data.table package. The presentation had two goals: i) describe one test in great detail so everyone understands what is being tested so they can judge if it is relevant to them or not; and ii) show how it scales with data size and number of nodes.

    These were the final two slides :

    Slide14     Slide15

    I left a blank for 1e10 (10 billion high cardinality rows joined with 10 billion high cardinality rows returning 10 billion high cardinality rows) because it didn’t work at that time. Although each node has 256GB RAM (and 32 cores) the 10 billion row test involves joining two 10 billion row tables (each 200GB) and returning a third table (also ~10 billion rows) of 300GB, total 700GB. I was giving 200GB to each of…

    Fast csv writing for R

    R has traditionally been very slow at reading and writing csv files of, say, 1 million rows or more. Getting data into R is often the first task a user needs to do and if they have a poor experience (either hard to use, or very slow) they are less likely to progress. The data.table package in R solved csv import convenience and speed in 2013 by implementing data.table::fread() in C. The examples at that time were 30 seconds down to 3 seconds to read 50MB and over 1 hour down to 8 minute to read 20GB. In 2015 Hadley Wickham implemented readr::read_csv() in C++.

    But what about writing csv files?

    It is often the case that the end goal of work in R is a report, a visualization or a summary of some form. Not anything large. So we don’t really need to export large data, right? It turns out that the combination of speed, expressiveness and advanced abilities of R (particularly data.table) mean that it can be faster (so I’m told) to export the data from other environments (e.g. a database), manipulate in R, and export back to the database, than it is keeping the data…

    Apache Spark and H2O on AWS

    Screen-Shot-2016-04-11-at-6.06.29-PM-1

    This is a guest post re-published with permission from our friends at Datapipe. The original lives here.

    One of the advantages of public cloud is the ability to experiment and run various workloads without the need to commit to purchasing hardware. However, to meet your data processing needs, a well-defined mapping between your objectives and the cloud vendor offerings is a must. In collaboration with Denis Perevalov (Milliman), we’d like to share some details around one of our most recent – and largest – big-data projects we’ve worked on; a project with our client, Milliman, to build a machine-learning platform on Amazon Web Services.

    Before we get into the details, let’s introduce Datapipe’s data and analytics consulting team. The goal of our data and analytics team is to help customers with their data processing needs. Our engagements fall into data engineering efforts, where we help customers build data processing pipelines. In addition to that, our team helps clients get a better insight into their existing datasets by engaging with our data science consultants.

    When we first started working…