Apache Spark and H2O on AWS


This is a guest post re-published with permission from our friends at Datapipe. The original lives here.

One of the advantages of public cloud is the ability to experiment and run various workloads without the need to commit to purchasing hardware. However, to meet your data processing needs, a well-defined mapping between your objectives and the cloud vendor offerings is a must. In collaboration with Denis Perevalov (Milliman), we’d like to share some details around one of our most recent – and largest – big-data projects we’ve worked on; a project with our client, Milliman, to build a machine-learning platform on Amazon Web Services.

Before we get into the details, let’s introduce Datapipe’s data and analytics consulting team. The goal of our data and analytics team is to help customers with their data processing needs. Our engagements fall into data engineering efforts, where we help customers build data processing pipelines. In addition to that, our team helps clients get a better insight into their existing datasets by engaging with our data science consultants.

When we first started working with Milliman, one of their challenges was running compute intensive machine learning algorithms in a reasonable amount of time on a growing set of datasets. In order to cope with this challenge, they had to pick a distributed processing framework and, after some investigation, narrowed down their options to two frameworks: H2O and Apache Spark. Both frameworks offer distributed computing leveraging multiple nodes and Milliman was left to decide if they would use their on-premise infrastructure or use an alternative option. In the end, Amazon Web Services as an ideal choice considering their use-case and requirements. To execute on this plan, Milliman engaged with our data and analytics consultants to build a scalable and secure Spark and H2O machine learning platform using AWS solutions Amazon EMR, S3, IAM, and Cloudformation‘.

Early in our engagement with Milliman, we identified the following as the high-level and important project goals:

  1. Security: Ability to limit access to confidential data using access control and security policies
  2. Elasticity and cost: Suggest a cost efficient and yet elastic resource manager to launch Spark and H2O clusters
  3. Cost visibility: Suggest a strategy to get visibility into AWS cost broken down by a given workload
  4. Automation: Provide automated deployment of Apache Spark, H2O clusters and AWS services.
  5. Interactivity: Provide ability to interact with Spark and H2O using IPython/Jupyter Notebooks
  6. Consolidated platform: A single platform to run both H2O and Spark.

Let’s dive into each of these priorities further, with a focus on Milliman’s requirements and how we mapped their requirements to AWS solutions, as well as the details around how we achieved the agreed upon project goals.


Security is the most important part of any cloud application deployment, and it’s a topic we take very seriously when we get involved in client engagements. In this project, one of the security requirements was the ability to control and limit access to datasets hosted on Amazon S3 depending on the environment. For example, Milliman wanted to restrict the access of development and staging clusters to a subset of data and to ensure the production dataset is only available to the production clusters.

To achieve Milliman’s security isolation goal, we took advantage of AWS Identity and Access Management offering (IAM). Using IAM we created:

  1. EC2 Roles: EC2 roles get assigned to EC2 instances during EMR launch process. EC2 roles allow us to limit what subset of dataset each instance can or cannot access. In the case of Spark, this enabled us to limit Spark nodes in the development environment to a particular subset of data.
  2. IAM Role policy: Each EC2 role requires an IAM policy to define what that role can or cannot do. We created appropriate IAM policies for the development and production environments and assigned each policy to its corresponding IAM role.

AmazonS3 Development Environment

Example IAM Policy:


 “Statement”: [


       “Resource”: [




       “Action”: [“s3:*” ],

       “Effect”: “Allow”



“Version”: “2012-10-17”


Elasticity And Cost

In general, there are two main approaches in deploying Spark or Hadoop platforms on AWS:

  1. Building your platform either using open-source tools or 3rd party vendor offerings on top of EC2 (i.e. download and deploy Cloudera CDH on EC2)
  2. Leverage AWS’ existing managed services such as EMR.

For this project, our recommendation to the client was to use Amazon’s Elastic MapReduce, which is a managed Hadoop and Spark offering. This allows us to build Spark or Hadoop clusters in a matter of minutes simply by calling APIs. There are a number of reasons why we suggested Amazon Elastic MapReduce instead of building their platform using 3rd party vendor/open-source offerings:

  1. It is very easy and flexible to get started with EMR. Clients don’t have to understand fully every aspect of deploying distributed systems such as configuration and node management. Instead, they can focus on a very few application-specific aspect of the platform configurations, such as how much memory or CPU should be allocated to the application (Spark).
  2. Automated deployment is part of EMR’s offering. Clients don’t have to build an orchestrated node management framework to automate the deployment of various nodes.
  3. Security is already baked-in, and clients don’t have to create extra logic in their application to take advantage of the security features.
  4. Integration with AWS spot offering which is one of the most important features when it comes down to controlling the cost of a distributed system.
  5. Ability to customize the cluster to install and run other frameworks not natively supported by EMR such as H2O


Automation is an important aspect of any cloud deployment. While it’s easy to build and deploy various applications on top of cloud resources manually or using inconsistent processes, automating the deployment process with consistency is the cornerstone of having a scalable cloud operation. That is especially true for distributed systems where reproducibility of the deployment process is the only way an organization can run and maintain a data processing platform.

As we mentioned earlier, in addition to Amazon EMR, other AWS services such as IAM and S3 were leveraged in this project. In addition to multiple AWS services, Milliman required having multiple AWS environments such as staging, development, and production with different security requirements. In order to automate the deployment of various AWS services while keeping the environments unique, we leveraged Amazon’s Cloudformation scripts.

Cost Visibility

One common requirement that regularly is overlooked is the ability to have fine-grained visibility to your cloud costs. Most organizations delay this requirement until they’re further down in their path to cloud adoption which can be a costly mistake. We were glad to hear that it was, in fact, part of Milliman’s project requirement to track AWS costs down to specific workloads.

Fortunately, AWS provides an offering called Cost Allocation Tags that can be tailored towards meeting a client’s cost visibility requirements. With Cost Allocation Tags, clients can tag their AWS resources with specific keywords that AWS can use to generate a cost report aggregated by customer’s tags. Specifically for this project, we instructed Milliman to use tagging feature of EMR to tag each cluster with workload specific keywords that can later be recognized in AWS’ billing report. For example, the following command line demonstrates how to tag an EMR cluster with workload specific tags:
aws emr create-cluster –name Spark –release-label emr-4.0.0 –tags Name=”Spark_Recommendation_Model_Generator” –applications Name=Hadoop Name=Spark




While building automated Spark and H2O clusters using AWS EMR and Cloudformation is a great start to building a data processing platform, at times developers need an interactive way of working with the platform. In other words, using the command-line to work with data processing platform may work in some use-cases, but in other cases UI access is critical for developers to perform their job by engaging with the cluster in an interactive fashion.

Part of our data & analytics offering at Datapipe, we help customers picking the right 3rd party tools/vendor for a given requirement. For Milliman, to build an interactive UI that engages with EMR Spark and H2O clusters, we leveraged IPython/Jupyter notebooks. Jupyter provides integration with Hadoop, Spark, and other platforms and allows developers to type their code in a UI and submit to the cluster for execution in real time. In order to deploy Jupyter notebooks, we leveraged EMR’s Bootstrap feature that allows customers to install custom software on EMR EC2 nodes.

Consolidated platform

Lastly, we needed to build a single platform to host both H2O and Spark frameworks. While H2O is not a supported platform on EMR, using Amazon EMR Bootstrap action feature, we were able to install H2O on EMR nodes and avoided creating a separate platform to host H2O. In other words, Milliman now has the ability to launch both Spark and H2O clusters using a single platform.


Data processing platforms require various considerations including but not limited to security, scalability, cost, interactivity, and automation. Fortunately by defining a set of clear project objectives and goals, and mapping those objectives to the applicable solutions (in this case AWS offerings), companies can meet their data processing requirements efficiently and effectively. We hope that this post demonstrated an example of how this process can be achieved using AWS offerings. If you have any questions or like to talk to one of our consultants, please contact us.

In the next set of blog posts we’ll provide some insight into how we use data science and data driven approach to gain insight into operational metrics.

H2O World from an Attendee’s Perspective

Data Science is like Rome, and all roads lead to Rome. H2O WORLD is the crossroad, pulling in a confluence of math, statistics, science and computer science and incorporating all avenues of business. From the academic, research oriented models to the business and computer science analytics implementations of those ideas, H2O WORLD informs attendees on H2O’s ability to help users and customers explore their data and produce a prediction or answer a question.

I came to H2O World hoping to gain a better understanding of H2O’s software and of Data Science in general. I thoroughly enjoyed attending the sessions, following along with the demos and playing with H2O myself. Learning from the hackers and Data Scientists about the algorithms and science behind H2O and seeing the community spirit at the Hackathons was enlightening. Listening to the keynote speakers, both women, describe our data-influenced future and hearing the customer’s point of view on how H2O has impacted their work has been inspirational. I especially appreciated learning about the potential influence on scientific and medical research and social issues and H2O’s ability to influence positive change.

Curiosity led me to delve into the world of Data Science and as a person with a background of science and math, I wasn’t sure how it applied to me. Now I realize that there is virtually no discipline which cannot benefit from the methods of Data Science and that there is great power in asking the right questions and telling a good story. H2O WORLD broadened my horizons and gave me a new perspective on the role of Data Science in the world. Data science can be harnessed as force for social good where a few people from around the globe can change the world. H2O World 2015 was a great success and I truly enjoyed learning and being there.

A Newbie’s Guide to H2O in Python – Guest Post

This blog was originally posted here

I created this guide to help fellow newbies get their feet wet with H2O, an open-source predictive analytics platform that is fast, powerful, and easy to use. Using a combination of extraordinary math and high-performance parallel processing, H2O allows you to quickly create models for big data. The steps below show you how to download and start analyzing data at high speeds with H2O. After that it’s up to you.

What You’ll Learn

  • How to download H2O (just updated to OS X El Capitan? Then Java too)
  • How to use H2O with IPython Notebook & where to get demo scripts
  • How to teach a computer to recognize handwritten digits with H2O
  • Where to find documentation and community resources

A Delicious Drink of Water — Downloading H2O

(If you don’t feel like reading the long version below just go here)

I recommend downloading the latest release of H2O (which is ‘Bleeding Edge’ as of this moment) because it has the most Python features, but you can also see the other releases here, as well as the software requirements. Okay, Let’s get started:

Do you have Java on your computer? No sure? Here’s how to check:

  • Open your terminal and type in ‘java -version’:

MacBook-Pro:~ username$ java -version


If you don’t have Java you can either click through the pop up dialogue box and make your way to the correct downloadable version, or you can go directly to the Java downloads page here (two-for-one tip: download the Java Development Kit and get the Java Runtime Environment with it).

Now that you have Java (fingers crossed), you can download H2O (I’m assuming you have Python, but if you don’t, consider downloading Anaconda which gives you access to amazing Python packages for data analysis and scientific computing).

You can find the official instructions to Download H2O’s ‘Bleeding Edge’ release here (click on the ‘Install in Python’ tab), or follow below:

  1. Prerequisite: Python 2.7
  2. Type the following in your terminal:

Fellow newbies don’t type in the ‘MacBook-Pro:~ username$’ part only type in what’s listed after the ‘$’: (you can get more command line help here).

MacBook-Pro:~ username$ pip install requests
MacBook-Pro:~ username$ pip install tabulate
MacBook-Pro:~ username$ pip install scikit-learn

MacBook-Pro:~ username$ pip uninstall h2o
MacBook-Pro:~ username$ pip install http://h2o-release.s3.amazonaws.com/h2o/master/3250/Python/h2o-

As shown above, if you installed an earlier version of H2O, uninstalling and reinstalling H2O with pip will do the trick.

Let’s Get Interactive — IPython Notebook

If don’t already have IPython Notebook, you can download it following these instructions. If you downloaded Anaconda, it comes with IPython Notebook so you’re set. And here’s a video tutorial on how to use IPython Notebook.

If everything goes as planned, to open IPython Notebook you ‘cd’ to your directory of choice (I chose my Desktop folder) and enter ‘ipython notebook’. (If you’re still new to the command line, learn more about using ‘cd’, which I like to use as a verb, here and here).

MacBook-Pro:~ username$ cd Desktop
MacBook-Pro:Desktop username$ ipython notebook

Random Note: After I updated to OS X El Capitan the command above didn’t work. For many people using ‘conda update conda’ and then ‘conda update ipython’ will solve the issue, but in my case I got an SSL error that wouldn’t let me ‘conda update’ anything. I found the solution here, using:

MacBook-Pro:~ username$ conda config — set ssl_verify False
MacBook-Pro:~ username$ conda update requests openssl
MacBook-Pro:~ username$ conda config — set ssl_verify True

Now that you have IPython Notebook, you can play around with some of H2O’s demo notebooks. If you’re new to Github, however, downloading the demos to your desktop can seem daunting, but don’t worry it’s easy. Here’s the trick:

  1. Navigate to H2O’s Python Demo Repository
  2. Click on your ‘.ipynb’ demo of choice (let’s do citi_bike_small.ipynb
  3. Click on ‘Raw’ in the upper right corner, then after the next web page opens, go to ‘File’ on the menu bar and select ‘Save Page As’ (or similar)
  4. Open your terminal, cd to the Downloads folder, or wherever you saved the IPython Notebook, then type ‘ipython notebook citi_bike_small.ipynb’
  5. Now you can go through the demo running each cell individually (click on the cell and press shift + enter)

Classifying Handwritten Digits — Enter a Kaggle Competition

A great way to get a feel for H2O is to test it out on a Kaggle data science competition. Don’t know what Kaggle is? Never enter a Kaggle Competition? That’s totally fine, I’ll give you a script to get your feet wet. If you’re still nervous here’s a great article about how to get started with Kaggle given your previous experience.

Are you excited? Get excited! You are going to teach your computer to recognize HANDWRITTEN DIGITS! (I feel like if you’re still ready at this point, it’s time to let my enthusiasm shine through).

  1. Take a look at Kaggle’s Digit Recognizer Competition
  2. Look at a demo notebook to get started
  3. Download the notebook by clicking on ‘Raw’ and then saving it
  4. Open up and run the notebook to generate a submission csv file
  5. Submit the file for your first submission to Kaggle, then play around with your model parameters and see if you can improve your Kaggle submission score

Getting Help — Resources & Documentation