Apache Spark and H2O on AWS

Screen-Shot-2016-04-11-at-6.06.29-PM-1

This is a guest post re-published with permission from our friends at Datapipe. The original lives here.

One of the advantages of public cloud is the ability to experiment and run various workloads without the need to commit to purchasing hardware. However, to meet your data processing needs, a well-defined mapping between your objectives and the cloud vendor offerings is a must. In collaboration with Denis Perevalov (Milliman), we’d like to share some details around one of our most recent – and largest – big-data projects we’ve worked on; a project with our client, Milliman, to build a machine-learning platform on Amazon Web Services.

Before we get into the details, let’s introduce Datapipe’s data and analytics consulting team. The goal of our data and analytics team is to help customers with their data processing needs. Our engagements fall into data engineering efforts, where we help customers build data processing pipelines. In addition to that, our team helps clients get a better insight into their existing datasets by engaging with our data science consultants.

When we first started working with Milliman, one of their challenges was running compute intensive machine learning algorithms in a reasonable amount of time on a growing set of datasets. In order to cope with this challenge, they had to pick a distributed processing framework and, after some investigation, narrowed down their options to two frameworks: H2O and Apache Spark. Both frameworks offer distributed computing leveraging multiple nodes and Milliman was left to decide if they would use their on-premise infrastructure or use an alternative option. In the end, Amazon Web Services as an ideal choice considering their use-case and requirements. To execute on this plan, Milliman engaged with our data and analytics consultants to build a scalable and secure Spark and H2O machine learning platform using AWS solutions Amazon EMR, S3, IAM, and Cloudformation‘.

Early in our engagement with Milliman, we identified the following as the high-level and important project goals:

  1. Security: Ability to limit access to confidential data using access control and security policies
  2. Elasticity and cost: Suggest a cost efficient and yet elastic resource manager to launch Spark and H2O clusters
  3. Cost visibility: Suggest a strategy to get visibility into AWS cost broken down by a given workload
  4. Automation: Provide automated deployment of Apache Spark, H2O clusters and AWS services.
  5. Interactivity: Provide ability to interact with Spark and H2O using IPython/Jupyter Notebooks
  6. Consolidated platform: A single platform to run both H2O and Spark.

Let’s dive into each of these priorities further, with a focus on Milliman’s requirements and how we mapped their requirements to AWS solutions, as well as the details around how we achieved the agreed upon project goals.


Security

Security is the most important part of any cloud application deployment, and it’s a topic we take very seriously when we get involved in client engagements. In this project, one of the security requirements was the ability to control and limit access to datasets hosted on Amazon S3 depending on the environment. For example, Milliman wanted to restrict the access of development and staging clusters to a subset of data and to ensure the production dataset is only available to the production clusters.

To achieve Milliman’s security isolation goal, we took advantage of AWS Identity and Access Management offering (IAM). Using IAM we created:

  1. EC2 Roles: EC2 roles get assigned to EC2 instances during EMR launch process. EC2 roles allow us to limit what subset of dataset each instance can or cannot access. In the case of Spark, this enabled us to limit Spark nodes in the development environment to a particular subset of data.
  2. IAM Role policy: Each EC2 role requires an IAM policy to define what that role can or cannot do. We created appropriate IAM policies for the development and production environments and assigned each policy to its corresponding IAM role.

AmazonS3 Development Environment

Example IAM Policy:


{

 “Statement”: [

   {

       “Resource”: [

           “arn:aws:s3:::parvizexamples/dev/data/*”,

           “arn:aws:s3:::parvizexamples/dev/data”

       ],

       “Action”: [“s3:*” ],

       “Effect”: “Allow”

  }


 ],

“Version”: “2012-10-17”

}


Elasticity And Cost

In general, there are two main approaches in deploying Spark or Hadoop platforms on AWS:

  1. Building your platform either using open-source tools or 3rd party vendor offerings on top of EC2 (i.e. download and deploy Cloudera CDH on EC2)
  2. Leverage AWS’ existing managed services such as EMR.

For this project, our recommendation to the client was to use Amazon’s Elastic MapReduce, which is a managed Hadoop and Spark offering. This allows us to build Spark or Hadoop clusters in a matter of minutes simply by calling APIs. There are a number of reasons why we suggested Amazon Elastic MapReduce instead of building their platform using 3rd party vendor/open-source offerings:

  1. It is very easy and flexible to get started with EMR. Clients don’t have to understand fully every aspect of deploying distributed systems such as configuration and node management. Instead, they can focus on a very few application-specific aspect of the platform configurations, such as how much memory or CPU should be allocated to the application (Spark).
  2. Automated deployment is part of EMR’s offering. Clients don’t have to build an orchestrated node management framework to automate the deployment of various nodes.
  3. Security is already baked-in, and clients don’t have to create extra logic in their application to take advantage of the security features.
  4. Integration with AWS spot offering which is one of the most important features when it comes down to controlling the cost of a distributed system.
  5. Ability to customize the cluster to install and run other frameworks not natively supported by EMR such as H2O


Automation

Automation is an important aspect of any cloud deployment. While it’s easy to build and deploy various applications on top of cloud resources manually or using inconsistent processes, automating the deployment process with consistency is the cornerstone of having a scalable cloud operation. That is especially true for distributed systems where reproducibility of the deployment process is the only way an organization can run and maintain a data processing platform.

As we mentioned earlier, in addition to Amazon EMR, other AWS services such as IAM and S3 were leveraged in this project. In addition to multiple AWS services, Milliman required having multiple AWS environments such as staging, development, and production with different security requirements. In order to automate the deployment of various AWS services while keeping the environments unique, we leveraged Amazon’s Cloudformation scripts.


Cost Visibility

One common requirement that regularly is overlooked is the ability to have fine-grained visibility to your cloud costs. Most organizations delay this requirement until they’re further down in their path to cloud adoption which can be a costly mistake. We were glad to hear that it was, in fact, part of Milliman’s project requirement to track AWS costs down to specific workloads.

Fortunately, AWS provides an offering called Cost Allocation Tags that can be tailored towards meeting a client’s cost visibility requirements. With Cost Allocation Tags, clients can tag their AWS resources with specific keywords that AWS can use to generate a cost report aggregated by customer’s tags. Specifically for this project, we instructed Milliman to use tagging feature of EMR to tag each cluster with workload specific keywords that can later be recognized in AWS’ billing report. For example, the following command line demonstrates how to tag an EMR cluster with workload specific tags:
aws emr create-cluster –name Spark –release-label emr-4.0.0 –tags Name=”Spark_Recommendation_Model_Generator” –applications Name=Hadoop Name=Spark

….

….


Interactivity

While building automated Spark and H2O clusters using AWS EMR and Cloudformation is a great start to building a data processing platform, at times developers need an interactive way of working with the platform. In other words, using the command-line to work with data processing platform may work in some use-cases, but in other cases UI access is critical for developers to perform their job by engaging with the cluster in an interactive fashion.

Part of our data & analytics offering at Datapipe, we help customers picking the right 3rd party tools/vendor for a given requirement. For Milliman, to build an interactive UI that engages with EMR Spark and H2O clusters, we leveraged IPython/Jupyter notebooks. Jupyter provides integration with Hadoop, Spark, and other platforms and allows developers to type their code in a UI and submit to the cluster for execution in real time. In order to deploy Jupyter notebooks, we leveraged EMR’s Bootstrap feature that allows customers to install custom software on EMR EC2 nodes.


Consolidated platform

Lastly, we needed to build a single platform to host both H2O and Spark frameworks. While H2O is not a supported platform on EMR, using Amazon EMR Bootstrap action feature, we were able to install H2O on EMR nodes and avoided creating a separate platform to host H2O. In other words, Milliman now has the ability to launch both Spark and H2O clusters using a single platform.

Conclusion

Data processing platforms require various considerations including but not limited to security, scalability, cost, interactivity, and automation. Fortunately by defining a set of clear project objectives and goals, and mapping those objectives to the applicable solutions (in this case AWS offerings), companies can meet their data processing requirements efficiently and effectively. We hope that this post demonstrated an example of how this process can be achieved using AWS offerings. If you have any questions or like to talk to one of our consultants, please contact us.

In the next set of blog posts we’ll provide some insight into how we use data science and data driven approach to gain insight into operational metrics.

Drink in the Data with H2O at Strata SJ 2016

It’s about to rain data in San Jose when Strata + Hadoop World comes to town March 29 – March 31st.

H2O has a waterfall of action happening at the show. Here’s a rundown of what’s on tap.
Keep it handy so you have less chance of FOMO (fear of missing out).

Hang out with H2O at Booth #1225 to learn more about how machine learning can help transform your business and find us throughout the conference:

Tuesday, March 29th

Wednesday, March 30th

  • 12:45pm – 1:15pm Meet the Makers: The brains and innovation behind the leading machine learning solution is on hand to hack with you
    • #AskArno – Arno Candel, Chief Architect and H2O algorithm expert
    • #RuReady with Matt Dowle, H2O Hacker and author of R data.table
    • #SparkUp with Michal Malohlava principal developer of Sparkling Water
    • #InWithErin – Erin LeDell, Machine Learning Scientist and H2O ensembles expert
  • 2:40pm – 3:20pm H2O highlighted in An introduction to Transamerica’s product recommendation platform
  • 5:50pm – 6:50pm Booth Crawl. Have a beer on us at Booth #1225
  • 7:00pm – 9:00pm Let it Flow with H2O – Drinks + Data at the Arcadia Lounge. Grab your invite at Booth #1225

Thursday, March 31st

  • 12:45pm – 1:15pm Ask Transamerica. Vishal Bamba and Nitin Prabhu of Transamerica join us at Booth #1225 for Q&A with you!

The Top 10 Most Watched Videos From H2O World 2015

Now that we’re a few months out from H2O World we wanted to share with you all what the most popular talks were by online viewership. The talks covered a variety of topics from introductions, to in-depth examinations of use cases, to wide-ranging panels.

Introduction to Data Science
Featuring Erin LeDell, Statistician and Machine Learning Scientist, H2O.ai
An introductory talk for people new to the field of data science.

Intro to R, Python, Flow
Featuring Amy Wang, Math Hacker, H2O.ai
A hands-on demonstration of how to run H2O in R and Python and an introduction to the Flow GUI.

Machine Learning at Comcast
Featuring Andrew Leamon, Director of Engineering Analysis, Comcast and Chushi Ren, Software Engineer, Comcast
An inside look at how Comcast leverages machine learning across its business units.

Migrating from Proprietary Analytics Stacks to Open Source H2O
Featuring Fonda Ingram, Technical Manager, H2O.ai
A ten-year SAS veteran explains how to migrate from proprietary software to an open source environment.

Top 10 Data Science Pitfalls
Featuring Mark Landry, Product Manager, H2O.ai
A Kaggle champion offers an overview of ten top pitfalls to avoid when performing data science.

Ensembles
Featuring Erin LeDell, Statistician and Machine Learning Scientist, H2O.ai
Another popular talk from Erin, this time providing an overview specifically of ensemble learning.

Sparkling Water
Featuring Michal Malohlava, Software Engineer, H2O.ai
An introduction to Sparkling Water, H2O’s Spark API, by one of its key architects.

Panel – Competitive Data Science
Featuring Arno Candel, Chief Architect, H2O.ai, Phillip Adkins, Data Scientist, Banjo, Nick Kridler, Data Scientist, Stich Fix, Mark Landry, Product Manager, H2O.ai, John Park, Principal Data Scientist, Hewlett-Packard Enterprise, Lauren Savage, Data Scientist, AT&T and Guocong Song, Data Scientist, Playground.Global
A panel discussion covering all aspects of competitive data science.

Survey of Available Machine Learning Frameworks
Featuring Brenden Herger, Data Scientist, Capital One
An overview of available machine learning frameworks and an analysis of why teams use specific ones.

Panel – Industrial Data Science – Practitioners’ Perspective
Featuring SriSatish Ambati, CEO & Cofounder, H2O.ai, Xaviar Amatriain, VP of Engineering, Quora, Scott Marsh, Research & Development Analyst, Progressive Insurance, Taposh Dutta Roy, Manager, Kaiser Permanente, Nachum Shacham, Principal Data Scientist, PayPal and Daqing Zhao, Director of Advanced Analytics, Macy’s.com
A discussion of large data science deployments by the people most familiar with them.

A great selection of talks if we do say so ourselves! Is it too early to start counting the days to H2O World 2016?

H2O World from an Attendee’s Perspective

Data Science is like Rome, and all roads lead to Rome. H2O WORLD is the crossroad, pulling in a confluence of math, statistics, science and computer science and incorporating all avenues of business. From the academic, research oriented models to the business and computer science analytics implementations of those ideas, H2O WORLD informs attendees on H2O’s ability to help users and customers explore their data and produce a prediction or answer a question.

I came to H2O World hoping to gain a better understanding of H2O’s software and of Data Science in general. I thoroughly enjoyed attending the sessions, following along with the demos and playing with H2O myself. Learning from the hackers and Data Scientists about the algorithms and science behind H2O and seeing the community spirit at the Hackathons was enlightening. Listening to the keynote speakers, both women, describe our data-influenced future and hearing the customer’s point of view on how H2O has impacted their work has been inspirational. I especially appreciated learning about the potential influence on scientific and medical research and social issues and H2O’s ability to influence positive change.

Curiosity led me to delve into the world of Data Science and as a person with a background of science and math, I wasn’t sure how it applied to me. Now I realize that there is virtually no discipline which cannot benefit from the methods of Data Science and that there is great power in asking the right questions and telling a good story. H2O WORLD broadened my horizons and gave me a new perspective on the role of Data Science in the world. Data science can be harnessed as force for social good where a few people from around the globe can change the world. H2O World 2015 was a great success and I truly enjoyed learning and being there.

H2O at ML Conf SF 2015

H2O is ubiquitous, and just like H2O, our team is everywhere! Today we attended the (H2O.ai-sponsored) 2015 Machine Learning Conference in San Francisco. Located at the gorgeous Julia Morgan Ballroom the ML Conference brought together some of the world’s foremost experts on machine learning, including the tireless Xavier Amatriain, VP of Engineering at Quora, fresh off his talk at H2O World. The speaking lineup also included folks from IBM, CMU, Kaggle, Ayasdi, ChaLearn, Google, Netflix, Numenta, Stitch Fix, Ufora, Intel, Walmart Labs, UC Irvine, Skymind, Slack and Baidu. As expected, many of our H2O fans were among attendees, leading to so much traffic at our booth that we ran out of booklets!

Tomorrow we’re off to another two days of fun at the (H2O.ai-sponsored) Open Data Science Conference being held at the luxurious Marriott Waterfront in San Francisco. Looking forward to seeing you there!

Questions? Tweet us @h2oai

Pre-H2O World, Part 2

H2O fans, we have a day of data delights in store you for you tomorrow! The first day of H2O World is totally devoted to demos and walkthroughs designed to help YOU get the most out of your data. In fact, we have so many sessions planned that unless you have Hermione’s Time Turner, you won’t be able to attend them all. So choose wisely! A half day hackathon will kickoff at 9 am and last until 12 pm. At the same time the Erdos stage will be hosting an introduction to the H2O platform for you newcomers followed by an explanation of how to install the platform and introductions to data science, R, Python and the Flow UI. A panel on the challenges and pitfalls of data science and a talk on Gradient Method Boosting and Random Forest will follow before we even get to lunch! Last, but certainly not least, the Ramanujan stage will feature a morning packed with an update on the H2O platform for veteran users and an explanation of how to upgrade the software. This will be followed by an overview of the “top 10 data science pitfalls,” an update on what’s new in R, Python and Flow, and talks on GLM and Python Pipelines.

Time is precious, so even the lunch hour gets used at H2O World! Enjoy your meal while you hear from our very own Chief Architect, Arno Candel. Silicon Valley Data Science’s Chen Huang will kickoff the afternoon at Boole with a talk on how to ask smarter questions to make better business decisions. Boole’s afternoon will be graced by talks on Sparkling Water and building smarter applications. Although the morning’s talks on the Erdos stage will be a tough act to follow, we’ll be making an attempt with a series of awesome talks on Deep Learning, GLM, Ensembles, Sparkling Water and building smart applications, plus a panel on competitive data science! Likewise, the Ramanujan stage will be putting on a strong effort in the afternoon. Ramanujan will feature talks on using H2O with Databricks Cloud, GBM and Random Forest, GLRM, migrating proprietary stacks to open source H2O, Deep Learning and a panel discussion on Smart Applications!

Questions? Tweet us @h2oai #h2oworld

A Newbie’s Guide to H2O in Python – Guest Post

This blog was originally posted here

I created this guide to help fellow newbies get their feet wet with H2O, an open-source predictive analytics platform that is fast, powerful, and easy to use. Using a combination of extraordinary math and high-performance parallel processing, H2O allows you to quickly create models for big data. The steps below show you how to download and start analyzing data at high speeds with H2O. After that it’s up to you.

What You’ll Learn

  • How to download H2O (just updated to OS X El Capitan? Then Java too)
  • How to use H2O with IPython Notebook & where to get demo scripts
  • How to teach a computer to recognize handwritten digits with H2O
  • Where to find documentation and community resources

A Delicious Drink of Water — Downloading H2O

(If you don’t feel like reading the long version below just go here)

I recommend downloading the latest release of H2O (which is ‘Bleeding Edge’ as of this moment) because it has the most Python features, but you can also see the other releases here, as well as the software requirements. Okay, Let’s get started:

Do you have Java on your computer? No sure? Here’s how to check:

  • Open your terminal and type in ‘java -version’:

MacBook-Pro:~ username$ java -version

command

If you don’t have Java you can either click through the pop up dialogue box and make your way to the correct downloadable version, or you can go directly to the Java downloads page here (two-for-one tip: download the Java Development Kit and get the Java Runtime Environment with it).

Now that you have Java (fingers crossed), you can download H2O (I’m assuming you have Python, but if you don’t, consider downloading Anaconda which gives you access to amazing Python packages for data analysis and scientific computing).

You can find the official instructions to Download H2O’s ‘Bleeding Edge’ release here (click on the ‘Install in Python’ tab), or follow below:

  1. Prerequisite: Python 2.7
  2. Type the following in your terminal:

Fellow newbies don’t type in the ‘MacBook-Pro:~ username$’ part only type in what’s listed after the ‘$’: (you can get more command line help here).

MacBook-Pro:~ username$ pip install requests
MacBook-Pro:~ username$ pip install tabulate
MacBook-Pro:~ username$ pip install scikit-learn

MacBook-Pro:~ username$ pip uninstall h2o
MacBook-Pro:~ username$ pip install http://h2o-release.s3.amazonaws.com/h2o/master/3250/Python/h2o-3.7.0.3250-py2.py3-none-any.whl

As shown above, if you installed an earlier version of H2O, uninstalling and reinstalling H2O with pip will do the trick.

Let’s Get Interactive — IPython Notebook

If don’t already have IPython Notebook, you can download it following these instructions. If you downloaded Anaconda, it comes with IPython Notebook so you’re set. And here’s a video tutorial on how to use IPython Notebook.

If everything goes as planned, to open IPython Notebook you ‘cd’ to your directory of choice (I chose my Desktop folder) and enter ‘ipython notebook’. (If you’re still new to the command line, learn more about using ‘cd’, which I like to use as a verb, here and here).

MacBook-Pro:~ username$ cd Desktop
MacBook-Pro:Desktop username$ ipython notebook

Random Note: After I updated to OS X El Capitan the command above didn’t work. For many people using ‘conda update conda’ and then ‘conda update ipython’ will solve the issue, but in my case I got an SSL error that wouldn’t let me ‘conda update’ anything. I found the solution here, using:

MacBook-Pro:~ username$ conda config — set ssl_verify False
MacBook-Pro:~ username$ conda update requests openssl
MacBook-Pro:~ username$ conda config — set ssl_verify True

Now that you have IPython Notebook, you can play around with some of H2O’s demo notebooks. If you’re new to Github, however, downloading the demos to your desktop can seem daunting, but don’t worry it’s easy. Here’s the trick:

  1. Navigate to H2O’s Python Demo Repository
  2. Click on your ‘.ipynb’ demo of choice (let’s do citi_bike_small.ipynb
  3. Click on ‘Raw’ in the upper right corner, then after the next web page opens, go to ‘File’ on the menu bar and select ‘Save Page As’ (or similar)
  4. Open your terminal, cd to the Downloads folder, or wherever you saved the IPython Notebook, then type ‘ipython notebook citi_bike_small.ipynb’
  5. Now you can go through the demo running each cell individually (click on the cell and press shift + enter)

Classifying Handwritten Digits — Enter a Kaggle Competition

A great way to get a feel for H2O is to test it out on a Kaggle data science competition. Don’t know what Kaggle is? Never enter a Kaggle Competition? That’s totally fine, I’ll give you a script to get your feet wet. If you’re still nervous here’s a great article about how to get started with Kaggle given your previous experience.

Are you excited? Get excited! You are going to teach your computer to recognize HANDWRITTEN DIGITS! (I feel like if you’re still ready at this point, it’s time to let my enthusiasm shine through).

  1. Take a look at Kaggle’s Digit Recognizer Competition
  2. Look at a demo notebook to get started
  3. Download the notebook by clicking on ‘Raw’ and then saving it
  4. Open up and run the notebook to generate a submission csv file
  5. Submit the file for your first submission to Kaggle, then play around with your model parameters and see if you can improve your Kaggle submission score

Getting Help — Resources & Documentation

Pre-H2O World, Part 1

H2O fans, the H2O.ai team is burning the midnight oil to get H2O World ready for you all. With an audience size twice that of last year’s event we’re going to pack the house at the Computer History Museum! This year’s event will feature 70+ speakers spread out over 41 talks, 22 training sessions and eight panels during the course of the most exciting three days a data scientist could ask for. These folks are amongst the leading lights in our industry including Hilary Mason, Monica Rogati and Stanford Professors Stephen Boyd and Rob Tibshirani.

Right now our awesome new QA team members are burning over 1,000 USB sticks filled to the brim with new content. We’re especially excited for you all to see use cases from your colleagues across a wide variety of industries including ad tech, insurance and finance and from companies like Progressive, Macy’s, PayPal and AT&T. Stay tuned for a follow up outlining all of Monday’s events, we’ve got some surprises in store!

If there’s something YOU want to see at the show Tweet us @H2Oai #h2oworld