Sparkling Water on the Spark-Notebook

This is a guest post from our friends at Kensu.

In the space of Data Science development in enterprises, two outstanding scalable technologies are Spark and H2O. Spark is a generic distributed computing framework and H2O is a very performant scalable platform for AI.
Their complementarity is best exploited with the use of Sparkling Water. Sparkling Water is the solution to get the best of Spark – its elegant APIs, RDDs, multi-tenant Context and H2O’s speed, columnar-compression and fully-featured Machine Learning and Deep-Learning algorithms in an enterprise ready fashion.

Examples of Sparkling Water pipelines are readily available in the H2O github repository, we have revisited these examples using the Spark-Notebook.

The Spark-Notebook is an open source notebook (web-based environment for code edition, execution, and data visualization), focused on Scala and Spark. The Spark-Notebook is part of the Adalog suite of which addresses agility, maintainability and productivity for data science teams. Adalog offers to data scientists a short work cycle to deploy their work to the business reality and to managers a set of data governance giving a consistent view on the impact of data activities on the market.

This new material allows diving into Sparkling Water in an interactive and dynamic way.

Working with Sparking Water in the Spark-Notebook scaffolds an ideal platform for big data /data science agile development. Most notably, this gives the data scientist the power to:

  • Write rich documentation of his work alongside the code, thus improving the capacity to index knowledge
  • Experiment quickly through interactive execution of individual code cells and share the results of these experiments with his colleagues.
  • Visualize the data he/she is feeding H2O through an extensive list of widgets and automatic makeup of computation results.

Most of the H2O/Sparkling water examples have been ported to the Spark-Notebook and are available in a github repository.

We are focussing here on the Chicago crime dataset example and looking at:

  • How to take advantage of both H2O and Spark-Notebook technologies,
  • How to install the Spark-Notebook,
  • How to use it to deploy H2O jobs on a spark cluster,
  • How to read, transform and join data with Spark,
  • How to render data on a geospatial map,
  • How to apply deep learning or Gradient Boosted Machine (GBM) models using Sparkling Water

Installing the Spark-Notebook:

Installation is very straightforward on a local machine. Follow the steps described in the Spark-Notebook documentation and in a few minutes, you will have it working. Please note that Sparkling Water works only with Scala 2.11 and Spark 2.02 and above currently.
For larger projects, you may also be interested to read the documentation on how to connect the notebook to an on-premise or cloud computing cluster.

The Sparkling Water notebooks repo should be cloned in the “notebooks” directory of your Spark-Notebook installation.

Integrating H2O with the Spark-Notebook:

In order to integrate Sparkling Water with the Spark-Notebook, we need to tell the notebook to load the Sparkling Water package and specify custom spark configuration, if required. Spark then automatically distributes the H2O libraries on each of your Spark executors. Declaring Sparkling Water dependencies induces some libraries to come along by transitivity, therefore take care to ensure duplication or multiple versions of some dependencies is avoided.
The notebook metadata defines custom dependencies (ai.h2o) and dependencies to not include (because they’re already available, i.e. spark, scala and jetty). The custom local repos allow us to define where dependencies are stored locally and thus avoid downloading these each time a notebook is started.

"customLocalRepo": "/tmp/spark-notebook",
"customDeps": [
  "ai.h2o % sparkling-water-core_2.11 % 2.0.2",
  "ai.h2o % sparkling-water-examples_2.11 % 2.0.2",
  "- org.apache.hadoop % hadoop-client %   _",
  "- org.apache.spark  % spark-core_2.11    %   _",
  "- org.apache.spark % spark-mllib_2.11 % _",
  "- org.apache.spark % spark-repl_2.11 % _",
  "- org.scala-lang    %     _         %   _",
  "- org.scoverage     %     _         %   _",
  "- org.eclipse.jetty.aggregate % jetty-servlet % _"
"customSparkConf": {
  "spark.ext.h2o.repl.enabled": "false"

With these dependencies set, we can start using Sparkling Water and initiate an H2O context from within the notebook.

Benchmark example – Chicago Crime Scenes:

As an example, we can revisit the Chicago Crime Sparkling Water demo. The Spark-Notebook we used for this benchmark can be seen in a read-only mode here.

Step 1: The Three datasets are loaded as spark data frames:

  • Chicago weather data : Min, Max and Mean temperature per day
  • Chicago Census data : Average poverty, unemployment, education level and gross income per Chicago Community Area
  • Chicago historical crime data : Crime description, date, location, community area, etc. Also contains a flag telling whether the criminal has been arrested or not.

The three tables are joined using Spark into a big table with location and date as keys. A view of the first entries of the table are generated by the notebook’s automatic rendering of tables (See a sample on the table below).


Geospatial charts widgets are also available in the Spark-Notebook, for example, the 100 first crimes in the table:


Step 2: We can transform the spark data frame into an H2O Frame and randomly split the H2O Frame into training and validation frames containing 80% and 20% of the rows, respectively. This is a memory to memory transformation, effectively copying and formatting data in the spark data frame into an equivalent representation in the H2O nodes (spawned by Sparkling Water into the spark executors).
We can verify that the frames are loaded into H2O by looking at the H2O Flow UI (available on port 54321 of your spark-notebook installation). We can access it by calling “openFlow” in a notebook cell.


Step 3: From the Spark-Notebook, we train two H2O machine learning models on the training H2O frame. For comparison, we are constructing a Deep Learning MLP model and a Gradient Boosting Machine (GBM) model. Both models are using all the data frame columns as features: time, weather, location, and neighborhood census data. Models are living in the H2O context and thus visible in the H2O flow UI. Sparkling Water functions allow us to access these from the SparkContext.

We compare the classification performance of the two models by looking at the area under the curve (AUC) on the validation dataset. The AUC measures the discrimination power of the model, that is the ability of the model to correctly classify crimes that lead to an arrest or not. The higher, the better.

The Deep Learning model leads to a 0.89 AUC while the GBM gets to 0.90 AUC. The two models are therefore quite comparable in terms of discrimination power.


Step 4: Finally, the trained model is used to measure the probability of arrest for two specific crimes:

  • A “narcotics” related crime on 02/08/2015 11:43:58 PM in a street of community area “46” in district 4 with FBI code 18.

    The probability of being arrested predicted by the deep learning model is 99.9% and by the GBM is 75.2%.

  • A “deceptive practice” related crime on 02/08/2015 11:00:39 PM in a residence of community area “14” in district 9 with FBI code 11.

    The probability of being arrested predicted by the deep learning model is 1.4% and by the GBM is 12%.

The Spark-Notebook allows for a quick computation and visualization of the results:



Combining Spark and H2O within the Spark-Notebook is a very nice set-up for scalable data science. More examples are available in the online viewer. If you are interested in running them, install the Spark-Notebook and look in this repository. From that point , you are on track for enterprise-ready interactive scalable data science.

Loic Quertenmont,
Data Scientist @

Apache Spark and H2O on AWS


This is a guest post re-published with permission from our friends at Datapipe. The original lives here.

One of the advantages of public cloud is the ability to experiment and run various workloads without the need to commit to purchasing hardware. However, to meet your data processing needs, a well-defined mapping between your objectives and the cloud vendor offerings is a must. In collaboration with Denis Perevalov (Milliman), we’d like to share some details around one of our most recent – and largest – big-data projects we’ve worked on; a project with our client, Milliman, to build a machine-learning platform on Amazon Web Services.

Before we get into the details, let’s introduce Datapipe’s data and analytics consulting team. The goal of our data and analytics team is to help customers with their data processing needs. Our engagements fall into data engineering efforts, where we help customers build data processing pipelines. In addition to that, our team helps clients get a better insight into their existing datasets by engaging with our data science consultants.

When we first started working with Milliman, one of their challenges was running compute intensive machine learning algorithms in a reasonable amount of time on a growing set of datasets. In order to cope with this challenge, they had to pick a distributed processing framework and, after some investigation, narrowed down their options to two frameworks: H2O and Apache Spark. Both frameworks offer distributed computing leveraging multiple nodes and Milliman was left to decide if they would use their on-premise infrastructure or use an alternative option. In the end, Amazon Web Services as an ideal choice considering their use-case and requirements. To execute on this plan, Milliman engaged with our data and analytics consultants to build a scalable and secure Spark and H2O machine learning platform using AWS solutions Amazon EMR, S3, IAM, and Cloudformation‘.

Early in our engagement with Milliman, we identified the following as the high-level and important project goals:

  1. Security: Ability to limit access to confidential data using access control and security policies
  2. Elasticity and cost: Suggest a cost efficient and yet elastic resource manager to launch Spark and H2O clusters
  3. Cost visibility: Suggest a strategy to get visibility into AWS cost broken down by a given workload
  4. Automation: Provide automated deployment of Apache Spark, H2O clusters and AWS services.
  5. Interactivity: Provide ability to interact with Spark and H2O using IPython/Jupyter Notebooks
  6. Consolidated platform: A single platform to run both H2O and Spark.

Let’s dive into each of these priorities further, with a focus on Milliman’s requirements and how we mapped their requirements to AWS solutions, as well as the details around how we achieved the agreed upon project goals.


Security is the most important part of any cloud application deployment, and it’s a topic we take very seriously when we get involved in client engagements. In this project, one of the security requirements was the ability to control and limit access to datasets hosted on Amazon S3 depending on the environment. For example, Milliman wanted to restrict the access of development and staging clusters to a subset of data and to ensure the production dataset is only available to the production clusters.

To achieve Milliman’s security isolation goal, we took advantage of AWS Identity and Access Management offering (IAM). Using IAM we created:

  1. EC2 Roles: EC2 roles get assigned to EC2 instances during EMR launch process. EC2 roles allow us to limit what subset of dataset each instance can or cannot access. In the case of Spark, this enabled us to limit Spark nodes in the development environment to a particular subset of data.
  2. IAM Role policy: Each EC2 role requires an IAM policy to define what that role can or cannot do. We created appropriate IAM policies for the development and production environments and assigned each policy to its corresponding IAM role.

AmazonS3 Development Environment

Example IAM Policy:


 “Statement”: [


       “Resource”: [




       “Action”: [“s3:*” ],

       “Effect”: “Allow”



“Version”: “2012-10-17”


Elasticity And Cost

In general, there are two main approaches in deploying Spark or Hadoop platforms on AWS:

  1. Building your platform either using open-source tools or 3rd party vendor offerings on top of EC2 (i.e. download and deploy Cloudera CDH on EC2)
  2. Leverage AWS’ existing managed services such as EMR.

For this project, our recommendation to the client was to use Amazon’s Elastic MapReduce, which is a managed Hadoop and Spark offering. This allows us to build Spark or Hadoop clusters in a matter of minutes simply by calling APIs. There are a number of reasons why we suggested Amazon Elastic MapReduce instead of building their platform using 3rd party vendor/open-source offerings:

  1. It is very easy and flexible to get started with EMR. Clients don’t have to understand fully every aspect of deploying distributed systems such as configuration and node management. Instead, they can focus on a very few application-specific aspect of the platform configurations, such as how much memory or CPU should be allocated to the application (Spark).
  2. Automated deployment is part of EMR’s offering. Clients don’t have to build an orchestrated node management framework to automate the deployment of various nodes.
  3. Security is already baked-in, and clients don’t have to create extra logic in their application to take advantage of the security features.
  4. Integration with AWS spot offering which is one of the most important features when it comes down to controlling the cost of a distributed system.
  5. Ability to customize the cluster to install and run other frameworks not natively supported by EMR such as H2O


Automation is an important aspect of any cloud deployment. While it’s easy to build and deploy various applications on top of cloud resources manually or using inconsistent processes, automating the deployment process with consistency is the cornerstone of having a scalable cloud operation. That is especially true for distributed systems where reproducibility of the deployment process is the only way an organization can run and maintain a data processing platform.

As we mentioned earlier, in addition to Amazon EMR, other AWS services such as IAM and S3 were leveraged in this project. In addition to multiple AWS services, Milliman required having multiple AWS environments such as staging, development, and production with different security requirements. In order to automate the deployment of various AWS services while keeping the environments unique, we leveraged Amazon’s Cloudformation scripts.

Cost Visibility

One common requirement that regularly is overlooked is the ability to have fine-grained visibility to your cloud costs. Most organizations delay this requirement until they’re further down in their path to cloud adoption which can be a costly mistake. We were glad to hear that it was, in fact, part of Milliman’s project requirement to track AWS costs down to specific workloads.

Fortunately, AWS provides an offering called Cost Allocation Tags that can be tailored towards meeting a client’s cost visibility requirements. With Cost Allocation Tags, clients can tag their AWS resources with specific keywords that AWS can use to generate a cost report aggregated by customer’s tags. Specifically for this project, we instructed Milliman to use tagging feature of EMR to tag each cluster with workload specific keywords that can later be recognized in AWS’ billing report. For example, the following command line demonstrates how to tag an EMR cluster with workload specific tags:
aws emr create-cluster –name Spark –release-label emr-4.0.0 –tags Name=”Spark_Recommendation_Model_Generator” –applications Name=Hadoop Name=Spark




While building automated Spark and H2O clusters using AWS EMR and Cloudformation is a great start to building a data processing platform, at times developers need an interactive way of working with the platform. In other words, using the command-line to work with data processing platform may work in some use-cases, but in other cases UI access is critical for developers to perform their job by engaging with the cluster in an interactive fashion.

Part of our data & analytics offering at Datapipe, we help customers picking the right 3rd party tools/vendor for a given requirement. For Milliman, to build an interactive UI that engages with EMR Spark and H2O clusters, we leveraged IPython/Jupyter notebooks. Jupyter provides integration with Hadoop, Spark, and other platforms and allows developers to type their code in a UI and submit to the cluster for execution in real time. In order to deploy Jupyter notebooks, we leveraged EMR’s Bootstrap feature that allows customers to install custom software on EMR EC2 nodes.

Consolidated platform

Lastly, we needed to build a single platform to host both H2O and Spark frameworks. While H2O is not a supported platform on EMR, using Amazon EMR Bootstrap action feature, we were able to install H2O on EMR nodes and avoided creating a separate platform to host H2O. In other words, Milliman now has the ability to launch both Spark and H2O clusters using a single platform.


Data processing platforms require various considerations including but not limited to security, scalability, cost, interactivity, and automation. Fortunately by defining a set of clear project objectives and goals, and mapping those objectives to the applicable solutions (in this case AWS offerings), companies can meet their data processing requirements efficiently and effectively. We hope that this post demonstrated an example of how this process can be achieved using AWS offerings. If you have any questions or like to talk to one of our consultants, please contact us.

In the next set of blog posts we’ll provide some insight into how we use data science and data driven approach to gain insight into operational metrics.

H2O World from an Attendee’s Perspective

Data Science is like Rome, and all roads lead to Rome. H2O WORLD is the crossroad, pulling in a confluence of math, statistics, science and computer science and incorporating all avenues of business. From the academic, research oriented models to the business and computer science analytics implementations of those ideas, H2O WORLD informs attendees on H2O’s ability to help users and customers explore their data and produce a prediction or answer a question.

I came to H2O World hoping to gain a better understanding of H2O’s software and of Data Science in general. I thoroughly enjoyed attending the sessions, following along with the demos and playing with H2O myself. Learning from the hackers and Data Scientists about the algorithms and science behind H2O and seeing the community spirit at the Hackathons was enlightening. Listening to the keynote speakers, both women, describe our data-influenced future and hearing the customer’s point of view on how H2O has impacted their work has been inspirational. I especially appreciated learning about the potential influence on scientific and medical research and social issues and H2O’s ability to influence positive change.

Curiosity led me to delve into the world of Data Science and as a person with a background of science and math, I wasn’t sure how it applied to me. Now I realize that there is virtually no discipline which cannot benefit from the methods of Data Science and that there is great power in asking the right questions and telling a good story. H2O WORLD broadened my horizons and gave me a new perspective on the role of Data Science in the world. Data science can be harnessed as force for social good where a few people from around the globe can change the world. H2O World 2015 was a great success and I truly enjoyed learning and being there.

A Newbie’s Guide to H2O in Python – Guest Post

This blog was originally posted here

I created this guide to help fellow newbies get their feet wet with H2O, an open-source predictive analytics platform that is fast, powerful, and easy to use. Using a combination of extraordinary math and high-performance parallel processing, H2O allows you to quickly create models for big data. The steps below show you how to download and start analyzing data at high speeds with H2O. After that it’s up to you.

What You’ll Learn

  • How to download H2O (just updated to OS X El Capitan? Then Java too)
  • How to use H2O with IPython Notebook & where to get demo scripts
  • How to teach a computer to recognize handwritten digits with H2O
  • Where to find documentation and community resources

A Delicious Drink of Water — Downloading H2O

(If you don’t feel like reading the long version below just go here)

I recommend downloading the latest release of H2O (which is ‘Bleeding Edge’ as of this moment) because it has the most Python features, but you can also see the other releases here, as well as the software requirements. Okay, Let’s get started:

Do you have Java on your computer? No sure? Here’s how to check:

  • Open your terminal and type in ‘java -version’:

MacBook-Pro:~ username$ java -version


If you don’t have Java you can either click through the pop up dialogue box and make your way to the correct downloadable version, or you can go directly to the Java downloads page here (two-for-one tip: download the Java Development Kit and get the Java Runtime Environment with it).

Now that you have Java (fingers crossed), you can download H2O (I’m assuming you have Python, but if you don’t, consider downloading Anaconda which gives you access to amazing Python packages for data analysis and scientific computing).

You can find the official instructions to Download H2O’s ‘Bleeding Edge’ release here (click on the ‘Install in Python’ tab), or follow below:

  1. Prerequisite: Python 2.7
  2. Type the following in your terminal:

Fellow newbies don’t type in the ‘MacBook-Pro:~ username$’ part only type in what’s listed after the ‘$’: (you can get more command line help here).

MacBook-Pro:~ username$ pip install requests
MacBook-Pro:~ username$ pip install tabulate
MacBook-Pro:~ username$ pip install scikit-learn

MacBook-Pro:~ username$ pip uninstall h2o
MacBook-Pro:~ username$ pip install

As shown above, if you installed an earlier version of H2O, uninstalling and reinstalling H2O with pip will do the trick.

Let’s Get Interactive — IPython Notebook

If don’t already have IPython Notebook, you can download it following these instructions. If you downloaded Anaconda, it comes with IPython Notebook so you’re set. And here’s a video tutorial on how to use IPython Notebook.

If everything goes as planned, to open IPython Notebook you ‘cd’ to your directory of choice (I chose my Desktop folder) and enter ‘ipython notebook’. (If you’re still new to the command line, learn more about using ‘cd’, which I like to use as a verb, here and here).

MacBook-Pro:~ username$ cd Desktop
MacBook-Pro:Desktop username$ ipython notebook

Random Note: After I updated to OS X El Capitan the command above didn’t work. For many people using ‘conda update conda’ and then ‘conda update ipython’ will solve the issue, but in my case I got an SSL error that wouldn’t let me ‘conda update’ anything. I found the solution here, using:

MacBook-Pro:~ username$ conda config — set ssl_verify False
MacBook-Pro:~ username$ conda update requests openssl
MacBook-Pro:~ username$ conda config — set ssl_verify True

Now that you have IPython Notebook, you can play around with some of H2O’s demo notebooks. If you’re new to Github, however, downloading the demos to your desktop can seem daunting, but don’t worry it’s easy. Here’s the trick:

  1. Navigate to H2O’s Python Demo Repository
  2. Click on your ‘.ipynb’ demo of choice (let’s do citi_bike_small.ipynb
  3. Click on ‘Raw’ in the upper right corner, then after the next web page opens, go to ‘File’ on the menu bar and select ‘Save Page As’ (or similar)
  4. Open your terminal, cd to the Downloads folder, or wherever you saved the IPython Notebook, then type ‘ipython notebook citi_bike_small.ipynb’
  5. Now you can go through the demo running each cell individually (click on the cell and press shift + enter)

Classifying Handwritten Digits — Enter a Kaggle Competition

A great way to get a feel for H2O is to test it out on a Kaggle data science competition. Don’t know what Kaggle is? Never enter a Kaggle Competition? That’s totally fine, I’ll give you a script to get your feet wet. If you’re still nervous here’s a great article about how to get started with Kaggle given your previous experience.

Are you excited? Get excited! You are going to teach your computer to recognize HANDWRITTEN DIGITS! (I feel like if you’re still ready at this point, it’s time to let my enthusiasm shine through).

  1. Take a look at Kaggle’s Digit Recognizer Competition
  2. Look at a demo notebook to get started
  3. Download the notebook by clicking on ‘Raw’ and then saving it
  4. Open up and run the notebook to generate a submission csv file
  5. Submit the file for your first submission to Kaggle, then play around with your model parameters and see if you can improve your Kaggle submission score

Getting Help — Resources & Documentation