in Community, Guest Posts

Apache Spark and H2O on AWS

Screen-Shot-2016-04-11-at-6.06.29-PM-1

This is a guest post re-published with permission from our friends at Datapipe. The original lives here.

One of the advantages of public cloud is the ability to experiment and run various workloads without the need to commit to purchasing hardware. However, to meet your data processing needs, a well-defined mapping between your objectives and the cloud vendor offerings is a must. In collaboration with Denis Perevalov (Milliman), we’d like to share some details around one of our most recent – and largest – big-data projects we’ve worked on; a project with our client, Milliman, to build a machine-learning platform on Amazon Web Services.

Before we get into the details, let’s introduce Datapipe’s data and analytics consulting team. The goal of our data and analytics team is to help customers with their data processing needs. Our engagements fall into data engineering efforts, where we help customers build data processing pipelines. In addition to that, our team helps clients get a better insight into their existing datasets by engaging with our data science consultants.

When we first started working with Milliman, one of their challenges was running compute intensive machine learning algorithms in a reasonable amount of time on a growing set of datasets. In order to cope with this challenge, they had to pick a distributed processing framework and, after some investigation, narrowed down their options to two frameworks: H2O and Apache Spark. Both frameworks offer distributed computing leveraging multiple nodes and Milliman was left to decide if they would use their on-premise infrastructure or use an alternative option. In the end, Amazon Web Services as an ideal choice considering their use-case and requirements. To execute on this plan, Milliman engaged with our data and analytics consultants to build a scalable and secure Spark and H2O machine learning platform using AWS solutions Amazon EMR, S3, IAM, and Cloudformation‘.

Early in our engagement with Milliman, we identified the following as the high-level and important project goals:

  1. Security: Ability to limit access to confidential data using access control and security policies
  2. Elasticity and cost: Suggest a cost efficient and yet elastic resource manager to launch Spark and H2O clusters
  3. Cost visibility: Suggest a strategy to get visibility into AWS cost broken down by a given workload
  4. Automation: Provide automated deployment of Apache Spark, H2O clusters and AWS services.
  5. Interactivity: Provide ability to interact with Spark and H2O using IPython/Jupyter Notebooks
  6. Consolidated platform: A single platform to run both H2O and Spark.

Let’s dive into each of these priorities further, with a focus on Milliman’s requirements and how we mapped their requirements to AWS solutions, as well as the details around how we achieved the agreed upon project goals.


Security

Security is the most important part of any cloud application deployment, and it’s a topic we take very seriously when we get involved in client engagements. In this project, one of the security requirements was the ability to control and limit access to datasets hosted on Amazon S3 depending on the environment. For example, Milliman wanted to restrict the access of development and staging clusters to a subset of data and to ensure the production dataset is only available to the production clusters.

To achieve Milliman’s security isolation goal, we took advantage of AWS Identity and Access Management offering (IAM). Using IAM we created:

  1. EC2 Roles: EC2 roles get assigned to EC2 instances during EMR launch process. EC2 roles allow us to limit what subset of dataset each instance can or cannot access. In the case of Spark, this enabled us to limit Spark nodes in the development environment to a particular subset of data.
  2. IAM Role policy: Each EC2 role requires an IAM policy to define what that role can or cannot do. We created appropriate IAM policies for the development and production environments and assigned each policy to its corresponding IAM role.

AmazonS3 Development Environment

Example IAM Policy:


{

 “Statement”: [

   {

       “Resource”: [

           “arn:aws:s3:::parvizexamples/dev/data/*”,

           “arn:aws:s3:::parvizexamples/dev/data”

       ],

       “Action”: [“s3:*” ],

       “Effect”: “Allow”

  }


 ],

“Version”: “2012-10-17”

}


Elasticity And Cost

In general, there are two main approaches in deploying Spark or Hadoop platforms on AWS:

  1. Building your platform either using open-source tools or 3rd party vendor offerings on top of EC2 (i.e. download and deploy Cloudera CDH on EC2)
  2. Leverage AWS’ existing managed services such as EMR.

For this project, our recommendation to the client was to use Amazon’s Elastic MapReduce, which is a managed Hadoop and Spark offering. This allows us to build Spark or Hadoop clusters in a matter of minutes simply by calling APIs. There are a number of reasons why we suggested Amazon Elastic MapReduce instead of building their platform using 3rd party vendor/open-source offerings:

  1. It is very easy and flexible to get started with EMR. Clients don’t have to understand fully every aspect of deploying distributed systems such as configuration and node management. Instead, they can focus on a very few application-specific aspect of the platform configurations, such as how much memory or CPU should be allocated to the application (Spark).
  2. Automated deployment is part of EMR’s offering. Clients don’t have to build an orchestrated node management framework to automate the deployment of various nodes.
  3. Security is already baked-in, and clients don’t have to create extra logic in their application to take advantage of the security features.
  4. Integration with AWS spot offering which is one of the most important features when it comes down to controlling the cost of a distributed system.
  5. Ability to customize the cluster to install and run other frameworks not natively supported by EMR such as H2O


Automation

Automation is an important aspect of any cloud deployment. While it’s easy to build and deploy various applications on top of cloud resources manually or using inconsistent processes, automating the deployment process with consistency is the cornerstone of having a scalable cloud operation. That is especially true for distributed systems where reproducibility of the deployment process is the only way an organization can run and maintain a data processing platform.

As we mentioned earlier, in addition to Amazon EMR, other AWS services such as IAM and S3 were leveraged in this project. In addition to multiple AWS services, Milliman required having multiple AWS environments such as staging, development, and production with different security requirements. In order to automate the deployment of various AWS services while keeping the environments unique, we leveraged Amazon’s Cloudformation scripts.


Cost Visibility

One common requirement that regularly is overlooked is the ability to have fine-grained visibility to your cloud costs. Most organizations delay this requirement until they’re further down in their path to cloud adoption which can be a costly mistake. We were glad to hear that it was, in fact, part of Milliman’s project requirement to track AWS costs down to specific workloads.

Fortunately, AWS provides an offering called Cost Allocation Tags that can be tailored towards meeting a client’s cost visibility requirements. With Cost Allocation Tags, clients can tag their AWS resources with specific keywords that AWS can use to generate a cost report aggregated by customer’s tags. Specifically for this project, we instructed Milliman to use tagging feature of EMR to tag each cluster with workload specific keywords that can later be recognized in AWS’ billing report. For example, the following command line demonstrates how to tag an EMR cluster with workload specific tags:
aws emr create-cluster –name Spark –release-label emr-4.0.0 –tags Name=”Spark_Recommendation_Model_Generator” –applications Name=Hadoop Name=Spark

….

….


Interactivity

While building automated Spark and H2O clusters using AWS EMR and Cloudformation is a great start to building a data processing platform, at times developers need an interactive way of working with the platform. In other words, using the command-line to work with data processing platform may work in some use-cases, but in other cases UI access is critical for developers to perform their job by engaging with the cluster in an interactive fashion.

Part of our data & analytics offering at Datapipe, we help customers picking the right 3rd party tools/vendor for a given requirement. For Milliman, to build an interactive UI that engages with EMR Spark and H2O clusters, we leveraged IPython/Jupyter notebooks. Jupyter provides integration with Hadoop, Spark, and other platforms and allows developers to type their code in a UI and submit to the cluster for execution in real time. In order to deploy Jupyter notebooks, we leveraged EMR’s Bootstrap feature that allows customers to install custom software on EMR EC2 nodes.


Consolidated platform

Lastly, we needed to build a single platform to host both H2O and Spark frameworks. While H2O is not a supported platform on EMR, using Amazon EMR Bootstrap action feature, we were able to install H2O on EMR nodes and avoided creating a separate platform to host H2O. In other words, Milliman now has the ability to launch both Spark and H2O clusters using a single platform.

Conclusion

Data processing platforms require various considerations including but not limited to security, scalability, cost, interactivity, and automation. Fortunately by defining a set of clear project objectives and goals, and mapping those objectives to the applicable solutions (in this case AWS offerings), companies can meet their data processing requirements efficiently and effectively. We hope that this post demonstrated an example of how this process can be achieved using AWS offerings. If you have any questions or like to talk to one of our consultants, please contact us.

In the next set of blog posts we’ll provide some insight into how we use data science and data driven approach to gain insight into operational metrics.