How to Frame Your Business Problem for Automatic Machine Learning

Over the last several years, machine learning has become an integral part of many organizations’ decision-making at various levels. With not enough data scientists to fill the increasing demand for data-driven business processes, H2O.ai has developed a product called Driverless AI that automates several time consuming aspects of a typical data science workflow: data visualization, feature engineering, predictive modeling, and model explanation. In this post, I will describe Driverless AI, how you can properly frame your business problem to get the most out of this automatic machine learning product, and how automatic machine learning is used to create business value.

What is Driverless AI and what kind of business problems does it solve?

H2O Driverless AI is a high-performance, GPU-enabled computing platform for automatic development and rapid deployment of state-of-the-art predictive analytics models. It reads tabular data from plain text sources, Hadoop, or S3 buckets and automates data visualization and building predictive models. Driverless AI is currently targeting business applications like loss-given-default, probability of default, customer churn, campaign response, fraud detection, anti-money-laundering, demand forecasting, and predictive asset maintenance models. (Or in machine learning parlance: common regression, binomial classification, and multinomial classification problems.)

How do you frame business problems in a data set for Driverless AI?

The data that is read into Driverless AI must contain one entity per row, like a customer, patient, piece of equipment, or financial transaction. That row must also contain information about what you will be trying to predict using similar data in the future, like whether that customer in the row of data used a promotion, whether that patient was readmitted to the hospital within thirty days of being released, whether that piece of equipment required maintenance, or whether that financial transaction was fraudulent. (In data science speak, Driverless AI requires “labeled” data.) Driverless AI runs through your data many, many times looking for interactions, insights, and business drivers of the phenomenon described by the provided data set. Driverless AI can handle simple data quality problems, but it currently requires all data for a single predictive model to be in the same data set and that data set must have already undergone standard ETL, cleaning, and normalization routines before being loaded into Driverless AI.

How do you use Driverless AI results to create commercial value?

Commercial value is generated by Driverless AI in a few ways.

● Driverless AI empowers data scientists or data analysts to work on projects faster and more efficiently by using automation and state-of-the-art computing power to accomplish tasks in just minutes or hours that can take humans months.

● Like in many other industries, automation leads to standardization of business processes, enforces best practices, and eventually drives down the cost of delivering the final product – in this case a predictive model.

● Driverless AI makes deploying predictive models easy – typically a difficult step in the data science process. In large organizations, value from predictive modeling is typically realized when a predictive model is moved from a data analysts’ or data scientists’ development environment into a production deployment setting where the model is running on live data, making decisions quickly and automatically that make or save money. Driverless AI provides both Java- and Python-based technologies to make production deployment simpler.

Moreover, the system was designed with interpretability and transparency in mind. Every prediction made by a Driverless AI model can be explained to business users, so the system is viable even for regulated industries.

Customer success stories with Driverless AI

PayPal tried Driverless AI on a collusion fraud use case and found that simply running on a laptop for 2 hours, Driverless AI yielded impressive fraud detection accuracy, and running on GPU-enhanced hardware, it was able to produce the same accuracy in just 20 minutes. The Driverless AI model was more accurate than PayPal’s existing predictive model and the Driverless AI system found the same insights in their data that their data scientists did! The system also found new features in their data that had not been used before for predictive modeling. For more information about the PayPal use case, click here

G5, a real estate marketing optimization firm, uses Driverless AI in their Intelligent Marketing Cloud to assist clients in targeted marketing spending for property management. Empowered by Driverless AI technology, marketers can quickly prioritize and convert highly qualified inbound leads from G5’s Intelligent Marketing Cloud platform with 95 percent accuracy for serious purchase intent. To learn more about how G5 uses Driverless AI check out:
https://www.h2o.ai/g5-h2o-ai-partner-to-deliver-ai-optimization-for-real-estate-marketing/

How can you try Driverless AI?

Visit: https://www.h2o.ai/driverless-ai/ and download your free 21-day evaluation copy.

We are happy to help you get started installing and using Driverless AI, and here are some resources we’ve put together to enable in that process:

● Installing Driverless AI: https://www.youtube.com/watch?v=swrqej9tFcU

● Launching an Experiment with Driverless AI: https://www.youtube.com/watch?v=bw6CbZu0dKk

● Driverless AI Webinars: https://www.gotostage.com/channel/4a90aa11b48f4a5d8823ec924e7bd8cf

● Driverless AI Documentation: http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/index.html

Democratize care with AI — AI to do AI for Healthcare

Very excited to have Prashant Natarajan (@natarpr) join us along with Sanjay Joshi on our vision to change the world of healthcare with AI. Health is wealth. And one worth saving the most. They bring invaluable domain knowledge and context to our cause.

As one of our customers would like to say, Healthcare should be optimized for health and outcomes for the ones in need of care. Health / Care, as in, Health divided by Care, how healthy can one be with least amount of care! We are investing in health because it is the right thing to do over the long term — especially with the convergence of Finance, Life Insurance, Retail towards Health. So many opportunities for cross-pollination!

With our strong ecosystem, community and customers’ support, h2o.ai will democratize care with AI — make it faster, cheaper and easier — accessible to all. Machine Learning touches lives — with Domain Scientists on our side, we can accelerate change to the problems that are in the most need. We are fortunate to have the team and culture that allows us to bring great products with high velocity to the marketplace. Stay tuned for Driverless AI for Health, one micro service AI model at a time!

As you feel inspired by the immense opportunities to serve humanity — please join Prashant Natarajan and www.h2o.ai community on our mission!

this will be fun! Sri

Developing and Operationalizing H2O.ai Models with Azure

This post originally appeared here. It was authored by Daisy Deng, Software Engineer, and Abhinav Mithal, Senior Engineering Manager, at Microsoft.

The focus on machine learning and artificial intelligence has soared over the past few years, even as fast, scalable and reliable ML and AI solutions are increasingly viewed as being vital to business success. H2O.ai has lately been gaining fame in the AI world for its fast in-memory ML algorithms and for easy consumption in production. H2O.ai is designed to provide a fast, scalable, and open source ML platform and it recently added support for deep learning as well. There are many ways to run H2O.ai on Azure. This post provides an overview of how to efficiently develop and operationalize H2O.ai ML models on Azure.

H2O.ai can be deployed in many ways including on a single node, on a multi-node cluster, in a Hadoop cluster and an Apache Spark cluster. H2O.ai is written in Java, so it naturally supports Java APIs. Since the standard Scala backend is a Java VM, H2O.ai also supports the Scala API. It also has rich interfaces for Python and R. The h2o R and h2o Python packages respectively help R and Python users access H2O.ai algorithms and functionality. The R and Python scripts that use the h2o library interact with the H2O clusters using REST API calls.

With the rising popularity of Apache Spark, Sparkling Water was developed to combine H2O functionality with Apache Spark. Sparkling Water provides a way to launch the H2O service on each Spark executor in the Spark cluster, forming a H2O cluster. A typical way of using the two together is to do data munging in Apache Spark while run training and scoring using H2O. Apache Spark has built-in support for Python through PySpark and pysparkling provides bindings between Spark and H2O to run Sparkling Water applications in Python. Sparklyr provides the R interface to Spark and rsparkling provides bindings between Spark and H2O to run Sparkling Water applications in R.

Table 1 and Figure 1 below show more information about how to run Sparkling Water applications on Spark from R and Python.

Model Development

Data Science Virtual Machine (DSVM) is a great tool with which you can start developing ML models in a single-node environment. H2O.ai comes preinstalled for Python on DSVM. If you use R (on Ubuntu), you can follow the script in our earlier blog post to set up your environment. If you are dealing with large datasets, you may consider using a cluster for development. Below are the two recommended choices for cluster-based development.

Azure HDInsight offers fully-managed clusters that come in many handy configurations. Azure HDInsight allows users to create Spark clusters with H2O.ai with all the dependencies pre-installed. Python users can experiment with it by following the Jupyter notebook examples that come with the cluster. R users can follow our previous post to set up the environment to use RStudio for development. Once the development of the model is finished and you’ve trained your model, you can save the trained model for scoring. H2O allows you to save the trained model as a MOJO file. A JAR file, h2o-genmodel.jar, is also generated when the model is saved. This jar file is need when you want to load your trained model in Java or Scala code while Python and R code can directly load the trained model using the H2O API.

If you are looking for low-cost clusters, you can use the Azure Distributed Data Engineering Toolkit (AZTK) to start a Docker-based Spark cluster on top of Azure Batch with low-priority VMs. The cluster created through AZTK is accessible for use in development through SSH or Jupyter notebooks. Compared to Jupyter Notebooks on Azure HDInsight clusters, the Jupyter notebook is rudimentary and does not come pre-configured for H2O.ai model development. Users also need to save the development work to external durable storage because once the AZTK spark cluster is torn down it cannot be restored.

Table 2 shows a summary of using the three environments for model development.

Batch Scoring and Model Retraining

Batch scoring is also referred to as offline scoring. It usually deals with significant amounts of data and may require a lot of processing time. Retraining deals with model drifting where the model no longer captures patterns in newer datasets accurately. Batch scoring and model retraining are considered batch processing and they can be operationalized in a similar fashion.

If you have many parallel tasks each of which can be handled by a single VM, Azure Batch is a great tool to handle this type of workload. Azure Batch Shipyard provides code-free job configuration and creation on Azure Batch with Docker containers. We can easily include Apache Spark and H2O.ai in the Docker image and use them with Azure Batch Shipyard. In Azure Batch Shipyard, each model retraining, or batch scoring, can be configured as a task. This type of job, consisting of several separate tasks, is also known as an “embarrassingly parallel” workload, which is fundamentally different from distributed computing where communications between tasks is required to complete a job. Interested readers can continue to read more from this wiki.

If the batch processing job needs a cluster for distributed processing, for example, if the amount of data is large or it’s more cost-effective to use a cluster, you can use AZTK to create a Docker-based Spark cluster. H2O.ai can be easily included in the Docker image, and the process of cluster creation, job submission, and cluster deletion can be automated and triggered by the Azure Function App. However, in this method, the users need to configure the cluster and manage container images. If you want a fully-managed cluster with detailed monitoring, Azure HDInsight cluster is a better choice. Currently we can use Azure Data Factory Spark Activity to submit batch jobs to the cluster. However, it requires having a HDInsight cluster running all the time, so it’s mostly relevant in use cases with frequent batch processing.

Table 3 shows a comparison of the three ways of running batch processing in Spark where H2O.ai can be easily integrated in each computing environment.

Online Scoring

Online scoring means scoring with a small response time, so this is also referred to as real-time scoring. In general, online scoring deals with a single-point prediction or mini-batch predictions and should use pre-computed cached features when possible. We can load the ML models and the relevant libraries and run scoring in any application. If a microservice architecture is preferred to separate concerns and decouple dependencies, it is recommended to implement online scoring as a web service with Rest API. The web services for scoring with the H2O ML model are usually written in Java, Scala or Python. As we mentioned in the Model Development section, the saved H2O model is in the MOJO format and, together with the model, the h2o-genmodel.jar file is generated. While web services written in Java or Scala can use this JAR file to load the saved model for scoring, web services written in Python can directly call the Python API to load the saved model.

Azure provides many choices to host web services.

Azure Web App is an Azure PaaS offering to host web applications. It provides a fully-managed platform which allows users to focus on their application. Recently, Azure Web App Service for Containers, built on Azure Web App on Linux, was released to host containerize web applications. Azure Container Service with Kubernetes (AKS) provides an effortless way to create, configure and manage a cluster of VMs to run containerized applications. Both Azure Web App Service for Containers and Azure Container Service provide great portability and run-environment customization for web applications. Azure Machine Learning (AML) Model Management CLI/API provides an even simpler way to deploy and manage web services on ACS with Kubernetes. We have listed below a comparison of the three Azure services for hosting online scoring in Table 4.

Edge Scoring

Edge scoring means executing scoring on internet-of-things (IoT) devices. With edge scoring, the devices perform analytics and make intelligent decisions once the data is collected without having to send the data to a central processing center. Edge scoring is important in use cases where data privacy requirements are high, or the desired scoring latency is super low. Enabled by container technology,

Azure Machine Learning, together with Azure IoT Edge provide easy ways to deploy machine learning models to Azure IoT edge devices. With AML containers, the use of H2O.ai on edge comes with minimal effort. Check out our recent blog post titled Artificial Intelligence and Machine Learning on the Cutting Edge for more details on how to enable edge intelligence.

Summary

In this post, we discussed a developer’s journey for building and deploying H2O.ai-based solutions with Azure services, and covered model development, model retraining, batch scoring and online scoring together with edge scoring. Our AI development journey in this post focused on H2O.ai. However, these learnings are not specific just to H2O.ai and can be applied just as easily to any Spark-based solutions. As more and more frameworks such as TensorFlow and Microsoft Cognitive Toolkit (CNTK) have been enabled to run on Spark, we believe these learnings will become more valuable. Understanding the right product choices based on business and technical needs is fundamental to the success of any project, and we hope the information in this post proves to be useful in your project.

Daisy & Abhinav

Happy Holidays from H2O.ai

Dear Community,

Your intelligence, support and love have been the strength behind an incredible year of growth, product innovation, partnerships, investments and customer wins for H2O and AI in 2017. Thank you for answering our rallying call to democratize AI with our maker culture.

Our mission to make AI ubiquitous is still fresh as dawn and our creativity new as spring. We are only getting started, learning, rising from each fall. H2O and Driverless AI are just the beginnings.

As we look into 2018, we see prolific innovation to make AI accessible to everyone. Simplicity that opens scale. Our focus on making experiments faster, easier and cheaper. We are so happy that you will be the center of our journey. We look forward to delivering many more magical customer experiences.

On behalf of the team and management at H2O, I wish you all a wonderful holiday: deep meaningful time spent with yourself and your loved ones and to come back refreshed for a winning 2018!

Gratitude for your partnership in our beautiful journey – it’s just begun!

this will be fun,


Sri Ambati
CEO & Co-Founder

P.S. #H2OWorld was an amazing experience. I invite you to watch the keynote and more than 40 talks and conversations.

H2O.ai Raises $40 Million to Democratize Artificial Intelligence for the Enterprise

Driverless AI


Series C round led by Wells Fargo and NVIDIA

MOUNTAIN VIEW, CA – November 30, 2017 – H2O.ai, the leading company bringing AI to enterprises, today announced it has completed a $40 million Series C round of funding led by Wells Fargo and NVIDIA with participation from New York Life, Crane Venture Partners, Nexus Venture Partners and Transamerica Ventures, the corporate venture capital fund of Transamerica and Aegon Group. The Series C round brings H2O.ai’s total amount of funding raised to $75 million. The new investment will be used to further democratize advanced machine learning and for global expansion and innovation of Driverless AI, an automated machine learning and pipelining platform that uses “AI to do AI.”

H2O.ai continued its juggernaut growth in 2017 as evidenced by new platforms and partnerships. The company launched Driverless AI, a product that automates AI for non-technical users and introduces visualization and interpretability features that explain the data modeling results in plain English, thus fostering further adoption and trust in artificial intelligence.

H2O.ai has partnered with NVIDIA to democratize machine learning on the NVIDIA GPU compute platform. It has also partnered with IBM, Amazon AWS and Microsoft Azure to bring its best-in-class machine learning platform to other infrastructures and the public cloud.

H2O.ai co-founded the GPU Open Analytics Initiative (GOAI) to create an ecosystem for data developers and researchers to advance data science using GPUs, and has launched H2O4GPU, a collection of the fastest GPU algorithms on the market capable of processing massive amounts of unstructured data up to 40x faster than on traditional CPUs.

“AI is eating both hardware and software,” said Sri Ambati, co-founder and CEO at H2O.ai. “Billions of devices are generating unprecedented amounts of data, which truly calls for distributed machine learning that is ubiquitous and fast. Our focus on automating machine learning makes it easily accessible to large enterprises. Our maker culture fosters deep trust and teamwork with our customers, and our partnerships with vendors across industry verticals bring significant value and growth to our community. It is quite supportive and encouraging to see our partners lead a significant funding round to help H2O.ai deliver on its mission.”

“AI is an incredible force that’s sweeping across the technology landscape,” said Jeff Herbst, vice president of business development at NVIDIA. “H2O.ai is exceptionally well positioned in this field as it pursues its mission to become the world’s leading data science platform for the financial services industry and beyond. Its use of GPU-accelerated AI provides powerful tools for customers, and we look forward to continuing our collaboration with them.”

“It is exhilarating to have backed the H2O.ai journey from day zero: the journey from a PowerPoint to becoming the enterprise AI platform essential for thousands of corporations across the planet,” said Jishnu Bhattarcharjee, managing director at Nexus Venture Partners. “AI has arrived, transforming industries as we know them. Exciting scale ahead for H2O, so fasten your seat belts!”

As the leading open-source platform for machine learning, H2O.ai is leveling the playing field in a space where much of the AI innovation and talent is locked up inside major tech titans and thus inaccessible to other enterprises. This is precisely why over 100,000 data scientists, 12,400 organizations and nearly half of the Fortune 500 have embraced H2O.ai’s suite of products that pack the productivity of an elite data science team into a single solution.

“We are delighted to lead H2O.ai’s funding round. We have been following the company’s progress and have been impressed by its high-caliber management team and success in establishing an open-source machine learning platform with wide adoption across many industries. We are excited to support the next phase of their development,” said Basil Darwish, director of strategic investments at Wells Fargo Securities.

Beyond its open source community, H2O.ai is transforming several industry verticals and building strong customer partnerships. Over the past 18 months, the company has worked with PwC to build PwC’s “GL.ai,” a revolutionary bot that uses AI and machine learning to ‘x-ray’ a business and detect anomalies in the general ledger. The product was named the ‘Audit Innovation of the Year‘ by the International Accounting Bulletin in October 2017.

H2O’s signature community conference, H2O World will take place on December 4-5, 2017 at the Computer History Museum in Mountain View, Calif.

About H2O.ai

H2O.ai’s mission is to democratize machine learning through its leading open source software platform. Its flagship product, H2O.ai empowers enterprise clients to quickly deploy machine learning and predictive analytics to accelerate business transformation for critical applications such as predictive maintenance and operational intelligence. H2O.ai recently launched Driverless AI, the first solution that allows any business — even ones without a team of talented data scientists — to implement AI to solve complex business problems. The product was reviewed and selected as Editor’s Choice in InfoWorld. Customers include Capital One, Progressive Insurance, Comcast, Walgreens and Kaiser Permanente. For more information and to learn more about how H2O.ai is transforming businesses, visit www.h2o.ai.

Contacts

VSC for H2O.ai
Kayla Abbassi
Senior Account Executive
kayla@vscpr.com