New features in H2O 3.18

Wolpert Release (H2O 3.18)

There’s a new major release of H2O and it’s packed with new features and fixes!

We named this release after David Wolpert, who is famous for inventing Stacking (aka Stacked Ensembles). Stacking is a central component in H2O AutoML, so we’re very grateful for his contributions to machine learning! He is also famous for the “No Free Lunch” theorem, which generally states that no single algorithm will be the best in all cases. In other words, there’s no magic bullet. This is precisely why stacking is such a powerful and practical algorithm — you never know in advance if a Deep Neural Network, or GBM or Random Forest will be the best algorithm for your problem. When you combine all of these together into a stacked ensemble, you are guaranteed to benefit from the strengths of each of these algorithms. You can read more about Dr. Wolpert and his work here.

Distributed XGBoost

The central feature of this release is support for distributed XGBoost, as well as other XGBoost enhancements and bug fixes. We are bringing XGBoost support to more platforms (including older versions of CentOS/Ubuntu) and we now support multi-node XGBoost training (though this feature is still in “beta”).

There are a number of XGBoost bug fixes, such the ability to use XGBoost models after they have been saved to disk and re-loaded into the H2O cluster, and fixes to the XGBoost MOJO. With all the improvements to H2O’s XGBoost, we are much closer to adding XGBoost to AutoML, and you can expect to see that in a future release. You can read more about the H2O XGBoost integration in the XGBoost User Guide.

AutoML & Stacked Ensembles

One big addition to H2O Automatic Machine Learning (AutoML) is the ability to turn off certain algorithms. By default, H2O AutoML will train Gradient Boosting Machines (GBM), Random Forests (RF), Generalized Linear Models (GLM), Deep Neural Networks (DNN) and Stacked Ensembles. However, sometimes it may be useful to turn off some of those algorithms. In particular, if you have sparse, wide data, you may choose to turn off the tree-based models (GBMs and RFs). Conversely, if tree-based models perform comparatively well on your data, then you may choose to turn off GLMs and DNNs. Keep in mind that Stacked Ensembles benefit from diversity of the set of base learners, so keeping “bad” models may still improve the overall performance of the Stacked Ensembles created by the AutoML run. The new argument is called exclude_algos and you can read more about it in the AutoML User Guide.

There are several improvements to the Stacked Ensemble functionality in H2O 3.18. The big new feature is the ability to fully customize the metalearning algorithm. The default metalearner (a GLM with non-negative weights) usually does pretty well, however, you are encouraged to experiment with other algorithms (such as GBM) and various hyperparameter settings. In the next major release, we will add the ability to easily perform a grid search on the hyperparameters of the metalearner algorithm using the standard H2O Grid Search functionality.


Below is a list of some of the highlights from the 3.18 release. As usual, you can see a list of all the items that went into this release at the file in the h2o-3 GitHub repository.

New Features:

  • PUBDEV-4652 – Added support for XGBoost multi-node training in H2O
  • PUBDEV-4980 – Users can now exclude certain algorithms during an AutoML run
  • PUBDEV-5086 – Stacked Ensemble should allow user to pass in a customized metalearner
  • PUBDEV-5224 – Users can now specify a seed parameter in Stacked Ensemble
  • PUBDEV-5204 – GLM: Allow user to specify a list of interactions terms to include/exclude


  • PUBDEV-4585 – Fixed an issue that caused XGBoost binary save/load to fail
  • PUBDEV-4593 – Fixed an issue that caused a Levenshtein Distance Normalization Error
  • PUBDEV-5133 – In Flow, the scoring history plot is now available for GLM models
  • PUBDEV-5195 – Fixed an issue in XGBoost that caused MOJOs to fail to work without manually adding the Commons Logging dependency
  • PUBDEV-5215 – Users can now specify interactions when running GLM in Flow
  • PUBDEV-5315 – Fixed an issue that caused XGBoost OpenMP to fail on Ubuntu 14.04


  • PUBDEV-5311 – The H2O-3 download site now includes a link to the HTML version of the R documentation

Download here:

Developing and Operationalizing Models with Azure

This post originally appeared here. It was authored by Daisy Deng, Software Engineer, and Abhinav Mithal, Senior Engineering Manager, at Microsoft.

The focus on machine learning and artificial intelligence has soared over the past few years, even as fast, scalable and reliable ML and AI solutions are increasingly viewed as being vital to business success. has lately been gaining fame in the AI world for its fast in-memory ML algorithms and for easy consumption in production. is designed to provide a fast, scalable, and open source ML platform and it recently added support for deep learning as well. There are many ways to run on Azure. This post provides an overview of how to efficiently develop and operationalize ML models on Azure. can be deployed in many ways including on a single node, on a multi-node cluster, in a Hadoop cluster and an Apache Spark cluster. is written in Java, so it naturally supports Java APIs. Since the standard Scala backend is a Java VM, also supports the Scala API. It also has rich interfaces for Python and R. The h2o R and h2o Python packages respectively help R and Python users access algorithms and functionality. The R and Python scripts that use the h2o library interact with the H2O clusters using REST API calls.

With the rising popularity of Apache Spark, Sparkling Water was developed to combine H2O functionality with Apache Spark. Sparkling Water provides a way to launch the H2O service on each Spark executor in the Spark cluster, forming a H2O cluster. A typical way of using the two together is to do data munging in Apache Spark while run training and scoring using H2O. Apache Spark has built-in support for Python through PySpark and pysparkling provides bindings between Spark and H2O to run Sparkling Water applications in Python. Sparklyr provides the R interface to Spark and rsparkling provides bindings between Spark and H2O to run Sparkling Water applications in R.

Table 1 and Figure 1 below show more information about how to run Sparkling Water applications on Spark from R and Python.

Model Development

Data Science Virtual Machine (DSVM) is a great tool with which you can start developing ML models in a single-node environment. comes preinstalled for Python on DSVM. If you use R (on Ubuntu), you can follow the script in our earlier blog post to set up your environment. If you are dealing with large datasets, you may consider using a cluster for development. Below are the two recommended choices for cluster-based development.

Azure HDInsight offers fully-managed clusters that come in many handy configurations. Azure HDInsight allows users to create Spark clusters with with all the dependencies pre-installed. Python users can experiment with it by following the Jupyter notebook examples that come with the cluster. R users can follow our previous post to set up the environment to use RStudio for development. Once the development of the model is finished and you’ve trained your model, you can save the trained model for scoring. H2O allows you to save the trained model as a MOJO file. A JAR file, h2o-genmodel.jar, is also generated when the model is saved. This jar file is need when you want to load your trained model in Java or Scala code while Python and R code can directly load the trained model using the H2O API.

If you are looking for low-cost clusters, you can use the Azure Distributed Data Engineering Toolkit (AZTK) to start a Docker-based Spark cluster on top of Azure Batch with low-priority VMs. The cluster created through AZTK is accessible for use in development through SSH or Jupyter notebooks. Compared to Jupyter Notebooks on Azure HDInsight clusters, the Jupyter notebook is rudimentary and does not come pre-configured for model development. Users also need to save the development work to external durable storage because once the AZTK spark cluster is torn down it cannot be restored.

Table 2 shows a summary of using the three environments for model development.

Batch Scoring and Model Retraining

Batch scoring is also referred to as offline scoring. It usually deals with significant amounts of data and may require a lot of processing time. Retraining deals with model drifting where the model no longer captures patterns in newer datasets accurately. Batch scoring and model retraining are considered batch processing and they can be operationalized in a similar fashion.

If you have many parallel tasks each of which can be handled by a single VM, Azure Batch is a great tool to handle this type of workload. Azure Batch Shipyard provides code-free job configuration and creation on Azure Batch with Docker containers. We can easily include Apache Spark and in the Docker image and use them with Azure Batch Shipyard. In Azure Batch Shipyard, each model retraining, or batch scoring, can be configured as a task. This type of job, consisting of several separate tasks, is also known as an “embarrassingly parallel” workload, which is fundamentally different from distributed computing where communications between tasks is required to complete a job. Interested readers can continue to read more from this wiki.

If the batch processing job needs a cluster for distributed processing, for example, if the amount of data is large or it’s more cost-effective to use a cluster, you can use AZTK to create a Docker-based Spark cluster. can be easily included in the Docker image, and the process of cluster creation, job submission, and cluster deletion can be automated and triggered by the Azure Function App. However, in this method, the users need to configure the cluster and manage container images. If you want a fully-managed cluster with detailed monitoring, Azure HDInsight cluster is a better choice. Currently we can use Azure Data Factory Spark Activity to submit batch jobs to the cluster. However, it requires having a HDInsight cluster running all the time, so it’s mostly relevant in use cases with frequent batch processing.

Table 3 shows a comparison of the three ways of running batch processing in Spark where can be easily integrated in each computing environment.

Online Scoring

Online scoring means scoring with a small response time, so this is also referred to as real-time scoring. In general, online scoring deals with a single-point prediction or mini-batch predictions and should use pre-computed cached features when possible. We can load the ML models and the relevant libraries and run scoring in any application. If a microservice architecture is preferred to separate concerns and decouple dependencies, it is recommended to implement online scoring as a web service with Rest API. The web services for scoring with the H2O ML model are usually written in Java, Scala or Python. As we mentioned in the Model Development section, the saved H2O model is in the MOJO format and, together with the model, the h2o-genmodel.jar file is generated. While web services written in Java or Scala can use this JAR file to load the saved model for scoring, web services written in Python can directly call the Python API to load the saved model.

Azure provides many choices to host web services.

Azure Web App is an Azure PaaS offering to host web applications. It provides a fully-managed platform which allows users to focus on their application. Recently, Azure Web App Service for Containers, built on Azure Web App on Linux, was released to host containerize web applications. Azure Container Service with Kubernetes (AKS) provides an effortless way to create, configure and manage a cluster of VMs to run containerized applications. Both Azure Web App Service for Containers and Azure Container Service provide great portability and run-environment customization for web applications. Azure Machine Learning (AML) Model Management CLI/API provides an even simpler way to deploy and manage web services on ACS with Kubernetes. We have listed below a comparison of the three Azure services for hosting online scoring in Table 4.

Edge Scoring

Edge scoring means executing scoring on internet-of-things (IoT) devices. With edge scoring, the devices perform analytics and make intelligent decisions once the data is collected without having to send the data to a central processing center. Edge scoring is important in use cases where data privacy requirements are high, or the desired scoring latency is super low. Enabled by container technology,

Azure Machine Learning, together with Azure IoT Edge provide easy ways to deploy machine learning models to Azure IoT edge devices. With AML containers, the use of on edge comes with minimal effort. Check out our recent blog post titled Artificial Intelligence and Machine Learning on the Cutting Edge for more details on how to enable edge intelligence.


In this post, we discussed a developer’s journey for building and deploying solutions with Azure services, and covered model development, model retraining, batch scoring and online scoring together with edge scoring. Our AI development journey in this post focused on However, these learnings are not specific just to and can be applied just as easily to any Spark-based solutions. As more and more frameworks such as TensorFlow and Microsoft Cognitive Toolkit (CNTK) have been enabled to run on Spark, we believe these learnings will become more valuable. Understanding the right product choices based on business and technical needs is fundamental to the success of any project, and we hope the information in this post proves to be useful in your project.

Daisy & Abhinav

Happy Holidays from

Dear Community,

Your intelligence, support and love have been the strength behind an incredible year of growth, product innovation, partnerships, investments and customer wins for H2O and AI in 2017. Thank you for answering our rallying call to democratize AI with our maker culture.

Our mission to make AI ubiquitous is still fresh as dawn and our creativity new as spring. We are only getting started, learning, rising from each fall. H2O and Driverless AI are just the beginnings.

As we look into 2018, we see prolific innovation to make AI accessible to everyone. Simplicity that opens scale. Our focus on making experiments faster, easier and cheaper. We are so happy that you will be the center of our journey. We look forward to delivering many more magical customer experiences.

On behalf of the team and management at H2O, I wish you all a wonderful holiday: deep meaningful time spent with yourself and your loved ones and to come back refreshed for a winning 2018!

Gratitude for your partnership in our beautiful journey – it’s just begun!

this will be fun,

Sri Ambati
CEO & Co-Founder

P.S. #H2OWorld was an amazing experience. I invite you to watch the keynote and more than 40 talks and conversations.

It’s all Water (or should I say H2O) to me!

It’s all Water (or should I say H2O) to me!

By Krishna Visvanathan, Co-founder & Partner, Crane Venture Partners

In the career of any venture capitalist, one dreads the “oh shit moment”. For those unfamiliar with this most technical of terms – it is that moment of clarity when a VC, in the immediate aftermath of closing one’s latest investment (often at the first post investment Board meeting), is brought back down to earth with the realisation that the shiny new investment wasn’t quite so shiny after all.

Whilst it’s not the case for every investment of course (exceptions proving the rule and all that), it was still with slight trepidation that I set off for Mountain View, CA on Dec 4th, to attend H2O World and to connect with the Board – to see what customers, ecosystems partners and Board members really thought of the company – just weeks after the completion of’s $40m Series C, in which Crane participated as the sole European investor.

I suspect SriSatish Ambati, Co-Founder & CEO, would probably have not asked me to pen my reflections of my first H2O World, had he known this – but you can relax Sri, I can genuinely say that H2O is one of those exceptions. What I experienced at H2O World surpassed my expectations.

Impressive attendance levels – I was staggered to see over 500 people attending when NIPS was taking place in LA at the very same time. Also pleasing was the number of attendees representing enterprise users already deploying AI (& H2O) for practical use cases to great impact (more on this later).

Open source is not a marketing strategy, it’s a way of life – and when coupled with great product is when users, partners and customers become the best evangelisers and the ecosystem takes on a life of its own. This was perhaps the enduring memory of the conference for me – the vibrancy, zeal, depth and richness of the H2O community. This is the first enterprise startup I’ve been involved in with such a highly developed community – vital to succeeding in open source (alongside a sizeable TAM). Seeing and speaking with H2O’s users on stage, in the coffee areas, at the demo tables eulogising about their uses of H2O’s products filled me with a truly warm glow.

The Data and AI revolution will not be televised – to paraphrase Gil Scott Heron. It’s here, it’s now but it’s only just beginning and it will truly transform every facet of human and corporate existence – from predicting blood usage and thus saving $m’s of dollars but more importantly saving precious blood bags by dramatically reducing wastage (shout out to Prof. Rob Tibshirani from Stanford), to predicting, detecting and combating fraudsters at PayPal to predicting Sepsis to save lives at a healthcare provider or credit scoring/lending, AML, KYC and much more at one of the largest credit card companies, these are just the tip of the use case iceberg of H2O and AI. We learn every day of new users and new use cases from the growing community of over 12,600 enterprises (across many verticals – Finance & Insurance, Healthcare, Automotive, TMT, Retail to name a few) and 130,000 users – whether you are a startup or a Fortune 1000 enterprise, if AI is not already a part of your corporate vernacular then good luck!

Software is eating the world, AI is eating Software but Data is feeding both. When Jeff Herbst of Nvidia said on stage at H2O World – “The next phase of the AI revolution is all about Data”, it was music to our ears at Crane as we’ve been investing in Data & AI companies for a couple of years now. Data is the true value and helping enterprises unlock the gold in their data is what Driverless AI (DAI) is all about. Whilst I was fully aware of the potential of DAI, hearing PayPal describe how DAI produced an optimised feature rich model/recipe an order of magnitude quicker than traditional modelling practices without DAI was simultaneously mind blowing and illustrative of the untapped potential. The floodgates will truly open when DAI enables any user to BYOR – bring your own recipe and to share these recipes with the community.

Interpretability and Visualisation of Data in AI is the another key plank and yet again, H2O is taking the lead. Even for a non-techie, Professor Leland Wilkinson’s illustration of the visualisation capabilities he and H2O have built made me sit up and take notice.

“Democratising AI is not a mission, it’s a duty”SriSatish Ambati. We are only at the start of the AI revolution but battle lines are already being drawn between the giants, deploying huge resources, stockpiling talent, building proprietary hardware/infrastructure/platforms, all in the name of harnessing AI and data for their own benefits. AI will transform human existence in profound ways that we are yet to imagine and making it accessible and executable by the many must therefore be our duty. Whilst we make no apology for our investment in H2O being about generating a financial return for our investors, we are also firmly and proudly committed to H2O’s doctrine of democratising AI.

Unlike Kevin Costner’s, this Waterworld was truly epic. Plaudits go to the entire H2O team for putting on a great event but more so for creating and nurturing such a superb community and for building world class product wrapped in unparalleled customer-centricity. We at Crane are clearly biased but I think you can tell that we are super excited to be part of the H2O team and hope to contribute in some small way to their continued success.

For other blog posts from Crane, please check out Crane-Taking Flight.

H2O4GPU Hands-On Lab (Video) + Updates

Deep learning algorithms have benefited significantly from the recent performance gains of GPUs. However, it has been uncertain whether GPUs can speed up powerful classical machine learning algorithms such as generalized linear modeling, random forests, gradient boosting machines, clustering, and singular value decomposition.

Today I’d love to share another interesting presentation from #H2OWorld focused on H2O4GPU.

H2O4GPU is a GPU-optimized machine learning library with a Python scikit-learn API tailored for enterprise AI. The library includes all the CPU algorithms from scikit-learn and also has selected algorithms that benefit greatly from GPU acceleration.

In the video below, Jon McKinney, Director of Research at, discussed the GPU-optimized machine learning algorithms in H2O4GPU and showed their speed in a suite of benchmarks against scikit-learn run on CPUs.

A few recent benchmarks include:

We’re always receiving helpful feedback from the community and making updates.

Exciting updates to expect in Q1 2018 include:
– Aggregator
– Kalman Filters
– K-nearest neighbors
– Quantiles
– Sort

If you’d like to learn more about H2O4GPU, I invite you to explore these helpful links:
– H2O4GPU Readme:
– Open Source License (Apache V2):

Happy Holidays!


Thank You for an Incredible H2O World

Thank You for an Incredible H2O World

#H2OWorld 2017 was an incredible experience!

It was wonderful to gather with community members from all over the world for more than 50 interesting presentations and so many great conversations.

H2O World kicked off at the Computer History Museum with a keynote by CEO and Co-Founder, Sri Ambati, on the Maryam-Curie stage.

Sri’s keynote was followed by more than 20 presentations on the first day from community members at innovative organizations like BeeswaxIO, Business Science, Change Healthcare, Comcast, Equifax, NVIDIA, PayPal, Stanford University, Wildbook and many others.

The second day started with a keynote from Professor Rob Tibshirani focused on “An Application of the Lasso in Biomedical data sciences”.

Professor Tibshirani’s keynote was followed by more than 25 presentations from leading organizations including Amazon’s A9,, Capital One, Digitalist Group, IBM, MapD, NVIDIA, QQ Trend, Stanford Medicine and more.

I’d love to say thank you to everyone who joined us at H2O World. We are incredibly grateful for your continued encouragement and feedback.

Thank you also to our talented team and Shiloh Events for planning such an amazing event.

Looking forward to the next H2O World!

Happy Holidays,


Director of Community

P.S. Want to share how you’re using products? I’d be thrilled to hear from you! Drop me a note.

Driverless AI – Introduction, Hands-On Lab and Updates

Driverless AI – Introduction, Hands-On Lab and Updates

#H2OWorld was an incredible experience. Thank you to everyone who joined us!

There were so many fascinating conversations and interesting presentations. I’d love to invite you to enjoy the presentations by visiting our YouTube channel.

Over the next few weeks, we’ll be highlighting many of the talks. Today I’m excited to share two presentations focused on Driverless AI – “Introduction and a Look Under the Hood + Hands-On Lab” and “Hands-On Focused on Machine Learning Interpretability”.

Slides available here.

Slides available here.

The response to Driverless AI has been amazing. We’re constantly receiving helpful feedback and making updates.

A few recent updates include:

Version 1.0.11 (December 12 2017)
– Faster multi-GPU training, especially for small data
– Increase default amount of exploration of genetic algorithm for systems with fewer than 4 GPUs
– Improved accuracy of generalization performance estimate for models on small data (< 100k rows)
– Faster abort of experiment
– Improved final ensemble meta-learner
– More robust date parsing

Version 1.0.10 (December 4 2017)
– Tooltips and link to documentation in parameter settings screen
– Faster training for multi-class problems with > 5 classes
– Experiment summary displayed in GUI after experiment finishes
– Python Client Library downloadable from the GUI
– Speedup for Maxwell-based GPUs
– Support for multinomial AUC and Gini scorers
– Add MCC and F1 scorers for binomial and multinomial problems
– Faster abort of experiment

Version 1.0.9 (November 29 2017)
– Support for time column for causal train/validation splits in time-series datasets
– Automatic detection of the time column from temporal correlations in data
– MLI improvements, dedicated page, selection of datasets and models
– Improved final ensemble meta-learner
– Test set score now displayed in experiment listing
– Original response is preserved in exported datasets
– Various bug fixes

Additional release notes can be viewed here:

If you’d like to learn more about Driverless AI, feel free to explore these helpful links:
– Driverless AI User Guide:
– Driverless AI Webinars:
– Latest Driverless AI Docker Download:
– Latest Driverless AI AWS AMI: Search for AMI-id : ami-d8c3b4a2
– Stack Overflow:

Want to try Driverless AI? Send us a note.

New versions of H2O-3 and Sparkling Water available

Dear H2O Community,

#H2OWorld is on Monday and we can’t wait to see you there! We’ll also be live streaming the event starting at 9:25am PST. Explore the agenda here.

Today we’re excited to share that new versions of H2O-3 and Sparkling Water are available.

We invite you to download them here:

– MOJOs are now supported for Stacked Ensembles.
– Easily specify the meta-learner algorithm type that Stacked Ensemble should use. This can be AUTO, GLM, GBM, DRF or Deep Learning.
– GBM, DRF now support custom evaluation metrics.
– The AutoML leaderboard now uses cross-validation metrics (new default).
– Multiclass stacking is now supported in AutoML. Removed the check that caused AutoML to skip stacking for multiclass.
– The Aggregator Function is now exposed in the Python/R client.
– Support for Python 3.6.

Detailed changes and bug fixes can be found here:

Sparkling Water 2.0, 2.1, 2.2
– Support for H2O Models into Spark python pipelines.
– Improved handling of sparse vectors in internal cluster.
– Improved stability of external cluster deployment mode.
– Includes latest H2O-

Detailed changes and bug fixes can be explored here:
2.2 –
2.1 –
2.0 –

Hope to see you on Monday!

The Team Raises $40 Million to Democratize Artificial Intelligence for the Enterprise

Driverless AI

Series C round led by Wells Fargo and NVIDIA

MOUNTAIN VIEW, CA – November 30, 2017 –, the leading company bringing AI to enterprises, today announced it has completed a $40 million Series C round of funding led by Wells Fargo and NVIDIA with participation from New York Life, Crane Venture Partners, Nexus Venture Partners and Transamerica Ventures, the corporate venture capital fund of Transamerica and Aegon Group. The Series C round brings’s total amount of funding raised to $75 million. The new investment will be used to further democratize advanced machine learning and for global expansion and innovation of Driverless AI, an automated machine learning and pipelining platform that uses “AI to do AI.” continued its juggernaut growth in 2017 as evidenced by new platforms and partnerships. The company launched Driverless AI, a product that automates AI for non-technical users and introduces visualization and interpretability features that explain the data modeling results in plain English, thus fostering further adoption and trust in artificial intelligence. has partnered with NVIDIA to democratize machine learning on the NVIDIA GPU compute platform. It has also partnered with IBM, Amazon AWS and Microsoft Azure to bring its best-in-class machine learning platform to other infrastructures and the public cloud. co-founded the GPU Open Analytics Initiative (GOAI) to create an ecosystem for data developers and researchers to advance data science using GPUs, and has launched H2O4GPU, a collection of the fastest GPU algorithms on the market capable of processing massive amounts of unstructured data up to 40x faster than on traditional CPUs.

“AI is eating both hardware and software,” said Sri Ambati, co-founder and CEO at “Billions of devices are generating unprecedented amounts of data, which truly calls for distributed machine learning that is ubiquitous and fast. Our focus on automating machine learning makes it easily accessible to large enterprises. Our maker culture fosters deep trust and teamwork with our customers, and our partnerships with vendors across industry verticals bring significant value and growth to our community. It is quite supportive and encouraging to see our partners lead a significant funding round to help deliver on its mission.”

“AI is an incredible force that’s sweeping across the technology landscape,” said Jeff Herbst, vice president of business development at NVIDIA. “ is exceptionally well positioned in this field as it pursues its mission to become the world’s leading data science platform for the financial services industry and beyond. Its use of GPU-accelerated AI provides powerful tools for customers, and we look forward to continuing our collaboration with them.”

“It is exhilarating to have backed the journey from day zero: the journey from a PowerPoint to becoming the enterprise AI platform essential for thousands of corporations across the planet,” said Jishnu Bhattarcharjee, managing director at Nexus Venture Partners. “AI has arrived, transforming industries as we know them. Exciting scale ahead for H2O, so fasten your seat belts!”

As the leading open-source platform for machine learning, is leveling the playing field in a space where much of the AI innovation and talent is locked up inside major tech titans and thus inaccessible to other enterprises. This is precisely why over 100,000 data scientists, 12,400 organizations and nearly half of the Fortune 500 have embraced’s suite of products that pack the productivity of an elite data science team into a single solution.

“We are delighted to lead’s funding round. We have been following the company’s progress and have been impressed by its high-caliber management team and success in establishing an open-source machine learning platform with wide adoption across many industries. We are excited to support the next phase of their development,” said Basil Darwish, director of strategic investments at Wells Fargo Securities.

Beyond its open source community, is transforming several industry verticals and building strong customer partnerships. Over the past 18 months, the company has worked with PwC to build PwC’s “,” a revolutionary bot that uses AI and machine learning to ‘x-ray’ a business and detect anomalies in the general ledger. The product was named the ‘Audit Innovation of the Year‘ by the International Accounting Bulletin in October 2017.

H2O’s signature community conference, H2O World will take place on December 4-5, 2017 at the Computer History Museum in Mountain View, Calif.

About’s mission is to democratize machine learning through its leading open source software platform. Its flagship product, empowers enterprise clients to quickly deploy machine learning and predictive analytics to accelerate business transformation for critical applications such as predictive maintenance and operational intelligence. recently launched Driverless AI, the first solution that allows any business — even ones without a team of talented data scientists — to implement AI to solve complex business problems. The product was reviewed and selected as Editor’s Choice in InfoWorld. Customers include Capital One, Progressive Insurance, Comcast, Walgreens and Kaiser Permanente. For more information and to learn more about how is transforming businesses, visit


VSC for
Kayla Abbassi
Senior Account Executive

Laying a Strong Foundation for Data Science Work

By William Merchan, CSO,

In the past few years, data science has become the cornerstone of enterprise companies’ efforts to understand how to deliver better customer experiences. Even so, when commissioned Forrester to survey over 200 data-driven businesses last year, only 22% reported they were leveraging big data well enough to get ahead of their competition.

That’s because there’s a big difference between building predictive models and putting them into production effectively. Data science teams need the support of IT from the very beginning to ensure that issues with large-scale data management, governance, and access don’t stand in the way of operationalizing key insights about your customers. However, many enterprise companies are still treating IT involvement as an afterthought, which ultimately delays the timeline for seeing value from their data science efforts.

There are many ways that better IT management can help scale the impact of data science at your organization. Three best practices include using containers for data science environments, managing compute resources effectively, and putting work into production faster with the help of tools. Here’s how it’s done.

1. Using software containers is one of the most impactful steps you can take to implement IT management best practices. These standardized development environments ensure that the hard work your data scientists put into building predictive models won’t go to waste when it’s time to deploy their code. Without a container-based workflow, a data scientist starting a new analysis must either wait for IT to build an environment from scratch, or build one themselves using the unique combination of packages and resources they prefer — and waiting for those to install or compile.

There are two major issues associated with both of these approaches: they don’t scale, and they’re slow. When data scientists are individually responsible for configuring environments as needed, their work isn’t reproducible — if it’s used in a different environment, it might not even run. Containers put the power in the hands of IT to standardize environment configuration in advance using images, which are snapshots of containers. Data scientists can launch environments from those images — which have already been vetted by IT — saving a lot of time in the long run.

2. Provide ample computing power to support your data scientists’ analysis from start to finish. Empowering them to spin up compute resources in the cloud as needed ensures they never get held up by limited computing power. It also eliminates the potential additional cost of maintaining unnecessary nodes. The same idea applies to on-prem data centers. IT must carefully monitor the expansion of data science work and scale resources accordingly. It may seem obvious, but IHS Markit reports that companies not anticipating this need lose approximately $700 billion a year to IT downtime.

3. Put data science work into production right away to start seeing its value earlier on. Imagine your data science team has built a recommender system to predict what products a customer is likely to enjoy based on the products he or she has already purchased. Even if you’re satisfied with the model’s accuracy and have identified some unexpected relationships that should inform your targeting strategies, this information still needs to be integrated into your application or website for it to be valuable.

Traditionally, the pipeline that delivers those recommendations to your customers would be built by engineers and require extensive support from IT. The rise of microservices, however, gives data scientists the opportunity to deploy models as APIs that can be integrated directly into an application.

If you’re among the 78% of companies not fully realizing the return on your data science investment, chances are there’s room to improve the IT foundation you’ve laid. To learn more about the next steps, find out how to take an agile approach to data science.

About the Author

William Merchan leads business and corporate development, partner initiatives, and strategy at as chief strategy officer. He most recently served as SVP of Strategic Alliances and GM of Dynamic Pricing at MarketShare, where he oversaw global business development and partner relationships, and successfully led the company to a $450 million acquisition by Neustar.