Driverless AI – Introduction, Hands-On Lab and Updates

Driverless AI – Introduction, Hands-On Lab and Updates

#H2OWorld was an incredible experience. Thank you to everyone who joined us!

There were so many fascinating conversations and interesting presentations. I’d love to invite you to enjoy the presentations by visiting our YouTube channel.

Over the next few weeks, we’ll be highlighting many of the talks. Today I’m excited to share two presentations focused on Driverless AI – “Introduction and a Look Under the Hood + Hands-On Lab” and “Hands-On Focused on Machine Learning Interpretability”.

Slides available here.

Slides available here.

The response to Driverless AI has been amazing. We’re constantly receiving helpful feedback and making updates.

A few recent updates include:

Version 1.0.11 (December 12 2017)
– Faster multi-GPU training, especially for small data
– Increase default amount of exploration of genetic algorithm for systems with fewer than 4 GPUs
– Improved accuracy of generalization performance estimate for models on small data (< 100k rows)
– Faster abort of experiment
– Improved final ensemble meta-learner
– More robust date parsing

Version 1.0.10 (December 4 2017)
– Tooltips and link to documentation in parameter settings screen
– Faster training for multi-class problems with > 5 classes
– Experiment summary displayed in GUI after experiment finishes
– Python Client Library downloadable from the GUI
– Speedup for Maxwell-based GPUs
– Support for multinomial AUC and Gini scorers
– Add MCC and F1 scorers for binomial and multinomial problems
– Faster abort of experiment

Version 1.0.9 (November 29 2017)
– Support for time column for causal train/validation splits in time-series datasets
– Automatic detection of the time column from temporal correlations in data
– MLI improvements, dedicated page, selection of datasets and models
– Improved final ensemble meta-learner
– Test set score now displayed in experiment listing
– Original response is preserved in exported datasets
– Various bug fixes

Additional release notes can be viewed here:
http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/release_notes.html

If you’d like to learn more about Driverless AI, feel free to explore these helpful links:
– Driverless AI User Guide: http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/index.html
– Driverless AI Webinars: https://webinar.com/channel/4a90aa11b48f4a5d8823ec924e7bd8cf
– Latest Driverless AI Docker Download: https://www.h2o.ai/driverless-ai-download/
– Latest Driverless AI AWS AMI: Search for AMI-id : ami-d8c3b4a2
– Stack Overflow: https://stackoverflow.com/questions/tagged/driverless-ai

Want to try Driverless AI? Send us a note.

New versions of H2O-3 and Sparkling Water available

Dear H2O Community,

#H2OWorld is on Monday and we can’t wait to see you there! We’ll also be live streaming the event starting at 9:25am PST. Explore the agenda here.

Today we’re excited to share that new versions of H2O-3 and Sparkling Water are available.

We invite you to download them here:
https://www.h2o.ai/download/

H2O-3.16
– MOJOs are now supported for Stacked Ensembles.
– Easily specify the meta-learner algorithm type that Stacked Ensemble should use. This can be AUTO, GLM, GBM, DRF or Deep Learning.
– GBM, DRF now support custom evaluation metrics.
– The AutoML leaderboard now uses cross-validation metrics (new default).
– Multiclass stacking is now supported in AutoML. Removed the check that caused AutoML to skip stacking for multiclass.
– The Aggregator Function is now exposed in the Python/R client.
– Support for Python 3.6.

Detailed changes and bug fixes can be found here:
https://github.com/h2oai/h2o-3/blob/master/Changes.md

Sparkling Water 2.0, 2.1, 2.2
– Support for H2O Models into Spark python pipelines.
– Improved handling of sparse vectors in internal cluster.
– Improved stability of external cluster deployment mode.
– Includes latest H2O-3.16.0.2.

Detailed changes and bug fixes can be explored here:
2.2 – https://github.com/h2oai/sparkling-water/blob/rel-2.2/doc/CHANGELOG.rst
2.1 – https://github.com/h2oai/sparkling-water/blob/rel-2.1/doc/CHANGELOG.rst
2.0 – https://github.com/h2oai/sparkling-water/blob/rel-2.0/doc/CHANGELOG.rst

Hope to see you on Monday!

The H2O.ai Team

H2O.ai Raises $40 Million to Democratize Artificial Intelligence for the Enterprise

Driverless AI


Series C round led by Wells Fargo and NVIDIA

MOUNTAIN VIEW, CA – November 30, 2017 – H2O.ai, the leading company bringing AI to enterprises, today announced it has completed a $40 million Series C round of funding led by Wells Fargo and NVIDIA with participation from New York Life, Crane Venture Partners, Nexus Venture Partners and Transamerica Ventures, the corporate venture capital fund of Transamerica and Aegon Group. The Series C round brings H2O.ai’s total amount of funding raised to $75 million. The new investment will be used to further democratize advanced machine learning and for global expansion and innovation of Driverless AI, an automated machine learning and pipelining platform that uses “AI to do AI.”

H2O.ai continued its juggernaut growth in 2017 as evidenced by new platforms and partnerships. The company launched Driverless AI, a product that automates AI for non-technical users and introduces visualization and interpretability features that explain the data modeling results in plain English, thus fostering further adoption and trust in artificial intelligence.

H2O.ai has partnered with NVIDIA to democratize machine learning on the NVIDIA GPU compute platform. It has also partnered with IBM, Amazon AWS and Microsoft Azure to bring its best-in-class machine learning platform to other infrastructures and the public cloud.

H2O.ai co-founded the GPU Open Analytics Initiative (GOAI) to create an ecosystem for data developers and researchers to advance data science using GPUs, and has launched H2O4GPU, a collection of the fastest GPU algorithms on the market capable of processing massive amounts of unstructured data up to 40x faster than on traditional CPUs.

“AI is eating both hardware and software,” said Sri Ambati, co-founder and CEO at H2O.ai. “Billions of devices are generating unprecedented amounts of data, which truly calls for distributed machine learning that is ubiquitous and fast. Our focus on automating machine learning makes it easily accessible to large enterprises. Our maker culture fosters deep trust and teamwork with our customers, and our partnerships with vendors across industry verticals bring significant value and growth to our community. It is quite supportive and encouraging to see our partners lead a significant funding round to help H2O.ai deliver on its mission.”

“AI is an incredible force that’s sweeping across the technology landscape,” said Jeff Herbst, vice president of business development at NVIDIA. “H2O.ai is exceptionally well positioned in this field as it pursues its mission to become the world’s leading data science platform for the financial services industry and beyond. Its use of GPU-accelerated AI provides powerful tools for customers, and we look forward to continuing our collaboration with them.”

“It is exhilarating to have backed the H2O.ai journey from day zero: the journey from a PowerPoint to becoming the enterprise AI platform essential for thousands of corporations across the planet,” said Jishnu Bhattarcharjee, managing director at Nexus Venture Partners. “AI has arrived, transforming industries as we know them. Exciting scale ahead for H2O, so fasten your seat belts!”

As the leading open-source platform for machine learning, H2O.ai is leveling the playing field in a space where much of the AI innovation and talent is locked up inside major tech titans and thus inaccessible to other enterprises. This is precisely why over 100,000 data scientists, 12,400 organizations and nearly half of the Fortune 500 have embraced H2O.ai’s suite of products that pack the productivity of an elite data science team into a single solution.

“We are delighted to lead H2O.ai’s funding round. We have been following the company’s progress and have been impressed by its high-caliber management team and success in establishing an open-source machine learning platform with wide adoption across many industries. We are excited to support the next phase of their development,” said Basil Darwish, director of strategic investments at Wells Fargo Securities.

Beyond its open source community, H2O.ai is transforming several industry verticals and building strong customer partnerships. Over the past 18 months, the company has worked with PwC to build PwC’s “GL.ai,” a revolutionary bot that uses AI and machine learning to ‘x-ray’ a business and detect anomalies in the general ledger. The product was named the ‘Audit Innovation of the Year‘ by the International Accounting Bulletin in October 2017.

H2O’s signature community conference, H2O World will take place on December 4-5, 2017 at the Computer History Museum in Mountain View, Calif.

About H2O.ai

H2O.ai’s mission is to democratize machine learning through its leading open source software platform. Its flagship product, H2O.ai empowers enterprise clients to quickly deploy machine learning and predictive analytics to accelerate business transformation for critical applications such as predictive maintenance and operational intelligence. H2O.ai recently launched Driverless AI, the first solution that allows any business — even ones without a team of talented data scientists — to implement AI to solve complex business problems. The product was reviewed and selected as Editor’s Choice in InfoWorld. Customers include Capital One, Progressive Insurance, Comcast, Walgreens and Kaiser Permanente. For more information and to learn more about how H2O.ai is transforming businesses, visit www.h2o.ai.

Contacts

VSC for H2O.ai
Kayla Abbassi
Senior Account Executive
kayla@vscpr.com

Laying a Strong Foundation for Data Science Work

By William Merchan, CSO, DataScience.com

In the past few years, data science has become the cornerstone of enterprise companies’ efforts to understand how to deliver better customer experiences. Even so, when DataScience.com commissioned Forrester to survey over 200 data-driven businesses last year, only 22% reported they were leveraging big data well enough to get ahead of their competition.

That’s because there’s a big difference between building predictive models and putting them into production effectively. Data science teams need the support of IT from the very beginning to ensure that issues with large-scale data management, governance, and access don’t stand in the way of operationalizing key insights about your customers. However, many enterprise companies are still treating IT involvement as an afterthought, which ultimately delays the timeline for seeing value from their data science efforts.

There are many ways that better IT management can help scale the impact of data science at your organization. Three best practices include using containers for data science environments, managing compute resources effectively, and putting work into production faster with the help of tools. Here’s how it’s done.

1. Using software containers is one of the most impactful steps you can take to implement IT management best practices. These standardized development environments ensure that the hard work your data scientists put into building predictive models won’t go to waste when it’s time to deploy their code. Without a container-based workflow, a data scientist starting a new analysis must either wait for IT to build an environment from scratch, or build one themselves using the unique combination of packages and resources they prefer — and waiting for those to install or compile.

There are two major issues associated with both of these approaches: they don’t scale, and they’re slow. When data scientists are individually responsible for configuring environments as needed, their work isn’t reproducible — if it’s used in a different environment, it might not even run. Containers put the power in the hands of IT to standardize environment configuration in advance using images, which are snapshots of containers. Data scientists can launch environments from those images — which have already been vetted by IT — saving a lot of time in the long run.

2. Provide ample computing power to support your data scientists’ analysis from start to finish. Empowering them to spin up compute resources in the cloud as needed ensures they never get held up by limited computing power. It also eliminates the potential additional cost of maintaining unnecessary nodes. The same idea applies to on-prem data centers. IT must carefully monitor the expansion of data science work and scale resources accordingly. It may seem obvious, but IHS Markit reports that companies not anticipating this need lose approximately $700 billion a year to IT downtime.

3. Put data science work into production right away to start seeing its value earlier on. Imagine your data science team has built a recommender system to predict what products a customer is likely to enjoy based on the products he or she has already purchased. Even if you’re satisfied with the model’s accuracy and have identified some unexpected relationships that should inform your targeting strategies, this information still needs to be integrated into your application or website for it to be valuable.

Traditionally, the pipeline that delivers those recommendations to your customers would be built by engineers and require extensive support from IT. The rise of microservices, however, gives data scientists the opportunity to deploy models as APIs that can be integrated directly into an application.

If you’re among the 78% of companies not fully realizing the return on your data science investment, chances are there’s room to improve the IT foundation you’ve laid. To learn more about the next steps, find out how to take an agile approach to data science.

About the Author

William Merchan leads business and corporate development, partner initiatives, and strategy at DataScience.com as chief strategy officer. He most recently served as SVP of Strategic Alliances and GM of Dynamic Pricing at MarketShare, where he oversaw global business development and partner relationships, and successfully led the company to a $450 million acquisition by Neustar.

H2O.ai Releases H2O4GPU, the Fastest Collection of GPU Algorithms on the Market, to Expedite Machine Learning in Python

H2O4GPU is an open-source collection of GPU solvers created by H2O.ai. It builds on the easy-to-use scikit-learn Python API and its well-tested CPU-based algorithms. It can be used as a drop-in replacement for scikit-learn with support for GPUs on selected (and ever-growing) algorithms. H2O4GPU inherits all the existing scikit-learn algorithms and falls back to CPU algorithms when the GPU algorithm does not support an important existing scikit-learn class option. It utilizes the efficient parallelism and high throughput of GPUs. Additionally, GPUs allow the user to complete training and inference much faster than possible on ordinary CPUs.

Today, select algorithms are GPU-enabled. These include Gradient Boosting Machines (GBM’s), Generalized Linear Models (GLM’s), and K-Means Clustering. Using H2O4GPU, users can unlock the power of GPU’s through the scikit-learn API that many already use today. In addition to the scikit-learn Python API, an R API is in development.

Here are specific benchmarks from a recent H2O4GPU test:

  • More than 5X faster on GPUs as compared to CPUs
  • Nearly 10X faster on GPUs
  • More than 40X faster on GPUs

“We’re excited to release these lightning-fast H2O4GPU algorithms and continue H2O.ai’s foray into GPU innovation,” said Sri Ambati, co-founder and CEO of H2O.ai. “H2O4GPU democratizes industry-leading speed, accuracy and interpretability for scikit-learn users from all over the globe. This includes enterprise AI users who were previously too busy building models to have time for what really matters: generating revenue.”

“The release of H2O4GPU is an important milestone,” said Jim McHugh, general manager and vice president at NVIDIA. “Delivered as part of an open-source platform it brings the incredible power of acceleration provided by NVIDIA GPUs to widely-used machine learning algorithms that today’s data scientists have come to rely upon.”

H2O4GPU’s release follows the launch of Driverless AI, H2O.ai’s fully automated solution that handles data science operations — data preparation, algorithms, model deployment and more — for any business needing world-class AI capability in a single product. Built by top-ranking Kaggle Grandmasters, Driverless AI is essentially an entire data science team baked into one application.

Following is some information on each GPU enabled algorithm as well as a roadmap.

Gradient Linear Model (GLM)

  • Framework utilizes Proximal Graph Solver (POGS)
  • Solvers include Lasso, Ridge Regression, Logistic Regression, and Elastic Net Regularization
  • Improvements to original implementation of POGS:
    • Full alpha search
    • Cross Validation
    • Early Stopping
    • Added scikit-learn-like API
    • Supports multiple GPU’s

Gradient Linear Model (GLM)

Gradient Boosting Machines (Please check out Rory’s blog on Nvidia Dev Blogs for a more detailed write-up on Gradient Boosted Trees on GPUs)

  • Based on XGBoost
  • Raw floating point data — binned into quantiles
  • Quantiles are stored as compressed instead of floats
  • Compressed quantiles are efficiently transferred to GPU
  • Sparsity is handled directly with high GPU efficiency
  • Multi-GPU enabled by sharing rows using NVIDIA NCCL AllReduce

Gradient Boosting Machines

k-Means Clustering

  • Based on NVIDIA prototype of k-Means algorithm in CUDA
  • Improvements to original implementation:
    • Significantly faster than scikit-learn implementation (50x) and other GPU implementations (5-10x)
    • Supports multiple GPUs

k-Means Clustering

H2O4GPU combines the power of GPU acceleration with H2O’s parallel implementation of popular algorithms, taking computational performance levels to new heights.

To learn more about H2O4GPU click here and for more information about the math behind each algorithm, click here.

Driverless AI Blog

In today’s market, there aren’t enough data scientists to satisfy the growing demand for people in the field. With many companies moving towards automating processes across their businesses (everything from HR to Marketing), companies are forced to compete for the best data science talent to meet their needs. A report by McKinsey says that based on 2018 job market predictions: “The United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of big data.” H2O’s Driverless AI addresses this gap by democratizing data science and making it accessible to non-experts, while simultaneously increasing the efficiency of expert data scientists. Its point-and-click UI minimizes the complicated legwork that precedes the actual model build.

Driverless AI is designed to take a raw dataset and run it through a proprietary algorithm that automates the data exploration/feature engineering process, which typically takes ~80% of a data scientist’s time. It then auto-tunes model parameters and provides the user with the model that yields the best results. Therefore, experienced data scientists are spending far less time engineering new features and can focus on drawing actionable insights from the models Driverless AI builds. Lastly, the user can see visualizations generated by the Machine Learning Interpretability (MLI) component of Driverless AI to clarify the model results and the effect of changing variables’ values. The MLI feature eliminates the black box nature of machine learning models and provides clear and straightforward results from a model as well as how changing features will alter results.

Driverless AI is also GPU-enabled, which can result in up to 40x speed ups. We had demonstrated GPU acceleration to achieve those speedups for machine learning algorithms at GTC in May 2017. We’ve ported over XGBoost, GLM, K-Means and other algorithms to GPUs to achieve significant performance gains. This enable Driverless AI to run thousands of iterations to find the most accurate feature transforms and models.

The automatic nature of Driverless AI leads to increased accuracy. AutoDL engineers new features mechanically, and AutoML finds the right algorithms and tunes them to create the perfect ensemble of models. You can think of it as a Kaggle Grandmaster in a box. To demonstrate the power of Driverless AI, we participated in a bunch of Kaggle contests and the results are here below. Driverless AI out of the box got performed nearly as well as the best Kaggle Grandmasters

Let’s look at an example: we are going to work with a credit card dataset and predict whether or not a person is going to default on their payment next month based on a set of variables related to their payment history. After simply choosing the variable we are predicting for as well as the number of iterations we’d like to run, we launch our experiment.

As the experiment cycles through iterations, it creates a variable importance chart ranking existing and newly created features by their effect on the model’s accuracy.

In this example, AutoDL creates a feature that represents the cross validation target encoding of the variables sex and education. In other words, if we group everyone who is of the same sex and who has the same level of education in this dataset, the resulting feature would help in predicting whether or not the customer is going to default on their payment next month. Generating features like this one usually takes the majority of a data scientist’s time, but Driverless AI automates this process for the user.

After AutoDL generates new features, we run the updated dataset through AutoML. At this point, Driverless AI builds a series of models using various algorithms and delivers a leaderboard ranking the success of each model. The user can then inspect and choose the model that best fits their needs.

Lastly, we can use the Machine Learning Interpretability feature to get clear and concise explanations of our model results. Four dynamic graphs are generated automatically: KLime, Variable Importance, Decision Tree Chart, and Partial Dependence Plot. Each one helps the user explore the model output more closely. KLIME creates one global surrogate GLM on the entire training data and also creates numerous local surrogate GLMs on samples formed from K-Means clusters in the training data. All penalized GLM surrogates are trained to model the predictions of the Driverless AI model. The Variable Importance measures the effect that a variable has on the predictions of a model, while the Partial Dependence Plot shows the effect of changing one variable on the outcome. The Decision Tree Surrogate Model clears up the Driverless AI model by displaying an approximate flow-chart of the complex Driverless AI model’s decision making process. The Decision Tree Surrogate Model also displays the most important variables in the Driverless AI model and the most important interactions in the Driverless AI model. Lastly, the Explanations button gives the user a plain English sentence about how each variable effects the model.

All of these graphs can be used to visualize and debug the Driverless AI model by comparing the displayed decision-process, important variables, and important interactions to known standards, domain knowledge, and reasonable expectations.

Driverless AI streamlines the machine learning workflow for inexperienced and expert users alike. For more information, click here.

Scalable Automatic Machine Learning: Introducing H2O’s AutoML

Prepared by: Erin LeDell, Navdeep Gill & Ray Peck

In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts and experts, alike. The first steps toward simplifying machine learning involved developing simple, unified interfaces to a variety of machine learning algorithms (e.g. H2O).

Although H2O has made it easy for non-experts to experiment with machine learning, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models. Deep Neural Networks in particular are notoriously difficult for a non-expert to tune properly. We have designed an easy-to-use interface which automates the process of training a large, diverse, selection of candidate models and training a stacked ensemble on the resulting models (which often leads to an even better model). Making it’s debut in the latest “Preview Release” of H2O, version 3.12.0.1 (aka “Vapnik”), we introduce H2O’s AutoML for Scalable Automatic Machine Learning.

H2O’s AutoML can be used for automating a large part of the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. The user can also use a performance metric-based stopping criterion for the AutoML process rather than a specific time constraint. Stacked Ensembles will be automatically trained on the collection individual models to produce a highly predictive ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard.

AutoML Interface

We provide a simple function that performs a process that would typically require many lines of code. This frees up users to focus on other aspects of the data science pipeline tasks such as data-preprocessing, feature engineering and model deployment.

R:

aml <- h2o.automl(x = x, y = y, training_frame = train, 
                  max_runtime_secs = 3600)

Python:

aml = H2OAutoML(max_runtime_secs = 3600)
aml.train(x = x, y = y, training_frame = train)

Flow (H2O's Web GUI):

AutoML Leaderboard

Each AutoML run returns a "Leaderboard" of models, ranked by a default performance metric. Here is an example leaderboard for a binary classification task:

More information, and full R and Python code examples are available on the H2O 3.12.0.1 AutoML docs page in the H2O User Guide.

XGBoost in H2O Machine Learning Platform

Untitled Document.md

The new H2O release 3.10.5.1 brings a shiny new feature – integration of the powerful XGBoost library algorithm into H2O Machine Learning Platform!

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable.

XGBoost provides parallel tree boosting (also known as GBDT, GBM) that solves many data science problems in a fast and accurate way.

By integrating XGBoost into the H2O Machine Learning platform, we not only enrich the family of provided algorithms by one of the most powerful machine learning algorithms, but we have also exposed it with all the nice features of H2O – Python, R APIs and Flow UI, real-time training progress, and MOJO support.

Example

Let’s quickly try to run XGBoost on the HIGGS dataset from Python. The first step is to get the latest H2O and install the Python library. Please follow instruction at H2O download page.

The next step is to download the HIGGS training and validation data. We can use sample datasets stored in S3:

wget https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/higgs_train_imbalance_100k.csv
wget https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/higgs_test_imbalance_100k.csv
# Or use full data: wget https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/higgs_head_2M.csv

Now, it is time to start your favorite Python environment and build some XGBoost models.

The first step involves starting H2O on single node cluster:

import h2o
h2o.init()

In the next step, we import and prepare data via the H2O API:

train_path = 'higgs_train_imbalance_100k.csv'
test_path = 'higgs_test_imbalance_100k.csv'

df_train = h2o.import_file(train_path)
df_test = h2o.import_file(test_path)

# Transform first feature into categorical feature
df_train[0] = df_train[0].asfactor()
df_valid[0] = df_valid[0].asfactor()

After data preparation, it is time to build an XGBoost model. Let’s try to train 100 trees with a maximum depth of 10:

param = {
      "ntrees" : 100
    , "max_depth" : 10
    , "learn_rate" : 0.02
    , "sample_rate" : 0.7
    , "col_sample_rate_per_tree" : 0.9
    , "min_rows" : 5
    , "seed": 4241
    , "score_tree_interval": 100
}

from h2o.estimators import H2OXGBoostEstimator
model = H2OXGBoostEstimator(**param)
model.train(x = list(range(1, df_train.shape[1])), y = 0, training_frame = df_train, validation_frame = df_valid)

At this point we can use the trained model like a normal H2O model, and for example use it to generate predictions:

prediction = model.predict(df_valid)[:,2]

Or we can open H2O Flow UI and explore model properties in nice user-friendly way:

Or rebuild model with different training parameters:

Technical Details

The integration of XGBoost into the H2O Machine Learning Platform utilizes the JNI interface of XGBoost and the corresponding native libraries. H2O wraps all JNI calls and exposes them as regular H2O model and model builder APIs.

The implementation itself is based on two separated modules, which are enriching the core H2O platform.

The first module, h2o-genmodel-ext-xgboost, extends module h2o-genmodel and registers an XGBoost-specific MOJO. The module also contains all necessary XGBoost binary libraries. Right now, the module provides libraries for OS X and Linux, however support of Windows is coming soon.

The module can contain multiple libraries for each platform to support different configurations (e.g., with/without GPU/OMP). H2O always tries to load the most powerful one (currently a library with GPU and OMP support). If it fails, the loader tries the next one in a loader chain. For each platform, we always provide an XGBoost library with minimal configuration (supports only single CPU) that serves as fallback in case all other libraries could not be loaded.

The second module, h2o-ext-xgboost, contains the actual XGBoost model and model builder code, which communicates with native XGBoost libraries via the JNI API. The module also provides all necessary REST API definitions to expose XGBoost model builder to clients.

Note: To learn more about H2O modular architecture, please, visit review our H2O Platform Extensibility blog post.

Limitations

There are several technical limitations of the current implementation that we are trying to resolve. However, it is necessary to mention them. In general, if XGBoost cannot be initialized for any reason (e.g., unsupported platform), then the algorithm is not exposed via REST API and is not available for clients. Clients can verify availability of the XGBoost by using the corresponding client API call. For example, in Python:

is_xgboost_available = H2OXGBoostEstimator.available()

The list of limitations include:

  1. Right now XGBoost is initialized only for single-node H2O clustersl however multi-node XGBoost support is coming soon.

  2. The list of supported platforms includes:
    Platform Minimal XGBoost OMP GPU Compilation OS
    Linux yes yes yes Ubuntu 14.04, g++ 4.7
    OS X yes no no OS X 10.11
    Windows no no no NA

    Note: Minimal XGBoost configuration includes support for a single CPU.

  3. Furthermore, because we are using native XGBoost libraries that depend on OS/platform libraries, it is possible that on older operating systems, XGBoost will not be able to find all necessary binary dependencies, and will not be initialized and available.

  4. XGBoost GPU libraries are compiled against CUDA 8, which is a necessary runtime requirement in order to utilize XGBoost GPU support.

Please give H2O XGBoost chance, try it, and let us know your experience or suggest improvements via h2ostream!

H2O Platform Extensibility

Untitled Document.md

The latest H2O release, 3.10.5.1, introduced several new concepts to improve extensibility and modularity of the H2O machine learning platform. This blog post will clarify motivation, explain design decisions we made, and demonstrate the overall approach for this release.

Motivation

The H2O Machine Learning platform was designed as a monolith application. However, a growing H2O community along with multiple new projects were demanding that we revisit the architecture and make the development of independent H2O extensions easier.

Furthermore, we would like to allow easy integration of third party tools (e.g., XGBoost, TensorFlow) under a common H2O API.

Design

Conceptually, platform modularity and extensibility can be achieved in different ways:

  1. Compile time code composition: A compile time process assembles all necessary code modules together into a resulting deployable application.
  2. Link time composition: An application is composed at start time based on modules provided at JVM classpath.
  3. Runtime composition: An application can be dynamically extended at runtime, new modules can be loaded, or existing modules can be deactivated.

The approach (1) represents a method adopted by the older version of H2O and its h2o.jar assembly process. In this case, all code is compiled and assembled into a single artifact. However, it has several major limitations. Mainly, it does need a predefined list of code components to put into the resulting artifact, and it does not allow developers and community to create independent extensions.

On the other hand, the last approach (3) is fully dynamic and is adopted by tools like OSGi, Eclipse, or Chrome and brings the freedom of having a fully dynamic environment that users can modify. However, in the context of a machine learning platform, we believe it is not necessary.

Hence, we decided to adopt the second approach (2) to our architecture and provide link time composition of modules.

With this approach, users specify the modules that they are going to use, and the specified modules are registered by H2O core via a JVM capability called Java Service Provider Interface (Java SPI).

Java SPI is a simple JVM service that allows you to register modules, implementing a given interface (or extending an abstract class), and then list them at runtime. The modules need to be registered by a so called service file located in the META-INF/services directory. The service file contains the name of the component implementation. Then the application can query all available components (e.g., that are given at classpath or available via specified classloader) and use them internally via an implemented interface.

From a design perspective, there are several locations in the H2O platform to make extensible:

  • H2O core
  • REST API
  • Parsers
  • Rapids
  • Persistent providers

In this blog post, we would like to focus only on the first two items; however, a similar approach could be or is already adopted for the remaining parts.

Regarding first item from the list, H2O core extensibility is crucial for adopting new features – for example, to introduce a new watchdog thread that shuts down H2O if a condition is satisfied, or a new public API layer like GRPC. The core modules are marked by the interface water.AbstractH2OExtension, which provides hooks
into the H2O platform lifecycle.

The second extension point allows you to extend a provided REST API, which is typically necessary when a new algorithm is introduced and needs to be exposed via REST API. In this case, the extension module needs to implement the interface water.api.RestApiExtension and register the implementation via the file META-INF/services/water.api.RestApiExtension.

Example

We are going to show extensibility on the XGBoost module – a new feature included in the latest version. XGBoost is a gradient boosting library distributed in a native non-Java form. Our goal is to publish it via the H2O API and use it in the same way as the rest of H2O algorithms. To realize that we need to:

  1. Extend the core of H2O with functionality that will load a binary version of XGBoost
  2. Wrap XGBoost into the H2O Java API
  3. Expose the Java API via REST API

To implement the first step, we are going to define a tiny implementation of water.AbstractH2OExtension, which will try to load XGBoost native libraries. The core extension does nothing except signal availability of XGBoost on the current platform (i.e., not all platforms are supported by XGBoost native libraries):

package hex.tree.xgboost;

public class XGBoostExtension extends AbstractH2OExtension {
  public static String NAME = "XGBoost";

  @Override
  public String getExtensionName() {
    return NAME;
  }

  @Override
  public boolean isEnabled() {
    try {
        ml.dmlc.xgboost4j.java.NativeLibLoader.load();
        return true;
    } catch (Exception e) {
        return false;
    }
  }
}

Now, we need to register the extension via SPI. We create a new file under META-INF/services called water.AbstractH2OExtension with the following content:

hex.tree.xgboost.XGBoostExtension

We will not go into details of the second step, which will be described in another blog post, but we will directly implement the last step.

To expose H2O-specific REST API for XGBoost Java API, we need to implement the interface water.api.RestApiExtension. However, in this example
we take a shortcut and reuse existing code infrastructure for registering the algorithm’s REST API exposed via class water.api.AlgoAbstractRegister:

package hex.api.xgboost;

public class RegisterRestApi extends AlgoAbstractRegister {

  @Override
  public void registerEndPoints(RestApiContext context) {
    XGBoost xgBoostMB = new XGBoost(true);
    // Register XGBoost model builder REST API
    registerModelBuilder(context, xgBoostMB, SchemaServer.getStableVersion());
  }

  @Override
  public String getName() {
    return "XGBoost";
  }

  @Override
  public List<String> getRequiredCoreExtensions() {
    return Collections.singletonList(XGBoostExtension.NAME);
  }
}

And again, it is necessary to register the defined class with the SPI subsystem via the file META-INF/services/water.api.RestApiExtension:

hex.api.xgboost.RegisterRestApi

REST API registration requires one more step that involves registration of used schemas (classes that are used by REST API and implementing water.api.Schema). This is an annoying step that is necessary right now, but we hope to remove it in the future. Registration of schemas is done in the same way as registration of extensions – it is necessary to list all schemas in the file META-INF/services/water.api.Schema:

hex.schemas.XGBoostModelV3
hex.schemas.XGBoostModelV3$XGBoostModelOutputV3
hex.schemas.XGBoostV3
hex.schemas.XGBoostV3$XGBoostParametersV3

From this point, the REST API definition published by XGBoost model builder is visible to clients. We compile the code and bundle it with H2O core code (or put it on the classpath) and run it:

java -cp h2o-ext-xgboost.jar:h2o.jar water.H2OApp

During the start, we should see a boot message that mentions loaded extensions (XGBoost core extension and REST API extension):

INFO: Flow dir: '/Users/michal/h2oflows'
INFO: Cloud of size 1 formed [/192.168.1.65:54321]
INFO: Registered parsers: [GUESS, ARFF, XLS, SVMLight, AVRO, PARQUET, CSV]
INFO: XGBoost extension initialized
INFO: Watchdog extension initialized
INFO: Registered 2 core extensions in: 68ms
INFO: Registered H2O core extensions: [XGBoost, Watchdog]
INFO: Found XGBoost backend with library: xgboost4j
INFO: Registered: 160 REST APIs in: 310ms
INFO: Registered REST API extensions: [AutoML, XGBoost, Algos, Core V3, Core V4]
INFO: Registered: 230 schemas in 342ms
INFO: H2O started in 4932ms

The platform also publishes a list of available extensions via a capabilities REST end-point. A client can get the complete list of capabilities via GET <ip:port>/3/Capabilities:

curl http://localhost:54321/3/Capabilities

Or get a list of core extensions (GET <ip:port>/3/Capabilities/Core):

curl http://localhost:54321/3/Core

Or get a list of REST API extensions (GET <ip:port>/3/Capabilities/API):

curl http://localhost:54321/3/API

Note: We do not modularize R/Python/Flow clients. The client is responsible to self-configure based on information provided by the backend (e.g., via Capabilities REST end point) and fails gracefully if the user invokes an operation that is not provided by the backend.

For more details about the change, please, consult the following:

H2O announces GPU Open Analytics Initiative with MapD & Continuum

H2O.ai, Continuum Analytics, and MapD Technologies have announced the formation of the GPU Open Analytics Initiative (GOAI) to create common data frameworks enabling developers and statistical researchers to accelerate data science on GPUs. GOAI will foster the development of a data science ecosystem on GPUs by allowing resident applications to interchange data seamlessly and efficiently. BlazingDB, Graphistry and Gunrock from UC Davis led by CUDA Fellow John Owens have joined the founding members to contribute their technical expertise.

The formation of the Initiative comes at a time when analytics and machine learning workloads are increasingly being migrated to GPUs. However, while individually powerful, these workloads have not been able to benefit from the power of end-to-end GPU computing. A common standard will enable intercommunication between the different data applications and speed up the entire workflow, removing latency and decreasing the complexity of data flows between core analytical applications.

At the GPU Technology Conference (GTC), NVIDIA’s annual GPU developers’ conference, the Initiative announced its first project: an open source GPU Data Frame with a corresponding Python API. The GPU Data Frame is a common API that enables efficient interchange of data between processes running on the GPU. End-to-end computation on the GPU avoids transfers back to the CPU or copying of in-memory data reducing compute time and cost for high-performance analytics common in artificial intelligence workloads.

Users of the MapD Core database can output the results of a SQL query into the GPU Data Frame, which then can be manipulated by the Continuum Analytics’ Anaconda NumPy-like Python API or used as input into the H2O suite of machine learning algorithms without additional data manipulation. In early internal tests, this approach exhibited order-of-magnitude improvements in processing times compared to passing the data between applications on a CPU.

“The data science and analytics communities are rapidly adopting GPU computing for machine learning and deep learning. However, CPU-based systems still handle tasks like subsetting and preprocessing training data, which creates a significant bottleneck,” said Todd Mostak, CEO and co-founder of MapD Technologies. “The GPU Data Frame makes it easy to run everything from ingestion to preprocessing to training and visualization directly on the GPU. This efficient data interchange will improve performance, encouraging development of ever more sophisticated GPU-based applications.”

“GPU Data Frame relies on the Anaconda platform as the foundational fabric that brings data science technologies together to take full advantage of GPU performance gains,” said Travis Oliphant, co-founder and chief data scientist of Continuum Analytics. “Using NVIDIA’s technology, Anaconda is mobilizing the Open Data Science movement by helping teams avoid the data transfer process between CPUs and GPUs and move nimbly toward their larger business goals. The key to producing this kind of innovation are great partners like H2O and MapD.”

“Truly diverse open source ecosystems are essential for adoption – we are excited to start GOAI for GPUs alongside leaders in data and analytics pipeline to help standardize data formats,” said Sri Ambati, CEO and co-founder of H2O.ai. “GOAI is a call for the community of data developers and researchers to join the movement to speed up analytics and GPU adoption in the enterprise.”

The GPU Open Analytics Initiative is actively welcoming participants who are committed to open source and to GPUs as a computing platform.

Details of the GPU Data Frame can be found at the Initiative’s Github repo.