Driverless AI Blog

In today’s market, there aren’t enough data scientists to satisfy the growing demand for people in the field. With many companies moving towards automating processes across their businesses (everything from HR to Marketing), companies are forced to compete for the best data science talent to meet their needs. A report by McKinsey says that based on 2018 job market predictions: “The United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of big data.” H2O’s Driverless AI addresses this gap by democratizing data science and making it accessible to non-experts, while simultaneously increasing the efficiency of expert data scientists. Its point-and-click UI minimizes the complicated legwork that precedes the actual model build.

Driverless AI is designed to take a raw dataset and run it through a proprietary algorithm that automates the data exploration/feature engineering process, which typically takes ~80% of a data scientist’s time. It then auto-tunes model parameters and provides the user with the model that yields the best results. Therefore, experienced data scientists are spending far less time engineering new features and can focus on drawing actionable insights from the models Driverless AI builds. Lastly, the user can see visualizations generated by the Machine Learning Interpretability (MLI) component of Driverless AI to clarify the model results and the effect of changing variables’ values. The MLI feature eliminates the black box nature of machine learning models and provides clear and straightforward results from a model as well as how changing features will alter results.

Driverless AI is also GPU-enabled, which can result in to 40x speed ups. We had demonstrated GPU acceleration to achieve those speedups for machine learning algorithms at GTC in May 2017. We’ve ported over XGBoost, GLM, K-Means and other algorithms to GPUs to achieve significant performance gains. This enable Driverless AI to run thousands of iterations to find the most accurate feature transforms and models.

The automatic nature of Driverless AI leads to increased accuracy. AutoDL engineers new features mechanically, and AutoML finds the right algorithms and tunes them to create the perfect ensemble of models. You can think of it as a Kaggle Grandmaster in a box. To demonstrate the power of Driverless AI, we participated in a bunch of Kaggle contests and the results are here below. Driverless AI out of the box got performed nearly as well as the best Kaggle Grandmasters

Let’s look at an example: we are going to work with a credit card dataset and predict whether or not a person is going to default on their payment next month based on a set of variables related to their payment history. After simply choosing the variable we are predicting for as well as the number of iterations we’d like to run, we launch our experiment.

As the experiment cycles through iterations, it creates a variable importance chart ranking existing and newly created features by their effect on the model’s accuracy.

In this example, AutoDL creates a feature that represents the cross validation target encoding of the variables sex and education. In other words, if we group everyone who is of the same sex and who has the same level of education in this dataset, the resulting feature would help in predicting whether or not the customer is going to default on their payment next month. Generating features like this one usually takes the majority of a data scientist’s time, but Driverless AI automates this process for the user.

After AutoDL generates new features, we run the updated dataset through AutoML. At this point, Driverless AI builds a series of models using various algorithms and delivers a leaderboard ranking the success of each model. The user can then inspect and choose the model that best fits their needs.

Lastly, we can use the Machine Learning Interpretability feature to get clear and concise explanations of our model results. Four dynamic graphs are generated automatically: KLime, Variable Importance, Decision Tree Chart, and Partial Dependence Plot. Each one helps the user explore the model output more closely. KLIME creates one global surrogate GLM on the entire training data and also creates numerous local surrogate GLMs on samples formed from K-Means clusters in the training data. All penalized GLM surrogates are trained to model the predictions of the Driverless AI model. The Variable Importance measures the effect that a variable has on the predictions of a model, while the Partial Dependence Plot shows the effect of changing one variable on the outcome. The Decision Tree Surrogate Model clears up the Driverless AI model by displaying an approximate flow-chart of the complex Driverless AI model’s decision making process. The Decision Tree Surrogate Model also displays the most important variables in the Driverless AI model and the most important interactions in the Driverless AI model. Lastly, the Explanations button gives the user a plain English sentence about how each variable effects the model.

All of these graphs can be used to visualize and debug the Driverless AI model by comparing the displayed decision-process, important variables, and important interactions to known standards, domain knowledge, and reasonable expectations.

Driverless AI streamlines the machine learning workflow for inexperienced and expert users alike. For more information, click here.

Scalable Automatic Machine Learning: Introducing H2O’s AutoML

Prepared by: Erin LeDell, Navdeep Gill & Ray Peck

In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts and experts, alike. The first steps toward simplifying machine learning involved developing simple, unified interfaces to a variety of machine learning algorithms (e.g. H2O).

Although H2O has made it easy for non-experts to experiment with machine learning, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models. Deep Neural Networks in particular are notoriously difficult for a non-expert to tune properly. We have designed an easy-to-use interface which automates the process of training a large, diverse, selection of candidate models and training a stacked ensemble on the resulting models (which often leads to an even better model). Making it’s debut in the latest “Preview Release” of H2O, version 3.12.0.1 (aka “Vapnik”), we introduce H2O’s AutoML for Scalable Automatic Machine Learning.

H2O’s AutoML can be used for automating a large part of the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. The user can also use a performance metric-based stopping criterion for the AutoML process rather than a specific time constraint. Stacked Ensembles will be automatically trained on the collection individual models to produce a highly predictive ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard.

AutoML Interface

We provide a simple function that performs a process that would typically require many lines of code. This frees up users to focus on other aspects of the data science pipeline tasks such as data-preprocessing, feature engineering and model deployment.

R:

aml <- h2o.automl(x = x, y = y, training_frame = train, 
                  max_runtime_secs = 3600)

Python:

aml = H2OAutoML(max_runtime_secs = 3600)
aml.train(x = x, y = y, training_frame = train)

Flow (H2O's Web GUI):

AutoML Leaderboard

Each AutoML run returns a "Leaderboard" of models, ranked by a default performance metric. Here is an example leaderboard for a binary classification task:

More information, and full R and Python code examples are available on the H2O 3.12.0.1 AutoML docs page in the H2O User Guide.

XGBoost in H2O Machine Learning Platform

Untitled Document.md

The new H2O release 3.10.5.1 brings a shiny new feature – integration of the powerful XGBoost library algorithm into H2O Machine Learning Platform!

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable.

XGBoost provides parallel tree boosting (also known as GBDT, GBM) that solves many data science problems in a fast and accurate way.

By integrating XGBoost into the H2O Machine Learning platform, we not only enrich the family of provided algorithms by one of the most powerful machine learning algorithms, but we have also exposed it with all the nice features of H2O – Python, R APIs and Flow UI, real-time training progress, and MOJO support.

Example

Let’s quickly try to run XGBoost on the HIGGS dataset from Python. The first step is to get the latest H2O and install the Python library. Please follow instruction at H2O download page.

The next step is to download the HIGGS training and validation data. We can use sample datasets stored in S3:

wget https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/higgs_train_imbalance_100k.csv
wget https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/higgs_test_imbalance_100k.csv
# Or use full data: wget https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/higgs_head_2M.csv

Now, it is time to start your favorite Python environment and build some XGBoost models.

The first step involves starting H2O on single node cluster:

import h2o
h2o.init()

In the next step, we import and prepare data via the H2O API:

train_path = 'higgs_train_imbalance_100k.csv'
test_path = 'higgs_test_imbalance_100k.csv'

df_train = h2o.import_file(train_path)
df_test = h2o.import_file(test_path)

# Transform first feature into categorical feature
df_train[0] = df_train[0].asfactor()
df_valid[0] = df_valid[0].asfactor()

After data preparation, it is time to build an XGBoost model. Let’s try to train 100 trees with a maximum depth of 10:

param = {
      "ntrees" : 100
    , "max_depth" : 10
    , "learn_rate" : 0.02
    , "sample_rate" : 0.7
    , "col_sample_rate_per_tree" : 0.9
    , "min_rows" : 5
    , "seed": 4241
    , "score_tree_interval": 100
}

from h2o.estimators import H2OXGBoostEstimator
model = H2OXGBoostEstimator(**param)
model.train(x = list(range(1, df_train.shape[1])), y = 0, training_frame = df_train, validation_frame = df_valid)

At this point we can use the trained model like a normal H2O model, and for example use it to generate predictions:

prediction = model.predict(df_valid)[:,2]

Or we can open H2O Flow UI and explore model properties in nice user-friendly way:

Or rebuild model with different training parameters:

Technical Details

The integration of XGBoost into the H2O Machine Learning Platform utilizes the JNI interface of XGBoost and the corresponding native libraries. H2O wraps all JNI calls and exposes them as regular H2O model and model builder APIs.

The implementation itself is based on two separated modules, which are enriching the core H2O platform.

The first module, h2o-genmodel-ext-xgboost, extends module h2o-genmodel and registers an XGBoost-specific MOJO. The module also contains all necessary XGBoost binary libraries. Right now, the module provides libraries for OS X and Linux, however support of Windows is coming soon.

The module can contain multiple libraries for each platform to support different configurations (e.g., with/without GPU/OMP). H2O always tries to load the most powerful one (currently a library with GPU and OMP support). If it fails, the loader tries the next one in a loader chain. For each platform, we always provide an XGBoost library with minimal configuration (supports only single CPU) that serves as fallback in case all other libraries could not be loaded.

The second module, h2o-ext-xgboost, contains the actual XGBoost model and model builder code, which communicates with native XGBoost libraries via the JNI API. The module also provides all necessary REST API definitions to expose XGBoost model builder to clients.

Note: To learn more about H2O modular architecture, please, visit review our H2O Platform Extensibility blog post.

Limitations

There are several technical limitations of the current implementation that we are trying to resolve. However, it is necessary to mention them. In general, if XGBoost cannot be initialized for any reason (e.g., unsupported platform), then the algorithm is not exposed via REST API and is not available for clients. Clients can verify availability of the XGBoost by using the corresponding client API call. For example, in Python:

is_xgboost_available = H2OXGBoostEstimator.available()

The list of limitations include:

  1. Right now XGBoost is initialized only for single-node H2O clustersl however multi-node XGBoost support is coming soon.

  2. The list of supported platforms includes:
    Platform Minimal XGBoost OMP GPU Compilation OS
    Linux yes yes yes Ubuntu 14.04, g++ 4.7
    OS X yes no no OS X 10.11
    Windows no no no NA

    Note: Minimal XGBoost configuration includes support for a single CPU.

  3. Furthermore, because we are using native XGBoost libraries that depend on OS/platform libraries, it is possible that on older operating systems, XGBoost will not be able to find all necessary binary dependencies, and will not be initialized and available.

  4. XGBoost GPU libraries are compiled against CUDA 8, which is a necessary runtime requirement in order to utilize XGBoost GPU support.

Please give H2O XGBoost chance, try it, and let us know your experience or suggest improvements via h2ostream!

H2O Platform Extensibility

Untitled Document.md

The latest H2O release, 3.10.5.1, introduced several new concepts to improve extensibility and modularity of the H2O machine learning platform. This blog post will clarify motivation, explain design decisions we made, and demonstrate the overall approach for this release.

Motivation

The H2O Machine Learning platform was designed as a monolith application. However, a growing H2O community along with multiple new projects were demanding that we revisit the architecture and make the development of independent H2O extensions easier.

Furthermore, we would like to allow easy integration of third party tools (e.g., XGBoost, TensorFlow) under a common H2O API.

Design

Conceptually, platform modularity and extensibility can be achieved in different ways:

  1. Compile time code composition: A compile time process assembles all necessary code modules together into a resulting deployable application.
  2. Link time composition: An application is composed at start time based on modules provided at JVM classpath.
  3. Runtime composition: An application can be dynamically extended at runtime, new modules can be loaded, or existing modules can be deactivated.

The approach (1) represents a method adopted by the older version of H2O and its h2o.jar assembly process. In this case, all code is compiled and assembled into a single artifact. However, it has several major limitations. Mainly, it does need a predefined list of code components to put into the resulting artifact, and it does not allow developers and community to create independent extensions.

On the other hand, the last approach (3) is fully dynamic and is adopted by tools like OSGi, Eclipse, or Chrome and brings the freedom of having a fully dynamic environment that users can modify. However, in the context of a machine learning platform, we believe it is not necessary.

Hence, we decided to adopt the second approach (2) to our architecture and provide link time composition of modules.

With this approach, users specify the modules that they are going to use, and the specified modules are registered by H2O core via a JVM capability called Java Service Provider Interface (Java SPI).

Java SPI is a simple JVM service that allows you to register modules, implementing a given interface (or extending an abstract class), and then list them at runtime. The modules need to be registered by a so called service file located in the META-INF/services directory. The service file contains the name of the component implementation. Then the application can query all available components (e.g., that are given at classpath or available via specified classloader) and use them internally via an implemented interface.

From a design perspective, there are several locations in the H2O platform to make extensible:

  • H2O core
  • REST API
  • Parsers
  • Rapids
  • Persistent providers

In this blog post, we would like to focus only on the first two items; however, a similar approach could be or is already adopted for the remaining parts.

Regarding first item from the list, H2O core extensibility is crucial for adopting new features – for example, to introduce a new watchdog thread that shuts down H2O if a condition is satisfied, or a new public API layer like GRPC. The core modules are marked by the interface water.AbstractH2OExtension, which provides hooks
into the H2O platform lifecycle.

The second extension point allows you to extend a provided REST API, which is typically necessary when a new algorithm is introduced and needs to be exposed via REST API. In this case, the extension module needs to implement the interface water.api.RestApiExtension and register the implementation via the file META-INF/services/water.api.RestApiExtension.

Example

We are going to show extensibility on the XGBoost module – a new feature included in the latest version. XGBoost is a gradient boosting library distributed in a native non-Java form. Our goal is to publish it via the H2O API and use it in the same way as the rest of H2O algorithms. To realize that we need to:

  1. Extend the core of H2O with functionality that will load a binary version of XGBoost
  2. Wrap XGBoost into the H2O Java API
  3. Expose the Java API via REST API

To implement the first step, we are going to define a tiny implementation of water.AbstractH2OExtension, which will try to load XGBoost native libraries. The core extension does nothing except signal availability of XGBoost on the current platform (i.e., not all platforms are supported by XGBoost native libraries):

package hex.tree.xgboost;

public class XGBoostExtension extends AbstractH2OExtension {
  public static String NAME = "XGBoost";

  @Override
  public String getExtensionName() {
    return NAME;
  }

  @Override
  public boolean isEnabled() {
    try {
        ml.dmlc.xgboost4j.java.NativeLibLoader.load();
        return true;
    } catch (Exception e) {
        return false;
    }
  }
}

Now, we need to register the extension via SPI. We create a new file under META-INF/services called water.AbstractH2OExtension with the following content:

hex.tree.xgboost.XGBoostExtension

We will not go into details of the second step, which will be described in another blog post, but we will directly implement the last step.

To expose H2O-specific REST API for XGBoost Java API, we need to implement the interface water.api.RestApiExtension. However, in this example
we take a shortcut and reuse existing code infrastructure for registering the algorithm’s REST API exposed via class water.api.AlgoAbstractRegister:

package hex.api.xgboost;

public class RegisterRestApi extends AlgoAbstractRegister {

  @Override
  public void registerEndPoints(RestApiContext context) {
    XGBoost xgBoostMB = new XGBoost(true);
    // Register XGBoost model builder REST API
    registerModelBuilder(context, xgBoostMB, SchemaServer.getStableVersion());
  }

  @Override
  public String getName() {
    return "XGBoost";
  }

  @Override
  public List<String> getRequiredCoreExtensions() {
    return Collections.singletonList(XGBoostExtension.NAME);
  }
}

And again, it is necessary to register the defined class with the SPI subsystem via the file META-INF/services/water.api.RestApiExtension:

hex.api.xgboost.RegisterRestApi

REST API registration requires one more step that involves registration of used schemas (classes that are used by REST API and implementing water.api.Schema). This is an annoying step that is necessary right now, but we hope to remove it in the future. Registration of schemas is done in the same way as registration of extensions – it is necessary to list all schemas in the file META-INF/services/water.api.Schema:

hex.schemas.XGBoostModelV3
hex.schemas.XGBoostModelV3$XGBoostModelOutputV3
hex.schemas.XGBoostV3
hex.schemas.XGBoostV3$XGBoostParametersV3

From this point, the REST API definition published by XGBoost model builder is visible to clients. We compile the code and bundle it with H2O core code (or put it on the classpath) and run it:

java -cp h2o-ext-xgboost.jar:h2o.jar water.H2OApp

During the start, we should see a boot message that mentions loaded extensions (XGBoost core extension and REST API extension):

INFO: Flow dir: '/Users/michal/h2oflows'
INFO: Cloud of size 1 formed [/192.168.1.65:54321]
INFO: Registered parsers: [GUESS, ARFF, XLS, SVMLight, AVRO, PARQUET, CSV]
INFO: XGBoost extension initialized
INFO: Watchdog extension initialized
INFO: Registered 2 core extensions in: 68ms
INFO: Registered H2O core extensions: [XGBoost, Watchdog]
INFO: Found XGBoost backend with library: xgboost4j
INFO: Registered: 160 REST APIs in: 310ms
INFO: Registered REST API extensions: [AutoML, XGBoost, Algos, Core V3, Core V4]
INFO: Registered: 230 schemas in 342ms
INFO: H2O started in 4932ms

The platform also publishes a list of available extensions via a capabilities REST end-point. A client can get the complete list of capabilities via GET <ip:port>/3/Capabilities:

curl http://localhost:54321/3/Capabilities

Or get a list of core extensions (GET <ip:port>/3/Capabilities/Core):

curl http://localhost:54321/3/Core

Or get a list of REST API extensions (GET <ip:port>/3/Capabilities/API):

curl http://localhost:54321/3/API

Note: We do not modularize R/Python/Flow clients. The client is responsible to self-configure based on information provided by the backend (e.g., via Capabilities REST end point) and fails gracefully if the user invokes an operation that is not provided by the backend.

For more details about the change, please, consult the following:

H2O announces GPU Open Analytics Initiative with MapD & Continuum

H2O.ai, Continuum Analytics, and MapD Technologies have announced the formation of the GPU Open Analytics Initiative (GOAI) to create common data frameworks enabling developers and statistical researchers to accelerate data science on GPUs. GOAI will foster the development of a data science ecosystem on GPUs by allowing resident applications to interchange data seamlessly and efficiently. BlazingDB, Graphistry and Gunrock from UC Davis led by CUDA Fellow John Owens have joined the founding members to contribute their technical expertise.

The formation of the Initiative comes at a time when analytics and machine learning workloads are increasingly being migrated to GPUs. However, while individually powerful, these workloads have not been able to benefit from the power of end-to-end GPU computing. A common standard will enable intercommunication between the different data applications and speed up the entire workflow, removing latency and decreasing the complexity of data flows between core analytical applications.

At the GPU Technology Conference (GTC), NVIDIA’s annual GPU developers’ conference, the Initiative announced its first project: an open source GPU Data Frame with a corresponding Python API. The GPU Data Frame is a common API that enables efficient interchange of data between processes running on the GPU. End-to-end computation on the GPU avoids transfers back to the CPU or copying of in-memory data reducing compute time and cost for high-performance analytics common in artificial intelligence workloads.

Users of the MapD Core database can output the results of a SQL query into the GPU Data Frame, which then can be manipulated by the Continuum Analytics’ Anaconda NumPy-like Python API or used as input into the H2O suite of machine learning algorithms without additional data manipulation. In early internal tests, this approach exhibited order-of-magnitude improvements in processing times compared to passing the data between applications on a CPU.

“The data science and analytics communities are rapidly adopting GPU computing for machine learning and deep learning. However, CPU-based systems still handle tasks like subsetting and preprocessing training data, which creates a significant bottleneck,” said Todd Mostak, CEO and co-founder of MapD Technologies. “The GPU Data Frame makes it easy to run everything from ingestion to preprocessing to training and visualization directly on the GPU. This efficient data interchange will improve performance, encouraging development of ever more sophisticated GPU-based applications.”

“GPU Data Frame relies on the Anaconda platform as the foundational fabric that brings data science technologies together to take full advantage of GPU performance gains,” said Travis Oliphant, co-founder and chief data scientist of Continuum Analytics. “Using NVIDIA’s technology, Anaconda is mobilizing the Open Data Science movement by helping teams avoid the data transfer process between CPUs and GPUs and move nimbly toward their larger business goals. The key to producing this kind of innovation are great partners like H2O and MapD.”

“Truly diverse open source ecosystems are essential for adoption – we are excited to start GOAI for GPUs alongside leaders in data and analytics pipeline to help standardize data formats,” said Sri Ambati, CEO and co-founder of H2O.ai. “GOAI is a call for the community of data developers and researchers to join the movement to speed up analytics and GPU adoption in the enterprise.”

The GPU Open Analytics Initiative is actively welcoming participants who are committed to open source and to GPUs as a computing platform.

Details of the GPU Data Frame can be found at the Initiative’s Github repo.

Machine Learning on GPUs

With H2O GPU Edition, H2O.ai seeks to build the fastest artificial intelligence (AI) platform on GPUs. While deep learning has recently taken advantage of the tremendous performance boost provided by GPUs, many machine learning algorithms can benefit from the efficient fine-grained parallelism and high throughput of GPUs. Importantly, GPUs allow one to complete training and inference much faster than possible on ordinary CPUs. In this blog post, we’re excited to share some of our recent developments implementing machine learning on GPUs.

Consider generalized linear models (GLMs), which are highly interpretable models compared to neural network models. As with all models, feature selection is important to control the variance. This is especially true for large number of features; \(p > N\), where \(p\) is the number of features and \(N\) is the number of observations in a data set. The Lasso regularizes least squares with an \(\ell_1\) penalty, simultanously providing shrinkage and feature selection. However, the Lasso suffers from a few limitations, including an upper bound on variable selection at \(N\) and failure to do grouped feature selection. The elastic net regression overcomes these limitation by introducing an \(\ell_2\) penality to the regularization [1]. The elastic net loss function is as follow:

, where \(\lambda\) specifies the regularization strength and \(\alpha\) controls the penalty distribution between \(\ell_1\) and \(\ell_2\).

Multiple GPUs can be used to fit the full regularization path (i.e. \(\lambda\) sweep) for multiple values of \(\alpha\) or \(\lambda\).

Below are the results of computing a grid of elastic net GLMs for eight equally spaced value of \(\alpha\) between (and including) 0 (full \(\ell_2\)) and 1 (full \(\ell_1\); Lasso) across the entire regularization path of 100 \(\lambda\) with 5-fold cross validation. Effectively, about 4000 models are trained to predict income using the U.S. Census data set (10k features and 45k records).

Five scenarios are shown, including training with two Dual Intel Xeon E5-2630 v4 CPUs and various numbers of P100 GPUs using the NVIDIA DGX-1. The performance gain of GPU-acceleration is clear, showing greater than 35x speed up with eight P100 GPUs over the two Xeon CPUs.

Similarily, we can apply GPU acceleration to gradient boosting machines (GBM). Here, we utilize multiple GPUs to train separate binary classification GBM models with different depths (i.e. max_depth = [6,8,10,12]) and different observation sample rates (i.e. sample_rate = [0.7, 0.8, 0.9, 1]) using the Higgs dataset (29 features and 1M records). The GBM models were trained under the same computing scenarios as the GLM cases above. Again, we see substantial speed up of up to 16x when utilizing GPUs.

GPUs enable a quantum leap in machine learning, opening the possibilities to train more models, larger models, and more complex models — all in much shorter times. Iteration cycles can be shortened and delivery of AI within organizations can be scaled with multiple GPU boards with multiple nodes.

The Elastic Net GLM and GBM benchmarks shown above are straightforward implementations, showcasing the raw computational gains of GPU. On top of this, mathematical optimizations in the algorithms could result in even more speed-up. Indeed, the H2O CPU-based GLM is sparse-aware when processing the data and our newly-developed H2O CPU-based GLM implements mathematical optimizations, which lead it to outperform a naive implementation by a factor of 10 — 320s for H2O CPU GLM versus 3570s for naive CPU GLM. The figure below illustrates the H2O CPU GLM and H2O GPU GLM against other framework implementations (tensorflow uses stochastic gradient descent and warmstart, while H2O CPU version and Scikit Learn use a coordinate descent algorithm, while H2O GPU GLM uses a direct matrix method that is optimal for dense matrices — we welcome improvements to these other frameworks, see http://github.com/h2oai/perf/).

H2O GPU edition captures the benefits from both GPU acceleration and H2O’s implementation of mathematical optimizations taking the performance of AI to a level unparalleled in the space. Our focus on speed, accuracy and interpretability has produced tremendously positive results. Benchmarks presented in this article are proofs of such, and we will have more benchmark results to present in the near future. For more information about H2O GPU edition, please visit www.h2o.ai/gpu.

[1] H. Zou and T. Hastie. “Regularization and variable selection via the elastic net” https://web.stanford.edu/~hastie/Papers/B67.2%20(2005)%20301-320%20Zou%20&%20Hastie.pdf

The Race for Intelligence: How AI is Eating Hardware – Towards an AI-defined hardware world

With the AI arms race reaching a fever pitch, every data-driven company is (or at least should be) evaluating its approach to AI as a means to make their owned datasets as powerful as they can possibly be. In fact, any business that’s not currently thinking about how AI can transform its operations risks falling behind its competitors and missing out on new business opportunities entirely. AI is becoming a requirement. It’s no longer a “nice to have.”

It’s no secret that AI is hot right now. But the sudden surge in its popularity, both in business and the greater tech zeitgeist, is no coincidence. Until recently, the hardware required to compute and process immense complex datasets just didn’t exist. Hardware, until now, has always dictated what software was capable of — in other words, hardware influenced software design.

Not anymore.

The emergence of graphic processing units (GPUs) has fundamentally changed how people think about data. AI is data hungry — the more data you feed your AI, the better it can perform. But this obviously presents computational requirements, namely, substantial memory (storage) and processing power. Today’s GPUs are 100x faster than CPUs, making analysis of massive data sets possible. Now that GPUs are able to process this scale of data, the potential for AI applications are virtually limitless. Previously, the demands of hardware influenced software design. Today, the opposite is true: AI is influencing how hardware is designed and built.

Here are the three macro-level trends enabling AI to eat hardware:

1.) AI is Eating Software

The old paradigm that business intelligence software relies upon rule-based engines no longer applies. Instead, the model has evolved to the point where artificial intelligence software now relies upon statistical data training, or machine learning. As statistical data training grows up, it’s feasting on rules-based engines. However, this transformation requires an immense amount of data to train the cognitive function, and AI is influencing the design of hardware to facilitate the training. AI is not only influencing hardware design, as evidenced by the rise of GPUs, but also eating the traditional rules-based software that has long been the hallmark of business intelligence.

What does this mean in practical terms? It means businesses can now use AI to address specific problems, and in a sense “manufacture” intelligence. For example, creating a human doctor involves roughly 30 years of training, from a child’s birth to when he or she has completed her residency and gets their first job. But with AI, we can now create a “doctor” without 30 years of training. On a single chip, encoded with AI, a self-learning “doctor” can be trained in 11 days with petabytes of data. Not only that, you can install this “doctor” into a million places by replicating that chip, so long as there’s a device and connectivity.

This may be an extreme example, but it illustrates just how quickly AI is advancing our ability to understand from data.

2.) The Edge is Becoming More Intelligent

Another major trend supporting AI’s influence over hardware is the democratization of intelligence. In the 1980s, mainframes were the only devices powerful enough to handle large datasets. At the time, nobody could have possibly imagined that an invention like the personal computer would come along and give the computing power of a mainframe to the masses.

Fast forward 30 years later, history is repeating itself. The Internet of Things is making it possible for intelligence to be distributed even further from centralized mainframes, to literally any connected device. Today, tiny sensors have computing power comparable to that of a PC, meaning there will be many more different types of devices that can process data. Soon, IoT devices of all sizes will be much more powerful than the smartphone.

This means that intelligence is headed to the edge, away from big, centralized systems like a mainframe. The cloud enables connection between edge and center, so with really smart devices on the edge, information can travel rapidly between any number of devices.

3.) Everything is Dataware

AI constantly seeks data, and business intelligence is actionable only when the AI has a steady diet of data. Thanks to the hardware movement and the shift of intelligence to the edge, there are more points of data collection than ever. However, the hardware movement is not just about collecting and storing data, but rather continuously learning from data and monetizing those insights. In the future, power is at the edge, and over time, the power of the individual device will increase. As those devices continue to process data, the monetization of that data will continue to make the edge more powerful.

AI presents us with a distributed view of the world. Because data is being analyzed on the edge and continuously learning, knowledge is not only increasing at the edge, but flowing back to the center too. Everything is now dataware.

As the demands for data processing power increase across businesses, AI is transforming how enterprises shape their entire data strategy. Software is changing as a result. Gone are the days where rules-based computing is sufficient to analyze the magnitude of available data. Statistical data training is required to handle the load. But CPUs can only handle a fraction of the demand, so the demands of AI are influencing the way that hardware is designed. As hardware becomes more ubiquitous via IoT, intelligence and data are moving to the edge and the balance of power is shifting to the masses.

Use H2O.ai on Azure HDInsight

This is a repost from this article on MSDN

We’re hosting an upcoming webinar to present you how to use H2O on HDInsight and to answer your questions. Sign up for our upcoming webinar on combining H2O and Azure HDInsight.

We recently announced that H2O and Microsoft Azure HDInsight have integrated to provide Data Scientists with a Leading Combination of Engines for Machine Learning and Deep Learning. Through H2O’s AI platform and its Sparkling Water solution, users can combine the fast, scalable machine learning algorithms of H2O with the capabilities of Spark, as well as drive computation from Scala/R/Python and utilize the H2O Flow UI, providing an ideal machine learning platform for application developers.

In this blog, we will provide a detailed step-by-step guide to help you set up the first H2O on HDInsight solution.

Step 1: setting up the environment

The first step is to create an HDInsight cluster with H2O installed. You can either create an HDInsight cluster and install H2O during provision time, or you can also install H2O on an existing cluster. Please note that H2O on HDInsight only works for Spark 2.0 on HDInsight 3.5 as of today, which is the default version of HDInsight.

For more information on how to create a cluster in HDInsight, please refer to the documentation here (https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-provision-linux-clusters). For more information on how to install an application on an existing cluster, please refer to the documentation here (https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apps-install-applications)

Please be noted that we’ve recently updated our UI with less clicks, so you need to click “custom” button to install applications on HDInsight.

hdi-image1

Step 2: Setting up the environment

After installing H2O on HDInsight, you can simply use the built-in Jupyter notebooks to write your first H2O on HDInsight applications. You can simply go to (https://yourclustername.azurehdinsight.net/jupyter) to open the Jupyter Notebook. You will see a folder named “H2O-PySparkling-Examples”.

hdi-image2

There are a few examples in the folder, but I recommend starting with the one named “Sentiment_analysis_with_Sparkling_Water.ipynb”. Most of the details on how to use the H2O PySparkling Water APIs are already covered in the Notebook itself, so here I will give some high-level overviews.

The first thing you need to do is to configure the environment. Most of the configurations are already taken care by the system, such as the FLOW UI address, Spark jar location, the Sparkling water egg file, etc.

There are three important parameter to configure: the driver memory, executor memory, and the number of executors. The default values are optimized for the default 4 node cluster, but your cluster size might vary.

Tuning these parameters are outside of scope of this blog, as it is more of a Spark resource tuning problem. There are a few good reference articles such as this one.

Note that all spark applications deployed using a Jupyter Notebook will have “yarn-cluster” deploy-mode. This means that the spark driver node will be allocated on any worker node of the cluster, not on the head nodes.

In this example, we simply allocate 75% of an HDInsight cluster worker nodes to the driver and executors (21 GB each), and put 3 executors, since the default HDInsight cluster size is 4 worker nodes (3 executors + 1 driver)

hdi-image3

Please refer to the Jupyter Notebook tutorial for more information on how to use Jupyter Notebooks on HDInsight.

The second step here is to create an H2O context. Since one default spark context is already configured in the Jupyter Notebook (called sc), in H2O, we just need to call

h2o_context = pysparkling.H2OContext.getOrCreate(sc)

so H2O can recognize the default spark context.

After executing this line of code, H2O will print out the status, as well as the YARN application it is using.

hdi-image4

After this, you can use H2O APIs plus the Spark APIs to write your applications. To learn more about Sparkling Water APIs, refer to the H2O GitHub site here.

hdi-image5

This sentiment analysis example has a few steps to analyze the data:

  1. Load data to Spark and H2O frames
  2. Data munging using H2O API
    • Remove columns
    • Refine Time Column into Year/Month/Day/DayOfWeek/Hour columns
  3. Data munging using Spark API
    • Select columns Score, Month, Day, DayOfWeek, Summary
    • Define UDF to transform score (0..5) to binary positive/negative
    • Use TF-IDF to vectorize summary column
  4. Model building using H2O API
    • Use H2O Grid Search to tune hyper parameters
    • Select the best Deep Learning model

Please refer to the Jupyter Notebook for more details.

Step 3: use FLOW UI to monitor the progress and visualize the model

H2O Flow is an interactive web-based computational user interface where you can combine code execution, text, mathematics, plots and rich media into a single document, much like Jupyter Notebooks. With H2O Flow, you can capture, rerun, annotate, present, and share your workflow. H2O Flow allows you to use H2O interactively to import files, build models, and iteratively improve them. Based on your models, you can make predictions and add rich text to create vignettes of your work – all within Flow’s browser-based environment. In this blog, we will only focus on its visualization part.

H2O FLOW web service lives in the Spark driver and is routed through the HDInsight gateway, so it can only be accessed when the spark application/Notebook is running

You can click the available link in the Jupyter Notebook, or you can directly access this URL: https://yourclustername-h2o.apps.azurehdinsight.net/flow/index.html

In this example, we will demonstrate its visualization capabilities. Simply click “Model > List Grid Search Results” (since we are trying to use Grid Search to tune hyper parameters)

hdi-image6

Then you can access the 4 grid search results:

hdi-image7

And you can view the details of each model. For example, you can visualize the ROC curve as below:

hdi-image8

In Jupyter Notebooks, you can also view the performance in text format:

hdi-image9

Summary
In this blog, we have walked you through the detailed steps on how to create your first H2O application on HDInsight for your machine learning applications. For more information on H2O, please visit H2O site; For more information on HDInsight, please visit the HDInsight site

This blog-post is co-authored by Pablo Marin(@pablomarin), Solution Architect in Microsoft.

Sparkling Water on the Spark-Notebook

This is a guest post from our friends at Kensu.

In the space of Data Science development in enterprises, two outstanding scalable technologies are Spark and H2O. Spark is a generic distributed computing framework and H2O is a very performant scalable platform for AI.
Their complementarity is best exploited with the use of Sparkling Water. Sparkling Water is the solution to get the best of Spark – its elegant APIs, RDDs, multi-tenant Context and H2O’s speed, columnar-compression and fully-featured Machine Learning and Deep-Learning algorithms in an enterprise ready fashion.

Examples of Sparkling Water pipelines are readily available in the H2O github repository, we have revisited these examples using the Spark-Notebook.

The Spark-Notebook is an open source notebook (web-based environment for code edition, execution, and data visualization), focused on Scala and Spark. The Spark-Notebook is part of the Adalog suite of Kensu.io which addresses agility, maintainability and productivity for data science teams. Adalog offers to data scientists a short work cycle to deploy their work to the business reality and to managers a set of data governance giving a consistent view on the impact of data activities on the market.

This new material allows diving into Sparkling Water in an interactive and dynamic way.

Working with Sparking Water in the Spark-Notebook scaffolds an ideal platform for big data /data science agile development. Most notably, this gives the data scientist the power to:

  • Write rich documentation of his work alongside the code, thus improving the capacity to index knowledge
  • Experiment quickly through interactive execution of individual code cells and share the results of these experiments with his colleagues.
  • Visualize the data he/she is feeding H2O through an extensive list of widgets and automatic makeup of computation results.

Most of the H2O/Sparkling water examples have been ported to the Spark-Notebook and are available in a github repository.

We are focussing here on the Chicago crime dataset example and looking at:

  • How to take advantage of both H2O and Spark-Notebook technologies,
  • How to install the Spark-Notebook,
  • How to use it to deploy H2O jobs on a spark cluster,
  • How to read, transform and join data with Spark,
  • How to render data on a geospatial map,
  • How to apply deep learning or Gradient Boosted Machine (GBM) models using Sparkling Water

Installing the Spark-Notebook:

Installation is very straightforward on a local machine. Follow the steps described in the Spark-Notebook documentation and in a few minutes, you will have it working. Please note that Sparkling Water works only with Scala 2.11 and Spark 2.02 and above currently.
For larger projects, you may also be interested to read the documentation on how to connect the notebook to an on-premise or cloud computing cluster.

The Sparkling Water notebooks repo should be cloned in the “notebooks” directory of your Spark-Notebook installation.

Integrating H2O with the Spark-Notebook:

In order to integrate Sparkling Water with the Spark-Notebook, we need to tell the notebook to load the Sparkling Water package and specify custom spark configuration, if required. Spark then automatically distributes the H2O libraries on each of your Spark executors. Declaring Sparkling Water dependencies induces some libraries to come along by transitivity, therefore take care to ensure duplication or multiple versions of some dependencies is avoided.
The notebook metadata defines custom dependencies (ai.h2o) and dependencies to not include (because they’re already available, i.e. spark, scala and jetty). The custom local repos allow us to define where dependencies are stored locally and thus avoid downloading these each time a notebook is started.

"customLocalRepo": "/tmp/spark-notebook",
"customDeps": [
  "ai.h2o % sparkling-water-core_2.11 % 2.0.2",
  "ai.h2o % sparkling-water-examples_2.11 % 2.0.2",
  "- org.apache.hadoop % hadoop-client %   _",
  "- org.apache.spark  % spark-core_2.11    %   _",
  "- org.apache.spark % spark-mllib_2.11 % _",
  "- org.apache.spark % spark-repl_2.11 % _",
  "- org.scala-lang    %     _         %   _",
  "- org.scoverage     %     _         %   _",
  "- org.eclipse.jetty.aggregate % jetty-servlet % _"
],
"customSparkConf": {
  "spark.ext.h2o.repl.enabled": "false"
},

With these dependencies set, we can start using Sparkling Water and initiate an H2O context from within the notebook.

Benchmark example – Chicago Crime Scenes:

As an example, we can revisit the Chicago Crime Sparkling Water demo. The Spark-Notebook we used for this benchmark can be seen in a read-only mode here.

Step 1: The Three datasets are loaded as spark data frames:

  • Chicago weather data : Min, Max and Mean temperature per day
  • Chicago Census data : Average poverty, unemployment, education level and gross income per Chicago Community Area
  • Chicago historical crime data : Crime description, date, location, community area, etc. Also contains a flag telling whether the criminal has been arrested or not.

The three tables are joined using Spark into a big table with location and date as keys. A view of the first entries of the table are generated by the notebook’s automatic rendering of tables (See a sample on the table below).

spark_tables

Geospatial charts widgets are also available in the Spark-Notebook, for example, the 100 first crimes in the table:

geospatial

Step 2: We can transform the spark data frame into an H2O Frame and randomly split the H2O Frame into training and validation frames containing 80% and 20% of the rows, respectively. This is a memory to memory transformation, effectively copying and formatting data in the spark data frame into an equivalent representation in the H2O nodes (spawned by Sparkling Water into the spark executors).
We can verify that the frames are loaded into H2O by looking at the H2O Flow UI (available on port 54321 of your spark-notebook installation). We can access it by calling “openFlow” in a notebook cell.

h2oflow

Step 3: From the Spark-Notebook, we train two H2O machine learning models on the training H2O frame. For comparison, we are constructing a Deep Learning MLP model and a Gradient Boosting Machine (GBM) model. Both models are using all the data frame columns as features: time, weather, location, and neighborhood census data. Models are living in the H2O context and thus visible in the H2O flow UI. Sparkling Water functions allow us to access these from the SparkContext.

We compare the classification performance of the two models by looking at the area under the curve (AUC) on the validation dataset. The AUC measures the discrimination power of the model, that is the ability of the model to correctly classify crimes that lead to an arrest or not. The higher, the better.

The Deep Learning model leads to a 0.89 AUC while the GBM gets to 0.90 AUC. The two models are therefore quite comparable in terms of discrimination power.

Flow2

Step 4: Finally, the trained model is used to measure the probability of arrest for two specific crimes:

  • A “narcotics” related crime on 02/08/2015 11:43:58 PM in a street of community area “46” in district 4 with FBI code 18.

    The probability of being arrested predicted by the deep learning model is 99.9% and by the GBM is 75.2%.

  • A “deceptive practice” related crime on 02/08/2015 11:00:39 PM in a residence of community area “14” in district 9 with FBI code 11.

    The probability of being arrested predicted by the deep learning model is 1.4% and by the GBM is 12%.

The Spark-Notebook allows for a quick computation and visualization of the results:

spark_notebook

Summary

Combining Spark and H2O within the Spark-Notebook is a very nice set-up for scalable data science. More examples are available in the online viewer. If you are interested in running them, install the Spark-Notebook and look in this repository. From that point , you are on track for enterprise-ready interactive scalable data science.

Loic Quertenmont,
Data Scientist @ Kensu.io

Stacked Ensembles and Word2Vec now available in H2O!

Prepared by: Erin LeDell and Navdeep Gill

Stacked Ensembles

sz42-6-wheels-lightened

H2O’s new Stacked Ensemble method is a supervised ensemble machine learning algorithm that finds the optimal combination of a collection of prediction algorithms using a process called stacking or “Super Learning.” This method currently supports regression and binary classification, and multiclass support is planned for a future release. A full list of the planned features for Stacked Ensemble can be viewed here.

H2O previously has supported the creation of ensembles of H2O models through a separate implementation, the h2oEnsemble R package, which is still available and will continue to be maintained, however for new projects we’d recommend using the native H2O version. Native support for stacking in the H2O backend brings support for ensembles to all the H2O APIs.

Creating ensembles of H2O models is now dead simple. You simply pass a list of existing H2O model ids to the stacked ensemble function and you are ready to go. This list of models can be a set of manually created H2O models, a random grid of models (of GBMs, for example), or set of grids of different algorithms. Typically, the more diverse the collection of base models, the better the ensemble performance. Thus, using H2O’s Random Grid Search to generate a collection of random models is a handy way of quickly generating a set of base models for the ensemble.

R:

ensemble <- h2o.stackedEnsemble(x = x, y = y, training_frame = train, base_models = my_models)

Python:

ensemble = H2OStackedEnsembleEstimator(base_models=my_models)
ensemble.train(x=x, y=y, training_frame=train)

Full R and Python code examples are available on the Stacked Ensembles docs page. Kagglers rejoice!

Word2Vec

w2v
\(\)

H2O now has a full implementation of Word2Vec. Word2Vec is a group of related models that are used to produce word embeddings (a language modeling/feature engineering technique in natural language processing where words or phrases are mapped to vectors of real numbers). The word embeddings can subsequently be used in a machine learning model, for example, GBM. This allows user to utilize text based data with current H2O algorithms in a very efficient manner. An R example is available here.

Technical Details

H2O’s Word2Vec is based on the skip-gram model. The training objective of skip-gram is to learn word vector representations that are good at predicting its context in the same sentence. Mathematically, given a sequence of training words $w_1, w_2, \dots, w_T$, the objective of the skip-gram model is to maximize the average log-likelihood

$$\frac{1}{T} \sum_{t = 1}^{T}\sum_{j=-k}^{j=k} \log p(w_{t+j} | w_t)$$

where $k$ is the size of the training window.
In the skip-gram model, every word w is associated with two vectors $u_w$ and $v_w$ which are vector representations of $w$ as word and context respectively. The probability of correctly predicting word $w_i$ given word $w_j$ is determined by the softmax model, which is

$$p(w_i | w_j ) = \frac{\exp(u_{w_i}^{\top}v_{w_j})}{\sum_{l=1}^{V} \exp(u_l^{\top}v_{w_j})}$$

where $V$ is the vocabulary size.
The skip-gram model with softmax is expensive because the cost of computing $\log p(w_i | w_j)$ is proportional to $V$, which can be easily in order of millions. To speed up training of Word2Vec, we used hierarchical softmax, which reduced the complexity of computing of $\log p(w_i | w_j)$ to $O(\log(V))$

Tverberg Release (H2O 3.10.3.4)

Below is a detailed list of all the items that are part of the Tverberg release.

List of New Features:

PUBDEV-2058- Implement word2vec in h2o (To use this feature in R, please visit this demo)
PUBDEV-3635- Ability to Select Columns for PDP computation in Flow (With this enhancement, users will be able to select which features/columns to render Partial Dependence Plots from Flow. (R/Python supported already). Known issue PUBDEV-3782: when nbins < categorical levels, PDP won't compute. Please visit also this post.)
PUBDEV-3881- Add PCA Estimator documentation to Python API Docs
PUBDEV-3902- Documentation: Add information about Azure support to H2O User Guide (Beta)
PUBDEV-3739- StackedEnsemble: put ensemble creation into the back end.

List of Improvements:

PUBDEV-3989- Decrease size of h2o.jar
PUBDEV-3257- Documentation: As a K-Means user, I want to be able to better understand the parameters
PUBDEV-3741- StackedEnsemble: add tests in R and Python to ensure that a StackedEnsemble performs at least as well as the base_models
PUBDEV-3857- Clean up the generated Python docs
PUBDEV-3895- Filter H2OFrame on pandas dates and time (python)
PUBDEV-3912- Provide way to specify context_path via Python/R h2o.init methods
PUBDEV-3933- Modify gen_R.py for Stacked Ensemble
PUBDEV-3972- Add Stacked Ensemble code examples to Python docstrings

List of Bugs:

PUBDEV-2464- Using asfactor() in Python client cannot allocate to a variable
PUBDEV-3111- R API's h2o.interaction() does not use destination_frame argument
PUBDEV-3694- Errors with PCA on wide data for pca_method = GramSVD which is the default
PUBDEV-3742- StackedEnsemble should work for regression
PUBDEV-3865- h2o gbm : for an unseen categorical level, discrepancy in predictions when score using h2o vs pojo/mojo
PUBDEV-3883- Negative indexing for H2OFrame is buggy in R API
PUBDEV-3894- Relational operators don't work properly with time columns.
PUBDEV-3966- java.lang.AssertionError when using h2o.makeGLMModel
PUBDEV-3835- Standard Errors in GLM: calculating and showing specifically when called
PUBDEV-3965- Importing data in python returns error - TypeError: expected string or bytes-like object
Hotfix: Remove StackedEnsemble from Flow UI. Training is only supported from Python and R interfaces. Viewing is supported in the Flow UI.

List of Tasks

PUBDEV-3336- h2o.create_frame(): if randomize=True, value param cannot be used
PUBDEV-3740- REST: implement simple ensemble generation API
PUBDEV-3843- Modify R REST API to always return binary data
PUBDEV-3844- Safe GET calls for POJO/MOJO/genmodel
PUBDEV-3864- Import files by pattern
PUBDEV-3884- StackedEnsemble: Add to online documentation
PUBDEV-3940- Add Stacked Ensemble code examples to R docs

Download here: http://h2o-release.s3.amazonaws.com/h2o/rel-tverberg/4/index.html