H2O Platform Extensibility

Untitled Document.md

The latest H2O release, 3.10.5.1, introduced several new concepts to improve extensibility and modularity of the H2O machine learning platform. This blog post will clarify motivation, explain design decisions we made, and demonstrate the overall approach for this release.

Motivation

The H2O Machine Learning platform was designed as a monolith application. However, a growing H2O community along with multiple new projects were demanding that we revisit the architecture and make the development of independent H2O extensions easier.

Furthermore, we would like to allow easy integration of third party tools (e.g., XGBoost, TensorFlow) under a common H2O API.

Design

Conceptually, platform modularity and extensibility can be achieved in different ways:

  1. Compile time code composition: A compile time process assembles all necessary code modules together into a resulting deployable application.
  2. Link time composition: An application is composed at start time based on modules provided at JVM classpath.
  3. Runtime composition: An application can be dynamically extended at runtime, new modules can be loaded, or existing modules can be deactivated.

The approach (1) represents a method adopted by the older version of H2O and its h2o.jar assembly process. In this case, all code is compiled and assembled into a single artifact. However, it has several major limitations. Mainly, it does need a predefined list of code components to put into the resulting artifact, and it does not allow developers and community to create independent extensions.

On the other hand, the last approach (3) is fully dynamic and is adopted by tools like OSGi, Eclipse, or Chrome and brings the freedom of having a fully dynamic environment that users can modify. However, in the context of a machine learning platform, we believe it is not necessary.

Hence, we decided to adopt the second approach (2) to our architecture and provide link time composition of modules.

With this approach, users specify the modules that they are going to use, and the specified modules are registered by H2O core via a JVM capability called Java Service Provider Interface (Java SPI).

Java SPI is a simple JVM service that allows you to register modules, implementing a given interface (or extending an abstract class), and then list them at runtime. The modules need to be registered by a so called service file located in the META-INF/services directory. The service file contains the name of the component implementation. Then the application can query all available components (e.g., that are given at classpath or available via specified classloader) and use them internally via an implemented interface.

From a design perspective, there are several locations in the H2O platform to make extensible:

  • H2O core
  • REST API
  • Parsers
  • Rapids
  • Persistent providers

In this blog post, we would like to focus only on the first two items; however, a similar approach could be or is already adopted for the remaining parts.

Regarding first item from the list, H2O core extensibility is crucial for adopting new features – for example, to introduce a new watchdog thread that shuts down H2O if a condition is satisfied, or a new public API layer like GRPC. The core modules are marked by the interface water.AbstractH2OExtension, which provides hooks
into the H2O platform lifecycle.

The second extension point allows you to extend a provided REST API, which is typically necessary when a new algorithm is introduced and needs to be exposed via REST API. In this case, the extension module needs to implement the interface water.api.RestApiExtension and register the implementation via the file META-INF/services/water.api.RestApiExtension.

Example

We are going to show extensibility on the XGBoost module – a new feature included in the latest version. XGBoost is a gradient boosting library distributed in a native non-Java form. Our goal is to publish it via the H2O API and use it in the same way as the rest of H2O algorithms. To realize that we need to:

  1. Extend the core of H2O with functionality that will load a binary version of XGBoost
  2. Wrap XGBoost into the H2O Java API
  3. Expose the Java API via REST API

To implement the first step, we are going to define a tiny implementation of water.AbstractH2OExtension, which will try to load XGBoost native libraries. The core extension does nothing except signal availability of XGBoost on the current platform (i.e., not all platforms are supported by XGBoost native libraries):

package hex.tree.xgboost;

public class XGBoostExtension extends AbstractH2OExtension {
  public static String NAME = "XGBoost";

  @Override
  public String getExtensionName() {
    return NAME;
  }

  @Override
  public boolean isEnabled() {
    try {
        ml.dmlc.xgboost4j.java.NativeLibLoader.load();
        return true;
    } catch (Exception e) {
        return false;
    }
  }
}

Now, we need to register the extension via SPI. We create a new file under META-INF/services called water.AbstractH2OExtension with the following content:

hex.tree.xgboost.XGBoostExtension

We will not go into details of the second step, which will be described in another blog post, but we will directly implement the last step.

To expose H2O-specific REST API for XGBoost Java API, we need to implement the interface water.api.RestApiExtension. However, in this example
we take a shortcut and reuse existing code infrastructure for registering the algorithm’s REST API exposed via class water.api.AlgoAbstractRegister:

package hex.api.xgboost;

public class RegisterRestApi extends AlgoAbstractRegister {

  @Override
  public void registerEndPoints(RestApiContext context) {
    XGBoost xgBoostMB = new XGBoost(true);
    // Register XGBoost model builder REST API
    registerModelBuilder(context, xgBoostMB, SchemaServer.getStableVersion());
  }

  @Override
  public String getName() {
    return "XGBoost";
  }

  @Override
  public List<String> getRequiredCoreExtensions() {
    return Collections.singletonList(XGBoostExtension.NAME);
  }
}

And again, it is necessary to register the defined class with the SPI subsystem via the file META-INF/services/water.api.RestApiExtension:

hex.api.xgboost.RegisterRestApi

REST API registration requires one more step that involves registration of used schemas (classes that are used by REST API and implementing water.api.Schema). This is an annoying step that is necessary right now, but we hope to remove it in the future. Registration of schemas is done in the same way as registration of extensions – it is necessary to list all schemas in the file META-INF/services/water.api.Schema:

hex.schemas.XGBoostModelV3
hex.schemas.XGBoostModelV3$XGBoostModelOutputV3
hex.schemas.XGBoostV3
hex.schemas.XGBoostV3$XGBoostParametersV3

From this point, the REST API definition published by XGBoost model builder is visible to clients. We compile the code and bundle it with H2O core code (or put it on the classpath) and run it:

java -cp h2o-ext-xgboost.jar:h2o.jar water.H2OApp

During the start, we should see a boot message that mentions loaded extensions (XGBoost core extension and REST API extension):

INFO: Flow dir: '/Users/michal/h2oflows'
INFO: Cloud of size 1 formed [/192.168.1.65:54321]
INFO: Registered parsers: [GUESS, ARFF, XLS, SVMLight, AVRO, PARQUET, CSV]
INFO: XGBoost extension initialized
INFO: Watchdog extension initialized
INFO: Registered 2 core extensions in: 68ms
INFO: Registered H2O core extensions: [XGBoost, Watchdog]
INFO: Found XGBoost backend with library: xgboost4j
INFO: Registered: 160 REST APIs in: 310ms
INFO: Registered REST API extensions: [AutoML, XGBoost, Algos, Core V3, Core V4]
INFO: Registered: 230 schemas in 342ms
INFO: H2O started in 4932ms

The platform also publishes a list of available extensions via a capabilities REST end-point. A client can get the complete list of capabilities via GET <ip:port>/3/Capabilities:

curl http://localhost:54321/3/Capabilities

Or get a list of core extensions (GET <ip:port>/3/Capabilities/Core):

curl http://localhost:54321/3/Core

Or get a list of REST API extensions (GET <ip:port>/3/Capabilities/API):

curl http://localhost:54321/3/API

Note: We do not modularize R/Python/Flow clients. The client is responsible to self-configure based on information provided by the backend (e.g., via Capabilities REST end point) and fails gracefully if the user invokes an operation that is not provided by the backend.

For more details about the change, please, consult the following: