XGBoost in H2O Machine Learning Platform

Untitled Document.md

The new H2O release 3.10.5.1 brings a shiny new feature – integration of the powerful XGBoost library algorithm into H2O Machine Learning Platform!

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable.

XGBoost provides parallel tree boosting (also known as GBDT, GBM) that solves many data science problems in a fast and accurate way.

By integrating XGBoost into the H2O Machine Learning platform, we not only enrich the family of provided algorithms by one of the most powerful machine learning algorithms, but we have also exposed it with all the nice features of H2O – Python, R APIs and Flow UI, real-time training progress, and MOJO support.

Example

Let’s quickly try to run XGBoost on the HIGGS dataset from Python. The first step is to get the latest H2O and install the Python library. Please follow instruction at H2O download page.

The next step is to download the HIGGS training and validation data. We can use sample datasets stored in S3:

wget https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/higgs_train_imbalance_100k.csv
wget https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/higgs_test_imbalance_100k.csv
# Or use full data: wget https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/higgs_head_2M.csv

Now, it is time to start your favorite Python environment and build some XGBoost models.

The first step involves starting H2O on single node cluster:

import h2o
h2o.init()

In the next step, we import and prepare data via the H2O API:

train_path = 'higgs_train_imbalance_100k.csv'
test_path = 'higgs_test_imbalance_100k.csv'

df_train = h2o.import_file(train_path)
df_test = h2o.import_file(test_path)

# Transform first feature into categorical feature
df_train[0] = df_train[0].asfactor()
df_valid[0] = df_valid[0].asfactor()

After data preparation, it is time to build an XGBoost model. Let’s try to train 100 trees with a maximum depth of 10:

param = {
      "ntrees" : 100
    , "max_depth" : 10
    , "learn_rate" : 0.02
    , "sample_rate" : 0.7
    , "col_sample_rate_per_tree" : 0.9
    , "min_rows" : 5
    , "seed": 4241
    , "score_tree_interval": 100
}

from h2o.estimators import H2OXGBoostEstimator
model = H2OXGBoostEstimator(**param)
model.train(x = list(range(1, df_train.shape[1])), y = 0, training_frame = df_train, validation_frame = df_valid)

At this point we can use the trained model like a normal H2O model, and for example use it to generate predictions:

prediction = model.predict(df_valid)[:,2]

Or we can open H2O Flow UI and explore model properties in nice user-friendly way:

Or rebuild model with different training parameters:

Technical Details

The integration of XGBoost into the H2O Machine Learning Platform utilizes the JNI interface of XGBoost and the corresponding native libraries. H2O wraps all JNI calls and exposes them as regular H2O model and model builder APIs.

The implementation itself is based on two separated modules, which are enriching the core H2O platform.

The first module, h2o-genmodel-ext-xgboost, extends module h2o-genmodel and registers an XGBoost-specific MOJO. The module also contains all necessary XGBoost binary libraries. Right now, the module provides libraries for OS X and Linux, however support of Windows is coming soon.

The module can contain multiple libraries for each platform to support different configurations (e.g., with/without GPU/OMP). H2O always tries to load the most powerful one (currently a library with GPU and OMP support). If it fails, the loader tries the next one in a loader chain. For each platform, we always provide an XGBoost library with minimal configuration (supports only single CPU) that serves as fallback in case all other libraries could not be loaded.

The second module, h2o-ext-xgboost, contains the actual XGBoost model and model builder code, which communicates with native XGBoost libraries via the JNI API. The module also provides all necessary REST API definitions to expose XGBoost model builder to clients.

Note: To learn more about H2O modular architecture, please, visit review our H2O Platform Extensibility blog post.

Limitations

There are several technical limitations of the current implementation that we are trying to resolve. However, it is necessary to mention them. In general, if XGBoost cannot be initialized for any reason (e.g., unsupported platform), then the algorithm is not exposed via REST API and is not available for clients. Clients can verify availability of the XGBoost by using the corresponding client API call. For example, in Python:

is_xgboost_available = H2OXGBoostEstimator.available()

The list of limitations include:

  1. Right now XGBoost is initialized only for single-node H2O clustersl however multi-node XGBoost support is coming soon.

  2. The list of supported platforms includes:
    Platform Minimal XGBoost OMP GPU Compilation OS
    Linux yes yes yes Ubuntu 14.04, g++ 4.7
    OS X yes no no OS X 10.11
    Windows no no no NA

    Note: Minimal XGBoost configuration includes support for a single CPU.

  3. Furthermore, because we are using native XGBoost libraries that depend on OS/platform libraries, it is possible that on older operating systems, XGBoost will not be able to find all necessary binary dependencies, and will not be initialized and available.

  4. XGBoost GPU libraries are compiled against CUDA 8, which is a necessary runtime requirement in order to utilize XGBoost GPU support.

Please give H2O XGBoost chance, try it, and let us know your experience or suggest improvements via h2ostream!

H2O Platform Extensibility

Untitled Document.md

The latest H2O release, 3.10.5.1, introduced several new concepts to improve extensibility and modularity of the H2O machine learning platform. This blog post will clarify motivation, explain design decisions we made, and demonstrate the overall approach for this release.

Motivation

The H2O Machine Learning platform was designed as a monolith application. However, a growing H2O community along with multiple new projects were demanding that we revisit the architecture and make the development of independent H2O extensions easier.

Furthermore, we would like to allow easy integration of third party tools (e.g., XGBoost, TensorFlow) under a common H2O API.

Design

Conceptually, platform modularity and extensibility can be achieved in different ways:

  1. Compile time code composition: A compile time process assembles all necessary code modules together into a resulting deployable application.
  2. Link time composition: An application is composed at start time based on modules provided at JVM classpath.
  3. Runtime composition: An application can be dynamically extended at runtime, new modules can be loaded, or existing modules can be deactivated.

The approach (1) represents a method adopted by the older version of H2O and its h2o.jar assembly process. In this case, all code is compiled and assembled into a single artifact. However, it has several major limitations. Mainly, it does need a predefined list of code components to put into the resulting artifact, and it does not allow developers and community to create independent extensions.

On the other hand, the last approach (3) is fully dynamic and is adopted by tools like OSGi, Eclipse, or Chrome and brings the freedom of having a fully dynamic environment that users can modify. However, in the context of a machine learning platform, we believe it is not necessary.

Hence, we decided to adopt the second approach (2) to our architecture and provide link time composition of modules.

With this approach, users specify the modules that they are going to use, and the specified modules are registered by H2O core via a JVM capability called Java Service Provider Interface (Java SPI).

Java SPI is a simple JVM service that allows you to register modules, implementing a given interface (or extending an abstract class), and then list them at runtime. The modules need to be registered by a so called service file located in the META-INF/services directory. The service file contains the name of the component implementation. Then the application can query all available components (e.g., that are given at classpath or available via specified classloader) and use them internally via an implemented interface.

From a design perspective, there are several locations in the H2O platform to make extensible:

  • H2O core
  • REST API
  • Parsers
  • Rapids
  • Persistent providers

In this blog post, we would like to focus only on the first two items; however, a similar approach could be or is already adopted for the remaining parts.

Regarding first item from the list, H2O core extensibility is crucial for adopting new features – for example, to introduce a new watchdog thread that shuts down H2O if a condition is satisfied, or a new public API layer like GRPC. The core modules are marked by the interface water.AbstractH2OExtension, which provides hooks
into the H2O platform lifecycle.

The second extension point allows you to extend a provided REST API, which is typically necessary when a new algorithm is introduced and needs to be exposed via REST API. In this case, the extension module needs to implement the interface water.api.RestApiExtension and register the implementation via the file META-INF/services/water.api.RestApiExtension.

Example

We are going to show extensibility on the XGBoost module – a new feature included in the latest version. XGBoost is a gradient boosting library distributed in a native non-Java form. Our goal is to publish it via the H2O API and use it in the same way as the rest of H2O algorithms. To realize that we need to:

  1. Extend the core of H2O with functionality that will load a binary version of XGBoost
  2. Wrap XGBoost into the H2O Java API
  3. Expose the Java API via REST API

To implement the first step, we are going to define a tiny implementation of water.AbstractH2OExtension, which will try to load XGBoost native libraries. The core extension does nothing except signal availability of XGBoost on the current platform (i.e., not all platforms are supported by XGBoost native libraries):

package hex.tree.xgboost;

public class XGBoostExtension extends AbstractH2OExtension {
  public static String NAME = "XGBoost";

  @Override
  public String getExtensionName() {
    return NAME;
  }

  @Override
  public boolean isEnabled() {
    try {
        ml.dmlc.xgboost4j.java.NativeLibLoader.load();
        return true;
    } catch (Exception e) {
        return false;
    }
  }
}

Now, we need to register the extension via SPI. We create a new file under META-INF/services called water.AbstractH2OExtension with the following content:

hex.tree.xgboost.XGBoostExtension

We will not go into details of the second step, which will be described in another blog post, but we will directly implement the last step.

To expose H2O-specific REST API for XGBoost Java API, we need to implement the interface water.api.RestApiExtension. However, in this example
we take a shortcut and reuse existing code infrastructure for registering the algorithm’s REST API exposed via class water.api.AlgoAbstractRegister:

package hex.api.xgboost;

public class RegisterRestApi extends AlgoAbstractRegister {

  @Override
  public void registerEndPoints(RestApiContext context) {
    XGBoost xgBoostMB = new XGBoost(true);
    // Register XGBoost model builder REST API
    registerModelBuilder(context, xgBoostMB, SchemaServer.getStableVersion());
  }

  @Override
  public String getName() {
    return "XGBoost";
  }

  @Override
  public List<String> getRequiredCoreExtensions() {
    return Collections.singletonList(XGBoostExtension.NAME);
  }
}

And again, it is necessary to register the defined class with the SPI subsystem via the file META-INF/services/water.api.RestApiExtension:

hex.api.xgboost.RegisterRestApi

REST API registration requires one more step that involves registration of used schemas (classes that are used by REST API and implementing water.api.Schema). This is an annoying step that is necessary right now, but we hope to remove it in the future. Registration of schemas is done in the same way as registration of extensions – it is necessary to list all schemas in the file META-INF/services/water.api.Schema:

hex.schemas.XGBoostModelV3
hex.schemas.XGBoostModelV3$XGBoostModelOutputV3
hex.schemas.XGBoostV3
hex.schemas.XGBoostV3$XGBoostParametersV3

From this point, the REST API definition published by XGBoost model builder is visible to clients. We compile the code and bundle it with H2O core code (or put it on the classpath) and run it:

java -cp h2o-ext-xgboost.jar:h2o.jar water.H2OApp

During the start, we should see a boot message that mentions loaded extensions (XGBoost core extension and REST API extension):

INFO: Flow dir: '/Users/michal/h2oflows'
INFO: Cloud of size 1 formed [/192.168.1.65:54321]
INFO: Registered parsers: [GUESS, ARFF, XLS, SVMLight, AVRO, PARQUET, CSV]
INFO: XGBoost extension initialized
INFO: Watchdog extension initialized
INFO: Registered 2 core extensions in: 68ms
INFO: Registered H2O core extensions: [XGBoost, Watchdog]
INFO: Found XGBoost backend with library: xgboost4j
INFO: Registered: 160 REST APIs in: 310ms
INFO: Registered REST API extensions: [AutoML, XGBoost, Algos, Core V3, Core V4]
INFO: Registered: 230 schemas in 342ms
INFO: H2O started in 4932ms

The platform also publishes a list of available extensions via a capabilities REST end-point. A client can get the complete list of capabilities via GET <ip:port>/3/Capabilities:

curl http://localhost:54321/3/Capabilities

Or get a list of core extensions (GET <ip:port>/3/Capabilities/Core):

curl http://localhost:54321/3/Core

Or get a list of REST API extensions (GET <ip:port>/3/Capabilities/API):

curl http://localhost:54321/3/API

Note: We do not modularize R/Python/Flow clients. The client is responsible to self-configure based on information provided by the backend (e.g., via Capabilities REST end point) and fails gracefully if the user invokes an operation that is not provided by the backend.

For more details about the change, please, consult the following:

Connecting to Spark & Sparkling Water from R & Rstudio

Sparkling Water offers the best of breed machine learning for Spark users. Sparkling Water brings all of H2O’s advanced algorithms and capabilities to Spark. This means that you can continue to use H2O from Rstudio or any other ide of your choice. This post will walk you through the steps to get running on plain R or R studio from Spark.

It works just the same the same way as regular H2O. You just need to call h2o.init() from R with the right parameters i.e. IP, PORT

For example: we start sparkling shell (bin/sparkling-shell) here and create an H2OContext:
scala-cli

Now H2OContext is running and H2O’s REST API is exposed on 172.162.223:54321

So we can open RStudio and call h2o.init() (make sure you have the right R H2O package installed):

rstudio-start

Let’s now create a Spark DataFrame, then publish it as H2O frame and access it from R:

This is how you achieve that in sparkling-shell:
val df = sc.parallelize(1 to 100).toDF // creates Spark DataFrame
val hf = h2oContext.asH2OFrame(df) // publishes DataFrame as H2O's Frame

sw-cli

You can see that the name of the published frame is frame_rdd_6. Now let us go to RStudio and list all the available frames via h2o.ls() function:

Alternatively you could also name the frame during the transformation from Spark to H2O as shown below:

h2oContext.asH2OFrame(df) -> val hf = h2oContext.asH2OFrame(df, "simple.frame")

rstudio-frames

We can fetch the frame as well or invoke a R function on it:
rstudio-rdd

Keep hacking!

Databricks and H2O Make it Rain with Sparkling Water

**This blog post was first posted on the Databricks blog here

Databricks provides a cloud-based integrated workspace on top of Apache Spark for developers and data scientists. H2O.ai has been an early adopter of Apache Spark and has developed Sparkling Water to seamlessly integrate H2O.ai’s machine learning library on top of Spark.

In this blog, we will demonstrate an integration between the Databricks platform and H2O.ai’s Sparking Water that provides Databricks users with an additional set of machine learning libraries. The integration allows data scientists to utilize Sparkling Water with Spark in a notebook environment more easily, allowing them to seamlessly combine Spark with H2O and get the best of both worlds.

Let’s begin by preparing a Databricks environment to develop our spam predictor:

The first step is to log into your Databricks account and create a new library containing Sparkling Water. You can use the Maven coordinates of the Sparkling Water package, for example: h2o:sparkling-water-examples_2.10:1.5.6 (this version works with Spark 1.5)

1

The next step is to create a new cluster to run the example:

2

For this version of the Sparkling Water library we will use Spark 1.5. The name of the created cluster is “HamOrSpamCluster” – keep it handy as we will need it later.

The next step is to upload data, you can use table import and upload the smsData.txt file

3

Now the environment is ready and you can create a Databricks notebook; connect it to “HamOrSpamCluster” and start building a predictive model!

The goal of the application is to write a spam detector using a trained model to categorize incoming messages

First look at the data. It contains raw text messages that are labeled as either spam or ham.
For example:

spam +123 Congratulations – in this week’s competition draw u have won the ?1450 prize to claim just call 09050002311 b4280703. T&Cs/stop SMS 08718727868. Over 18 only 150
ham Yun ah.the ubi one say if ? wan call by tomorrow.call 67441233 look for irene.ere only got bus8,22,65,6

We need to transform these messages into vectors of numbers and then train a binomial model to predict whether the text message is either SPAM or HAM. For the transformation of a message into a vector of numbers we will use Spark MLlib string tokenization and word to vector transformers. We are going to split messages into tokens and use the TF (term frequency–inverse document frequency) technique to represent words of importance inside the training data set:

// Representation of a training message
import org.apache.spark.mllib.linalg.Vector
case class SMS(target: String, fv: Vector)
def tokenize(data: RDD[String]): RDD[Seq[String]] = {
val ignoredWords = Seq("the", "a", "", "in", "on", "at", "as", "not", "for")
val ignoredChars = Seq(',', ':', ';', '/', '<', '>', '"', '.', '(', ')', '?', '-', '\'','!','0', '1')

val texts = data.map( r=> {
var smsText = r.toLowerCase
for( c <- ignoredChars) {
smsText = smsText.replace(c, ' ')
}

val words =smsText.split(" ").filter(w => !ignoredWords.contains(w) && w.length>2).distinct

words.toSeq
})
texts
}
import org.apache.spark.mllib.feature._

def buildIDFModel(tokens: RDD[Seq[String]],
minDocFreq:Int = 4,
hashSpaceSize:Int = 1 << 10): (HashingTF, IDFModel, RDD[Vector]) = {
// Hash strings into the given space
val hashingTF = new HashingTF(hashSpaceSize)
val tf = hashingTF.transform(tokens)
// Build term frequency-inverse document frequency
val idfModel = new IDF(minDocFreq = minDocFreq).fit(tf)
val expandedText = idfModel.transform(tf)
(hashingTF, idfModel, expandedText)
}

The resulting table will contain the following lines:

spam 0, 0, 0.31, 0.12, ….
ham 0.67, 0, 0, 0, 0, 0.003, 0, 0.1

After this we are free to experiment with different binary classification algorithms in H2O.

To start using H2O, we need to initialize the H2O service by creating an H2OContext:

// Create SQL support
import org.apache.spark.sql._
implicit val sqlContext = SQLContext.getOrCreate(sc)
import sqlContext.implicits._

// Start H2O services
import org.apache.spark.h2o._
@transient val h2oContext = new H2OContext(sc).start()

H2OContext represents H2O running on top of a Spark cluster. You should see the following output:

4

For this demonstration, we will leverage the H2O Deep Learning method:

// Define function which builds a DL model
import org.apache.spark.h2o._
import water.Key
import <em>root</em>.hex.deeplearning.DeepLearning
import <em>root</em>.hex.deeplearning.DeepLearningParameters
import <em>root</em>.hex.deeplearning.DeepLearningModel

def buildDLModel(train: Frame, valid: Frame,
epochs: Int = 10, l1: Double = 0.001, l2: Double = 0.0,
hidden: Array[Int] = Array[Int](200, 200))
(implicit h2oContext: H2OContext): DeepLearningModel = {
import h2oContext._
// Build a model

val dlParams = new DeepLearningParameters()
dlParams._model_id = Key.make("dlModel.hex")
dlParams._train = train
dlParams._valid = valid
dlParams._response_column = 'target
dlParams._epochs = epochs
dlParams._l1 = l1
dlParams._hidden = hidden

// Create a job
val dl = new DeepLearning(dlParams)
val dlModel = dl.trainModel.get

// Compute metrics on both datasets
dlModel.score(train).delete()
dlModel.score(valid).delete()

dlModel
}

Here is the final application:

// Build the application

import org.apache.spark.rdd.RDD
import org.apache.spark.examples.h2o.DemoUtils._
import scala.io.Source

// load both columns from the table
val data = sqlContext.sql("SELECT * FROM smsData")
// Extract response spam or ham
val hamSpam = data.map( r => r(0).toString)
val message = data.map( r => r(1).toString)
// Tokenize message content
val tokens = tokenize(message)
// Build IDF model
var (hashingTF, idfModel, tfidf) = buildIDFModel(tokens)

// Merge response with extracted vectors
val resultRDD: DataFrame = hamSpam.zip(tfidf).map(v => SMS(v._1, v._2)).toDF

// Publish Spark DataFrame as H2OFrame
// This H2OFrame has to be transient because we do not want it to be serialized. When calling for example sc.parallelize(..) the object which we are trying to parallelize takes with itself all variables in its surroundings scope - apart from those marked as serialized.
// 
@transient val table = h2oContext.asH2OFrame(resultRDD)
println(sc.parallelize(Array(1,2)))
// Transform target column into categorical
table.replace(table.find("target"), table.vec("target").toCategoricalVec()).remove()
table.update(null)

// Split table
val keys = Array[String]("train.hex", "valid.hex")
val ratios = Array<a href="0.8">Double</a>
@transient val frs = split(table, keys, ratios)
@transient val train = frs(0)
@transient val valid = frs(1)
table.delete()

// Build a model
@transient val dlModel = buildDLModel(train, valid)(h2oContext)

And voila we have a Deep Learning Model ready to detect spam

At this point you can explore quality of the model:

scala
// Collect model metrics and evaluate model quality
import water.app.ModelMetricsSupport
val validMetrics = ModelMetricsSupport.binomialMM(dlModel, valid)
println(validMetrics.auc._auc)


You can also use the H2O Flow UI by clicking on the URL provided when you instantiated the H2O Context.

5

At this point we have everything ready to create a spam detector:

scala
// Create a spam detector - a method which will return SPAM or HAM for given text message
import water.DKV._
// Spam detector
def isSpam(msg: String,
modelId: String,
hashingTF: HashingTF,
idfModel: IDFModel,
h2oContext: H2OContext,
hamThreshold: Double = 0.5):String = {
val dlModel: DeepLearningModel = water.DKV.getGet(modelId)
val msgRdd = sc.parallelize(Seq(msg))
val msgVector: DataFrame = idfModel.transform(
hashingTF.transform (
tokenize (msgRdd))).map(v =&gt; SMS(&quot;?&quot;, v)).toDF
val msgTable: H2OFrame = h2oContext.asH2OFrame(msgVector)
msgTable.remove(0) // remove first column
val prediction = dlModel.score(msgTable)
//println(prediction)
if (prediction.vecs()(1).at(0) &lt; hamThreshold) &quot;SPAM DETECTED!&quot; else &quot;HAM&quot;
}

The method uses built-in models to transform incoming text message and provide a prediction – SPAM or HAM. For example:

6

We’ve shown a fast and easy way to build a spam detector with Databricks and Sparkling Water. To try this out for yourself, register for a free 14-day trial of Databricks and check out the Sparkling Water example in the Databricks Guide.