This blog is also explains the solution to a Google Stream question we received
Note: KFold Cross Validation will be added to H2O3 as an argument soon
This is a terse guide to building KFold crossvalidated models with H2O using the R interface. There's not very much R code needed to get up and running, but it's by no means the onemagicbutton method either. This guide is intended for the more “rustic” data scientist that likes to get there hands a bit dirty and build out their own tools.
In about 30 lines of R you'll be able to build the folds, the models, and the predictions! Here's the code in all of its glory:
”Rustic” KFold CrossValidation Code
h2o.kfold < function(k,training_frame,X,Y,algo.fun,predict.fun,poll=FALSE) {
folds < 1+as.numeric(cut(h2o.runif(training_frame), seq(0,1,1/k), include.lowest=T))
print(dim(folds))
# launch models
model.futures < NULL
for( i in 1L:k) {
train < training_frame[folds!=i,]
if( is.null(model.futures) ) model.futures < list(algo.fun(train,X,Y))
else model.futures < c(model.futures, list(algo.fun(train,X,Y)))
}
models < model.futures
if( poll ) {
for( i in 1L:length(models) ) {
models[[i]] < h2o.getFutureModel(models[[i]])
}
}
# perform predictions on the holdout data
preds < NULL
for( i in 1L:k) {
valid < training_frame[folds==i,]
p < predict.fun(models[[i]], valid)
if( is.null(preds) ) preds < p
else preds < h2o.rbind(preds,p)
}
# return the results
list(models=models, predictions=preds)
}
tl;dr: You can start using this right away. Here are three examples:
Example 1: 5fold GBM
# 5fold GBM:
h2o_gbm < function(training_frame,X,Y) {<br />
h2o.gbm(x=X,<br />
y=Y,<br />
training_frame=training_frame,<br />
ntree=1,<br />
max_depth=1,<br />
learn_rate=0.01,<br />
future=TRUE) # future = TRUE launches model builds in parallel, careful!<br />
}
kf.gbm < h2o.kfold(5, fr, X, Y, h2o_gbm, h2o.predict, TRUE) # poll future models
Example 2: 10fold Deeplearning
# 10fold Deeplearning:
h2o_dl < function(training_frame,X,Y){
h2o.deeplearning(x=X,
y=Y,
training_frame=training_frame,
hidden=c(200,200,200),
activation=”RectifierWithDropout”,
input_dropout_ratio=0.3,
hidden_dropout_ratios=c(0.5,0.5,0.5),
l1=1e4) # no future since each DL has high Duty Cycle<
}
kf.dl < h2o.kfold(10, fr, X, Y, h2o_dl , h2o.predict, FALSE) # no future models to poll!
Example 3: 10fold 1Many Random Forest
# 1many binomial models with 5fold cross validation:
rf.one_v_many.futures < function(training_frame,X,Y) {
keys.to.clean < NULL
nclass < length(h2o.levels(training_frame[,Y]))
model.futures < lapply(0:(nclass1), function(CLASS) {
tr < h2o.cbind(training_frame, as.factor(as.numeric(training_frame[,Y])==CLASS))
keys.to.clean « c(keys.to.clean, tr@frame_id)
h2o.randomForest(x=X,
y=ncol(tr),
training_frame=tr,
ntree=50,
max_depth=20,
future=TRUE)
})
# poll the models
models < lapply(model.futures, function(MODEL) h2o.getFutureModel(MODEL))
# some house keeping
h2o.rm(keys.to.clean)
# return the models
models
}
kf.rf < h2o.kfold(3,fr,X,Y,rf.one_v_many.futures,ensemble.predict) # ensemble.predict is below
Diving In
Let’s step through what this h2o.kfold
method does.
Admittedly the API here is clunky, but it will certainly do the job — I'll leave API munging as an exercise for the reader!
Briefly the parameters are:

k
: the number of folds 
training_frame
: the dataset to do machine learning on 
X
: predictor variables 
Y
: response variable 
algo.fun
: a fullyspecified algorithm to perform kfold crossvalidation on 
fun.predict
: a predict method 
poll
: if TRUE, then it will attempt to poll future models
In general, fun.predict
should be the vanilla h2o.predict
method (although more exotic methods are permissible, as hinted at by Example 3 above).
How folds are built:
Many of our R examples make use of h2o.runif
to split a dataset into (train,valid,test) tuples:
# some existing dataset
r < h2o.runif(fr) # builds a vector the length of fr filled with draws from U(0,1)
train < fr[r < 0.7,]
valid < fr[0.7 <= r < 0.8, ]
test < fr[r >= 0.8, ]
We can apply the same thinking to assign fold IDs to each row of our input training data. This is exactly what the first line of h2o.kfold
does:
folds < 1+as.numeric(cut(h2o.runif(training_frame), seq(0,1,1/k), include.lowest=T))
This line performs 3 actions:
1. First it builds a vector filled with uniformly random numbers in [0,1).
2. Next the (extremely useful) `cut` method assigns each random value one of k factor levels.
3. Finally, to get the factor levels as integral identifiers from 1, ..., k we add 1 after coercing the column to numeric (adding 1 because H2O is 0based).
The remainder of the method is not very interesting, except for the asynchronous launch and polling of models. From the R interface, the algorithm methods may take a special parameter future=TRUE
to return a model future object, which can be blocked on at a future time (rather than polling at launch).
Predicting 1Many Models
Building off of the oneversusmany code in Example 3, then the predict code should look something like
ensemble.predict < function(models,valid_data) {
probs < .binomial.predict.helper(models,valid_data)
p_valid < h2o:::h2o.which.max(probs[[1]])
dim(p_valid)
res < h2o.cbind(p_valid,probs[[1]][,1])
dim(res)
res
}
.binomial.predict.helper < function(models,data) {
keys.to.clean < NULL
threshes < NULL
Y < ncol(data) # assumes that response is last vec...
res < lapply(0L:(length(models)1L), function(ID) {
d < h2o.cbind(data, as.numeric(data[,Y])==ID)
p < h2o.performance(models[[ID+1]], d)
t < h2o.find_threshold_by_max_metric(p, "f1")
pred < h2o.predict(models[[ID+1]], d)
cp < ifelse(pred[,3] >= t, pred[,3], 0)
keys.to.clean << c(keys.to.clean, d@frame_id, pred@frame_id)
threshes < c(threshes, t)
dim(cp)
cp
})
res < h2o.cbind(res)
print(dim(res))
h2o.rm(keys.to.clean)
list(res,threshes)
}
This constructs class probabilties for each of the classes based on a threshold computed over the holdout data from the kfold cross validation (it altrnatively takes any input vector of thresholds).