Running a GLM Model in H2O + R (notes from the hands-on meetup Sept. 26)

This is a walk through of running H2O through R. Before you get started you will need three things:

R (a recent version), H2O (wich you can get through github: https://github.com/0xdata/h2o) or directly from our website: http://0xdata.com/h2O/, and the h2oWrapper R package, which is the tool that makes H2O talk to R, and lets you talk to big data.

We’re going to run H2O from your computer. To do this open your command line terminal (or wherever you run java programming from). If you're doing this on a server, you need the same basic elements, and permissions, but other than that, everything is just the same.

Start an instance of H2O:

CD to the directory with the h2o jar

</blah file path to working directory>/h2o-1.7.0.536

Enter the java command:

$ java -Xmx<memory> -jar h2o.jar -name mystats-cloud1

In the command above where you see memory specify the amount of memory you want to allocate to h2o (if you're not sure- 3 gigs is a good starting place): java -Xmx3g -jar h2o.jar -name mystats-cloud1.

Ok. You have an instance of h2o running? Good. Now go to R.

In the R console either change your working directory or be ready to give R an absolute path to the R package. This package makes sure that the version of h2o in R and the version of h2o you are running in your cloud are the same and can talk to each other.

>install.packages("<unzipped h2o directory>/R/h2oWrapper_1.0.tar.gz", repos = NULL, type = "source")

# This installs the dependent packages that you need to run h2o from R:

> h2oWrapper.installDepPkgs()

# Make an R object for you h2o instance (the h2o cloud that you started from your terminal)

> localH2O = h2oWrapper.init(ip = "localhost", port = 54321, startH2O = TRUE, silentUpgrade = FALSE, promptUpgrade = TRUE)

#  (you should get a lot of output that ends with : successfully connected to H2O)
>require(h2o)

# Starts H2O in R

Beginner calls in R for running a glm model in H2O: (you can get help and a full listing of functions and calls by calling ?? h2o) 

> h2o.importFile("h2o object", path, key="", parse = T)
# h2o object is the instance of h2o you have running. We called it localH2O
>  h2o.glm(y=y, x=X, data= your data, family = "gaussian", nfolds = 2) 
#  put your variables in quotes, and make sure they are all spelled correctly.  For sets of X variables, enter  x = c("Var 1", "Var 2", etc...).
>  <strong>h2o.predict(model to predict on, data to predict on)

# generate predictions for a new set of data using a glm model you built in H2O.

 

Here is the code from a demo we ran at a hands on with the full airlines data set (when I say full I mean ALL observations, and all possible variables. 152,360,031 observations). That's a huge amount of data, so this is probably a good time to talk about the difference between h2o in R and R.  H2O in R is happy to call basic R functions, and you can use H2O with R to your heart's content, but when you run an instance of h2o in R, you're using H2O math, H2O objects and H2O distributed computing. R is acting as the user interface.  This has a few implications. The first is that if you want to run really big data through R using H2O you need to do it on a server. Our compression scheme is amazing, but the full airlines data set won't do well with only 4 gigs of memory.

The second is that you need an instance of H2O running while you work in R. If your instance of H2O dies, or you stop it, H2O R no longer has cluster on the other side of the program to talk to, and you will loose your work.  I really highly recommend saving your R work in an external script editor, so that you don't have to recreate your R code from scratch if a server goes down.

>> library('h2oWrapper')
>>h2oWrapper.installDepPkgs()

# I found the IP and Port by looking the output  from the terminal where I started H2O.

>>ip = '192.168.0.103'
>>port = 54321
 
# extablish communication between H2O and R
>>h2oWrapper.init(ip, port, startH2O = FALSE, silentUpgrade = F, promptUpgrade = T)
 
# Call the H2O package
>>library(h2o)
 
#import a file
>>airline.df<- h2o.importFile(localH2O, "Airline.csv", key="", parse = T, sep = "")
 
#get summary information on the data
>>summary(airlines)
>>system.time(summary(airlines))
 
# specify an H2O GLM model of the family binomial
>>y="IsArrDelayed"
>>X=c("Year", "Month", "DayofMonth", "DayOfWeek", "CRSDepTime", "UniqueCarrier", "Origin", "Dest", 'Distance')
>>airlines.glm<- h2o.glm(y=y, x=X, data= airlines, family = "binomial", nfolds = 1))
 
#Get basic information about your model
>>system.time(airlines.glm)
>>coefs <- airlines.glm@model$coefficients
>>head(coefs[order(abs(coefs), decreasing=T)], 20)
>>coefs <- airlines.glm@model$coefficients
>> head(coefs[order(abs(coefs), decreasing=T)], 20)
 
#generate a PCA model
>>airlines.pca <- h2o.prcomp(airlines, tol=0, standardize=T)

The curious can also open the web browser, point it at the IP and port number specified in your java command output, and watch all of the work you are doing in H2O show up in your jobs list. You can also continue your analysis in the web browser.

Have fun!