The MillionSongs Data Part 1: Bells and Whistles of GLM in H2O

Using the Million Songs Data Set I want to go from beginning to end through H2O's GLM tool. Note that the original data are large, so downloading and fiddling with the full data set can be quite painful if you just do it from your desktop, that said you can find it here.  It’s a good opportunity to take a really detailed look at H2O so that you can get the most bump from the trunk (so to speak).

To start, let’s assume you’ve decided that GLM is the method for you. You’ve launched H2O, parsed your data and chosen GLM from the drop down menu under “Model”.

Destination Key  – this is an automatically generated key for your model; it will allow you to recall this specific model and all of its details later in your analysis. While H2O will spit a key out for you, you can also specify a model name such that later you can identify which of many models you are interested in revisiting.

Key – this is the .hex key generated when you parsed your data into H2O. If you didn’t save it at the time it’s no biggie. The .hex is named whatever your original data file was named, save for the change in extension. If you begin typing the name of your original file, you will be given the option to tab auto-complete. If you want to find the key yourself you can do so by going to the drop down menu “Admin”, select “Jobs” and under description find “Parse”. The key for your data of interest is given in the “Destination key” field, and is a clickable link that allows you to inspect your data.

Y – Your dependent variable.

X – Once you identify your dependent variable (the value you would like to predict) in the Y field, the X field will auto populate with all possible options (all of your other variables).  You select the subset of variables that you would like to use to predict with.

Family – Under family you will see a drop down menu with choices. Each of the four options differs in the assumptions you make about your dependent (Y) variable – the variable you would like to predict. They are explained in some detail below.

Link – Each family is associated with a default link function, which defines the specialized transformation on the set of X variables chosen to predict Y.

Family Default Link Description and Example
Gaussian Identity Your dependent variables (Y) are quantitative, continuous (or continuous predicted values can be meaningfully interpreted), and expected to be normally distributed.EX: The average length of a song in seconds or the average purchase price of a product.
Binomial Logit Your dependent variables take on two values, traditionally coded as 0 and 1, and follow a binomial distribution. Choose this if you have a categorical Y with two possible outcomes.EX: Customer decides to purchase or notA song is played or not played
Poisson Log Your dependent variable is a count – a quantitative, discrete value that expresses the number of times some event occurred.EX: The number of customers visiting a website over time, the number of customers visiting a store over distance
Gamma Inverse Your dependent variable is a survival measure – that is, you have some measure of the duration of a process for which the outcome is variable.EX: The length of time an individual remains a customer, the length of time before a particular product feature fails

Lambda: H2O provides a default value, but this can also be user defined. Lambda is a regularization parameter that is designed to prevent overfitting. The best value of lambda depends on the degree to which you wish the variance of the cross validated coefficients to match.

Alpha:   A user defined tuning regularization parameter that H2O sets to 0.5 by default, but which can take any value between 0 and 1, inclusive.  It functions so that there is an added penalty taken against the estimated fit of the model as the number of parameters increases. An alpha of 1 is the lasso penalty, and an alpha of 0 is the ridge penalty.

Lambda and alpha are distinct in purpose in that lambda is primarily concerned with preventing overfitting and thus increasing the generalizability of any specific coefficient in your model, where alpha is concerned with the model overall. 

N-Folds: The number of cross validations you would like H2O to generate. Choosing 10 means that ten random samples of observations from your orginal data will be selected and models will be fit to those subsets as well. It’s important to note that the smaller your orginal data are the larger the variation you can expect to see in the parameter estimates provided in the cross validation models; for sufficiently small data sets you may want to choose a different evaluative criteria.

Expert Settings: for the moment I would like to leave expert settings, except to note that this is the option you choose if you would like to standardize your data. In data where there is a substantial difference in the scale of your input variables standardizing can greatly improve the interpretability of your results.