Use H2O.ai on Azure HDInsight

This is a repost from this article on MSDN

We’re hosting an upcoming webinar to present you how to use H2O on HDInsight and to answer your questions. Sign up for our upcoming webinar on combining H2O and Azure HDInsight.

We recently announced that H2O and Microsoft Azure HDInsight have integrated to provide Data Scientists with a Leading Combination of Engines for Machine Learning and Deep Learning. Through H2O’s AI platform and its Sparkling Water solution, users can combine the fast, scalable machine learning algorithms of H2O with the capabilities of Spark, as well as drive computation from Scala/R/Python and utilize the H2O Flow UI, providing an ideal machine learning platform for application developers.

In this blog, we will provide a detailed step-by-step guide to help you set up the first H2O on HDInsight solution.

Step 1: setting up the environment

The first step is to create an HDInsight cluster with H2O installed. You can either create an HDInsight cluster and install H2O during provision time, or you can also install H2O on an existing cluster. Please note that H2O on HDInsight only works for Spark 2.0 on HDInsight 3.5 as of today, which is the default version of HDInsight.

For more information on how to create a cluster in HDInsight, please refer to the documentation here (https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-provision-linux-clusters). For more information on how to install an application on an existing cluster, please refer to the documentation here (https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apps-install-applications)

Please be noted that we’ve recently updated our UI with less clicks, so you need to click “custom” button to install applications on HDInsight.

hdi-image1

Step 2: Setting up the environment

After installing H2O on HDInsight, you can simply use the built-in Jupyter notebooks to write your first H2O on HDInsight applications. You can simply go to (https://yourclustername.azurehdinsight.net/jupyter) to open the Jupyter Notebook. You will see a folder named “H2O-PySparkling-Examples”.

hdi-image2

There are a few examples in the folder, but I recommend starting with the one named “Sentiment_analysis_with_Sparkling_Water.ipynb”. Most of the details on how to use the H2O PySparkling Water APIs are already covered in the Notebook itself, so here I will give some high-level overviews.

The first thing you need to do is to configure the environment. Most of the configurations are already taken care by the system, such as the FLOW UI address, Spark jar location, the Sparkling water egg file, etc.

There are three important parameter to configure: the driver memory, executor memory, and the number of executors. The default values are optimized for the default 4 node cluster, but your cluster size might vary.

Tuning these parameters are outside of scope of this blog, as it is more of a Spark resource tuning problem. There are a few good reference articles such as this one.

Note that all spark applications deployed using a Jupyter Notebook will have “yarn-cluster” deploy-mode. This means that the spark driver node will be allocated on any worker node of the cluster, not on the head nodes.

In this example, we simply allocate 75% of an HDInsight cluster worker nodes to the driver and executors (21 GB each), and put 3 executors, since the default HDInsight cluster size is 4 worker nodes (3 executors + 1 driver)

hdi-image3

Please refer to the Jupyter Notebook tutorial for more information on how to use Jupyter Notebooks on HDInsight.

The second step here is to create an H2O context. Since one default spark context is already configured in the Jupyter Notebook (called sc), in H2O, we just need to call

h2o_context = pysparkling.H2OContext.getOrCreate(sc)

so H2O can recognize the default spark context.

After executing this line of code, H2O will print out the status, as well as the YARN application it is using.

hdi-image4

After this, you can use H2O APIs plus the Spark APIs to write your applications. To learn more about Sparkling Water APIs, refer to the H2O GitHub site here.

hdi-image5

This sentiment analysis example has a few steps to analyze the data:

  1. Load data to Spark and H2O frames
  2. Data munging using H2O API
    • Remove columns
    • Refine Time Column into Year/Month/Day/DayOfWeek/Hour columns
  3. Data munging using Spark API
    • Select columns Score, Month, Day, DayOfWeek, Summary
    • Define UDF to transform score (0..5) to binary positive/negative
    • Use TF-IDF to vectorize summary column
  4. Model building using H2O API
    • Use H2O Grid Search to tune hyper parameters
    • Select the best Deep Learning model

Please refer to the Jupyter Notebook for more details.

Step 3: use FLOW UI to monitor the progress and visualize the model

H2O Flow is an interactive web-based computational user interface where you can combine code execution, text, mathematics, plots and rich media into a single document, much like Jupyter Notebooks. With H2O Flow, you can capture, rerun, annotate, present, and share your workflow. H2O Flow allows you to use H2O interactively to import files, build models, and iteratively improve them. Based on your models, you can make predictions and add rich text to create vignettes of your work – all within Flow’s browser-based environment. In this blog, we will only focus on its visualization part.

H2O FLOW web service lives in the Spark driver and is routed through the HDInsight gateway, so it can only be accessed when the spark application/Notebook is running

You can click the available link in the Jupyter Notebook, or you can directly access this URL: https://yourclustername-h2o.apps.azurehdinsight.net/flow/index.html

In this example, we will demonstrate its visualization capabilities. Simply click “Model > List Grid Search Results” (since we are trying to use Grid Search to tune hyper parameters)

hdi-image6

Then you can access the 4 grid search results:

hdi-image7

And you can view the details of each model. For example, you can visualize the ROC curve as below:

hdi-image8

In Jupyter Notebooks, you can also view the performance in text format:

hdi-image9

Summary
In this blog, we have walked you through the detailed steps on how to create your first H2O application on HDInsight for your machine learning applications. For more information on H2O, please visit H2O site; For more information on HDInsight, please visit the HDInsight site

This blog-post is co-authored by Pablo Marin(@pablomarin), Solution Architect in Microsoft.

Sparkling Water on the Spark-Notebook

This is a guest post from our friends at Kensu.

In the space of Data Science development in enterprises, two outstanding scalable technologies are Spark and H2O. Spark is a generic distributed computing framework and H2O is a very performant scalable platform for AI.
Their complementarity is best exploited with the use of Sparkling Water. Sparkling Water is the solution to get the best of Spark – its elegant APIs, RDDs, multi-tenant Context and H2O’s speed, columnar-compression and fully-featured Machine Learning and Deep-Learning algorithms in an enterprise ready fashion.

Examples of Sparkling Water pipelines are readily available in the H2O github repository, we have revisited these examples using the Spark-Notebook.

The Spark-Notebook is an open source notebook (web-based environment for code edition, execution, and data visualization), focused on Scala and Spark. The Spark-Notebook is part of the Adalog suite of Kensu.io which addresses agility, maintainability and productivity for data science teams. Adalog offers to data scientists a short work cycle to deploy their work to the business reality and to managers a set of data governance giving a consistent view on the impact of data activities on the market.

This new material allows diving into Sparkling Water in an interactive and dynamic way.

Working with Sparking Water in the Spark-Notebook scaffolds an ideal platform for big data /data science agile development. Most notably, this gives the data scientist the power to:

  • Write rich documentation of his work alongside the code, thus improving the capacity to index knowledge
  • Experiment quickly through interactive execution of individual code cells and share the results of these experiments with his colleagues.
  • Visualize the data he/she is feeding H2O through an extensive list of widgets and automatic makeup of computation results.

Most of the H2O/Sparkling water examples have been ported to the Spark-Notebook and are available in a github repository.

We are focussing here on the Chicago crime dataset example and looking at:

  • How to take advantage of both H2O and Spark-Notebook technologies,
  • How to install the Spark-Notebook,
  • How to use it to deploy H2O jobs on a spark cluster,
  • How to read, transform and join data with Spark,
  • How to render data on a geospatial map,
  • How to apply deep learning or Gradient Boosted Machine (GBM) models using Sparkling Water

Installing the Spark-Notebook:

Installation is very straightforward on a local machine. Follow the steps described in the Spark-Notebook documentation and in a few minutes, you will have it working. Please note that Sparkling Water works only with Scala 2.11 and Spark 2.02 and above currently.
For larger projects, you may also be interested to read the documentation on how to connect the notebook to an on-premise or cloud computing cluster.

The Sparkling Water notebooks repo should be cloned in the “notebooks” directory of your Spark-Notebook installation.

Integrating H2O with the Spark-Notebook:

In order to integrate Sparkling Water with the Spark-Notebook, we need to tell the notebook to load the Sparkling Water package and specify custom spark configuration, if required. Spark then automatically distributes the H2O libraries on each of your Spark executors. Declaring Sparkling Water dependencies induces some libraries to come along by transitivity, therefore take care to ensure duplication or multiple versions of some dependencies is avoided.
The notebook metadata defines custom dependencies (ai.h2o) and dependencies to not include (because they’re already available, i.e. spark, scala and jetty). The custom local repos allow us to define where dependencies are stored locally and thus avoid downloading these each time a notebook is started.

"customLocalRepo": "/tmp/spark-notebook",
"customDeps": [
  "ai.h2o % sparkling-water-core_2.11 % 2.0.2",
  "ai.h2o % sparkling-water-examples_2.11 % 2.0.2",
  "- org.apache.hadoop % hadoop-client %   _",
  "- org.apache.spark  % spark-core_2.11    %   _",
  "- org.apache.spark % spark-mllib_2.11 % _",
  "- org.apache.spark % spark-repl_2.11 % _",
  "- org.scala-lang    %     _         %   _",
  "- org.scoverage     %     _         %   _",
  "- org.eclipse.jetty.aggregate % jetty-servlet % _"
],
"customSparkConf": {
  "spark.ext.h2o.repl.enabled": "false"
},

With these dependencies set, we can start using Sparkling Water and initiate an H2O context from within the notebook.

Benchmark example – Chicago Crime Scenes:

As an example, we can revisit the Chicago Crime Sparkling Water demo. The Spark-Notebook we used for this benchmark can be seen in a read-only mode here.

Step 1: The Three datasets are loaded as spark data frames:

  • Chicago weather data : Min, Max and Mean temperature per day
  • Chicago Census data : Average poverty, unemployment, education level and gross income per Chicago Community Area
  • Chicago historical crime data : Crime description, date, location, community area, etc. Also contains a flag telling whether the criminal has been arrested or not.

The three tables are joined using Spark into a big table with location and date as keys. A view of the first entries of the table are generated by the notebook’s automatic rendering of tables (See a sample on the table below).

spark_tables

Geospatial charts widgets are also available in the Spark-Notebook, for example, the 100 first crimes in the table:

geospatial

Step 2: We can transform the spark data frame into an H2O Frame and randomly split the H2O Frame into training and validation frames containing 80% and 20% of the rows, respectively. This is a memory to memory transformation, effectively copying and formatting data in the spark data frame into an equivalent representation in the H2O nodes (spawned by Sparkling Water into the spark executors).
We can verify that the frames are loaded into H2O by looking at the H2O Flow UI (available on port 54321 of your spark-notebook installation). We can access it by calling “openFlow” in a notebook cell.

h2oflow

Step 3: From the Spark-Notebook, we train two H2O machine learning models on the training H2O frame. For comparison, we are constructing a Deep Learning MLP model and a Gradient Boosting Machine (GBM) model. Both models are using all the data frame columns as features: time, weather, location, and neighborhood census data. Models are living in the H2O context and thus visible in the H2O flow UI. Sparkling Water functions allow us to access these from the SparkContext.

We compare the classification performance of the two models by looking at the area under the curve (AUC) on the validation dataset. The AUC measures the discrimination power of the model, that is the ability of the model to correctly classify crimes that lead to an arrest or not. The higher, the better.

The Deep Learning model leads to a 0.89 AUC while the GBM gets to 0.90 AUC. The two models are therefore quite comparable in terms of discrimination power.

Flow2

Step 4: Finally, the trained model is used to measure the probability of arrest for two specific crimes:

  • A “narcotics” related crime on 02/08/2015 11:43:58 PM in a street of community area “46” in district 4 with FBI code 18.

    The probability of being arrested predicted by the deep learning model is 99.9% and by the GBM is 75.2%.

  • A “deceptive practice” related crime on 02/08/2015 11:00:39 PM in a residence of community area “14” in district 9 with FBI code 11.

    The probability of being arrested predicted by the deep learning model is 1.4% and by the GBM is 12%.

The Spark-Notebook allows for a quick computation and visualization of the results:

spark_notebook

Summary

Combining Spark and H2O within the Spark-Notebook is a very nice set-up for scalable data science. More examples are available in the online viewer. If you are interested in running them, install the Spark-Notebook and look in this repository. From that point , you are on track for enterprise-ready interactive scalable data science.

Loic Quertenmont,
Data Scientist @ Kensu.io

Stacked Ensembles and Word2Vec now available in H2O!

Prepared by: Erin LeDell and Navdeep Gill

Stacked Ensembles

sz42-6-wheels-lightened

H2O’s new Stacked Ensemble method is a supervised ensemble machine learning algorithm that finds the optimal combination of a collection of prediction algorithms using a process called stacking or “Super Learning.” This method currently supports regression and binary classification, and multiclass support is planned for a future release. A full list of the planned features for Stacked Ensemble can be viewed here.

H2O previously has supported the creation of ensembles of H2O models through a separate implementation, the h2oEnsemble R package, which is still available and will continue to be maintained, however for new projects we’d recommend using the native H2O version. Native support for stacking in the H2O backend brings support for ensembles to all the H2O APIs.

Creating ensembles of H2O models is now dead simple. You simply pass a list of existing H2O model ids to the stacked ensemble function and you are ready to go. This list of models can be a set of manually created H2O models, a random grid of models (of GBMs, for example), or set of grids of different algorithms. Typically, the more diverse the collection of base models, the better the ensemble performance. Thus, using H2O’s Random Grid Search to generate a collection of random models is a handy way of quickly generating a set of base models for the ensemble.

R:

ensemble <- h2o.stackedEnsemble(x = x, y = y, training_frame = train, base_models = my_models)

Python:

ensemble = H2OStackedEnsembleEstimator(base_models=my_models)
ensemble.train(x=x, y=y, training_frame=train)

Full R and Python code examples are available on the Stacked Ensembles docs page. Kagglers rejoice!

Word2Vec

w2v
\(\)

H2O now has a full implementation of Word2Vec. Word2Vec is a group of related models that are used to produce word embeddings (a language modeling/feature engineering technique in natural language processing where words or phrases are mapped to vectors of real numbers). The word embeddings can subsequently be used in a machine learning model, for example, GBM. This allows user to utilize text based data with current H2O algorithms in a very efficient manner. An R example is available here.

Technical Details

H2O’s Word2Vec is based on the skip-gram model. The training objective of skip-gram is to learn word vector representations that are good at predicting its context in the same sentence. Mathematically, given a sequence of training words $w_1, w_2, \dots, w_T$, the objective of the skip-gram model is to maximize the average log-likelihood

$$\frac{1}{T} \sum_{t = 1}^{T}\sum_{j=-k}^{j=k} \log p(w_{t+j} | w_t)$$

where $k$ is the size of the training window.
In the skip-gram model, every word w is associated with two vectors $u_w$ and $v_w$ which are vector representations of $w$ as word and context respectively. The probability of correctly predicting word $w_i$ given word $w_j$ is determined by the softmax model, which is

$$p(w_i | w_j ) = \frac{\exp(u_{w_i}^{\top}v_{w_j})}{\sum_{l=1}^{V} \exp(u_l^{\top}v_{w_j})}$$

where $V$ is the vocabulary size.
The skip-gram model with softmax is expensive because the cost of computing $\log p(w_i | w_j)$ is proportional to $V$, which can be easily in order of millions. To speed up training of Word2Vec, we used hierarchical softmax, which reduced the complexity of computing of $\log p(w_i | w_j)$ to $O(\log(V))$

Tverberg Release (H2O 3.10.3.4)

Below is a detailed list of all the items that are part of the Tverberg release.

List of New Features:

PUBDEV-2058- Implement word2vec in h2o (To use this feature in R, please visit this demo)
PUBDEV-3635- Ability to Select Columns for PDP computation in Flow (With this enhancement, users will be able to select which features/columns to render Partial Dependence Plots from Flow. (R/Python supported already). Known issue PUBDEV-3782: when nbins < categorical levels, PDP won't compute. Please visit also this post.)
PUBDEV-3881- Add PCA Estimator documentation to Python API Docs
PUBDEV-3902- Documentation: Add information about Azure support to H2O User Guide (Beta)
PUBDEV-3739- StackedEnsemble: put ensemble creation into the back end.

List of Improvements:

PUBDEV-3989- Decrease size of h2o.jar
PUBDEV-3257- Documentation: As a K-Means user, I want to be able to better understand the parameters
PUBDEV-3741- StackedEnsemble: add tests in R and Python to ensure that a StackedEnsemble performs at least as well as the base_models
PUBDEV-3857- Clean up the generated Python docs
PUBDEV-3895- Filter H2OFrame on pandas dates and time (python)
PUBDEV-3912- Provide way to specify context_path via Python/R h2o.init methods
PUBDEV-3933- Modify gen_R.py for Stacked Ensemble
PUBDEV-3972- Add Stacked Ensemble code examples to Python docstrings

List of Bugs:

PUBDEV-2464- Using asfactor() in Python client cannot allocate to a variable
PUBDEV-3111- R API's h2o.interaction() does not use destination_frame argument
PUBDEV-3694- Errors with PCA on wide data for pca_method = GramSVD which is the default
PUBDEV-3742- StackedEnsemble should work for regression
PUBDEV-3865- h2o gbm : for an unseen categorical level, discrepancy in predictions when score using h2o vs pojo/mojo
PUBDEV-3883- Negative indexing for H2OFrame is buggy in R API
PUBDEV-3894- Relational operators don't work properly with time columns.
PUBDEV-3966- java.lang.AssertionError when using h2o.makeGLMModel
PUBDEV-3835- Standard Errors in GLM: calculating and showing specifically when called
PUBDEV-3965- Importing data in python returns error - TypeError: expected string or bytes-like object
Hotfix: Remove StackedEnsemble from Flow UI. Training is only supported from Python and R interfaces. Viewing is supported in the Flow UI.

List of Tasks

PUBDEV-3336- h2o.create_frame(): if randomize=True, value param cannot be used
PUBDEV-3740- REST: implement simple ensemble generation API
PUBDEV-3843- Modify R REST API to always return binary data
PUBDEV-3844- Safe GET calls for POJO/MOJO/genmodel
PUBDEV-3864- Import files by pattern
PUBDEV-3884- StackedEnsemble: Add to online documentation
PUBDEV-3940- Add Stacked Ensemble code examples to R docs

Download here: http://h2o-release.s3.amazonaws.com/h2o/rel-tverberg/4/index.html

Artificial Intelligence Is Already Deep Inside Your Wallet – Here’s How

Artificial intelligence (AI) is the key for financial service companies and banks to stay ahead of the ever-shifting digital landscape, especially given competition from Google, Apple, Facebook, Amazon and others moving strategically into fintech. AI startups are building data products that not only automate the ingestion of vast amounts of data, but also provide predictive and actionable insights into how people spend and save across digital channels. Financial companies are now the biggest acquirers of such data products, as they can leverage the massive data sets they sit upon to achieve higher profitability and productivity, and operational excellence. Here are the five ways financial service companies are embracing AI today to go even deeper inside your wallet.

Your Bank Knows More About You Than Facebook
Banks and financial service companies today live or die by their ability to differentiate their offering and meet the unique needs of their customers in real-time. Retention is key, and artificial intelligence is already disrupting what it means for financial service companies to “know the customer.” Google, Facebook, Twitter, and other walled gardens already deeply understand this, which is why they are so keen to collect massive amounts of data on their users, even if they don’t have fintech infrastructure yet.

So how does your bank know more about you than Facebook? Using AI platforms, they can bridge customer data across multiple accounts – including bank, credit, loans, social media profiles, and more – and give them a 360-degree view of the customer. Once they have this, predictive applications suggest in real-time the “next best” offer to keep the person happy based on their spending, risk tolerance, investment history, and debt. For example, based on one transaction – a mortgage – financial companies use AI to recommend a checking account to pay for the mortgage, credit cards to buy furniture, home insurance, or even mutual funds that are focused on real estate. Financial services companies can now also predict customer satisfaction and dissatisfaction, allowing them to intercept consumer churn before it happens by offering exclusive deals or promotions before the person gets angry.

Credit “Risk” Is Becoming Competitive Opportunity
A limited amount of data is used for credit risk scoring today, and it’s heavily weighted toward existing credit history, length of credit usage, and payment history. Naturally, this results in many qualified customers – or anyone trying to access credit for the first time – being rejected for loans, credit cards and more. Credit card companies, including Amazon, are realizing there is a big revenue opportunity that is missed by the current credit assessment system. With AI, employment history data, social media data, shopping and purchasing patterns, and are used to build a 360-degree view of the credit “opportunity” as opposed to pure risk. Even better, AI data products can provide real-time updates of credit scores based on recent employment status changes or transactions, so that your credit score is not a fixed number but something that evolves. With this capability, banks and financial services companies are finding overlooked or untapped credit opportunities that even the most sophisticated tech company is missing.

Predict the Next DDOS Attack
The distributed denial-of-service (DDOS) attack against Dyn in October brought to the public forefront the scale and severity of cyber attacks. In the financial realm, security breaches and cyber attacks are not only costly, but also have a damaging impact on brand trust and customer loyalty. Experts and analysts agree that such DDOS attacks will become more prevalent in the future, in part because current cybersecurity practices are built upon rules-based systems and require a lot of human intervention. Many of the current cybersecurity solutions in market are focused on detection, as opposed to prevention. They can tell you an attack is happening, but not how to predict one or what to do once it’s discovered.

Leveraging AI platforms, banks, credit card companies, and financial service providers are beginning to predict and prevent such cyber attacks with far greater precision than what’s in use today. Using traffic pattern analysis and traffic pattern prediction, AI data products inspect financial-based traffic in real-time and identify threats based on previous sessions. Effectively, this means that a financial company can shut down harmful connections before they compromise the entire website or server. Importantly, as more data is ingested, the AI data product evolves and gets smarter as the hacker changes its methodology. This takes the notion of prevention to a whole new level, as it anticipates the bad actors’ next move.

Putting an End to Money Laundering
The estimated amount of money laundered globally in one year is 2 to 5 percent of global GDP, or upwards of $2 trillion in USD. Efforts to combat money laundering are never-ending, as criminals find new ways to stay ahead of law enforcement and technology. Customer activity monitoring is currently done through rules-based filtering, in which rigid and inflexible rules are used to determine if something is suspicious. This system not only creates major loopholes and many false positives, but also wastes investigators’ time and increases operational costs. AI platforms can now find patterns that regular thresholds do not detect, and continuously learn and adapt with new data. Because false positives are reduced, investigators then focus on true anti-money laundering activities to create a more efficient, accurate solution, and at the same time reduce operational costs. Suspicious activity reports are finally living up to their name of truly documenting suspicious behavior as opposed to random red flags in a rules-based system.

Biometrics-Based Fraud Detection
Fraudulent credit card activity is one area where artificial intelligence has made great progress in detection and prevention. But there are other interesting applications that are strengthening financial services companies’ overall value proposition. Account origination fraud – where fraudsters open fake accounts using stolen or made-up information – more than doubled in 2015. That’s because there’s no way to prove with absolute certainty that the person on the mobile device is who they say they are. AI technologies are being developed to compare a variety of biometric indicators – such as facial features, iris, fingerprints, and voice – in order to allow banks and financial service companies to confirm the user’s identity in far more secure ways than just a pin number or password. Mastercard, for example, unveiled facial recognition “security checks” for purchases made on mobile phones. Given its potential to protect user’s identities from being stolen or abused, biometrics in the context of banking and financial services may face fewer regulatory hurdles than practices undertaken by Facebook and Google, both of whom have faced class action lawsuits. This is allowing financial services to move much faster in the field of biometrics.

Beyond the Wallet
The tech giants are in an arms race to acquire as many AI and machine learning startups as possible. But the one thing they don’t have yet, and financial services companies do, are massive amounts of financial data. Up until now, financial services companies required a tremendous amount of experience and human judgment in order to analyze this financial data and provide cost-effective, competitive products and services. However, by adopting “out-of-the-box” AI data products that can ingest huge amounts of data, banks and financial services companies are making valuable predictions and insights in real-time that drive revenue and reduce inefficiencies. The five applications above are not simply isolated use cases, but bellwethers of how intimately AI will be directly tied to enterprise-level financial strategy.

Source: paymentsjournal.com

Football Flowers

Start Off 2017 with Our Stanford Advisors

We were very excited to meet with our advisors (Prof. Stephen Boyd, Prof. Rob Tibshirani and Prof. Trevor Hastie) at H2O.AI on Jan 6, 2017.

Our CEO, Sri Ambati, made two great observations at the start of the meeting:

  • First was the hardware trend where hardware companies like Intel/Nvidia/AMD plan to put the various machine learning algorithms into hardware/GPUs.
  • Second was the data trend where more and more datasets are images/texts/audio instead of the traditional transactional datasets.  To deal with these new datasets, deep learning seems to be the go-to algorithm.  However, with deep learning, it might work very well but it was very difficult to explain to business or regulatory professionals how and why it worked.

There were several techniques to get around this problem and make machine learning solutions interpretable to our customers:

  • Patrick Hall pointed out that monotonicity determines interpretability, not linearity of systems.  He cited a credit scoring system using a constrained neural network, when the input variable was monotonic to the response variable, the system could automatically generate reason codes.
  • One could use deep learning and simpler algorithms (like GLM, Random Forest, etc.) on datasets.  When the performances were similar, we chose the simple models since they tended to be more interpretable. These meetings were great learning opportunities for us.
  • Another suggestion is to use a layered approach:
    • Use deep learning to extract a small number of features from a high dimension datasets. 
    • Next, use a simple model that used these extracted features to perform specific tasks. 
    This layered approach could provide great speed up as well.  Imagine the cases where you could use feature sets for images/text/speech derived from others on your datasets, all you need to do was to build your simple model off the feature sets to perform the functions you desired.  In this case, deep learning is the equivalent of PCA for non-linear features.  Prof. Boyd seemed to like GLRM (check out H2O GLRM) as well for feature extraction.
    With this layered approach, there were more system parameters to tune.  Our auto-ML toolbox would be perfect for this!  Go team!

Subsequently the conversation turned to visualization of datasets.  Patrick Hall brought up the approach to first use clustering to separate the datasets and apply simple models for each cluster.  This approach was very similar to their hierarchical mixture of experts algorithm described in their elements of statistical learning book.  Basically, you built decision trees from your dataset, then fit linear models at the leaf nodes to perform specific tasks. 

Our very own Dr. Wilkinson had built a dataset visualization tool that could summarize a big dataset while maintaining the characteristics of the original datasets (like outliners and others). Totally awesome!

Arno Candel brought up the issue of overfitting and how to detect it during the training process rather than at the end of the training process using the held-out set.  Prof. Boyd mentioned that we should checkout Bayesian trees/additive models.

Last Words of Wisdom from our esteemed advisors: Deep learning was powerful but other algorithms like random forest could beat deep learning depending on the datasets.  Deep learning required big datasets to train.  It worked best with datasets that had some kind of organization in it like spatial features (in images) and temporal trends (in speech/time series).  Random forest, on the other hand, worked perfectly well with dataset with no such organization/features.

What is new in Sparkling Water 2.0.3 Release?

This release has H2O core – 3.10.1.2

Important Feature:

This architectural change allows to connect to existing h2o cluster from sparkling water. This has a benefit that we are no longer affected by Spark killing it’s executors thus we should have more stable solution in environment with lots of h2o/spark node. We are working on article on how to use this very important feature in Sparkling Water 2.0.3.

Release notes: https://0xdata.atlassian.net/secure/ReleaseNote.jspa?projectId=12000&version=16601

2.0.3 (2017-01-04)

  • Bug
    • SW-152 – ClassNotFound with spark-submit
    • SW-266 – H2OContext shouldn’t be Serializable
    • SW-276 – ClassLoading issue when running code using SparkSubmit
    • SW-281 – Update sparkling water tests so they use correct frame locking
    • SW-283 – Set spark.sql.warehouse.dir explicitly in tests because of SPARK-17810
    • SW-284 – Fix CraigsListJobTitlesApp to use local file instead of trying to get one from hdfs
    • SW-285 – Disable timeline service also in python integration tests
    • SW-286 – Add missing test in pysparkling for conversion RDD[Double] -> H2OFrame
    • SW-287 – Fix bug in SparkDataFrame converter where key wasn’t random if not specified
    • SW-288 – Improve performance of Dataset tests and call super.afterAll
    • SW-289 – Fix PySparkling numeric handling during conversions
    • SW-290 – Fixes and improvements of task used to extended h2o jars by sparkling-water classes
    • SW-292 – Fix ScalaCodeHandlerTestSuite
  • New Feature
    • SW-178 – Allow external h2o cluster to act as h2o backend in Sparkling Water
  • Improvement
    • SW-282 – Integrate SW with H2O 3.10.1.2 ( Support for external cluster )
    • SW-291 – Use absolute value for random number in sparkling-water in internal backend
    • SW-295 – H2OConf should be parameterized by SparkConf and not by SparkContext

Please visit https://community.h2o.ai to learn more about it, provide feedback and ask for assistance as needed.

@avkashchauhan | @h2oai

Behind the scenes of CRAN

(Just from my point of view as a package maintainer.)

New users of R might not appreciate the full benefit of CRAN and new package maintainers may not appreciate the importance of keeping their packages updated and free of warnings and errors. This is something I only came to realize myself in the last few years so I thought I would write about it, by way of today’s example.

Since data.table was updated on CRAN on 3rd December, it has been passing all-OK. But today I noticed a new warning (converted to error by data.table) on one CRAN machine. This is displayed under the CRAN checks link.

selection_003

Sometimes errors happen for mundane reasons. For example, one of the Windows machines was in error recently because somehow the install became locked. That either fixed itself or someone spent time fixing it (if so, thank you). Today’s issue appears to be different.

I can either look lower down on that page or click the link to see the message.

Calling 'structure(NULL, *)' is deprecated, as NULL cannot have attributes.

I’ve never seen this message before. But given it mentions something about something being deprecated and the flavor name is r-devel it looks likely that it has just been added to R itself in development. On a daily basis all CRAN packages are retested with the very latest commits to R that day. I did a quick scan of the latest commit messages to R but I couldn’t see anything relating to this apparently new warning. Some of the commit messages are long with detail right there in the message. Others are short and use reference numbers that require you to hop to that reference such as “port r71828 from trunk” which prevent fast scanning at times like this. There is more hunting I could do, but for now, let’s see if I get lucky.

The last line of data.table’s test suite output has been refined over the years for my future self, and on the same CRAN page without me needing to go anywhere else, it is showing me today:

5 errors out of 5940 (lastID=1748.4, endian==little, sizeof(long double)==16, sizeof(pointer)==8) in inst/tests/tests.Rraw on Tue Dec 27 18:09:48 2016. Search tests.Rraw for test numbers: 167, 167.1, 167.2, 168, 168.1.

The date and time is included to double check to myself that it really did run on CRAN’s machine recently and I’m not seeing an old stale result that would disappear when simply rerun with latest commits/fixes.

Next I do what I told myself and open tests.Rraw file in my editor and search for "test(167". Immediately, I see this test is within a block of code starting with :

if ("package:ggplot2" %in% search()) {
test(167, ...
}

In data.table we test compatibility of data.table with a bunch of popular packages. These packages are listed in the Suggests suggestion of DESCRIPTION.

Suggests: bit64, knitr, chron, ggplot2 (≥ 0.9.0), plyr, reshape, reshape2, testthat (≥ 0.4), hexbin, fastmatch, nlme, xts, gdata, GenomicRanges, caret, curl, zoo, plm, rmarkdown, parallel

We do not necessarily suggest these packages in the English sense of the verb; i.e., ‘to recommend’. Rather, perhaps a better name for that field would be Optional in the sense that you need those packages installed if you wish to run all tests, documentation and in data.table’s case, compatibility.

Anyway, now that I know that the failing tests are testing compatibility with ggplot2, I’ll click over to ggplot2 and look at its status. I’m hoping I’ll get lucky and ggplot2 is in error too.

selection_005

Indeed it is. And I can see the same message on ggplot2’s CRAN checks page.

Calling 'structure(NULL, *)' is deprecated, as NULL cannot have attributes.

It’s my lucky day. ggplot2 is in error too with the same message. This time, thankfully, the new warning is therefore nothing to do with data.table per se. I got to this point in under 30 seconds! No typing was required to run anything at all. It was all done just by clicking within CRAN’s pages and searching a file. My task is done and I can move on. Thanks to CRAN and the people that run it.

What if data.table or ggplot2 were already in error or warning before R-core made their change? R-core members wouldn’t have seen any status change. If they see no status change for any of the 9,787 CRAN packages then they don’t know for sure it’s ok. All they know is their change didn’t affect any of the passing packages but they can’t be sure about the packages which are already in error or warning for an unrelated reason. I get more requests from R-core and CRAN maintainers to update data.table than from users of data.table. I’m sorry that I could not find time earlier in 2016 to update data.table than I did (data.table was showing an error for many months).

Regarding today’s warning, it has been caught before it gets to users. You will never be aware it ever happened. Either R-core will revert this change, or ggplot2 will be asked to send an update to CRAN before this change in R is released.

This is one reason why packages need to be on CRAN not just on GitHub. Not just so they are available to users most easily but so they are under the watchful eye of CRAN daily tests on all platforms.

Now that data.table is used by 320 CRAN and Bioconductor packages, I’m experiencing the same (minor in comparison) frustration that R-core maintainers must have been having for many years: package maintainers not keeping their packages clean of errors and warnings, myself included. No matter how insignificant those errors or warnings might appear. Sometimes, as in my case in 2016, I simply haven’t been able to assign time to start the process of releasing to CRAN. I have worked hard to reduce the time it takes to run the checks not covered by R CMD check and this is happening faster now. One aspect of that script is reverse dependency checks; checking packages which use data.table in some way.

The current status() of data.table reverse dependency checks is as follows, using data.table in development on my laptop. These 320 packages themselves often depend or suggest other packages so my local revdep library has 2,108 packages.

> status()
CRAN:
ERROR : 6 : AFM mlr mtconnectR partools quanteda stremr
WARNING : 2 : ie2miscdata PGRdup
NOTE : 69
OK : 155
TOTAL : 232 / 237
RUNNING : 0
NOT STARTED (first 5 of 5) : finch flippant gasfluxes propr rlas

BIOC:
ERROR : 1 : RTCGA
WARNING : 4 : genomation methylPipe RiboProfiling S4Vectors
NOTE : 68
OK : 9
TOTAL : 82 / 83
RUNNING : 0
NOT STARTED (first 5 of 1) : diffloop

Now that Jan Gorecki has joined H2O he has been able to spend some time to automate and improve this. Currently, the result he gets with a docker script is as follows.

> status()
CRAN:
ERROR : 18 : AFM blscrapeR brainGraph checkmate ie2misc lava mlr mtconnectR OptiQuantR panelaggregation partools pcrsim psidR quanteda simcausal stremr strvalidator xgboost
WARNING : 4 : data.table ie2miscdata msmtools PGRdup
NOTE : 72
OK : 141
TOTAL : 235 / 235
RUNNING : 0

BIOC:
ERROR : 20 : CAGEr dada2 facopy flowWorkspace GenomicTuples ggcyto GOTHiC IONiseR LowMACA methylPipe minfi openCyto pepStat PGA phyloseq pwOmics QUALIFIER RTCGA SNPhood TCGAbiolinks
WARNING : 15 : biobroom bsseq Chicago genomation GenomicInteractions iGC ImmuneSpaceR metaX MSnID MSstats paxtoolsr Pviz r3Cseq RiboProfiling scater
NOTE : 27
OK : 3
TOTAL : 65 / 65
RUNNING : 0

So, our next task is to make Jan’s result on docker match mine. I can’t quite remember how I got all these packages to pass locally for me. In some cases I needed to find and install Ubuntu libraries and I tried my best to keep a note of them at the time here here. Another case is that lava suggests mets but mets depends on lava. We currently solve chicken-or-egg situations manually, one-by-one. A third example is that permissions of /tmp seem to be different on docker which at least one package appears to test and depend on. We have tried changing TEMPDIR from /tmp to ~/tmp to solve that and will wait for the rerun to see if that worked. I won’t be surprised if it takes a week of elapsed time to get our results to match. That’s two man weeks of on-and-off time as we fix, automate-the-fix and wait to see if the rerun works. And this is work after data.table has already made it to CRAN; to make next time easier and less of a barrier-to-entry to start.

The point is, all this takes time behind the scenes. I’m sure other package maintainers have similar issues and have come up with various solutions. I’m aware of devtools::revdep_check, used it gratefully for some years and thanked Hadley for it in this tweet. But recently I’ve found it more reliable and simpler to run R CMD check at the command line directly using the unix parallel command. Thank you to R-core and CRAN maintainers for keeping CRAN going in 2016. There must be much that nobody knows about. Thank you to the package maintainers that use data.table and have received my emails and fixed their warnings or errors (users will never know that happened). Sorry I myself didn’t keep data.table cleaner, faster. We’re working to improve that going forward.

What is new in H2O latest release 3.10.2.1 (Tutte) ?

Today we released H2O version 3.10.2.1 (Tutte). It’s available on our Downloads page, and release notes can be found here.

sz42-6-wheels-lightened

Photo Credit: https://en.wikipedia.org/wiki/W._T._Tutte

Top enhancements in this release:

GLM MOJO Support: GLM now supports our smaller, faster, more efficient MOJO (Model ObJect, Optimized) format for model publication and deployment (PUBDEV-3664, PUBDEV-3695).

ISAX: We actually introduced ISAX (Indexable Symbolic Aggregate ApproXimation) support a couple of releases back, but this version features more improvements and is worth a look. ISAX allows you to represent complex time series patterns using a symbolic notation, reducing the dimensionality of your data and allowing you to run our ML algos or use the index for searching or data analysis. For more information, check out the blog entry here: Indexing 1 billion time series with H2O and ISAX. (PUBDEV-3367, PUBDEV-3377, PUBDEV-3376)

GLM: Improved feature and parameter descriptions for GLM. Next focus will be on improving documentation for the K-Means algorithm (PUBDEV-3695, PUBDEV-3753, PUBDEV-3791).

Quasibinomial support in GLM:
the quasibinomial family is similar to the binomial family except that, where the binomial models only support 0/1 for the values of a target, the quasibinomial family allows for two arbitrary values. This feature was requested by advanced users of H2O for applications such as implementing their own advanced estimators. (PUBDEV-3482, PUBDEV-3791)

GBM/DRF high cardinality accuracy improvements: Fixed a bug in the handling of large categorical features (cardinality > 32) that was there since the first release of H2O-3. Certain such categorical tree split decisions were incorrect, essentially sending observations down the wrong path at any such split point in the decision tree. The error was systematic and consistent between in-H2O and POJO/MOJO, and led to lower training accuracy (and often, to lower validation accurary). The handling of unseen categorical levels (in training and testing) was also inconsistent and unseen levels would go left or right without any reason – now they follow the path of a missing values consistently. Generally, models involving high-cardinality categorical features should have improved accuracy now. This change might require re-tuning of model parameters for best results. In particular the nbins_cats parameter, which controls the number of separable categorical levels at a given split, which has a large impact on the amount of memorization of per-level behavior that is possible: higher values generally (over)fit more.

Direct Download: http://h2o-release.s3.amazonaws.com/h2o/rel-tutte/1/index.html

For each PUBDEV-* information please look at the release note links at the top of this article

Accordingly to VP of Engineering Bill Gallmeister, this release consist of signifiant work done by his engineering team. For more information on these features and all the other improvements in H2O version 3.10.2.1, review our documentation.

Happy Holidays from all H2O team!!

@avkashchauhan (Avkash Chauhan)

Using Sentiment Analysis to Measure Election Surprise

Sentiment Analysis is a powerful Natural Language Processing technique that can be used to compute and quantify the emotions associated with a body of text. One of the reasons that Sentiment Analysis is so powerful is because its results are easy to interpret and can give you a big-picture metric for your dataset.

One recent event that surprised many people was the November 8th US Presidential election. Hillary Clinton, who ended up losing the race, had been given chances ranging from a 71.4% (FiveThirtyEight), to a 85% (New York Times), to a >99% chance of victory (Princeton Election Consortium).

prediction_comparisons

Credit: New York Times

To measure the shock of this upset, we decided to examine comments made during the announcements of the election results and see how (if) the sentiment changed. The sentiment of a comment is measured by how its words correspond to either a negative or positive connotation. A score of ‘0.00’ means the comment is neutral, while a higher score means that the sentiment is more positive (and a negative score implies the comment is negative).

Our dataset is a .csv of all Reddit comments made during 11/8/2016 to 11/10/2016 (UTC) and is courtesy of /u/Stuck_In_the_Matrix. All times are in EST, and we’ve annotated the timeline (the height of the bars denotes the number of comments posted during that hour):

politics_counts2


We examined five political subreddits to gauge their reactions. Our first target was /r/hillaryclinton, Clinton’s primary support base. The number of comments reached a high starting at around 9pm EST, but the sentiment gradually fell as news came in that Donald Trump was winning more states than expected.

hillaryclinton_counts

/r/hillaryclinton: Number of Comments per Hour

hillaryclinton_sentiment

/r/hillaryclinton: Mean Sentiment Score per Hour

What is interesting is the low number of comments made after the election was called for Donald Trump. I suspect that it may have been a subreddit-wide pause on comments due to concerns about trolls, but I’m not sure; I contacted the moderators but haven’t received a response back yet.

A few other left-leaning subreddits had interesting results as well. While /r/SandersforPresident was closed for the election season post-Bernie concession, it’s successor, /r/Political_Revolution, had not closed and experienced declines in comment sentiment as well.

political_revolution_counts

/r/Political_Revolution: Number of Comments per Hour

political_revolution_sentiment

/r/Political_Revolution: Mean Sentiment Score per Hour


On /r/The_Donald (Donald Trump’s base), the results were the opposite. 

the_donald_counts

/r/The_Donald: Number of Comments per Hour

the_donald_sentiment

/r/The_Donald: Mean Sentiment Score per Hour


There are also a few subreddits that are less candidate- or ideology-specific: /r/politics and /r/PoliticalDiscussion. /r/PoliticalDiscussion didn’t seem to show any shift, but /r/politics did seem to become more muted, at least compared to the previous night.

politicaldiscussion_counts

/r/PoliticalDiscussion: Number of Comments per Hour

politicaldiscussion_sentiment

/r/PoliticalDiscussion: Mean Sentiment Score per Hour

politics_sentiment
/r/politics: Mean Sentiment Score per Hour

To recap,

  1. Reddit political subreddits experienced a sizable increase in activity during the election results
  2. Subreddits differed in their reactions to the news along idealogical lines, with pro-Trump subreddits having higher positive sentiment than pro-Clinton subreddits

What could be the next steps for this type of analysis?

  1. Can we use these patterns to classify the readership of the comments sections of newspapers as left- or right-leaning?
  2. Can we apply these time-series sentiment analyses to other events, such as sporting events (which also includes two ‘teams’)?
  3. Can we use sentiment analysis to evaluate the long-term health of communities, such as subreddits dedicated to eventually-losing candidates, like Bernie Sanders?