Football Flowers

Start Off 2017 with Our Stanford Advisors

We were very excited to meet with our advisors (Prof. Stephen Boyd, Prof. Rob Tibshirani and Prof. Trevor Hastie) at H2O.AI on Jan 6, 2017.

Our CEO, Sri Ambati, made two great observations at the start of the meeting:

  • First was the hardware trend where hardware companies like Intel/Nvidia/AMD plan to put the various machine learning algorithms into hardware/GPUs.
  • Second was the data trend where more and more datasets are images/texts/audio instead of the traditional transactional datasets.  To deal with these new datasets, deep learning seems to be the go-to algorithm.  However, with deep learning, it might work very well but it was very difficult to explain to business or regulatory professionals how and why it worked.

There were several techniques to get around this problem and make machine learning solutions interpretable to our customers:

  • Patrick Hall pointed out that monotonicity determines interpretability, not linearity of systems.  He cited a credit scoring system using a constrained neural network, when the input variable was monotonic to the response variable, the system could automatically generate reason codes.
  • One could use deep learning and simpler algorithms (like GLM, Random Forest, etc.) on datasets.  When the performances were similar, we chose the simple models since they tended to be more interpretable. These meetings were great learning opportunities for us.
  • Another suggestion is to use a layered approach:
    • Use deep learning to extract a small number of features from a high dimension datasets. 
    • Next, use a simple model that used these extracted features to perform specific tasks. 
    This layered approach could provide great speed up as well.  Imagine the cases where you could use feature sets for images/text/speech derived from others on your datasets, all you need to do was to build your simple model off the feature sets to perform the functions you desired.  In this case, deep learning is the equivalent of PCA for non-linear features.  Prof. Boyd seemed to like GLRM (check out H2O GLRM) as well for feature extraction.
    With this layered approach, there were more system parameters to tune.  Our auto-ML toolbox would be perfect for this!  Go team!

Subsequently the conversation turned to visualization of datasets.  Patrick Hall brought up the approach to first use clustering to separate the datasets and apply simple models for each cluster.  This approach was very similar to their hierarchical mixture of experts algorithm described in their elements of statistical learning book.  Basically, you built decision trees from your dataset, then fit linear models at the leaf nodes to perform specific tasks. 

Our very own Dr. Wilkinson had built a dataset visualization tool that could summarize a big dataset while maintaining the characteristics of the original datasets (like outliners and others). Totally awesome!

Arno Candel brought up the issue of overfitting and how to detect it during the training process rather than at the end of the training process using the held-out set.  Prof. Boyd mentioned that we should checkout Bayesian trees/additive models.

Last Words of Wisdom from our esteemed advisors: Deep learning was powerful but other algorithms like random forest could beat deep learning depending on the datasets.  Deep learning required big datasets to train.  It worked best with datasets that had some kind of organization in it like spatial features (in images) and temporal trends (in speech/time series).  Random forest, on the other hand, worked perfectly well with dataset with no such organization/features.

What is new in Sparkling Water 2.0.3 Release?

This release has H2O core –

Important Feature:

This architectural change allows to connect to existing h2o cluster from sparkling water. This has a benefit that we are no longer affected by Spark killing it’s executors thus we should have more stable solution in environment with lots of h2o/spark node. We are working on article on how to use this very important feature in Sparkling Water 2.0.3.

Release notes:

2.0.3 (2017-01-04)

  • Bug
    • SW-152 – ClassNotFound with spark-submit
    • SW-266 – H2OContext shouldn’t be Serializable
    • SW-276 – ClassLoading issue when running code using SparkSubmit
    • SW-281 – Update sparkling water tests so they use correct frame locking
    • SW-283 – Set spark.sql.warehouse.dir explicitly in tests because of SPARK-17810
    • SW-284 – Fix CraigsListJobTitlesApp to use local file instead of trying to get one from hdfs
    • SW-285 – Disable timeline service also in python integration tests
    • SW-286 – Add missing test in pysparkling for conversion RDD[Double] -> H2OFrame
    • SW-287 – Fix bug in SparkDataFrame converter where key wasn’t random if not specified
    • SW-288 – Improve performance of Dataset tests and call super.afterAll
    • SW-289 – Fix PySparkling numeric handling during conversions
    • SW-290 – Fixes and improvements of task used to extended h2o jars by sparkling-water classes
    • SW-292 – Fix ScalaCodeHandlerTestSuite
  • New Feature
    • SW-178 – Allow external h2o cluster to act as h2o backend in Sparkling Water
  • Improvement
    • SW-282 – Integrate SW with H2O ( Support for external cluster )
    • SW-291 – Use absolute value for random number in sparkling-water in internal backend
    • SW-295 – H2OConf should be parameterized by SparkConf and not by SparkContext

Please visit to learn more about it, provide feedback and ask for assistance as needed.

@avkashchauhan | @h2oai

Behind the scenes of CRAN

(Just from my point of view as a package maintainer.)

New users of R might not appreciate the full benefit of CRAN and new package maintainers may not appreciate the importance of keeping their packages updated and free of warnings and errors. This is something I only came to realize myself in the last few years so I thought I would write about it, by way of today’s example.

Since data.table was updated on CRAN on 3rd December, it has been passing all-OK. But today I noticed a new warning (converted to error by data.table) on one CRAN machine. This is displayed under the CRAN checks link.


Sometimes errors happen for mundane reasons. For example, one of the Windows machines was in error recently because somehow the install became locked. That either fixed itself or someone spent time fixing it (if so, thank you). Today’s issue appears to be different.

I can either look lower down on that page or click the link to see the message.

Calling 'structure(NULL, *)' is deprecated, as NULL cannot have attributes.

I’ve never seen this message before. But given it mentions something about something being deprecated and the flavor name is r-devel it looks likely that it has just been added to R itself in development. On a daily basis all CRAN packages are retested with the very latest commits to R that day. I did a quick scan of the latest commit messages to R but I couldn’t see anything relating to this apparently new warning. Some of the commit messages are long with detail right there in the message. Others are short and use reference numbers that require you to hop to that reference such as “port r71828 from trunk” which prevent fast scanning at times like this. There is more hunting I could do, but for now, let’s see if I get lucky.

The last line of data.table’s test suite output has been refined over the years for my future self, and on the same CRAN page without me needing to go anywhere else, it is showing me today:

5 errors out of 5940 (lastID=1748.4, endian==little, sizeof(long double)==16, sizeof(pointer)==8) in inst/tests/tests.Rraw on Tue Dec 27 18:09:48 2016. Search tests.Rraw for test numbers: 167, 167.1, 167.2, 168, 168.1.

The date and time is included to double check to myself that it really did run on CRAN’s machine recently and I’m not seeing an old stale result that would disappear when simply rerun with latest commits/fixes.

Next I do what I told myself and open tests.Rraw file in my editor and search for "test(167". Immediately, I see this test is within a block of code starting with :

if ("package:ggplot2" %in% search()) {
test(167, ...

In data.table we test compatibility of data.table with a bunch of popular packages. These packages are listed in the Suggests suggestion of DESCRIPTION.

Suggests: bit64, knitr, chron, ggplot2 (≥ 0.9.0), plyr, reshape, reshape2, testthat (≥ 0.4), hexbin, fastmatch, nlme, xts, gdata, GenomicRanges, caret, curl, zoo, plm, rmarkdown, parallel

We do not necessarily suggest these packages in the English sense of the verb; i.e., ‘to recommend’. Rather, perhaps a better name for that field would be Optional in the sense that you need those packages installed if you wish to run all tests, documentation and in data.table’s case, compatibility.

Anyway, now that I know that the failing tests are testing compatibility with ggplot2, I’ll click over to ggplot2 and look at its status. I’m hoping I’ll get lucky and ggplot2 is in error too.


Indeed it is. And I can see the same message on ggplot2’s CRAN checks page.

Calling 'structure(NULL, *)' is deprecated, as NULL cannot have attributes.

It’s my lucky day. ggplot2 is in error too with the same message. This time, thankfully, the new warning is therefore nothing to do with data.table per se. I got to this point in under 30 seconds! No typing was required to run anything at all. It was all done just by clicking within CRAN’s pages and searching a file. My task is done and I can move on. Thanks to CRAN and the people that run it.

What if data.table or ggplot2 were already in error or warning before R-core made their change? R-core members wouldn’t have seen any status change. If they see no status change for any of the 9,787 CRAN packages then they don’t know for sure it’s ok. All they know is their change didn’t affect any of the passing packages but they can’t be sure about the packages which are already in error or warning for an unrelated reason. I get more requests from R-core and CRAN maintainers to update data.table than from users of data.table. I’m sorry that I could not find time earlier in 2016 to update data.table than I did (data.table was showing an error for many months).

Regarding today’s warning, it has been caught before it gets to users. You will never be aware it ever happened. Either R-core will revert this change, or ggplot2 will be asked to send an update to CRAN before this change in R is released.

This is one reason why packages need to be on CRAN not just on GitHub. Not just so they are available to users most easily but so they are under the watchful eye of CRAN daily tests on all platforms.

Now that data.table is used by 320 CRAN and Bioconductor packages, I’m experiencing the same (minor in comparison) frustration that R-core maintainers must have been having for many years: package maintainers not keeping their packages clean of errors and warnings, myself included. No matter how insignificant those errors or warnings might appear. Sometimes, as in my case in 2016, I simply haven’t been able to assign time to start the process of releasing to CRAN. I have worked hard to reduce the time it takes to run the checks not covered by R CMD check and this is happening faster now. One aspect of that script is reverse dependency checks; checking packages which use data.table in some way.

The current status() of data.table reverse dependency checks is as follows, using data.table in development on my laptop. These 320 packages themselves often depend or suggest other packages so my local revdep library has 2,108 packages.

> status()
ERROR : 6 : AFM mlr mtconnectR partools quanteda stremr
WARNING : 2 : ie2miscdata PGRdup
NOTE : 69
OK : 155
TOTAL : 232 / 237
NOT STARTED (first 5 of 5) : finch flippant gasfluxes propr rlas

WARNING : 4 : genomation methylPipe RiboProfiling S4Vectors
NOTE : 68
OK : 9
TOTAL : 82 / 83
NOT STARTED (first 5 of 1) : diffloop

Now that Jan Gorecki has joined H2O he has been able to spend some time to automate and improve this. Currently, the result he gets with a docker script is as follows.

> status()
ERROR : 18 : AFM blscrapeR brainGraph checkmate ie2misc lava mlr mtconnectR OptiQuantR panelaggregation partools pcrsim psidR quanteda simcausal stremr strvalidator xgboost
WARNING : 4 : data.table ie2miscdata msmtools PGRdup
NOTE : 72
OK : 141
TOTAL : 235 / 235

ERROR : 20 : CAGEr dada2 facopy flowWorkspace GenomicTuples ggcyto GOTHiC IONiseR LowMACA methylPipe minfi openCyto pepStat PGA phyloseq pwOmics QUALIFIER RTCGA SNPhood TCGAbiolinks
WARNING : 15 : biobroom bsseq Chicago genomation GenomicInteractions iGC ImmuneSpaceR metaX MSnID MSstats paxtoolsr Pviz r3Cseq RiboProfiling scater
NOTE : 27
OK : 3
TOTAL : 65 / 65

So, our next task is to make Jan’s result on docker match mine. I can’t quite remember how I got all these packages to pass locally for me. In some cases I needed to find and install Ubuntu libraries and I tried my best to keep a note of them at the time here here. Another case is that lava suggests mets but mets depends on lava. We currently solve chicken-or-egg situations manually, one-by-one. A third example is that permissions of /tmp seem to be different on docker which at least one package appears to test and depend on. We have tried changing TEMPDIR from /tmp to ~/tmp to solve that and will wait for the rerun to see if that worked. I won’t be surprised if it takes a week of elapsed time to get our results to match. That’s two man weeks of on-and-off time as we fix, automate-the-fix and wait to see if the rerun works. And this is work after data.table has already made it to CRAN; to make next time easier and less of a barrier-to-entry to start.

The point is, all this takes time behind the scenes. I’m sure other package maintainers have similar issues and have come up with various solutions. I’m aware of devtools::revdep_check, used it gratefully for some years and thanked Hadley for it in this tweet. But recently I’ve found it more reliable and simpler to run R CMD check at the command line directly using the unix parallel command. Thank you to R-core and CRAN maintainers for keeping CRAN going in 2016. There must be much that nobody knows about. Thank you to the package maintainers that use data.table and have received my emails and fixed their warnings or errors (users will never know that happened). Sorry I myself didn’t keep data.table cleaner, faster. We’re working to improve that going forward.

What is new in H2O latest release (Tutte) ?

Today we released H2O version (Tutte). It’s available on our Downloads page, and release notes can be found here.


Photo Credit:

Top enhancements in this release:

GLM MOJO Support: GLM now supports our smaller, faster, more efficient MOJO (Model ObJect, Optimized) format for model publication and deployment (PUBDEV-3664, PUBDEV-3695).

ISAX: We actually introduced ISAX (Indexable Symbolic Aggregate ApproXimation) support a couple of releases back, but this version features more improvements and is worth a look. ISAX allows you to represent complex time series patterns using a symbolic notation, reducing the dimensionality of your data and allowing you to run our ML algos or use the index for searching or data analysis. For more information, check out the blog entry here: Indexing 1 billion time series with H2O and ISAX. (PUBDEV-3367, PUBDEV-3377, PUBDEV-3376)

GLM: Improved feature and parameter descriptions for GLM. Next focus will be on improving documentation for the K-Means algorithm (PUBDEV-3695, PUBDEV-3753, PUBDEV-3791).

Quasibinomial support in GLM:
the quasibinomial family is similar to the binomial family except that, where the binomial models only support 0/1 for the values of a target, the quasibinomial family allows for two arbitrary values. This feature was requested by advanced users of H2O for applications such as implementing their own advanced estimators. (PUBDEV-3482, PUBDEV-3791)

GBM/DRF high cardinality accuracy improvements: Fixed a bug in the handling of large categorical features (cardinality > 32) that was there since the first release of H2O-3. Certain such categorical tree split decisions were incorrect, essentially sending observations down the wrong path at any such split point in the decision tree. The error was systematic and consistent between in-H2O and POJO/MOJO, and led to lower training accuracy (and often, to lower validation accurary). The handling of unseen categorical levels (in training and testing) was also inconsistent and unseen levels would go left or right without any reason – now they follow the path of a missing values consistently. Generally, models involving high-cardinality categorical features should have improved accuracy now. This change might require re-tuning of model parameters for best results. In particular the nbins_cats parameter, which controls the number of separable categorical levels at a given split, which has a large impact on the amount of memorization of per-level behavior that is possible: higher values generally (over)fit more.

Direct Download:

For each PUBDEV-* information please look at the release note links at the top of this article

Accordingly to VP of Engineering Bill Gallmeister, this release consist of signifiant work done by his engineering team. For more information on these features and all the other improvements in H2O version, review our documentation.

Happy Holidays from all H2O team!!

@avkashchauhan (Avkash Chauhan)

Using Sentiment Analysis to Measure Election Surprise

Sentiment Analysis is a powerful Natural Language Processing technique that can be used to compute and quantify the emotions associated with a body of text. One of the reasons that Sentiment Analysis is so powerful is because its results are easy to interpret and can give you a big-picture metric for your dataset.

One recent event that surprised many people was the November 8th US Presidential election. Hillary Clinton, who ended up losing the race, had been given chances ranging from a 71.4% (FiveThirtyEight), to a 85% (New York Times), to a >99% chance of victory (Princeton Election Consortium).


Credit: New York Times

To measure the shock of this upset, we decided to examine comments made during the announcements of the election results and see how (if) the sentiment changed. The sentiment of a comment is measured by how its words correspond to either a negative or positive connotation. A score of ‘0.00’ means the comment is neutral, while a higher score means that the sentiment is more positive (and a negative score implies the comment is negative).

Our dataset is a .csv of all Reddit comments made during 11/8/2016 to 11/10/2016 (UTC) and is courtesy of /u/Stuck_In_the_Matrix. All times are in EST, and we’ve annotated the timeline (the height of the bars denotes the number of comments posted during that hour):


We examined five political subreddits to gauge their reactions. Our first target was /r/hillaryclinton, Clinton’s primary support base. The number of comments reached a high starting at around 9pm EST, but the sentiment gradually fell as news came in that Donald Trump was winning more states than expected.


/r/hillaryclinton: Number of Comments per Hour


/r/hillaryclinton: Mean Sentiment Score per Hour

What is interesting is the low number of comments made after the election was called for Donald Trump. I suspect that it may have been a subreddit-wide pause on comments due to concerns about trolls, but I’m not sure; I contacted the moderators but haven’t received a response back yet.

A few other left-leaning subreddits had interesting results as well. While /r/SandersforPresident was closed for the election season post-Bernie concession, it’s successor, /r/Political_Revolution, had not closed and experienced declines in comment sentiment as well.


/r/Political_Revolution: Number of Comments per Hour


/r/Political_Revolution: Mean Sentiment Score per Hour

On /r/The_Donald (Donald Trump’s base), the results were the opposite. 


/r/The_Donald: Number of Comments per Hour


/r/The_Donald: Mean Sentiment Score per Hour

There are also a few subreddits that are less candidate- or ideology-specific: /r/politics and /r/PoliticalDiscussion. /r/PoliticalDiscussion didn’t seem to show any shift, but /r/politics did seem to become more muted, at least compared to the previous night.


/r/PoliticalDiscussion: Number of Comments per Hour


/r/PoliticalDiscussion: Mean Sentiment Score per Hour

/r/politics: Mean Sentiment Score per Hour

To recap,

  1. Reddit political subreddits experienced a sizable increase in activity during the election results
  2. Subreddits differed in their reactions to the news along idealogical lines, with pro-Trump subreddits having higher positive sentiment than pro-Clinton subreddits

What could be the next steps for this type of analysis?

  1. Can we use these patterns to classify the readership of the comments sections of newspapers as left- or right-leaning?
  2. Can we apply these time-series sentiment analyses to other events, such as sporting events (which also includes two ‘teams’)?
  3. Can we use sentiment analysis to evaluate the long-term health of communities, such as subreddits dedicated to eventually-losing candidates, like Bernie Sanders?

Indexing 1 Billion Time Series with H2O and ISax

At H2O, we have recently debuted a new feature called ISax that works on time series data in an H2O Dataframe. ISax stands for Indexable Symbolic Aggregate ApproXimation, which means it can represent complex time series patterns using a symbolic notation and thereby reducing the dimensionality of your data. From there you can run H2O’s ML algos or use the index for searching or data analysis. ISax has many uses in a variety of fields including finance, biology and cybersecurity.

Today in this blog we will use H2O to create an ISax index for analytical purposes. We will generate 1 Billion time series of 256 steps on an integer U(-100,100) distribution. Once we have the index we’ll show how you can search for similar patterns using the index.

We’ll show you the steps and you can run along, assuming you have enough hardware and patience. In this example we are using a 9 machine cluster, each with 32 cores and 256GB RAM. We’ll create a 1B row synthetic data set and form random walks for more interesting time series patterns. We’ll run ISax and perform the search, the whole process takes ~30 minutes with our cluster.

Raw H2O Frame Creation
In the typical use case, H2O users would be importing time series data from disk. H2O can read from local filesystems, NFS, or distributed systems like Hadoop. H2O cluster file reads are parallelized across the nodes for speed. In our case we’ll be generating a 256 column, 1B row frame. By the way H2O Dataframes scales better by increasing rows instead of columns. Each row will be an individual time series. The ISax algo assumes the time series data is row based.

rawdf = h2o.create_frame(cols=256, rows=1000000000, real_fraction=0.0, integer_fraction=1.0,missing_fraction=0.0)


Random Walk
Here we do a row wise cumulative sum to simulate random walks. The .head call triggers the execution graph so we can do a time measurement.

tsdf = rawdf.cumsum(axis=1)
print tsdf.head()


Lets take a quick peek at our time series



Run ISax
Now we’re ready to run isax and generate the index. The output of this command is another H2O Frame that contains the string representation of the isax word, along with the numeric columns in case you want to run ML algos.
res = tsdf.isax(num_words=20,max_cardinality=10)


Takes 10 minutes and H2O’s MapReduce framework makes efficient use of all 288 cpu cores.


Now that we have the index done, lets search for similar time series patterns in our 1B time series data set. Lets make indexes on the isax result frame and the original time series frame.

res["idx"] =1
res["idx"] = res["idx"].cumsum(axis=0)
tsdf["idx"] = 1
tsdf["idx"] = tsdf["idx"].cumsum(axis=0)

Im going to pick the second time series that we plotted (the green “C2”) time series.

myidx = res[res["iSax_index"]=="5^20_5^20_7^20_9^20_9^20_9^20_9^20_9^20_8^20_6^20

There are 4342 other time series with the same index in the 1B time series dataframe. Lets just plot the first 10 and see how similar they look

mylist = myidx.as_data_frame(use_pandas=True)["idx"][0:10].tolist()
mydf = tsdf[tsdf["idx"].isin(mylist)].as_data_frame(use_pandas=True)


The successful implementation of a fast in memory ISax algo can be attributed to the H2O platform having a highly efficient, easy to code, open source MapReduce framework, and the Rapids api that can deploy your distributed algos to Python or R. In my next blog, I will show how to get started with writing your own MapReduce functions in H2O on structured data by using ISax as an example.


Why We Bought A Happy Diwali Billboard


It’s been a dark year in many ways, so we wanted to lighten things up and celebrate Diwali — the festival of lights!

Diwali is a holiday that celebrates joy, hope, knowledge and all that is full of light — the perfect antidote for some of the more negative developments coming out of the Silicon Valley recently. Throw in a polarizing presidential race where a certain candidate wants to literally build a wall around US borders, and it’s clear that inclusivity is as important as ever.

Diwali is also a great opportunity to highlight the advancements Asian Americans have made in technology, especially South Asian Americans. The heads of Google (Sundar Pichai) and Microsoft (Satya Nadella) — two major forces in the world of AI — are led by Indian Americans. They join other leaders across the technology ecosystem that we also want to recognize broadly.

Today we are open-sourcing Diwali. America embraced Yoga and Chicken Tikka, so why not Diwali too?

Creating a Binary Classifier to Sort Trump vs. Clinton Tweets Using NLP

The problem: Can we determine if a tweet came from the Donald Trump Twitter account (@realDonaldTrump) or the Hillary Clinton Twitter account (@HillaryClinton) using text analysis and Natural Language Processing (NLP) alone?

The Solution: Yes! We’ll divide this tutorial into three parts, the first on how to gather the necessary data, the second on data exploration, munging, & feature engineering, and the third on building our model itself. You can find all of our code on GitHub (

Part One: Collecting the Data
Note: We are going to be using Python. For the R version of this process, the concepts translate, and we have some code on Github that might be helpful. You can find the notebook for this part as “TweetGetter.ipynb” in our GitHub repository:

We used the Twitter API to collect tweets by both presidential candidates, which would become our dataset. Twitter only lets you access the latest ~3000 or so tweets from a particular handle, even though they keep all the Tweets in their own databases. 

The first step is to create an app on Twitter, which you can do by visiting After completing the form you can access your app, and your keys and tokens. Specifically we’re looking for four things: the client key and secret (called consumer key and consumer secret) and the resource owner key and secret (called access token and access token secret).

We save this information in JSON format in a separate file

Then, we can use the Python libraries Requests and Pandas to gather the tweets into a DataFrame. We only really care about three things: the author of the Tweet (Donald Trump or Hillary Clinton), the text of the Tweet, and the unique identifier of the Tweet, but we can take in as much other data as we want (for the sake of data exploration, we also included the timestamp of each Tweet).

Once we have all this information, we can output it to a .csv file for further analysis and exploration. 

Part Two: Data Cleaning and Munging
You can find the notebook for this part as “NLPAnalysis.ipynb” in our GitHub repository:

To fully take advantage of machine learning, we need to add features to this dataset. For example, we might want to take into account the punctuation that each Twitter account uses, thinking that it might be important in helping us discriminate between Trump and Clinton. If we take the amount of punctuation symbols in each Tweet, and take the average across all Tweets, we get the following graph:


Or perhaps we care about how many hashtags or mentions each account uses:


With our timestamp data, we can examine Tweets by their Retweet count, over time:

The tall blue skyscraper was Clinton’s “Delete Your Account” Tweet


The scale graph, on a logarithmic scale

We can also compare the distribution of Tweets over time. We can see that Clinton tweets more frequently than Trump (this is also evidenced by us being able to access older Tweets from Trump, since there’s a hard limit on the number of Tweets we can access).

The Democratic National Convention was in session from July 25th to the 28th

We can construct heatmaps of when these candidates were posting:

Heatmap of Trump Tweets, by day and hour

All this light analysis was useful for intuition, but our real goal is to only use the text of the tweet (including derived features) for our classification. If we included features like the time-stamp, it would become a lot easier.

We can utilize a process called tokenization, which lets us create features from the words in our text. To understand why this is useful, let’s pretend to only care about the mentions (for example, @h2oai) in each tweet. We would expect that Donald Trump would mention certain people (@GovPenceIN) more than others and certainly different people than Hillary Clinton. Of course, there might be people both parties tweet at (maybe @POTUS). These patterns could be useful in classifying Tweets. 

Now, we can apply that same line of thinking to words. To make sure that we are only including valuable words, we can exclude stop-words which are filler words, such as ‘and’ or ‘the.’ We can also use a metric called term frequency – inverse document frequency (TF-IDF) that computes how important a word is to a document. 

There are also other ways to use and combine NLP. One approach might be sentiment analysis, where we interpret a tweet to be positive or negative. David Robinson did this to show that Trump’s personal tweets are angrier, as opposed to those written by his staff.

Another approach might be to create word trees that represent sentence structure. Once each tweet has been represented in this format you can examine metrics such as tree length or number of nodes, which are measures of the complexity of a sentence. Maybe Trump tweets a lot of clauses, as opposed to full sentences.

Part Three: Building, Training, and Testing the Model
You can find the notebooks for this part as “Python-TF-IDF.ipynb” and “TweetsNLP.flow” in our GitHub repository:

There were a lot of approaches to take but we decided to keep it simple for now by only using TF-IDF vectorization. The actual code writing was relatively simple thanks to the excellent Scikit-Learn package alongside NLTK. 

We could have also done some further cleaning of the data, such as excluding urls from our Tweets text (right now, strings such as “zy7vpfrsdz” get their own feature column as the NLTK vectorizer treats them as words). Our not doing this won’t affect our model as the urls are unique, but it might save on space and time. Another strategy could be to stem words, treating words as their root (so ‘hearing’ and ‘heard’ would both be coded as ‘hear’).

Still, our model (created using H2O Flow) produces quite a good result without those improvements. We can use a variety of metrics to confirm this, including the Area Under the Curve (AUC). The AUC measures the True Positive Rate (tpr) versus the False Negative Rate (fpr). A score of 0.5 means that the model is equivalent to flipping a coin, and a score of 1 means that the model is 100% accurate. 

The model curve is blue, while the red curve represents 50–50 guessing

For a more intuitive judgement of our model we can look at the variable importances of our model (what the model considers to be good discriminators of the data) and see if they make sense:

Can you guess which words (variables) correspond (are important) to which candidate?

Maybe the next step could be to build an app that will take in text and output if the text is more likely to have come from Clinton or Trump. Perhaps we can even consider the Tweets of several politicians, assign them a ‘liberal/conservative’ score, and then build a model to predict if a Tweet is more conservative or more liberal (important features would maybe include “Benghazi” or “climate change”). Another cool application might be a deep learning model, in the footsteps of @DeepDrumpf.

If this inspired you to create analysis or build models, please let us know! We might want to highlight your project 🎉📈.

sparklyr: R interface for Apache Spark

This post is reposted from Rstudio’s announcement on sparklyr – Rstudio’s extension for Spark


  • Connect to Spark from R. The sparklyr package provides a complete dplyr backend.
  • Filter and aggregate Spark datasets then bring them into R for analysis and visualization.
  • Use Spark’s distributed machine learning library from R.
  • Create extensions that call the full Spark API and provide interfaces to Spark packages.


You can install the sparklyr package from CRAN as follows:


You should also install a local version of Spark for development purposes:

spark_install(version = "1.6.2")

To upgrade to the latest version of sparklyr, run the following command and restart your r session:


If you use the RStudio IDE, you should also download the latest preview release of the IDE which includes several enhancements for interacting with Spark (see the RStudio IDE section below for more details).

Connecting to Spark

You can connect to both local instances of Spark as well as remote Spark clusters. Here we’ll connect to a local instance of Spark via the spark_connect function:

sc <- spark_connect(master = "local")

The returned Spark connection (sc) provides a remote dplyr data source to the Spark cluster.

For more information on connecting to remote Spark clusters see the Deployment section of the sparklyr website.

Using dplyr

We can new use all of the available dplyr verbs against the tables within the cluster.

We’ll start by copying some datasets from R into the Spark cluster (note that you may need to install the nycflights13 and Lahman packages in order to execute this code):

install.packages(c("nycflights13", "Lahman"))
iris_tbl <- copy_to(sc, iris)
flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
batting_tbl <- copy_to(sc, Lahman::Batting, "batting")
## [1] "batting" "flights" "iris"

To start with here’s a simple filtering example:

# filter by departure delay and print the first few records
flights_tbl %>% filter(dep_delay == 2)
## Source:   query [?? x 19]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1   2013     1     1      517            515         2      830
## 2   2013     1     1      542            540         2      923
## 3   2013     1     1      702            700         2     1058
## 4   2013     1     1      715            713         2      911
## 5   2013     1     1      752            750         2     1025
## 6   2013     1     1      917            915         2     1206
## 7   2013     1     1      932            930         2     1219
## 8   2013     1     1     1028           1026         2     1350
## 9   2013     1     1     1042           1040         2     1325
## 10  2013     1     1     1231           1229         2     1523
## # ... with more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dbl>

Introduction to dplyr provides additional dplyr examples you can try. For example, consider the last example from the tutorial which plots data on flight delays:

delay <- flights_tbl %>% 
  group_by(tailnum) %>%
  summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
  filter(count > 20, dist < 2000, ! %>%

# plot delays
ggplot(delay, aes(dist, delay)) +
  geom_point(aes(size = count), alpha = 1/2) +
  geom_smooth() +
  scale_size_area(max_size = 2)


Window Functions

dplyr window functions are also supported, for example:

batting_tbl %>%
  select(playerID, yearID, teamID, G, AB:H) %>%
  arrange(playerID, yearID, teamID) %>%
  group_by(playerID) %>%
  filter(min_rank(desc(H)) <= 2 & H > 0)
## Source:   query [?? x 7]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
## Groups: playerID
##     playerID yearID teamID     G    AB     R     H
##        <chr>  <int>  <chr> <int> <int> <int> <int>
## 1  abbotpa01   2000    SEA    35     5     1     2
## 2  abbotpa01   2004    PHI    10    11     1     2
## 3  abnersh01   1992    CHA    97   208    21    58
## 4  abnersh01   1990    SDN    91   184    17    45
## 5  abreujo02   2014    CHA   145   556    80   176
## 6  acevejo01   2001    CIN    18    34     1     4
## 7  acevejo01   2004    CIN    39    43     0     2
## 8  adamsbe01   1919    PHI    78   232    14    54
## 9  adamsbe01   1918    PHI    84   227    10    40
## 10 adamsbu01   1945    SLN   140   578    98   169
## # ... with more rows

For additional documentation on using dplyr with Spark see the dplyr section of the sparklyr website.

Using SQL

It’s also possible to execute SQL queries directly against tables within a Spark cluster. The spark_connection object implements a DBI interface for Spark, so you can use dbGetQuery to execute SQL and return the result as an R data frame:

iris_preview <- dbGetQuery(sc, "SELECT * FROM iris LIMIT 10")
##    Sepal_Length Sepal_Width Petal_Length Petal_Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa

Machine Learning

You can orchestrate machine learning algorithms in a Spark cluster via the machine learning functions within sparklyr. These functions connect to a set of high-level APIs built on top of DataFrames that help you create and tune machine learning workflows.

Here’s an example where we use ml_linear_regression to fit a linear regression model. We’ll use the built-in mtcars dataset, and see if we can predict a car’s fuel consumption (mpg) based on its weight (wt), and the number of cylinders the engine contains (cyl). We’ll assume in each case that the relationship between mpg and each of our features is linear.

# copy mtcars into spark
mtcars_tbl <- copy_to(sc, mtcars)

# transform our data set, and then partition into 'training', 'test'
partitions <- mtcars_tbl %>%
  filter(hp >= 100) %>%
  mutate(cyl8 = cyl == 8) %>%
  sdf_partition(training = 0.5, test = 0.5, seed = 1099)

# fit a linear model to the training dataset
fit <- partitions$training %>%
  ml_linear_regression(response = "mpg", features = c("wt", "cyl"))
## Call: ml_linear_regression(., response = "mpg", features = c("wt", "cyl"))
## Coefficients:
## (Intercept)          wt         cyl 
##   37.066699   -2.309504   -1.639546

For linear regression models produced by Spark, we can use summary() to learn a bit more about the quality of our fit, and the statistical significance of each of our predictors.

## Call: ml_linear_regression(., response = "mpg", features = c("wt", "cyl"))
## Deviance Residuals::
##     Min      1Q  Median      3Q     Max 
## -2.6881 -1.0507 -0.4420  0.4757  3.3858 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 37.06670    2.76494 13.4059 2.981e-07 ***
## wt          -2.30950    0.84748 -2.7252   0.02341 *  
## cyl         -1.63955    0.58635 -2.7962   0.02084 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## R-Squared: 0.8665
## Root Mean Squared Error: 1.799

Spark machine learning supports a wide array of algorithms and feature transformations and as illustrated above it’s easy to chain these functions together with dplyr pipelines. To learn more see the machine learning section.

Reading and Writing Data

You can read and write data in CSV, JSON, and Parquet formats. Data can be stored in HDFS, S3, or on the lcoal filesystem of cluster nodes.

temp_csv <- tempfile(fileext = ".csv")
temp_parquet <- tempfile(fileext = ".parquet")
temp_json <- tempfile(fileext = ".json")

spark_write_csv(iris_tbl, temp_csv)
iris_csv_tbl <- spark_read_csv(sc, "iris_csv", temp_csv)

spark_write_parquet(iris_tbl, temp_parquet)
iris_parquet_tbl <- spark_read_parquet(sc, "iris_parquet", temp_parquet)

spark_write_csv(iris_tbl, temp_json)
iris_json_tbl <- spark_read_csv(sc, "iris_json", temp_json)

## [1] "batting"      "flights"      "iris"         "iris_csv"    
## [5] "iris_json"    "iris_parquet" "mtcars"


The facilities used internally by sparklyr for its dplyr and machine learning interfaces are available to extension packages. Since Spark is a general purpose cluster computing system there are many potential applications for extensions (e.g. interfaces to custom machine learning pipelines, interfaces to 3rd party Spark packages, etc.).

Here’s a simple example that wraps a Spark text file line counting function with an R function:

# write a CSV 
tempfile <- tempfile(fileext = ".csv")
write.csv(nycflights13::flights, tempfile, row.names = FALSE, na = "")

# define an R interface to Spark line counting
count_lines <- function(sc, path) {
  spark_context(sc) %>% 
    invoke("textFile", path, 1L) %>% 

# call spark to count the lines of the CSV
count_lines(sc, tempfile)
## [1] 336777

To learn more about creating extensions see the Extensions section of the sparklyr website.

dplyr Utilities

You can cache a table into memory with:

tbl_cache(sc, "batting")

and unload from memory using:

tbl_uncache(sc, "batting")

Connection Utilities

You can view the Spark web console using the spark_web function:


You can show the log using the spark_log function:

spark_log(sc, n = 10)
## 16/09/24 07:50:59 INFO ContextCleaner: Cleaned accumulator 224
## 16/09/24 07:50:59 INFO ContextCleaner: Cleaned accumulator 223
## 16/09/24 07:50:59 INFO ContextCleaner: Cleaned accumulator 222
## 16/09/24 07:50:59 INFO BlockManagerInfo: Removed broadcast_64_piece0 on localhost:56324 in memory (size: 20.6 KB, free: 483.0 MB)
## 16/09/24 07:50:59 INFO ContextCleaner: Cleaned accumulator 220
## 16/09/24 07:50:59 INFO Executor: Finished task 0.0 in stage 67.0 (TID 117). 2082 bytes result sent to driver
## 16/09/24 07:50:59 INFO TaskSetManager: Finished task 0.0 in stage 67.0 (TID 117) in 122 ms on localhost (1/1)
## 16/09/24 07:50:59 INFO DAGScheduler: ResultStage 67 (count at finished in 0.122 s
## 16/09/24 07:50:59 INFO TaskSchedulerImpl: Removed TaskSet 67.0, whose tasks have all completed, from pool 
## 16/09/24 07:50:59 INFO DAGScheduler: Job 47 finished: count at, took 0.125238 s

Finally, we disconnect from Spark:


RStudio IDE

The latest RStudio Preview Release of the RStudio IDE includes integrated support for Spark and the sparklyr package, including tools for:

  • Creating and managing Spark connections
  • Browsing the tables and columns of Spark DataFrames
  • Previewing the first 1,000 rows of Spark DataFrames

Once you’ve installed the sparklyr package, you should find a new Spark pane within the IDE. This pane includes a New Connection dialog which can be used to make connections to local or remote Spark instances:


Once you’ve connected to Spark you’ll be able to browse the tables contained within the Spark cluster:


The Spark DataFrame preview uses the standard RStudio data viewer:


The RStudio IDE features for sparklyr are available now as part of the RStudio Preview Release.