Using Sentiment Analysis to Measure Election Surprise

Sentiment Analysis is a powerful Natural Language Processing technique that can be used to compute and quantify the emotions associated with a body of text. One of the reasons that Sentiment Analysis is so powerful is because its results are easy to interpret and can give you a big-picture metric for your dataset.

One recent event that surprised many people was the November 8th US Presidential election. Hillary Clinton, who ended up losing the race, had been given chances ranging from a 71.4% (FiveThirtyEight), to a 85% (New York Times), to a >99% chance of victory (Princeton Election Consortium).

prediction_comparisons

Credit: New York Times

To measure the shock of this upset, we decided to examine comments made during the announcements of the election results and see how (if) the sentiment changed. The sentiment of a comment is measured by how its words correspond to either a negative or positive connotation. A score of ‘0.00’ means the comment is neutral, while a higher score means that the sentiment is more positive (and a negative score implies the…

Indexing 1 Billion Time Series with H2O and ISax

At H2O, we have recently debuted a new feature called ISax that works on time series data in an H2O Dataframe. ISax stands for Indexable Symbolic Aggregate ApproXimation, which means it can represent complex time series patterns using a symbolic notation and thereby reducing the dimensionality of your data. From there you can run H2O’s ML algos or use the index for searching or data analysis. ISax has many uses in a variety of fields including finance, biology and cybersecurity.

Today in this blog we will use H2O to create an ISax index for analytical purposes. We will generate 1 Billion time series of 256 steps on an integer U(-100,100) distribution. Once we have the index we’ll show how you can search for similar patterns using the index.

We’ll show you the steps and you can run along, assuming you have enough hardware and patience. In this example we are using a 9 machine cluster, each with 32 cores and 256GB RAM. We’ll create a 1B row synthetic data set and form random walks for more interesting time series patterns. We’ll run ISax and perform the search, the whole process takes ~30 minutes with our cluster.

Raw H2O Frame Creation

Why We Bought A Happy Diwali Billboard

h2o-close-up2

It’s been a dark year in many ways, so we wanted to lighten things up and celebrate Diwali — the festival of lights!

Diwali is a holiday that celebrates joy, hope, knowledge and all that is full of light — the perfect antidote for some of the more negative developments coming out of the Silicon Valley recently. Throw in a polarizing presidential race where a certain candidate wants to literally build a wall around US borders, and it’s clear that inclusivity is as important as ever.

Diwali is also a great opportunity to highlight the advancements Asian Americans have made in technology, especially South Asian Americans. The heads of Google (Sundar Pichai) and Microsoft (Satya Nadella) — two major forces in the world of AI — are led by Indian Americans. They join other leaders across the technology ecosystem that we also want to recognize broadly.

Today we are open-sourcing Diwali. America embraced Yoga and Chicken Tikka, so why not Diwali too?

Creating a Binary Classifier to Sort Trump vs. Clinton Tweets Using NLP

The problem: Can we determine if a tweet came from the Donald Trump Twitter account (@realDonaldTrump) or the Hillary Clinton Twitter account (@HillaryClinton) using text analysis and Natural Language Processing (NLP) alone?

The Solution: Yes! We’ll divide this tutorial into three parts, the first on how to gather the necessary data, the second on data exploration, munging, & feature engineering, and the third on building our model itself. You can find all of our code on GitHub (https://git.io/vPwxr).


Part One: Collecting the Data
Note: We are going to be using Python. For the R version of this process, the concepts translate, and we have some code on Github that might be helpful. You can find the notebook for this part as “TweetGetter.ipynb” in our GitHub repository: https://git.io/vPwxr.

We used the Twitter API to collect tweets by both presidential candidates, which would become our dataset. Twitter only lets you access the latest ~3000 or so tweets from a particular handle, even though they keep all the Tweets in their own databases. 

The first step is to create an app on Twitter, which you can do by visiting https://apps.twitter.com/. After completing the form you can access your app, and your…

sparklyr: R interface for Apache Spark

This post is reposted from Rstudio’s announcement on sparklyr – Rstudio’s extension for Spark

sparklyr-illustration

  • Connect to Spark from R. The sparklyr package provides a complete dplyr backend.
  • Filter and aggregate Spark datasets then bring them into R for analysis and visualization.
  • Use Spark’s distributed machine learning library from R.
  • Create extensions that call the full Spark API and provide interfaces to Spark packages.

Installation

You can install the sparklyr package from CRAN as follows:

install.packages("sparklyr") 

You should also install a local version of Spark for development purposes:

library(sparklyr) spark_install(version = "1.6.2") 

To upgrade to the latest version of sparklyr, run the following command and restart your r session:

devtools::install_github("rstudio/sparklyr") 

If you use the RStudio IDE, you should also download the latest preview release of the IDE which includes several enhancements for interacting with Spark (see the RStudio IDE section below for more details).

Connecting to Spark

You can connect to both local instances of Spark as well as remote Spark clusters….

When is the Best Time to Look for Apartments on Craigslist?

A while ago I was looking for an apartment in San Francisco. There are a lot of problems with finding housing in San Francisco, mostly stemming from the fierce competition. I was checking Craigslist every single day. It still took me (and my girlfriend) a few months to find a place — and we had to sublet for three weeks in between. Thankfully we’re happily housed now but it was quite the journey. Others have talked about their search for SF housing, but I have a few tips myself:

1) While Craigslist continues to be the best resource for finding housing (it’s how I found my current apartment), there are quite a few Facebook groups that may also be useful. My experience has been of having weekly cycles, where I send out lots of emails, get a stream of responses, go visit 1-2 places per weekday evening, and then get a stream of rejections back. If you do check Craigslist, the best times to check are Tuesday and Wednesday evenings, and then the following mornings, as the following graphic shows.

screen-shot-2016-09-26-at-2-42-45-pm Read More

Focus

———- Forwarded message ———
From: SriSatish Ambati
Date: Thu, Sep 15, 2016 at 10:17 PM
Subject: changes and all hands tomorrow.
To: team

Team,

Our focus has changed towards larger fewer deals & deeper engagements with handful of finance and insurance customers.

We took a hard look at our marketing spend, pr programs and personnel. We let go most of our amazing inside sales talent. And two of our account executives. We are not building a vertical in IOT. In all nine business folks were affected. No further changes are anticipated or necessary.

These were heroic partners in our journey. I spoke to most all of them today to personally convey the message. Some were with me for a short time, many for years – all great humans who diligently served me and h2o well. I’m grateful for their support and partnership towards my vision. I learnt a lot from each one of them and will not hesitate to assist them any ways possible personally.

Thank you. heroes in bcc. may you find fulfillment & love in your path. It’s a small world and we will all meet very soon.

Our goal as a startup is…

Distracted Driving

Last week, we started to examine the 7.2% increase in traffic fatalities from 2014 to 2015, the reversal of a near decade-long downward trend. We then broke out the data by various accident classifications, such as “speeding” or “driving with a positive BAC,” and identified those classifications that had the greatest increase. One label that showed promise for improvement was “involving a distracted driver.” According to Pew Research, the number of Americans who own a mobile device has pretty consistently risen over the past decade, as has the number of Americans who own a smartphone. Moreover, apps like Pokemon Go have built-in features that incentivize driving while playing, and these types of augmented reality games are only going to become more common.

The National Highway Traffic Safety Association (NHTSA) defines distracted driving as “any activity that could divert a person’s attention away from the primary task of driving.” This includes several activities, from texting while driving, to using one hand to place a call. The Governor’s Highway Safety Administration (GHSA), an organization that “provides leadership and representation for the states and territories to improve traffic safety,” notes that states can even collect data on distracted driving…

Introducing H2O Community & Support Portals

At H2O, we enjoy serving our customers and the community, and we take pride in making them successful while using H2O products. Today, we are very excited to announce two great platforms for our customers and for the community to better communicate with H2O. Let’s start with our community first:

Community Badge

The success of every open source project depends on a vibrant community, and having an active community helps to convert an average product into a successful product. So to maintain our commitment to our H2O community, we are releasing an updated community platform at https://community.h2o.ai. This community platform is available for everyone, whether you are new to machine intelligence or are a seasoned veteran. If you are new to machine intelligence or H2O, you have an opportunity to learn from great minds, and if you are a seasoned industry veteran, you can not only enhance your skillset, you can also help others to achieve success.

Our objective is to develop this community in a way where every community member has the opportunity to establish himself or herself as a technology leader or expert by helping others. Every moment you spend here in the community,…

Fatal Traffic Accidents Rise in 2015

On Tuesday, August 30th, the National Highway Traffic Safety Administration released their annual dataset of traffic fatalities asking interested parties to use the dataset to identify the causes of an increase of 7.2% in fatalities from 2014 to 2015. As part of H2O.ai‘s vision of using artificial intelligence for the betterment of society we were excited to tackle this problem.

screen-shot-2016-09-14-at-2-29-27-pm

This post is the first in our series on the Department of Transportation dataset and driving fatalities which will hopefully culminate in a hackathon in late September, where we’ll invite community members to join forces with the talented engineers and scientists at H2O.ai to find a solution to this problem and prescribe policy changes.


To begin, we started by reading some literature and getting familiar with the data. These documents served as excellent inspiration for possible paths of analysis and guided our thinking. Our introductory investigation was based around asking a series of questions, paving the way for detailed analysis down the road. The dataset includes every (reported) accident along with several labels, from…