Use H2O.ai on Azure HDInsight

This is a repost from this article on MSDN

We’re hosting an upcoming webinar to present you how to use H2O on HDInsight and to answer your questions. Sign up for our upcoming webinar on combining H2O and Azure HDInsight.

We recently announced that H2O and Microsoft Azure HDInsight have integrated to provide Data Scientists with a Leading Combination of Engines for Machine Learning and Deep Learning. Through H2O’s AI platform and its Sparkling Water solution, users can combine the fast, scalable machine learning algorithms of H2O with the capabilities of Spark, as well as drive computation from Scala/R/Python and utilize the H2O Flow UI, providing an ideal machine learning platform for application developers.

In this blog, we will provide a detailed step-by-step guide to help you set up the first H2O on HDInsight solution.

Step 1: setting up the environment

The first step is to create an HDInsight cluster with H2O installed. You can either create an HDInsight cluster and install H2O during provision time, or you can also install H2O on an existing cluster. Please note that H2O on HDInsight only works for Spark 2.0 on HDInsight 3.5 as of today, which is the default version of HDInsight.

For more information on how to create a cluster in HDInsight, please refer to the documentation here (https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-provision-linux-clusters). For more information on how to install an application on an existing cluster, please refer to the documentation here (https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apps-install-applications)

Please be noted that we’ve recently updated our UI with less clicks, so you need to click “custom” button to install applications on HDInsight.

hdi-image1

Step 2: Setting up the environment

After installing H2O on HDInsight, you can simply use the built-in Jupyter notebooks to write your first H2O on HDInsight applications. You can simply go to (https://yourclustername.azurehdinsight.net/jupyter) to open the Jupyter Notebook. You will see a folder named “H2O-PySparkling-Examples”.

hdi-image2

There are a few examples in the folder, but I recommend starting with the one named “Sentiment_analysis_with_Sparkling_Water.ipynb”. Most of the details on how to use the H2O PySparkling Water APIs are already covered in the Notebook itself, so here I will give some high-level overviews.

The first thing you need to do is to configure the environment. Most of the configurations are already taken care by the system, such as the FLOW UI address, Spark jar location, the Sparkling water egg file, etc.

There are three important parameter to configure: the driver memory, executor memory, and the number of executors. The default values are optimized for the default 4 node cluster, but your cluster size might vary.

Tuning these parameters are outside of scope of this blog, as it is more of a Spark resource tuning problem. There are a few good reference articles such as this one.

Note that all spark applications deployed using a Jupyter Notebook will have “yarn-cluster” deploy-mode. This means that the spark driver node will be allocated on any worker node of the cluster, not on the head nodes.

In this example, we simply allocate 75% of an HDInsight cluster worker nodes to the driver and executors (21 GB each), and put 3 executors, since the default HDInsight cluster size is 4 worker nodes (3 executors + 1 driver)

hdi-image3

Please refer to the Jupyter Notebook tutorial for more information on how to use Jupyter Notebooks on HDInsight.

The second step here is to create an H2O context. Since one default spark context is already configured in the Jupyter Notebook (called sc), in H2O, we just need to call

h2o_context = pysparkling.H2OContext.getOrCreate(sc)

so H2O can recognize the default spark context.

After executing this line of code, H2O will print out the status, as well as the YARN application it is using.

hdi-image4

After this, you can use H2O APIs plus the Spark APIs to write your applications. To learn more about Sparkling Water APIs, refer to the H2O GitHub site here.

hdi-image5

This sentiment analysis example has a few steps to analyze the data:

  1. Load data to Spark and H2O frames
  2. Data munging using H2O API
    • Remove columns
    • Refine Time Column into Year/Month/Day/DayOfWeek/Hour columns
  3. Data munging using Spark API
    • Select columns Score, Month, Day, DayOfWeek, Summary
    • Define UDF to transform score (0..5) to binary positive/negative
    • Use TF-IDF to vectorize summary column
  4. Model building using H2O API
    • Use H2O Grid Search to tune hyper parameters
    • Select the best Deep Learning model

Please refer to the Jupyter Notebook for more details.

Step 3: use FLOW UI to monitor the progress and visualize the model

H2O Flow is an interactive web-based computational user interface where you can combine code execution, text, mathematics, plots and rich media into a single document, much like Jupyter Notebooks. With H2O Flow, you can capture, rerun, annotate, present, and share your workflow. H2O Flow allows you to use H2O interactively to import files, build models, and iteratively improve them. Based on your models, you can make predictions and add rich text to create vignettes of your work – all within Flow’s browser-based environment. In this blog, we will only focus on its visualization part.

H2O FLOW web service lives in the Spark driver and is routed through the HDInsight gateway, so it can only be accessed when the spark application/Notebook is running

You can click the available link in the Jupyter Notebook, or you can directly access this URL: https://yourclustername-h2o.apps.azurehdinsight.net/flow/index.html

In this example, we will demonstrate its visualization capabilities. Simply click “Model > List Grid Search Results” (since we are trying to use Grid Search to tune hyper parameters)

hdi-image6

Then you can access the 4 grid search results:

hdi-image7

And you can view the details of each model. For example, you can visualize the ROC curve as below:

hdi-image8

In Jupyter Notebooks, you can also view the performance in text format:

hdi-image9

Summary
In this blog, we have walked you through the detailed steps on how to create your first H2O application on HDInsight for your machine learning applications. For more information on H2O, please visit H2O site; For more information on HDInsight, please visit the HDInsight site

This blog-post is co-authored by Pablo Marin(@pablomarin), Solution Architect in Microsoft.

Artificial Intelligence Is Already Deep Inside Your Wallet – Here’s How

Artificial intelligence (AI) is the key for financial service companies and banks to stay ahead of the ever-shifting digital landscape, especially given competition from Google, Apple, Facebook, Amazon and others moving strategically into fintech. AI startups are building data products that not only automate the ingestion of vast amounts of data, but also provide predictive and actionable insights into how people spend and save across digital channels. Financial companies are now the biggest acquirers of such data products, as they can leverage the massive data sets they sit upon to achieve higher profitability and productivity, and operational excellence. Here are the five ways financial service companies are embracing AI today to go even deeper inside your wallet.

Your Bank Knows More About You Than Facebook
Banks and financial service companies today live or die by their ability to differentiate their offering and meet the unique needs of their customers in real-time. Retention is key, and artificial intelligence is already disrupting what it means for financial service companies to “know the customer.” Google, Facebook, Twitter, and other walled gardens already deeply understand this, which is why they are so keen to collect massive amounts of data on their users, even if they don’t have fintech infrastructure yet.

So how does your bank know more about you than Facebook? Using AI platforms, they can bridge customer data across multiple accounts – including bank, credit, loans, social media profiles, and more – and give them a 360-degree view of the customer. Once they have this, predictive applications suggest in real-time the “next best” offer to keep the person happy based on their spending, risk tolerance, investment history, and debt. For example, based on one transaction – a mortgage – financial companies use AI to recommend a checking account to pay for the mortgage, credit cards to buy furniture, home insurance, or even mutual funds that are focused on real estate. Financial services companies can now also predict customer satisfaction and dissatisfaction, allowing them to intercept consumer churn before it happens by offering exclusive deals or promotions before the person gets angry.

Credit “Risk” Is Becoming Competitive Opportunity
A limited amount of data is used for credit risk scoring today, and it’s heavily weighted toward existing credit history, length of credit usage, and payment history. Naturally, this results in many qualified customers – or anyone trying to access credit for the first time – being rejected for loans, credit cards and more. Credit card companies, including Amazon, are realizing there is a big revenue opportunity that is missed by the current credit assessment system. With AI, employment history data, social media data, shopping and purchasing patterns, and are used to build a 360-degree view of the credit “opportunity” as opposed to pure risk. Even better, AI data products can provide real-time updates of credit scores based on recent employment status changes or transactions, so that your credit score is not a fixed number but something that evolves. With this capability, banks and financial services companies are finding overlooked or untapped credit opportunities that even the most sophisticated tech company is missing.

Predict the Next DDOS Attack
The distributed denial-of-service (DDOS) attack against Dyn in October brought to the public forefront the scale and severity of cyber attacks. In the financial realm, security breaches and cyber attacks are not only costly, but also have a damaging impact on brand trust and customer loyalty. Experts and analysts agree that such DDOS attacks will become more prevalent in the future, in part because current cybersecurity practices are built upon rules-based systems and require a lot of human intervention. Many of the current cybersecurity solutions in market are focused on detection, as opposed to prevention. They can tell you an attack is happening, but not how to predict one or what to do once it’s discovered.

Leveraging AI platforms, banks, credit card companies, and financial service providers are beginning to predict and prevent such cyber attacks with far greater precision than what’s in use today. Using traffic pattern analysis and traffic pattern prediction, AI data products inspect financial-based traffic in real-time and identify threats based on previous sessions. Effectively, this means that a financial company can shut down harmful connections before they compromise the entire website or server. Importantly, as more data is ingested, the AI data product evolves and gets smarter as the hacker changes its methodology. This takes the notion of prevention to a whole new level, as it anticipates the bad actors’ next move.

Putting an End to Money Laundering
The estimated amount of money laundered globally in one year is 2 to 5 percent of global GDP, or upwards of $2 trillion in USD. Efforts to combat money laundering are never-ending, as criminals find new ways to stay ahead of law enforcement and technology. Customer activity monitoring is currently done through rules-based filtering, in which rigid and inflexible rules are used to determine if something is suspicious. This system not only creates major loopholes and many false positives, but also wastes investigators’ time and increases operational costs. AI platforms can now find patterns that regular thresholds do not detect, and continuously learn and adapt with new data. Because false positives are reduced, investigators then focus on true anti-money laundering activities to create a more efficient, accurate solution, and at the same time reduce operational costs. Suspicious activity reports are finally living up to their name of truly documenting suspicious behavior as opposed to random red flags in a rules-based system.

Biometrics-Based Fraud Detection
Fraudulent credit card activity is one area where artificial intelligence has made great progress in detection and prevention. But there are other interesting applications that are strengthening financial services companies’ overall value proposition. Account origination fraud – where fraudsters open fake accounts using stolen or made-up information – more than doubled in 2015. That’s because there’s no way to prove with absolute certainty that the person on the mobile device is who they say they are. AI technologies are being developed to compare a variety of biometric indicators – such as facial features, iris, fingerprints, and voice – in order to allow banks and financial service companies to confirm the user’s identity in far more secure ways than just a pin number or password. Mastercard, for example, unveiled facial recognition “security checks” for purchases made on mobile phones. Given its potential to protect user’s identities from being stolen or abused, biometrics in the context of banking and financial services may face fewer regulatory hurdles than practices undertaken by Facebook and Google, both of whom have faced class action lawsuits. This is allowing financial services to move much faster in the field of biometrics.

Beyond the Wallet
The tech giants are in an arms race to acquire as many AI and machine learning startups as possible. But the one thing they don’t have yet, and financial services companies do, are massive amounts of financial data. Up until now, financial services companies required a tremendous amount of experience and human judgment in order to analyze this financial data and provide cost-effective, competitive products and services. However, by adopting “out-of-the-box” AI data products that can ingest huge amounts of data, banks and financial services companies are making valuable predictions and insights in real-time that drive revenue and reduce inefficiencies. The five applications above are not simply isolated use cases, but bellwethers of how intimately AI will be directly tied to enterprise-level financial strategy.

Source: paymentsjournal.com

Football Flowers

Using Sentiment Analysis to Measure Election Surprise

Sentiment Analysis is a powerful Natural Language Processing technique that can be used to compute and quantify the emotions associated with a body of text. One of the reasons that Sentiment Analysis is so powerful is because its results are easy to interpret and can give you a big-picture metric for your dataset.

One recent event that surprised many people was the November 8th US Presidential election. Hillary Clinton, who ended up losing the race, had been given chances ranging from a 71.4% (FiveThirtyEight), to a 85% (New York Times), to a >99% chance of victory (Princeton Election Consortium).

prediction_comparisons

Credit: New York Times

To measure the shock of this upset, we decided to examine comments made during the announcements of the election results and see how (if) the sentiment changed. The sentiment of a comment is measured by how its words correspond to either a negative or positive connotation. A score of ‘0.00’ means the comment is neutral, while a higher score means that the sentiment is more positive (and a negative score implies the comment is negative).

Our dataset is a .csv of all Reddit comments made during 11/8/2016 to 11/10/2016 (UTC) and is courtesy of /u/Stuck_In_the_Matrix. All times are in EST, and we’ve annotated the timeline (the height of the bars denotes the number of comments posted during that hour):

politics_counts2


We examined five political subreddits to gauge their reactions. Our first target was /r/hillaryclinton, Clinton’s primary support base. The number of comments reached a high starting at around 9pm EST, but the sentiment gradually fell as news came in that Donald Trump was winning more states than expected.

hillaryclinton_counts

/r/hillaryclinton: Number of Comments per Hour

hillaryclinton_sentiment

/r/hillaryclinton: Mean Sentiment Score per Hour

What is interesting is the low number of comments made after the election was called for Donald Trump. I suspect that it may have been a subreddit-wide pause on comments due to concerns about trolls, but I’m not sure; I contacted the moderators but haven’t received a response back yet.

A few other left-leaning subreddits had interesting results as well. While /r/SandersforPresident was closed for the election season post-Bernie concession, it’s successor, /r/Political_Revolution, had not closed and experienced declines in comment sentiment as well.

political_revolution_counts

/r/Political_Revolution: Number of Comments per Hour

political_revolution_sentiment

/r/Political_Revolution: Mean Sentiment Score per Hour


On /r/The_Donald (Donald Trump’s base), the results were the opposite. 

the_donald_counts

/r/The_Donald: Number of Comments per Hour

the_donald_sentiment

/r/The_Donald: Mean Sentiment Score per Hour


There are also a few subreddits that are less candidate- or ideology-specific: /r/politics and /r/PoliticalDiscussion. /r/PoliticalDiscussion didn’t seem to show any shift, but /r/politics did seem to become more muted, at least compared to the previous night.

politicaldiscussion_counts

/r/PoliticalDiscussion: Number of Comments per Hour

politicaldiscussion_sentiment

/r/PoliticalDiscussion: Mean Sentiment Score per Hour

politics_sentiment
/r/politics: Mean Sentiment Score per Hour

To recap,

  1. Reddit political subreddits experienced a sizable increase in activity during the election results
  2. Subreddits differed in their reactions to the news along idealogical lines, with pro-Trump subreddits having higher positive sentiment than pro-Clinton subreddits

What could be the next steps for this type of analysis?

  1. Can we use these patterns to classify the readership of the comments sections of newspapers as left- or right-leaning?
  2. Can we apply these time-series sentiment analyses to other events, such as sporting events (which also includes two ‘teams’)?
  3. Can we use sentiment analysis to evaluate the long-term health of communities, such as subreddits dedicated to eventually-losing candidates, like Bernie Sanders?

Why We Bought A Happy Diwali Billboard

h2o-close-up2

It’s been a dark year in many ways, so we wanted to lighten things up and celebrate Diwali — the festival of lights!

Diwali is a holiday that celebrates joy, hope, knowledge and all that is full of light — the perfect antidote for some of the more negative developments coming out of the Silicon Valley recently. Throw in a polarizing presidential race where a certain candidate wants to literally build a wall around US borders, and it’s clear that inclusivity is as important as ever.

Diwali is also a great opportunity to highlight the advancements Asian Americans have made in technology, especially South Asian Americans. The heads of Google (Sundar Pichai) and Microsoft (Satya Nadella) — two major forces in the world of AI — are led by Indian Americans. They join other leaders across the technology ecosystem that we also want to recognize broadly.

Today we are open-sourcing Diwali. America embraced Yoga and Chicken Tikka, so why not Diwali too?

Creating a Binary Classifier to Sort Trump vs. Clinton Tweets Using NLP

The problem: Can we determine if a tweet came from the Donald Trump Twitter account (@realDonaldTrump) or the Hillary Clinton Twitter account (@HillaryClinton) using text analysis and Natural Language Processing (NLP) alone?

The Solution: Yes! We’ll divide this tutorial into three parts, the first on how to gather the necessary data, the second on data exploration, munging, & feature engineering, and the third on building our model itself. You can find all of our code on GitHub (https://git.io/vPwxr).


Part One: Collecting the Data
Note: We are going to be using Python. For the R version of this process, the concepts translate, and we have some code on Github that might be helpful. You can find the notebook for this part as “TweetGetter.ipynb” in our GitHub repository: https://git.io/vPwxr.

We used the Twitter API to collect tweets by both presidential candidates, which would become our dataset. Twitter only lets you access the latest ~3000 or so tweets from a particular handle, even though they keep all the Tweets in their own databases. 

The first step is to create an app on Twitter, which you can do by visiting https://apps.twitter.com/. After completing the form you can access your app, and your keys and tokens. Specifically we’re looking for four things: the client key and secret (called consumer key and consumer secret) and the resource owner key and secret (called access token and access token secret).


screen-shot-2016-10-12-at-1-19-02-pm
We save this information in JSON format in a separate file

Then, we can use the Python libraries Requests and Pandas to gather the tweets into a DataFrame. We only really care about three things: the author of the Tweet (Donald Trump or Hillary Clinton), the text of the Tweet, and the unique identifier of the Tweet, but we can take in as much other data as we want (for the sake of data exploration, we also included the timestamp of each Tweet).

Once we have all this information, we can output it to a .csv file for further analysis and exploration. 


Part Two: Data Cleaning and Munging
You can find the notebook for this part as “NLPAnalysis.ipynb” in our GitHub repository: https://git.io/vPwxr.

To fully take advantage of machine learning, we need to add features to this dataset. For example, we might want to take into account the punctuation that each Twitter account uses, thinking that it might be important in helping us discriminate between Trump and Clinton. If we take the amount of punctuation symbols in each Tweet, and take the average across all Tweets, we get the following graph:

screen-shot-2016-10-14-at-2-55-54-pm

Or perhaps we care about how many hashtags or mentions each account uses:

screen-shot-2016-10-14-at-2-56-21-pm

With our timestamp data, we can examine Tweets by their Retweet count, over time:


screen-shot-2016-10-14-at-2-28-12-pm
The tall blue skyscraper was Clinton’s “Delete Your Account” Tweet

screen-shot-2016-10-14-at-2-28-03-pm

The scale graph, on a logarithmic scale

We can also compare the distribution of Tweets over time. We can see that Clinton tweets more frequently than Trump (this is also evidenced by us being able to access older Tweets from Trump, since there’s a hard limit on the number of Tweets we can access).


screen-shot-2016-10-14-at-2-27-53-pm
The Democratic National Convention was in session from July 25th to the 28th

We can construct heatmaps of when these candidates were posting:


screen-shot-2016-10-14-at-2-27-42-pm
Heatmap of Trump Tweets, by day and hour

All this light analysis was useful for intuition, but our real goal is to only use the text of the tweet (including derived features) for our classification. If we included features like the time-stamp, it would become a lot easier.

We can utilize a process called tokenization, which lets us create features from the words in our text. To understand why this is useful, let’s pretend to only care about the mentions (for example, @h2oai) in each tweet. We would expect that Donald Trump would mention certain people (@GovPenceIN) more than others and certainly different people than Hillary Clinton. Of course, there might be people both parties tweet at (maybe @POTUS). These patterns could be useful in classifying Tweets. 

Now, we can apply that same line of thinking to words. To make sure that we are only including valuable words, we can exclude stop-words which are filler words, such as ‘and’ or ‘the.’ We can also use a metric called term frequency – inverse document frequency (TF-IDF) that computes how important a word is to a document. 

There are also other ways to use and combine NLP. One approach might be sentiment analysis, where we interpret a tweet to be positive or negative. David Robinson did this to show that Trump’s personal tweets are angrier, as opposed to those written by his staff.

Another approach might be to create word trees that represent sentence structure. Once each tweet has been represented in this format you can examine metrics such as tree length or number of nodes, which are measures of the complexity of a sentence. Maybe Trump tweets a lot of clauses, as opposed to full sentences.


Part Three: Building, Training, and Testing the Model
You can find the notebooks for this part as “Python-TF-IDF.ipynb” and “TweetsNLP.flow” in our GitHub repository: https://git.io/vPwxr.

There were a lot of approaches to take but we decided to keep it simple for now by only using TF-IDF vectorization. The actual code writing was relatively simple thanks to the excellent Scikit-Learn package alongside NLTK. 

We could have also done some further cleaning of the data, such as excluding urls from our Tweets text (right now, strings such as “zy7vpfrsdz” get their own feature column as the NLTK vectorizer treats them as words). Our not doing this won’t affect our model as the urls are unique, but it might save on space and time. Another strategy could be to stem words, treating words as their root (so ‘hearing’ and ‘heard’ would both be coded as ‘hear’).

Still, our model (created using H2O Flow) produces quite a good result without those improvements. We can use a variety of metrics to confirm this, including the Area Under the Curve (AUC). The AUC measures the True Positive Rate (tpr) versus the False Negative Rate (fpr). A score of 0.5 means that the model is equivalent to flipping a coin, and a score of 1 means that the model is 100% accurate. 


screen-shot-2016-10-13-at-2-33-56-pm
The model curve is blue, while the red curve represents 50–50 guessing

For a more intuitive judgement of our model we can look at the variable importances of our model (what the model considers to be good discriminators of the data) and see if they make sense:


screen-shot-2016-10-13-at-1-41-46-pm
Can you guess which words (variables) correspond (are important) to which candidate?

Maybe the next step could be to build an app that will take in text and output if the text is more likely to have come from Clinton or Trump. Perhaps we can even consider the Tweets of several politicians, assign them a ‘liberal/conservative’ score, and then build a model to predict if a Tweet is more conservative or more liberal (important features would maybe include “Benghazi” or “climate change”). Another cool application might be a deep learning model, in the footsteps of @DeepDrumpf.

If this inspired you to create analysis or build models, please let us know! We might want to highlight your project 🎉📈.

When is the Best Time to Look for Apartments on Craigslist?

A while ago I was looking for an apartment in San Francisco. There are a lot of problems with finding housing in San Francisco, mostly stemming from the fierce competition. I was checking Craigslist every single day. It still took me (and my girlfriend) a few months to find a place — and we had to sublet for three weeks in between. Thankfully we’re happily housed now but it was quite the journey. Others have talked about their search for SF housing, but I have a few tips myself:

1) While Craigslist continues to be the best resource for finding housing (it’s how I found my current apartment), there are quite a few Facebook groups that may also be useful. My experience has been of having weekly cycles, where I send out lots of emails, get a stream of responses, go visit 1-2 places per weekday evening, and then get a stream of rejections back. If you do check Craigslist, the best times to check are Tuesday and Wednesday evenings, and then the following mornings, as the following graphic shows.

screen-shot-2016-09-26-at-2-42-45-pm

Data sourced from over 10,000 SF apartment Craigslist postings over the month of September

2) Be prepared to apply to an apartment on the spot. I’ve been burned a few times when I took a day or two to fully think about the location and price, but by the time I applied I was at the back of the line. It really helps to know exactly what you want, and know how to spot it. The good news is that even if you’re not sure of your wants and needs at the beginning of your search, you’ll learn as you visit more and more apartments.

3) Make sure you know where the laundry machines are. I once lived in an apartment where I forgot to ask if they had laundry in the building (they didn’t.) The result was that I spent an unanticipated few hours every few weeks having to clean my clothes. It’s not the end of the world, and unfortunately I doubt it could have changed my situation, but it’s still a very important amenity that some people overlook.

Distracted Driving

Last week, we started to examine the 7.2% increase in traffic fatalities from 2014 to 2015, the reversal of a near decade-long downward trend. We then broke out the data by various accident classifications, such as “speeding” or “driving with a positive BAC,” and identified those classifications that had the greatest increase. One label that showed promise for improvement was “involving a distracted driver.” According to Pew Research, the number of Americans who own a mobile device has pretty consistently risen over the past decade, as has the number of Americans who own a smartphone. Moreover, apps like Pokemon Go have built-in features that incentivize driving while playing, and these types of augmented reality games are only going to become more common.

The National Highway Traffic Safety Association (NHTSA) defines distracted driving as “any activity that could divert a person’s attention away from the primary task of driving.” This includes several activities, from texting while driving, to using one hand to place a call. The Governor’s Highway Safety Administration (GHSA), an organization that “provides leadership and representation for the states and territories to improve traffic safety,” notes that states can even collect data on distracted driving in different ways. While most states split up distracted driving into two or three categories, some states use only one category (and other states use as many as 15 categories!) These categories include not just distraction by technology, but also events such as animals in the vehicle, or the consumption of food & drink. Distracted driving is also said to be under-reported because drivers are less likely to admit to using their phone in the event of a crash.

Because of these discrepancies, it’s important to keep in mind that regulations vary from state to state and policy that successfully reduces accidents in one state may not automatically follow to another state. Still, sharing what works and what doesn’t can be important in saving lives, which is one reason why this data is collected and aggregated. So, which states are succeeding at reducing the number of fatalities caused by distracted driving?

screen-shot-2016-09-14-at-5-02-14-pm

In 2015, New Mexico had one of the highest rates of distracted driving fatalities per mile driven. New Mexico Governor Susana Martinez recognized this even back in 2014, signing a bill that banned texting while driving citing, “Texting while driving is now the leading cause of death for New Mexico’s teen drivers. Most other states have banned the practice of texting while driving.”

screen-shot-2016-09-14-at-4-37-03-pm

Did these laws end up working? Well, maybe. If we look at all crashes (not just fatal ones) in New Mexico from 2005 to 2014, the general trend was downward post-2007, seemingly leveling out during 2014. Unfortunately, data from 2015 on the total number of crashes in New Mexico isn’t available and so we aren’t able to examine whether the bill ended up succeeding in terms of reducing all crashes due to distracted driving.

screen-shot-2016-09-14-at-4-36-52-pm

If we examine only fatalities, we see that while the number of all fatalities decreases in 2015, the 2014 bill doesn’t seem to actually affect distracted driving fatalities. The bill was signed in March and took effect in July, and so there were several months for its effects to be able to propagate. This ambiguous policy impact isn’t limited to New Mexico either. Economists Rahi Abouk and Scott Adams ran a national study where they discovered that “while the effects are strong for the month immediately following ban imposition, accident levels appear to return toward normal levels in about three months.” Still, New Mexico’s decrease in fatalities bucked the national trend by the greatest amount (we’ll talk about which states experienced the largest increase in fatalities in a separate post).

screen-shot-2016-09-13-at-4-21-19-pm

A teaser photo for a new NHTSA advertising campaign

Of course, legislation is only one tactic that can be used to prevent distracted driving. The GHSA notes that (some) states use other tactics such as social media outreach or statewide campaigns. Several states have adopted slogans, which range from the passive Wyoming “The road is no place for distractions” to the more flavorful Missouri “U TXT UR NXT, NO DWT.” States sometimes aim these campaigns at specific demographics, such as teens and young adults, who have higher rates of distracted driving (quite a few states also pass legislation that directly targets young people).


This post is the second in our series on traffic fatalities, inspired by a call to action put out by the Department of Transportation. Watch out for another post highlighting a different aspect of the dataset next week. In the meantime if you have any questions or comments or suggestions you can find me at @JayMahabal or email me at jay@h2oai.

Fatal Traffic Accidents Rise in 2015

On Tuesday, August 30th, the National Highway Traffic Safety Administration released their annual dataset of traffic fatalities asking interested parties to use the dataset to identify the causes of an increase of 7.2% in fatalities from 2014 to 2015. As part of H2O.ai‘s vision of using artificial intelligence for the betterment of society we were excited to tackle this problem.

screen-shot-2016-09-14-at-2-29-27-pm

This post is the first in our series on the Department of Transportation dataset and driving fatalities which will hopefully culminate in a hackathon in late September, where we’ll invite community members to join forces with the talented engineers and scientists at H2O.ai to find a solution to this problem and prescribe policy changes.


To begin, we started by reading some literature and getting familiar with the data. These documents served as excellent inspiration for possible paths of analysis and guided our thinking. Our introductory investigation was based around asking a series of questions, paving the way for detailed analysis down the road. The dataset includes every (reported) accident along with several labels, from “involving a distracted driving” to “involving a driver aged 15-20.” Even though fatalities as a whole fell during the last ten years more progress has been made in some areas over others, and comparing 2014 incidents to 2015 incidents can reveal promising openings for policy action.

screen-shot-2016-09-14-at-2-29-58-pm

It’s important to keep in mind that regulations vary from state to state and policy that successfully reduces accidents in one state may not follow to another state. Still, sharing what works and what doesn’t can be important in saving lives, which is one reason why this data is collected and aggregated.

Next week we’ll examine distracted driving, and investigate whether or not the laws that prohibited texting while driving made a difference — and why those laws didn’t continue the downward trend in 2015. We’ll follow that with an investigation into speeding and motorcycle crashes. In the meantime if you have any questions or comments or suggestions you can find me at @JayMahabal or email me at jay@h2oai.


Clarification: September 14th, 2016
We shifted graph labels to reduce confusion.

IoT – Take Charge of Your Business and IT Insights Starting at the Edge

Instead of just being hype, the Internet of Things (IoT) is now becoming a reality. Gartner forecasts that 6.4 billion connected devices will be in use worldwide, and 5.5 million new devices will get connected every day, in 2016. These devices range from wearables, to sensors in vehicles the can detect surrounding obstacles, to sensors in pipelines that detect their own wear-and-tear. Huge volumes of data are collected from these connected devices, and yet companies struggle to get optimal business and IT outcomes from it.

Why is this the case?
Rule-based data models limit insights. Industry experts have a wealth of knowledge manually driving business rules, which in turn drive the data models. Many current IoT practices simply run large volumes of data through these rule-based models, but the business insights are limited by what rule-based models allow. Machine Learning/Artificial Intelligence allows new patterns to be found within stored data without human intervention. These new patterns can be applied to data models, allowing new insights to be generated for better business results.
Analytics in the backend data center delay insights. In current IoT practice, data is collected and analyzed in the backend data center (e.g. OLAP/MPP database, Hadoop, etc.). Typically, data models are large and harder to deploy at the edge due to IoT edge devices having limited computing resources. The trade-off is that large amounts of data travel miles and miles of distance, unfiltered and un-analyzed until the backend systems have the bandwidth to process them. This defeats the spirit of getting business insights for agility in real-time, not to mention the high cost of data transfer and ingestion in the backend data center.
Lack of security measures at the edge reduce the accuracy of insights. Current IoT practice also only secures the backend while security threats can be injected from edge devices. Data can be altered and viruses can be injected during the long period of data transfer. How accurate can the insights be when data integrity is not preserved?

The good news is that H2O can help with:
Pattern-based models. H2O detects patterns in the data with distributed Machine Learning algorithms, instead of depending on pre-established rules. It has been proven in many use cases that H2O’s AI engine can find dozens more patterns than humans are able to discover. Patterns can also change over time and H2O models can be continuously retrained to yield more and better insights.
Fast and easy model deployment with small footprint. The H2O Open Source Machine Learning Platform creates data models, with a minimal footprint, that can score events and make predictions in nanoseconds. The data models are Java-based and can be deployed anywhere with a JVM, or even as a web service. Models can easily be deployed at the IoT edge to yield real-time business and IT insights.
Enabling security measures at the edge. AI is particularly adept at finding and establishing patterns, especially when it’s fed huge amounts of data. Security loopholes and threats take on new forms all the time. H2O models can easily adapt as data show new patterns of security threats. Deploying these adaptive models at the edge means that threats can be blocked early on, before they’re able to cause damage throughout the system.

There are many advantages in enabling analytics at the IoT edge. Using H2O will be crucial in this endeavor. Many industry experts are already moving in this direction. What are you waiting for?