It’s been a dark year in many ways, so we wanted to lighten things up and celebrate Diwali — the festival of lights!
Diwali is a holiday that celebrates joy, hope, knowledge and all that is full of light — the perfect antidote for some of the more negative developments coming out of the Silicon Valley recently. Throw in a polarizing presidential race where a certain candidate wants to literally build a wall around US borders, and it’s clear that inclusivity is as important as ever.
Diwali is also a great opportunity to highlight the advancements Asian Americans have made in technology, especially South Asian Americans. The heads of Google (Sundar Pichai) and Microsoft (Satya Nadella) — two major forces in the world of AI — are led by Indian Americans. They join other leaders across the technology ecosystem that we also want to recognize broadly.
Today we are open-sourcing Diwali. America embraced Yoga and Chicken Tikka, so why not Diwali too?
The problem: Can we determine if a tweet came from the Donald Trump Twitter account (@realDonaldTrump) or the Hillary Clinton Twitter account (@HillaryClinton) using text analysis and Natural Language Processing (NLP) alone?
The Solution: Yes! We’ll divide this tutorial into three parts, the first on how to gather the necessary data, the second on data exploration, munging, & feature engineering, and the third on building our model itself. You can find all of our code on GitHub (https://git.io/vPwxr).
Part One: Collecting the Data
Note: We are going to be using Python. For the R version of this process, the concepts translate, and we have some code on Github that might be helpful. You can find the notebook for this part as “TweetGetter.ipynb” in our GitHub repository: https://git.io/vPwxr.
We used the Twitter API to collect tweets by both presidential candidates, which would become our dataset. Twitter only lets you access the latest ~3000 or so tweets from a particular handle, even though they keep all the Tweets in their own databases.
The first step is to create an app on Twitter, which you can do by visiting https://apps.twitter.com/. After completing the form you can access your app, and your…
This post is reposted from Rstudio’s announcement on sparklyr – Rstudio’s extension for Spark
- Connect to Spark from R. The sparklyr package provides a complete dplyr backend.
- Filter and aggregate Spark datasets then bring them into R for analysis and visualization.
- Use Spark’s distributed machine learning library from R.
- Create extensions that call the full Spark API and provide interfaces to Spark packages.
You can install the sparklyr package from CRAN as follows:
You should also install a local version of Spark for development purposes:
library(sparklyr) spark_install(version = "1.6.2")
To upgrade to the latest version of sparklyr, run the following command and restart your r session:
If you use the RStudio IDE, you should also download the latest preview release of the IDE which includes several enhancements for interacting with Spark (see the RStudio IDE section below for more details).
Connecting to Spark
You can connect to both local instances of Spark as well as remote Spark clusters….