Indexing 1 Billion Time Series with H2O and ISax

At H2O, we have recently debuted a new feature called ISax that works on time series data in an H2O Dataframe. ISax stands for Indexable Symbolic Aggregate ApproXimation, which means it can represent complex time series patterns using a symbolic notation and thereby reducing the dimensionality of your data. From there you can run H2O’s ML algos or use the index for searching or data analysis. ISax has many uses in a variety of fields including finance, biology and cybersecurity.

Today in this blog we will use H2O to create an ISax index for analytical purposes. We will generate 1 Billion time series of 256 steps on an integer U(-100,100) distribution. Once we have the index we’ll show how you can search for similar patterns using the index.

We’ll show you the steps and you can run along, assuming you have enough hardware and patience. In this example we are using a 9 machine cluster, each with 32 cores and 256GB RAM. We’ll create a 1B row synthetic data set and form random walks for more interesting time series patterns. We’ll run ISax and perform the search, the whole process takes ~30 minutes with our cluster.

Raw H2O Frame Creation
In the typical use case, H2O users would be importing time series data from disk. H2O can read from local filesystems, NFS, or distributed systems like Hadoop. H2O cluster file reads are parallelized across the nodes for speed. In our case we’ll be generating a 256 column, 1B row frame. By the way H2O Dataframes scales better by increasing rows instead of columns. Each row will be an individual time series. The ISax algo assumes the time series data is row based.

rawdf = h2o.create_frame(cols=256, rows=1000000000, real_fraction=0.0, integer_fraction=1.0,missing_fraction=0.0)

isax_sshot1

Random Walk
Here we do a row wise cumulative sum to simulate random walks. The .head call triggers the execution graph so we can do a time measurement.

tsdf = rawdf.cumsum(axis=1)
print tsdf.head()

isax_sshot2

Lets take a quick peek at our time series
tsdf[0:2,:].transpose().as_data_frame(use_pandas=True).plot()

isax_sshot3

Run ISax
Now we’re ready to run isax and generate the index. The output of this command is another H2O Frame that contains the string representation of the isax word, along with the numeric columns in case you want to run ML algos.
res = tsdf.isax(num_words=20,max_cardinality=10)

isax_sshot4

Takes 10 minutes and H2O’s MapReduce framework makes efficient use of all 288 cpu cores.

isax_cluster_3
isax_sshot5

Now that we have the index done, lets search for similar time series patterns in our 1B time series data set. Lets make indexes on the isax result frame and the original time series frame.

res["idx"] =1
res["idx"] = res["idx"].cumsum(axis=0)
tsdf["idx"] = 1
tsdf["idx"] = tsdf["idx"].cumsum(axis=0)

Im going to pick the second time series that we plotted (the green “C2”) time series.
myidx = res[res["iSax_index"]=="5^20_5^20_7^20_9^20_9^20_9^20_9^20_9^20_8^20_6^20
_4^20_3^20_2^20_1^20_1^20_0^20_0^20_0^20_0^20_0^20"]["idx"]

There are 4342 other time series with the same index in the 1B time series dataframe. Lets just plot the first 10 and see how similar they look

mylist = myidx.as_data_frame(use_pandas=True)["idx"][0:10].tolist()
mydf = tsdf[tsdf["idx"].isin(mylist)].as_data_frame(use_pandas=True)
mydf.ix[:,0:256].transpose().plot(figsize=(20,10))

isax_sshot6

The successful implementation of a fast in memory ISax algo can be attributed to the H2O platform having a highly efficient, easy to code, open source MapReduce framework, and the Rapids api that can deploy your distributed algos to Python or R. In my next blog, I will show how to get started with writing your own MapReduce functions in H2O on structured data by using ISax as an example.

References
https://www.quora.com/MLconf-2015-Seattle-How-does-the-symbolic-aggregate-approximation-SAX-technique-work

http://cs.gmu.edu/~jessica/SAX_DAMI_preprint.pdf

IoT – Take Charge of Your Business and IT Insights Starting at the Edge

Instead of just being hype, the Internet of Things (IoT) is now becoming a reality. Gartner forecasts that 6.4 billion connected devices will be in use worldwide, and 5.5 million new devices will get connected every day, in 2016. These devices range from wearables, to sensors in vehicles the can detect surrounding obstacles, to sensors in pipelines that detect their own wear-and-tear. Huge volumes of data are collected from these connected devices, and yet companies struggle to get optimal business and IT outcomes from it.

Why is this the case?
Rule-based data models limit insights. Industry experts have a wealth of knowledge manually driving business rules, which in turn drive the data models. Many current IoT practices simply run large volumes of data through these rule-based models, but the business insights are limited by what rule-based models allow. Machine Learning/Artificial Intelligence allows new patterns to be found within stored data without human intervention. These new patterns can be applied to data models, allowing new insights to be generated for better business results.
Analytics in the backend data center delay insights. In current IoT practice, data is collected and analyzed in the backend data center (e.g. OLAP/MPP database, Hadoop, etc.). Typically, data models are large and harder to deploy at the edge due to IoT edge devices having limited computing resources. The trade-off is that large amounts of data travel miles and miles of distance, unfiltered and un-analyzed until the backend systems have the bandwidth to process them. This defeats the spirit of getting business insights for agility in real-time, not to mention the high cost of data transfer and ingestion in the backend data center.
Lack of security measures at the edge reduce the accuracy of insights. Current IoT practice also only secures the backend while security threats can be injected from edge devices. Data can be altered and viruses can be injected during the long period of data transfer. How accurate can the insights be when data integrity is not preserved?

The good news is that H2O can help with:
Pattern-based models. H2O detects patterns in the data with distributed Machine Learning algorithms, instead of depending on pre-established rules. It has been proven in many use cases that H2O’s AI engine can find dozens more patterns than humans are able to discover. Patterns can also change over time and H2O models can be continuously retrained to yield more and better insights.
Fast and easy model deployment with small footprint. The H2O Open Source Machine Learning Platform creates data models, with a minimal footprint, that can score events and make predictions in nanoseconds. The data models are Java-based and can be deployed anywhere with a JVM, or even as a web service. Models can easily be deployed at the IoT edge to yield real-time business and IT insights.
Enabling security measures at the edge. AI is particularly adept at finding and establishing patterns, especially when it’s fed huge amounts of data. Security loopholes and threats take on new forms all the time. H2O models can easily adapt as data show new patterns of security threats. Deploying these adaptive models at the edge means that threats can be blocked early on, before they’re able to cause damage throughout the system.

There are many advantages in enabling analytics at the IoT edge. Using H2O will be crucial in this endeavor. Many industry experts are already moving in this direction. What are you waiting for?