Interview with Carolyn Phillips, Sr. Data Scientist, Neurensic

During Open Tour Chicago we conducted a series of interviews with data scientists attending the conference. This is the second of a multipart series recapping our conversations.

Be sure to keep an eye out for updates by checking our website or following us on Twitter @h2oai.

AAEAAQAAAAAAAAeRAAAAJGZmMWZiMGE1LTVlMDgtNGQwZi05NzYyLTEwMTMxNDhmODcwMw

H2O.ai: How did you become a data scientist?

Phillips: Until very close to two months ago I was a computational scientist working at Argonne National Laboratory.

H2O.ai: Okay.

Phillips: I was working in material science, physics, mathematics, etc., but I was getting bored with that and looking for new opportunities, and I got hired by a startup in Chicago.

H2O.ai: Yes.

Phillips: When they hired me they said, “We’re hiring you, and your title is Senior Quantitative Analyst,” but the very first day I showed up, they said, “Scratch that. We’ve changed your title. Your title is now Senior Data Scientist.” And I said, “Yes, all right.” It has senior in it, so I’m okay going with that.

H2O.ai: Nice. I like it.

Phillips:…

Interview with Svetlana Kharlamova, ­Sr. Data Scientist, Grainger

During Open Tour Chicago we conducted a series of interviews with data scientists attending the conference. This is the first of a multipart series recapping our conversations.

Be sure to keep an eye out for updates by checking our website or following us on Twitter @h2oai.

Svetlana Kharlamova

H2O.ai: How did you become a data scientist?

Kharlamova: I’m a physicist.

H2O.ai: Okay.

Kharlamova: I came here from the academia of physics. I worked for seven years in academia for physics and math, and four years ago I switched to finance to be more of a math person than a physics person.

H2O.ai: I see.

Kharlamova: And from finance I came to the data industry. At that time data science was booming.

H2O.ai: Oh, okay.

Kharlamova: And I got excited with all new the stuff and technologies coming up, and here I am.

H2O.ai: Okay, nice. So what business do you work for now?

Kharlamova: I work for Grainger. We’re focused on equipment distribution; serving as a connector between manufacturing plants,…

H2O Day at Capital One

Here at H2O.ai one of our most important partners is Capital One, and we’re proud to have been working with them for over a year. One of the world’s leading financial services providers, Capital One has a strong reputation for being an extremely data and technology-focused organization. That’s why when the Capital One team invited us to their offices in McLean, Virginia for for a full day of H2O talks and demos we were delighted to accept. Many key members of Capital One’s technology team were among the 500+ attendees at the event, including Jeff Chapman, MVP of Shared Technology, Hiren Hiranandani, Lead Software Engineer, Mike Fulkerson, VP of Software Engineering and Adam Wenchel, VP of Data Engineering.

A major theme throughout the day was “vertical is the new horizontal,” an idea presented by our CEO Sri Ambati, about how every company is becoming a technology company. Sri pointed out that software is becoming increasingly ubiquitous at organizations at the same time that code is becoming a commodity. Today, the only assets that companies can defend is their community and brand. Airbnb is more valuable than most hospitality companies, despite owning no property,…

Red herring bites

At the Bay Area R User Group in February I presented progress in big-join in H2O which is based on the algorithm in R’s data.table package. The presentation had two goals: i) describe one test in great detail so everyone understands what is being tested so they can judge if it is relevant to them or not; and ii) show how it scales with data size and number of nodes.

These were the final two slides :

Slide14     Slide15

I left a blank for 1e10 (10 billion high cardinality rows joined with 10 billion high cardinality rows returning 10 billion high cardinality rows) because it didn’t work at that time. Although each node has 256GB RAM (and 32 cores) the 10 billion row test involves joining two 10 billion row tables (each 200GB) and returning a third table (also ~10 billion rows) of 300GB, total 700GB. I was giving 200GB to each of…

Fast csv writing for R

R has traditionally been very slow at reading and writing csv files of, say, 1 million rows or more. Getting data into R is often the first task a user needs to do and if they have a poor experience (either hard to use, or very slow) they are less likely to progress. The data.table package in R solved csv import convenience and speed in 2013 by implementing data.table::fread() in C. The examples at that time were 30 seconds down to 3 seconds to read 50MB and over 1 hour down to 8 minute to read 20GB. In 2015 Hadley Wickham implemented readr::read_csv() in C++.

But what about writing csv files?

It is often the case that the end goal of work in R is a report, a visualization or a summary of some form. Not anything large. So we don’t really need to export large data, right? It turns out that the combination of speed, expressiveness and advanced abilities of R (particularly data.table) mean that it can be faster (so I’m told) to export the data from other environments (e.g. a database), manipulate in R, and export back to the database, than it is keeping the data…

Apache Spark and H2O on AWS

Screen-Shot-2016-04-11-at-6.06.29-PM-1

This is a guest post re-published with permission from our friends at Datapipe. The original lives here.

One of the advantages of public cloud is the ability to experiment and run various workloads without the need to commit to purchasing hardware. However, to meet your data processing needs, a well-defined mapping between your objectives and the cloud vendor offerings is a must. In collaboration with Denis Perevalov (Milliman), we’d like to share some details around one of our most recent – and largest – big-data projects we’ve worked on; a project with our client, Milliman, to build a machine-learning platform on Amazon Web Services.

Before we get into the details, let’s introduce Datapipe’s data and analytics consulting team. The goal of our data and analytics team is to help customers with their data processing needs. Our engagements fall into data engineering efforts, where we help customers build data processing pipelines. In addition to that, our team helps clients get a better insight into their existing datasets by engaging with our data science consultants.

When we first started working…

Connecting to Spark & Sparkling Water from R & Rstudio

Sparkling Water offers the best of breed machine learning for Spark users. Sparkling Water brings all of H2O’s advanced algorithms and capabilities to Spark. This means that you can continue to use H2O from Rstudio or any other ide of your choice. This post will walk you through the steps to get running on plain R or R studio from Spark.

It works just the same the same way as regular H2O. You just need to call h2o.init() from R with the right parameters i.e. IP, PORT

For example: we start sparkling shell (bin/sparkling-shell) here and create an H2OContext:
scala-cli

Now H2OContext is running and H2O’s REST API is exposed on 172.162.223:54321

So we can open RStudio and call h2o.init() (make sure you have the right R H2O package installed):

rstudio-start

Let’s now create a Spark DataFrame, then publish it as H2O frame and access it from R:

This is how you achieve that in sparkling-shell:
val df =...

Drink in the Data with H2O at Strata SJ 2016

It’s about to rain data in San Jose when Strata + Hadoop World comes to town March 29 – March 31st.

H2O has a waterfall of action happening at the show. Here’s a rundown of what’s on tap.
Keep it handy so you have less chance of FOMO (fear of missing out).

Hang out with H2O at Booth #1225 to learn more about how machine learning can help transform your business and find us throughout the conference:

Tuesday, March 29th

Wednesday, March 30th

  • 12:45pm – 1:15pm Meet the Makers: The brains and innovation behind the leading machine learning solution is on hand to hack with you
    • #AskArno – Arno Candel, Chief Architect and H2O algorithm expert
    • #RuReady with Matt Dowle, H2O Hacker and author of R…

Thank you, Cliff

Cliff resigned from the Company last week – He is parting on good terms and supports our success in future. Cliff and I worked closely since 2004 so this is a loss for me. It ends an era of prolific work supporting my vision as a partner.

Let’s take this opportunity to congratulate Cliff on his work, in helping me build something from nothing. Millions of little things we did together to get us this far. (Still remember the uHaul trip with earliest furniture in old building and cliff cranking out code furiously running on Taco Bell & Fiesta Del Mar.) A lot of how I built the Company has to do with maximizing partnering with Cliff. Lots of wins came out of that and we’ll cherish them. Like all good things it came to an end. I only wish him the very best in the future.

Over the past four years, Cliff and the rest of you have helped me build an amazing technology, business, customer and investor team. Your creativity, passion, loyalty, spirited work, grit & determination are the pillars of support and wellspring of life for the Company. I’ll look for strong partners in each one of…