Vertical Is The New Horizontal

Every vertical company needs a digital arm. software is eating the vertical. Value has shifted from asset heavy to asset light. AirBnb is more valuable than Hyatt or Hilton and Uber more than a Tesla. However, Code itself is a commodity. Following faster than Moore’s law pace of depreciation. People, Community and brands are the true assets of a Company. Data products in the defense of communities is the path. Nurturing community love with beautiful data products should be the goal of every enterprise. Open source, AI and Cloud become the delivery mechanisms for digital transformation. Companies have to make software not just consume it, to be part of this movement. They need to review data assets and build innovative products and data alliances in the service of their communities and brands. Data is a team sport and needs careful nurturing of teamwork. Seeding and enabling a data driven culture that thrives on asking good questions and answering them even approximately is key.

Brand is eternal. Reputation is ever so easy to lose and needs fullest attention of the executive. Data can be a great partner in the defense of hard earned customer trust. Remember to add feedback loops to capture user interactions to create net new data. Networks lead to scale. platforms need ecosystems — raise a forest, not just a tree. Leverage & facilitate external interactions not just internal product focus.

Lots of examples abound us — Amazon as the online e-commerce company building a whole series of horizontal technology businesses is a gold standard.

Democratize care with AI — AI to do AI for Healthcare

Very excited to have Prashant Natarajan (@natarpr) join us along with Sanjay Joshi on our vision to change the world of healthcare with AI. Health is wealth. And one worth saving the most. They bring invaluable domain knowledge and context to our cause.

As one of our customers would like to say, Healthcare should be optimized for health and outcomes for the ones in need of care. Health / Care, as in, Health divided by Care, how healthy can one be with least amount of care! We are investing in health because it is the right thing to do over the long term — especially with the convergence of Finance, Life Insurance, Retail towards Health. So many opportunities for cross-pollination!

With our strong ecosystem, community and customers’ support, h2o.ai will democratize care with AI — make it faster, cheaper and easier — accessible to all. Machine Learning touches lives — with Domain Scientists on our side, we can accelerate change to the problems that are in the most need. We are fortunate to have the team and culture that allows us to bring great products with high velocity to the marketplace. Stay tuned for Driverless AI for Health, one micro service AI model at a time!

As you feel inspired by the immense opportunities to serve humanity — please join Prashant Natarajan and www.h2o.ai community on our mission!

this will be fun! Sri

Sparkling Water 2.3.0 is now available!

Hi Makers!

We are happy to announce that Sparkling Water now fully supports Spark 2.3 and is available from our download page.

If you are using an older version of Spark, that’s no problem. Even though we suggest upgrading to the latest version possible, we keep the Sparkling Water releases for Spark 2.2 and 2.1 up-to-date with the latest version if we are not limited by Spark.

The last release of Sparkling Water contained several important bug fixes. The 3 major bug fixes are:

  • Handle nulls properly in H2OMojoModel. In the previous versions, running predictions on the H2OMojoModel with null values would fail. We now handle the null values as missing values and it no longer fails.

  • We marked the Spark dependencies in our maven packages as provided. This means that we assume that Spark dependencies are always provided by the run-time, which should always be true. This ensures a cleaner and more transparent Sparkling Water environment.

  • In PySparkling, the method as_h2o_frame didn’t issue an alert when we passed in a wrong input type. This method accepts only Spark DataFrames and RDDs, however, some users tried to pass different types and this method ended silently. Now we fail if the user passes a wrong data type to this method.

It is also important to mention that Spark 2.3 removed support for Scala 2.10. We’ve done the same in the release for Spark 2.3. Scala 2.10 is still supported in the older Spark versions.

The latest Sparkling Water versions also integrated with H2O 3.18.0.5 which brings several important fixes. The full change log for H2O 3.18.0.5 is available here and the full Sparkling Water change log can be viewed here.

Thank you!

Kuba

Senior Software Engineer, Sparkling Water Team

H2O + Kubeflow/Kubernetes How-To

Today, we are introducing a walkthrough on how to deploy H2O 3 on Kubeflow. Kubeflow is an open source project led by Google that sits on top of the Kubernetes engine. It is designed to alleviate some of the more tedious tasks associated with machine learning. Kubeflow helps orchestrate deployment of apps through the full cycle of development, testing, and production, while allowing for resource scaling as demand increases. H2O 3’s goal is to reduce the time spent by data scientists on time-consuming tasks like designing grid search algorithms and tuning hyperparameters, while also providing an interface that allows newer practitioners an easy foothold into the machine learning space. The integration of H2O and Kubeflow is extremely powerful, as it provides a turn-key solution for easily deployable and highly scalable machine learning applications, with minimal input required from the user.

Getting Started:

  1. Make sure to have kubectl and ksonnet installed on the machine you are using, as we will need both. Kubectl is the Kubernetes command line tool, and ksonnet is an additional command line tool that assists in managing more complex deployments. Ksonnet helps to generate Kubernetes manifests from templates that may contain several parameters and components.
  2. Launch a Kubernetes cluster. This can either be an on-prem deployment of Kubernetes or an on-cloud cluster from Google Kubernetes Engine. Minikube offers a platform for local testing and development running as a virtual machine on your laptop.
  3. Make sure to configure kubectl to work with the your Kubernetes cluster.
    a. kubectl cluster-info, will tell you which cluster kubectl is configured to work on at the moment.
    b. Google Kubernetes Engine has a link in the GCP console that will provide the command for properly configuring kubectl.


  4. c. minikube start, will launch minikube and should automatically configure kubectl. You can check this by running the command: “minikube status” after launching minikube to verify.

    4. Now we are ready to start our deployment. To begin with, we will initialize a ksonnet application by running the command “ks init <your_app_name>”.
    5. Move into the directory that was created by the previous command using “cd <my_app_name>”. You will see that it has been populated with a couple directories, as well as, files containing some default parameters. You do not need to touch these.
    6. In order to install the Kubeflow components, we add a ksonnet registry to application. This can be done by running the commands:
    ks registry add kubeflow <location_of_the_registry>
    ks pkg install kubeflow/core
    ks pkg install kubeflow/tf-serving
    ks pkg install kubeflow/tf-job
    ks pkg install kubeflow/h2o3
    a. This will create a registry called “kubeflow” within the ksonnet application using the components found within the specified location.
    b. <location_of_the_registry> is typically a github repo. For this walkthrough, you can use this repo as it has the prebuilt components for both H2O and Kubeflow.
    c. ks pkg install <component_name> will install the components that we will reference when deploying Kubeflow and H2O.

    7. Let’s start with deploying the core Kubeflow components first:
    NAMESPACE=kubeflow
    kubectl create namespace ${NAMESPACE}
    ks generate core kubeflow-core ––name=kubeflow-core ––namespace=${NAMESPACE}
    ks env add cloud
    ks param set kubeflow-core cloud gke ––env=cloud
    KF_ENV=cloud
    ks apply ${KF_ENV} -c kubeflow-core
    a. These commands will create a deployment of the core Kubeflow components.
    b. Note: if you are using minikube, you may want to create an environment named “local” or “minikube” rather than “cloud”, and you can skip the “ks param set …” command.
    c. For GKE: you may need to run this command “kubectl create clusterrolebinding default-admin ––clusterrole=cluster-admin ––user=your-user@email.com” to avoid RBAC permission errors.
    8. Kubeflow is now deployed on our Kubernetes cluster. There are two options for deploying H2O on Kubeflow: through Kubeflow’s JupyterHub Notebook offering, or as a persistent server. Both options accept a docker image containing the necessary packages for running H2O.
    a. You can find the dockerfiles needed for both options here.
    b. Copy the dockerfiles to a local directory and run the command “docker build -t <name_for_docker_image> -f <name_of_dockerfile>”.
    c. If we are deploying to the cloud, it is a good idea to push the image to a docker container registry like docker hub or google container registry.

Deploy JupyterHub Notebook:
1. The JupyterHub serve comes deployed with the core Kubeflow components. Running the command “kubectl get svc -n=${NAMESPACE}” will show us a service running with the name “tf-hub-0”.

2. Use the command: “kubectl port-forward tf-hub-0 8000:8000 ––namespace={$NAMESPACE}” to make the exposed port available to your local machine, and open http://127.0.0.1:8000 in your browser. Create a username and password when prompted within the browser window, and click “Start My Server”.
3. You will be prompted to designate a docker image to pull, as well as, requests for CPUs, memory, and additional resources. Fill in the resource requests as preferred.
Note: We already have the notebook image (“h2o3-kf-notebook:v1”) pushed to GCR. You will want to build your own image using the dockerfiles provided, and push them to GCR. The notebook image is fairly large, it may take some time to download and start.

4. Once the notebook server has properly spawned, you will see the familiar Jupyter Notebook homepage. Create a new Python 3 notebook. The image built from the dockerfiles provided will have all the requisite plugins to run H2O.
5. A basic example of running H2O AutoML would look something like the images below. A sample of the Jupyter Notebook is available in the repo, or you can follow the example from the H2O AutoML documentation:

Deploy H2O 3 Persistent Server:
1. If we want to deploy H2O 3 as a persistent server, we use the prototype available within the ksonnet registry.
Run the command:
ks prototype use io.ksonnet.pkg.h2o3 h2o3 \
––name h2o3 \
––namespace kubeflow \
––model_server_image <image_name_in_container_registry>
This will create the necessary component for deploying H2O 3 as a persistent server.
2. Finally, deploy the H2O 3 component to the server using this command:
ks apply cloud -c h2o3 -n kubeflow
a. Flag -c specifies the component you wish to deploy and -n flag specifies that we deploy the component to the kubeflow namespace
3. Use “kubectl get deployments” to make sure that the H2O 3 persistent server was deployed properly. “kubectl get pods” will show the name of the pod to which the server was deployed.

4. Additionally, if running “kubectl get svc -n kubeflow” you will see a service named “h2o3” running with type “LoadBalancer”. If you wait about a minute, the external-ip will change from <pending> to a real ip address.
5. Go to a working directory where you would like to store any Jupyter Notebooks or scripts. At this point you can launch a Jupyter Notebook locally or write a python script the runs H2O. Make sure your local version of H2O 3 is up to date. You can follow the steps here to install the newest version of H2O 3. By default, docker will build the image using the most current version of H2O 3.
a. Use the External IP address obtained from “kubectl get svc -n kubeflow” and the port 54321 in the h2o.init() command, and you will connect H2O to the cluster running in kubernetes.

b. From here, the steps are the same as in the JupyterHub Notebook above. You can follow the same example steps as are outlined in the AutoML example here.
6. Optionally, you can direct your browser to the exposed ip address with http://<your_ip>:54321. This will launch H2O Flow, which is H2O’s web server offering. H2O Flow provides a notebook like UI with more point and click options as compared to a Jupyter Notebook which requires understanding of Python syntax.

This walkthrough provides a small window into a high-potential, ongoing project. Currently, the deployment of H2O.ai’s enterprise product Driverless AI on Kubeflow is in progress. At the moment, it is deployable in a similar fashion to the H2O 3 persistent server, and beta work on this can be found within the github repo. Driverless AI speeds up data science workflows by automating feature engineering, model tuning, ensembling, and model deployment.

Please feel free to contact me with any questions via email or Linkedin.
All files are available here: https://github.com/h2oai/h2o-kubeflow.

Makers in Action: Community, Partners and Team Members at #GTC18

NVIDIA’s GPU Technology Conference (GTC) has been incredible! Folks from all over the world are exploring the latest breakthroughs in self-driving cars, smart cities, healthcare, high performance computing, virtual reality, and more, all propelled by the AI movement.

If you’re attending GTC and would like to see our solutions in action (recently named a leader by Gartner) or chat with one of the H2O.ai Makers, come visit us at booth #725.

We’re excited to be joined at #GTC18 by a number of community members, partners, and H2O.ai team members including:

Community
Venkatesh Ramanathan (Software Engineer, PayPal) and Ajay Gopal (Chief Data Scientist, Deserve) joined speakers from KickView and Memorial Sloan Kettering Cancer Center on the “Deep Learning Insitute Executive Workshop” panel.

Arpit Mehta (Data Scientist, Product Owner: Big Data Architectures, BMW Group) presented “Now I C U: Analyzing Data Flow Inside an Autonomous Driving Car” and “Beyond Autonomous Driving: Unleashing Value via Machine Learning Applications in Automotive Industry“.

Partners
Supermicro (Booth #111) – Supermicro provides customers around the world with application-optimized server, workstation, blade, storage and GPU systems. Stop by their booth for a demo of Driverless AI. Driverless AI speeds up data science workflows by automating feature engineering, model tuning, ensembling, and model deployment.

MapD (Booth #602) – MapD’s mission is to redefine the limits of scale and speed in big data analytics. Along with us, NVIDIA, and other innovative companies, they are also one of the founding GPU Open Analytics Initiative (GOAI) members. Enjoy a recent post (“How VW Predicts Churn with GPU-Accelerated Machine Learning and Visual Analytics”) by Wamsi Viswanath (Data Scientist, MapD) in which he shares how H2O was used in concert with MapD.

Dell EMC (Booth #815) – Dell EMC enables digital transformation with trusted solutions for the modern data center.

Team Members
– Wednesday, March 28th, 9am – Ashrith Barthur (Security Scientist) presented “Network Security with Machine Learning”.
– Thursday, March 29th, 11am, Room 211A – Jon McKinney (Director of Research) will be presenting “World’s Fastest Machine Learning With GPUs”.
– Thursday, March 29th, 4pm, LL21C – Arno Candel (CTO) will host the “Hands-on with Driverless AI” workshop.

Enjoy GTC!

Rosalie

Director of Community

H2O4GPU now available in R

In September, H2O.ai released a new open source software project for GPU machine learning called H2O4GPU. The initial release (blog post here) included a Python module with a scikit-learn compatible API, which allows it to be used as a drop-in replacement for scikit-learn with support for GPUs on selected (and ever-growing) algorithms. We are proud to announce that the same collection of GPU algorithms is now available in R, and the h2o4gpu R package is available on CRAN.

The R package makes use of RStudio’s reticulate R package for facilitating access to Python libraries through R. Reticulate embeds a Python session within your R session, enabling seamless, high-performance interoperability and was originally created by RStudio in an effort to bring the TensorFlow Python library into R.

This is exciting news for the R community, as h2o4gpu is the first machine learning package that brings together a diverse collection of supervised and unsupervised GPU-powered algorithms in a unified interface. The initial collection of algorithms includes:

  • Random Forest, Gradient Boosting Machine (GBM), Generalized Linear Models (GLM) with Elastic Net regularization
  • K-Means, Principal Component Analysis (PCA), Truncated SVD


h2o4gpu has a functional interface. This is different than many modeling packages in R (including the h2o package), however, functional interfaces are becoming increasingly popular in the R ecosystem.

Here’s an example of how to specify a Random Forest classification model with a non-default value for the max_depth parameter.

model <- h2o4gpu.random_forest_classifier(max_depth = 10L)

To train the model, you simply pipe the model object to a fit() function which takes the training data as arguments. Once the model is trained, we pipe the model to the predict() function to generate predictions.

Here is a quick demo of how to train, predict and evaluate an H2O4GPU model using the Iris dataset.

Detailed installation instructions and a comprehensive tutorial is available in package vignette, so we encourage you to visit the vignette to get started.

H2O4GPU is a new project under active development and we are looking for contributors! If you find a bug, please check that we have not already fixed the issue in the bleeding edge version and then check that we do not already have an issue opened for this topic. If not, then please file a new GitHub issue with a reproducible example. 🙏

  • Here is the main GitHub repo. If you like the package, please 🌟 the repo on GitHub!
  • If you’re looking to contribute, check out the CONTRIBUTING.md file.
  • All open issues that are specific to the R package are here.
  • All open issues are here.

Thanks for checking out our new package!

-- Navdeep Gill, Erin LeDell, and Yuan Tang

Come meet the Makers!

NVIDIA’s GPU Technology Conference (GTC) Silicon Valley, March 26-29th is the premier AI and deep learning event, providing you with training, insights, and direct access to the industry’s best and brightest. It’s where you will see the latest breakthroughs in self-driving cars, smart cities, healthcare, high-performance computing, virtual reality and more, and all because of the power of AI. H2O.ai will be there in full force to share how you can immediately gain value and insights from our industry-leading AI and ML platforms. In case you hadn’t heard, H2O.ai was named a leader in 2018 Gartner Magic Quadrant for Data Science and Machine Learning platforms. You can get the report here.

Please visit us at booth #725 to see Driverless AI in action and talk to the Makers leading the AI movement! Our sessions will be leading edge talks that you won’t want to miss.

  1. Ashrith Barthur – Network Security with Machine Learning

    Ashrith will speak about modeling different kinds of cyber attacks and building a model that is able to identify these different kinds of attacks using machine learning.

    Room 210F – Wednesday, 28 March, 9 AM to 9:50 AM.

  2. Jonathan McKinney – World’s Fastest Machine Learning with GPUs

    Jonathan will introduce H2O4GPU, a fully featured machine learning library that is optimized for GPUs with a robust python API that is a drop dead replacement for scikit-learn. He will demonstrate benchmarks for the most common algorithms relevant to enterprise AI and will showcase performance gains as compared to running on CPUs.

    Room 220B – Thursday, March 29, 11 AM to 11:50 AM.

  3. Arno Candel – Hands-on with Driverless AI

    In this lab, Arno will show how to install and start Driverless AI, the automated Kaggle Grandmaster in-a-box software, on a multi GPU box. He will go through the full end-to-end workflow and showcase how Driverless AI uses the power of GPUs to achieve 40x speedups on algorithms that in turn allow it run thousands of iterations and find the best model.

    Room LL21C – Thursday, March 29, 4 PM to 6 PM.

Can’t make it to the event? Schedule a time to talk to one of our makers!

How Driverless AI Prevents Overfitting and Leakage

By Marios Michailidis, Competitive Data Scientist, H2O.ai

In this post, I’ll provide an overview of overfitting, k-fold cross-validation, and leakage. I’ll also explain how Driverless AI avoids overfitting and leakage.

An Introduction to Overfitting

A common pitfall that causes machine learning models to fail when tested in a real-world environment is overfitting. In Driverless AI, we take special measures in various stages of the modeling process to ensure that overfitting is prevented and that all models produced are generalizable and give consistent results on test data.

Before highlighting the different measures used to prevent overfitting within Driverless AI, we should first define overfitting. According to Oxford Dictionaries, it is the production of an analysis which corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably. Overfitting is often explained in connection with two other terms, namely bias and variance. Consider the following diagram that shows how typically the training and test errors “move” throughout the modeling process.

When an algorithm starts learning from the data, any relationships found are (still) basic and simple. You could see in the top left of the chart that both errors, training (teal) and test (red) are quite high because the model has not yet uncovered much information about the data and makes basic/biased (and quite erroneous) assumptions. These basic/simple assumptions cause high errors due to (high) bias. At this point, the model is not sensitive to fluctuations in the training data because the logic of the model is still very simple. This low sensitivity to fluctuations is often referred to as having low variance. This whole initial stage can be described as underfitting, a state in the modeling process that causes errors because the model is still very basic and does not live up to its potential. This state is not deemed as serious as overfitting because it sits at the beginning of the modeling process and generally the modeler (often greedily) tries to maximize performance ending up overdoing it.

As the model keeps learning, the training error decreases, as well as, the error for the test data. That is because the relationships found in the training data are still significant and generalizable. However, after a while, even though the training error keeps decreasing at a fast pace, the test error does not follow. The algorithm has exhausted all significant information from within the training data and starts modeling noise. This state where the training error keeps decreasing, but the test error is becoming worse (increases), is called overfitting – the model is learning more than it should and it is sensitive to tiny fluctuations in the training data, hence characterized by high variance. Ideally, the learning process needs to stop at the point where the red line is at its lowest or optimal point, somewhere in the middle of the graph. How quickly or slowly this point comes depends on various factors.

In order to get the most out of your predictions, it becomes imperative to control overfitting and underfitting. Since Driverless AI thrives in making very predictive models, it has built-in measures to control overfitting and underfitting. These methods will be analyzed in the following sections.

This section includes both existing and upcoming features.

Check similarity of train and test datasets
This step is conditional upon providing a test dataset which is not always the case. If a test dataset is provided during the training procedure, various checks take place to ensure that the training set is consistent with the test set. This happens both at a univariate level and global level. At a univariate level, the distribution of each variable in the training data is checked against the equivalent one in the test data. If a big difference is detected, the column is omitted and a warning message is displayed.

To illustrate the problem, imagine having a variable that measures the age in years for some customers. If the distribution for train and test data resembles the one below, even a very basic model would start to overfit very quickly.

This happens because the model will assume that a given customer will be between 40 and 85 years old. However, that model is tasked to make predictions for customers that are much younger. This model would most probably fail to generalize well in the test data and the more the model will rely on age, the worse the performance will be for the test data.

To detect this problem, Driverless AI would fit models to determine if certain values for some features have a tendency to appear more in train or in test data. If such relationships are found and deemed significant, the feature is omitted and/or a warning message is displayed to alert the user about this discrepancy.

K-Fold Cross-Validation

Depending on the accuracy and speed settings selected in Driverless AI before the experiment is run, the training data will be split into multiple train and validation pairs and Driverless AI will try to find that optimal point (mentioned in the top section) for the validation data in order to stop the learning process. This method can be illustrated below for K=4.

This process gets repeated 4 times (if K=4) and K models are run until all parts become validated exactly once. The same model parameters are applied to all these processes and an overall (average) error is estimated for all parts’ predictions to ensure that any model built with such parameters (features, hyperparameters, model types) will generalize well in any (unseen) test dataset.

More specifically based on the average error on all these validation predictions, Driverless AI will determine:

  • When to stop the learning process. Driverless AI primarily uses XGBoost which requires various iterations to reach the optimal point and the multiple validation datasets facilitate in finding that global best generalizable point that works well in all validation parts.

  • Which hyperparameter values to change/tune. XGBoost has a long list of parameters which control different elements in the modeling process. For instance, based on performance in the validation data, the best learning rate, maximum depth, leaf size, and growing policy are found and help achieve a better generalization error in test data.

  • Which features or feature transformations to use. Driverless AI’s strong performance comes from its feature engineering pipeline. Typically thousands of features will be explored during the training process, but only a minority of them will be useful eventually. The performance of validation data, once again defines which of the generated features are worth keeping and which need to be discarded.

  • Which models to ensemble and which weights to use. Driverless AI uses an exhaustive process to find the best linear weights to combine multiple different models (trained with different parameters) and gets better results in unseen data.

Check for IID

A common pitfall that causes overfitting is when a dataset contains temporal elements and the observations (samples) are not Independent and Identically Distributed (IID). In the case of temporal elements, consider having future data predicting past data and not the other way around, as that prediction is misleading.

DAI can determine the presence of non-IID easily if a date field or equivalent is provided. In this instance, it will assess how strong the temporal elements are and whether the target variables values significantly change for different periods in the date field. It uses correlation, serial correlation checks and variable binning to determine if and how strong these temporal elements are. It should be noted that this check is extended to the ordering of the dataset too. Sometimes the data itself may be ordered in a way that it does not help if the validation is formulated randomly. For example, if the data is sorted by date/time, even though that field is not provided.

If non-IID is detected, Driverless AI typically switches to time series based validation mode to get more consistent errors when predicting the test (further) data. The difference with the k-fold (mentioned above) is that the data is split based on date/time and all pairs of train,validation are formulated so that train is in the past and validation is in the future. The type of features Driverless AI will generate are different than when there is IID. This process also identifies main entities that will be used for both validation and feature engineering. The entities are the group-by categorical features such as stores or type of products that if used make predictions better – ex: sales by store and product in last 30 days.

As a summary of this subsection, this check is put in place to ensure that Driverless AI performs feature engineering, selects parameters and ensembles models in a way that best resembles reality when strong temporal elements are detected.

Avoiding Leakage in Feature Engineering

Leakage can be defined as the creation of unexpected additional information in the training data, allowing a model or machine learning algorithm to make unrealistically good predictions. It has various types and causes. Driverless AI is particularly careful with a type of leakage that can arise when implementing target (mean or likelihood) encoding. In many occasions, these type of features tend to be the most predictive inputs, especially when high cardinality categorical variable (as in with many distinct values) are present in the data. However, over-reliance on these features can cause a model to overfit quickly. These features may be created differently in different domains – for example in banking they take the form of weights of evidence, in times series they are the past mean values of the target variable given certain lags and so on.

A common cause that makes target encoding fail is when predicting an observation/entry that its target value was included for the formulation of an average value for a certain category. For example, let’s assume there is a dataset that contains a categorical variable called profession and various job titles are listed like ‘doctor’, ‘teacher’, ‘policeman’, etc. Out of all of the mentioned entries, there is only one entry for ‘entrepreneur’ and that entry has an income of $3,000,000,000. When estimating the average income for all ‘entrepreneurs’, it will be $3,000,000,000, as there is only one ‘entrepreneur’. If a new variable is created that measures the average income of a profession, while trying to predict income, it will create the connection that the average salary of an entrepreneur is $3,000,000,000. Given how big this value is, it is likely to make a model over-rely on this connection and predict high values for all entrepreneurs.

This can be referred to as target leakage because it uses the target value directly as a feature. There are various ways to mitigate the impact of this type of leakage, like estimating averages only when there is a significant number of cases for one category. Ideally though, the average values need to be created with data that the entries’ features have not used in any way as their respective target values. A way to achieve this is to have another holdout dataset that is used to estimate the mean target values for certain categories and then apply these to the train and validation data – in other words, have a third dataset dedicated to target encoding.

The latter approach suffers from the fact that the model will then need to be built with significantly less data, as some part is surrendered to estimate the averages per category. To counter this, Driverless AI uses a CCV or Cross-Cross-Validation, or cross-validation inside a cross-validation. After a train and validation pair has been determined, then only the train part is undergoing another k-fold procedure where K-1 parts are used to estimate average target values for the selected categories and apply those to the Kth part until the train dataset has its mean (target) values estimated in k batches. They can be applied at the same time to the outer valid dataset, taking an average of all the K-folds’ mean values.

The same type of leakage can also be found in metamodeling. For more information on this topic, enjoy this video, in which Kaggle Grandmaster, Mathias Müller, discusses this and other types of leakage, including how they get created and prevented.

In regards to other types of leakage, Driverless AI will throw warning messages if some features are strongly correlated with the target but typically does not take any additional action by itself unless the correlation is perfect.

Other means to avoid overfitting
There are various other mechanisms, tools or approaches in Driverless AI that help prevent overfitting:

  • Bagging (or Randomized Averaging): When the time comes for final predictions, Driverless AI will run multiple models with slightly different parameters and aggregate the results. This ensures that the final prediction, which is based on multiple models, is not too attached to the original data exactly due to this imputed randomness and produced of (to-some-extend) uncorrelated models. In other words, bagging has the ability to reduce the variance without changing the bias.
  • Dimensionality reduction: Although Driverless AI generates many features, it will also employ techniques such as SVD or PCA to minimize the features’ expansion when encountering high cardinality categorical features and simplifying the modeling process.

  • Feature pruning: Driverless AI can get a measure of how important a feature is to the model. Features that tend to add very little to the prediction are likely to just be noise and are therefore discarded.

Examples of Consistency

It is no secret that we like to test drive Driverless AI in competitive environments such as Kaggle, Analytics Vidhya, and CrowdANALYTIX to know how it fares compared to some of the best data scientists in the world. However, we are not only interested in accuracy, which is a product of how much time you allow Driverless AI to run and can be configured in the beginning, but also the consistency between validation performance and performance in the test data, in other words the ability to avoid overfitting.

Here is a list of results from public sources that show Driverless AI’s validation performance and performance on the test data drawn with various combinations of accuracy and speed settings. Some of the results may contain combinations of multiple Driverless AI outputs and a metamodeling layer on top of them:

NameMetricValidation ResultsTest ResultsFinal RankSource
Analytics Vidhya - Churn Predictionauc0.689 (rank 14/250)0.677 (rank 8/250)8/250
LinkedIn blog
Analytics Vidhya - Churn Predictionauc0.69104 (rank 8/250)0.67844 (rank 5/250)5/250Competition website
BNP Paribas Cardif Claims Managementlog loss0.43573 (rank 23/2926)0.43316 (rank 18/2926)
18/2926Video
Predicting How Points End in Tennislog loss0.1774 (rank 3/207)0.19051 (rank 6/207)
6/207
Competition website
Predicting How Points End in Tennislog loss0.1769 (rank 2/207)0.19062 (rank 8/207)8/207
Competition website
BNP Paribas Cardif Claims Managementlog loss0.44196 (rank 52/2926)0.44137 (Rank 60 /2926)
(fast settings)
60/2926Video
Amazon.com - Employee Access Challengeauc0.91165 (rank 65/1687)0.90933 (rank 79 /1687
(fast settings)
79/1687Video
New York City Taxi Trip DurationRMSLE0.31017 (rank 11/1257)
0.31181 (rank 11/1257)
11/1257
Tweet
Analytics Vidhya - McKinsey Analyticsauc0.85932 (rank 17/503)0.85456 (rank 6/503)6/503Competition website

Sparkling Water 2.2.10 is now available!

Hi Makers!

There are several new features in the latest Sparkling Water. The major new addition is that we now publish Sparkling Water documentation as a website which is available here. This link is for Spark 2.2.

We have also documented and fixed a few issues with LDAP on Sparkling Water. Exact steps are provided in the documentation.

Bundled H2O was upgraded to 3.18.0.4 which brings the ordinal regression for GLM as the major change.

The last major change included in this release is the availability of the H2O AutoML and H2O Grid Search transformation for the Spark pipelines. They are now being exposed as regular Spark Estimator and can be used within your Spark pipelines. An example PySparkling AutoML script can be found here.

The full changelog can be viewed here.

Sparkling Water is already integrated with Spark 2.3 on master and the next release will be also for this latest Spark version.

Stay tuned!

Congratulations – H2O is a leader in the Gartner Magic Quadrant for Data Science and Machine Learning Platforms

Congratulations – Thanks to the support of our customer community over the past years, H2O.ai is a leader and one with the most completeness of vision in Gartner Magic Quadrant for Data Science and Machine Learning Platforms. It is an ecosystem we dedicated a good part of this decade to open up and spring. This is testimony to the incredibly community-centric maker culture of team H2O in our relentless support of our customers with beautiful intelligent products. Our partnership with NVIDIA and IBM helped bring GPUs to Machine Learning this past year. Our work with Azure, AWS and Google Cloud to make it easy to try, train and deploy AI. Automation of AI pipelines with AI in DriverlessAI will help maximize extremely scarce data science talent and bring it to many more enterprises. We will make it cheaper, faster and easier to experiment and build AI products. This is fast moving AI space with tectonic shifts and very high product innovation from great players – even we are only getting started. We seek your partnership to further transform your problems and verticals with AI to build solutions together.

From the first to our latest investors, our amazing team members: past, present, new and future ones and supportive families; our community of data scientists who attended the first and recent meetups to spread the word, believers and our customers who backed our vision and execution – Each and every one of you are part of this incredibly fun journey. Thank you. Gratitude is the word that comes to mind. Your support inspires us to do great things, in the pursuit of magic! (and magic quadrants) 🙂

this will be fun, Sri