We were very excited to meet with our advisors (Prof. Stephen Boyd, Prof. Rob Tibshirani and Prof. Trevor Hastie) at H2O.AI on Jan 6, 2017.
Our CEO, Sri Ambati, made two great observations at the start of the meeting:
- First was the hardware trend where hardware companies like Intel/Nvidia/AMD plan to put the various machine learning algorithms into hardware/GPUs.
- Second was the data trend where more and more datasets are images/texts/audio instead of the traditional transactional datasets. To deal with these new datasets, deep learning seems to be the go-to algorithm. However, with deep learning, it might work very well but it was very difficult to explain to business or regulatory professionals how and why it worked.
There were several techniques to get around this problem and make machine learning solutions interpretable to our customers:
- Patrick Hall pointed out that monotonicity determines interpretability, not linearity of systems. He cited a credit scoring system using a constrained neural network, when the input variable was monotonic to the response variable, the system could automatically generate reason codes.
- One could use deep learning and simpler algorithms (like GLM, Random Forest, etc.) on datasets. When the performances were similar, we chose the simple models since they tended to be more interpretable. These meetings were great learning opportunities for us.
- Another suggestion is to use a layered approach:
- Use deep learning to extract a small number of features from a high dimension datasets.
- Next, use a simple model that used these extracted features to perform specific tasks.
- This layered approach could provide great speed up as well. Imagine the cases where you could use feature sets for images/text/speech derived from others on your datasets, all you need to do was to build your simple model off the feature sets to perform the functions you desired. In this case, deep learning is the equivalent of PCA for non-linear features. Prof. Boyd seemed to like GLRM (check out H2O GLRM) as well for feature extraction.
- With this layered approach, there were more system parameters to tune. Our auto-ML toolbox would be perfect for this! Go team!
Subsequently the conversation turned to visualization of datasets. Patrick Hall brought up the approach to first use clustering to separate the datasets and apply simple models for each cluster. This approach was very similar to their hierarchical mixture of experts algorithm described in their elements of statistical learning book. Basically, you built decision trees from your dataset, then fit linear models at the leaf nodes to perform specific tasks.
Our very own Dr. Wilkinson had built a dataset visualization tool that could summarize a big dataset while maintaining the characteristics of the original datasets (like outliners and others). Totally awesome!
Arno Candel brought up the issue of overfitting and how to detect it during the training process rather than at the end of the training process using the held-out set. Prof. Boyd mentioned that we should checkout Bayesian trees/additive models.
Last Words of Wisdom from our esteemed advisors: Deep learning was powerful but other algorithms like random forest could beat deep learning depending on the datasets. Deep learning required big datasets to train. It worked best with datasets that had some kind of organization in it like spatial features (in images) and temporal trends (in speech/time series). Random forest, on the other hand, worked perfectly well with dataset with no such organization/features.