in Uncategorized

Machine Learning for Adtech

Characteristics of advertising data:

  • tens of thousands of columns or more (top 100k or 1 m sites)
  • high collinearity factors: eg demographics, with a strong correlation between eg income and education
  • collinearity: sports fans follow nfl + espn + bleacher report + fox sports; users of ravelry also shop etsy. ┬áThose features are certainly not independent so naive bayes fails.
  • high cardinality factors: eg bluekai style registration data
  • thousands to tens of thousands of campaigns
  • billions of rows; the us pop is 330mm times mean devices (laptop + phone + work) times cookie churn
  • because cookies churn, you have partial row duplication
  • extremely sparse

In addition, adtech machine learning systems have a few other quirks.

A typical campaign has a $100 cost per action (cpa) target and ads cost between $.1 and $5 per thousand (a 10 cent to $5 cpm). Assuming your business needs a 50% margin, you can show between 10 thousand and 500 thousand ads per conversion. Therefore your business is sensitive to missing out on converters and insensitive to showing an ad to a non-converter. This type of system requires very high recall.

Attempting to model hundred million or billion row datasets generally is out of the capacity of anything but linear systems such as logistic regression.  Nonlinear systems such as gradient boosting are often known to produce better business results, are more capable of picking out the tiny fraction of converters, and are more robust to collinearity but can't be run on adtech sized datasets. In a linear only world, better model performance comes through careful feature engineering. Modeling speed / flexibility thus increases the capacity of your modeling scientists and engineers to experiment and allows you to move along the bias-variance tradeoff without using more complex models.

Further, even linear systems will often have to be trained with small batch or stochastic gradient descent, either because better minimization algorithms are too expensive or because it's too difficult to scale learning across multiple machines. This is undesirable because of additional tuning parameters that your results will often be sensitive to. And even with sgd, you will often be required to sample; creating good samples is a difficult problem in it's own right. A typical use case would subsample non-converters and take all the converters to create workable dataset sizes and to help alleviate class imbalance. This alone is hard to get right: one trap is if your positive / negative time periods don't strictly coincide, you can accidentally find a popular but transient event such as the death of a movie star showing up only in the positive or negative converters and becoming a strong but meaningless signal. Another common trap is without careful sampling of the negative converters the negative set will have much shorter histories than the converters; in this case, your model may well give a small positive weight to every feature and what you've really built is a system that detects users with lots of features present.

More next week where I'll discuss ways to decrease model creating time, increase modeling scientist throughput, and begin to look at running more complex models.