## Data Inference: Drinking With the Dude

Data Science is the art of combining computer science, statistics, mathematics, and any number of other fields and applying it to a pile of data to extract some kind of added meaning.

At least, this is what it means to me. I use the word art here on purpose; we take complicated formulas and algorithms and apply them to our data - that’s the science bit. However, there’s another side of it, which comes from looking at the data, understanding its origins, and finding the right approach to deal with that problem – this requires a sharp eye and keen observation, not unlike a painter trying to capture subtle facial expressions while composing a portrait.

Tapad’s Data Science team has one ongoing project that involves taking nodes (devices) on our device graph and coming up with various ways to score the edges that connect those nodes. This helps Tapad link various devices that are often seen together or that have similar traits. Our data - which are representative of people and society - are noisy and biased, so understanding where the information came from helps us build a cleaner model.

With this in mind, Tapad’s data team spends a lot of time inspecting inherent bias in our data. One source of bias is what we call platform bias. Platform bias is something we all inherently understand: you don’t use your phone the same way you use your tablet or the same way you’d use your desktop machine. Every device has its purpose. Perhaps you are more likely to look for directions on your phone than on your laptop. Or maybe you only visit the mobile version or app version of your favorite website when on the go, but opt for the desktop version when at home. Naive approaches to similarity scoring would view these behavior patterns as indicative of inherent difference between these devices.

Another source of bias comes from data we receive from various partners. These data may represent consumer segments or various other pieces of data associated with that device. Data can only be obtained when the device can be identified by both parties. This often means that a device in our graph will have data from some of our partners, but not all. This a problem because the absence of a piece of data doesn’t necessarily imply that that device should not be labelled with it. Instead, it may indicate that we did not establish a common identity with the source of that piece of data.

Let’s use a concrete example.

Say we are on a polling committee and we want to determine the correlation between people’s favorite films and their favorite cocktail. We don’t want to ask people too many questions, so we split up the questions into two surveys: favorite movie & favorite cocktails.

We give 500 people just the movie survey and then another 500 just the cocktail one. We then give 50 people both surveys (so we’ve given 1100 surveys to 1050 people since 50 of them got 2 surveys). Maybe when we give out our movie survey to the first 500 people, 25 people (5%) say that their favorite movie is the Big Lebowski. Likewise, in our only cocktail survey, 75 (15%) of them say that their favorite cocktail is a White Russian. So far, of the “just favorite movie” pool we find 5% like the Big Lebowski, whereas in our “just favorite cocktail” pool we find 15% prefer White Russians.

Now let’s look at our 50 person pool. Let’s say in that pool 10 of them said their favorite movie was the Big Lebowski, 8 of whom also said their favorite cocktail was a White Russian (for simplicity let’s say that no one said their favorite cocktail was a White Russian but their favorite movie was something other than the Big Lebowski.)

Let’s put this all in table form:

Survey: | Total | Favorite is White Russians | Favorite is The Big Lebowski |
---|---|---|---|

Movie Only | 500 | Unknown | 25 (5%) |

Cocktails Only | 500 | 75 (15%) | Unknown |

Movie + Cocktails | 50 | 8 (16%) | 10 (20%) |

Totals: | 1050 | 75 + 8 + Unknown | 25 + Unknown + 10 |

If we look at the set where people took both surveys, 80% of the time we see that when the Big Lebowski is someone’s favorite movie (10 people), their favorite cocktail is a White Russian (8 of those 10 people). This is a conditional probability. The condition is that the person already said they liked the Big Lebowski. In contrast, everytime we see that White Russians are someone’s favorite cocktail we also see that their favorite movie is the Big Lebowski. So that conditional probability is 100% – everyone said their favorite film was the Big Lebowski if we already know that they love White Russians. Note how these two probabilities are not the same! Conditional probabilities are tricky beasts!

In fact, these conditional probabilities are all related to the famous Bayes Theorem (more complicated than it sounds!):

In math speak we could write:

In human speak we would say, “The probability of A given B equals the probability of B given A times the probability of A divided by the probability of B”.

In terms of our little equation (over JUST the subset who took both surveys!): (Let Big Lebowski = BL, White Russians = WR)

Let’s do the math, the three terms on the right hand side equal:

The probability that the favorite drink is a WR, given that the favorite movie is BL = 0.8 (80% as stated before, but in decimal form now.)

Probability that favorite movie is BL = 10/50 = 0.2

Probability that favorite drink is WR = 8/50 = 0.16

Putting it all together!

So the conditional probability that someone’s favorite movie is the Big Lebowski given that they adore White Russians is 100%. Exactly what we reasoned quite simply earlier! Pretty cool.

We should probably contrast it with what is called the joint probability:

What does this mean? It’s the probability of observing A and B simultaneously over the set at large (not restricted to a set where B or A is already true). In our case, out of the 50 surveys we see that 8 out of 50 surveys came back with both true. So that makes 16%. We’d say that the joint probability that people like White Russians and the Big Lebowski is 16%. Check for yourself with our above probabilities and the previous formula.

We can use this data as a Rosetta Stone. We can go back to our original data for the 500 people who took just the cocktail survey and the others who took just the movie survey. From our Rosetta Stone we can infer that 80% of the people who said their favorite movie was the Big Lebowski probably are interested in imbibing a delicious Kahlua tinged milky beverage, whereas 100% of the people who love that drink are itching to go bowling with The Dude.

We can use that to (probabilistically) infer the missing data— even though we never actually asked the two isolated groups these questions! So we would say that of those 25% who said their favorite movie is the Big Lebowski, there is an 80% chance that their favorite cocktail is a White Russian, whereas there is a (nearly) 100% chance that the people who took only the cocktail survey who said their favorite drink was a White Russian (75 of them!) would also rate the Big Lebowski as their favorite film.

If we wanted to show an ad campaign for next big coffee tinged liquor we might target the original 75 people who took only the beverage survey and said their favorite beverage was a White Russian plus the 8 who took both and said the same. We could then also target the 25 people who said that they preferred the Big Lebowski - even though no direct data for their drink preferences is known. Our probabilities suggest that about 20 of those 25 (80% of them) will be White Russian fans. All told we thus expect about 75 + 8 + 20 = 103 people (out of the 1050 people in our total, representative population) to be coffee based cocktail fans.

Now imagine we replace surveys with data syncs from our partners. Our Rosetta Stone is now all the times we see data from multiple partners on the same devices. Our survey questions are the specific pieces of data (perhaps some inferred demographic information, or use of a specific mobile app) associated with those partners. This is pretty close to what we do, but with thousands of pieces of data! In a giant matrix! We use this to probabilistically infer the missing values for partners we do not see on that device (like what a user’s favorite movie is likely to be when we only have data on their favorite cocktail.). This allows us to use relatively sparse data to extract a richer picture of how devices are related.