Spooking Out on Some Scary TV
Here at Tapad, a big role for our data science and engineering teams is to distill useful information from all the data that we work with. I was recently tasked with analyzing a month’s worth of Smart TV viewership data.
The project yielded some really interesting findings on common viewer behaviors, which you can check out below. First though, let’s look at how we came to these findings.
For advertisers partnering with Tapad, we can learn when their TV commercial was shown on a specific television, what network showed it and what content it was shown during. We store this data on HDFS (Hadoop Distributed File System).
Since we’re looking for the similarity between audiences of different TV shows, we can use the Jaccard index, or the number of TVs that are common to both shows, divided by the total number of unique TVs found in either show. Our output should be every possible content pairing with a number, between zero and one, to indicate the similarity of the two shows’ audiences. For instance, a sample row of our output could be:
|Desperate Housewives||Jersey Shore||.25|
This would indicate that a quarter of the TVs who watched either “Desperate Housewives” and “Jersey Shore” also watched both shows.
We turn to Scalding, a Tapad staple, to read in the viewership data, analyze it, and emit the desired similarity scores. In Scalding, data processing occurs in a pipe, which can be thought of as “distributed unordered list that may or may not yet have been materialized in memory or disk” (https://twitter.github.io/scalding/index.html#com.twitter.scalding.typed.TypedPipe). Pipes are manipulated similarly to regular Scala lists, with a stream of tuples representing the input and output of each pipe.
Initially, I planned to find audience similarity for networks where a particular commercial was aired on, leading me to naively assume that the number of combinations would be small, around ten or less. So, for every network where the particular commercial aired on, I looped through every other relevant network, calculating in a pipe the number of TVs in the intersection of the two networks’ audiences divided by the number of TVs in the union. I then combined all of the pipes (one for every pair) into one large pipe and wrote the result to a file. In rough pseudocode, we have
This algorithm is correct and runs successfully for small numbers of networks, but unfortunately, for any real-life number of networks (or content shows - which we have over 2000 of!), this program crashed miserably with a Stack Overflow Exception when merging the pipes. This demanded an alternative, more scalable solution so I turned to another engineer for advice.
Together, we formulated a new, more scalable algorithm, armed with the fresh insight that the number of distinct TVs in the union of two shows’ audiences is the number of TVs who watched show A plus the number of TVs who watched show B minus the number of TVs who watched both shows. In other words,
|A∪B| = |A| + |B| - |A∩B| so the similarity index becomes
|A∩B| / (|A| + |B| - |A∩B|) from
|A∩B| / |A∪B|.
Underneath the Hood of the New Algorithm
In our renovated algorithm, we construct a pipe with the TV and current content (or network) tuple for every commercial aired. Grouping by TV and calling
toSet yields a pipe with a mapping from a TV to the set of contents it has viewed (technically, commercials during those contents):
We take this pipe,
flatMap each tuple to each device’s set of contents, group by contents and then count the size of each grouping, or the number of devices that has seen each content. This is the
|B| that we need.
Now, we need to calculate
|A∩B| for every pair of contents. We can reuse our initial pipe containing TVs to the set of contents that they appear on. For every TV’s set of contents, let’s form all combinations of two contents and output those combinations. Then, we simply need to count the number of times those pairs occur across all TVs’ contents sets to find the number of TVs in the intersection of the audiences of two shows. We just group by contents pair, and count the size of each grouping:
All that’s left is putting the pieces together. We utilize
join to pull all of the set sizes onto one reducer. We can proceed to calculate the similarity scores now that sets of content audience sizes and content audience intersection sizes are together, at last.
The Results are in
Now– the results. Check out the table below.
In column 1, you see a selection of scary TV shows and films watched in the last week. In column 2, you see the content that was most often watched by the same audience.
- Some of the comparisons make perfect sense – Buffy the Vampire Slayer fans also dig The Dead Files…
- Some of the comparisons are more startling – fans of the movie Sinister are also big fans ofTosh.O. Wah?
Interested in learning more about Tapad data science and engineering? Connect with us at firstname.lastname@example.org.
|Evil Twins||Married at First Sight|
|Paranormal Activity 2||What on Earth?|
|Buffy the Vampire Slayer||The Dead Files|
|Abraham Lincoln vs. Zombies||Cults: Dangerous Devotion|
|Paranormal Activity||The Fifth Element|
|Cults: Dangerous Devotion||The Middle|
|Abraham Lincoln vs. Zombies||Gunsmoke|
|Evil Twins||The Pack|
|Paranormal Witness||While You Were Sleeping|
|The Dead Files||Tosh.0|
|Paranormal Witness||The Day After Tomorrow|
|Ghostly Encounters||NBC 10 News Today at 6:00a|
|Buffy the Vampire Slayer||Raising Hope|
|Paranormal Witness||The Hobbit: An Unexpected Journey|
|Cults: Dangerous Devotion||Paranormal Activity|
|Ghost Adventures||Parks and Recreation|