Data Pipeline at Tapad
At Tapad, we deal with a lot of data. Our data pipeline is an essential component of our tech stack that allows us to extract value from the multiple terabytes of data that we push through it every day. Recently, Dag Liodden shared a few details about the data pipelinewith Pete Soderling of Hakka Labs. Tapad’s section is copied below; see the post itself for insights from other companies.
Tapad is an ad-tech business in NYC that’s experienced lots of growth in both traffic and data over the past several years. So I reached out to their CTO, Dag Liodden, to find out how they’ve built their data pipeline, and some of the strategies and tools they use. In Dag’s own words, here’s how they do it:
”- All ingested data flows through a message queue in a pub-sub fashion (we use Kafka and push multiple TB of data through it every hour)
All data is encoded with a consistent denormalized schema that supports schema evolution (we use Avro and Protocol Buffers)
Most of our data stores are updated in real-time from processes consuming the message queues (hot data is pushed to Aerospike and Cassandra, real-time queryable data to Vertica and the raw events, often enriched with data from our Aerospike cluster, is stored in HDFS)
Advanced analytics and data science computation is typically executed on the denormalized data in HDFS
The real-time updates can always be reproduced through offline batch jobs over the HDFS stored data. We strive to make our computation logic so that it can be run in-stream and in batch MR-mode without any modification”
He notes that the last point allows them to retroactively change their streaming computation at-will and then backfill the other data stores with updated projections.
Dag also explains the “why” behind their use of multiple types of data technologies on the storage side and explains that each of them has its own particular “sweet-spot” which makes it attractive to them:
”- Kafka: High-throughput parallel pub-sub, but relaxed delivery and latency guarantees, limited data retention and no querying capabilities.
Aerospike: Extremely fast random access read/write performance, by key (we have 3.2 billion keys and 4TB of replicated data), cross data center replication, high availability but very limited querying capabilities
Cassandra: Medium random access read/write performance, atomic counters and a data model that lends it well to time-series storage. Flexible consistency model and cross data center replication.
HDFS: High throughput and cheap storage.
Vertica: Fast and powerful ad-hoc querying capabilities for interactive analysis, high availability, but no support for nested data structure, multi-valued attributes. Storage based pricing makes us limit the amount of data we put here.”