Aerospike at Tapad - How We Scale

At Tapad, we deal with high throughput, low latency systems. Our bidder processes billions of requests per day. The simplified request/response cycle is: receive http request from a supplier, parse content, look up related device information, match the enriched request to any number of collections of targeting criteria (a “tactic”) and receive an internal bid from this tactic, run an internal auction, send response from the bid that wins the internal auction back to the supplier.

Billions of records will be inserted into the database over the course of normal operation, and many of these will naturally become stale and ultimately useless over time at which point they will be deleted. All keys are in memory, and all records are on SSDs and replicated once. Records are never cached by the database.

Looking up related device information quickly is a critical step. There is too much data to cache in the main process, so we leverage Aerospike as a key-value store. The key is the device id, and the value is either a serialized device record (protocol buffers) or a pointer to another id. A single device has multiple ids, we must be able to look up by any of them, and Aerospike does not support having multiple keys pointing to the same physical record because of the way indexing works. Each record has a primary copy on a partition on one node in the cluster, and a secondary copy on a different node. If there are multiple outstanding reads for the same key, the client will warn via a “hot key” error. If a node goes down, the cluster will start automatically rebalancing; this will have a noticeable impact on read and write performance.

Each namespace is configured independently; for ease of use we basically have everything in one giant namespace. This is easier to scale because it’s possible to add capacity to an existing namespace by simply bringing up a new node and pointing it at the cluster. Adding or removing a new namespace requires a cluster restart, which is not an attractive option. The nodes are assumed to be homogenous, so one node that has more storage available than the others will be unable to make use of it until all the other nodes have the same storage capacity.

Two data management features that we make heavy use of are expiration and eviction. Each record has a TTL, and Aerospike handles deleting the record automatically at this expiration time. Eviction is a strategy to accelerate expiration if necessary. If the database is running out of space because of a surge in new writes, the oldest records are deleted to make room for new ones. Fresh data is more valuable than old data, so this is exactly the desired behavior. This works best when TTLs are well distributed. Be sure to set a TTL, otherwise the record will default to “no expiration” and the database will be unable to automatically manage deletion.

The “high watermark” is the threshold at which evication kicks in. It’s recommended to be set at 50%; we ran at 70% successfully for a long time and dynamically decreased it to 60% to mitigate a situation where we were running short on capacity. The change enabled Aerospike to clear out old data faster so the system could continue operations.

We have a simple Scala wrapper around the Java client, which provides blocking and non-blocking modes.

Video of the talk (thanks to Hakka Labs for helping with A/V):