Key-Value Abstraction at Netflix | Biweekly Engineering - Episode 38

How Netflix built its key-value system as an abstraction layer

We software engineers love abstractions. And so do Netflix engineers!

This episode features an article from Netflix Technology Blog where we learn how the teams there built a system as an abstraction: a key-value layer for storing key-value data.

I particularly enjoyed a lot learning about the design and the design choices mentioned in the article. So make sure to take your time and read them, or at least read this episode!

Majestic Flam in Norway

Netflix’s Key-Value Data Abstraction

KV Data Model

The key-value abstraction is a simple data structure:

HashMap<String, SortedMap<Bytes, Bytes>>

The benefit of having such a 2d map is that they can also use it for flat key-value data as well as sets.

Flat key-value: id → {"" → value} // essentially it's id -> value
Set: id → {key → ""} // inside an id, it becomes a set

The following diagram nicely depicts the data structure:

Each item also has a simple structure:

message Item (   
  Bytes    key,
  Bytes    value,
  Metadata metadata,
  Integer  chunk
)

An item is basically an element of the inner sorted map in the data structure.

The fundamental benefit of this KV abstraction is that it works with multiple key-value storage systems like Apache Cassandra, DynamoDB, RocksDB!

Design Choices

APIs

Four basic APIs are supported:

  • GetItems: Read items for a record with selection and predicate support. The response is paginated.

  • PutItems: Inserts or updates items to a key.

  • DeleteItems: Deletes items for a key.

  • Beyond single record: The system also supports multi-item and multi-record APIs like MutateItems and ScanItems. These APIs expand across more than a single record, unlike the APIs mentioned above.

Idempotency

Each PutItems and DeleteItems operations comes with an idempotency token:

message IdempotencyToken (
  Timestamp generation_time,
  String    token
)

The IdempotencyToken is not generated on the KV abstraction layer, instead, each client generates it on its own and attaches it to the API requests.

This token ensures mutation operation like put and delete keep the data integrity intact. While the token field is a 128 bit UUID, the generation_time field is a timestamp that is used to order requests to support last-write-wins mechanism in databases like Cassandra.

But what is last-write-wins? Let’s explain.

Last write wins (LWW) in Cassandra

In Apache Cassandra, the "last-write-wins" (LWW) semantic is a conflict resolution strategy used when multiple updates occur to the same data in a distributed system. Cassandra relies on timestamps to determine which write is the "latest" and thus the one that persists. Let’s understand this with an example.

Scenario

Imagine a Cassandra database with a table called users that stores user information, including their email address. The table is defined as follows:

CREATE TABLE users (
    user_id text PRIMARY KEY,
    email text
);

Now, suppose two clients (Client A and Client B) are updating the email for the same user_id = 'john_doe' concurrently in a distributed Cassandra cluster.

  1. Client A updates the email to "[email protected]" at timestamp 2025-03-22 10:00:00.123 UTC.

  2. Client B updates the email to "[email protected]" at timestamp 2025-03-22 10:00:00.456 UTC.

How Cassandra handles it
  • Each write operation in Cassandra includes a timestamp (usually generated by the client or coordinator node).

  • When these updates are sent to the Cassandra cluster, the nodes replicate the data across the system.

  • If there’s a conflict (i.e., two different values for the same user_id), Cassandra uses the last-write-wins rule: it compares the timestamps of the writes and keeps the value with the latest timestamp.

In this case:

  • Client A’s write has a timestamp of 10:00:00.123.

  • Client B’s write has a timestamp of 10:00:00.456.

Since 10:00:00.456 is later than 10:00:00.123, Cassandra resolves the conflict by keeping Client B’s update. The final value stored for john_doe’s email will be "[email protected]".

Issue with timestamp

The timestamp is critical here. If clocks are not synchronized across clients (e.g., due to clock skew), an earlier "real-world" write could overwrite a later one if its timestamp is higher.

Netflix has experimented with clock skew in AWS EC2 machines. Their conclusion is mentioned in the blog:

Although clock-based token generation can suffer from clock skew, our tests on EC2 Nitro instances show drift is minimal (under 1 millisecond).

Note that clocks are inherently unreliable in distributed systems. While I understand in happy path the clock drift is low in EC2 machines, I highly suspect that the drift gets wider sometimes, maybe rarer than expected. It means LWW is not 100% bulletproof. This is why we have other mechanisms like Quorum, Vector Clocks, Conflict-Free Replicated Data Types (CRDTs), etc.

Hopefully I will discuss the conflict resolution mechanisms in distributed systems in a separate episode.

Chunking

Key-value storages have by-design limitations on how much data can be stored for a key or a partition. This is why for larger blobs of data, chunking is used. If a key has large value, a separate chunking storage is used where data is chunked. And the primary database contains necessary metadata to retrieve the chunks efficiently.

Adaptive Pagination

An interesting design choice discussed in the article is adaptive pagination.

Firstly, Netflix decided to use byte-based pagination instead of item or row count-based pagination. The size of items can vary. Some items can be 1 KiB whereas others can be 1 MiB. So having a count-based pagination would have a risk of violating the latency SLO.

Databases like DynamoDB or Cassandra only supports count-based pagination. As a result, they have to take a smart approach:

  • Based on a static limit, fetch rows from the underlying storage

  • If the result is less than the byte limit in their pagination config, fetch more data to meet the limit

  • If more data is fetched than the limit allows, discard extra items

  • Generate pagination token accordingly and send it to the clients so that clients can request for the next page using the token

While this approach works, it is wasteful. This is where adaptive pagination comes into picture — estimate the byte limit for a page based on average item size in a namespace. This helps to set a dynamic page limit in bytes which reduces number of additional reads or discarding of excess results.

KV Usages at Netflix

Key-value is a widely used data access and design patterns in distributed systems. The article shares some common KV abstraction use-cases at Netflix:

  • Video streaming metadata

  • User profiles

  • Real-time analytics

So this is where we end this episode today. I hope you learned something new.

Thanks a lot for reading and see you in the next one! 🌊 

Reply

or to participate.