How Datadog Built Indexing for Metrics | Biweekly Engineering - Episode 43

From on-demand indexing to index it all!

Datadog is one of the leaders in observability-as-a-service platforms. They deal with massive scale of data - think of trillions of events flowing through their system every single day. Systems like this require big engineering investment in efficient and performant queries.

Today, we get a little peek into how Datadog handles its timeseries data. Timeseries data is the backbone of all its observability products, and for any other observability platform. We get to learn how they changed the time-series indexing - from legacy on-demand solution to pre-calculated index. This article is a great example of growing and evolving a system from an adaptive design to one built for predictable massive scale.

Let’s read it!

The Prophet’s Mosque in Madinah, KSA

What is timeseries, again?

A time series is a sequence of data points indexed, ordered, or graphed in time. Essentially, it's any data set where the order matters because it reflects how a quantity changes over time.

We can think of two primary characteristics of timeseries data:

  • Ordered by time: The critical feature of a timeseries is that the chronological sequence of the observations is meaningful. You can't shuffle the data without losing crucial information.

  • Sequential dependence: Adjacent data points are often dependent on each other. For instance, the temperature today is usually related to the temperature yesterday.

Example

Consider a server monitoring system logging the CPU utilization of a web server every minute.

Timestamp

CPU Utilization (%)

2025-11-26 10:00:00

25

2025-11-26 10:01:00

28

2025-11-26 10:02:00

35

2025-11-26 10:03:00

33

2025-11-26 10:04:00

41

This table represents a time series. Analyzing this data can reveal trends (for example, peak usage hours) and anomalies (like sudden spikes in usage). The value 35% is not just a number, it's the CPU usage at 10:02:00. The timestamp gives the data its meaning.

Timeseries indexing at Datadog

The First Generation

The original service was an adaptive index. It was written in Go and used embedded databases: SQLite for storing metadata and the query log, and RocksDB for the high-volume index storage.

The system's indexing was based on usage:

  • It was driven by a query_log that recorded slow queries. A background process analyzed the selectivity of these queries (the ratio of scanned IDs to returned IDs).

  • It automatically created a specialized index (a materialized view) in the Indexes DB for highly selective, frequently run queries. This converted a complex, multi-step search into a single key-value lookup.

This resulted in high space-efficiency, as not every tag is indexed. However, this indexing strategy was not suitable for user-facing applications. Unpredictable user queries often missed the specific indexes, forcing the engine into a slow full-data scan equivalent. This led to query timeouts and high operational toil for engineers who had to manually create or remove indexes.

In this earlier design, the core metadata was split across RocksDB components:

  • Metrics DB: Mapped a metric name to the complete list of timeseries IDs for that metric.

  • Tagsets DB: Mapped a timeseries ID to its full set of tags, which was essential for later grouping logic.

The Next-Gen Inverted Index

To solve the unpredictability, Datadog made a fundamental shift to an exhaustive indexing strategy inspired by the inverted index used in search engines.

  • Exhaustive Indexing: They stopped trying to guess what to index and started unconditionally indexing every tag. This meant an increase in storage and writes - known as write amplification - but delivered consistent query latency.

  • Predictable Query Flow: The Indexes DB now maps a composite key metric-name:tag to the list of matching timeseries IDs. A query is solved by looking up the individual filter tags and computing the set intersection of the resulting ID lists. This ensures every query has a fast index path.

Why index every tag?

When I was reading the architecture of the older system, it was an obvious question: why not index every tag?

I am not sure why Datadog initially decided to not index every tag and run set intersection to resolve a query. Indexing every tag is a reasonable approach because storage has become much cheaper over the years. Modern systems with modern hardwares can handle tons of data without a cough. So yeah, it is much simpler to tag every index and resolve query accordingly.

Scaling with Sharding and Rust

To handle the massive increase in write volume and ensure query parallelism, they introduced two key architectural changes:

  • Intranode Sharding: To overcome the single-core bottleneck of the original Go service, they partitioned their RocksDB instances within a single node. For instance, they settled on running eight shards on a 32-core machine. This allowed them to execute query fetches from all shards in parallel and merge the results, providing an approximately 8x performance boost.

  • Switch to Rust: The service was completely rewritten in Rust to optimize for hardware efficiency and safety.

    • Rust provided superior CPU efficiency and predictable performance by avoiding the overhead of the Go runtime's Garbage Collection (GC).

    • Its strict ownership model and compile-time checks were vital for preventing data races and memory bugs in a high-concurrency, distributed service, enhancing reliability.

A fundamental learning you can have from this post is how to design a high throughput metrics platform - can be a question in system design interviews. Have a read, make sure to read the past episodes, and don’t forget to share Biweekly Engineering with you friends, family, and network!

Until the next one, bye! 👋 

Reply

or to participate.