PRODUCT
9 min read
Published on 01/26/2020
Last updated on 03/21/2024
Therefore, long-term storage of Prometheus metrics is left up to 3rd parties. There are several projects out there with differing approaches:Prometheus' local storage is limited by single nodes in its scalability and durability. Instead of trying to solve clustered storage in Prometheus itself, Prometheus has a set of interfaces that allow integration through remote storage systems.
M3DB
M3DB was developed primarily for collecting high volumes of monitoring time series data, then for distributing the storage of that data in a horizontally scalable manner that most efficiently leverages the hardware at its disposal. This is useful because time series data that is read infrequently is not kept in memory.Cortex
Cortex provides horizontally scalable, highly available, multi-tenant, long term storage for Prometheus.Thanos
Thanos is an open source, highly available Prometheus setup with long term storage and querying capabilities. All of these projects may fit specific use cases, but none of them is a silver-bullet. The following benefits made us decide to go with Thanos:- Stateless components
- Stable StoreAPI between services
- Historic metrics and the whole state are persisted to object storage
- Last but not least it is a CNCF project
Thanos components
We open sourced our Thanos Operator to automate Thanos management on Kubernetes. Check out our introduction blog post.Thanos is built from a handful of components, each with a dedicated role within the architecture. The easiest way to gain a basic understanding of how these work is to take a quick look at the responsibilities assigned to each one.
- Sidecar
- Serves as a sidecar container alongside Prometheus
- Uploads Prometheus chunks to an object storage
- Supports object stores like S3, Google Cloud Storage, Azure Storage and more
- Prometheus operator, which we also use in our integrated monitoring service, solves the injection of this sidecar transparently
- Store
- Retrieves chunks from object storage in order to provide long term metrics for Query
- Supports time-based partitioning
- Supports label-based partitioning
- Compact
- Creates downsampled summaries of chunks to expedite queries for long time ranges
- Has three levels; Raw chunks contain all samples, while 5m and 1h chunks contain aggregated samples for every five minutes or one hour respectively
- Query
- Is the entry point for PromQL queries
- Deduplicates results from different sources
- Supports partial responses
- Rule
- Is a simplified version of Prometheus that does not require a sidecar and does not scrape or do PromQL evaluations
- Writes results back to the disk in the Prometheus 2.0 storage format
- Participates in the system as a store node, which means that it exposes StoreAPI and uploads its generated TSDB blocks to an object store
- Bucket
- Inspects data in object storage buckets
How downsampling works?
You should keep in mind that the goal of downsampling in Thanos is not to save disk space. It provides a way to quickly evaluate queries with large time intervals, like months or years.Let's take a look at how this works. There are three levels of granularity:In fact, downsampling doesn’t save you any space but, instead, adds two new blocks for each raw block. These are slightly smaller than, or close to, the size of raw blocks. This means that downsampling slightly increases the amount of storage space used, but it provides a massive performance and bandwidth use advantage when querying long intervals.
- raw --- this is the raw scraped metrics from Prometheus
- 5m --- for chunks compacted into five minutes
- 1h --- for chunks compacted into one hour
min
and max
can simply use their corresponding attributes, but it is also possible to calculate more complex functions from the aggregated values, like avg
from count/sum
.
So how do these chunks help with our queries? For the purposes of comparison, the following table demonstrates queries made on raw and compacted data.
Query Range | Samples for 1000 series | Decompression Latency | Fetched chunks size |
---|---|---|---|
30m | ~120 000 | ~5m | ~160KB |
1d | ~6 million | ~240ms | ~8MB |
30d (raw) | ~170 million | ~7s | ~240MB |
30d | ~8 million | ~300ms | ~9MB |
1y (raw) | ~2 billion | ~80s | ~2GB |
1y | ~8 million | ~300ms | ~9MB |
Metric query flow
The following diagram shows the "life of a query".- The PromQL query is posted to the
Querier
- It interprets the query and goes to a pre-filter
- The query fans out its request for
stores
,prometheuses
or otherqueries
on the basis of labels and time-range requirements - The
Query
only sends and receives StoreAPI messages - After it has collected all the responses, it merges and deduplicates them (if enabled)
- It then sends back the series for the user
Time-based partitioning
By default, Thanos' Store Gateway looks at all of the data in the Object Store and returns it based on the query’s time range. But if we have a lot of data we can scale it horizontally. Our first and most obvious option is to use time-based partitioning. AllStoreAPI
sources advertise the minimum and maximum times available and those labels that pertain to reachable series. Using parameters, we can tweak these arguments to narrow the scope of this partition, making it smaller and balancing the load. This parameter can be in relative time as well as a concrete date.
An example setup with 3 Store
servers might look like this:
- A:
max-time=-6w
- B:
min-time=-8w
andmax-time=-2w
- C:
min-time=-3w
As you can see, you can set overlapping ranges as well to improve redundancy. Thanos Querier deals with overlapping time series by merging them together.Note: filtering is done on the level of chunks, so Thanos' Store might still return samples which are outside of
--min-time
and--max-time
.
Label-based partitioning
Label-based partitioning is similar to time-based partitioning, but instead of using time as a sharding key, we uselabels
. These labels come from Prometheus' external labels and explicitly set labels based on Thanos components. The relabel configuration is identical to Prometheus' relabel configuration. We can see how this works in the following example:
relabel config
- action: keep
regex: "eu.*"
source_labels:
- region
Such a configuration means that the component in question will only match metrics with a region
label starting with the eu
prefix.
How deduplication works?
For a detailed explanation, please read what the official documentation has to say on this topicIt is typical for identical Prometheus servers to be set up as HA pairs. This approach eliminates the problems that arise from a single Prometheus instance failing. However, to make the Prometheus querying seamless, Thanos provides query time deduplication. To make this possible, we need only to set up one or more replica labels on the
sidecar
component, and the query
component does the rest. Let's take a look at how this is handled in the Thanos documentation.
An example of single replica labels
Prometheus + sidecar “A”: cluster=1,env=2,replica=A
Prometheus + sidecar “B”: cluster=1,env=2,replica=B
Prometheus + sidecar “A” in different cluster: cluster=2,env=2,replica=A
An example query looks like this: up{job="prometheus",env="2"}
. With deduplication the results are:
up{job="prometheus",env="2",cluster="1"} 1
up{job="prometheus",env="2",cluster="2"} 1
Without deduplication the result looks like this:
up{job="prometheus",env="2",cluster="1",replica="A"} 1
up{job="prometheus",env="2",cluster="1",replica="B"} 1
up{job="prometheus",env="2",cluster="2",replica="A"} 1
One Eye
As you see, Thanos is a powerful tool that allows you to build highly available, multi-cluster monitoring systems. However, there are several difficulties that naturally arise when creating a production-ready version of such a system:- Configuring Thanos service discovery for multiple clusters
- Sharding Prometheus instances
- Configuration of transport layer TLS and certificates
- Proper scaling of resources (how many metrics, and what kind of resources are needed, etc)
- Setting up Label/Time range shards
- Keeping your configuration up to date and propagating changes to clusters
- Adding a caching layer
While this project is still very much under way (we'll be releasing it late Q1), feel free to set up your Thanos infrastructure using the highly popular Thanos Helm chart maintained by Banzai Cloud.Attention: We open sourced our Thanos Operator to automate Thanos management on Kubernetes. Check out our introduction blog > post.
Get emerging insights on emerging technology straight to your inbox.
Unlocking Multi-Cloud Security: Panoptica's Graph-Based Approach
Discover why security teams rely on Panoptica's graph-based technology to navigate and prioritize risks across multi-cloud landscapes, enhancing accuracy and resilience in safeguarding diverse ecosystems.
The Shift keeps you at the forefront of cloud native modern applications, application security, generative AI, quantum computing, and other groundbreaking innovations that are shaping the future of technology.