query: distributed query mode is too slow when the number of external labels is high (~1kk)

> [!important]
> **TBD**: I'll add more detailed a bit later: reproduction steps, our topology, configs, etc. For now it's just an umbrella issue for a few PRs.
>
> We've already briefly discussed this with @MichaHoffmann, and found the root cause.





**Thanos, Prometheus and Golang version used**:



Thanos: `v0.39.2` (an internal fork with a few patches. Mostly irrelevant to `query` component).

**Object Storage Provider**:

**What happened**:

I've noticed a significant performance degradation when running a global Thanos querier in "distributed" mode (`--query.mode=distributed`) compared to "local" mode, in an environment with very large numbers of external label (`~1-2` million).

A simple instant query like this (bellow), takes `~30ms` in local mode vs `~2-3s` in distributed mode:

```promql
sum by (cluster, job) (up{
    cluster="prod",
})
```

**What you expected to happen**:

**How to reproduce it (as minimally and precisely as possible)**:

**Full logs to relevant components**:



**Anything else we need to know**:



**PRs**:
1. https://github.com/thanos-io/thanos/pull/8598
2. https://github.com/thanos-io/thanos/pull/8599 - this change dropped the latency (in our case) from ~3s down to ~40ms.
3. https://github.com/thanos-io/promql-engine/pull/680
4. https://github.com/thanos-io/thanos/pull/8653

---

CC: @MichaHoffmann, @SuperQ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query: distributed query mode is too slow when the number of external labels is high (~1kk) #8597

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

query: distributed query mode is too slow when the number of external labels is high (~1kk) #8597

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions