Skip to content

query: distributed query mode is too slow when the number of external labels is high (~1kk) #8597

@SuperPaintman

Description

@SuperPaintman

Important

TBD: I'll add more detailed a bit later: reproduction steps, our topology, configs, etc. For now it's just an umbrella issue for a few PRs.

We've already briefly discussed this with @MichaHoffmann, and found the root cause.

Thanos, Prometheus and Golang version used:

Thanos: v0.39.2 (an internal fork with a few patches. Mostly irrelevant to query component).

Object Storage Provider:

What happened:

I've noticed a significant performance degradation when running a global Thanos querier in "distributed" mode (--query.mode=distributed) compared to "local" mode, in an environment with very large numbers of external label (~1-2 million).

A simple instant query like this (bellow), takes ~30ms in local mode vs ~2-3s in distributed mode:

sum by (cluster, job) (up{
    cluster="prod",
})

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components:

Anything else we need to know:

PRs:

  1. query: cache engines in remoteEndpoints to reuse computed MinT / MaxT / LabelSets values across Engines() calls #8598
  2. WIP: query: prune TSDBInfos in query.remoteEndpoints.Engines() #8599 - this change dropped the latency (in our case) from ~3s down to ~40ms.
  3. Change RemoteEndpoints interface to support remote engines pruning promql-engine#680
  4. query: prepare remoteEndpoints for remote engine pruning #8653

CC: @MichaHoffmann, @SuperQ

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions