Skip to content

Configurable consumer batch timeout for Kafka EventBus sensors #3984

@kaio6fellipe

Description

@kaio6fellipe

Is your feature request related to a problem? Please describe.

When using Kafka as the EventBus, the sensor consumer batches messages with a hardcoded 1-second timeout (Batch(100, 1*time.Second, ...)). This introduces significant latency between an event being published and a trigger firing — up to 1 second per internal topic hop (event → trigger → action), resulting in 3-5 seconds of total latency even for a single event at low volume.

In our environment (3-broker Strimzi cluster across 3 GCP zones, mTLS, min.insync.replicas=2), we observed ~4.0 seconds between eventsource.publish and sensor.trigger through distributed tracing. The batch timer accounted for ~2.5 seconds of that total, with the remaining time spent on Kafka transactional commits.

This latency is acceptable for high-throughput batch workloads but problematic for latency-sensitive use cases like webhook-driven workflows, real-time notifications, and interactive event-driven pipelines.

Describe the solution you'd like

Add a configurable consumerBatchMaxWait field to the Kafka EventBus spec that controls the maximum time the sensor consumer waits to fill a batch before processing. The field should:

  1. Accept a Go duration string (e.g., 1s, 500ms, 100ms) or 0 to disable batching entirely
  2. Default to 1s (preserving current behavior when not set)
  3. When set to 0, process messages individually in real-time without batching
  4. Be configurable at the EventBus level (default for all sensors) and overridable per Sensor via a eventBusConsumerBatchMaxWait field in the Sensor spec

Example EventBus configuration:

apiVersion: argoproj.io/v1alpha1
kind: EventBus
metadata:
  name: default
spec:
  kafka:
    url: kafka:9092
    consumerBatchMaxWait: "100ms"

Example Sensor-level override:

apiVersion: argoproj.io/v1alpha1
kind: Sensor
metadata:
  name: latency-sensitive-sensor
spec:
  eventBusConsumerBatchMaxWait: "0"

Describe alternatives you've considered

  • Hardcode a lower timeout (e.g., 100ms): This would improve latency for everyone but reduces throughput optimization for high-volume use cases. A configurable approach lets users choose the right trade-off.
  • Remove batching entirely: While this provides the best latency, it eliminates the transaction amortization benefit. Making it opt-in via "0" is safer.
  • Use JetStream instead of Kafka: JetStream already processes messages one at a time (no batching), but switching event buses is not always feasible for teams that depend on Kafka's ecosystem, horizontal scaling, and exactly-once semantics.

Additional context

Benchmarks from a 3-broker Strimzi Kafka cluster (Kafka 4.1.1, 3 GCP zones, mTLS, min.insync.replicas=2, transaction.state.log.replication.factor=3):

Configuration Event-to-trigger latency
Default (1s batch) ~3500ms
Image

Message from the maintainers:

If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions