Skip to content

[Bug Report]: Consumer deadlocks if there is a partition rebalance while paused #681

@johnyk87

Description

@johnyk87

Prerequisites

  • I have searched issues to ensure it has not already been reported

Description

When a consumer is paused, and the Consume poll is being executed by a heartbeat task, a partition rebalance will cause a deadlock trying to stop the consumer flow manager.

Steps to reproduce

A sample to reproduce this behavior is available in branch https://github.com/johnyk87/kafkaflow/tree/revoke-when-paused-deadlock.

  1. Start an instance of the custom KafkaFlow.Sample (this custom sample introduces logic to force the heartbeat task to acquire the consumer semaphore).
  2. Wait for the log message Please start a second instance of this application to force a partition rebalance., which should occur shortly after the log KafkaFlow: Kafka consumer 'PrintConsoleConsumer' was paused.
  3. Start a second instance of the application to force a partition rebalance.
  4. First instance will block after the log KafkaFlow: Flow manager waiting for heartbeat task to finish.

Expected behavior

The consumer should not deadlock on partition rebalance when a heartbeat task is running.

Actual behavior

If the Consume call is being executed by the heartbeat task:

  • The partition revoked handler will attempt to stop the consumer flow manager;
  • The flow manager tries to stop and wait for the heartbeat task to complete;
  • The heartbeat task will be blocked waiting for the Consume call to return;
  • The Consume call will be blocked waiting for the partition revoked handler to terminate.

This behavior occurs because Kafka executes the partition assign/revoke handlers in the context of a Consume call, hence a Consume won't return until any triggered handler returns.

Please also note that the deadlock only occurs if the heartbeat task is running, and not all paused consumers will have a heartbeat task running. The heartbeat task shares a semaphore with the main consumer flow, and the heartbeat logic only executes if its task is able to acquire the lock when it is first started shortly after the pause request.

A heartbeat task can acquire the consumer semaphore naturally if the consumer pause request happens while the consumer is trying to enqueue a message to the worker pool, which can easily happen when the target worker's buffer is full.

KafkaFlow version

v4.1.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions