Prerequisites
Description
When a consumer is paused, and the Consume poll is being executed by a heartbeat task, a partition rebalance will cause a deadlock trying to stop the consumer flow manager.
Steps to reproduce
A sample to reproduce this behavior is available in branch https://github.com/johnyk87/kafkaflow/tree/revoke-when-paused-deadlock.
- Start an instance of the custom
KafkaFlow.Sample (this custom sample introduces logic to force the heartbeat task to acquire the consumer semaphore).
- Wait for the log message
Please start a second instance of this application to force a partition rebalance., which should occur shortly after the log KafkaFlow: Kafka consumer 'PrintConsoleConsumer' was paused.
- Start a second instance of the application to force a partition rebalance.
- First instance will block after the log
KafkaFlow: Flow manager waiting for heartbeat task to finish.
Expected behavior
The consumer should not deadlock on partition rebalance when a heartbeat task is running.
Actual behavior
If the Consume call is being executed by the heartbeat task:
- The partition revoked handler will attempt to stop the consumer flow manager;
- The flow manager tries to stop and wait for the heartbeat task to complete;
- The heartbeat task will be blocked waiting for the
Consume call to return;
- The
Consume call will be blocked waiting for the partition revoked handler to terminate.
This behavior occurs because Kafka executes the partition assign/revoke handlers in the context of a Consume call, hence a Consume won't return until any triggered handler returns.
Please also note that the deadlock only occurs if the heartbeat task is running, and not all paused consumers will have a heartbeat task running. The heartbeat task shares a semaphore with the main consumer flow, and the heartbeat logic only executes if its task is able to acquire the lock when it is first started shortly after the pause request.
A heartbeat task can acquire the consumer semaphore naturally if the consumer pause request happens while the consumer is trying to enqueue a message to the worker pool, which can easily happen when the target worker's buffer is full.
KafkaFlow version
v4.1.0
Prerequisites
Description
When a consumer is paused, and the
Consumepoll is being executed by a heartbeat task, a partition rebalance will cause a deadlock trying to stop the consumer flow manager.Steps to reproduce
A sample to reproduce this behavior is available in branch https://github.com/johnyk87/kafkaflow/tree/revoke-when-paused-deadlock.
KafkaFlow.Sample(this custom sample introduces logic to force the heartbeat task to acquire the consumer semaphore).Please start a second instance of this application to force a partition rebalance., which should occur shortly after the logKafkaFlow: Kafka consumer 'PrintConsoleConsumer' was paused.KafkaFlow: Flow manager waiting for heartbeat task to finish.Expected behavior
The consumer should not deadlock on partition rebalance when a heartbeat task is running.
Actual behavior
If the
Consumecall is being executed by the heartbeat task:Consumecall to return;Consumecall will be blocked waiting for the partition revoked handler to terminate.This behavior occurs because Kafka executes the partition assign/revoke handlers in the context of a
Consumecall, hence aConsumewon't return until any triggered handler returns.Please also note that the deadlock only occurs if the heartbeat task is running, and not all paused consumers will have a heartbeat task running. The heartbeat task shares a semaphore with the main consumer flow, and the heartbeat logic only executes if its task is able to acquire the lock when it is first started shortly after the pause request.
A heartbeat task can acquire the consumer semaphore naturally if the consumer pause request happens while the consumer is trying to enqueue a message to the worker pool, which can easily happen when the target worker's buffer is full.
KafkaFlow version
v4.1.0