fix(framework): Wait for Control API channel readiness#7387
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves flwr run reliability against transient Control API submission failures by waiting for the gRPC Control channel to become ready before attempting to submit a run to SuperLink (helpful during load-balanced rolling restarts).
Changes:
- Add a readiness wait for the Control API gRPC channel prior to issuing
StartRun. - Introduce configurable timeout/interval constants and a shared “SuperLink unavailable” message for readiness failures.
- Add unit tests covering the retry-until-ready and timeout-failure behaviors.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| framework/py/flwr/cli/run/run.py | Adds channel readiness waiting logic and related constants before StartRun. |
| framework/py/flwr/cli/run/run_test.py | Adds tests for readiness wait success and timeout error paths. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
243f943 to
6438f27
Compare
There was a problem hiding this comment.
💡 Codex Review
flower/framework/py/flwr/cli/utils.py
Lines 400 to 402 in 31b7800
Commands using cli_output_control_stub (for example framework/py/flwr/cli/federation/invite/accept.py:54 and framework/py/flwr/cli/federation/simulation_config.py:145) still receive a ControlStub immediately from this shared helper, so during the same rolling SuperLink restart window their first Control API RPC can still fail with UNAVAILABLE instead of waiting for channel readiness. Add the readiness wait here, or move it into the channel initialization path, so all Control API CLI commands get the intended behavior consistently.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
What changed
CLI commands that use the Control API now wait for the gRPC Control channel to become ready before sending the first RPC.
This currently covers:
flwr run/StartRunflwr stop/StopRunflwr ls/ListRunsflwr log/ initialStreamLogsWhy
In HA testing, one Control submission failed with:
Connection to the SuperLink is unavailableThe failure happened during a rolling SuperLink restart window. Every run that was accepted completed, so this was a Control API availability issue, not a ServerApp execution issue.
The fix intentionally does not retry RPCs after they have been sent. Without idempotency for state-changing Control API operations, retrying after a lost response could create duplicate or ambiguous effects.
Validation
uv run --project framework black --check framework/py/flwr/cli/run/run.py framework/py/flwr/cli/stop.py framework/py/flwr/cli/ls.py framework/py/flwr/cli/log.py framework/py/flwr/cli/utils.py framework/py/flwr/cli/utils_test.pyuv run --project framework ruff check framework/py/flwr/cli/run/run.py framework/py/flwr/cli/stop.py framework/py/flwr/cli/ls.py framework/py/flwr/cli/log.py framework/py/flwr/cli/utils.py framework/py/flwr/cli/utils_test.pyuv run --project framework python -m pytest framework/py/flwr/cli/utils_test.pyHA validation: