Skip to content

fix(framework): Wait for Control API channel readiness#7387

Open
psfoley wants to merge 8 commits into
mainfrom
fix/control-api-ready-wait
Open

fix(framework): Wait for Control API channel readiness#7387
psfoley wants to merge 8 commits into
mainfrom
fix/control-api-ready-wait

Conversation

@psfoley

@psfoley psfoley commented Jun 15, 2026

Copy link
Copy Markdown
Member

What changed

CLI commands that use the Control API now wait for the gRPC Control channel to become ready before sending the first RPC.

This currently covers:

  • flwr run / StartRun
  • flwr stop / StopRun
  • flwr ls / ListRuns
  • flwr log / initial StreamLogs

Why

In HA testing, one Control submission failed with:

Connection to the SuperLink is unavailable

The failure happened during a rolling SuperLink restart window. Every run that was accepted completed, so this was a Control API availability issue, not a ServerApp execution issue.

The fix intentionally does not retry RPCs after they have been sent. Without idempotency for state-changing Control API operations, retrying after a lost response could create duplicate or ambiguous effects.

Validation

  • uv run --project framework black --check framework/py/flwr/cli/run/run.py framework/py/flwr/cli/stop.py framework/py/flwr/cli/ls.py framework/py/flwr/cli/log.py framework/py/flwr/cli/utils.py framework/py/flwr/cli/utils_test.py
  • uv run --project framework ruff check framework/py/flwr/cli/run/run.py framework/py/flwr/cli/stop.py framework/py/flwr/cli/ls.py framework/py/flwr/cli/log.py framework/py/flwr/cli/utils.py framework/py/flwr/cli/utils_test.py
  • uv run --project framework python -m pytest framework/py/flwr/cli/utils_test.py

HA validation:

  • 5 SuperLinks
  • 5 SuperExecs
  • 50 SuperNodes
  • 10 federations with 5 SuperNodes each
  • 200 submitted runs
  • rolling SuperLink restarts during submission
  • result: 200/200 submissions succeeded, 200/200 runs completed, 0 submit errors, 0 failed runs

Copilot AI review requested due to automatic review settings June 15, 2026 22:12
@psfoley psfoley changed the title Wait for Control API channel readiness fix(framework): Wait for Control API channel readiness Jun 15, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves flwr run reliability against transient Control API submission failures by waiting for the gRPC Control channel to become ready before attempting to submit a run to SuperLink (helpful during load-balanced rolling restarts).

Changes:

  • Add a readiness wait for the Control API gRPC channel prior to issuing StartRun.
  • Introduce configurable timeout/interval constants and a shared “SuperLink unavailable” message for readiness failures.
  • Add unit tests covering the retry-until-ready and timeout-failure behaviors.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
framework/py/flwr/cli/run/run.py Adds channel readiness waiting logic and related constants before StartRun.
framework/py/flwr/cli/run/run_test.py Adds tests for readiness wait success and timeout error paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread framework/py/flwr/cli/run/run.py
Comment thread framework/py/flwr/cli/run/run.py Outdated
Comment thread framework/py/flwr/cli/run/run.py Outdated
@psfoley psfoley force-pushed the fix/control-api-ready-wait branch from 243f943 to 6438f27 Compare June 15, 2026 22:21
@github-actions github-actions Bot added the Maintainer Used to determine what PRs (mainly) come from Flower maintainers. label Jun 15, 2026
@psfoley psfoley marked this pull request as ready for review June 15, 2026 23:57

@panh99 panh99 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

channel = init_channel_from_connection(superlink_connection)
try:
yield ControlStub(channel), is_json

P2 Badge Wait in the shared Control API stub helper

Commands using cli_output_control_stub (for example framework/py/flwr/cli/federation/invite/accept.py:54 and framework/py/flwr/cli/federation/simulation_config.py:145) still receive a ControlStub immediately from this shared helper, so during the same rolling SuperLink restart window their first Control API RPC can still fail with UNAVAILABLE instead of waiting for channel readiness. Add the readiness wait here, or move it into the channel initialization path, so all Control API CLI commands get the intended behavior consistently.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Maintainer Used to determine what PRs (mainly) come from Flower maintainers.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants