Skip to content

Pipeline the firmware queue: concurrent compile + upload lanes#1131

Merged
bdraco merged 35 commits into
mainfrom
firmware-queue-pipeline
Jun 2, 2026
Merged

Pipeline the firmware queue: concurrent compile + upload lanes#1131
bdraco merged 35 commits into
mainfrom
firmware-queue-pipeline

Conversation

@bdraco

@bdraco bdraco commented Jun 2, 2026

Copy link
Copy Markdown
Member

What does this implement/fix?

Pipelines the firmware job queue so a network-bound upload runs concurrently with a CPU-bound compile, per esphome org discussion #3702. Before this, the queue ran one job at a time and install was a fused esphome run, so a slow upload (for example ~9 minutes over Thread, CPU idle) blocked the next device's compile.

The queue now has two concurrent single-worker lanes: a compile lane (CPU; compile, clean, reset, rename) and an upload lane (network). firmware/install is a two-job chain, a COMPILE job plus a dependent local UPLOAD job (FirmwareJob.depends_on); the upload is held off its lane until the compile succeeds, then runs on the upload lane while the compile lane is free for the next device. install_bulk queues the same chain per device, so a multi-device install pipelines (device B's compile overlaps device A's upload) instead of running serially. A cancelled or failed compile cascades to cancel the held upload, so a cancelled build never flashes; remote installs use the same chain since a remote compile already materialises artifacts locally, and the receiver advertises compile-lane idleness so an uploading receiver still accepts delegated compiles.

Lane state lives on FirmwareState as the shared source of truth (lane_for / dependency_satisfied / place_on_lane); the install chain is committed atomically (one rename-lock check, then place + announce both jobs synchronously, one persist) so a concurrent rename can't strand a half-queued pair. _run_queue spawns the two lane consumers via the eager-task helper and drains both on any exit. Because the lanes run concurrently, build-tree-mutating ops cancel conflicting work: a per-device clean cancels that device's in-flight chain, and reset_build_env (clean-all) cancels every active job on both lanes before wiping. The cancel cascade is persisted so a restart doesn't re-cancel dependents.

Tests cover the chain, the cancel cascade (unit + real-runner e2e), lane concurrency e2e, cross-lane verify-chip, persistence re-queue by lane, supersede/clean/reset cancellation, and the compile-idle scheduler signal; firmware suite green with 100% patch coverage on the changed controllers. CLAUDE.md, docs/ARCHITECTURE.md, and docs/API.md are updated.

Related issue or feature (if applicable):

  • fixes

Types of changes

  • Bugfix (non-breaking change which fixes an issue) — bugfix
  • New feature (non-breaking change which adds functionality) — new-feature
  • Enhancement to an existing feature — enhancement
  • Breaking change (fix or feature that would cause existing functionality to not work as expected) — breaking-change
  • Refactor (no behaviour change) — refactor
  • Documentation only — docs
  • Maintenance / chore — maintenance
  • CI / workflow change — ci
  • Dependencies bump — dependencies

Frontend coordination

Frontend impact: firmware/install now produces two job rows (a compile and a dependent upload), and two firmware jobs can be RUNNING at once. The events are all standard JOB_*; the frontend must render concurrent RUNNING rows and the held/queued upload (new FirmwareJob.depends_on field). The compile→upload progress reset at the seam already matches the existing remote-install behavior.

Checklist

  • The code change is tested and works locally.
  • Pre-commit hooks pass (ruff, codespell, yaml/json/python checks).
  • Tests have been added or updated under tests/ where applicable.
  • components.index.json / definitions/components/*.json have not been hand-edited (regenerate via script/sync_components.py if a sync is needed).
  • Architecture-level changes are reflected in docs/ARCHITECTURE.md and/or docs/API.md.

bdraco added 10 commits June 1, 2026 19:21
Add a Lane dataclass (queue + current_job + current_process) and
compile_lane/upload_lane on FirmwareState, with transitional proxies so
existing single-lane call sites keep working. Add FirmwareJob.depends_on
for job chaining and retire JobType.INSTALL for new jobs (kept for
deserialisation). QueueStatus stays a 3-field tuple; per-lane status will
come from a separate accessor. No behavior change yet — both runner paths
still flow through the compile lane.

Plan: compile_upload_queue_split_plan.md
Two lane consumers (compile/upload) run concurrently; install splits into a
COMPILE job + a dependent local UPLOAD job (depends_on), so a network flash
no longer blocks the next compile. Remote install uses the same chain since
a remote compile already materialises artifacts locally. Lane-aware
finalize/cancel/terminate; dependency hook releases or cancels held
dependents; persistence routes restored jobs by lane with order-independent
dependency resolution. lane_for/dependency_satisfied live on FirmwareState
(shared, DRY). Tests still need updating to the new signatures.

WIP checkpoint; see compile_upload_queue_split_plan.md.
Lane-aware tracked_subprocess/terminate/execute_job calls, supersede
exclude_job_ids, lane-aware enqueue tracing. Remaining red is behaviour
change (install chain, verify-on-upload) to be rewritten next.
Rewrite test_install for the COMPILE+UPLOAD chain (port lands on the upload
half; compile head returned). Cancelling a QUEUED compile now cascades to
its held upload (jobs.cancel runs release_dependents) so a cancelled build
never flashes the device — the #3702 cancel guard. Tests pin
cancel/fail-cascade and success-release-to-upload-lane.
Thread the lane through verify_chip (real bug: the esptool path was being
passed as the lane); event-driven _run_until_terminal (no polling) that
waits for the whole chain; cancel-during-verify tests drive a bare upload
(verify is upload-phase); chip-mismatch[install] asserts on the upload job;
lane-aware terminate/execute stubs. _run_queue now cancels AND drains both
lane tasks on shutdown (a bare gather left a lane mid-cancel). Adds an e2e
asserting a cancelled compile never flashes (the held upload cascades to
CANCELLED, no esptool spawn).
Update CLAUDE.md, docs/ARCHITECTURE.md, and the controller docstrings from
'one job at a time' to the compile/upload lane model + install chain.
A receiver only compiles for offloaders, so it broadcasts compile-lane
status (compile_queue_status) not the aggregate — an uploading receiver
keeps a free compile lane and must still accept delegated compiles, else
offloaders silently fall back to LOCAL.
@bdraco bdraco added the enhancement Improvement to an existing feature label Jun 2, 2026
@codspeed-hq

codspeed-hq Bot commented Jun 2, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 27 untouched benchmarks
⏩ 1 skipped benchmark1


Comparing firmware-queue-pipeline (81eaf77) with main (62a275c)2

Open in CodSpeed

Footnotes

  1. 1 benchmark was skipped, so the baseline result was used instead. If it was deleted from the codebase, click here and archive it to remove it from the performance reports.

  2. No successful run was found on main (aa0c4fa) during the generation of this report, so 62a275c was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@codecov-commenter

codecov-commenter commented Jun 2, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.38%. Comparing base (62a275c) to head (81eaf77).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1131      +/-   ##
==========================================
- Coverage   99.39%   99.38%   -0.01%     
==========================================
  Files         209      209              
  Lines       15268    15376     +108     
==========================================
+ Hits        15175    15282     +107     
- Misses         93       94       +1     
Flag Coverage Δ
py3.12 99.36% <100.00%> (-0.01%) ⬇️
py3.14 99.38% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...home_device_builder/controllers/firmware/_state.py 100.00% <100.00%> (ø)
...sphome_device_builder/controllers/firmware/bulk.py 100.00% <100.00%> (ø)
...phome_device_builder/controllers/firmware/clean.py 100.00% <ø> (ø)
esphome_device_builder/controllers/firmware/cli.py 100.00% <100.00%> (ø)
..._device_builder/controllers/firmware/controller.py 100.00% <100.00%> (ø)
...e_device_builder/controllers/firmware/factories.py 100.00% <100.00%> (ø)
...sphome_device_builder/controllers/firmware/jobs.py 100.00% <100.00%> (ø)
...e_device_builder/controllers/firmware/lifecycle.py 100.00% <100.00%> (ø)
...device_builder/controllers/firmware/persistence.py 100.00% <100.00%> (ø)
...vice_builder/controllers/firmware/remote_runner.py 100.00% <100.00%> (ø)
... and 3 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bdraco

bdraco commented Jun 2, 2026

Copy link
Copy Markdown
Member Author

Scope note: parallel use of multiple remote build servers is a follow-up (not in this PR)

This PR pipelines compile vs. upload — each lane is a single worker, run concurrently — so a slow upload no longer blocks the next compile.

It does not make multiple remote build servers run at once. The compile lane processes one job at a time, and a REMOTE compile occupies that lane for its whole duration (it awaits the receiver's compile + artifact download). So with N paired servers and N queued installs, the servers are still used one at a time, not in parallel — the scheduler picks one receiver per job (oldest-idle, then oldest-otherwise); it doesn't fan out concurrently.

Follow-up: allow the compile lane to dispatch multiple remote compiles in parallel (bounded by the number of idle receivers), since remote compiles are I/O-bound waits rather than local CPU — e.g. a small remote-dispatch pool on the compile lane, or a dedicated remote lane with N concurrent in-flight dispatches keyed on idle receivers. This composes cleanly on top of the lane model here and is intentionally out of scope for this PR (which keeps each lane single-worker).

@esphbot

esphbot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

PR Review — Pipeline the firmware queue: concurrent compile + upload lanes

Solid, carefully-engineered change that has already absorbed many review rounds. The two-lane model is coherent: lane_for routes UPLOAD to the network lane and everything else to the compile lane; _await_build_gate uses a correct clear→check→wait condition pattern with a single waiter, so no lost-wakeup; enqueue_install_chain commits the COMPILE+UPLOAD pair atomically (one rename-lock check, synchronous place+announce, single supersede+persist) which closes the half-queued-pair window; and the cancel cascade (release_dependents) is now persisted on the QUEUED-cancel path. The rmtree-vs-upload corruption concern is genuinely closed because cancel()terminate_subtree_with_grace awaits the subprocess's actual death (SIGTERM grace then SIGKILL) before returning, so a cancelled upload's process is gone before RESET's wipe runs, and new uploads park at the build gate. The reset_build_env orphan-on-re-raise case is handled by the except BaseException rollback that pops the just-created RESET job. Restart routing is two-pass so a dependent resolves regardless of on-disk order, and _restore_to_lane cancels (rather than strands) a dependent whose prerequisite is gone. Persistence dedup correctly moved to (configuration, job_type) so both install logs survive. Docs (CLAUDE.md, ARCHITECTURE.md, API.md) and the frontend-coordination note are updated; 100% patch coverage with real-runner e2e tests for the cascade and lane concurrency. I found no blocking correctness, security, or resource issues. One non-blocking suggestion about a defensive branch's test reachability. Merge-ready.


🟢 Suggestions

1. Same-config CLEAN branch of upload_blocked is largely defensive (`esphome_device_builder/controllers/firmware/_state.py`, L76-84)

The JobType.CLEAN and other.configuration == job.configuration branch is good defensive correctness, but worth noting it is rarely the live path: a firmware/clean enqueue runs through _enqueue with supersede=True, so it cancels any same-configuration UPLOAD already on the upload lane rather than coexisting with it, and an install chain whose UPLOAD is in flight is likewise superseded by a same-config clean. The branch that actually fires in practice is the RESET_BUILD_ENV one (empty configuration → never superseded by config-scoped jobs → a later upload parks behind it). No change required — the CLEAN guard is correct and cheap — but if the e2e suite doesn't already exercise an upload genuinely parking behind a same-config CLEAN (as opposed to being superseded by it), a targeted test would pin that this branch is reachable and not dead code.

if other.job_type is JobType.CLEAN and other.configuration == job.configuration:
    return True

Checklist

  • No command/shell injection from user input
  • No hardcoded secrets
  • Error paths swallow only benign races (global cancel re-raises RuntimeError)
  • No resource/subprocess leaks (lane slots cleared in finally; tracked_subprocess terminates on cancel)
  • Race conditions in concurrent lanes addressed (build gate, awaited process death, atomic chain commit)
  • Restart/persistence correctness (two-pass restore, dependent cancel-or-hold, cascade persisted)
  • Tests verify observable behavior, not source text
  • Untested branches covered — suggestion #1
  • Docs updated to match behavior

Summary

Solid, carefully-engineered change that has already absorbed many review rounds. The two-lane model is coherent: lane_for routes UPLOAD to the network lane and everything else to the compile lane; _await_build_gate uses a correct clear→check→wait condition pattern with a single waiter, so no lost-wakeup; enqueue_install_chain commits the COMPILE+UPLOAD pair atomically (one rename-lock check, synchronous place+announce, single supersede+persist) which closes the half-queued-pair window; and the cancel cascade (release_dependents) is now persisted on the QUEUED-cancel path. The rmtree-vs-upload corruption concern is genuinely closed because cancel()terminate_subtree_with_grace awaits the subprocess's actual death (SIGTERM grace then SIGKILL) before returning, so a cancelled upload's process is gone before RESET's wipe runs, and new uploads park at the build gate. The reset_build_env orphan-on-re-raise case is handled by the except BaseException rollback that pops the just-created RESET job. Restart routing is two-pass so a dependent resolves regardless of on-disk order, and _restore_to_lane cancels (rather than strands) a dependent whose prerequisite is gone. Persistence dedup correctly moved to (configuration, job_type) so both install logs survive. Docs (CLAUDE.md, ARCHITECTURE.md, API.md) and the frontend-coordination note are updated; 100% patch coverage with real-runner e2e tests for the cascade and lane concurrency. I found no blocking correctness, security, or resource issues. One non-blocking suggestion about a defensive branch's test reachability. Merge-ready.


Automated review by Kōan8467cfc
a340ea7
55b614f
f79fd1b
e844b3c
0ba74b4
133fd50
059145b
e331850
3adcddd
362aae4
9374161
7f5f691
3536def
5189ba1
9685a50
17eae0f
70a9708
4a63c7a
c19aba0
d9c169f
b0d3078
5ba417b
359cd1a
00325ec
a63661b
3f42cc0
86e3402
35a8d3c
46bb16a
20045cb
364850c
654576c
dfb833b

bdraco added 2 commits June 1, 2026 20:53
DRY + simplification pass on the queue-pipeline branch:

- Hoist the duplicated ``wire_real_queue`` / ``run_until_terminal`` /
  ``upload_of`` e2e helpers into the firmware conftest; the three test
  files import them aliased to their existing ``_``-prefixed names so the
  call sites are untouched. The unified ``run_until_terminal`` settles on
  the whole chain being terminal, correct for both single-job and the
  install COMPILE-then-UPLOAD chain.
- Remove the test-only ``state.queue`` / ``current_job`` /
  ``current_process`` proxies; production is fully lane-aware, so migrate
  the test call sites to ``state.compile_lane.*`` and delete the shims so
  ``state.current_job`` can't read as a half-truth that hides an in-flight
  upload.
- Revert the single-element ``_runner_tasks`` list back to the original
  ``_runner_task``; ``_run_queue`` is one wrapper task.
- Use the ``with_queue`` fixture in the bulk-install rename-lock test so
  its queue ``put_nowait`` is sync, clearing the AsyncMock warning.
It was a working scratchpad (status/TODO scaffolding), not a durable
deliverable; the design lives in the PR description. Repo root is for
README.md and CLAUDE.md only.
@bdraco

bdraco commented Jun 2, 2026

Copy link
Copy Markdown
Member Author

Thanks for the review. Addressed in 9374161 and 7f5f691:

  • Repo-root scratch note: removed compile_upload_queue_split_plan.md; the design lives in the PR description and the working copy is kept out of the tree, so nothing rots after merge.
  • Back-compat shims: removed the state.queue / current_job / current_process proxies and migrated the test call sites to state.compile_lane.*. Production was already fully lane-aware, so the shims only survived for tests; with them gone, state.current_job can no longer read as a half-truth that hides an in-flight upload.
  • Single-element task list: reverted _runner_tasks back to the original _runner_task, since _run_queue is one wrapper task that spawns the two lane consumers.

Folded in a small DRY pass while here: hoisted the duplicated wire_real_queue / run_until_terminal / upload_of e2e helpers into the firmware conftest, and switched the bulk-install rename-lock test to the with_queue fixture so its put_nowait is sync, clearing an AsyncMock warning. Full firmware suite is green (463 passed, 2 skipped).

- Add ``test_lane_concurrency_e2e.py``: drive both real lane consumers
  against real subprocesses, park an upload on the upload lane, and pin
  that a compile for another device runs to COMPLETED while the upload is
  still RUNNING. This is the headline #3702 behavior; a regression that
  re-serialised the queue trips the timeout.
- Cover the persistence two-pass restore branches: a held UPLOAD whose
  COMPILE is still QUEUED resumes held (not on its lane); one whose
  prerequisite is gone resumes CANCELLED.
- Cover ``compile_queue_status``: an uploading receiver still advertises
  compile-lane idle so offloaders keep delegating compiles.
- Hoist the ``StubDevices`` / ``wire_devices`` test helper into the
  firmware conftest (now used by the chip-verify and concurrency files).

Brings firmware/persistence.py and firmware/controller.py to 100% patch
coverage.
Switch _run_queue's except-CancelledError to a try/finally so that any
lane consumer error (not just shutdown cancellation) cancels and drains
the sibling lane before the error propagates, rather than leaving it
orphaned mid-flight. Add a regression test.

Fix the held-branch comment in factories.enqueue to reference the public
lifecycle.release_dependents (no leading underscore).
@bdraco

bdraco commented Jun 2, 2026

Copy link
Copy Markdown
Member Author

Addressed in 5189ba1:

  • Stale comment: fixed the held-branch comment in factories.enqueue to reference the public lifecycle.release_dependents (no leading underscore).
  • MEDIUM 1 (_run_queue only handling CancelledError): switched the except asyncio.CancelledError to a try/finally, so any lane consumer error (not just shutdown cancellation) now cancels and drains the sibling lane before the original error propagates, rather than orphaning it mid-flight. Added a regression test (test_run_queue_cancels_sibling_lane_when_one_raises).

On MEDIUM 2 (dependency_satisfied could strand a dependent): this is already covered by the callers, so I left it as-is. dependency_satisfied is deliberately narrow, but no caller can strand a dependent:

  • At enqueue, the chain creates the COMPILE before the dependent UPLOAD, so the prerequisite is always present.
  • At restart, persistence._restore_to_lane explicitly disambiguates: a prerequisite that is already COMPLETED enqueues the dependent directly; one still active holds it; one missing or terminal-but-not-completed cancels it with an error. So a pruned or already-finished prerequisite can't leave a dependent QUEUED forever.
  • At runtime, release_dependents fires on the prerequisite's terminal transition (every terminal path goes through finalize_terminal) synchronously, before any prune_history can remove the prerequisite, and active jobs are never pruned.

Full firmware suite green (468 passed, 2 skipped).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the firmware job system to run compile and upload work in two concurrent single-worker lanes, enabling a slow network upload to overlap with the next device’s CPU-bound compile. It also changes firmware/install to enqueue a COMPILE job plus a dependent UPLOAD job (FirmwareJob.depends_on) and updates remote-build signaling so receivers advertise compile-lane availability even while uploading.

Changes:

  • Introduce FirmwareState lanes (compile_lane + upload_lane) and run both lane consumers concurrently.
  • Split firmware/install into a COMPILE head job plus a dependent UPLOAD job held until the prerequisite completes.
  • Extend persistence + tests to cover dependency behavior, cancel cascade safety, lane concurrency, and compile-lane idleness reporting.

Reviewed changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
esphome_device_builder/controllers/firmware/_state.py Adds Lane and lane routing/dependency helpers (lane_for, dependency_satisfied).
esphome_device_builder/controllers/firmware/controller.py Runs both lanes concurrently; adds lane/aggregate queue snapshots; install now enqueues the chain.
esphome_device_builder/controllers/firmware/runner.py Converts single-queue runner into per-lane runner + lane-scoped subprocess tracking.
esphome_device_builder/controllers/firmware/lifecycle.py Clears the correct lane slot and releases/cancels dependent jobs on prerequisite completion.
esphome_device_builder/controllers/firmware/factories.py Enqueue routes by lane and introduces enqueue_install_chain; updates supersede signature.
esphome_device_builder/controllers/firmware/jobs.py Makes cancel lane-aware and cancels held dependents when cancelling prerequisites.
esphome_device_builder/controllers/firmware/persistence.py Restores active jobs into lanes and handles held/cancellable dependents on restart.
esphome_device_builder/controllers/firmware/cli.py Makes verify-chip use lane-scoped tracked subprocess registration.
esphome_device_builder/controllers/firmware/remote_runner.py Updates tracked subprocess registration to be lane-scoped.
esphome_device_builder/controllers/remote_build/peer_link_sessions.py Uses compile-lane status for receiver scheduling signals to offloaders.
esphome_device_builder/models/firmware.py Adds depends_on to FirmwareJob; updates queue-status docs and INSTALL commentary.
tests/controllers/firmware/conftest.py Centralizes shared e2e helpers and updates queue wiring for lanes and dependency tests.
tests/controllers/firmware/test_install.py Updates install expectations to COMPILE+dependent UPLOAD chain; adds cascade/release tests.
tests/controllers/firmware/test_persistence.py Adds restart behaviors for held uploads and orphan prerequisites; updates queue assertions.
tests/controllers/firmware/test_queue_status.py Adds compile-lane-only status snapshot used for remote-build scheduler.
tests/controllers/firmware/test_lane_concurrency_e2e.py New end-to-end proof that upload and compile lanes overlap.
tests/controllers/firmware/test_verify_chip_e2e.py Adjusts verify-chip/cancel tests for lane-scoped subprocess tracking and install chain.
tests/controllers/firmware/test_execute_job_e2e.py Reuses shared runner helpers; updates process tracking expectations.
tests/controllers/firmware/test_remote_runner.py Updates remote-runner tests for lane-scoped execution/termination tracking.
tests/controllers/firmware/test_supersede.py Updates state wiring to lane-scoped current_job.
tests/controllers/firmware/test_stop.py Updates termination calls to pass the lane.
tests/controllers/firmware/test_stop_windows.py Updates termination calls to pass the lane on Windows.
tests/controllers/firmware/test_rename_lock.py Updates controller factory usage for queue wiring.
tests/controllers/firmware/test_clean.py Updates state wiring to lane-scoped current_job.
tests/controllers/firmware/test_cancel.py Updates cancel tests for lane-scoped current_job and process tracking.
tests/controllers/firmware/test_branches_coverage.py Updates runner/terminate coverage tests for lane scoping.
tests/conftest.py Adds remote-build firmware stub for compile_queue_status.
docs/ARCHITECTURE.md Documents the two-lane queue and install chain behavior.
docs/API.md Documents firmware/install returning the COMPILE job and describing the dependent UPLOAD.
CLAUDE.md Updates repo orientation notes to reflect the two-lane firmware queue and install chaining.

Comment thread esphome_device_builder/controllers/firmware/factories.py Outdated
Comment thread esphome_device_builder/models/firmware.py Outdated
Comment thread tests/controllers/firmware/conftest.py Outdated
@esphbot

esphbot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

@bdraco: Addressed in 5189ba1:

Your MEDIUM-1 (_run_queue try/finally drain) and MEDIUM-2 (dependency_satisfied strand) responses both hold up — the restart/enqueue/runtime paths each keep a dependent from being stranded, and the synchronous release_dependents on the prerequisite's terminal transition fires before any prune.

However, Copilot's earlier supersede comment is still open and is a real bug after this change. supersede_active_jobs wraps each cancel() in with suppress(ValueError, RuntimeError), but the new cascade makes cancel() raise a CommandError that isn't in that tuple:

  • Double-install device X while compile1 is still QUEUED → to_cancel = [compile1, upload1] (compile created before its dependent, so it sorts first in state.jobs).
  • cancel(compile1) (QUEUED→CANCELLED) now calls release_dependents, which cancels upload1 since the compile didn't COMPLETE.
  • The loop then reaches upload1, which is already CANCELLED → cancel() hits the terminal branch and raise CommandError(INVALID_ARGS, "Cannot cancel a cancelled job").

CommandError subclasses Exception (helpers/api.py:21), not ValueError/RuntimeError, so it escapes the suppress and propagates out of install(), aborting the rest of the supersede and the trailing _persist_jobs. Suggest either adding CommandError to the suppress tuple or re-checking state.jobs[job_id].status in _ACTIVE_JOB_STATUSES before each cancel (the cascade is now an expected mid-loop terminal transition, not just a tolerated race). A regression test with two install chains on the same configuration (first compile still QUEUED) would pin it.

Re-installing a config supersedes the prior COMPILE + UPLOAD chain:
cancelling the prior compile cascades to cancel its held upload, so when
the supersede loop reaches that already-terminal upload, cancel() raises
CommandError ("Cannot cancel a cancelled job"). CommandError is not a
ValueError/RuntimeError, so the existing suppress missed it and the new
install raised; add it to the suppress and pin it with a regression test.

Also correct the JobType.INSTALL comment (persisted INSTALL jobs still run
as a fused `esphome run` via the runner + CLI; they are not migrated to a
chain on load), and update the capture_enqueue_order docstring to put_nowait.
@bdraco

bdraco commented Jun 2, 2026

Copy link
Copy Markdown
Member Author

Addressed in 9685a50:

  • supersede_active_jobs cascade double-cancel (real bug): confirmed. Re-installing a config supersedes the prior COMPILE + UPLOAD chain; cancelling the prior compile cascades to cancel its held upload, so when the loop reaches that already-terminal upload, cancel raises CommandError("Cannot cancel a cancelled job"). CommandError is a bare Exception, not a ValueError/RuntimeError, so the existing suppress missed it and the second install would have raised. Added CommandError to the suppress (the job is already in the state we wanted) and pinned it with test_reinstalling_supersedes_prior_chain_without_raising.
  • JobType.INSTALL comment: corrected. There is no on-load migration; persisted INSTALL jobs are re-queued and run as a fused esphome run via the runner + CLI. The comment now says INSTALL is retained so older persisted jobs still deserialise and run, and that new installs enqueue a COMPILE + dependent UPLOAD instead.
  • capture_enqueue_order docstring: updated to put_nowait (the enqueue path is synchronous now).

On the related "collapse async to sync" cleanup: I audited the firmware package for async defs that lost their last awaitable to the put_nowait switch — there are none. enqueue / enqueue_install_chain / supersede_active_jobs / load_jobs / cancel all still await (persist, supersede, executor, terminate); release_dependents and persistence._restore_to_lane were already sync. The only awaitless async defs are the pre-existing read-only API getters jobs.get_jobs / get_job, which predate this PR and are unrelated to the queue change.

Full firmware suite green (469 passed, 2 skipped).

@bdraco bdraco marked this pull request as ready for review June 2, 2026 03:38
Use create_eager_task instead of bare asyncio.create_task for the two
run_lane consumers, matching how the rest of the codebase spawns tasks
(eager_start runs each up to its first await synchronously).
@bdraco bdraco requested a review from Copilot June 2, 2026 03:45
@bdraco bdraco requested review from Copilot and removed request for Copilot June 2, 2026 03:48

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 2 comments.

Comment thread esphome_device_builder/controllers/firmware/jobs.py Outdated
Comment thread esphome_device_builder/controllers/firmware/_state.py
bdraco added 3 commits June 1, 2026 23:01
A clean is the user asking for a fresh build, so per-device clean now cancels
any in-flight compile / upload / install for that configuration (via the
existing _enqueue supersede; an install's compile cascades to its held upload)
instead of rejecting it.

reset_build_env (clean-all) wipes the whole build tree, which — with the
compile and upload lanes now running concurrently — can race a live build on
either lane. It cancels every active job before queueing. Adds
factories.cancel_all_active_jobs (sharing the cancel loop with supersede).
Document that queue_status_snapshot's queue_depth excludes held dependent
uploads (review note), correct the remote_runner lane comment to match
lane_for, and keep the new clean/reset/cancel docstrings terse per CLAUDE.md.
The QUEUED-cancel path persisted the cancelled job before release_dependents
cascaded to its held upload, so the upload's CANCELLED status never reached
disk and a restart re-cancelled it every boot. release_dependents now reports
whether it acted; the cancel path re-persists when it did.
@bdraco

bdraco commented Jun 2, 2026

Copy link
Copy Markdown
Member Author

On the two Copilot discussions:

  • r3338567551 (QUEUED-cancel cascade not persisted) — fixed in a63661b. The QUEUED-cancel path persisted the cancelled job before release_dependents cascaded to its held upload, so the upload's CANCELLED status never reached disk and a restart re-cancelled it every boot. release_dependents now returns whether it acted, and the cancel path re-persists (+ prune) when it did. Pinned by a test asserting two persists on a chain cancel.

  • r3338567568 (reset can run concurrently with an upload) — partly addressed in 359cd1a: reset_build_env now calls cancel_all_active_jobs to cancel every in-flight job on both lanes before queueing the wipe, and per-device clean supersedes the device's chain. That closes the existing-job race. The residual is a newly-started upload during a clean/reset (a standalone firmware/upload lands on the upload lane and could run while the wipe touches the same artifacts) — a deliberate, user-triggered window rather than an automatic collision. Making clean/reset fully exclusive (hold/block new jobs for the affected scope until they finish) is the robust fix; I've left it as a follow-up rather than add an exclusive-gate mechanism in this PR. Flagging for a call on whether to do it here.

Document in API.md that per-device clean cancels the device's in-flight build
and reset_build_env cancels every job on both lanes (and fix the now-per-type
history-dedup note). Log when _restore_to_lane cancels a restored dependent
whose prerequisite is missing/terminal, so a pruned-but-succeeded prereq is
diagnosable rather than a silent cancel.
@bdraco

bdraco commented Jun 2, 2026

Copy link
Copy Markdown
Member Author

Thanks — status on each:

  • 🟡 Add section config API and user preferences #1 (QUEUED-cancel cascade not persisted) — fixed in a63661b (landed after this review ran). release_dependents now returns whether it acted, and the QUEUED-cancel branch re-prunes + re-persists when it did, so the cascade-cancelled upload's CANCELLED status reaches disk. Pinned by a test asserting two persists on a chain cancel.

  • 🟡 Fix be when i want the catalog for a specific device #2 (clean now supersedes instead of rejecting — undocumented reversal) — intentional (requested: a clean is the user asking for a fresh build, and with the two lanes a held/running upload could otherwise read artifacts the clean wipes). Now documented in 3f42cc0: docs/API.md's firmware/clean row, the "one active job per device" section, and the reset section all call out the cancel semantics; the PR description was also updated. No required frontend change — the cancellations fire standard JOB_CANCELLED (handled generically; there was no clean-reject error path to remove). A pre-cancel confirmation is a possible optional UX add.

  • 🟡 MEDIUM (dependency_satisfied collapsed boolean) — runtime can't strand a held upload: release_dependents fires synchronously on the compile's terminal (placing or cancelling the upload) before any prune_history, so a succeeded compile is never pruned out from under a still-held upload. On restart, _restore_to_lane deterministically cancels a dependent whose prereq is missing/terminal (with an error) rather than holding it — and 3f42cc0 adds a log there so a pruned-but-succeeded prereq is diagnosable, not silent.

  • 🟢 Handle hint and errors in yaml #3 (dead aggregate queue_status_snapshot) — retained intentionally as the public aggregate accessor (idle only when both lanes are). The offloader path correctly uses the compile-lane variant (compile_queue_status) by design; the aggregate is still exercised by the slot-release-before-fire ordering tests as a sync probe. Happy to drop it if you'd prefer no unused accessor.

Firmware suite green (470 passed, 2 skipped).

@esphbot

esphbot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

@bdraco: On the two Copilot discussions:

On r3338567568 (reset/clean can run concurrently with an upload) — you asked for a call on whether to close this in-PR, so here's a recommendation.

Your cancel_all_active_jobs change fully closes the existing-job race, and it's worth stating why so it isn't second-guessed later: cancel() on a RUNNING upload awaits terminate_subtree_with_grace, so the esptool/upload subprocess is confirmed dead before reset_build_env enqueues the wipe, and the wipe serializes behind the cancelled job's finalisation on the compile lane. No live flash survives into the clean-all.

The residual is narrower than "reset is just another compile-lane job" though — it's specifically the upload lane being free after the cancel sweep. Once RESET_BUILD_ENV is sitting on the compile lane, a fresh firmware/upload, or the held UPLOAD half of an unrelated install whose COMPILE just completed and got released, lands on the upload lane and runs concurrently with the rmtree of .esphome/build/. esphome upload streams firmware.bin out of that tree, so an rmtree mid-read can flash a truncated binary — strictly worse than a failed build. Per-device clean has the same shape vs a fresh same-config upload, smaller blast radius.

My suggestion is to land a minimal exclusive gate in this PR rather than defer, because the failure mode is a corrupt device flash rather than a retryable error. A full third 'exclusive' lane is overkill; a reset_pending flag (set when a RESET_BUILD_ENV is QUEUED/RUNNING, checked in run_lane before get() so the upload lane parks new work until it clears) closes it for the global case, and holding same-configuration jobs off their lanes while a per-device CLEAN is active closes the per-device case — both reuse the held-dependent machinery you already have. If you'd rather keep this PR scoped to the pipelining and ship the gate as an immediate follow-up, that's defensible given how narrow the trigger window is, but it shouldn't ride to a release unguarded.

On r3338567551 — the return-value-driven re-persist on the QUEUED-cancel cascade is the right fix; nothing to add.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 33 out of 33 changed files in this pull request and generated 2 comments.

Comment thread docs/API.md Outdated
Comment thread esphome_device_builder/controllers/firmware/factories.py Outdated
bdraco added 2 commits June 1, 2026 23:37
Close the cross-lane race: a clean/reset on the compile lane rmtree's build
artifacts that an upload on the upload lane reads, so a newly-started upload
could flash a truncated binary mid-wipe. The upload lane now holds an upload
(run_lane._await_build_gate) while a RESET_BUILD_ENV (any) or a same-config
CLEAN is active, releasing it when the op finalizes (build_gate, set on every
terminal). Compile-lane jobs serialize behind the clean/reset already, so only
uploads need the gate.

Also hoist FirmwareState.active_jobs() (queued/running iterator) and use it in
upload_blocked, the supersede/cancel-all loop, and the rename-lock check.
cancel_all_active_jobs (the global cancel for reset_build_env) now re-raises a
RuntimeError from cancel — a RUNNING job it couldn't terminate would race the
clean-all wipe — while a per-configuration supersede keeps swallowing it as a
benign state-out-of-sync. Also fix the "one job runs at a time" API.md note to
describe the two concurrent lanes.
@bdraco

bdraco commented Jun 2, 2026

Copy link
Copy Markdown
Member Author

Addressed:

  • 🟡 Add section config API and user preferences #1 (reset/clean wipe races a newly-started upload) — implemented the exclusive gate in 86e3402. The upload lane now holds an upload (runner._await_build_gate) while a RESET_BUILD_ENV (any) or a same-configuration CLEAN is active — FirmwareState.upload_blocked is the predicate, woken by build_gate which is set on every job terminal. Compile-lane jobs serialize behind the clean/reset already, so only uploads need the gate. Unit + run_lane integration tests added.
  • Copilot (_cancel_active_jobs swallows RuntimeError) — fixed in 35a8d3c. The global cancel (cancel_all_active_jobs, for reset) now re-raises RuntimeError/ValueError (a RUNNING job it couldn't terminate would race the wipe), while per-configuration supersede keeps swallowing it. Test added.
  • Copilot (stale "one job runs at a time" doc) — fixed in 35a8d3c; the queue note now describes the two concurrent lanes.
  • 🟡 MEDIUM (dependency_satisfied) — runtime can't strand (release-before-prune); restart cancels a missing-prereq dependent deterministically, now with a log.

Also hoisted FirmwareState.active_jobs() and reused it across upload_blocked, the supersede/cancel-all loop, and the rename-lock check. Firmware suite green (473 passed).

bdraco added 3 commits June 1, 2026 23:48
The aggregate (idle only when both lanes are) had no production caller once
the peer-link broadcast switched to compile_queue_status. Drop it; the
slot-release-ordering tests + stubs now read compile_queue_status (the compile
lane is what they actually exercise), and the QueueStatus docstring drops the
aggregate framing. Leaves two queue-status accessors: lane_status (one lane)
and compile_queue_status (the compile lane the offloader keys on).
run_lane's post-gate cancelled-skip — an upload cancelled while held behind a
clean/reset — had no test. Cancel a held upload, finish the reset, and verify a
fresh upload runs while the cancelled one is skipped.
@esphbot

esphbot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

@bdraco: Addressed:

One follow-up on the gate that's adjacent to this work: upload_blocked keys off any active RESET_BUILD_ENV in active_jobs(), and reset_build_env creates that RESET job before cancel_all_active_jobs. Since _cancel_active_jobs now re-raises RuntimeError/ValueError for the global case (the right call — don't wipe over a job you couldn't confirm dead), a re-raise unwinds before _enqueue, leaving the just-created RESET job stranded QUEUED in state.jobs. Because it never reaches a terminal, build_gate never fires for it, so every later UPLOAD parks at _await_build_gate indefinitely — the gate's own stuck case — and the orphan also persists onto disk and gets placed straight onto the compile lane by _restore_to_lane on restart, executing an unintended clean-all. enqueue_install_chain already established the rollback shape (pop the created job, re-raise); mirroring it in reset_build_env (or creating the RESET job only after the sweep returns) closes it. Filed as an inline comment on controller.py with the suggested diff. Low-probability trigger, but the failure mode (wedged upload lane + destructive boot-time wipe) argues for a clean rollback.

@bdraco bdraco merged commit f257142 into main Jun 2, 2026
16 checks passed
@bdraco bdraco deleted the firmware-queue-pipeline branch June 2, 2026 05:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Improvement to an existing feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants