Pipeline the firmware queue: concurrent compile + upload lanes#1131
Conversation
Add a Lane dataclass (queue + current_job + current_process) and compile_lane/upload_lane on FirmwareState, with transitional proxies so existing single-lane call sites keep working. Add FirmwareJob.depends_on for job chaining and retire JobType.INSTALL for new jobs (kept for deserialisation). QueueStatus stays a 3-field tuple; per-lane status will come from a separate accessor. No behavior change yet — both runner paths still flow through the compile lane. Plan: compile_upload_queue_split_plan.md
Two lane consumers (compile/upload) run concurrently; install splits into a COMPILE job + a dependent local UPLOAD job (depends_on), so a network flash no longer blocks the next compile. Remote install uses the same chain since a remote compile already materialises artifacts locally. Lane-aware finalize/cancel/terminate; dependency hook releases or cancels held dependents; persistence routes restored jobs by lane with order-independent dependency resolution. lane_for/dependency_satisfied live on FirmwareState (shared, DRY). Tests still need updating to the new signatures. WIP checkpoint; see compile_upload_queue_split_plan.md.
Lane-aware tracked_subprocess/terminate/execute_job calls, supersede exclude_job_ids, lane-aware enqueue tracing. Remaining red is behaviour change (install chain, verify-on-upload) to be rewritten next.
Rewrite test_install for the COMPILE+UPLOAD chain (port lands on the upload half; compile head returned). Cancelling a QUEUED compile now cascades to its held upload (jobs.cancel runs release_dependents) so a cancelled build never flashes the device — the #3702 cancel guard. Tests pin cancel/fail-cascade and success-release-to-upload-lane.
Thread the lane through verify_chip (real bug: the esptool path was being passed as the lane); event-driven _run_until_terminal (no polling) that waits for the whole chain; cancel-during-verify tests drive a bare upload (verify is upload-phase); chip-mismatch[install] asserts on the upload job; lane-aware terminate/execute stubs. _run_queue now cancels AND drains both lane tasks on shutdown (a bare gather left a lane mid-cancel). Adds an e2e asserting a cancelled compile never flashes (the held upload cascades to CANCELLED, no esptool spawn).
Update CLAUDE.md, docs/ARCHITECTURE.md, and the controller docstrings from 'one job at a time' to the compile/upload lane model + install chain.
A receiver only compiles for offloaders, so it broadcasts compile-lane status (compile_queue_status) not the aggregate — an uploading receiver keeps a free compile lane and must still accept delegated compiles, else offloaders silently fall back to LOCAL.
Merging this PR will not alter performance
Comparing Footnotes
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1131 +/- ##
==========================================
- Coverage 99.39% 99.38% -0.01%
==========================================
Files 209 209
Lines 15268 15376 +108
==========================================
+ Hits 15175 15282 +107
- Misses 93 94 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Scope note: parallel use of multiple remote build servers is a follow-up (not in this PR)This PR pipelines compile vs. upload — each lane is a single worker, run concurrently — so a slow upload no longer blocks the next compile. It does not make multiple remote build servers run at once. The compile lane processes one job at a time, and a REMOTE compile occupies that lane for its whole duration (it awaits the receiver's compile + artifact download). So with N paired servers and N queued installs, the servers are still used one at a time, not in parallel — the scheduler picks one receiver per job (oldest-idle, then oldest-otherwise); it doesn't fan out concurrently. Follow-up: allow the compile lane to dispatch multiple remote compiles in parallel (bounded by the number of idle receivers), since remote compiles are I/O-bound waits rather than local CPU — e.g. a small remote-dispatch pool on the compile lane, or a dedicated remote lane with N concurrent in-flight dispatches keyed on idle receivers. This composes cleanly on top of the lane model here and is intentionally out of scope for this PR (which keeps each lane single-worker). |
PR Review — Pipeline the firmware queue: concurrent compile + upload lanesSolid, carefully-engineered change that has already absorbed many review rounds. The two-lane model is coherent: 🟢 Suggestions1. Same-config CLEAN branch of upload_blocked is largely defensive (`esphome_device_builder/controllers/firmware/_state.py`, L76-84)The Checklist
SummarySolid, carefully-engineered change that has already absorbed many review rounds. The two-lane model is coherent: Automated review by Kōan8467cfc |
DRY + simplification pass on the queue-pipeline branch: - Hoist the duplicated ``wire_real_queue`` / ``run_until_terminal`` / ``upload_of`` e2e helpers into the firmware conftest; the three test files import them aliased to their existing ``_``-prefixed names so the call sites are untouched. The unified ``run_until_terminal`` settles on the whole chain being terminal, correct for both single-job and the install COMPILE-then-UPLOAD chain. - Remove the test-only ``state.queue`` / ``current_job`` / ``current_process`` proxies; production is fully lane-aware, so migrate the test call sites to ``state.compile_lane.*`` and delete the shims so ``state.current_job`` can't read as a half-truth that hides an in-flight upload. - Revert the single-element ``_runner_tasks`` list back to the original ``_runner_task``; ``_run_queue`` is one wrapper task. - Use the ``with_queue`` fixture in the bulk-install rename-lock test so its queue ``put_nowait`` is sync, clearing the AsyncMock warning.
It was a working scratchpad (status/TODO scaffolding), not a durable deliverable; the design lives in the PR description. Repo root is for README.md and CLAUDE.md only.
|
Thanks for the review. Addressed in 9374161 and 7f5f691:
Folded in a small DRY pass while here: hoisted the duplicated |
- Add ``test_lane_concurrency_e2e.py``: drive both real lane consumers against real subprocesses, park an upload on the upload lane, and pin that a compile for another device runs to COMPLETED while the upload is still RUNNING. This is the headline #3702 behavior; a regression that re-serialised the queue trips the timeout. - Cover the persistence two-pass restore branches: a held UPLOAD whose COMPILE is still QUEUED resumes held (not on its lane); one whose prerequisite is gone resumes CANCELLED. - Cover ``compile_queue_status``: an uploading receiver still advertises compile-lane idle so offloaders keep delegating compiles. - Hoist the ``StubDevices`` / ``wire_devices`` test helper into the firmware conftest (now used by the chip-verify and concurrency files). Brings firmware/persistence.py and firmware/controller.py to 100% patch coverage.
Switch _run_queue's except-CancelledError to a try/finally so that any lane consumer error (not just shutdown cancellation) cancels and drains the sibling lane before the error propagates, rather than leaving it orphaned mid-flight. Add a regression test. Fix the held-branch comment in factories.enqueue to reference the public lifecycle.release_dependents (no leading underscore).
|
Addressed in 5189ba1:
On MEDIUM 2 (
Full firmware suite green (468 passed, 2 skipped). |
There was a problem hiding this comment.
Pull request overview
This PR updates the firmware job system to run compile and upload work in two concurrent single-worker lanes, enabling a slow network upload to overlap with the next device’s CPU-bound compile. It also changes firmware/install to enqueue a COMPILE job plus a dependent UPLOAD job (FirmwareJob.depends_on) and updates remote-build signaling so receivers advertise compile-lane availability even while uploading.
Changes:
- Introduce
FirmwareStatelanes (compile_lane+upload_lane) and run both lane consumers concurrently. - Split
firmware/installinto a COMPILE head job plus a dependent UPLOAD job held until the prerequisite completes. - Extend persistence + tests to cover dependency behavior, cancel cascade safety, lane concurrency, and compile-lane idleness reporting.
Reviewed changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| esphome_device_builder/controllers/firmware/_state.py | Adds Lane and lane routing/dependency helpers (lane_for, dependency_satisfied). |
| esphome_device_builder/controllers/firmware/controller.py | Runs both lanes concurrently; adds lane/aggregate queue snapshots; install now enqueues the chain. |
| esphome_device_builder/controllers/firmware/runner.py | Converts single-queue runner into per-lane runner + lane-scoped subprocess tracking. |
| esphome_device_builder/controllers/firmware/lifecycle.py | Clears the correct lane slot and releases/cancels dependent jobs on prerequisite completion. |
| esphome_device_builder/controllers/firmware/factories.py | Enqueue routes by lane and introduces enqueue_install_chain; updates supersede signature. |
| esphome_device_builder/controllers/firmware/jobs.py | Makes cancel lane-aware and cancels held dependents when cancelling prerequisites. |
| esphome_device_builder/controllers/firmware/persistence.py | Restores active jobs into lanes and handles held/cancellable dependents on restart. |
| esphome_device_builder/controllers/firmware/cli.py | Makes verify-chip use lane-scoped tracked subprocess registration. |
| esphome_device_builder/controllers/firmware/remote_runner.py | Updates tracked subprocess registration to be lane-scoped. |
| esphome_device_builder/controllers/remote_build/peer_link_sessions.py | Uses compile-lane status for receiver scheduling signals to offloaders. |
| esphome_device_builder/models/firmware.py | Adds depends_on to FirmwareJob; updates queue-status docs and INSTALL commentary. |
| tests/controllers/firmware/conftest.py | Centralizes shared e2e helpers and updates queue wiring for lanes and dependency tests. |
| tests/controllers/firmware/test_install.py | Updates install expectations to COMPILE+dependent UPLOAD chain; adds cascade/release tests. |
| tests/controllers/firmware/test_persistence.py | Adds restart behaviors for held uploads and orphan prerequisites; updates queue assertions. |
| tests/controllers/firmware/test_queue_status.py | Adds compile-lane-only status snapshot used for remote-build scheduler. |
| tests/controllers/firmware/test_lane_concurrency_e2e.py | New end-to-end proof that upload and compile lanes overlap. |
| tests/controllers/firmware/test_verify_chip_e2e.py | Adjusts verify-chip/cancel tests for lane-scoped subprocess tracking and install chain. |
| tests/controllers/firmware/test_execute_job_e2e.py | Reuses shared runner helpers; updates process tracking expectations. |
| tests/controllers/firmware/test_remote_runner.py | Updates remote-runner tests for lane-scoped execution/termination tracking. |
| tests/controllers/firmware/test_supersede.py | Updates state wiring to lane-scoped current_job. |
| tests/controllers/firmware/test_stop.py | Updates termination calls to pass the lane. |
| tests/controllers/firmware/test_stop_windows.py | Updates termination calls to pass the lane on Windows. |
| tests/controllers/firmware/test_rename_lock.py | Updates controller factory usage for queue wiring. |
| tests/controllers/firmware/test_clean.py | Updates state wiring to lane-scoped current_job. |
| tests/controllers/firmware/test_cancel.py | Updates cancel tests for lane-scoped current_job and process tracking. |
| tests/controllers/firmware/test_branches_coverage.py | Updates runner/terminate coverage tests for lane scoping. |
| tests/conftest.py | Adds remote-build firmware stub for compile_queue_status. |
| docs/ARCHITECTURE.md | Documents the two-lane queue and install chain behavior. |
| docs/API.md | Documents firmware/install returning the COMPILE job and describing the dependent UPLOAD. |
| CLAUDE.md | Updates repo orientation notes to reflect the two-lane firmware queue and install chaining. |
|
Your MEDIUM-1 ( However, Copilot's earlier supersede comment is still open and is a real bug after this change.
|
Re-installing a config supersedes the prior COMPILE + UPLOAD chain:
cancelling the prior compile cascades to cancel its held upload, so when
the supersede loop reaches that already-terminal upload, cancel() raises
CommandError ("Cannot cancel a cancelled job"). CommandError is not a
ValueError/RuntimeError, so the existing suppress missed it and the new
install raised; add it to the suppress and pin it with a regression test.
Also correct the JobType.INSTALL comment (persisted INSTALL jobs still run
as a fused `esphome run` via the runner + CLI; they are not migrated to a
chain on load), and update the capture_enqueue_order docstring to put_nowait.
|
Addressed in 9685a50:
On the related "collapse async to sync" cleanup: I audited the firmware package for Full firmware suite green (469 passed, 2 skipped). |
Use create_eager_task instead of bare asyncio.create_task for the two run_lane consumers, matching how the rest of the codebase spawns tasks (eager_start runs each up to its first await synchronously).
A clean is the user asking for a fresh build, so per-device clean now cancels any in-flight compile / upload / install for that configuration (via the existing _enqueue supersede; an install's compile cascades to its held upload) instead of rejecting it. reset_build_env (clean-all) wipes the whole build tree, which — with the compile and upload lanes now running concurrently — can race a live build on either lane. It cancels every active job before queueing. Adds factories.cancel_all_active_jobs (sharing the cancel loop with supersede).
Document that queue_status_snapshot's queue_depth excludes held dependent uploads (review note), correct the remote_runner lane comment to match lane_for, and keep the new clean/reset/cancel docstrings terse per CLAUDE.md.
The QUEUED-cancel path persisted the cancelled job before release_dependents cascaded to its held upload, so the upload's CANCELLED status never reached disk and a restart re-cancelled it every boot. release_dependents now reports whether it acted; the cancel path re-persists when it did.
|
On the two Copilot discussions:
|
Document in API.md that per-device clean cancels the device's in-flight build and reset_build_env cancels every job on both lanes (and fix the now-per-type history-dedup note). Log when _restore_to_lane cancels a restored dependent whose prerequisite is missing/terminal, so a pruned-but-succeeded prereq is diagnosable rather than a silent cancel.
|
Thanks — status on each:
Firmware suite green (470 passed, 2 skipped). |
On r3338567568 (reset/clean can run concurrently with an upload) — you asked for a call on whether to close this in-PR, so here's a recommendation. Your The residual is narrower than "reset is just another compile-lane job" though — it's specifically the upload lane being free after the cancel sweep. Once RESET_BUILD_ENV is sitting on the compile lane, a fresh My suggestion is to land a minimal exclusive gate in this PR rather than defer, because the failure mode is a corrupt device flash rather than a retryable error. A full third 'exclusive' lane is overkill; a On r3338567551 — the return-value-driven re-persist on the QUEUED-cancel cascade is the right fix; nothing to add. |
Close the cross-lane race: a clean/reset on the compile lane rmtree's build artifacts that an upload on the upload lane reads, so a newly-started upload could flash a truncated binary mid-wipe. The upload lane now holds an upload (run_lane._await_build_gate) while a RESET_BUILD_ENV (any) or a same-config CLEAN is active, releasing it when the op finalizes (build_gate, set on every terminal). Compile-lane jobs serialize behind the clean/reset already, so only uploads need the gate. Also hoist FirmwareState.active_jobs() (queued/running iterator) and use it in upload_blocked, the supersede/cancel-all loop, and the rename-lock check.
cancel_all_active_jobs (the global cancel for reset_build_env) now re-raises a RuntimeError from cancel — a RUNNING job it couldn't terminate would race the clean-all wipe — while a per-configuration supersede keeps swallowing it as a benign state-out-of-sync. Also fix the "one job runs at a time" API.md note to describe the two concurrent lanes.
|
Addressed:
Also hoisted |
The aggregate (idle only when both lanes are) had no production caller once the peer-link broadcast switched to compile_queue_status. Drop it; the slot-release-ordering tests + stubs now read compile_queue_status (the compile lane is what they actually exercise), and the QueueStatus docstring drops the aggregate framing. Leaves two queue-status accessors: lane_status (one lane) and compile_queue_status (the compile lane the offloader keys on).
run_lane's post-gate cancelled-skip — an upload cancelled while held behind a clean/reset — had no test. Cancel a held upload, finish the reset, and verify a fresh upload runs while the cancelled one is skipped.
One follow-up on the gate that's adjacent to this work: |
What does this implement/fix?
Pipelines the firmware job queue so a network-bound upload runs concurrently with a CPU-bound compile, per esphome org discussion #3702. Before this, the queue ran one job at a time and
installwas a fusedesphome run, so a slow upload (for example ~9 minutes over Thread, CPU idle) blocked the next device's compile.The queue now has two concurrent single-worker lanes: a compile lane (CPU; compile, clean, reset, rename) and an upload lane (network).
firmware/installis a two-job chain, a COMPILE job plus a dependent local UPLOAD job (FirmwareJob.depends_on); the upload is held off its lane until the compile succeeds, then runs on the upload lane while the compile lane is free for the next device.install_bulkqueues the same chain per device, so a multi-device install pipelines (device B's compile overlaps device A's upload) instead of running serially. A cancelled or failed compile cascades to cancel the held upload, so a cancelled build never flashes; remote installs use the same chain since a remote compile already materialises artifacts locally, and the receiver advertises compile-lane idleness so an uploading receiver still accepts delegated compiles.Lane state lives on
FirmwareStateas the shared source of truth (lane_for/dependency_satisfied/place_on_lane); the install chain is committed atomically (one rename-lock check, then place + announce both jobs synchronously, one persist) so a concurrent rename can't strand a half-queued pair._run_queuespawns the two lane consumers via the eager-task helper and drains both on any exit. Because the lanes run concurrently, build-tree-mutating ops cancel conflicting work: a per-device clean cancels that device's in-flight chain, andreset_build_env(clean-all) cancels every active job on both lanes before wiping. The cancel cascade is persisted so a restart doesn't re-cancel dependents.Tests cover the chain, the cancel cascade (unit + real-runner e2e), lane concurrency e2e, cross-lane verify-chip, persistence re-queue by lane, supersede/clean/reset cancellation, and the compile-idle scheduler signal; firmware suite green with 100% patch coverage on the changed controllers. CLAUDE.md, docs/ARCHITECTURE.md, and docs/API.md are updated.
Related issue or feature (if applicable):
Types of changes
bugfixnew-featureenhancementbreaking-changerefactordocsmaintenancecidependenciesFrontend coordination
Frontend impact:
firmware/installnow produces two job rows (a compile and a dependent upload), and two firmware jobs can be RUNNING at once. The events are all standardJOB_*; the frontend must render concurrent RUNNING rows and the held/queued upload (newFirmwareJob.depends_onfield). The compile→upload progress reset at the seam already matches the existing remote-install behavior.Checklist
ruff,codespell, yaml/json/python checks).tests/where applicable.components.index.json/definitions/components/*.jsonhave not been hand-edited (regenerate viascript/sync_components.pyif a sync is needed).docs/ARCHITECTURE.mdand/ordocs/API.md.