Skip to content

Mission artifact corruption can trigger replay/resume churn because scheduling trusts mutable features.json #974

@davidgeib89-art

Description

@davidgeib89-art

Title: Mission artifact corruption can trigger replay/resume churn because scheduling trusts mutable features.json

Summary

We hit an expensive mission-state integrity failure in Droid / Factory missions.

This was not a product-code regression.
It was a harness/state issue where a late artifact-writing failure left features.json malformed/inconsistent, and resume behavior then relied on that mutable file strongly enough to create replay/broad-resume risk.

The practical result was that a small late-stage artifact failure created unnecessary orchestration churn, token waste, and operator confusion near the end of a mostly completed mission.

What happened

Mission shape:

  • large multi-step coding mission
  • orchestrator + workers + validators
  • mission artifacts included features.json and validation-state.json

Observed sequence:

  1. core product work was already largely complete
  2. a late proof/follow-up/finalization phase was running
  3. an artifact-writing step failed
    • malformed JSON / broken artifact write
    • later we also saw synthesis-writing failures
  4. features.json ended up malformed or inconsistent
  5. resume/start logic trusted features.json strongly enough that replay/broad-resume behavior followed
  6. this created unnecessary re-planning / re-validation risk instead of a narrow repair-only path

Important clarification

This was not a true runtime completed -> pending regression for already completed product work.

In our case:

  • some follow-up validation state had genuinely failed or not completed yet
  • later manual artifact/state repair attempted to finalize the mission state
  • then malformed/inconsistent features.json made scheduling unreliable

So the problem was not “successful product work vanished”.

The problem was:

  • mutable mission summary state
  • artifact corruption
  • scheduler trusting that corrupted summary strongly enough to cause replay/resume churn

Why this is costly

A late artifact error should be cheap to recover from.

Instead, it can currently cause:

  • replay risk for already completed work
  • broad milestone resume behavior
  • repeated re-reading / re-planning / re-validation
  • large token waste
  • lower operator trust in resume semantics

This is especially painful near the end of long-running missions.

Verified behavior from our incident

  • validation-state.json preserved behavioral truth more reliably than features.json
  • features.json was the effective scheduling truth for resume/start behavior
  • malformed/inconsistent features.json directly interfered with scheduling
  • broad replay/resume risk came from artifact state, not from a real product regression

Requested changes

1. Make mission artifact writes transactional

For files like:

  • features.json
  • validation-state.json
  • synthesis artifacts that affect scheduling

Please use:

  1. write temp file
  2. parse temp file
  3. schema-validate temp file
  4. atomically replace original only on success

If validation fails, keep the old file untouched.

2. Do not use mutable features.json as sole scheduling truth

Please add a more durable execution truth, such as:

  • append-only event log
  • durable runner state
  • immutable feature lifecycle history

Examples of durable facts:

  • feature started
  • feature completed
  • feature failed
  • validator completed
  • validator failed
  • worker session ids
  • milestone transitions

features.json should be a projection/summary, not the only source of scheduling truth.

3. Enforce monotonic state transitions

A successfully completed runtime step should not be silently treated as pending again just because a mutable artifact was damaged or rewritten.

At minimum:

  • completed -> pending should require explicit repair mode / reason
  • corruption should not implicitly broaden pending work

4. Fail closed on artifact corruption

If features.json is malformed or inconsistent:

  • do not broaden replay
  • do not recompute pending work from partial state
  • do not milestone-resume broadly

Instead:

  • enter repair mode
  • require artifact repair first
  • then resume only the truly pending step

5. Distinguish implementation failures from finalization/synthesis failures

A late failure while writing synthesis/finalization artifacts should not be treated like unfinished product implementation.

Please distinguish:

  • implementation/product step failures
  • finalization/bookkeeping/synthesis failures

Finalization failures should trigger:

  • repair-in-place
  • not replay of completed product work

6. Cross-check resume state against stronger evidence

Before resuming from mission artifacts, cross-check:

  • durable execution history
  • worker handoffs
  • validator handoffs
  • validation-state.json
  • current features.json

If they disagree:

  • stop
  • enter repair mode
  • do not schedule blindly from the mutable summary file

Minimum acceptance criteria

A single malformed mission artifact should not cause replay of completed work.

Specifically:

  • completed runtime work remains recoverable even if summary files are damaged
  • artifact corruption blocks scheduling instead of widening it
  • finalization/synthesis failures do not reopen completed product work
  • resume can recover from handoffs / durable state rather than only from features.json

Additional note

We also observed that operators currently need to be extremely explicit with the orchestrator to avoid broad resume behavior after artifact issues.

That is workable as a temporary mitigation, but the core issue appears to be harness/state integrity rather than prompt quality.

If helpful

I can provide a more structured incident report with:

  • timeline
  • verified facts
  • root cause
  • contributing factors
  • corrective actions

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions