Mission artifact corruption can trigger replay/resume churn because scheduling trusts mutable `features.json`

Title: Mission artifact corruption can trigger replay/resume churn because scheduling trusts mutable `features.json`

## Summary

We hit an expensive mission-state integrity failure in Droid / Factory missions.

This was not a product-code regression.  
It was a harness/state issue where a late artifact-writing failure left `features.json` malformed/inconsistent, and resume behavior then relied on that mutable file strongly enough to create replay/broad-resume risk.

The practical result was that a small late-stage artifact failure created unnecessary orchestration churn, token waste, and operator confusion near the end of a mostly completed mission.

## What happened

Mission shape:
- large multi-step coding mission
- orchestrator + workers + validators
- mission artifacts included `features.json` and `validation-state.json`

Observed sequence:
1. core product work was already largely complete
2. a late proof/follow-up/finalization phase was running
3. an artifact-writing step failed
   - malformed JSON / broken artifact write
   - later we also saw synthesis-writing failures
4. `features.json` ended up malformed or inconsistent
5. resume/start logic trusted `features.json` strongly enough that replay/broad-resume behavior followed
6. this created unnecessary re-planning / re-validation risk instead of a narrow repair-only path

## Important clarification

This was **not** a true runtime `completed -> pending` regression for already completed product work.

In our case:
- some follow-up validation state had genuinely failed or not completed yet
- later manual artifact/state repair attempted to finalize the mission state
- then malformed/inconsistent `features.json` made scheduling unreliable

So the problem was not “successful product work vanished”.

The problem was:
- mutable mission summary state
- artifact corruption
- scheduler trusting that corrupted summary strongly enough to cause replay/resume churn

## Why this is costly

A late artifact error should be cheap to recover from.

Instead, it can currently cause:
- replay risk for already completed work
- broad milestone resume behavior
- repeated re-reading / re-planning / re-validation
- large token waste
- lower operator trust in resume semantics

This is especially painful near the end of long-running missions.

## Verified behavior from our incident

- `validation-state.json` preserved behavioral truth more reliably than `features.json`
- `features.json` was the effective scheduling truth for resume/start behavior
- malformed/inconsistent `features.json` directly interfered with scheduling
- broad replay/resume risk came from artifact state, not from a real product regression

## Requested changes

### 1. Make mission artifact writes transactional

For files like:
- `features.json`
- `validation-state.json`
- synthesis artifacts that affect scheduling

Please use:
1. write temp file
2. parse temp file
3. schema-validate temp file
4. atomically replace original only on success

If validation fails, keep the old file untouched.

### 2. Do not use mutable `features.json` as sole scheduling truth

Please add a more durable execution truth, such as:
- append-only event log
- durable runner state
- immutable feature lifecycle history

Examples of durable facts:
- feature started
- feature completed
- feature failed
- validator completed
- validator failed
- worker session ids
- milestone transitions

`features.json` should be a projection/summary, not the only source of scheduling truth.

### 3. Enforce monotonic state transitions

A successfully completed runtime step should not be silently treated as pending again just because a mutable artifact was damaged or rewritten.

At minimum:
- `completed -> pending` should require explicit repair mode / reason
- corruption should not implicitly broaden pending work

### 4. Fail closed on artifact corruption

If `features.json` is malformed or inconsistent:
- do not broaden replay
- do not recompute pending work from partial state
- do not milestone-resume broadly

Instead:
- enter repair mode
- require artifact repair first
- then resume only the truly pending step

### 5. Distinguish implementation failures from finalization/synthesis failures

A late failure while writing synthesis/finalization artifacts should not be treated like unfinished product implementation.

Please distinguish:
- implementation/product step failures
- finalization/bookkeeping/synthesis failures

Finalization failures should trigger:
- repair-in-place
- not replay of completed product work

### 6. Cross-check resume state against stronger evidence

Before resuming from mission artifacts, cross-check:
- durable execution history
- worker handoffs
- validator handoffs
- `validation-state.json`
- current `features.json`

If they disagree:
- stop
- enter repair mode
- do not schedule blindly from the mutable summary file

## Minimum acceptance criteria

A single malformed mission artifact should **not** cause replay of completed work.

Specifically:
- completed runtime work remains recoverable even if summary files are damaged
- artifact corruption blocks scheduling instead of widening it
- finalization/synthesis failures do not reopen completed product work
- resume can recover from handoffs / durable state rather than only from `features.json`

## Additional note

We also observed that operators currently need to be extremely explicit with the orchestrator to avoid broad resume behavior after artifact issues.

That is workable as a temporary mitigation, but the core issue appears to be harness/state integrity rather than prompt quality.

## If helpful

I can provide a more structured incident report with:
- timeline
- verified facts
- root cause
- contributing factors
- corrective actions


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mission artifact corruption can trigger replay/resume churn because scheduling trusts mutable `features.json` #974

Summary

What happened

Important clarification

Why this is costly

Verified behavior from our incident

Requested changes

1. Make mission artifact writes transactional

2. Do not use mutable `features.json` as sole scheduling truth

3. Enforce monotonic state transitions

4. Fail closed on artifact corruption

5. Distinguish implementation failures from finalization/synthesis failures

6. Cross-check resume state against stronger evidence

Minimum acceptance criteria

Additional note

If helpful

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Mission artifact corruption can trigger replay/resume churn because scheduling trusts mutable features.json #974

Description

Summary

What happened

Important clarification

Why this is costly

Verified behavior from our incident

Requested changes

1. Make mission artifact writes transactional

2. Do not use mutable features.json as sole scheduling truth

3. Enforce monotonic state transitions

4. Fail closed on artifact corruption

5. Distinguish implementation failures from finalization/synthesis failures

6. Cross-check resume state against stronger evidence

Minimum acceptance criteria

Additional note

If helpful

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Mission artifact corruption can trigger replay/resume churn because scheduling trusts mutable `features.json` #974

2. Do not use mutable `features.json` as sole scheduling truth