Skip to content

Commit 7bab056

Browse files
doquanghuyclaude
andauthored
feat(workflows): add continue_on_error step field for non-halting failures (#2663)
* feat(workflows): add continue_on_error step field Adds an optional `continue_on_error: bool` field on every step. When set to `true` and the step fails, the engine records the result (`exit_code`, `stderr` on `steps.<id>.output` plus `status` as a sibling key on `steps.<id>`) and continues to the next sibling step instead of halting the run. Downstream `if`, `switch`, or `gate` steps can then branch on `{{ steps.<id>.output.exit_code }}` to route the recovery path. Engine details -------------- `WorkflowEngine._execute_steps` now consults the step config when a step returns `StepStatus.FAILED`: - Gate aborts (`output.aborted`) always halt the run — operator decisions take precedence over the flag. - Otherwise, if `continue_on_error` is the literal `True`, log a `step_continue_on_error` event and proceed to the next sibling. The runtime check uses identity comparison (`is True`) rather than truthiness, so truthy non-bool values like the string `"true"` cannot silently change run semantics even if a caller bypasses `validate_workflow()`. - Otherwise, behave as before: log `step_failed`, set `RunStatus.FAILED`, and return. Validation ---------- `_validate_steps` rejects non-bool values for `continue_on_error`. Coerced strings like `"true"` are not accepted so authoring mistakes surface at validation time rather than silently changing run semantics. Tests ----- `TestContinueOnError` in `tests/test_workflows.py` (8 tests): - `test_undeclared_failure_halts_run` — default halt behaviour. - `test_declared_and_fired_continues_run` — flag + fail → continue. - `test_declared_but_step_succeeded_is_noop` — flag + success → no-op. - `test_if_branch_routes_around_failure` — end-to-end recovery. - `test_gate_abort_still_halts_with_continue_on_error` — abort always halts. - `test_validation_rejects_non_bool_continue_on_error` — `"true"` rejected at validation. - `test_validation_accepts_bool_continue_on_error` — `true`/`false` pass cleanly. - `test_engine_ignores_truthy_non_bool_continue_on_error` — defense-in-depth: engine ignores string `"true"` even when validation is bypassed. Rebased onto current upstream/main (post #2664 merge); the new `TestContinueOnError` class sits immediately after upstream's `TestContextRunId` so the two feature suites coexist cleanly. Closes #2591. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(workflows): restore runtime context section, clarify gate prompt Two Copilot findings on d0b9e00: 1. The `### Runtime Context` documentation for `{{ context.* }}` was lost during the rebase onto current main (the squash dropped the anchor where #2664 had added it). Restored under `## Expressions` so users can find `context.run_id` semantics and examples. 2. The continue_on_error example gate had message "Retry or skip?" but used the default `options: [approve, reject]` with `on_reject: skip`, which implied an automatic retry path that gates do not provide. Reworded the message to match the actual approve/reject semantics and added an explicit note that retry requires either custom gate options + downstream branching or a wrapper loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(workflows): clarify continue_on_error scope — returned FAILED only Copilot finding on d0b9e00: The README's "Error Handling" intro implied `continue_on_error` covers "any other runtime error raised during step execution", but the engine only consults the flag when a step returns `StepResult(status=FAILED, ...)`. Exceptions raised out of `step_impl.execute()` propagate to `WorkflowEngine.execute()`, where the catch-all logs `workflow_failed` and re-raises — the step result is never recorded, and the flag is never consulted. Audited the whole PR diff for the same overclaim: 1. workflows/README.md — main fix. Reworded the Error Handling intro to "any step that returns StepResult(status=FAILED, ...)" and promoted the parenthetical structural-validation note into the Notes block. Added a new "Scope: returned failures only" note that names the exception path explicitly and tells step authors how to bring the flag into scope for exceptional code (catch internally and return FAILED with the failure encoded in `output`). 2. tests/test_workflows.py — section comment used "when an executable step fails", same ambiguity. Tightened to "when a step returns StepResult(status=FAILED, ...)" and added a sentence calling out that unhandled exceptions are out of scope. 3. src/specify_cli/workflows/engine.py — already correct ("any step that returns FAILED" in the validator comment; "lets the pipeline route around the failure" in the execute path). No change. Engine semantics and test bodies are unchanged. Docs-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(workflows): clarify on_reject:skip semantics — engine returns COMPLETED, not auto-skip Copilot finding on b8982a7: The README example's gate message said "reject to skip the rest of this branch", and the explanatory paragraph claimed [approve, reject] map to "continue" vs "skip the rest of this branch". The engine does not implement automatic branch-skipping. `on_reject: skip` returns `StepStatus.COMPLETED` (gate/__init__.py:65-66); the next sibling step runs unconditionally unless the author wires a downstream `if` reading `{{ steps.<gate-id>.output.choice }}`. Two fixes: 1. Restructured the YAML example so it actually demonstrates the manual-branching pattern: added a `recover` if-step after the gate that conditions on `steps.review.output.choice == 'approve'`. Now the example shows the real workflow author's responsibility instead of implying the engine does it. 2. Replaced the trailing paragraph with three precise notes: - both gate options return COMPLETED; `on_reject: skip` controls abort behaviour only, not sibling-skipping - all three `on_reject` values enumerated with their actual engine semantics (FAILED+aborted / COMPLETED / PAUSED) - the original retry-loop guidance retained as the third bullet Updated the gate message in the example to match — "reject to leave the failure recorded and move on" instead of "reject to skip the rest of this branch". Audited the whole PR diff for the same overclaim: no other instance. Engine semantics, validation, and test bodies are unchanged. Docs-only. 161/161 tests/test_workflows.py pass locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(workflows): clarify gate's role — surfaces, doesn't programmatically branch Audit follow-up to 393ac6b — three sites repeated the same minor overclaim about gates being one of the "branch on it" step types alongside `if` and `switch`: 1. workflows/README.md (the "downstream `if`, `switch`, or `gate` steps can branch on it" sentence introducing the example) 2. engine.py:236 (validator inline comment) 3. engine.py:657 (execute-path inline comment) A `gate` step does not have a `condition` or `expression` field — it only evaluates expressions for `message` and `show_file` (gate/__init__.py:29,36). Programmatic branching happens in `if`/`switch`; a gate surfaces the value to a human operator via message interpolation, and the operator's choice is recorded in `output.choice` for a *subsequent* `if`/`switch` to route on. Reworded all three sites consistently: "a downstream `if` or `switch` can branch on it (or a `gate` can surface it to the operator via message interpolation)". The README example already demonstrates this distinction — the gate carries `{{ }}` template variables in its message and the `recover` if-step downstream is what actually branches on the choice. Engine semantics, validation, and test bodies are unchanged. Docs-only on the README; comment-only on engine.py. 161/161 tests/test_workflows.py pass locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(workflows): use qualified StepStatus.* instead of bare FAILED/COMPLETED/PAUSED Three Copilot inline comments on workflows/README.md lines 226, 282, 288 flagged that ``StepResult(status=FAILED, ...)`` is not valid Python — ``StepResult.status`` is a ``StepStatus`` enum value, so the documented form should be ``StepStatus.FAILED``. Audited the whole PR diff for the same shorthand. The bare unqualified form appears in three files added/modified by this PR: 1. workflows/README.md (6 sites) — three ``StepResult(status=FAILED, ...)`` parentheticals, plus the on_reject Notes bullet listing the three step statuses (``FAILED``, ``COMPLETED``, ``PAUSED``). 2. tests/test_workflows.py (4 sites) — section header for TestContinueOnError, two test-method docstrings, one inline comment about a gate's TTY-fallback behaviour. 3. src/specify_cli/workflows/engine.py (1 site) — the validator inline comment added in d0b9e00 said "returns FAILED" where the engine code itself uses ``StepStatus.FAILED``. All 11 sites normalised to the qualified ``StepStatus.<name>`` form so the docs / test docstrings / inline comments match what readers will actually find in the engine code and the tests. Engine semantics, validation, and test bodies are unchanged. 161/161 tests/test_workflows.py pass locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 7c558ab commit 7bab056

3 files changed

Lines changed: 434 additions & 4 deletions

File tree

src/specify_cli/workflows/engine.py

Lines changed: 57 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -232,6 +232,22 @@ def _validate_steps(
232232
step_errors = step_impl.validate(step_config)
233233
errors.extend(step_errors)
234234

235+
# Validate optional `continue_on_error` field. The engine honours
236+
# this on any step that returns StepStatus.FAILED so the pipeline can route
237+
# around the failure via a downstream `if` or `switch` (or a
238+
# `gate` that surfaces the failure to the operator via message
239+
# interpolation). The field must be a literal boolean —
240+
# coercion from truthy strings is deliberately not supported so
241+
# authoring mistakes surface at validation time rather than
242+
# silently changing run semantics.
243+
if "continue_on_error" in step_config:
244+
coe = step_config["continue_on_error"]
245+
if not isinstance(coe, bool):
246+
errors.append(
247+
f"Step {step_id!r}: 'continue_on_error' must be a "
248+
f"boolean, got {type(coe).__name__}."
249+
)
250+
235251
# Recursively validate nested steps
236252
for nested_key in ("then", "else", "steps"):
237253
nested = step_config.get(nested_key)
@@ -629,7 +645,10 @@ def _execute_steps(
629645

630646
# Handle failures
631647
if result.status == StepStatus.FAILED:
632-
# Gate abort (output.aborted) maps to ABORTED status
648+
# Gate abort (output.aborted) maps to ABORTED status.
649+
# Aborts are deliberate operator decisions, so
650+
# `continue_on_error` does NOT override them — that flag
651+
# is for transient/expected step failures only.
633652
if result.output.get("aborted"):
634653
state.status = RunStatus.ABORTED
635654
state.append_log(
@@ -638,15 +657,49 @@ def _execute_steps(
638657
"step_id": step_id,
639658
}
640659
)
641-
else:
642-
state.status = RunStatus.FAILED
660+
state.save()
661+
return
662+
663+
# `continue_on_error: true` lets the pipeline route
664+
# around the failure instead of halting. The step
665+
# result (including exit_code, stderr, status) is
666+
# still recorded so a downstream `if` or `switch`
667+
# can branch on it (or a `gate` can surface it to the
668+
# operator via message interpolation). Log a single,
669+
# unambiguous event per failure resolution — either
670+
# the run continued past it, or it halted.
671+
#
672+
# Use identity comparison (`is True`) rather than
673+
# truthiness so that only a literal boolean enables
674+
# the behaviour, even if validation was skipped.
675+
# Validation rejects non-bool values at parse time,
676+
# but `WorkflowEngine.execute()` does not auto-validate
677+
# (see `WorkflowEngine.load_workflow`, whose docstring
678+
# explicitly notes "not yet validated; call
679+
# `validate_workflow()` or `engine.validate()`
680+
# separately"), so a caller passing an unvalidated
681+
# definition could otherwise see truthy non-bool
682+
# values like the string `"true"` silently change
683+
# run semantics.
684+
if step_config.get("continue_on_error") is True:
643685
state.append_log(
644686
{
645-
"event": "step_failed",
687+
"event": "step_continue_on_error",
646688
"step_id": step_id,
647689
"error": result.error,
648690
}
649691
)
692+
state.save()
693+
continue
694+
695+
state.status = RunStatus.FAILED
696+
state.append_log(
697+
{
698+
"event": "step_failed",
699+
"step_id": step_id,
700+
"error": result.error,
701+
}
702+
)
650703
state.save()
651704
return
652705

tests/test_workflows.py

Lines changed: 300 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2379,6 +2379,306 @@ def test_run_id_arg_takes_precedence_over_env_override(self, project_dir, monkey
23792379
assert state.step_results["stamp"]["output"]["stdout"].strip() == "explicit-456"
23802380

23812381

2382+
# ===== continue_on_error Tests =====
2383+
#
2384+
# Locks the contract documented in workflows/README.md "Error Handling"
2385+
# section: when a step returns `StepResult(status=StepStatus.FAILED, ...)` and
2386+
# `continue_on_error: true` is declared, the engine records the step's
2387+
# `output` (with `exit_code` and `stderr` from the failure) and its
2388+
# `status` (sibling key on `steps.<id>`, not nested under `output`)
2389+
# and continues to the next sibling step instead of halting the run.
2390+
# Gate aborts (`output.aborted`) still halt regardless of the flag.
2391+
# Unhandled exceptions raised out of `step_impl.execute()` are out of
2392+
# scope for this flag — they propagate to `WorkflowEngine.execute()`
2393+
# and abort the run.
2394+
2395+
2396+
class TestContinueOnError:
2397+
"""Test the `continue_on_error` step-level field."""
2398+
2399+
def test_undeclared_failure_halts_run(self, project_dir):
2400+
"""Default behaviour (no `continue_on_error`): a failing step
2401+
halts the workflow run with `status == StepStatus.FAILED`.
2402+
2403+
Locks the byte-equivalent default — workflows that do not
2404+
declare the flag must behave exactly as before this feature.
2405+
"""
2406+
from specify_cli.workflows.engine import WorkflowDefinition, WorkflowEngine
2407+
from specify_cli.workflows.base import RunStatus
2408+
2409+
definition = WorkflowDefinition.from_string("""
2410+
schema_version: "1.0"
2411+
workflow:
2412+
id: "halt-on-fail"
2413+
name: "Halt On Fail"
2414+
version: "1.0.0"
2415+
steps:
2416+
- id: fail-step
2417+
type: shell
2418+
run: "exit 7"
2419+
- id: after
2420+
type: shell
2421+
run: "echo should-not-run"
2422+
""")
2423+
engine = WorkflowEngine(project_dir)
2424+
state = engine.execute(definition)
2425+
2426+
assert state.status == RunStatus.FAILED
2427+
assert "fail-step" in state.step_results
2428+
assert state.step_results["fail-step"]["output"]["exit_code"] == 7
2429+
# Subsequent step never executes when the flag is absent.
2430+
assert "after" not in state.step_results
2431+
2432+
def test_declared_and_fired_continues_run(self, project_dir):
2433+
"""`continue_on_error: true` + failing step: the run keeps
2434+
going, the failed step's result is recorded, and the
2435+
downstream step runs.
2436+
"""
2437+
from specify_cli.workflows.engine import WorkflowDefinition, WorkflowEngine
2438+
from specify_cli.workflows.base import RunStatus
2439+
2440+
definition = WorkflowDefinition.from_string("""
2441+
schema_version: "1.0"
2442+
workflow:
2443+
id: "continue-past-fail"
2444+
name: "Continue Past Fail"
2445+
version: "1.0.0"
2446+
steps:
2447+
- id: flaky-step
2448+
type: shell
2449+
run: "exit 42"
2450+
continue_on_error: true
2451+
- id: after
2452+
type: shell
2453+
run: "echo did-run"
2454+
""")
2455+
engine = WorkflowEngine(project_dir)
2456+
state = engine.execute(definition)
2457+
2458+
assert state.status == RunStatus.COMPLETED
2459+
# Failed step's exit_code is preserved so downstream branching
2460+
# can inspect it.
2461+
assert state.step_results["flaky-step"]["output"]["exit_code"] == 42
2462+
assert state.step_results["flaky-step"]["status"] == "failed"
2463+
# Downstream step ran successfully.
2464+
assert state.step_results["after"]["output"]["exit_code"] == 0
2465+
2466+
def test_declared_but_step_succeeded_is_noop(self, project_dir):
2467+
"""`continue_on_error: true` on a step that succeeds is a
2468+
no-op — the flag only changes behaviour on StepStatus.FAILED status.
2469+
"""
2470+
from specify_cli.workflows.engine import WorkflowDefinition, WorkflowEngine
2471+
from specify_cli.workflows.base import RunStatus
2472+
2473+
definition = WorkflowDefinition.from_string("""
2474+
schema_version: "1.0"
2475+
workflow:
2476+
id: "flag-but-success"
2477+
name: "Flag But Success"
2478+
version: "1.0.0"
2479+
steps:
2480+
- id: ok-step
2481+
type: shell
2482+
run: "echo ok"
2483+
continue_on_error: true
2484+
- id: after
2485+
type: shell
2486+
run: "echo done"
2487+
""")
2488+
engine = WorkflowEngine(project_dir)
2489+
state = engine.execute(definition)
2490+
2491+
assert state.status == RunStatus.COMPLETED
2492+
assert state.step_results["ok-step"]["status"] == "completed"
2493+
assert state.step_results["ok-step"]["output"]["exit_code"] == 0
2494+
assert state.step_results["after"]["output"]["exit_code"] == 0
2495+
2496+
def test_if_branch_routes_around_failure(self, project_dir):
2497+
"""End-to-end: `continue_on_error` + `if` cleanly routes around
2498+
a failure. The recovery branch runs; the success branch does
2499+
not.
2500+
2501+
Mirrors the canonical usage pattern from the original feature
2502+
discussion in issue #2591.
2503+
"""
2504+
from specify_cli.workflows.engine import WorkflowDefinition, WorkflowEngine
2505+
from specify_cli.workflows.base import RunStatus
2506+
2507+
definition = WorkflowDefinition.from_string("""
2508+
schema_version: "1.0"
2509+
workflow:
2510+
id: "route-around"
2511+
name: "Route Around Failure"
2512+
version: "1.0.0"
2513+
steps:
2514+
- id: heavy-thing
2515+
type: shell
2516+
run: "exit 1"
2517+
continue_on_error: true
2518+
- id: check-result
2519+
type: if
2520+
condition: "{{ steps.heavy-thing.output.exit_code != 0 }}"
2521+
then:
2522+
- id: recovery
2523+
type: shell
2524+
run: "echo recovery-ran"
2525+
else:
2526+
- id: happy-path
2527+
type: shell
2528+
run: "echo happy-path-ran"
2529+
""")
2530+
engine = WorkflowEngine(project_dir)
2531+
state = engine.execute(definition)
2532+
2533+
assert state.status == RunStatus.COMPLETED
2534+
assert "recovery" in state.step_results
2535+
assert "happy-path" not in state.step_results
2536+
2537+
def test_gate_abort_still_halts_with_continue_on_error(
2538+
self, project_dir, monkeypatch
2539+
):
2540+
"""`continue_on_error` does NOT override a deliberate gate
2541+
abort. `output.aborted` always halts the run with
2542+
`status == ABORTED`.
2543+
2544+
Aborts are explicit operator decisions; continue_on_error
2545+
is for transient/expected step failures only.
2546+
"""
2547+
from specify_cli.workflows.engine import WorkflowDefinition, WorkflowEngine
2548+
from specify_cli.workflows.base import RunStatus
2549+
from specify_cli.workflows.steps.gate import GateStep
2550+
from specify_cli.workflows.steps import gate as gate_module
2551+
2552+
# Force the gate step into interactive mode and feed a "reject"
2553+
# choice so the abort path actually runs in the test env
2554+
# (default behaviour returns StepStatus.PAUSED when stdin is not a TTY).
2555+
# Swap sys.stdin itself for a stub: setattr on the real
2556+
# TextIOWrapper's `isatty` method is not assignable under some
2557+
# runners (e.g. pytest with capture disabled).
2558+
class _TTYStdin:
2559+
def isatty(self) -> bool:
2560+
return True
2561+
2562+
monkeypatch.setattr(gate_module.sys, "stdin", _TTYStdin())
2563+
monkeypatch.setattr(
2564+
GateStep, "_prompt", staticmethod(lambda _msg, _opts: "reject")
2565+
)
2566+
2567+
definition = WorkflowDefinition.from_string("""
2568+
schema_version: "1.0"
2569+
workflow:
2570+
id: "gate-abort-halts"
2571+
name: "Gate Abort Halts"
2572+
version: "1.0.0"
2573+
steps:
2574+
- id: gate-step
2575+
type: gate
2576+
message: "Approve?"
2577+
options: [approve, reject]
2578+
on_reject: abort
2579+
continue_on_error: true
2580+
- id: should-not-run
2581+
type: shell
2582+
run: "echo nope"
2583+
""")
2584+
engine = WorkflowEngine(project_dir)
2585+
state = engine.execute(definition)
2586+
2587+
assert state.status == RunStatus.ABORTED
2588+
assert "should-not-run" not in state.step_results
2589+
2590+
def test_validation_rejects_non_bool_continue_on_error(self):
2591+
"""`continue_on_error` must be a literal boolean; coerced
2592+
strings like `"true"` are rejected at validation time so
2593+
authoring mistakes surface before execution.
2594+
"""
2595+
from specify_cli.workflows.engine import (
2596+
WorkflowDefinition,
2597+
validate_workflow,
2598+
)
2599+
2600+
definition = WorkflowDefinition.from_string("""
2601+
schema_version: "1.0"
2602+
workflow:
2603+
id: "bad-coe"
2604+
name: "Bad COE"
2605+
version: "1.0.0"
2606+
steps:
2607+
- id: step-one
2608+
type: shell
2609+
run: "true"
2610+
continue_on_error: "true"
2611+
""")
2612+
errors = validate_workflow(definition)
2613+
assert any(
2614+
"continue_on_error" in e and "boolean" in e for e in errors
2615+
), errors
2616+
2617+
def test_validation_accepts_bool_continue_on_error(self):
2618+
"""Boolean values pass validation cleanly."""
2619+
from specify_cli.workflows.engine import (
2620+
WorkflowDefinition,
2621+
validate_workflow,
2622+
)
2623+
2624+
for value in (True, False):
2625+
yaml_value = "true" if value else "false"
2626+
definition = WorkflowDefinition.from_string(f"""
2627+
schema_version: "1.0"
2628+
workflow:
2629+
id: "good-coe"
2630+
name: "Good COE"
2631+
version: "1.0.0"
2632+
steps:
2633+
- id: step-one
2634+
type: shell
2635+
run: "true"
2636+
continue_on_error: {yaml_value}
2637+
""")
2638+
errors = validate_workflow(definition)
2639+
assert errors == [], errors
2640+
2641+
def test_engine_ignores_truthy_non_bool_continue_on_error(self, project_dir):
2642+
"""Defense-in-depth: even if a caller bypasses
2643+
`validate_workflow()` and feeds the engine a definition with
2644+
`continue_on_error: "true"` (a string), the engine must NOT
2645+
honour the flag — only a literal boolean enables the
2646+
behaviour. `WorkflowEngine.execute()` does not auto-validate
2647+
(the `WorkflowEngine.load_workflow` docstring explicitly
2648+
notes the definition is "not yet validated; call
2649+
`validate_workflow()` or `engine.validate()` separately"),
2650+
so the engine guards against truthy non-bool values itself
2651+
via an identity check rather than truthiness.
2652+
"""
2653+
from specify_cli.workflows.engine import WorkflowDefinition, WorkflowEngine
2654+
from specify_cli.workflows.base import RunStatus
2655+
2656+
# Bypass `validate_workflow()` — execute() is what would
2657+
# be called by a caller that skipped validation.
2658+
definition = WorkflowDefinition.from_string("""
2659+
schema_version: "1.0"
2660+
workflow:
2661+
id: "string-coe"
2662+
name: "String COE"
2663+
version: "1.0.0"
2664+
steps:
2665+
- id: fail-step
2666+
type: shell
2667+
run: "exit 1"
2668+
continue_on_error: "true"
2669+
- id: should-not-run
2670+
type: shell
2671+
run: "echo should-not-run"
2672+
""")
2673+
engine = WorkflowEngine(project_dir)
2674+
state = engine.execute(definition)
2675+
2676+
# String "true" is truthy but not a literal boolean, so the
2677+
# engine must treat the step as a halting failure.
2678+
assert state.status == RunStatus.FAILED
2679+
assert "should-not-run" not in state.step_results
2680+
2681+
23822682
# ===== State Persistence Tests =====
23832683

23842684
class TestRunState:

0 commit comments

Comments
 (0)