docs(power): clarify replay safety and step semantics#5
docs(power): clarify replay safety and step semantics#5
Conversation
Update the durable-functions power guidance to distinguish deterministic orchestration code from non-atomic durable operation bodies. - generalize replay-safety guidance across steps, waits, and concurrent branches - document logger replay caveats and note that context.logger can wrap an existing logger - correct StepSemantics defaults and examples for TypeScript and Python - document the at-most-once-per-retry fallback for non-idempotent steps with retries disabled
|
FYI, I also have another branch ready (based on these changes), which reduces the power size by ~50%. |
|
@singledigit can you use this updated power to verify if it would have caught the bugs in your scanner function: https://github.com/singledigit/durable-function-video-scanner/tree/main |
|
@yaythomas can we prioritize this one to address some issues with the power? |
|
|
||
| ## Rule 2: Durable Operation Bodies Are Not Guaranteed To Be Atomic | ||
|
|
||
| **Functions passed to durable context APIs must assume the operation is not guaranteed to be atomic with respect to external side effects, and may be re-attempted before the durable runtime has fully recorded the result.** |
There was a problem hiding this comment.
-
what about at most once guarantee?
-
what does "durable context APIs" mean? methods on the DurableContext? or the durable handler?
-
"Functions" means something specific in coding, strictly speaking java doesn't have functions.
-
Style: avoid passive
Suggestion: Code in durable operation must assume that it could re-run on replay, unless it is in a Step with an AT MOST ONCE execution guarantee. This means that external side-effects caused by such code could execute more than once.
|
|
||
| ### What This Means | ||
|
|
||
| - Non-deterministic computation inside a durable operation body is acceptable because the result can be checkpointed |
There was a problem hiding this comment.
this is not quite why it's "acceptable".
suggestion:
once code inside a durable step completes it saves to a checkpoint, and on subsequent replays the operation returns the saved result. in this way the result of non-deterministic code becomes deterministic on replay because the non-deterministic code does not re-run and the durable execution framework uses the checkpoint result instead.
| ### What This Means | ||
|
|
||
| - Non-deterministic computation inside a durable operation body is acceptable because the result can be checkpointed | ||
| - External side effects started from that body should still be safe under re-attempt whenever possible |
There was a problem hiding this comment.
re-entrancy is strictly speaking the term described in this sentence.
but the general thrust of some of the copy added is for idempotency: same result, no duplicate effects, running it N times has the same effect as running it once
in general, yes, idempotency is a good design pattern to follow here.
however, part of the point of checkpointing is to make provide the idempotency. so this sentence is recommending with "should" to avoid taking advantage of something durable functions provide as a key feature with AT MOST ONCE, which is deterministic checkpointing when wrapping non-idempotent code.
| - Non-deterministic computation inside a durable operation body is acceptable because the result can be checkpointed | ||
| - External side effects started from that body should still be safe under re-attempt whenever possible | ||
| - If the side effect needs an identifier for idempotency, derive it from durable inputs/state or generate it once from durable state and reuse it | ||
| - If a **step** cannot be made idempotent and duplicate execution is unacceptable, use `StepSemantics.AtMostOncePerRetry` (TypeScript) or `StepSemantics.AT_MOST_ONCE_PER_RETRY` (Python) with retries disabled so the behavior is effectively zero-or-once rather than more than once |
There was a problem hiding this comment.
arguably the step is idempotent once it checkpoints.
the inside of the step isn't.
Also, now that we have Java, we should probably avoid listing per language (TypeSCript vs Python) each time, and instead reference the general concept and refer that to a single source of truth.
| 1. **Write handler** with durable operations | ||
| 2. **Test locally** with `LocalDurableTestRunner` | ||
| 3. **Validate replay rules** (no non-deterministic code outside steps) | ||
| 3. **Validate replay rules** (determinism outside durable operations; stable identity and idempotent side effects inside durable operation bodies) |
There was a problem hiding this comment.
what about determinism outside of durable operations? the first sentence is negative (i.e don't do this) and the second is positive (do this), but this is not clear from the text.
Is there a replay section this can link to instead?
| 7. **Choose correct semantics** (AT_LEAST_ONCE vs AT_MOST_ONCE) | ||
| 7. **Choose correct semantics** (`AtLeastOncePerRetry` vs `AtMostOncePerRetry`) | ||
| 8. **Use stable identity for external work** - derive identifiers from durable inputs/state, not `Date.now()`, randomness, or fresh UUIDs created inside the step body | ||
| 9. **Use `AtMostOncePerRetry` with zero retries for non-idempotent steps** when duplicate execution is unacceptable and you can accept zero-or-once behavior |
There was a problem hiding this comment.
also repeating text. presumably for a power/agent's use we can keep DRY?
Use
AtMostOncePerRetrywith retries disabled for steps with
non-idempotent external side effects the step will execute at most once, and if it fails it will not be retry.
|
|
||
| Wait for external systems to respond (human approval, webhook, async job): | ||
|
|
||
| The submitter function passed to `waitForCallback(...)` and the check function passed to `waitForCondition(...)` are durable operation bodies. They are not guaranteed to be atomic with respect to external side effects, so if they start or address external work, use stable identity and idempotent behavior. See [replay-model-rules.md](replay-model-rules.md). |
There was a problem hiding this comment.
well, they are specifically a step. "durable operation body" is this new concept introduced in this PR to capture any given durable operation that has a step inside it.
as this stands, it's unclear what it means to be a "durable operation bodies."
similar concerns re the "stable identity & idempotent behaviour" I outlined above
| 2. **Cannot nest durable operations** - use `runInChildContext` to group operations | ||
| 3. **Closure mutations are lost on replay** - return values from steps | ||
| 4. **Side effects outside steps repeat** - use `context.logger` (replay-aware) | ||
| 1. **All non-deterministic code outside durable operations MUST be moved into durable operations** (`context.step`, `waitForCallback`, `waitForCondition`, `parallel`/`map` branches) |
There was a problem hiding this comment.
passive voice, prefer active
| 3. **Closure mutations are lost on replay** - return values from steps | ||
| 4. **Side effects outside steps repeat** - use `context.logger` (replay-aware) | ||
| 1. **All non-deterministic code outside durable operations MUST be moved into durable operations** (`context.step`, `waitForCallback`, `waitForCondition`, `parallel`/`map` branches) | ||
| 2. **Durable operation bodies are not guaranteed to be atomic** - prefer stable identity and idempotent behavior for external side effects; for non-idempotent steps, consider at-most-once-per-retry semantics with zero retries |
There was a problem hiding this comment.
this repeats yet again content from before.
it also refers to "durable operation bodies" but then describes semantics that are only available on step.
a step body may run successfully but fail to checkpoint, causing it to re-execute on replay. For steps where duplicate execution is unacceptable, use
AtMostOncePerRetrywith retries disabled
| 3. **Closure mutations that won't persist**: Variables mutated inside steps are NOT preserved across replays — return values from steps instead | ||
| 4. **Side effects outside steps that repeat on replay**: Use `context.logger` for logging (it is replay-aware and deduplicates automatically) | ||
| 1. **Non-deterministic code outside durable operations**: `Date.now()`, `Math.random()`, UUID generation, API calls, database queries must all be inside durable operations | ||
| 2. **Non-atomic durable operation bodies**: Functions passed to `context.step()`, `waitForCallback()`, `waitForCondition()`, and `parallel()`/`map()` branches may be re-attempted before persistence is fully committed — prefer stable identity and idempotent external effects; for non-idempotent steps, use at-most-once-per-retry semantics with zero retries when duplicate execution is unacceptable |
There was a problem hiding this comment.
this idea is getting repeated a lot. can we consolidate.
If keeping this, maybe something like:
Durable operation bodies are not atomic: Code passed to context.step(),
waitForCallback(), waitForCondition(), and parallel()/map() branches can
succeed but fail to checkpoint, so the runtime may re-execute them on replay.
Prefer idempotent external side effects with stable identity. For steps where
duplicate execution is unacceptable, use AtMostOncePerRetry with retries
disabled.
Update the durable-functions power guidance to distinguish deterministic orchestration code from non-atomic durable operation bodies.
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.