Skip to content

docs(power): clarify replay safety and step semantics#5

Open
embano1 wants to merge 2 commits intoaws:mainfrom
embano1:codex/fix-power-replay-semantics-docs
Open

docs(power): clarify replay safety and step semantics#5
embano1 wants to merge 2 commits intoaws:mainfrom
embano1:codex/fix-power-replay-semantics-docs

Conversation

@embano1
Copy link
Copy Markdown
Member

@embano1 embano1 commented Mar 8, 2026

Update the durable-functions power guidance to distinguish deterministic orchestration code from non-atomic durable operation bodies.

  • generalize replay-safety guidance across steps, waits, and concurrent branches
  • document logger replay caveats and note that context.logger can wrap an existing logger
  • correct StepSemantics defaults and examples for TypeScript and Python
  • document the at-most-once-per-retry fallback for non-idempotent steps with retries disabled

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Update the durable-functions power guidance to distinguish deterministic orchestration code from non-atomic durable operation bodies.

- generalize replay-safety guidance across steps, waits, and concurrent branches
- document logger replay caveats and note that context.logger can wrap an existing logger
- correct StepSemantics defaults and examples for TypeScript and Python
- document the at-most-once-per-retry fallback for non-idempotent steps with retries disabled
@embano1 embano1 requested review from bfreiberg and yaythomas March 8, 2026 09:20
@embano1
Copy link
Copy Markdown
Member Author

embano1 commented Mar 8, 2026

FYI, I also have another branch ready (based on these changes), which reduces the power size by ~50%.

@embano1
Copy link
Copy Markdown
Member Author

embano1 commented Mar 8, 2026

@singledigit can you use this updated power to verify if it would have caught the bugs in your scanner function: https://github.com/singledigit/durable-function-video-scanner/tree/main

Scanner fix: generate stable Transcribe/Rekognition submission identifiers once from durable state, not from wall-clock time inside the callback submitter. Default: derive them from scanId plus a deterministic suffix, or generate them in an earlier durable step and reuse them in the submitter.
Scanner fix: replace handler-level logger.* calls with context.logger (and childContext.logger / submitter ctx.logger where available) for all logging that can occur during replay.

@embano1
Copy link
Copy Markdown
Member Author

embano1 commented Mar 28, 2026

@yaythomas can we prioritize this one to address some issues with the power?


## Rule 2: Durable Operation Bodies Are Not Guaranteed To Be Atomic

**Functions passed to durable context APIs must assume the operation is not guaranteed to be atomic with respect to external side effects, and may be re-attempted before the durable runtime has fully recorded the result.**
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. what about at most once guarantee?

  2. what does "durable context APIs" mean? methods on the DurableContext? or the durable handler?

  3. "Functions" means something specific in coding, strictly speaking java doesn't have functions.

  4. Style: avoid passive

Suggestion: Code in durable operation must assume that it could re-run on replay, unless it is in a Step with an AT MOST ONCE execution guarantee. This means that external side-effects caused by such code could execute more than once.


### What This Means

- Non-deterministic computation inside a durable operation body is acceptable because the result can be checkpointed
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not quite why it's "acceptable".

suggestion:
once code inside a durable step completes it saves to a checkpoint, and on subsequent replays the operation returns the saved result. in this way the result of non-deterministic code becomes deterministic on replay because the non-deterministic code does not re-run and the durable execution framework uses the checkpoint result instead.

### What This Means

- Non-deterministic computation inside a durable operation body is acceptable because the result can be checkpointed
- External side effects started from that body should still be safe under re-attempt whenever possible
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-entrancy is strictly speaking the term described in this sentence.

but the general thrust of some of the copy added is for idempotency: same result, no duplicate effects, running it N times has the same effect as running it once

in general, yes, idempotency is a good design pattern to follow here.

however, part of the point of checkpointing is to make provide the idempotency. so this sentence is recommending with "should" to avoid taking advantage of something durable functions provide as a key feature with AT MOST ONCE, which is deterministic checkpointing when wrapping non-idempotent code.

- Non-deterministic computation inside a durable operation body is acceptable because the result can be checkpointed
- External side effects started from that body should still be safe under re-attempt whenever possible
- If the side effect needs an identifier for idempotency, derive it from durable inputs/state or generate it once from durable state and reuse it
- If a **step** cannot be made idempotent and duplicate execution is unacceptable, use `StepSemantics.AtMostOncePerRetry` (TypeScript) or `StepSemantics.AT_MOST_ONCE_PER_RETRY` (Python) with retries disabled so the behavior is effectively zero-or-once rather than more than once
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arguably the step is idempotent once it checkpoints.

the inside of the step isn't.

Also, now that we have Java, we should probably avoid listing per language (TypeSCript vs Python) each time, and instead reference the general concept and refer that to a single source of truth.

1. **Write handler** with durable operations
2. **Test locally** with `LocalDurableTestRunner`
3. **Validate replay rules** (no non-deterministic code outside steps)
3. **Validate replay rules** (determinism outside durable operations; stable identity and idempotent side effects inside durable operation bodies)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about determinism outside of durable operations? the first sentence is negative (i.e don't do this) and the second is positive (do this), but this is not clear from the text.

Is there a replay section this can link to instead?

7. **Choose correct semantics** (AT_LEAST_ONCE vs AT_MOST_ONCE)
7. **Choose correct semantics** (`AtLeastOncePerRetry` vs `AtMostOncePerRetry`)
8. **Use stable identity for external work** - derive identifiers from durable inputs/state, not `Date.now()`, randomness, or fresh UUIDs created inside the step body
9. **Use `AtMostOncePerRetry` with zero retries for non-idempotent steps** when duplicate execution is unacceptable and you can accept zero-or-once behavior
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also repeating text. presumably for a power/agent's use we can keep DRY?

Use AtMostOncePerRetry with retries disabled for steps with
non-idempotent external side effects
the step will execute at most once, and if it fails it will not be retry.


Wait for external systems to respond (human approval, webhook, async job):

The submitter function passed to `waitForCallback(...)` and the check function passed to `waitForCondition(...)` are durable operation bodies. They are not guaranteed to be atomic with respect to external side effects, so if they start or address external work, use stable identity and idempotent behavior. See [replay-model-rules.md](replay-model-rules.md).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, they are specifically a step. "durable operation body" is this new concept introduced in this PR to capture any given durable operation that has a step inside it.

as this stands, it's unclear what it means to be a "durable operation bodies."

similar concerns re the "stable identity & idempotent behaviour" I outlined above

2. **Cannot nest durable operations** - use `runInChildContext` to group operations
3. **Closure mutations are lost on replay** - return values from steps
4. **Side effects outside steps repeat** - use `context.logger` (replay-aware)
1. **All non-deterministic code outside durable operations MUST be moved into durable operations** (`context.step`, `waitForCallback`, `waitForCondition`, `parallel`/`map` branches)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

passive voice, prefer active

3. **Closure mutations are lost on replay** - return values from steps
4. **Side effects outside steps repeat** - use `context.logger` (replay-aware)
1. **All non-deterministic code outside durable operations MUST be moved into durable operations** (`context.step`, `waitForCallback`, `waitForCondition`, `parallel`/`map` branches)
2. **Durable operation bodies are not guaranteed to be atomic** - prefer stable identity and idempotent behavior for external side effects; for non-idempotent steps, consider at-most-once-per-retry semantics with zero retries
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this repeats yet again content from before.

it also refers to "durable operation bodies" but then describes semantics that are only available on step.

a step body may run successfully but fail to checkpoint, causing it to re-execute on replay. For steps where duplicate execution is unacceptable, use AtMostOncePerRetry with retries disabled

3. **Closure mutations that won't persist**: Variables mutated inside steps are NOT preserved across replays — return values from steps instead
4. **Side effects outside steps that repeat on replay**: Use `context.logger` for logging (it is replay-aware and deduplicates automatically)
1. **Non-deterministic code outside durable operations**: `Date.now()`, `Math.random()`, UUID generation, API calls, database queries must all be inside durable operations
2. **Non-atomic durable operation bodies**: Functions passed to `context.step()`, `waitForCallback()`, `waitForCondition()`, and `parallel()`/`map()` branches may be re-attempted before persistence is fully committed — prefer stable identity and idempotent external effects; for non-idempotent steps, use at-most-once-per-retry semantics with zero retries when duplicate execution is unacceptable
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this idea is getting repeated a lot. can we consolidate.

If keeping this, maybe something like:

Durable operation bodies are not atomic: Code passed to context.step(),
waitForCallback(), waitForCondition(), and parallel()/map() branches can
succeed but fail to checkpoint, so the runtime may re-execute them on replay.
Prefer idempotent external side effects with stable identity. For steps where
duplicate execution is unacceptable, use AtMostOncePerRetry with retries
disabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants