Skip to content

fix: restore all checkpoint state data during resume (#4519)#4526

Open
4yDX3906 wants to merge 1 commit intoalibaba:mainfrom
4yDX3906:fix/issue-4519
Open

fix: restore all checkpoint state data during resume (#4519)#4526
4yDX3906 wants to merge 1 commit intoalibaba:mainfrom
4yDX3906:fix/issue-4519

Conversation

@4yDX3906
Copy link
Copy Markdown
Contributor

@4yDX3906 4yDX3906 commented Apr 8, 2026

Summary

Fixes #4519.

Problem

When a node stores data in OverAllState and then the graph suspends (interrupt), the subsequent node cannot access that data after resume(). This is because OverAllState.input() — called during resume to restore checkpoint state — only processes keys that have a registered KeyStrategy, silently dropping all other keys.

In contrast, during normal execution, updateState() uses a default REPLACE strategy for unregistered keys, so nodes can write to arbitrary keys successfully. This inconsistency causes checkpoint state data loss on resume.

Solution

Modified OverAllState.input() to use the default REPLACE strategy for keys without a registered KeyStrategy, making it consistent with updateState(). This ensures all checkpoint state data is properly restored when resuming from suspension.

Changes

  • OverAllState.java: Changed input() to process all keys (not just those with registered strategies), using KeyStrategy.REPLACE as the default fallback
  • InterruptionTest.java: Added resumeShouldPreserveCheckpointState test that reproduces the exact scenario from the issue

Testing

  • New test resumeShouldPreserveCheckpointState passes — verifies Node B can read Node A's state data after resume
  • All 10 existing InterruptionTest cases pass
  • All 58 related graph-core tests pass (StateGraphMemorySaverTest, StateGraphTest, TimeTravelTest, etc.)

🤖 Generated with [Qoder][https://qoder.com]

The OverAllState.input() method was filtering out keys that didn't have
a registered KeyStrategy, causing checkpoint state data to be lost when
resuming from suspension. This fix makes input() consistent with
updateState() by using the default REPLACE strategy for unregistered
keys, ensuring all checkpoint state data is properly restored on resume.

Closes alibaba#4519

🤖 Generated with [Qoder][https://qoder.com]
@github-actions github-actions bot added the area/graph SAA Grpah module label Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/graph SAA Grpah module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] resume() 恢复执行时,checkpoint 的 state 没有传递给节点

1 participant