fix(gnoland): recover validator changes after node restart by omarsy · Pull Request #5469 · gnolang/gno

omarsy · 2026-04-09T08:59:37Z

Summary

The EndBlocker relies on an in-memory event collector to trigger validator set queries via GetChanges(). On node restart, the collector is empty (events from the last block before shutdown are lost), so the EndBlocker never queries the realm — validator changes are permanently lost from the consensus layer.
Fix: use a firstBlock flag in the EndBlocker closure so the very first invocation after startup always queries the VM, bypassing the empty collector check. On genesis (height 0) it skips the query since there's nothing to recover.
Adds a txtar integration test that adds a validator via GovDAO proposal, restarts the node, and verifies the validator appears in the consensus set.

Details

The fix is contained in gno.land/pkg/gnoland/app.go (EndBlocker function):

firstBlock := true

return func(ctx sdk.Context, _ abci.RequestEndBlock) abci.ResponseEndBlock {
    // ...
    if firstBlock {
        firstBlock = false
        collector.getEvents() // drain any accumulated events
        if app.LastBlockHeight() == 0 {
            return abci.ResponseEndBlock{} // genesis — nothing to recover
        }
    } else if len(collector.getEvents()) == 0 {
        return abci.ResponseEndBlock{}
    }
    // ... VM query + apply validator changes ...
}

All validator logic stays in one place (the EndBlocker). The collector optimization is preserved for all blocks after the first.

Test plan

restart_validators.txtar — passes with fix, fails on master (confirmed)
restart_nonval.txtar / restart_missing_type.txtar — existing restart tests pass
TestEndBlocker — all existing subtests pass
CI passes

🤖 Generated with Claude Code

This PR was developed with AI assistance. The approach and code were reviewed by a human contributor.

Gno2D2 · 2026-04-09T09:00:17Z

🛠 PR Checks Summary

All Automated Checks passed. ✅

Manual Checks (for Reviewers):

IGNORE the bot requirements for this PR (force green CI check)

✅ Automated Checks (for Contributors):

🟢 Maintainers must be able to edit this pull request (more info)
🟢 Pending initial approval by a review team member, or review from tech-staff

☑️ Contributor Actions:

Fix any issues flagged by automated checks.
Follow the Contributor Checklist to ensure your PR is ready for review.
- Add new tests, or document why they are unnecessary.
- Provide clear examples/screenshots, if necessary.
- Update documentation, if required.
- Ensure no breaking changes, or include BREAKING CHANGE notes.
- Link related issues/PRs, where applicable.

☑️ Reviewer Actions:

Complete manual checks for the PR, including the guidelines and additional checks if applicable.

📚 Resources:

Debug

Automated Checks

Maintainers must be able to edit this pull request (more info)

If

🟢 Condition met
└── 🟢 And
    ├── 🟢 The base branch matches this pattern: ^master$
    └── 🟢 The pull request was created from a fork (head branch repo: omarsy/gno)

Then

🟢 Requirement satisfied
└── 🟢 Maintainer can modify this pull request

Pending initial approval by a review team member, or review from tech-staff

If

🟢 Condition met
└── 🟢 And
    ├── 🟢 The base branch matches this pattern: ^master$
    └── 🟢 Not (🔴 Pull request author is a member of the team: tech-staff)

Then

🟢 Requirement satisfied
└── 🟢 If
    ├── 🟢 Condition
    │   └── 🟢 Or
    │       ├── 🔴 At least one of these user(s) reviewed the pull request: [aronpark1007 davd-gzl jefft0 notJoon omarsy MikaelVallenet] (with state "APPROVED")
    │       ├── 🟢 At least 1 user(s) of the team tech-staff reviewed pull request
    │       └── 🔴 This pull request is a draft
    └── 🟢 Then
        └── 🟢 Not (🔴 This label is applied to pull request: review/triage-pending)

Manual Checks

**IGNORE** the bot requirements for this PR (force green CI check)
If
🟢 Condition met
└── 🟢 On every pull request
Can be checked by

Any user with comment edit permission

codecov · 2026-04-09T09:01:51Z

Codecov Report

❌ Patch coverage is 53.33333% with 35 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
gno.land/pkg/integration/testscript_gnoland.go	48.52%	28 Missing and 7 partials ⚠️

📢 Thoughts on this report? Let us know!

The EndBlocker relies on an in-memory event collector to decide when to query GetChanges() from r/sys/validators/v2. On restart, the collector is empty — events from the last block before shutdown are lost — so the EndBlocker never queries the realm and validator changes are permanently lost from the consensus layer. Fix: after loading the latest version in NewAppWithOptions, seed the collector with a synthetic event when LastBlockHeight > 0 (i.e. not a fresh genesis). This causes the first EndBlocker after restart to query GetChanges(), picking up any pending validator changes. Includes an integration test that adds a validator via GovDAO, stops the node, restarts it, and verifies the validator appears in the consensus set after a few blocks.

thehowl

Would it be possible to actually just try to fetch the events from block-1?

We might even make it work on the current block (changing the current behaviour whereby currently, the updates in one block are executed in the block AFTER current).

Here's what Claude responded when I asked:

The answer is yes, and there's actually a meaningful reason why this would be more correct. Let me show the key sequence from execution.go:
execBlockOnProxyApp()          ← EndBlocker runs HERE (inside ABCI EndBlock)
SaveABCIResponses(N, ...)      ← persisted to disk (includes DeliverTxs + their events)
SaveState(N, ...)              ← TM state saved
fireEvents(evsw, block, ...)   ← EventTx fired HERE — what the collector listens to
The critical insight is that fireEvents — which is what populates the in-memory collector — runs after SaveABCIResponses. So the events that get lost on restart are already sitting in the TM state DB, verbatim, in ABCIResponses.DeliverTxs[i].Events. This is exactly the data validatorEventFilter would inspect.

The check would look like: load ABCIResponses(lastHeight) from the state DB, scan through the DeliverTxs, apply validatorEventFilter logic (look for chain.Event with PkgPath == valRealm and type ValidatorAdded/ValidatorRemoved). If found, proceed with the VM query; if not, skip — same as normal operation.

Advantages over the firstBlock flag:

More precise: doesn't unconditionally query the VM on every restart, only when block N actually had validator events

Works correctly for the crash-before-fireEvents case too (not just clean shutdown)

No special state in the closure, no changed semantics for the empty-collector path

Extends cleanly: survives N consecutive restarts without accumulating errors

Complications that would need solving:

The EndBlocker currently has no access to the TM state DB. The endBlockerApp interface would need a new method like LoadABCIResponses(height int64) (*state.ABCIResponses, error), or the state DB would need to be passed as a parameter — this touches the TM/app layer boundary.

The events in ABCIResponses.DeliverTxs[i].Events are stored as amino-encoded []abci.Event (an interface slice). After deserialization, the type assertion ev.(chain.Event) in validatorEventFilter needs amino to have the concrete chain.Event type registered — not guaranteed to work out of the box.

A simpler variant that avoids the amino concern: instead of decoding the events, check ABCIResponses(N).EndBlock.ValidatorUpdates. If the previous block's EndBlocker already returned validator updates, you know block N-1's events were processed; what you actually need to know is whether block N itself fired events. But EndBlock.ValidatorUpdates for height N is the result of processing N-1's events, so this doesn't help directly.

In short: the ABCIResponses approach is the "right" fix conceptually — it reads the persistent record of what actually happened rather than using a stateful flag — but it requires more invasive changes to plumb the state DB access into the EndBlocker. The firstBlock flag is a smaller, self-contained fix that trades precision (one unnecessary VM query per restart) for simplicity.

IMO this makes sense if we consider this also allows us to get rid of the collector logic.

omarsy · 2026-04-19T18:38:06Z

Would it be possible to actually just try to fetch the events from block-1?

I was leaning toward wiping the collector too. My first thought was to just remove it entirely without adding any gate at all (same shape as Alt 3 in the ADR — remove collector, always call GetChanges). I benched what that would cost per block:

Approach	Per-block cost
Current collector / `firstBlock` (steady-state)	~3 ns, 0 allocs
No gate — real VM query (lower bound)	~86 µs, 53 KB

That's already ~4 orders of magnitude more per block, and it's a lower bound — I couldn't deploy the real r/sys/validators/v2 in a microbench (too many transitive deps), so I used a stub realm. Real GetChanges walks an AVL tree, so it's strictly higher. That alone changed my mind about dropping the gate.

Then I benched your proposal to see if the ABCIResponses-based gate could replace the collector cheaply enough to be worth the change. It came in at ~50 µs, 39 KB, 1k allocs on a medium block (10 tx × 5 ev) — same ballpark as no gate at all. Amino decode dominates (I also ran a decode-only variant skipping the DB read — same numbers), so page-caching doesn't help. So the gate doesn't really save us from the VM-query cost, while still adding TM/app-boundary plumbing and the amino-registration question you flagged.

My preference: keep the narrow firstBlock fix here and open a follow-up if we want to revisit the collector design. WDYT?

The existing `sleep 1s` in restart_validators.txtar races with block production under CI load. Replace it by extending the existing `gnorpc validators` command with a `-wait ADDR` flag (and optional `-timeout`) that polls the validator-set RPC until the given address appears, with a 30s default timeout. The wait returns in ~100ms in practice (one poll interval after the trigger tx's block commits) instead of a hardcoded 1s. Also removes the now-unused `sleep` testscript command — it was only used in this one file, and encouraging time-based waits in integration tests is a footgun. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

thehowl · 2026-04-20T16:03:16Z

I was leaning toward wiping the collector too. My first thought was to just remove it entirely without adding any gate at all (same shape as Alt 3 in the ADR — remove collector, always call GetChanges). I benched what that would cost per block:

Actually, I was proposing something else: I made an alternative PR: #5556 (works on top of this one, added you as co-author)

github-project-automation Bot added this to 🧙‍♂️Gno.land development Apr 9, 2026

github-project-automation Bot moved this to Triage in 🧙‍♂️Gno.land development Apr 9, 2026

github-project-automation Bot added this to 💪 Bounties & Worx Apr 9, 2026

github-actions Bot assigned omarsy Apr 9, 2026

github-actions Bot added 📦 🤖 gnovm Issues or PRs gnovm related 📦 ⛰️ gno.land Issues or PRs gno.land package related 📄 top-level-md labels Apr 9, 2026

Gno2D2 added the review/triage-pending PRs opened by external contributors that are waiting for the 1st review label Apr 9, 2026

omarsy marked this pull request as draft April 9, 2026 09:06

Gno2D2 removed the review/triage-pending PRs opened by external contributors that are waiting for the 1st review label Apr 9, 2026

omarsy force-pushed the fix/endblock-restart-validator-changes branch 9 times, most recently from 17efe63 to 9cc1373 Compare April 10, 2026 15:43

omarsy force-pushed the fix/endblock-restart-validator-changes branch from 9cc1373 to 27b2d60 Compare April 10, 2026 15:55

Merge branch 'master' into fix/endblock-restart-validator-changes

f78578c

omarsy marked this pull request as ready for review April 10, 2026 16:41

Gno2D2 added the review/triage-pending PRs opened by external contributors that are waiting for the 1st review label Apr 10, 2026

omarsy requested review from a team, ltzmaxwell and thehowl April 10, 2026 16:42

davd-gzl self-requested a review April 10, 2026 18:35

thehowl reviewed Apr 14, 2026

View reviewed changes

Comment thread gno.land/pkg/integration/testdata/restart_validators.txtar Outdated

Gno2D2 removed the review/triage-pending PRs opened by external contributors that are waiting for the 1st review label Apr 14, 2026

thehowl reviewed Apr 14, 2026

View reviewed changes

omarsy force-pushed the fix/endblock-restart-validator-changes branch from f7ae6f8 to d7d9c4d Compare April 19, 2026 20:11

thehowl mentioned this pull request Apr 20, 2026

fix(gnoland): detect validator changes via EndTxHook for same-block processing #5556

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gnoland): recover validator changes after node restart#5469

fix(gnoland): recover validator changes after node restart#5469
omarsy wants to merge 3 commits intognolang:masterfrom
omarsy:fix/endblock-restart-validator-changes

omarsy commented Apr 9, 2026 •

edited

Loading

Uh oh!

Gno2D2 commented Apr 9, 2026 •

edited

Loading

✅ Automated Checks (for Contributors):

☑️ Contributor Actions:

☑️ Reviewer Actions:

📚 Resources:

If

Then

If

Then

If

Can be checked by

Uh oh!

codecov Bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

thehowl left a comment •

edited

Loading

Uh oh!

omarsy commented Apr 19, 2026

Uh oh!

thehowl commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

omarsy commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Test plan

Uh oh!

Gno2D2 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🛠 PR Checks Summary

Manual Checks (for Reviewers):

✅ Automated Checks (for Contributors):

☑️ Contributor Actions:

☑️ Reviewer Actions:

📚 Resources:

If

Then

If

Then

If

Can be checked by

Uh oh!

codecov Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

thehowl left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

omarsy commented Apr 19, 2026

Uh oh!

thehowl commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

omarsy commented Apr 9, 2026 •

edited

Loading

Gno2D2 commented Apr 9, 2026 •

edited

Loading

codecov Bot commented Apr 9, 2026 •

edited

Loading

thehowl left a comment •

edited

Loading