Skip to content

fix(gnoland): recover validator changes after node restart#5469

Open
omarsy wants to merge 3 commits intognolang:masterfrom
omarsy:fix/endblock-restart-validator-changes
Open

fix(gnoland): recover validator changes after node restart#5469
omarsy wants to merge 3 commits intognolang:masterfrom
omarsy:fix/endblock-restart-validator-changes

Conversation

@omarsy
Copy link
Copy Markdown
Member

@omarsy omarsy commented Apr 9, 2026

Summary

  • The EndBlocker relies on an in-memory event collector to trigger validator set queries via GetChanges(). On node restart, the collector is empty (events from the last block before shutdown are lost), so the EndBlocker never queries the realm — validator changes are permanently lost from the consensus layer.
  • Fix: use a firstBlock flag in the EndBlocker closure so the very first invocation after startup always queries the VM, bypassing the empty collector check. On genesis (height 0) it skips the query since there's nothing to recover.
  • Adds a txtar integration test that adds a validator via GovDAO proposal, restarts the node, and verifies the validator appears in the consensus set.

Details

The fix is contained in gno.land/pkg/gnoland/app.go (EndBlocker function):

firstBlock := true

return func(ctx sdk.Context, _ abci.RequestEndBlock) abci.ResponseEndBlock {
    // ...
    if firstBlock {
        firstBlock = false
        collector.getEvents() // drain any accumulated events
        if app.LastBlockHeight() == 0 {
            return abci.ResponseEndBlock{} // genesis — nothing to recover
        }
    } else if len(collector.getEvents()) == 0 {
        return abci.ResponseEndBlock{}
    }
    // ... VM query + apply validator changes ...
}

All validator logic stays in one place (the EndBlocker). The collector optimization is preserved for all blocks after the first.

Test plan

  • restart_validators.txtar — passes with fix, fails on master (confirmed)
  • restart_nonval.txtar / restart_missing_type.txtar — existing restart tests pass
  • TestEndBlocker — all existing subtests pass
  • CI passes

🤖 Generated with Claude Code

This PR was developed with AI assistance. The approach and code were reviewed by a human contributor.

@github-actions github-actions Bot added 📦 🤖 gnovm Issues or PRs gnovm related 📦 ⛰️ gno.land Issues or PRs gno.land package related 📄 top-level-md labels Apr 9, 2026
@Gno2D2 Gno2D2 added the review/triage-pending PRs opened by external contributors that are waiting for the 1st review label Apr 9, 2026
@Gno2D2
Copy link
Copy Markdown
Collaborator

Gno2D2 commented Apr 9, 2026

🛠 PR Checks Summary

All Automated Checks passed. ✅

Manual Checks (for Reviewers):
  • IGNORE the bot requirements for this PR (force green CI check)
Read More

🤖 This bot helps streamline PR reviews by verifying automated checks and providing guidance for contributors and reviewers.

✅ Automated Checks (for Contributors):

🟢 Maintainers must be able to edit this pull request (more info)
🟢 Pending initial approval by a review team member, or review from tech-staff

☑️ Contributor Actions:
  1. Fix any issues flagged by automated checks.
  2. Follow the Contributor Checklist to ensure your PR is ready for review.
    • Add new tests, or document why they are unnecessary.
    • Provide clear examples/screenshots, if necessary.
    • Update documentation, if required.
    • Ensure no breaking changes, or include BREAKING CHANGE notes.
    • Link related issues/PRs, where applicable.
☑️ Reviewer Actions:
  1. Complete manual checks for the PR, including the guidelines and additional checks if applicable.
📚 Resources:
Debug
Automated Checks
Maintainers must be able to edit this pull request (more info)

If

🟢 Condition met
└── 🟢 And
    ├── 🟢 The base branch matches this pattern: ^master$
    └── 🟢 The pull request was created from a fork (head branch repo: omarsy/gno)

Then

🟢 Requirement satisfied
└── 🟢 Maintainer can modify this pull request

Pending initial approval by a review team member, or review from tech-staff

If

🟢 Condition met
└── 🟢 And
    ├── 🟢 The base branch matches this pattern: ^master$
    └── 🟢 Not (🔴 Pull request author is a member of the team: tech-staff)

Then

🟢 Requirement satisfied
└── 🟢 If
    ├── 🟢 Condition
    │   └── 🟢 Or
    │       ├── 🔴 At least one of these user(s) reviewed the pull request: [aronpark1007 davd-gzl jefft0 notJoon omarsy MikaelVallenet] (with state "APPROVED")
    │       ├── 🟢 At least 1 user(s) of the team tech-staff reviewed pull request
    │       └── 🔴 This pull request is a draft
    └── 🟢 Then
        └── 🟢 Not (🔴 This label is applied to pull request: review/triage-pending)

Manual Checks
**IGNORE** the bot requirements for this PR (force green CI check)

If

🟢 Condition met
└── 🟢 On every pull request

Can be checked by

  • Any user with comment edit permission

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 53.33333% with 35 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
gno.land/pkg/integration/testscript_gnoland.go 48.52% 28 Missing and 7 partials ⚠️

📢 Thoughts on this report? Let us know!

@omarsy omarsy marked this pull request as draft April 9, 2026 09:06
@Gno2D2 Gno2D2 removed the review/triage-pending PRs opened by external contributors that are waiting for the 1st review label Apr 9, 2026
@omarsy omarsy force-pushed the fix/endblock-restart-validator-changes branch 9 times, most recently from 17efe63 to 9cc1373 Compare April 10, 2026 15:43
The EndBlocker relies on an in-memory event collector to decide when
to query GetChanges() from r/sys/validators/v2. On restart, the
collector is empty — events from the last block before shutdown are
lost — so the EndBlocker never queries the realm and validator
changes are permanently lost from the consensus layer.

Fix: after loading the latest version in NewAppWithOptions, seed the
collector with a synthetic event when LastBlockHeight > 0 (i.e. not
a fresh genesis). This causes the first EndBlocker after restart to
query GetChanges(), picking up any pending validator changes.

Includes an integration test that adds a validator via GovDAO, stops
the node, restarts it, and verifies the validator appears in the
consensus set after a few blocks.
@omarsy omarsy force-pushed the fix/endblock-restart-validator-changes branch from 9cc1373 to 27b2d60 Compare April 10, 2026 15:55
@omarsy omarsy marked this pull request as ready for review April 10, 2026 16:41
@Gno2D2 Gno2D2 added the review/triage-pending PRs opened by external contributors that are waiting for the 1st review label Apr 10, 2026
@omarsy omarsy requested review from a team, ltzmaxwell and thehowl April 10, 2026 16:42
@davd-gzl davd-gzl self-requested a review April 10, 2026 18:35
Comment thread gno.land/pkg/integration/testdata/restart_validators.txtar Outdated
@Gno2D2 Gno2D2 removed the review/triage-pending PRs opened by external contributors that are waiting for the 1st review label Apr 14, 2026
Copy link
Copy Markdown
Member

@thehowl thehowl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to actually just try to fetch the events from block-1?

We might even make it work on the current block (changing the current behaviour whereby currently, the updates in one block are executed in the block AFTER current).

Here's what Claude responded when I asked:

The answer is yes, and there's actually a meaningful reason why this would be more correct. Let me show the key sequence from execution.go:

execBlockOnProxyApp()          ← EndBlocker runs HERE (inside ABCI EndBlock)
SaveABCIResponses(N, ...)      ← persisted to disk (includes DeliverTxs + their events)
SaveState(N, ...)              ← TM state saved
fireEvents(evsw, block, ...)   ← EventTx fired HERE — what the collector listens to

The critical insight is that fireEvents — which is what populates the in-memory collector — runs after SaveABCIResponses. So the events that get lost on restart are already sitting in the TM state DB, verbatim, in ABCIResponses.DeliverTxs[i].Events. This is exactly the data validatorEventFilter would inspect.

The check would look like: load ABCIResponses(lastHeight) from the state DB, scan through the DeliverTxs, apply validatorEventFilter logic (look for chain.Event with PkgPath == valRealm and type ValidatorAdded/ValidatorRemoved). If found, proceed with the VM query; if not, skip — same as normal operation.

Advantages over the firstBlock flag:

  • More precise: doesn't unconditionally query the VM on every restart, only when block N actually had validator events
  • Works correctly for the crash-before-fireEvents case too (not just clean shutdown)
  • No special state in the closure, no changed semantics for the empty-collector path
  • Extends cleanly: survives N consecutive restarts without accumulating errors

Complications that would need solving:

  1. The EndBlocker currently has no access to the TM state DB. The endBlockerApp interface would need a new method like LoadABCIResponses(height int64) (*state.ABCIResponses, error), or the state DB would need to be passed as a parameter — this touches the TM/app layer boundary.
  2. The events in ABCIResponses.DeliverTxs[i].Events are stored as amino-encoded []abci.Event (an interface slice). After deserialization, the type assertion ev.(chain.Event) in validatorEventFilter needs amino to have the concrete chain.Event type registered — not guaranteed to work out of the box.

A simpler variant that avoids the amino concern: instead of decoding the events, check ABCIResponses(N).EndBlock.ValidatorUpdates. If the previous block's EndBlocker already returned validator updates, you know block N-1's events were processed; what you actually need to know is whether block N itself fired events. But EndBlock.ValidatorUpdates for height N is the result of processing N-1's events, so this doesn't help directly.

In short: the ABCIResponses approach is the "right" fix conceptually — it reads the persistent record of what actually happened rather than using a stateful flag — but it requires more invasive changes to plumb the state DB access into the EndBlocker. The firstBlock flag is a smaller, self-contained fix that trades precision (one unnecessary VM query per restart) for simplicity.

IMO this makes sense if we consider this also allows us to get rid of the collector logic.

@omarsy
Copy link
Copy Markdown
Member Author

omarsy commented Apr 19, 2026

Would it be possible to actually just try to fetch the events from block-1?

I was leaning toward wiping the collector too. My first thought was to just remove it entirely without adding any gate at all (same shape as Alt 3 in the ADR — remove collector, always call GetChanges). I benched what that would cost per block:

Approach Per-block cost
Current collector / firstBlock (steady-state) ~3 ns, 0 allocs
No gate — real VM query (lower bound) ~86 µs, 53 KB

That's already ~4 orders of magnitude more per block, and it's a lower bound — I couldn't deploy the real r/sys/validators/v2 in a microbench (too many transitive deps), so I used a stub realm. Real GetChanges walks an AVL tree, so it's strictly higher. That alone changed my mind about dropping the gate.

Then I benched your proposal to see if the ABCIResponses-based gate could replace the collector cheaply enough to be worth the change. It came in at ~50 µs, 39 KB, 1k allocs on a medium block (10 tx × 5 ev) — same ballpark as no gate at all. Amino decode dominates (I also ran a decode-only variant skipping the DB read — same numbers), so page-caching doesn't help. So the gate doesn't really save us from the VM-query cost, while still adding TM/app-boundary plumbing and the amino-registration question you flagged.

My preference: keep the narrow firstBlock fix here and open a follow-up if we want to revisit the collector design. WDYT?

The existing `sleep 1s` in restart_validators.txtar races with block
production under CI load. Replace it by extending the existing
`gnorpc validators` command with a `-wait ADDR` flag (and optional
`-timeout`) that polls the validator-set RPC until the given address
appears, with a 30s default timeout.

The wait returns in ~100ms in practice (one poll interval after the
trigger tx's block commits) instead of a hardcoded 1s.

Also removes the now-unused `sleep` testscript command — it was only
used in this one file, and encouraging time-based waits in integration
tests is a footgun.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thehowl
Copy link
Copy Markdown
Member

thehowl commented Apr 20, 2026

I was leaning toward wiping the collector too. My first thought was to just remove it entirely without adding any gate at all (same shape as Alt 3 in the ADR — remove collector, always call GetChanges). I benched what that would cost per block:

Actually, I was proposing something else: I made an alternative PR: #5556 (works on top of this one, added you as co-author)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

📦 ⛰️ gno.land Issues or PRs gno.land package related 📦 🤖 gnovm Issues or PRs gnovm related 📄 top-level-md

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants