fix(gnoland): recover validator changes after node restart#5469
fix(gnoland): recover validator changes after node restart#5469omarsy wants to merge 3 commits intognolang:masterfrom
Conversation
🛠 PR Checks SummaryAll Automated Checks passed. ✅ Manual Checks (for Reviewers):
Read More🤖 This bot helps streamline PR reviews by verifying automated checks and providing guidance for contributors and reviewers. ✅ Automated Checks (for Contributors):🟢 Maintainers must be able to edit this pull request (more info) ☑️ Contributor Actions:
☑️ Reviewer Actions:
📚 Resources:Debug
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
17efe63 to
9cc1373
Compare
The EndBlocker relies on an in-memory event collector to decide when to query GetChanges() from r/sys/validators/v2. On restart, the collector is empty — events from the last block before shutdown are lost — so the EndBlocker never queries the realm and validator changes are permanently lost from the consensus layer. Fix: after loading the latest version in NewAppWithOptions, seed the collector with a synthetic event when LastBlockHeight > 0 (i.e. not a fresh genesis). This causes the first EndBlocker after restart to query GetChanges(), picking up any pending validator changes. Includes an integration test that adds a validator via GovDAO, stops the node, restarts it, and verifies the validator appears in the consensus set after a few blocks.
9cc1373 to
27b2d60
Compare
There was a problem hiding this comment.
Would it be possible to actually just try to fetch the events from block-1?
We might even make it work on the current block (changing the current behaviour whereby currently, the updates in one block are executed in the block AFTER current).
Here's what Claude responded when I asked:
The answer is yes, and there's actually a meaningful reason why this would be more correct. Let me show the key sequence from
execution.go:execBlockOnProxyApp() ← EndBlocker runs HERE (inside ABCI EndBlock) SaveABCIResponses(N, ...) ← persisted to disk (includes DeliverTxs + their events) SaveState(N, ...) ← TM state saved fireEvents(evsw, block, ...) ← EventTx fired HERE — what the collector listens toThe critical insight is that
fireEvents— which is what populates the in-memory collector — runs afterSaveABCIResponses. So the events that get lost on restart are already sitting in the TM state DB, verbatim, inABCIResponses.DeliverTxs[i].Events. This is exactly the datavalidatorEventFilterwould inspect.The check would look like: load
ABCIResponses(lastHeight)from the state DB, scan through theDeliverTxs, applyvalidatorEventFilterlogic (look forchain.EventwithPkgPath == valRealmand typeValidatorAdded/ValidatorRemoved). If found, proceed with the VM query; if not, skip — same as normal operation.Advantages over the
firstBlockflag:
- More precise: doesn't unconditionally query the VM on every restart, only when block N actually had validator events
- Works correctly for the crash-before-
fireEventscase too (not just clean shutdown)- No special state in the closure, no changed semantics for the empty-collector path
- Extends cleanly: survives N consecutive restarts without accumulating errors
Complications that would need solving:
- The
EndBlockercurrently has no access to the TM state DB. TheendBlockerAppinterface would need a new method likeLoadABCIResponses(height int64) (*state.ABCIResponses, error), or the state DB would need to be passed as a parameter — this touches the TM/app layer boundary.- The events in
ABCIResponses.DeliverTxs[i].Eventsare stored as amino-encoded[]abci.Event(an interface slice). After deserialization, the type assertionev.(chain.Event)invalidatorEventFilterneeds amino to have the concretechain.Eventtype registered — not guaranteed to work out of the box.A simpler variant that avoids the amino concern: instead of decoding the events, check
ABCIResponses(N).EndBlock.ValidatorUpdates. If the previous block's EndBlocker already returned validator updates, you know block N-1's events were processed; what you actually need to know is whether block N itself fired events. ButEndBlock.ValidatorUpdatesfor height N is the result of processing N-1's events, so this doesn't help directly.In short: the ABCIResponses approach is the "right" fix conceptually — it reads the persistent record of what actually happened rather than using a stateful flag — but it requires more invasive changes to plumb the state DB access into the EndBlocker. The
firstBlockflag is a smaller, self-contained fix that trades precision (one unnecessary VM query per restart) for simplicity.
IMO this makes sense if we consider this also allows us to get rid of the collector logic.
I was leaning toward wiping the collector too. My first thought was to just remove it entirely without adding any gate at all (same shape as Alt 3 in the ADR — remove collector, always call
That's already ~4 orders of magnitude more per block, and it's a lower bound — I couldn't deploy the real Then I benched your proposal to see if the ABCIResponses-based gate could replace the collector cheaply enough to be worth the change. It came in at ~50 µs, 39 KB, 1k allocs on a medium block (10 tx × 5 ev) — same ballpark as no gate at all. Amino decode dominates (I also ran a decode-only variant skipping the DB read — same numbers), so page-caching doesn't help. So the gate doesn't really save us from the VM-query cost, while still adding TM/app-boundary plumbing and the amino-registration question you flagged. My preference: keep the narrow |
The existing `sleep 1s` in restart_validators.txtar races with block production under CI load. Replace it by extending the existing `gnorpc validators` command with a `-wait ADDR` flag (and optional `-timeout`) that polls the validator-set RPC until the given address appears, with a 30s default timeout. The wait returns in ~100ms in practice (one poll interval after the trigger tx's block commits) instead of a hardcoded 1s. Also removes the now-unused `sleep` testscript command — it was only used in this one file, and encouraging time-based waits in integration tests is a footgun. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f7ae6f8 to
d7d9c4d
Compare
Actually, I was proposing something else: I made an alternative PR: #5556 (works on top of this one, added you as co-author) |
Summary
GetChanges(). On node restart, the collector is empty (events from the last block before shutdown are lost), so the EndBlocker never queries the realm — validator changes are permanently lost from the consensus layer.firstBlockflag in the EndBlocker closure so the very first invocation after startup always queries the VM, bypassing the empty collector check. On genesis (height 0) it skips the query since there's nothing to recover.Details
The fix is contained in
gno.land/pkg/gnoland/app.go(EndBlocker function):All validator logic stays in one place (the EndBlocker). The collector optimization is preserved for all blocks after the first.
Test plan
restart_validators.txtar— passes with fix, fails on master (confirmed)restart_nonval.txtar/restart_missing_type.txtar— existing restart tests passTestEndBlocker— all existing subtests pass🤖 Generated with Claude Code
This PR was developed with AI assistance. The approach and code were reviewed by a human contributor.