-
Notifications
You must be signed in to change notification settings - Fork 454
fix(gnoland): recover validator changes after node restart #5469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
omarsy
wants to merge
3
commits into
gnolang:master
Choose a base branch
from
omarsy:fix/endblock-restart-validator-changes
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 2 commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,109 @@ | ||
| # ADR: Recover Validator Changes After Node Restart | ||
|
|
||
| ## Context | ||
|
|
||
| The `EndBlocker` in `gno.land/pkg/gnoland/app.go` relies on an in-memory event | ||
| collector to decide whether to query the VM for validator set changes. The | ||
| collector listens on the `EventSwitch` for `validatorUpdate` events fired during | ||
| transaction execution. When events are present, the `EndBlocker` calls | ||
| `GetChanges(from, to)` on `r/sys/validators/v2` and forwards the resulting | ||
| updates to Tendermint2's consensus layer. | ||
|
|
||
| **The bug:** on node restart, the in-memory event collector is empty. Events | ||
| from the last block before shutdown are lost. The `EndBlocker` sees an empty | ||
| collector, returns early, and never queries `GetChanges` — validator changes | ||
| that were committed to the realm but not yet applied to consensus are | ||
| permanently lost. | ||
|
|
||
| This was confirmed by an integration test: a validator added via GovDAO proposal | ||
| and verified in the realm (`IsValidator` returns `true`) disappears from the | ||
| consensus set after a restart. | ||
|
|
||
| ## Decision | ||
|
|
||
| Use a `firstBlock` flag inside the `EndBlocker` closure. On the very first | ||
| invocation after startup, the EndBlocker bypasses the collector check and always | ||
| queries the VM for pending validator changes. On all subsequent blocks, the | ||
| collector gate applies as before. | ||
|
|
||
| ```go | ||
| firstBlock := true | ||
|
|
||
| return func(ctx sdk.Context, _ abci.RequestEndBlock) abci.ResponseEndBlock { | ||
| // ... auth/gas price logic ... | ||
|
|
||
| if firstBlock { | ||
| firstBlock = false | ||
| collector.getEvents() // drain any accumulated events | ||
| if app.LastBlockHeight() == 0 { | ||
| return abci.ResponseEndBlock{} // genesis — nothing to recover | ||
| } | ||
| } else if len(collector.getEvents()) == 0 { | ||
| return abci.ResponseEndBlock{} | ||
| } | ||
|
|
||
| // ... VM query + apply validator changes ... | ||
| } | ||
| ``` | ||
|
|
||
| This keeps all validator logic in one place (the EndBlocker) rather than | ||
| splitting it between init and EndBlocker. The `firstBlock` bool is safe because | ||
| the EndBlocker runs single-threaded in the ABCI consensus flow. | ||
|
|
||
| The collector itself is kept as a performance optimization — it avoids a VM | ||
| query on every block when no validator changes occurred. | ||
|
|
||
| ## Alternatives Considered | ||
|
|
||
| ### 1. Query VM at init time and seed the collector | ||
|
|
||
| After `vmk.Initialize()`, query the VM for pending changes and pre-populate | ||
| the collector with a synthetic event so the EndBlocker picks it up naturally. | ||
|
|
||
| **Rejected:** splits validator logic across init and EndBlocker. Also requires | ||
| handling the case where the validators realm isn't deployed (test nodes), adding | ||
| error-handling complexity to init. | ||
|
|
||
| ### 2. Fire a synthetic event via `evsw.FireEvent` | ||
|
|
||
| Seed the collector by firing a fake `validatorUpdate` event on the `EventSwitch` | ||
| after restart. | ||
|
|
||
| **Rejected:** `FireEvent` dispatches to all listeners, not just the collector. | ||
| Unknown listeners could react to a synthetic event in unexpected ways, creating | ||
| subtle bugs. | ||
|
|
||
| ### 3. Always query the VM (remove the collector) | ||
|
|
||
| Remove the event collector entirely and call `GetChanges` in every `EndBlocker`. | ||
|
|
||
| **Rejected after benchmarking:** the collector early-return path costs ~13ns/0 | ||
| allocs per block, while a VM query costs ~830ns/8 allocs (mock) and | ||
| significantly more with a real VM. Over thousands of blocks with no validator | ||
| changes, the collector avoids substantial overhead. | ||
|
|
||
| ### 4. Persist the collector to disk | ||
|
|
||
| Save collector state before shutdown and restore on restart. | ||
|
|
||
| **Rejected:** adds persistence complexity for a problem that only manifests at | ||
| the boundary between the last pre-shutdown block and the first post-restart | ||
| block. A one-time unconditional query on the first block is simpler. | ||
|
|
||
| ## Consequences | ||
|
|
||
| - **Positive:** validator changes committed to the realm are always applied to | ||
| consensus, even across restarts. | ||
| - **Positive:** all validator change logic stays in the EndBlocker — no init-time | ||
| special cases. | ||
| - **Positive:** the event collector optimization is preserved for normal | ||
| operation (all blocks after the first). | ||
| - **Positive:** on genesis (height 0), the first EndBlocker still early-returns | ||
| — no wasted VM query. | ||
| - **Trade-off:** if `GetChanges` or the validators realm is unavailable on the | ||
| first block after restart, the EndBlocker logs an error and continues (same as | ||
| any other block). No panic, no silent data loss. | ||
| - **Testing:** a txtar integration test (`restart_validators.txtar`) verifies | ||
| the full flow: add validator via GovDAO, restart, confirm validator appears in | ||
| the consensus set. The test also required adding `gnorpc validators` and | ||
| `sleep` testscript commands. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
74 changes: 74 additions & 0 deletions
74
gno.land/pkg/integration/testdata/restart_validators.txtar
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,74 @@ | ||
| # Test that validator changes committed to the realm survive a node restart. | ||
| # | ||
| # The bug: the EndBlocker relies on an in-memory event collector to trigger | ||
| # validator set queries. On restart, the collector is empty, so the EndBlocker | ||
| # never queries GetChanges — validator changes stuck in the realm are never | ||
| # applied to consensus. | ||
|
|
||
| loadpkg gno.land/r/gov/dao/v3/init | ||
| loadpkg gno.land/r/gov/dao | ||
| loadpkg gno.land/r/gnops/valopers | ||
| loadpkg gno.land/r/gnops/valopers/proposal | ||
|
|
||
| gnoland start | ||
|
|
||
| # Init GovDAO with test1 as member | ||
| gnokey maketx run -gas-fee 100000ugnot -gas-wanted 95000000 -broadcast -chainid=tendermint_test test1 $WORK/run/init_govdao.gno | ||
| stdout OK! | ||
|
|
||
| # Register a valoper | ||
| gnokey maketx call -pkgpath gno.land/r/gnops/valopers -func Register -gas-fee 1000000ugnot -gas-wanted 30000000 -send 20000000ugnot -args myval -args 'Test validator' -args on-prem -args g1td0cgmt9uz7kq4hcv7fkkwvp3z4lq4dsewffwr -args gpub1pgfj7ard9eg82cjtv4u4xetrwqer2dntxyfzxz3pqg0lte6srklm3tuyja9489n3dsnx4wcadq43wrwnz6nln8s7lf9uyptc3nm -broadcast -chainid=tendermint_test test1 | ||
| stdout OK! | ||
|
|
||
| # Create proposal, vote YES, execute — adds validator to the realm | ||
| gnokey maketx run -gas-fee 1000000ugnot -gas-wanted 95000000 -broadcast -chainid=tendermint_test test1 $WORK/run/add_validator.gno | ||
| stdout OK! | ||
|
|
||
| # Verify validator is in realm before restart | ||
| gnokey query vm/qeval --data "gno.land/r/sys/validators/v2.IsValidator(address(\"g1td0cgmt9uz7kq4hcv7fkkwvp3z4lq4dsewffwr\"))" | ||
| stdout 'true' | ||
|
|
||
| # Restart the node — in-memory event collector is lost | ||
| gnoland restart | ||
|
|
||
| # Advance one block so EndBlocker picks up the pending change | ||
| gnokey maketx run -gas-fee 1000000ugnot -gas-wanted 95000000 -broadcast -chainid=tendermint_test test1 $WORK/run/trigger.gno | ||
| stdout OK! | ||
|
|
||
| # Wait for the validator update to propagate | ||
| sleep 1s | ||
|
|
||
| # Verify the new validator is in the consensus set after restart | ||
| gnorpc validators | ||
| stdout 'g1td0cgmt9uz7kq4hcv7fkkwvp3z4lq4dsewffwr' | ||
|
|
||
| -- run/init_govdao.gno -- | ||
| package main | ||
|
|
||
| import i "gno.land/r/gov/dao/v3/init" | ||
|
|
||
| func main() { | ||
| i.InitWithUsers(address("g1jg8mtutu9khhfwc4nxmuhcpftf0pajdhfvsqf5")) | ||
| } | ||
|
|
||
| -- run/add_validator.gno -- | ||
| package main | ||
|
|
||
| import ( | ||
| "gno.land/r/gnops/valopers/proposal" | ||
| "gno.land/r/gov/dao" | ||
| ) | ||
|
|
||
| func main() { | ||
| pr := proposal.NewValidatorProposalRequest(cross, address("g1td0cgmt9uz7kq4hcv7fkkwvp3z4lq4dsewffwr")) | ||
| pid := dao.MustCreateProposal(cross, pr) | ||
| dao.MustVoteOnProposalSimple(cross, int64(pid), "YES") | ||
| dao.ExecuteProposal(cross, pid) | ||
| } | ||
|
|
||
| -- run/trigger.gno -- | ||
| package main | ||
|
|
||
| func main() { | ||
| println("trigger block") | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.