Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 109 additions & 0 deletions gno.land/adr/pr5469_restart_validator_changes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# ADR: Recover Validator Changes After Node Restart

## Context

The `EndBlocker` in `gno.land/pkg/gnoland/app.go` relies on an in-memory event
collector to decide whether to query the VM for validator set changes. The
collector listens on the `EventSwitch` for `validatorUpdate` events fired during
transaction execution. When events are present, the `EndBlocker` calls
`GetChanges(from, to)` on `r/sys/validators/v2` and forwards the resulting
updates to Tendermint2's consensus layer.

**The bug:** on node restart, the in-memory event collector is empty. Events
from the last block before shutdown are lost. The `EndBlocker` sees an empty
collector, returns early, and never queries `GetChanges` — validator changes
that were committed to the realm but not yet applied to consensus are
permanently lost.

This was confirmed by an integration test: a validator added via GovDAO proposal
and verified in the realm (`IsValidator` returns `true`) disappears from the
consensus set after a restart.

## Decision

Use a `firstBlock` flag inside the `EndBlocker` closure. On the very first
invocation after startup, the EndBlocker bypasses the collector check and always
queries the VM for pending validator changes. On all subsequent blocks, the
collector gate applies as before.

```go
firstBlock := true

return func(ctx sdk.Context, _ abci.RequestEndBlock) abci.ResponseEndBlock {
// ... auth/gas price logic ...

if firstBlock {
firstBlock = false
collector.getEvents() // drain any accumulated events
if app.LastBlockHeight() == 0 {
return abci.ResponseEndBlock{} // genesis — nothing to recover
}
} else if len(collector.getEvents()) == 0 {
return abci.ResponseEndBlock{}
}

// ... VM query + apply validator changes ...
}
```

This keeps all validator logic in one place (the EndBlocker) rather than
splitting it between init and EndBlocker. The `firstBlock` bool is safe because
the EndBlocker runs single-threaded in the ABCI consensus flow.

The collector itself is kept as a performance optimization — it avoids a VM
query on every block when no validator changes occurred.

## Alternatives Considered

### 1. Query VM at init time and seed the collector

After `vmk.Initialize()`, query the VM for pending changes and pre-populate
the collector with a synthetic event so the EndBlocker picks it up naturally.

**Rejected:** splits validator logic across init and EndBlocker. Also requires
handling the case where the validators realm isn't deployed (test nodes), adding
error-handling complexity to init.

### 2. Fire a synthetic event via `evsw.FireEvent`

Seed the collector by firing a fake `validatorUpdate` event on the `EventSwitch`
after restart.

**Rejected:** `FireEvent` dispatches to all listeners, not just the collector.
Unknown listeners could react to a synthetic event in unexpected ways, creating
subtle bugs.

### 3. Always query the VM (remove the collector)

Remove the event collector entirely and call `GetChanges` in every `EndBlocker`.

**Rejected after benchmarking:** the collector early-return path costs ~13ns/0
allocs per block, while a VM query costs ~830ns/8 allocs (mock) and
significantly more with a real VM. Over thousands of blocks with no validator
changes, the collector avoids substantial overhead.

### 4. Persist the collector to disk

Save collector state before shutdown and restore on restart.

**Rejected:** adds persistence complexity for a problem that only manifests at
the boundary between the last pre-shutdown block and the first post-restart
block. A one-time unconditional query on the first block is simpler.

## Consequences

- **Positive:** validator changes committed to the realm are always applied to
consensus, even across restarts.
- **Positive:** all validator change logic stays in the EndBlocker — no init-time
special cases.
- **Positive:** the event collector optimization is preserved for normal
operation (all blocks after the first).
- **Positive:** on genesis (height 0), the first EndBlocker still early-returns
— no wasted VM query.
- **Trade-off:** if `GetChanges` or the validators realm is unavailable on the
first block after restart, the EndBlocker logs an error and continues (same as
any other block). No panic, no silent data loss.
- **Testing:** a txtar integration test (`restart_validators.txtar`) verifies
the full flow: add validator via GovDAO, restart, confirm validator appears in
the consensus set. The test also required adding `gnorpc validators` and
`sleep` testscript commands.
19 changes: 16 additions & 3 deletions gno.land/pkg/gnoland/app.go
Original file line number Diff line number Diff line change
Expand Up @@ -459,6 +459,11 @@ func EndBlocker(
ctx sdk.Context,
req abci.RequestEndBlock,
) abci.ResponseEndBlock {
// On restart the in-memory event collector is empty — events from the
// last block before shutdown are lost. firstBlock ensures the first
// EndBlocker always queries the VM for pending validator changes.
firstBlock := true

return func(ctx sdk.Context, _ abci.RequestEndBlock) abci.ResponseEndBlock {
// set the auth params value in the ctx. The EndBlocker will use InitialGasPrice in
// the params to calculate the updated gas price.
Expand All @@ -469,9 +474,17 @@ func EndBlocker(
auth.EndBlocker(ctx, gpk)
}

// Check if there was a valset change
if len(collector.getEvents()) == 0 {
// No valset updates
// Check if there was a valset change.
// On the very first block, skip this check — the collector may be
// empty after a restart even though changes are pending in the realm.
if firstBlock {
firstBlock = false
collector.getEvents() // drain any accumulated events
// On genesis (height 0) there are no pending changes to recover.
if app.LastBlockHeight() == 0 {
return abci.ResponseEndBlock{}
}
} else if len(collector.getEvents()) == 0 {
return abci.ResponseEndBlock{}
}

Expand Down
33 changes: 27 additions & 6 deletions gno.land/pkg/gnoland/app_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -576,8 +576,12 @@ func TestEndBlocker(t *testing.T) {
// Fire a GnoVM event
mockEventSwitch.FireEvent(chain.Event{})

mockApp := &mockEndBlockerApp{
lastBlockHeightFn: func() int64 { return 1 },
}

// Create the EndBlocker
eb := EndBlocker(c, nil, nil, mockVMKeeper, &mockEndBlockerApp{})
eb := EndBlocker(c, nil, nil, mockVMKeeper, mockApp)

// Run the EndBlocker
res := eb(sdk.Context{}.WithConsensusParams(&abci.ConsensusParams{
Expand Down Expand Up @@ -623,8 +627,12 @@ func TestEndBlocker(t *testing.T) {
// Fire a GnoVM event
mockEventSwitch.FireEvent(chain.Event{})

mockApp := &mockEndBlockerApp{
lastBlockHeightFn: func() int64 { return 1 },
}

// Create the EndBlocker
eb := EndBlocker(c, nil, nil, mockVMKeeper, &mockEndBlockerApp{})
eb := EndBlocker(c, nil, nil, mockVMKeeper, mockApp)

// Run the EndBlocker
res := eb(sdk.Context{}.WithConsensusParams(&abci.ConsensusParams{
Expand Down Expand Up @@ -695,8 +703,12 @@ func TestEndBlocker(t *testing.T) {

mockEventSwitch.FireEvent(txEvent)

mockApp := &mockEndBlockerApp{
lastBlockHeightFn: func() int64 { return 1 },
}

// Create the EndBlocker
eb := EndBlocker(c, nil, nil, mockVMKeeper, &mockEndBlockerApp{})
eb := EndBlocker(c, nil, nil, mockVMKeeper, mockApp)

// Run the EndBlocker
res := eb(sdk.Context{}.WithConsensusParams(&abci.ConsensusParams{
Expand Down Expand Up @@ -770,7 +782,10 @@ func TestEndBlocker(t *testing.T) {
c := newCollector[validatorUpdate](mockEventSwitch, validatorEventFilter)
mockEventSwitch.FireEvent(txEvent)

eb := EndBlocker(c, nil, nil, mockVMKeeper, &mockEndBlockerApp{})
mockApp := &mockEndBlockerApp{
lastBlockHeightFn: func() int64 { return 1 },
}
eb := EndBlocker(c, nil, nil, mockVMKeeper, mockApp)
res := eb(sdk.Context{}.WithConsensusParams(&abci.ConsensusParams{
Validator: &abci.ValidatorParams{
PubKeyTypeURLs: []string{"/tm.PubKeySecp256k1"},
Expand Down Expand Up @@ -835,7 +850,10 @@ func TestEndBlocker(t *testing.T) {

c := newCollector[validatorUpdate](mockEventSwitch, validatorEventFilter)
mockEventSwitch.FireEvent(txEvent)
eb := EndBlocker(c, nil, nil, mockVMKeeper, &mockEndBlockerApp{})
mockApp := &mockEndBlockerApp{
lastBlockHeightFn: func() int64 { return 1 },
}
eb := EndBlocker(c, nil, nil, mockVMKeeper, mockApp)
res := eb(sdk.Context{}.WithConsensusParams(&abci.ConsensusParams{
Validator: &abci.ValidatorParams{
PubKeyTypeURLs: []string{"/tm.PubKeySecp256k1"},
Expand Down Expand Up @@ -890,7 +908,10 @@ func TestEndBlocker(t *testing.T) {

c := newCollector[validatorUpdate](mockEventSwitch, validatorEventFilter)
mockEventSwitch.FireEvent(txEvent)
eb := EndBlocker(c, nil, nil, mockVMKeeper, &mockEndBlockerApp{})
mockApp := &mockEndBlockerApp{
lastBlockHeightFn: func() int64 { return 1 },
}
eb := EndBlocker(c, nil, nil, mockVMKeeper, mockApp)
res := eb(sdk.Context{}.WithConsensusParams(&abci.ConsensusParams{
Validator: &abci.ValidatorParams{
PubKeyTypeURLs: []string{"/tm.PubKeyEd25519"},
Expand Down
74 changes: 74 additions & 0 deletions gno.land/pkg/integration/testdata/restart_validators.txtar
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Test that validator changes committed to the realm survive a node restart.
#
# The bug: the EndBlocker relies on an in-memory event collector to trigger
# validator set queries. On restart, the collector is empty, so the EndBlocker
# never queries GetChanges — validator changes stuck in the realm are never
# applied to consensus.

loadpkg gno.land/r/gov/dao/v3/init
loadpkg gno.land/r/gov/dao
loadpkg gno.land/r/gnops/valopers
loadpkg gno.land/r/gnops/valopers/proposal

gnoland start

# Init GovDAO with test1 as member
gnokey maketx run -gas-fee 100000ugnot -gas-wanted 95000000 -broadcast -chainid=tendermint_test test1 $WORK/run/init_govdao.gno
stdout OK!

# Register a valoper
gnokey maketx call -pkgpath gno.land/r/gnops/valopers -func Register -gas-fee 1000000ugnot -gas-wanted 30000000 -send 20000000ugnot -args myval -args 'Test validator' -args on-prem -args g1td0cgmt9uz7kq4hcv7fkkwvp3z4lq4dsewffwr -args gpub1pgfj7ard9eg82cjtv4u4xetrwqer2dntxyfzxz3pqg0lte6srklm3tuyja9489n3dsnx4wcadq43wrwnz6nln8s7lf9uyptc3nm -broadcast -chainid=tendermint_test test1
stdout OK!

# Create proposal, vote YES, execute — adds validator to the realm
gnokey maketx run -gas-fee 1000000ugnot -gas-wanted 95000000 -broadcast -chainid=tendermint_test test1 $WORK/run/add_validator.gno
stdout OK!

# Verify validator is in realm before restart
gnokey query vm/qeval --data "gno.land/r/sys/validators/v2.IsValidator(address(\"g1td0cgmt9uz7kq4hcv7fkkwvp3z4lq4dsewffwr\"))"
stdout 'true'

# Restart the node — in-memory event collector is lost
gnoland restart

# Advance one block so EndBlocker picks up the pending change
gnokey maketx run -gas-fee 1000000ugnot -gas-wanted 95000000 -broadcast -chainid=tendermint_test test1 $WORK/run/trigger.gno
stdout OK!

# Wait for the validator update to propagate
sleep 1s
Comment thread
thehowl marked this conversation as resolved.
Outdated

# Verify the new validator is in the consensus set after restart
gnorpc validators
stdout 'g1td0cgmt9uz7kq4hcv7fkkwvp3z4lq4dsewffwr'

-- run/init_govdao.gno --
package main

import i "gno.land/r/gov/dao/v3/init"

func main() {
i.InitWithUsers(address("g1jg8mtutu9khhfwc4nxmuhcpftf0pajdhfvsqf5"))
}

-- run/add_validator.gno --
package main

import (
"gno.land/r/gnops/valopers/proposal"
"gno.land/r/gov/dao"
)

func main() {
pr := proposal.NewValidatorProposalRequest(cross, address("g1td0cgmt9uz7kq4hcv7fkkwvp3z4lq4dsewffwr"))
pid := dao.MustCreateProposal(cross, pr)
dao.MustVoteOnProposalSimple(cross, int64(pid), "YES")
dao.ExecuteProposal(cross, pid)
}

-- run/trigger.gno --
package main

func main() {
println("trigger block")
}
62 changes: 62 additions & 0 deletions gno.land/pkg/integration/testscript_gnoland.go
Original file line number Diff line number Diff line change
Expand Up @@ -228,6 +228,8 @@ func SetupGnolandTestscript(t *testing.T, p *testscript.Params) error {
cmds := map[string]func(ts *testscript.TestScript, neg bool, args []string){
"gnoland": gnolandCmd(t, nodesManager, gnoRootDir),
"gnokey": gnokeyCmd(nodesManager),
"gnorpc": gnorpcCmd(nodesManager),
"sleep": sleepCmd(),
"adduser": adduserCmd(nodesManager),
"adduserfrom": adduserfromCmd(nodesManager),
"patchpkg": patchpkgCmd(),
Expand Down Expand Up @@ -875,6 +877,48 @@ func getNodeSID(ts *testscript.TestScript) string {
return ts.Getenv("SID")
}

// gnorpcCmd provides RPC query access to the running gnoland node.
// Usage: gnorpc validators
func gnorpcCmd(nodes *NodesManager) func(ts *testscript.TestScript, neg bool, args []string) {
return func(ts *testscript.TestScript, neg bool, args []string) {
if len(args) == 0 {
ts.Fatalf("gnorpc requires a subcommand; supported: validators")
}

sid := getNodeSID(ts)
n, ok := nodes.Get(sid)
if !ok {
ts.Fatalf("gnorpc: node not running")
}

raddr := n.Address()
if raddr == "" {
ts.Fatalf("gnorpc: node has no address")
}

rpcClient, err := rpcclient.NewHTTPClient(raddr)
if err != nil {
tsValidateError(ts, "gnorpc", neg, err)
return
}

switch args[0] {
case "validators":
res, err := rpcClient.Validators(context.Background(), nil)
if err != nil {
tsValidateError(ts, "gnorpc", neg, err)
return
}

for _, v := range res.Validators {
fmt.Fprintf(ts.Stdout(), "%s power=%d\n", v.Address, v.VotingPower)
}
default:
ts.Fatalf("gnorpc: unknown subcommand %q", args[0])
}
}
}

func inputCmd() func(ts *testscript.TestScript, neg bool, args []string) {
return func(ts *testscript.TestScript, neg bool, args []string) {
if neg {
Expand All @@ -897,6 +941,24 @@ func inputCmd() func(ts *testscript.TestScript, neg bool, args []string) {
}
}

// sleepCmd pauses execution for the given duration.
// Usage: sleep 1s
func sleepCmd() func(ts *testscript.TestScript, neg bool, args []string) {
return func(ts *testscript.TestScript, neg bool, args []string) {
if neg {
ts.Fatalf("sleep does not support negation")
}
if len(args) != 1 {
ts.Fatalf("sleep requires exactly one argument (e.g., 1s, 500ms)")
}
d, err := time.ParseDuration(args[0])
if err != nil {
ts.Fatalf("sleep: invalid duration %q: %v", args[0], err)
}
time.Sleep(d)
}
}

func tsValidateError(ts *testscript.TestScript, cmd string, neg bool, err error) {
if err != nil {
fmt.Fprintf(ts.Stderr(), "%q error: %+v\n", cmd, err)
Expand Down
Loading