Skip to content

fix(gnoland): detect validator changes via EndTxHook for same-block processing#5556

Open
thehowl wants to merge 6 commits intognolang:masterfrom
thehowl:dev/morgan/endblock-same-block-validators
Open

fix(gnoland): detect validator changes via EndTxHook for same-block processing#5556
thehowl wants to merge 6 commits intognolang:masterfrom
thehowl:dev/morgan/endblock-same-block-validators

Conversation

@thehowl
Copy link
Copy Markdown
Member

@thehowl thehowl commented Apr 20, 2026

Fixes the validator-changes-lost-on-restart bug (originally reported in #5469) with a cleaner approach: instead of an in-memory event collector that listened post-commit, we hook into EndTxHook which fires synchronously during DeliverTx.

What changes

  • EndTxHook scans each delivered tx's ABCI events for ValidatorAdded/ValidatorRemoved from r/sys/validators/v2. If any are found, a hasValEvent flag is set.
  • EndBlocker checks the flag (and resets it). If set, it calls GetChanges(req.Height, req.Height) — using the current block height rather than the previously-committed height.
  • deliverState context holds all uncommitted writes from this block's txs, so QueryEval sees the validator changes immediately, in the same block they were made.
  • Restart safety: because EndBlock(N) now returns ValidatorUpdates containing the changes from block N, Tendermint persists them in its own state DB. On restart, Tendermint restores the validator set from that DB — no special recovery code needed.
  • Deleted events.go: the generic collector[T] type is gone entirely.

Why this is better than the firstBlock flag (from #5469)

The firstBlock approach would query GetChanges(lastHeight, lastHeight) on the first block after restart, which queries for changes already applied to consensus. The EndTxHook approach eliminates the one-block delay entirely and makes restarts inherently correct without any special-case logic.

ADR

gno.land/adr/pr5469_restart_validator_changes.md

Testing

  • All TestEndBlocker unit tests pass.
  • restart_validators.txtar integration test: adds a validator via GovDAO, restarts the node, then uses gnorpc validators -wait to confirm the validator appears in the consensus set.

Note

This PR includes the gnorpc testscript command from #5469 (by @omarsy), which was needed for the integration test. If #5469 is merged first, that part can be dropped.

Generated with Claude Code — AI assistance disclosed per AGENTS.md.

@Gno2D2
Copy link
Copy Markdown
Collaborator

Gno2D2 commented Apr 20, 2026

🛠 PR Checks Summary

All Automated Checks passed. ✅

Manual Checks (for Reviewers):
  • IGNORE the bot requirements for this PR (force green CI check)
Read More

🤖 This bot helps streamline PR reviews by verifying automated checks and providing guidance for contributors and reviewers.

✅ Automated Checks (for Contributors):

🟢 Maintainers must be able to edit this pull request (more info)

☑️ Contributor Actions:
  1. Fix any issues flagged by automated checks.
  2. Follow the Contributor Checklist to ensure your PR is ready for review.
    • Add new tests, or document why they are unnecessary.
    • Provide clear examples/screenshots, if necessary.
    • Update documentation, if required.
    • Ensure no breaking changes, or include BREAKING CHANGE notes.
    • Link related issues/PRs, where applicable.
☑️ Reviewer Actions:
  1. Complete manual checks for the PR, including the guidelines and additional checks if applicable.
📚 Resources:
Debug
Automated Checks
Maintainers must be able to edit this pull request (more info)

If

🟢 Condition met
└── 🟢 And
    ├── 🟢 The base branch matches this pattern: ^master$
    └── 🟢 The pull request was created from a fork (head branch repo: thehowl/gno)

Then

🟢 Requirement satisfied
└── 🟢 Maintainer can modify this pull request

Manual Checks
**IGNORE** the bot requirements for this PR (force green CI check)

If

🟢 Condition met
└── 🟢 On every pull request

Can be checked by

  • Any user with comment edit permission

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Codecov Report

❌ Patch coverage is 54.21687% with 38 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
gno.land/pkg/integration/testscript_gnoland.go 48.57% 30 Missing and 6 partials ⚠️
tm2/pkg/sdk/baseapp.go 60.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

…rocessing

Replace the in-memory event collector (which listened on EventSwitch
post-commit, causing a one-block delay and losing state on restart) with
an EndTxHook that scans ABCI events synchronously during DeliverTx.

The EndBlocker now calls GetChanges(req.Height, req.Height) using the
current block height, querying changes from the deliver-state context
which reflects all uncommitted txs in the block. Because EndBlock(N)
returns ValidatorUpdates that Tendermint persists in its own state DB,
the validator set survives restarts without any special recovery logic.

Deletes the generic collector[T] / events.go infrastructure entirely.

Adds a gnorpc testscript command and a txtar integration test that
confirms validators added via GovDAO survive a node restart.

Co-authored-by: Omar Sy <syomar.pro@gmail.com>
@thehowl thehowl force-pushed the dev/morgan/endblock-same-block-validators branch from 4fcf79c to d7d62c6 Compare April 20, 2026 15:45
… closure flag

Add BlockEvents() []abci.Event to BaseApp: accumulated from successful
DeliverTx calls, reset at BeginBlock. The EndBlocker now reads block
events directly via the endBlockerApp interface rather than relying on
a shared bool captured in a closure between EndTxHook and EndBlocker.

This removes the hasValEvent flag and the EndTxHook validator-detection
logic, replacing it with a clean pull model. The CommitGnoTransactionStore
call remains in EndTxHook as before.

Co-authored-by: Omar Sy <syomar.pro@gmail.com>
@github-actions github-actions Bot added the 📦 🌐 tendermint v2 Issues or PRs tm2 related label Apr 20, 2026
@thehowl thehowl requested review from omarsy and tbruyelle and removed request for omarsy April 21, 2026 12:49
@omarsy
Copy link
Copy Markdown
Member

omarsy commented Apr 21, 2026

I'd assumed applying the consensus change at N+1 was a protocol requirement. Re-reading, it looks like the N+1 delay was a side effect of the EventSwitch collector firing post-commit, not a deliberate constraint — happy to be corrected if I'm misreading.

If that's right, this PR still lives in the same substrate #5488 flags: realm write → event → scan → cross-realm QueryEval → regex-parse. #5469, #5556, and the crash that prompted #5488 each fix a different bug in that pipeline, which made me wonder whether tx-level handling (as #5488 proposes) makes the whole class go away.

Possible sketch (may be wrong)

Native function callable only from r/sys/validators/v2, backed by a ValidatorKeeper on a standard SDK KVStore:

func setValidator(ctx, addr, pubKey, power) {
    assertCaller(ctx, valRealm)
    if err := valKeeper.Validate(ctx, update); err != nil {
        panic(err) // caught by runTx → tx fails, node stays up
    }
    valKeeper.QueueUpdate(ctx, update)
}

Validation would run inside the tx's panic-recovery scope, so an invalid change fails that tx rather than crashing every node at updateState() (the fix #5488 is asking for). Queued updates ride the SDK's standard store semantics — no explicit commit hook like the VM store needs. EndBlock drains:

return abci.ResponseEndBlock{ValidatorUpdates: valKeeper.Drain(ctx)}

WDYT ?

@thehowl
Copy link
Copy Markdown
Member Author

thehowl commented Apr 22, 2026

I'd assumed applying the consensus change at N+1 was a protocol requirement. Re-reading, it looks like the N+1 delay was a side effect of the EventSwitch collector firing post-commit, not a deliberate constraint — happy to be corrected if I'm misreading.

Yes, that's my understanding too. It was done that way because of how eventswitch worked, but really there was no need for the event switch all along.

#5469, #5556, and the crash that prompted #5488 each fix a different bug in that pipeline, which made me wonder whether tx-level handling (as #5488 proposes) makes the whole class go away.

here's my understanding, and what prompted me to propose this PR:

  • a validator change can be dropped by a node if it is stopped after committing block N, because after restart the event collector in the N+1 endblocker would be empty, while normally it would have the events of N

The proposal you make seems to me to add complexity while I don't understand your criticism of this PR's approach

@omarsy
Copy link
Copy Markdown
Member

omarsy commented Apr 22, 2026

The proposal you make seems to me to add complexity while I don't understand your criticism of this PR's approach

It is more of a proposal. My point was that since we're already refactoring this, it seemed like a good moment to also address #5488. The approach I suggested would fix both.

@thehowl
Copy link
Copy Markdown
Member Author

thehowl commented Apr 22, 2026

Looks like #5485 was already attempting to move in the direction of using parameters

@moul
Copy link
Copy Markdown
Member

moul commented Apr 23, 2026

#5485 is the way :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

📦 🌐 tendermint v2 Issues or PRs tm2 related 📦 ⛰️ gno.land Issues or PRs gno.land package related

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants