Skip to content

refactor(tm2/rpc): move rpc/core globals into per-node Environment#5574

Open
thehowl wants to merge 4 commits intognolang:masterfrom
thehowl:dev/morgan/rpccore-environment
Open

refactor(tm2/rpc): move rpc/core globals into per-node Environment#5574
thehowl wants to merge 4 commits intognolang:masterfrom
thehowl:dev/morgan/rpccore-environment

Conversation

@thehowl
Copy link
Copy Markdown
Member

@thehowl thehowl commented Apr 22, 2026

Stacked on top of #5561 — that PR adds defensive scaffolding around the txDispatcher shutdown race; this one deletes the scaffolding and fixes the underlying design. Review only the top commit (`4be1a4a51`). If #5561 lands first, this rebases to just that one commit.

Summary

Replaces the 13 package-level globals in `tm2/pkg/bft/rpc/core` with a per-node `*core.Environment`. Every RPC handler becomes a method on `*Environment`; `Node` owns one Environment per instance. Eliminates the process-wide singleton design that made two in-process nodes clobber each other's state.

Why

`rpc/core` was architected as a singleton: `pipe.go` held `stateDB`, `blockStore`, `consensusState`, `mempool`, `evsw`, `gTxDispatcher`, … as package vars; `Node.configureRPC()` wrote them via 13 `Set*` calls. Running two nodes in one process — `INMEMORY_TS=1` integration tests, direct in-process tests using `TestingInMemoryNode` with `t.Parallel`, multiple `gnodev` instances — clobbered those globals: whoever called `Set*` last owned every handler. A `BroadcastTxCommit` on node A could route through node B's dispatcher.

#5561 added a defensive `Stop()` and `SetEventSwitch` guard to mitigate the specific shutdown-race flake this caused in CI. This PR removes the design that made the flake possible. The scaffolding from #5561 is deleted because the race it patched can no longer occur: `txDispatcher` is now a field on `Environment` with a deterministic lifecycle tied to the Node, not a global that outlives `Node.Stop()`.

Design

  • `Environment` struct (`tm2/pkg/bft/rpc/core/pipe.go`) holds the 13 dep fields plus the internal `txDispatcher`. `Config` is a value field (not pointer) to preserve today's copy-by-value semantics for `TimeoutBroadcastTxCommit`.
  • Lifecycle: `(*Environment).Start()` creates and starts the `txDispatcher` iff `EventSwitch != nil`; idempotent. `(*Environment).Stop()` tears it down. `Node.OnStart` calls `env.Start()`; `Node.OnStop` calls `env.Stop()` before `n.evsw.Stop()` so the dispatcher exits via its own Quit channel.
  • Routes: `(*Environment).Routes(unsafe bool) map[string]*rpc.RPCFunc` returns a fresh route map bound to this Environment's methods. `rpcserver.NewRPCFunc` uses `reflect.ValueOf(f)`; method values satisfy that like any free function.
  • Local client: `client.NewLocal(env *core.Environment) *Local`. Two `contribs/gnodev` callers updated (`pkg/dev/node.go` binds in `Reset`, `accounts.go` receives the devnode).
  • Handler signatures unchanged: first argument is still `*rpctypes.Context`, so the JSON-RPC wire shape is identical.

Name-collision note: the `Environment.Consensus` field (interface type `Consensus`) is named `Consensus` rather than `ConsensusState` to avoid colliding with the `ConsensusState` handler method.

Tests

  • `tm2/pkg/bft/rpc/core/environment_test.go` (new):
    • `TestEnvironment_IsolatedBetweenInstances`: two Environments with different mock `BlockStore`s hammered in parallel; asserts no state aliasing. Would fail under the old globals design.
    • `TestEnvironment_StartStopIdempotent`: Start/Stop safe to call repeatedly, including on an Environment without an EventSwitch.
    • `TestEnvironment_BroadcastTxCommitRejectsUnstarted`: returns a clear error instead of nil-derefing.
  • Existing `mempool_test.go` (from fix(tm2/rpc): don't panic when txDispatcher subscription closes on shutdown #5561) still exercises `txDispatcher` directly — unchanged.
  • `blocks_test`, `status_test`, `tx_test`: converted to Environment literals.

Test plan

  • `go test ./tm2/pkg/bft/rpc/core/... -race -short -count=3`
  • `go test ./tm2/pkg/bft/node/... -short -timeout=3m`
  • `go test ./gno.land/cmd/... ./gno.land/pkg/gnoland/... -short -timeout=5m`
  • `go test ./gno.land/pkg/integration/... -run TestMalformedTypeURL -count=3`
  • `go test ./contribs/gnodev/... -short -timeout=3m`
  • CI: full matrix
  • Manual: `INMEMORY_TS=true make -C gno.land test` under parallel load

Diff shape

19 files changed, +593 / -1626 (big net-negative because the large per-handler HTTP example blocks in the old `mempool.go`, `blocks.go`, `consensus.go`, `status.go`, `net.go` got trimmed — those docs belong in a generated swagger file, not in source comments).

thehowl added 3 commits April 21, 2026 15:24
…utdown

The txDispatcher's listenRoutine used to panic with "txDispatcher
subscription unexpectedly closed" whenever its events subscription channel
was closed. That channel is closed legitimately by
events.SubscribeFilteredOn during event-switch shutdown: when the listener
callback's unbuffered send would block and evsw.Quit() has already fired,
the callback takes the <-evsw.Quit() branch and close(ch)es the
subscription.

This hit integration test CI as a flake (panic: txDispatcher subscription
unexpectedly closed) because Node.OnStop stops the event switch without
first stopping the package-level gTxDispatcher, and eventSwitch.FireEvent
has no stop check — so a late EventTx fired during teardown can still
invoke the listener callback and race evsw.Quit() to close the channel.

Treat the closed subscription as a clean shutdown signal: stop the
dispatcher and return. Additionally stop gTxDispatcher in Node.OnStop
before evsw.Stop so the goroutine exits via its own Quit channel in the
normal case, and guard SetEventSwitch against leaking a previous
dispatcher when the package globals are re-initialized.

Regression tests exercise both the directly-closed-subscription path and
the full event-switch shutdown race.
rpc/core was a process-wide singleton: 13 package-level variables
(stateDB, blockStore, consensusState, mempool, evsw, gTxDispatcher, …)
populated once by Node.configureRPC via Set* functions. Running two
nodes in one process (INMEMORY_TS integration tests, in-process tests
using TestingInMemoryNode with t.Parallel, multiple gnodev instances)
clobbered these globals: whoever called Set* last owned every handler.
RPC calls on node A could route through node B's state.

This replaces the globals with a per-node *core.Environment that owns
all the state the 23 RPC handlers read. Each handler becomes a method
on *Environment; Routes becomes (*Environment).Routes(unsafe bool)
which builds a route map from bound method values. rpcserver.NewRPCFunc
accepts method values unchanged — it uses reflect.ValueOf on any func
interface, and bound methods satisfy that just like package functions.

Node owns one Environment (built in configureRPC, Started in OnStart,
Stopped in OnStop). The fix PR's defensive Stop/SetEventSwitch guard
code is deleted — the race it patched can no longer occur because the
txDispatcher is a field on Environment with a deterministic lifecycle
tied to the Node, not a global that survives across Node stops.

client.NewLocal now takes an *Environment (Node.RPCEnvironment()
returns it). Two gnodev callers updated accordingly.

Test files in rpc/core (blocks_test, status_test, tx_test) construct
Environment literals with mock fields instead of calling Set*. The
mempool_test regression test from gnolang#5561 is unchanged — it still
exercises txDispatcher directly.

Adds environment_test.go with:
  - TestEnvironment_IsolatedBetweenInstances: two Environments with
    different mock BlockStores hammered concurrently; asserts no
    state aliasing. Would fail loudly under the old globals model.
  - TestEnvironment_StartStopIdempotent: Start/Stop are safe to call
    repeatedly and on Environments without an EventSwitch.
  - TestEnvironment_BroadcastTxCommitRejectsUnstarted: BroadcastTxCommit
    returns a clear error instead of nil-derefing when called on an
    Environment that has no dispatcher.
@Gno2D2
Copy link
Copy Markdown
Collaborator

Gno2D2 commented Apr 22, 2026

🛠 PR Checks Summary

All Automated Checks passed. ✅

Manual Checks (for Reviewers):
  • IGNORE the bot requirements for this PR (force green CI check)
Read More

🤖 This bot helps streamline PR reviews by verifying automated checks and providing guidance for contributors and reviewers.

✅ Automated Checks (for Contributors):

🟢 Maintainers must be able to edit this pull request (more info)

☑️ Contributor Actions:
  1. Fix any issues flagged by automated checks.
  2. Follow the Contributor Checklist to ensure your PR is ready for review.
    • Add new tests, or document why they are unnecessary.
    • Provide clear examples/screenshots, if necessary.
    • Update documentation, if required.
    • Ensure no breaking changes, or include BREAKING CHANGE notes.
    • Link related issues/PRs, where applicable.
☑️ Reviewer Actions:
  1. Complete manual checks for the PR, including the guidelines and additional checks if applicable.
📚 Resources:
Debug
Automated Checks
Maintainers must be able to edit this pull request (more info)

If

🟢 Condition met
└── 🟢 And
    ├── 🟢 The base branch matches this pattern: ^master$
    └── 🟢 The pull request was created from a fork (head branch repo: thehowl/gno)

Then

🟢 Requirement satisfied
└── 🟢 Maintainer can modify this pull request

Manual Checks
**IGNORE** the bot requirements for this PR (force green CI check)

If

🟢 Condition met
└── 🟢 On every pull request

Can be checked by

  • Any user with comment edit permission

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

- Environment.Stop() is now a one-way door: after Stop(), Start() panics
  rather than silently succeeding and creating a second live dispatcher.
  A stopped bool guards this; Stop() checks it for idempotency so calling
  Stop twice is still safe.
- Environment.Stop() panics (instead of discarding) any error from
  txDispatcher.Stop(), consistent with newTxDispatcher's own panic-on-error.
- Rename heightMismatchErr → heightMismatchError to satisfy the errname lint
  rule; include want/got values in the error message.
- Add t.Parallel() to TestTxHandler subtests (tparallel lint); the parent
  already called t.Parallel() after the globals removal, so subtests must
  follow.
@thehowl thehowl marked this pull request as ready for review April 22, 2026 16:36
@thehowl
Copy link
Copy Markdown
Member Author

thehowl commented Apr 22, 2026

cc/ @omarsy, this should also fix the issue about the globals in testing we mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants