refactor(tm2/rpc): move rpc/core globals into per-node Environment#5574
Open
thehowl wants to merge 4 commits intognolang:masterfrom
Open
refactor(tm2/rpc): move rpc/core globals into per-node Environment#5574thehowl wants to merge 4 commits intognolang:masterfrom
thehowl wants to merge 4 commits intognolang:masterfrom
Conversation
…utdown The txDispatcher's listenRoutine used to panic with "txDispatcher subscription unexpectedly closed" whenever its events subscription channel was closed. That channel is closed legitimately by events.SubscribeFilteredOn during event-switch shutdown: when the listener callback's unbuffered send would block and evsw.Quit() has already fired, the callback takes the <-evsw.Quit() branch and close(ch)es the subscription. This hit integration test CI as a flake (panic: txDispatcher subscription unexpectedly closed) because Node.OnStop stops the event switch without first stopping the package-level gTxDispatcher, and eventSwitch.FireEvent has no stop check — so a late EventTx fired during teardown can still invoke the listener callback and race evsw.Quit() to close the channel. Treat the closed subscription as a clean shutdown signal: stop the dispatcher and return. Additionally stop gTxDispatcher in Node.OnStop before evsw.Stop so the goroutine exits via its own Quit channel in the normal case, and guard SetEventSwitch against leaking a previous dispatcher when the package globals are re-initialized. Regression tests exercise both the directly-closed-subscription path and the full event-switch shutdown race.
rpc/core was a process-wide singleton: 13 package-level variables (stateDB, blockStore, consensusState, mempool, evsw, gTxDispatcher, …) populated once by Node.configureRPC via Set* functions. Running two nodes in one process (INMEMORY_TS integration tests, in-process tests using TestingInMemoryNode with t.Parallel, multiple gnodev instances) clobbered these globals: whoever called Set* last owned every handler. RPC calls on node A could route through node B's state. This replaces the globals with a per-node *core.Environment that owns all the state the 23 RPC handlers read. Each handler becomes a method on *Environment; Routes becomes (*Environment).Routes(unsafe bool) which builds a route map from bound method values. rpcserver.NewRPCFunc accepts method values unchanged — it uses reflect.ValueOf on any func interface, and bound methods satisfy that just like package functions. Node owns one Environment (built in configureRPC, Started in OnStart, Stopped in OnStop). The fix PR's defensive Stop/SetEventSwitch guard code is deleted — the race it patched can no longer occur because the txDispatcher is a field on Environment with a deterministic lifecycle tied to the Node, not a global that survives across Node stops. client.NewLocal now takes an *Environment (Node.RPCEnvironment() returns it). Two gnodev callers updated accordingly. Test files in rpc/core (blocks_test, status_test, tx_test) construct Environment literals with mock fields instead of calling Set*. The mempool_test regression test from gnolang#5561 is unchanged — it still exercises txDispatcher directly. Adds environment_test.go with: - TestEnvironment_IsolatedBetweenInstances: two Environments with different mock BlockStores hammered concurrently; asserts no state aliasing. Would fail loudly under the old globals model. - TestEnvironment_StartStopIdempotent: Start/Stop are safe to call repeatedly and on Environments without an EventSwitch. - TestEnvironment_BroadcastTxCommitRejectsUnstarted: BroadcastTxCommit returns a clear error instead of nil-derefing when called on an Environment that has no dispatcher.
Collaborator
🛠 PR Checks SummaryAll Automated Checks passed. ✅ Manual Checks (for Reviewers):
Read More🤖 This bot helps streamline PR reviews by verifying automated checks and providing guidance for contributors and reviewers. ✅ Automated Checks (for Contributors):🟢 Maintainers must be able to edit this pull request (more info) ☑️ Contributor Actions:
☑️ Reviewer Actions:
📚 Resources:Debug
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
- Environment.Stop() is now a one-way door: after Stop(), Start() panics rather than silently succeeding and creating a second live dispatcher. A stopped bool guards this; Stop() checks it for idempotency so calling Stop twice is still safe. - Environment.Stop() panics (instead of discarding) any error from txDispatcher.Stop(), consistent with newTxDispatcher's own panic-on-error. - Rename heightMismatchErr → heightMismatchError to satisfy the errname lint rule; include want/got values in the error message. - Add t.Parallel() to TestTxHandler subtests (tparallel lint); the parent already called t.Parallel() after the globals removal, so subtests must follow.
Member
Author
|
cc/ @omarsy, this should also fix the issue about the globals in testing we mentioned. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on top of #5561 — that PR adds defensive scaffolding around the
txDispatchershutdown race; this one deletes the scaffolding and fixes the underlying design. Review only the top commit (`4be1a4a51`). If #5561 lands first, this rebases to just that one commit.Summary
Replaces the 13 package-level globals in `tm2/pkg/bft/rpc/core` with a per-node `*core.Environment`. Every RPC handler becomes a method on `*Environment`; `Node` owns one Environment per instance. Eliminates the process-wide singleton design that made two in-process nodes clobber each other's state.
Why
`rpc/core` was architected as a singleton: `pipe.go` held `stateDB`, `blockStore`, `consensusState`, `mempool`, `evsw`, `gTxDispatcher`, … as package vars; `Node.configureRPC()` wrote them via 13 `Set*` calls. Running two nodes in one process — `INMEMORY_TS=1` integration tests, direct in-process tests using `TestingInMemoryNode` with `t.Parallel`, multiple `gnodev` instances — clobbered those globals: whoever called `Set*` last owned every handler. A `BroadcastTxCommit` on node A could route through node B's dispatcher.
#5561 added a defensive `Stop()` and `SetEventSwitch` guard to mitigate the specific shutdown-race flake this caused in CI. This PR removes the design that made the flake possible. The scaffolding from #5561 is deleted because the race it patched can no longer occur: `txDispatcher` is now a field on `Environment` with a deterministic lifecycle tied to the Node, not a global that outlives `Node.Stop()`.
Design
Name-collision note: the `Environment.Consensus` field (interface type `Consensus`) is named `Consensus` rather than `ConsensusState` to avoid colliding with the `ConsensusState` handler method.
Tests
Test plan
Diff shape
19 files changed, +593 / -1626 (big net-negative because the large per-handler HTTP example blocks in the old `mempool.go`, `blocks.go`, `consensus.go`, `status.go`, `net.go` got trimmed — those docs belong in a generated swagger file, not in source comments).