Skip to content

fix(connection): reload onRequest in onProcess to avoid dead loop#425

Open
014-code wants to merge 1 commit into
cloudwego:mainfrom
014-code:fix/onprocess-stale-handler-dead-loop
Open

fix(connection): reload onRequest in onProcess to avoid dead loop#425
014-code wants to merge 1 commit into
cloudwego:mainfrom
014-code:fix/onprocess-stale-handler-dead-loop

Conversation

@014-code

@014-code 014-code commented Jul 3, 2026

Copy link
Copy Markdown

What type of PR is this?

  • Bug fix
  • Feature
  • Refactor
  • Optimization
  • Documentation
  • Test
  • Other

What this PR does / why we need it:

Fixes the dead-loop described in #421. When SetOnRequest is called from inside the OnConnect callback (or from inside a previous OnRequest call), the task goroutine spawned by onProcess keeps using the stale handler snapshot captured as a closure parameter. If the old handler (typically a no-op placeholder) fails to consume the input buffer, the for loop at the START: label never breaks, the processing lock is never released, and the goroutine spins forever, effectively killing the connection.

Root cause

In connection_onevent.go::onConnect, the handler is loaded once from the atomic before onProcess is called:

onRequest, _ := c.onRequestCallback.Load().(OnRequest) // snapshot
c.onProcess(onConnect, onRequest)

Inside onProcess, the snapshot is captured by the task closure and reused:

START:
    if onRequest != nil && c.Reader().Len() > 0 {
        _ = onRequest(c.ctx, c) // stale
    }
    for {
        ...
        _ = onRequest(c.ctx, c) // stale
    }

When the user calls conn.SetOnRequest(newHandler) from inside the OnConnect callback, the snapshot is no longer current, but the task continues to invoke the old one. A no-op placeholder never consumes the buffered data, c.Reader().Len() > 0 stays true, the loop spins, and the lock is never released.

Fix

Reload the latest handler from c.onRequestCallback at the entry of START: and after each onRequest call inside the processing loop, so any handler swap is picked up immediately. No API change, no signature change, ~8 lines added.

START:
    onRequest, _ = c.onRequestCallback.Load().(OnRequest)
    if onRequest != nil && c.Reader().Len() > 0 {
        _ = onRequest(c.ctx, c)
    }
    for {
        ...
        _ = onRequest(c.ctx, c)
        onRequest, _ = c.onRequestCallback.Load().(OnRequest) // pick up SetOnRequest
    }

Trigger conditions

The bug was reproducible when all of the following held:

  1. A placeholder OnRequest was installed via NewEventLoop(...) or an early SetOnRequest.
  2. OnConnect was configured via WithOnConnect.
  3. OnConnect called conn.SetOnRequest(realHandler) to swap the placeholder.
  4. TCP data was already buffered in inputBuffer (or arrived) before the task reached START: — common under load, or when OnConnect did any non-trivial work (handshake, RPC endpoint construction, ~10ms+ latency).

Impact when triggered

  • The wrong (old, no-op) handler is invoked.
  • The for loop never breaks → goroutine dead loop, 100% CPU.
  • The processing lock is never released → the connection becomes a black hole; all subsequent bytes are dropped.

Side benefit

The same re-read in the for-loop body also fixes a symmetric case where SetOnRequest is called from inside a regular OnRequest handler, not just from OnConnect. The original code had the same staleness problem on that path too; the reload now handles both uniformly.

Performance

  • One extra atomic.Value.Load per START: entry and per for-loop iteration. Cost is a single pointer read with atomic semantics (~1ns on amd64), negligible compared to the user OnRequest call (µs–ms).
  • No new heap allocation, no new lock, no false sharing.

Which issue(s) this PR fixes:

Fixes #421

Checklist

  • Tests pass locally (go build ./..., go vet ./... on linux/amd64)
  • Code is formatted with gofmt
  • Follows the project's Conventional Commits style
  • Added/updated tests (recommend porting the reproducer in onProcess 死循环风险 #421 into connection_test.go and adding a -race run to CI)
  • Documentation is up to date

… "When SetOnRequest is called from inside the onConnect callback (or from

inside a previous onRequest call) the task goroutine spawned by
onProcess kept using the stale handler snapshot captured as a closure
parameter. If the old handler (typically a no-op placeholder) failed to
consume the input buffer, the for-loop at the START label would never
break, the processing lock would never be released, and the goroutine
would spin forever, effectively killing the connection.

This change reloads the latest handler from c.onRequestCallback at the
entry of START and at the end of each onRequest call inside the
processing loop, so any handler swap made by the user is picked up
immediately.

Closes cloudwego#421
@014-code 014-code requested review from a team as code owners July 3, 2026 04:58
@CLAassistant

CLAassistant commented Jul 3, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

onProcess 死循环风险

2 participants