Skip to content

Fix isClosing data race#418

Open
dnerdy wants to merge 4 commits into
mainfrom
fix-websocket-client-close-data-race
Open

Fix isClosing data race#418
dnerdy wants to merge 4 commits into
mainfrom
fix-websocket-client-close-data-race

Conversation

@dnerdy
Copy link
Copy Markdown
Contributor

@dnerdy dnerdy commented May 20, 2026

This PR fixes a data race related to the webSocketClient isClosing field. Before this change the field was set in the Close method while holding a lock but read in the listenWebSocket method without holding a lock. The point of the isClosing flag was to prevent sending on errChan after close. Instead of having a hidden dependecy between a flag and the errChan channel state, this PR replaces isClosing with a flag named exitListenWebSocket, and when the flag is set the listenWebSocket closes errChan itself. listenWebSocket is the only goroutine that writes to errChan, so there's no longer any possibility of writing on a closed channel -- the listenWebSocket goroutine now effectively "owns" errChan.

I have:

  • Written a clear PR title and description (above)
  • Signed the Khan Academy CLA
  • Added tests covering my changes, if applicable
  • Included a link to the issue fixed, if applicable
  • Included documentation, for new features
  • Added an entry to the changelog

dnerdy added 4 commits May 19, 2026 17:03
This PR fixes the deadlock introduced in #402. The deadlock was occurring
because the forwardWebSocketData function was holding a lock on the
subscription map (in the listenWebSocket goroutine) while Unsubscribe was
blocked attempting to aquire the same lock. The forwardWebSocketData function
was also blocked attempting to send data on the interfaceChan channel, which
had no readers in the test being run.

After fixing the locking (by making sure the lock isn't held when
forwardDataFunc is called), a race condition cropped up between sending on
interfaceChan (in forwardWebSocketData) and closing interfaceChan in
Unsubscribe. This was fixed by making the listenWebSocket goroutine the "owner"
of the interfaceChan channel -- it is now the only goroutine that sends on and
closes the channel. Other goroutines singal interfaceChan should be closed by
setting `_hasBeenUnsubscribed` on the channel, which is protected by a
short-held lock.

Note that there's a similar data race involving isClosing on webSocketClient.
isClosing is set when Close is called (while holding a lock) but read in
listenWebSocket without a lock. A separate PR will fix this race (so the
`-race` flag can be added when running tests).
This PR fixes a data race related to the webSocketClient isClosing field.
Before this change the field was set in the Close method while holding a lock
but read in the listenWebSocket method without holding a lock. The point of the
isClosing flag was to prevent sending on errChan after close. Instead of having
a hidden dependecy between a flag and the errChan channel state, this PR
replaces isClosing with a flag named exitListenWebSocket, and when the flag is
set the listenWebSocket closes errChan itself. listenWebSocket is the only
goroutine that writes to errChan, so there's no longer any possibility of
writing on a closed channel -- the listenWebSocket goroutine now effectively
"owns" errChan.
Comment thread graphql/websocket.go
conn WSConn
connParams map[string]interface{}
Dialer Dialer
conn WSConn
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Struct fields were reordered due to the following lint error:

graphql/websocket.go:46:22: fieldalignment: struct with 104 pointer bytes could be 80 (govet)

@dnerdy dnerdy mentioned this pull request May 21, 2026
6 tasks
Copy link
Copy Markdown
Collaborator

@benjaminjkraft benjaminjkraft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for tracking this down!

Comment thread graphql/websocket.go
w.exitListenWebSocketMu.Lock()
if w.exitListenWebSocket {
close(w.errChan)
return
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to unlock here?

benjaminjkraft pushed a commit that referenced this pull request May 23, 2026
This PR fixes the deadlock introduced in #402. The deadlock was
occurring because the forwardWebSocketData function was holding a lock
on the subscription map (in the listenWebSocket goroutine) while
Unsubscribe was blocked attempting to aquire the same lock. The
forwardWebSocketData function was also blocked attempting to send data
on the interfaceChan channel, which had no readers in the test being
run.

After fixing the locking (by making sure the lock isn't held when
forwardDataFunc is called), a race condition cropped up between sending
on interfaceChan (in forwardWebSocketData) and closing interfaceChan in
Unsubscribe. This was fixed by making the listenWebSocket goroutine the
"owner" of the interfaceChan channel -- it is now the only goroutine
that sends on and closes the channel. Other goroutines singal
interfaceChan should be closed by setting `_hasBeenUnsubscribed` on the
subscription, which is protected by a short-held lock.

Note that there's a similar data race involving isClosing on
webSocketClient. isClosing is set when Close is called (while holding a
lock) but read in listenWebSocket without a lock. #418 fixes this race
(so the `-race` flag can be added when running tests).

<!--
Thanks for your contribution! Check out the
[contributing
docs](https://github.com/Khan/genqlient/blob/main/docs/CONTRIBUTING.md)
for more on contributing to genqlient.
-->



I have:
- [x] Written a clear PR title and description (above)
- [x] Signed the [Khan Academy CLA](https://www.khanacademy.org/r/cla)
- [ ] Added tests covering my changes, if applicable
- [ ] Included a link to the issue fixed, if applicable
- [x] Included documentation, for new features
- [ ] Added an entry to the changelog
Base automatically changed from fix-402-deadlock to main May 23, 2026 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants