Skip to content

testing: Improve lsp shutdown logic & address test race conditions#1943

Open
charlieegan3 wants to merge 1 commit intoopen-policy-agent:mainfrom
charlieegan3:server-test-shutdown-refactor
Open

testing: Improve lsp shutdown logic & address test race conditions#1943
charlieegan3 wants to merge 1 commit intoopen-policy-agent:mainfrom
charlieegan3:server-test-shutdown-refactor

Conversation

@charlieegan3
Copy link
Copy Markdown
Contributor

I have been trying to make test suite flakes less common. I am not 100% sure I have got all the issues, but these changes make the flakes and races appear much less often on my local testing. The main things done:

  • Add proper shutdown with timeout and worker synchronization. Refactor server test helper to manage the shutdown of the server instances centrally. This helps address issues where workers were continuing to write to pipes after closing.
  • Address Aggregate Data races (ast.Object sharing), uses defensive copies when setting aggregate data. ast.Object.Insert() contains non-atomic map/slice operations.
  • Address channel blocking issues and timeouts, e.g. in TestTemplateWorkerRaceCondition
  • Config access races, locking when accessing workspaceRootURI.
  • Fixer state mutation race, create local copy of OPAFmtOpts before mutation.

I have been testing with:

bash -c 'i=1; while go test -race ./... -count=1; do echo "Completed iteration $i"; i=$((i+1)); done'

@charlieegan3 charlieegan3 force-pushed the server-test-shutdown-refactor branch from 9bce75a to cca87df Compare April 9, 2026 15:54
@charlieegan3 charlieegan3 changed the title Improve server shutdown logic address test race conditions testing: Improve server shutdown logic address test race conditions Apr 9, 2026
@charlieegan3 charlieegan3 changed the title testing: Improve server shutdown logic address test race conditions testing: Improve lsp shutdown logic & address test race conditions Apr 9, 2026
// Shutdown waits for all worker goroutines to complete. The context can be
// used to set a timeout or cancel the wait if workers take too long to exit.
// The context passed to workers should be cancelled before calling this method.
func (l *LanguageServer) Shutdown(ctx context.Context) error {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be a better name for this...


continue
}
func (l *LanguageServer) StartDiagnosticsWorker(ctx context.Context) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been trying to make test suite flakes less common. I am not 100% sure I have got all the issues, but these changes make the flakes and races appear much less often on my local testing. The main things done:

- Add proper shutdown with timeout and worker synchronization. Refactor server test helper to manage the shutdown of the server instances centrally. This helps address issues where workers were continuing to write to pipes after closing.
- Address Aggregate Data races (ast.Object sharing), uses defensive copies when setting aggregate data. ast.Object.Insert() contains non-atomic map/slice operations.
- Address channel blocking issues and timeouts, e.g. in TestTemplateWorkerRaceCondition
- Config access races, locking when accessing workspaceRootURI.
- Fixer state mutation race, create local copy of OPAFmtOpts before mutation.

Signed-off-by: Charlie Egan <charlie_egan@apple.com>
@charlieegan3 charlieegan3 force-pushed the server-test-shutdown-refactor branch from cca87df to bb184a5 Compare April 9, 2026 16:06
Copy link
Copy Markdown
Member

@anderseknert anderseknert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some great work here Charlie! 👏

Dropped a few comments, but mainly the aggregate copying I'm worried about. Besides that, huge improvement!

// startWorkspaceJobRouter routes workspace linting jobs with rate limiting.
// It listens on l.lintWorkspaceJobs and forwards to workspaceLintRuns,
// implementing backpressure for aggregate-only reports to prevent performance degradation.
func startWorkspaceJobRouter(ctx context.Context, l *LanguageServer, workspaceLintRuns chan<- lintWorkspaceJob) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"WorkspaceJobRouter" is so generic I'd have to look up what this does. Could we name it something that includes lint/linting?

// violations on character changes. Since these happen so
// frequently, we stop adding to the channel if there already
// jobs set to preserve performance
if job.AggregateReportOnly && len(workspaceLintRuns) > 10/2 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

> 10/2

🤨
What do these numbers represent? And why not write it as 5?

// if there are no parsed modules in the cache, then there is
// no need to run the aggregate report. This can happen if the
// server is very slow to start up.
if len(l.cache.GetAllModules()) == 0 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be worthy of a debug log event

@@ -0,0 +1,176 @@
package lsp
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice to move these out!

}

for fileURI := range l.cache.GetAllFiles() {
l.sendFileDiagnostics(ctx, fileURI)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this isn't new code, but what's the difference between updating diagnostics — like we just did above — and sending diagnostics like we do here? 🤔

t.Fatalf("timed out waiting for file diagnostics to be sent")
}
}
waitForDiagnostics(t, receivedMessages, mainRegoFileURI, []string{"opa-fmt"}, timeout)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better!

// scenario.
func TestLanguageServerMultipleFiles(t *testing.T) {
// TODO: this test has been flakey and we need to skip it until we have time to look deeper into why
t.Skip()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opening up the sarcophagus!

// Return a defensive copy to prevent a time-of-check-time-of-use (TOCTOU) race.
// Without this copy, callers could access the returned object while another
// goroutine calls Set(), which mutates co.o via Insert().
return co.o.Copy()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be incredibly expensive for large sets of aggregated data. Have you run any benchmarks on this? I get why not doing it could cause a race, but perhaps we can find a way to ensure we avoid it that doesn't involve deep-copying all aggregates. Without having looked into the details, I wonder if we could use the OPA store for this instead of a custom cache. It's made for AST objects after all, and ensures safety through transactions 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants