testing: Improve lsp shutdown logic & address test race conditions#1943
testing: Improve lsp shutdown logic & address test race conditions#1943charlieegan3 wants to merge 1 commit intoopen-policy-agent:mainfrom
Conversation
9bce75a to
cca87df
Compare
| // Shutdown waits for all worker goroutines to complete. The context can be | ||
| // used to set a timeout or cancel the wait if workers take too long to exit. | ||
| // The context passed to workers should be cancelled before calling this method. | ||
| func (l *LanguageServer) Shutdown(ctx context.Context) error { |
There was a problem hiding this comment.
might be a better name for this...
|
|
||
| continue | ||
| } | ||
| func (l *LanguageServer) StartDiagnosticsWorker(ctx context.Context) { |
I have been trying to make test suite flakes less common. I am not 100% sure I have got all the issues, but these changes make the flakes and races appear much less often on my local testing. The main things done: - Add proper shutdown with timeout and worker synchronization. Refactor server test helper to manage the shutdown of the server instances centrally. This helps address issues where workers were continuing to write to pipes after closing. - Address Aggregate Data races (ast.Object sharing), uses defensive copies when setting aggregate data. ast.Object.Insert() contains non-atomic map/slice operations. - Address channel blocking issues and timeouts, e.g. in TestTemplateWorkerRaceCondition - Config access races, locking when accessing workspaceRootURI. - Fixer state mutation race, create local copy of OPAFmtOpts before mutation. Signed-off-by: Charlie Egan <charlie_egan@apple.com>
cca87df to
bb184a5
Compare
anderseknert
left a comment
There was a problem hiding this comment.
Some great work here Charlie! 👏
Dropped a few comments, but mainly the aggregate copying I'm worried about. Besides that, huge improvement!
| // startWorkspaceJobRouter routes workspace linting jobs with rate limiting. | ||
| // It listens on l.lintWorkspaceJobs and forwards to workspaceLintRuns, | ||
| // implementing backpressure for aggregate-only reports to prevent performance degradation. | ||
| func startWorkspaceJobRouter(ctx context.Context, l *LanguageServer, workspaceLintRuns chan<- lintWorkspaceJob) { |
There was a problem hiding this comment.
"WorkspaceJobRouter" is so generic I'd have to look up what this does. Could we name it something that includes lint/linting?
| // violations on character changes. Since these happen so | ||
| // frequently, we stop adding to the channel if there already | ||
| // jobs set to preserve performance | ||
| if job.AggregateReportOnly && len(workspaceLintRuns) > 10/2 { |
There was a problem hiding this comment.
> 10/2
🤨
What do these numbers represent? And why not write it as 5?
| // if there are no parsed modules in the cache, then there is | ||
| // no need to run the aggregate report. This can happen if the | ||
| // server is very slow to start up. | ||
| if len(l.cache.GetAllModules()) == 0 { |
There was a problem hiding this comment.
May be worthy of a debug log event
| @@ -0,0 +1,176 @@ | |||
| package lsp | |||
There was a problem hiding this comment.
Very nice to move these out!
| } | ||
|
|
||
| for fileURI := range l.cache.GetAllFiles() { | ||
| l.sendFileDiagnostics(ctx, fileURI) |
There was a problem hiding this comment.
I suppose this isn't new code, but what's the difference between updating diagnostics — like we just did above — and sending diagnostics like we do here? 🤔
| t.Fatalf("timed out waiting for file diagnostics to be sent") | ||
| } | ||
| } | ||
| waitForDiagnostics(t, receivedMessages, mainRegoFileURI, []string{"opa-fmt"}, timeout) |
| // scenario. | ||
| func TestLanguageServerMultipleFiles(t *testing.T) { | ||
| // TODO: this test has been flakey and we need to skip it until we have time to look deeper into why | ||
| t.Skip() |
There was a problem hiding this comment.
Opening up the sarcophagus!
| // Return a defensive copy to prevent a time-of-check-time-of-use (TOCTOU) race. | ||
| // Without this copy, callers could access the returned object while another | ||
| // goroutine calls Set(), which mutates co.o via Insert(). | ||
| return co.o.Copy() |
There was a problem hiding this comment.
This will be incredibly expensive for large sets of aggregated data. Have you run any benchmarks on this? I get why not doing it could cause a race, but perhaps we can find a way to ensure we avoid it that doesn't involve deep-copying all aggregates. Without having looked into the details, I wonder if we could use the OPA store for this instead of a custom cache. It's made for AST objects after all, and ensures safety through transactions 🤔
I have been trying to make test suite flakes less common. I am not 100% sure I have got all the issues, but these changes make the flakes and races appear much less often on my local testing. The main things done:
I have been testing with: