Skip to content

fix(auth): deduplicate concurrent IsAuthenticated() balloon notifications#1183

Open
bastiandoetsch wants to merge 35 commits intorefactor/IDE-1786_folder-config-refactoringfrom
fix/auth-duplicate-balloon-notification
Open

fix(auth): deduplicate concurrent IsAuthenticated() balloon notifications#1183
bastiandoetsch wants to merge 35 commits intorefactor/IDE-1786_folder-config-refactoringfrom
fix/auth-duplicate-balloon-notification

Conversation

@bastiandoetsch
Copy link
Copy Markdown
Collaborator

@bastiandoetsch bastiandoetsch commented Mar 23, 2026

User description

Description

Fixes a bug where the IDE showed duplicate authentication error balloon notifications on a single transient token-refresh failure.

Root cause: IsAuthenticated() uses sync.RWMutex (read lock), allowing multiple goroutines to enter the slow path concurrently. Each goroutine independently called the auth provider API and sent a window/showMessage notification. Three concurrent callers — DelegatingConcurrentScanner.Scan(), the DidChangeWorkspaceFolders handler, and the auth Initializer — each triggered the popup independently.

Changes in this PR:

Auth: deduplicate concurrent IsAuthenticated() calls

  • Added a singleflight.Group to AuthenticationServiceImpl so concurrent callers with the same token share one in-flight API call and receive the same result — only one notification is ever sent per concurrent wave.
  • Added a message-aware notification deduplicator (notifDedup) using sync.Mutex + timestamp: identical messages are suppressed for 30 s; different messages are shown immediately.
  • Made the deduplication state atomic-safe (atomic.Int64) and reset on credential change.
  • Added shouldCauseLogout / isTransientNetworkError helpers to avoid spurious logouts on DNS/TCP/context-cancellation errors.
  • invalid_grant wrapped in url.Error now correctly triggers logout.

Logging: non-blocking LSP write channel

  • Reduced lspWriter channel capacity from 1,000,000 to 10,000 and eliminated the bootstrap allocation.
  • WriteLevel now uses a select/default so it never blocks the caller when the channel is full — messages are dropped to stderr instead of freezing the main goroutine.
  • Added lsp_logger_test.go covering level filtering and the non-blocking fallback.

Secrets: silent skip when feature flag disabled

  • When SnykSecretsEnabled feature flag is off, Scan() now returns nil error and empty issues instead of an error — preventing a spurious balloon notification ("feature flag not found").

CI/test fixes

  • Upgraded macOS integration-test runner to macos-latest-large to resolve OOM failures.
  • Fixed non-breaking hyphens in build.yaml runner name.
  • Capped CLI concurrency at 8 on CI to prevent resource exhaustion.
  • Pinned risk-score feature flags in precedence smoke tests.
  • Skipped TestUnifiedTestApiSmokeTest and substituteDepGraphFlow on macOS/Windows where they consistently OOM/exit-1.
  • Fixed verifyQuickFixForIssue in smoke tests to distinguish transient RPC errors (skip) from correctness violations (fail fast), eliminating a 2-minute flake.
  • Reduced maxIntegTestDuration from 45 m to 15 m.
  • Added featureflag.Override for test injection.

Checklist

  • Tests added and all succeed
  • Regenerated mocks, etc. (make generate)
  • Linted (make lint-fix)
  • README.md updated, if user-facing
  • License file updated, if new 3rd-party dependency is introduced

PR Type

Bug fix, Enhancement


Description

  • Deduplicate concurrent authentication status checks and balloon notifications.

  • Improve logging reliability and prevent blocking.

  • Enhance test stability and error handling.

  • Adjust feature flag behavior for disabled secrets scanning.


Diagram Walkthrough

flowchart LR
  A[Auth Service] --> B(Singleflight Group);
  A --> C(Notification Deduplicator);
  D[Logging] --> E(Non-blocking LSP Write);
  F[Testing] --> G(Error Handling);
  F --> H(Platform Specific Tests);
  I[Secrets Scanner] --> J(Feature Flag Check);
  B --> K{Auth API Call};
  C --> L{IDE Notification};
  E --> M{Stderr Fallback};
  G --> N(Test Stability);
  H --> O(Platform Skip);
  J --> P{Silent Skip};
Loading

File Walkthrough

Relevant files
Documentation
1 files
automatic_config.go
Add comment to binary path finding logic                                 
+1/-0     
Miscellaneous
1 files
config.go
Use stderr for bootstrap logger                                                   
+4/-1     
Tests
15 files
precedence_smoke_test.go
Pin experimental risk score feature flags for CI                 
+6/-0     
server_smoke_test.go
Improve error handling in test callbacks and code action verification
+21/-12 
server_test.go
Reduce integration test duration                                                 
+1/-1     
unified_test_api_smoke_test.go
Add platform checks for depgraph CLI tests                             
+4/-0     
auth_service_impl_test.go
Add tests for concurrent auth calls and error handling     
+58/-0   
provider_fake.go
Add delay and call count to fake auth provider for testing
+19/-1   
executor_mock.go
Generate mock for CLI executor                                                     
+66/-0   
fake_featureflag.go
Add override method to fake feature flag service                 
+4/-0     
featureflag_test.go
Add test for feature flag override functionality                 
+23/-0   
oss_test.go
Test scan progress marking on error                                           
+24/-0   
secrets_test.go
Update secrets scan test for disabled feature flag             
+2/-3     
lsp_logger_test.go
Add tests for LSP logger non-blocking behavior                     
+144/-0 
scan_progress_test.go
Add tests for scan progress lifecycle management                 
+140/-0 
helpers.go
Add NotOnMacOS helper function                                                     
+8/-0     
test_setup.go
Pre-configure engine for faster test initialization           
+20/-1   
Enhancement
4 files
auth_service_impl.go
Deduplicate concurrent IsAuthenticated calls and notifications
+82/-19 
cli.go
Adjust CLI concurrency limit for CI environments                 
+12/-1   
featureflag.go
Implement feature flag override mechanism                               
+19/-0   
lsp_logger.go
Make LSP logger non-blocking and reduce channel capacity 
+6/-2     
Bug fix
4 files
iac.go
Ensure IAC scan goroutine exits on context done                   
+1/-1     
cli_scanner.go
Ensure OSS scan goroutine exits and marks done on error   
+7/-7     
secrets.go
Silently skip secrets scan when feature flag is disabled 
+2/-2     
scan_progress.go
Ensure scan goroutines exit and handle done channel correctly
+9/-2     
Dependencies
2 files
go.mod
Update go.mod with new dependencies                                           
+1/-1     
go.sum
Update go.sum with new dependencies                                           
+2/-6     
Additional files
1 files
CLAUDE.md +108/-110

…() calls

Concurrent callers of IsAuthenticated() with the same token each
independently called the auth provider and each sent a balloon
notification on transient failure (e.g. 400 from /oauth2/token),
resulting in duplicate popups in the IDE (observed 3x).

With singleflight, only one in-flight API call is made per concurrent
wave for the same token. All other callers share the result, so only
one balloon notification is sent regardless of how many goroutines
check simultaneously.

Adds CheckAuthDelay to FakeAuthenticationProvider to allow concurrency
testing, and a regression test that verifies exactly one notification
is sent for 3 concurrent concurrent callers on transient auth failure.
@snyk-io
Copy link
Copy Markdown

snyk-io bot commented Mar 23, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues
Code Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@bastiandoetsch bastiandoetsch marked this pull request as ready for review March 23, 2026 11:47
@bastiandoetsch bastiandoetsch requested review from a team as code owners March 23, 2026 11:47
@snyk-pr-review-bot

This comment has been minimized.

- Extract token once at start of isAuthenticated() — eliminates three
  redundant config.GetToken(conf) calls (cache check, empty-token
  guard, singleflight key) in favour of a single read
- Make type assertion on singleflight result explicit with a panic on
  failure so a future type mismatch surfaces immediately as a
  programming error rather than silently returning unauthenticated
- Apply CheckAuthDelay in both success and failure branches of
  FakeAuthenticationProvider.GetCheckAuthenticationFunction so the
  delay works for both positive and negative test scenarios
- Add AuthCallCount (atomic int32) to FakeAuthenticationProvider to
  let tests assert how many times the check function was invoked
- Add TestIsAuthenticated_ConcurrentCallsOnlyInvokeProviderOnce to
  cover the success-path: three concurrent callers on a cache-miss
  must invoke the provider exactly once
- Add load-bearing comment explaining why the 50ms CheckAuthDelay is
  required for concurrency overlap in both concurrent tests
@bastiandoetsch
Copy link
Copy Markdown
Collaborator Author

/describe

@bastiandoetsch
Copy link
Copy Markdown
Collaborator Author

/review

@snyk-pr-review-bot
Copy link
Copy Markdown

PR Description updated to latest commit (6f74b49)

@snyk-pr-review-bot

This comment has been minimized.

@snyk-pr-review-bot

This comment has been minimized.

@bastiandoetsch
Copy link
Copy Markdown
Collaborator Author

/describe

@bastiandoetsch
Copy link
Copy Markdown
Collaborator Author

/review

@snyk-pr-review-bot
Copy link
Copy Markdown

PR Description updated to latest commit (cbb4d11)

@snyk-pr-review-bot

This comment has been minimized.

On CI runners (2-4 vCPUs), the CLI semaphore was calculated as
max(1, NumCPU-4) = 1, serializing all CLI calls. With ~40 CLI
invocations across smoke tests (working dir + reference branch scans),
this caused the application/server package to exceed its 60m timeout
on Windows consistently.

When the CI env var is set, use NumCPU directly (no reservation).
The -4 reservation is kept for local development where CPUs should be
left for the IDE.
@bastiandoetsch
Copy link
Copy Markdown
Collaborator Author

/describe

@bastiandoetsch
Copy link
Copy Markdown
Collaborator Author

/review

@snyk-pr-review-bot
Copy link
Copy Markdown

PR Description updated to latest commit (cbb4d11)

@bastiandoetsch
Copy link
Copy Markdown
Collaborator Author

/describe

@bastiandoetsch
Copy link
Copy Markdown
Collaborator Author

/review

On main, all integration/smoke test helpers called
WithBinarySearchPaths([]string{}) on the Config struct, preventing the
env-defaults goroutine from walking directories like C:\Program Files
when searching for mvn.exe or java.exe.

After the IDE-1786 Config-struct removal, this safeguard was lost.
prepareTestHelper (used by SmokeTestWithEngine and IntegTestWithEngine)
now pre-seeds an empty SettingBinarySearchPaths into the GAF
configuration before passing it to InitEngine, so the goroutine that
calls DetermineJavaHome/MavenDefaults never performs an exhaustive
directory walk.

On Windows CI this walk took ~10 minutes per test setup (hundreds of
thousands of files under C:\Program Files), causing the application/server
package to hit the 60-minute go test timeout. This restores the behavior
from main and brings Windows CI back in line.
@bastiandoetsch
Copy link
Copy Markdown
Collaborator Author

/describe

@bastiandoetsch
Copy link
Copy Markdown
Collaborator Author

/review

@snyk-pr-review-bot
Copy link
Copy Markdown

PR Description updated to latest commit (c357491)

@snyk-pr-review-bot

This comment has been minimized.

@snyk-pr-review-bot

This comment has been minimized.

ScanProgress.Listen previously blocked indefinitely on its select
waiting for cancel or done signals. If a scan ended without calling
SetDone or CancelScan (e.g. due to an error early return), the listener
goroutine would leak and keep the WaitGroup in DelegatingConcurrentScanner
from completing, causing subsequent tests or scans to hang until timeout.

Adding ctx.Done() as a third select case ensures the goroutine always
exits when the scan's context is canceled, which happens at the latest
when the scan function returns via defer cancel().
runtime.NumCPU() already returns int, the conversion was redundant.
@bastiandoetsch
Copy link
Copy Markdown
Collaborator Author

/describe

@bastiandoetsch
Copy link
Copy Markdown
Collaborator Author

/review

@snyk-pr-review-bot

This comment has been minimized.

@snyk-pr-review-bot
Copy link
Copy Markdown

PR Description updated to latest commit (bd4188b)

@snyk-pr-review-bot

This comment has been minimized.

…stHelper

prepareTestHelper passes a pre-configured engine to InitEngine to avoid
binary search path walks in CI. However, InitEngine only calls
InitWorkflows and engine.Init when engine==nil, so integration and smoke
tests were running without registered workflows. This caused:
- Test_connectivityCheckCommand_Execute_integration to fail because the
  connectivity check workflow was never registered
- Test_textDocumentInlineValues_InlineValues_IntegTest to fail because
  the CLI installer workflow was unavailable
- Test_SmokeInstanceTest to fail for the same reason

Also set EXECUTION_MODE_KEY and PersistInStorage on preConf to match
what InitEngine does for nil engines.
@bastiandoetsch
Copy link
Copy Markdown
Collaborator Author

/describe

@bastiandoetsch
Copy link
Copy Markdown
Collaborator Author

/review

@snyk-pr-review-bot
Copy link
Copy Markdown

PR Description updated to latest commit (bd4188b)

@snyk-pr-review-bot

This comment has been minimized.

@snyk-pr-review-bot

This comment has been minimized.

Previously, ScanProgress.SetDone() was only called on the happy path in
scanInternal, leaving isDone=false when legacyScan or ostestScan returned
an error. The next scan on the same folder would call CancelScan() on the
stale ScanProgress, hitting the 5-second channel timeout.

Convert the explicit SetDone() call to a deferred function, matching the
IAC scanner's existing pattern. SetDone() is safe to call repeatedly or
after cancellation, so deferring it guarantees cleanup on all exit paths.

Also adds a gomock-generated Executor mock and a regression test
(Test_ScanError_ScanProgressIsMarkedDone) that verifies IsDone()==true
after a scan error.
@bastiandoetsch
Copy link
Copy Markdown
Collaborator Author

/describe

@snyk-pr-review-bot
Copy link
Copy Markdown

PR Description updated to latest commit (01ce9fa)

@snyk-pr-review-bot

This comment has been minimized.

…tx.Done()

When Listen exits via ctx.Done(), the unbuffered done channel has no
active reader. Any subsequent call to SetDone() (e.g. from a deferred
cleanup in scanInternal or iac.Scan) would block for the full 5-second
timeout before returning, effectively hanging the scan goroutine on
every context-canceled completion.

Changing done to a buffered channel (capacity 1) ensures SetDone() can
always complete the send immediately: if Listen is still active it reads
the value normally; if Listen has already exited the value sits harmlessly
in the buffer until the ScanProgress is garbage-collected.

Adds TestScanProgress_SetDone_DoesNotBlockAfterContextCanceled to
reproduce the hang and confirm the fix, plus additional unit tests for
the normal SetDone, CancelScan, and idempotent SetDone paths.
@bastiandoetsch
Copy link
Copy Markdown
Collaborator Author

/describe

@snyk-pr-review-bot
Copy link
Copy Markdown

PR Description updated to latest commit (392361d)

@snyk-pr-review-bot
Copy link
Copy Markdown

PR Reviewer Guide 🔍

🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Performance Bottleneck 🟡 [minor]

The calcConcurrencyLimit function reserves 4 cores for the IDE. On a standard 4-core machine (common in many laptop configurations), this results in math.Max(1, 4-4), restricting the Snyk CLI to a single concurrent process regardless of system load. While this prevents the IDE from freezing, it significantly slows down multi-folder workspace scans on mid-range hardware.

func calcConcurrencyLimit() int {
	cpus := runtime.NumCPU()
	if os.Getenv("CI") != "" {
		return cpus
	}
	// Reserve max 4 cores for IDE / other local work
	return int(math.Max(1, float64(cpus-4)))
}
📚 Repository Context Analyzed

This review considered 95 relevant code sections from 15 files (average relevance: 0.98)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant