Improve test suite reliability with randomization, TSan, and flaky test fixes#2823
Draft
Improve test suite reliability with randomization, TSan, and flaky test fixes#2823
Conversation
Enable randomExecutionOrdering on all 20 test schemes to surface order-dependent test failures. Add Thread Sanitizer to DatadogInternal, DatadogRUM, DatadogLogs, and DatadogTrace (iOS + tvOS) where it was previously missing.
Increase timeouts for tests that use real threading, timers, or concurrent dispatch where 0.1s ceilings are too tight for CI: - AppHangsWatchdogThreadTests: raise threshold from 0.1s to 0.5s, widen wait multiplier from 10x to 15x, use Constants.tolerance + CI padding for duration assertions - Profiling concurrency tests: 0.1s to 2.0s for concurrentPerform waits - AppStateManagerTests: 0.1s to 2.0s for async data store operations - DisplayLinkerTests: wait(during:) from 0.1s to 0.25s for CADisplayLink - VitalInfoSamplerTests: use GreaterThanOrEqual for sample count - AppHangsMonitoringTests: raise threshold and hang duration - WatchdogTerminationsMonitoringTests: add 10s deadline to polling loop - ViewHitchesTests: increase wait for frame hitch generation
…acktraceTests URLSessionTaskStateSwizzlerTests: wrap interceptedStates and interceptionCount in thread-safe types (ThreadSafeStates, ThreadSafeCounter) to fix data races from concurrent URLSession callbacks. Replace Thread.sleep(1) with expectation-based waiting. KSCrashBacktraceTests: fix testGenerateBacktraceForBackgroundThread by busy-spinning the background thread inside a @inline(never) user code function so the backtrace captures user binary image frames. Restores the assertion that DatadogCrashReportingTests appears in binary images.
22814b2 to
efb6319
Compare
Each test called span.setActive() which enters an os_activity scope, but never called span.finish() which leaves it. With randomized test execution, accumulated nested os_activity scopes corrupted the activity hierarchy, causing getActiveSpan() to return nil in subsequent tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What and why?
The iOS SDK test suite has been experiencing intermittent CI failures (~50% pass rate locally with
make test-ios-all). This PR addresses the root causes of flakiness and adds infrastructure to detect order-dependent and thread-unsafe tests earlier.How?
1. Enable test randomization across all 20 schemes
Adds
randomExecutionOrdering = "YES"to every test scheme to surface order-dependent test failures that were hidden by deterministic execution.2. Extend Thread Sanitizer coverage
Enables TSan on DatadogInternal, DatadogRUM, DatadogLogs, and DatadogTrace (iOS + tvOS) where it was previously missing. TSan was already enabled on DatadogCore, DatadogCrashReporting, and IntegrationTests.
3. Fix timing-sensitive tests
Tests with timeouts too tight for CI environments:
AppHangsWatchdogThreadTests: threshold 0.1s→0.5s, wait multiplier 10x→15x, duration tolerance usesConstants.tolerance + ciPaddingCTorProfiler,MachSamplingProfiler,SafeRead,AppLaunchProfiler):timeout: 0.1→2.0forconcurrentPerformwaitsAppStateManagerTests:0.1→2.0for async data store operationsDisplayLinkerTests:wait(during:)0.1s→0.25s for CADisplayLink callbacksVitalInfoSamplerTests:XCTAssertEqual(sampleCount, 2)→XCTAssertGreaterThanOrEqual(timer scheduling can produce extra samples)WatchdogTerminationsMonitoringTests: added 10s deadline to unbounded polling loopViewHitchesIntegrationTests: increased wait for frame hitch generation4. Fix data races in URLSessionTaskStateSwizzlerTests
interceptedStates(plain Array) andinterceptionCount(plain Int) were mutated from concurrent URLSession delegate callbacks — a genuine data race. Wrapped inThreadSafeStatesandThreadSafeCounter. Also replacedThread.sleep(1)with expectation-based waiting.5. Fix flaky KSCrashBacktraceTests
testGenerateBacktraceForBackgroundThreadwas asserting thatDatadogCrashReportingTestsandFoundationappear in binary images, but the background thread was blocked onsemaphore.wait()(only system frames on stack). Fixed by busy-spinning the thread inside a@inline(never)function in the test module, so user code frames are on the stack at capture time. Restores the user image assertion.Local performance impact (10-run average):
The +16s overhead comes entirely from TSan instrumentation on the 4 newly-enabled modules. Timeout changes have near-zero wall-clock impact (they're ceilings, not delays).
Review checklist
Add CHANGELOG entry for user facing changesN/A — internal test infrastructure onlyAdd Objective-C interface for public APIsN/A — no public API changesRunN/A — no API changesmake api-surface