[WINA-2079] Fix ETW HTTP test hang with readiness signal and receive timeout#52038
[WINA-2079] Fix ETW HTTP test hang with readiness signal and receive timeout#52038jack0x2 wants to merge 2 commits into
Conversation
…timeout TestEtwTransactions could hang an entire TestVMSuite run for the full 5-minute Go test timeout when an expected ETW HttpService event never arrived: the receiver goroutine read from etw.DataChannel with no timeout, and the suite waited for the provider to start with a fixed 10s sleep. Mirror the CWS Windows probe's ETWReady pattern (pkg/security/probe): - EtwInterface now closes an etwReady channel on the first event received and exposes ETWReady(), so callers can wait for the provider to be live instead of sleeping a fixed duration. - TestEtwTransactions waits on ETWReady (bounded, non-fatal) before firing requests, replacing the unreliable 10s sleep. - executeRequestForTest's receiver now uses a per-batch idle timeout so a missing event fails that subtest fast with a clear message instead of deadlocking the whole binary.
|
@codex review |
|
|
Codex Review: Didn't find any major issues. Keep them coming! ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
The test host can't be assumed to have incidental HTTP traffic, so waiting on ETWReady alone could never fire (or always burn the fallback). Drive our own warmup requests to the local IIS site until the provider delivers its first event, then drain those warmup transactions so they don't bleed into the first subtest's counts.
|
@codex review |
Files inventory check summaryFile checks results against ancestor 874f8ee4: Results for datadog-agent_7.81.0~devel.git.373.a744861.pipeline.117853933-1_amd64.deb:No change detected |
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
|
What does this PR do?
Fixes a hang in the Windows-only
TestEtwTransactions(NPM USM ETW HTTP) testthat could time out an entire
sysprobe-functional/TestVMSuiterun for the full5 minutes when an expected ETW
HttpServiceevent failed to arrive.Changes, mirroring the existing CWS Windows probe
ETWReadypattern(
pkg/security/probe/probe_windows.go+WaitForETWReadyinpkg/security/tests/module_tester_windows.go):EtwInterfacenow closes anetwReadychannel on the first ETW eventreceived and exposes
ETWReady() <-chan struct{}, so callers can wait for theprovider to be live instead of relying on a fixed sleep.
TestEtwTransactionswaits for the provider to come alive before firing thereal test requests, via a new
waitForETWReadyhelper, replacing theunreliable
time.Sleep(10 * time.Second).event,
waitForETWReadydrives its own warmup requests to the local IIS siteuntil
ETWReadyfires, then drains those warmup transactions so they don'tbleed into the first subtest's counts.
executeRequestForTest's receiver goroutine now uses a per-batch idle timeouton the
DataChannelreceive, so a genuinely missing event fails that subtestfast with a clear message instead of deadlocking the whole test binary.
Motivation
WINA-2079 —
TestVMSuiteintermittently timed out after 5m. The CI goroutine dumps showed the receiver
goroutine blocked on
<-etw.DataChannel(no timeout) while the poller sat idlewith nothing to send, i.e. the expected ETW event never materialized for a
given request. Because the read had no timeout, a single missing event hung the
entire suite until the global Go test alarm fired. The failing subtest varied
across runs (
Test_default_site_ipv6_bad_path,Test_path_limit_one_over,Test_path_limit_at_boundary), confirming a generic missing-event/timeout issuerather than a defect tied to any one request path.
Describe how you validated your changes
This code is gated behind
//go:build windows && npmand requires a realWindows host with IIS + a live ETW
HttpServiceprovider, so it cannot be builtor run locally on macOS/Linux. Validation is via CI:
gofmtclean on all changed files.sysprobe-functionalWindows E2E job (TestVMSuite) exercisesTestEtwTransactionsend-to-end. This job runs onmain/release branches; itmay need
qa/rc-requiredto run against this PR — see Additional Notes.Expected behavior after this change: startup is gated on the provider's first
real event (driven by warmup traffic) rather than a blind 10s sleep, and on a
missing ETW event the affected subtest fails within ~60s with a clear message
instead of hanging the suite for 5 minutes.
Additional Notes
user-facing behavior changes, so it carries
changelog/no-changelog.coverage does not run on PR branches by default. Adding
qa/rc-required(ortriggering the sysprobe-functional Windows job) is recommended to actually
validate the fix before merge.
etw_http_service.gotrailing-space sentinel(
len(path)+1 < maxRequestFragmentBytes) was investigated and deliberatelyleft unchanged — it's a harmless classification quirk and flipping
<→<=would write past the end of the buffer. It is not the cause of WINA-2079.