Skip to content

[WINA-2079] Fix ETW HTTP test hang with readiness signal and receive timeout#52038

Draft
jack0x2 wants to merge 2 commits into
mainfrom
jack.phillips/wina-2079-etw-ready-timeout
Draft

[WINA-2079] Fix ETW HTTP test hang with readiness signal and receive timeout#52038
jack0x2 wants to merge 2 commits into
mainfrom
jack.phillips/wina-2079-etw-ready-timeout

Conversation

@jack0x2

@jack0x2 jack0x2 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes a hang in the Windows-only TestEtwTransactions (NPM USM ETW HTTP) test
that could time out an entire sysprobe-functional/TestVMSuite run for the full
5 minutes when an expected ETW HttpService event failed to arrive.

Changes, mirroring the existing CWS Windows probe ETWReady pattern
(pkg/security/probe/probe_windows.go + WaitForETWReady in
pkg/security/tests/module_tester_windows.go):

  • EtwInterface now closes an etwReady channel on the first ETW event
    received and exposes ETWReady() <-chan struct{}, so callers can wait for the
    provider to be live instead of relying on a fixed sleep.
  • TestEtwTransactions waits for the provider to come alive before firing the
    real test requests, via a new waitForETWReady helper, replacing the
    unreliable time.Sleep(10 * time.Second).
  • Because the test host has no incidental HTTP traffic to trigger that first
    event, waitForETWReady drives its own warmup requests to the local IIS site
    until ETWReady fires, then drains those warmup transactions so they don't
    bleed into the first subtest's counts.
  • executeRequestForTest's receiver goroutine now uses a per-batch idle timeout
    on the DataChannel receive, so a genuinely missing event fails that subtest
    fast with a clear message instead of deadlocking the whole test binary.

Motivation

WINA-2079TestVMSuite
intermittently timed out after 5m. The CI goroutine dumps showed the receiver
goroutine blocked on <-etw.DataChannel (no timeout) while the poller sat idle
with nothing to send, i.e. the expected ETW event never materialized for a
given request. Because the read had no timeout, a single missing event hung the
entire suite until the global Go test alarm fired. The failing subtest varied
across runs (Test_default_site_ipv6_bad_path, Test_path_limit_one_over,
Test_path_limit_at_boundary), confirming a generic missing-event/timeout issue
rather than a defect tied to any one request path.

Describe how you validated your changes

This code is gated behind //go:build windows && npm and requires a real
Windows host with IIS + a live ETW HttpService provider, so it cannot be built
or run locally on macOS/Linux. Validation is via CI:

  • gofmt clean on all changed files.
  • The sysprobe-functional Windows E2E job (TestVMSuite) exercises
    TestEtwTransactions end-to-end. This job runs on main/release branches; it
    may need qa/rc-required to run against this PR — see Additional Notes.

Expected behavior after this change: startup is gated on the provider's first
real event (driven by warmup traffic) rather than a blind 10s sleep, and on a
missing ETW event the affected subtest fails within ~60s with a clear message
instead of hanging the suite for 5 minutes.

Additional Notes

  • This is a test-infrastructure fix plus a small internal accessor; no
    user-facing behavior changes, so it carries changelog/no-changelog.
  • Per the repo's branch-conditional CI guidance, the relevant Windows E2E
    coverage does not run on PR branches by default. Adding qa/rc-required (or
    triggering the sysprobe-functional Windows job) is recommended to actually
    validate the fix before merge.
  • Note: the off-by-one in etw_http_service.go trailing-space sentinel
    (len(path)+1 < maxRequestFragmentBytes) was investigated and deliberately
    left unchanged — it's a harmless classification quirk and flipping <<=
    would write past the end of the buffer. It is not the cause of WINA-2079.

…timeout

TestEtwTransactions could hang an entire TestVMSuite run for the full
5-minute Go test timeout when an expected ETW HttpService event never
arrived: the receiver goroutine read from etw.DataChannel with no timeout,
and the suite waited for the provider to start with a fixed 10s sleep.

Mirror the CWS Windows probe's ETWReady pattern (pkg/security/probe):
- EtwInterface now closes an etwReady channel on the first event received
  and exposes ETWReady(), so callers can wait for the provider to be live
  instead of sleeping a fixed duration.
- TestEtwTransactions waits on ETWReady (bounded, non-fatal) before firing
  requests, replacing the unreliable 10s sleep.
- executeRequestForTest's receiver now uses a per-batch idle timeout so a
  missing event fails that subtest fast with a clear message instead of
  deadlocking the whole binary.
@jack0x2 jack0x2 added the changelog/no-changelog No changelog entry needed label Jun 9, 2026
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@codex review

@dd-octo-sts dd-octo-sts Bot added internal Identify a non-fork PR team/windows-products labels Jun 9, 2026
@github-actions github-actions Bot added the short review PR is simple enough to be reviewed quickly label Jun 9, 2026
@datadog-datadog-prod-us1-2

datadog-datadog-prod-us1-2 Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 2 Pipeline jobs failed

DataDog/datadog-agent | single_machine_performance-regression_detector-merge_base_check   View in Datadog   GitLab

Label analysis | skip-qa-check   View in Datadog   GitHub Actions

ℹ️ Info

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 50.48% (-0.12%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: a744861 | Docs | Datadog PR Page | Give us feedback!

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Keep them coming!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

The test host can't be assumed to have incidental HTTP traffic, so waiting
on ETWReady alone could never fire (or always burn the fallback). Drive our
own warmup requests to the local IIS site until the provider delivers its
first event, then drain those warmup transactions so they don't bleed into
the first subtest's counts.
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@codex review

@dd-octo-sts

dd-octo-sts Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Files inventory check summary

File checks results against ancestor 874f8ee4:

Results for datadog-agent_7.81.0~devel.git.373.a744861.pipeline.117853933-1_amd64.deb:

No change detected

@dd-octo-sts

dd-octo-sts Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Static quality checks

✅ Please find below the results from static quality gates
Comparison made with ancestor 874f8ee
📊 Static Quality Gates Dashboard
🔗 SQG Job
SOME SIZE DELTAS ARE N/A (ANCESTOR METRICS NOT YET AVAILABLE). RETRY JOB

Successful checks

Info

Quality gate Change Size (prev → curr → max)
agent_deb_amd64 N/A N/A → 746.464 → 750.800
agent_deb_amd64_fips N/A N/A → 704.119 → 704.300
agent_heroku_amd64 N/A N/A → 310.788 → 314.260
agent_rpm_amd64 N/A N/A → 746.447 → 750.770
agent_rpm_amd64_fips N/A N/A → 704.102 → 704.300
agent_rpm_arm64 N/A N/A → 724.029 → 724.500
agent_rpm_arm64_fips N/A N/A → 684.783 → 684.870
agent_suse_amd64 N/A N/A → 746.447 → 750.770
agent_suse_amd64_fips N/A N/A → 704.102 → 704.300
agent_suse_arm64 N/A N/A → 724.029 → 724.500
agent_suse_arm64_fips N/A N/A → 684.783 → 684.870
docker_agent_amd64 N/A N/A → 806.600 → 806.840
docker_agent_arm64 N/A N/A → 809.007 → 810.120
docker_agent_jmx_amd64 N/A N/A → 997.541 → 997.600
docker_agent_jmx_arm64 N/A N/A → 988.601 → 989.800
docker_cluster_agent_amd64 N/A N/A → 207.245 → 207.710
docker_cluster_agent_arm64 N/A N/A → 221.206 → 221.210
docker_cws_instrumentation_amd64 N/A N/A → 7.154 → 7.190
docker_cws_instrumentation_arm64 N/A N/A → 6.689 → 6.920
docker_dogstatsd_amd64 N/A N/A → 39.515 → 39.560
docker_dogstatsd_arm64 N/A N/A → 37.690 → 38.080
docker_host_profiler_amd64 N/A N/A → 302.144 → 315.880
docker_host_profiler_arm64 N/A N/A → 313.658 → 327.470
dogstatsd_deb_amd64 N/A N/A → 30.174 → 30.790
dogstatsd_deb_arm64 N/A N/A → 28.296 → 29.290
dogstatsd_rpm_amd64 N/A N/A → 30.174 → 30.790
dogstatsd_suse_amd64 N/A N/A → 30.174 → 30.790
iot_agent_deb_amd64 N/A N/A → 44.472 → 45.230
iot_agent_deb_arm64 N/A N/A → 41.437 → 42.800
iot_agent_deb_armhf N/A N/A → 42.150 → 42.960
iot_agent_rpm_amd64 N/A N/A → 44.473 → 45.230
iot_agent_suse_amd64 N/A N/A → 44.472 → 45.230

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/no-changelog No changelog entry needed component/system-probe internal Identify a non-fork PR short review PR is simple enough to be reviewed quickly team/windows-products

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant