Skip to content

fix(policy): prevent policy callback deadlocks and teardown leaks#118

Closed
rvatkar wants to merge 1 commit intomainfrom
rvatkar/fix/DCGM-3746-6904-policy-callback-deadlock
Closed

fix(policy): prevent policy callback deadlocks and teardown leaks#118
rvatkar wants to merge 1 commit intomainfrom
rvatkar/fix/DCGM-3746-6904-policy-callback-deadlock

Conversation

@rvatkar
Copy link
Copy Markdown
Collaborator

@rvatkar rvatkar commented Apr 22, 2026

Fixes DCGM-3746 and DCGM-6904.

Both issues came from the same root cause: ViolationRegistration could block while DCGM held internal locks.

Summary

  • make ViolationRegistration non-blocking by dropping on full per-condition queues and tracking drops with droppedPolicyViolations
  • sample drop logs at powers of two so overload does not turn the callback path into log I/O
  • start fan-in before dcgmPolicyRegister_v2 and make fan-in sends cancellation-aware so registration and teardown cannot hang on a full user-facing channel
  • increase the per-condition buffer from 1 to 16 to absorb short bursts

Why policyChannelBuffer = 16

  • DCGM_MAX_XID_INFO = 10 (from dcgm_structs.h) is the upstream design ceiling for in-flight XID events per callback, so the buffer must be >= 10 to absorb a worst-case synchronous burst.
  • 16 is the next power of two above 10, giving small headroom for scheduler jitter without straying far from what DCGM itself considers the realistic ceiling.
  • Memory cost is negligible: 7 conditions x 16 slots x ~64 bytes per PolicyViolation ~= 7 KB total. Sized larger (e.g. 64) would be hard to defend in review for no measurable benefit.
  • If droppedPolicyViolations ever becomes non-zero in production, that is the signal to revisit the size -- not a guess.

Tests

Added deterministic regression coverage for:

  • bounded callback latency on full per-condition channels
  • exact drop accounting
  • recovery after a dropped violation
  • fan-in cancellation while blocked on output
  • registerPolicy pre-registration failure cleanup
  • normal lifecycle cleanup on caller cancellation
  • registration-failure cleanup for both registerPolicy and registerPolicyOnly

@rvatkar rvatkar requested a review from nccurry April 22, 2026 05:40
@rvatkar
Copy link
Copy Markdown
Collaborator Author

rvatkar commented Apr 28, 2026

Closing this PR, as #119 handles it.

@rvatkar rvatkar closed this Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant