Skip to content

fix(policy): Fix DCGM policy listener fanout and preserve policy thresholds#119

Open
rvatkar wants to merge 2 commits intomainfrom
rvatkar/bugfix/DCGM-6957-policy-listener-dispatcher
Open

fix(policy): Fix DCGM policy listener fanout and preserve policy thresholds#119
rvatkar wants to merge 2 commits intomainfrom
rvatkar/bugfix/DCGM-6957-policy-listener-dispatcher

Conversation

@rvatkar
Copy link
Copy Markdown
Collaborator

@rvatkar rvatkar commented Apr 28, 2026

Summary

Fix DCGM policy listener callback handling by replacing shared global callback channels with a per-subscriber dispatcher. This prevents listener fanout races, callback blocking, premature cleanup, and policy threshold
clobbering while keeping public listener API signatures unchanged.

Covers:

Changes

  • Add an internal policy dispatcher with per-subscriber channels, group-aware routing, and DCGM userData registration IDs.
  • Fan out matching policy violations to all interested subscribers instead of competing on shared condition channels.
  • Use non-blocking delivery so slow consumers cannot block DCGM callback threads.
  • Preserve existing policy thresholds for Listen* APIs by merging missing conditions instead of resetting the whole policy to defaults.
  • Narrow default policy setup fallback to DCGM_ST_INSUFFICIENT_SIZE and DCGM_ST_NOT_CONFIGURED; propagate other policy-read errors.
  • Keep unregister bookkeeping until dcgmPolicyUnregister succeeds or DCGM reports no live registration can remain.
  • Expose PolicyViolationDropCount() for slow-consumer drop visibility.
  • Update C callback bridge to pass DCGM userData.
  • Add dispatcher, lifecycle, rollback, race, threshold-preservation, and error-contract tests.

Validation

  • go test ./pkg/dcgm -count=1
  • go test -race ./pkg/dcgm -run "TestPolicyDispatcherDeliverDuringUnsubscribeRace|TestPolicyDispatcherRollbackSubscription|TestPolicyDispatcherClearRegistrationKeepsStateOnUnregisterFailure| TestPolicyDispatcherSlowConsumerDrop|TestConcurrentPolicySubscribers|TestPolicyListenerLifecycle" -count=1 -v
  • go test ./pkg/dcgm -run "TestPolicyErrors|TestSetPolicyAndWatchViolations|TestSetAndGetPolicy|TestSetAndGetMultiplePolicies" -count=1 -v
  • go build ./...
  • go vet ./...

dcgm-exporter L4 Smoke Test

Built dcgm-exporter from /home/rvatkar/code/dcgm-exporter-dev against the local fixed go-dcgm branch using a temporary go.work replace.

  1. Confirm exporter uses local go-dcgm:

    cd /home/rvatkar/code/dcgm-exporter-dev
    GOWORK=/tmp/dcgm-exporter-gowork/go.work go list -m -f "{{.Path}} {{.Version}} {{.Dir}}" github.com/NVIDIA/go-dcgm
    

Expected: module resolves to /home/rvatkar/code/community/go-dcgm.

  1. Build exporter:

    GOWORK=/tmp/dcgm-exporter-gowork/go.work make binary

  2. Run exporter on L4 host with reduced L4 counters:

    /tmp/dcgm-exporter-after
    -f /tmp/dcgm-exporter-l4-counters.csv
    -a :9401
    --enable-dcgm-log
    --dcgm-log-level DEBUG

  3. Scrape metrics:

    curl -sf http://127.0.0.1:9401/metrics >/tmp/dcgm-exporter-after.metrics

    grep -E "DCGM_FI_DEV_GPU_UTIL|DCGM_FI_DEV_SM_CLOCK|DCGM_FI_DEV_FB_USED|DCGM_FI_DEV_XID_ERRORS"
    /tmp/dcgm-exporter-after.metrics

  4. Repeated scrape soak:

    for i in $(seq 1 30); do
    curl -sf http://127.0.0.1:9401/metrics >/tmp/dcgm-exporter-after-$i.metrics || exit 1
    sleep 10
    done

Result:

  • Exporter started successfully on L4.
  • Repeated scrapes succeeded.
  • L4 metrics such as DCGM_FI_DEV_SM_CLOCK, DCGM_FI_DEV_GPU_UTIL, and DCGM_FI_DEV_FB_USED were present.
  • No callback panic, deadlock, or exporter crash observed.
  • XID injection was visible when exporter connected to the same hostengine using --remote-hostengine-info localhost:5555.

Notes

No dcgm-exporter code change is required, but downstream consumers need to update github.com/NVIDIA/go-dcgm dependency to pick up this fix.

Comment thread pkg/dcgm/policy.go
typ: policyFieldTypeBool,
value: policyBoolValue,
// unsubscribe closes one listener and unregisters DCGM callbacks when it was the last one.
func (d *policyDispatcher) unsubscribe(subID uint64) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsubscribe should unregister DCGM callbacks per group/condition as soon as they become unused, instead of waiting for the final process-wide subscriber. After removing a subscription, compute the remaining condition mask for that group and unregister only registeredByGroup[group] &^ stillNeeded; keep clearing local registration state only after dcgmPolicyUnregister succeeds.

Add tests for:

  • Same group: subscriber A watches Xid, subscriber B watches Thermal; removing A unregisters only Xid.
  • Different groups: removing the last subscriber for group A unregisters group A even while group B remains.
  • Combined registration: A watches Xid,Thermal, B watches Thermal; removing A unregisters only Xid and preserves Thermal bookkeeping.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed. Unsubscribe now computes the remaining subscriber mask for the removed subscriber’s group and calls dcgmPolicyUnregister only for the group/condition bits that became unused. It no longer waits for all
local subscribers to exit.

I also added regression coverage for:

  • same group, different conditions
  • different groups with the same condition
  • combined registration where only one condition becomes unused

Copy link
Copy Markdown
Collaborator

@nccurry nccurry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small thing, otherwise LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants