fix(policy): Fix DCGM policy listener fanout and preserve policy thresholds#119
fix(policy): Fix DCGM policy listener fanout and preserve policy thresholds#119
Conversation
| typ: policyFieldTypeBool, | ||
| value: policyBoolValue, | ||
| // unsubscribe closes one listener and unregisters DCGM callbacks when it was the last one. | ||
| func (d *policyDispatcher) unsubscribe(subID uint64) { |
There was a problem hiding this comment.
unsubscribe should unregister DCGM callbacks per group/condition as soon as they become unused, instead of waiting for the final process-wide subscriber. After removing a subscription, compute the remaining condition mask for that group and unregister only registeredByGroup[group] &^ stillNeeded; keep clearing local registration state only after dcgmPolicyUnregister succeeds.
Add tests for:
- Same group: subscriber A watches Xid, subscriber B watches Thermal; removing A unregisters only Xid.
- Different groups: removing the last subscriber for group A unregisters group A even while group B remains.
- Combined registration: A watches Xid,Thermal, B watches Thermal; removing A unregisters only Xid and preserves Thermal bookkeeping.
There was a problem hiding this comment.
Thanks, fixed. Unsubscribe now computes the remaining subscriber mask for the removed subscriber’s group and calls dcgmPolicyUnregister only for the group/condition bits that became unused. It no longer waits for all
local subscribers to exit.
I also added regression coverage for:
- same group, different conditions
- different groups with the same condition
- combined registration where only one condition becomes unused
nccurry
left a comment
There was a problem hiding this comment.
Just one small thing, otherwise LGTM
Summary
Fix DCGM policy listener callback handling by replacing shared global callback channels with a per-subscriber dispatcher. This prevents listener fanout races, callback blocking, premature cleanup, and policy threshold
clobbering while keeping public listener API signatures unchanged.
Covers:
Changes
userDataregistration IDs.Listen*APIs by merging missing conditions instead of resetting the whole policy to defaults.DCGM_ST_INSUFFICIENT_SIZEandDCGM_ST_NOT_CONFIGURED; propagate other policy-read errors.dcgmPolicyUnregistersucceeds or DCGM reports no live registration can remain.PolicyViolationDropCount()for slow-consumer drop visibility.userData.Validation
go test ./pkg/dcgm -count=1go test -race ./pkg/dcgm -run "TestPolicyDispatcherDeliverDuringUnsubscribeRace|TestPolicyDispatcherRollbackSubscription|TestPolicyDispatcherClearRegistrationKeepsStateOnUnregisterFailure| TestPolicyDispatcherSlowConsumerDrop|TestConcurrentPolicySubscribers|TestPolicyListenerLifecycle" -count=1 -vgo test ./pkg/dcgm -run "TestPolicyErrors|TestSetPolicyAndWatchViolations|TestSetAndGetPolicy|TestSetAndGetMultiplePolicies" -count=1 -vgo build ./...go vet ./...dcgm-exporter L4 Smoke Test
Built
dcgm-exporterfrom/home/rvatkar/code/dcgm-exporter-devagainst the local fixedgo-dcgmbranch using a temporarygo.workreplace.Confirm exporter uses local
go-dcgm:Expected: module resolves to /home/rvatkar/code/community/go-dcgm.
Build exporter:
GOWORK=/tmp/dcgm-exporter-gowork/go.work make binary
Run exporter on L4 host with reduced L4 counters:
/tmp/dcgm-exporter-after
-f /tmp/dcgm-exporter-l4-counters.csv
-a :9401
--enable-dcgm-log
--dcgm-log-level DEBUG
Scrape metrics:
curl -sf http://127.0.0.1:9401/metrics >/tmp/dcgm-exporter-after.metrics
grep -E "DCGM_FI_DEV_GPU_UTIL|DCGM_FI_DEV_SM_CLOCK|DCGM_FI_DEV_FB_USED|DCGM_FI_DEV_XID_ERRORS"
/tmp/dcgm-exporter-after.metrics
Repeated scrape soak:
for i in $(seq 1 30); do
curl -sf http://127.0.0.1:9401/metrics >/tmp/dcgm-exporter-after-$i.metrics || exit 1
sleep 10
done
Result:
Notes
No dcgm-exporter code change is required, but downstream consumers need to update github.com/NVIDIA/go-dcgm dependency to pick up this fix.