netvsp & net_mana - instrument VTL0 & host calls at risk of hanging#3170
netvsp & net_mana - instrument VTL0 & host calls at risk of hanging#3170ben-zen wants to merge 5 commits intomicrosoft:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves observability around NetVSP and MANA endpoint operations that can take a long time or potentially hang, by adding tracing spans with relevant identifiers (instance/vport/VF IDs). This helps correlate stalls in async shutdown/reconfiguration paths when debugging misbehaving VMs or host/VTL0 interactions.
Changes:
- Add an
info_spanaround NetVSP coordinator endpoint shutdown. - Add
info_spaninstrumentation around MANA endpoint teardown steps (RX disable, vport resource destruction, queue drain wait). - Add
info_spaninstrumentation around Underhill VTL0 VF offer/remove notification calls, plus an explicit log when VTL0 bus isn’t present.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| vm/devices/net/netvsp/src/lib.rs | Instruments coordinator endpoint stop().await with an info span including instance_id. |
| vm/devices/net/net_mana/src/lib.rs | Adds tracing::Instrument and spans around endpoint stop/cleanup operations with vport_id. |
| openhcl/underhill_core/src/emuplat/netvsp.rs | Adds spans around VTL0 VF removal notice and VF offer call to better pinpoint stalls and add context. |
erfrimod
left a comment
There was a problem hiding this comment.
Added a comment to trace one more thing, but otherwise ready to go. :)
take Brian's better comment. Co-authored-by: Brian Perkins <Brian-Perkins@users.noreply.github.com>
| } else { | ||
| tracing::info!( | ||
| vtl2_vfid, | ||
| vtl0_vfid = vtl0_vfid_from_bus_control(&self.vtl0_bus_control), | ||
| tracing::info!( | ||
| vtl2_vfid, | ||
| %vtl0_bus_control, | ||
| "Ignoring VTL0 device request from guest" | ||
| ); | ||
| } |
There was a problem hiding this comment.
The else branch has a malformed tracing::info! invocation (a second tracing::info! starts before the first is closed), which will not compile. Collapse this into a single tracing::info! call that includes all desired fields (vtl2_vfid, vtl0_vfid, %vtl0_bus_control) and the message, and ensure the macro invocation is properly closed.
| self.coordinator | ||
| .task_mut() | ||
| .endpoint | ||
| .stop() | ||
| .instrument(tracing::info_span!( | ||
| "stopping coordinator endpoint", | ||
| instance_id = %self.instance_id, | ||
| )) | ||
| .await; |
There was a problem hiding this comment.
This uses .instrument(...), which requires the tracing::Instrument trait to be in scope for method resolution. If this module doesn’t already use tracing::Instrument;, the build will fail—add that import alongside the other use statements (or use UFCS).
| if let Err(err) = self | ||
| .send_vf_state_change_notifications() | ||
| .instrument(tracing::info_span!( | ||
| "sending VTL0 VF removal notice", | ||
| vtl2_vfid, | ||
| vtl0_bus = %bus_control)) | ||
| .await |
There was a problem hiding this comment.
This uses .instrument(...), which requires the tracing::Instrument trait to be in scope for method resolution. If this module doesn’t already use tracing::Instrument;, compilation will fail—add the import near the top-level use list (or switch to UFCS to avoid relying on the trait being in scope).
There's a bunch of async operations with long or no timeout which are at risk of hanging the NetVSP code; adding spans to those actions makes them more obvious in tracing, with additional context to track down miscreant calls or make a misbehaving VM more obvious.