feat: add multi-GPU support for CUDA attach by yunwei37 · Pull Request #559 · eunomia-bpf/bpftime

yunwei37 · 2026-03-11T14:53:37Z

Summary

Add gpu_device_manager that enumerates all CUDA devices at init, caches per-device SM architectures, and provides device-aware lookup APIs
Per-device PTX compilation, module loading, and patched kernel tracking for correct multi-GPU launch interception
Device-aware run_attach_entry_on_gpu() with new device_ordinal parameter (backward compatible, defaults to auto-detect)
Multi-device CUDAContext support in runtime with init_multi_gpu_contexts()
Fix cuCtxCreate calls for CUDA 13 compatibility (cuCtxCreate_v4 4-parameter signature)
Add gpu_device_manager unit tests
Add multi-GPU vector addition example (example/gpu/multi-gpu/)

Test plan

All 131 assertions in 23 test cases pass (including new gpu_device_manager tests)
Full project build succeeds with -DBPFTIME_ENABLE_CUDA_ATTACH=ON
Multi-GPU example verified on 8x NVIDIA B300 SXM6 AC (sm_103)
Single-GPU backward compatibility preserved (all new parameters default to device 0)
BPFTIME_SM_ARCH env var override works across all devices

🤖 Generated with Claude Code

Add gpu_device_manager that enumerates all CUDA devices at init time, caches per-device SM architectures, and provides device-aware APIs. Key changes: - Per-device SM arch detection and PTX compilation - Per-device module loading with separate module pools - Per-device patched kernel tracking for correct launch interception - Device-aware run_attach_entry_on_gpu() with device_ordinal parameter - Multi-device CUDAContext support in runtime - Fix cuCtxCreate calls for CUDA 13 compatibility - Add gpu_device_manager unit tests (tested on 8x NVIDIA B300 SXM6) - Add multi-GPU vector addition example Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… monitoring Add per-GPU block timing using gridDim.x (helper 508) to identify which GPU each block belongs to from inside the GPU. Update README to emphasize bpftime's unique value: zero-modification black-box monitoring, cross-GPU shared maps via UVA, and programmable GPU-internal probes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

CUDA 12.x uses cuCtxCreate_v2 (3 args) while CUDA 13+ uses cuCtxCreate_v4 (4 args). Add #if CUDA_VERSION >= 13000 guard. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add helper 512 (bpf_get_device_ordinal): reads per-module deviceOrdinal constant set by bpftime during CUmodule loading. Each GPU gets its own CUmodule with a unique ordinal, providing reliable GPU identification from inside eBPF probes regardless of workload distribution. - Fix C1: device_count() now returns devices_.size() instead of the raw CUDA count, avoiding mismatch when cuDeviceGet fails for some devices. - Fix C2: run_attach_entry_on_gpu now uses per-device shared_mem_device_ptr instead of always using the primary device's pointer. - Fix start_ts collision: Use compound key (device_ordinal << 20 | block_id) to prevent cross-GPU block_id collisions in shared maps. - Update example to use device ordinal instead of gridDim.x for per-GPU stats, making it work with any workload distribution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…cy (#560) * feat: skip compile/load of unmodified PTX in CUDA attach path Optimize GPU attach latency by only compiling and loading PTX files that were actually modified by the pass pipeline. Previously all extracted PTX files flowed through compile+load even when unmodified. On a llama.cpp 1B workload (RTX 5090, CUDA 12.9): - First cold fatbin: 27.9s → 5.6s (-80%) - PTX compile: 21.2s → 0.17s (-99.2%) - Fatbins 3..48 mean: 210ms → 123ms (-41.5%) Changes: - nv_attach_impl.cpp: rename should_add_trampoline to ptx_modified for clarity; propagate per-PTX modified flag from pass pipeline - nv_attach_fatbin_record.cpp: filter patched PTX to only modified entries before compile_ptxs() and module-load loop; add thread-safe cache access for ptx_pool and module_pool; add per-fatbin load_mutex; add timing instrumentation for extract/patch/compile/load stages - nv_attach_impl_frida_setup.cpp: add extract timing; downgrade expected "symbol not in patched module" messages to DEBUG - nv_attach_impl.hpp: add module_pool_mutex_ and ptx_pool_mutex_ - nv_attach_fatbin_record.hpp: add load_mutex for per-fatbin serialization - vm/compat/CMakeLists.txt: enable llvm-vm when CUDA attach is on Closes #552 * feat: add pass-internal early return for non-target PTX to reduce patch latency Add ptx_may_contain_target_kernel() fast-path check in ptxpass_core that skips heavy PTX parsing when the target kernel name is absent. Applied at both pass level (kprobe_entry, kretprobe, kprobe_memcapture) and framework level (hack_fatbin JSON serialization skip). On llama.cpp 1B (RTX 5090, CUDA 12.9): - Cold first fatbin patch: 5428ms -> 487ms (-91%) - Cold first fatbin total: 5605ms -> 661ms (-88%) - Combined with PR1: 27.9s -> 0.66s (-97.6%) Also hoists eBPF instruction word packing out of inner kernel loop and adds unit test for the new helper. * refactor: separate PTX from JSON in pass interface to eliminate serialization overhead Change process_input() ABI to receive PTX as raw pointer instead of JSON-encoded. Meta-only JSON now contains kernel name, eBPF instructions, and map symbols (~100 bytes vs 50-200KB per PTX). Key changes: - New process_input(ptx, ptx_len, meta_json, meta_len, out, out_len) - Cache key uses sha256(raw_ptx) + kernel + sha256(ebpf) instead of sha256(full_json) - Remove framework-level ptx_may_contain_target_kernel() pre-filter (no longer needed since pass calls are now cheap) - Fix 1GB per-call buffer allocation, size from PTX length instead - Update all 3 passes, unit tests, and README-passes.md On llama.cpp 1B (RTX 5090, CUDA 12.9): - Cold first fatbin: 661ms -> 275ms (-58%) - Cold patch: 487ms -> 106ms (-78%) - Combined from baseline: 27.9s -> 0.275s (-99.0%, 102x) * chore: revert unnecessary changes from optimization PR - Revert README-passes.md: docs-only change not required for the optimization - Revert ebpf_inst_words hoist: minor micro-optimization churn, not needed Keeps the diff minimal and focused on the actual latency optimization. * chore: revert unnecessary string_view and formatting changes Revert non-essential refactoring to minimize the PR diff: - Restore const std::string& signatures for validate_input, contains_entry_function, contains_ret_instruction, validate_ptx_version, find_kernel_body, pass_runtime_request_from_string - Remove ptx_owned workaround in find_kernel_body that was only needed for the string_view conversion - Restore original indentation for effective_module_pool ternary - Restore original formatting for lambda capture and ebpf_inst_words New function ptx_may_contain_target_kernel keeps string_view since it's a new addition, not a signature change. * chore: revert remaining unnecessary changes to minimize PR diff * Revert "chore: revert remaining unnecessary changes to minimize PR diff" This reverts commit a5bab96. * chore: revert non-essential comment, log message, and include changes - Restore original SPDLOG_WARN level and messages in frida_setup - Restore original comments in core.hpp - Remove unnecessary #include <cstddef> - Restore original validate_input declaration formatting * chore: simplify PR by inlining trivial helper functions Remove unnecessary abstractions that wrapped 1-2 line operations: - Inline runtime_request_ptx_view(), populate_runtime_request_ptx(), ptx_may_contain_target_kernel() at call sites - Inline build_patch_cache_key() and estimate_pass_output_buffer_size() - Restore NLOHMANN_DEFINE macros for RuntimeResponse (remove custom to_json/from_json that only omitted output_ptx for unmodified) - Remove ptx_may_contain_target_kernel from core.cpp (inlined in passes) - Update tests to match simplified serialization * chore: restore original variable names and remove unnecessary defensive code - Restore should_add_trampoline variable name (rename was unnecessary churn) - Restore original SPDLOG_DEBUG message text - Remove defensive empty-PTX error check (not required for optimization) * chore: change GPU attach timing logs from INFO to DEBUG --------- Co-authored-by: LinuxDev9002 <linuxdev8883@example.com>

pull-request-size bot added the size/XXL label Mar 11, 2026

yusheng and others added 3 commits March 11, 2026 15:14

fix: add CUDA version guard for cuCtxCreate in nv_attach_impl

79e15d0

CUDA 12.x uses cuCtxCreate_v2 (3 args) while CUDA 13+ uses cuCtxCreate_v4 (4 args). Add #if CUDA_VERSION >= 13000 guard. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

yunwei37 requested review from Officeyutong and Sy0307 March 18, 2026 17:14

github-actions bot mentioned this pull request Mar 23, 2026

Weekly Org Report (2026-03-09..2026-03-15) eunomia-bpf/eunomia.dev#80

Open

github-actions bot mentioned this pull request Apr 1, 2026

Monthly Org Report (2026-03-01..2026-03-31) eunomia-bpf/eunomia.dev#82

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add multi-GPU support for CUDA attach#559

feat: add multi-GPU support for CUDA attach#559
yunwei37 wants to merge 5 commits intomasterfrom
feat/multi-gpu-attach

yunwei37 commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yunwei37 commented Mar 11, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants