Skip to content

feat: add multi-GPU support for CUDA attach#559

Open
yunwei37 wants to merge 5 commits intomasterfrom
feat/multi-gpu-attach
Open

feat: add multi-GPU support for CUDA attach#559
yunwei37 wants to merge 5 commits intomasterfrom
feat/multi-gpu-attach

Conversation

@yunwei37
Copy link
Copy Markdown
Member

Summary

  • Add gpu_device_manager that enumerates all CUDA devices at init, caches per-device SM architectures, and provides device-aware lookup APIs
  • Per-device PTX compilation, module loading, and patched kernel tracking for correct multi-GPU launch interception
  • Device-aware run_attach_entry_on_gpu() with new device_ordinal parameter (backward compatible, defaults to auto-detect)
  • Multi-device CUDAContext support in runtime with init_multi_gpu_contexts()
  • Fix cuCtxCreate calls for CUDA 13 compatibility (cuCtxCreate_v4 4-parameter signature)
  • Add gpu_device_manager unit tests
  • Add multi-GPU vector addition example (example/gpu/multi-gpu/)

Test plan

  • All 131 assertions in 23 test cases pass (including new gpu_device_manager tests)
  • Full project build succeeds with -DBPFTIME_ENABLE_CUDA_ATTACH=ON
  • Multi-GPU example verified on 8x NVIDIA B300 SXM6 AC (sm_103)
  • Single-GPU backward compatibility preserved (all new parameters default to device 0)
  • BPFTIME_SM_ARCH env var override works across all devices

🤖 Generated with Claude Code

Add gpu_device_manager that enumerates all CUDA devices at init time,
caches per-device SM architectures, and provides device-aware APIs.
Key changes:
- Per-device SM arch detection and PTX compilation
- Per-device module loading with separate module pools
- Per-device patched kernel tracking for correct launch interception
- Device-aware run_attach_entry_on_gpu() with device_ordinal parameter
- Multi-device CUDAContext support in runtime
- Fix cuCtxCreate calls for CUDA 13 compatibility
- Add gpu_device_manager unit tests (tested on 8x NVIDIA B300 SXM6)
- Add multi-GPU vector addition example

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
yusheng and others added 3 commits March 11, 2026 15:14
… monitoring

Add per-GPU block timing using gridDim.x (helper 508) to identify which GPU
each block belongs to from inside the GPU. Update README to emphasize bpftime's
unique value: zero-modification black-box monitoring, cross-GPU shared maps
via UVA, and programmable GPU-internal probes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CUDA 12.x uses cuCtxCreate_v2 (3 args) while CUDA 13+ uses
cuCtxCreate_v4 (4 args). Add #if CUDA_VERSION >= 13000 guard.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add helper 512 (bpf_get_device_ordinal): reads per-module deviceOrdinal
  constant set by bpftime during CUmodule loading. Each GPU gets its own
  CUmodule with a unique ordinal, providing reliable GPU identification
  from inside eBPF probes regardless of workload distribution.

- Fix C1: device_count() now returns devices_.size() instead of the raw
  CUDA count, avoiding mismatch when cuDeviceGet fails for some devices.

- Fix C2: run_attach_entry_on_gpu now uses per-device shared_mem_device_ptr
  instead of always using the primary device's pointer.

- Fix start_ts collision: Use compound key (device_ordinal << 20 | block_id)
  to prevent cross-GPU block_id collisions in shared maps.

- Update example to use device ordinal instead of gridDim.x for per-GPU
  stats, making it work with any workload distribution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@yunwei37 yunwei37 requested review from Officeyutong and Sy0307 March 18, 2026 17:14
…cy (#560)

* feat: skip compile/load of unmodified PTX in CUDA attach path

Optimize GPU attach latency by only compiling and loading PTX files
that were actually modified by the pass pipeline. Previously all
extracted PTX files flowed through compile+load even when unmodified.

On a llama.cpp 1B workload (RTX 5090, CUDA 12.9):
- First cold fatbin: 27.9s → 5.6s (-80%)
- PTX compile: 21.2s → 0.17s (-99.2%)
- Fatbins 3..48 mean: 210ms → 123ms (-41.5%)

Changes:
- nv_attach_impl.cpp: rename should_add_trampoline to ptx_modified
  for clarity; propagate per-PTX modified flag from pass pipeline
- nv_attach_fatbin_record.cpp: filter patched PTX to only modified
  entries before compile_ptxs() and module-load loop; add thread-safe
  cache access for ptx_pool and module_pool; add per-fatbin load_mutex;
  add timing instrumentation for extract/patch/compile/load stages
- nv_attach_impl_frida_setup.cpp: add extract timing; downgrade
  expected "symbol not in patched module" messages to DEBUG
- nv_attach_impl.hpp: add module_pool_mutex_ and ptx_pool_mutex_
- nv_attach_fatbin_record.hpp: add load_mutex for per-fatbin
  serialization
- vm/compat/CMakeLists.txt: enable llvm-vm when CUDA attach is on

Closes #552

* feat: add pass-internal early return for non-target PTX to reduce patch latency

Add ptx_may_contain_target_kernel() fast-path check in ptxpass_core
that skips heavy PTX parsing when the target kernel name is absent.
Applied at both pass level (kprobe_entry, kretprobe, kprobe_memcapture)
and framework level (hack_fatbin JSON serialization skip).

On llama.cpp 1B (RTX 5090, CUDA 12.9):
- Cold first fatbin patch: 5428ms -> 487ms (-91%)
- Cold first fatbin total: 5605ms -> 661ms (-88%)
- Combined with PR1: 27.9s -> 0.66s (-97.6%)

Also hoists eBPF instruction word packing out of inner kernel loop
and adds unit test for the new helper.

* refactor: separate PTX from JSON in pass interface to eliminate serialization overhead

Change process_input() ABI to receive PTX as raw pointer instead of
JSON-encoded. Meta-only JSON now contains kernel name, eBPF instructions,
and map symbols (~100 bytes vs 50-200KB per PTX).

Key changes:
- New process_input(ptx, ptx_len, meta_json, meta_len, out, out_len)
- Cache key uses sha256(raw_ptx) + kernel + sha256(ebpf) instead of
  sha256(full_json)
- Remove framework-level ptx_may_contain_target_kernel() pre-filter
  (no longer needed since pass calls are now cheap)
- Fix 1GB per-call buffer allocation, size from PTX length instead
- Update all 3 passes, unit tests, and README-passes.md

On llama.cpp 1B (RTX 5090, CUDA 12.9):
- Cold first fatbin: 661ms -> 275ms (-58%)
- Cold patch: 487ms -> 106ms (-78%)
- Combined from baseline: 27.9s -> 0.275s (-99.0%, 102x)

* chore: revert unnecessary changes from optimization PR

- Revert README-passes.md: docs-only change not required for the optimization
- Revert ebpf_inst_words hoist: minor micro-optimization churn, not needed

Keeps the diff minimal and focused on the actual latency optimization.

* chore: revert unnecessary string_view and formatting changes

Revert non-essential refactoring to minimize the PR diff:
- Restore const std::string& signatures for validate_input,
  contains_entry_function, contains_ret_instruction,
  validate_ptx_version, find_kernel_body, pass_runtime_request_from_string
- Remove ptx_owned workaround in find_kernel_body that was only
  needed for the string_view conversion
- Restore original indentation for effective_module_pool ternary
- Restore original formatting for lambda capture and ebpf_inst_words

New function ptx_may_contain_target_kernel keeps string_view
since it's a new addition, not a signature change.

* chore: revert remaining unnecessary changes to minimize PR diff

* Revert "chore: revert remaining unnecessary changes to minimize PR diff"

This reverts commit a5bab96.

* chore: revert non-essential comment, log message, and include changes

- Restore original SPDLOG_WARN level and messages in frida_setup
- Restore original comments in core.hpp
- Remove unnecessary #include <cstddef>
- Restore original validate_input declaration formatting

* chore: simplify PR by inlining trivial helper functions

Remove unnecessary abstractions that wrapped 1-2 line operations:
- Inline runtime_request_ptx_view(), populate_runtime_request_ptx(),
  ptx_may_contain_target_kernel() at call sites
- Inline build_patch_cache_key() and estimate_pass_output_buffer_size()
- Restore NLOHMANN_DEFINE macros for RuntimeResponse (remove custom
  to_json/from_json that only omitted output_ptx for unmodified)
- Remove ptx_may_contain_target_kernel from core.cpp (inlined in passes)
- Update tests to match simplified serialization

* chore: restore original variable names and remove unnecessary defensive code

- Restore should_add_trampoline variable name (rename was unnecessary churn)
- Restore original SPDLOG_DEBUG message text
- Remove defensive empty-PTX error check (not required for optimization)

* chore: change GPU attach timing logs from INFO to DEBUG

---------

Co-authored-by: LinuxDev9002 <linuxdev8883@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants