Open
Conversation
Add gpu_device_manager that enumerates all CUDA devices at init time, caches per-device SM architectures, and provides device-aware APIs. Key changes: - Per-device SM arch detection and PTX compilation - Per-device module loading with separate module pools - Per-device patched kernel tracking for correct launch interception - Device-aware run_attach_entry_on_gpu() with device_ordinal parameter - Multi-device CUDAContext support in runtime - Fix cuCtxCreate calls for CUDA 13 compatibility - Add gpu_device_manager unit tests (tested on 8x NVIDIA B300 SXM6) - Add multi-GPU vector addition example Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… monitoring Add per-GPU block timing using gridDim.x (helper 508) to identify which GPU each block belongs to from inside the GPU. Update README to emphasize bpftime's unique value: zero-modification black-box monitoring, cross-GPU shared maps via UVA, and programmable GPU-internal probes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CUDA 12.x uses cuCtxCreate_v2 (3 args) while CUDA 13+ uses cuCtxCreate_v4 (4 args). Add #if CUDA_VERSION >= 13000 guard. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add helper 512 (bpf_get_device_ordinal): reads per-module deviceOrdinal constant set by bpftime during CUmodule loading. Each GPU gets its own CUmodule with a unique ordinal, providing reliable GPU identification from inside eBPF probes regardless of workload distribution. - Fix C1: device_count() now returns devices_.size() instead of the raw CUDA count, avoiding mismatch when cuDeviceGet fails for some devices. - Fix C2: run_attach_entry_on_gpu now uses per-device shared_mem_device_ptr instead of always using the primary device's pointer. - Fix start_ts collision: Use compound key (device_ordinal << 20 | block_id) to prevent cross-GPU block_id collisions in shared maps. - Update example to use device ordinal instead of gridDim.x for per-GPU stats, making it work with any workload distribution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…cy (#560) * feat: skip compile/load of unmodified PTX in CUDA attach path Optimize GPU attach latency by only compiling and loading PTX files that were actually modified by the pass pipeline. Previously all extracted PTX files flowed through compile+load even when unmodified. On a llama.cpp 1B workload (RTX 5090, CUDA 12.9): - First cold fatbin: 27.9s → 5.6s (-80%) - PTX compile: 21.2s → 0.17s (-99.2%) - Fatbins 3..48 mean: 210ms → 123ms (-41.5%) Changes: - nv_attach_impl.cpp: rename should_add_trampoline to ptx_modified for clarity; propagate per-PTX modified flag from pass pipeline - nv_attach_fatbin_record.cpp: filter patched PTX to only modified entries before compile_ptxs() and module-load loop; add thread-safe cache access for ptx_pool and module_pool; add per-fatbin load_mutex; add timing instrumentation for extract/patch/compile/load stages - nv_attach_impl_frida_setup.cpp: add extract timing; downgrade expected "symbol not in patched module" messages to DEBUG - nv_attach_impl.hpp: add module_pool_mutex_ and ptx_pool_mutex_ - nv_attach_fatbin_record.hpp: add load_mutex for per-fatbin serialization - vm/compat/CMakeLists.txt: enable llvm-vm when CUDA attach is on Closes #552 * feat: add pass-internal early return for non-target PTX to reduce patch latency Add ptx_may_contain_target_kernel() fast-path check in ptxpass_core that skips heavy PTX parsing when the target kernel name is absent. Applied at both pass level (kprobe_entry, kretprobe, kprobe_memcapture) and framework level (hack_fatbin JSON serialization skip). On llama.cpp 1B (RTX 5090, CUDA 12.9): - Cold first fatbin patch: 5428ms -> 487ms (-91%) - Cold first fatbin total: 5605ms -> 661ms (-88%) - Combined with PR1: 27.9s -> 0.66s (-97.6%) Also hoists eBPF instruction word packing out of inner kernel loop and adds unit test for the new helper. * refactor: separate PTX from JSON in pass interface to eliminate serialization overhead Change process_input() ABI to receive PTX as raw pointer instead of JSON-encoded. Meta-only JSON now contains kernel name, eBPF instructions, and map symbols (~100 bytes vs 50-200KB per PTX). Key changes: - New process_input(ptx, ptx_len, meta_json, meta_len, out, out_len) - Cache key uses sha256(raw_ptx) + kernel + sha256(ebpf) instead of sha256(full_json) - Remove framework-level ptx_may_contain_target_kernel() pre-filter (no longer needed since pass calls are now cheap) - Fix 1GB per-call buffer allocation, size from PTX length instead - Update all 3 passes, unit tests, and README-passes.md On llama.cpp 1B (RTX 5090, CUDA 12.9): - Cold first fatbin: 661ms -> 275ms (-58%) - Cold patch: 487ms -> 106ms (-78%) - Combined from baseline: 27.9s -> 0.275s (-99.0%, 102x) * chore: revert unnecessary changes from optimization PR - Revert README-passes.md: docs-only change not required for the optimization - Revert ebpf_inst_words hoist: minor micro-optimization churn, not needed Keeps the diff minimal and focused on the actual latency optimization. * chore: revert unnecessary string_view and formatting changes Revert non-essential refactoring to minimize the PR diff: - Restore const std::string& signatures for validate_input, contains_entry_function, contains_ret_instruction, validate_ptx_version, find_kernel_body, pass_runtime_request_from_string - Remove ptx_owned workaround in find_kernel_body that was only needed for the string_view conversion - Restore original indentation for effective_module_pool ternary - Restore original formatting for lambda capture and ebpf_inst_words New function ptx_may_contain_target_kernel keeps string_view since it's a new addition, not a signature change. * chore: revert remaining unnecessary changes to minimize PR diff * Revert "chore: revert remaining unnecessary changes to minimize PR diff" This reverts commit a5bab96. * chore: revert non-essential comment, log message, and include changes - Restore original SPDLOG_WARN level and messages in frida_setup - Restore original comments in core.hpp - Remove unnecessary #include <cstddef> - Restore original validate_input declaration formatting * chore: simplify PR by inlining trivial helper functions Remove unnecessary abstractions that wrapped 1-2 line operations: - Inline runtime_request_ptx_view(), populate_runtime_request_ptx(), ptx_may_contain_target_kernel() at call sites - Inline build_patch_cache_key() and estimate_pass_output_buffer_size() - Restore NLOHMANN_DEFINE macros for RuntimeResponse (remove custom to_json/from_json that only omitted output_ptx for unmodified) - Remove ptx_may_contain_target_kernel from core.cpp (inlined in passes) - Update tests to match simplified serialization * chore: restore original variable names and remove unnecessary defensive code - Restore should_add_trampoline variable name (rename was unnecessary churn) - Restore original SPDLOG_DEBUG message text - Remove defensive empty-PTX error check (not required for optimization) * chore: change GPU attach timing logs from INFO to DEBUG --------- Co-authored-by: LinuxDev9002 <linuxdev8883@example.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
gpu_device_managerthat enumerates all CUDA devices at init, caches per-device SM architectures, and provides device-aware lookup APIsrun_attach_entry_on_gpu()with newdevice_ordinalparameter (backward compatible, defaults to auto-detect)CUDAContextsupport in runtime withinit_multi_gpu_contexts()cuCtxCreatecalls for CUDA 13 compatibility (cuCtxCreate_v44-parameter signature)gpu_device_managerunit testsexample/gpu/multi-gpu/)Test plan
gpu_device_managertests)-DBPFTIME_ENABLE_CUDA_ATTACH=ONBPFTIME_SM_ARCHenv var override works across all devices🤖 Generated with Claude Code