Add prefetch example by Officeyutong · Pull Request #526 · eunomia-bpf/bpftime

Officeyutong · 2025-12-04T14:36:35Z

Please try to use the copilot to summary your PR. You don't need to fill all info below, just it can help giving your a checklist.

Description

Fixes # (issue)

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Test A
Test B

Test Configuration:

Firmware version:
Hardware:
Toolchain:
SDK:

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules
I have checked my code and corrected any misspellings

Copilot

Pull request overview

This PR adds a new GPU prefetch example demonstrating CUDA unified memory (UVM) prefetching capabilities with eBPF tracing support. However, the implementation appears incomplete with several critical issues including incorrect documentation (copied from threadhist), kernel tracing mismatches, and unused code artifacts.

Key changes:

Adds seq_prefetch_kernel CUDA kernel showcasing batch-based memory prefetching with prefetch.global.L2 instructions
Implements BPF helper function 0x509 (bpf_prefetch_l2) in the GPU trampoline infrastructure
Includes BPF program for kernel execution tracing (though targeting wrong kernel)

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
example/gpu/prefetch/prefetch_example.cu	CUDA application demonstrating UVM prefetching with batch-based memory access patterns; contains Chinese comments and unused parameters
example/gpu/prefetch/prefetch.c	Userspace BPF loader for tracing GPU kernel executions; appears to be boilerplate from threadhist example
example/gpu/prefetch/prefetch.bpf.c	eBPF program for GPU kernel tracing; incorrectly targets vectorAdd instead of seq_prefetch_kernel
example/gpu/prefetch/README.md	Documentation entirely copied from threadhist without updates; completely incorrect for this example
example/gpu/prefetch/Makefile	Build configuration for prefetch example; references non-existent vec_add target
example/gpu/prefetch/.gitignore	Git ignore patterns; includes incorrect references to threadhist and vec_add
attach/nv_attach_impl/trampoline_ptx.h	PTX assembly for prefetch helper function 0x509
attach/nv_attach_impl/trampoline/default_trampoline.cu	Implements GPU-side prefetch helper and adds unused GPU_HASH_MAP constant

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-05T18:04:59Z

example/gpu/prefetch/README.md

+# threadhist - GPU Thread Execution Histogram
+
+## Overview
+
+When you launch a CUDA kernel with multiple threads, you expect each thread to do roughly equal work. But what if thread 0 executes the kernel 200,000 times while thread 4 only runs 150,000 times? That's a 25% workload imbalance that's silently degrading your GPU performance.
+
+`threadhist` reveals these hidden load imbalances by counting how many times each GPU thread actually executes your kernel. Unlike traditional profilers that show aggregate metrics, this tool exposes per-thread execution patterns that directly impact performance.
+
+## Understanding GPU Thread Execution
+
+When you launch a CUDA kernel like `vectorAdd<<<1, 5>>>()`, you're creating 5 threads (threadIdx.x = 0 through 4) that should process your data in parallel. In an ideal world, all 5 threads would execute the kernel the same number of times and do equal work.
+
+However, several factors can cause **load imbalance**:
+
+- **Grid-stride loops**: The last thread might process fewer elements due to array size not dividing evenly
+- **Conditional branches**: Some threads might skip work based on their index or data values
+- **Early exits**: Threads might return early when they hit boundary conditions
+- **Poor work distribution**: The algorithm itself might assign unequal work to different threads
+
+These imbalances mean some threads are busy while others are idle, wasting precious GPU compute capacity. But because they happen inside the kernel, traditional profiling tools can't see them—they only report that "the kernel took X milliseconds."
+
+## What This Tool Captures
+
+`threadhist` uses a GPU array map to maintain a per-thread counter. Every time a thread exits the kernel, it increments its counter by 1. The userspace program periodically reads these counters and displays a histogram showing exactly how many times each thread has executed.
+
+## Why This Matters
+
+### The Grid-Stride Workload Imbalance
+
+You're processing a 1 million element array with a grid-stride loop. Your kernel launches with 5 threads, and you expect each to process about 200,000 elements. After running `threadhist` for a few seconds, you see:
+
+```
+Thread 0: 210432
+Thread 1: 210432
+Thread 2: 210432
+Thread 3: 210432
+Thread 4: 158304  (Only 75% of the work!)
+```
+
+What's happening? Your kernel is being invoked repeatedly in a loop, and each invocation processes a chunk of data. Due to how your grid-stride loop is written, thread 4 consistently finishes early because the remaining elements divide unevenly. While threads 0-3 continue working, thread 4 sits idle.
+
+**The fix**: Adjust your thread block configuration or restructure the grid-stride loop to distribute the boundary work more evenly. After optimization, all threads show similar counts, indicating balanced workload.
+
+### The Conditional Branch Mystery
+
+You're running a kernel that processes data with some conditional logic. The histogram reveals:
+
+```
+Thread 0: 195423
+Thread 1: 195423
+Thread 2: 98156   (50% fewer executions!)
+Thread 3: 195423
+Thread 4: 195423
+```
+
+Thread 2 is executing significantly less often than the others. Looking at your code, you discover there's a conditional that causes thread 2 to exit early in certain cases:
+
+```cuda
+if (threadIdx.x == 2 && someCondition()) {
+    return;  // Early exit
+}
+```
+
+This pattern means thread 2 is doing half the work, but the other threads in the warp have to wait for it during the iterations where it does execute. This is **warp divergence** causing serialization, and the idle time from thread 2's early exits wastes GPU cycles.
+
+**The insight**: Either remove the branch to make all threads follow the same path, or restructure your data so this condition doesn't correlate with specific thread indices.
+
+### Detecting Completely Idle Threads
+
+In a more extreme case, you might see:
+
+```
+Thread 0: 187234
+Thread 1: 187234
+Thread 2: 187234
+Thread 3: 0       (Never executed!)
+Thread 4: 0       (Never executed!)
+```
+
+Threads 3 and 4 aren't executing at all! This indicates a bug in your kernel launch configuration or grid-stride logic. Perhaps your workload size only requires 3 threads, but you're launching 5—wasting GPU resources. Or maybe there's a bug where certain thread indices never enter the main processing loop.
+
+**The action**: Adjust your kernel launch parameters to match actual workload requirements, or fix the loop logic to ensure all threads participate.
+
+## Building
+
+```bash
+# From bpftime root directory
+make -C example/gpu/threadhist
+```
+
+Requirements:
+- bpftime built with CUDA support (`-DBPFTIME_ENABLE_CUDA_ATTACH=1`)
+- CUDA toolkit installed
+
+## Running
+
+### Terminal 1: Start the histogram collector
+```bash
+BPFTIME_LOG_OUTPUT=console LD_PRELOAD=build/runtime/syscall-server/libbpftime-syscall-server.so \
+  example/gpu/threadhist/threadhist
+```
+
+### Terminal 2: Run your CUDA application
+```bash
+BPFTIME_LOG_OUTPUT=console LD_PRELOAD=build/runtime/agent/libbpftime-agent.so  example/gpu/threadhist/vec_add
+```
+
+Or trace any CUDA application:
+```bash
+LD_PRELOAD=build/runtime/agent/libbpftime-agent.so \
+  ./your_cuda_app
+```
+
+## Example Output
+
+```
+12:34:56
+Thread 0: 210432
+Thread 1: 210432
+Thread 2: 210432
+Thread 3: 210432
+Thread 4: 158304
+```
+
+The timestamp shows when the snapshot was taken, followed by the total execution count for each thread since the program started.
+
+## Use Cases
+
+### 1. Optimizing Thread Block Configuration
+
+You're experimenting with different block sizes. By running `threadhist` with various configurations, you can quickly see which configuration produces the most balanced workload distribution, maximizing GPU utilization.
+
+### 2. Validating Grid-Stride Loop Implementations
+
+After implementing or modifying a grid-stride loop, verify that all threads are executing roughly equally. Large discrepancies indicate the loop isn't distributing work evenly.
+
+### 3. Detecting Algorithmic Imbalances
+
+Some algorithms inherently create load imbalance (e.g., processing a sparse matrix where some threads have many elements, others few). The histogram quantifies this imbalance, helping you decide whether to redesign the algorithm or accept the tradeoff.
+
+### 4. Debugging Thread Launch Issues
+
+If threads show zero executions, you've caught a bug in your launch configuration or kernel logic before it becomes a production issue.
+
+## How It Works
+
+1. Attaches an eBPF kretprobe to the target CUDA kernel function
+2. On kernel exit, each GPU thread increments its counter in the GPU array map: `*cnt += 1`
+3. The GPU array map allocates per-thread storage automatically (one `u64` per thread)
+4. Userspace periodically reads the entire array and prints the histogram
+5. Counters accumulate over time, showing total executions since program start
+
+## Code Structure
+
+- **`threadhist.bpf.c`**: eBPF program running on GPU at kernel exit, incrementing per-thread counters
+- **`threadhist.c`**: Userspace loader that reads and displays the histogram
+- **`vec_add.cu`**: Example CUDA application for testing
+
+## Limitations
+
+- Shows cumulative counts since program start (not per-second rates)
+- Fixed thread count (hardcoded to 7 threads in the example, line 87 of `threadhist.c`)
+- Only tracks kernel exits (doesn't show per-invocation timing)
+- Requires knowing the kernel function symbol name
+
+## Customization
+
+To trace a different kernel, modify the SEC annotation in `threadhist.bpf.c`:
+
+```c
+SEC("kretprobe/_Z9vectorAddPKfS0_Pf")  // Current: vectorAdd(const float*, const float*, float*)
+```
+
+Find your kernel's mangled name with:
+```bash
+cuobjdump -symbols your_app | grep your_kernel_name
+```
+
+To monitor more threads, change the thread count parameter in `threadhist.c:87`:
+```c
+print_stat(skel, 32);  // Monitor 32 threads instead of 7
+```
+
+## Interpreting Results
+
+**Perfectly balanced** (all threads ±5%):
+```
+Thread 0: 200000
+Thread 1: 199876
+Thread 2: 200124
+```
+✓ Excellent - GPU resources fully utilized
+
+**Slight imbalance** (10-20% variance):
+```
+Thread 0: 200000
+Thread 1: 180000
+```
+⚠ Acceptable for complex algorithms, but investigate if possible
+
+**Severe imbalance** (>25% variance):
+```
+Thread 0: 200000
+Thread 4: 120000
+```
+❌ Performance problem - restructure workload distribution
+
+**Idle threads** (zero counts):
+```
+Thread 3: 0
+```
+❌ Bug or misconfiguration - fix immediately
+
+## Troubleshooting
+
+**All threads show zero**: eBPF program didn't attach. Check kernel name matches exactly (including C++ mangling).
+
+**Counts don't match expectations**: Ensure you're measuring the correct kernel. Use `cuobjdump` to verify the symbol name.
+
+**Output doesn't update**: The application might not be invoking the kernel. Check that both processes are running and communicating via shared memory.


The entire README.md file appears to be copied from the threadhist example without being updated for the prefetch example. The documentation describes threadhist functionality (thread execution histograms) but this directory contains a prefetch example. The README should be rewritten to describe:

What the prefetch example demonstrates (GPU memory prefetching with CUDA unified memory)

How the seq_prefetch_kernel works

The purpose of the prefetch.bpf.c eBPF program

How to build and run the prefetch_example

The relationship between the BPF tracing and the CUDA prefetch operations

All references to "threadhist", thread execution counts, and load balancing should be replaced with content relevant to the prefetch functionality.

Copilot · 2025-12-05T18:04:59Z

example/gpu/prefetch/prefetch_example.cu

+				    size_t stride_elems,
+				    int prefetch_distance_pages)


The parameter prefetch_distance_pages is declared in the function signature and passed to the kernel, but it's never actually used inside the seq_prefetch_kernel. The kernel uses hard-coded values BATCH_SIZE = 4 and PREFETCH_BATCHES = 2 instead. Either:

Remove the unused parameter from the function signature and kernel call, or

Use the parameter to configure the prefetch distance as originally intended

This appears to be dead code left over from an earlier implementation.

Suggested change

size_t stride_elems,

int prefetch_distance_pages)

size_t stride_elems)

Copilot · 2025-12-05T18:05:00Z

example/gpu/prefetch/Makefile

+.PHONY: clean
+clean:
+	$(call msg,CLEAN)
+	$(Q)rm -rf $(OUTPUT) $(APPS) vec_add


In line 81, the Makefile clean target removes vec_add, but this file is never built by this Makefile. Looking at the all target (line 69), only prefetch and prefetch_example are built. The reference to vec_add should be removed from the clean target.

Suggested change

$(Q)rm -rf $(OUTPUT) $(APPS) vec_add

$(Q)rm -rf $(OUTPUT) $(APPS)

Copilot · 2025-12-05T18:05:00Z

attach/nv_attach_impl/trampoline/default_trampoline.cu

+// extern "C" __noinline__ __device__ uint64_t _bpf_helper_ext_0510(
+// 	uint64_t time_to_sleep, uint64_t, uint64_t, uint64_t, uint64_t)
+// {
+// 	asm volatile("nanosleep.u32 %0;" ::"r"((uint32_t)time_to_sleep));
+// 	return 0;
+// }



The commented-out code for _bpf_helper_ext_0510 (nanosleep helper) should either be:

Removed if it's not needed for this PR, or

Uncommented and properly implemented if it's intended to be part of the prefetch functionality

Dead/commented code that isn't actively being used should be removed to maintain code cleanliness unless there's a specific reason to keep it (which should be documented with a comment explaining why).

Suggested change

// extern "C" __noinline__ __device__ uint64_t _bpf_helper_ext_0510(

// uint64_t time_to_sleep, uint64_t, uint64_t, uint64_t, uint64_t)

// {

// asm volatile("nanosleep.u32 %0;" ::"r"((uint32_t)time_to_sleep));

// return 0;

// }

Copilot · 2025-12-05T18:05:01Z

attach/nv_attach_impl/trampoline/default_trampoline.cu

+const int BPF_MAP_TYPE_GPU_HASH_MAP = 1501; // non-per-thread, single-copy
+					    // shared hashmap


The constant BPF_MAP_TYPE_GPU_HASH_MAP is defined but never used in this file or the changes shown. If this is part of the prefetch functionality, it should be used; otherwise, it should be removed. If it's being added for future use, that should be documented with a comment explaining its purpose.

Suggested change

const int BPF_MAP_TYPE_GPU_HASH_MAP = 1501; // non-per-thread, single-copy

// shared hashmap

// Reserved for future use: GPU hash map type for non-per-thread, single-copy shared hashmap.

const int BPF_MAP_TYPE_GPU_HASH_MAP = 1501;

Copilot · 2025-12-05T18:05:02Z

example/gpu/prefetch/prefetch_example.cu

+	// Prefetch distance: how many pages ahead to prefetch
+	// Start with 4 pages ahead
+	int prefetch_distance_pages = 4;


The variable prefetch_distance_pages is initialized to 4 but never used, as noted earlier in the kernel. This initialization on lines 236-238 along with the comment "Start with 4 pages ahead" suggests it was intended to be configurable, but the functionality was never completed. Either remove this unused variable or complete the implementation to actually use it.

Copilot · 2025-12-05T18:05:02Z

example/gpu/prefetch/.gitignore

+/threadhist
+/.output
+/victim*
+/vec_add.cpp
+/vec_add
+/vec_add-new.cpp
+/prefetch


The .gitignore file references files that don't match the prefetch example. Specifically:

Line 1: /threadhist should likely be /prefetch (the actual binary name per Makefile line 24)

Lines 4-5: vec_add.cpp and vec_add don't appear to be built by this example

The .gitignore should be updated to reflect the actual build artifacts of the prefetch example.

Suggested change

/threadhist

/.output

/victim*

/vec_add.cpp

/vec_add

/vec_add-new.cpp

/prefetch

/prefetch

/.output

/victim*

Copilot · 2025-12-05T18:05:02Z

example/gpu/prefetch/prefetch_example.cu

+	// 每次处理的页面批次大小
+	const int BATCH_SIZE = 4;
+	// 预取窗口大小（批次数）
+	const int PREFETCH_BATCHES = 2;
+
+	for (int c = 0; c < chunks_per_thread; ++c) {
+		size_t chunk_id = (size_t)tid * chunks_per_thread + c;
+		size_t chunk_start = chunk_id * chunk_elems;
+		size_t chunk_end = min(chunk_start + chunk_elems, N);
+
+		if (chunk_start >= N)
+			continue;
+
+		size_t chunk_size = chunk_end - chunk_start;
+		size_t pages_in_chunk =
+			(chunk_size + elems_per_page - 1) / elems_per_page;
+		size_t batches_in_chunk =
+			(pages_in_chunk + BATCH_SIZE - 1) / BATCH_SIZE;
+
+		for (int b = 0;
+		     b < PREFETCH_BATCHES && b < (int)batches_in_chunk; ++b) {
+#pragma unroll
+			for (int p = 0; p < BATCH_SIZE; ++p) {
+				size_t page = b * BATCH_SIZE + p;
+				if (page < pages_in_chunk) {
+					size_t prefetch_elem =
+						chunk_start +
+						page * elems_per_page;
+					if (prefetch_elem < N) {
+						prefetch_l2(
+							&input[prefetch_elem]);
+						prefetch_l2(
+							&output[prefetch_elem]);
+					}
+				}
+			}
+		}
+
+		// 按批次处理
+		for (size_t batch_idx = 0; batch_idx < batches_in_chunk;
+		     ++batch_idx) {
+			// 预取未来批次
+			size_t prefetch_batch = batch_idx + PREFETCH_BATCHES;
+			if (prefetch_batch < batches_in_chunk) {
+#pragma unroll
+				for (int p = 0; p < BATCH_SIZE; ++p) {
+					size_t page =
+						prefetch_batch * BATCH_SIZE + p;
+					if (page < pages_in_chunk) {
+						size_t prefetch_elem =
+							chunk_start +
+							page * elems_per_page;
+						if (prefetch_elem < N) {
+							prefetch_l2(
+								&input[prefetch_elem]);
+							prefetch_l2(
+								&output[prefetch_elem]);
+						}
+					}
+				}
+			}
+
+// 处理当前批次的所有页面


These comments are in Chinese and should be translated to English for consistency with the rest of the codebase. The comments say:

Line 104: "每次处理的页面批次大小" (Page batch size processed each time)

Line 106: "预取窗口大小（批次数）" (Prefetch window size (number of batches))

Line 142: "按批次处理" (Process by batch)

Line 145: "预取未来批次" (Prefetch future batches)

Line 166: "处理当前批次的所有页面" (Process all pages in the current batch)

Copilot · 2025-12-05T18:05:03Z

example/gpu/prefetch/prefetch_example.cu

+inline void run_seq_device_prefetch(size_t total_working_set,
+				    const std::string &mode,
+				    size_t stride_bytes, int iterations,
+				    std::vector<float> &runtimes,
+				    KernelResult &result)


The function name run_seq_device_prefetch is misleading. The function explicitly checks that mode is NOT "device" (line 203-206) and only works with UVM (Unified Virtual Memory) modes. The function should be renamed to something like run_seq_uvm_prefetch or run_uvm_device_prefetch to accurately reflect that it only supports UVM modes, not regular device memory.

Copilot · 2025-12-05T18:05:03Z

attach/nv_attach_impl/trampoline_ptx.h

+}
+	// .globl	_bpf_helper_ext_0509    // -- Begin function _bpf_helper_ext_0509


There's an inconsistency in the PTX code structure. Line 1549 has a closing brace } that appears to close a function that wasn't opened in the visible diff. This suggests the PTX assembly might have formatting issues. The comment on line 1550 indicates this is the beginning of _bpf_helper_ext_0509, but it's placed after what looks like a closing brace from a previous function. While this may be intentional based on the full file context, it's worth verifying that the PTX assembly structure is correct and properly formatted.

Signed-off-by: Officeyutong <yt.xyxx@gmail.com>

update

be9d230

pull-request-size bot added the size/XL label Dec 4, 2025

update

b5482c5

Sy0307 requested a review from Copilot December 5, 2025 17:59

Copilot started reviewing on behalf of Sy0307 December 5, 2025 17:59 View session

Copilot finished reviewing on behalf of Sy0307 December 5, 2025 18:02

Copilot AI reviewed Dec 5, 2025

View reviewed changes

Officeyutong added 2 commits December 6, 2025 20:10

update

38ce84c

remove chinese comments

c75db73

pull-request-size bot added size/XXL and removed size/XL labels Dec 6, 2025

Merge branch 'master' into add-two-new-helpers

37bf8be

Signed-off-by: Officeyutong <yt.xyxx@gmail.com>

Officeyutong marked this pull request as ready for review December 6, 2025 12:16

Officeyutong added 6 commits December 7, 2025 00:40

fix readme

7c35287

update

1bbc06d

update

1d9a192

update

b5a3384

update

bc037a2

update

81f9755

github-actions bot mentioned this pull request Dec 15, 2025

Weekly Org Report (2025-12-01..2025-12-07) eunomia-bpf/eunomia.dev#59

Open

github-actions bot mentioned this pull request Jan 1, 2026

Monthly Org Report (2025-12-01..2025-12-31) eunomia-bpf/eunomia.dev#62

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add prefetch example#526

Add prefetch example#526
Officeyutong wants to merge 11 commits intoeunomia-bpf:masterfrom
Officeyutong:add-two-new-helpers

Officeyutong commented Dec 4, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 5, 2025

Uh oh!

Copilot AI Dec 5, 2025

Uh oh!

Copilot AI Dec 5, 2025

Uh oh!

Copilot AI Dec 5, 2025

Uh oh!

Copilot AI Dec 5, 2025

Uh oh!

Copilot AI Dec 5, 2025

Uh oh!

Copilot AI Dec 5, 2025

Uh oh!

Copilot AI Dec 5, 2025

Uh oh!

Copilot AI Dec 5, 2025

Uh oh!

Copilot AI Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	size_t stride_elems,
	int prefetch_distance_pages)
	size_t stride_elems)

	$(Q)rm -rf $(OUTPUT) $(APPS) vec_add
	$(Q)rm -rf $(OUTPUT) $(APPS)

		const int BPF_MAP_TYPE_GPU_HASH_MAP = 1501; // non-per-thread, single-copy
		// shared hashmap

		}
		// .globl _bpf_helper_ext_0509 // -- Begin function _bpf_helper_ext_0509

Uh oh!

Conversation

Officeyutong commented Dec 4, 2025

Description

Type of change

How Has This Been Tested?

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants