Skip to content
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,9 @@
**/load

compile_commands.json
.cache/
.cache/

# AI Agents
.vscode/
.cursor/
.claude/
143 changes: 143 additions & 0 deletions .planning/codebase/ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# Architecture

**Analysis Date:** 2026-03-25

## Pattern Overview

**Overall:** Linux Kernel Module with User-Space API

**Key Characteristics:**
- Character device driver interface with standard file operations (open, read, write, ioctl, poll, mmap, release)
- Platform driver (rce_stream, rce_memmap) and PCI driver (data_dev) architectures
- Hardware abstraction layer via `hardware_functions` structure for board-specific operations
- Shared common library for DMA buffer management, queues, and common functions
- Cross-platform support for x86_64 (data_dev with NVIDIA GPU) and ARM (RCE)

## Layers

**Application Layer:**
- Purpose: User-space applications for testing, benchmarking, and data transfer
- Location: `common/app/`, `data_dev/app/`, `rce_stream/app/`
- Contains: C++ test applications (dmaLoopTest, dmaRead, dmaWrite, test)
- Depends on: User-space driver library (`include/DmaDriver.h`, `include/AxisDriver.h`)
- Used by: Test infrastructure and integration scripts

**User-Space Driver Library:**
- Purpose: User-space API for kernel driver communication
- Location: `include/` (DmaDriver.h, AxisDriver.h, AxiVersion.h, GpuAsync.h)
- Contains: ioctl command definitions, data structures (DmaWriteData, DmaReadData), inline helper functions
- Pattern: Header-only library with conditional compilation for kernel vs user space

**Kernel Driver Core (Common):**
- Purpose: Shared DMA infrastructure and buffer management
- Location: `common/driver/`
- Contains:
- `dma_common.c/h`: Core DMA device initialization, file operations, buffer management
- `dma_buffer.c/h`: Buffer allocation, tracking, and queue management
- `axis_gen1.c/h`, `axis_gen2.c/h`: AXI stream protocol implementations
- `axi_version.c/h`: AXI version register access
- Depends on: Linux kernel APIs (pci, platform_driver, dma, irq)

**Hardware Abstraction Layer:**
- Purpose: Board-specific hardware operations
- Location: Per-board driver directories
- Contains: Hardware-specific implementations of `hardware_functions` structure
- Pattern: Driver core calls hwFunc->init, hwFunc->sendBuffer, hwFunc->irq, etc.

**Hardware Layer:**
- Purpose: Physical PCIe devices and FPGA AXI interfaces
- Data_dev: TID-AIR PCIe cards with optional NVIDIA GPU Direct RDMA
- RCE Stream: RCE AXI stream DMA controller
- RCE Memmap: RCE memory mapping interface

## Data Flow

**DMA Write Flow (Software -> Firmware):**

1. User calls `dmaWrite()` or `dmaWriteIndex()` from application
2. `Dma_Write()` kernel function receives write request
3. Buffer retrieved from transmit queue (`dmaQueuePop(&(dev->tq))`)
4. Data copied from user space to DMA buffer
5. Hardware-specific `sendBuffer()` called (e.g., `AxisG1_SendBuffer`)
6. Buffer handle written to hardware TX FIFOs
7. Interrupt handler (`AxisG1_Irq`) receives completion status
8. Buffer returned to queue for reuse

**DMA Read Flow (Firmware -> Software):**

1. Hardware receives data via AXI stream interface
2. Interrupt handler (`AxisG1_Irq`) reads from RX FIFOs
3. Buffer located in buffer list (`dmaFindBufferList`)
4. Buffer populated with size, flags, destination, error info
5. Buffer pushed to descriptor's RX queue (`dmaRxBuffer`)
6. User calls `dmaRead()` or `dmaReadIndex()`
7. `Dma_Read()` pops from descriptor queue and copies to user space
8. Buffer returned to hardware RX pool

**State Management:**
- Buffer ownership tracked per-descriptor via `userHas` pointer
- Hardware queue state managed via `inHw`, `inQ` flags on each buffer
- Destination masking via `destMask` to reserve lanes for specific consumers

## Key Abstractions

**DmaDevice:**
- Purpose: Represents a single DMA-capable device instance
- Location: `common/driver/dma_common.h`
- Contains: PCI device, register mappings, buffer lists, configuration, hardware functions
- Pattern: Global array `gDmaDevices[]` with count tracking

**DmaDesc (Descriptor):**
- Purpose: Per-file-descriptor state for DMA operations
- Location: `common/driver/dma_common.h`
- Contains: Destination mask, RX queue, async notification queue
- Pattern: Allocated in `open()`, freed in `release()`

**hardware_functions:**
- Purpose: Board-specific operation callbacks
- Location: `common/driver/dma_common.h`
- Fields: irq, init, enable, clear, irqEnable, retRxBuffer, sendBuffer, command, seqShow
- Pattern: Each driver defines its own implementation structure

**DmaBuffer:**
- Purpose: Managed memory region for DMA transfers
- Location: `common/driver/dma_buffer.h`
- Contains: Virtual/physical addresses, size, index, ownership tracking
- Pattern: Allocated in bulk during device initialization

## Entry Points

**PCI Probe (data_dev):**
- Location: `data_dev/driver/src/data_dev_top.c:DataDev_Probe`
- Triggers: PCI device enumeration finds matching device
- Responsibilities: Allocate DmaDevice, map registers, initialize buffers, register char device

**Platform Probe (rce_stream):**
- Location: `rce_stream/driver/src/rce_top.c:Rce_Probe`
- Triggers: Device tree match for "axi_stream_dma" compatible device
- Responsibilities: Same as PCI probe, uses of_platform API

**File Operations:**
- Location: `common/driver/dma_common.c:Dma_Open`, `Dma_Close`, `Dma_Read`, `Dma_Write`, `Dma_Ioctl`
- Triggers: User-space system calls on `/dev/datadev_*`
- Responsibilities: Request handling, buffer management, ioctl command dispatch

## Error Handling

**Strategy:** Return negative error codes to userspace; log errors via `dev_err()`, warnings via `dev_warn()`

**Patterns:**
- Memory allocation failures return `-ENOMEM`
- Invalid parameters return `-EINVAL` (or `-1` in legacy code)
- Hardware errors logged with descriptive messages
- Buffer errors tracked via error bits in `DmaReadData.error`

## Cross-Cutting Concerns

**Logging:** Uses kernel `printk` macros (`dev_info`, `dev_warn`, `dev_err`) with device pointers
**Validation:** Buffer sizes checked against `cfgSize`; addresses validated against mapped regions
**Synchronization:** Spinlocks for hardware write, command, and mask operations; prevents race conditions

---

*Architecture analysis: 2026-03-25*
172 changes: 172 additions & 0 deletions .planning/codebase/CONCERNS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
# Codebase Concerns

**Analysis Date:** 2026-03-25

## Tech Debt

### Empty Function Implementations in RCE_HP Buffers Driver
- **Issue:** Several hardware function implementations in `rce_hp_buffers/driver/src/rce_hp.c` are empty stubs that return nothing or zero values without actually performing hardware operations:
- `RceHp_RetRxBuffer()` - Empty function (line 89)
- `RceHp_SendBuffer()` - Returns 0 without actually sending (line 92-94)
- `RceHp_Command()` - Returns 0 without any command handling (line 97-99)
- `RceHp_SeqShow()` - Empty function (line 101)
- `RceHp_IrqEnable()` - Empty function (line 85)
- **Files:** `rce_hp_buffers/driver/src/rce_hp.c`, `rce_hp_buffers/driver/src/rce_hp.h`
- **Impact:** The RCE_HP driver is non-functional for actual hardware operations. It cannot return buffers, send buffers, execute commands, or show device status.
- **Fix approach:** Either implement the hardware-specific operations for the RCE_HP card or remove this driver if it's obsolete. If the hardware is not present, add a clear comment explaining this is a placeholder.

### Kernel Version Compatibility Macros
- **Issue:** Multiple places use kernel version checks with complex conditional compilation that may not be fully maintained:
- `dma_common.c` lines 48-50, 66-108, 1270-1276, 252-256, 270-276, 371-374, 433-446
- `axis_gen1.c` line 115-135
- `rce_memmap.c` line 115-121
- **Files:** `common/driver/dma_common.c`, `common/driver/axis_gen1.c`, `rce_memmap/driver/src/rce_map.c`
- **Impact:** As newer kernels are released, the version-specific code paths may become outdated or incorrect, leading to build failures or runtime issues on modern kernels.
- **Fix approach:** Consider using feature detection (via `#ifdef`) rather than version checks where possible, or establish a clear minimum supported kernel version and remove old compatibility code.

### RHEL Release Version Detection
- **Issue:** `RHEL_RELEASE_VERSION` macro is not always defined (line 32-34 of `dma_common.c`), requiring a fallback definition. This suggests inconsistent kernel header availability across build environments.
- **Files:** `common/driver/dma_common.c`
- **Impact:** Build failures or incorrect behavior on RHEL/CentOS systems where the macro is missing.
- **Fix approach:** Implement proper feature detection instead of relying on RHEL release version macros.

### Memory Mapping Without Locking in RCE Map
- **Issue:** `rce_map.c` has commented-out `request_mem_region` calls (lines 125-130, 213-218) that would prevent conflicts with other drivers claiming the same memory regions.
- **Files:** `rce_memmap/driver/src/rce_map.c`
- **Impact:** Potential memory conflicts if multiple drivers attempt to map the same address space.
- **Fix approach:** Re-enable the memory region requests or add a comment explaining why they were disabled and what mitigates the risk.

### Missing Error Handling in DMA Buffer Allocation
- **Issue:** `dma_buffer.c` line 135-144: When `dma_alloc_coherent` or `kmalloc` fails, the error is logged but the cleanup path may not fully handle partially allocated resources.
- **Files:** `common/driver/dma_buffer.c`
- **Impact:** Potential memory leaks or inconsistent state during allocation failures.
- **Fix approach:** Review and ensure all cleanup paths properly release all allocated resources in reverse order of allocation.

## Known Bugs

### Unused Debug Code in DMA Common
- **Symptoms:** The `dma_common.c` file contains commented-out debug output code in `Dma_SeqShow` (lines 1384-1416, 1429-1462) that calculates and would display min/max/avg buffer usage statistics.
- **Files:** `common/driver/dma_common.c`
- **Trigger:** Always present but disabled.
- **Workaround:** Code is commented out and not executed.

### Race Condition in Mask Adjustment
- **Symptoms:** The `Dma_SetMaskBytes` function in `dma_common.c` (line 1490) checks if `desc->destMask` is zero to prevent multiple calls, but this check is not atomic with the mask setting operation.
- **Files:** `common/driver/dma_common.c`
- **Trigger:** Concurrent calls to `Dma_SetMaskBytes` with the same descriptor.
- **Workaround:** The function returns -1 on subsequent calls, preventing accidental double-configuration.

## Security Considerations

### No Input Validation on User pointers in DMA Commands
- **Risk:** Functions like `Dma_Read`, `Dma_Write`, and `Dma_Ioctl` use `copy_from_user` and `copy_to_user` but do not always validate the user pointer ranges before accessing them.
- **Files:** `common/driver/dma_common.c` (lines 633, 735, 984, 1000, 1549, 1585)
- **Current mitigation:** Basic size checks are performed (e.g., checking if `count` is a multiple of `sizeof(struct DmaReadData)`).
- **Recommendations:** Add validation that user pointers are within acceptable address ranges and that structures pointed to are fully readable.

### DMA Memory Access via mmap
- **Risk:** The `Dma_Mmap` function (lines 1149-1234) allows mapping of both buffer memory and register space into user space. Register space mapping (lines 1209-1233) could potentially expose hardware control registers to user space.
- **Files:** `common/driver/dma_common.c`
- **Current mitigation:** There is a check to ensure the mapped offset is within valid range (line 1215-1219), but the comment "Bad map range" suggests this may not be exhaustive.
- **Recommendations:** Consider whether register mapping is truly required for user applications or if it should be restricted.

## Performance Bottlenecks

### Inefficient Buffer Search in Unsorted Lists
- **Problem:** `dmaFindBufferList` in `dma_buffer.c` (lines 341-370) performs linear search O(n) when the sorted list is unavailable (when `list->sorted == NULL`).
- **Files:** `common/driver/dma_buffer.c`
- **Cause:** When using streaming buffer mode (`BUFF_STREAM`), the sorted list is disabled (see `dma_buffer.c` line 84).
- **Improvement path:** Consider alternative data structures like hash tables for unsorted lookups, or enable sorting even for streaming mode with a different comparison key.

### High Buffer Count Queue Operations
- **Problem:** The queue implementation uses a fixed maximum of 100,000 buffers per list (`BUFFERS_PER_LIST` at line 44 of `dma_buffer.h`), but with potentially 1024 TX + 1024 RX buffers per device, and multiple devices, the queue operations can become slow.
- **Files:** `common/driver/dma_buffer.c`, `common/driver/dma_common.c`
- **Cause:** Circular queue with sub-list segmentation adds complexity to push/pop operations.
- **Improvement path:** Profile actual queue operation time and consider using lockless queue implementations for high-throughput scenarios.

## Fragile Areas

### GPU Async Memory Management
- **Files:** `common/driver/gpu_async.c`, `include/GpuAsync.h`, `include/GpuAsyncUser.h`
- **Why fragile:**
- Uses NVIDIA's proprietary P2P API (`nvidia_p2p_get_pages`, `nvidia_p2p_dma_map_pages`) which may change across driver versions
- Version-specific register handling (lines 240-263 in `gpu_async.c`) with hardcoded offsets
- The `Gpu_AddNvidia` function (lines 139-307) has complex memory alignment and mapping logic with multiple failure points
- **Safe modification:** Before modifying, verify against the current NVIDIA Open GPU Kernel Modules documentation. Consider adding runtime version detection for NVIDIA P2P API changes.
- **Test coverage gaps:** No unit tests for the GPU async path. The `data_dev/app/src/rdmaTest.cu` file exists but is CUDA-specific and may not cover all error paths.

### AXI Version 1 vs Version 2 Hardware Detection
- **Files:** `common/driver/axis_gen1.c`, `common/driver/axis_gen2.c`, `rce_stream/driver/src/rce_top.c` (lines 178-192)
- **Why fragile:** The detection logic in `rce_top.c` (lines 178-192) writes a test value to a register and reads it back to distinguish between Gen1 and Gen2. This write to hardware during probe could potentially affect device state if interrupted.
- **Safe modification:** Ensure the test write/read is atomic or protected by appropriate locks. Consider reading a known-version register instead of probing with writes.
- **Test coverage gaps:** No automated tests verify the Gen1/Gen2 detection logic works correctly with simulated hardware.

### workqueue Management in AxisG2
- **Files:** `common/driver/axis_gen2.c` (lines 542-585, 609-660)
- **Why fragile:** The `AxisG2_Enable` and `AxisG2_Clear` functions manage a workqueue with multiple state transitions (`wqEnable`, delayed work, cancel, flush, destroy). The order of operations is critical.
- **Safe modification:** The workqueue must be destroyed only after all pending work is flushed and the `wqEnable` flag is cleared. Any change to this order could cause use-after-free or deadlocks.
- **Test coverage gaps:** No tests verify proper cleanup when the driver is unloaded while active work is pending.

## Scaling Limits

### Maximum Device Limit
- **Current capacity:** `MAX_DMA_DEVICES` is set to 4 in `rce_stream/driver/src/rce_top.h` (line 32)
- **Limit:** Hard-coded array size in `gDmaDevices[]` in both `rce_stream/driver/src/rce_top.c` and `data_dev/driver/src/data_dev_top.c`
- **Where it breaks:** Adding more devices than `MAX_DMA_DEVICES` will cause overflow in the device array.
- **Scaling path:** Increase `MAX_DMA_DEVICES` and ensure all related arrays (like `RceDevNames`) are updated accordingly. Consider dynamic allocation if device count varies significantly.

### Buffer Memory Limits
- **Current capacity:** Default 1024 TX + 1024 RX buffers at 128KB each = ~256MB per device
- **Limit:** Contiguous DMA buffer allocation may fail for large buffer counts or large buffer sizes on fragmented memory systems
- **Where it breaks:** `dma_alloc_coherent` may fail silently (returns NULL) for large requests, causing initialization failure
- **Scaling path:** Consider using non-contiguous buffer allocation (`dma_alloc_noncontiguous`) for large buffers, or implement buffer pooling.

## Dependencies at Risk

### NVIDIA P2P API
- **Risk:** The GPU async driver depends on NVIDIA's `nv-p2p.h` header and P2P functions that are not part of the mainline kernel API.
- **Impact:** Build failures or runtime issues if:
- NVIDIA drivers are not installed
- NVIDIA driver version changes the P2P API
- Building on systems without NVIDIA GPU support
- **Migration plan:** The `DATA_GPU` compile-time flag (used in `data_dev_top.c` lines 36-39, 285-288) provides some isolation. Consider making GPU support more modular (separate kernel module) or providing a software-only fallback for testing.

### RHEL Release Version Macro
- **Risk:** `RHEL_RELEASE_VERSION` is conditionally defined (line 32-34 of `dma_common.c`) because it's not available in all kernel builds.
- **Impact:** Version-specific code paths may not work correctly on RHEL-based distributions where the macro is missing.
- **Migration plan:** Replace RHEL version checks with feature detection using `#ifdef` checks for specific API availability.

## Missing Critical Features

### No Memory Barrier Documentation
- **Problem:** The code uses `writel` and `ioread32` for register access but does not document or enforce memory barrier requirements. Some operations may be reordered by the CPU or bus.
- **Files:** All driver files that access hardware registers
- **Blocks:** Performance optimization without risking correctness. Without clear barrier documentation, developers may add unnecessary barriers or miss required ones.

### No Test Infrastructure
- **Problem:** The repository contains application test files (`data_dev/app/src/test.cpp`, `common/app/dmaLoopTest.cpp`, etc.) but there's no automated test framework.
- **Blocks:** Regression testing, CI/CD validation of driver changes, verifying bug fixes.

## Test Coverage Gaps

### GPU Async Code Paths
- **What's not tested:** All GPU-specific code paths in `common/driver/gpu_async.c` are not covered by automated tests. The `data_dev/app/src/rdmaTest.cu` is a CUDA application, not a unit test.
- **Files:** `common/driver/gpu_async.c`, `include/GpuAsync.h`, `include/GpuAsyncUser.h`
- **Risk:** Changes to GPU code may break functionality without detection until runtime on systems with NVIDIA GPUs.
- **Priority:** High - GPU support is a key feature; bugs here affect users requiring GPU direct memory access.

### Error Path Testing
- **What's not tested:** Error handling paths (allocation failures, hardware register read/write failures, user pointer validation failures) are not tested.
- **Files:** All driver files
- **Risk:** Driver may crash or leak resources when errors occur.
- **Priority:** High - error paths are critical for stability.

### Multi-Device Testing
- **What's not tested:** Interactions between multiple DMA devices are not tested. The `MAX_DMA_DEVICES` limit is 4, but there are no tests that exercise multiple devices simultaneously.
- **Files:** `common/driver/dma_common.c`, `rce_stream/driver/src/rce_top.c`
- **Risk:** Race conditions or resource conflicts may only appear with multiple devices.
- **Priority:** Medium.

---

*Concerns audit: 2026-03-25*
Loading
Loading