Skip to content

Releases: facebookexperimental/CUTracer

CUTracer v0.3.0 Release Notes

10 Jun 03:38

Choose a tag to compare

🎉 Minor Release — 49 commits since v0.2.1

CUTracer v0.3.0 (2026-04-23 → 2026-06-09) is a substantial minor release that focuses on three big themes: (1) a new high-performance RapidJSON trace serializer that becomes the default and removes ~16% of host-CPU overhead, (2) Blackwell-class race detection — a new TMEM-lifetime detector for the analyze data-race command plus warpgroup-targeted random-delay injection that can finally reproduce SM100 flash-attention dealloc races, and (3) a reusable cuda-gdb hang-analysis backend that samples live CUDA hangs and feeds structured evidence into the AI reasoner pipeline. The release also completes the NVBit 1.8 TMA migration, ships a unified AI all command behind a clean Reasoner protocol, expands schemas to cover every record type the writer emits, and migrates the repository from facebookresearch to facebookexperimental with the wiki now living under docs/.


✨ Highlights

  • RapidJSON trace serializer (default) — Removes ~16% host CPU and ~8% malloc churn from the per-record write path; byte-identical output to the legacy nlohmann path for reg/mem/opcode/mem_value records.
  • TMEM Lifetime Detector D3 — New detector in analyze data-race that finds Blackwell SM100 TMEM dealloc-vs-LDTM races, with prototype-matching 256-vs-0 verdicts on the case-15 baseline/fixed pair.
  • Warpgroup-targeted random delay--delay-warpgroup-ids / --delay-warp-mask / --delay-enable-prob flags let random_delay stall an entire warp or warpgroup as a single scheduling unit, enabling reproduction of cross-warpgroup races that per-thread delays could not.
  • cuda-gdb live hang analyzer — Reusable cutracer.debugger Python backend samples PC offsets, SASS context, register operands, HGMMA/TMA memory evidence, and effective-blocker inference from a live hung CUDA process.
  • NVBit 1.8 TMA pipeline — 3-diff migration to first-class nvbit_parse_tma_transfer_info(); trace now carries structured tensor metadata (dim, dtype, strides, swizzle, SMEM dst/src/mbar addresses) instead of opaque parameter handles.
  • Unified AI all command — All --ai CLI paths re-routed through a new Reasoner Protocol; new cross-domain all command runs deadlock + data-race reasoning in one invocation.
  • Repository migrated to facebookexperimental — Wiki content moved into docs/ with auto-sync workflow; CI gains H100 lanes (meta-triton + upstream Triton nightly) and retires the legacy T4 GitHub-hosted workflow.

🏎️ Capture-Side Serializer Overhaul (RapidJSON)

A capture CPU profile attributed ~16% of host CPU to per-record nlohmann::json DOM serialization plus ~8% malloc churn from per-record std::stringstream hex formatting. This release ports the entire hot path to a streaming rapidjson::Writer and makes it the default.

  • Opt-in RapidJSON path (D107298645) — Adds streaming Writer into a reused thread_local StringBuffer for reg_trace / mem_addr_trace / mem_value_trace / opcode_only; gated behind CUTRACER_JSON_ENGINE={nlohmann,rapidjson,ab} with an ab oracle that semantically compares both engines per record and reports per-type match/mismatch counters at writer teardown.
  • Unit-test the RapidJSON serializer — Extracted to its own translation unit with cxx goldens, exercising all four record types.
  • Port tma_trace (D107320461) — Removes the last nlohmann fallback in the per-record path; semantically identical on 2,304 TMA descriptors.
  • RapidJSON becomes the default (D107323672) — Removes the CUTRACER_JSON_ENGINE switch, the A/B comparison machinery, build_nlohmann_line, and the entire per-record nlohmann serializer family (~390 lines net). nlohmann is retained only for the one-time kernel_metadata header.
  • Heap-corruption fix in TraceWriter (#225) — Serialize public mutators with std::mutex to prevent libstdc++ _M_mutate heap double-free that surfaced as "JSON decode error - unexpected character" at ~7% per kernel-run on aarch64. Replaces a misleading "CRITICAL FIX" std::move dance that did not actually fix the race.
  • OSS install fix (#224) — install_third_party.sh now downloads rapidjson headers so the OSS CI build succeeds.

🏗️ Blackwell Race Detection

TMEM Lifetime Detector D3

Final diff of the case-15 TMEM dealloc-race stack lands a third detector in the analyze data-race registry alongside DataRaceRAWDetector. The new cutracer/analyze/fb/data_race/tmem_lifetime/detector.py module reconciles cross-warp ARRIVE sets per TCGEN05.DEALLOC and emits one Finding(severity=ERROR, category=TMEM_LIFETIME) for every (CTA, warp) that LDTM-reads TMEM but never ARRIVEs on the guarding mbarrier before deallocation. Three-phase pipeline (static guard scan → per-CTA event classification → per-DEALLOC window reconciliation) matches the prototype's 256-vs-0 PASS on the case-15 baseline / D105385008-fixed trace pair. Hopper traces yield no DEALLOCs and short-circuit cleanly via the existing decoder.is_tmem_dealloc / is_tmem_read / is_mbarrier_arrive dispatch.

Warpgroup-Targeted Random Delay (5-diff stack)

Per-thread random delay washes out at the warp boundary (warps are lock-step, so the effective stall is max(thread_delays)), and existing CTA-local modes operate on clusters, not warp ranges within a single CTA. This stack teaches random_delay to stall an entire warp or warpgroup as a single scheduling unit — the exact timing distribution needed to reproduce SM100 flash-attention TMEM dealloc races.

  • Device function + host wrapper — New instrument_delay_warpgroup(pred, delay_ns, warp_mask) in inject_funcs.cu and instrument_warpgroup_delay_injection host wrapper. Computes CTA-local warp id from threadIdx/blockDim, tests warp_mask bit, and __nanosleeps the entire warp uniformly via the same 1ms chunked loop as instrument_delay_random_cluster.
  • CLI flags + env varsCUTRACER_DELAY_WARP_MASK (hex/oct/dec via new get_var_uint32_auto parser; existing get_var_uint32 silently dropped 0x prefixes via atoll) and CUTRACER_DELAY_WARPGROUP_ID (integer warpgroup index, resolves to 0xF << (id * 4) on the host).
  • Dispatcher routing + --delay-enable-prob — Routes warp targeting through the existing delay dispatcher in cutracer.cu and adds a new probability gate that bypasses the 50/50 PC gate for tighter control over injection density.
  • Persist warp targeting in delay config JSON — Replay mode (--delay-load-path) now round-trips warp/warpgroup targeting for deterministic reproduction.

Periodic Hang Snapshots

Adds periodic-snapshot sampling so the analyzer can correlate evolving warp state across multiple sample windows rather than relying on a single point-in-time capture.

Blackwell Test Coverage

  • Blackwell FP8 GEMM E2E test — Permanent E2E test exercising the upstream Blackwell TLX UTCQMMA path with real FP8 data.
  • Block-scaled UTC*MMA fixture — New Blackwell mxfp8 / mxfp4 GEMM fixture for block-scaled tensor-core coverage.
  • Local-only Buck test for Blackwell FA-WS data race reproducer — Captures the case-15 reproducer as a permanent regression guard (local-only because of GPU requirements).
  • UTC*MMA A=tmem fix — Replaces the positional guard with a per-operand find("gdesc") loop so URx+1 is correctly pushed as the high half of the B gdesc; header rewritten to cite Blackwell ISA slot semantics.

🔍 Live Hang Analysis (cuda-gdb backend)

A new reusable cutracer.debugger Python package wraps cuda-gdb to extract structured evidence from a live hung CUDA process. Designed to feed the AI reasoner pipeline.

  • Reusable backend (#212) — Pure Python backend with no gdb runtime dependency in the base layer; provides the foundation for the follow-up cuda-gdb command.
  • Preserve cuda-gdb warp slot identity — Parses cuda-gdb Wp separately from logical warp id; CUTracer parity needs logical warp id (first_active_threadIdx.x // 32) while cuda-gdb focus commands need the physical Wp slot. Both are now serialized in debugger opcode records.
  • PC offsets and SASS context — Records runtime PC, kernel-relative offset, and a small disassembly window for each sampled warp.
  • SASS register operands — Reads scalar and uniform registers named by the sampled SASS instruction and serializes captured values; failures are recorded as explicit register-read errors so reports never invent values.
  • Effective PC evidence — Distinguishes the instruction cuda-gdb stopped on from the effective blocking instruction reported to CUTracer analysis (post-barrier rule).
  • Effective-blocker inference — Small cuda-gdb disassembly parser implements the immediate-previous-instruction rule for recovering the effective blocking PC when cuda-gdb stops on a known unsafe post-barrier instruction.
  • HGMMA and TMA memory evidence — Captures Hopper GMMA and TMA-related memory state from the live process for the reasoner.
  • Dynamic cuda-gdb AI evidence reporting — Evidence stream surfaced directly in --ai reports.

🧠 AI Reasoner Refactor

  • Unified all command + Reasoner protocol (D103870397) — All --ai paths re-routed through the new Reasoner Protocol. New cross-domain cutracer analyze all --ai runs deadlock + data-race reasoning together. Centralizes LLM-client construction and report composition.
  • AIDeadlockAnalyzer retired — No longer imported from production code; per-reasoner tests (test_deadlock_reasoner.py, test_data_race_reasoner.py, test_unified_reasoner.py) cover the --ai paths end-to-end via fake LLM clients.
  • Dedup — Removed redundant base deadlock_command and dead tritonparse fallback shim.

📡 TMA Pipeline (NVBit 1.8 Migration)

Three-diff stack migrates TMA tracing off non-public ISA semantics (URa/...

Read more

CUTracer v0.2.1 Release 🎉

23 Apr 03:41

Choose a tag to compare

CUTracer v0.2.1 Release Notes

🎉 Patch Release — 24 commits since v0.2.0

This release brings PyTorch-native Python callstack capture, a new kernel events recording system, cluster-level delay injection for inter-CTA race detection, significant trace format optimizations via instruction table embedding, a critical NVBit 1.8 TMA compatibility fix, and several CI/security hardening improvements. Date range: 2026-04-08 to 2026-04-22.


✨ Highlights

  • PyTorch CapturedTraceback Integration — Full Python callstack capture via PyTorch's CapturedTraceback API with zero compile-time dependencies, replacing unsymbolized C++ backtrace frames with readable Python call sites
  • Kernel Events Recording — New structured NDJSON logging of all kernel launches (not just instrumented ones) with callstack deduplication for minimal file overhead
  • Cluster-Level Delay Mode — New --delay-mode cluster that delays one random CTA per cluster to expose missing inter-CTA synchronization bugs
  • Instruction Table Embedding — Per-instruction static table with SASS binary encoding embedded in kernel_metadata, eliminating redundant per-record SASS strings from JSON traces
  • Custom Delay Patterns — New --delay-patterns flag for injecting delays at arbitrary SASS instruction types, including a "*" wildcard mode
  • NVBit 1.8 TMA Fix — Critical fix for TMA operand extraction regression introduced by NVBit 1.8's new TMA_PARAM_HANDLE operand type

🧠 PyTorch CapturedTraceback Callstack Capture

A three-part series replaces the existing backtrace()-based C++ callstack capture with PyTorch's CapturedTraceback API, producing readable Python call stacks that show the actual user code and PyTorch layers that launched a kernel.

  • Backtrace refactor (#202) — Extract capture_cpu_callstack() into capture_cpu_callstack_backtrace() to prepare for alternative backends
  • CapturedTraceback module (#203) — New python_callstack.cpp (~360 lines) that dynamically resolves Python C API functions via dlsym(dlopen(NULL)) — zero compile-time Python/PyTorch dependencies. Calls CapturedTraceback.extract().summary() when the current thread holds the GIL
  • Dynamic mode selection (#204) — New CpuCallstackMode enum (AUTO/PYTORCH/BACKTRACE/DISABLED) replacing the old boolean flag. AUTO (default) tries PyTorch first, falls back to backtrace(). New cpu_callstack_source field in kernel_metadata JSON output
  • Auto-GIL acquisition — New auto_gil mode that re-acquires the GIL via PyGILState_Ensure() for Triton kernels where __triton_launcher releases the GIL before cuLaunchKernelEx. Safe because the Python frame chain is frozen at the Triton launch() call site
# Use PyTorch callstacks (auto-detected by default)
cutracer trace --cpu-callstack auto -- python my_triton_kernel.py

# Force auto_gil mode for Triton kernels
cutracer trace --cpu-callstack auto_gil -- python my_triton_kernel.py

📋 Kernel Events Recording

New structured logging of every kernel launch to a dedicated NDJSON file (cutracer_kernel_events_*.ndjson), independent of instrumentation. Key innovation: callstack deduplication using FNV-1a hashing — each unique Python callstack is emitted once as a callstack_def record, with subsequent launches referencing it by callstack_id.

  • Three modes: dedup (recommended), full (inline callstack per launch), nostack (metadata only)
  • Zero overhead when disabled (default)
  • Query integration — The query command now handles kernel events files seamlessly: callstack_def records are cached and resolved, kernel_launch records get a caller field injected with the innermost call site frame
  • Recommended query pattern: --group-by kernel_checksum --count for launch frequency analysis
# Record kernel events with callstack dedup
cutracer trace --kernel-events dedup -- python my_app.py

# Query kernel launch counts
cutracer query cutracer_kernel_events_*.ndjson --group-by kernel_checksum --count

🔀 Cluster-Level Random Delay Mode

New --delay-mode cluster (#207) that delays only one randomly-selected CTA within each cluster while other CTAs proceed at normal speed (~430 lines). This creates timing asymmetry between CTAs in the same cluster, exposing missing inter-CTA synchronization bugs that --delay-mode random (intra-CTA) would not catch.

  • Uses cluster_ctaid/cluster_nctaid PTX registers for CTA selection within each cluster
  • cluster_seed stored in delay config JSON for deterministic replay
  • Host-side cluster dimension detection via CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION (including CUDA graph captures), with fallback to cuFuncGetAttribute
  • One-time-per-kernel diagnostic log of runtime cluster dimensions
  • Automatic warning when used against non-cluster-launched kernels (no-op detection)
# Expose inter-CTA synchronization bugs
cutracer trace -i random_delay -a random_delay \
    --delay-mode cluster --delay-ns 10000 \
    --kernel-filters my_cluster_kernel -- python test.py

🎯 Custom Delay Patterns

New --delay-patterns flag (#211) that lets users specify arbitrary SASS instruction substrings for delay injection, overriding the built-in DELAY_INJECTION_PATTERNS list. This enables targeted testing of specific instruction types without modifying source code.

  • Comma-separated patterns: --delay-patterns "UTMALDG,UTMASTG"
  • Wildcard --delay-patterns "*" matches all instructions, injecting delay on a random 50% subset of ALL SASS instructions
  • Also adds SYNCS.EXCH (mbarrier init) to the built-in delay injection patterns
  • Plumbed through CUTRACER_DELAY_PATTERNS environment variable
# Delay only TMA instructions
cutracer trace -i random_delay --delay-ns 5000 \
    --delay-patterns "UTMALDG,UTMASTG" -- python test.py

# Stress-test: random delay on 50% of ALL instructions
cutracer trace -i random_delay --delay-ns 1000 \
    --delay-patterns "*" -- python test.py

📦 Instruction Table Embedding & Trace Format Optimization

Three-part series that embeds a per-instruction static table in kernel_metadata and eliminates redundant per-record SASS strings from JSON output:

  • Instruction table — Each kernel_metadata record now includes an instructions array indexed by opcode_id, containing SASS disassembly, binary encoding (via NVBit 1.8 getSassBinary() API), register indices, and uniform register indices
  • Schema updatekernel_metadata.schema.json updated with the instructions property definition
  • Per-record SASS removalj["sass"] serialization removed from JSON output (text mode unchanged). The Python TraceReader caches the instruction table and injects sass into records on read via opcode_id lookup. Backward compatible: old traces with inline SASS still work

🐛 Bug Fixes

  • NVBit 1.8 TMA operand extraction regression — NVBit 1.8 changed UTMALDG/UTMASTG/UTMAREDG from two separate MREF operands to a single TMA_PARAM_HANDLE operand. CUTracer's operand loop had no handler, silently dropping all UR information and breaking tma_trace, data-race, tma, mma, and dataflow analysis commands. Added a manual TMA_PARAM_HANDLE decoder plus unhandled-operand debug logging for future NVBit changes
  • Kernel hash always 0xcompute_kernel_checksum() was only called inside instrumentation path; now computed at metadata creation time via idempotent ensure_kernel_checksum() helper
  • Log truncation on SIGTERM (#206) — Added flush_log_files() before raise(SIGTERM) in deadlock detection, kernel timeout, and no-data timeout termination paths
  • Cluster warning noise (#210) — Gated [CLUSTER] warning by should_instrument so non-matching kernels don't clutter stderr; also fixed legacy MEMTRACE: prefix to CUTracer:
  • Kernel events writer lifetime — Fixed multi-context crash where first context teardown destroyed the shared writer; now cleaned up only when all contexts are gone
  • Python callstack lineno — Clamped negative PyFrame_GetLineNumber() return values unconditionally (previously only inside PyErr_Occurred() block)
  • Kernel events NDJSON enforcement — Force NDJSON format for kernel events writer regardless of CUTRACER_TRACE_FORMAT, with warning when text mode is active

🖥️ CLI Changes

New Flags

# Callstack capture mode (replaces boolean --cpu-callstack 0/1)
cutracer trace --cpu-callstack auto|pytorch|backtrace|auto_gil|0|1

# Kernel events recording
cutracer trace --kernel-events dedup|full|nostack

# Cluster delay mode
cutracer trace --delay-mode cluster  # (alongside existing random/fixed)

# Custom delay patterns
cutracer trace --delay-patterns "PATTERN1,PATTERN2"  # or "*" for all

Query Command Updates

# Query kernel events files (auto-detected by file naming)
cutracer query cutracer_kernel_events_*.ndjson --group-by kernel_checksum --count

📁 Configuration Changes

Updated Environment Variables

Variable Change Description
CUTRACER_CPU_CALLSTACK Extended Now accepts: auto, pytorch, backtrace, auto_gil, 0, 1 (was: 0/1 only)
CUTRACER_KERNEL_EVENTS New Kernel events recording mode: 0 (disabled, default), dedup, full, nostack
CUTRACER_DELAY_MODE Extended New cluster value (alongside existing random/fixed)
CUTRACER_DELAY_PATTERNS New Comma-separated SASS instruction substrings for delay injection; "*" for all instructions

🔒 Security & CI

  • CodeQL fixes (#209) — Replaced fopen("w") with open(O_CREAT|O_WRONLY|O_TRUNC, 0644) + fdopen() for explicit file permissions; added permissions: contents: read to CI ...
Read more

CUTracer v0.2.0 Release

08 Apr 23:14

Choose a tag to compare

🎉 Major Release — 114 commits since v0.1.0

CUTracer v0.2.0 brings Blackwell GPU support, a unified CLI experience, advanced data race reduction, and significant improvements to trace infrastructure.


✨ Highlights

  • Blackwell (SM100) GPU Support — Tensor core instruction tracing for UTC*MMA, UTMALDG, UTMAREDG, and TMA descriptors
  • Unified CLI — New cutracer trace subcommand replaces manual CUDA_INJECTION64_PATH setup
  • Data Race Reducer — DDMin bisection algorithm to automatically find minimal race-triggering configurations
  • NVBit 1.8 — Updated from NVBit 1.7.7.1 to 1.8, with a critical fix for <<<>>> kernel launch deadlocks
  • CPU Call Stack Capture — Per-kernel-launch host-side stack traces for debugging
  • Kernel Timeouts & Safety Limits — Configurable execution timeout and trace file size limits

🏗️ Blackwell GPU Support

Full tracing support for NVIDIA Blackwell architecture:

  • UTC*MMA tensor core instructions — Trace Blackwell's new warp-group MMA operations (#161)
  • UTMAREDG tracing — Support for TMA reduction instructions (#162)
  • UTMALDG decoder — Decode TMA load descriptor parameters
  • TMA descriptor tracing — Capture and decode TMA descriptor fields for tile configuration analysis (#155)
  • TMA descriptor decoding in SASS — Extract descriptor parameters from cubin SASS output
  • Tensor memory delay injection — Extend random delay to TMA instructions for data race detection (#189)

🖥️ Unified CLI

The CLI has been completely revamped with a unified cutracer entry point:

cutracer trace — Run and Trace

# Trace a CUDA application (replaces manual CUDA_INJECTION64_PATH setup)
cutracer trace --instrument opcode_only -- python my_kernel.py

# With cubin dump and output directory
cutracer trace --instrument reg_trace --dump-cubin --output-dir ./traces -- python my_kernel.py

# Shell-style environment variable passthrough
cutracer trace CUTRACER_DELAY_NS=1000 -- python my_kernel.py

cutracer query — Query Trace Data

# Filter and query traces
cutracer query trace.ndjson --filter "warp=24"
cutracer query trace.ndjson --filter "cta=[0,0,0],opcode=LDG"  # Multi-condition AND filter
cutracer query trace.ndjson --output result.ndjson --compress

cutracer analyze — Analyze Traces

# Warp execution summary
cutracer analyze warp-summary trace.ndjson

cutracer reduce — Minimize Race Configs

# Find minimal delay configuration that triggers a race
cutracer reduce --config delay_config.json -- python my_kernel.py

cutracer sass — SASS Extraction

# Extract SASS from cubin files
cutracer sass --cubin kernel.cubin

🔍 Data Race Detection Enhancements

DDMin Bisection Reducer (#187)

Automatically reduce a delay configuration to the minimal set of delay points that still trigger a data race, using the delta debugging (ddmin) algorithm:

  • Exponentially faster than brute-force elimination
  • Produces minimal reproducible configurations
  • Integrated via cutracer reduce CLI command

Per-Thread Random Delay Mode (#186)

  • New CUTRACER_DELAY_MODE=per_thread for thread-level delay granularity
  • Better coverage for detecting fine-grained data races

Delay Config Mutator (#145)

  • Programmatic API for manipulating delay configurations
  • Enables automated delay sweep workflows

⏱️ Reliability & Safety

  • Kernel execution timeout (CUTRACER_KERNEL_TIMEOUT_S) — Kill kernels that exceed a time limit (#169)
  • No-data timeout — Detect silent hangs when no trace data is produced
  • Trace file size limit (CUTRACER_TRACE_SIZE_LIMIT_MB) — Prevent runaway disk usage (#169)
  • Periodic flush — TraceWriter and log files flush periodically during kernel hangs, ensuring data is available for post-mortem analysis
  • Configurable channel buffer size (CUTRACER_CHANNEL_RECORDS) — Tune buffer for hang debugging scenarios
  • Fix <<<>>> deadlock — Preload flush_channel via fatbin + NVBit tool API to eliminate kernel launch deadlocks (#199)
  • Fix CUDA graph handling — Prevent graph build/capture phase from prematurely executing per-launch side effects
  • Fix trace overwrite — Trace file write mode changed from append to overwrite across runs

🔧 Instrumentation Improvements

  • Instruction category system — Conditional instrumentation based on instruction categories (#134)
  • IPOINT configuration — Configure instrumentation points via environment variables (#183)
  • Register indices in trace output — CPU-side static mapping of register operands (#143)
  • opcode_only trace writing — Lightweight opcode-only mode now writes structured trace output
  • Auto-enable cubin dump — Cubin dump auto-enabled when instrumentation is active (#191)
  • Kernel checksum — Robust delay config replay using kernel binary checksums (#133, #141)
  • CPU call stack capture — Host-side stack trace for each kernel launch (#172)
  • Re-execute compiled kernel — Ensure trace captures both compilation and execution runs

📁 Configuration Changes

Renamed Environment Variables

Old New
TRACE_FORMAT_NDJSON CUTRACER_TRACE_FORMAT
CUTRACER_TRACE_OUTPUT_DIR CUTRACER_OUTPUT_DIR

CUTRACER_TRACE_FORMAT now also accepts string names (e.g., ndjson_zst, ndjson, log) in addition to numeric values (#193).

New Environment Variables

Variable Description Default
CUTRACER_KERNEL_TIMEOUT_S Kernel execution timeout in seconds (disabled)
CUTRACER_TRACE_SIZE_LIMIT_MB Max trace file size in MB (unlimited)
CUTRACER_CHANNEL_RECORDS Channel buffer record count (default)
CUTRACER_CPU_CALLSTACK Enable CPU call stack capture 0
CUTRACER_DELAY_MODE Delay mode (uniform/per_thread) uniform
CUTRACER_OUTPUT_DIR Unified output directory for all artifacts .
CUTRACER_IPOINT Instrumentation point configuration (default)

🔄 Dependency Updates

  • NVBit: 1.7.7.1 → 1.7.7.3 → 1.8 (#164, #198)
  • nlohmann/json: Updated default to 3.12.0
  • Python: CI updated to Python 3.13
  • GitHub Actions: Updated to latest versions
  • JSON parsing: Migrated to orjson for faster JSON I/O via tritonparse _json_compat
  • Daily NVBit update check: Automated GitHub Action to detect upstream NVBit releases (#163)

🐍 Python Package Improvements

  • CLP archive support — Dump and read CLP compressed log archives (#118, #148)
  • Unified logger module — Consistent logging across all Python modules
  • Schema validation — Migrated trace validation into the cutracer Python module (#154)
  • Query enhancements — Hex filters, --all-lines flag, NDJSON output, --output, --compress (#136)
  • Multi-condition AND filter — Filter by multiple fields simultaneously (#139)
  • JSON list value filters — Support cta=[0,0,0] style filter expressions
  • KernelConfig abstraction — Clean API for trace metadata
  • TraceWriter metadatawrite_metadata() and kernel_metadata event support (#153)
  • Truncated trace detection — Detect and handle truncated trace files gracefully
  • GB200 aarch64 support — Installation scripts updated for GB200 platforms (#159, #173)

📋 Requirements

  • CUDA Toolkit: Aligned with NVBit 1.8 requirements
  • libzstd: Required for trace compression
  • Python 3.10+: For Python package
  • NVBit 1.8: Bundled (auto-downloaded during build)

⚠️ Breaking Changes

  • TRACE_FORMAT_NDJSON renamed to CUTRACER_TRACE_FORMAT (#192)
  • CUTRACER_TRACE_OUTPUT_DIR renamed to CUTRACER_OUTPUT_DIR (#167)
  • CLI entry point unified to cutracer (replaces cutraceross)
  • --all flag renamed to --all-lines (#157)
  • analysis module renamed to query (#135)
  • pc field in trace output changed to hex string format (#137)

🙏 Acknowledgments

CUTracer is built on NVBit by NVIDIA Research. We thank the NVBit team for their excellent binary instrumentation framework and the v1.8 release.


📄 License

  • MIT License — Meta Platforms, Inc. contributions
  • BSD-3-Clause License — NVIDIA NVBit components

See LICENSE and LICENSE-BSD for details.


📚 Documentation

Full documentation is available in the Wiki.


🔗 Links

CUTracer v0.1.0 Release 🎉

06 Feb 03:01

Choose a tag to compare

🎉 Initial Public Release

CUTracer is an NVBit-based CUDA binary instrumentation tool for GPU kernel analysis and debugging. It enables runtime-level insights without requiring application recompilation.


✨ Highlights

  • Zero-modification runtime injection - Attach to any CUDA application via CUDA_INJECTION64_PATH
  • GPU Hang Detection - Automatic deadlock identification with process termination
  • Data Race Detection - Random delay injection with deterministic replay support
  • Triton/Proton Integration - Per-warp instruction histograms with IPC calculation
  • Efficient Trace Compression - NDJSON + Zstd (~92% space savings)
  • Python Analysis Toolkit - Available on PyPI: pip install cutracer

🔧 Instrumentation Modes

Mode Environment Variable Description
opcode_only CUTRACER_INSTRUMENT=opcode_only Lightweight instruction counting
reg_trace CUTRACER_INSTRUMENT=reg_trace Register value tracing (R/UREG support)
mem_addr_trace CUTRACER_INSTRUMENT=mem_addr_trace Memory address tracing
mem_value_trace CUTRACER_INSTRUMENT=mem_value_trace Memory address + value tracing (Global/Shared/Local)
random_delay CUTRACER_INSTRUMENT=random_delay Delay injection for race detection

Multiple modes can be enabled simultaneously with comma-separated values.


📊 Built-in Analyses

Instruction Histogram (proton_instr_histogram)

  • Clock-delimited per-warp instruction counting
  • CSV output: warp_id,region_id,instruction,count
  • Integration with Triton Proton for IPC calculation
  • Requires kernels to emit clock instructions (e.g., Triton pl.scope())
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=proton_instr_histogram \
KERNEL_FILTERS=add_kernel \
python ./vector-add-instrumented.py

Deadlock/Hang Detection (deadlock_detection)

  • Detects warps stuck in stable PC loops
  • Automatic SIGTERM → SIGKILL termination sequence
  • Detailed warp state logging for post-mortem analysis
  • Auto-enables reg_trace instrumentation
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=deadlock_detection \
python ./test_hang.py

Data Race Detection (random_delay)

  • Injects delays at synchronization points to expose timing-dependent races
  • Target SASS patterns:
    • SYNCS.PHASECHK.TRANS64.TRYWAIT (mbarrier try_wait)
    • SYNCS.ARRIVE.TRANS64.RED.A1T0 (mbarrier arrive)
    • UTMALDG.2D (TMA load)
    • WARPGROUP.DEPBAR.LE (MMA wait)

Deterministic Replay Support:

  • CUTRACER_DELAY_DUMP_PATH: Export delay config JSON for recording
  • CUTRACER_DELAY_LOAD_PATH: Load delay config JSON for exact replay
  • Workflow: Discover race with random delays → Reproduce exactly with saved config
# Record mode
CUTRACER_DELAY_NS=1000 \
CUTRACER_DELAY_DUMP_PATH=./delay_config.json \
CUTRACER_ANALYSIS=random_delay \
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
python your_kernel.py

# Replay mode (deterministic reproduction)
CUTRACER_DELAY_LOAD_PATH=./delay_config.json \
CUTRACER_ANALYSIS=random_delay \
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
python your_kernel.py

📁 Output Formats

Mode Extension Description
0 .log Human-readable text format
1 (default) .ndjson.zst NDJSON + Zstd compressed
2 .ndjson NDJSON uncompressed

Set via TRACE_FORMAT_NDJSON environment variable.

Compression level configurable via CUTRACER_ZSTD_LEVEL (1-22, default: 22).


🐍 Python Package

Available on PyPI:

pip install cutracer

Features

Validation:

  • JSON syntax and schema validation
  • Text format validation
  • Cross-format consistency checking
  • Transparent Zstd compression handling

Analysis:

  • TraceReader: Stream trace records from NDJSON files
  • StreamingGrouper: Memory-efficient grouped analysis
  • WarpSummary: Warp execution status for hang analysis (completed/in-progress/missing)
  • Multi-format output: table, JSON, CSV

CLI Tools:

# Validate trace files
cutraceross validate trace.ndjson
cutraceross validate trace.ndjson.zst --verbose

# Analyze trace data
cutraceross analyze trace.ndjson --head 20
cutraceross analyze trace.ndjson --filter "warp=24"
cutraceross analyze trace.ndjson --group-by warp --count

⚙️ Configuration Reference

Variable Description Default
CUTRACER_INSTRUMENT Instrumentation modes (comma-separated) (none)
CUTRACER_ANALYSIS Analysis types (comma-separated) (none)
KERNEL_FILTERS Kernel name filters (substring match) (none)
INSTR_BEGIN / INSTR_END Instruction index range filter 0 / UINT32_MAX
TRACE_FORMAT_NDJSON Output format (0/1/2) 1
CUTRACER_ZSTD_LEVEL Zstd compression level 22
CUTRACER_DELAY_NS Delay value in nanoseconds 0 (disabled)
CUTRACER_DELAY_DUMP_PATH Export delay config JSON (none)
CUTRACER_DELAY_LOAD_PATH Load delay config for replay (none)
TOOL_VERBOSE Verbosity level (0/1/2) 0
CUTRACER_DUMP_CUBIN Dump cubin files 0

📋 Requirements

  • CUDA Toolkit: Aligned with NVBit requirements
  • libzstd: Required for trace compression
  • Python 3.10+: For Python package

⚠️ API Stability Notice

This is the initial public release (v0.1.0). APIs and configuration options may change in future versions as we gather feedback and iterate on the design.

Known Limitations

  • Instruction histogram requires clock instruction boundaries (e.g., Triton pl.scope())
  • Nested regions not supported for instruction histogram analysis

🙏 Acknowledgments

CUTracer is built on NVBit by NVIDIA Research. We thank the NVBit team for their excellent binary instrumentation framework.


📄 License

  • MIT License - Meta Platforms, Inc. contributions
  • BSD-3-Clause License - NVIDIA NVBit components

See LICENSE and LICENSE-BSD for details.


📚 Documentation

Full documentation is available in the Wiki.


🔗 Links