10 Jun 03:38

FindHao

dde45fd

Latest

🎉 Minor Release — 49 commits since v0.2.1

CUTracer v0.3.0 (2026-04-23 → 2026-06-09) is a substantial minor release that focuses on three big themes: (1) a new high-performance RapidJSON trace serializer that becomes the default and removes ~16% of host-CPU overhead, (2) Blackwell-class race detection — a new TMEM-lifetime detector for the analyze data-race command plus warpgroup-targeted random-delay injection that can finally reproduce SM100 flash-attention dealloc races, and (3) a reusable cuda-gdb hang-analysis backend that samples live CUDA hangs and feeds structured evidence into the AI reasoner pipeline. The release also completes the NVBit 1.8 TMA migration, ships a unified AI all command behind a clean Reasoner protocol, expands schemas to cover every record type the writer emits, and migrates the repository from facebookresearch to facebookexperimental with the wiki now living under docs/.

✨ Highlights

RapidJSON trace serializer (default) — Removes ~16% host CPU and ~8% malloc churn from the per-record write path; byte-identical output to the legacy nlohmann path for reg/mem/opcode/mem_value records.
TMEM Lifetime Detector D3 — New detector in analyze data-race that finds Blackwell SM100 TMEM dealloc-vs-LDTM races, with prototype-matching 256-vs-0 verdicts on the case-15 baseline/fixed pair.
Warpgroup-targeted random delay — --delay-warpgroup-ids / --delay-warp-mask / --delay-enable-prob flags let random_delay stall an entire warp or warpgroup as a single scheduling unit, enabling reproduction of cross-warpgroup races that per-thread delays could not.
cuda-gdb live hang analyzer — Reusable cutracer.debugger Python backend samples PC offsets, SASS context, register operands, HGMMA/TMA memory evidence, and effective-blocker inference from a live hung CUDA process.
NVBit 1.8 TMA pipeline — 3-diff migration to first-class nvbit_parse_tma_transfer_info(); trace now carries structured tensor metadata (dim, dtype, strides, swizzle, SMEM dst/src/mbar addresses) instead of opaque parameter handles.
Unified AI all command — All --ai CLI paths re-routed through a new Reasoner Protocol; new cross-domain all command runs deadlock + data-race reasoning in one invocation.
Repository migrated to facebookexperimental — Wiki content moved into docs/ with auto-sync workflow; CI gains H100 lanes (meta-triton + upstream Triton nightly) and retires the legacy T4 GitHub-hosted workflow.

🏎️ Capture-Side Serializer Overhaul (RapidJSON)

A capture CPU profile attributed ~16% of host CPU to per-record nlohmann::json DOM serialization plus ~8% malloc churn from per-record std::stringstream hex formatting. This release ports the entire hot path to a streaming rapidjson::Writer and makes it the default.

Opt-in RapidJSON path (D107298645) — Adds streaming Writer into a reused thread_local StringBuffer for reg_trace / mem_addr_trace / mem_value_trace / opcode_only; gated behind CUTRACER_JSON_ENGINE={nlohmann,rapidjson,ab} with an ab oracle that semantically compares both engines per record and reports per-type match/mismatch counters at writer teardown.
Unit-test the RapidJSON serializer — Extracted to its own translation unit with cxx goldens, exercising all four record types.
Port tma_trace (D107320461) — Removes the last nlohmann fallback in the per-record path; semantically identical on 2,304 TMA descriptors.
RapidJSON becomes the default (D107323672) — Removes the CUTRACER_JSON_ENGINE switch, the A/B comparison machinery, build_nlohmann_line, and the entire per-record nlohmann serializer family (~390 lines net). nlohmann is retained only for the one-time kernel_metadata header.
Heap-corruption fix in TraceWriter (#225) — Serialize public mutators with std::mutex to prevent libstdc++ _M_mutate heap double-free that surfaced as "JSON decode error - unexpected character" at ~7% per kernel-run on aarch64. Replaces a misleading "CRITICAL FIX" std::move dance that did not actually fix the race.
OSS install fix (#224) — install_third_party.sh now downloads rapidjson headers so the OSS CI build succeeds.

🏗️ Blackwell Race Detection

TMEM Lifetime Detector D3

Final diff of the case-15 TMEM dealloc-race stack lands a third detector in the analyze data-race registry alongside DataRaceRAWDetector. The new cutracer/analyze/fb/data_race/tmem_lifetime/detector.py module reconciles cross-warp ARRIVE sets per TCGEN05.DEALLOC and emits one Finding(severity=ERROR, category=TMEM_LIFETIME) for every (CTA, warp) that LDTM-reads TMEM but never ARRIVEs on the guarding mbarrier before deallocation. Three-phase pipeline (static guard scan → per-CTA event classification → per-DEALLOC window reconciliation) matches the prototype's 256-vs-0 PASS on the case-15 baseline / D105385008-fixed trace pair. Hopper traces yield no DEALLOCs and short-circuit cleanly via the existing decoder.is_tmem_dealloc / is_tmem_read / is_mbarrier_arrive dispatch.

Warpgroup-Targeted Random Delay (5-diff stack)

Per-thread random delay washes out at the warp boundary (warps are lock-step, so the effective stall is max(thread_delays)), and existing CTA-local modes operate on clusters, not warp ranges within a single CTA. This stack teaches random_delay to stall an entire warp or warpgroup as a single scheduling unit — the exact timing distribution needed to reproduce SM100 flash-attention TMEM dealloc races.

Device function + host wrapper — New instrument_delay_warpgroup(pred, delay_ns, warp_mask) in inject_funcs.cu and instrument_warpgroup_delay_injection host wrapper. Computes CTA-local warp id from threadIdx/blockDim, tests warp_mask bit, and __nanosleeps the entire warp uniformly via the same 1ms chunked loop as instrument_delay_random_cluster.
CLI flags + env vars — CUTRACER_DELAY_WARP_MASK (hex/oct/dec via new get_var_uint32_auto parser; existing get_var_uint32 silently dropped 0x prefixes via atoll) and CUTRACER_DELAY_WARPGROUP_ID (integer warpgroup index, resolves to 0xF << (id * 4) on the host).
Dispatcher routing + --delay-enable-prob — Routes warp targeting through the existing delay dispatcher in cutracer.cu and adds a new probability gate that bypasses the 50/50 PC gate for tighter control over injection density.
Persist warp targeting in delay config JSON — Replay mode (--delay-load-path) now round-trips warp/warpgroup targeting for deterministic reproduction.

Periodic Hang Snapshots

Adds periodic-snapshot sampling so the analyzer can correlate evolving warp state across multiple sample windows rather than relying on a single point-in-time capture.

Blackwell Test Coverage

Blackwell FP8 GEMM E2E test — Permanent E2E test exercising the upstream Blackwell TLX UTCQMMA path with real FP8 data.
Block-scaled UTC*MMA fixture — New Blackwell mxfp8 / mxfp4 GEMM fixture for block-scaled tensor-core coverage.
Local-only Buck test for Blackwell FA-WS data race reproducer — Captures the case-15 reproducer as a permanent regression guard (local-only because of GPU requirements).
UTC*MMA A=tmem fix — Replaces the positional guard with a per-operand find("gdesc") loop so URx+1 is correctly pushed as the high half of the B gdesc; header rewritten to cite Blackwell ISA slot semantics.

🔍 Live Hang Analysis (cuda-gdb backend)

A new reusable cutracer.debugger Python package wraps cuda-gdb to extract structured evidence from a live hung CUDA process. Designed to feed the AI reasoner pipeline.

Reusable backend (#212) — Pure Python backend with no gdb runtime dependency in the base layer; provides the foundation for the follow-up cuda-gdb command.
Preserve cuda-gdb warp slot identity — Parses cuda-gdb Wp separately from logical warp id; CUTracer parity needs logical warp id (first_active_threadIdx.x // 32) while cuda-gdb focus commands need the physical Wp slot. Both are now serialized in debugger opcode records.
PC offsets and SASS context — Records runtime PC, kernel-relative offset, and a small disassembly window for each sampled warp.
SASS register operands — Reads scalar and uniform registers named by the sampled SASS instruction and serializes captured values; failures are recorded as explicit register-read errors so reports never invent values.
Effective PC evidence — Distinguishes the instruction cuda-gdb stopped on from the effective blocking instruction reported to CUTracer analysis (post-barrier rule).
Effective-blocker inference — Small cuda-gdb disassembly parser implements the immediate-previous-instruction rule for recovering the effective blocking PC when cuda-gdb stops on a known unsafe post-barrier instruction.
HGMMA and TMA memory evidence — Captures Hopper GMMA and TMA-related memory state from the live process for the reasoner.
Dynamic cuda-gdb AI evidence reporting — Evidence stream surfaced directly in --ai reports.

🧠 AI Reasoner Refactor

Unified all command + Reasoner protocol (D103870397) — All --ai paths re-routed through the new Reasoner Protocol. New cross-domain cutracer analyze all --ai runs deadlock + data-race reasoning together. Centralizes LLM-client construction and report composition.
AIDeadlockAnalyzer retired — No longer imported from production code; per-reasoner tests (test_deadlock_reasoner.py, test_data_race_reasoner.py, test_unified_reasoner.py) cover the --ai paths end-to-end via fake LLM clients.
Dedup — Removed redundant base deadlock_command and dead tritonparse fallback shim.

📡 TMA Pipeline (NVBit 1.8 Migration)

Three-diff stack migrates TMA tracing off non-public ISA semantics (URa/...

Assets 2

23 Apr 03:41

FindHao

v0.2.1

0ab27d5

CUTracer v0.2.1 Release 🎉

CUTracer v0.2.1 Release Notes

🎉 Patch Release — 24 commits since v0.2.0

This release brings PyTorch-native Python callstack capture, a new kernel events recording system, cluster-level delay injection for inter-CTA race detection, significant trace format optimizations via instruction table embedding, a critical NVBit 1.8 TMA compatibility fix, and several CI/security hardening improvements. Date range: 2026-04-08 to 2026-04-22.

✨ Highlights

PyTorch CapturedTraceback Integration — Full Python callstack capture via PyTorch's CapturedTraceback API with zero compile-time dependencies, replacing unsymbolized C++ backtrace frames with readable Python call sites
Kernel Events Recording — New structured NDJSON logging of all kernel launches (not just instrumented ones) with callstack deduplication for minimal file overhead
Cluster-Level Delay Mode — New --delay-mode cluster that delays one random CTA per cluster to expose missing inter-CTA synchronization bugs
Instruction Table Embedding — Per-instruction static table with SASS binary encoding embedded in kernel_metadata, eliminating redundant per-record SASS strings from JSON traces
Custom Delay Patterns — New --delay-patterns flag for injecting delays at arbitrary SASS instruction types, including a "*" wildcard mode
NVBit 1.8 TMA Fix — Critical fix for TMA operand extraction regression introduced by NVBit 1.8's new TMA_PARAM_HANDLE operand type

🧠 PyTorch CapturedTraceback Callstack Capture

A three-part series replaces the existing backtrace()-based C++ callstack capture with PyTorch's CapturedTraceback API, producing readable Python call stacks that show the actual user code and PyTorch layers that launched a kernel.

Backtrace refactor (#202) — Extract capture_cpu_callstack() into capture_cpu_callstack_backtrace() to prepare for alternative backends
CapturedTraceback module (#203) — New python_callstack.cpp (~360 lines) that dynamically resolves Python C API functions via dlsym(dlopen(NULL)) — zero compile-time Python/PyTorch dependencies. Calls CapturedTraceback.extract().summary() when the current thread holds the GIL
Dynamic mode selection (#204) — New CpuCallstackMode enum (AUTO/PYTORCH/BACKTRACE/DISABLED) replacing the old boolean flag. AUTO (default) tries PyTorch first, falls back to backtrace(). New cpu_callstack_source field in kernel_metadata JSON output
Auto-GIL acquisition — New auto_gil mode that re-acquires the GIL via PyGILState_Ensure() for Triton kernels where __triton_launcher releases the GIL before cuLaunchKernelEx. Safe because the Python frame chain is frozen at the Triton launch() call site

# Use PyTorch callstacks (auto-detected by default)
cutracer trace --cpu-callstack auto -- python my_triton_kernel.py

# Force auto_gil mode for Triton kernels
cutracer trace --cpu-callstack auto_gil -- python my_triton_kernel.py

📋 Kernel Events Recording

New structured logging of every kernel launch to a dedicated NDJSON file (cutracer_kernel_events_*.ndjson), independent of instrumentation. Key innovation: callstack deduplication using FNV-1a hashing — each unique Python callstack is emitted once as a callstack_def record, with subsequent launches referencing it by callstack_id.

Three modes: dedup (recommended), full (inline callstack per launch), nostack (metadata only)
Zero overhead when disabled (default)
Query integration — The query command now handles kernel events files seamlessly: callstack_def records are cached and resolved, kernel_launch records get a caller field injected with the innermost call site frame
Recommended query pattern: --group-by kernel_checksum --count for launch frequency analysis

# Record kernel events with callstack dedup
cutracer trace --kernel-events dedup -- python my_app.py

# Query kernel launch counts
cutracer query cutracer_kernel_events_*.ndjson --group-by kernel_checksum --count

🔀 Cluster-Level Random Delay Mode

New --delay-mode cluster (#207) that delays only one randomly-selected CTA within each cluster while other CTAs proceed at normal speed (~430 lines). This creates timing asymmetry between CTAs in the same cluster, exposing missing inter-CTA synchronization bugs that --delay-mode random (intra-CTA) would not catch.

Uses cluster_ctaid/cluster_nctaid PTX registers for CTA selection within each cluster
cluster_seed stored in delay config JSON for deterministic replay
Host-side cluster dimension detection via CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION (including CUDA graph captures), with fallback to cuFuncGetAttribute
One-time-per-kernel diagnostic log of runtime cluster dimensions
Automatic warning when used against non-cluster-launched kernels (no-op detection)

# Expose inter-CTA synchronization bugs
cutracer trace -i random_delay -a random_delay \
    --delay-mode cluster --delay-ns 10000 \
    --kernel-filters my_cluster_kernel -- python test.py

🎯 Custom Delay Patterns

New --delay-patterns flag (#211) that lets users specify arbitrary SASS instruction substrings for delay injection, overriding the built-in DELAY_INJECTION_PATTERNS list. This enables targeted testing of specific instruction types without modifying source code.

Comma-separated patterns: --delay-patterns "UTMALDG,UTMASTG"
Wildcard --delay-patterns "*" matches all instructions, injecting delay on a random 50% subset of ALL SASS instructions
Also adds SYNCS.EXCH (mbarrier init) to the built-in delay injection patterns
Plumbed through CUTRACER_DELAY_PATTERNS environment variable

# Delay only TMA instructions
cutracer trace -i random_delay --delay-ns 5000 \
    --delay-patterns "UTMALDG,UTMASTG" -- python test.py

# Stress-test: random delay on 50% of ALL instructions
cutracer trace -i random_delay --delay-ns 1000 \
    --delay-patterns "*" -- python test.py

📦 Instruction Table Embedding & Trace Format Optimization

Three-part series that embeds a per-instruction static table in kernel_metadata and eliminates redundant per-record SASS strings from JSON output:

Instruction table — Each kernel_metadata record now includes an instructions array indexed by opcode_id, containing SASS disassembly, binary encoding (via NVBit 1.8 getSassBinary() API), register indices, and uniform register indices
Schema update — kernel_metadata.schema.json updated with the instructions property definition
Per-record SASS removal — j["sass"] serialization removed from JSON output (text mode unchanged). The Python TraceReader caches the instruction table and injects sass into records on read via opcode_id lookup. Backward compatible: old traces with inline SASS still work

🐛 Bug Fixes

NVBit 1.8 TMA operand extraction regression — NVBit 1.8 changed UTMALDG/UTMASTG/UTMAREDG from two separate MREF operands to a single TMA_PARAM_HANDLE operand. CUTracer's operand loop had no handler, silently dropping all UR information and breaking tma_trace, data-race, tma, mma, and dataflow analysis commands. Added a manual TMA_PARAM_HANDLE decoder plus unhandled-operand debug logging for future NVBit changes
Kernel hash always 0x — compute_kernel_checksum() was only called inside instrumentation path; now computed at metadata creation time via idempotent ensure_kernel_checksum() helper
Log truncation on SIGTERM (#206) — Added flush_log_files() before raise(SIGTERM) in deadlock detection, kernel timeout, and no-data timeout termination paths
Cluster warning noise (#210) — Gated [CLUSTER] warning by should_instrument so non-matching kernels don't clutter stderr; also fixed legacy MEMTRACE: prefix to CUTracer:
Kernel events writer lifetime — Fixed multi-context crash where first context teardown destroyed the shared writer; now cleaned up only when all contexts are gone
Python callstack lineno — Clamped negative PyFrame_GetLineNumber() return values unconditionally (previously only inside PyErr_Occurred() block)
Kernel events NDJSON enforcement — Force NDJSON format for kernel events writer regardless of CUTRACER_TRACE_FORMAT, with warning when text mode is active

🖥️ CLI Changes

New Flags

# Callstack capture mode (replaces boolean --cpu-callstack 0/1)
cutracer trace --cpu-callstack auto|pytorch|backtrace|auto_gil|0|1

# Kernel events recording
cutracer trace --kernel-events dedup|full|nostack

# Cluster delay mode
cutracer trace --delay-mode cluster  # (alongside existing random/fixed)

# Custom delay patterns
cutracer trace --delay-patterns "PATTERN1,PATTERN2"  # or "*" for all

Query Command Updates

# Query kernel events files (auto-detected by file naming)
cutracer query cutracer_kernel_events_*.ndjson --group-by kernel_checksum --count

📁 Configuration Changes

Updated Environment Variables

Variable	Change	Description
`CUTRACER_CPU_CALLSTACK`	Extended	Now accepts: `auto`, `pytorch`, `backtrace`, `auto_gil`, `0`, `1` (was: `0`/`1` only)
`CUTRACER_KERNEL_EVENTS`	New	Kernel events recording mode: `0` (disabled, default), `dedup`, `full`, `nostack`
`CUTRACER_DELAY_MODE`	Extended	New `cluster` value (alongside existing `random`/`fixed`)
`CUTRACER_DELAY_PATTERNS`	New	Comma-separated SASS instruction substrings for delay injection; `"*"` for all instructions

🔒 Security & CI

CodeQL fixes (#209) — Replaced fopen("w") with open(O_CREAT|O_WRONLY|O_TRUNC, 0644) + fdopen() for explicit file permissions; added permissions: contents: read to CI ...

Assets 2

08 Apr 23:14

FindHao

v0.2.0

e9b0b56

CUTracer v0.2.0 Release

🎉 Major Release — 114 commits since v0.1.0

CUTracer v0.2.0 brings Blackwell GPU support, a unified CLI experience, advanced data race reduction, and significant improvements to trace infrastructure.

✨ Highlights

Blackwell (SM100) GPU Support — Tensor core instruction tracing for UTC*MMA, UTMALDG, UTMAREDG, and TMA descriptors
Unified CLI — New cutracer trace subcommand replaces manual CUDA_INJECTION64_PATH setup
Data Race Reducer — DDMin bisection algorithm to automatically find minimal race-triggering configurations
NVBit 1.8 — Updated from NVBit 1.7.7.1 to 1.8, with a critical fix for <<<>>> kernel launch deadlocks
CPU Call Stack Capture — Per-kernel-launch host-side stack traces for debugging
Kernel Timeouts & Safety Limits — Configurable execution timeout and trace file size limits

🏗️ Blackwell GPU Support

Full tracing support for NVIDIA Blackwell architecture:

UTC*MMA tensor core instructions — Trace Blackwell's new warp-group MMA operations (#161)
UTMAREDG tracing — Support for TMA reduction instructions (#162)
UTMALDG decoder — Decode TMA load descriptor parameters
TMA descriptor tracing — Capture and decode TMA descriptor fields for tile configuration analysis (#155)
TMA descriptor decoding in SASS — Extract descriptor parameters from cubin SASS output
Tensor memory delay injection — Extend random delay to TMA instructions for data race detection (#189)

🖥️ Unified CLI

The CLI has been completely revamped with a unified cutracer entry point:

`cutracer trace` — Run and Trace

# Trace a CUDA application (replaces manual CUDA_INJECTION64_PATH setup)
cutracer trace --instrument opcode_only -- python my_kernel.py

# With cubin dump and output directory
cutracer trace --instrument reg_trace --dump-cubin --output-dir ./traces -- python my_kernel.py

# Shell-style environment variable passthrough
cutracer trace CUTRACER_DELAY_NS=1000 -- python my_kernel.py

`cutracer query` — Query Trace Data

# Filter and query traces
cutracer query trace.ndjson --filter "warp=24"
cutracer query trace.ndjson --filter "cta=[0,0,0],opcode=LDG"  # Multi-condition AND filter
cutracer query trace.ndjson --output result.ndjson --compress

`cutracer analyze` — Analyze Traces

# Warp execution summary
cutracer analyze warp-summary trace.ndjson

`cutracer reduce` — Minimize Race Configs

# Find minimal delay configuration that triggers a race
cutracer reduce --config delay_config.json -- python my_kernel.py

`cutracer sass` — SASS Extraction

# Extract SASS from cubin files
cutracer sass --cubin kernel.cubin

🔍 Data Race Detection Enhancements

DDMin Bisection Reducer (#187)

Automatically reduce a delay configuration to the minimal set of delay points that still trigger a data race, using the delta debugging (ddmin) algorithm:

Exponentially faster than brute-force elimination
Produces minimal reproducible configurations
Integrated via cutracer reduce CLI command

Per-Thread Random Delay Mode (#186)

New CUTRACER_DELAY_MODE=per_thread for thread-level delay granularity
Better coverage for detecting fine-grained data races

Delay Config Mutator (#145)

Programmatic API for manipulating delay configurations
Enables automated delay sweep workflows

⏱️ Reliability & Safety

Kernel execution timeout (CUTRACER_KERNEL_TIMEOUT_S) — Kill kernels that exceed a time limit (#169)
No-data timeout — Detect silent hangs when no trace data is produced
Trace file size limit (CUTRACER_TRACE_SIZE_LIMIT_MB) — Prevent runaway disk usage (#169)
Periodic flush — TraceWriter and log files flush periodically during kernel hangs, ensuring data is available for post-mortem analysis
Configurable channel buffer size (CUTRACER_CHANNEL_RECORDS) — Tune buffer for hang debugging scenarios
Fix <<<>>> deadlock — Preload flush_channel via fatbin + NVBit tool API to eliminate kernel launch deadlocks (#199)
Fix CUDA graph handling — Prevent graph build/capture phase from prematurely executing per-launch side effects
Fix trace overwrite — Trace file write mode changed from append to overwrite across runs

🔧 Instrumentation Improvements

Instruction category system — Conditional instrumentation based on instruction categories (#134)
IPOINT configuration — Configure instrumentation points via environment variables (#183)
Register indices in trace output — CPU-side static mapping of register operands (#143)
opcode_only trace writing — Lightweight opcode-only mode now writes structured trace output
Auto-enable cubin dump — Cubin dump auto-enabled when instrumentation is active (#191)
Kernel checksum — Robust delay config replay using kernel binary checksums (#133, #141)
CPU call stack capture — Host-side stack trace for each kernel launch (#172)
Re-execute compiled kernel — Ensure trace captures both compilation and execution runs

📁 Configuration Changes

Renamed Environment Variables

Old	New
`TRACE_FORMAT_NDJSON`	`CUTRACER_TRACE_FORMAT`
`CUTRACER_TRACE_OUTPUT_DIR`	`CUTRACER_OUTPUT_DIR`

CUTRACER_TRACE_FORMAT now also accepts string names (e.g., ndjson_zst, ndjson, log) in addition to numeric values (#193).

New Environment Variables

Variable	Description	Default
`CUTRACER_KERNEL_TIMEOUT_S`	Kernel execution timeout in seconds	(disabled)
`CUTRACER_TRACE_SIZE_LIMIT_MB`	Max trace file size in MB	(unlimited)
`CUTRACER_CHANNEL_RECORDS`	Channel buffer record count	(default)
`CUTRACER_CPU_CALLSTACK`	Enable CPU call stack capture	0
`CUTRACER_DELAY_MODE`	Delay mode (`uniform`/`per_thread`)	`uniform`
`CUTRACER_OUTPUT_DIR`	Unified output directory for all artifacts	`.`
`CUTRACER_IPOINT`	Instrumentation point configuration	(default)

🔄 Dependency Updates

NVBit: 1.7.7.1 → 1.7.7.3 → 1.8 (#164, #198)
nlohmann/json: Updated default to 3.12.0
Python: CI updated to Python 3.13
GitHub Actions: Updated to latest versions
JSON parsing: Migrated to orjson for faster JSON I/O via tritonparse _json_compat
Daily NVBit update check: Automated GitHub Action to detect upstream NVBit releases (#163)

🐍 Python Package Improvements

CLP archive support — Dump and read CLP compressed log archives (#118, #148)
Unified logger module — Consistent logging across all Python modules
Schema validation — Migrated trace validation into the cutracer Python module (#154)
Query enhancements — Hex filters, --all-lines flag, NDJSON output, --output, --compress (#136)
Multi-condition AND filter — Filter by multiple fields simultaneously (#139)
JSON list value filters — Support cta=[0,0,0] style filter expressions
KernelConfig abstraction — Clean API for trace metadata
TraceWriter metadata — write_metadata() and kernel_metadata event support (#153)
Truncated trace detection — Detect and handle truncated trace files gracefully
GB200 aarch64 support — Installation scripts updated for GB200 platforms (#159, #173)

📋 Requirements

CUDA Toolkit: Aligned with NVBit 1.8 requirements
libzstd: Required for trace compression
Python 3.10+: For Python package
NVBit 1.8: Bundled (auto-downloaded during build)

⚠️ Breaking Changes

TRACE_FORMAT_NDJSON renamed to CUTRACER_TRACE_FORMAT (#192)
CUTRACER_TRACE_OUTPUT_DIR renamed to CUTRACER_OUTPUT_DIR (#167)
CLI entry point unified to cutracer (replaces cutraceross)
--all flag renamed to --all-lines (#157)
analysis module renamed to query (#135)
pc field in trace output changed to hex string format (#137)

🙏 Acknowledgments

CUTracer is built on NVBit by NVIDIA Research. We thank the NVBit team for their excellent binary instrumentation framework and the v1.8 release.

📄 License

MIT License — Meta Platforms, Inc. contributions
BSD-3-Clause License — NVIDIA NVBit components

See LICENSE and LICENSE-BSD for details.

📚 Documentation

Full documentation is available in the Wiki.

🔗 Links

Repository: https://github.com/facebookresearch/CUTracer
PyPI: https://pypi.org/project/cutracer/
Issues: https://github.com/facebookresearch/CUTracer/issues

Assets 2

06 Feb 03:01

FindHao

v0.1.0

34e5ed1

CUTracer v0.1.0 Release 🎉

🎉 Initial Public Release

CUTracer is an NVBit-based CUDA binary instrumentation tool for GPU kernel analysis and debugging. It enables runtime-level insights without requiring application recompilation.

✨ Highlights

Zero-modification runtime injection - Attach to any CUDA application via CUDA_INJECTION64_PATH
GPU Hang Detection - Automatic deadlock identification with process termination
Data Race Detection - Random delay injection with deterministic replay support
Triton/Proton Integration - Per-warp instruction histograms with IPC calculation
Efficient Trace Compression - NDJSON + Zstd (~92% space savings)
Python Analysis Toolkit - Available on PyPI: pip install cutracer

🔧 Instrumentation Modes

Mode	Environment Variable	Description
`opcode_only`	`CUTRACER_INSTRUMENT=opcode_only`	Lightweight instruction counting
`reg_trace`	`CUTRACER_INSTRUMENT=reg_trace`	Register value tracing (R/UREG support)
`mem_addr_trace`	`CUTRACER_INSTRUMENT=mem_addr_trace`	Memory address tracing
`mem_value_trace`	`CUTRACER_INSTRUMENT=mem_value_trace`	Memory address + value tracing (Global/Shared/Local)
`random_delay`	`CUTRACER_INSTRUMENT=random_delay`	Delay injection for race detection

Multiple modes can be enabled simultaneously with comma-separated values.

📊 Built-in Analyses

Instruction Histogram (`proton_instr_histogram`)

Clock-delimited per-warp instruction counting
CSV output: warp_id,region_id,instruction,count
Integration with Triton Proton for IPC calculation
Requires kernels to emit clock instructions (e.g., Triton pl.scope())

CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=proton_instr_histogram \
KERNEL_FILTERS=add_kernel \
python ./vector-add-instrumented.py

Deadlock/Hang Detection (`deadlock_detection`)

Detects warps stuck in stable PC loops
Automatic SIGTERM → SIGKILL termination sequence
Detailed warp state logging for post-mortem analysis
Auto-enables reg_trace instrumentation

CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=deadlock_detection \
python ./test_hang.py

Data Race Detection (`random_delay`)

Injects delays at synchronization points to expose timing-dependent races
Target SASS patterns:
- SYNCS.PHASECHK.TRANS64.TRYWAIT (mbarrier try_wait)
- SYNCS.ARRIVE.TRANS64.RED.A1T0 (mbarrier arrive)
- UTMALDG.2D (TMA load)
- WARPGROUP.DEPBAR.LE (MMA wait)

Deterministic Replay Support:

CUTRACER_DELAY_DUMP_PATH: Export delay config JSON for recording
CUTRACER_DELAY_LOAD_PATH: Load delay config JSON for exact replay
Workflow: Discover race with random delays → Reproduce exactly with saved config

# Record mode
CUTRACER_DELAY_NS=1000 \
CUTRACER_DELAY_DUMP_PATH=./delay_config.json \
CUTRACER_ANALYSIS=random_delay \
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
python your_kernel.py

# Replay mode (deterministic reproduction)
CUTRACER_DELAY_LOAD_PATH=./delay_config.json \
CUTRACER_ANALYSIS=random_delay \
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
python your_kernel.py

📁 Output Formats

Mode	Extension	Description
0	`.log`	Human-readable text format
1 (default)	`.ndjson.zst`	NDJSON + Zstd compressed
2	`.ndjson`	NDJSON uncompressed

Set via TRACE_FORMAT_NDJSON environment variable.

Compression level configurable via CUTRACER_ZSTD_LEVEL (1-22, default: 22).

🐍 Python Package

Available on PyPI:

pip install cutracer

Features

Validation:

JSON syntax and schema validation
Text format validation
Cross-format consistency checking
Transparent Zstd compression handling

Analysis:

TraceReader: Stream trace records from NDJSON files
StreamingGrouper: Memory-efficient grouped analysis
WarpSummary: Warp execution status for hang analysis (completed/in-progress/missing)
Multi-format output: table, JSON, CSV

CLI Tools:

# Validate trace files
cutraceross validate trace.ndjson
cutraceross validate trace.ndjson.zst --verbose

# Analyze trace data
cutraceross analyze trace.ndjson --head 20
cutraceross analyze trace.ndjson --filter "warp=24"
cutraceross analyze trace.ndjson --group-by warp --count

⚙️ Configuration Reference

Variable	Description	Default
`CUTRACER_INSTRUMENT`	Instrumentation modes (comma-separated)	(none)
`CUTRACER_ANALYSIS`	Analysis types (comma-separated)	(none)
`KERNEL_FILTERS`	Kernel name filters (substring match)	(none)
`INSTR_BEGIN` / `INSTR_END`	Instruction index range filter	0 / UINT32_MAX
`TRACE_FORMAT_NDJSON`	Output format (0/1/2)	1
`CUTRACER_ZSTD_LEVEL`	Zstd compression level	22
`CUTRACER_DELAY_NS`	Delay value in nanoseconds	0 (disabled)
`CUTRACER_DELAY_DUMP_PATH`	Export delay config JSON	(none)
`CUTRACER_DELAY_LOAD_PATH`	Load delay config for replay	(none)
`TOOL_VERBOSE`	Verbosity level (0/1/2)	0
`CUTRACER_DUMP_CUBIN`	Dump cubin files	0

📋 Requirements

CUDA Toolkit: Aligned with NVBit requirements
libzstd: Required for trace compression
Python 3.10+: For Python package

⚠️ API Stability Notice

This is the initial public release (v0.1.0). APIs and configuration options may change in future versions as we gather feedback and iterate on the design.

Known Limitations

Instruction histogram requires clock instruction boundaries (e.g., Triton pl.scope())
Nested regions not supported for instruction histogram analysis

🙏 Acknowledgments

CUTracer is built on NVBit by NVIDIA Research. We thank the NVBit team for their excellent binary instrumentation framework.

📄 License

MIT License - Meta Platforms, Inc. contributions
BSD-3-Clause License - NVIDIA NVBit components

See LICENSE and LICENSE-BSD for details.

📚 Documentation

Full documentation is available in the Wiki.

🔗 Links

Repository: https://github.com/facebookresearch/CUTracer
PyPI: https://pypi.org/project/cutracer/
Issues: https://github.com/facebookresearch/CUTracer/issues

Assets 2

Releases: facebookexperimental/CUTracer

CUTracer v0.3.0 Release Notes

🎉 Minor Release — 49 commits since v0.2.1

✨ Highlights

🏎️ Capture-Side Serializer Overhaul (RapidJSON)

🏗️ Blackwell Race Detection

TMEM Lifetime Detector D3

Warpgroup-Targeted Random Delay (5-diff stack)

Periodic Hang Snapshots

Blackwell Test Coverage

🔍 Live Hang Analysis (cuda-gdb backend)

🧠 AI Reasoner Refactor

📡 TMA Pipeline (NVBit 1.8 Migration)

Uh oh!