Pure CLI in / JSON contract out; no GUI required.
Agent-friendly profile-query CLI family. JSON by default, with CSV/table projections where useful. One command answers one question. VeloQ is designed for coding agents and scripts that need GPU profile evidence without opening a GUI.
VeloQ covers three profile sources today — Nsight Systems (timeline
traces), Nsight Compute (kernel reports), and experimental
PyTorch/Kineto Chrome traces — through a single binary with a shared
envelope and a pluggable ProfileSource trait. The PyTorch/Kineto source
covers the Perfetto-style Chrome trace shape used by PyTorch profiler.
- 17 NSys verbs, including timeline analysis, static SVG figures, kernel overlap, NCU handoff, prep/cache helpers, and schema.
- 11 NCU verbs:
summary,launches,inspect,metrics,disasm,ranges,graphs,sources,source-metrics,warp-stalls, andschema. - 10 experimental PyTorch verbs:
summary,search,inspect,stats,correlate,timeline,slices,collectives,prep, andschema. - Five root meta verbs:
info,sources,clean,recipes, andself-update.
JSON output uses one v1 envelope on stdout. List responses use
canonical data.rows[] with a stable per-row key; NSys trace
responses also carry top-level trace_span for per-second normalization.
Errors use the same envelope shape and a non-zero exit code.
NSys traces are read through nsys export -t parquetdir. Minimum
required nsys version is 2024.6 (the release that introduced the
parquetdir --type). All VeloQ-generated products live under one
<report>.veloq/ artifact root; the NSys parquet cache is its
parquetdir/ child with ctime invalidation.
For a GPU-profile question an agent usually reaches for one of three
interfaces. VeloQ focuses on the agent-facing axes: a stable typed
contract, token economy, and scriptability. It does not replace
nsys or ncu; it reads their exported evidence.
| Nsight GUI | Raw nsys/ncu text in context |
Hand-rolled SQLite + jq | VeloQ | |
|---|---|---|---|---|
| Scriptable / one-shot | ✗ | ~ ad hoc | ✓ | ✓ |
| Token-efficient for an agent | n/a | ✗ broad dumps | ~ | ✓ shaped rows + truncation signals |
| Stable typed contract | ✗ | ✗ free text | ✗ schema you own | ✓ versioned JSON envelope |
| Cross-capture diffable | ✗ | ✗ | ~ | ✓ stable per-row key |
| Zero setup per query | ✓ | ✓ | ✗ | ✓ |
Use the Nsight GUI for interactive timeline exploration or one-off visual inspection. Use VeloQ for programmatic, repeatable, agent- or script-driven querying.
For Linux and macOS, the install script is the shortest path: it installs
both the veloq binary and the bundled Agent Skills.
# Linux x86_64 / aarch64 and macOS x86_64 / arm64
curl -fsSL https://raw.githubusercontent.com/lucifer1004/veloq/main/scripts/install.sh | bashInstalls the veloq binary under ~/.local/bin and the Agent Skills
for profile analysis (nsys-profile-analysis, ncu-profile-analysis,
pytorch-profile-analysis) under
~/.agents/skills/. Pass --no-skills to install just the
binary, or --no-binary to refresh the skills when you manage the
VeloQ CLI separately. The skills are VeloQ-backed: they can be
installed separately, but profile evidence extraction still requires a
veloq binary on PATH. --bin-dir <path> overrides the binary
install location.
For Windows, use cargo binstall veloq below or grab
veloq-x86_64-windows.exe from the
Releases page directly.
If you use cargo-binstall,
install the prebuilt veloq binary from the GitHub release:
cargo binstall veloqcargo binstall installs only the executable. To fetch the bundled
skills from the latest release without replacing the binstall-managed
VeloQ binary, run:
veloq self-update --no-binaryUse --skills-dir <path> on that second command to install skills under a
non-default root such as .claude/skills/.
veloq self-update --no-binary --skills-dir .claudeVeloQ ships Codex plugin metadata under .codex-plugin/ and a local
marketplace under .agents/plugins/. From a VeloQ checkout:
codex plugin marketplace add .
codex plugin add veloq@veloqThe plugin install handles the Agent Skills only. Those skills require
the VeloQ CLI for evidence extraction, so install the veloq binary
separately via cargo binstall veloq or scripts/install.sh --no-skills.
The repo's canonical Agent Skills source lives under .agents/skills/.
The legacy .claude/skills path is kept as a compatibility alias.
VeloQ ships a one-plugin marketplace listing under
.claude-plugin/. Users running Claude Code's plugin manager can:
/plugin marketplace add https://github.com/lucifer1004/veloq.git
/plugin install veloq@veloq
This uses the same Agent Skills through the Claude-specific plugin
metadata under .claude-plugin/.
veloq self-update # binary AND bundled Agent Skills
veloq self-update --check # is a newer release out? (JSON)
veloq self-update --no-skills # binary only
veloq self-update --no-binary # Agent Skills only; keep your binary manager
veloq self-update --skills-dir .claude # install skills to .claude/skills/self-update pulls the latest GitHub release. By default it replaces the
running binary and refreshes the bundled Agent Skills, removing stale
skill files from earlier installs. Skills go to ~/.agents/skills/ by
default; --skills-dir <path> or VELOQ_SKILLS_DIR selects another
root such as project-local .agents or .claude. Passing either the
root or the final skills/ directory works. --check reports
update_available without changing files. All modes emit the standard
envelope on stdout.
If the binary was installed with cargo-binstall and you want
cargo-binstall to remain the binary manager, use
veloq self-update --no-binary to refresh Agent Skills only, and use
cargo binstall again for binary updates.
These examples assume veloq is on PATH (via one of the install
methods above). For contributors building from source, see
Build from source — the binary lands at
target/release/veloq.
# ── NSys (timeline) — hoisted to the top level and also available as `veloq nsys ...`
# Summarize a trace
veloq summary path/to/trace.nsys-rep
veloq nsys summary path/to/trace.nsys-rep
# Top kernels by total time
veloq stats path/to/trace.nsys-rep --limit 10
# Aggregate attributable kernels by full NVTX hierarchy path
veloq stats path/to/trace.nsys-rep --type kernel --group-by nvtx-path
# Human-friendly comfy-table view
veloq stats path/to/trace.nsys-rep --limit 10 --format table
# Find kernels by name. On large traces, --name-regex prunes the scan
# before name resolution and runs several times faster than the
# equivalent --name '*...*' glob (identical results).
veloq search path/to/trace.nsys-rep --type kernel --name-regex 'gemm' --sort duration:desc --limit 10
# Export a bounded timeline window as a report-ready SVG artifact.
# The JSON row returns the SVG path relative to <trace>.veloq/; resolved
# tracks carry roles such as group, summary, detail, and annotation.
veloq viz timeline path/to/trace.nsys-rep --from @100000000 --to @120000000
# Highlight the top kernel names in that window while preserving the
# base event-type legend; metadata lands in data.auxiliary.resolved_highlights.
veloq viz timeline path/to/trace.nsys-rep --from @100000000 --to @120000000 --highlight-kernels top=3,scope=nameExample viz timeline SVG artifact with top-kernel highlights.
# Discover canonical workflows (nvtx-breakdown, gpu-idle-audit,
# timeline-figure-report, memcpy-asymmetry, cold-kernel-hotspot, ...)
veloq recipes
veloq recipes nvtx-breakdown
# GPU performance-counter samples (needs --gpu-metrics-devices at capture time)
veloq metrics path/to/trace.nsys-rep --type gpu --limit 8 --sort=mean:desc
# Same data as a 50ms time series
veloq metrics path/to/trace.nsys-rep --type gpu --counter '*Throughput*' --bucket 50ms
# NIC performance-counter samples (needs --nic-metrics=lf or =hf at capture time)
veloq metrics path/to/trace.nsys-rep --type nic --counter 'IB: Bytes*' --bucket 50ms
# CPU hotspot (needs --sample=process-tree at capture time)
veloq metrics path/to/trace.nsys-rep --type cpu-sampling --limit 20
# Per-thread breakdown
veloq metrics path/to/trace.nsys-rep --type cpu-sampling --group-by tid
# Drill: full callchain for one sample
veloq inspect path/to/trace.nsys-rep cpu_sample:1234
# Generate an Nsight Compute rerun command for a selected NSys kernel event
veloq nsys ncu-command path/to/trace.nsys-rep kernel:1234
veloq nsys ncu-command path/to/trace.nsys-rep kernel:1234 --print | bash
# ── NCU (kernel reports) — namespaced under `ncu`
# Slim overview (launch-derived totals + NCU-version session)
veloq ncu summary path/to/report.ncu-rep
veloq ncu summary --format csv path/to/report.ncu-rep
# List launches; drill in for full per-launch metrics / rules
veloq ncu launches path/to/report.ncu-rep --kernel '*gemm*'
veloq ncu inspect path/to/report.ncu-rep --row-id launch:0
# Cross-launch metric projection (long form by default; jq-friendly diff shape)
veloq ncu metrics path/to/report.ncu-rep --counter 'sm__*active*'
# Per-launch SASS / PTX / source-line correlation (cached per cubin)
veloq ncu disasm path/to/report.ncu-rep --row-id launch:0 \
| jq '.data.rows[0] | {function_name, instruction_count: (.instructions|length)}'
# Per-source-line warp-stall-reason histogram (from timed_warp_samples)
veloq ncu warp-stalls path/to/report.ncu-rep --row-id launch:0
# Other list verbs
veloq ncu sources path/to/report.ncu-rep
veloq ncu ranges path/to/report.ncu-rep
veloq ncu schema launches
# ── PyTorch/Kineto (Chrome traces) — namespaced under `pytorch`
veloq pytorch summary path/to/worker0.pt.trace.json
veloq pytorch search path/to/worker0.pt.trace.json --type kernel --is-comm
veloq pytorch correlate path/to/worker0.pt.trace.json kernel:91
veloq pytorch slices path/to/worker0.pt.trace.json --aggregate --group-by step
veloq pytorch stats path/to/worker0.pt.trace.json --type comm --group-by comm-kind,rank
veloq pytorch collectives path/to/worker0.pt.trace.json
veloq pytorch schema search
# ── Meta verbs
veloq sources
veloq info path/to/file.ncu-rep
veloq schema metrics# The repo pins Rust 1.89.0 via rust-toolchain.toml.
cargo build --release -p veloq
# Binary lands at target/release/veloq — either invoke it via the
# full path or run `cp target/release/veloq ~/.local/bin/` to put
# it on PATH manually.
./target/release/veloq --helpHeads-up: nsys's GPU/NIC/CPU-sample/SCHED buffers can silently drop data on long captures. Every
metricsresponse carries coverage + per-type trust signals atdata.auxiliary.common;veloq metrics --type <gpu|nic|cpu-sampling|cpu-sched> --helplists them. Read coverage before quoting numbers.
The first command on a new .nsys-rep runs nsys export -t parquetdir,
caching <trace>.nsys-rep.veloq/parquetdir/<TABLE>.parquet for reuse;
passing that generated parquetdir/ back resolves to the owning
.nsys-rep, so sidecars stay under one artifact root. veloq prep <trace> exports upfront and reports registered sidecar readiness in
data.rows[]; veloq clean <trace> removes the generated products
for one report.
Every successful JSON call returns the source-qualified v1 envelope:
{
"schema": "v1",
"source": { "kind": "nsys", "version": "v2" },
"command": "nsys.stats",
"trace": { "kind": "nsys", "path": "trace.nsys-rep" },
"trace_span": { "origin_ns": 0, "span_ns": 12345000000 },
"data": {
"count": 50,
"total_matched": 1234,
"rows": [{ "key": "kernel|...|dev:0|stream:7", "...": "..." }]
}
}schema— envelope-format version. Bumps on every breaking envelope-shape change.source.kind— which profile backend produced the response ("nsys","ncu","pytorch", or"veloq"for meta verbs).source.version— per-source wire-format version. Bumps independently from the envelope when the source's payload shapes change. Currently NSys reportsv2(v1introduced the NVTX domain dimension onstats --group-by nvtx-pathrows;v2makesprepandprep --statuscanonical list responses wheredata.rows[]carries registered sidecar readiness keyed assidecar|<sidecar-id>) and NCU reportsv1(thencu_report-native wire —inspectcarries no section catalog andsummary.auxiliary.sessionkeeps only the NCU version; eachncu inspectmetric'smetric_type/metric_subtype/rollupis thencu_reportenum name such as"counter"rather than the integer1, with the raw integer kept alongside as*_code). PyTorch reportsv0: it is experimental, but documented response fields, schema-target inventories, row ids/keys, command ids, and output-mode semantics are still part of the versioned source contract.command— qualified as<source>.<verb>for source verbs (nsys.stats,ncu.summary), or just<verb>for meta verbs (info,sources,clean).trace.kind— mirrors the producingsource.kind(or the detected source kind forveloq info). Omitted entirely for trace-less verbs (sources,schema,ncu.schema).trace_span— primary-execution(origin_ns, span_ns)window. Agents normalize totals byspan_nsto get per-second rates without a separatesummarycall. Omitted when the source does not provide a trace-wide window, and on meta verbs that don't read a trace.data.rows[]— canonical primary list on every list-shaped verb. Each row carries akey: stringcomposed from its identifying axes (e.g."kernel:1234","bucket|0..1000000","slice|step_42|@1234567") so agents canINDEX(.data.rows; .key)across two captures and diff by key. Non-primary data lives underdata.auxiliary.
Stability. The JSON envelope and the per-source versions are VeloQ's
public contract. Additive fields are non-breaking and keep the version; any
breaking shape change bumps schema (ENVELOPE_VERSION) or the affected
source.version and lands a CHANGELOG entry. The crate's 0.x Cargo
version is independent of the wire version — pin behavior to the
envelope/source versions, not the crate version.
Errors share the same shape, with data replaced by error:
{
"schema": "v1",
"source": { "kind": "nsys", "version": "v2" },
"command": "nsys.stats",
"trace": { "kind": "nsys", "path": "trace.nsys-rep" },
"error": {
"message": "invalid --from `1s`: must pair with --to",
"chain": ["resolving --from/--to"]
}
}CLI-level parse failures (unknown flag, bad subcommand) omit source,
command, and trace. --help / --version print clap's native
usage text unchanged.
Exception: veloq nsys ncu-command --print intentionally writes a
raw shell script on stdout for piping, and writes failures to stderr
without a JSON envelope.
| Command | Purpose |
|---|---|
summary |
Overview: version, capabilities, per-table, primary vs full span |
stats |
Aggregation across kernel/memcpy/memset/sync/runtime/osrt/graph/nvtx by name + composable axes |
search |
Filter events → list of row_ids plus headline columns |
inspect |
Full per-kind details for one or more row_ids |
correlate |
CPU↔GPU causal chain for a row_id |
ncu-command |
Generate a native ncu rerun command for one selected kernel event |
gaps |
GPU idle bubbles. Default --scope device is cross-stream (no phantom gaps from idle peer streams); --scope stream for per-stream starvation; --scope trace for multi-GPU rig idle |
timeline |
Time-bucketed GPU activity (busy ns + per-kind breakdown per bucket) |
viz timeline |
Export a bounded NSys timeline window as an SVG artifact with resolved track roles, placement provenance, render metadata, and label counters |
concurrency |
Kernel/transfer overlap: per-device union vs sum busy time, peak concurrency, per-stream (incl. same-stream PDL) + compute/copy overlap. Extraction-only (ratios in jq) |
graph-replays |
CUDA Graph replay decomposition: per-replay GPU work keyed by (device, context, correlationId), across both --cuda-graph-trace=graph and =node captures |
slices |
Per-NVTX-range CPU bounds + attributed GPU work |
hardware |
CPU / GPU / NIC inventory from the trace's TARGET_INFO_* tables |
metrics |
GPU/NIC PM counters, CPU IP samples, or CPU scheduler events — hotspot summary, time series, callchain via inspect |
prep |
Build the Parquet cache + registered sidecars eagerly; --status reports sidecar readiness without building |
correlation-stats |
Build/load the correlation index and report counts |
schema <target> |
Strict JSON Schema for one NSys verb's response |
Every NSys command above can also be invoked as veloq nsys <command> ...; the top-level form is kept as the default-source shorthand.
NCU verbs share a <trace>.veloq/ncu-native.json.gz sidecar built on
first use; subsequent calls deserialise it instead of re-ingesting the
report.
All NCU detail verbs accept --format json\|csv\|table; tabular
output mirrors the JSON data.rows[] one row per output line
(nested objects become dotted-key columns, BTreeMap fields like
counters expand to one column per resolved counter name). ncu schema is JSON-only.
| Command | Formats | Purpose |
|---|---|---|
ncu summary |
json / csv / table | Slim overview: one launch-derived totals row + degraded session (NCU version only). --format csv|table renders the totals + session as a section,key,value projection. |
ncu launches |
json / csv / table | List CUDA kernel launches as headline rows (launch:<idx>); filters: --kernel '<glob>', --nvtx-range '<glob>', --grid WxHxD, --block WxHxD, --limit |
ncu inspect |
json / csv / table | Full per-launch payload (full metric list with placement-tagged instances + rules + recovered identity scalars) for one or more --row-id launch:<idx>; malformed, unsupported-kind, and out-of-range ids return not_found rows so partial batches survive |
ncu metrics |
json / csv / table | Cross-launch metric projection. Default long form (one row per (launch, counter)); --per-launch for wide form (BTreeMap counters expand to one column per name) |
ncu disasm |
json / csv / table | SASS / PTX / source-index correlation for the cubin one launch ran out of (cubin extracted from the report, cached per-cubin under <report>.veloq/disasm/); tabular emits one row per SASS instruction with denormalised kernel identity |
ncu source-metrics |
json / csv / table | Per-source-line / per-SASS / per-file NCU counter attribution. Joins per-PC metric instances with DWARF source-line attribution; --by line|sass|file. See veloq recipes source-line-hotspots for the canonical invocation. |
ncu warp-stalls |
json / csv / table | Per-source-line warp-stall-reason histogram from timed_warp_samples (the raw warp-state stream); --by line|sass|reason, --file '<glob>'. Raw sample counts + not_issued; jq for percentages. |
ncu ranges |
json / csv / table | List range workloads (--replay-mode range) |
ncu graphs |
json / csv / table | List CUDA-graph workloads (--graph-profiling graph) |
ncu sources |
json / csv / table | Per-cubin source metadata (cuda_sm_name, embedded_source_file_count, has_disasm), one row per launch's cubin |
ncu schema <target> |
json | Strict JSON Schema for one NCU response. Targets are the response field inventory: summary | launches | inspect | metrics | disasm | ranges | graphs | sources | source-metrics | warp-stalls |
NCU drill verbs other than inspect may return handled diagnostic
errors for malformed, unsupported-kind, or out-of-range launch row ids.
PyTorch is an experimental source.version = "v0" source for Kineto
Chrome trace files (.pt.trace.json / .pt.trace.json.gz). Directory
inputs and cross-rank collective skew are planned, not shipped in v0.
When one trace file contains multiple rank values, rank-scoped commands
(search, stats, timeline, slices, and collectives) require
--rank <n> or --all-ranks. inspect and correlate operate on
explicit row ids and are not rank-scope gated.
CUDA device ids are rank-local in multi-rank traces, and stream ids are
device-local: use --rank <n> --device <id> --stream <id> for a fixed
stream, or project parent axes with --group-by rank,device,stream for
comparison.
It uses the same general VeloQ verbs instead
of adding parallel steps, memory, or comm commands; communication
questions use --type comm, --is-comm, grouping axes, slices, and the
source-specific collectives verb.
| Command | Formats | Purpose |
|---|---|---|
pytorch summary |
json / csv / table | Trace inventory, capabilities, active devices, rank/worker inference, versions, capture flags |
pytorch search |
json / csv / table | Typed event refs; filters include --type, name glob/regex, duration, time, rank, device, stream, step |
pytorch inspect |
json / csv / table | Raw args, typed args, parent/children, step/Python context, and correlation/flow links for one or more row ids |
pytorch stats |
json / csv / table | Duration/count aggregation by name,type,step,rank,device,stream,shape,comm-kind,python-context,python-path |
pytorch correlate |
json / csv / table | CPU op / annotation / runtime / driver / GPU activity causal chain for one or more row ids |
pytorch timeline |
json / csv / table | Time buckets with CPU, GPU, communication, and per-type time |
pytorch slices |
json / csv / table | ProfilerStep and user annotation range instances or aggregates |
pytorch collectives |
json / csv / table | Single-trace communication groups with CPU/NCCL evidence row ids and link/ordinal confidence |
pytorch prep |
json / csv / table | Build or inspect PyTorch sidecars under <input>.veloq/pytorch/ |
pytorch schema <target> |
json | Strict JSON Schema for one PyTorch response; schema targets are the response field inventory |
| Command | Purpose |
|---|---|
info <trace> |
First-touch trace map: source kind, filesystem facts, capability bitmap, plus (on a cached parquetdir) device/process inventory, NVTX domains + top paths, and applicable_recipes filtered by trace shape. Sub-100ms on a parquetdir; basics-only on a cold .nsys-rep with a meta.next_steps hint pointing at veloq prep. |
recipes [<id>] |
List or show registered workflow recipes (run veloq recipes for the catalog, veloq recipes <id> for one). |
sources |
Registered sources and their wire-format versions |
clean <trace> |
Remove the <trace>.veloq/ artifact root generated by VeloQ |
self-update |
Update the binary and bundled Agent Skills from the latest GitHub release (--check / --no-skills / --no-binary / --skills-dir) |
Per-verb flag detail, response shape, sort keys, and examples live
in veloq <verb> --help (which is projected from the same
JsonSchema derive as the response, so it can't drift).
NSys's NVTX_EVENTS table records CPU-side range timestamps only;
GPU work is reached by walking correlationId from NVTX → runtime API → kernel/memcpy/memset with (device, context) disambiguation
from TARGET_INFO_CUDA_CONTEXT_INFO. VeloQ does this walk in SQL for
stats --nvtx/search --nvtx/slices and in a pre-built index
(<trace>.veloq/correlation.bin) for correlate.
The same walk runs in reverse for inspect (default-on) and
search --with-nvtx (opt-in batched): given a kernel / memcpy /
memset / sync row_id, VeloQ surfaces nvtx_context: { range_id, name, depth, iter_index } for the innermost enclosing NVTX range.
iter_index is the 0-based ordinal among same-(global_tid, domain_id, name) repeats — answers "which step did this kernel
belong to" without a second jq pass.
For nested NVTX, veloq stats T --group-by nvtx-path and
veloq slices T --aggregate --group-by path group by the full
slash-joined hierarchy path, so repeated leaf
names under different parents remain distinct. inspect T nvtx:N
also includes path, parent_row_id, and parent_name when the
NVTX tree can be built.
| Source | Extensions | Notes |
|---|---|---|
| NSys | .nsys-rep |
Primary path; exported via nsys export -t parquetdir on first use |
| NSys | <stem>_pqtdir/ |
Pre-exported parquetdir; opened directly |
| NSys | <trace>.veloq/parquetdir/ |
Generated alias for the owning .nsys-rep; not a separate source |
| NCU | .ncu-rep |
Nsight Compute kernel report (ingested via NVIDIA's ncu_report API at prep time; no vendored proto schemas) |
| PyTorch | .pt.trace.json |
PyTorch/Kineto Chrome trace JSON |
| PyTorch | .pt.trace.json.gz |
Gzipped PyTorch/Kineto Chrome trace JSON |
veloq info <trace> reports which source claims the file based on
the same detect() heuristic the dispatcher uses, so an agent can
probe a path without having to maintain its own extension list.
NCU ingestion runs NVIDIA's ncu_report Python API at prep time only;
query-time is NCU-free and the generated <report>.veloq/ sidecar is
portable across Linux/macOS/Windows. VeloQ auto-discovers the Nsight
Compute install (extras/python, or the macOS app bundle's
Contents/MacOS/python). For a non-standard location, set
VELOQ_NCU_REPORT_DIR to the directory containing ncu_report.py, and/or
VELOQ_PYTHON to the interpreter to run the helper with.
MIT