Skip to content

feat(skills): unify TPC-H runner via performance_test.py --nsys-profile#882

Open
bwyogatama wants to merge 2 commits into
sirius-db:devfrom
bwyogatama:worktree-consolidate-skills
Open

feat(skills): unify TPC-H runner via performance_test.py --nsys-profile#882
bwyogatama wants to merge 2 commits into
sirius-db:devfrom
bwyogatama:worktree-consolidate-skills

Conversation

@bwyogatama

@bwyogatama bwyogatama commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Summary

Make test/tpch_performance/performance_test.py the single canonical TPC-H runner shared by the benchmark, profile-analyzer, and optimization-advisor skills. Previously, profile-analyzer recommended a separate set of shell runners (profile_tpch_nsys.sh, run_tpch_parquet.sh, benchmark_and_validate.sh) for the same workload — duplicating logic and using a different query source. It also pointed users at flags/scripts that don't exist (--db-path, profile_tpch_nsys_duckdb_native.sh).

After this PR all three skills point at the same Python runner. Profiling goes through nsys_report.sh, which now delegates to performance_test.py --nsys-profile and flattens per-query outputs so the existing nsys_analyze.sh / nsys_compare.sh tooling keeps working unchanged.

What changed

  • test/tpch_performance/performance_test.py (+310 lines)
    • New --nsys-profile flag: subprocess-per-query DuckDB CLI wrapped in nsys profile with --capture-range=cudaProfilerApi.
    • Helpers _build_views_sql (factored from open_connection), _build_nsys_temp_sql, run_nsys_profile.
    • Per-query outputs under <bench>/sirius/q<N>/{nsys.nsys-rep, nsys.sqlite, nsys.sql, timings.csv, log_dir/}.
    • New --query-timeout flag (default 90s, only for --nsys-profile).
    • metadata.json gains nsys_profile field.
    • Reuses existing queries.py, pin SQL, parquet resolution, and csv/runtimes.csv format.
  • test/tpch_performance/nsys_report.sh
    • Phase 2 delegates to pixi run python performance_test.py --nsys-profile instead of profile_tpch_nsys.sh, then flattens <bench>/sirius/q<N>/{nsys.sqlite, nsys.nsys-rep, timings.csv, nsys_stdout.txt} into the legacy flat profiles/q<N>.<ext> layout so Phase 3/4 (analyze + metric extraction) work unchanged.
    • Adds --parquet-dir (default \$PROJECT_DIR/test_datasets/tpch_parquet_sf<SF>).
    • Removes --cache-level (set scan_cache_level in the Sirius YAML config instead).
  • .claude/skills/profile-analyzer/SKILL.md
    • Drops broken --db-path and profile_tpch_nsys_duckdb_native.sh references.
    • Workflow A Step 2 (non-profiled timing) now uses performance_test.py (the same runner the benchmark skill uses).
    • Reframes NVTX domain numbering as discovered from each profile at analysis time (not Sirius-registered).
    • Drops hardcoded "30-40 CUDA streams" (configurable via pipeline.num_threads).
    • Tweaks "24 on Ada Lovelace" to note nsys_analyze.sh reads maxBlocksPerSm dynamically.
  • .claude/skills/optimization-advisor/SKILL.md
    • Three call sites (Profiling Overhead Warning, Validating Optimizations, Output Format reminder) now reference performance_test.py instead of run_tpch_parquet.sh / benchmark_and_validate.sh.
  • test/tpch_performance/CLAUDE.md
    • Running Benchmarks section restructured around performance_test.py (covers --engine, --iterations, --mode, --queries, --pin, --validation, --nsys-profile, --query-timeout, etc.).
    • Profiling section now points at performance_test.py --nsys-profile as the primary entry; nsys_report.sh documented as the orchestrator.
    • Key Files table restructured: performance_test.py promoted to top; legacy shell runners demoted to a "Legacy (kept for CI)" group at the bottom.

Out of scope

  • Legacy shell runners (benchmark_and_validate.sh, run_tpch_parquet.sh, profile_tpch_nsys.sh, run_tpch_legacy.sh, etc.) stay in the tree because .github/workflows/test.yml:156 and .ai-helper/commands.yaml:34/44/78 still call them. Those callers should be migrated to performance_test.py in a follow-up PR before the shell runners are deleted.

Follow-up fixes landed on this branch

  • 6923e78 fix(skills): pass -unsigned to DuckDB CLI in --nsys-profile mode — caught while smoke-testing nsys_report.sh end-to-end. Without -unsigned, the DuckDB CLI rejects the locally-built unsigned Sirius extension. The Python runner already handles this via allow_unsigned_extensions=true in open_connection; mirror that for the subprocess.

Test plan

  • Static checks passpython3 -c \"import ast; ast.parse(open('test/tpch_performance/performance_test.py').read())\" and bash -n test/tpch_performance/nsys_report.sh both clean.
  • Pre-commit cleanblack, codespell, end-of-file-fixer, trim trailing whitespace, smartquote, debug statements, etc. all green on the changed files (verified by the commit hook running on both c7b93f2 and 6923e78). black auto-reformatted performance_test.py once during the initial commit; subsequent runs are clean.
  • --nsys-profile argparse contract verified — rejected with the expected error for: --engine cpu, --engine both, --validation, --mode sequential, --duckdb-profiling. --help lists both new flags.
  • Non-profile baseline (CPU side) unchangedperformance_test.py --input ~/sirius/test_datasets/tpch_parquet_sf1 --engine cpu --queries 1,3,6 --iterations 1 produces the documented layout: config.yml, metadata.json (with new nsys_profile: false field), csv/runtimes.csv with the expected engine,query,iteration,runtime_s rows, duckdb/q<N>/result.txt.
  • New --nsys-profile mode — code path — flag validation passes, run_nsys_profile is invoked, nsys.sql is generated with the exact layout from profile_tpch_nsys.sh:170-208 (views → optional pin → cold iter → profiler_start → hot iter(s) → profiler_stop → COPY timings to per-query timings.csv), per-query subdir (sirius/q<N>/{nsys.sql, nsys_stdout.txt, log_dir/}) created, subprocess spawned and exit-code handled (NaN rows recorded in runtimes.csv on failure).
  • End-to-end orchestrator — code pathnsys_report.sh --sf 1 --parquet-dir ... --output-dir ... --iterations 2 1 3 6 runs all phases: PARQUET_DIR resolution from --sf, pixi run python performance_test.py --nsys-profile invocation with the correct flag set, <REPORT_DIR>/profiles_tree/sirius/q<N>/ populated by the Python runner, [Phase 2] Flattening per-query outputs into <REPORT_DIR>/profiles/ step reached.
  • Successful GPU run end-to-end — produces sirius/q<N>/nsys.{nsys-rep, sqlite} plus a report.md / summary.json / metadata.json via nsys_report.sh. Pending — needs to be run on a host with the NVIDIA driver loaded outside the verification sandbox.
  • Existing analyze/compare tools unchanged on the new outputsnsys_analyze.sh reports/<ts>/profiles/ and nsys_compare.sh reports/<a> reports/<b>. Pending — depends on the GPU run above producing real .sqlite files.
  • CI green — the legacy benchmark_and_validate.sh 1 SF1 step continues to work in .github/workflows/test.yml. Pending — will be reflected by the GitHub Actions check on this PR.

🤖 Generated with Claude Code

bwyogatama and others added 2 commits June 4, 2026 17:56
Make performance_test.py the single canonical TPC-H runner shared by
the benchmark, profile-analyzer, and optimization-advisor skills.

- Add --nsys-profile to performance_test.py: subprocess-per-query
  DuckDB CLI wrapped in `nsys profile`, producing per-query
  .nsys-rep + .sqlite under <bench>/sirius/q<N>/. Reuses existing
  parquet-view setup, queries.py, and pin SQL.
- nsys_report.sh now delegates Phase 2 to performance_test.py
  --nsys-profile and flattens per-query outputs into profiles/ so
  nsys_analyze.sh and Phase 4 metric extraction continue to work
  unchanged. Adds --parquet-dir; removes --cache-level (use Sirius
  YAML config instead).
- profile-analyzer SKILL.md: drop broken --db-path /
  profile_tpch_nsys_duckdb_native.sh references; Workflow A Step 2
  now uses performance_test.py; reframe NVTX domain numbering as
  discovered at analysis time; drop hardcoded "30-40 CUDA streams"
  and "24 on Ada Lovelace" claims.
- optimization-advisor SKILL.md: three call sites now reference
  performance_test.py instead of run_tpch_parquet.sh /
  benchmark_and_validate.sh.
- test/tpch_performance/CLAUDE.md: restructure Running Benchmarks
  around performance_test.py; demote shell runners to a "Legacy
  (kept for CI)" group in Key Files.

Legacy shell runners (benchmark_and_validate.sh, run_tpch_parquet.sh,
profile_tpch_nsys.sh, etc.) stay in tree for backward compat with
.github/workflows/test.yml and .ai-helper/commands.yaml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without -unsigned, the DuckDB CLI rejects locally-built Sirius
extensions with "signature is either missing or invalid and unsigned
extensions are disabled by configuration". The Python runner already
handles this via `allow_unsigned_extensions=true` in the duckdb.connect
config (open_connection at performance_test.py:271); mirror that for
the subprocess so nsys profiling works against an unsigned dev build.

Caught while smoke-testing nsys_report.sh end-to-end against the local
build/release/extension/sirius/sirius.duckdb_extension.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants