Benchmark script to wrap dtest by tameware · Pull Request #202 · dds-bridge/dds

tameware · 2026-06-19T18:12:01Z

Sample output:

$ ./benchmark.sh 
DDS dtest benchmark
===================
branch:      /Users/adamw/src/dds/bazel-bin/library/tests/dtest
hands dir:   /Users/adamw/src/dds/hands
max_deals:   100
files:       list100.txt list10.txt list1.txt
git branch:  benchmark
repeats:     1

solver file              ver  user_ms   sys_ms   avg_user  ratio run
------ ------------- ------- -------- -------- ---------- ------ ---
solve  list100.txt    branch      284     2428       2.84   8.55 1/1
solve  list10.txt     branch       47      133       4.70   2.83 1/1
solve  list1.txt      branch       15       15      15.00   1.00 1/1
calc   list100.txt    branch     1457    14900      14.57  10.23 1/1
calc   list10.txt     branch      237     1178      23.70   4.97 1/1
calc   list1.txt      branch       45      149      45.00   3.31 1/1

Completed 6 runs (6 expected).

Compare against a binary from another branch or v2.9. Comparing against a branch could be automated; as is I build a dtest in another branch, rename it, and move it to dds/..

$ ./benchmark.sh --build --compare ../dtest_solver_context_reuse --max-deals 100
Building //library/tests:dtest...
Starting local Bazel server (9.1.0 Homebrew) and connecting to it...
INFO: Analyzed target //library/tests:dtest (126 packages loaded, 1424 targets configured).
INFO: Found 1 target...
Target //library/tests:dtest up-to-date:
  bazel-bin/library/tests/dtest
INFO: Elapsed time: 3.827s, Critical Path: 0.02s
INFO: 1 process: 96 action cache hit, 1 internal.
INFO: Build completed successfully, 1 total action
DDS dtest benchmark
===================
branch:      /Users/adamw/src/dds/bazel-bin/library/tests/dtest
compare:     ../dtest_solver_context_reuse
run order:   branch, compare
hands dir:   /Users/adamw/src/dds/hands
max_deals:   100
files:       list100.txt list10.txt list1.txt
git branch:  benchmark
repeats:     1

solver file              ver  user_ms   sys_ms   avg_user  ratio run
------ ------------- ------- -------- -------- ---------- ------ ---
solve  list100.txt    branch      300     2417       3.00   8.06 1/1
solve  list100.txt   compare      246     2450       2.46   9.96 1/1
solve  list10.txt     branch       47      127       4.70   2.70 1/1
solve  list10.txt    compare       50      134       5.00   2.68 1/1
solve  list1.txt      branch       15       15      15.00   1.00 1/1
solve  list1.txt     compare       16       16      16.00   1.00 1/1
calc   list100.txt    branch     1443    14842      14.43  10.29 1/1
calc   list100.txt   compare     1408    14731      14.08  10.46 1/1
calc   list10.txt     branch      240     1123      24.00   4.68 1/1
calc   list10.txt    compare      241     1173      24.10   4.87 1/1
calc   list1.txt      branch       46      147      46.00   3.20 1/1
calc   list1.txt     compare       45      147      45.00   3.27 1/1

Summary (branch vs compare, avg user ms; cmp/branch > 1 => branch faster)
==============================================================================
solver file           compare_avg   branch_avg cmp/branch note           
------ ------------- ------------ ------------ ---------- ---------------
solve  list100.txt           2.46         3.00      0.82x compare faster 
solve  list10.txt            5.00         4.70      1.06x branch faster  
solve  list1.txt            16.00        15.00      1.07x branch faster  
calc   list100.txt          14.08        14.43      0.98x compare faster 
calc   list10.txt           24.10        24.00      1.00x branch faster  
calc   list1.txt            45.00        46.00      0.98x compare faster 

Completed 12 runs (12 expected).

Run bazel build //library/tests:dtest before benchmarking when --build is passed; DRY_RUN prints the build command without executing it. Co-authored-by: Cursor <cursoragent@cursor.com>

Arguments after -- are forwarded to every dtest invocation (e.g. thread count and -r); benchmark -n remains the repeat count before --. Co-authored-by: Cursor <cursoragent@cursor.com>

Avoids clashing with dtest -n; DRY_RUN now prints commands only without fake timing rows or summary. Co-authored-by: Cursor <cursoragent@cursor.com>

Clarify default vs env overrides for --repeats and --max-deals; error now says 10^n <= N to match filtering. Co-authored-by: Cursor <cursoragent@cursor.com>

Use --branch/--compare (and BRANCH/COMPARE env) instead of dtest1/dtest2; widen ver/file columns and align summary output. Co-authored-by: Cursor <cursoragent@cursor.com>

Format speedup as a fixed-width string so the trailing x stays in-column; give note a 15-char field. Co-authored-by: Cursor <cursoragent@cursor.com>

Validate repeats, skip binary checks in dry-run, run branch before compare, and tighten dtest timing parse so the cmp/branch summary is easier to read. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Adds a new benchmark.sh helper script to benchmark library/tests/dtest across solver modes and hand-file sizes, with optional building and binary-to-binary comparison.

Changes:

Introduces a Bash benchmark runner that iterates solver/file combinations and prints per-run timing rows.
Adds optional --build, --compare, --repeats, and --max-deals controls plus -- pass-through args to dtest.
Adds an aggregated comparison summary when a second binary is provided.

Use a portable mktemp template, normalize dtest "zero" timings, warn only when user/sys are missing, and label the compare summary as avg user ms. Co-authored-by: Cursor <cursoragent@cursor.com>

Run solve before calc and largest hand files first to reduce warmup bias, aggregate summary on avg_user per hand, and add --reverse to run compare before branch when comparing two binaries. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Parse Number of hands and compute user/hands when Avg user time is missing (e.g. zero user time), replacing the broken Copilot autofix. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Show n/a/equal for 0/0, inf/branch faster when compare is nonzero and branch is zero, and keep 0.00x/compare faster when only compare is zero. Co-authored-by: Cursor <cursoragent@cursor.com>

Run branch then compare (or reverse) for each repeat before advancing, so paired timings see the same CPU warmth instead of all branch runs first. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

zzcgumn

Interesting idea. I am not sure about adding this to the CI run as we don't yet have an idea about how large performance regressions we can tolerate.

I did a couple of comparisons like this manually while refactoring from 2.9 to 3.0. The main problem I found was methods that had been inlined but were not in my release candidate code.

tameware · 2026-06-21T18:56:24Z

I had not considered adding this to the CI run. I like the idea! Something simpler could work for CI, wrap dtest and fail if the runtime for any of list1/10/100 or calc 1/10/100 becomes slower by more than a threshold. There's a fair bit of random variation from run to run, though, and some non-random order variation. On my Mac, the current version speeds up by more than 10% when the machine is "warm", in particular if running solve after many calc runs.

This reverts commit 5626ea7.

Assume summary avg_user values are always positive and compute u2/u1 directly instead of guarding a zero branch average. Co-authored-by: Cursor <cursoragent@cursor.com>

Add --details to opt into per-run timing rows; branch-only runs still print the full table as before. Co-authored-by: Cursor <cursoragent@cursor.com>

On a tty, print per-run timing rows while --compare runs then erase them (including the table header) before the summary; use --details to keep rows. Co-authored-by: Cursor <cursoragent@cursor.com>

The ratio column is self-explanatory without the inline note. Co-authored-by: Cursor <cursoragent@cursor.com>

Default 0.5% (--epsilon / EPSILON) marks branch and compare as equal when avg user times differ by less than that relative threshold. Co-authored-by: Cursor <cursoragent@cursor.com>

Show the compare-mode tolerance flag in the header and help examples. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Document that zero or skewed averages can come from per-interval int ms rounding and point to accumulating microseconds in TestTimer.cpp. Co-authored-by: Cursor <cursoragent@cursor.com>

Say rounding to zero rather than a rounding error. Co-authored-by: Cursor <cursoragent@cursor.com>

tameware and others added 11 commits June 17, 2026 22:45

Initial commit

5935062

Do not hard-code the path for binary that's not in the repo

1c6a1b2

make dtest2 optional

d8c6a3b

Added max-deals parameter

1c01ef3

Add --build option to benchmark.sh for dtest.

afb842e

Run bazel build //library/tests:dtest before benchmarking when --build is passed; DRY_RUN prints the build command without executing it. Co-authored-by: Cursor <cursoragent@cursor.com>

Pass dtest options via -- in benchmark.sh.

53119b4

Arguments after -- are forwarded to every dtest invocation (e.g. thread count and -r); benchmark -n remains the repeat count before --. Co-authored-by: Cursor <cursoragent@cursor.com>

Rename benchmark repeat flag to --repeats and improve dry-run output.

2602a99

Avoids clashing with dtest -n; DRY_RUN now prints commands only without fake timing rows or summary. Co-authored-by: Cursor <cursoragent@cursor.com>

Fix benchmark.sh help text and max-deals error message.

e3077c2

Clarify default vs env overrides for --repeats and --max-deals; error now says 10^n <= N to match filtering. Co-authored-by: Cursor <cursoragent@cursor.com>

Rename benchmark binaries to branch/compare and fix column alignment.

32fba00

Use --branch/--compare (and BRANCH/COMPARE env) instead of dtest1/dtest2; widen ver/file columns and align summary output. Co-authored-by: Cursor <cursoragent@cursor.com>

Align benchmark summary speedup and note columns.

5d508c1

Format speedup as a fixed-width string so the trailing x stays in-column; give note a 15-char field. Co-authored-by: Cursor <cursoragent@cursor.com>

Harden benchmark.sh validation, parsing, and compare summary.

69af21e

Validate repeats, skip binary checks in dry-run, run branch before compare, and tighten dtest timing parse so the cmp/branch summary is easier to read. Co-authored-by: Cursor <cursoragent@cursor.com>

tameware requested a review from Copilot June 19, 2026 18:12

Copilot started reviewing on behalf of tameware June 19, 2026 18:12 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread benchmark.sh Outdated

Comment thread benchmark.sh

Comment thread benchmark.sh

Comment thread benchmark.sh Outdated

Copilot AI mentioned this pull request Jun 19, 2026

benchmark.sh: apply PR #202 review-thread fixes for parsing, warnings, and portability #203

Closed

tameware and others added 2 commits June 19, 2026 15:46

Address PR review comments in benchmark.sh.

6555a88

Use a portable mktemp template, normalize dtest "zero" timings, warn only when user/sys are missing, and label the compare summary as avg user ms. Co-authored-by: Cursor <cursoragent@cursor.com>

tameware requested a review from Copilot June 19, 2026 22:18

Copilot started reviewing on behalf of tameware June 19, 2026 22:18 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread benchmark.sh

tameware self-assigned this Jun 19, 2026

tameware requested a review from zzcgumn June 19, 2026 22:30

Derive avg_user in benchmark when dtest omits the avg line.

aa3ddc7

Parse Number of hands and compute user/hands when Avg user time is missing (e.g. zero user time), replacing the broken Copilot autofix. Co-authored-by: Cursor <cursoragent@cursor.com>

tameware force-pushed the benchmark branch from ecdec25 to aa3ddc7 Compare June 21, 2026 14:16

tameware marked this pull request as ready for review June 21, 2026 15:03

tameware requested a review from Copilot June 21, 2026 15:03

Copilot started reviewing on behalf of tameware June 21, 2026 15:04 View session

Copilot AI reviewed Jun 21, 2026

View reviewed changes

Comment thread benchmark.sh Outdated

tameware and others added 2 commits June 21, 2026 11:58

Fix benchmark summary ratio when branch_avg is zero.

5626ea7

Show n/a/equal for 0/0, inf/branch faster when compare is nonzero and branch is zero, and keep 0.00x/compare faster when only compare is zero. Co-authored-by: Cursor <cursoragent@cursor.com>

Interleave branch and compare runs per repeat in benchmark.

713186d

Run branch then compare (or reverse) for each repeat before advancing, so paired timings see the same CPU warmth instead of all branch runs first. Co-authored-by: Cursor <cursoragent@cursor.com>

tameware requested a review from Copilot June 21, 2026 16:29

Copilot started reviewing on behalf of tameware June 21, 2026 16:29 View session

Copilot AI reviewed Jun 21, 2026

View reviewed changes

Comment thread benchmark.sh Outdated

zzcgumn approved these changes Jun 21, 2026

View reviewed changes

tameware and others added 7 commits June 21, 2026 15:11

Revert "Fix benchmark summary ratio when branch_avg is zero."

083016d

This reverts commit 5626ea7.

Simplify benchmark cmp/branch ratio when averages are positive.

9a989bd

Assume summary avg_user values are always positive and compute u2/u1 directly instead of guarding a zero branch average. Co-authored-by: Cursor <cursoragent@cursor.com>

Show summary only by default when benchmarking with --compare.

7096445

Add --details to opt into per-run timing rows; branch-only runs still print the full table as before. Co-authored-by: Cursor <cursoragent@cursor.com>

Show transient per-run progress during compare benchmarks.

7293c3e

On a tty, print per-run timing rows while --compare runs then erase them (including the table header) before the summary; use --details to keep rows. Co-authored-by: Cursor <cursoragent@cursor.com>

Drop cmp/branch interpretation from benchmark summary header.

aa99cbd

The ratio column is self-explanatory without the inline note. Co-authored-by: Cursor <cursoragent@cursor.com>

Add epsilon tolerance for equal benchmark comparisons.

9bb879e

Default 0.5% (--epsilon / EPSILON) marks branch and compare as equal when avg user times differ by less than that relative threshold. Co-authored-by: Cursor <cursoragent@cursor.com>

Document --epsilon in benchmark.sh usage examples.

9e9284b

Show the compare-mode tolerance flag in the header and help examples. Co-authored-by: Cursor <cursoragent@cursor.com>

tameware requested a review from Copilot June 21, 2026 20:36

Copilot started reviewing on behalf of tameware June 21, 2026 20:36 View session

Copilot AI reviewed Jun 21, 2026

View reviewed changes

Comment thread benchmark.sh

Comment thread benchmark.sh Outdated

tameware requested a review from Copilot June 21, 2026 20:43

Copilot started reviewing on behalf of tameware June 21, 2026 20:44 View session

Copilot AI reviewed Jun 21, 2026

View reviewed changes

Comment thread benchmark.sh Outdated

tameware and others added 2 commits June 21, 2026 17:07

Note TestTimer ms truncation in benchmark summary comment.

429ec25

Document that zero or skewed averages can come from per-interval int ms rounding and point to accumulating microseconds in TestTimer.cpp. Co-authored-by: Cursor <cursoragent@cursor.com>

Clarify benchmark comment on zero timing from ms truncation.

118d524

Say rounding to zero rather than a rounding error. Co-authored-by: Cursor <cursoragent@cursor.com>

tameware merged commit 3127749 into dds-bridge:develop Jun 21, 2026
4 checks passed

Conversation

tameware commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

zzcgumn left a comment

Choose a reason for hiding this comment

Uh oh!

tameware commented Jun 21, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tameware commented Jun 19, 2026 •

edited

Loading