feat(metrics): stdlib rank-correlation primitives (Spearman, Kendall, bootstrap CI) by Denis-hamon · Pull Request #54 · Denis-hamon/world-model-eval-lab

Denis-hamon · 2026-06-14T15:06:20Z

Tier 2.1.a — the no-GPU foundation for the offline→downstream keystone (does a cheap planner-free offline metric predict control usefulness?). Pure stdlib, no compute, core wmel.metrics only.

What it adds

spearman_rho — Pearson on average ranks (tie-safe).
kendall_tau — tau-b, tie-corrected, O(n²); robust at very small n.
bootstrap_correlation_ci → CorrelationResult — paired percentile bootstrap (resamples cell pairs jointly, like paired_bootstrap_gap_ci; skips degenerate resamples; deterministic given seed). Rank-based by design: a handful of cells, monotone-not-linear, incomparable scales.
Exported from the package; _rankdata/_pearson private helpers.

Why now

T2.2 (n=150) and T2.3 (pixels) are GPU-gated, but this primitive layer is the planned no-GPU first step of the keystone and is independently useful stats. The remaining keystone pieces — offline_metrics.py (M1 1-step MSE, M2 k-step divergence, M3 action-counterfactual sensitivity, M4 score-to-go error) and correlate.py — need the CPU MLP-variant sweep + the GPU box for the TD-MPC2 cells.

Verification

13 tests, all closed-form: Spearman 0.8 and Kendall 4/6 on [1,2,3,4] vs [1,3,2,4]; tie-corrected tau-b 0.5; average-rank ties; perfect-correlation tight CI; determinism; degenerate + input-validation guards.
Adversarial review: GO, zero must-fix — all three hand-derivable values re-derived and cross-checked against SciPy; paired resampling, percentile convention, determinism, and edge guards confirmed. Two doc nits applied (CI is conditional on non-degenerate resamples; n_boot field = valid count).
Full suite passing; core stays stdlib-only; no emoji; no new git tag.

🤖 Generated with Claude Code

… bootstrap CI) The no-GPU foundation for the offline-metric vs downstream-performance keystone (T2.1.a): correlate a cheap planner-free offline metric against CPG / success across (model, env, planner) cells and report the strength with an honest interval. Pure stdlib, no compute. - src/wmel/metrics.py: spearman_rho (Pearson on average ranks, tie-safe), kendall_tau (tau-b, tie-corrected, O(n^2)), bootstrap_correlation_ci (paired percentile bootstrap; resamples cell pairs jointly; skips degenerate resamples; deterministic given seed), CorrelationResult dataclass, plus private _rankdata / _pearson. Exported from the package. - tests/test_correlation.py: closed-form checks (Spearman 0.8 and Kendall 4/6 on [1,2,3,4] vs [1,3,2,4]; tie-corrected tau-b 0.5 on [1,1,2] vs [1,2,2]; average-rank ties), perfect-correlation tight CI, determinism, degenerate and input-validation guards. Rank-based by design (small n, monotone-not-linear, incomparable scales); the bootstrap CI is documented as conditional on non-degenerate resamples. Adversarial review GO, zero must-fix (all three hand-derivable values re-derived and cross-checked vs SciPy; paired resampling, percentile convention matching paired_bootstrap_gap_ci, determinism, and edge guards all confirmed). Full suite passing, core stdlib-only. No new git tag. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Denis-hamon merged commit e3c7128 into main Jun 14, 2026
4 checks passed

Denis-hamon mentioned this pull request Jun 14, 2026

experiments: offline->downstream correlation, CPU proof-of-concept (keystone) #55

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): stdlib rank-correlation primitives (Spearman, Kendall, bootstrap CI)#54

feat(metrics): stdlib rank-correlation primitives (Spearman, Kendall, bootstrap CI)#54
Denis-hamon merged 1 commit into
mainfrom
feat-correlation-primitives

Denis-hamon commented Jun 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Denis-hamon commented Jun 14, 2026

What it adds

Why now

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant