feat(metrics): stdlib rank-correlation primitives (Spearman, Kendall, bootstrap CI)#54
Merged
Merged
Conversation
… bootstrap CI) The no-GPU foundation for the offline-metric vs downstream-performance keystone (T2.1.a): correlate a cheap planner-free offline metric against CPG / success across (model, env, planner) cells and report the strength with an honest interval. Pure stdlib, no compute. - src/wmel/metrics.py: spearman_rho (Pearson on average ranks, tie-safe), kendall_tau (tau-b, tie-corrected, O(n^2)), bootstrap_correlation_ci (paired percentile bootstrap; resamples cell pairs jointly; skips degenerate resamples; deterministic given seed), CorrelationResult dataclass, plus private _rankdata / _pearson. Exported from the package. - tests/test_correlation.py: closed-form checks (Spearman 0.8 and Kendall 4/6 on [1,2,3,4] vs [1,3,2,4]; tie-corrected tau-b 0.5 on [1,1,2] vs [1,2,2]; average-rank ties), perfect-correlation tight CI, determinism, degenerate and input-validation guards. Rank-based by design (small n, monotone-not-linear, incomparable scales); the bootstrap CI is documented as conditional on non-degenerate resamples. Adversarial review GO, zero must-fix (all three hand-derivable values re-derived and cross-checked vs SciPy; paired resampling, percentile convention matching paired_bootstrap_gap_ci, determinism, and edge guards all confirmed). Full suite passing, core stdlib-only. No new git tag. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tier 2.1.a — the no-GPU foundation for the offline→downstream keystone (does a cheap planner-free offline metric predict control usefulness?). Pure stdlib, no compute, core
wmel.metricsonly.What it adds
spearman_rho— Pearson on average ranks (tie-safe).kendall_tau— tau-b, tie-corrected, O(n²); robust at very small n.bootstrap_correlation_ci→CorrelationResult— paired percentile bootstrap (resamples cell pairs jointly, likepaired_bootstrap_gap_ci; skips degenerate resamples; deterministic given seed). Rank-based by design: a handful of cells, monotone-not-linear, incomparable scales._rankdata/_pearsonprivate helpers.Why now
T2.2 (n=150) and T2.3 (pixels) are GPU-gated, but this primitive layer is the planned no-GPU first step of the keystone and is independently useful stats. The remaining keystone pieces —
offline_metrics.py(M1 1-step MSE, M2 k-step divergence, M3 action-counterfactual sensitivity, M4 score-to-go error) andcorrelate.py— need the CPU MLP-variant sweep + the GPU box for the TD-MPC2 cells.Verification
[1,2,3,4]vs[1,3,2,4]; tie-corrected tau-b 0.5; average-rank ties; perfect-correlation tight CI; determinism; degenerate + input-validation guards.n_bootfield = valid count).🤖 Generated with Claude Code