Add regression checks for AI Credits math in Go and JavaScript#38638
Add regression checks for AI Credits math in Go and JavaScript#38638Copilot wants to merge 3 commits into
Conversation
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
|
✅ Design Decision Gate 🏗️ completed the design decision gate check. No ADR enforcement needed: PR #38638 does not have the 'implementation' label and has only 24 new lines of code in business logic directories (threshold is 100). No custom config present. |
|
🧪 Test Quality Sentinel completed test quality analysis. |
There was a problem hiding this comment.
Pull request overview
This PR adds regression tests that validate AI Credits (AIC) inference-cost arithmetic and Copilot provider/model normalization behavior using a fixed set of token-usage samples sourced from a specific GitHub Actions run artifact.
Changes:
- Added a Go unit test (
TestComputeModelInferenceAICLiveCopilotData) that asserts AIC results for thecopilotprovider and dated modelgpt-5.4-2026-03-05match expected live-derived values. - Added a JavaScript unit test that asserts the same live-derived samples match
computeInferenceAICbehavior, including dated-model resolution and provider aliasing.
Show a summary per file
| File | Description |
|---|---|
| pkg/cli/model_costs_test.go | Adds a table-driven Go regression test verifying AIC math for Copilot usage samples and dated model resolution. |
| actions/setup/js/model_costs.test.cjs | Adds a Vitest regression test verifying the JS AIC computation matches the same Copilot usage samples and alias/model resolution behavior. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 2/2 changed files
- Comments generated: 0
|
🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅ |
|
✅ PR Code Quality Reviewer completed the code quality review. |
|
@copilot Add tests that involve cache read |
🧪 Test Quality Sentinel Report
📊 Metrics & Test Classification (2 tests analyzed)
Test Classification Details
Language SupportTests analyzed:
|
There was a problem hiding this comment.
✅ Test Quality Sentinel: 60/100. Test quality is acceptable — 0% of new tests are implementation tests (threshold: 30%). Both new tests are behavioral design tests that validate AI Credits math against live Copilot API data. Minor suggestions: add assertion messages to the Go assert.InDelta call and consider adding edge-case coverage (cache-read tokens, unknown model) in a follow-up.
There was a problem hiding this comment.
Skills-Based Review 🧠
Applied /tdd — no blocking issues, but 6 observations worth addressing before this anchors future arithmetic regression suites.
📋 Key Themes & Highlights
Key Themes
- Test-fixture inconsistency (Go vs JS): The JS test pins exact pricing in a fixture; the Go test reads from the embedded
data/models.json. Ifgpt-5.4prices are ever updated, the Go wantAIC values go stale silently. - Debuggability: Generic test names (
request_1–request_4) and aforloop in JS make failure triage harder than it needs to be — a one-line naming change and switching toit.each()would give precise failing-sample context. - Precision asymmetry:
toBeCloseTo(9)in JS is 2× stricter thanInDelta(1e-9)in Go, which subtly undermines the cross-runtime parity guarantee. - Coverage gap:
cache_readpricing is present in both fixtures but no sample exercises a non-zerocacheReadTokens— the cache billing path is unvalidated.
Positive Highlights
- ✅ Excellent idea to anchor regression tests against live run artifacts with exact URLs — future readers know exactly where the expected values came from.
- ✅ Four samples with varying input/output ratios provide good arithmetic diversity.
- ✅ Both runtimes covering the same samples is a strong parity signal.
- ✅ Dated model resolution (
gpt-5.4-2026-03-05 → gpt-5.4) and provider alias (copilot → github-copilot) are both exercised, which was the stated goal.
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 308 AIC · ⌖ 34.6 AIC
|
|
||
| for _, tt := range tests { | ||
| t.Run(tt.name, func(t *testing.T) { | ||
| got := computeModelInferenceAIC("copilot", "gpt-5.4-2026-03-05", tt.inputTokens, tt.outputTokens, 0, 0, 0) |
There was a problem hiding this comment.
[/tdd] Go test relies on the embedded production data/models.json, unlike the JS counterpart which injects an explicit fixture — if gpt-5.4 pricing is updated in the catalog, these wantAIC values become stale and the test fails with a misleading arithmetic delta rather than a clear "pricing changed" signal.
💡 Suggestion: use a fixture like the JS test
The JS test uses writeModelsFixture({ "github-copilot": { models: { "gpt-5.4": { cost: { ... } } } } }) to pin the exact prices it is testing. Doing the same in Go (via os.Setenv("GH_AW_MODELS_JSON_PATH", ...)) would make this test a pure arithmetic regression rather than a combined pricing-plus-arithmetic check.
This also means any future price update to gpt-5.4 in data/models.json will require updating wantAIC here — a non-obvious maintenance burden if the relationship is not documented.
| outputTokens int | ||
| wantAIC float64 | ||
| }{ | ||
| {name: "request_1", inputTokens: 20314, outputTokens: 970, wantAIC: 6.5335}, |
There was a problem hiding this comment.
[/tdd] Test case names request_1–request_4 do not read as specifications — a failing test run just says TestComputeModelInferenceAICLiveCopilotData/request_3 with no hint of what makes that case distinctive.
💡 Suggested naming
Encode the distinguishing characteristic in the name. For example:
{name: "gpt54_20k_input_1k_output", inputTokens: 20314, outputTokens: 970, wantAIC: 6.5335},
{name: "gpt54_24k_input_15k_output", inputTokens: 24162, outputTokens: 1534, wantAIC: 8.3415},
{name: "gpt54_54k_input_4k_output", inputTokens: 54255, outputTokens: 4411, wantAIC: 20.18025},
{name: "gpt54_66k_input_low_output", inputTokens: 65750, outputTokens: 224, wantAIC: 16.7735},This lets a failing test message immediately surface the input range, making it faster to correlate back to the live artifact.
| for _, tt := range tests { | ||
| t.Run(tt.name, func(t *testing.T) { | ||
| got := computeModelInferenceAIC("copilot", "gpt-5.4-2026-03-05", tt.inputTokens, tt.outputTokens, 0, 0, 0) | ||
| assert.InDelta(t, tt.wantAIC, got, 1e-9) |
There was a problem hiding this comment.
[/tdd] assert.InDelta is called without a failure message — every other assertion in this file includes a diagnostic string. Without it, a failure just shows the numeric delta, leaving the reader to infer which field or sample was wrong.
💡 Add a failure message
assert.InDelta(t, tt.wantAIC, got, 1e-9,
"AIC mismatch for %s (input=%d, output=%d)", tt.name, tt.inputTokens, tt.outputTokens)| { inputTokens: 65750, outputTokens: 224, wantAIC: 16.7735 }, | ||
| ]; | ||
|
|
||
| for (const sample of samples) { |
There was a problem hiding this comment.
[/tdd] A plain for loop over samples means the test fails on the first bad value and does not identify which sample caused it — expect errors will just show "at line 171" with no indication of which token-count pair is implicated.
💡 Use it.each() for per-sample test IDs
const samples = [
{ inputTokens: 20314, outputTokens: 970, wantAIC: 6.5335 },
{ inputTokens: 24162, outputTokens: 1534, wantAIC: 8.3415 },
{ inputTokens: 54255, outputTokens: 4411, wantAIC: 20.18025 },
{ inputTokens: 65750, outputTokens: 224, wantAIC: 16.7735 },
];
it.each(samples)(
"gpt-5.4 live: $inputTokens in / $outputTokens out → $wantAIC AIC",
async ({ inputTokens, outputTokens, wantAIC }) => {
const { computeInferenceAIC } = await import("./model_costs.cjs");
const got = computeInferenceAIC({ provider: "copilot", model: "gpt-5.4-2026-03-05",
inputTokens, outputTokens, cacheReadTokens: 0, cacheWriteTokens: 0 });
expect(got).toBeCloseTo(wantAIC, 9);
}
);Each sample becomes a named subtest, surfacing exactly which input data failed.
| cacheReadTokens: 0, | ||
| cacheWriteTokens: 0, | ||
| }); | ||
| expect(got).toBeCloseTo(sample.wantAIC, 9); |
There was a problem hiding this comment.
[/tdd] Precision asymmetry with the Go counterpart: toBeCloseTo(value, 9) checks |got − want| < 5e-10, but the Go test uses assert.InDelta(..., 1e-9) which allows up to 1e-9. A JS floating-point computation differing by 6e-10 would fail JS and pass Go — quietly undermining the cross-runtime parity guarantee these tests are meant to provide.
💡 Align tolerances
Choose one tolerance and use it consistently:
- JS stricter: change Go to
assert.InDelta(t, tt.wantAIC, got, 5e-10, ...) - Go looser: change JS to
toBeCloseTo(sample.wantAIC, 8)(which checks within5e-9)
Since the values are exact floating-point arithmetic with the given prices, 1e-9 or tighter should be achievable on both sides.
| cost: { | ||
| input: "0.0000025", | ||
| output: "0.000015", | ||
| cache_read: "0.00000025", |
There was a problem hiding this comment.
[/tdd] The fixture includes cache_read: "0.00000025" but every sample passes cacheReadTokens: 0 — the cache-hit cost path is never exercised. This is a coverage gap in a test explicitly designed as an arithmetic regression.
💡 Add a cache-read sample
Derive a wantAIC that includes cache read tokens (e.g., from the live artifact if a cache hit is present, or synthetically with a known token count):
// synthetic: 10000 cache-read tokens at 0.00000025 = 0.0025 USD = 0.25 AIC
{ inputTokens: 10000, outputTokens: 0, cacheReadTokens: 10000, wantAIC: /* input+cache */ }Without this, a bug that zeroes out cache-read billing would pass all four existing assertions.
There was a problem hiding this comment.
Non-blocking observations on the new live-data tests
Two test-only files changed, adding live-billing regression tests for gpt-5.4-2026-03-05. The math checks out (verified manually), the model-alias resolution is correct, and the JS test is properly fixture-isolated. Two non-blocking observations inline.
📋 Review summary
What was checked
- Arithmetic correctness of all four
wantAICvalues against the fixture/embedded pricing ✓ - Model fuzzy-match path:
gpt-5.4-2026-03-05→gpt-5.4via longest-prefix logic ✓ - Provider alias:
"copilot"→"github-copilot"✓ cacheWriteTokens=0/cacheReadTokens=0edge cases ✓
Issues found
- (medium) Go test couples
wantAICvalues to livemodels.jsonpricing — no fixture isolation - (low) JS test uses
precision=9vs.precision=6everywhere else in the file
Neither is blocking.
🔎 Code quality review by PR Code Quality Reviewer · ⌖ 13.7 AIC
|
|
||
| for _, tt := range tests { | ||
| t.Run(tt.name, func(t *testing.T) { | ||
| got := computeModelInferenceAIC("copilot", "gpt-5.4-2026-03-05", tt.inputTokens, tt.outputTokens, 0, 0, 0) |
There was a problem hiding this comment.
Go test is not fixture-isolated: the wantAIC values are coupled to the live models.json embedded at compile time, not to a pinned fixture.
💡 Why this matters and how to fix it
The JS counterpart writes its own fixture with writeModelsFixture, so its expectations are independent of models.json. This Go test does not — it relies on gpt-5.4 pricing in the embedded data/models.json matching the hardcoded expected values exactly.
When gpt-5.4 pricing is updated in models.json (corrected rates, model deprecation, etc.), this test fails with a numeric mismatch that has no obvious connection to the pricing file. The failure message will show something like expected 6.5335 but got 7.2100, leaving the developer wondering why a live-data test broke after an unrelated JSON change.
Consider using an environment-variable override analogous to GH_AW_MODELS_JSON_PATH (the JS side) or temporarily swapping the embedded bytes, e.g.:
// write a temp models.json with the known gpt-5.4 prices
dir := t.TempDir()
modelsPath := filepath.Join(dir, "models.json")
// ... write fixture ...
t.Setenv("GH_AW_MODELS_JSON_PATH", modelsPath)If the Go implementation has no equivalent override, the test comment should at least say which prices it depends on and point to data/models.json so the coupling is explicit.
| cacheReadTokens: 0, | ||
| cacheWriteTokens: 0, | ||
| }); | ||
| expect(got).toBeCloseTo(sample.wantAIC, 9); |
There was a problem hiding this comment.
Inconsistent toBeCloseTo precision: this test uses precision=9 (tolerance 5e-10), 1000× tighter than the precision=6 used everywhere else in this file.
💡 Details
The two existing toBeCloseTo calls use precision=6:
- line 56:
expect(aic).toBeCloseTo(0.54825, 6) - line 104:
expect(aicViaEmbedded).toBeCloseTo(aicViaProvider, 6)
The new test uses precision=9. The arithmetic here is straightforward (a few multiplications + additions on ~20k–65k token counts), so there is no correctness reason to demand sub-nanosecond precision. Using a tighter-than-needed tolerance:
- Creates an unexplained inconsistency that will confuse the next person writing a test in this file.
- Could cause spurious failures if a future refactor introduces a legitimate intermediate rounding step.
precision=6 (or even 4) is sufficient to verify the billing arithmetic is correct.
|
|
||
| for _, tt := range tests { | ||
| t.Run(tt.name, func(t *testing.T) { | ||
| got := computeModelInferenceAIC("copilot", "gpt-5.4-2026-03-05", tt.inputTokens, tt.outputTokens, 0, 0, 0) |
There was a problem hiding this comment.
Implicit prefix-match resolution is not documented and silently breakable: gpt-5.4-2026-03-05 resolves to gpt-5.4 today via longest-HasPrefix, but nothing in the test documents this or guards against it changing.
💡 Details
findModelPricing selects the catalog entry with the longest matching prefix of the normalized model ID. If data/models.json gains a new entry like gpt-5.4-2026 in the github-copilot provider, it becomes the new longest prefix for gpt-5.4-2026-03-05 and silently overrides the pricing. The wantAIC values would then be wrong, but the test would fail for a completely opaque reason.
Minimally, add a comment stating which pricing record is expected (gpt-5.4, input=2.5e-6, output=1.5e-5) and how the wantAIC values were derived. Better yet, this is another reason to add fixture isolation: a test-local models.json with only gpt-5.4 removes the ambiguity entirely.
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Added cache-read coverage in both runtimes:
Addressed in commits |
This PR strengthens confidence in AI Credits computation by validating both Go and JavaScript pricing paths against real Copilot usage data from run
27355745225/ job80833046748. It targets arithmetic correctness, Copilot model-price mapping behavior, and cache-read billing coverage.Scope: AI Credits computation parity
Go: live arithmetic assertions
TestComputeModelInferenceAICLiveCopilotDatainpkg/cli/model_costs_test.go.computeModelInferenceAIC("copilot", "gpt-5.4-2026-03-05", ...)againstai_credits_this_responsevalues from the live artifact.JavaScript: live arithmetic assertions
matches live Copilot billing arithmetic for gpt-5.4 dated modelinactions/setup/js/model_costs.test.cjs.computeInferenceAIC({ provider: "copilot", model: "gpt-5.4-2026-03-05", ... })against the same live values.Cache-read regression coverage (Go + JavaScript)
TestComputeModelInferenceAICCopilotCacheReadinpkg/cli/model_costs_test.go.includes cache read pricing for copilot gpt-5.4 dated modelinactions/setup/js/model_costs.test.cjs.cache_readpath), with expected0.56AIC forinput=1000,output=200,cacheRead=400.Pricing-model resolution behavior covered
gpt-5.4-2026-03-05→gpt-5.4pricing entry) and Copilot provider alias handling (copilot→github-copilot) in both paths.