Skip to content

Add regression checks for AI Credits math in Go and JavaScript#38638

Closed
Copilot wants to merge 3 commits into
mainfrom
copilot/review-ai-credits-functions
Closed

Add regression checks for AI Credits math in Go and JavaScript#38638
Copilot wants to merge 3 commits into
mainfrom
copilot/review-ai-credits-functions

Conversation

Copilot AI commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

This PR strengthens confidence in AI Credits computation by validating both Go and JavaScript pricing paths against real Copilot usage data from run 27355745225 / job 80833046748. It targets arithmetic correctness, Copilot model-price mapping behavior, and cache-read billing coverage.

  • Scope: AI Credits computation parity

    • Added regression coverage in both runtimes for the same live token-usage samples.
    • Ensures both implementations compute identical AIC outcomes for equivalent inputs.
  • Go: live arithmetic assertions

    • Added TestComputeModelInferenceAICLiveCopilotData in pkg/cli/model_costs_test.go.
    • Verifies computeModelInferenceAIC("copilot", "gpt-5.4-2026-03-05", ...) against ai_credits_this_response values from the live artifact.
  • JavaScript: live arithmetic assertions

    • Added matches live Copilot billing arithmetic for gpt-5.4 dated model in actions/setup/js/model_costs.test.cjs.
    • Verifies computeInferenceAIC({ provider: "copilot", model: "gpt-5.4-2026-03-05", ... }) against the same live values.
  • Cache-read regression coverage (Go + JavaScript)

    • Added TestComputeModelInferenceAICCopilotCacheRead in pkg/cli/model_costs_test.go.
    • Added includes cache read pricing for copilot gpt-5.4 dated model in actions/setup/js/model_costs.test.cjs.
    • Verifies cache-read token pricing is included correctly for Copilot (cache_read path), with expected 0.56 AIC for input=1000, output=200, cacheRead=400.
  • Pricing-model resolution behavior covered

    • Validates dated model resolution (gpt-5.4-2026-03-05gpt-5.4 pricing entry) and Copilot provider alias handling (copilotgithub-copilot) in both paths.
got := computeModelInferenceAIC("copilot", "gpt-5.4-2026-03-05", 20314, 970, 0, 0, 0)
// expected from live token_usage.jsonl ai_credits_this_response
// want: 6.5335

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title test: verify AI credits arithmetic against live Copilot usage data Add live-data regression checks for AI Credits math in Go and JavaScript Jun 11, 2026
Copilot AI requested a review from pelikhan June 11, 2026 15:42
@pelikhan pelikhan marked this pull request as ready for review June 11, 2026 15:44
Copilot AI review requested due to automatic review settings June 11, 2026 15:44
@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Design Decision Gate 🏗️ completed the design decision gate check.

No ADR enforcement needed: PR #38638 does not have the 'implementation' label and has only 24 new lines of code in business logic directories (threshold is 100). No custom config present.

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

🧪 Test Quality Sentinel completed test quality analysis.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds regression tests that validate AI Credits (AIC) inference-cost arithmetic and Copilot provider/model normalization behavior using a fixed set of token-usage samples sourced from a specific GitHub Actions run artifact.

Changes:

  • Added a Go unit test (TestComputeModelInferenceAICLiveCopilotData) that asserts AIC results for the copilot provider and dated model gpt-5.4-2026-03-05 match expected live-derived values.
  • Added a JavaScript unit test that asserts the same live-derived samples match computeInferenceAIC behavior, including dated-model resolution and provider aliasing.
Show a summary per file
File Description
pkg/cli/model_costs_test.go Adds a table-driven Go regression test verifying AIC math for Copilot usage samples and dated model resolution.
actions/setup/js/model_costs.test.cjs Adds a Vitest regression test verifying the JS AIC computation matches the same Copilot usage samples and alias/model resolution behavior.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 2/2 changed files
  • Comments generated: 0

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

PR Code Quality Reviewer completed the code quality review.

@pelikhan

Copy link
Copy Markdown
Collaborator

@copilot Add tests that involve cache read

@github-actions

Copy link
Copy Markdown
Contributor

🧪 Test Quality Sentinel Report

⚠️ Test Quality Score: 60/100 — Acceptable, with suggestions

Analyzed 2 test(s): 2 design, 0 implementation, 0 critical guideline violations. Both tests validate billing arithmetic against live Copilot API data.

📊 Metrics & Test Classification (2 tests analyzed)
Metric Value
New/modified tests analyzed 2
✅ Design tests (behavioral contracts) 2 (100%)
⚠️ Implementation tests (low value) 0 (0%)
Tests with error/edge cases 0 (0%)
Duplicate test clusters 0
Test inflation detected Yes (test-only PR: +63 test lines, 0 production lines added)
🚨 Coding-guideline violations 1 (missing assertion message in Go)

Test Classification Details

Test File Classification Issues Detected
TestComputeModelInferenceAICLiveCopilotData (4 table rows) pkg/cli/model_costs_test.go:65 ✅ Design No error/edge cases; missing assertion message on assert.InDelta
it("matches live Copilot billing arithmetic for gpt-5.4 dated model") actions/setup/js/model_costs.test.cjs:135 ✅ Design No error/edge cases

Language Support

Tests analyzed:

  • 🐹 Go (*_test.go): 1 test function (table-driven, 4 rows) — unit (//go:build !integration)
  • 🟨 JavaScript (*.test.cjs): 1 test — vitest
⚠️ Flagged Tests — Requires Review (2 issue(s))

⚠️ TestComputeModelInferenceAICLiveCopilotData (pkg/cli/model_costs_test.go:84)

Classification: Design test (behavioral contract)
Issue 1 — Missing assertion message (guideline violation): assert.InDelta(t, tt.wantAIC, got, 1e-9) has no descriptive message argument. When this fails, the output is cryptic.
Suggested improvement: Add a message for context:

assert.InDelta(t, tt.wantAIC, got, 1e-9, "AIC for %s should match live billing data", tt.name)

Issue 2 — Happy-path only: All 4 table rows use valid positive token counts. No error paths (e.g., unknown model, zero tokens) are covered.
Suggested improvement: Add an expectError column or a separate test for invalid inputs such as an unrecognised model name.


⚠️ it("matches live Copilot billing arithmetic for gpt-5.4 dated model") (actions/setup/js/model_costs.test.cjs:135)

Classification: Design test (behavioral contract)
Issue — Happy-path only: All 4 samples use well-formed inputs. No edge cases (e.g., unknown provider/model alias, cacheReadTokens > 0) are tested.
Suggested improvement: Add one sample with cacheReadTokens > 0 to verify the cache-read pricing path, and one with an unknown model to confirm graceful fallback behaviour.

Verdict

Check passed. 0% of new tests are implementation tests (threshold: 30%). No critical guideline violations detected (no mock libraries, build tags present). The missing assertion message and lack of error-case coverage are noted but do not block merge — address in a follow-up for improved debuggability.

📖 Understanding Test Classifications

Design Tests (High Value) verify what the system does:

  • Assert on observable outputs, return values, or state changes
  • Cover error paths and boundary conditions
  • Would catch a behavioral regression if deleted
  • Remain valid even after internal refactoring

Implementation Tests (Low Value) verify how the system does it:

  • Assert on internal function calls (mocking internals)
  • Only test the happy path with typical inputs
  • Break during legitimate refactoring even when behavior is correct
  • Give false assurance: they pass even when the system is wrong

Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.

References: §27359120423

🧪 Test quality analysis by Test Quality Sentinel · 152.5 AIC · ⌖ 20.4 AIC ·

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Test Quality Sentinel: 60/100. Test quality is acceptable — 0% of new tests are implementation tests (threshold: 30%). Both new tests are behavioral design tests that validate AI Credits math against live Copilot API data. Minor suggestions: add assertion messages to the Go assert.InDelta call and consider adding edge-case coverage (cache-read tokens, unknown model) in a follow-up.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skills-Based Review 🧠

Applied /tdd — no blocking issues, but 6 observations worth addressing before this anchors future arithmetic regression suites.

📋 Key Themes & Highlights

Key Themes

  • Test-fixture inconsistency (Go vs JS): The JS test pins exact pricing in a fixture; the Go test reads from the embedded data/models.json. If gpt-5.4 prices are ever updated, the Go wantAIC values go stale silently.
  • Debuggability: Generic test names (request_1request_4) and a for loop in JS make failure triage harder than it needs to be — a one-line naming change and switching to it.each() would give precise failing-sample context.
  • Precision asymmetry: toBeCloseTo(9) in JS is 2× stricter than InDelta(1e-9) in Go, which subtly undermines the cross-runtime parity guarantee.
  • Coverage gap: cache_read pricing is present in both fixtures but no sample exercises a non-zero cacheReadTokens — the cache billing path is unvalidated.

Positive Highlights

  • ✅ Excellent idea to anchor regression tests against live run artifacts with exact URLs — future readers know exactly where the expected values came from.
  • ✅ Four samples with varying input/output ratios provide good arithmetic diversity.
  • ✅ Both runtimes covering the same samples is a strong parity signal.
  • ✅ Dated model resolution (gpt-5.4-2026-03-05 → gpt-5.4) and provider alias (copilot → github-copilot) are both exercised, which was the stated goal.

🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · 308 AIC · ⌖ 34.6 AIC


for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
got := computeModelInferenceAIC("copilot", "gpt-5.4-2026-03-05", tt.inputTokens, tt.outputTokens, 0, 0, 0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] Go test relies on the embedded production data/models.json, unlike the JS counterpart which injects an explicit fixture — if gpt-5.4 pricing is updated in the catalog, these wantAIC values become stale and the test fails with a misleading arithmetic delta rather than a clear "pricing changed" signal.

💡 Suggestion: use a fixture like the JS test

The JS test uses writeModelsFixture({ "github-copilot": { models: { "gpt-5.4": { cost: { ... } } } } }) to pin the exact prices it is testing. Doing the same in Go (via os.Setenv("GH_AW_MODELS_JSON_PATH", ...)) would make this test a pure arithmetic regression rather than a combined pricing-plus-arithmetic check.

This also means any future price update to gpt-5.4 in data/models.json will require updating wantAIC here — a non-obvious maintenance burden if the relationship is not documented.

outputTokens int
wantAIC float64
}{
{name: "request_1", inputTokens: 20314, outputTokens: 970, wantAIC: 6.5335},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] Test case names request_1request_4 do not read as specifications — a failing test run just says TestComputeModelInferenceAICLiveCopilotData/request_3 with no hint of what makes that case distinctive.

💡 Suggested naming

Encode the distinguishing characteristic in the name. For example:

{name: "gpt54_20k_input_1k_output",  inputTokens: 20314, outputTokens: 970,  wantAIC: 6.5335},
{name: "gpt54_24k_input_15k_output", inputTokens: 24162, outputTokens: 1534, wantAIC: 8.3415},
{name: "gpt54_54k_input_4k_output",  inputTokens: 54255, outputTokens: 4411, wantAIC: 20.18025},
{name: "gpt54_66k_input_low_output",  inputTokens: 65750, outputTokens: 224,  wantAIC: 16.7735},

This lets a failing test message immediately surface the input range, making it faster to correlate back to the live artifact.

for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
got := computeModelInferenceAIC("copilot", "gpt-5.4-2026-03-05", tt.inputTokens, tt.outputTokens, 0, 0, 0)
assert.InDelta(t, tt.wantAIC, got, 1e-9)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] assert.InDelta is called without a failure message — every other assertion in this file includes a diagnostic string. Without it, a failure just shows the numeric delta, leaving the reader to infer which field or sample was wrong.

💡 Add a failure message
assert.InDelta(t, tt.wantAIC, got, 1e-9,
  "AIC mismatch for %s (input=%d, output=%d)", tt.name, tt.inputTokens, tt.outputTokens)

{ inputTokens: 65750, outputTokens: 224, wantAIC: 16.7735 },
];

for (const sample of samples) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] A plain for loop over samples means the test fails on the first bad value and does not identify which sample caused it — expect errors will just show "at line 171" with no indication of which token-count pair is implicated.

💡 Use it.each() for per-sample test IDs
const samples = [
  { inputTokens: 20314, outputTokens: 970,  wantAIC: 6.5335 },
  { inputTokens: 24162, outputTokens: 1534, wantAIC: 8.3415 },
  { inputTokens: 54255, outputTokens: 4411, wantAIC: 20.18025 },
  { inputTokens: 65750, outputTokens: 224,  wantAIC: 16.7735 },
];

it.each(samples)(
  "gpt-5.4 live: $inputTokens in / $outputTokens out → $wantAIC AIC",
  async ({ inputTokens, outputTokens, wantAIC }) => {
    const { computeInferenceAIC } = await import("./model_costs.cjs");
    const got = computeInferenceAIC({ provider: "copilot", model: "gpt-5.4-2026-03-05",
      inputTokens, outputTokens, cacheReadTokens: 0, cacheWriteTokens: 0 });
    expect(got).toBeCloseTo(wantAIC, 9);
  }
);

Each sample becomes a named subtest, surfacing exactly which input data failed.

cacheReadTokens: 0,
cacheWriteTokens: 0,
});
expect(got).toBeCloseTo(sample.wantAIC, 9);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] Precision asymmetry with the Go counterpart: toBeCloseTo(value, 9) checks |got − want| < 5e-10, but the Go test uses assert.InDelta(..., 1e-9) which allows up to 1e-9. A JS floating-point computation differing by 6e-10 would fail JS and pass Go — quietly undermining the cross-runtime parity guarantee these tests are meant to provide.

💡 Align tolerances

Choose one tolerance and use it consistently:

  • JS stricter: change Go to assert.InDelta(t, tt.wantAIC, got, 5e-10, ...)
  • Go looser: change JS to toBeCloseTo(sample.wantAIC, 8) (which checks within 5e-9)

Since the values are exact floating-point arithmetic with the given prices, 1e-9 or tighter should be achievable on both sides.

cost: {
input: "0.0000025",
output: "0.000015",
cache_read: "0.00000025",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] The fixture includes cache_read: "0.00000025" but every sample passes cacheReadTokens: 0 — the cache-hit cost path is never exercised. This is a coverage gap in a test explicitly designed as an arithmetic regression.

💡 Add a cache-read sample

Derive a wantAIC that includes cache read tokens (e.g., from the live artifact if a cache hit is present, or synthetically with a known token count):

// synthetic: 10000 cache-read tokens at 0.00000025 = 0.0025 USD = 0.25 AIC
{ inputTokens: 10000, outputTokens: 0, cacheReadTokens: 10000, wantAIC: /* input+cache */ }

Without this, a bug that zeroes out cache-read billing would pass all four existing assertions.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking observations on the new live-data tests

Two test-only files changed, adding live-billing regression tests for gpt-5.4-2026-03-05. The math checks out (verified manually), the model-alias resolution is correct, and the JS test is properly fixture-isolated. Two non-blocking observations inline.

📋 Review summary

What was checked

  • Arithmetic correctness of all four wantAIC values against the fixture/embedded pricing ✓
  • Model fuzzy-match path: gpt-5.4-2026-03-05gpt-5.4 via longest-prefix logic ✓
  • Provider alias: "copilot""github-copilot"
  • cacheWriteTokens=0 / cacheReadTokens=0 edge cases ✓

Issues found

  1. (medium) Go test couples wantAIC values to live models.json pricing — no fixture isolation
  2. (low) JS test uses precision=9 vs. precision=6 everywhere else in the file

Neither is blocking.

🔎 Code quality review by PR Code Quality Reviewer · ⌖ 13.7 AIC


for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
got := computeModelInferenceAIC("copilot", "gpt-5.4-2026-03-05", tt.inputTokens, tt.outputTokens, 0, 0, 0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go test is not fixture-isolated: the wantAIC values are coupled to the live models.json embedded at compile time, not to a pinned fixture.

💡 Why this matters and how to fix it

The JS counterpart writes its own fixture with writeModelsFixture, so its expectations are independent of models.json. This Go test does not — it relies on gpt-5.4 pricing in the embedded data/models.json matching the hardcoded expected values exactly.

When gpt-5.4 pricing is updated in models.json (corrected rates, model deprecation, etc.), this test fails with a numeric mismatch that has no obvious connection to the pricing file. The failure message will show something like expected 6.5335 but got 7.2100, leaving the developer wondering why a live-data test broke after an unrelated JSON change.

Consider using an environment-variable override analogous to GH_AW_MODELS_JSON_PATH (the JS side) or temporarily swapping the embedded bytes, e.g.:

// write a temp models.json with the known gpt-5.4 prices
dir := t.TempDir()
modelsPath := filepath.Join(dir, "models.json")
// ... write fixture ...
t.Setenv("GH_AW_MODELS_JSON_PATH", modelsPath)

If the Go implementation has no equivalent override, the test comment should at least say which prices it depends on and point to data/models.json so the coupling is explicit.

cacheReadTokens: 0,
cacheWriteTokens: 0,
});
expect(got).toBeCloseTo(sample.wantAIC, 9);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent toBeCloseTo precision: this test uses precision=9 (tolerance 5e-10), 1000× tighter than the precision=6 used everywhere else in this file.

💡 Details

The two existing toBeCloseTo calls use precision=6:

  • line 56: expect(aic).toBeCloseTo(0.54825, 6)
  • line 104: expect(aicViaEmbedded).toBeCloseTo(aicViaProvider, 6)

The new test uses precision=9. The arithmetic here is straightforward (a few multiplications + additions on ~20k–65k token counts), so there is no correctness reason to demand sub-nanosecond precision. Using a tighter-than-needed tolerance:

  1. Creates an unexplained inconsistency that will confuse the next person writing a test in this file.
  2. Could cause spurious failures if a future refactor introduces a legitimate intermediate rounding step.

precision=6 (or even 4) is sufficient to verify the billing arithmetic is correct.


for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
got := computeModelInferenceAIC("copilot", "gpt-5.4-2026-03-05", tt.inputTokens, tt.outputTokens, 0, 0, 0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implicit prefix-match resolution is not documented and silently breakable: gpt-5.4-2026-03-05 resolves to gpt-5.4 today via longest-HasPrefix, but nothing in the test documents this or guards against it changing.

💡 Details

findModelPricing selects the catalog entry with the longest matching prefix of the normalized model ID. If data/models.json gains a new entry like gpt-5.4-2026 in the github-copilot provider, it becomes the new longest prefix for gpt-5.4-2026-03-05 and silently overrides the pricing. The wantAIC values would then be wrong, but the test would fail for a completely opaque reason.

Minimally, add a comment stating which pricing record is expected (gpt-5.4, input=2.5e-6, output=1.5e-5) and how the wantAIC values were derived. Better yet, this is another reason to add fixture isolation: a test-local models.json with only gpt-5.4 removes the ambiguity entirely.

Copilot AI and others added 2 commits June 11, 2026 16:05
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot AI commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

@copilot Add tests that involve cache read

Added cache-read coverage in both runtimes:

  • Go: TestComputeModelInferenceAICCopilotCacheRead in pkg/cli/model_costs_test.go
  • JS: includes cache read pricing for copilot gpt-5.4 dated model in actions/setup/js/model_costs.test.cjs

Addressed in commits 311e5d8 and c111ec0.

Copilot AI changed the title Add live-data regression checks for AI Credits math in Go and JavaScript Add regression checks for AI Credits math in Go and JavaScript Jun 11, 2026
@pelikhan pelikhan closed this Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants