Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions actions/setup/js/model_costs.test.cjs
Original file line number Diff line number Diff line change
Expand Up @@ -132,4 +132,71 @@ describe("model_costs.cjs", () => {
});
expect(aic).toBeGreaterThan(0);
});

it("matches live Copilot billing arithmetic for gpt-5.4 dated model", async () => {
// Live sample from:
// https://github.com/github/gh-aw/actions/runs/27355745225/job/80833046748
// usage artifact: agent/token_usage.jsonl
writeModelsFixture({
"github-copilot": {
models: {
"gpt-5.4": {
cost: {
input: "0.0000025",
output: "0.000015",
cache_read: "0.00000025",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] The fixture includes cache_read: "0.00000025" but every sample passes cacheReadTokens: 0 — the cache-hit cost path is never exercised. This is a coverage gap in a test explicitly designed as an arithmetic regression.

💡 Add a cache-read sample

Derive a wantAIC that includes cache read tokens (e.g., from the live artifact if a cache hit is present, or synthetically with a known token count):

// synthetic: 10000 cache-read tokens at 0.00000025 = 0.0025 USD = 0.25 AIC
{ inputTokens: 10000, outputTokens: 0, cacheReadTokens: 10000, wantAIC: /* input+cache */ }

Without this, a bug that zeroes out cache-read billing would pass all four existing assertions.

},
},
},
},
});

const { computeInferenceAIC } = await import("./model_costs.cjs");
const samples = [
{ inputTokens: 20314, outputTokens: 970, wantAIC: 6.5335 },
{ inputTokens: 24162, outputTokens: 1534, wantAIC: 8.3415 },
{ inputTokens: 54255, outputTokens: 4411, wantAIC: 20.18025 },
{ inputTokens: 65750, outputTokens: 224, wantAIC: 16.7735 },
];

for (const sample of samples) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] A plain for loop over samples means the test fails on the first bad value and does not identify which sample caused it — expect errors will just show "at line 171" with no indication of which token-count pair is implicated.

💡 Use it.each() for per-sample test IDs
const samples = [
  { inputTokens: 20314, outputTokens: 970,  wantAIC: 6.5335 },
  { inputTokens: 24162, outputTokens: 1534, wantAIC: 8.3415 },
  { inputTokens: 54255, outputTokens: 4411, wantAIC: 20.18025 },
  { inputTokens: 65750, outputTokens: 224,  wantAIC: 16.7735 },
];

it.each(samples)(
  "gpt-5.4 live: $inputTokens in / $outputTokens out → $wantAIC AIC",
  async ({ inputTokens, outputTokens, wantAIC }) => {
    const { computeInferenceAIC } = await import("./model_costs.cjs");
    const got = computeInferenceAIC({ provider: "copilot", model: "gpt-5.4-2026-03-05",
      inputTokens, outputTokens, cacheReadTokens: 0, cacheWriteTokens: 0 });
    expect(got).toBeCloseTo(wantAIC, 9);
  }
);

Each sample becomes a named subtest, surfacing exactly which input data failed.

const got = computeInferenceAIC({
provider: "copilot",
model: "gpt-5.4-2026-03-05",
inputTokens: sample.inputTokens,
outputTokens: sample.outputTokens,
cacheReadTokens: 0,
cacheWriteTokens: 0,
});
expect(got).toBeCloseTo(sample.wantAIC, 9);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] Precision asymmetry with the Go counterpart: toBeCloseTo(value, 9) checks |got − want| < 5e-10, but the Go test uses assert.InDelta(..., 1e-9) which allows up to 1e-9. A JS floating-point computation differing by 6e-10 would fail JS and pass Go — quietly undermining the cross-runtime parity guarantee these tests are meant to provide.

💡 Align tolerances

Choose one tolerance and use it consistently:

  • JS stricter: change Go to assert.InDelta(t, tt.wantAIC, got, 5e-10, ...)
  • Go looser: change JS to toBeCloseTo(sample.wantAIC, 8) (which checks within 5e-9)

Since the values are exact floating-point arithmetic with the given prices, 1e-9 or tighter should be achievable on both sides.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent toBeCloseTo precision: this test uses precision=9 (tolerance 5e-10), 1000× tighter than the precision=6 used everywhere else in this file.

💡 Details

The two existing toBeCloseTo calls use precision=6:

  • line 56: expect(aic).toBeCloseTo(0.54825, 6)
  • line 104: expect(aicViaEmbedded).toBeCloseTo(aicViaProvider, 6)

The new test uses precision=9. The arithmetic here is straightforward (a few multiplications + additions on ~20k–65k token counts), so there is no correctness reason to demand sub-nanosecond precision. Using a tighter-than-needed tolerance:

  1. Creates an unexplained inconsistency that will confuse the next person writing a test in this file.
  2. Could cause spurious failures if a future refactor introduces a legitimate intermediate rounding step.

precision=6 (or even 4) is sufficient to verify the billing arithmetic is correct.

}
});

it("includes cache read pricing for copilot gpt-5.4 dated model", async () => {
writeModelsFixture({
"github-copilot": {
models: {
"gpt-5.4": {
cost: {
input: "0.0000025",
output: "0.000015",
cache_read: "0.00000025",
},
},
},
},
});

const { computeInferenceAIC } = await import("./model_costs.cjs");
// expected USD: 1000*0.0000025 + 200*0.000015 + 400*0.00000025 = 0.0056 → 0.56 AIC
const got = computeInferenceAIC({
provider: "copilot",
model: "gpt-5.4-2026-03-05",
inputTokens: 1000,
outputTokens: 200,
cacheReadTokens: 400,
cacheWriteTokens: 0,
});
expect(got).toBeCloseTo(0.56, 9);
});
});
32 changes: 32 additions & 0 deletions pkg/cli/model_costs_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -61,3 +61,35 @@ func TestComputeModelInferenceAICCopilotAlias(t *testing.T) {
assert.Greater(t, aicViaCopilot, 0.0, "copilot provider alias should produce non-zero AIC")
assert.InDelta(t, aicViaGitHubCopilot, aicViaCopilot, 1e-9, "copilot and github-copilot should yield identical AIC")
}

func TestComputeModelInferenceAICLiveCopilotData(t *testing.T) {
// Live sample from:
// https://github.com/github/gh-aw/actions/runs/27355745225/job/80833046748
// usage artifact: agent/token_usage.jsonl
tests := []struct {
name string
inputTokens int
outputTokens int
wantAIC float64
}{
{name: "request_1", inputTokens: 20314, outputTokens: 970, wantAIC: 6.5335},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] Test case names request_1request_4 do not read as specifications — a failing test run just says TestComputeModelInferenceAICLiveCopilotData/request_3 with no hint of what makes that case distinctive.

💡 Suggested naming

Encode the distinguishing characteristic in the name. For example:

{name: "gpt54_20k_input_1k_output",  inputTokens: 20314, outputTokens: 970,  wantAIC: 6.5335},
{name: "gpt54_24k_input_15k_output", inputTokens: 24162, outputTokens: 1534, wantAIC: 8.3415},
{name: "gpt54_54k_input_4k_output",  inputTokens: 54255, outputTokens: 4411, wantAIC: 20.18025},
{name: "gpt54_66k_input_low_output",  inputTokens: 65750, outputTokens: 224,  wantAIC: 16.7735},

This lets a failing test message immediately surface the input range, making it faster to correlate back to the live artifact.

{name: "request_2", inputTokens: 24162, outputTokens: 1534, wantAIC: 8.3415},
{name: "request_3", inputTokens: 54255, outputTokens: 4411, wantAIC: 20.18025},
{name: "request_4", inputTokens: 65750, outputTokens: 224, wantAIC: 16.7735},
}

for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
got := computeModelInferenceAIC("copilot", "gpt-5.4-2026-03-05", tt.inputTokens, tt.outputTokens, 0, 0, 0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] Go test relies on the embedded production data/models.json, unlike the JS counterpart which injects an explicit fixture — if gpt-5.4 pricing is updated in the catalog, these wantAIC values become stale and the test fails with a misleading arithmetic delta rather than a clear "pricing changed" signal.

💡 Suggestion: use a fixture like the JS test

The JS test uses writeModelsFixture({ "github-copilot": { models: { "gpt-5.4": { cost: { ... } } } } }) to pin the exact prices it is testing. Doing the same in Go (via os.Setenv("GH_AW_MODELS_JSON_PATH", ...)) would make this test a pure arithmetic regression rather than a combined pricing-plus-arithmetic check.

This also means any future price update to gpt-5.4 in data/models.json will require updating wantAIC here — a non-obvious maintenance burden if the relationship is not documented.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go test is not fixture-isolated: the wantAIC values are coupled to the live models.json embedded at compile time, not to a pinned fixture.

💡 Why this matters and how to fix it

The JS counterpart writes its own fixture with writeModelsFixture, so its expectations are independent of models.json. This Go test does not — it relies on gpt-5.4 pricing in the embedded data/models.json matching the hardcoded expected values exactly.

When gpt-5.4 pricing is updated in models.json (corrected rates, model deprecation, etc.), this test fails with a numeric mismatch that has no obvious connection to the pricing file. The failure message will show something like expected 6.5335 but got 7.2100, leaving the developer wondering why a live-data test broke after an unrelated JSON change.

Consider using an environment-variable override analogous to GH_AW_MODELS_JSON_PATH (the JS side) or temporarily swapping the embedded bytes, e.g.:

// write a temp models.json with the known gpt-5.4 prices
dir := t.TempDir()
modelsPath := filepath.Join(dir, "models.json")
// ... write fixture ...
t.Setenv("GH_AW_MODELS_JSON_PATH", modelsPath)

If the Go implementation has no equivalent override, the test comment should at least say which prices it depends on and point to data/models.json so the coupling is explicit.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implicit prefix-match resolution is not documented and silently breakable: gpt-5.4-2026-03-05 resolves to gpt-5.4 today via longest-HasPrefix, but nothing in the test documents this or guards against it changing.

💡 Details

findModelPricing selects the catalog entry with the longest matching prefix of the normalized model ID. If data/models.json gains a new entry like gpt-5.4-2026 in the github-copilot provider, it becomes the new longest prefix for gpt-5.4-2026-03-05 and silently overrides the pricing. The wantAIC values would then be wrong, but the test would fail for a completely opaque reason.

Minimally, add a comment stating which pricing record is expected (gpt-5.4, input=2.5e-6, output=1.5e-5) and how the wantAIC values were derived. Better yet, this is another reason to add fixture isolation: a test-local models.json with only gpt-5.4 removes the ambiguity entirely.

assert.InDelta(t, tt.wantAIC, got, 1e-9)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[/tdd] assert.InDelta is called without a failure message — every other assertion in this file includes a diagnostic string. Without it, a failure just shows the numeric delta, leaving the reader to infer which field or sample was wrong.

💡 Add a failure message
assert.InDelta(t, tt.wantAIC, got, 1e-9,
  "AIC mismatch for %s (input=%d, output=%d)", tt.name, tt.inputTokens, tt.outputTokens)

})
}
}

func TestComputeModelInferenceAICCopilotCacheRead(t *testing.T) {
// gpt-5.4 pricing in data/models.json:
// input=0.0000025, output=0.000015, cache_read=0.00000025 (USD/token)
// expected USD: 1000*0.0000025 + 200*0.000015 + 400*0.00000025 = 0.0056 → 0.56 AIC
got := computeModelInferenceAIC("copilot", "gpt-5.4-2026-03-05", 1000, 200, 400, 0, 0)
assert.InDelta(t, 0.56, got, 1e-9)
}