How to author, organize, and maintain runners that exercise a built
artifact against an external spec corpus (tc39/test262, WPT, future
spec suites). This is the fleet canonical layout; references the
running-test262
skill for invocation specifics.
packages/<pkg>/
test/
fixtures/<corpus>/ # 1. Sparse-checkout submodule
scripts/<corpus>-<scope>-runner.mts # 2. Thin CLI entry
scripts/<corpus>/ # Modular guts:
types.mts # Result / Test / Summary
parser.mts # Frontmatter parser
classifier.mts # Pure: result + allowlist → bucket
harness.mts # Compose harness + walk corpus
executor.mts # Spawn + collect + retry
report.mts # Format summary
integration/<corpus>-<scope>.test.mts # 3. Vitest wrapper (gate)
unit/<corpus>-<scope>.test.mts # 4. Vitest tests of pure modules
<corpus>-config/<corpus>.allowlist # 5. Out-of-band allowlist file
package.json scripts:
"<corpus>:<scope>": "node test/scripts/<corpus>-<scope>-runner.mts"
External corpora live at test/fixtures/<corpus>/, NOT upstream/.
Build-time submodules use upstream/; test-time corpora use
test/fixtures/. The distinction signals whether bumping the
submodule affects shipped artifacts. See related
../fleet/untracked-by-default.md for
adjacent rules on vendored trees.
Conformance corpora are large but our runners exercise narrow
subtrees. Add a sparse-checkout = <patterns> field to .gitmodules
and use scripts/git-partial-submodule.mts clone <path> for fresh
checkouts. Vanilla git submodule update ignores the field; the
fleet utility reads it.
Examples:
# .gitmodules
[submodule "packages/node-smol-builder/test/fixtures/wpt/streams"]
path = packages/node-smol-builder/test/fixtures/wpt/streams
url = https://github.com/web-platform-tests/wpt.git
sparse-checkout = streams/
[submodule "packages/temporal-infra/test/fixtures/test262"]
path = packages/temporal-infra/test/fixtures/test262
url = https://github.com/tc39/test262.git
sparse-checkout = test/built-ins/Temporal/ test/intl402/Temporal/ harness/Requires git ≥ 2.27 (for --filter + --sparse on git clone).
The CLI entry (<corpus>-<scope>-runner.mts) stays under ~60 lines
— it parses argv, resolves the binary, calls the harness/executor
modules. Everything else lives in the sibling <corpus>/ directory
broken into ~6 modules. The split lets each piece have a single
reason to change AND lets the pure modules be unit-tested in
isolation.
Canonical module set:
| Module | Responsibility |
|---|---|
types.mts |
Result, Test, Summary, TestCase types |
parser.mts |
Frontmatter / metadata parsing |
classifier.mts |
Pure: (result, allowlist) → "expected" / "unexpected" / "now-passing" |
harness.mts |
Compose harness JS, walk corpus, filter |
executor.mts |
Spawn subprocesses, collect output, retry |
report.mts |
Format human-readable summary, exit-code policy |
The classifier is the highest-value module to extract — get the result-bucketing logic wrong and the runner silently masks regressions. Keep it pure (no I/O, no globals).
A ~20-line .test.mts under test/integration/ that:
- Resolves the built binary (returns
undefinedif no build exists). - Computes
skipIffrom that. - Inside
describe.skipIf(...), has oneit()that spawns the runner subprocess and asserts exit code 0.
// test/integration/<corpus>-<scope>.test.mts
import path from 'node:path'
import { fileURLToPath } from 'node:url'
import { spawn } from '@socketsecurity/lib-stable/spawn/spawn'
import { resolveFinalBinary } from '../helpers/binary.mts'
const __dirname = path.dirname(fileURLToPath(import.meta.url))
const RUNNER = path.resolve(
__dirname,
'..',
'scripts',
'<corpus>-<scope>-runner.mts',
)
const skipTests = !resolveFinalBinary()
const TIMEOUT_MS = 45 * 60 * 1000
describe.skipIf(skipTests)('<corpus> <scope> conformance', () => {
it(
'no unexpected failures vs allowlist',
async () => {
const result = await spawn('node', [RUNNER], { stdio: 'inherit' })
expect(result.code).toBe(0)
},
TIMEOUT_MS,
)
})This is what brings the gate into pnpm test. Without it, the runner
is a manual ritual the dev has to remember.
A .test.mts under test/unit/ covering the classifier exhaustively.
At minimum: every transition (success/failure × allowed/disallowed),
stale-allowlist (test passes that's in the allowlist), and
prefix-match edge cases.
These tests do NOT spawn subprocesses, do NOT walk the corpus, and do NOT need the built binary. Pure logic only. They catch the highest- severity bug class (silent regression masking) without needing the expensive infrastructure.
Either path-keyed or feature-keyed depending on what the runner exercises:
- Path-keyed:
<file> (<scenario>)one per line, with comment rationale. Suitable for narrow subset runs (temporal-infra Temporal subset, WPT streams). Allow only failures that can be justified. - Feature-keyed: TC39 feature name (
decorators,import-source). Suitable for broad parser conformance where the set of unimplemented features is well-defined (ultrathink/acorn parsers). Makes it hard to sneak a parser bug past the allowlist.
Never inline a Map literal in the runner source. The diff becomes unreviewable, the allowlist mixes with logic, and PRs that touch the runner accidentally pull in allowlist changes.
Use this checklist:
- Submodule at
test/fixtures/<corpus>/withsparse-checkoutdeclared in.gitmodules. - Runner skeleton at
test/scripts/<corpus>-<scope>-runner.mtsthat imports fromtest/scripts/<corpus>/{parser,classifier, harness,executor,report}.mts. - Allowlist file at
<corpus>-config/<corpus>.allowlist(path- or feature-keyed). - Vitest integration wrapper at
test/integration/<corpus>-<scope>.test.mts. - Vitest unit tests at
test/unit/<corpus>-<scope>.test.mtscovering at minimum the classifier. package.jsonscript:"<corpus>:<scope>": "node test/scripts/<corpus>-<scope>-runner.mts".
The runner should always exit non-zero on (a) unexpected failure (test not in allowlist that failed), or (b) stale allowlist (test in allowlist that now passes — a drift signal that needs cleanup, not silent acceptance).
As of 2026-05, the closest-to-canonical implementations in the fleet:
socket-btm/packages/temporal-infra/test/scripts/test262-temporal-runner.mts— best module split + unit-tested classifier.socket-btm/packages/node-smol-builder/test/scripts/wpt-streams-runner.mts— best integration wrapper shape.
When in doubt, mirror temporal-infra's test262/ subdirectory split.
- Inline
EXPECTED_FAILURESMap in the runner source. Move it to an external allowlist file. - Single 500+ line monolith. Split into the canonical 6 modules the first time you touch it.
- Vitest wrapper that runs the corpus inline as
test.each(files). Each file is too granular for vitest's reporter and breaks allowlist classification semantics. Spawn the runner as a subprocess and check exit code; the runner's own report is the human-readable output. - Test-time submodule under
upstream/. That path is reserved for build-time submodules. Move conformance corpora totest/fixtures/<corpus>/. - Full-tree submodule when only a subset is exercised. Use sparse-checkout.
.claude/skills/running-test262/SKILL.md— how to invoke runners per repo.untracked-by-default.md— adjacent rules for vendored / build-copied trees.parser-comments.md— lock-step comment conventions for cross-language parser ports (relevant when a single package has multiple language lanes, each with its own runner).