Harness Drill

A from-first-principles curriculum in harness engineering for agentic systems. Build every pattern by hand before adopting frameworks. The repo is the artifact of the journey — modules in order, each with reference code and notes.

Foundations — read these first

Doc	Purpose
`docs/treatise.md`	The conceptual map — eight parts, from "why understanding the bare model call is non-negotiable" through state, modality, planes, frameworks, and the two-axis (modality × cognition) view of the curriculum. Read this to know why the modules are in the order they are.
`docs/curriculum.md`	The 15-module path — what every module teaches, the new lesson at each seam, the three-layer code organization (modules / strings / agents), the eight deeper territories (multi-modal seams, MCP, etc.), and a status table for what's built vs. on deck. Read this to know what's next.

Layout

.
├── level_1_modules/         per-module reference implementations
│   └── module_01_bare_call/ (read this first)
├── level_2_strings/         cross-layer integrations (added when needed)
├── level_3_agents/          purpose-built agents (added when needed)
└── common/                  shared building blocks (added when patterns stabilize)

Per-module isolation makes lessons legible — diff Module 4 against Module 3 and the change is the lesson. Strings prove composition; purpose-built agents prove the muscle holds when the problem is yours.

Setup

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env          # then fill in ANTHROPIC_API_KEY

Workflow

Code is authored on branch claude/plan-task-dxiRu, pushed to GitHub, and synced to the execution environment via git. To run a module:

git fetch origin
git checkout claude/plan-task-dxiRu
git pull
python level_1_modules/module_01_bare_call/bare_call.py "your question"

Modules

#	Module	Concept
1	`module_01_bare_call`	The single model invocation. Stateless. No tools.
1b	`module_01b_bilateral`	Split read from act — separate parser and composer models, two tiers each (Anthropic only). Prototype for the bilateral piece of LIMBIC.
1c	`module_01c_bilateral_x`	Bilateral across three providers (Anthropic, OpenAI, Google). Adapter layer makes provider differences visible — the API-drift tax LIMBIC will amortize.
1d	`module_01d_modality`	Bilateral with image input. Parser sees the image once, composer reads only the text IR. S3-backed asset fetch via `assets.py`.
1e	`module_01e_audio`	Bilateral with audio input. Anthropic excluded from parser slots (capability matrix). First place where modality routing is forced, not optional.
1f	`module_01f_video`	Bilateral with video input. Single-provider parser (Gemini-only); composer is unconstrained. Modality forces parser; bilateral keeps composer free.
1g	`module_01g_audio_out`	Bilateral with audio output. Composer must be OpenAI; capability matrix splits into input/output halves. Audio bytes uploaded to `s3://harness-eng/outputs/`.
1h	`module_01h_modality_matrix`	Closes the modality matrix — any input modality (text/image/audio/video) × either output (text/audio). Dynamic `--all` generates valid configs per cell.
1i	`module_01i_image_out`	Bilateral with image output. Composer is a diffusion-transformer (gpt-image-1, gemini-2.5-flash-image); Anthropic excluded. IR becomes modality-shaped (subject, composition, style, lighting, aspect, negatives, safety). Cost arithmetic inverts — parser is a rounding error against per-image cost.
1j	`module_01j_video_out`	Bilateral with video output, async. Composer is Sora or Veo (submit/poll/fetch); Anthropic excluded. Refusal is a typed terminal state (`completed` / `failed` / `rejected`). Cost moves from cents to dollars per clip.
1k	`module_01k_image_edit`	Asset-conditioned image output. Closes `(image, image)` (true edit via `images.edit` / Gemini in-context edit) plus `(audio, image)` and `(video, image)` (parser translates asset to image brief). Two-path dispatch — edit when you can, translate when you must.
1l	`module_01l_video_edit`	Asset-conditioned video output. Closes `(image, video)` (image conditioning via Sora `input_reference` / Veo `image=`) plus `(audio, video)` and `(video, video)` (parser translates asset to video brief). Reuses 1j's async state machine. Closes the matrix at all 16 cells with no deferrals.

Up next: Module 2 — The Conversation Loop. See docs/curriculum.md for the full 15-module path and the status table.

Subsequent modules are built on demand, in order.

Routing intuitions — when to modulate, when not to

The bilateral split is not free. A parser stage adds latency, input tokens, and coordination cost. Across modules 1–1g, recurring patterns have emerged that any future LIMBIC router must encode — including the patterns where bilateral loses and the right move is no modulation at all.

Where bilateral pays:

Hard-to-parse input — image, audio, video, or ambiguous text where the parser does real interpretive work.
Cheap-but-modality-capable parser → deep-prose composer — for example google-fast → anthropic-deep. Asymmetric advantage at near-baseline cost.
Pre-distilling Gemini Pro composer work — bilateral with a non-Gemini parser routinely cuts cost vs. Gemini Pro baseline, because the IR pre-distills work Pro would otherwise burn thinking tokens on (observed in 1c, 1d, 1f).

Where bilateral is overhead — LIMBIC must decline to modulate:

Trivial prompts (e.g., "what is 2+2") — baseline single-call always wins; the bilateral overhead exceeds any quality lift.
gpt-4o-mini baseline on simple factual extracts — so cheap that almost no bilateral can compete on cost (e.g., 1d's $0.000049 baseline vs. $0.003 for the cheapest bilateral on the same prompt).
Parser slot = composer slot — pure overhead, no asymmetric advantage. Just route as baseline.

The discipline: LIMBIC's first decision is whether to modulate at all. "Decline to modulate, dispatch single-call baseline" must be a first-class routing outcome — not a fallback after a failed bilateral.

The perception / intellect / expression triad

The bilateral seam maps cleanly to a 3-faculty cognitive model that's sharper for routing than the 5-faculty taxonomy in docs/benchmark-lens.md (which is sharper for grading, but conflates routing concerns):

Faculty	Pipeline location	Tier choice driven by
Perception	parser stage	how hard the input is to understand (modality, noise, ambiguity, length)
Intellect	wherever reasoning happens (often composer)	depth of reasoning required (factual recall vs. novel inference)
Expression	composer stage	output modality and prose quality required

Each faculty has its own (modality × tier) Goldilocks zone:

Perception: fast tier when input is clean (terse text, simple chart); deep tier when input is noisy/ambiguous (long context, hard-to-read audio, multi-scene video).
Intellect: fast tier on retrieval-shaped questions; deep tier on novel reasoning, multi-step inference, or any task where a thinking model's hidden tokens earn their cost.
Expression: fast tier on terse factual replies; deep tier on long-form prose, persuasive arguments, or spoken delivery (audio output where pacing and word choice matter).

LIMBIC's per-call routing decomposes into three orthogonal Goldilocks lookups against this triad. Deeper Territory 6 (evaluation frameworks) is what turns the lookups from heuristic to data-driven.

Coverage map (modality × routing space)

Discrete dispatchable configurations across the bilateral × tier × modality space:

                       OUTPUT
              text   audio   image   video
            ┌────────────────────────────────┐
text        │  ✅    ✅      ✅      ✅      │
INPUT image │  ✅    ✅      ✅      ✅      │
      audio │  ✅    ✅      ✅      ✅      │
      video │  ✅    ✅      ✅      ✅      │
            └────────────────────────────────┘

16 modality cells (4 input × 4 output) × full bilateral expansion
                                        (capability-filtered)

Modules 1 → 1h closed the text and audio output halves (8 cells).
Module 1i closes (text → image).
Module 1j closes (text → video) and proves the async-job primitive.
Module 1k closes (image|audio|video → image) — edit when you can,
  translate when you must.
Module 1l closes (image|audio|video → video) — image conditioning on
  Sora/Veo, parser-translate for non-image asset inputs.

All 16 cells are now covered with no deferrals.

After 1l the modality plane is solved. The remaining LIMBIC work is on axes orthogonal to modality: evaluation frameworks (Deeper Territory 6), telemetry/observability (Module 11), rule-based router (LIMBIC L2.1 forward-design), LIMBIC v0 (L3.1).

See docs/limbic-design.md for the full LIMBIC sketch and docs/limbic-image-video-generative.md for the design grounding behind 1i and 1j (why image and video are not the same machine as text on the other side of the wire).

Design notes

Doc	Purpose
`docs/treatise.md`	Conceptual map of the curriculum — eight parts. Read first to understand why the modules are ordered this way.
`docs/curriculum.md`	The 15-module path, three-layer code organization, deeper territories, and the build status table.
`docs/limbic-design.md`	Future-project sketch — multi-axis dynamic router (direction × faculty × modality × cost). Module 1b is the prototype of its bilateral axis.
`docs/measurement-seam.md`	Philosophical lens — frozen-weight LLM inference as a quantum-mechanical measurement act. The structural fact that makes the harness/researcher/inference role split inevitable.
`docs/limbic-image-video-generative.md`	Design grounding for 1i and 1j — why generative image/video output is a different operator family, why control flow goes async, and why cost units shift from micro-cents to dollars per clip.
`docs/seam-parameters.md`	Architectural lock — seam parameters (temperature, seed, voice, CFG, reasoning_effort, …) are dispatcher growth, not new curriculum modules.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Harness Drill

Foundations — read these first

Layout

Setup

Workflow

Modules

Routing intuitions — when to modulate, when not to

The perception / intellect / expression triad

Coverage map (modality × routing space)

Design notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
common		common
docs		docs
level_1_modules		level_1_modules
level_2_strings		level_2_strings
level_3_agents		level_3_agents
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Harness Drill

Foundations — read these first

Layout

Setup

Workflow

Modules

Routing intuitions — when to modulate, when not to

The perception / intellect / expression triad

Coverage map (modality × routing space)

Design notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages