Evolving Agents in the Dark

Retrospective Harness Optimization (RHO) via Self-Preference

TL;DR — AI agents rely on a harness of skills, tools, and workflows to solve complex tasks. RHO improves that harness without any ground-truth labels or validation set — it learns purely from the agent's own past trajectories. A single retrospective pass lifts SWE-Bench Pro pass rate from 59% → 78%.

Read the story behind RHO and dynamic workflows in the blog post (中文版).

⚡ Run it on your own projects

Agent	One line to try
Claude Code	Paste as a prompt: `Run the workflow at https://raw.githubusercontent.com/wbopan/retro-harness/main/.claude/workflows/retrospection.js on this project`	Dynamic workflow, plug-and-play on the project you're in. Recommended.
Codex CLI	`curl -fsSLO https://raw.githubusercontent.com/wbopan/retro-harness/main/codex/retrospection.py && python3 retrospection.py`	Stdlib-only orchestrator over `codex exec` — the same cycle on your `AGENTS.md` + skills.
This repo	`git clone https://github.com/wbopan/retro-harness && cd retro-harness && uv sync && uv run rho evolve --dataset locomo:data/locomo10.json --rounds 1`	Used to reproduce our results, for research purposes.

Both one-liners mine the sessions you have already accumulated in that project, diagnose recurring failures, and evolve the agent's persistent harness (CLAUDE.md / auto-memory / scripts, or AGENTS.md / skills) — applying an update only when the agent's own pairwise self-preference favors it. Details: Retrospection on Claude Code · codex/retrospection.py.

What is RHO?

Most harness-optimization methods (prompt optimization, skill/tool synthesis, agent search) iterate against a labeled validation set. In real deployments such labels are expensive or impossible to collect — but a deployed agent continuously produces a rich stream of unlabeled trajectories.

RHO turns those trajectories into harness improvements with no external grading, in three stages:

Coreset Selection — pick a small, difficulty-diverse subset of past tasks with a determinantal point process (DPP).
Group Rollout — re-solve each coreset task G times in parallel, then extract two label-free diagnostic signals: self-validation (within a trajectory) and self-consistency (across parallel trajectories).
Harness Proposal — sample N candidate harness edits and keep the one whose rollouts are most preferred by the agent's own pairwise self-preference.

Results

Held-out pass rate after a single optimization round (Codex + GPT-5.5), versus feedback-free baselines that operate under the same agent-call budget:

Method	Harness surface	SWE-Bench Pro	Terminal-Bench 2	GAIA-2
Vanilla Codex	—	0.59	0.71	0.29
Dynamic Cheatsheet	Skills	0.62 (+0.03)	0.73 (+0.02)	0.30 (+0.01)
ReasoningBank	Memory	0.61 (+0.02)	0.73 (+0.02)	0.28 (−0.01)
Sleep-time Compute	Memory	0.64 (+0.05)	0.73 (+0.02)	0.32 (+0.03)
RHO (ours)	Skills + Tools	0.78 (+0.19)	0.76 (+0.05)	0.37 (+0.08)

RHO also surpasses Meta-Harness, a validation-feedback optimizer, at a matched single-round budget (0.78 vs 0.62 on SWE-Bench Pro) — without ever touching ground-truth labels.

Install

The project uses uv.

git clone https://github.com/wbopan/retro-harness.git
cd retro-harness
uv sync                       # core dependencies
uv sync --extra swebench-pro  # + a dataset extra you want to run

RHO drives the Codex CLI as its base agent. Point it at a model backend by copying a config from configs/ (e.g. configs/codex.chatgpt-default.toml) and passing it via --codex-config.

Quickstart

# Run one retrospective optimization round on a dataset's trajectory split,
# then grade the winning harness on the held-out split.
uv run rho evolve \
  --dataset locomo:data/locomo10.json \
  --rounds 1 \
  --codex-config configs/codex.chatgpt-default.toml

# Solve a single task with a given harness
uv run rho solve --dataset <ds> --task <id> --harness <dir> --run-dir runs/demo

# Browse runs (prompts, completions, trajectories, harness diffs) in a web UI
uv run rho ui

Every run persists prompts, completions, trajectories, diagnoses, candidate harnesses, harness diffs, configs, scores, and held-out reports under runs/<timestamp>-<dataset>/. See the full command reference in docs/cli-help.md.

Repository layout

src/rho/
├── loop.py            # the RHO evolution loop (select → rollout → propose)
├── protocols.py       # typing.Protocol interfaces (Dataset, Harness, Task, TrajectoryStore, …)
├── selection/         # coreset selection (DPP, coverage, difficulty)
├── strategies/        # harness-proposal strategies + feedback-free baselines
├── orchestrators/     # solve / group-rollout orchestration
├── datasets/          # SWE-Bench Pro, Terminal-Bench 2, GAIA-2, LOCOMO loaders
├── reasoningbank/     # ReasoningBank baseline
├── meta_harness/      # Meta-Harness (validation-feedback) baseline
└── stores/            # trajectory + harness stores
configs/               # Codex CLI backend configs
scripts/               # figure-building & analysis scripts
webui/                 # run-browser frontend
tests/                 # hermetic + real-agent end-to-end tests
.claude/workflows/
└── retrospection.js   # RHO as a Claude Code dynamic workflow (see below)
codex/
└── retrospection.py   # RHO over `codex exec` for Codex CLI users (stdlib-only)

Implementations are decoupled behind typing.Protocol so components (selectors, strategies, datasets, agents) can be swapped for ablations.

Retrospection: try RHO on your own Claude Code projects

.claude/workflows/retrospection.js packages the paper's method as a single Claude Code dynamic workflow that evolves the harness Claude Code natively exposes — your project's CLAUDE.md, its auto-memory directory, and helper scripts — using only the session transcripts you have already accumulated. No labels, no validation set, no benchmark: the trajectories are your own past sessions.

One run is one retrospection cycle (≈40 agents, well under the 1,000-agent cap):

Bootstrap — locate the project's transcripts (~/.claude/projects/<slug>/*.jsonl, including worktree sessions) and snapshot the current harness h₀.
Digest — parallel agents summarize past sessions into difficulty scores + task fingerprints (the paper's LLM judge).
Coreset — plain-JS greedy MAP on the paper's DPP kernel L = diag(r)·S·diag(r) (Jaccard fingerprint kernel, same θ trade-off). Similar sessions are grouped so the diagnoser can recover self-consistency across them; singletons fall back to validation-only diagnosis.
Diagnose — self-validation + cross-session self-consistency, producing severity-weighted, task-agnostic improvement directions.
Optimize — N independent candidate harnesses, staged outside the working tree.
Probe & select — replayable past tasks are re-attempted under each candidate in isolated worktrees; pairwise self-preference scores the fresh trajectory against the original session. The winner is applied only if its mean score is positive, with a full backup first.

Usage — copy the file into a project's .claude/workflows/ (or ~/.claude/workflows/ for all projects), then in Claude Code:

/retrospection

or target another project / override knobs via args:

{ projectDir: "/path/to/project",  // default: current project
  model: "opus",                   // default: session model
  k: 8,                            // coreset size (paper: 10)
  n: 2,                            // candidate harnesses (paper: 3)
  probes: 4,                       // self-preference probe tasks
  maxSessions: 36, theta: 0.7,     // DPP difficulty/diversity trade-off
  apply: true }                    // false = stage the winner, don't touch live files

Every cycle persists its artifacts (digests, diagnoses, candidates, probe trajectories, scores, report.md, and a backup/ of the pre-apply harness) under ~/.claude/rho-runs/<timestamp>-<project>/. Re-running the command later is the next evolution round — the harness keeps learning from whatever real sessions you accumulate in between.

Codex CLI variant

codex/retrospection.py runs the same cycle for Codex CLI users — a single stdlib-only Python file orchestrating parallel codex exec subprocesses. The mapping differs only in what the native harness is:

Trajectories come from Codex's rollout store (~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl), filtered to the target project via each rollout's session_meta.cwd.
Harness = AGENTS.md (kept lean — Codex caps combined project docs at 32 KB) + .agents/skills/*/SKILL.md (Codex's persistent knowledge units, the analog of auto-memory) + helper scripts.
Structured stages (digest / diagnose / score) use codex exec --output-schema; probes run in git worktrees with the candidate harness materialized inside, so Codex loads it natively.
All orchestration calls run --ephemeral (they never enter the session store, so a later cycle can't mine its own machinery) with the experimental memories feature disabled.

python3 codex/retrospection.py --dry-run            # list the sessions it would mine
python3 codex/retrospection.py                      # one cycle on the current project
python3 codex/retrospection.py --project ~/my/app \
    --model gpt-5.5 --n 2 --probes 4 --no-apply     # stage the winner without touching live files

Citation

@article{pan2026rho,
  title   = {Evolving Agents in the Dark: Retrospective Harness Optimization via Self-Preference},
  author  = {Pan, Wenbo and Liu, Shujie and Lin, Chin-Yew and Zeng, Jingying and Tang, Xianfeng and Zhou, Xiangyang and Lu, Yan and Jia, Xiaohua},
  journal = {arXiv preprint arXiv:2606.05922},
  year    = {2026}
}

License

Released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.claude/workflows		.claude/workflows
codex		codex
configs		configs
data		data
docs		docs
scripts		scripts
site		site
src/rho		src/rho
tests		tests
webui		webui
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evolving Agents in the Dark

⚡ Run it on your own projects

What is RHO?

Results

Install

Quickstart

Repository layout

Retrospection: try RHO on your own Claude Code projects

Codex CLI variant

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evolving Agents in the Dark

⚡ Run it on your own projects

What is RHO?

Results

Install

Quickstart

Repository layout

Retrospection: try RHO on your own Claude Code projects

Codex CLI variant

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages