Llm coordination harness

llm-coordination-harness is a reproducible measurement rig for hidden coordination variables in multi-agent LLM systems under fixed billed-token budgets.

This repository is intentionally positioned as:

an eval / harness project
a measurement-first research artifact
a negative-results / methods result

This repository is not positioned as:

a generic swarm framework
a production routing layer
a claim that a universal coordination law has already been proved

What It Measures

The harness extracts and logs:

F: critical-fact survival fidelity through the graph
rho: shared-error correlation under no communication
B: propagation balance over edge fact-survival ratios
C: fan-in pressure from incoming peer-token load vs quota

The important property is that these variables are recomputed from logs offline, rather than living only in process memory.

Golden Artifacts

The outputs/ directory is intentionally frozen to the two gold runs:

Frozen configs:

Visuals

Feature importance from the offline predictor analysis:

Topology penalty on MA-FT at budget 96:

Topology delta (Balanced Tree - Star) on MA-FT at budget 96:

Held-out predictor AUROC after excluding budget == 0 from training:

P0b attack score delta vs clean baseline:

P0b infection spread:

P0b attack success rate:

Headline Results

Clean Phase (P0a)

The calibrated clean run is here:

Main outcome:

topology-sensitive coordination failures are real
repaired F and B move with those failures
the current v1 held-out predictor still does not pass the clean gate

This is a valid scientific result.

Stress Phase (P0b)

The attack run is here:

batch_index.json

Main outcome:

attack spread is measurable
star can behave as a zero-quarantine topology
structures that degrade useful coordination may also weakly attenuate malicious propagation

This is mechanistically interesting, but still not enough to claim a general attack-robustness law.

Why This Repo Matters

The core question is:

At fixed orchestration and fixed billed budget, do F, rho, B, C explain transitions between:

help
saturation
collapse

better than heuristic predictors that mostly exploit size and token-count shortcuts?

The current answer is nuanced:

the measurement system works
the hidden coordination variables are real and mechanistically meaningful
the predictor still fails the intended clean gate

That combination of positive measurement result and negative claim result is exactly the kind of outcome this repo is meant to preserve.

OpenRouter Discipline

Two modes exist:

research_strict
dev_convenience

Research runs require:

exact model pinning
explicit provider pinning
no openrouter/auto
no provider fallback
route / pricing / snapshot logging

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
archive/configs		archive/configs
configs		configs
data/benchmarks/clean		data/benchmarks/clean
docs		docs
scripts		scripts
src/coord_harness		src/coord_harness
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Llm coordination harness

What It Measures

Golden Artifacts

Visuals

Headline Results

Clean Phase (P0a)

Stress Phase (P0b)

Why This Repo Matters

OpenRouter Discipline

Where To Start

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Llm coordination harness

What It Measures

Golden Artifacts

Visuals

Headline Results

Clean Phase (P0a)

Stress Phase (P0b)

Why This Repo Matters

OpenRouter Discipline

Where To Start

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages