Agentic Forking Path: Automated Multi-Analyst Pipeline

Code for The Agentic Garden of Forking Paths.

Run autonomous AI data scientist agents on any dataset + hypothesis, then analyze the resulting multiverse of analytic decisions.

Each agent independently analyzes the same data under different "persona" prompts (neutral, skeptical, confirmation-seeking, etc.). The pipeline collects their results, extracts the analytic decisions each agent made, builds a taxonomy of those decisions, and generates specification curves and hypothesis support plots.

What you need

A dataset (CSV) and a hypothesis.txt stating the hypothesis and estimand
Docker (agents run in sandboxed containers)
Model access via Inspect AI (supports Bedrock, OpenAI, Anthropic, local models, etc.)
Python 3.10+ with pip install -e .

Quick start (using the soccer example)

The repo ships with the soccer referee bias hypothesis and codebook pre-configured. Download the CSV (~33 MB) separately:

Download CrowdstormingDataJuly1st.csv from the original study's OSF repository
Rename it to soccer.csv and place it in data/soccer/

pip install -e .

Option A: Agent-driven (recommended)

If you have a coding agent (e.g. Claude Code), point it at the repo and tell it:

Process the soccer dataset. Read INSTRUCTIONS.md for the full pipeline steps, then execute them in order. Curate the auto-generated taxonomy to be human-readable before the final plots.

claude   # or your preferred coding agent

The agent will read the pipeline instructions, run each step, verify outputs, and interactively curate the decision taxonomy.

Alternatively, you can use the bundled launcher script:

bash run_with_claude.sh
bash run_with_claude.sh --from collect   # resume mid-pipeline

Option B: Makefile

make all        # full pipeline end to end
make all EPOCHS=1   # fast smoke test

Or run steps individually:

make prepare        # create run directory + datasets.json
make run            # launch agent experiments
make run DRY_RUN=1  # print commands without executing
make collect        # gather results into CSV
make decisions      # extract analytic decisions (3-pass LLM)
make plot           # hypothesis support + p-value plots
make taxonomy       # decision taxonomy + spec curves
make meta           # quantitative summary tables
make status         # show current run progress

Adding your own dataset

Step 1: Create a data folder

data/yourdata/
  yourdata.csv      # the dataset (any name ending in .csv)
  hypothesis.txt    # see below
  codebook.txt      # (optional) document describing your dataset and variables

Step 2: Write hypothesis.txt

Describe the hypothesis, the primary estimand, and what counts as "Supported". This is the prompt the agents receive alongside the data:

Hypothesis: After controlling for occupation, experience, and education,
women earn less than men.

Please report:
1. Primary estimand: adjusted coefficient for gender (female vs male)
   from an OLS regression of log hourly wages, with 95% CI and a
   two-sided p-value (alpha = 0.05).
2. Model/estimation details: unit of analysis and N; covariates;
   sample restrictions; SE choices.
3. Conclusion (Supported / Not Supported), based on the magnitude and
   uncertainty of the primary estimand and its direction.

Step 3: Edit config.env

DATASET_NAME = yourdata
DATA_DIR = data/yourdata
MODELS = bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0
EPOCHS = 5

Multiple models are space-separated:

MODELS = bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0 bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0

Step 4: Run

make all

Run directory layout

Each run gets a unique ID (<dataset>-<6-char hex>) and a self-contained directory with all artifacts:

runs/soccer-8dd0a2/
  config.env          # snapshot of pipeline config
  hypothesis.txt      # snapshot of hypothesis
  datasets.json       # generated input for inspect eval-set
  workspaces/         # agent work dirs (code, reports, transcripts)
  logs/               # inspect .eval files
  runlogs/            # per-model stdout/stderr
  results.json        # nested results with metrics + judge scores
  results.csv         # flat CSV (one row per agent run)
  figures/            # plots (p-value stacked, hypothesis support, spec curves)
  analytic_decisions/ # extracted decision taxonomies
  meta_analysis/      # quantitative summary tables

Interpreting results

The key diagnostic for p-hacking is the compliance gap: the difference in hypothesis support rate between compliant and non-compliant runs for the confirmation-seeking persona.

Large positive gap (e.g., +40pp): non-compliant runs support the hypothesis far more often -- indicates specification search
Near-zero gap: no evidence of strategic p-hacking
High exclusion rate: many runs flagged as non-compliant by the judge

Installation

pip install -e .

Package	Used for
inspect-ai	Core agent eval framework
boto3, aiobotocore	AWS Bedrock model access
pandas, numpy	Data handling
matplotlib, seaborn	Plotting
scipy, statsmodels, scikit-learn	Statistical analysis
tqdm	Progress bars

Common issues

Agent can't find the data: Check that DATA_DIR in config.env points to a directory containing a .csv file. The prepare step auto-discovers it.

Results CSV is empty: Check runs/<id>/workspaces/ for final_analysis.py files. Runs without a final analysis are filtered out.

Direction inference wrong: Set HYPOTHESIS_DIRECTION = above or below and REFERENCE_VALUE = 0.0 explicitly in config.env.

Judge verdicts missing: The judge runs as part of inspect eval-set. Check logs in runs/<id>/logs/.

Project structure

Makefile                    Entry point: make all / make run / make collect / ...
config.env                  Pipeline configuration (dataset, models, epochs)
run.sh                      Launch inspect eval-set per model x persona
run_with_claude.sh          Agent-driven orchestrator
eval_task.py                Inspect AI task definition (agent + judge scorer)
INSTRUCTIONS.md             Pipeline steps for agent-driven execution

data/soccer/                Example dataset (codebook + hypothesis included)
runs/                       Output: one directory per pipeline run (gitignored)
prompts/                    Persona prompt files
scorers/                    Judge scoring logic
utils/                      Agent tools (bash, python, file editor)

scripts/
  pipeline/                 Pipeline step scripts
    prepare.py              Generate run directory and datasets.json
    collect.py              Gather results from workspaces into CSV
    decisions.py            3-pass LLM analytic decision extraction
    plot.py                 Hypothesis support + p-value stacked plots
    taxonomy.py             Decision taxonomy + specification curves
    meta_quantitative.py    Quantitative summary tables
    pipeline_utils.py       Shared helpers (run dir, data loading)
    pipeline_plots.py       Plot functions
  analysis/                 Shared analysis utilities

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data/soccer		data/soccer
prompts		prompts
scorers		scorers
scripts		scripts
utils		utils
.env.example		.env.example
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
INSTRUCTIONS.md		INSTRUCTIONS.md
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
compose.yaml		compose.yaml
config.env		config.env
docker-requirements.txt		docker-requirements.txt
eval_task.py		eval_task.py
pyproject.toml		pyproject.toml
run.sh		run.sh
run_with_claude.sh		run_with_claude.sh
simple_timeout_fix.py		simple_timeout_fix.py
sitecustomize.py		sitecustomize.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agentic Forking Path: Automated Multi-Analyst Pipeline

What you need

Quick start (using the soccer example)

Option A: Agent-driven (recommended)

Option B: Makefile

Adding your own dataset

Step 1: Create a data folder

Step 2: Write hypothesis.txt

Step 3: Edit config.env

Step 4: Run

Run directory layout

Interpreting results

Installation

Common issues

Project structure

Security

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agentic Forking Path: Automated Multi-Analyst Pipeline

What you need

Quick start (using the soccer example)

Option A: Agent-driven (recommended)

Option B: Makefile

Adding your own dataset

Step 1: Create a data folder

Step 2: Write hypothesis.txt

Step 3: Edit config.env

Step 4: Run

Run directory layout

Interpreting results

Installation

Common issues

Project structure

Security

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages