Skip to content

megagonlabs/critique-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Critique-Learning: Memory-Augmented LLM Agents

Artifact for the paper listed below. Reproduces the main accuracy tables, the train-size and reflection-parts ablations, the token-usage analysis for reasoning models, and the suggestibility analysis, all from the shipped logs or by re-running the pipeline.


Paper

  • Title: Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation
  • Authors: Jackson Hassell, Dan Zhang, Hannah Kim, Tom Mitchell, Estevam Hruschka
  • Venue: ACM CAIS 2026
  • DOI: TBD
  • BibTeX:
    @inproceedings{hassell2026learning,
      title     = {Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation},
      author    = {Hassell, Jackson and Zhang, Dan and Kim, Hannah and Mitchell, Tom and Hruschka, Estevam},
      booktitle = {Proceedings of the ACM Conference on AI and Agentic Systems (CAIS)},
      year      = {2026},
    }

Overview

The pipeline teaches a frozen LLM to solve a new classification task by having it critique its own answers on a labeled training set and then reusing those critiques at test time. Concretely, for each training question the performance agent (PA) produces an initial prediction; the critic agent (CA), given the ground-truth label, produces a structured critique (correct_answer / local_reason / global_reason, corresponding to the paper's Assertion / Rationale / Reflection). These critiques form two kinds of memory:

  • Episodic memory — each critique is stored alongside its training example. At inference, the top-K most similar training examples (FAISS over sentence-transformer embeddings) are retrieved and shown to the PA along with their critiques.
  • Semantic memory — all critiques are distilled into a single piece of task-level advice by one additional LLM call, and that advice is prepended to the test prompt.

The paper compares five strategies, which are named identically in the repo:

Paper / repo name How it conditions the PA at test time Column suffix in logs/
zero_shot no memory, raw prompt ""
EP_LABEL RAG-retrieved (x, y) pairs, K=5 _fewshot_k5_rag_no_reflections
EP_CRIT RAG-retrieved (x, y, critique), K=5 _fewshot_k5_rag_reflections
SEM_CRIT distilled summary of all critiques _summary_reflections
EP+SEM_CRIT both _fewshot_k5_rag_reflections_summary

An additional consistency experiment (suffix _consistency_*) implements the paper's suggestibility metric: the PA is told the correct — or a deliberately wrong — answer and re-asked the question.


Datasets

The BSD 3-Clause license in the top-level LICENSE covers this repository's code only. The splits under data_samples/ are transformations of the following upstream datasets and remain subject to their upstream licenses.

All datasets used within the product are listed below with their sources and license information.

Please refer to the respective sources for detailed licensing terms.

For more dataset details please refer to data_samples/README.md


Artifact contents

Path Role
pipeline/ Library code. pipeline.py is the stateful CritiquePipeline. prompt_library.py holds all prompts/templates. rag_utils.py is the FAISS + sentence-transformer retrieval wrapper. utils.py holds scoring/metric helpers.
scripts/run_all_experiments.py One-shot driver that reproduces every result in logs/ (main pipeline + vary-K + train-size + reflection-parts ablations) into a single timestamped output directory. Calls the four scripts below in sequence.
scripts/run_pipeline.py Runs the full pipeline (train → EP_LABEL, EP_CRIT, SEM_CRIT, EP+SEM_CRIT, consistency) for one or more datasets × models. Produces the main-results logs.
scripts/run_ablation_train_size.py Reuses shipped critiques and re-runs EP_CRIT / EP+SEM_CRIT at 25% / 50% / 75% train sizes. Restricted to Steam Pref + Multi-Cond. Ranking, matching the paper's Table 6.
scripts/run_ablation_reflection_parts.py Re-runs EP_CRIT with reflections restricted to local_only / global_only / full.
scripts/run_ablation_vary_k.py Sweeps the few-shot K value (supporting analysis for the K=5 choice; gpt-4o-mini only in the paper).
scripts/reproduce_paper_tables.py Prints every table and headline number cited in the paper (accuracy, aggregate gain, token usage, detailed cost analysis, suggestibility, vary-K, train-size, reflection-parts, per-user).
scripts/analyze_logs.py Main accuracy table + aggregate gain + token usage + timeouts + suggestibility + summary, on any single main_results directory.
scripts/analyze_logs_ablations.py Train-size and reflection-parts ablation tables.
scripts/create_preference_data_samples.py Regenerates the Anime/Book/Movie/Steam preference splits from raw sources.
scripts/create_multichoice_ranking_samples.py Regenerates the Multi-Condition Ranking split.
data_samples/ Shipped train/test splits used in the paper. Preference datasets ship as ten independent {anime,book,movie,steam}_sample_{1..10} directories (one per user); non-preference datasets are single splits.
logs/main_results/ Shipped outputs of run_pipeline.py for the six paper models × seven datasets.
logs/train_size_ablation/ Shipped train-size ablation outputs.
logs/reflection_parts_ablation/ Shipped reflection-parts ablation outputs.
requirements.txt Pinned Python dependencies.
LICENSE BSD 3-Clause License.

Environment

  • OS: developed and tested on Ubuntu 22.04.2
  • Python: 3.10.16 (see requirements.txt for exact package pins).
  • Hardware: No local LLMs are loaded — all inference is routed to hosted APIs. The only local model is the sentence-transformer embedder used for RAG retrieval (default blevlabs/stella_en_v5, ~1.5B params), so the host needs to be able to run a model of that size locally. CPU works; torch will pick up a GPU automatically if one is available, but there is no GPU-only code path.
  • API access needed to re-run the pipeline:
    • OpenAI — for gpt-4o-mini-2024-07-18 and gpt-5-2025-08-07.
    • Fireworks.ai (or another OpenAI-compatible gateway serving the same model IDs) — for gpt-oss-20b, llama-4-scout, qwen3-235b-a22b-instruct-2507, and qwen3-vl-235b-a22b-thinking.
    • No API access is required for the analysis-only path (running analyze_logs*.py against the shipped logs/).

The pipeline routes per model automatically: any model whose name contains qwen3, llama, gpt-oss, or accounts/ is sent to Fireworks (and the Fireworks accounts/fireworks/models/ prefix is added if missing); all other models go to OpenAI. You can set both providers at once and a single run_pipeline.py invocation can mix models across the two. The relevant env vars (see .env.example):

# Required for OpenAI models (gpt-4o-mini, gpt-5)
export OPENAI_API_KEY="sk-..."
# Optional; defaults to OpenAI's public endpoint
export OPENAI_BASE_URL=

# Required for Fireworks-routed models
export FIREWORKS_API_KEY="fw_..."
export FIREWORKS_BASE_URL="https://api.fireworks.ai/inference/v1"

Installation

git clone <repo-url>
cd critique-learning
python3.10 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

On first use the sentence-transformers embedder (blevlabs/stella_en_v5, ~1.5B params) will be downloaded and cached by huggingface_hub.


Minimal working example

Verifies that the pipeline runs end-to-end on one cheap model and one small dataset (25-question train split, 25-question test split):

export OPENAI_API_KEY=...   # OpenAI key
python scripts/run_pipeline.py \
    --dataset data_samples/anime_sample_1 \
    --model gpt-4o-mini-2024-07-18

This produces logs/<timestamp>/gpt-4o-mini-2024-07-18/anime_sample_1/{train,test}.csv and a time_stats.json. Then to visualize results:

python scripts/analyze_logs.py logs/<timestamp>

Expected: under 10 minutes of wall-clock and a small number of cents in API spend on gpt-4o-mini.

If you also want the ablations on the same one-dataset slice, use the wrapper instead:

python scripts/run_all_experiments.py \
    --models gpt-4o-mini-2024-07-18 \
    --ablation-models gpt-4o-mini-2024-07-18 \
    --filter anime_sample_1
python scripts/reproduce_paper_tables.py --logs logs/<timestamp>

Reproducing the paper

The shipped logs live under logs/main_results/, logs/train_size_ablation/, and logs/reflection_parts_ablation/. By default every reproduction command writes to a separate timestamped directory logs/<wallclock>/... so the shipped artifacts are never overwritten.

The six models used in the paper:

gpt-4o-mini-2024-07-18          (non-reasoning, OpenAI)
llama-4-scout                   (non-reasoning, Fireworks)
qwen3-235b-a22b-instruct-2507   (non-reasoning, Fireworks)
gpt-5-2025-08-07                (reasoning, OpenAI)
gpt-oss-20b                     (reasoning, Fireworks)
qwen3-vl-235b-a22b-thinking     (reasoning, Fireworks)

The main-results stage runs all six models; the train-size and reflection-parts ablations run only gpt-4o-mini-2024-07-18, gpt-5-2025-08-07, and gpt-oss-20b (matching the paper's Tables 6–7), and vary-K runs only on gpt-4o-mini-2024-07-18 (Table 8). Override with --models, --ablation-models, and --vary-k-models respectively.

Datasets used in the paper (under data_samples/):

multi_condition_ranking_multichoice       # Multi-Cond. Ranking
nfcorpus_short_questions                  # NFCorpus
pubmed                                    # PubMed
{anime,book,movie,steam}_sample_{1..10}   # preference tasks, averaged across 10 users

Display the paper's results (no API calls)

Runs against the shipped logs/ and reproduces every table and headline number cited in the paper:

# Every table to stdout
python scripts/reproduce_paper_tables.py

# Just one section (one of: accuracy, aggregate, tokens, cost, suggestibility,
# vary_k, train_size, refl_parts, per_user, summary)
python scripts/reproduce_paper_tables.py --section accuracy

# Same, against your own re-run (see the next section)
python scripts/reproduce_paper_tables.py --logs logs/<timestamp>

The lighter-weight per-section analyzers can also be pointed at any log directory produced by the runners:

python scripts/analyze_logs.py logs/main_results            # accuracy + tokens + suggestibility
python scripts/analyze_logs_ablations.py logs               # train-size + reflection-parts

Reproduce everything from scratch

One command runs the four stages — main pipeline, vary-K, train-size ablation, reflection-parts ablation — into a single timestamped output directory:

export OPENAI_API_KEY=sk-...
export FIREWORKS_API_KEY=fw_...
python scripts/run_all_experiments.py
# -> writes logs/<wallclock>/{main_results,train_size_ablation,reflection_parts_ablation}/

# Analyze the run
python scripts/reproduce_paper_tables.py --logs logs/<wallclock>

This requires approximately 400 hours of wall-clock time and $600 USD across both providers (see Resource use for detailed breakdown). The wrapper is idempotent: re-running skips any (model × dataset) whose test.csv already has every expected column. Pass --force to override.

Common variations:

# Pin the output directory (otherwise wallclock is used)
python scripts/run_all_experiments.py --output-dir logs/2026-04-24_rerun

# Smoke test on the cheapest model and one dataset
python scripts/run_all_experiments.py \
    --models gpt-4o-mini-2024-07-18 \
    --filter anime_sample_1

# Only the preference tasks, all six models
python scripts/run_all_experiments.py --filter anime book movie steam

Reproduce only some experiments

Each stage has a --skip-* flag and a per-stage model list. Stages 2-4 read their critiques from a main_results/ directory; if you skip stage 1 you must point at an existing one (the shipped logs work fine).

# Just the main accuracy + suggestibility stage
python scripts/run_all_experiments.py \
    --skip-vary-k --skip-train-size --skip-refl-parts

# Just the train-size ablation against the shipped main_results
python scripts/run_all_experiments.py \
    --skip-main --skip-vary-k --skip-refl-parts \
    --source-main-results-dir logs/main_results

# One model, one ablation
python scripts/run_all_experiments.py \
    --skip-main --skip-vary-k --skip-refl-parts \
    --ablation-models gpt-4o-mini-2024-07-18 \
    --source-main-results-dir logs/main_results

The four runners can also be invoked directly if you want full control — see python scripts/<runner>.py --help. The relevant flags:

python scripts/run_pipeline.py \
    --dataset data_samples/anime_sample_1 \
    --model gpt-4o-mini-2024-07-18 \
    --output-dir logs/my_run                   # writes logs/my_run/<model>/<ds>/

python scripts/run_ablation_train_size.py \
    --model gpt-4o-mini-2024-07-18 \
    --source-dir logs/main_results \           # where to read critiques from
    --output-dir logs/my_run/train_size_ablation

python scripts/run_ablation_reflection_parts.py \
    --model gpt-4o-mini-2024-07-18 \
    --source-dir logs/main_results \
    --output-dir logs/my_run/reflection_parts_ablation

python scripts/run_ablation_vary_k.py \
    --log logs/my_run/<model>/<dataset> \      # appends K=1/3/10 columns to its test.csv
    --model gpt-4o-mini-2024-07-18

Mix-and-match: analyze reproduced + shipped together

If you only re-ran some stages, the analyzers can pull each component from a different directory. For example, reproduce the train-size ablation locally but read main_results from the shipped logs:

python scripts/reproduce_paper_tables.py \
    --main-results-dir logs/main_results \
    --train-size-dir   logs/<wallclock>/train_size_ablation \
    --refl-parts-dir   logs/reflection_parts_ablation

python scripts/analyze_logs_ablations.py logs \
    --train-size-dir logs/<wallclock>/train_size_ablation

What each artifact produces

Paper artifact Stage that produces it Section in reproduce_paper_tables.py
Table 1 (per-model accuracy; subtables 1a non-reasoning, 1b reasoning) stage 1 (main pipeline) --section accuracy
Table 2 (aggregate gain): paired 95% CIs and sign-test p-values stage 1 --section aggregate
Table 3: output-token usage for reasoning models stage 1 --section tokens
Table 4 (suggestibility): truth/lying accuracy and S stage 1 (consistency step) --section suggestibility
Table 5 (detailed cost analysis): training tokens and break-even points stage 1 --section cost
Table 6 (train-size ablation) stage 3 (run_ablation_train_size.py) --section train_size
Table 7 (reflection-parts ablation) stage 4 (run_ablation_reflection_parts.py) --section refl_parts
Table 8 (vary-K, gpt-4o-mini) stage 2 (run_ablation_vary_k.py) --section vary_k
Table 9 (timeout rate): token-limit-exceeded rate for reasoning models stage 1 --section tokens (printed after token usage table)
Tables 10–13 (per-user preference accuracy, one table per domain) stage 1 --section per_user
Narrative numbers from §4.3–4.5 (mean gains, token reductions, Spearman) stage 1 --section summary

Resource use

Wall-clock and token usage vary substantially with model and dataset. Exact per-run numbers are in time_stats.json and the usage* columns of each test.csv in logs/. Rough orientation from the shipped runs:

  • gpt-4o-mini, full pipeline on one preference dataset user (~25 train + 25 test): under 30 minutes, ~$0.20 API spend.
  • Full paper reproduction from scratch:
    • Main results only (6 models × 43 datasets): ~350 hours, ~$400
    • With all ablations: ~400 hours, ~$600 total

Per-stage token usage (one-time train + per-question inference) and the break-even point of each critique method against EP_LABEL are printed by python scripts/reproduce_paper_tables.py --section cost.


Unusual behavior to expect

  • Error: token limit exceeded. rows will appear in thinking-model outputs (gpt-5, gpt-oss-20b, qwen3-vl-235b-a22b-thinking). These are not parsing errors: the model ran past the 8192-token budget without emitting its JSON answer. They are intentionally counted as wrong in the accuracy calculation — reducing these is itself one of the reported benefits of critique-based methods (see paper §Critique-based Memory for Reasoning Models).
  • Parse failures return a random wrong answer by design (see parse_response). This prevents malformed JSON from silently boosting the accuracy. The responses* column preserves the raw output for inspection.
  • First run builds the FAISS + sentence-transformer index. The first call to pipeline.train(..., train_rag=True) downloads the embedding model and builds the index over the training prompts. Subsequent runs reuse the cached model.
  • Exact LLM responses may vary due to non-stochasticity even with a zero temperature parameter. Re-run experiment accuracies may differ slightly, particularly for small datasets (such as individual preference dataset users).

Regenerating data_samples/ from raw sources

The shipped data_samples/ is the paper's exact split and is what every experiment in this repo consumes. See data_samples/README.md for the upstream source and exact transformation applied to each dataset.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages