Critique-Learning: Memory-Augmented LLM Agents

Artifact for the paper listed below. Reproduces the main accuracy tables, the train-size and reflection-parts ablations, the token-usage analysis for reasoning models, and the suggestibility analysis, all from the shipped logs or by re-running the pipeline.

Paper

Title: Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation
Authors: Jackson Hassell, Dan Zhang, Hannah Kim, Tom Mitchell, Estevam Hruschka
Venue: ACM CAIS 2026
DOI: TBD

BibTeX:

@inproceedings{hassell2026learning,
  title     = {Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation},
  author    = {Hassell, Jackson and Zhang, Dan and Kim, Hannah and Mitchell, Tom and Hruschka, Estevam},
  booktitle = {Proceedings of the ACM Conference on AI and Agentic Systems (CAIS)},
  year      = {2026},
}

Overview

The pipeline teaches a frozen LLM to solve a new classification task by having it critique its own answers on a labeled training set and then reusing those critiques at test time. Concretely, for each training question the performance agent (PA) produces an initial prediction; the critic agent (CA), given the ground-truth label, produces a structured critique (correct_answer / local_reason / global_reason, corresponding to the paper's Assertion / Rationale / Reflection). These critiques form two kinds of memory:

Episodic memory — each critique is stored alongside its training example. At inference, the top-K most similar training examples (FAISS over sentence-transformer embeddings) are retrieved and shown to the PA along with their critiques.
Semantic memory — all critiques are distilled into a single piece of task-level advice by one additional LLM call, and that advice is prepended to the test prompt.

The paper compares five strategies, which are named identically in the repo:

Paper / repo name	How it conditions the PA at test time	Column suffix in `logs/`
`zero_shot`	no memory, raw prompt	`""`
`EP_LABEL`	RAG-retrieved (x, y) pairs, K=5	`_fewshot_k5_rag_no_reflections`
`EP_CRIT`	RAG-retrieved (x, y, critique), K=5	`_fewshot_k5_rag_reflections`
`SEM_CRIT`	distilled summary of all critiques	`_summary_reflections`
`EP+SEM_CRIT`	both	`_fewshot_k5_rag_reflections_summary`

An additional consistency experiment (suffix _consistency_*) implements the paper's suggestibility metric: the PA is told the correct — or a deliberately wrong — answer and re-asked the question.

Datasets

The BSD 3-Clause license in the top-level LICENSE covers this repository's code only. The splits under data_samples/ are transformations of the following upstream datasets and remain subject to their upstream licenses.

All datasets used within the product are listed below with their sources and license information.

PubMedQA Source: qiaojin/PubMedQA on Hugging Face License: MIT
NFCorpus Source: Heidelberg StatNLP NFCorpus License: Free for academic use per the upstream page. For any other use of the embedded NutritionFacts.org data, consult the NutritionFacts.org Terms of Service and contact Dr. Michael Greger directly.
Multi-Condition Ranking Source: megagonlabs/MCR License: BSD 3-Clause
Anime Recommendations Database Source: Kaggle — CooperUnion/anime-recommendations-database License: CC0 1.0
Book Recommendation Dataset Source: Kaggle — arashnic/book-recommendation-dataset License: CC0 1.0
Movie Recommendation System Source: Kaggle — parasharmanas/movie-recommendation-system License: ODbL 1.0
Steam Video Games Source: Kaggle — tamber/steam-video-games License: ODbL 1.0

Please refer to the respective sources for detailed licensing terms.

For more dataset details please refer to data_samples/README.md

Artifact contents

Path	Role
`pipeline/`	Library code. `pipeline.py` is the stateful `CritiquePipeline`. `prompt_library.py` holds all prompts/templates. `rag_utils.py` is the FAISS + sentence-transformer retrieval wrapper. `utils.py` holds scoring/metric helpers.
`scripts/run_all_experiments.py`	One-shot driver that reproduces every result in `logs/` (main pipeline + vary-K + train-size + reflection-parts ablations) into a single timestamped output directory. Calls the four scripts below in sequence.
`scripts/run_pipeline.py`	Runs the full pipeline (train → `EP_LABEL`, `EP_CRIT`, `SEM_CRIT`, `EP+SEM_CRIT`, `consistency`) for one or more datasets × models. Produces the main-results logs.
`scripts/run_ablation_train_size.py`	Reuses shipped critiques and re-runs `EP_CRIT` / `EP+SEM_CRIT` at 25% / 50% / 75% train sizes. Restricted to Steam Pref + Multi-Cond. Ranking, matching the paper's Table 6.
`scripts/run_ablation_reflection_parts.py`	Re-runs `EP_CRIT` with reflections restricted to `local_only` / `global_only` / `full`.
`scripts/run_ablation_vary_k.py`	Sweeps the few-shot K value (supporting analysis for the K=5 choice; gpt-4o-mini only in the paper).
`scripts/reproduce_paper_tables.py`	Prints every table and headline number cited in the paper (accuracy, aggregate gain, token usage, detailed cost analysis, suggestibility, vary-K, train-size, reflection-parts, per-user).
`scripts/analyze_logs.py`	Main accuracy table + aggregate gain + token usage + timeouts + suggestibility + summary, on any single main_results directory.
`scripts/analyze_logs_ablations.py`	Train-size and reflection-parts ablation tables.
`scripts/create_preference_data_samples.py`	Regenerates the Anime/Book/Movie/Steam preference splits from raw sources.
`scripts/create_multichoice_ranking_samples.py`	Regenerates the Multi-Condition Ranking split.
`data_samples/`	Shipped train/test splits used in the paper. Preference datasets ship as ten independent `{anime,book,movie,steam}_sample_{1..10}` directories (one per user); non-preference datasets are single splits.
`logs/main_results/`	Shipped outputs of `run_pipeline.py` for the six paper models × seven datasets.
`logs/train_size_ablation/`	Shipped train-size ablation outputs.
`logs/reflection_parts_ablation/`	Shipped reflection-parts ablation outputs.
`requirements.txt`	Pinned Python dependencies.
`LICENSE`	BSD 3-Clause License.

Environment

OS: developed and tested on Ubuntu 22.04.2
Python: 3.10.16 (see requirements.txt for exact package pins).
Hardware: No local LLMs are loaded — all inference is routed to hosted APIs. The only local model is the sentence-transformer embedder used for RAG retrieval (default blevlabs/stella_en_v5, ~1.5B params), so the host needs to be able to run a model of that size locally. CPU works; torch will pick up a GPU automatically if one is available, but there is no GPU-only code path.
API access needed to re-run the pipeline:
- OpenAI — for gpt-4o-mini-2024-07-18 and gpt-5-2025-08-07.
- Fireworks.ai (or another OpenAI-compatible gateway serving the same model IDs) — for gpt-oss-20b, llama-4-scout, qwen3-235b-a22b-instruct-2507, and qwen3-vl-235b-a22b-thinking.
- No API access is required for the analysis-only path (running analyze_logs*.py against the shipped logs/).

The pipeline routes per model automatically: any model whose name contains qwen3, llama, gpt-oss, or accounts/ is sent to Fireworks (and the Fireworks accounts/fireworks/models/ prefix is added if missing); all other models go to OpenAI. You can set both providers at once and a single run_pipeline.py invocation can mix models across the two. The relevant env vars (see .env.example):

# Required for OpenAI models (gpt-4o-mini, gpt-5)
export OPENAI_API_KEY="sk-..."
# Optional; defaults to OpenAI's public endpoint
export OPENAI_BASE_URL=

# Required for Fireworks-routed models
export FIREWORKS_API_KEY="fw_..."
export FIREWORKS_BASE_URL="https://api.fireworks.ai/inference/v1"

Installation

git clone <repo-url>
cd critique-learning
python3.10 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

On first use the sentence-transformers embedder (blevlabs/stella_en_v5, ~1.5B params) will be downloaded and cached by huggingface_hub.

Minimal working example

Verifies that the pipeline runs end-to-end on one cheap model and one small dataset (25-question train split, 25-question test split):

export OPENAI_API_KEY=...   # OpenAI key
python scripts/run_pipeline.py \
    --dataset data_samples/anime_sample_1 \
    --model gpt-4o-mini-2024-07-18

This produces logs/<timestamp>/gpt-4o-mini-2024-07-18/anime_sample_1/{train,test}.csv and a time_stats.json. Then to visualize results:

python scripts/analyze_logs.py logs/<timestamp>

Expected: under 10 minutes of wall-clock and a small number of cents in API spend on gpt-4o-mini.

If you also want the ablations on the same one-dataset slice, use the wrapper instead:

python scripts/run_all_experiments.py \
    --models gpt-4o-mini-2024-07-18 \
    --ablation-models gpt-4o-mini-2024-07-18 \
    --filter anime_sample_1
python scripts/reproduce_paper_tables.py --logs logs/<timestamp>

Reproducing the paper

The shipped logs live under logs/main_results/, logs/train_size_ablation/, and logs/reflection_parts_ablation/. By default every reproduction command writes to a separate timestamped directory logs/<wallclock>/... so the shipped artifacts are never overwritten.

The six models used in the paper:

gpt-4o-mini-2024-07-18          (non-reasoning, OpenAI)
llama-4-scout                   (non-reasoning, Fireworks)
qwen3-235b-a22b-instruct-2507   (non-reasoning, Fireworks)
gpt-5-2025-08-07                (reasoning, OpenAI)
gpt-oss-20b                     (reasoning, Fireworks)
qwen3-vl-235b-a22b-thinking     (reasoning, Fireworks)

The main-results stage runs all six models; the train-size and reflection-parts ablations run only gpt-4o-mini-2024-07-18, gpt-5-2025-08-07, and gpt-oss-20b (matching the paper's Tables 6–7), and vary-K runs only on gpt-4o-mini-2024-07-18 (Table 8). Override with --models, --ablation-models, and --vary-k-models respectively.

Datasets used in the paper (under data_samples/):

multi_condition_ranking_multichoice       # Multi-Cond. Ranking
nfcorpus_short_questions                  # NFCorpus
pubmed                                    # PubMed
{anime,book,movie,steam}_sample_{1..10}   # preference tasks, averaged across 10 users

Display the paper's results (no API calls)

Runs against the shipped logs/ and reproduces every table and headline number cited in the paper:

# Every table to stdout
python scripts/reproduce_paper_tables.py

# Just one section (one of: accuracy, aggregate, tokens, cost, suggestibility,
# vary_k, train_size, refl_parts, per_user, summary)
python scripts/reproduce_paper_tables.py --section accuracy

# Same, against your own re-run (see the next section)
python scripts/reproduce_paper_tables.py --logs logs/<timestamp>

The lighter-weight per-section analyzers can also be pointed at any log directory produced by the runners:

python scripts/analyze_logs.py logs/main_results            # accuracy + tokens + suggestibility
python scripts/analyze_logs_ablations.py logs               # train-size + reflection-parts

Reproduce everything from scratch

One command runs the four stages — main pipeline, vary-K, train-size ablation, reflection-parts ablation — into a single timestamped output directory:

export OPENAI_API_KEY=sk-...
export FIREWORKS_API_KEY=fw_...
python scripts/run_all_experiments.py
# -> writes logs/<wallclock>/{main_results,train_size_ablation,reflection_parts_ablation}/

# Analyze the run
python scripts/reproduce_paper_tables.py --logs logs/<wallclock>

This requires approximately 400 hours of wall-clock time and $600 USD across both providers (see Resource use for detailed breakdown). The wrapper is idempotent: re-running skips any (model × dataset) whose test.csv already has every expected column. Pass --force to override.

Common variations:

# Pin the output directory (otherwise wallclock is used)
python scripts/run_all_experiments.py --output-dir logs/2026-04-24_rerun

# Smoke test on the cheapest model and one dataset
python scripts/run_all_experiments.py \
    --models gpt-4o-mini-2024-07-18 \
    --filter anime_sample_1

# Only the preference tasks, all six models
python scripts/run_all_experiments.py --filter anime book movie steam

Reproduce only some experiments

Each stage has a --skip-* flag and a per-stage model list. Stages 2-4 read their critiques from a main_results/ directory; if you skip stage 1 you must point at an existing one (the shipped logs work fine).

# Just the main accuracy + suggestibility stage
python scripts/run_all_experiments.py \
    --skip-vary-k --skip-train-size --skip-refl-parts

# Just the train-size ablation against the shipped main_results
python scripts/run_all_experiments.py \
    --skip-main --skip-vary-k --skip-refl-parts \
    --source-main-results-dir logs/main_results

# One model, one ablation
python scripts/run_all_experiments.py \
    --skip-main --skip-vary-k --skip-refl-parts \
    --ablation-models gpt-4o-mini-2024-07-18 \
    --source-main-results-dir logs/main_results

The four runners can also be invoked directly if you want full control — see python scripts/<runner>.py --help. The relevant flags:

python scripts/run_pipeline.py \
    --dataset data_samples/anime_sample_1 \
    --model gpt-4o-mini-2024-07-18 \
    --output-dir logs/my_run                   # writes logs/my_run/<model>/<ds>/

python scripts/run_ablation_train_size.py \
    --model gpt-4o-mini-2024-07-18 \
    --source-dir logs/main_results \           # where to read critiques from
    --output-dir logs/my_run/train_size_ablation

python scripts/run_ablation_reflection_parts.py \
    --model gpt-4o-mini-2024-07-18 \
    --source-dir logs/main_results \
    --output-dir logs/my_run/reflection_parts_ablation

python scripts/run_ablation_vary_k.py \
    --log logs/my_run/<model>/<dataset> \      # appends K=1/3/10 columns to its test.csv
    --model gpt-4o-mini-2024-07-18

Mix-and-match: analyze reproduced + shipped together

If you only re-ran some stages, the analyzers can pull each component from a different directory. For example, reproduce the train-size ablation locally but read main_results from the shipped logs:

python scripts/reproduce_paper_tables.py \
    --main-results-dir logs/main_results \
    --train-size-dir   logs/<wallclock>/train_size_ablation \
    --refl-parts-dir   logs/reflection_parts_ablation

python scripts/analyze_logs_ablations.py logs \
    --train-size-dir logs/<wallclock>/train_size_ablation

What each artifact produces

Paper artifact	Stage that produces it	Section in `reproduce_paper_tables.py`
Table 1 (per-model accuracy; subtables 1a non-reasoning, 1b reasoning)	stage 1 (main pipeline)	`--section accuracy`
Table 2 (aggregate gain): paired 95% CIs and sign-test p-values	stage 1	`--section aggregate`
Table 3: output-token usage for reasoning models	stage 1	`--section tokens`
Table 4 (suggestibility): truth/lying accuracy and S	stage 1 (`consistency` step)	`--section suggestibility`
Table 5 (detailed cost analysis): training tokens and break-even points	stage 1	`--section cost`
Table 6 (train-size ablation)	stage 3 (`run_ablation_train_size.py`)	`--section train_size`
Table 7 (reflection-parts ablation)	stage 4 (`run_ablation_reflection_parts.py`)	`--section refl_parts`
Table 8 (vary-K, gpt-4o-mini)	stage 2 (`run_ablation_vary_k.py`)	`--section vary_k`
Table 9 (timeout rate): token-limit-exceeded rate for reasoning models	stage 1	`--section tokens` (printed after token usage table)
Tables 10–13 (per-user preference accuracy, one table per domain)	stage 1	`--section per_user`
Narrative numbers from §4.3–4.5 (mean gains, token reductions, Spearman)	stage 1	`--section summary`

Resource use

Wall-clock and token usage vary substantially with model and dataset. Exact per-run numbers are in time_stats.json and the usage* columns of each test.csv in logs/. Rough orientation from the shipped runs:

gpt-4o-mini, full pipeline on one preference dataset user (~25 train + 25 test): under 30 minutes, ~$0.20 API spend.
Full paper reproduction from scratch:
- Main results only (6 models × 43 datasets): ~350 hours, ~$400
- With all ablations: ~400 hours, ~$600 total

Per-stage token usage (one-time train + per-question inference) and the break-even point of each critique method against EP_LABEL are printed by python scripts/reproduce_paper_tables.py --section cost.

Unusual behavior to expect

Error: token limit exceeded. rows will appear in thinking-model outputs (gpt-5, gpt-oss-20b, qwen3-vl-235b-a22b-thinking). These are not parsing errors: the model ran past the 8192-token budget without emitting its JSON answer. They are intentionally counted as wrong in the accuracy calculation — reducing these is itself one of the reported benefits of critique-based methods (see paper §Critique-based Memory for Reasoning Models).
Parse failures return a random wrong answer by design (see parse_response). This prevents malformed JSON from silently boosting the accuracy. The responses* column preserves the raw output for inspection.
First run builds the FAISS + sentence-transformer index. The first call to pipeline.train(..., train_rag=True) downloads the embedding model and builds the index over the training prompts. Subsequent runs reuse the cached model.
Exact LLM responses may vary due to non-stochasticity even with a zero temperature parameter. Re-run experiment accuracies may differ slightly, particularly for small datasets (such as individual preference dataset users).

Regenerating `data_samples/` from raw sources

The shipped data_samples/ is the paper's exact split and is what every experiment in this repo consumes. See data_samples/README.md for the upstream source and exact transformation applied to each dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Critique-Learning: Memory-Augmented LLM Agents

Paper

Overview

Datasets

Artifact contents

Environment

Installation

Minimal working example

Reproducing the paper

Display the paper's results (no API calls)

Reproduce everything from scratch

Reproduce only some experiments

Mix-and-match: analyze reproduced + shipped together

What each artifact produces

Resource use

Unusual behavior to expect

Regenerating `data_samples/` from raw sources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data_samples		data_samples
logs		logs
pipeline		pipeline
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Critique-Learning: Memory-Augmented LLM Agents

Paper

Overview

Datasets

Artifact contents

Environment

Installation

Minimal working example

Reproducing the paper

Display the paper's results (no API calls)

Reproduce everything from scratch

Reproduce only some experiments

Mix-and-match: analyze reproduced + shipped together

What each artifact produces

Resource use

Unusual behavior to expect

Regenerating data_samples/ from raw sources

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Regenerating `data_samples/` from raw sources

Packages