Artifact for the paper listed below. Reproduces the main accuracy tables, the train-size and reflection-parts ablations, the token-usage analysis for reasoning models, and the suggestibility analysis, all from the shipped logs or by re-running the pipeline.
- Title: Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation
- Authors: Jackson Hassell, Dan Zhang, Hannah Kim, Tom Mitchell, Estevam Hruschka
- Venue: ACM CAIS 2026
- DOI: TBD
- BibTeX:
@inproceedings{hassell2026learning, title = {Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation}, author = {Hassell, Jackson and Zhang, Dan and Kim, Hannah and Mitchell, Tom and Hruschka, Estevam}, booktitle = {Proceedings of the ACM Conference on AI and Agentic Systems (CAIS)}, year = {2026}, }
The pipeline teaches a frozen LLM to solve a new classification task by having
it critique its own answers on a labeled training set and then reusing
those critiques at test time. Concretely, for each training question the
performance agent (PA) produces an initial prediction; the critic agent
(CA), given the ground-truth label, produces a structured critique
(correct_answer / local_reason / global_reason, corresponding to the
paper's Assertion / Rationale / Reflection). These critiques form two kinds
of memory:
- Episodic memory — each critique is stored alongside its training example. At inference, the top-K most similar training examples (FAISS over sentence-transformer embeddings) are retrieved and shown to the PA along with their critiques.
- Semantic memory — all critiques are distilled into a single piece of task-level advice by one additional LLM call, and that advice is prepended to the test prompt.
The paper compares five strategies, which are named identically in the repo:
| Paper / repo name | How it conditions the PA at test time | Column suffix in logs/ |
|---|---|---|
zero_shot |
no memory, raw prompt | "" |
EP_LABEL |
RAG-retrieved (x, y) pairs, K=5 | _fewshot_k5_rag_no_reflections |
EP_CRIT |
RAG-retrieved (x, y, critique), K=5 | _fewshot_k5_rag_reflections |
SEM_CRIT |
distilled summary of all critiques | _summary_reflections |
EP+SEM_CRIT |
both | _fewshot_k5_rag_reflections_summary |
An additional consistency experiment (suffix _consistency_*) implements the
paper's suggestibility metric: the PA is told the correct — or a
deliberately wrong — answer and re-asked the question.
The BSD 3-Clause license in the top-level LICENSE covers this
repository's code only. The splits under data_samples/ are transformations
of the following upstream datasets and remain subject to their upstream
licenses.
All datasets used within the product are listed below with their sources and license information.
- PubMedQA Source: qiaojin/PubMedQA on Hugging Face License: MIT
- NFCorpus Source: Heidelberg StatNLP NFCorpus License: Free for academic use per the upstream page. For any other use of the embedded NutritionFacts.org data, consult the NutritionFacts.org Terms of Service and contact Dr. Michael Greger directly.
- Multi-Condition Ranking Source: megagonlabs/MCR License: BSD 3-Clause
- Anime Recommendations Database Source: Kaggle — CooperUnion/anime-recommendations-database License: CC0 1.0
- Book Recommendation Dataset Source: Kaggle — arashnic/book-recommendation-dataset License: CC0 1.0
- Movie Recommendation System Source: Kaggle — parasharmanas/movie-recommendation-system License: ODbL 1.0
- Steam Video Games Source: Kaggle — tamber/steam-video-games License: ODbL 1.0
Please refer to the respective sources for detailed licensing terms.
For more dataset details please refer to data_samples/README.md
| Path | Role |
|---|---|
pipeline/ |
Library code. pipeline.py is the stateful CritiquePipeline. prompt_library.py holds all prompts/templates. rag_utils.py is the FAISS + sentence-transformer retrieval wrapper. utils.py holds scoring/metric helpers. |
scripts/run_all_experiments.py |
One-shot driver that reproduces every result in logs/ (main pipeline + vary-K + train-size + reflection-parts ablations) into a single timestamped output directory. Calls the four scripts below in sequence. |
scripts/run_pipeline.py |
Runs the full pipeline (train → EP_LABEL, EP_CRIT, SEM_CRIT, EP+SEM_CRIT, consistency) for one or more datasets × models. Produces the main-results logs. |
scripts/run_ablation_train_size.py |
Reuses shipped critiques and re-runs EP_CRIT / EP+SEM_CRIT at 25% / 50% / 75% train sizes. Restricted to Steam Pref + Multi-Cond. Ranking, matching the paper's Table 6. |
scripts/run_ablation_reflection_parts.py |
Re-runs EP_CRIT with reflections restricted to local_only / global_only / full. |
scripts/run_ablation_vary_k.py |
Sweeps the few-shot K value (supporting analysis for the K=5 choice; gpt-4o-mini only in the paper). |
scripts/reproduce_paper_tables.py |
Prints every table and headline number cited in the paper (accuracy, aggregate gain, token usage, detailed cost analysis, suggestibility, vary-K, train-size, reflection-parts, per-user). |
scripts/analyze_logs.py |
Main accuracy table + aggregate gain + token usage + timeouts + suggestibility + summary, on any single main_results directory. |
scripts/analyze_logs_ablations.py |
Train-size and reflection-parts ablation tables. |
scripts/create_preference_data_samples.py |
Regenerates the Anime/Book/Movie/Steam preference splits from raw sources. |
scripts/create_multichoice_ranking_samples.py |
Regenerates the Multi-Condition Ranking split. |
data_samples/ |
Shipped train/test splits used in the paper. Preference datasets ship as ten independent {anime,book,movie,steam}_sample_{1..10} directories (one per user); non-preference datasets are single splits. |
logs/main_results/ |
Shipped outputs of run_pipeline.py for the six paper models × seven datasets. |
logs/train_size_ablation/ |
Shipped train-size ablation outputs. |
logs/reflection_parts_ablation/ |
Shipped reflection-parts ablation outputs. |
requirements.txt |
Pinned Python dependencies. |
LICENSE |
BSD 3-Clause License. |
- OS: developed and tested on Ubuntu 22.04.2
- Python: 3.10.16 (see
requirements.txtfor exact package pins). - Hardware: No local LLMs are loaded — all inference is routed to hosted
APIs. The only local model is the sentence-transformer embedder used for
RAG retrieval (default
blevlabs/stella_en_v5, ~1.5B params), so the host needs to be able to run a model of that size locally. CPU works;torchwill pick up a GPU automatically if one is available, but there is no GPU-only code path. - API access needed to re-run the pipeline:
- OpenAI — for
gpt-4o-mini-2024-07-18andgpt-5-2025-08-07. - Fireworks.ai (or another OpenAI-compatible gateway serving the same
model IDs) — for
gpt-oss-20b,llama-4-scout,qwen3-235b-a22b-instruct-2507, andqwen3-vl-235b-a22b-thinking. - No API access is required for the analysis-only path (running
analyze_logs*.pyagainst the shippedlogs/).
- OpenAI — for
The pipeline routes per model automatically: any model whose name contains
qwen3, llama, gpt-oss, or accounts/ is sent to Fireworks (and
the Fireworks accounts/fireworks/models/ prefix is added if missing); all
other models go to OpenAI. You can set both providers at once and a single
run_pipeline.py invocation can mix models across the two. The relevant env
vars (see .env.example):
# Required for OpenAI models (gpt-4o-mini, gpt-5)
export OPENAI_API_KEY="sk-..."
# Optional; defaults to OpenAI's public endpoint
export OPENAI_BASE_URL=
# Required for Fireworks-routed models
export FIREWORKS_API_KEY="fw_..."
export FIREWORKS_BASE_URL="https://api.fireworks.ai/inference/v1"git clone <repo-url>
cd critique-learning
python3.10 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txtOn first use the sentence-transformers embedder (blevlabs/stella_en_v5,
~1.5B params) will be downloaded and cached by huggingface_hub.
Verifies that the pipeline runs end-to-end on one cheap model and one small dataset (25-question train split, 25-question test split):
export OPENAI_API_KEY=... # OpenAI key
python scripts/run_pipeline.py \
--dataset data_samples/anime_sample_1 \
--model gpt-4o-mini-2024-07-18This produces logs/<timestamp>/gpt-4o-mini-2024-07-18/anime_sample_1/{train,test}.csv
and a time_stats.json. Then to visualize results:
python scripts/analyze_logs.py logs/<timestamp>Expected: under 10 minutes of wall-clock and a small number of cents in
API spend on gpt-4o-mini.
If you also want the ablations on the same one-dataset slice, use the wrapper instead:
python scripts/run_all_experiments.py \
--models gpt-4o-mini-2024-07-18 \
--ablation-models gpt-4o-mini-2024-07-18 \
--filter anime_sample_1
python scripts/reproduce_paper_tables.py --logs logs/<timestamp>The shipped logs live under logs/main_results/,
logs/train_size_ablation/, and logs/reflection_parts_ablation/. By
default every reproduction command writes to a separate timestamped
directory logs/<wallclock>/... so the shipped artifacts are never
overwritten.
The six models used in the paper:
gpt-4o-mini-2024-07-18 (non-reasoning, OpenAI)
llama-4-scout (non-reasoning, Fireworks)
qwen3-235b-a22b-instruct-2507 (non-reasoning, Fireworks)
gpt-5-2025-08-07 (reasoning, OpenAI)
gpt-oss-20b (reasoning, Fireworks)
qwen3-vl-235b-a22b-thinking (reasoning, Fireworks)
The main-results stage runs all six models; the train-size and
reflection-parts ablations run only gpt-4o-mini-2024-07-18, gpt-5-2025-08-07,
and gpt-oss-20b (matching the paper's Tables 6–7), and vary-K runs only on
gpt-4o-mini-2024-07-18 (Table 8). Override with --models,
--ablation-models, and --vary-k-models respectively.
Datasets used in the paper (under data_samples/):
multi_condition_ranking_multichoice # Multi-Cond. Ranking
nfcorpus_short_questions # NFCorpus
pubmed # PubMed
{anime,book,movie,steam}_sample_{1..10} # preference tasks, averaged across 10 users
Runs against the shipped logs/ and reproduces every table and headline number cited in the paper:
# Every table to stdout
python scripts/reproduce_paper_tables.py
# Just one section (one of: accuracy, aggregate, tokens, cost, suggestibility,
# vary_k, train_size, refl_parts, per_user, summary)
python scripts/reproduce_paper_tables.py --section accuracy
# Same, against your own re-run (see the next section)
python scripts/reproduce_paper_tables.py --logs logs/<timestamp>The lighter-weight per-section analyzers can also be pointed at any log directory produced by the runners:
python scripts/analyze_logs.py logs/main_results # accuracy + tokens + suggestibility
python scripts/analyze_logs_ablations.py logs # train-size + reflection-partsOne command runs the four stages — main pipeline, vary-K, train-size ablation, reflection-parts ablation — into a single timestamped output directory:
export OPENAI_API_KEY=sk-...
export FIREWORKS_API_KEY=fw_...
python scripts/run_all_experiments.py
# -> writes logs/<wallclock>/{main_results,train_size_ablation,reflection_parts_ablation}/
# Analyze the run
python scripts/reproduce_paper_tables.py --logs logs/<wallclock>This requires approximately 400 hours of wall-clock time and $600
USD across both providers (see Resource use for detailed breakdown). The
wrapper is idempotent: re-running skips any (model × dataset) whose
test.csv already has every expected column. Pass --force to override.
Common variations:
# Pin the output directory (otherwise wallclock is used)
python scripts/run_all_experiments.py --output-dir logs/2026-04-24_rerun
# Smoke test on the cheapest model and one dataset
python scripts/run_all_experiments.py \
--models gpt-4o-mini-2024-07-18 \
--filter anime_sample_1
# Only the preference tasks, all six models
python scripts/run_all_experiments.py --filter anime book movie steamEach stage has a --skip-* flag and a per-stage model list. Stages 2-4
read their critiques from a main_results/ directory; if you skip stage 1
you must point at an existing one (the shipped logs work fine).
# Just the main accuracy + suggestibility stage
python scripts/run_all_experiments.py \
--skip-vary-k --skip-train-size --skip-refl-parts
# Just the train-size ablation against the shipped main_results
python scripts/run_all_experiments.py \
--skip-main --skip-vary-k --skip-refl-parts \
--source-main-results-dir logs/main_results
# One model, one ablation
python scripts/run_all_experiments.py \
--skip-main --skip-vary-k --skip-refl-parts \
--ablation-models gpt-4o-mini-2024-07-18 \
--source-main-results-dir logs/main_resultsThe four runners can also be invoked directly if you want full control —
see python scripts/<runner>.py --help. The relevant flags:
python scripts/run_pipeline.py \
--dataset data_samples/anime_sample_1 \
--model gpt-4o-mini-2024-07-18 \
--output-dir logs/my_run # writes logs/my_run/<model>/<ds>/
python scripts/run_ablation_train_size.py \
--model gpt-4o-mini-2024-07-18 \
--source-dir logs/main_results \ # where to read critiques from
--output-dir logs/my_run/train_size_ablation
python scripts/run_ablation_reflection_parts.py \
--model gpt-4o-mini-2024-07-18 \
--source-dir logs/main_results \
--output-dir logs/my_run/reflection_parts_ablation
python scripts/run_ablation_vary_k.py \
--log logs/my_run/<model>/<dataset> \ # appends K=1/3/10 columns to its test.csv
--model gpt-4o-mini-2024-07-18If you only re-ran some stages, the analyzers can pull each component from a different directory. For example, reproduce the train-size ablation locally but read main_results from the shipped logs:
python scripts/reproduce_paper_tables.py \
--main-results-dir logs/main_results \
--train-size-dir logs/<wallclock>/train_size_ablation \
--refl-parts-dir logs/reflection_parts_ablation
python scripts/analyze_logs_ablations.py logs \
--train-size-dir logs/<wallclock>/train_size_ablation| Paper artifact | Stage that produces it | Section in reproduce_paper_tables.py |
|---|---|---|
| Table 1 (per-model accuracy; subtables 1a non-reasoning, 1b reasoning) | stage 1 (main pipeline) | --section accuracy |
| Table 2 (aggregate gain): paired 95% CIs and sign-test p-values | stage 1 | --section aggregate |
| Table 3: output-token usage for reasoning models | stage 1 | --section tokens |
| Table 4 (suggestibility): truth/lying accuracy and S | stage 1 (consistency step) |
--section suggestibility |
| Table 5 (detailed cost analysis): training tokens and break-even points | stage 1 | --section cost |
| Table 6 (train-size ablation) | stage 3 (run_ablation_train_size.py) |
--section train_size |
| Table 7 (reflection-parts ablation) | stage 4 (run_ablation_reflection_parts.py) |
--section refl_parts |
| Table 8 (vary-K, gpt-4o-mini) | stage 2 (run_ablation_vary_k.py) |
--section vary_k |
| Table 9 (timeout rate): token-limit-exceeded rate for reasoning models | stage 1 | --section tokens (printed after token usage table) |
| Tables 10–13 (per-user preference accuracy, one table per domain) | stage 1 | --section per_user |
| Narrative numbers from §4.3–4.5 (mean gains, token reductions, Spearman) | stage 1 | --section summary |
Wall-clock and token usage vary substantially with model and dataset. Exact
per-run numbers are in time_stats.json and the usage* columns of each
test.csv in logs/. Rough orientation from the shipped runs:
gpt-4o-mini, full pipeline on one preference dataset user (~25 train + 25 test): under 30 minutes, ~$0.20 API spend.- Full paper reproduction from scratch:
- Main results only (6 models × 43 datasets): ~350 hours, ~$400
- With all ablations: ~400 hours, ~$600 total
Per-stage token usage (one-time train + per-question inference) and the
break-even point of each critique method against EP_LABEL are printed by
python scripts/reproduce_paper_tables.py --section cost.
Error: token limit exceeded.rows will appear in thinking-model outputs (gpt-5,gpt-oss-20b,qwen3-vl-235b-a22b-thinking). These are not parsing errors: the model ran past the 8192-token budget without emitting its JSON answer. They are intentionally counted as wrong in the accuracy calculation — reducing these is itself one of the reported benefits of critique-based methods (see paper §Critique-based Memory for Reasoning Models).- Parse failures return a random wrong answer by design
(see
parse_response). This prevents malformed JSON from silently boosting the accuracy. Theresponses*column preserves the raw output for inspection. - First run builds the FAISS + sentence-transformer index. The first call
to
pipeline.train(..., train_rag=True)downloads the embedding model and builds the index over the training prompts. Subsequent runs reuse the cached model. - Exact LLM responses may vary due to non-stochasticity even with a zero temperature parameter. Re-run experiment accuracies may differ slightly, particularly for small datasets (such as individual preference dataset users).
The shipped data_samples/ is the paper's exact split and is what every
experiment in this repo consumes. See
data_samples/README.md for the upstream source
and exact transformation applied to each dataset.