Skip to content

snumprlab/lwe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Becoming Experienced Judges: Selective Test-Time Learning for Evaluators (EACL 2026)

Official Implementation of the paper "Becoming Experienced Judges: Selective Test-Time Learning for Evaluators" (EACL 2026 [Short-Oral]).

Method Description Module
Vanilla Fixed evaluation prompt. methods/vanilla.py
Sample-Specific Prompt (SSP) Static meta-prompt for all samples. Per-sample generated rubric, then judge. methods/ssp.py
🌟 LWE Evolving meta-prompt and per-sample generated rubric. methods/lwe.py
🌟🌟 Selective LWE Vanilla consistency check first -> LWE pipeline only on inconsistent samples. methods/selective_lwe.py

Prompts are in the prompts/ directory.

Setup

cd lwe
pip install -r requirements.txt

Set API keys (add to .env or export directly):

export OPENAI_API_KEY=...       # for GPT models
export GEMINI_API_KEY=...       # for Gemini models
export ANTHROPIC_API_KEY=...    # for Claude models

Data Preparation

Click to expand
python data/scripts/download_vlrewardbench.py --output_dir data/vlrewardbench

Saves data/vlrewardbench/vlrewardbench_test.jsonl and images under data/vlrewardbench/images/.

python data/scripts/download_mmrewardbench.py --output_dir data/mmrewardbench

Creates data/mmrewardbench/mmrewardbench_test.jsonl, containing a random sample of 1000 examples (excluding Hateful Memes), and saves images to data/mmrewardbench/images/.

Options:

# Reproduce the exact 1000 examples used in the paper (recommended)
python data/scripts/download_mmrewardbench.py --paper_ids data/scripts/mmrewardbench_paper_ids.json

# Keep all examples (no subsampling)
python data/scripts/download_mmrewardbench.py --no_subsample

# Custom sample size
python data/scripts/download_mmrewardbench.py --sample_size 500

Data format

Each row in the JSONL must contain:

Field Type Description
ID str Unique sample identifier
Text str Question / instruction
Output1 str First candidate response
Output2 str Second candidate response
Better str "Output1" or "Output2" — the ground-truth preferred response
Image str | null Path to image file, or null for text-only

You can run with your own custom data by providing a JSONL file with the above format.

Simply point dataset.data_path in your config YAML to your custom JSONL file.

Run

Run from the lwe/ directory.

GPT

python judge.py --config configs/gpt_vanilla_vl.yaml # method: vanilla, dataset: VL-Rewardbench
python judge.py --config configs/gpt_vanilla_mm.yaml # method: vanilla, dataset: Multimodal Rewardbench
python judge.py --config configs/gpt_ssp_vl.yaml
python judge.py --config configs/gpt_lwe_vl.yaml
python judge.py --config configs/gpt_selective_lwe_vl.yaml

Gemini

python judge.py --config configs/gemini_selective_lwe.yaml

Claude

python judge.py --config configs/claude_selective_lwe.yaml

Additionally, you can override the method, seed, or model specified in your YAML config directly from the command line using arguments:

python judge.py --config configs/gemini_lwe.yaml --method selective_lwe --seed 42 --model gemini-2.5-flash

Outputs

Results are written under out_dir (default runs/), one subdirectory per run:

runs/<method>_<model>_<timestamp>/
  config.yaml                  — copy of the run config
  <dataset>.jsonl              — per-sample results
  cumulative_metrics.jsonl     — accuracy / consistency after each batch
  meta_prompt_v0_initial.txt   — (LWE / Selective-LWE) initial meta-prompt
  meta_prompt_final.txt        — (LWE / Selective-LWE) final meta-prompt
  meta_prompt_snapshots/       — (LWE / Selective-LWE) per-batch snapshots

Per-sample fields include pred, acc, swap_pred, swap_acc, consistency, pair_acc where applicable.

Code Structure

Click to expand
lwe/
  judge.py                     # entrypoint
  configs/                     # example YAMLs (vanilla/ssp/lwe/selective_lwe × gpt/gemini/claude)
  data/
    scripts/
      download_vlrewardbench.py  # HuggingFace download: MMInstruction/VL-RewardBench
      download_mmrewardbench.py  # HuggingFace download: syhuggingface/multimodal_rewardbench
    vlrewardbench/             # downloaded data
    mmrewardbench/             # downloaded data
  models/
    base.py                    # BaseModel interface
    gpt.py                     # OpenAI GPT
    gemini.py                  # Google Gemini
    claude.py                  # Anthropic Claude
  methods/
    vanilla.py
    ssp.py
    lwe.py
    selective_lwe.py
  prompts/
    vanilla.py                 # vanilla judge prompt
    lwe_prompts.py             # SSP / LWE prompt templates
  utils/
    dataset.py                 # PairwiseDataset + dataloader
    utils.py
  requirements.txt

Citation

If you find this work useful, please cite:

@inproceedings{jwa-etal-2026-becoming,
  title = {Becoming Experienced Judges: Selective Test-Time Learning for Evaluators},
  author = {Jwa, Seungyeon and Ahn, Daechul and Kim, Reokyoung and Kang, Dongyeop and Choi, Jonghyun},
  booktitle = {Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)},
  year = {2026},
  address = {Rabat, Morocco},
  publisher = {Association for Computational Linguistics},
  url = {https://aclanthology.org/2026.eacl-short.50/},
  pages = {697--721}
}

License

This project is licensed under the MIT License. See the LICENSE file for details.

All datasets and codebases are used strictly for non-commercial academic research in compliance with their respective licenses. Users of this repository are responsible for ensuring compliance with the original dataset licenses.

About

Official Implementation of the paper "Becoming Experienced Judges: Selective Test-Time Learning for Evaluators" (EACL 2026 [Short-Oral])

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages