Skip to content

multimodal-art-projection/Encyclo-K

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

6 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements

๐ŸŒ Homepage | ๐Ÿค— Dataset | ๐Ÿ“– ArXiv | ๐Ÿ† Leaderboard

This repository contains the dataset and evaluation code for the paper "Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements".


๐Ÿ”” Introduction

Encyclo-K Overview

Existing benchmarks curate evaluation data at the question levelโ€”collecting questions from textbooks or websites. This paradigm suffers from three key limitations: vulnerability to data contamination, restriction to single-concept assessment, and reliance on costly domain expert annotation. We introduce Encyclo-K, which rethinks benchmark construction from the ground up. Our key insight is that knowledge statements, not questions, can serve as the unit of curation, and questions can then be constructed from them.

We compose statements into questions via random sampling, yielding two key properties: (1) Dynamicโ€”the combinatorial space is too vast to memorize, with experiments across multiple random seeds confirming stable model rankings; (2) Multi-statementโ€”each question requires joint comprehension of multiple statements, posing greater challenges than single-concept questions.

Key Features

  • Dynamic Evaluation: We extract standalone knowledge statements from authoritative textbooks and dynamically compose them into evaluation questions through random sampling at test time. The combinatorial space is too vast to memorize, enabling reliable periodic dataset refresh.
  • Multi-Statement Comprehension: Each question aggregates 8-10 statements for comprehensive multi-knowledge assessment, going beyond what single-statement questions can probe.
  • Cost-Effective Annotation: Annotators only verify formatting compliance without requiring domain expertise, substantially reducing annotation costs.
  • Contamination Resistance: Even if individual statements appear in training data, their compositions form a combinatorial space too vast to memorize.

Question Distribution

The dataset comprises 5,038 questions across 11 disciplines, 44 fields, and 62 subfields. The disciplinary distribution is proportional to statement ratios: Science has the most questions (1,242, 24.7%), while Philosophy has the fewest (61, 1.2%). Each question contains 8โ€“10 statements, 4โ€“8 options, and 2โ€“4 combinations.

Question Distribution

โš™๏ธ Installation

To install the required packages, run:

# Clone the repository
git clone <repository-url>
cd Encyclo-K

# Install dependencies
pip install -r requirements.txt

๐Ÿง  Inference

You can directly perform inference on selected models using the following command:

export PYTHONPATH=$(pwd)

# Local model inference
python infer/infer.py --config <CONFIG_PATH> --split <TASKS> --mode zero-shot --model_name <MODEL_NAME> --output_dir <OUTPUT_DIR> --batch_size <BATCH_SIZE> --use_accel --index <INDEX> --world_size <WORLD_SIZE>

# API model inference
python infer/infer.py --config <CONFIG_PATH> --split <TASKS> --mode zero-shot --model_name <MODEL_NAME> --output_dir <OUTPUT_DIR> --num_worker <NUM_WORKERS> --index <INDEX> --world_size <WORLD_SIZE>

Quick Start Examples

export PYTHONPATH=$(pwd)

# Local chat model with vLLM acceleration
python infer/infer.py --config config/config_default.yaml --split encyclo-k_all --mode zero-shot --model_name DeepSeek-V3-0324 --output_dir results/encyclo-k_all --batch_size 32 --use_accel --index 0 --world_size 1

# API reasoning model
python infer/infer.py --config config/config_reasoning_models.yaml --split encyclo-k_all --mode zero-shot --model_name DeepSeek-R1 --output_dir results/encyclo-k_all --num_worker 32 --index 0 --world_size 1

๐Ÿ“ Notes

  • If inference is unexpectedly interrupted, a temporary file .jsonl.tmp will be saved. You can directly rerun the command to resume from the last checkpoint.
  • After inference is complete, check the response field in the saved JSONL file. If it contains an error field, you can rerun the command to re-infer failed samples.

๐Ÿ› ๏ธ Run Custom Model

  1. --model_name: This parameter must align with the filenames in the infer/models directory.
  2. Adding a Custom Model: Update the configuration in __init__.py to include the new model

โญ Evaluation

After completing inference, proceed with answer parsing and evaluation:

export PYTHONPATH=$(pwd)

# Evaluate results
python eval/eval.py --evaluate_all --excel_output --json_output --output_dir results/encyclo-k_all --save_dir results_with_status/encyclo-k_all --split encyclo-k_all

Evaluation Parameters

Parameter Description
--evaluate_all Evaluate all result files in the output directory
--model_name Specific model(s) to evaluate
--split Data split name
--mode Evaluation modes (default: zero-shot, five-shot)
--output_dir Directory containing inference results
--save_dir Directory to save evaluation results with status
--excel_output Generate Excel report with detailed results
--json_output Generate JSON file with detailed results

Output Files

  • Excel Report: Contains accuracy, error rate, and miss rate across disciplines, fields, and subfields
  • JSON Results: Detailed results including per-category performance
  • JSONL with Status: Original data augmented with extracted_answer and status fields

๐Ÿ“Š Benchmark Characteristics

Key Findings

Multi-statement Challenge Random Seed Stability

1. Multi-Statement Comprehensive Assessment

Each question aggregates 8โ€“10 knowledge statements, requiring models to jointly comprehend multiple knowledge points rather than isolated factual recall. This design introduces significant cognitive complexity beyond simple statement-level verification.

2. Dynamic Question Generation

Encyclo-K supports dynamic question generation by varying random seeds that control statement selection and combination. Model rankings remain highly consistent across different question sets, confirming that the combinatorial design creates a vast question space resistant to memorization-based shortcuts. This enables periodic dataset refresh to prevent overfitting.

๐Ÿ“ˆ Experimental Results

We evaluate 50+ LLMs on Encyclo-K. The benchmark poses substantial challenges with strong discriminative power:

Model Type Best Model Accuracy Range
Chat Qwen3-235B-A22B-Instruct 50.40% 9.71% โ€“ 50.40%
Reasoning OpenAI-GPT-5.1-high 62.07% 16.04% โ€“ 62.07%

๐Ÿ‘‰ For complete leaderboard and more model results, please visit our Homepage.

๐Ÿ› ๏ธ Dataset Maintenance

Despite multiple rounds of manual review, there may still be a small number of errors in the dataset. If you find any, please paste the question_id and statement index to the Issues page, and we will make the corresponding corrections. Our team is committed to long-term maintenance of this dataset to ensure its quality!

๐Ÿ“š Citation

If you find Encyclo-K useful in your research, please cite our paper:

@article{liang2025encyclo0k0,
  title   = {Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements},
  author  = {Yiming Liang and Yizhi Li and Yantao Du and Ge Zhang and Jiayi Zhou and Yuchen Wu and Yinzhu Piao and Denghui Cao and Tong Sun and Ziniu Li and Li Du and Bo Lei and Jiaheng Liu and Chenghua Lin and Zhaoxiang Zhang and Wenhao Huang and Jiajun Zhang},
  year    = {2025},
  journal = {arXiv preprint arXiv: 2512.24867}
}

๐Ÿค Acknowledgements

We thank the contributors and the open-source community for their valuable support. The evaluation pipeline of this project is built upon several excellent open-source benchmarks, including MMLU, MMLU-Pro, GPQA, and SuperGPQA.

๐Ÿ“œ License

This project is licensed under the terms specified in the LICENSE file.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors