๐ Homepage | ๐ค Dataset | ๐ ArXiv | ๐ Leaderboard
This repository contains the dataset and evaluation code for the paper "Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements".
Existing benchmarks curate evaluation data at the question levelโcollecting questions from textbooks or websites. This paradigm suffers from three key limitations: vulnerability to data contamination, restriction to single-concept assessment, and reliance on costly domain expert annotation. We introduce Encyclo-K, which rethinks benchmark construction from the ground up. Our key insight is that knowledge statements, not questions, can serve as the unit of curation, and questions can then be constructed from them.
We compose statements into questions via random sampling, yielding two key properties: (1) Dynamicโthe combinatorial space is too vast to memorize, with experiments across multiple random seeds confirming stable model rankings; (2) Multi-statementโeach question requires joint comprehension of multiple statements, posing greater challenges than single-concept questions.
- Dynamic Evaluation: We extract standalone knowledge statements from authoritative textbooks and dynamically compose them into evaluation questions through random sampling at test time. The combinatorial space is too vast to memorize, enabling reliable periodic dataset refresh.
- Multi-Statement Comprehension: Each question aggregates 8-10 statements for comprehensive multi-knowledge assessment, going beyond what single-statement questions can probe.
- Cost-Effective Annotation: Annotators only verify formatting compliance without requiring domain expertise, substantially reducing annotation costs.
- Contamination Resistance: Even if individual statements appear in training data, their compositions form a combinatorial space too vast to memorize.
The dataset comprises 5,038 questions across 11 disciplines, 44 fields, and 62 subfields. The disciplinary distribution is proportional to statement ratios: Science has the most questions (1,242, 24.7%), while Philosophy has the fewest (61, 1.2%). Each question contains 8โ10 statements, 4โ8 options, and 2โ4 combinations.
To install the required packages, run:
# Clone the repository
git clone <repository-url>
cd Encyclo-K
# Install dependencies
pip install -r requirements.txtYou can directly perform inference on selected models using the following command:
export PYTHONPATH=$(pwd)
# Local model inference
python infer/infer.py --config <CONFIG_PATH> --split <TASKS> --mode zero-shot --model_name <MODEL_NAME> --output_dir <OUTPUT_DIR> --batch_size <BATCH_SIZE> --use_accel --index <INDEX> --world_size <WORLD_SIZE>
# API model inference
python infer/infer.py --config <CONFIG_PATH> --split <TASKS> --mode zero-shot --model_name <MODEL_NAME> --output_dir <OUTPUT_DIR> --num_worker <NUM_WORKERS> --index <INDEX> --world_size <WORLD_SIZE>export PYTHONPATH=$(pwd)
# Local chat model with vLLM acceleration
python infer/infer.py --config config/config_default.yaml --split encyclo-k_all --mode zero-shot --model_name DeepSeek-V3-0324 --output_dir results/encyclo-k_all --batch_size 32 --use_accel --index 0 --world_size 1
# API reasoning model
python infer/infer.py --config config/config_reasoning_models.yaml --split encyclo-k_all --mode zero-shot --model_name DeepSeek-R1 --output_dir results/encyclo-k_all --num_worker 32 --index 0 --world_size 1- If inference is unexpectedly interrupted, a temporary file
.jsonl.tmpwill be saved. You can directly rerun the command to resume from the last checkpoint. - After inference is complete, check the
responsefield in the saved JSONL file. If it contains anerrorfield, you can rerun the command to re-infer failed samples.
--model_name: This parameter must align with the filenames in theinfer/modelsdirectory.- Adding a Custom Model: Update the configuration in
__init__.pyto include the new model
After completing inference, proceed with answer parsing and evaluation:
export PYTHONPATH=$(pwd)
# Evaluate results
python eval/eval.py --evaluate_all --excel_output --json_output --output_dir results/encyclo-k_all --save_dir results_with_status/encyclo-k_all --split encyclo-k_all| Parameter | Description |
|---|---|
--evaluate_all |
Evaluate all result files in the output directory |
--model_name |
Specific model(s) to evaluate |
--split |
Data split name |
--mode |
Evaluation modes (default: zero-shot, five-shot) |
--output_dir |
Directory containing inference results |
--save_dir |
Directory to save evaluation results with status |
--excel_output |
Generate Excel report with detailed results |
--json_output |
Generate JSON file with detailed results |
- Excel Report: Contains accuracy, error rate, and miss rate across disciplines, fields, and subfields
- JSON Results: Detailed results including per-category performance
- JSONL with Status: Original data augmented with
extracted_answerandstatusfields
Each question aggregates 8โ10 knowledge statements, requiring models to jointly comprehend multiple knowledge points rather than isolated factual recall. This design introduces significant cognitive complexity beyond simple statement-level verification.
Encyclo-K supports dynamic question generation by varying random seeds that control statement selection and combination. Model rankings remain highly consistent across different question sets, confirming that the combinatorial design creates a vast question space resistant to memorization-based shortcuts. This enables periodic dataset refresh to prevent overfitting.
We evaluate 50+ LLMs on Encyclo-K. The benchmark poses substantial challenges with strong discriminative power:
| Model Type | Best Model | Accuracy | Range |
|---|---|---|---|
| Chat | Qwen3-235B-A22B-Instruct | 50.40% | 9.71% โ 50.40% |
| Reasoning | OpenAI-GPT-5.1-high | 62.07% | 16.04% โ 62.07% |
๐ For complete leaderboard and more model results, please visit our Homepage.
Despite multiple rounds of manual review, there may still be a small number of errors in the dataset. If you find any, please paste the question_id and statement index to the Issues page, and we will make the corresponding corrections. Our team is committed to long-term maintenance of this dataset to ensure its quality!
If you find Encyclo-K useful in your research, please cite our paper:
@article{liang2025encyclo0k0,
title = {Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements},
author = {Yiming Liang and Yizhi Li and Yantao Du and Ge Zhang and Jiayi Zhou and Yuchen Wu and Yinzhu Piao and Denghui Cao and Tong Sun and Ziniu Li and Li Du and Bo Lei and Jiaheng Liu and Chenghua Lin and Zhaoxiang Zhang and Wenhao Huang and Jiajun Zhang},
year = {2025},
journal = {arXiv preprint arXiv: 2512.24867}
}We thank the contributors and the open-source community for their valuable support. The evaluation pipeline of this project is built upon several excellent open-source benchmarks, including MMLU, MMLU-Pro, GPQA, and SuperGPQA.
This project is licensed under the terms specified in the LICENSE file.



