📊 FidBench: A New Evaluation Framework for LLM Compression

FidBench introduces a new evaluation framework to address the gap between reported scores and practical utility in Large Language Model (LLM) compression. We dispense with proxy metrics like perplexity and curated benchmarks, and instead directly measure a compressed model's generative faithfulness to its uncompressed counterpart on real-world user queries.

✨ Core Features

🎯 Direct Faithfulness Measurement: Instead of relying on proxy metrics (e.g., PPL, MMLU), FidBench directly evaluates how well a compressed model replicates the original model's generative behavior.
💡 Conditional Generation Accuracy (CGA): A novel metric that employs a teacher-forcing paradigm to assess next-token prediction accuracy, effectively avoiding the cascading errors that confound traditional text-similarity measures.
🌍 Real-World Data: The evaluation is grounded in a dataset of diverse, open-ended user queries sourced from ShareGPT, reflecting practical use cases rather than synthetic benchmarks.
🔬 Granular Analysis: The framework supports fine-grained analysis by categorizing prompts into domains (e.g., Code, Math, Law) and stratifying them by context length (up to 24K tokens).
🔄 Transparent & Reproducible: The benchmark code is open-sourced to promote transparent and reproducible progress in the field of LLM compression.

🏆 Benchmark Results

The following table summarizes the average Conditional Generation Accuracy (CGA) scores for various compression methods applied to the Qwen2.5-Instruct model family. A higher CGA score indicates greater faithfulness to the original model.

Compression Method	Qwen2.5-7B	Qwen2.5-14B	Qwen2.5-32B
Low-Precision Attn
SageAttention	0.986	0.987	0.990
Top-10% Sparse Attn	0.943	0.950	0.960
FlashAttention FP8	0.580	0.965	0.977
INT4 Quantization
GPTQ	0.921	0.931	0.947
AWQ	0.909	0.909	0.939
Pruning (50%)
SparseGPT	0.820	0.788	N/A
Wanda	0.780	0.795	0.835
KV Cache Dropping
SnapKV	0.600	0.275	0.181
H2O	0.571	0.497	0.535

🚀 How to Use

This guide outlines the steps to evaluate compressed models using the FidBench framework.

Prerequisites

Environment: Set up the necessary environments for the compression methods you wish to evaluate. See the Environment Setup guide for detailed instructions.
Configs: Ensure your model configuration files (.json) are correctly placed within the runs/ directory.

Evaluate with CGA

To calculate the Conditional Generation Accuracy (CGA), you must first generate outputs from both the baseline (uncompressed) model and the compressed models. The pred.sh script handles this process. It takes one or more compression methods as arguments.

# Usage: ./pred.sh <method1> [<method2>] [...]
# Example for evaluating AWQ and GPTQ
./pred.sh awq gptq

This script will iterate through all base models (qwen2.5-7b, qwen2.5-14b, qwen2.5-32b) and run predictions for the baseline and each specified compression method.

Optional: Evaluate with Proxy Metrics

For comparison, you can also evaluate the models using traditional proxy metrics like Perplexity and standard QA benchmarks (e.g., MMLU).

1. Perplexity (PPL)

Use the perplexity.sh script to calculate the Perplexity score on the WikiText-2 dataset.

# Usage: ./perplexity.sh <method1> [<method2>] [...]
# Example for evaluating AWQ, GPTQ, and SnapKV
./perplexity.sh awq gptq snapkv

2. QA Benchmarks (lm-eval)

Use the lmeval.sh script to run downstream QA benchmark evaluations.

# Usage: ./lmeval.sh <method1> [<method2>] [...]
# Example for evaluating SparseGPT and Wanda
./lmeval.sh sparsegpt wanda

📚 Documentation

For more detailed information, please refer to the following documents:

📄 Adding a New Model: A guide on how to integrate and evaluate a new compressed model within the framework.
📄 Dataset Overview: An overview of the dataset structure, categories, and data sources.
📄 Environment Setup: Instructions for setting up the required environment and dependencies for each compression method.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
case_study		case_study
config		config
data-full		data-full
data-sample		data-sample
data-visualize		data-visualize
docs		docs
fidbench		fidbench
runs		runs
tasks		tasks
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
case_study.jsonl		case_study.jsonl
case_study.py		case_study.py
case_study.sh		case_study.sh
case_study_all.jsonl		case_study_all.jsonl
lmeval.py		lmeval.py
lmeval.sh		lmeval.sh
perplexity.py		perplexity.py
perplexity.sh		perplexity.sh
pred.py		pred.py
pred.sh		pred.sh
requirements.txt		requirements.txt
setup.py		setup.py
visualize.py		visualize.py
wt2.jsonl		wt2.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📊 FidBench: A New Evaluation Framework for LLM Compression

✨ Core Features

🏆 Benchmark Results

🚀 How to Use

Prerequisites

Evaluate with CGA

Optional: Evaluate with Proxy Metrics

📚 Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📊 FidBench: A New Evaluation Framework for LLM Compression

✨ Core Features

🏆 Benchmark Results

🚀 How to Use

Prerequisites

Evaluate with CGA

Optional: Evaluate with Proxy Metrics

📚 Documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages