FidBench introduces a new evaluation framework to address the gap between reported scores and practical utility in Large Language Model (LLM) compression. We dispense with proxy metrics like perplexity and curated benchmarks, and instead directly measure a compressed model's generative faithfulness to its uncompressed counterpart on real-world user queries.
- 🎯 Direct Faithfulness Measurement: Instead of relying on proxy metrics (e.g., PPL, MMLU), FidBench directly evaluates how well a compressed model replicates the original model's generative behavior.
- 💡 Conditional Generation Accuracy (CGA): A novel metric that employs a teacher-forcing paradigm to assess next-token prediction accuracy, effectively avoiding the cascading errors that confound traditional text-similarity measures.
- 🌍 Real-World Data: The evaluation is grounded in a dataset of diverse, open-ended user queries sourced from ShareGPT, reflecting practical use cases rather than synthetic benchmarks.
- 🔬 Granular Analysis: The framework supports fine-grained analysis by categorizing prompts into domains (e.g., Code, Math, Law) and stratifying them by context length (up to 24K tokens).
- 🔄 Transparent & Reproducible: The benchmark code is open-sourced to promote transparent and reproducible progress in the field of LLM compression.
The following table summarizes the average Conditional Generation Accuracy (CGA) scores for various compression methods applied to the Qwen2.5-Instruct model family. A higher CGA score indicates greater faithfulness to the original model.
| Compression Method | Qwen2.5-7B | Qwen2.5-14B | Qwen2.5-32B |
|---|---|---|---|
| Low-Precision Attn | |||
| SageAttention | 0.986 | 0.987 | 0.990 |
| Top-10% Sparse Attn | 0.943 | 0.950 | 0.960 |
| FlashAttention FP8 | 0.580 | 0.965 | 0.977 |
| INT4 Quantization | |||
| GPTQ | 0.921 | 0.931 | 0.947 |
| AWQ | 0.909 | 0.909 | 0.939 |
| Pruning (50%) | |||
| SparseGPT | 0.820 | 0.788 | N/A |
| Wanda | 0.780 | 0.795 | 0.835 |
| KV Cache Dropping | |||
| SnapKV | 0.600 | 0.275 | 0.181 |
| H2O | 0.571 | 0.497 | 0.535 |
This guide outlines the steps to evaluate compressed models using the FidBench framework.
- Environment: Set up the necessary environments for the compression methods you wish to evaluate. See the Environment Setup guide for detailed instructions.
- Configs: Ensure your model configuration files (
.json) are correctly placed within theruns/directory.
To calculate the Conditional Generation Accuracy (CGA), you must first generate outputs from both the baseline (uncompressed) model and the compressed models. The pred.sh script handles this process. It takes one or more compression methods as arguments.
# Usage: ./pred.sh <method1> [<method2>] [...]
# Example for evaluating AWQ and GPTQ
./pred.sh awq gptqThis script will iterate through all base models (qwen2.5-7b, qwen2.5-14b, qwen2.5-32b) and run predictions for the baseline and each specified compression method.
For comparison, you can also evaluate the models using traditional proxy metrics like Perplexity and standard QA benchmarks (e.g., MMLU).
1. Perplexity (PPL)
Use the perplexity.sh script to calculate the Perplexity score on the WikiText-2 dataset.
# Usage: ./perplexity.sh <method1> [<method2>] [...]
# Example for evaluating AWQ, GPTQ, and SnapKV
./perplexity.sh awq gptq snapkv2. QA Benchmarks (lm-eval)
Use the lmeval.sh script to run downstream QA benchmark evaluations.
# Usage: ./lmeval.sh <method1> [<method2>] [...]
# Example for evaluating SparseGPT and Wanda
./lmeval.sh sparsegpt wandaFor more detailed information, please refer to the following documents:
- 📄 Adding a New Model: A guide on how to integrate and evaluate a new compressed model within the framework.
- 📄 Dataset Overview: An overview of the dataset structure, categories, and data sources.
- 📄 Environment Setup: Instructions for setting up the required environment and dependencies for each compression method.