Skip to content

wenhaoli-xmu/fidbench

Repository files navigation

📊 FidBench: A New Evaluation Framework for LLM Compression

FidBench introduces a new evaluation framework to address the gap between reported scores and practical utility in Large Language Model (LLM) compression. We dispense with proxy metrics like perplexity and curated benchmarks, and instead directly measure a compressed model's generative faithfulness to its uncompressed counterpart on real-world user queries.

✨ Core Features

  • 🎯 Direct Faithfulness Measurement: Instead of relying on proxy metrics (e.g., PPL, MMLU), FidBench directly evaluates how well a compressed model replicates the original model's generative behavior.
  • 💡 Conditional Generation Accuracy (CGA): A novel metric that employs a teacher-forcing paradigm to assess next-token prediction accuracy, effectively avoiding the cascading errors that confound traditional text-similarity measures.
  • 🌍 Real-World Data: The evaluation is grounded in a dataset of diverse, open-ended user queries sourced from ShareGPT, reflecting practical use cases rather than synthetic benchmarks.
  • 🔬 Granular Analysis: The framework supports fine-grained analysis by categorizing prompts into domains (e.g., Code, Math, Law) and stratifying them by context length (up to 24K tokens).
  • 🔄 Transparent & Reproducible: The benchmark code is open-sourced to promote transparent and reproducible progress in the field of LLM compression.

🏆 Benchmark Results

The following table summarizes the average Conditional Generation Accuracy (CGA) scores for various compression methods applied to the Qwen2.5-Instruct model family. A higher CGA score indicates greater faithfulness to the original model.

Compression Method Qwen2.5-7B Qwen2.5-14B Qwen2.5-32B
Low-Precision Attn
SageAttention 0.986 0.987 0.990
Top-10% Sparse Attn 0.943 0.950 0.960
FlashAttention FP8 0.580 0.965 0.977
INT4 Quantization
GPTQ 0.921 0.931 0.947
AWQ 0.909 0.909 0.939
Pruning (50%)
SparseGPT 0.820 0.788 N/A
Wanda 0.780 0.795 0.835
KV Cache Dropping
SnapKV 0.600 0.275 0.181
H2O 0.571 0.497 0.535

🚀 How to Use

This guide outlines the steps to evaluate compressed models using the FidBench framework.

Prerequisites

  1. Environment: Set up the necessary environments for the compression methods you wish to evaluate. See the Environment Setup guide for detailed instructions.
  2. Configs: Ensure your model configuration files (.json) are correctly placed within the runs/ directory.

Evaluate with CGA

To calculate the Conditional Generation Accuracy (CGA), you must first generate outputs from both the baseline (uncompressed) model and the compressed models. The pred.sh script handles this process. It takes one or more compression methods as arguments.

# Usage: ./pred.sh <method1> [<method2>] [...]
# Example for evaluating AWQ and GPTQ
./pred.sh awq gptq

This script will iterate through all base models (qwen2.5-7b, qwen2.5-14b, qwen2.5-32b) and run predictions for the baseline and each specified compression method.

Optional: Evaluate with Proxy Metrics

For comparison, you can also evaluate the models using traditional proxy metrics like Perplexity and standard QA benchmarks (e.g., MMLU).

1. Perplexity (PPL)

Use the perplexity.sh script to calculate the Perplexity score on the WikiText-2 dataset.

# Usage: ./perplexity.sh <method1> [<method2>] [...]
# Example for evaluating AWQ, GPTQ, and SnapKV
./perplexity.sh awq gptq snapkv

2. QA Benchmarks (lm-eval)

Use the lmeval.sh script to run downstream QA benchmark evaluations.

# Usage: ./lmeval.sh <method1> [<method2>] [...]
# Example for evaluating SparseGPT and Wanda
./lmeval.sh sparsegpt wanda

📚 Documentation

For more detailed information, please refer to the following documents:

  • 📄 Adding a New Model: A guide on how to integrate and evaluate a new compressed model within the framework.
  • 📄 Dataset Overview: An overview of the dataset structure, categories, and data sources.
  • 📄 Environment Setup: Instructions for setting up the required environment and dependencies for each compression method.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors