This directory contains the notebooks for Chapter 5. After mastering Depth Pruning in the previous chapter, we now delve into a more precise surgery: Width Pruning.
In this chapter, you will learn to surgically reduce the size of the MLP modules, a critical component that consumes a large number of parameters in modern models like Llama, Gemma, or Mistral. Instead of removing entire blocks, we will select and eliminate individual neurons within the GLU architecture, creating lighter, faster, and more energy-efficient models.
By the end of this chapter, you will understand that width pruning doesn't just reduce the model's size; it fundamentally alters its behavior. You will learn to use this technique to create smaller models that, paradoxically, can become better at specific tasks, like following instructions, by eliminating the "noise" from general-knowledge neurons.
- LLM:
meta-llama/Llama-3.2-1B - Dataset: N/A (Data-free static pruning, evaluated on
GSM8K,IFEval,TruthfulQAbenchmarks) - Description: This notebook implements static width pruning based on weight magnitude for GLU architecture. It surgically reduces the MLP expansion ratio and analyzes the trade-off in reasoning, instruction following, and truthfulness.
- LLM:
meta-llama/Llama-3.2-1B - Dataset:
wikitext(wikitext-2-raw-v1) andsms_spam - Description: This notebook demonstrates data-driven width pruning by capturing the activations of the
down_projlayers using PyTorch hooks. It creates two specialized models calibrated on different datasets to evaluate domain-specific specialization.
- LLM:
meta-llama/Llama-3.2-1B - Dataset:
wikitext(wikitext-2-raw-v1) andsms_spam - Description: A bonus notebook testing a static pruned 20% model on Wiki2 and SMS Datasets for cross-evaluation insights.