Structural techniques for efficient models
Rearchitecting LLMs addresses the growing need for AI professionals who understand how LLMs work at a fundamental level—professionals who can create hyper-efficient models tailored to specific data and tasks rather than relying on one-size-fits-all solutions.
This is the official repository for the book Rearchitecting LLMs - Structural techniques for efficient models. Although the notebooks include explanations so they can be understood and run individually, the best experience is through the book, which provides more details on the experiments, decisions, papers, and technologies used.
The industry is shifting away from generic, closed-source models toward open-source alternatives that offer better stability, data privacy, lower operational costs, and competitive differentiation that proprietary APIs cannot provide. However, this transition faces a critical bottleneck: a shortage of engineers equipped with the deep architectural knowledge required to optimize these models effectively.
The book teaches optimization techniques for transforming large pre-trained models into efficient Small Language Models (SLMs). These methodologies—including depth pruning, width pruning in GLU architectures, and knowledge distillation—are similar to approaches used by companies like Nvidia (Minitron family) and Mistral (Ministral family) to create production-ready model families.
Beyond these foundational techniques, the book introduces original methodologies like Fair Pruning (bias-aware optimization) and Adaptive Attention Bypass (dynamic inference), combining industry best practices with cutting-edge research. You'll learn to apply all these techniques to open-source models like Llama, Gemma, and Qwen, with hands-on notebooks that run on Google Colab's free tier.
- Surgically optimize model architectures through depth and width pruning
- Recover lost knowledge using targeted distillation techniques
- Specialize models for your specific domain and use case
- Measure and validate every optimization decision
The Rearchitecting Pipeline. The domain-specific dataset guides the calibration of the base model, informs structural optimization decisions, and drives the final specialization through LoRA fine-tuning. A general dataset supports Knowledge Recovery, ensuring the pruned model retains broad capabilities before domain-specific specialization. This dual approach optimizes each phase for the project's specific objectives.
| Chapter | What you'll learn | |
|---|---|---|
| PART 1: FOUNDATIONS | ||
| ✅ | 1 · Why rearchitecting LLMs matters | The case for specialized models over generic LLMs |
| ✅ | 2 · An end-to-end rearchitecting project | Full pipeline: prune and recover |
| ✅ | 3 · A blueprint to modern transformers | GLU architectures, attention, and model internals |
| PART 2: HANDS-ON OPTIMIZATION | ||
| ✅ | 4 · Building smaller and faster LLMs with depth pruning | Block removal, caputure block importance with python hooks, evaluate pruning |
| ✅ | 5 · Shaping model architectures via width pruning | GLU neuron selection, data-driven pruning strategies |
| ✅ | 6 · Knowledge recovery through distillation | Recovering capability after structural compression |
| ✅ | 7 · Model specialization | LoRA / DoRA fine-tuning and quantization for domain tasks |
| ✅ | 8 · Attention Optimization | KV cache, attention bypass, inference acceleration |
| 🔜 | 9 · Dynamic Pruning for Adaptive Inference | Early exit mechanisms that stop inference before the final layer |
| PART 3: BEYOND THE BLACK BOX | ||
| 🔜 | 10 · Exploring the Black Box | Activation analysis and behavioral interpretability |
| 🔜 | 11 · Optimizing While Eliminating Biases | Fair pruning: removing demographic bias at neuron level |
| 🔜 | 12 · Capstone Project | End-to-end: replacing API calls with a specialized SLM |
Start experimenting interactively.
This NotebookLM space contains all the research papers, chapter notebooks, and optiPfair guides in a conversational format. Think of it as your AI-powered technical assistant for the book, which helps you to become an LLM architect.
What you can do:
- Ask specific questions: "How does depth pruning work?" or "How many layers can I remove from a 70B model?"
- Get code snippets: "Show me the code to reduce the GLU expansion of Llama3"
- Explore techniques: Query any pruning, distillation, or optimization method
- Troubleshoot: Get help understanding implementation details from the notebooks
Perfect for:
- Quick reference while coding
- Understanding paper implementations
- Exploring techniques before diving into chapters
- Clarifying concepts on the go
💡 Pro tip: Use NotebookLM for quick queries and experimentation. For structured, in-depth learning, the book remains your best companion.
Stop being a mere user. It's time to become an architect.
Every chapter in this book includes a specific Hands-on Lab designed to push your understanding of model architecture. We use GitHub Discussions as our active laboratory to share metrics, architectural insights, doubts, and even Out-of-Memory (OOM) errors.
→ Explore all Hands-on Labs Discussions
If you find these techniques useful, consider:
- ⭐ Starring this repo to stay updated
- 🔄 Sharing it with your team
- 💬 Opening Discussions with your questions
If you find this repository or the techniques described in the book useful, please cite it as follows:
APA: Martra, P. (2026). Rearchitecting LLMs: Structural techniques for efficient models. Manning Publications. ISBN 9781633434332.
BibTeX:
@book{martra2026rearchitecting,
title={Rearchitecting LLMs: Structural techniques for efficient models},
author={Martra, Pere},
isbn={9781633434332},
year={2026},
publisher={Manning Publications},
url={[https://www.manning.com/books/rearchitecting-llms](https://www.manning.com/books/rearchitecting-llms)},
note={Manning Early Access Program (MEAP)}
}