Skip to content

walkinglabs/modern-llm-notebook

Repository files navigation

Modern LLM Notebook

Build modern LLMs from scratch through 26 runnable Jupyter Notebooks.

English · 中文文档 · Read Online · Start in Colab · Join Discord

GitHub stars Quality checks License Python PyTorch Notebooks Languages

Overview · What You Will Build · Why · What Is Included · Quick Start · Status · Curriculum · Quality Bar · Contributing


Case-based LLM learning: course map -> concrete notebook -> runnable experiment.

Modern LLM Notebook English home page

Start from the full bilingual course map, then choose a focused path through foundations, training, inference, frontiers, and production topics.

Modern LLM Notebook English notebook reader

Each case keeps the learning loop visible: intuition, hand calculation, implementation, experiment, outline navigation, and one-click Colab access.

Overview

Modern LLM Notebook is a hands-on course for building modern LLM systems from the ground up in PyTorch. Instead of treating the model as a black box, you implement the core pieces yourself: tokenizers, embeddings, attention, Transformer blocks, training loops, MoE, LoRA, RLHF, decoding, KV Cache, long context, VLMs, evaluation, distillation, and on-policy distillation.

The repository ships with a full English notebook mirror under notebooks-en/. The web viewer supports language switching from the home page and the notebook sidebar (or via ?lang=en in the URL), so both the curriculum and the browsing experience stay bilingual end to end.

The project is designed as an educational reference implementation. It is not a model zoo, not a production serving framework, and not a wrapper around hosted APIs. Its purpose is to make the internal machinery of LLMs legible to engineers who want to reason from first principles.

Each notebook follows the same learning contract:

intuition -> hand calculation -> implementation -> experiment

That contract matters. A reader should not only know that BPE merges frequent pairs, or that KV Cache speeds up generation. They should be able to trace the numbers, write the minimal code, and explain why the behavior appears.

What You Will Build

By the end, you will have implemented a compact version of the systems that power modern LLMs:

Stage You build Why it matters
Text to tokens Character, word, and BPE tokenizers See exactly how raw text becomes model input
Tokens to vectors Token embeddings and position encodings Understand what the model can compute over
Transformer core Self-Attention, Multi-Head Attention, Transformer blocks, Mini-GPT Reconstruct the core forward pass
Training system Cross-Entropy, batching, gradient flow, scaling-law intuition Connect loss curves to real model behavior
Adaptation LoRA, continued pretraining, reward modeling, PPO/DPO style objectives Learn how base models become useful assistants
Inference system Sampling, beam search, KV Cache, speculative decoding Understand why serving is a systems problem
Frontiers Long context, CoT experiments, VLM patch embeddings and cross-attention Turn newer papers into small runnable examples
Production loop Evaluation, win-rate matrices, distillation, OPD Measure, compress, and improve model behavior
raw text -> tokens -> embeddings -> attention -> Transformer -> Mini-GPT
         -> training -> alignment -> inference -> evaluation -> distillation

Why This Project

LLM education often falls into two extremes.

Some resources are mathematically precise but difficult to enter: they introduce formulas before the reader understands the problem being solved. Other resources are easy to run but heavily abstracted: the important ideas disappear behind a library call.

Modern LLM Notebook takes the middle path. It treats modern LLMs as systems that can be decomposed, tested, and rebuilt piece by piece. The goal is not to replace papers or production libraries. The goal is to give you the mental model needed to read those papers and use those libraries with judgment.

Use this project if you want to:

  • Understand the data flow from raw text to logits.
  • Build a small GPT-style model without treating the architecture as a black box.
  • See how training objectives, data quality, and scaling laws connect.
  • Learn why inference systems need KV Cache, batching, memory planning, and speculative decoding.
  • Connect recent research topics such as MoE, long context, CoT, VLMs, RLHF, DPO, and distillation back to small runnable examples.

What Is Included

Area Topics Reference implementations
Foundations Tokenization, BPE, embeddings, position encoding CharTokenizer, WordTokenizer, BPETokenizer, TokenEmbedding
Transformer core Self-Attention, Multi-Head Attention, Transformer block MultiHeadAttention, TransformerBlock, MiniGPT
GPT-2 to modern models RMSNorm, SwiGLU, RoPE, GQA, QK-Norm, MLA, MoE RMSNorm, SwiGLU, RoPE, GroupedQueryAttention, MultiHeadLatentAttention, MoELayer
Training Loss, optimization, scaling laws, data engineering, MTP, FIM Training loop, gradient accumulation, MinHash deduplication, Multi-Token Prediction, Fill-in-the-Middle
Adaptation and alignment LoRA, reward modeling, PPO, DPO LoraLinear, reward model loss, PPO clip, DPO loss
Inference Sampling, beam search, KV Cache, speculative decoding Top-k, Top-p, beam search, AttentionWithKVCache
Frontiers Long context, reasoning traces, VLM, Sliding Window Attention RoPE extrapolation, Self-Consistency, Cross-Attention, Sliding Window mask
Production concepts Evaluation, distillation, on-policy distillation Win-rate matrices, soft labels, KL estimators

What This Project Is Not

This repository intentionally avoids several things so the learning path stays clear:

  • It is not a production LLM framework.
  • It is not optimized for maximum throughput or distributed training.
  • It does not provide pretrained model weights.
  • It does not use transformers as a shortcut for core implementations.
  • It does not assume the reader already knows the terminology.

Some dependencies such as transformers and datasets may appear in the environment for comparison or utility work, but the teaching path keeps the core algorithms explicit.

Quick Start

Python notebooks

git clone https://github.com/walkinglabs/modern-llm-notebook.git
cd modern-llm-notebook

# Create an isolated Python environment instead of installing into the system Python.
python3 -m venv .venv
source .venv/bin/activate

python -m pip install --upgrade pip
python -m pip install -r requirements.txt
python -m ipykernel install --user \
  --name modern-llm-notebook \
  --display-name "Python (modern-llm-notebook)"

jupyter notebook notebooks-en/part1-foundation/01-tokenizer-basics.ipynb

If jupyter: command not found appears, the virtual environment is probably not active. Run:

source .venv/bin/activate

Or call Jupyter directly from the environment:

.venv/bin/jupyter notebook notebooks-en/part1-foundation/01-tokenizer-basics.ipynb

Language note:

  • Chinese notebooks live in notebooks/
  • English notebooks live in notebooks-en/ (complete 26/26 translation coverage)

Recommended environment:

  • Python 3.9+
  • PyTorch 2.0+
  • NumPy, Matplotlib, Jupyter
  • 16GB RAM

Most notebooks run on CPU. Larger training experiments are easier with a GPU.

Web viewer

The repository also includes a React / Vite reader for a course-like browsing experience. The reader imports the .ipynb files directly and renders them in the browser, without a generated web content copy.

npm install
npm run dev

Build and preview the static site:

npm run build
npm run preview

Executing notebooks in restricted environments

Some sandboxed environments disallow opening local sockets, which breaks the standard Jupyter kernel protocol (and tools like nbclient / nbconvert --execute). For those cases we ship a no-kernel executor that runs code cells via plain Python and writes outputs back into the English notebooks:

python scripts/execute_notebooks_en_no_kernel.py

Project Status

Area Status
Chinese notebooks Complete 32/32 (added 29-MLA, 30-inference-systems, 31-linear-attention, 32-sparse-attention)
English notebooks Complete 26/26 with executed outputs; renumber pending
Web reader React / Vite app with language switching
Static site Published through GitHub Pages
Quality checks English coverage, syntax, output-language checks, and web build
Next focus CS336/CME295-inspired depth, smoother writing, reproducible pretraining, and stronger eval benchmarks

Near-Term Roadmap

  1. Incorporate more material inspired by CS336 and CME295, especially around data, training, systems, and evaluation.
  2. Polish the flow of the existing notebooks so the explanations read more naturally from intuition to code.
  3. Add a reproducible 0-to-1 pretraining workflow inspired by SmolLM, from data preparation to a small trained model.
  4. Make the eval benchmark chapter more detailed, including benchmark design, metrics, judge prompts, result aggregation, and failure analysis.

Curriculum

The curriculum is organized as five parts and 26 self-contained notebooks.

Modern LLM Notebook
│
├── Part 1: Foundation
│   ├── Tokenizer basics
│   ├── BPE tokenizer
│   ├── Embedding and position encoding
│   ├── Attention and Transformer block
│   ├── Mini-GPT
│   └── BERT encoder
│
├── Part 2: Training
│   ├── From GPT-2 to modern models
│   ├── Model config
│   ├── Mixture of Experts
│   ├── Training and loss
│   ├── Scaling laws
│   ├── Data engineering
│   ├── LoRA
│   ├── Mid-training and continued pretraining
│   └── RLHF alignment
│
├── Part 3: Inference
│   ├── Generation
│   ├── Inference acceleration
│   └── Speculative decoding
│
├── Part 4: Frontiers
│   ├── Long context
│   ├── CoT and thinking
│   └── Vision-language models
│
└── Part 5: Production
    ├── Evaluation
    ├── Distillation
    ├── On-policy distillation
    └── vLLM & SGLang deployment

Each notebook is designed to be runnable on its own. You can follow the full sequence or jump to a topic without depending on hidden runtime state from earlier notebooks.

Notebook Index

Part 1: Foundation

# Notebook Primary question Implementation focus
01 Tokenizer Basics Why do models need tokenizers? Character and word tokenizers
02 BPE Tokenizer How does BPE learn a vocabulary? Merge rules, encode, decode
03 Embedding How do IDs become vectors? Token embedding, distributed representation
04 Position Encoding How does the model know word order? Sinusoidal encoding, input assembly
05 Attention & Transformer Block How does attention move information? MHA, residuals, normalization
06 Mini-GPT How does a GPT-style model fit together? Decoder-only model, LM head
07 BERT Encoder Why can encoder-only models read bidirectionally? MiniBERT, MLM head

Part 2: Training

# Notebook Primary question Implementation focus
08 From GPT-2 to Modern Models What changed architecturally after GPT-2? RMSNorm, SwiGLU, RoPE, GQA, QK-Norm, MLA
09 Model Config What does each field in a real config.json mean? vocab_size, hidden_size, layers, heads
10 Mixture of Experts How does sparse expert routing work? Router gate, top-k experts, aux-free load balancing
11 Training & Loss How does a language model learn from prediction errors? Training loop, loss, gradients, Multi-Token Prediction
12 Scaling Laws How do model size, data, and compute trade off? FLOPs estimates, Chinchilla intuition
13 Distributed Training How do we shard memory and compute across GPUs? DDP, ZeRO Stage 1/2/3, FSDP, DeepSpeed, Accelerate
14 Data Engineering Why does data quality dominate model behavior? Cleaning, filtering, MinHash, FIM
15 LoRA Why does low-rank adaptation work? LoraLinear, merge for inference
16 Mid-Training & CPT How does continued pretraining adapt a model? Data mixing, loss observation
17 RLHF Alignment How do preference signals become objectives? Reward model, PPO, DPO

Part 3: Inference

# Notebook Primary question Implementation focus
17 Generation How do decoding strategies change model behavior? Greedy, top-k, top-p, beam search
18 Inference Acceleration Why is generation memory-bound? KV Cache, FlashAttention, PagedAttention
19 Speculative Decoding How can a small model accelerate a large one? Draft-then-verify acceptance

Part 4: Frontiers

# Notebook Primary question Implementation focus
20 Long Context How do models extend beyond their training context length? RoPE extrapolation, YaRN, Sliding Window Attention
21 CoT & Thinking Why can reasoning traces improve answers? Self-Consistency, reward design
22 Vision-Language Models How does visual information enter a language model? Patch embedding, cross-attention

Part 5: Production

# Notebook Primary question Implementation focus
23 Evaluation How do we tell whether a model is better? Win-rate matrices, RAGAS, judge metrics
24 Distillation How does a small model learn from a large one? Soft labels, temperature, logit distillation
25 On-Policy Distillation How can distillation reduce exposure bias? OPSD, KL estimator taxonomy
26 LLM Deployment How do you turn a trained model into a callable service? vLLM, SGLang, custom architecture registration

Quality Bar

The repository follows a small set of standards to keep the notebooks useful as learning material:

  • Concepts are introduced by motivation before notation.
  • New terminology is defined before it is used heavily.
  • Core algorithms include at least one concrete hand calculation or toy example.
  • Code cells are kept small and observable.
  • Randomized experiments use fixed seeds where appropriate.
  • Each notebook is self-contained and does not rely on variables from previous notebooks.
  • Markdown explanations are written for patient beginners, while the code remains close to the real algorithmic structure.

Papers and Systems

The course connects implementation details to influential papers and production systems:

Paper or system Concepts covered
Attention Is All You Need Multi-Head Attention, position encoding
BERT Encoder-only models, masked language modeling
LLaMA RMSNorm, SwiGLU, RoPE, Pre-Norm
DeepSeek-V2 / DeepSeek-V3 MLA, Multi-Token Prediction, aux-free MoE load balancing
Mixtral / Qwen3 Sliding Window Attention, MoE with shared experts
Scaling Laws / Chinchilla Parameter, data, and compute trade-offs
LoRA Low-rank adaptation
RLHF / PPO / DPO Preference alignment
Code Llama / DeepSeek-Coder Fill-in-the-Middle (FIM)
FlashAttention / vLLM Inference acceleration and memory management
Speculative Decoding Draft-then-verify generation
RoPE / YaRN Long-context extrapolation
Chain-of-Thought Reasoning traces and Self-Consistency
Flamingo / LLaVA Vision-language models
Knowledge Distillation / OPD Compression and distillation

Repository Structure

modern-llm-notebook/
├── notebooks/           # Chinese source notebooks
│   ├── part1-foundation/
│   ├── part2-training/
│   ├── part3-inference/
│   ├── part4-frontiers/
│   └── part5-production/
├── notebooks-en/        # English mirror notebooks
│   ├── part1-foundation/
│   ├── part2-training/
│   ├── part3-inference/
│   ├── part4-frontiers/
│   └── part5-production/
├── external/            # Upstream references (e.g. karpathy nanoGPT/minGPT)
├── karpathy_models.py   # Thin import wrapper used by a few notebooks
├── web/                 # React / Vite web viewer
├── docs/                # Static site build output
├── scripts/             # Notebook conversion scripts
├── requirements.txt
├── package.json
├── README.md
└── README-CN.md

Contributing

Contributions are welcome when they improve clarity, correctness, or coverage.

Good contributions include:

  • Fixing incorrect explanations, broken cells, or outdated APIs.
  • Improving hand-calculation sections and visualizations.
  • Adding focused exercises with assertions.
  • Translating or improving bilingual documentation.
  • Proposing new notebooks for important model architectures or training methods.

Please read CONTRIBUTING.md before opening a pull request.

Star History

Star history chart

Citation

If Modern LLM Notebook helps your research or work, please cite:

@misc{modern-llm-notebook,
  title   = {Modern LLM Notebook: Build Modern LLMs from Scratch},
  author  = {WalkingLabs},
  year    = {2025},
  url     = {https://github.com/walkinglabs/modern-llm-notebook},
  note    = {GitHub repository, accessed 2026}
}

License

This project is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.


Built for engineers who want to understand LLM systems from the inside.
Maintained by walkinglabs.

About

A hands-on course for building modern LLMs from scratch in PyTorch, with 26 runnable Jupyter Notebooks covering tokenizers, attention, MoE, RLHF, inference, evaluation, and distillation.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors