A production-style implementation of nanoGPT — a GPT-2 language model for training and text generation in PyTorch.
Features:
- Full GPT-2 architecture with Flash Attention
- Training from scratch, resuming from checkpoints, or fine-tuning pretrained GPT-2 weights
- Distributed Data Parallel (DDP) for multi-GPU training
- Mixed precision training (bfloat16/float16)
- Cosine learning rate schedule with warmup
- Gradient accumulation and gradient clipping
- Weights & Biases logging and artifact tracking
- Hydra configuration management
- Dataset preparation for Shakespeare (BPE + char-level) and OpenWebText
Install uv:
curl -LsSf https://astral.sh/uv/install.sh | shuv syncPrepare the dataset:
uv run python data/shakespeare_char/prepare.pyTrain:
uv run python train.py --config-name train_shakespeare_charSample from the trained model:
uv run python sample.py --out_dir=out-shakespeare-charPrepare the dataset (~54GB download):
uv run python data/openwebtext/prepare.pyTrain on 8x A100 GPUs:
torchrun --standalone --nproc_per_node=8 train.py --config-name train_gpt2A maxed-out 429M parameter model (30 layers, 16 heads, 1024 embedding) designed to fit on a single RTX 4070 12GB GPU. Trains for ~2 epochs (~551k iterations, ~14 days).
Prepare the dataset (~54GB download, if not already done):
uv run python data/openwebtext/prepare.pyTrain from scratch:
uv run python train.py --config-name train_max_owtResume from checkpoint (loads out-max-owt/ckpt.pt):
uv run python train.py --config-name train_max_owt init_from=resumeThe W&B run ID is saved in the checkpoint, so resuming automatically continues the same W&B run. To override with a different run ID:
uv run python train.py --config-name train_max_owt init_from=resume wandb_run_id=<RUN_ID>Sample from the trained model:
uv run python sample.py --out_dir=out-max-owtPrepare the Shakespeare dataset (BPE tokenized):
uv run python data/shakespeare/prepare.pyFine-tune:
uv run python train.py --config-name finetune_shakespeare# From checkpoint
uv run python sample.py --out_dir=out-shakespeare-char
# From pretrained GPT-2
uv run python sample.py --init_from=gpt2-xl
# With custom prompt
uv run python sample.py --init_from=gpt2 --start="To be or not to be"Training configs are in configs/ and use Hydra.
Override any parameter from the command line:
uv run python train.py --config-name train_shakespeare_char model.dropout=0.1Available configs:
config.yaml— Default GPT-2 (124M) on OpenWebTexttrain_shakespeare_char.yaml— Character-level Shakespeare (small, fast)train_gpt2.yaml— Full GPT-2 training on OpenWebTexttrain_max_owt.yaml— Max GPT-2 (429M) on OpenWebText for RTX 4070 12GBfinetune_shakespeare.yaml— Fine-tune GPT-2-XL on Shakespeare