A project for fine-tuning language models using Direct Preference Optimization with Hugging Face trl, peft, and transformers libraries. DPO is a simpler and more stable alternative to RLHF - it directly optimizes models using chosen/rejected response pairs instead of training a separate reward model.
- DPO Training: Efficient training with
trllibrary'sDPOTrainerclass - QLoRA Integration: Memory-efficient 4-bit quantization for GPU training
- Modular Architecture: Clean separation of data processing, model initialization, and training logic
- Automatic Model Management: Handles both policy (trainable) and reference (frozen) models automatically
- Flexible CLI: Easy parameter configuration via
argparse - Interactive Inference: Test your trained model with a simple chat interface
- Python 3.8+
- CUDA-capable GPU (recommended for QLoRA)
- 8GB+ RAM (16GB+ recommended)
# Clone the repository
git clone https://github.com/AbdulSametTurkmenoglu/dpo.git
cd dpo
# Create virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate # Windows: .\.venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtBasic training with default settings (TinyLlama + UltraFeedback dataset):
python train.pyCustomized training parameters:
python train.py --num_samples 1000 --epochs 2 --beta 0.15 --output_dir dpo_model_v2Training on CPU/MPS (without quantization):
python train.py --no_quantizationAvailable training arguments:
python train.py --helpKey parameters:
--model_name: Base model to fine-tune (default: TinyLlama-1.1B-Chat-v1.0)--dataset_name: HuggingFace dataset (default: argilla/ultrafeedback-binarized-preferences-cleaned)--num_samples: Number of training samples (default: 500)--epochs: Training epochs (default: 1)--beta: DPO beta parameter for preference strength (default: 0.1)--output_dir: Directory to save the trained model (default: dpo_tinyllama_model)
Interactive chat with trained model:
python inference.pyCustom model path:
python inference.py --base_model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --adapter_path "dpo_model_v2"The inference script loads the base model and applies your trained LoRA adapter on top of it.
- Data Loading: Loads preference pairs (chosen/rejected responses) from HuggingFace datasets
- Model Setup: Initializes base model with optional QLoRA quantization
- DPO Training: Trains the model to prefer chosen responses over rejected ones
- Adapter Saving: Saves only the trained LoRA adapters (memory efficient)
- Inference: Loads base model + adapters for text generation