Summary
Write a tutorial (notebook or markdown guide) showing how to fine-tune ModelSpanExtractor on custom domain data using the existing training pipeline in verbatim_rag/extractor_models/.
Motivation
The verbatim-rag-modern-bert-v2 model was trained on scientific papers, medical literature, financial tables, and other domains, but users in niche domains (legal contracts, internal documentation, code output) may want a specialized extractor. The training code exists but is undocumented for external users.
Scope
- Data format: how to structure
{question, document, gold_spans[]} pairs as a training dataset
- Training:
python verbatim_rag/extractor_models/train.py --data_path data/ --output_dir output/
- Evaluation: word-F1 on a held-out dev set
- Loading the fine-tuned model:
ModelSpanExtractor(model_path="./output/checkpoint-best")
A Jupyter notebook in examples/ or a docs/fine_tuning.md are both acceptable.
Notes
- The training script CLI already exists at
verbatim_rag/extractor_models/train.py
QADataset, QAModel, and Trainer are in verbatim_rag/extractor_models/
- A small synthetic dataset example would make the tutorial self-contained
Summary
Write a tutorial (notebook or markdown guide) showing how to fine-tune
ModelSpanExtractoron custom domain data using the existing training pipeline inverbatim_rag/extractor_models/.Motivation
The
verbatim-rag-modern-bert-v2model was trained on scientific papers, medical literature, financial tables, and other domains, but users in niche domains (legal contracts, internal documentation, code output) may want a specialized extractor. The training code exists but is undocumented for external users.Scope
{question, document, gold_spans[]}pairs as a training datasetpython verbatim_rag/extractor_models/train.py --data_path data/ --output_dir output/ModelSpanExtractor(model_path="./output/checkpoint-best")A Jupyter notebook in
examples/or adocs/fine_tuning.mdare both acceptable.Notes
verbatim_rag/extractor_models/train.pyQADataset,QAModel, andTrainerare inverbatim_rag/extractor_models/