Skip to content

Fine-tuning tutorial: train ModelSpanExtractor on a custom domain #30

Description

@adaamko

Summary

Write a tutorial (notebook or markdown guide) showing how to fine-tune ModelSpanExtractor on custom domain data using the existing training pipeline in verbatim_rag/extractor_models/.

Motivation

The verbatim-rag-modern-bert-v2 model was trained on scientific papers, medical literature, financial tables, and other domains, but users in niche domains (legal contracts, internal documentation, code output) may want a specialized extractor. The training code exists but is undocumented for external users.

Scope

  1. Data format: how to structure {question, document, gold_spans[]} pairs as a training dataset
  2. Training: python verbatim_rag/extractor_models/train.py --data_path data/ --output_dir output/
  3. Evaluation: word-F1 on a held-out dev set
  4. Loading the fine-tuned model: ModelSpanExtractor(model_path="./output/checkpoint-best")

A Jupyter notebook in examples/ or a docs/fine_tuning.md are both acceptable.

Notes

  • The training script CLI already exists at verbatim_rag/extractor_models/train.py
  • QADataset, QAModel, and Trainer are in verbatim_rag/extractor_models/
  • A small synthetic dataset example would make the tutorial self-contained

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions