Hi, I’m trying to fine-tune the Surya OCR model using my own dataset on Hugging Face:
https://huggingface.co/datasets/Dark6273/surya-persian-number
This dataset has the same structure as the official example dataset (datalab-to/ocr_finetune_example), with two fields per example:
{'image': Image(mode=None, decode=True),
'text': Value('string')}
I tested fine-tuning in two scenarios:
- Using the official example dataset (datalab-to/ocr_finetune_example) — fine-tuning runs without the error.
- Using my custom dataset (Dark6273/surya-persian-number) — the training fails immediately
ValueError: `input_ids`, `position_ids`, and `cache_position` **must** be specified.
For prefill, you must provide either (`image_tiles` and `grid_thw`) or `image_embeddings`.
This error indicates that the model is not receiving the processed inputs it expects, and it occurs before any training steps begin.
Steps I followed
python surya/scripts/finetune_ocr.py \
--output_dir surya_finetune \
--dataset_name Dark6273/surya-persian-number \
--per_device_train_batch_size 64 \
--gradient_checkpointing true \
--max_sequence_length 1024
I verified that the dataset loads correctly using datasets.load_dataset() and that the columns match the expected names.
- Is there any additional preprocessing or specific dataset schema required for custom datasets that differ from the example?
- Does the example dataset itself need transformation that the script is currently doing implicitly?
- Could this be related to language or character encoding differences (Persian digits/RTL text)?
- Is this a known limitation or a bug in the current fine-tuning pipeline?
Thanks in advance for any guidance!
Hi, I’m trying to fine-tune the Surya OCR model using my own dataset on Hugging Face:
https://huggingface.co/datasets/Dark6273/surya-persian-numberThis dataset has the same structure as the official example dataset (datalab-to/ocr_finetune_example), with two fields per example:
I tested fine-tuning in two scenarios:
This error indicates that the model is not receiving the processed inputs it expects, and it occurs before any training steps begin.
Steps I followed
I verified that the dataset loads correctly using datasets.load_dataset() and that the columns match the expected names.
Thanks in advance for any guidance!