Skip to content

Error during OCR fine-tuning on custom dataset (Dark6273/surya-persian-number), works on example dataset #482

@Dark6273

Description

@Dark6273

Hi, I’m trying to fine-tune the Surya OCR model using my own dataset on Hugging Face:
https://huggingface.co/datasets/Dark6273/surya-persian-number

This dataset has the same structure as the official example dataset (datalab-to/ocr_finetune_example), with two fields per example:

{'image': Image(mode=None, decode=True),
 'text': Value('string')}

I tested fine-tuning in two scenarios:

  1. Using the official example dataset (datalab-to/ocr_finetune_example) — fine-tuning runs without the error.
  2. Using my custom dataset (Dark6273/surya-persian-number) — the training fails immediately
ValueError: `input_ids`, `position_ids`, and `cache_position` **must** be specified. 
For prefill, you must provide either (`image_tiles` and `grid_thw`) or `image_embeddings`.

This error indicates that the model is not receiving the processed inputs it expects, and it occurs before any training steps begin.

Steps I followed

python surya/scripts/finetune_ocr.py \
  --output_dir surya_finetune \
  --dataset_name Dark6273/surya-persian-number \
  --per_device_train_batch_size 64 \
  --gradient_checkpointing true \
  --max_sequence_length 1024

I verified that the dataset loads correctly using datasets.load_dataset() and that the columns match the expected names.

  1. Is there any additional preprocessing or specific dataset schema required for custom datasets that differ from the example?
  2. Does the example dataset itself need transformation that the script is currently doing implicitly?
  3. Could this be related to language or character encoding differences (Persian digits/RTL text)?
  4. Is this a known limitation or a bug in the current fine-tuning pipeline?

Thanks in advance for any guidance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug: breakingCrashes, errors, anything that stops execution or is runtime-breaking

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions