Error during OCR fine-tuning on custom dataset (Dark6273/surya-persian-number), works on example dataset

Hi, I’m trying to fine-tune the Surya OCR model using my own dataset on Hugging Face:
`https://huggingface.co/datasets/Dark6273/surya-persian-number`

This dataset has the same structure as the official example dataset (datalab-to/ocr_finetune_example), with two fields per example:
```
{'image': Image(mode=None, decode=True),
 'text': Value('string')}
```
I tested fine-tuning in two scenarios:
1. Using the official example dataset (datalab-to/ocr_finetune_example) — fine-tuning runs without the error.
2. Using my custom dataset (Dark6273/surya-persian-number) — the training fails immediately
```
ValueError: `input_ids`, `position_ids`, and `cache_position` **must** be specified. 
For prefill, you must provide either (`image_tiles` and `grid_thw`) or `image_embeddings`.
```
This error indicates that the model is not receiving the processed inputs it expects, and it occurs before any training steps begin.

Steps I followed
```
python surya/scripts/finetune_ocr.py \
  --output_dir surya_finetune \
  --dataset_name Dark6273/surya-persian-number \
  --per_device_train_batch_size 64 \
  --gradient_checkpointing true \
  --max_sequence_length 1024
```
I verified that the dataset loads correctly using datasets.load_dataset() and that the columns match the expected names.


1. Is there any additional preprocessing or specific dataset schema required for custom datasets that differ from the example?
2. Does the example dataset itself need transformation that the script is currently doing implicitly?
3. Could this be related to language or character encoding differences (Persian digits/RTL text)?
4. Is this a known limitation or a bug in the current fine-tuning pipeline?

Thanks in advance for any guidance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error during OCR fine-tuning on custom dataset (Dark6273/surya-persian-number), works on example dataset #482

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error during OCR fine-tuning on custom dataset (Dark6273/surya-persian-number), works on example dataset #482

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions