Bug Description
surya-ocr crashes with torch.AcceleratorError: index out of bounds during layout recognition and OCR text recognition on Apple Silicon (MPS backend) when processing longer PDFs (60+ pages).
The crash occurs in surya/common/surya/encoder/__init__.py:438:
def unpack_qkv_with_mask(self, q, k, v, cu_seqlens):
seq_lengths = cu_seqlens[1:] - cu_seqlens[:-1]
max_seq_len = seq_lengths.max().item() # ← crash here
torch.AcceleratorError: index 7984 is out of bounds: 2, range 0 to 2184
Root Cause
This is a PyTorch MPS backend bug — the .max() kernel produces an incorrect index for certain tensor shapes. The tensor itself is small (sequence lengths), but the MPS kernel has an edge case.
Environment
- surya-ocr: 0.17.1
- PyTorch: 2.7.0
- macOS: Darwin 25.3.0 (Apple M4 Pro)
- Python: 3.13
Suggested Fix
Option 1 — Use an alternative tensor op that may route through a different MPS kernel:
max_seq_len = torch.amax(seq_lengths, dim=0).item()
Option 2 — Surgical .cpu() on just this tiny tensor (negligible perf impact):
max_seq_len = seq_lengths.cpu().max().item()
The heavy attention/matmul computations all stay on MPS. Only this small indexing operation (tens of elements) moves to CPU, so performance cost is effectively zero.
Workarounds Tested
| Workaround |
Result |
TORCH_DEVICE=cpu |
✅ Works but ~10x slower, unusable for long docs |
PYTORCH_ENABLE_MPS_FALLBACK=1 |
❌ Doesn't help (MPS "supports" .max(), it's just buggy) |
Smaller batch via --page_range |
⚠️ Reduces frequency but doesn't eliminate |
Related
Bug Description
surya-ocrcrashes withtorch.AcceleratorError: index out of boundsduring layout recognition and OCR text recognition on Apple Silicon (MPS backend) when processing longer PDFs (60+ pages).The crash occurs in
surya/common/surya/encoder/__init__.py:438:Root Cause
This is a PyTorch MPS backend bug — the
.max()kernel produces an incorrect index for certain tensor shapes. The tensor itself is small (sequence lengths), but the MPS kernel has an edge case.Environment
Suggested Fix
Option 1 — Use an alternative tensor op that may route through a different MPS kernel:
Option 2 — Surgical
.cpu()on just this tiny tensor (negligible perf impact):The heavy attention/matmul computations all stay on MPS. Only this small indexing operation (tens of elements) moves to CPU, so performance cost is effectively zero.
Workarounds Tested
TORCH_DEVICE=cpuPYTORCH_ENABLE_MPS_FALLBACK=1.max(), it's just buggy)--page_rangeRelated