torch.AcceleratorError in unpack_qkv_with_mask on MPS (Apple Silicon)

## Bug Description

`surya-ocr` crashes with `torch.AcceleratorError: index out of bounds` during layout recognition and OCR text recognition on Apple Silicon (MPS backend) when processing longer PDFs (60+ pages).

The crash occurs in `surya/common/surya/encoder/__init__.py:438`:

```python
def unpack_qkv_with_mask(self, q, k, v, cu_seqlens):
    seq_lengths = cu_seqlens[1:] - cu_seqlens[:-1]
    max_seq_len = seq_lengths.max().item()  # ← crash here
```

```
torch.AcceleratorError: index 7984 is out of bounds: 2, range 0 to 2184
```

## Root Cause

This is a **PyTorch MPS backend bug** — the `.max()` kernel produces an incorrect index for certain tensor shapes. The tensor itself is small (sequence lengths), but the MPS kernel has an edge case.

## Environment

- **surya-ocr**: 0.17.1
- **PyTorch**: 2.7.0
- **macOS**: Darwin 25.3.0 (Apple M4 Pro)
- **Python**: 3.13

## Suggested Fix

**Option 1 — Use an alternative tensor op that may route through a different MPS kernel:**

```python
max_seq_len = torch.amax(seq_lengths, dim=0).item()
```

**Option 2 — Surgical `.cpu()` on just this tiny tensor (negligible perf impact):**

```python
max_seq_len = seq_lengths.cpu().max().item()
```

The heavy attention/matmul computations all stay on MPS. Only this small indexing operation (tens of elements) moves to CPU, so performance cost is effectively zero.

## Workarounds Tested

| Workaround | Result |
|---|---|
| `TORCH_DEVICE=cpu` | ✅ Works but ~10x slower, unusable for long docs |
| `PYTORCH_ENABLE_MPS_FALLBACK=1` | ❌ Doesn't help (MPS "supports" `.max()`, it's just buggy) |
| Smaller batch via `--page_range` | ⚠️ Reduces frequency but doesn't eliminate |

## Related

- [datalab-to/marker#993](https://github.com/datalab-to/marker/issues/993) — same crash reported via marker

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.AcceleratorError in unpack_qkv_with_mask on MPS (Apple Silicon) #490

Bug Description

Root Cause

Environment

Suggested Fix

Workarounds Tested

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Workaround	Result
`TORCH_DEVICE=cpu`	✅ Works but ~10x slower, unusable for long docs
`PYTORCH_ENABLE_MPS_FALLBACK=1`	❌ Doesn't help (MPS "supports" `.max()`, it's just buggy)
Smaller batch via `--page_range`	⚠️ Reduces frequency but doesn't eliminate

torch.AcceleratorError in unpack_qkv_with_mask on MPS (Apple Silicon) #490

Description

Bug Description

Root Cause

Environment

Suggested Fix

Workarounds Tested

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions