SIGSEGV in mlx::core::metal::Device::end_encoding() during LoRA training on M5 Pro

## Environment
- **pMetal**: 0.4.0 (cargo install)
- **macOS**: 26.4 (25E243)
- **Hardware**: Apple M5 Pro, 48GB
- **MLX (Python, for reference)**: 0.31.0

## Reproduction

```bash
pmetal train \
  --model Qwen/Qwen3-0.6B \
  --dataset forms-train.jsonl \
  --eval-dataset forms-val.jsonl \
  --lora-r 16 --lora-alpha 16 \
  --learning-rate 1e-4 --epochs 2 --batch-size 8 \
  --lr-schedule cosine --warmup-steps 100 \
  --no-sequence-packing \
  --output /tmp/pmetal-test
```

Dataset is instruction/output format JSONL, ~4,600 records, short sequences (~50-400 tokens).

## Behavior

- Training starts fine, throughput ~1,300 tokens/s at step 10-50
- Throughput degrades to ~800 tokens/s by step 60-100
- SIGSEGV (null pointer dereference) at step 100, immediately after eval + checkpoint write
- Exit code 139

## Crash log (from macOS DiagnosticReports)

```
Exception: EXC_BAD_ACCESS (SIGSEGV)
Subtype: KERN_INVALID_ADDRESS at 0x0000000000000000

Faulting thread stack:
  pmetal :: mlx::core::metal::Device::end_encoding(int)
  pmetal :: mlx::core::gpu::eval(mlx::core::array&)
  pmetal :: mlx::core::eval_impl(std::__1::vector<mlx::core::array, ...>, bool)
  pmetal :: mlx::core::eval(std::__1::vector<mlx::core::array, ...>)
  pmetal :: mlx_eval
  pmetal :: pmetal_mlx_rs::transforms::eval
  pmetal :: pmetal_trainer::training_loop::TrainingLoop::accumulate_gradients
  pmetal :: pmetal_trainer::training_loop::run_metal_fused::...::run_metal_fused
  pmetal :: pmetal_trainer::orchestrator::run_lora_path
```

## Root cause analysis

`pmetal-mlx-sys 0.2.4` bundles **MLX v0.30.6** via FetchContent in `mlx-c/CMakeLists.txt`:

```cmake
GIT_TAG v0.30.6
```

MLX 0.31.0 includes two relevant fixes:
- **ml-explore/mlx#3108** — "Fix residency set with user provided buffer" — fixes Metal buffer tracking that can cause null encoder in `end_encoding()`
- **ml-explore/mlx#3119** — "Fix precision in Metal fused attention"

The M5 Pro (Apple10 Pro gen 17) is new hardware that likely exposes these bugs more readily than older chips.

## Suggested fix

Bump the MLX dependency in `pmetal-mlx-sys` from v0.30.6 to ≥v0.31.0. The workspace currently pins `pmetal-mlx-sys = "=0.2.4"` in the root Cargo.toml.

Note: the upstream `mlx-rs` repo has already upgraded to MLX-C 0.4.0 (commit `1deb45a`).

## Additional notes

- The same training config completes successfully with `mlx-lm` (which uses MLX 0.31.0 via Python) in 545s on the same machine
- Sequence packing mode (`--no-sequence-packing` omitted) also showed very low throughput (125 tok/s) before we disabled it, though that may be a separate issue with batch sizing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SIGSEGV in mlx::core::metal::Device::end_encoding() during LoRA training on M5 Pro #14

Environment

Reproduction

Behavior

Crash log (from macOS DiagnosticReports)

Root cause analysis

Suggested fix

Additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

SIGSEGV in mlx::core::metal::Device::end_encoding() during LoRA training on M5 Pro #14

Description

Environment

Reproduction

Behavior

Crash log (from macOS DiagnosticReports)

Root cause analysis

Suggested fix

Additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions