Skip to content

SIGSEGV in mlx::core::metal::Device::end_encoding() during LoRA training on M5 Pro #14

@barrettj

Description

@barrettj

Environment

  • pMetal: 0.4.0 (cargo install)
  • macOS: 26.4 (25E243)
  • Hardware: Apple M5 Pro, 48GB
  • MLX (Python, for reference): 0.31.0

Reproduction

pmetal train \
  --model Qwen/Qwen3-0.6B \
  --dataset forms-train.jsonl \
  --eval-dataset forms-val.jsonl \
  --lora-r 16 --lora-alpha 16 \
  --learning-rate 1e-4 --epochs 2 --batch-size 8 \
  --lr-schedule cosine --warmup-steps 100 \
  --no-sequence-packing \
  --output /tmp/pmetal-test

Dataset is instruction/output format JSONL, ~4,600 records, short sequences (~50-400 tokens).

Behavior

  • Training starts fine, throughput ~1,300 tokens/s at step 10-50
  • Throughput degrades to ~800 tokens/s by step 60-100
  • SIGSEGV (null pointer dereference) at step 100, immediately after eval + checkpoint write
  • Exit code 139

Crash log (from macOS DiagnosticReports)

Exception: EXC_BAD_ACCESS (SIGSEGV)
Subtype: KERN_INVALID_ADDRESS at 0x0000000000000000

Faulting thread stack:
  pmetal :: mlx::core::metal::Device::end_encoding(int)
  pmetal :: mlx::core::gpu::eval(mlx::core::array&)
  pmetal :: mlx::core::eval_impl(std::__1::vector<mlx::core::array, ...>, bool)
  pmetal :: mlx::core::eval(std::__1::vector<mlx::core::array, ...>)
  pmetal :: mlx_eval
  pmetal :: pmetal_mlx_rs::transforms::eval
  pmetal :: pmetal_trainer::training_loop::TrainingLoop::accumulate_gradients
  pmetal :: pmetal_trainer::training_loop::run_metal_fused::...::run_metal_fused
  pmetal :: pmetal_trainer::orchestrator::run_lora_path

Root cause analysis

pmetal-mlx-sys 0.2.4 bundles MLX v0.30.6 via FetchContent in mlx-c/CMakeLists.txt:

GIT_TAG v0.30.6

MLX 0.31.0 includes two relevant fixes:

The M5 Pro (Apple10 Pro gen 17) is new hardware that likely exposes these bugs more readily than older chips.

Suggested fix

Bump the MLX dependency in pmetal-mlx-sys from v0.30.6 to ≥v0.31.0. The workspace currently pins pmetal-mlx-sys = "=0.2.4" in the root Cargo.toml.

Note: the upstream mlx-rs repo has already upgraded to MLX-C 0.4.0 (commit 1deb45a).

Additional notes

  • The same training config completes successfully with mlx-lm (which uses MLX 0.31.0 via Python) in 545s on the same machine
  • Sequence packing mode (--no-sequence-packing omitted) also showed very low throughput (125 tok/s) before we disabled it, though that may be a separate issue with batch sizing

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions