Environment
- pMetal: 0.4.0 (cargo install)
- macOS: 26.4 (25E243)
- Hardware: Apple M5 Pro, 48GB
- MLX (Python, for reference): 0.31.0
Reproduction
pmetal train \
--model Qwen/Qwen3-0.6B \
--dataset forms-train.jsonl \
--eval-dataset forms-val.jsonl \
--lora-r 16 --lora-alpha 16 \
--learning-rate 1e-4 --epochs 2 --batch-size 8 \
--lr-schedule cosine --warmup-steps 100 \
--no-sequence-packing \
--output /tmp/pmetal-test
Dataset is instruction/output format JSONL, ~4,600 records, short sequences (~50-400 tokens).
Behavior
- Training starts fine, throughput ~1,300 tokens/s at step 10-50
- Throughput degrades to ~800 tokens/s by step 60-100
- SIGSEGV (null pointer dereference) at step 100, immediately after eval + checkpoint write
- Exit code 139
Crash log (from macOS DiagnosticReports)
Exception: EXC_BAD_ACCESS (SIGSEGV)
Subtype: KERN_INVALID_ADDRESS at 0x0000000000000000
Faulting thread stack:
pmetal :: mlx::core::metal::Device::end_encoding(int)
pmetal :: mlx::core::gpu::eval(mlx::core::array&)
pmetal :: mlx::core::eval_impl(std::__1::vector<mlx::core::array, ...>, bool)
pmetal :: mlx::core::eval(std::__1::vector<mlx::core::array, ...>)
pmetal :: mlx_eval
pmetal :: pmetal_mlx_rs::transforms::eval
pmetal :: pmetal_trainer::training_loop::TrainingLoop::accumulate_gradients
pmetal :: pmetal_trainer::training_loop::run_metal_fused::...::run_metal_fused
pmetal :: pmetal_trainer::orchestrator::run_lora_path
Root cause analysis
pmetal-mlx-sys 0.2.4 bundles MLX v0.30.6 via FetchContent in mlx-c/CMakeLists.txt:
MLX 0.31.0 includes two relevant fixes:
The M5 Pro (Apple10 Pro gen 17) is new hardware that likely exposes these bugs more readily than older chips.
Suggested fix
Bump the MLX dependency in pmetal-mlx-sys from v0.30.6 to ≥v0.31.0. The workspace currently pins pmetal-mlx-sys = "=0.2.4" in the root Cargo.toml.
Note: the upstream mlx-rs repo has already upgraded to MLX-C 0.4.0 (commit 1deb45a).
Additional notes
- The same training config completes successfully with
mlx-lm (which uses MLX 0.31.0 via Python) in 545s on the same machine
- Sequence packing mode (
--no-sequence-packing omitted) also showed very low throughput (125 tok/s) before we disabled it, though that may be a separate issue with batch sizing
Environment
Reproduction
Dataset is instruction/output format JSONL, ~4,600 records, short sequences (~50-400 tokens).
Behavior
Crash log (from macOS DiagnosticReports)
Root cause analysis
pmetal-mlx-sys 0.2.4bundles MLX v0.30.6 via FetchContent inmlx-c/CMakeLists.txt:MLX 0.31.0 includes two relevant fixes:
end_encoding()The M5 Pro (Apple10 Pro gen 17) is new hardware that likely exposes these bugs more readily than older chips.
Suggested fix
Bump the MLX dependency in
pmetal-mlx-sysfrom v0.30.6 to ≥v0.31.0. The workspace currently pinspmetal-mlx-sys = "=0.2.4"in the root Cargo.toml.Note: the upstream
mlx-rsrepo has already upgraded to MLX-C 0.4.0 (commit1deb45a).Additional notes
mlx-lm(which uses MLX 0.31.0 via Python) in 545s on the same machine--no-sequence-packingomitted) also showed very low throughput (125 tok/s) before we disabled it, though that may be a separate issue with batch sizing