Exported Llama 1B transformer with static 128 sequence length tries to allocate 10Gb on iOS18 causing OOM

## 🐞Describing the bug

I'm roughly following [this guide](https://machinelearning.apple.com/research/core-ml-on-device-llama) on LLM exporting. I adjusted the input names to be able to use it with this [HF demo](https://github.com/huggingface/swift-chat). I also added W8A8 quantization to reduce space. The model takes 1.4Gb saved. by my estimations, all the activations shouldn't take more than 1Gb. Nevertheless, when calling "predict" on IPhone or IPad (iOS18 both, M1 chip IPad pro and IPhone 15 pro, both have ~8Gb of RAM), it OOMs with memory profiler showing a 10Gb malloc call inside some metal dispatch call. It also happens when using the `.mlpackage` GUI benchmarking feature. It also uses a bunch of RAM (~12Gb) when benchmarking on iOS15 using that feature.

## To Reproduce
- [Conversion script](https://gist.github.com/BlackSamorez/3e379a408e23eb91fa8d7dfb01e76252)
- [HF demo](https://github.com/huggingface/swift-chat)
- Memory profiler trace (IDK how to copy from that app)

<img width="2471" height="884" alt="Image" src="https://github.com/user-attachments/assets/a19d08de-2460-4d42-9b2b-e68e86ebb0ab" />

## System environment (please complete the following information):
```
Name: torch
Version: 2.8.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3-Clause
Location: /opt/miniconda3/envs/executorch/lib/python3.11/site-packages
Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions
Required-by: executorch, timm, torchaudio, torchdata, torchsr, torchvision
---
Name: transformers
Version: 4.47.1
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /opt/miniconda3/envs/executorch/lib/python3.11/site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: 
---
Name: coremltools
Version: 9.0b1
Summary: Community Tools for Core ML
Home-page: https://github.com/apple/coremltools
Author: Apple Inc.
Author-email: coremltools@apple.com
License: BSD
Location: /opt/miniconda3/envs/executorch/lib/python3.11/site-packages
Editable project location: /opt/miniconda3/envs/executorch/lib/python3.11/site-packages
Requires: attrs, cattrs, numpy, packaging, protobuf, pyaml, sympy, tqdm
Required-by: executorch
```

Converted on `macOS 15.6.1 Apple M3 Pro 18Gb`.
Ran on: 
 - `macOS 15.6.1 Apple M3 Pro 18Gb`.
 - `iPadOS 18.6.2 iPad Pro M1`
 - `iOS 18.6.2 iPhone 15 Pro`

## Additional context
- Add anything else about the problem here that you want to share.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exported Llama 1B transformer with static 128 sequence length tries to allocate 10Gb on iOS18 causing OOM #2590

🐞Describing the bug

To Reproduce

System environment (please complete the following information):

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Exported Llama 1B transformer with static 128 sequence length tries to allocate 10Gb on iOS18 causing OOM #2590

Description

🐞Describing the bug

To Reproduce

System environment (please complete the following information):

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions