Skip to content

Commit 93aa3c4

Browse files
mergennachinclaude
andcommitted
Use MmapUseMlockIgnoreErrors to pin model pages in RAM
On memory-constrained machines (e.g. 32 GB with a 21 GB model), plain Mmap causes constant page eviction and re-fault — 0.2 tok/s decode. MmapUseMlockIgnoreErrors calls mlock() to pin pages after mmap, keeping weights resident once loaded. Falls back silently if mlock exceeds the system limit. Measured 4.7 tok/s decode on M3 Pro 32 GB (24x speedup). Co-authored-by: Claude <noreply@anthropic.com>
1 parent 9c23685 commit 93aa3c4

1 file changed

Lines changed: 1 addition & 1 deletion

File tree

examples/models/gemma4_31b/main.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -155,7 +155,7 @@ int main(int argc, char** argv) {
155155
auto module = std::make_unique<Module>(
156156
FLAGS_model_path,
157157
data_files,
158-
Module::LoadMode::Mmap,
158+
Module::LoadMode::MmapUseMlockIgnoreErrors,
159159
/*event_tracer=*/nullptr,
160160
/*memory_allocator=*/nullptr,
161161
/*temp_allocator=*/nullptr,

0 commit comments

Comments
 (0)