[Feature] Add Unsloth Studio UI value to tune the llama-server --fit-target flag for squeezed extra performance.

Current implementation using --fit on uses dynamic approximation to identify how many layers llama-server can fit into GPU while leaving a default buffer of around 1GB unused --> forcing more layers into CPU 
In some situations where: 1- VRAM is tight (e.g 16GB VRAM) and 2- the GPU is used purely for compute and not graphics (the nvidia_drm module is not loaded) setting --fit-target to 128/256  will 1.enable all layers on the GPU or in the best case 2. fit all layers + the sane KVCache configured into GPU --> at least 10x  performance difference in pp and pg.. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Add Unsloth Studio UI value to tune the llama-server --fit-target flag for squeezed extra performance. #4857

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature] Add Unsloth Studio UI value to tune the llama-server --fit-target flag for squeezed extra performance. #4857

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions