Skip to content

[Feature] Add Unsloth Studio UI value to tune the llama-server --fit-target flag for squeezed extra performance. #4857

@xyehya

Description

@xyehya

Current implementation using --fit on uses dynamic approximation to identify how many layers llama-server can fit into GPU while leaving a default buffer of around 1GB unused --> forcing more layers into CPU
In some situations where: 1- VRAM is tight (e.g 16GB VRAM) and 2- the GPU is used purely for compute and not graphics (the nvidia_drm module is not loaded) setting --fit-target to 128/256 will 1.enable all layers on the GPU or in the best case 2. fit all layers + the sane KVCache configured into GPU --> at least 10x performance difference in pp and pg..

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions