Current implementation using --fit on uses dynamic approximation to identify how many layers llama-server can fit into GPU while leaving a default buffer of around 1GB unused --> forcing more layers into CPU
In some situations where: 1- VRAM is tight (e.g 16GB VRAM) and 2- the GPU is used purely for compute and not graphics (the nvidia_drm module is not loaded) setting --fit-target to 128/256 will 1.enable all layers on the GPU or in the best case 2. fit all layers + the sane KVCache configured into GPU --> at least 10x performance difference in pp and pg..
Current implementation using --fit on uses dynamic approximation to identify how many layers llama-server can fit into GPU while leaving a default buffer of around 1GB unused --> forcing more layers into CPU
In some situations where: 1- VRAM is tight (e.g 16GB VRAM) and 2- the GPU is used purely for compute and not graphics (the nvidia_drm module is not loaded) setting --fit-target to 128/256 will 1.enable all layers on the GPU or in the best case 2. fit all layers + the sane KVCache configured into GPU --> at least 10x performance difference in pp and pg..