Add GGUF fit-target control and wire to llama-server --fit-target#4882
Add GGUF fit-target control and wire to llama-server --fit-target#4882aiSynergy37 wants to merge 2 commits intounslothai:mainfrom
Conversation
Expose fit_target from Chat settings through API and GGUF load/status responses, and pass --fit-target when --fit is active. Includes backend regression coverage.\n\nFixes unslothai#4857
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
for more information, see https://pre-commit.ci
|
Hey team, Tried chat loading a local GGUF file for unsloth gemma-4 which it found automatically under finetunes list in models. First load --> full gemma-3-31b context length by default + bf16 kv ---> cpu offload --> chat works fine Tweaked kv to q8_0 and reduced context size to 100k --> the reload command in terminal automatically falls back to 8096 context and q8_0 --> its not respecting the context set from UI but rather falling back to safe kv to fit in VRAM. (When cranking the context above that it displays warning about spilling to RAM nevertheless even clicking apply doesn't force the correct UI set setting) I think there will be dependencies between applying this current PR and fixing that issue |
Summary
fit_targetto inference load API request/response and status response for GGUFfit_targetthrough backend GGUF load path intoLlamaCppBackend.load_model(...)--fit onis used, pass--fit-target <value>to llama-serverFit Targetin Chat Settings (GGUF) with presets: Auto / 64 / 128 / 256 / 512/api/inference/statusWhy
Issue #4857 asks for a Studio UI control to tune llama-server
--fit-targetfor tight VRAM cases where default fit behavior leaves too much GPU memory unused.Validation
pytest studio/backend/tests/test_native_context_length.py -v(38 passed)python -m compileall studio/backend/core/inference/llama_cpp.py studio/backend/models/inference.py studio/backend/routes/inference.pytscis unavailable (frontend deps/tools not installed).Fixes #4857