Skip to content

Add GGUF fit-target control and wire to llama-server --fit-target#4882

Open
aiSynergy37 wants to merge 2 commits intounslothai:mainfrom
aiSynergy37:feat/gguf-fit-target-ui-4857
Open

Add GGUF fit-target control and wire to llama-server --fit-target#4882
aiSynergy37 wants to merge 2 commits intounslothai:mainfrom
aiSynergy37:feat/gguf-fit-target-ui-4857

Conversation

@aiSynergy37
Copy link
Copy Markdown

Summary

  • add optional fit_target to inference load API request/response and status response for GGUF
  • wire fit_target through backend GGUF load path into LlamaCppBackend.load_model(...)
  • when --fit on is used, pass --fit-target <value> to llama-server
  • expose Fit Target in Chat Settings (GGUF) with presets: Auto / 64 / 128 / 256 / 512
  • persist loaded fit-target in runtime store and restore it from /api/inference/status

Why

Issue #4857 asks for a Studio UI control to tune llama-server --fit-target for tight VRAM cases where default fit behavior leaves too much GPU memory unused.

Validation

  • pytest studio/backend/tests/test_native_context_length.py -v (38 passed)
  • python -m compileall studio/backend/core/inference/llama_cpp.py studio/backend/models/inference.py studio/backend/routes/inference.py
  • Frontend typecheck could not run in this environment because tsc is unavailable (frontend deps/tools not installed).

Fixes #4857

Expose fit_target from Chat settings through API and GGUF load/status responses, and pass --fit-target when --fit is active. Includes backend regression coverage.\n\nFixes unslothai#4857
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@xyehya
Copy link
Copy Markdown

xyehya commented Apr 7, 2026

Hey team,
Was testing a fresh installation for the latest release to check if the context size UI control has been fixed or not but the behavior is not consistent.

Tried chat loading a local GGUF file for unsloth gemma-4 which it found automatically under finetunes list in models.

First load --> full gemma-3-31b context length by default + bf16 kv ---> cpu offload --> chat works fine

Tweaked kv to q8_0 and reduced context size to 100k --> the reload command in terminal automatically falls back to 8096 context and q8_0 --> its not respecting the context set from UI but rather falling back to safe kv to fit in VRAM. (When cranking the context above that it displays warning about spilling to RAM nevertheless even clicking apply doesn't force the correct UI set setting)

I think there will be dependencies between applying this current PR and fixing that issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add Unsloth Studio UI value to tune the llama-server --fit-target flag for squeezed extra performance.

3 participants