Add GGUF fit-target control and wire to llama-server --fit-target by aiSynergy37 · Pull Request #4882 · unslothai/unsloth

aiSynergy37 · 2026-04-06T20:36:37Z

Summary

add optional fit_target to inference load API request/response and status response for GGUF
wire fit_target through backend GGUF load path into LlamaCppBackend.load_model(...)
when --fit on is used, pass --fit-target <value> to llama-server
expose Fit Target in Chat Settings (GGUF) with presets: Auto / 64 / 128 / 256 / 512
persist loaded fit-target in runtime store and restore it from /api/inference/status

Why

Issue #4857 asks for a Studio UI control to tune llama-server --fit-target for tight VRAM cases where default fit behavior leaves too much GPU memory unused.

Validation

pytest studio/backend/tests/test_native_context_length.py -v (38 passed)
python -m compileall studio/backend/core/inference/llama_cpp.py studio/backend/models/inference.py studio/backend/routes/inference.py
Frontend typecheck could not run in this environment because tsc is unavailable (frontend deps/tools not installed).

Fixes #4857

Expose fit_target from Chat settings through API and GGUF load/status responses, and pass --fit-target when --fit is active. Includes backend regression coverage.\n\nFixes unslothai#4857

gemini-code-assist · 2026-04-06T20:36:44Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

for more information, see https://pre-commit.ci

xyehya · 2026-04-07T04:36:39Z

Hey team,
Was testing a fresh installation for the latest release to check if the context size UI control has been fixed or not but the behavior is not consistent.

Tried chat loading a local GGUF file for unsloth gemma-4 which it found automatically under finetunes list in models.

First load --> full gemma-3-31b context length by default + bf16 kv ---> cpu offload --> chat works fine

Tweaked kv to q8_0 and reduced context size to 100k --> the reload command in terminal automatically falls back to 8096 context and q8_0 --> its not respecting the context set from UI but rather falling back to safe kv to fit in VRAM. (When cranking the context above that it displays warning about spilling to RAM nevertheless even clicking apply doesn't force the correct UI set setting)

I think there will be dependencies between applying this current PR and fixing that issue

feat(studio): add GGUF fit-target setting for llama-server

c418cfb

Expose fit_target from Chat settings through API and GGUF load/status responses, and pass --fit-target when --fit is active. Includes backend regression coverage.\n\nFixes unslothai#4857

aiSynergy37 requested review from Manan17, Shine1i, danielhanchen and rolandtannous as code owners April 6, 2026 20:36

aiSynergy37 mentioned this pull request Apr 6, 2026

[Feature] Add Unsloth Studio UI value to tune the llama-server --fit-target flag for squeezed extra performance. #4857

Open

[pre-commit.ci] auto fixes from pre-commit.com hooks

e11db97

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add GGUF fit-target control and wire to llama-server --fit-target#4882

Add GGUF fit-target control and wire to llama-server --fit-target#4882
aiSynergy37 wants to merge 2 commits intounslothai:mainfrom
aiSynergy37:feat/gguf-fit-target-ui-4857

aiSynergy37 commented Apr 6, 2026

Uh oh!

gemini-code-assist bot commented Apr 6, 2026

Uh oh!

xyehya commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

aiSynergy37 commented Apr 6, 2026

Summary

Why

Validation

Uh oh!

gemini-code-assist bot commented Apr 6, 2026

Uh oh!

xyehya commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants