Releases: oobabooga/textgen
v4.5.2
Changes
- The project has been renamed to TextGen! The GitHub URL is now github.com/oobabooga/textgen.
- Logits display improvements (#7486). Thanks, @wiger3.
- UI: Add sky-blue color for quoted text in light mode (#7473). Thanks, @Th-Underscore.
- Reduce VRAM peak in prompt logprobs forward pass.
Bug fixes
- Fix Gemma-4 tool calling: handle double quotes and newline chars in arguments (#7477). Thanks, @mamei16.
- Fix chat scroll getting stuck on thinking blocks (#7485).
- Prevent Tool Icon SVG Shrinking When Tool Calls Are Long (#7488). Thanks, @mamei16.
- Fix: wrong chat deleted when selection changes before confirm (#7483). Thanks, @lawrence3699.
- Fix bos/eos tokens not being set for models without a chat template. Defaults are now reset before reading model metadata.
- Fix duplicate BOS token being prepended in ExLlamav3.
- Fix version metadata not syncing on Continue (#7492).
- Fix
row_splitnot working with ik_llama.cpp —--split-mode rowis now converted to--split-mode graph(#7489). - Fix "Start reply with" crash (#7497). 🆕 - v4.5.1.
- Fix tool responses with Gemma 4 template (#7498). 🆕 - v4.5.1.
- UI: Fix consecutive thinking blocks rendering with Gemma 4. 🆕 - v4.5.1.
- Fix bos/eos tokens being overwritten after GGUF metadata sets them (#7496). 🆕 - v4.5.2
Dependency updates
- Update llama.cpp to ggml-org/llama.cpp@5d14e5d
- Update ik_llama.cpp to ikawrakow/ik_llama.cpp@47986f0
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.
Note
NVIDIA GPU: If nvidia-smi reports CUDA Version >= 13.1, use the cuda13.1 build. Otherwise, use cuda12.4.
ik_llama.cpp is a llama.cpp fork with new quant types. If unsure, use the llama.cpp column.
Windows
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (774 MB) | Download (1.09 GB) |
| NVIDIA (CUDA 13.1) | Download (696 MB) | Download (1.19 GB) |
| AMD/Intel (Vulkan) | Download (209 MB) | — |
| AMD (ROCm 7.2) | Download (517 MB) | — |
| CPU only | Download (191 MB) | Download (192 MB) |
Linux
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (758 MB) | Download (1.09 GB) |
| NVIDIA (CUDA 13.1) | Download (710 MB) | Download (1.21 GB) |
| AMD/Intel (Vulkan) | Download (225 MB) | — |
| AMD (ROCm 7.2) | Download (330 MB) | — |
| CPU only | Download (207 MB) | Download (218 MB) |
macOS
| Architecture | llama.cpp |
|---|---|
| Apple Silicon (arm64) | Download (182 MB) |
| Intel (x86_64) | Download (188 MB) |
Updating a portable install:
- Download and extract the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:
text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/ <-- shared by both installsv4.5.1
Updated to v4.5.2
https://github.com/oobabooga/textgen/releases/tag/v4.5.2
v4.5
Updated to v4.5.2
https://github.com/oobabooga/textgen/releases/tag/v4.5.2
v4.4 - MCP server support!
Changes
- MCP server support: Use remote MCP servers from the UI. Just add one server URL per line in the new "MCP servers" field in the Chat tab and send a message. Tools will be discovered automatically and used alongside local tools. [Tutorial]
- Several UI improvements, further modernizing the theme:
- Improve hover menu appearance in the Chat tab.
- Improve scrollbar styling (thinner, more rounded).
- Improve message text contrast and heading colors.
- Improve message action icon visibility in light mode.
- Make blockquote, table, and hr borders more subtle and consistent.
- Improve accordion outline styling.
- Reduce empty space between chat input and message contents.
- Hide spin buttons on all sliders (these looked ugly on Windows).
- Show filename tooltip on file attachments in the chat input.
- Add Windows + ROCm portable builds.
- Image generation: Embed metadata in API responses. PNG images returned by the API now include generation settings (model, seed, dimensions, steps, CFG scale, sampler) in the file metadata.
- API: Add
instruction_templateandinstruction_template_strparameters in the model load endpoint. - API: Remove the deprecated
settingsparameter from the model load endpoint. - Move the
cpu-moecheckbox to extra flags (no longer needed now that--fitexists).
Bug fixes
- Fix inline LaTeX rendering:
$...$expressions are now protected from being parsed as markdown (#7423). - Fix crash when truncating prompts with tool call messages.
- Fix "address already in use" on server restart (Linux/macOS).
- Fix GPT-OSS reasoning tags briefly leaking into streamed output between thinking and tool calls.
- Fix tool call check sometimes truncating visible text at end of generation.
- Fix image generation failing with Flash Attention 2 errors by defaulting attention to SDPA.
- Fix loader args leaking between sequential API model loads.
- Fix IPv6 address formatting in the API.
Dependency updates
- Update llama.cpp to ggml-org/llama.cpp@d0a6dfe
- Update ik_llama.cpp to ikawrakow/ik_llama.cpp@67fc9c5 (adds Gemma 4 support)
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.
Note
NVIDIA GPU: If nvidia-smi reports CUDA Version >= 13.1, use the cuda13.1 build. Otherwise, use cuda12.4.
ik_llama.cpp is a llama.cpp fork with new quant types. If unsure, use the llama.cpp column.
Windows
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (777 MB) | Download (1.09 GB) |
| NVIDIA (CUDA 13.1) | Download (698 MB) | Download (1.19 GB) |
| AMD/Intel (Vulkan) | Download (207 MB) | — |
| AMD (ROCm 7.2) | Download (516 MB) | — |
| CPU only | Download (191 MB) | Download (192 MB) |
Linux
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (761 MB) | Download (1.09 GB) |
| NVIDIA (CUDA 13.1) | Download (712 MB) | Download (1.21 GB) |
| AMD/Intel (Vulkan) | Download (223 MB) | — |
| AMD (ROCm 7.2) | Download (329 MB) | — |
| CPU only | Download (207 MB) | Download (217 MB) |
macOS
| Architecture | llama.cpp |
|---|---|
| Apple Silicon (arm64) | Download (181 MB) |
| Intel (x86_64) | Download (187 MB) |
Updating a portable install:
- Download and extract the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:
text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/ <-- shared by both installsv4.3.3 - Gemma 4 support!
Changes
- Gemma 4 support with tool-calling in the API and UI. 🆕 - v4.3.1.
- ik_llama.cpp support: Add ik_llama.cpp as a new backend through new
textgen-portable-ikportable builds and a new--ikflag for full installs. ik_llama.cpp is a fork by the author of the imatrix quants, including support for new quant types, significantly more accurate KV cache quantization (via Hadamard KV cache rotation, enabled by default), and optimizations for MoE models and CPU inference. - API: Add echo + logprobs for
/v1/completions. The completions endpoint now supports theechoandlogprobsparameters, returning token-level log probabilities for both prompt and generated tokens. Token IDs are also included in the output via a newtop_logprobs_idsfield. - Further optimize my custom gradio fork, saving up to 50 ms per UI event (button click, etc).
- Transformers: Autodetect
torch_dtypefrom model config instead of always forcing bfloat16/float16. The--bf16flag still works as an override. - Remove the obsolete
models/config.yamlfile. Instruction templates are now detected from model metadata instead of filename patterns. - Rename "truncation length" to "context length" in the terminal log message.
Security
- Gradio fork: Fix ACL bypass via case-insensitive path matching on Windows/macOS.
- Gradio fork: Add server-side validation for Dropdown, Radio, and CheckboxGroup.
- Sanitize filenames in all prompt file operations (CWE-22). Thanks, @ffulbtech. 🆕 - v4.3.3.
- Fix SSRF in superbooga extensions: URLs fetched by superbooga/superboogav2 are now validated to block requests to private/internal networks.
Bug fixes
- Fix
--idle-timeoutfailing on encode/decode requests and not tracking parallel generation properly. - Fix stopping string detection for chromadb/context-1 (
<|return|>vs<|result|>). - Fix Qwen3.5 MoE failing to load via ExLlamav3_HF.
- Fix
ban_eos_tokennot working for ExLlamav3. EOS is now suppressed at the logit level. - Fix "Value: None is not in the list of choices: []" Gradio error introduced in v4.3. 🆕 - v4.3.2.
- Fix Dropdown/Radio/CheckboxGroup crash when choices list is empty. 🆕 - v4.3.3.
- Fix API crash when parsing tool calls from non-dict JSON model output. 🆕 - v4.3.3.
- Fix llama.cpp crashing due to failing to parse the Gemma 4 template (even though we don't use llama.cpp's jinja parser). 🆕 - v4.3.2.
Dependency updates
- Update llama.cpp to ggml-org/llama.cpp@277ff5f
.- Adds Gemma-4 support
- Adds improved KV cache quantization via activations rotation, based on TurboQuant ggml-org/llama.cpp#21038
- Update ik_llama.cpp to ikawrakow/ik_llama.cpp@d557d6c
- Update ExLlamaV3 to 0.0.28
- Update transformers to 5.5
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.
Note
NVIDIA GPU: If nvidia-smi reports CUDA Version >= 13.1, use the cuda13.1 build. Otherwise, use cuda12.4.
ik_llama.cpp is a llama.cpp fork with new quant types. If unsure, use the llama.cpp column.
Windows
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (758 MB) | Download (1.12 GB) |
| NVIDIA (CUDA 13.1) | Download (681 MB) | Download (1.17 GB) |
| AMD/Intel (Vulkan) | Download (191 MB) | — |
| AMD (ROCm 7.2) | Download (499 MB) | — |
| CPU only | Download (175 MB) | Download (175 MB) |
Linux
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (753 MB) | Download (1.12 GB) |
| NVIDIA (CUDA 13.1) | Download (706 MB) | Download (1.2 GB) |
| AMD/Intel (Vulkan) | Download (217 MB) | — |
| AMD (ROCm 7.2) | Download (323 MB) | — |
| CPU only | Download (201 MB) | Download (211 MB) |
macOS
| Architecture | llama.cpp |
|---|---|
| Apple Silicon (arm64) | Download (173 MB) |
| Intel (x86_64) | Download (179 MB) |
Updating a portable install:
- Download and extract the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:
text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/ <-- shared by both installsv4.3.2
v4.3.1
v4.3
v4.2
| Before | After |
|---|---|
![]() |
![]() |
Changes
- Anthropic-compatible API: A new
/v1/messagesendpoint lets you connect Claude Code, Cursor, and other Anthropic API clients. Supports system messages, content blocks, tool use, tool results, image inputs, and thinking blocks. To use with Claude Code:ANTHROPIC_BASE_URL=http://127.0.0.1:5000 claude. - Updated UI theme: New colors, borders, and button styles across light and dark modes.
--extra-flagsnow supports literal flags: You can now pass flags directly, e.g.--extra-flags "--rpc 192.168.1.100:50052 --jinja". The oldkey=valueformat is still accepted for backwards compatibility.- Training
- Enable
gradient_checkpointingby default for lower VRAM usage during training. - Remove the arbitrary
higher_rank_limitparameter. - Reorganize the training UI.
- Enable
- Strip thinking blocks before tool-call parsing to prevent false-positive tool call detection from
<think>content. - Move the OpenAI-compatible API from
extensions/openaitomodules/api. The old--extensions openaiflag is still accepted as an alias for--api. - Set
top_p=0.95as the default sampling parameter for API requests. - Remove 52 obsolete instruction templates from 2023 (Airoboros, Baichuan, Guanaco, Koala, Vicuna v0, MOSS, etc.).
- Reduce portable build sizes by using a stripped Python distribution.
Bug fixes
- Fix prompt corruption when continuing a chat with context truncation (#7439). Thanks, @Phrosty1.
- Fix multi-turn thinking block corruption for Kimi models.
- Fix AMD installer failing to resolve ROCm triton dependency.
- Fix the
--sharefeature in the Gradio fork. - Fix
--extra-flagsbreaking short long-form-only flags like--rpc. - Fix the instruction template delete dialog not appearing.
- Fix file handle leaks and redundant re-reads in model metadata loading (#7422). Thanks, @alvinttang.
- Fix superboogav2 broken delete endpoint (#6010). Thanks, @Raunak-Kumar7.
- Fix leading spaces in post-reasoning
contentin API responses. - Fix Cloudflare tunnel retry logic raising after the first failed attempt instead of exhausting retries.
- Fix
OPENEDAI_DEBUG=0being treated as truthy. - Fix mutable default argument in LogitsBiasProcessor (#7426). Thanks, @Jah-yee.
Dependency updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/3fc6f1aed172602790e9088b57786109438c2466
- Update ExLlamaV3 to 0.0.26
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda13.1, orcuda12.4if you have older drivers. - AMD/Intel GPU: Use
vulkanbuilds. - AMD GPU (ROCm): Use
rocmbuilds. - CPU only: Use
cpubuilds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64. - Intel: Use
macos-x86_64.
- Apple Silicon: Use
Updating a portable install:
- Download and extract the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:
text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/ <-- shared by both installsv4.1.1
Changes
- Tool-calling in the UI!: Models can now call custom functions during chat. Each tool is a single
.pyfile inuser_data/tools, and five examples are provided:web_search,fetch_webpage,calculate,get_datetime, androll_dice. During streaming, each tool call appears as a collapsible accordion similar to the existing thinking blocks, showing the called function, the arguments chosen by the LLM, and the output. [Tutorial] - Replace
html2textwithtrafilaturafor extracting text from web pages, reducing boilerplate like navigation bars significantly and saving tokens in agentic tool-calling loops. - OpenAI API improvements:
- Rewrite
logprobssupport for full spec compliance across llama.cpp, ExLlamaV3, and Transformers backends. Both streaming and non-streaming responses now return token-by-token logprobs. - Add a
reasoning_contentfield for thinking blocks in both streaming and non-streaming chat completions. Now thinking blocks go exclusively in this field, andcontentonly shows the post-thinking reply, even when tool calls are present. - Add
tool_choicesupport and fix thetool_callsresponse format for strict spec compliance. - Put mid-conversation system messages in the correct positions in the prompt instead of collapsing all system messages at the top.
- Add support for the
developerrole, which is mapped tosystem. - Add
max_completion_tokensas an alias formax_tokens. - Include
/v1in the API URL printed to the terminal since that's what most clients expect. - Make the
/v1/modelsendpoint show only the currently loaded model. - Add
stream_optionssupport withinclude_usagefor streaming responses. - Return
finish_reason: tool_callswhen tool calls are detected. - Several other spec compliance improvements after careful auditing.
- Rewrite
- llama.cpp
- Set
ctx-sizeto0(auto) by default. Note: this only works when--gpu-layersis also set to-1, which is the default value. When using other loaders, 0 maps to 8192. - Reduce the
--fit-targetdefault from 1024 MiB to 512 MiB. - Use
--fit-ctx 8192to set 8192 as the minimum acceptable ctx size for--fit on(llama.cpp uses 4096 by default). - Make
logit_biasandlogprobsfunctional in API calls. - Add missing
custom_token_bansparameter in the UI.
- Set
- ExLlamaV3
- Add native
logit_biasandlogprobssupport. - Load the vision model and the draft model before the main model so memory auto-splitting accounts for them.
- Add native
- New default preset: "Top-P" (
top_p: 0.95), following recommendations for several SOTA open-weights models. The old "Qwen3 - Thinking", "Qwen3 - No Thinking", "min_p", and "Instruct" presets have been removed. - Refactor reasoning/thinking extraction into a standalone module supporting multiple model formats (Qwen, GPT-OSS, Solar, seed:think, and others). Also detect when a chat template appends
<think>to the prompt and prepend it to the reply, so the thinking block appears immediately during streaming. - Incognito chat: This option has been added next to the existing "New chat" button. Incognito chats are temporary, live in RAM and are never saved to disk.
- Optimize chat streaming performance by updating the DOM only once per animation frame.
- Increase the
ctx-sizeslider maximum to 1M tokens in the UI, with 1024 step. - Add a new drag-and-drop UI component for reordering "Sampler priority" items.
- Make all chat styles consistent with instruct style in spacings, line heights, etc., improving the quality and consistency of those styles.
- Remove the gradio import in
--nowebuimode, saving some 0.5-0.8 seconds on startup. - Force-exit the webui on repeated Ctrl+C.
- Improve the
--multi-userwarning to make the known limitations transparent. - Remove the rope scaling parameters (
alpha_value,rope_freq_base,compress_pos_emb). Models now have 128k+ context, and those parameters are from the 4096 context era; the parameters can still be passed to llama.cpp through--extra-flagsif needed. - Optimize wheel downloads in the one-click installer to only download wheels that actually changed between updates. Previously all wheels would get downloaded if at least 1 of them had changed.
- Update the Intel Arc PyTorch installation command in the one-click installer, removing the dependency on Intel oneAPI conda packages.
- Security: server-side file save roots, image URL SSRF protection, extension allowlist (new in 4.1.1)
Bug fixes
- Fix pip accidentally installing to the system Miniconda on Windows instead of the project environment.
- Fix crash on non-UTF-8 Windows locales (e.g. Chinese GBK).
- Fix passing
adaptive-pto llama-server. - Fix
truncation_lengthnot propagating correctly whenctx_sizeis set to auto (0). - Fix dark theme using light theme syntax highlighting.
- Fix word breaks in tables. Tables now scroll horizontally instead of breaking words.
- Fix the OpenAI API server not respecting
--listen-host. - Fix a crash loading the MiniMax-M2.5 jinja template.
- Fix
reasoning_effortnot appearing in the UI for ExLlamaV3. - Fix ExLlamaV3 draft cache size to match main cache.
- Fix ExLlamaV3 EOS handling for models with multiple end-of-sequence tokens.
- Fix ExLlamaV3 perplexity evaluation giving incorrect values for sequences longer than 2048 tokens.
Dependency updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/67a2209fabe2e3498d458561933d5380655085d2
- Update ExLlamaV3 to 0.0.25
- Update diffusers to 0.37
- Update AMD ROCm from 6.4 to 7.2
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda13.1, orcuda12.4if you have older drivers. - AMD/Intel GPU: Use
vulkanbuilds. - AMD GPU (ROCm): Use
rocmbuilds. - CPU only: Use
cpubuilds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64. - Intel: Use
macos-x86_64.
- Apple Silicon: Use
Updating a portable install:
- Download and extract the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:
text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/ <-- shared by both installs
