Skip to content

Releases: oobabooga/textgen

v4.5.2

15 Apr 20:19
841aded

Choose a tag to compare

Changes

Bug fixes

  • Fix Gemma-4 tool calling: handle double quotes and newline chars in arguments (#7477). Thanks, @mamei16.
  • Fix chat scroll getting stuck on thinking blocks (#7485).
  • Prevent Tool Icon SVG Shrinking When Tool Calls Are Long (#7488). Thanks, @mamei16.
  • Fix: wrong chat deleted when selection changes before confirm (#7483). Thanks, @lawrence3699.
  • Fix bos/eos tokens not being set for models without a chat template. Defaults are now reset before reading model metadata.
  • Fix duplicate BOS token being prepended in ExLlamav3.
  • Fix version metadata not syncing on Continue (#7492).
  • Fix row_split not working with ik_llama.cpp — --split-mode row is now converted to --split-mode graph (#7489).
  • Fix "Start reply with" crash (#7497). 🆕 - v4.5.1.
  • Fix tool responses with Gemma 4 template (#7498). 🆕 - v4.5.1.
  • UI: Fix consecutive thinking blocks rendering with Gemma 4. 🆕 - v4.5.1.
  • Fix bos/eos tokens being overwritten after GGUF metadata sets them (#7496). 🆕 - v4.5.2

Dependency updates


Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.

Note

NVIDIA GPU: If nvidia-smi reports CUDA Version >= 13.1, use the cuda13.1 build. Otherwise, use cuda12.4.

ik_llama.cpp is a llama.cpp fork with new quant types. If unsure, use the llama.cpp column.

Windows

GPU/Platform llama.cpp ik_llama.cpp
NVIDIA (CUDA 12.4) Download (774 MB) Download (1.09 GB)
NVIDIA (CUDA 13.1) Download (696 MB) Download (1.19 GB)
AMD/Intel (Vulkan) Download (209 MB)
AMD (ROCm 7.2) Download (517 MB)
CPU only Download (191 MB) Download (192 MB)

Linux

GPU/Platform llama.cpp ik_llama.cpp
NVIDIA (CUDA 12.4) Download (758 MB) Download (1.09 GB)
NVIDIA (CUDA 13.1) Download (710 MB) Download (1.21 GB)
AMD/Intel (Vulkan) Download (225 MB)
AMD (ROCm 7.2) Download (330 MB)
CPU only Download (207 MB) Download (218 MB)

macOS

Architecture llama.cpp
Apple Silicon (arm64) Download (182 MB)
Intel (x86_64) Download (188 MB)

Updating a portable install:

  1. Download and extract the latest version.
  2. Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:

text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/                    <-- shared by both installs

v4.5.1

15 Apr 18:38
7bae526

Choose a tag to compare

v4.5

15 Apr 03:50
eeb3288

Choose a tag to compare

v4.4 - MCP server support!

07 Apr 00:56
9dcf574

Choose a tag to compare

Changes

  • MCP server support: Use remote MCP servers from the UI. Just add one server URL per line in the new "MCP servers" field in the Chat tab and send a message. Tools will be discovered automatically and used alongside local tools. [Tutorial]
  • Several UI improvements, further modernizing the theme:
    • Improve hover menu appearance in the Chat tab.
    • Improve scrollbar styling (thinner, more rounded).
    • Improve message text contrast and heading colors.
    • Improve message action icon visibility in light mode.
    • Make blockquote, table, and hr borders more subtle and consistent.
    • Improve accordion outline styling.
    • Reduce empty space between chat input and message contents.
    • Hide spin buttons on all sliders (these looked ugly on Windows).
    • Show filename tooltip on file attachments in the chat input.
  • Add Windows + ROCm portable builds.
  • Image generation: Embed metadata in API responses. PNG images returned by the API now include generation settings (model, seed, dimensions, steps, CFG scale, sampler) in the file metadata.
  • API: Add instruction_template and instruction_template_str parameters in the model load endpoint.
  • API: Remove the deprecated settings parameter from the model load endpoint.
  • Move the cpu-moe checkbox to extra flags (no longer needed now that --fit exists).

Bug fixes

  • Fix inline LaTeX rendering: $...$ expressions are now protected from being parsed as markdown (#7423).
  • Fix crash when truncating prompts with tool call messages.
  • Fix "address already in use" on server restart (Linux/macOS).
  • Fix GPT-OSS reasoning tags briefly leaking into streamed output between thinking and tool calls.
  • Fix tool call check sometimes truncating visible text at end of generation.
  • Fix image generation failing with Flash Attention 2 errors by defaulting attention to SDPA.
  • Fix loader args leaking between sequential API model loads.
  • Fix IPv6 address formatting in the API.

Dependency updates


Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.

Note

NVIDIA GPU: If nvidia-smi reports CUDA Version >= 13.1, use the cuda13.1 build. Otherwise, use cuda12.4.

ik_llama.cpp is a llama.cpp fork with new quant types. If unsure, use the llama.cpp column.

Windows

GPU/Platform llama.cpp ik_llama.cpp
NVIDIA (CUDA 12.4) Download (777 MB) Download (1.09 GB)
NVIDIA (CUDA 13.1) Download (698 MB) Download (1.19 GB)
AMD/Intel (Vulkan) Download (207 MB)
AMD (ROCm 7.2) Download (516 MB)
CPU only Download (191 MB) Download (192 MB)

Linux

GPU/Platform llama.cpp ik_llama.cpp
NVIDIA (CUDA 12.4) Download (761 MB) Download (1.09 GB)
NVIDIA (CUDA 13.1) Download (712 MB) Download (1.21 GB)
AMD/Intel (Vulkan) Download (223 MB)
AMD (ROCm 7.2) Download (329 MB)
CPU only Download (207 MB) Download (217 MB)

macOS

Architecture llama.cpp
Apple Silicon (arm64) Download (181 MB)
Intel (x86_64) Download (187 MB)

Updating a portable install:

  1. Download and extract the latest version.
  2. Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:

text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/                    <-- shared by both installs

v4.3.3 - Gemma 4 support!

04 Apr 00:05
62e67ad

Choose a tag to compare

Changes

  • Gemma 4 support with tool-calling in the API and UI. 🆕 - v4.3.1.
  • ik_llama.cpp support: Add ik_llama.cpp as a new backend through new textgen-portable-ik portable builds and a new --ik flag for full installs. ik_llama.cpp is a fork by the author of the imatrix quants, including support for new quant types, significantly more accurate KV cache quantization (via Hadamard KV cache rotation, enabled by default), and optimizations for MoE models and CPU inference.
  • API: Add echo + logprobs for /v1/completions. The completions endpoint now supports the echo and logprobs parameters, returning token-level log probabilities for both prompt and generated tokens. Token IDs are also included in the output via a new top_logprobs_ids field.
  • Further optimize my custom gradio fork, saving up to 50 ms per UI event (button click, etc).
  • Transformers: Autodetect torch_dtype from model config instead of always forcing bfloat16/float16. The --bf16 flag still works as an override.
  • Remove the obsolete models/config.yaml file. Instruction templates are now detected from model metadata instead of filename patterns.
  • Rename "truncation length" to "context length" in the terminal log message.

Security

  • Gradio fork: Fix ACL bypass via case-insensitive path matching on Windows/macOS.
  • Gradio fork: Add server-side validation for Dropdown, Radio, and CheckboxGroup.
  • Sanitize filenames in all prompt file operations (CWE-22). Thanks, @ffulbtech. 🆕 - v4.3.3.
  • Fix SSRF in superbooga extensions: URLs fetched by superbooga/superboogav2 are now validated to block requests to private/internal networks.

Bug fixes

  • Fix --idle-timeout failing on encode/decode requests and not tracking parallel generation properly.
  • Fix stopping string detection for chromadb/context-1 (<|return|> vs <|result|>).
  • Fix Qwen3.5 MoE failing to load via ExLlamav3_HF.
  • Fix ban_eos_token not working for ExLlamav3. EOS is now suppressed at the logit level.
  • Fix "Value: None is not in the list of choices: []" Gradio error introduced in v4.3. 🆕 - v4.3.2.
  • Fix Dropdown/Radio/CheckboxGroup crash when choices list is empty. 🆕 - v4.3.3.
  • Fix API crash when parsing tool calls from non-dict JSON model output. 🆕 - v4.3.3.
  • Fix llama.cpp crashing due to failing to parse the Gemma 4 template (even though we don't use llama.cpp's jinja parser). 🆕 - v4.3.2.

Dependency updates


Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.

Note

NVIDIA GPU: If nvidia-smi reports CUDA Version >= 13.1, use the cuda13.1 build. Otherwise, use cuda12.4.

ik_llama.cpp is a llama.cpp fork with new quant types. If unsure, use the llama.cpp column.

Windows

GPU/Platform llama.cpp ik_llama.cpp
NVIDIA (CUDA 12.4) Download (758 MB) Download (1.12 GB)
NVIDIA (CUDA 13.1) Download (681 MB) Download (1.17 GB)
AMD/Intel (Vulkan) Download (191 MB)
AMD (ROCm 7.2) Download (499 MB)
CPU only Download (175 MB) Download (175 MB)

Linux

GPU/Platform llama.cpp ik_llama.cpp
NVIDIA (CUDA 12.4) Download (753 MB) Download (1.12 GB)
NVIDIA (CUDA 13.1) Download (706 MB) Download (1.2 GB)
AMD/Intel (Vulkan) Download (217 MB)
AMD (ROCm 7.2) Download (323 MB)
CPU only Download (201 MB) Download (211 MB)

macOS

Architecture llama.cpp
Apple Silicon (arm64) Download (173 MB)
Intel (x86_64) Download (179 MB)

Updating a portable install:

  1. Download and extract the latest version.
  2. Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:

text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/                    <-- shared by both installs

v4.3.2

03 Apr 17:08
0050a33

Choose a tag to compare

v4.3.1

03 Apr 03:54
b11379f

Choose a tag to compare

v4.3

03 Apr 01:22
9374a4e

Choose a tag to compare

v4.2

24 Mar 19:39
dd9d254

Choose a tag to compare

Before After
before after

Changes

  • Anthropic-compatible API: A new /v1/messages endpoint lets you connect Claude Code, Cursor, and other Anthropic API clients. Supports system messages, content blocks, tool use, tool results, image inputs, and thinking blocks. To use with Claude Code: ANTHROPIC_BASE_URL=http://127.0.0.1:5000 claude.
  • Updated UI theme: New colors, borders, and button styles across light and dark modes.
  • --extra-flags now supports literal flags: You can now pass flags directly, e.g. --extra-flags "--rpc 192.168.1.100:50052 --jinja". The old key=value format is still accepted for backwards compatibility.
  • Training
    • Enable gradient_checkpointing by default for lower VRAM usage during training.
    • Remove the arbitrary higher_rank_limit parameter.
    • Reorganize the training UI.
  • Strip thinking blocks before tool-call parsing to prevent false-positive tool call detection from <think> content.
  • Move the OpenAI-compatible API from extensions/openai to modules/api. The old --extensions openai flag is still accepted as an alias for --api.
  • Set top_p=0.95 as the default sampling parameter for API requests.
  • Remove 52 obsolete instruction templates from 2023 (Airoboros, Baichuan, Guanaco, Koala, Vicuna v0, MOSS, etc.).
  • Reduce portable build sizes by using a stripped Python distribution.

Bug fixes

  • Fix prompt corruption when continuing a chat with context truncation (#7439). Thanks, @Phrosty1.
  • Fix multi-turn thinking block corruption for Kimi models.
  • Fix AMD installer failing to resolve ROCm triton dependency.
  • Fix the --share feature in the Gradio fork.
  • Fix --extra-flags breaking short long-form-only flags like --rpc.
  • Fix the instruction template delete dialog not appearing.
  • Fix file handle leaks and redundant re-reads in model metadata loading (#7422). Thanks, @alvinttang.
  • Fix superboogav2 broken delete endpoint (#6010). Thanks, @Raunak-Kumar7.
  • Fix leading spaces in post-reasoning content in API responses.
  • Fix Cloudflare tunnel retry logic raising after the first failed attempt instead of exhausting retries.
  • Fix OPENEDAI_DEBUG=0 being treated as truthy.
  • Fix mutable default argument in LogitsBiasProcessor (#7426). Thanks, @Jah-yee.

Dependency updates


Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.

Which version to download:

  • Windows/Linux:

    • NVIDIA GPU: Use cuda13.1, or cuda12.4 if you have older drivers.
    • AMD/Intel GPU: Use vulkan builds.
    • AMD GPU (ROCm): Use rocm builds.
    • CPU only: Use cpu builds.
  • Mac:

    • Apple Silicon: Use macos-arm64.
    • Intel: Use macos-x86_64.

Updating a portable install:

  1. Download and extract the latest version.
  2. Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:

text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/                    <-- shared by both installs

v4.1.1

18 Mar 05:33

Choose a tag to compare

print

Changes

  • Tool-calling in the UI!: Models can now call custom functions during chat. Each tool is a single .py file in user_data/tools, and five examples are provided: web_search, fetch_webpage, calculate, get_datetime, and roll_dice. During streaming, each tool call appears as a collapsible accordion similar to the existing thinking blocks, showing the called function, the arguments chosen by the LLM, and the output. [Tutorial]
  • Replace html2text with trafilatura for extracting text from web pages, reducing boilerplate like navigation bars significantly and saving tokens in agentic tool-calling loops.
  • OpenAI API improvements:
    • Rewrite logprobs support for full spec compliance across llama.cpp, ExLlamaV3, and Transformers backends. Both streaming and non-streaming responses now return token-by-token logprobs.
    • Add a reasoning_content field for thinking blocks in both streaming and non-streaming chat completions. Now thinking blocks go exclusively in this field, and content only shows the post-thinking reply, even when tool calls are present.
    • Add tool_choice support and fix the tool_calls response format for strict spec compliance.
    • Put mid-conversation system messages in the correct positions in the prompt instead of collapsing all system messages at the top.
    • Add support for the developer role, which is mapped to system.
    • Add max_completion_tokens as an alias for max_tokens.
    • Include /v1 in the API URL printed to the terminal since that's what most clients expect.
    • Make the /v1/models endpoint show only the currently loaded model.
    • Add stream_options support with include_usage for streaming responses.
    • Return finish_reason: tool_calls when tool calls are detected.
    • Several other spec compliance improvements after careful auditing.
  • llama.cpp
    • Set ctx-size to 0 (auto) by default. Note: this only works when --gpu-layers is also set to -1, which is the default value. When using other loaders, 0 maps to 8192.
    • Reduce the --fit-target default from 1024 MiB to 512 MiB.
    • Use --fit-ctx 8192 to set 8192 as the minimum acceptable ctx size for --fit on (llama.cpp uses 4096 by default).
    • Make logit_bias and logprobs functional in API calls.
    • Add missing custom_token_bans parameter in the UI.
  • ExLlamaV3
    • Add native logit_bias and logprobs support.
    • Load the vision model and the draft model before the main model so memory auto-splitting accounts for them.
  • New default preset: "Top-P" (top_p: 0.95), following recommendations for several SOTA open-weights models. The old "Qwen3 - Thinking", "Qwen3 - No Thinking", "min_p", and "Instruct" presets have been removed.
  • Refactor reasoning/thinking extraction into a standalone module supporting multiple model formats (Qwen, GPT-OSS, Solar, seed:think, and others). Also detect when a chat template appends <think> to the prompt and prepend it to the reply, so the thinking block appears immediately during streaming.
  • Incognito chat: This option has been added next to the existing "New chat" button. Incognito chats are temporary, live in RAM and are never saved to disk.
  • Optimize chat streaming performance by updating the DOM only once per animation frame.
  • Increase the ctx-size slider maximum to 1M tokens in the UI, with 1024 step.
  • Add a new drag-and-drop UI component for reordering "Sampler priority" items.
  • Make all chat styles consistent with instruct style in spacings, line heights, etc., improving the quality and consistency of those styles.
  • Remove the gradio import in --nowebui mode, saving some 0.5-0.8 seconds on startup.
  • Force-exit the webui on repeated Ctrl+C.
  • Improve the --multi-user warning to make the known limitations transparent.
  • Remove the rope scaling parameters (alpha_value, rope_freq_base, compress_pos_emb). Models now have 128k+ context, and those parameters are from the 4096 context era; the parameters can still be passed to llama.cpp through --extra-flags if needed.
  • Optimize wheel downloads in the one-click installer to only download wheels that actually changed between updates. Previously all wheels would get downloaded if at least 1 of them had changed.
  • Update the Intel Arc PyTorch installation command in the one-click installer, removing the dependency on Intel oneAPI conda packages.
  • Security: server-side file save roots, image URL SSRF protection, extension allowlist (new in 4.1.1)

Bug fixes

  • Fix pip accidentally installing to the system Miniconda on Windows instead of the project environment.
  • Fix crash on non-UTF-8 Windows locales (e.g. Chinese GBK).
  • Fix passing adaptive-p to llama-server.
  • Fix truncation_length not propagating correctly when ctx_size is set to auto (0).
  • Fix dark theme using light theme syntax highlighting.
  • Fix word breaks in tables. Tables now scroll horizontally instead of breaking words.
  • Fix the OpenAI API server not respecting --listen-host.
  • Fix a crash loading the MiniMax-M2.5 jinja template.
  • Fix reasoning_effort not appearing in the UI for ExLlamaV3.
  • Fix ExLlamaV3 draft cache size to match main cache.
  • Fix ExLlamaV3 EOS handling for models with multiple end-of-sequence tokens.
  • Fix ExLlamaV3 perplexity evaluation giving incorrect values for sequences longer than 2048 tokens.

Dependency updates


Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.

Which version to download:

  • Windows/Linux:

    • NVIDIA GPU: Use cuda13.1, or cuda12.4 if you have older drivers.
    • AMD/Intel GPU: Use vulkan builds.
    • AMD GPU (ROCm): Use rocm builds.
    • CPU only: Use cpu builds.
  • Mac:

    • Apple Silicon: Use macos-arm64.
    • Intel: Use macos-x86_64.

Updating a portable install:

  1. Download and extract the latest version.
  2. Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:

text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/                    <-- shared by both installs