30 Mar 12:55

xenova

364ebd4

4.0.0 Latest

Latest

🚀 Transformers.js v4

We're excited to announce that Transformers.js v4 is now available on NPM! After a year of development (we started in March 2025 🤯), we're finally ready for you to use it.

npm i @huggingface/transformers

Links: YouTube Video, Blog Post, Demo Collection

New WebGPU backend

The biggest change is undoubtedly the adoption of a new WebGPU Runtime, completely rewritten in C++. We've worked closely with the ONNX Runtime team to thoroughly test this runtime across our ~200 supported model architectures, as well as many new v4-exclusive architectures.

In addition to better operator support (for performance, accuracy, and coverage), this new WebGPU runtime allows the same transformers.js code to be used across a wide variety of JavaScript environments, including browsers, server-side runtimes, and desktop applications. That's right, you can now run WebGPU-accelerated models directly in Node, Bun, and Deno!

We've proven that it's possible to run state-of-the-art AI models 100% locally in the browser, and now we're focused on performance: making these models run as fast as possible, even in resource-constrained environments. This required completely rethinking our export strategy, especially for large language models. We achieve this by re-implementing new models operation by operation, leveraging specialized ONNX Runtime Contrib Operators like com.microsoft.GroupQueryAttention, com.microsoft.MatMulNBits, or com.microsoft.QMoE to maximize performance.

For example, adopting the com.microsoft.MultiHeadAttention operator, we were able to achieve a ~4x speedup for BERT-based embedding models.

ONNX Runtime improvements by @xenova in #1306
Transformers.js V4: Native WebGPU EP, repo restructuring, and more! by @xenova in #1382

New models

Thanks to our new export strategy and ONNX Runtime's expanding support for custom operators, we've been able to add many new models and architectures to Transformers.js v4. These include popular models like GPT-OSS, Chatterbox, GraniteMoeHybrid, LFM2-MoE, HunYuanDenseV1, Apertus, Olmo3, FalconH1, and Youtu-LLM. Many of these required us to implement support for advanced architectural patterns, including Mamba (state-space models), Multi-head Latent Attention (MLA), and Mixture of Experts (MoE). Perhaps most importantly, these models are all compatible with WebGPU, allowing users to run them directly in the browser or server-side JavaScript environments with hardware acceleration. We've released several Transformers.js v4 demos so far... and we'll continue to release more!

Additionally, we've added support for larger models exceeding 8B parameters. In our tests, we've been able to run GPT-OSS 20B (q4f16) at ~60 tokens per second on an M4 Pro Max.

Add support for Apertus by @nico-martin in #1465
Add support for FalconH1 by @xenova in #1502
Add support for Cohere's Tiny Aya models by @xenova in #1529
Add support for AFMoE by @xenova in #1542
Add support for new Qwen VL models (Qwen2.5-VL, Qwen3-VL, Qwen3.5, and Qwen3.5 MoE) by @xenova in #1551
Add support for Qwen2 MoE, Qwen3 MoE, Qwen3 Next, Qwen3-VL MoE, and Olmo Hybrid by @xenova in #1562
Add support for EuroBERT by @xenova in #1583
Add support for LightOnOCR and GLM-OCR by @xenova in #1582
Add support for Nemotron-H by @xenova in #1585
Add support for DeepSeek-v3 by @xenova in #1586
Add support for mistral4 by @xenova in #1587
Add support for GLM-MoE-DSA by @xenova in #1588
Add support for Chatterbox @xenova in #1592
Add support for Cohere ASR by @xenova in #1610
Add support for SolarOpen and CHMv2 models by @xenova in #1593
Voxtral Realtime, LFM2-VL, Granite Speech, and modeling type refactoring by @xenova in #1569
Add support for Gemma3 VLM architecture by @xenova in #1601

New features

ModelRegistry

The new ModelRegistry API is designed for production workflows. It provides explicit visibility into pipeline assets before loading anything: list required files with get_pipeline_files, inspect per-file metadata with get_file_metadata (quite useful to calculate total download size), check cache status with is_pipeline_cached, and clear cached artifacts with clear_pipeline_cache. You can also query available precision types for a model with get_available_dtypes. Based on this new API, progress_callback now includes a progress_total event, making it easy to render end-to-end loading progress without manually aggregating per-file updates.

See `ModelRegistry` examples

import { ModelRegistry, pipeline } from "@huggingface/transformers";

const modelId = "onnx-community/all-MiniLM-L6-v2-ONNX";
const modelOptions = { dtype: "fp32" };

const files = await ModelRegistry.get_pipeline_files(
  "feature-extraction",
  modelId,
  modelOptions
);
// ['config.json', 'onnx/model.onnx', ..., 'tokenizer_config.json']

const metadata = await Promise.all(
  files.map(file => ModelRegistry.get_file_metadata(modelId, file))
);

const downloadSize = metadata.reduce((total, item) => total + item.size, 0);

const cached = await ModelRegistry.is_pipeline_cached(
  "feature-extraction",
  modelId,
  modelOptions
);

const dtypes = await ModelRegistry.get_available_dtypes(modelId);
// ['fp32', 'fp16', 'q4', 'q4f16']

if (cached) {
  await ModelRegistry.clear_pipeline_cache(
    "feature-extraction",
    modelId,
    modelOptions
  );
}

const pipe = await pipeline(
  "feature-extraction",
  modelId,
  {
    progress_callback: e => {
      if (e.status === "progress_total") {
        console.log(`${Math.round(e.progress)}%`);
      }
    },
  }
);

New Environment Settings

We also added new environment controls for model loading. env.useWasmCache enables caching of WASM runtime files (when cache storage is available), allowing applications to work fully offline after the initial load.

env.fetch lets you provide a custom fetch implementation for use cases such as authenticated model access, custom headers, and abortable requests.

See env examples

import { env } from "@huggingface/transformers";

env.useWasmCache = true;

env.fetch = (url, options) =>
  fetch(url, {
    ...options,
    headers: {
      ...options?.headers,
      Authorization: `Bearer ${MY_TOKEN}`,
    },
  });

Improved Logging Controls

Finally, logging is easier to manage in real-world deployments. ONNX Runtime WebGPU warnings are now hidden by default, and you can set explicit verbosity levels for both Transformers.js and ONNX Runtime. This update, also driven by community feedback, keeps console output focused on actionable signals rather than low-value noise.

See `logLevel` example

import { env, LogLevel } from "@huggingface/transformers";

// LogLevel.DEBUG
// LogLevel.INFO
// LogLevel.WARNING
// LogLevel.ERROR
// LogLevel.NONE

env.logLevel = LogLevel.WARNING;

[v4] added wasm cache by @nico-martin in #1471
V4 cache wasm file blob fix by @nico-martin in #1489
[v4] suppress console.error while creating InferenceSession by @nico-martin in #1468
feat: add configurable log levels via env.logLevel by @taronsung in #1507
[v4] Improve download progress tracking (model cache registry and define which files will be loaded for pipelines) by @nico-martin in #1511
Add support for seedable random number generation by @xenova in https://github.com/huggingface/transformers.j...

Contributors

philnash, sroussey, and 6 other contributors

Assets 2

02 Dec 14:37

xenova

3.8.1

2ec882e

3.8.1

What's new?

Add support for Ministral 3 in #1474
Fix Ernie 4.5 naming in #1473
Update Supertonic TTS paper + authors in #1463

Full Changelog: 3.8.0...3.8.1

Assets 2

19 Nov 16:50

xenova

3.8.0

bf09aaf

3.8.0

🚀 Transformers.js v3.8 — SAM2, SAM3, EdgeTAM, Supertonic TTS

Add support for EdgeTAM in #1454

Add support for Supertonic TTS in #1459

Example:

import { pipeline } from '@huggingface/transformers';

const tts = await pipeline('text-to-speech', 'onnx-community/Supertonic-TTS-ONNX');

const input_text = 'This is really cool!';
const audio = await tts(input_text, {
    speaker_embeddings: 'https://huggingface.co/onnx-community/Supertonic-TTS-ONNX/resolve/main/voices/F1.bin',
});
await audio.save('output.wav');

Add support for SAM2 and SAM3 (Tracker) in #1461
Remove Metaspace add_prefix_space logic in #1451
ImageProcessor preprocess uses image_std for fill value by @NathanKolbas in #1455

New Contributors

@NathanKolbas made their first contribution in #1455

Full Changelog: 3.7.6...3.8.0

Contributors

NathanKolbas

Assets 2

20 Oct 19:44

xenova

3.7.6

4c908ec

3.7.6

What's new?

Fix issue when temperature=0 and do_sample=true by @nico-martin in #1431
Fix type errors by @nico-martin in #1436
Add support for NanoChat in #1441
Add support for Parakeet CTC in #1440

New Contributors

@nico-martin made their first contribution in #1431

Full Changelog: 3.7.5...3.7.6

Contributors

nico-martin

Assets 2

02 Oct 13:58

xenova

3.7.5

c670bb9

3.7.5

What's new?

Add support for GraniteMoeHybrid in #1426

Full Changelog: 3.7.4...3.7.5

Assets 2

29 Sep 17:40

xenova

3.7.4

d6b3998

3.7.4

What's new?

Correctly assign logits warpers in _get_logits_processor in #1422

Full Changelog: 3.7.3...3.7.4

Assets 2

12 Sep 20:35

xenova

3.7.3

699dcb5

3.7.3

What's new?

Unify inference chains in #1399
Fix progress tracking bug by @kukudixiaoming in #1405
Add support for MobileLLM-R1 (llama4_text) in #1412
Add support for VaultGemma in #1413

New Contributors

@kukudixiaoming made their first contribution in #1405

Full Changelog: 3.7.2...3.7.3

Contributors

kukudixiaoming

Assets 2

15 Aug 17:58

xenova

3.7.2

28852a2

3.7.2

What's new?

Add support for DINOv3 in #1390

See here for the full list of supported models.

Example: Compute image embeddings

import { pipeline } from '@huggingface/transformers';

const image_feature_extractor = await pipeline(
    'image-feature-extraction',
    'onnx-community/dinov3-vits16-pretrain-lvd1689m-ONNX',
);
const url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png';
const features = await image_feature_extractor(url);
console.log(features);

Try it out using our online demo:

dinov3.mp4

Full Changelog: 3.7.1...3.7.2

Assets 2

01 Aug 21:14

xenova

3.7.1

8d6c400

3.7.1

What's new?

Add support for Arcee in #1377
Optimize tensor.slice() by @Honry in #1381

New Contributors

@Honry made their first contribution in #1381

Full Changelog: 3.7.0...3.7.1

Contributors

Honry

Assets 2

23 Jul 03:12

xenova

3.7.0

0feb5b7

3.7.0

🚀 Transformers.js v3.7 — Voxtral, LFM2, ModernBERT Decoder

🤖 New models

This update adds support for 3 new architectures:

Voxtral
LFM2
ModernBERT Decoder

Voxtral

Voxtral Mini is an enhancement of Ministral 3B, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding. ONNX weights for Voxtral-Mini-3B-2507 can be found here. Learn more about Voxtral in the release blog post.

Try it out with our online demo:

Voxtral.WebGPU.demo.mp4

Example: Audio transcription

import { VoxtralForConditionalGeneration, VoxtralProcessor, TextStreamer, read_audio } from "@huggingface/transformers";

// Load the processor and model
const model_id = "onnx-community/Voxtral-Mini-3B-2507-ONNX";
const processor = await VoxtralProcessor.from_pretrained(model_id);
const model = await VoxtralForConditionalGeneration.from_pretrained(
    model_id,
    {
        dtype: {
            embed_tokens: "fp16", // "fp32", "fp16", "q8", "q4"
            audio_encoder: "q4", // "fp32", "fp16", "q8", "q4", "q4f16"
            decoder_model_merged: "q4", // "q4", "q4f16"
        },
        device: "webgpu",
    },
);

// Prepare the conversation
const conversation = [
    {
        "role": "user",
        "content": [
            { "type": "audio" },
            { "type": "text", "text": "lang:en [TRANSCRIBE]" },
        ],
    }
];
const text = processor.apply_chat_template(conversation, { tokenize: false });
const audio = await read_audio("http://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/mlk.wav", 16000);
const inputs = await processor(text, audio);

// Generate the response
const generated_ids = await model.generate({
    ...inputs,
    max_new_tokens: 256,
    streamer: new TextStreamer(processor.tokenizer, { skip_special_tokens: true, skip_prompt: true }),
});

// Decode the generated tokens
const new_tokens = generated_ids.slice(null, [inputs.input_ids.dims.at(-1), null]);
const generated_texts = processor.batch_decode(
    new_tokens,
    { skip_special_tokens: true },
);
console.log(generated_texts[0]);
// I have a dream that one day this nation will rise up and live out the true meaning of its creed.

Added in #1373 and #1375.

LFM2

LFM2 is a new generation of hybrid models developed by Liquid AI, specifically designed for edge AI and on-device deployment. It sets a new standard in terms of quality, speed, and memory efficiency.

The models, which we have converted to ONNX, come in three different sizes: 350M, 700M, and 1.2B parameters.

Example: Text-generation with LFM2-350M:

import { pipeline, TextStreamer } from "@huggingface/transformers";

// Create a text generation pipeline
const generator = await pipeline(
  "text-generation",
  "onnx-community/LFM2-350M-ONNX",
  { dtype: "q4" },
);

// Define the list of messages
const messages = [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "What is the capital of France?" },
];

// Generate a response
const output = await generator(messages, {
    max_new_tokens: 512,
    do_sample: false,
    streamer: new TextStreamer(generator.tokenizer, { skip_prompt: true, skip_special_tokens: true }),
});
console.log(output[0].generated_text.at(-1).content);
// The capital of France is Paris. It is a vibrant city known for its historical landmarks, art, fashion, and gastronomy.

Added in #1367 and #1369.

ModernBERT Decoder

These models form part of the Ettin suite: the first collection of paired encoder-only and decoder-only models trained with identical data, architecture, and training recipes. Ettin enables fair comparisons between encoder and decoder architectures across multiple scales, providing state-of-the-art performance for open-data models in their respective size categories.

The list of supported models can be found here.

import { pipeline, TextStreamer } from "@huggingface/transformers";

// Create a text generation pipeline
const generator = await pipeline(
  "text-generation",
  "onnx-community/ettin-decoder-150m-ONNX",
  { dtype: "fp32" },
);

// Generate a response
const text = "Q: What is the capital of France?\nA:";
const output = await generator(text, {
  max_new_tokens: 128,
  streamer: new TextStreamer(generator.tokenizer, { skip_prompt: true, skip_special_tokens: true }),
});
console.log(output[0].generated_text);

Added in #1371.

🛠️ Other improvements

Add special tokens in text-generation pipeline if tokenizer requires in #1370

Full Changelog: 3.6.3...3.7.0

Assets 2

Releases: huggingface/transformers.js

4.0.0

🚀 Transformers.js v4

New WebGPU backend

New models

New features

ModelRegistry

New Environment Settings

Improved Logging Controls

Contributors

Uh oh!

3.8.1

What's new?

Uh oh!

3.8.0

🚀 Transformers.js v3.8 — SAM2, SAM3, EdgeTAM, Supertonic TTS

New Contributors

Contributors

Uh oh!

3.7.6

What's new?

New Contributors

Contributors

Uh oh!

3.7.5

What's new?

Uh oh!

3.7.4

What's new?

Uh oh!

3.7.3

What's new?

New Contributors

Contributors

Uh oh!

3.7.2

What's new?

Uh oh!

3.7.1

What's new?

New Contributors

Contributors

Uh oh!

3.7.0

🚀 Transformers.js v3.7 — Voxtral, LFM2, ModernBERT Decoder

🤖 New models

Voxtral

LFM2

ModernBERT Decoder

🛠️ Other improvements

Uh oh!