Add gpt-oss model (OpenAI MoE: gpt-oss-20b / 120b) to candle-transformers

## Why (WeaverTools consumer context)

WeaverTools needs gpt-oss on the **native / safetensors** candle path for its primary goal of **model-family interchangeability**: develop/train an agent on **gpt-oss-20b** (cheap dev tokens) and deploy that *same agent* on **gpt-oss-120b** for production. This needs the safetensors/native path — **not GGUF** — because the 120B target uses NVLink model-parallelism and GGUF splits across the PCI bus (the reason PR #3525 `enable_peer_access` exists). candle-transformers currently has **no gpt-oss model**.

## Current state

A **WIP scaffold exists on `feat/gpt-oss`** (`candle-transformers/src/models/gpt_oss.rs`) — it **compiles** and captures the architecture + standard assembly. It is **not yet load- or numerically-correct**; see Definition of Done.

## Architecture (reference)

- **MoE**, top-4 routing (20b: 32 experts; 120b: 128). Linear router *with bias*.
- **Attention sinks** — per-head learnable logit concatenated to pre-softmax scores, then dropped from the value sum (SDPA-incompatible; eager path).
- **Alternating sliding-window / full attention** (window 128), starting sliding.
- **YaRN RoPE** (theta=150000, 131k context).
- **Clamped-SwiGLU experts**: `(up + 1) * (gate * sigmoid(1.702 * gate))`, `gate` clamped to `+limit`, `up` to `[-limit, +limit]`, `limit = 7.0`. Fused `gate_up_proj` + `down_proj`, both with bias.
- RMSNorm; `attention_bias = true`.
- gpt-oss-20b config: hidden 2880, intermediate 2880, 24 layers, 64 heads, 8 KV heads, head_dim 64, vocab 201088, eps 1e-5.

## Definition of done

1. **Fused + MXFP4 expert weights** (`experts.gate_up_proj[_bias]`, `experts.down_proj[_bias]`, 3D) — adapt loading from the scaffold's per-expert `Linear`s (dequantize to bf16 or handle MXFP4).
2. **YaRN** rope scaling wired (scaffold has a plain-RoPE stub).
3. **Sliding-window mask** per-layer (scaffold flags layer type but doesn't apply the window).
4. **KV cache** (offset-aware) for generation.
5. **Numerical parity** vs HF `modeling_gpt_oss` on gpt-oss-20b (acceptance test).
6. Loads `openai/gpt-oss-20b` safetensors end-to-end + generates coherent text.

## Workflow

`feat/gpt-oss` -> fork self-PR to `main` (CodeRabbit pre-flight) -> upstream PR to huggingface/candle -> merge into `integration` -> report new integration SHA back to WeaverTools.

## References

- https://huggingface.co/openai/gpt-oss-20b
- https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_oss/modeling_gpt_oss.py
- https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_oss/configuration_gpt_oss.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gpt-oss model (OpenAI MoE: gpt-oss-20b / 120b) to candle-transformers #5

Why (WeaverTools consumer context)

Current state

Architecture (reference)

Definition of done

Workflow

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add gpt-oss model (OpenAI MoE: gpt-oss-20b / 120b) to candle-transformers #5

Description

Why (WeaverTools consumer context)

Current state

Architecture (reference)

Definition of done

Workflow

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions