Why (WeaverTools consumer context)
WeaverTools needs gpt-oss on the native / safetensors candle path for its primary goal of model-family interchangeability: develop/train an agent on gpt-oss-20b (cheap dev tokens) and deploy that same agent on gpt-oss-120b for production. This needs the safetensors/native path — not GGUF — because the 120B target uses NVLink model-parallelism and GGUF splits across the PCI bus (the reason PR huggingface#3525 enable_peer_access exists). candle-transformers currently has no gpt-oss model.
Current state
A WIP scaffold exists on feat/gpt-oss (candle-transformers/src/models/gpt_oss.rs) — it compiles and captures the architecture + standard assembly. It is not yet load- or numerically-correct; see Definition of Done.
Architecture (reference)
- MoE, top-4 routing (20b: 32 experts; 120b: 128). Linear router with bias.
- Attention sinks — per-head learnable logit concatenated to pre-softmax scores, then dropped from the value sum (SDPA-incompatible; eager path).
- Alternating sliding-window / full attention (window 128), starting sliding.
- YaRN RoPE (theta=150000, 131k context).
- Clamped-SwiGLU experts:
(up + 1) * (gate * sigmoid(1.702 * gate)), gate clamped to +limit, up to [-limit, +limit], limit = 7.0. Fused gate_up_proj + down_proj, both with bias.
- RMSNorm;
attention_bias = true.
- gpt-oss-20b config: hidden 2880, intermediate 2880, 24 layers, 64 heads, 8 KV heads, head_dim 64, vocab 201088, eps 1e-5.
Definition of done
- Fused + MXFP4 expert weights (
experts.gate_up_proj[_bias], experts.down_proj[_bias], 3D) — adapt loading from the scaffold's per-expert Linears (dequantize to bf16 or handle MXFP4).
- YaRN rope scaling wired (scaffold has a plain-RoPE stub).
- Sliding-window mask per-layer (scaffold flags layer type but doesn't apply the window).
- KV cache (offset-aware) for generation.
- Numerical parity vs HF
modeling_gpt_oss on gpt-oss-20b (acceptance test).
- Loads
openai/gpt-oss-20b safetensors end-to-end + generates coherent text.
Workflow
feat/gpt-oss -> fork self-PR to main (CodeRabbit pre-flight) -> upstream PR to huggingface/candle -> merge into integration -> report new integration SHA back to WeaverTools.
References
Why (WeaverTools consumer context)
WeaverTools needs gpt-oss on the native / safetensors candle path for its primary goal of model-family interchangeability: develop/train an agent on gpt-oss-20b (cheap dev tokens) and deploy that same agent on gpt-oss-120b for production. This needs the safetensors/native path — not GGUF — because the 120B target uses NVLink model-parallelism and GGUF splits across the PCI bus (the reason PR huggingface#3525
enable_peer_accessexists). candle-transformers currently has no gpt-oss model.Current state
A WIP scaffold exists on
feat/gpt-oss(candle-transformers/src/models/gpt_oss.rs) — it compiles and captures the architecture + standard assembly. It is not yet load- or numerically-correct; see Definition of Done.Architecture (reference)
(up + 1) * (gate * sigmoid(1.702 * gate)),gateclamped to+limit,upto[-limit, +limit],limit = 7.0. Fusedgate_up_proj+down_proj, both with bias.attention_bias = true.Definition of done
experts.gate_up_proj[_bias],experts.down_proj[_bias], 3D) — adapt loading from the scaffold's per-expertLinears (dequantize to bf16 or handle MXFP4).modeling_gpt_osson gpt-oss-20b (acceptance test).openai/gpt-oss-20bsafetensors end-to-end + generates coherent text.Workflow
feat/gpt-oss-> fork self-PR tomain(CodeRabbit pre-flight) -> upstream PR to huggingface/candle -> merge intointegration-> report new integration SHA back to WeaverTools.References