Skip to content

Add gpt-oss model (OpenAI MoE: gpt-oss-20b / 120b) to candle-transformers #5

@toddwbucy

Description

@toddwbucy

Why (WeaverTools consumer context)

WeaverTools needs gpt-oss on the native / safetensors candle path for its primary goal of model-family interchangeability: develop/train an agent on gpt-oss-20b (cheap dev tokens) and deploy that same agent on gpt-oss-120b for production. This needs the safetensors/native path — not GGUF — because the 120B target uses NVLink model-parallelism and GGUF splits across the PCI bus (the reason PR huggingface#3525 enable_peer_access exists). candle-transformers currently has no gpt-oss model.

Current state

A WIP scaffold exists on feat/gpt-oss (candle-transformers/src/models/gpt_oss.rs) — it compiles and captures the architecture + standard assembly. It is not yet load- or numerically-correct; see Definition of Done.

Architecture (reference)

  • MoE, top-4 routing (20b: 32 experts; 120b: 128). Linear router with bias.
  • Attention sinks — per-head learnable logit concatenated to pre-softmax scores, then dropped from the value sum (SDPA-incompatible; eager path).
  • Alternating sliding-window / full attention (window 128), starting sliding.
  • YaRN RoPE (theta=150000, 131k context).
  • Clamped-SwiGLU experts: (up + 1) * (gate * sigmoid(1.702 * gate)), gate clamped to +limit, up to [-limit, +limit], limit = 7.0. Fused gate_up_proj + down_proj, both with bias.
  • RMSNorm; attention_bias = true.
  • gpt-oss-20b config: hidden 2880, intermediate 2880, 24 layers, 64 heads, 8 KV heads, head_dim 64, vocab 201088, eps 1e-5.

Definition of done

  1. Fused + MXFP4 expert weights (experts.gate_up_proj[_bias], experts.down_proj[_bias], 3D) — adapt loading from the scaffold's per-expert Linears (dequantize to bf16 or handle MXFP4).
  2. YaRN rope scaling wired (scaffold has a plain-RoPE stub).
  3. Sliding-window mask per-layer (scaffold flags layer type but doesn't apply the window).
  4. KV cache (offset-aware) for generation.
  5. Numerical parity vs HF modeling_gpt_oss on gpt-oss-20b (acceptance test).
  6. Loads openai/gpt-oss-20b safetensors end-to-end + generates coherent text.

Workflow

feat/gpt-oss -> fork self-PR to main (CodeRabbit pre-flight) -> upstream PR to huggingface/candle -> merge into integration -> report new integration SHA back to WeaverTools.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions