Pipeline-as-Config: structural model dispatch (#2114)#2210
Pipeline-as-Config: structural model dispatch (#2114)#2210justinchuby wants to merge 13 commits into
Conversation
Logits::Get() only converted the model's raw logits to float32 when the output type was Float16. For BFloat16 the conversion was skipped and the subsequent WrapTensor<float> reinterpreted the raw 2-byte bf16 values as 4-byte float32, corrupting every logit (wrong argmax, incoherent generation). The identical model in Float16 worked correctly. Treat BFloat16 the same as Float16 in both the fp32 staging-buffer allocation and the Cast to float32. Add an on-device CUDA bf16->f32 cast (LaunchBf16ToFp32) so the conversion does not fall back to a host round-trip; the CPU Cast path already supported bf16->f32. Verified on a bf16 decoder (vocab 262144): first-token argmax now matches the Float16 / HuggingFace reference and generation is identical to fp16. Fixes microsoft#2202 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <11205048+justinchuby@users.noreply.github.com>
Implements the Pipeline-as-Config redesign (issue microsoft#2114, PR1-5) by generalizing the existing DecoderOnlyPipeline executor rather than introducing new classes ("refactor not rewrite"). - Config v2 schema: Config::version + Config::Pipeline; SAX v2 parse; TranslateV1ToPipeline / LowerPipelineToModel; pipeline_presets. - Structural CreatePipeline() delegated from CreateModel when a pipeline is present; legacy model.type dispatch preserved as ClassifyLegacyRoute oracle. - PipelineFlow (init/step/final phases, explicit dataflow, DFS cycle and 10-stage guards) + Finalize hook. - Plugin escape hatch: opaque C-ABI plugin_api.h + plugin_loader, gated behind USE_GENAI_PLUGINS (OFF by default). - ClassifyStructuralRoute replaces model.type dispatch for CP2/CP3/CP4, guarded by a zero-regression gate asserting structural == legacy for every in-tree fixture. Backward compatibility: model_type.h predicates retained (live non-dispatch callers); all 14 checked-in genai_config.json fixtures route identically. The qwen2-5-vl-pipeline gate fixture surfaced and fixed a real divergence: has_vision now also recognizes vision.pipeline[]. Tests: unit_tests 76 passed / 0 failed / 21 skipped (env-gated); new PipelineConfig/Flow/Dispatch/PluginLoader suites green; gpt2 + lfm2 e2e pass. WIP src/models/kv_cache.cpp intentionally excluded. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds examples/pipeline-config/ demonstrating the v2 schema, each verified
to parse, lower, and route against the current build:
- 01-preset-decoder preset usage (autoregressive-decoder) -> Gpt/DecoderOnly
- 02-explicit-encoder-decoder explicit multi-stage dataflow (init/step,
cross_attention_from, frozen cross_cache) -> Whisper
- 03-vlm-per-image loop:"per_image" + mrope_3d vision pipeline -> MultiModal
- 04-plugin-escape-hatch plugin opaque-handle shape (doc-only; needs
USE_GENAI_PLUGINS=ON)
- 05-v1-to-v2 legacy v1 gpt2 config beside its v2 equivalent
- README.md walkthrough: version field, v1->v2 migration,
presets, flow phases + guardrails, plugin opt-in
Verification: test/pipeline_config_tests.cpp gains ExamplePipelineConfigs.*
(4 tests) that load examples 1/2/3/5 from EXAMPLES_PATH and assert parse +
lower + ClassifyStructuralRoute. 4/4 pass; PipelineConfig/Dispatch suites
11/11, no regressions.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…oft#2114) Extends examples/pipeline-config/ with two more verified, parseable demos: - 06-multimodal-single-pass gemma4-style tri-modal (text + vision + audio): vision -> image_features, speech -> audio_features, embedding merges both -> inputs_embeds, decoder consumes. Single-pass (no per-image loop, no mRoPE). Routes -> MultiModal. - 07-prefill-decode decoder.pipeline[] split into a prefill stage (run_on_prompt:true/run_on_token_gen:false) and a decode stage (run_on_prompt:false/run_on_token_gen: true), sharing KV via past_present_share_buffer. TranslateV1ToPipeline derives prefill->init, decode->step. Routes -> DecoderOnlyPipeline. README gains tri-modal vs per-image (06 vs 03) and prefill/decode sections. ExamplePipelineConfigs gains MultiModalSinglePass and PrefillDecodeSplit, asserting parse + lower + route + stage flags/sessions. Reviewed by Rusty (APPROVE-WITH-NITS, no blocking issues); 17/17 example/config/dispatch tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR-A of the v2.1 speculative-decoding groundwork. Adds the KV-cache rollback prerequisite that speculative decoding's accept/reject step will build on, without introducing any new schema. - DecoderOnlyPipelineState::RewindTo override: drains outstanding async partial KV-cache updates, then rewinds position inputs, key-value cache, and recurrent state (mirrors DecoderOnly_State::RewindTo). - generators.cpp: remove "decoder-pipeline" from the RewindToLength throw list; whisper/phi3v/lfm2 still throw. - New CAPITests.RewindDecoderPipelineFp32CAPI with a tiny self-consistent causal decoder-pipeline fixture proving token-for-token identical continuation after RewindTo (KV + position state truly rolled back). Build green; *Rewind* + model/CAPI/pipeline suites: 46 passed, 0 failed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…framework
Design doc for evolving the Pipeline-as-Config schema (v2.0) toward
native support for speculative decoding and modern inference
optimizations, without breaking v2.0 (both remain version: 2; the
speculative/strategy block presence is the discriminator).
Covers: speculative flow strategy, multi-session draft/target roles,
KV-cache rollback/checkpoint (PR-A, landed), variable tokens/step with
token-tree attention, intermediate hidden-state dataflow edges
(EAGLE/MTP), an ordered logit-processor/sampler chain, a runtime-vs-
build-time feature namespace, and a controller-plugin escape hatch.
Includes a dependency-ordered PR plan (PR-A -> PR-B -> {PR-C, PR-D};
PR-E/PR-F independent). Reviewed by Livingston (APPROVE-WITH-NITS);
citation nits addressed.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…or (PR-B)
PR-B of the v2.1 speculative-decoding groundwork, stacked on PR-A's
KV-cache rollback. Adds the `speculative` flow strategy and multi-session
draft/target roles to the v2.1 schema, plus a working vanilla draft-target
executor. Both v2.0 and v2.1 stay version: 2; the strategy/roles block
presence is the discriminator.
- config.{h,cpp}: parse `roles` and `strategy` (speculative) blocks with
nested draft/ngram/verify/tree, block-presence gated.
- src/models/speculative_decoder.{h,cpp}: SpeculativeDecoder composes a
target and a draft Generator (each its own session + KV cache). Draft
proposes K tokens greedily; target verifies all K in one forward pass;
accept the longest matching prefix; commit target's greedy argmax (incl.
bonus); roll back both roles via RewindToLength (PR-A). Output is
token-for-token identical to plain greedy on the target.
- Logits::GetAll + State::GetRawLogits virtual + pipeline override +
Generator::GetRawLogits expose full [batch,seq,vocab] logits for
single-pass verify (all additive; normal decoding unperturbed).
- New SpeculativeDecodingTests (3) on real tiny target/draft fixtures:
schema parse, greedy==baseline with a distinct draft (reject+rewind),
and multi-token advance when draft==target.
Non-greedy acceptance, token-tree verify, and non-draft_model producers
(ngram/EAGLE) are parsed but throw — deferred to PR-C/PR-D.
Build green; SpeculativeDecodingTests + Pipeline + CAPI suites:
51 passed, 0 failed. Reviewed by Livingston (APPROVE-WITH-NITS).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR-C of the v2.1 speculative-decoding groundwork, stacked on PR-B.
Adds the EAGLE/MTP foundation (intermediate hidden-state edges) and
upgrades the PR-B tree stub to a verified linear-K fallback.
- Hidden-state edges (real, fully verified): State::GetHiddenStates
virtual (additive, empty default) + DecoderOnlyPipelineState override
reading the configured `decoder.outputs.hidden_states` intermediate
activation from the ortvalue store, kept device-resident, cast
fp16/bf16 -> fp32, normalized to [batch,seq,hidden]; Generator
forwarder; schema parses the edge name and a dataflow wire
(target.hidden_states -> eagle_draft.prev_hidden). This is the
EAGLE/MTP prerequisite (draft consuming target hidden states).
- Token tree: medusa_choices now parsed; the executor degrades to a
verified linear-K chain instead of throwing. Output stays greedy-
equivalent (§10 invariant holds under tree verify).
True tree attention is deferred with a code-grounded reason: the
runtime PositionInputs only builds a 1D padding mask
(attention_mask_shape_ is {batch,seq}) and causal masking is hardcoded
in-graph via Trilu, so a per-(query,key) tree mask requires a model-side
[batch,1,q,kv] mask input (a build-time graph change). See design §11.
Build green; SpeculativeDecodingTests + Pipeline + CAPI suites:
57 passed, 0 failed. 6 new PR-C tests; PR-A/PR-B unaffected.
Reviewed by Livingston (APPROVE).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR-D of the v2.1 groundwork, stacked on PR-C. Generalizes the single
llguidance hook into a declarative, composable ordered chain of logit
processors applied before sampling, fully backward-compatible.
- Schema: search.logits_processors[] of typed ops (repetition_penalty,
min_length, logit_bias, grammar, temperature, top_k, top_p, sample).
Block-presence gated; version stays 2; unknown ops/keys throw.
- src/logits_processor_chain.{h,cpp}: LogitsProcessorOp interface +
LogitsProcessorChain. repetition_penalty/min_length delegate to the
existing Search scoring kernels; logit_bias is an in-place transform;
grammar adapts the existing ConstrainedLogitsProcessor verbatim (incl
Reset); temperature/top_k/top_p are realized by the existing fused
sampler so numerics never diverge; sample is the terminal op.
- Back-compat: logits_chain_ is built ONLY when logits_processors is
non-empty; otherwise the legacy guidance+sampling path runs byte-for-
byte unchanged. Guarded by BackCompatDefaultMatchesLegacy.
Deferred (flagged): combine (contrastive/CFG) needs multi-session
logits; grammar e2e needs USE_GUIDANCE=ON (op throws clearly otherwise);
speculative-path integration; scalar sampler ops are realized by the
terminal fused sampler regardless of relative position.
Build green; LogitsChain + Sampling + Speculative + Pipeline + CAPI
suites: 77 passed, 0 failed. Reviewed by Livingston (APPROVE-WITH-NITS).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR-E of the v2.1 groundwork, stacked on PR-D. Adds the bucket-C escape
hatch: a plugin that drives a custom generation loop (e.g. Lookahead
Jacobi n-gram pools, nested cascades) that cannot be expressed as a
static DAG. The ABI exposes existing step primitives only; it adds no
new engine behavior.
- Schema: pipeline.controller {library, entry_point, config} (optional,
block-presence gated; version stays 2; absent => no behavior change).
- C ABI (plugin_api.h): OgaDecodeController / OgaDecodeStepContext vtable
exposing token append/get, forward step, logits (PR-B), hidden states
(PR-C), rewind (PR-A), EOS/length queries.
- Host dispatch (controller_host.{h,cpp}, ungated): when a controller is
configured, GenerateNextToken delegates to controller->Step(); the
plugin calls back into the Generator's existing primitives via the
vtable. controller_ is null otherwise; legacy path byte-for-byte
unchanged.
- Loader: real dlopen is USE_GENAI_PLUGINS-gated; disabled builds throw a
clear "rebuild with USE_GENAI_PLUGINS=ON" error (no silent skip).
- Fix: AppendAcceptedTokens commits via the SelectTop greedy path instead
of Search::AppendTokens, which called ResetDone() and wrote one-past
the sequence row at max_length (heap corruption). SelectTop guards with
if(!done_) and preserves termination.
Real external .so load is build-gated (this build is USE_GENAI_PLUGINS=
OFF); the primitive surface is proven by an in-tree stub controller that
reproduces plain greedy token-for-token through the vtable only.
Build green; Controller + LogitsChain + Speculative + Pipeline + CAPI
suites: 68 passed, 0 failed (7/7 ControllerHookTests). PR-A/B/C/D green.
Reviewed by Livingston (APPROVE-WITH-NITS).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Final PR-F of the v2.1 groundwork, stacked on PR-E. Adds a clear
session_options namespace separating runtime-toggleable features from
build-time graph properties, following the "declared, never synthesized"
principle.
- Schema: session_options.runtime.* (kv_cache{dtype,quant}, paging,
prefix_cache, sliding_window, chunked_prefill, precision) for features
schedulable at load/session time with no graph change; and
session_options.build_requires.* (attention, quantization, extra_heads)
for properties baked into the exported ONNX graph. Both std::optional,
block-presence gated, version stays 2.
- Validation (ValidateSessionOptionsFeatures): unknown enums throw, and
cross-namespace misuse throws a clear namespaced error (e.g. a build
quant token like awq in runtime.kv_cache.dtype points the user at
build_requires.quantization, and vice versa).
- Back-compat: SessionOptions_Element previously had no OnObject (nested
objects threw); adding runtime/build_requires is strictly additive --
scalar keys still route to config_entries unchanged and every other
nested object key still throws. Validator is a no-op when both absent.
- Plumbing point in CreateSessionOptionsFromConfig is a guarded warning
only: NO runtime feature is applied and build_requires is never acted
upon (declared, never synthesized). Per-feature numeric runtime effects
(KV dtype/quant, paging, prefix cache, chunked prefill) are deferred
per design section 9.
Build green; RuntimeFeatureNamespace + Config + Speculative + LogitsChain
+ ControllerHook + Pipeline + CAPI suites: 74 passed, 0 failed
(7/7 RuntimeFeatureNamespaceTests). PR-A..PR-E all green.
Reviewed by Livingston (APPROVE).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A concise, decision-oriented RFC complementing the detailed v2.1 design (linked as the deep appendix). Frames the speculative-decoding + inference-optimization work for a team meeting: TL;DR, motivation, goals/non-goals, a capability-status table, six key decisions to discuss (each with options, tradeoffs, and a recommendation), per-PR prototype evidence with honest deferrals and code-grounded reasons, open questions, and the phased PR plan (PR-A..PR-F, all landed on this branch). Reviewed by Livingston for factual accuracy (APPROVE-WITH-NITS): all commit SHAs, cited test names, and load-bearing code-facts verified against the tree; citation nits fixed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tions
Expand the v2.1 discussion RFC so a single team review covers:
- NEW v2.0 base schema (Pipeline-as-Config) as a discussion item: version:2
schema, structural/block-presence routing (CreateModel->CreatePipeline->
ClassifyStructuralRoute) replacing model_type dispatch, the Wire{from,to}
dataflow concept, v1->v2 migration (TranslateV1ToPipeline / example 05),
and six v2.0-level key decisions framed for discussion.
- NEW audio-to-audio (speech-to-speech) forward-compat section: v2.1 schema
is an additive superset; audio-out wiring is expressible without a v2.0
break, with executor-level constraints (single int32 token stream + single
vocab_size, text-only sink) framed as open discussion items.
Sections renumbered; intra-RFC cross-references fixed. Reviewed by Livingston
(APPROVE-WITH-NITS, all code-fact citations verified accurate); nits fixed.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review from a model-builder / CP6 perspectiveI went through this with a focus on the producer↔consumer contract (this PR is the C++ config consumer; the Python model builder is the producer that would emit these configs). I traced schema → parser → router → executor against the diff. Net takeaway: today the v2 Critical
Major
Docs / clarity (would mislead a producer)
Nice work
Questions
|
Summary
Implements the Pipeline-as-Config redesign from #2114 (PR1–5), by generalizing the existing
DecoderOnlyPipelineexecutor rather than introducing the greenfield classes the issue sketched — "refactor not rewrite." The proposedPipelineExecutor/MultiSessionPipelinealready effectively exist asDecoderOnlyPipelineState/DecoderOnlyPipelineModel.Addresses #2114.
What's included
Config::version+Config::Pipeline; SAX v2 parse;TranslateV1ToPipeline/LowerPipelineToModel;pipeline_presets.*CreatePipeline()delegated fromCreateModelwhen a pipeline is present; legacymodel.typedispatch preserved as theClassifyLegacyRouteoracle; v2 tokens/generation/metadata loweringPipelineFlow(init/step/final phases, explicitdataflow[], DFS cycle detection, 10-stage guard) +Finalizehookplugin_api.h+plugin_loader, gated behindUSE_GENAI_PLUGINS(OFF by default)ClassifyStructuralRoutereplacesmodel.typedispatch for CP2/CP3/CP4, guarded by a zero-regression gate assertingstructural == legacyfor every in-tree fixtureBackward compatibility
model_type.hpredicates are retained — every one still has a live non-dispatch caller (v1→v2 translator, generators, kv_cache, RNNT context-length guard, legacy oracle). Only the dispatch role was removed. This is shrink-in-role, not the issue's "delete the file."genai_config.jsonfixtures route identically (PipelineDispatchTestsgate).qwen2-5-vl-pipelinegate fixture surfaced and fixed a real divergence:has_visionnow recognizes thevision.pipeline[]shape (not justvision.filename).Testing
unit_tests: 76 passed / 0 failed / 21 skipped (skips are env-gated: TensorRT-RTX, StreamingASR, Parakeet).PipelineConfigTests,PipelineFlowTests,PluginLoaderTests,PipelineDispatchTests.USE_GENAI_PLUGINS=1) compiles.Not yet exercised end-to-end (reviewer attention welcome)
RunStageinit/step loop andFinalizefinal-stage path — no in-tree genericdecoder.pipeline[]VLM fixture with ONNX weights; behavior-preservation rests on the dispatch-equivalence gate + reasoning, not a running multi-session VLM.Descoped to v2.1+ (per maintainer comments in #2114)
TTS
single_pass, diffusiondenoising, RNNT loop strategies as plugin,when: "final"vocoder,repeat/counter. Encoder-decoder remains Whisper/Marian (routed structurally, not unified into the executor).🤖 Draft — opening for early architectural review given the e2e caveats above.