ORT GenAI Architectural Redesign: Pipeline-as-Config
A proposal to make onnxruntime-genai truly model-agnostic
Authors: Architecture Team (Architect, Product Manager, Radical Thinker)
Date: 2026-05-02
Status: Draft — for review before GitHub issue creation
Terminology note: The flow[].when values were renamed from once/prompt/always to init/step/final for cross-paradigm clarity (autoregressive, denoising, single-pass all use the same three phases). The canonical definitions are in Section 3.2. Some earlier-drafted examples may use the original names.
Executive Summary
onnxruntime-genai is 90% model-agnostic today — but a hardcoded string registry blocks every new model. 21 of 32 recognized model types share identical runtime code (DecoderOnly_Model). The KV cache auto-discovers its layout from ONNX tensor names. The generation loop knows nothing about model architecture. The only thing preventing ANY new model from working is a C++ whitelist that maps model_type strings to implementation classes.
We propose Full-Stack Declarative Inference — replacing string-based dispatch with a declarative pipeline configuration where preprocessing, orchestration, and generation are ALL expressed as JSON config, running on 6+ execution providers. Instead of the runtime knowing about "Llama" or "Qwen" or "Gemma," it knows about pipelines — sequences of ONNX session invocations with configurable data flow, state management, and execution ordering.
The result: zero model-specific C++ code in ORT GenAI, ever again. New models are supported entirely by the export tool (mobius/Olive) generating ONNX graphs + pipeline configs. The runtime becomes a stable platform that only changes for performance improvements and new features, never for new models.
The Pitch
The only inference runtime where adding a new model is a JSON file, not a code change — and it runs on every platform.
Design Principle
Detect, don't declare. The runtime infers model category from structural signals (which ONNX sessions exist, what I/O signatures they have), not from string dispatch. Config fields express behavioral choices that genuinely can't be inferred from structure. model.type becomes metadata for humans, not a dispatch key for machines.
The One-Sentence Version
The runtime should know HOW to run pipelines (KV cache, generation loop, sampling), not WHAT models it's running — the "what" comes entirely from the ONNX graph + pipeline config.
1. The Problem Today
1.1 The Six Coupling Points
Adding a new model type to ORT GenAI currently requires C++ source changes in up to 6 locations:
| # |
Coupling Point |
File |
Lines |
What Changes |
| CP1 |
Model type whitelist |
src/models/model_type.h |
16-61 |
Add string to static array |
| CP2 |
Model factory dispatch |
src/models/model.cpp |
820-842 |
Add if-else branch |
| CP3 |
Position input strategy |
src/models/position_inputs.cpp |
928-938 |
Add model_type check |
| CP4 |
Vision state factory |
src/models/multi_modal.cpp |
565-572 |
Add model_type check |
| CP5 |
Multimodal processor factory |
src/models/model.cpp |
915-933 |
Add factory entry |
| CP6 |
Python model builder |
src/python/py/models/builders/ |
Various |
Add Python builder file |
For a standard decoder-only LLM, only CP1 is required (adding one string). But that one string requires a C++ PR, code review, CI pipeline, and a new release. The bottleneck isn't engineering complexity — it's release process overhead for a trivial change.
1.2 The False Complexity
The codebase has 8 C++ model classes and 32 recognized model_type strings. But strip away the legacy, and there are only 3 genuinely different runtime behaviors:
| Runtime Behavior |
C++ Classes |
Model Types |
Actually Different? |
| Decoder-only autoregressive |
DecoderOnly_Model, Gpt_Model |
22 types |
No — GPT-2 differs only in KV cache format (config-detectable) |
| Multi-session (VLM/multimodal) |
MultiModalLanguageModel, Qwen2_5_VL_PipelineModel |
8 types |
Partially — vision invocation strategy varies, but is config-expressible |
| Encoder-decoder |
WhisperModel, MarianModel |
2 types |
No — both are encoder→cross-attention-decoder |
| RNNT streaming ASR |
NemotronSpeechModel |
1 type |
Yes — fundamentally different decoding loop |
| Pipeline (QNN multi-stage) |
DecoderOnlyPipelineModel |
1 type |
Deployment variant, not architectural difference |
3 runtime patterns, not 32 model types. The model_type string is doing almost zero useful work.
1.3 What Users Experience
Four personas are blocked by this architecture:
- The Model Builder (mobius dev): "I built a perfect ONNX model + config, but ORT GenAI rejects it because it doesn't know my model_type string."
- The Deployer (ML engineer): "I have to wait for a new ORT GenAI release just to use a new model. My alternative is forking the runtime."
- The Fine-Tuner (researcher): "My model is architecturally identical to Llama but has a custom model_type. ORT GenAI won't load it."
- The ORT GenAI Maintainer (MSFT): "Every new HuggingFace model = a C++ PR. We're bottlenecked on model support."
2. Competitive Analysis
How Other Runtimes Handle Extensibility
| Runtime |
Pattern |
Extensible Without Source Changes? |
Adding a New Model |
| vLLM |
Python dict registry + lazy import |
✅ Yes — register_model() API |
1 registry line + 1 Python file |
| SGLang |
AST-based filesystem discovery |
✅ Yes — drop .py file in directory |
0-1 config lines + 1 module file |
| llama.cpp |
C++ enum dispatch (like ORT GenAI) |
❌ No — requires recompilation |
1 enum + 100-300 LOC |
| ORT GenAI |
C++ string whitelist dispatch |
❌ No — requires recompilation |
1 string + PR + release cycle |
| ORT GenAI (proposed) |
Declarative pipeline config |
✅ Yes — JSON config only |
0 code lines + 1 JSON config |
ORT GenAI's Unique Advantage
ORT GenAI has something no other runtime has: the ONNX model IS the computation. vLLM and SGLang require model-specific Python classes that implement forward() with PyTorch ops. llama.cpp requires model-specific C++ code that implements attention, MLP, and normalization. ORT GenAI delegates ALL computation to ONNX Runtime — it never touches model internals.
This means ORT GenAI's extensibility problem is fundamentally simpler. It doesn't need a plugin system for model computation (the ONNX graph handles that). It only needs extensibility for orchestration — which sessions to run, in what order, how to manage state between steps. And orchestration is naturally expressed as configuration.
Why Pipeline-as-Config Is Better Than GGUF (llama.cpp)
1. GGUF bundles computation with metadata. We separate them.
GGUF's model file contains weights + architecture metadata. The runtime reads the metadata and builds a compute graph at load time. This means the runtime must understand every architecture's compute pattern — which attention variant, which normalization, which MLP structure. When a model adds dual head_dim or KV sharing, llama.cpp needs new C++ code to interpret those metadata keys and build the right compute graph.
Our ONNX model IS the precompiled compute graph. The runtime never interprets architecture details — it just runs session.Run(). The pipeline config only describes ORCHESTRATION (which sessions, what order, what state), not COMPUTATION:
GGUF: metadata → [runtime builds graph] → execution
Ours: ONNX graph (prebuilt) + pipeline config → [runtime orchestrates] → execution
The runtime never needs to know what's inside the model. GGUF's runtime does.
2. Multi-EP deployment is impossible with GGUF.
GGUF models run on llama.cpp's own backends (CPU, CUDA, Metal, Vulkan). You can't take a GGUF and run it on DirectML, QNN (Qualcomm NPU), OpenVINO, or WebGPU without porting the entire backend.
ONNX + pipeline config runs on ANY ORT execution provider. The same model + config deploys to cloud GPU (CUDA EP), Windows laptop (DML EP), Qualcomm mobile (QNN EP), Intel hardware (OpenVINO EP), browser (WebGPU EP), and CPU. One model, one config, six+ deployment targets.
3. Graph-level optimization at export time.
ONNX models go through ORT's graph optimization pipeline: constant folding, op fusion, layout optimization, EP-specific transformations. These happen ONCE at model load time and produce an optimized execution plan. GGUF's runtime-built graphs can't do this — the graph is constructed and executed simultaneously.
Why Pipeline-as-Config Is Better Than vLLM
1. vLLM requires Python code for every model. We require JSON.
Adding a model to vLLM means writing a Python class with forward(), weight loading, attention implementation — typically 200-500 lines of PyTorch code. Even with register_model(), someone must WRITE that code.
Our approach: the export tool (mobius) generates the ONNX graph + pipeline config. The runtime needs ZERO new code. The complexity lives in the exporter (which already understands the model), not the runtime.
2. vLLM is CUDA-only for production.
vLLM's custom CUDA kernels (PagedAttention, FlashAttention) are what make it fast. But they only work on NVIDIA GPUs. Running vLLM on AMD, Intel, Qualcomm, or in a browser requires rewriting those kernels. ORT's execution providers handle hardware abstraction transparently.
3. vLLM couples computation and orchestration.
vLLM's model classes implement both the forward pass AND orchestration logic (KV cache management, attention patterns). Our architecture cleanly separates: ONNX model = computation, pipeline config = orchestration, ORT = execution.
Where Competitors Are Better (Honest Assessment)
| They're better at |
Why |
Our path to parity |
| GGUF: Single-file distribution |
One .gguf file vs our model dir |
ONNX metadata embedding (research direction) |
| GGUF: Quantization simplicity |
Q4_K_M is one flag |
Olive pipeline (more steps, but more flexible) |
| vLLM: Serving features |
Continuous batching, speculative decoding, prefix caching |
ORT GenAI engine mode (growing) |
| vLLM: Community velocity |
200+ models, rapid community PRs |
Pipeline-as-config FIXES this — enables same velocity |
| Both: No export step |
Load HF weights directly |
We require an export step (mobius build) |
The Core Competitive Insight: Compile at Export Time
This is our unique structural advantage that neither competitor can replicate:
We move complexity from RUNTIME to EXPORT TIME.
- GGUF: Runtime builds the compute graph (complexity at runtime)
- vLLM: Runtime runs model-specific Python code (complexity at runtime)
- Ours: Export tool builds the compute graph AND generates the orchestration config (complexity at export time). Runtime is generic.
Why this matters:
- Export runs ONCE; inference runs millions of times. Put the intelligence where it runs once.
- Export has access to the full HuggingFace model — Python code, config, architecture details. It can make perfect decisions. The runtime shouldn't need this information.
- The export tool (mobius) is Python — easy to extend. The runtime is C++ — hard to change. Our architecture puts extensibility in the easy-to-change layer.
- Export-time optimization — graph optimization, quantization, EP-specific tuning all happen before deployment. The runtime gets a pre-optimized artifact.
This is the 'compiler vs interpreter' advantage. GGUF and vLLM are interpreters — they process model definitions at runtime. We're a compiler — we process model definitions once at export and produce an optimized artifact that a simple, generic runtime executes.
What We Do That NEITHER Competitor Can
Three capabilities that pipeline-as-config delivers that no competitor matches:
1. Multi-Session Declarative Pipelines. GGUF has flat key-value metadata — no concept of multi-model pipelines. vLLM can do multi-model through Python code, but each topology requires a new class. Pipeline-as-Config's flow[] + dataflow[] declaratively express ANY multi-session topology — VLMs, speech models, multimodal with vision+audio+decoder — all as JSON without new code.
2. Hardware-Agnostic Model Artifacts. GGUF models are tied to llama.cpp's backend ecosystem. vLLM is CUDA-first (AMD ROCm second-class, no DirectML/QNN/WebGPU). ONNX + pipeline config is a hardware-agnostic artifact — the same files deploy on CPU, CUDA, DirectML, QNN (Qualcomm NPU), OpenVINO (Intel), and WebGPU. Write once, deploy on 6+ hardware targets. Even more powerfully, different sessions in the same pipeline can run on DIFFERENT execution providers — e.g., vision encoder on CPU while the decoder runs on GPU, or vision on NPU while decoder runs on GPU:
"sessions": {
"vision": {"file": "vision/model.onnx", "execution_provider": "QNNExecutionProvider"},
"decoder": {"file": "decoder/model.onnx", "execution_provider": "CUDAExecutionProvider"}
}
This heterogeneous hardware deployment — different EPs per session in a single pipeline — is something neither GGUF nor vLLM can express at all.
3. Truly Model-Agnostic Runtime. GGUF's runtime interprets architecture metadata to build compute graphs — it must understand every model's attention pattern, normalization, MLP structure. vLLM's runtime runs model-specific Python forward() code. Our runtime executes a declared pipeline — it understands ZERO model architecture. The runtime has no decisions to make.
The Complete Competitive Matrix
| Capability |
GGUF |
vLLM |
Pipeline-as-Config |
| New LLM without runtime changes |
❌ |
❌ (needs Python class) |
✅ (JSON config) |
| Multi-session pipelines (VLM) |
❌ (no concept) |
⚠️ (Python code) |
✅ (declarative flow) |
| Deploy same model on 6+ HW targets |
❌ |
❌ |
✅ (ORT execution providers) |
| Heterogeneous HW per session |
❌ |
❌ |
✅ (vision on NPU, decoder on GPU) |
| Model-agnostic runtime |
❌ |
❌ |
✅ |
| Self-describing model artifacts |
✅ (GGUF metadata) |
❌ (needs Python) |
✅ (ONNX + pipeline JSON) |
| Declarative preprocessing |
❌ |
❌ |
✅ (ort-extensions JSON) |
| No Python dependency at inference |
✅ |
❌ |
✅ |
| C++ only deployment (edge/embedded) |
✅ |
❌ |
✅ |
| Extensibility without recompilation |
❌ |
✅ (Python) |
✅ (JSON + plugin .so) |
Pipeline-as-Config is the only approach that checks ALL boxes.
3. The Architecture: Pipeline-as-Config
3.1 Core Concept
Replace model-type dispatch with a declarative pipeline configuration. The runtime becomes a generic pipeline executor that:
- Loads whatever ONNX sessions the config declares
- Executes them in the order the config specifies
- Wires outputs→inputs using explicit dataflow declarations
- Manages state (KV cache, position IDs) per config-driven strategies
- Generates tokens using the standard (fully generic) generation loop
┌─────────────────────────────────────────────────┐
│ genai_config.json v2 │
│ │
│ pipeline.extends: "autoregressive-decoder" │
│ pipeline.sessions: {name → file} │
│ pipeline.flow: [{run, when, loop}] │
│ pipeline.dataflow: [{from, to}] │
│ pipeline.state: {kv_cache, position_ids} │
│ pipeline.plugin: "libcustom.so" (optional) │
│ │
│ tokens: {pad, eos, bos} │
│ generation: {max_length, sampling, stop} │
│ metadata: {model_type, source} (human-only) │
└──────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Pipeline Factory (structural) │
│ │
│ No string dispatch. Config structure drives: │
│ ┌─ DecoderPipeline (single session) │
│ ├─ MultiSessionPipeline (2+ sessions) │
│ ├─ EncoderDecoderPipeline (cross-attention) │
│ └─ PluginPipeline (custom shared library) │
└──────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Generic Pipeline Executor │
│ │
│ Interprets flow[] declaratively: │
│ - when: init | step | final │
│ - loop: batched | per_image │
│ │
│ State management (config-driven): │
│ - KV cache (auto / separate / combined) │
│ - Position IDs (auto / default / mrope_3d) │
│ - Sliding window (from config) │
│ │
│ Generation loop (fully generic, unchanged): │
│ - Sampling, beam search, EOS detection │
│ - Streaming output │
└─────────────────────────────────────────────────┘
3.2 The flow Array — Execution Ordering
The flow array declares which sessions run, when, and how:
"flow": [
{"run": "vision", "when": "init", "loop": "per_image"},
{"run": "embedding", "when": "init"},
{"run": "decoder", "when": "step"}
]
Lifecycle phases (fixed vocabulary — not Turing-complete):
when: "init" — run before the main generation loop (encoders, embedding projectors, preprocessing). Execution order follows the flow[] array. Covers what earlier drafts called once and prompt.
when: "step" — run every iteration of the generation loop (decoder in autoregressive, UNet in denoising)
when: "final" — run after the generation loop completes (vocoder in TTS, VAE decoder in diffusion)
These three phases map cleanly across all generation paradigms:
| Phase |
Autoregressive |
Denoising (Diffusion) |
Single-Pass |
init |
Vision encoder, embedding, prompt processing |
Text encoder, latent init |
All sessions |
step |
Decoder (each token) |
UNet (each denoising step) |
N/A (no loop) |
final |
— |
VAE decoder → image |
— |
Loop modes (fixed vocabulary):
loop: "batched" — pass all inputs at once (default)
loop: "per_image" — iterate over inputs individually (Qwen VL, Pixtral)
Guardrails:
- Fixed vocabulary for
when and loop — no arbitrary conditions or iterations
- Maximum 10 flow stages — prevents pathological configs
- No
if/else — anything that needs conditional logic uses the plugin API
- Cycle detection in dataflow at load time
3.3 The dataflow Array — Session Wiring
Optional. Declares how outputs from one session feed into inputs of another:
"dataflow": [
{"from": "vision.image_features", "to": "embedding.image_features"},
{"from": "embedding.inputs_embeds", "to": "decoder.inputs_embeds"}
]
When omitted, the runtime auto-matches by tensor name (output name in session A matches input name in session B). When explicit, overrides auto-matching for cases where tensor names differ.
3.4 The state Object — KV Cache & Position Strategy
"state": {
"kv_cache": {
"format": "auto",
"past_key_pattern": "past_key_values.{layer}.key",
"present_key_pattern": "present.{layer}.key",
"past_value_pattern": "past_key_values.{layer}.value",
"present_value_pattern": "present.{layer}.value"
},
"position_ids": {
"strategy": "auto",
"input_name": "position_ids"
}
}
KV cache formats:
"auto" — introspect ONNX session I/O to detect format (default)
"separate" — standard past_key_values.{layer}.key / present.{layer}.key
"combined" — GPT-2 style past_{layer} / present_{layer}
- Name patterns are optional overrides when auto-detection fails
Position ID strategies:
"auto" — introspect position_ids input shape: rank 2 → default 1D, rank 3 → mRoPE 3D
"default" — standard 1D position IDs
"mrope_3d" — 3-dimensional mRoPE (temporal, height, width)
"windowed" — sliding window position tracking
3.5 The extends Mechanism — Preset Inheritance
Built-in presets eliminate boilerplate for common patterns:
{"pipeline": {"extends": "autoregressive-decoder"}}
Built-in presets:
| Preset Name |
What It Expands To |
autoregressive-decoder |
Single decoder session, default KV cache, default position IDs, flow: [{run: decoder, when: always}] |
vision-language |
Vision + embedding + decoder sessions, batched vision, default KV cache |
encoder-decoder |
Encoder (once) + decoder (always), cross-attention KV cache |
speech-language |
Speech encoder + embedding + decoder |
Presets are resolved at load time — the runtime sees a fully expanded config. Overrides replace preset defaults:
{
"pipeline": {
"extends": "vision-language",
"flow": [
{"run": "vision", "when": "prompt", "loop": "per_image"},
{"run": "embedding", "when": "prompt"},
{"run": "decoder", "when": "always"}
],
"state": {
"position_ids": {"strategy": "mrope_3d", "grid_source": "vision.image_grid_thw"}
}
}
}
3.6 The Plugin API — Escape Hatch
For genuinely novel architectures that can't be expressed as standard pipelines (RNNT, SSM/Mamba, diffusion):
{
"pipeline": {
"plugin": {
"library": "libgenai_rnnt.so",
"entry_point": "CreateRnntPipeline"
}
}
}
C++ plugin interface:
// Stable C ABI — plugins compiled separately from the runtime
extern "C" {
std::shared_ptr<Pipeline> CreateRnntPipeline(
OrtEnv& env, std::unique_ptr<Config> config);
}
The plugin registers a Pipeline factory, not a Model factory — keeping the abstraction consistent. Plugins extend the pipeline type system for the ~1% of models that can't fit the declarative config.
3.7 Preprocessing: Image, Audio, and Variable Input Shapes
Preprocessing is NOT the pipeline executor's job. It transforms raw inputs (pixels, audio waveforms) into model-ready tensors. This happens BEFORE the pipeline runs and is handled by a separate, config-driven preprocessing layer.
The Architecture Boundary
Raw Input (images, audio, text)
│
▼
┌─────────────────────────────┐
│ Preprocessing Layer │
│ (ort-extensions) │
│ │
│ image_processor.json ────┤──→ pixel_values, image_sizes, grid_thw
│ audio_processor.json ────┤──→ audio_features, audio_sizes
│ tokenizer.json ────┤──→ input_ids, attention_mask
│ │
│ Config-driven. │
│ Zero model-specific C++. │
└──────────────┬──────────────┘
│ model-ready tensors
▼
┌─────────────────────────────┐
│ Pipeline Executor │
│ (this proposal) │
└─────────────────────────────┘
Image Preprocessing — ort-extensions image_processor.json
Each VLM ships an image_processor.json that declares its preprocessing pipeline:
{
"image_processor_type": "Qwen2VLImageProcessor",
"resample": "bicubic",
"do_resize": true,
"size": {"min_pixels": 3136, "max_pixels": 12845056},
"do_rescale": true,
"rescale_factor": 0.00392156862745098,
"do_normalize": true,
"image_mean": [0.48145466, 0.4578275, 0.40821073],
"image_std": [0.26862954, 0.26130258, 0.27577711],
"patch_size": 14,
"merge_size": 2
}
ort-extensions loads this JSON and executes the preprocessing pipeline using its own C++ ops. No model-type dispatch needed. Different VLMs (Phi3v at 336×336, Qwen2.5-VL with dynamic resolution, Pixtral with variable per-image sizes) all use the same mechanism — they just ship different image_processor.json configs.
The C++ preprocessors (PhiImageProcessor, QwenImageProcessor, GemmaImageProcessor, Mistral3ImageProcessor) become legacy. New models use ort-extensions exclusively. This is already the direction — mobius already generates image_processor.json for all VLMs.
Audio Preprocessing — audio_processor.json
Same pattern for speech models:
{
"audio_processor_type": "WhisperFeatureExtractor",
"feature_size": 128,
"sampling_rate": 16000,
"hop_length": 160,
"chunk_length": 30,
"n_fft": 400
}
For multimodal models needing both image and audio (Phi4mm):
"preprocessing": {
"image": {"config": "image_processor.json"},
"audio": {"config": "audio_processor.json"}
}
Variable Input Shapes
Different models handle input sizes differently. The pipeline config + preprocessor config handle all cases:
| Pattern |
Example |
How It's Handled |
| Fixed-size |
Phi3v (336×336 images) |
image_processor.json resizes to target. ONNX model has static shapes. |
| Dynamic-size |
Qwen2.5-VL (arbitrary resolution) |
image_processor.json does dynamic resize + patch extraction. ONNX model has dynamic shapes. Pipeline executor passes tensors as-is. |
| Per-image variable |
Pixtral (each image different resolution) |
Preprocessor zero-pads to max(H)×max(W), provides image_sizes[N, 2]. Pipeline flow uses loop: per_image + dynamic_shape for per-image slicing. |
For the per-image variable resolution case:
{"run": "vision", "when": "prompt", "loop": "per_image",
"loop_over": "pixel_values",
"dynamic_shape": {"source": "image_sizes", "apply_to_dims": [2, 3]}}
The executor slices pixel_values[i, :, :H_i, :W_i] where H_i and W_i come from image_sizes[i]. This is ~15 lines of generic loop code, not a model-specific class.
Pipeline Config Reference
The pipeline config references preprocessing configs without embedding their details:
"preprocessing": {
"image": {"config": "image_processor.json", "format": "ort-extensions"},
"audio": {"config": "audio_processor.json", "format": "ort-extensions"}
}
The format field future-proofs beyond ort-extensions — today it's the only value, but it enables alternative preprocessing backends (e.g., a pure ONNX preprocessing graph) without schema changes.
This clean boundary means: preprocessing is fully described by its own config files (already supported by ort-extensions), and the pipeline executor receives model-ready tensors without knowing how they were produced. Note that half the pipeline-as-config vision is already shipped and working in production via ort-extensions — we're completing the other half for inference orchestration.
Every layer of the stack is config-driven. Zero model-specific C++ anywhere.
3.8 Advanced KV Cache Patterns (Shared Cache, Dual Head Dim)
Some models have non-uniform KV cache layouts. Gemma4 is the most complex example:
- Dual head_dim: Local (sliding_attention) layers use
head_dim=256; global (full_attention) layers use global_head_dim=512
- KV sharing: The last
num_kv_shared_layers layers reuse K/V from earlier layers and have NO independent cache entries
- Mixed window sizes: Sliding window layers have bounded cache; full attention layers have unbounded cache
These are all handled by auto-detection — no model-specific code needed.
ORT GenAI's DefaultKeyValueCache already supports:
- Sparse layer indices (
kv_layer_indices_): Auto-discovered by scanning which past_key_values.{N}.key inputs exist in the ONNX session. If the model only has layers 0-25 (skipping 26-33 due to KV sharing), the cache allocates only 26 entries.
- Per-layer shapes (
layer_shapes_): Each layer can have a different [batch, heads, seq_len, head_dim] shape, auto-discovered from the ONNX session output shapes.
- Per-layer sliding window: Configurable which layers use bounded cache vs unbounded.
The pipeline config for a Gemma4-style model:
"state": {
"kv_cache": {
"format": "auto",
"sliding_window": {
"window_size": 4096,
"layers": [0, 1, 3, 4, 6, 7, 9, 10, 12, 13, 15, 16, 18, 19, 21, 22, 24, 25]
}
}
}
format: auto handles the dual head_dim and sparse layers automatically. The sliding_window.layers array (already supported by ORT GenAI today) specifies which layers use bounded cache. The export tool (mobius) builds the ONNX model with cache I/O only for non-shared layers, so the runtime never needs to know about KV sharing — it's implicit in the graph structure.
3.9 Preprocessor↔Model Shape Alignment
Problem: Different models expect different image sizes and preprocessing. How do we ensure the preprocessor output matches the model's expectations without model-specific code?
Answer: Co-generation. The export tool (mobius) generates BOTH the ONNX model and its image_processor.json from the same HuggingFace config. They're guaranteed to be aligned because they share a single source of truth:
HuggingFace Config
├── → ONNX model (expects specific input shapes)
└── → image_processor.json (produces those exact shapes)
For additional safety, the pipeline config can include optional shape validation:
"preprocessing": {
"image": {
"config": "image_processor.json",
"format": "ort-extensions",
"expected_outputs": {
"pixel_values": {"rank": 4, "dtype": "float32"},
"image_grid_thw": {"rank": 2, "dtype": "int64"}
}
}
}
The expected_outputs field enables load-time validation: verify that the preprocessor config produces tensors compatible with the model's inputs before running inference. This catches mismatches at load time rather than inference time.
If someone provides the wrong preprocessor: ORT Runtime throws a shape mismatch error at session.Run() — already a clear, debuggable failure. The optional validation catches it earlier.
3.10 The metadata Section
model_type lives here — as documentation, not dispatch:
"metadata": {
"model_type": "qwen2_5_vl",
"architecture": "Qwen2_5VLForConditionalGeneration",
"source": "mobius",
"export_version": "0.5.0"
}
Used for: logging, telemetry, debugging, human readability. Ignored by: all dispatch and runtime logic.
4. Concrete Schema Examples
4.1 Decoder-Only LLM (Minimal — 7 lines)
{
"version": 2,
"pipeline": {
"extends": "autoregressive-decoder",
"sessions": {"decoder": {"file": "model.onnx"}}
},
"tokens": {"eos": [151645], "pad": 0},
"generation": {"max_length": 4096, "sampling": {"temperature": 0.7}},
"metadata": {"model_type": "qwen2", "source": "mobius"}
}
4.2 Vision-Language Model (Qwen2.5-VL style — 25 lines)
{
"version": 2,
"pipeline": {
"extends": "vision-language",
"sessions": {
"vision": {"file": "vision_encoder/model.onnx"},
"embedding": {"file": "embedding/model.onnx"},
"decoder": {"file": "decoder/model.onnx"}
},
"flow": [
{"run": "vision", "when": "prompt", "loop": "per_image"},
{"run": "embedding", "when": "prompt"},
{"run": "decoder", "when": "always"}
],
"dataflow": [
{"from": "vision.image_features", "to": "embedding.image_features"},
{"from": "embedding.inputs_embeds", "to": "decoder.inputs_embeds"}
],
"state": {
"kv_cache": {"format": "auto"},
"position_ids": {
"strategy": "mrope_3d",
"grid_source": "vision.image_grid_thw"
}
},
"preprocessing": {
"image": {"config": "image_processor.json"}
}
},
"tokens": {"eos": [151645], "pad": 0, "image_token": 151655},
"generation": {"max_length": 4096, "sampling": {"temperature": 0.7}},
"metadata": {"model_type": "qwen2_5_vl", "source": "mobius"}
}
4.3 Encoder-Decoder (Whisper style)
{
"version": 2,
"pipeline": {
"extends": "encoder-decoder",
"sessions": {
"encoder": {"file": "encoder/model.onnx"},
"decoder": {"file": "decoder/model.onnx"}
},
"flow": [
{"run": "encoder", "when": "once"},
{"run": "decoder", "when": "always", "cross_attention_from": "encoder"}
],
"state": {
"kv_cache": {"format": "auto"},
"cross_cache": {"source": "encoder", "frozen": true}
}
},
"tokens": {"eos": [50257], "pad": 50257, "decoder_start": 50258},
"generation": {"max_length": 448},
"metadata": {"model_type": "whisper", "source": "mobius"}
}
4.4 Multimodal (Vision + Audio — Phi4mm style)
{
"version": 2,
"pipeline": {
"sessions": {
"vision": {"file": "vision_encoder/model.onnx"},
"speech": {"file": "audio_encoder/model.onnx"},
"embedding": {"file": "embedding/model.onnx"},
"decoder": {"file": "decoder/model.onnx"}
},
"flow": [
{"run": "vision", "when": "prompt", "loop": "batched"},
{"run": "speech", "when": "prompt", "loop": "batched"},
{"run": "embedding", "when": "prompt"},
{"run": "decoder", "when": "always"}
],
"dataflow": [
{"from": "vision.image_features", "to": "embedding.image_features"},
{"from": "speech.audio_features", "to": "embedding.audio_features"},
{"from": "embedding.inputs_embeds", "to": "decoder.inputs_embeds"}
],
"state": {
"kv_cache": {"format": "auto"},
"position_ids": {"strategy": "default"}
}
},
"tokens": {"eos": [32007], "pad": 32000},
"generation": {"max_length": 4096},
"metadata": {"model_type": "phi4mm", "source": "mobius"}
}
4.5 Novel Architecture via Plugin (RNNT)
{
"version": 2,
"pipeline": {
"plugin": {
"library": "libgenai_rnnt.so",
"entry_point": "CreateRnntPipeline"
},
"sessions": {
"encoder": {"file": "encoder/model.onnx"},
"predictor": {"file": "predictor/model.onnx"},
"joiner": {"file": "joiner/model.onnx"}
}
},
"metadata": {"model_type": "nemotron_speech", "source": "mobius"}
}
5. Implementation Plan
Overview
This is a refactor, not a rewrite. The generation loop, search/sampling, tokenizer, KV cache internals, and all language bindings remain unchanged. We're replacing the model dispatch layer with a pipeline dispatch layer.
Net code change estimate: +800 lines added, -2000 lines deleted. The codebase gets smaller.
PR 1: Config Schema v2 Parser + Backward Compatibility (~300 LOC)
Files changed:
src/config.h — Add Pipeline struct with sessions, flow, dataflow, state, extends fields
src/config.cpp — Parse v2 schema; add v1→v2 translator that converts old-format configs to pipeline format
- New:
src/pipeline_presets.h — Built-in preset definitions (autoregressive-decoder, vision-language, encoder-decoder, speech-language)
Logic:
// In Config constructor:
if (json.contains("version") && json["version"] == 2) {
ParsePipelineConfig(json); // New v2 path
} else {
ParseLegacyConfig(json); // Existing v1 path
TranslateV1ToV2(); // Convert to pipeline format internally
}
Backward compatibility guarantee: Every existing genai_config.json produces an identical internal Pipeline struct after translation. The v1→v2 translator maps:
model.type + model_type.h classification → appropriate preset
model.decoder.inputs/outputs → state.kv_cache patterns
model.vision/speech/embedding sections → sessions + flow + dataflow
Tests: All existing config tests pass unchanged. New tests for v2 parsing, preset resolution, extends override logic.
PR 2: PipelineExecutor Class (~350 LOC)
Files changed:
- New:
src/models/pipeline_executor.h — PipelineExecutor class definition
- New:
src/models/pipeline_executor.cpp — Implementation
src/models/model.cpp — Replace CreateModel() with CreatePipeline() using structural detection
The core class:
class PipelineExecutor : public State {
public:
PipelineExecutor(std::unique_ptr<Config> config, OrtEnv& env);
DeviceSpan<float> RunStep(int total_length, DeviceSpan<int32_t>& next_tokens,
DeviceSpan<int32_t> next_indices) override;
private:
// Loaded from config
std::map<std::string, std::unique_ptr<OrtSession>> sessions_;
std::vector<FlowStep> prompt_flow_; // Steps where when != "always"
std::vector<FlowStep> decode_flow_; // Steps where when == "always"
std::vector<DataflowWire> dataflow_;
// State (auto-detected or config-driven)
std::unique_ptr<KeyValueCache> kv_cache_;
std::unique_ptr<PositionStrategy> position_ids_;
DefaultInputIDs input_ids_{*this};
Logits logits_{*this};
// Data flow between sessions
std::map<std::string, std::unique_ptr<OrtValue>> intermediates_;
bool is_prompt_{true};
void WireInputs(const FlowStep& step);
void WireOutputs(const FlowStep& step);
void RunFlowStep(const FlowStep& step, bool graph_capture);
};
Structural detection in CreatePipeline() (replaces CreateModel()):
std::shared_ptr<Model> CreatePipeline(OrtEnv& env, std::unique_ptr<Config> config) {
auto& pipeline = config->pipeline;
// Plugin escape hatch
if (pipeline.plugin.has_value()) {
return LoadPluginPipeline(pipeline.plugin.value(), std::move(config), env);
}
// Structural detection — no string dispatch
bool has_encoder_with_cross_attn = HasCrossAttentionFlow(pipeline.flow);
bool has_multiple_sessions = pipeline.sessions.size() > 1;
if (has_encoder_with_cross_attn) {
return std::make_shared<EncoderDecoderPipeline>(std::move(config), env);
}
if (has_multiple_sessions) {
return std::make_shared<MultiSessionPipeline>(std::move(config), env);
}
return std::make_shared<DecoderPipeline>(std::move(config), env);
}
PR 3: Flow Interpreter + Dataflow Wiring (~200 LOC)
Files changed:
- New:
src/models/flow_interpreter.h/.cpp — Interprets flow[] and dataflow[]
src/models/pipeline_executor.cpp — Uses flow interpreter
Key logic:
void PipelineExecutor::RunFlowStep(const FlowStep& step, bool graph_capture) {
auto& session = sessions_[step.session_name];
if (step.loop == LoopMode::PerImage) {
// Per-image loop: iterate over input tensor's batch dimension
auto input_slices = SliceTensorDim0(GetInput(step, step.loop_over));
std::vector<OrtValue> output_parts;
for (auto& slice : input_slices) {
BindSlicedInput(step, slice);
session->Run();
output_parts.push_back(CaptureOutput(step));
}
intermediates_[step.output_key] = ConcatenateDim0(output_parts);
} else {
// Standard batched execution
WireInputs(step);
session->Run(graph_capture);
WireOutputs(step);
}
}
Dataflow wiring:
void PipelineExecutor::WireInputs(const FlowStep& step) {
for (auto& wire : dataflow_) {
if (wire.to_session == step.session_name) {
// Wire output from previous session to input of this session
auto& source = intermediates_[wire.from_key];
BindInput(step, wire.to_input_name, source);
}
}
}
PR 4: Plugin API (~100 LOC)
Files changed:
- New:
src/models/plugin_api.h — Stable C ABI for pipeline plugins
- New:
src/models/plugin_loader.cpp — Dynamic library loading (dlopen/LoadLibrary)
Interface:
// plugin_api.h — stable ABI, ships with ORT GenAI headers
extern "C" {
typedef std::shared_ptr<Model> (*PipelineFactoryFn)(
OrtEnv& env, std::unique_ptr<Config> config);
}
// In plugin .so/.dll:
extern "C" {
std::shared_ptr<Model> CreateRnntPipeline(
OrtEnv& env, std::unique_ptr<Config> config) {
return std::make_shared<RnntPipeline>(std::move(config), env);
}
}
PR 5: Delete Model-Type Dispatch (~-1500 LOC)
Files deleted:
src/models/model_type.h — The entire file
Files simplified:
src/models/model.cpp — Remove CreateModel() if-else chain, replace with CreatePipeline()
src/models/position_inputs.cpp — Remove IsQwenVLFamily() check; position strategy comes from config
src/models/multi_modal.cpp — Remove CreateVisionState() model_type dispatch; vision loop mode comes from config flow
Files eventually deprecated (kept for v1 compat, removed in future release):
src/models/gpt.h/cpp — Absorbed into generic pipeline with kv_cache.format: combined
- Per-model C++ preprocessors (phi_image_processor, gemma_image_processor, etc.) — Replaced by ort-extensions
image_processor.json
Implementation Summary
| PR |
Description |
LOC Added |
LOC Deleted |
Net |
| PR 1 |
Config v2 parser + v1 translator |
+300 |
-0 |
+300 |
| PR 2 |
PipelineExecutor classes |
+350 |
-0 |
+350 |
| PR 3 |
Flow interpreter + dataflow |
+200 |
-0 |
+200 |
| PR 4 |
Plugin API |
+100 |
-0 |
+100 |
| PR 5 |
Delete model_type dispatch |
+0 |
-1500 |
-1500 |
| Total |
|
+950 |
-1500 |
-550 |
The codebase shrinks by ~550 lines while gaining full model-agnostic extensibility.
6. Compatibility Matrix
| Model Scenario |
Today |
After PR 1-2 |
After PR 1-5 |
| Existing Llama/Phi/Gemma (v1 config) |
✅ Works |
✅ Works (v1→v2 translator) |
✅ Works (translator) |
| New decoder-only LLM (unknown type) |
❌ Rejected by whitelist |
✅ 7-line v2 config |
✅ 7-line v2 config |
| Custom fine-tune with custom model_type |
❌ Rejected by whitelist |
✅ extends preset |
✅ extends preset |
| New VLM family |
❌ Needs new C++ class + processor |
✅ ~25-line v2 config |
✅ ~25-line v2 config |
| Qwen2.5-VL (3D mRoPE, per-image vision) |
✅ Hardcoded |
✅ v2 config with position_strategy + loop |
✅ Config-driven |
| Pixtral/Mistral3 (variable resolution) |
✅ Hardcoded |
✅ v2 config with per_image loop + dynamic_shape |
✅ Config-driven |
| Whisper (encoder-decoder) |
✅ Hardcoded |
✅ v2 config with encoder-decoder preset |
✅ Config-driven |
| GPT-2 (combined KV cache) |
✅ Hardcoded (separate class) |
✅ v2 config with kv_cache.format: combined |
✅ Config-driven |
| Mamba/SSM (recurrent, no KV) |
❌ Not supported |
⚠️ Needs state.type: recurrent |
✅ Config-driven |
| RNNT (non-autoregressive) |
✅ Hardcoded |
✅ Plugin .so |
✅ Plugin .so |
| Novel architecture (unknown future) |
❌ Major C++ work |
✅ Plugin .so, zero runtime changes |
✅ Plugin .so |
| Phi4mm (vision + audio) |
✅ Hardcoded |
✅ v2 config with 4 sessions |
✅ Config-driven |
7. Technical Feasibility
7.1 CUDA Graph Capture
Concern: CUDA graphs require identical session topology and buffer shapes between captures and replays. Does a generic pipeline executor break this?
Answer: No. The executor pre-computes a "decode flow" (steps where when: "step") at init time. During token generation, only the decode flow runs — this is a fixed, repeatable sequence identical to what the current DecoderOnly_State::Run() does. CUDA graph capture applies to this fixed sequence:
bool graph_capture = !is_prompt_ && params_->use_graph_capture
&& input_ids_.GetShape()[1] == 1;
// Only the decode_flow_ steps run — topology is fixed
for (auto& step : decode_flow_) {
RunFlowStep(step, graph_capture);
}
7.2 Memory Pre-allocation
Concern: The current code pre-allocates KV cache buffers based on model dimensions. Can a generic executor do this without model-specific knowledge?
Answer: Yes. KV cache dimensions come from config (decoder.num_hidden_layers, decoder.num_key_value_heads, decoder.head_size) or are discoverable from ONNX session output shapes at init time. The current DefaultKeyValueCache already auto-discovers layer count by pattern-matching present tensor names in the session. A generic executor uses the same mechanism — zero model-specific knowledge needed.
7.3 Performance Overhead
Concern: Does the generic pipeline add overhead vs hand-optimized model classes?
Answer: Negligible. The overhead is:
- One
for loop over flow_ steps per generation step (typically 1 step for LLMs)
- One map lookup per dataflow wire per step
- These are nanosecond-scale operations vs millisecond-scale ONNX session runs
The hot path — session.Run() + KV cache management — is identical to the current code. The generation loop, search/sampling, and tokenizer are completely unchanged.
7.4 Config Validation
Invalid configs must produce clear errors at load time, not runtime crashes:
| Error |
Message |
| Session referenced in flow but not declared |
Flow step references session "vision" but no such session is declared in pipeline.sessions |
| Dataflow references non-existent tensor |
Dataflow wire references output "image_features" but session "vision" has no such output (available outputs: hidden_states, pooler_output) |
| Unknown position strategy |
Unknown position_ids strategy "my_custom". Valid options: auto, default, mrope_3d, windowed |
| Cycle in dataflow |
Circular dependency detected in dataflow: vision → embedding → decoder → vision |
| Unknown preset |
Unknown pipeline preset "my-preset". Built-in presets: autoregressive-decoder, vision-language, encoder-decoder, speech-language |
| Missing required field |
Pipeline config requires at least one session. Add "sessions": {"decoder": {"file": "model.onnx"}} |
7.5 The per_image Loop for Vision
QwenVisionState and PixtralVisionState loop over images individually with different slicing strategies. The flow interpreter handles this generically:
{"run": "vision", "when": "prompt", "loop": "per_image",
"loop_over": "pixel_values"}
For Pixtral's variable-resolution cropping (per-image height/width from image_sizes):
{"run": "vision", "when": "prompt", "loop": "per_image",
"loop_over": "pixel_values",
"dynamic_shape": {"source": "image_sizes", "apply_to_dims": [2, 3]}}
The executor slices pixel_values[i, :, :H_i, :W_i] where H_i, W_i come from image_sizes[i]. This is ~15 lines of generic loop code, not a model-specific class.
8. The Pitch to the ORT GenAI Team
Framing
"Your runtime is already 90% model-agnostic. We're proposing you formalize what's already true — and eliminate the last 10% of model-specific code."
Today, 21 of 32 model types share identical C++ code. The generation loop doesn't know what model it's running. The KV cache auto-discovers its own layout. The only thing preventing any new model from working is a string whitelist that adds no value.
We're not asking you to change your architecture — we're asking you to recognize that your architecture has already evolved past the model_type dispatch layer. The pipeline config makes the implicit explicit.
Value Proposition
| For |
Today |
With Pipeline-as-Config |
| ORT GenAI team |
Bottlenecked on model support PRs |
Never writes model-specific code again |
| Model builders (mobius/Olive) |
Must coordinate with runtime team for every new model |
Ship independently — generate config, done |
| ML engineers |
Wait for runtime releases |
New models work immediately |
| The ecosystem |
ORT GenAI lags behind HuggingFace model zoo |
ORT GenAI supports any ONNX model by design |
The Key Selling Point
This REDUCES ORT GenAI's maintenance burden. The team goes from "we must ship a PR for every new HuggingFace model" to "we maintain a stable pipeline runtime." New model support becomes the exporter's responsibility (mobius/Olive), not the runtime's.
"You build the engine. We build the cars."
9. Risk Analysis
| Risk |
Likelihood |
Impact |
Mitigation |
| Performance regression for existing models |
Low |
High |
Benchmark all 32 model types before/after. The hot path is identical. |
| Config complexity deters users |
Medium |
Medium |
Presets with extends reduce 90% of configs to 7 lines. JSON Schema for IDE support. |
| Edge cases in flow interpreter |
Medium |
Medium |
Comprehensive test matrix covering all 32 model types. Validation at load time. |
| ORT GenAI team rejects the proposal |
Medium |
High |
Start with the blacklist inversion (5 lines) to build trust. Present the full vision as an RFC. |
| Plugin ABI stability across versions |
Low |
Medium |
Version the plugin API. Keep it minimal (1 factory function). |
| v1→v2 translator has subtle bugs |
Medium |
Medium |
The translator is tested against every existing genai_config.json in the test suite. |
Immediate Bridge (While Building the Future)
While the pipeline-as-config architecture is implemented, mobius can unblock users TODAY:
- For unregistered LLM model_types: emit
"type": "decoder" + "original_model_type": "<real_type>" in genai_config.json
"decoder" is in the current whitelist and routes to DecoderOnly_Model
- When pipeline-as-config ships, switch to v2 format with the real model_type in metadata
10. The mobius Role: Pipeline Compiler
mobius already knows everything needed to generate complete pipeline configs:
| What mobius knows |
How it maps to pipeline config |
| Model architecture (decoder-only, VLM, enc-dec) |
Which preset to extend |
| Number and type of ONNX sessions |
pipeline.sessions |
| Vision invocation pattern (batched vs per-image) |
flow[].loop |
| Position embedding strategy (1D, 3D mRoPE) |
state.position_ids.strategy |
| KV cache format (separate, combined) |
state.kv_cache.format |
| All I/O tensor names |
state.kv_cache.*_pattern, dataflow[] |
| Token IDs, generation params |
tokens, generation |
Implementation in mobius: Extend the existing _write_genai_config() function to emit v2 format alongside (or instead of) v1. The pipeline config is generated from the same model metadata that already drives ONNX graph construction.
11. Research Direction: Self-Contained Generation Graphs
As a long-term research direction (not part of the core proposal), we explored embedding generation logic inside the ONNX graph itself. Microsoft's existing com.microsoft.BeamSearch and com.microsoft.GreedySearch contrib ops prove this is technically possible.
Viable for: Offline batch inference, edge deployment, WebAssembly
Not viable for: Interactive serving (streaming, continuous batching, speculative decoding — all require host-side coordination)
Potential approach: Small set of generation-specific custom ops (GenerationKVCacheUpdate, SampleTopP) that the runtime provides as efficient primitives, while the ONNX graph carries the generation logic. Worth exploring for simple deployment scenarios but not the primary architecture.
12. Beyond Autoregressive: TTS, Diffusion, and Multimodal Audio
12.1 The Question
Pipeline-as-Config is designed around autoregressive token generation. But the model ecosystem includes fundamentally different generation patterns:
- TTS (text-to-speech): Text → mel spectrogram → audio waveform (multi-stage, often non-autoregressive)
- Diffusion (image generation): Iterative denoising loop with fixed step count, noise scheduling, no token sampling
- Audio+text multimodal: Mixed modality inputs (audio + text → text), structurally similar to VLMs
Can flow[]/dataflow[]/state{} express these? Or is the schema inherently autoregressive?
12.2 The Honest Assessment
| Model Type |
Schema Expressive? |
Runtime Can Execute? |
What's Missing |
| Audio+text multimodal (Phi4mm, speech-language) |
✅ Yes |
✅ Yes |
Nothing — structurally identical to VLMs |
| Encoder-decoder (Whisper, Marian) |
✅ Yes |
✅ Yes |
Nothing — already supported |
| Autoregressive TTS (Bark, VALL-E) |
✅ Yes |
✅ Yes |
Add when: "final" for vocoder post-processing |
| Non-autoregressive TTS (VITS, FastSpeech2) |
✅ Yes |
❌ No |
Sequential executor + non-token output |
| Diffusion (SD, Flux, DiT) |
⚠️ Topology yes |
❌ No |
Iterative executor, scheduler state, latent init, non-token output |
Key insight: the flow[]/dataflow[] schema is MORE GENERAL than the current runtime. It can already describe these topologies. The bottleneck is the C++ Generator, which assumes autoregressive token generation.
12.3 Concrete Config Examples
Audio+text multimodal (works today with pipeline-as-config):
{
"pipeline": {
"extends": "multimodal",
"sessions": {
"audio_encoder": {"file": "audio_encoder.onnx"},
"embedding": {"file": "embedding.onnx"},
"decoder": {"file": "decoder.onnx"}
},
"flow": [
{"run": "audio_encoder", "when": "init"},
{"run": "embedding", "when": "init"},
{"run": "decoder", "when": "step"}
],
"dataflow": [
{"from": "audio_encoder.audio_features", "to": "embedding.audio_features"},
{"from": "embedding.inputs_embeds", "to": "decoder.inputs_embeds"}
]
},
"generation": {"loop": "autoregressive", "max_length": 4096}
}
Whisper and Nemotron Speech are already this pattern. No schema changes needed.
Autoregressive TTS (Bark — works with minor extension):
{
"pipeline": {
"sessions": {
"decoder": {"file": "decoder.onnx"},
"vocoder": {"file": "vocoder.onnx"}
},
"flow": [
{"run": "decoder", "when": "step"},
{"run": "vocoder", "when": "final"}
],
"dataflow": [
{"from": "decoder.audio_tokens", "to": "vocoder.input_ids"}
]
},
"generation": {"loop": "autoregressive", "max_length": 2048}
}
New: when: "final" — runs after the generation loop completes (post-processing). Trivial to add.
Non-autoregressive TTS (VITS — needs sequential executor):
{
"pipeline": {
"sessions": {
"text_encoder": {"file": "text_encoder.onnx"},
"duration_predictor": {"file": "duration.onnx"},
"mel_decoder": {"file": "mel_decoder.onnx"},
"vocoder": {"file": "vocoder.onnx"}
},
"flow": [
{"run": "text_encoder", "when": "init"},
{"run": "duration_predictor", "when": "init"},
{"run": "mel_decoder", "when": "init"},
{"run": "vocoder", "when": "init"}
],
"output": {"session": "vocoder", "name": "audio_waveform"}
},
"generation": {"loop": "single_pass"}
}
New: "loop": "single_pass" — no generation loop, run all flow steps once, return output tensor. Requires a SequentialExecutor (~100 LOC).
Complex TTS with inner loops (Qwen3 TTS — needs flow step extensions):
Qwen3 TTS is a 4-model pipeline: embedding → talker → code_predictor → speaker_encoder. The talker IS autoregressive (KV cache, logits), but within each generation step, the code_predictor runs 14 times in an inner loop with a step counter:
{
"pipeline": {
"sessions": {
"embedding": {"file": "embedding.onnx"},
"talker": {"file": "talker.onnx"},
"code_predictor": {"file": "code_predictor.onnx"},
"speaker_encoder": {"file": "speaker_encoder.onnx", "optional": true}
},
"flow": [
{"run": "speaker_encoder", "when": "init", "optional": true},
{"run": "embedding", "when": "init"},
{"run": "talker", "when": "step"},
{"run": "code_predictor", "when": "step", "repeat": 14, "counter": "step_index"}
],
"dataflow": [
{"from": "embedding.text_embeds", "to": "talker.inputs_embeds"},
{"from": "talker.last_hidden_state", "to": "code_predictor.inputs_embeds"},
{"from": "code_predictor.codec_embeddings", "to": "code_predictor.inputs_embeds"}
]
},
"generation": {"loop": "autoregressive", "max_length": 2048}
}
New concepts: repeat: N on a flow step (inner loop within each generation step), counter field (provides a step index input), and self-referential dataflow (code_predictor output feeds back into itself). These are v2.1 extensions. Until then, the plugin escape hatch covers complex TTS.
Diffusion (Stable Diffusion, Flux) — in scope for schema design, out of scope for v2.0 implementation:
Diffusion has a fundamentally different generation loop: fixed N-step denoising with noise scheduling, classifier-free guidance (conditional UNet double-call), and non-neural scheduler math between iterations. The flow[]/dataflow[] schema can express the session topology using the same init/step/final phases:
{
"pipeline": {
"sessions": {
"text_encoder": {"file": "text_encoder.onnx"},
"unet": {"file": "unet.onnx"},
"vae_decoder": {"file": "vae_decoder.onnx"}
},
"flow": [
{"run": "text_encoder", "when": "init"},
{"run": "unet", "when": "step"},
{"run": "vae_decoder", "when": "final"}
],
"dataflow": [
{"from": "text_encoder.text_embeddings", "to": "unet.encoder_hidden_states"},
{"from": "unet.noise_pred", "to": "vae_decoder.latent_sample"}
]
},
"generation": {
"loop": "denoising",
"num_steps": 50,
"scheduler": "euler_discrete",
"guidance_scale": 7.5
}
}
Note how init/step/final maps naturally: text_encoder = init, unet = step (each denoising iteration), vae_decoder = final (after loop). The same three phases work for autoregressive AND denoising — no schema fork needed.
The denoising loop itself requires host-side C++ logic (scheduler.step, CFG interpolation) that would be Turing-complete if expressed declaratively. The right approach: a dedicated DenoisingExecutor in C++ that implements the denoising loop — analogous to how the autoregressive loop is C++ today. The config parameterizes it; the C++ implements it. The loop skeleton is ~300 LOC, but the full scheduler zoo (Euler, DDPM, DDIM, DPM-Solver, LCM, Flow Matching) plus classifier-free guidance and ControlNet support is ~1000-1500 LOC of implementation complexity. The pipeline executor and schema don't change — scheduler: "euler_discrete" is just a string that selects a C++ implementation.
Implementation is deferred — diffusion users have different tooling (ComfyUI, diffusers), different serving patterns (no streaming, batch-oriented), and ORT already has separate diffusion pipeline support. But the schema design explicitly accommodates diffusion so no breaking changes are needed when the executor is added.
12.4 The Scoping Decision
Pipeline-as-Config v2.0 implements autoregressive generation. The schema designs for all generation paradigms. This is a deliberate split: ship what matters now, design so future work is additive.
| Pattern |
v2.0 Implementation |
v2.1+ Implementation |
Schema Support |
| LLM (decoder-only) |
✅ Ship |
— |
✅ Designed |
| VLM (vision+language) |
✅ Ship |
— |
✅ Designed |
| Encoder-decoder (Whisper, Marian) |
✅ Ship |
— |
✅ Designed |
| Speech-language (audio+text→text) |
✅ Ship |
— |
✅ Designed |
| Simple TTS (AR + vocoder) |
⚠️ Plugin |
when: "final" |
✅ Designed |
| Complex TTS (Qwen3-style inner loops) |
⚠️ Plugin |
repeat + counter |
✅ Designed |
| Non-autoregressive TTS (VITS) |
⚠️ Plugin |
loop: "single_pass" |
✅ Designed |
| Diffusion (SD, Flux, DiT) |
⚠️ Plugin |
loop: "denoising" |
✅ Designed |
| Exotic (RNNT, custom) |
⚠️ Plugin |
Plugin |
✅ Plugin escape hatch |
The pitch: "The v2 schema supports any generation paradigm — autoregressive, denoising, single-pass. v2.0 ships the autoregressive executor. Adding a new paradigm = one C++ executor class. Adding a new model within any paradigm = zero code."
This prevents the "only works for LLMs" objection (the schema designs for everything) while keeping v2.0 scope tight (ship quality over breadth).
12.5 The Architectural Pattern: Pluggable Loop Strategies
The generation loop is a layer ABOVE the pipeline executor:
┌─────────────────────────────┐
│ Loop Strategy │ ← autoregressive | denoising | single_pass
│ (generation.loop) │
├─────────────────────────────┤
│ Pipeline Executor │ ← flow[], dataflow[], state{} — UNCHANGED
│ (FlowInterpreter) │
├─────────────────────────────┤
│ ONNX Sessions │ ← The actual computation — UNCHANGED
└─────────────────────────────┘
Each loop strategy is independent:
| Loop Strategy |
When It Runs |
Termination |
State Between Steps |
LOC Estimate |
autoregressive |
Token-by-token |
EOS or max_length |
KV cache, positions |
Existing (~800 LOC) |
single_pass |
All steps once |
After one pass |
None |
~100 LOC |
denoising |
Fixed N iterations |
After N steps |
Latents, scheduler |
~300 LOC loop + ~1000 LOC schedulers |
Adding a new loop strategy never touches the pipeline executor or existing loop strategies. Pure addition.
12.6 Competitive Advantage (Strengthened)
This analysis actually STRENGTHENS the competitive story:
- llama.cpp: GGUF has no concept of denoising loops, multi-session pipelines, or post-processing stages. Diffusion support would require a fundamentally new runtime.
- vLLM: Each diffusion architecture needs its own Python pipeline class. They're doing this (diffusion support is recent), but it's per-model Python code.
- Pipeline-as-Config: Add ONE loop strategy to the runtime → EVERY model of that type works via config. One
DenoisingExecutor enables Stable Diffusion, Flux, DiT, SDXL, ControlNet — all expressed as JSON with different session topologies.
The compiler advantage applies across modalities: add one loop strategy to the "compiled runtime" → unlimited models of that type. With "interpreter" runtimes (vLLM, llama.cpp), every model needs its own code.
12.7 Implementation Roadmap
| Phase |
What |
Status |
| Phase 1 (v2.0) |
Autoregressive (decoder-only, VLM, encoder-decoder, speech-language) |
PRs 1-5 (in progress) |
| Phase 2 (v2.1) |
when: "final" for post-processing (enables AR TTS with vocoder) |
Trivial addition to FlowInterpreter |
| Phase 2 (v2.1) |
repeat + counter on flow steps (enables complex TTS like Qwen3) |
~50 LOC FlowInterpreter extension |
| Phase 3 (v2.1) |
loop: "single_pass" + SequentialExecutor (enables non-AR TTS, embeddings) |
~100 LOC new executor |
| Phase 4 (future) |
loop: "denoising" + DenoisingExecutor (enables diffusion) |
~300 LOC loop skeleton + ~1000 LOC scheduler implementations |
v2.0 scope: Generative language models (autoregressive token generation). Covers ~95% of current ORT GenAI model zoo.
v2.1 scope: TTS extensions (when: "final", repeat/counter, loop: "single_pass"). Additive, no breaking changes.
Architecture accommodates: Diffusion via pluggable loop strategy. Out of v2.0 scope (different product, different users), but architecturally consistent. The plugin escape hatch covers all exotic patterns in the meantime.
13. Summary
What Changes
| Component |
Before |
After |
| Model dispatch |
32-string whitelist → 8 C++ classes |
Structural detection → 3 pipeline classes + plugin |
| Adding a new LLM |
C++ PR + release cycle |
7-line JSON config |
| Adding a new VLM |
New C++ class + processor + factory entries |
~25-line JSON config |
| Config format |
Implicit schema tied to C++ structs |
Explicit v2 schema with presets, versioned |
| model_type |
Dispatch key |
Human-readable metadata |
| Code size |
~4000 LOC in model dispatch |
~2500 LOC in pipeline executor (-1500 LOC) |
| Extension mechanism |
Fork the C++ runtime |
JSON config or plugin .so |
What Stays the Same
- Generation loop (Generator, Search, Sampling) — fully generic for autoregressive; extensible via pluggable loop strategies for diffusion/TTS (Section 12)
- KV cache internals — auto-detection mechanism preserved
- Tokenizer — unchanged
- C/Python/C#/Java/ObjC API surface — unchanged
- ONNX Runtime session management — unchanged
- All existing models — backward compatible via v1→v2 translator
The Vision (2-Year Horizon)
ORT GenAI becomes a generic pipeline runtime — the ONNX equivalent of what Kubernetes is for container orchestration. Models describe their pipeline declaratively. The runtime executes it generically. No model-specific code. No release bottlenecks. Any ONNX model that follows standard I/O conventions runs automatically.
Zero model-specific C++ code in ORT GenAI, ever again.
ORT GenAI Architectural Redesign: Pipeline-as-Config
A proposal to make onnxruntime-genai truly model-agnostic
Authors: Architecture Team (Architect, Product Manager, Radical Thinker)
Date: 2026-05-02
Status: Draft — for review before GitHub issue creation
Executive Summary
onnxruntime-genai is 90% model-agnostic today — but a hardcoded string registry blocks every new model. 21 of 32 recognized model types share identical runtime code (
DecoderOnly_Model). The KV cache auto-discovers its layout from ONNX tensor names. The generation loop knows nothing about model architecture. The only thing preventing ANY new model from working is a C++ whitelist that maps model_type strings to implementation classes.We propose Full-Stack Declarative Inference — replacing string-based dispatch with a declarative pipeline configuration where preprocessing, orchestration, and generation are ALL expressed as JSON config, running on 6+ execution providers. Instead of the runtime knowing about "Llama" or "Qwen" or "Gemma," it knows about pipelines — sequences of ONNX session invocations with configurable data flow, state management, and execution ordering.
The result: zero model-specific C++ code in ORT GenAI, ever again. New models are supported entirely by the export tool (mobius/Olive) generating ONNX graphs + pipeline configs. The runtime becomes a stable platform that only changes for performance improvements and new features, never for new models.
The Pitch
Design Principle
The One-Sentence Version
The runtime should know HOW to run pipelines (KV cache, generation loop, sampling), not WHAT models it's running — the "what" comes entirely from the ONNX graph + pipeline config.
1. The Problem Today
1.1 The Six Coupling Points
Adding a new model type to ORT GenAI currently requires C++ source changes in up to 6 locations:
src/models/model_type.hsrc/models/model.cppsrc/models/position_inputs.cppsrc/models/multi_modal.cppsrc/models/model.cppsrc/python/py/models/builders/For a standard decoder-only LLM, only CP1 is required (adding one string). But that one string requires a C++ PR, code review, CI pipeline, and a new release. The bottleneck isn't engineering complexity — it's release process overhead for a trivial change.
1.2 The False Complexity
The codebase has 8 C++ model classes and 32 recognized model_type strings. But strip away the legacy, and there are only 3 genuinely different runtime behaviors:
DecoderOnly_Model,Gpt_ModelMultiModalLanguageModel,Qwen2_5_VL_PipelineModelWhisperModel,MarianModelNemotronSpeechModelDecoderOnlyPipelineModel3 runtime patterns, not 32 model types. The model_type string is doing almost zero useful work.
1.3 What Users Experience
Four personas are blocked by this architecture:
2. Competitive Analysis
How Other Runtimes Handle Extensibility
register_model()APIORT GenAI's Unique Advantage
ORT GenAI has something no other runtime has: the ONNX model IS the computation. vLLM and SGLang require model-specific Python classes that implement
forward()with PyTorch ops. llama.cpp requires model-specific C++ code that implements attention, MLP, and normalization. ORT GenAI delegates ALL computation to ONNX Runtime — it never touches model internals.This means ORT GenAI's extensibility problem is fundamentally simpler. It doesn't need a plugin system for model computation (the ONNX graph handles that). It only needs extensibility for orchestration — which sessions to run, in what order, how to manage state between steps. And orchestration is naturally expressed as configuration.
Why Pipeline-as-Config Is Better Than GGUF (llama.cpp)
1. GGUF bundles computation with metadata. We separate them.
GGUF's model file contains weights + architecture metadata. The runtime reads the metadata and builds a compute graph at load time. This means the runtime must understand every architecture's compute pattern — which attention variant, which normalization, which MLP structure. When a model adds dual head_dim or KV sharing, llama.cpp needs new C++ code to interpret those metadata keys and build the right compute graph.
Our ONNX model IS the precompiled compute graph. The runtime never interprets architecture details — it just runs
session.Run(). The pipeline config only describes ORCHESTRATION (which sessions, what order, what state), not COMPUTATION:The runtime never needs to know what's inside the model. GGUF's runtime does.
2. Multi-EP deployment is impossible with GGUF.
GGUF models run on llama.cpp's own backends (CPU, CUDA, Metal, Vulkan). You can't take a GGUF and run it on DirectML, QNN (Qualcomm NPU), OpenVINO, or WebGPU without porting the entire backend.
ONNX + pipeline config runs on ANY ORT execution provider. The same model + config deploys to cloud GPU (CUDA EP), Windows laptop (DML EP), Qualcomm mobile (QNN EP), Intel hardware (OpenVINO EP), browser (WebGPU EP), and CPU. One model, one config, six+ deployment targets.
3. Graph-level optimization at export time.
ONNX models go through ORT's graph optimization pipeline: constant folding, op fusion, layout optimization, EP-specific transformations. These happen ONCE at model load time and produce an optimized execution plan. GGUF's runtime-built graphs can't do this — the graph is constructed and executed simultaneously.
Why Pipeline-as-Config Is Better Than vLLM
1. vLLM requires Python code for every model. We require JSON.
Adding a model to vLLM means writing a Python class with
forward(), weight loading, attention implementation — typically 200-500 lines of PyTorch code. Even withregister_model(), someone must WRITE that code.Our approach: the export tool (mobius) generates the ONNX graph + pipeline config. The runtime needs ZERO new code. The complexity lives in the exporter (which already understands the model), not the runtime.
2. vLLM is CUDA-only for production.
vLLM's custom CUDA kernels (PagedAttention, FlashAttention) are what make it fast. But they only work on NVIDIA GPUs. Running vLLM on AMD, Intel, Qualcomm, or in a browser requires rewriting those kernels. ORT's execution providers handle hardware abstraction transparently.
3. vLLM couples computation and orchestration.
vLLM's model classes implement both the forward pass AND orchestration logic (KV cache management, attention patterns). Our architecture cleanly separates: ONNX model = computation, pipeline config = orchestration, ORT = execution.
Where Competitors Are Better (Honest Assessment)
Q4_K_Mis one flagThe Core Competitive Insight: Compile at Export Time
This is our unique structural advantage that neither competitor can replicate:
We move complexity from RUNTIME to EXPORT TIME.
Why this matters:
This is the 'compiler vs interpreter' advantage. GGUF and vLLM are interpreters — they process model definitions at runtime. We're a compiler — we process model definitions once at export and produce an optimized artifact that a simple, generic runtime executes.
What We Do That NEITHER Competitor Can
Three capabilities that pipeline-as-config delivers that no competitor matches:
1. Multi-Session Declarative Pipelines. GGUF has flat key-value metadata — no concept of multi-model pipelines. vLLM can do multi-model through Python code, but each topology requires a new class. Pipeline-as-Config's
flow[]+dataflow[]declaratively express ANY multi-session topology — VLMs, speech models, multimodal with vision+audio+decoder — all as JSON without new code.2. Hardware-Agnostic Model Artifacts. GGUF models are tied to llama.cpp's backend ecosystem. vLLM is CUDA-first (AMD ROCm second-class, no DirectML/QNN/WebGPU). ONNX + pipeline config is a hardware-agnostic artifact — the same files deploy on CPU, CUDA, DirectML, QNN (Qualcomm NPU), OpenVINO (Intel), and WebGPU. Write once, deploy on 6+ hardware targets. Even more powerfully, different sessions in the same pipeline can run on DIFFERENT execution providers — e.g., vision encoder on CPU while the decoder runs on GPU, or vision on NPU while decoder runs on GPU:
This heterogeneous hardware deployment — different EPs per session in a single pipeline — is something neither GGUF nor vLLM can express at all.
3. Truly Model-Agnostic Runtime. GGUF's runtime interprets architecture metadata to build compute graphs — it must understand every model's attention pattern, normalization, MLP structure. vLLM's runtime runs model-specific Python
forward()code. Our runtime executes a declared pipeline — it understands ZERO model architecture. The runtime has no decisions to make.The Complete Competitive Matrix
Pipeline-as-Config is the only approach that checks ALL boxes.
3. The Architecture: Pipeline-as-Config
3.1 Core Concept
Replace model-type dispatch with a declarative pipeline configuration. The runtime becomes a generic pipeline executor that:
3.2 The
flowArray — Execution OrderingThe
flowarray declares which sessions run, when, and how:Lifecycle phases (fixed vocabulary — not Turing-complete):
when: "init"— run before the main generation loop (encoders, embedding projectors, preprocessing). Execution order follows the flow[] array. Covers what earlier drafts calledonceandprompt.when: "step"— run every iteration of the generation loop (decoder in autoregressive, UNet in denoising)when: "final"— run after the generation loop completes (vocoder in TTS, VAE decoder in diffusion)These three phases map cleanly across all generation paradigms:
initstepfinalLoop modes (fixed vocabulary):
loop: "batched"— pass all inputs at once (default)loop: "per_image"— iterate over inputs individually (Qwen VL, Pixtral)Guardrails:
whenandloop— no arbitrary conditions or iterationsif/else— anything that needs conditional logic uses the plugin API3.3 The
dataflowArray — Session WiringOptional. Declares how outputs from one session feed into inputs of another:
When omitted, the runtime auto-matches by tensor name (output name in session A matches input name in session B). When explicit, overrides auto-matching for cases where tensor names differ.
3.4 The
stateObject — KV Cache & Position StrategyKV cache formats:
"auto"— introspect ONNX session I/O to detect format (default)"separate"— standardpast_key_values.{layer}.key/present.{layer}.key"combined"— GPT-2 stylepast_{layer}/present_{layer}Position ID strategies:
"auto"— introspect position_ids input shape: rank 2 → default 1D, rank 3 → mRoPE 3D"default"— standard 1D position IDs"mrope_3d"— 3-dimensional mRoPE (temporal, height, width)"windowed"— sliding window position tracking3.5 The
extendsMechanism — Preset InheritanceBuilt-in presets eliminate boilerplate for common patterns:
{"pipeline": {"extends": "autoregressive-decoder"}}Built-in presets:
autoregressive-decodervision-languageencoder-decoderspeech-languagePresets are resolved at load time — the runtime sees a fully expanded config. Overrides replace preset defaults:
{ "pipeline": { "extends": "vision-language", "flow": [ {"run": "vision", "when": "prompt", "loop": "per_image"}, {"run": "embedding", "when": "prompt"}, {"run": "decoder", "when": "always"} ], "state": { "position_ids": {"strategy": "mrope_3d", "grid_source": "vision.image_grid_thw"} } } }3.6 The Plugin API — Escape Hatch
For genuinely novel architectures that can't be expressed as standard pipelines (RNNT, SSM/Mamba, diffusion):
{ "pipeline": { "plugin": { "library": "libgenai_rnnt.so", "entry_point": "CreateRnntPipeline" } } }C++ plugin interface:
The plugin registers a Pipeline factory, not a Model factory — keeping the abstraction consistent. Plugins extend the pipeline type system for the ~1% of models that can't fit the declarative config.
3.7 Preprocessing: Image, Audio, and Variable Input Shapes
Preprocessing is NOT the pipeline executor's job. It transforms raw inputs (pixels, audio waveforms) into model-ready tensors. This happens BEFORE the pipeline runs and is handled by a separate, config-driven preprocessing layer.
The Architecture Boundary
Image Preprocessing — ort-extensions
image_processor.jsonEach VLM ships an
image_processor.jsonthat declares its preprocessing pipeline:{ "image_processor_type": "Qwen2VLImageProcessor", "resample": "bicubic", "do_resize": true, "size": {"min_pixels": 3136, "max_pixels": 12845056}, "do_rescale": true, "rescale_factor": 0.00392156862745098, "do_normalize": true, "image_mean": [0.48145466, 0.4578275, 0.40821073], "image_std": [0.26862954, 0.26130258, 0.27577711], "patch_size": 14, "merge_size": 2 }ort-extensions loads this JSON and executes the preprocessing pipeline using its own C++ ops. No model-type dispatch needed. Different VLMs (Phi3v at 336×336, Qwen2.5-VL with dynamic resolution, Pixtral with variable per-image sizes) all use the same mechanism — they just ship different
image_processor.jsonconfigs.The C++ preprocessors (
PhiImageProcessor,QwenImageProcessor,GemmaImageProcessor,Mistral3ImageProcessor) become legacy. New models use ort-extensions exclusively. This is already the direction — mobius already generatesimage_processor.jsonfor all VLMs.Audio Preprocessing —
audio_processor.jsonSame pattern for speech models:
{ "audio_processor_type": "WhisperFeatureExtractor", "feature_size": 128, "sampling_rate": 16000, "hop_length": 160, "chunk_length": 30, "n_fft": 400 }For multimodal models needing both image and audio (Phi4mm):
Variable Input Shapes
Different models handle input sizes differently. The pipeline config + preprocessor config handle all cases:
image_processor.jsonresizes to target. ONNX model has static shapes.image_processor.jsondoes dynamic resize + patch extraction. ONNX model has dynamic shapes. Pipeline executor passes tensors as-is.image_sizes[N, 2]. Pipeline flow usesloop: per_image+dynamic_shapefor per-image slicing.For the per-image variable resolution case:
{"run": "vision", "when": "prompt", "loop": "per_image", "loop_over": "pixel_values", "dynamic_shape": {"source": "image_sizes", "apply_to_dims": [2, 3]}}The executor slices
pixel_values[i, :, :H_i, :W_i]where H_i and W_i come fromimage_sizes[i]. This is ~15 lines of generic loop code, not a model-specific class.Pipeline Config Reference
The pipeline config references preprocessing configs without embedding their details:
The
formatfield future-proofs beyond ort-extensions — today it's the only value, but it enables alternative preprocessing backends (e.g., a pure ONNX preprocessing graph) without schema changes.This clean boundary means: preprocessing is fully described by its own config files (already supported by ort-extensions), and the pipeline executor receives model-ready tensors without knowing how they were produced. Note that half the pipeline-as-config vision is already shipped and working in production via ort-extensions — we're completing the other half for inference orchestration.
Every layer of the stack is config-driven. Zero model-specific C++ anywhere.
3.8 Advanced KV Cache Patterns (Shared Cache, Dual Head Dim)
Some models have non-uniform KV cache layouts. Gemma4 is the most complex example:
head_dim=256; global (full_attention) layers useglobal_head_dim=512num_kv_shared_layerslayers reuse K/V from earlier layers and have NO independent cache entriesThese are all handled by auto-detection — no model-specific code needed.
ORT GenAI's
DefaultKeyValueCachealready supports:kv_layer_indices_): Auto-discovered by scanning whichpast_key_values.{N}.keyinputs exist in the ONNX session. If the model only has layers 0-25 (skipping 26-33 due to KV sharing), the cache allocates only 26 entries.layer_shapes_): Each layer can have a different[batch, heads, seq_len, head_dim]shape, auto-discovered from the ONNX session output shapes.The pipeline config for a Gemma4-style model:
format: autohandles the dual head_dim and sparse layers automatically. Thesliding_window.layersarray (already supported by ORT GenAI today) specifies which layers use bounded cache. The export tool (mobius) builds the ONNX model with cache I/O only for non-shared layers, so the runtime never needs to know about KV sharing — it's implicit in the graph structure.3.9 Preprocessor↔Model Shape Alignment
Problem: Different models expect different image sizes and preprocessing. How do we ensure the preprocessor output matches the model's expectations without model-specific code?
Answer: Co-generation. The export tool (mobius) generates BOTH the ONNX model and its
image_processor.jsonfrom the same HuggingFace config. They're guaranteed to be aligned because they share a single source of truth:For additional safety, the pipeline config can include optional shape validation:
The
expected_outputsfield enables load-time validation: verify that the preprocessor config produces tensors compatible with the model's inputs before running inference. This catches mismatches at load time rather than inference time.If someone provides the wrong preprocessor: ORT Runtime throws a shape mismatch error at
session.Run()— already a clear, debuggable failure. The optional validation catches it earlier.3.10 The
metadataSectionmodel_type lives here — as documentation, not dispatch:
Used for: logging, telemetry, debugging, human readability. Ignored by: all dispatch and runtime logic.
4. Concrete Schema Examples
4.1 Decoder-Only LLM (Minimal — 7 lines)
{ "version": 2, "pipeline": { "extends": "autoregressive-decoder", "sessions": {"decoder": {"file": "model.onnx"}} }, "tokens": {"eos": [151645], "pad": 0}, "generation": {"max_length": 4096, "sampling": {"temperature": 0.7}}, "metadata": {"model_type": "qwen2", "source": "mobius"} }4.2 Vision-Language Model (Qwen2.5-VL style — 25 lines)
{ "version": 2, "pipeline": { "extends": "vision-language", "sessions": { "vision": {"file": "vision_encoder/model.onnx"}, "embedding": {"file": "embedding/model.onnx"}, "decoder": {"file": "decoder/model.onnx"} }, "flow": [ {"run": "vision", "when": "prompt", "loop": "per_image"}, {"run": "embedding", "when": "prompt"}, {"run": "decoder", "when": "always"} ], "dataflow": [ {"from": "vision.image_features", "to": "embedding.image_features"}, {"from": "embedding.inputs_embeds", "to": "decoder.inputs_embeds"} ], "state": { "kv_cache": {"format": "auto"}, "position_ids": { "strategy": "mrope_3d", "grid_source": "vision.image_grid_thw" } }, "preprocessing": { "image": {"config": "image_processor.json"} } }, "tokens": {"eos": [151645], "pad": 0, "image_token": 151655}, "generation": {"max_length": 4096, "sampling": {"temperature": 0.7}}, "metadata": {"model_type": "qwen2_5_vl", "source": "mobius"} }4.3 Encoder-Decoder (Whisper style)
{ "version": 2, "pipeline": { "extends": "encoder-decoder", "sessions": { "encoder": {"file": "encoder/model.onnx"}, "decoder": {"file": "decoder/model.onnx"} }, "flow": [ {"run": "encoder", "when": "once"}, {"run": "decoder", "when": "always", "cross_attention_from": "encoder"} ], "state": { "kv_cache": {"format": "auto"}, "cross_cache": {"source": "encoder", "frozen": true} } }, "tokens": {"eos": [50257], "pad": 50257, "decoder_start": 50258}, "generation": {"max_length": 448}, "metadata": {"model_type": "whisper", "source": "mobius"} }4.4 Multimodal (Vision + Audio — Phi4mm style)
{ "version": 2, "pipeline": { "sessions": { "vision": {"file": "vision_encoder/model.onnx"}, "speech": {"file": "audio_encoder/model.onnx"}, "embedding": {"file": "embedding/model.onnx"}, "decoder": {"file": "decoder/model.onnx"} }, "flow": [ {"run": "vision", "when": "prompt", "loop": "batched"}, {"run": "speech", "when": "prompt", "loop": "batched"}, {"run": "embedding", "when": "prompt"}, {"run": "decoder", "when": "always"} ], "dataflow": [ {"from": "vision.image_features", "to": "embedding.image_features"}, {"from": "speech.audio_features", "to": "embedding.audio_features"}, {"from": "embedding.inputs_embeds", "to": "decoder.inputs_embeds"} ], "state": { "kv_cache": {"format": "auto"}, "position_ids": {"strategy": "default"} } }, "tokens": {"eos": [32007], "pad": 32000}, "generation": {"max_length": 4096}, "metadata": {"model_type": "phi4mm", "source": "mobius"} }4.5 Novel Architecture via Plugin (RNNT)
{ "version": 2, "pipeline": { "plugin": { "library": "libgenai_rnnt.so", "entry_point": "CreateRnntPipeline" }, "sessions": { "encoder": {"file": "encoder/model.onnx"}, "predictor": {"file": "predictor/model.onnx"}, "joiner": {"file": "joiner/model.onnx"} } }, "metadata": {"model_type": "nemotron_speech", "source": "mobius"} }5. Implementation Plan
Overview
This is a refactor, not a rewrite. The generation loop, search/sampling, tokenizer, KV cache internals, and all language bindings remain unchanged. We're replacing the model dispatch layer with a pipeline dispatch layer.
Net code change estimate: +800 lines added, -2000 lines deleted. The codebase gets smaller.
PR 1: Config Schema v2 Parser + Backward Compatibility (~300 LOC)
Files changed:
src/config.h— AddPipelinestruct withsessions,flow,dataflow,state,extendsfieldssrc/config.cpp— Parse v2 schema; add v1→v2 translator that converts old-format configs to pipeline formatsrc/pipeline_presets.h— Built-in preset definitions (autoregressive-decoder, vision-language, encoder-decoder, speech-language)Logic:
Backward compatibility guarantee: Every existing genai_config.json produces an identical internal
Pipelinestruct after translation. The v1→v2 translator maps:model.type+model_type.hclassification → appropriate presetmodel.decoder.inputs/outputs→state.kv_cachepatternsmodel.vision/speech/embeddingsections →sessions+flow+dataflowTests: All existing config tests pass unchanged. New tests for v2 parsing, preset resolution,
extendsoverride logic.PR 2: PipelineExecutor Class (~350 LOC)
Files changed:
src/models/pipeline_executor.h— PipelineExecutor class definitionsrc/models/pipeline_executor.cpp— Implementationsrc/models/model.cpp— ReplaceCreateModel()withCreatePipeline()using structural detectionThe core class:
Structural detection in CreatePipeline() (replaces CreateModel()):
PR 3: Flow Interpreter + Dataflow Wiring (~200 LOC)
Files changed:
src/models/flow_interpreter.h/.cpp— Interpretsflow[]anddataflow[]src/models/pipeline_executor.cpp— Uses flow interpreterKey logic:
Dataflow wiring:
PR 4: Plugin API (~100 LOC)
Files changed:
src/models/plugin_api.h— Stable C ABI for pipeline pluginssrc/models/plugin_loader.cpp— Dynamic library loading (dlopen/LoadLibrary)Interface:
PR 5: Delete Model-Type Dispatch (~-1500 LOC)
Files deleted:
src/models/model_type.h— The entire fileFiles simplified:
src/models/model.cpp— RemoveCreateModel()if-else chain, replace withCreatePipeline()src/models/position_inputs.cpp— RemoveIsQwenVLFamily()check; position strategy comes from configsrc/models/multi_modal.cpp— RemoveCreateVisionState()model_type dispatch; vision loop mode comes from config flowFiles eventually deprecated (kept for v1 compat, removed in future release):
src/models/gpt.h/cpp— Absorbed into generic pipeline withkv_cache.format: combinedimage_processor.jsonImplementation Summary
The codebase shrinks by ~550 lines while gaining full model-agnostic extensibility.
6. Compatibility Matrix
7. Technical Feasibility
7.1 CUDA Graph Capture
Concern: CUDA graphs require identical session topology and buffer shapes between captures and replays. Does a generic pipeline executor break this?
Answer: No. The executor pre-computes a "decode flow" (steps where
when: "step") at init time. During token generation, only the decode flow runs — this is a fixed, repeatable sequence identical to what the currentDecoderOnly_State::Run()does. CUDA graph capture applies to this fixed sequence:7.2 Memory Pre-allocation
Concern: The current code pre-allocates KV cache buffers based on model dimensions. Can a generic executor do this without model-specific knowledge?
Answer: Yes. KV cache dimensions come from config (
decoder.num_hidden_layers,decoder.num_key_value_heads,decoder.head_size) or are discoverable from ONNX session output shapes at init time. The currentDefaultKeyValueCachealready auto-discovers layer count by pattern-matching present tensor names in the session. A generic executor uses the same mechanism — zero model-specific knowledge needed.7.3 Performance Overhead
Concern: Does the generic pipeline add overhead vs hand-optimized model classes?
Answer: Negligible. The overhead is:
forloop overflow_steps per generation step (typically 1 step for LLMs)The hot path —
session.Run()+ KV cache management — is identical to the current code. The generation loop, search/sampling, and tokenizer are completely unchanged.7.4 Config Validation
Invalid configs must produce clear errors at load time, not runtime crashes:
Flow step references session "vision" but no such session is declared in pipeline.sessionsDataflow wire references output "image_features" but session "vision" has no such output (available outputs: hidden_states, pooler_output)Unknown position_ids strategy "my_custom". Valid options: auto, default, mrope_3d, windowedCircular dependency detected in dataflow: vision → embedding → decoder → visionUnknown pipeline preset "my-preset". Built-in presets: autoregressive-decoder, vision-language, encoder-decoder, speech-languagePipeline config requires at least one session. Add "sessions": {"decoder": {"file": "model.onnx"}}7.5 The
per_imageLoop for VisionQwenVisionState and PixtralVisionState loop over images individually with different slicing strategies. The flow interpreter handles this generically:
{"run": "vision", "when": "prompt", "loop": "per_image", "loop_over": "pixel_values"}For Pixtral's variable-resolution cropping (per-image height/width from
image_sizes):{"run": "vision", "when": "prompt", "loop": "per_image", "loop_over": "pixel_values", "dynamic_shape": {"source": "image_sizes", "apply_to_dims": [2, 3]}}The executor slices
pixel_values[i, :, :H_i, :W_i]whereH_i, W_icome fromimage_sizes[i]. This is ~15 lines of generic loop code, not a model-specific class.8. The Pitch to the ORT GenAI Team
Framing
Value Proposition
The Key Selling Point
This REDUCES ORT GenAI's maintenance burden. The team goes from "we must ship a PR for every new HuggingFace model" to "we maintain a stable pipeline runtime." New model support becomes the exporter's responsibility (mobius/Olive), not the runtime's.
"You build the engine. We build the cars."
9. Risk Analysis
extendsreduce 90% of configs to 7 lines. JSON Schema for IDE support.Immediate Bridge (While Building the Future)
While the pipeline-as-config architecture is implemented, mobius can unblock users TODAY:
"type": "decoder"+"original_model_type": "<real_type>"in genai_config.json"decoder"is in the current whitelist and routes toDecoderOnly_Model10. The mobius Role: Pipeline Compiler
mobius already knows everything needed to generate complete pipeline configs:
extendpipeline.sessionsflow[].loopstate.position_ids.strategystate.kv_cache.formatstate.kv_cache.*_pattern,dataflow[]tokens,generationImplementation in mobius: Extend the existing
_write_genai_config()function to emit v2 format alongside (or instead of) v1. The pipeline config is generated from the same model metadata that already drives ONNX graph construction.11. Research Direction: Self-Contained Generation Graphs
As a long-term research direction (not part of the core proposal), we explored embedding generation logic inside the ONNX graph itself. Microsoft's existing
com.microsoft.BeamSearchandcom.microsoft.GreedySearchcontrib ops prove this is technically possible.Viable for: Offline batch inference, edge deployment, WebAssembly
Not viable for: Interactive serving (streaming, continuous batching, speculative decoding — all require host-side coordination)
Potential approach: Small set of generation-specific custom ops (
GenerationKVCacheUpdate,SampleTopP) that the runtime provides as efficient primitives, while the ONNX graph carries the generation logic. Worth exploring for simple deployment scenarios but not the primary architecture.12. Beyond Autoregressive: TTS, Diffusion, and Multimodal Audio
12.1 The Question
Pipeline-as-Config is designed around autoregressive token generation. But the model ecosystem includes fundamentally different generation patterns:
Can
flow[]/dataflow[]/state{}express these? Or is the schema inherently autoregressive?12.2 The Honest Assessment
when: "final"for vocoder post-processingKey insight: the
flow[]/dataflow[]schema is MORE GENERAL than the current runtime. It can already describe these topologies. The bottleneck is the C++Generator, which assumes autoregressive token generation.12.3 Concrete Config Examples
Audio+text multimodal (works today with pipeline-as-config):
{ "pipeline": { "extends": "multimodal", "sessions": { "audio_encoder": {"file": "audio_encoder.onnx"}, "embedding": {"file": "embedding.onnx"}, "decoder": {"file": "decoder.onnx"} }, "flow": [ {"run": "audio_encoder", "when": "init"}, {"run": "embedding", "when": "init"}, {"run": "decoder", "when": "step"} ], "dataflow": [ {"from": "audio_encoder.audio_features", "to": "embedding.audio_features"}, {"from": "embedding.inputs_embeds", "to": "decoder.inputs_embeds"} ] }, "generation": {"loop": "autoregressive", "max_length": 4096} }Whisper and Nemotron Speech are already this pattern. No schema changes needed.
Autoregressive TTS (Bark — works with minor extension):
{ "pipeline": { "sessions": { "decoder": {"file": "decoder.onnx"}, "vocoder": {"file": "vocoder.onnx"} }, "flow": [ {"run": "decoder", "when": "step"}, {"run": "vocoder", "when": "final"} ], "dataflow": [ {"from": "decoder.audio_tokens", "to": "vocoder.input_ids"} ] }, "generation": {"loop": "autoregressive", "max_length": 2048} }New:
when: "final"— runs after the generation loop completes (post-processing). Trivial to add.Non-autoregressive TTS (VITS — needs sequential executor):
{ "pipeline": { "sessions": { "text_encoder": {"file": "text_encoder.onnx"}, "duration_predictor": {"file": "duration.onnx"}, "mel_decoder": {"file": "mel_decoder.onnx"}, "vocoder": {"file": "vocoder.onnx"} }, "flow": [ {"run": "text_encoder", "when": "init"}, {"run": "duration_predictor", "when": "init"}, {"run": "mel_decoder", "when": "init"}, {"run": "vocoder", "when": "init"} ], "output": {"session": "vocoder", "name": "audio_waveform"} }, "generation": {"loop": "single_pass"} }New:
"loop": "single_pass"— no generation loop, run all flow steps once, return output tensor. Requires aSequentialExecutor(~100 LOC).Complex TTS with inner loops (Qwen3 TTS — needs flow step extensions):
Qwen3 TTS is a 4-model pipeline: embedding → talker → code_predictor → speaker_encoder. The talker IS autoregressive (KV cache, logits), but within each generation step, the code_predictor runs 14 times in an inner loop with a step counter:
{ "pipeline": { "sessions": { "embedding": {"file": "embedding.onnx"}, "talker": {"file": "talker.onnx"}, "code_predictor": {"file": "code_predictor.onnx"}, "speaker_encoder": {"file": "speaker_encoder.onnx", "optional": true} }, "flow": [ {"run": "speaker_encoder", "when": "init", "optional": true}, {"run": "embedding", "when": "init"}, {"run": "talker", "when": "step"}, {"run": "code_predictor", "when": "step", "repeat": 14, "counter": "step_index"} ], "dataflow": [ {"from": "embedding.text_embeds", "to": "talker.inputs_embeds"}, {"from": "talker.last_hidden_state", "to": "code_predictor.inputs_embeds"}, {"from": "code_predictor.codec_embeddings", "to": "code_predictor.inputs_embeds"} ] }, "generation": {"loop": "autoregressive", "max_length": 2048} }New concepts:
repeat: Non a flow step (inner loop within each generation step),counterfield (provides a step index input), and self-referential dataflow (code_predictor output feeds back into itself). These are v2.1 extensions. Until then, the plugin escape hatch covers complex TTS.Diffusion (Stable Diffusion, Flux) — in scope for schema design, out of scope for v2.0 implementation:
Diffusion has a fundamentally different generation loop: fixed N-step denoising with noise scheduling, classifier-free guidance (conditional UNet double-call), and non-neural scheduler math between iterations. The
flow[]/dataflow[]schema can express the session topology using the sameinit/step/finalphases:{ "pipeline": { "sessions": { "text_encoder": {"file": "text_encoder.onnx"}, "unet": {"file": "unet.onnx"}, "vae_decoder": {"file": "vae_decoder.onnx"} }, "flow": [ {"run": "text_encoder", "when": "init"}, {"run": "unet", "when": "step"}, {"run": "vae_decoder", "when": "final"} ], "dataflow": [ {"from": "text_encoder.text_embeddings", "to": "unet.encoder_hidden_states"}, {"from": "unet.noise_pred", "to": "vae_decoder.latent_sample"} ] }, "generation": { "loop": "denoising", "num_steps": 50, "scheduler": "euler_discrete", "guidance_scale": 7.5 } }Note how
init/step/finalmaps naturally: text_encoder = init, unet = step (each denoising iteration), vae_decoder = final (after loop). The same three phases work for autoregressive AND denoising — no schema fork needed.The denoising loop itself requires host-side C++ logic (scheduler.step, CFG interpolation) that would be Turing-complete if expressed declaratively. The right approach: a dedicated
DenoisingExecutorin C++ that implements the denoising loop — analogous to how the autoregressive loop is C++ today. The config parameterizes it; the C++ implements it. The loop skeleton is ~300 LOC, but the full scheduler zoo (Euler, DDPM, DDIM, DPM-Solver, LCM, Flow Matching) plus classifier-free guidance and ControlNet support is ~1000-1500 LOC of implementation complexity. The pipeline executor and schema don't change —scheduler: "euler_discrete"is just a string that selects a C++ implementation.Implementation is deferred — diffusion users have different tooling (ComfyUI, diffusers), different serving patterns (no streaming, batch-oriented), and ORT already has separate diffusion pipeline support. But the schema design explicitly accommodates diffusion so no breaking changes are needed when the executor is added.
12.4 The Scoping Decision
Pipeline-as-Config v2.0 implements autoregressive generation. The schema designs for all generation paradigms. This is a deliberate split: ship what matters now, design so future work is additive.
when: "final"repeat+counterloop: "single_pass"loop: "denoising"The pitch: "The v2 schema supports any generation paradigm — autoregressive, denoising, single-pass. v2.0 ships the autoregressive executor. Adding a new paradigm = one C++ executor class. Adding a new model within any paradigm = zero code."
This prevents the "only works for LLMs" objection (the schema designs for everything) while keeping v2.0 scope tight (ship quality over breadth).
12.5 The Architectural Pattern: Pluggable Loop Strategies
The generation loop is a layer ABOVE the pipeline executor:
Each loop strategy is independent:
autoregressivesingle_passdenoisingAdding a new loop strategy never touches the pipeline executor or existing loop strategies. Pure addition.
12.6 Competitive Advantage (Strengthened)
This analysis actually STRENGTHENS the competitive story:
DenoisingExecutorenables Stable Diffusion, Flux, DiT, SDXL, ControlNet — all expressed as JSON with different session topologies.The compiler advantage applies across modalities: add one loop strategy to the "compiled runtime" → unlimited models of that type. With "interpreter" runtimes (vLLM, llama.cpp), every model needs its own code.
12.7 Implementation Roadmap
when: "final"for post-processing (enables AR TTS with vocoder)repeat+counteron flow steps (enables complex TTS like Qwen3)loop: "single_pass"+ SequentialExecutor (enables non-AR TTS, embeddings)loop: "denoising"+ DenoisingExecutor (enables diffusion)v2.0 scope: Generative language models (autoregressive token generation). Covers ~95% of current ORT GenAI model zoo.
v2.1 scope: TTS extensions (
when: "final",repeat/counter,loop: "single_pass"). Additive, no breaking changes.Architecture accommodates: Diffusion via pluggable loop strategy. Out of v2.0 scope (different product, different users), but architecturally consistent. The plugin escape hatch covers all exotic patterns in the meantime.
13. Summary
What Changes
What Stays the Same
The Vision (2-Year Horizon)
ORT GenAI becomes a generic pipeline runtime — the ONNX equivalent of what Kubernetes is for container orchestration. Models describe their pipeline declaratively. The runtime executes it generically. No model-specific code. No release bottlenecks. Any ONNX model that follows standard I/O conventions runs automatically.
Zero model-specific C++ code in ORT GenAI, ever again.