Skip to content

Copilot design of Pipeline-as-Config #2114

@justinchuby

Description

@justinchuby

ORT GenAI Architectural Redesign: Pipeline-as-Config

A proposal to make onnxruntime-genai truly model-agnostic

Authors: Architecture Team (Architect, Product Manager, Radical Thinker)
Date: 2026-05-02
Status: Draft — for review before GitHub issue creation

Terminology note: The flow[].when values were renamed from once/prompt/always to init/step/final for cross-paradigm clarity (autoregressive, denoising, single-pass all use the same three phases). The canonical definitions are in Section 3.2. Some earlier-drafted examples may use the original names.


Executive Summary

onnxruntime-genai is 90% model-agnostic today — but a hardcoded string registry blocks every new model. 21 of 32 recognized model types share identical runtime code (DecoderOnly_Model). The KV cache auto-discovers its layout from ONNX tensor names. The generation loop knows nothing about model architecture. The only thing preventing ANY new model from working is a C++ whitelist that maps model_type strings to implementation classes.

We propose Full-Stack Declarative Inference — replacing string-based dispatch with a declarative pipeline configuration where preprocessing, orchestration, and generation are ALL expressed as JSON config, running on 6+ execution providers. Instead of the runtime knowing about "Llama" or "Qwen" or "Gemma," it knows about pipelines — sequences of ONNX session invocations with configurable data flow, state management, and execution ordering.

The result: zero model-specific C++ code in ORT GenAI, ever again. New models are supported entirely by the export tool (mobius/Olive) generating ONNX graphs + pipeline configs. The runtime becomes a stable platform that only changes for performance improvements and new features, never for new models.

The Pitch

The only inference runtime where adding a new model is a JSON file, not a code change — and it runs on every platform.

Design Principle

Detect, don't declare. The runtime infers model category from structural signals (which ONNX sessions exist, what I/O signatures they have), not from string dispatch. Config fields express behavioral choices that genuinely can't be inferred from structure. model.type becomes metadata for humans, not a dispatch key for machines.

The One-Sentence Version

The runtime should know HOW to run pipelines (KV cache, generation loop, sampling), not WHAT models it's running — the "what" comes entirely from the ONNX graph + pipeline config.


1. The Problem Today

1.1 The Six Coupling Points

Adding a new model type to ORT GenAI currently requires C++ source changes in up to 6 locations:

# Coupling Point File Lines What Changes
CP1 Model type whitelist src/models/model_type.h 16-61 Add string to static array
CP2 Model factory dispatch src/models/model.cpp 820-842 Add if-else branch
CP3 Position input strategy src/models/position_inputs.cpp 928-938 Add model_type check
CP4 Vision state factory src/models/multi_modal.cpp 565-572 Add model_type check
CP5 Multimodal processor factory src/models/model.cpp 915-933 Add factory entry
CP6 Python model builder src/python/py/models/builders/ Various Add Python builder file

For a standard decoder-only LLM, only CP1 is required (adding one string). But that one string requires a C++ PR, code review, CI pipeline, and a new release. The bottleneck isn't engineering complexity — it's release process overhead for a trivial change.

1.2 The False Complexity

The codebase has 8 C++ model classes and 32 recognized model_type strings. But strip away the legacy, and there are only 3 genuinely different runtime behaviors:

Runtime Behavior C++ Classes Model Types Actually Different?
Decoder-only autoregressive DecoderOnly_Model, Gpt_Model 22 types No — GPT-2 differs only in KV cache format (config-detectable)
Multi-session (VLM/multimodal) MultiModalLanguageModel, Qwen2_5_VL_PipelineModel 8 types Partially — vision invocation strategy varies, but is config-expressible
Encoder-decoder WhisperModel, MarianModel 2 types No — both are encoder→cross-attention-decoder
RNNT streaming ASR NemotronSpeechModel 1 type Yes — fundamentally different decoding loop
Pipeline (QNN multi-stage) DecoderOnlyPipelineModel 1 type Deployment variant, not architectural difference

3 runtime patterns, not 32 model types. The model_type string is doing almost zero useful work.

1.3 What Users Experience

Four personas are blocked by this architecture:

  1. The Model Builder (mobius dev): "I built a perfect ONNX model + config, but ORT GenAI rejects it because it doesn't know my model_type string."
  2. The Deployer (ML engineer): "I have to wait for a new ORT GenAI release just to use a new model. My alternative is forking the runtime."
  3. The Fine-Tuner (researcher): "My model is architecturally identical to Llama but has a custom model_type. ORT GenAI won't load it."
  4. The ORT GenAI Maintainer (MSFT): "Every new HuggingFace model = a C++ PR. We're bottlenecked on model support."

2. Competitive Analysis

How Other Runtimes Handle Extensibility

Runtime Pattern Extensible Without Source Changes? Adding a New Model
vLLM Python dict registry + lazy import ✅ Yes — register_model() API 1 registry line + 1 Python file
SGLang AST-based filesystem discovery ✅ Yes — drop .py file in directory 0-1 config lines + 1 module file
llama.cpp C++ enum dispatch (like ORT GenAI) ❌ No — requires recompilation 1 enum + 100-300 LOC
ORT GenAI C++ string whitelist dispatch ❌ No — requires recompilation 1 string + PR + release cycle
ORT GenAI (proposed) Declarative pipeline config ✅ Yes — JSON config only 0 code lines + 1 JSON config

ORT GenAI's Unique Advantage

ORT GenAI has something no other runtime has: the ONNX model IS the computation. vLLM and SGLang require model-specific Python classes that implement forward() with PyTorch ops. llama.cpp requires model-specific C++ code that implements attention, MLP, and normalization. ORT GenAI delegates ALL computation to ONNX Runtime — it never touches model internals.

This means ORT GenAI's extensibility problem is fundamentally simpler. It doesn't need a plugin system for model computation (the ONNX graph handles that). It only needs extensibility for orchestration — which sessions to run, in what order, how to manage state between steps. And orchestration is naturally expressed as configuration.

Why Pipeline-as-Config Is Better Than GGUF (llama.cpp)

1. GGUF bundles computation with metadata. We separate them.

GGUF's model file contains weights + architecture metadata. The runtime reads the metadata and builds a compute graph at load time. This means the runtime must understand every architecture's compute pattern — which attention variant, which normalization, which MLP structure. When a model adds dual head_dim or KV sharing, llama.cpp needs new C++ code to interpret those metadata keys and build the right compute graph.

Our ONNX model IS the precompiled compute graph. The runtime never interprets architecture details — it just runs session.Run(). The pipeline config only describes ORCHESTRATION (which sessions, what order, what state), not COMPUTATION:

GGUF:    metadata → [runtime builds graph] → execution
Ours:    ONNX graph (prebuilt) + pipeline config → [runtime orchestrates] → execution

The runtime never needs to know what's inside the model. GGUF's runtime does.

2. Multi-EP deployment is impossible with GGUF.

GGUF models run on llama.cpp's own backends (CPU, CUDA, Metal, Vulkan). You can't take a GGUF and run it on DirectML, QNN (Qualcomm NPU), OpenVINO, or WebGPU without porting the entire backend.

ONNX + pipeline config runs on ANY ORT execution provider. The same model + config deploys to cloud GPU (CUDA EP), Windows laptop (DML EP), Qualcomm mobile (QNN EP), Intel hardware (OpenVINO EP), browser (WebGPU EP), and CPU. One model, one config, six+ deployment targets.

3. Graph-level optimization at export time.

ONNX models go through ORT's graph optimization pipeline: constant folding, op fusion, layout optimization, EP-specific transformations. These happen ONCE at model load time and produce an optimized execution plan. GGUF's runtime-built graphs can't do this — the graph is constructed and executed simultaneously.

Why Pipeline-as-Config Is Better Than vLLM

1. vLLM requires Python code for every model. We require JSON.

Adding a model to vLLM means writing a Python class with forward(), weight loading, attention implementation — typically 200-500 lines of PyTorch code. Even with register_model(), someone must WRITE that code.

Our approach: the export tool (mobius) generates the ONNX graph + pipeline config. The runtime needs ZERO new code. The complexity lives in the exporter (which already understands the model), not the runtime.

2. vLLM is CUDA-only for production.

vLLM's custom CUDA kernels (PagedAttention, FlashAttention) are what make it fast. But they only work on NVIDIA GPUs. Running vLLM on AMD, Intel, Qualcomm, or in a browser requires rewriting those kernels. ORT's execution providers handle hardware abstraction transparently.

3. vLLM couples computation and orchestration.

vLLM's model classes implement both the forward pass AND orchestration logic (KV cache management, attention patterns). Our architecture cleanly separates: ONNX model = computation, pipeline config = orchestration, ORT = execution.

Where Competitors Are Better (Honest Assessment)

They're better at Why Our path to parity
GGUF: Single-file distribution One .gguf file vs our model dir ONNX metadata embedding (research direction)
GGUF: Quantization simplicity Q4_K_M is one flag Olive pipeline (more steps, but more flexible)
vLLM: Serving features Continuous batching, speculative decoding, prefix caching ORT GenAI engine mode (growing)
vLLM: Community velocity 200+ models, rapid community PRs Pipeline-as-config FIXES this — enables same velocity
Both: No export step Load HF weights directly We require an export step (mobius build)

The Core Competitive Insight: Compile at Export Time

This is our unique structural advantage that neither competitor can replicate:

We move complexity from RUNTIME to EXPORT TIME.

  • GGUF: Runtime builds the compute graph (complexity at runtime)
  • vLLM: Runtime runs model-specific Python code (complexity at runtime)
  • Ours: Export tool builds the compute graph AND generates the orchestration config (complexity at export time). Runtime is generic.

Why this matters:

  1. Export runs ONCE; inference runs millions of times. Put the intelligence where it runs once.
  2. Export has access to the full HuggingFace model — Python code, config, architecture details. It can make perfect decisions. The runtime shouldn't need this information.
  3. The export tool (mobius) is Python — easy to extend. The runtime is C++ — hard to change. Our architecture puts extensibility in the easy-to-change layer.
  4. Export-time optimization — graph optimization, quantization, EP-specific tuning all happen before deployment. The runtime gets a pre-optimized artifact.

This is the 'compiler vs interpreter' advantage. GGUF and vLLM are interpreters — they process model definitions at runtime. We're a compiler — we process model definitions once at export and produce an optimized artifact that a simple, generic runtime executes.

What We Do That NEITHER Competitor Can

Three capabilities that pipeline-as-config delivers that no competitor matches:

1. Multi-Session Declarative Pipelines. GGUF has flat key-value metadata — no concept of multi-model pipelines. vLLM can do multi-model through Python code, but each topology requires a new class. Pipeline-as-Config's flow[] + dataflow[] declaratively express ANY multi-session topology — VLMs, speech models, multimodal with vision+audio+decoder — all as JSON without new code.

2. Hardware-Agnostic Model Artifacts. GGUF models are tied to llama.cpp's backend ecosystem. vLLM is CUDA-first (AMD ROCm second-class, no DirectML/QNN/WebGPU). ONNX + pipeline config is a hardware-agnostic artifact — the same files deploy on CPU, CUDA, DirectML, QNN (Qualcomm NPU), OpenVINO (Intel), and WebGPU. Write once, deploy on 6+ hardware targets. Even more powerfully, different sessions in the same pipeline can run on DIFFERENT execution providers — e.g., vision encoder on CPU while the decoder runs on GPU, or vision on NPU while decoder runs on GPU:

"sessions": {
  "vision":  {"file": "vision/model.onnx", "execution_provider": "QNNExecutionProvider"},
  "decoder": {"file": "decoder/model.onnx", "execution_provider": "CUDAExecutionProvider"}
}

This heterogeneous hardware deployment — different EPs per session in a single pipeline — is something neither GGUF nor vLLM can express at all.

3. Truly Model-Agnostic Runtime. GGUF's runtime interprets architecture metadata to build compute graphs — it must understand every model's attention pattern, normalization, MLP structure. vLLM's runtime runs model-specific Python forward() code. Our runtime executes a declared pipeline — it understands ZERO model architecture. The runtime has no decisions to make.

The Complete Competitive Matrix

Capability GGUF vLLM Pipeline-as-Config
New LLM without runtime changes ❌ (needs Python class) ✅ (JSON config)
Multi-session pipelines (VLM) ❌ (no concept) ⚠️ (Python code) ✅ (declarative flow)
Deploy same model on 6+ HW targets ✅ (ORT execution providers)
Heterogeneous HW per session ✅ (vision on NPU, decoder on GPU)
Model-agnostic runtime
Self-describing model artifacts ✅ (GGUF metadata) ❌ (needs Python) ✅ (ONNX + pipeline JSON)
Declarative preprocessing ✅ (ort-extensions JSON)
No Python dependency at inference
C++ only deployment (edge/embedded)
Extensibility without recompilation ✅ (Python) ✅ (JSON + plugin .so)

Pipeline-as-Config is the only approach that checks ALL boxes.


3. The Architecture: Pipeline-as-Config

3.1 Core Concept

Replace model-type dispatch with a declarative pipeline configuration. The runtime becomes a generic pipeline executor that:

  1. Loads whatever ONNX sessions the config declares
  2. Executes them in the order the config specifies
  3. Wires outputs→inputs using explicit dataflow declarations
  4. Manages state (KV cache, position IDs) per config-driven strategies
  5. Generates tokens using the standard (fully generic) generation loop
┌─────────────────────────────────────────────────┐
│              genai_config.json v2               │
│                                                  │
│  pipeline.extends: "autoregressive-decoder"      │
│  pipeline.sessions: {name → file}                │
│  pipeline.flow: [{run, when, loop}]              │
│  pipeline.dataflow: [{from, to}]                 │
│  pipeline.state: {kv_cache, position_ids}        │
│  pipeline.plugin: "libcustom.so" (optional)      │
│                                                  │
│  tokens: {pad, eos, bos}                         │
│  generation: {max_length, sampling, stop}        │
│  metadata: {model_type, source} (human-only)     │
└──────────────────┬──────────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────────┐
│          Pipeline Factory (structural)           │
│                                                  │
│  No string dispatch. Config structure drives:    │
│  ┌─ DecoderPipeline (single session)             │
│  ├─ MultiSessionPipeline (2+ sessions)           │
│  ├─ EncoderDecoderPipeline (cross-attention)     │
│  └─ PluginPipeline (custom shared library)       │
└──────────────────┬──────────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────────┐
│         Generic Pipeline Executor                │
│                                                  │
│  Interprets flow[] declaratively:                │
│  - when: init | step | final                     │
│  - loop: batched | per_image                     │
│                                                  │
│  State management (config-driven):               │
│  - KV cache (auto / separate / combined)         │
│  - Position IDs (auto / default / mrope_3d)      │
│  - Sliding window (from config)                  │
│                                                  │
│  Generation loop (fully generic, unchanged):     │
│  - Sampling, beam search, EOS detection          │
│  - Streaming output                              │
└─────────────────────────────────────────────────┘

3.2 The flow Array — Execution Ordering

The flow array declares which sessions run, when, and how:

"flow": [
  {"run": "vision",    "when": "init", "loop": "per_image"},
  {"run": "embedding", "when": "init"},
  {"run": "decoder",   "when": "step"}
]

Lifecycle phases (fixed vocabulary — not Turing-complete):

  • when: "init" — run before the main generation loop (encoders, embedding projectors, preprocessing). Execution order follows the flow[] array. Covers what earlier drafts called once and prompt.
  • when: "step" — run every iteration of the generation loop (decoder in autoregressive, UNet in denoising)
  • when: "final" — run after the generation loop completes (vocoder in TTS, VAE decoder in diffusion)

These three phases map cleanly across all generation paradigms:

Phase Autoregressive Denoising (Diffusion) Single-Pass
init Vision encoder, embedding, prompt processing Text encoder, latent init All sessions
step Decoder (each token) UNet (each denoising step) N/A (no loop)
final VAE decoder → image

Loop modes (fixed vocabulary):

  • loop: "batched" — pass all inputs at once (default)
  • loop: "per_image" — iterate over inputs individually (Qwen VL, Pixtral)

Guardrails:

  • Fixed vocabulary for when and loop — no arbitrary conditions or iterations
  • Maximum 10 flow stages — prevents pathological configs
  • No if/else — anything that needs conditional logic uses the plugin API
  • Cycle detection in dataflow at load time

3.3 The dataflow Array — Session Wiring

Optional. Declares how outputs from one session feed into inputs of another:

"dataflow": [
  {"from": "vision.image_features",     "to": "embedding.image_features"},
  {"from": "embedding.inputs_embeds",   "to": "decoder.inputs_embeds"}
]

When omitted, the runtime auto-matches by tensor name (output name in session A matches input name in session B). When explicit, overrides auto-matching for cases where tensor names differ.

3.4 The state Object — KV Cache & Position Strategy

"state": {
  "kv_cache": {
    "format": "auto",
    "past_key_pattern": "past_key_values.{layer}.key",
    "present_key_pattern": "present.{layer}.key",
    "past_value_pattern": "past_key_values.{layer}.value",
    "present_value_pattern": "present.{layer}.value"
  },
  "position_ids": {
    "strategy": "auto",
    "input_name": "position_ids"
  }
}

KV cache formats:

  • "auto" — introspect ONNX session I/O to detect format (default)
  • "separate" — standard past_key_values.{layer}.key / present.{layer}.key
  • "combined" — GPT-2 style past_{layer} / present_{layer}
  • Name patterns are optional overrides when auto-detection fails

Position ID strategies:

  • "auto" — introspect position_ids input shape: rank 2 → default 1D, rank 3 → mRoPE 3D
  • "default" — standard 1D position IDs
  • "mrope_3d" — 3-dimensional mRoPE (temporal, height, width)
  • "windowed" — sliding window position tracking

3.5 The extends Mechanism — Preset Inheritance

Built-in presets eliminate boilerplate for common patterns:

{"pipeline": {"extends": "autoregressive-decoder"}}

Built-in presets:

Preset Name What It Expands To
autoregressive-decoder Single decoder session, default KV cache, default position IDs, flow: [{run: decoder, when: always}]
vision-language Vision + embedding + decoder sessions, batched vision, default KV cache
encoder-decoder Encoder (once) + decoder (always), cross-attention KV cache
speech-language Speech encoder + embedding + decoder

Presets are resolved at load time — the runtime sees a fully expanded config. Overrides replace preset defaults:

{
  "pipeline": {
    "extends": "vision-language",
    "flow": [
      {"run": "vision", "when": "prompt", "loop": "per_image"},
      {"run": "embedding", "when": "prompt"},
      {"run": "decoder", "when": "always"}
    ],
    "state": {
      "position_ids": {"strategy": "mrope_3d", "grid_source": "vision.image_grid_thw"}
    }
  }
}

3.6 The Plugin API — Escape Hatch

For genuinely novel architectures that can't be expressed as standard pipelines (RNNT, SSM/Mamba, diffusion):

{
  "pipeline": {
    "plugin": {
      "library": "libgenai_rnnt.so",
      "entry_point": "CreateRnntPipeline"
    }
  }
}

C++ plugin interface:

// Stable C ABI — plugins compiled separately from the runtime
extern "C" {
  std::shared_ptr<Pipeline> CreateRnntPipeline(
    OrtEnv& env, std::unique_ptr<Config> config);
}

The plugin registers a Pipeline factory, not a Model factory — keeping the abstraction consistent. Plugins extend the pipeline type system for the ~1% of models that can't fit the declarative config.

3.7 Preprocessing: Image, Audio, and Variable Input Shapes

Preprocessing is NOT the pipeline executor's job. It transforms raw inputs (pixels, audio waveforms) into model-ready tensors. This happens BEFORE the pipeline runs and is handled by a separate, config-driven preprocessing layer.

The Architecture Boundary

Raw Input (images, audio, text)
        │
        ▼
┌─────────────────────────────┐
│   Preprocessing Layer       │
│   (ort-extensions)          │
│                             │
│   image_processor.json  ────┤──→ pixel_values, image_sizes, grid_thw
│   audio_processor.json  ────┤──→ audio_features, audio_sizes
│   tokenizer.json        ────┤──→ input_ids, attention_mask
│                             │
│   Config-driven.            │
│   Zero model-specific C++.  │
└──────────────┬──────────────┘
               │ model-ready tensors
               ▼
┌─────────────────────────────┐
│   Pipeline Executor         │
│   (this proposal)           │
└─────────────────────────────┘

Image Preprocessing — ort-extensions image_processor.json

Each VLM ships an image_processor.json that declares its preprocessing pipeline:

{
  "image_processor_type": "Qwen2VLImageProcessor",
  "resample": "bicubic",
  "do_resize": true,
  "size": {"min_pixels": 3136, "max_pixels": 12845056},
  "do_rescale": true,
  "rescale_factor": 0.00392156862745098,
  "do_normalize": true,
  "image_mean": [0.48145466, 0.4578275, 0.40821073],
  "image_std": [0.26862954, 0.26130258, 0.27577711],
  "patch_size": 14,
  "merge_size": 2
}

ort-extensions loads this JSON and executes the preprocessing pipeline using its own C++ ops. No model-type dispatch needed. Different VLMs (Phi3v at 336×336, Qwen2.5-VL with dynamic resolution, Pixtral with variable per-image sizes) all use the same mechanism — they just ship different image_processor.json configs.

The C++ preprocessors (PhiImageProcessor, QwenImageProcessor, GemmaImageProcessor, Mistral3ImageProcessor) become legacy. New models use ort-extensions exclusively. This is already the direction — mobius already generates image_processor.json for all VLMs.

Audio Preprocessing — audio_processor.json

Same pattern for speech models:

{
  "audio_processor_type": "WhisperFeatureExtractor",
  "feature_size": 128,
  "sampling_rate": 16000,
  "hop_length": 160,
  "chunk_length": 30,
  "n_fft": 400
}

For multimodal models needing both image and audio (Phi4mm):

"preprocessing": {
  "image": {"config": "image_processor.json"},
  "audio": {"config": "audio_processor.json"}
}

Variable Input Shapes

Different models handle input sizes differently. The pipeline config + preprocessor config handle all cases:

Pattern Example How It's Handled
Fixed-size Phi3v (336×336 images) image_processor.json resizes to target. ONNX model has static shapes.
Dynamic-size Qwen2.5-VL (arbitrary resolution) image_processor.json does dynamic resize + patch extraction. ONNX model has dynamic shapes. Pipeline executor passes tensors as-is.
Per-image variable Pixtral (each image different resolution) Preprocessor zero-pads to max(H)×max(W), provides image_sizes[N, 2]. Pipeline flow uses loop: per_image + dynamic_shape for per-image slicing.

For the per-image variable resolution case:

{"run": "vision", "when": "prompt", "loop": "per_image",
 "loop_over": "pixel_values",
 "dynamic_shape": {"source": "image_sizes", "apply_to_dims": [2, 3]}}

The executor slices pixel_values[i, :, :H_i, :W_i] where H_i and W_i come from image_sizes[i]. This is ~15 lines of generic loop code, not a model-specific class.

Pipeline Config Reference

The pipeline config references preprocessing configs without embedding their details:

"preprocessing": {
  "image": {"config": "image_processor.json", "format": "ort-extensions"},
  "audio": {"config": "audio_processor.json", "format": "ort-extensions"}
}

The format field future-proofs beyond ort-extensions — today it's the only value, but it enables alternative preprocessing backends (e.g., a pure ONNX preprocessing graph) without schema changes.

This clean boundary means: preprocessing is fully described by its own config files (already supported by ort-extensions), and the pipeline executor receives model-ready tensors without knowing how they were produced. Note that half the pipeline-as-config vision is already shipped and working in production via ort-extensions — we're completing the other half for inference orchestration.

Every layer of the stack is config-driven. Zero model-specific C++ anywhere.

3.8 Advanced KV Cache Patterns (Shared Cache, Dual Head Dim)

Some models have non-uniform KV cache layouts. Gemma4 is the most complex example:

  • Dual head_dim: Local (sliding_attention) layers use head_dim=256; global (full_attention) layers use global_head_dim=512
  • KV sharing: The last num_kv_shared_layers layers reuse K/V from earlier layers and have NO independent cache entries
  • Mixed window sizes: Sliding window layers have bounded cache; full attention layers have unbounded cache

These are all handled by auto-detection — no model-specific code needed.

ORT GenAI's DefaultKeyValueCache already supports:

  • Sparse layer indices (kv_layer_indices_): Auto-discovered by scanning which past_key_values.{N}.key inputs exist in the ONNX session. If the model only has layers 0-25 (skipping 26-33 due to KV sharing), the cache allocates only 26 entries.
  • Per-layer shapes (layer_shapes_): Each layer can have a different [batch, heads, seq_len, head_dim] shape, auto-discovered from the ONNX session output shapes.
  • Per-layer sliding window: Configurable which layers use bounded cache vs unbounded.

The pipeline config for a Gemma4-style model:

"state": {
  "kv_cache": {
    "format": "auto",
    "sliding_window": {
      "window_size": 4096,
      "layers": [0, 1, 3, 4, 6, 7, 9, 10, 12, 13, 15, 16, 18, 19, 21, 22, 24, 25]
    }
  }
}

format: auto handles the dual head_dim and sparse layers automatically. The sliding_window.layers array (already supported by ORT GenAI today) specifies which layers use bounded cache. The export tool (mobius) builds the ONNX model with cache I/O only for non-shared layers, so the runtime never needs to know about KV sharing — it's implicit in the graph structure.

3.9 Preprocessor↔Model Shape Alignment

Problem: Different models expect different image sizes and preprocessing. How do we ensure the preprocessor output matches the model's expectations without model-specific code?

Answer: Co-generation. The export tool (mobius) generates BOTH the ONNX model and its image_processor.json from the same HuggingFace config. They're guaranteed to be aligned because they share a single source of truth:

HuggingFace Config
├── → ONNX model (expects specific input shapes)
└── → image_processor.json (produces those exact shapes)

For additional safety, the pipeline config can include optional shape validation:

"preprocessing": {
  "image": {
    "config": "image_processor.json",
    "format": "ort-extensions",
    "expected_outputs": {
      "pixel_values": {"rank": 4, "dtype": "float32"},
      "image_grid_thw": {"rank": 2, "dtype": "int64"}
    }
  }
}

The expected_outputs field enables load-time validation: verify that the preprocessor config produces tensors compatible with the model's inputs before running inference. This catches mismatches at load time rather than inference time.

If someone provides the wrong preprocessor: ORT Runtime throws a shape mismatch error at session.Run() — already a clear, debuggable failure. The optional validation catches it earlier.

3.10 The metadata Section

model_type lives here — as documentation, not dispatch:

"metadata": {
  "model_type": "qwen2_5_vl",
  "architecture": "Qwen2_5VLForConditionalGeneration",
  "source": "mobius",
  "export_version": "0.5.0"
}

Used for: logging, telemetry, debugging, human readability. Ignored by: all dispatch and runtime logic.


4. Concrete Schema Examples

4.1 Decoder-Only LLM (Minimal — 7 lines)

{
  "version": 2,
  "pipeline": {
    "extends": "autoregressive-decoder",
    "sessions": {"decoder": {"file": "model.onnx"}}
  },
  "tokens": {"eos": [151645], "pad": 0},
  "generation": {"max_length": 4096, "sampling": {"temperature": 0.7}},
  "metadata": {"model_type": "qwen2", "source": "mobius"}
}

4.2 Vision-Language Model (Qwen2.5-VL style — 25 lines)

{
  "version": 2,
  "pipeline": {
    "extends": "vision-language",
    "sessions": {
      "vision":    {"file": "vision_encoder/model.onnx"},
      "embedding": {"file": "embedding/model.onnx"},
      "decoder":   {"file": "decoder/model.onnx"}
    },
    "flow": [
      {"run": "vision",    "when": "prompt", "loop": "per_image"},
      {"run": "embedding", "when": "prompt"},
      {"run": "decoder",   "when": "always"}
    ],
    "dataflow": [
      {"from": "vision.image_features",   "to": "embedding.image_features"},
      {"from": "embedding.inputs_embeds", "to": "decoder.inputs_embeds"}
    ],
    "state": {
      "kv_cache": {"format": "auto"},
      "position_ids": {
        "strategy": "mrope_3d",
        "grid_source": "vision.image_grid_thw"
      }
    },
    "preprocessing": {
      "image": {"config": "image_processor.json"}
    }
  },
  "tokens": {"eos": [151645], "pad": 0, "image_token": 151655},
  "generation": {"max_length": 4096, "sampling": {"temperature": 0.7}},
  "metadata": {"model_type": "qwen2_5_vl", "source": "mobius"}
}

4.3 Encoder-Decoder (Whisper style)

{
  "version": 2,
  "pipeline": {
    "extends": "encoder-decoder",
    "sessions": {
      "encoder": {"file": "encoder/model.onnx"},
      "decoder": {"file": "decoder/model.onnx"}
    },
    "flow": [
      {"run": "encoder", "when": "once"},
      {"run": "decoder", "when": "always", "cross_attention_from": "encoder"}
    ],
    "state": {
      "kv_cache": {"format": "auto"},
      "cross_cache": {"source": "encoder", "frozen": true}
    }
  },
  "tokens": {"eos": [50257], "pad": 50257, "decoder_start": 50258},
  "generation": {"max_length": 448},
  "metadata": {"model_type": "whisper", "source": "mobius"}
}

4.4 Multimodal (Vision + Audio — Phi4mm style)

{
  "version": 2,
  "pipeline": {
    "sessions": {
      "vision":    {"file": "vision_encoder/model.onnx"},
      "speech":    {"file": "audio_encoder/model.onnx"},
      "embedding": {"file": "embedding/model.onnx"},
      "decoder":   {"file": "decoder/model.onnx"}
    },
    "flow": [
      {"run": "vision",    "when": "prompt", "loop": "batched"},
      {"run": "speech",    "when": "prompt", "loop": "batched"},
      {"run": "embedding", "when": "prompt"},
      {"run": "decoder",   "when": "always"}
    ],
    "dataflow": [
      {"from": "vision.image_features", "to": "embedding.image_features"},
      {"from": "speech.audio_features", "to": "embedding.audio_features"},
      {"from": "embedding.inputs_embeds", "to": "decoder.inputs_embeds"}
    ],
    "state": {
      "kv_cache": {"format": "auto"},
      "position_ids": {"strategy": "default"}
    }
  },
  "tokens": {"eos": [32007], "pad": 32000},
  "generation": {"max_length": 4096},
  "metadata": {"model_type": "phi4mm", "source": "mobius"}
}

4.5 Novel Architecture via Plugin (RNNT)

{
  "version": 2,
  "pipeline": {
    "plugin": {
      "library": "libgenai_rnnt.so",
      "entry_point": "CreateRnntPipeline"
    },
    "sessions": {
      "encoder":  {"file": "encoder/model.onnx"},
      "predictor": {"file": "predictor/model.onnx"},
      "joiner":   {"file": "joiner/model.onnx"}
    }
  },
  "metadata": {"model_type": "nemotron_speech", "source": "mobius"}
}

5. Implementation Plan

Overview

This is a refactor, not a rewrite. The generation loop, search/sampling, tokenizer, KV cache internals, and all language bindings remain unchanged. We're replacing the model dispatch layer with a pipeline dispatch layer.

Net code change estimate: +800 lines added, -2000 lines deleted. The codebase gets smaller.

PR 1: Config Schema v2 Parser + Backward Compatibility (~300 LOC)

Files changed:

  • src/config.h — Add Pipeline struct with sessions, flow, dataflow, state, extends fields
  • src/config.cpp — Parse v2 schema; add v1→v2 translator that converts old-format configs to pipeline format
  • New: src/pipeline_presets.h — Built-in preset definitions (autoregressive-decoder, vision-language, encoder-decoder, speech-language)

Logic:

// In Config constructor:
if (json.contains("version") && json["version"] == 2) {
  ParsePipelineConfig(json);  // New v2 path
} else {
  ParseLegacyConfig(json);    // Existing v1 path
  TranslateV1ToV2();          // Convert to pipeline format internally
}

Backward compatibility guarantee: Every existing genai_config.json produces an identical internal Pipeline struct after translation. The v1→v2 translator maps:

  • model.type + model_type.h classification → appropriate preset
  • model.decoder.inputs/outputsstate.kv_cache patterns
  • model.vision/speech/embedding sections → sessions + flow + dataflow

Tests: All existing config tests pass unchanged. New tests for v2 parsing, preset resolution, extends override logic.

PR 2: PipelineExecutor Class (~350 LOC)

Files changed:

  • New: src/models/pipeline_executor.h — PipelineExecutor class definition
  • New: src/models/pipeline_executor.cpp — Implementation
  • src/models/model.cpp — Replace CreateModel() with CreatePipeline() using structural detection

The core class:

class PipelineExecutor : public State {
public:
  PipelineExecutor(std::unique_ptr<Config> config, OrtEnv& env);
  
  DeviceSpan<float> RunStep(int total_length, DeviceSpan<int32_t>& next_tokens,
                            DeviceSpan<int32_t> next_indices) override;
  
private:
  // Loaded from config
  std::map<std::string, std::unique_ptr<OrtSession>> sessions_;
  std::vector<FlowStep> prompt_flow_;   // Steps where when != "always"
  std::vector<FlowStep> decode_flow_;   // Steps where when == "always"
  std::vector<DataflowWire> dataflow_;
  
  // State (auto-detected or config-driven)
  std::unique_ptr<KeyValueCache> kv_cache_;
  std::unique_ptr<PositionStrategy> position_ids_;
  DefaultInputIDs input_ids_{*this};
  Logits logits_{*this};
  
  // Data flow between sessions
  std::map<std::string, std::unique_ptr<OrtValue>> intermediates_;
  
  bool is_prompt_{true};
  
  void WireInputs(const FlowStep& step);
  void WireOutputs(const FlowStep& step);
  void RunFlowStep(const FlowStep& step, bool graph_capture);
};

Structural detection in CreatePipeline() (replaces CreateModel()):

std::shared_ptr<Model> CreatePipeline(OrtEnv& env, std::unique_ptr<Config> config) {
  auto& pipeline = config->pipeline;
  
  // Plugin escape hatch
  if (pipeline.plugin.has_value()) {
    return LoadPluginPipeline(pipeline.plugin.value(), std::move(config), env);
  }
  
  // Structural detection — no string dispatch
  bool has_encoder_with_cross_attn = HasCrossAttentionFlow(pipeline.flow);
  bool has_multiple_sessions = pipeline.sessions.size() > 1;
  
  if (has_encoder_with_cross_attn) {
    return std::make_shared<EncoderDecoderPipeline>(std::move(config), env);
  }
  if (has_multiple_sessions) {
    return std::make_shared<MultiSessionPipeline>(std::move(config), env);
  }
  return std::make_shared<DecoderPipeline>(std::move(config), env);
}

PR 3: Flow Interpreter + Dataflow Wiring (~200 LOC)

Files changed:

  • New: src/models/flow_interpreter.h/.cpp — Interprets flow[] and dataflow[]
  • src/models/pipeline_executor.cpp — Uses flow interpreter

Key logic:

void PipelineExecutor::RunFlowStep(const FlowStep& step, bool graph_capture) {
  auto& session = sessions_[step.session_name];
  
  if (step.loop == LoopMode::PerImage) {
    // Per-image loop: iterate over input tensor's batch dimension
    auto input_slices = SliceTensorDim0(GetInput(step, step.loop_over));
    std::vector<OrtValue> output_parts;
    for (auto& slice : input_slices) {
      BindSlicedInput(step, slice);
      session->Run();
      output_parts.push_back(CaptureOutput(step));
    }
    intermediates_[step.output_key] = ConcatenateDim0(output_parts);
  } else {
    // Standard batched execution
    WireInputs(step);
    session->Run(graph_capture);
    WireOutputs(step);
  }
}

Dataflow wiring:

void PipelineExecutor::WireInputs(const FlowStep& step) {
  for (auto& wire : dataflow_) {
    if (wire.to_session == step.session_name) {
      // Wire output from previous session to input of this session
      auto& source = intermediates_[wire.from_key];
      BindInput(step, wire.to_input_name, source);
    }
  }
}

PR 4: Plugin API (~100 LOC)

Files changed:

  • New: src/models/plugin_api.h — Stable C ABI for pipeline plugins
  • New: src/models/plugin_loader.cpp — Dynamic library loading (dlopen/LoadLibrary)

Interface:

// plugin_api.h — stable ABI, ships with ORT GenAI headers
extern "C" {
  typedef std::shared_ptr<Model> (*PipelineFactoryFn)(
    OrtEnv& env, std::unique_ptr<Config> config);
}

// In plugin .so/.dll:
extern "C" {
  std::shared_ptr<Model> CreateRnntPipeline(
    OrtEnv& env, std::unique_ptr<Config> config) {
    return std::make_shared<RnntPipeline>(std::move(config), env);
  }
}

PR 5: Delete Model-Type Dispatch (~-1500 LOC)

Files deleted:

  • src/models/model_type.h — The entire file

Files simplified:

  • src/models/model.cpp — Remove CreateModel() if-else chain, replace with CreatePipeline()
  • src/models/position_inputs.cpp — Remove IsQwenVLFamily() check; position strategy comes from config
  • src/models/multi_modal.cpp — Remove CreateVisionState() model_type dispatch; vision loop mode comes from config flow

Files eventually deprecated (kept for v1 compat, removed in future release):

  • src/models/gpt.h/cpp — Absorbed into generic pipeline with kv_cache.format: combined
  • Per-model C++ preprocessors (phi_image_processor, gemma_image_processor, etc.) — Replaced by ort-extensions image_processor.json

Implementation Summary

PR Description LOC Added LOC Deleted Net
PR 1 Config v2 parser + v1 translator +300 -0 +300
PR 2 PipelineExecutor classes +350 -0 +350
PR 3 Flow interpreter + dataflow +200 -0 +200
PR 4 Plugin API +100 -0 +100
PR 5 Delete model_type dispatch +0 -1500 -1500
Total +950 -1500 -550

The codebase shrinks by ~550 lines while gaining full model-agnostic extensibility.


6. Compatibility Matrix

Model Scenario Today After PR 1-2 After PR 1-5
Existing Llama/Phi/Gemma (v1 config) ✅ Works ✅ Works (v1→v2 translator) ✅ Works (translator)
New decoder-only LLM (unknown type) ❌ Rejected by whitelist ✅ 7-line v2 config ✅ 7-line v2 config
Custom fine-tune with custom model_type ❌ Rejected by whitelist ✅ extends preset ✅ extends preset
New VLM family ❌ Needs new C++ class + processor ✅ ~25-line v2 config ✅ ~25-line v2 config
Qwen2.5-VL (3D mRoPE, per-image vision) ✅ Hardcoded ✅ v2 config with position_strategy + loop ✅ Config-driven
Pixtral/Mistral3 (variable resolution) ✅ Hardcoded ✅ v2 config with per_image loop + dynamic_shape ✅ Config-driven
Whisper (encoder-decoder) ✅ Hardcoded ✅ v2 config with encoder-decoder preset ✅ Config-driven
GPT-2 (combined KV cache) ✅ Hardcoded (separate class) ✅ v2 config with kv_cache.format: combined ✅ Config-driven
Mamba/SSM (recurrent, no KV) ❌ Not supported ⚠️ Needs state.type: recurrent ✅ Config-driven
RNNT (non-autoregressive) ✅ Hardcoded ✅ Plugin .so ✅ Plugin .so
Novel architecture (unknown future) ❌ Major C++ work ✅ Plugin .so, zero runtime changes ✅ Plugin .so
Phi4mm (vision + audio) ✅ Hardcoded ✅ v2 config with 4 sessions ✅ Config-driven

7. Technical Feasibility

7.1 CUDA Graph Capture

Concern: CUDA graphs require identical session topology and buffer shapes between captures and replays. Does a generic pipeline executor break this?

Answer: No. The executor pre-computes a "decode flow" (steps where when: "step") at init time. During token generation, only the decode flow runs — this is a fixed, repeatable sequence identical to what the current DecoderOnly_State::Run() does. CUDA graph capture applies to this fixed sequence:

bool graph_capture = !is_prompt_ && params_->use_graph_capture 
                     && input_ids_.GetShape()[1] == 1;
// Only the decode_flow_ steps run — topology is fixed
for (auto& step : decode_flow_) {
  RunFlowStep(step, graph_capture);
}

7.2 Memory Pre-allocation

Concern: The current code pre-allocates KV cache buffers based on model dimensions. Can a generic executor do this without model-specific knowledge?

Answer: Yes. KV cache dimensions come from config (decoder.num_hidden_layers, decoder.num_key_value_heads, decoder.head_size) or are discoverable from ONNX session output shapes at init time. The current DefaultKeyValueCache already auto-discovers layer count by pattern-matching present tensor names in the session. A generic executor uses the same mechanism — zero model-specific knowledge needed.

7.3 Performance Overhead

Concern: Does the generic pipeline add overhead vs hand-optimized model classes?

Answer: Negligible. The overhead is:

  • One for loop over flow_ steps per generation step (typically 1 step for LLMs)
  • One map lookup per dataflow wire per step
  • These are nanosecond-scale operations vs millisecond-scale ONNX session runs

The hot path — session.Run() + KV cache management — is identical to the current code. The generation loop, search/sampling, and tokenizer are completely unchanged.

7.4 Config Validation

Invalid configs must produce clear errors at load time, not runtime crashes:

Error Message
Session referenced in flow but not declared Flow step references session "vision" but no such session is declared in pipeline.sessions
Dataflow references non-existent tensor Dataflow wire references output "image_features" but session "vision" has no such output (available outputs: hidden_states, pooler_output)
Unknown position strategy Unknown position_ids strategy "my_custom". Valid options: auto, default, mrope_3d, windowed
Cycle in dataflow Circular dependency detected in dataflow: vision → embedding → decoder → vision
Unknown preset Unknown pipeline preset "my-preset". Built-in presets: autoregressive-decoder, vision-language, encoder-decoder, speech-language
Missing required field Pipeline config requires at least one session. Add "sessions": {"decoder": {"file": "model.onnx"}}

7.5 The per_image Loop for Vision

QwenVisionState and PixtralVisionState loop over images individually with different slicing strategies. The flow interpreter handles this generically:

{"run": "vision", "when": "prompt", "loop": "per_image",
 "loop_over": "pixel_values"}

For Pixtral's variable-resolution cropping (per-image height/width from image_sizes):

{"run": "vision", "when": "prompt", "loop": "per_image",
 "loop_over": "pixel_values",
 "dynamic_shape": {"source": "image_sizes", "apply_to_dims": [2, 3]}}

The executor slices pixel_values[i, :, :H_i, :W_i] where H_i, W_i come from image_sizes[i]. This is ~15 lines of generic loop code, not a model-specific class.


8. The Pitch to the ORT GenAI Team

Framing

"Your runtime is already 90% model-agnostic. We're proposing you formalize what's already true — and eliminate the last 10% of model-specific code."

Today, 21 of 32 model types share identical C++ code. The generation loop doesn't know what model it's running. The KV cache auto-discovers its own layout. The only thing preventing any new model from working is a string whitelist that adds no value.

We're not asking you to change your architecture — we're asking you to recognize that your architecture has already evolved past the model_type dispatch layer. The pipeline config makes the implicit explicit.

Value Proposition

For Today With Pipeline-as-Config
ORT GenAI team Bottlenecked on model support PRs Never writes model-specific code again
Model builders (mobius/Olive) Must coordinate with runtime team for every new model Ship independently — generate config, done
ML engineers Wait for runtime releases New models work immediately
The ecosystem ORT GenAI lags behind HuggingFace model zoo ORT GenAI supports any ONNX model by design

The Key Selling Point

This REDUCES ORT GenAI's maintenance burden. The team goes from "we must ship a PR for every new HuggingFace model" to "we maintain a stable pipeline runtime." New model support becomes the exporter's responsibility (mobius/Olive), not the runtime's.

"You build the engine. We build the cars."


9. Risk Analysis

Risk Likelihood Impact Mitigation
Performance regression for existing models Low High Benchmark all 32 model types before/after. The hot path is identical.
Config complexity deters users Medium Medium Presets with extends reduce 90% of configs to 7 lines. JSON Schema for IDE support.
Edge cases in flow interpreter Medium Medium Comprehensive test matrix covering all 32 model types. Validation at load time.
ORT GenAI team rejects the proposal Medium High Start with the blacklist inversion (5 lines) to build trust. Present the full vision as an RFC.
Plugin ABI stability across versions Low Medium Version the plugin API. Keep it minimal (1 factory function).
v1→v2 translator has subtle bugs Medium Medium The translator is tested against every existing genai_config.json in the test suite.

Immediate Bridge (While Building the Future)

While the pipeline-as-config architecture is implemented, mobius can unblock users TODAY:

  1. For unregistered LLM model_types: emit "type": "decoder" + "original_model_type": "<real_type>" in genai_config.json
  2. "decoder" is in the current whitelist and routes to DecoderOnly_Model
  3. When pipeline-as-config ships, switch to v2 format with the real model_type in metadata

10. The mobius Role: Pipeline Compiler

mobius already knows everything needed to generate complete pipeline configs:

What mobius knows How it maps to pipeline config
Model architecture (decoder-only, VLM, enc-dec) Which preset to extend
Number and type of ONNX sessions pipeline.sessions
Vision invocation pattern (batched vs per-image) flow[].loop
Position embedding strategy (1D, 3D mRoPE) state.position_ids.strategy
KV cache format (separate, combined) state.kv_cache.format
All I/O tensor names state.kv_cache.*_pattern, dataflow[]
Token IDs, generation params tokens, generation

Implementation in mobius: Extend the existing _write_genai_config() function to emit v2 format alongside (or instead of) v1. The pipeline config is generated from the same model metadata that already drives ONNX graph construction.


11. Research Direction: Self-Contained Generation Graphs

As a long-term research direction (not part of the core proposal), we explored embedding generation logic inside the ONNX graph itself. Microsoft's existing com.microsoft.BeamSearch and com.microsoft.GreedySearch contrib ops prove this is technically possible.

Viable for: Offline batch inference, edge deployment, WebAssembly

Not viable for: Interactive serving (streaming, continuous batching, speculative decoding — all require host-side coordination)

Potential approach: Small set of generation-specific custom ops (GenerationKVCacheUpdate, SampleTopP) that the runtime provides as efficient primitives, while the ONNX graph carries the generation logic. Worth exploring for simple deployment scenarios but not the primary architecture.


12. Beyond Autoregressive: TTS, Diffusion, and Multimodal Audio

12.1 The Question

Pipeline-as-Config is designed around autoregressive token generation. But the model ecosystem includes fundamentally different generation patterns:

  • TTS (text-to-speech): Text → mel spectrogram → audio waveform (multi-stage, often non-autoregressive)
  • Diffusion (image generation): Iterative denoising loop with fixed step count, noise scheduling, no token sampling
  • Audio+text multimodal: Mixed modality inputs (audio + text → text), structurally similar to VLMs

Can flow[]/dataflow[]/state{} express these? Or is the schema inherently autoregressive?

12.2 The Honest Assessment

Model Type Schema Expressive? Runtime Can Execute? What's Missing
Audio+text multimodal (Phi4mm, speech-language) ✅ Yes ✅ Yes Nothing — structurally identical to VLMs
Encoder-decoder (Whisper, Marian) ✅ Yes ✅ Yes Nothing — already supported
Autoregressive TTS (Bark, VALL-E) ✅ Yes ✅ Yes Add when: "final" for vocoder post-processing
Non-autoregressive TTS (VITS, FastSpeech2) ✅ Yes ❌ No Sequential executor + non-token output
Diffusion (SD, Flux, DiT) ⚠️ Topology yes ❌ No Iterative executor, scheduler state, latent init, non-token output

Key insight: the flow[]/dataflow[] schema is MORE GENERAL than the current runtime. It can already describe these topologies. The bottleneck is the C++ Generator, which assumes autoregressive token generation.

12.3 Concrete Config Examples

Audio+text multimodal (works today with pipeline-as-config):

{
  "pipeline": {
    "extends": "multimodal",
    "sessions": {
      "audio_encoder": {"file": "audio_encoder.onnx"},
      "embedding": {"file": "embedding.onnx"},
      "decoder": {"file": "decoder.onnx"}
    },
    "flow": [
      {"run": "audio_encoder", "when": "init"},
      {"run": "embedding", "when": "init"},
      {"run": "decoder", "when": "step"}
    ],
    "dataflow": [
      {"from": "audio_encoder.audio_features", "to": "embedding.audio_features"},
      {"from": "embedding.inputs_embeds", "to": "decoder.inputs_embeds"}
    ]
  },
  "generation": {"loop": "autoregressive", "max_length": 4096}
}

Whisper and Nemotron Speech are already this pattern. No schema changes needed.

Autoregressive TTS (Bark — works with minor extension):

{
  "pipeline": {
    "sessions": {
      "decoder": {"file": "decoder.onnx"},
      "vocoder": {"file": "vocoder.onnx"}
    },
    "flow": [
      {"run": "decoder", "when": "step"},
      {"run": "vocoder", "when": "final"}
    ],
    "dataflow": [
      {"from": "decoder.audio_tokens", "to": "vocoder.input_ids"}
    ]
  },
  "generation": {"loop": "autoregressive", "max_length": 2048}
}

New: when: "final" — runs after the generation loop completes (post-processing). Trivial to add.

Non-autoregressive TTS (VITS — needs sequential executor):

{
  "pipeline": {
    "sessions": {
      "text_encoder": {"file": "text_encoder.onnx"},
      "duration_predictor": {"file": "duration.onnx"},
      "mel_decoder": {"file": "mel_decoder.onnx"},
      "vocoder": {"file": "vocoder.onnx"}
    },
    "flow": [
      {"run": "text_encoder", "when": "init"},
      {"run": "duration_predictor", "when": "init"},
      {"run": "mel_decoder", "when": "init"},
      {"run": "vocoder", "when": "init"}
    ],
    "output": {"session": "vocoder", "name": "audio_waveform"}
  },
  "generation": {"loop": "single_pass"}
}

New: "loop": "single_pass" — no generation loop, run all flow steps once, return output tensor. Requires a SequentialExecutor (~100 LOC).

Complex TTS with inner loops (Qwen3 TTS — needs flow step extensions):

Qwen3 TTS is a 4-model pipeline: embedding → talker → code_predictor → speaker_encoder. The talker IS autoregressive (KV cache, logits), but within each generation step, the code_predictor runs 14 times in an inner loop with a step counter:

{
  "pipeline": {
    "sessions": {
      "embedding": {"file": "embedding.onnx"},
      "talker": {"file": "talker.onnx"},
      "code_predictor": {"file": "code_predictor.onnx"},
      "speaker_encoder": {"file": "speaker_encoder.onnx", "optional": true}
    },
    "flow": [
      {"run": "speaker_encoder", "when": "init", "optional": true},
      {"run": "embedding", "when": "init"},
      {"run": "talker", "when": "step"},
      {"run": "code_predictor", "when": "step", "repeat": 14, "counter": "step_index"}
    ],
    "dataflow": [
      {"from": "embedding.text_embeds", "to": "talker.inputs_embeds"},
      {"from": "talker.last_hidden_state", "to": "code_predictor.inputs_embeds"},
      {"from": "code_predictor.codec_embeddings", "to": "code_predictor.inputs_embeds"}
    ]
  },
  "generation": {"loop": "autoregressive", "max_length": 2048}
}

New concepts: repeat: N on a flow step (inner loop within each generation step), counter field (provides a step index input), and self-referential dataflow (code_predictor output feeds back into itself). These are v2.1 extensions. Until then, the plugin escape hatch covers complex TTS.

Diffusion (Stable Diffusion, Flux) — in scope for schema design, out of scope for v2.0 implementation:

Diffusion has a fundamentally different generation loop: fixed N-step denoising with noise scheduling, classifier-free guidance (conditional UNet double-call), and non-neural scheduler math between iterations. The flow[]/dataflow[] schema can express the session topology using the same init/step/final phases:

{
  "pipeline": {
    "sessions": {
      "text_encoder": {"file": "text_encoder.onnx"},
      "unet": {"file": "unet.onnx"},
      "vae_decoder": {"file": "vae_decoder.onnx"}
    },
    "flow": [
      {"run": "text_encoder", "when": "init"},
      {"run": "unet", "when": "step"},
      {"run": "vae_decoder", "when": "final"}
    ],
    "dataflow": [
      {"from": "text_encoder.text_embeddings", "to": "unet.encoder_hidden_states"},
      {"from": "unet.noise_pred", "to": "vae_decoder.latent_sample"}
    ]
  },
  "generation": {
    "loop": "denoising",
    "num_steps": 50,
    "scheduler": "euler_discrete",
    "guidance_scale": 7.5
  }
}

Note how init/step/final maps naturally: text_encoder = init, unet = step (each denoising iteration), vae_decoder = final (after loop). The same three phases work for autoregressive AND denoising — no schema fork needed.

The denoising loop itself requires host-side C++ logic (scheduler.step, CFG interpolation) that would be Turing-complete if expressed declaratively. The right approach: a dedicated DenoisingExecutor in C++ that implements the denoising loop — analogous to how the autoregressive loop is C++ today. The config parameterizes it; the C++ implements it. The loop skeleton is ~300 LOC, but the full scheduler zoo (Euler, DDPM, DDIM, DPM-Solver, LCM, Flow Matching) plus classifier-free guidance and ControlNet support is ~1000-1500 LOC of implementation complexity. The pipeline executor and schema don't change — scheduler: "euler_discrete" is just a string that selects a C++ implementation.

Implementation is deferred — diffusion users have different tooling (ComfyUI, diffusers), different serving patterns (no streaming, batch-oriented), and ORT already has separate diffusion pipeline support. But the schema design explicitly accommodates diffusion so no breaking changes are needed when the executor is added.

12.4 The Scoping Decision

Pipeline-as-Config v2.0 implements autoregressive generation. The schema designs for all generation paradigms. This is a deliberate split: ship what matters now, design so future work is additive.

Pattern v2.0 Implementation v2.1+ Implementation Schema Support
LLM (decoder-only) ✅ Ship ✅ Designed
VLM (vision+language) ✅ Ship ✅ Designed
Encoder-decoder (Whisper, Marian) ✅ Ship ✅ Designed
Speech-language (audio+text→text) ✅ Ship ✅ Designed
Simple TTS (AR + vocoder) ⚠️ Plugin when: "final" ✅ Designed
Complex TTS (Qwen3-style inner loops) ⚠️ Plugin repeat + counter ✅ Designed
Non-autoregressive TTS (VITS) ⚠️ Plugin loop: "single_pass" ✅ Designed
Diffusion (SD, Flux, DiT) ⚠️ Plugin loop: "denoising" ✅ Designed
Exotic (RNNT, custom) ⚠️ Plugin Plugin ✅ Plugin escape hatch

The pitch: "The v2 schema supports any generation paradigm — autoregressive, denoising, single-pass. v2.0 ships the autoregressive executor. Adding a new paradigm = one C++ executor class. Adding a new model within any paradigm = zero code."

This prevents the "only works for LLMs" objection (the schema designs for everything) while keeping v2.0 scope tight (ship quality over breadth).

12.5 The Architectural Pattern: Pluggable Loop Strategies

The generation loop is a layer ABOVE the pipeline executor:

┌─────────────────────────────┐
│  Loop Strategy              │  ← autoregressive | denoising | single_pass
│  (generation.loop)          │
├─────────────────────────────┤
│  Pipeline Executor          │  ← flow[], dataflow[], state{} — UNCHANGED
│  (FlowInterpreter)         │
├─────────────────────────────┤
│  ONNX Sessions              │  ← The actual computation — UNCHANGED
└─────────────────────────────┘

Each loop strategy is independent:

Loop Strategy When It Runs Termination State Between Steps LOC Estimate
autoregressive Token-by-token EOS or max_length KV cache, positions Existing (~800 LOC)
single_pass All steps once After one pass None ~100 LOC
denoising Fixed N iterations After N steps Latents, scheduler ~300 LOC loop + ~1000 LOC schedulers

Adding a new loop strategy never touches the pipeline executor or existing loop strategies. Pure addition.

12.6 Competitive Advantage (Strengthened)

This analysis actually STRENGTHENS the competitive story:

  • llama.cpp: GGUF has no concept of denoising loops, multi-session pipelines, or post-processing stages. Diffusion support would require a fundamentally new runtime.
  • vLLM: Each diffusion architecture needs its own Python pipeline class. They're doing this (diffusion support is recent), but it's per-model Python code.
  • Pipeline-as-Config: Add ONE loop strategy to the runtime → EVERY model of that type works via config. One DenoisingExecutor enables Stable Diffusion, Flux, DiT, SDXL, ControlNet — all expressed as JSON with different session topologies.

The compiler advantage applies across modalities: add one loop strategy to the "compiled runtime" → unlimited models of that type. With "interpreter" runtimes (vLLM, llama.cpp), every model needs its own code.

12.7 Implementation Roadmap

Phase What Status
Phase 1 (v2.0) Autoregressive (decoder-only, VLM, encoder-decoder, speech-language) PRs 1-5 (in progress)
Phase 2 (v2.1) when: "final" for post-processing (enables AR TTS with vocoder) Trivial addition to FlowInterpreter
Phase 2 (v2.1) repeat + counter on flow steps (enables complex TTS like Qwen3) ~50 LOC FlowInterpreter extension
Phase 3 (v2.1) loop: "single_pass" + SequentialExecutor (enables non-AR TTS, embeddings) ~100 LOC new executor
Phase 4 (future) loop: "denoising" + DenoisingExecutor (enables diffusion) ~300 LOC loop skeleton + ~1000 LOC scheduler implementations

v2.0 scope: Generative language models (autoregressive token generation). Covers ~95% of current ORT GenAI model zoo.

v2.1 scope: TTS extensions (when: "final", repeat/counter, loop: "single_pass"). Additive, no breaking changes.

Architecture accommodates: Diffusion via pluggable loop strategy. Out of v2.0 scope (different product, different users), but architecturally consistent. The plugin escape hatch covers all exotic patterns in the meantime.


13. Summary

What Changes

Component Before After
Model dispatch 32-string whitelist → 8 C++ classes Structural detection → 3 pipeline classes + plugin
Adding a new LLM C++ PR + release cycle 7-line JSON config
Adding a new VLM New C++ class + processor + factory entries ~25-line JSON config
Config format Implicit schema tied to C++ structs Explicit v2 schema with presets, versioned
model_type Dispatch key Human-readable metadata
Code size ~4000 LOC in model dispatch ~2500 LOC in pipeline executor (-1500 LOC)
Extension mechanism Fork the C++ runtime JSON config or plugin .so

What Stays the Same

  • Generation loop (Generator, Search, Sampling) — fully generic for autoregressive; extensible via pluggable loop strategies for diffusion/TTS (Section 12)
  • KV cache internals — auto-detection mechanism preserved
  • Tokenizer — unchanged
  • C/Python/C#/Java/ObjC API surface — unchanged
  • ONNX Runtime session management — unchanged
  • All existing models — backward compatible via v1→v2 translator

The Vision (2-Year Horizon)

ORT GenAI becomes a generic pipeline runtime — the ONNX equivalent of what Kubernetes is for container orchestration. Models describe their pipeline declaratively. The runtime executes it generically. No model-specific code. No release bottlenecks. Any ONNX model that follows standard I/O conventions runs automatically.

Zero model-specific C++ code in ORT GenAI, ever again.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions