Copilot design of Pipeline-as-Config

# ORT GenAI Architectural Redesign: Pipeline-as-Config

**A proposal to make onnxruntime-genai truly model-agnostic**

**Authors:** Architecture Team (Architect, Product Manager, Radical Thinker)
**Date:** 2026-05-02
**Status:** Draft — for review before GitHub issue creation

> **Terminology note:** The `flow[].when` values were renamed from `once`/`prompt`/`always` to `init`/`step`/`final` for cross-paradigm clarity (autoregressive, denoising, single-pass all use the same three phases). The canonical definitions are in Section 3.2. Some earlier-drafted examples may use the original names.

---

## Executive Summary

**onnxruntime-genai is 90% model-agnostic today — but a hardcoded string registry blocks every new model.** 21 of 32 recognized model types share identical runtime code (`DecoderOnly_Model`). The KV cache auto-discovers its layout from ONNX tensor names. The generation loop knows nothing about model architecture. The only thing preventing ANY new model from working is a C++ whitelist that maps model_type strings to implementation classes.

**We propose Full-Stack Declarative Inference** — replacing string-based dispatch with a declarative pipeline configuration where preprocessing, orchestration, and generation are ALL expressed as JSON config, running on 6+ execution providers. Instead of the runtime knowing about "Llama" or "Qwen" or "Gemma," it knows about *pipelines* — sequences of ONNX session invocations with configurable data flow, state management, and execution ordering.

**The result: zero model-specific C++ code in ORT GenAI, ever again.** New models are supported entirely by the export tool (mobius/Olive) generating ONNX graphs + pipeline configs. The runtime becomes a stable platform that only changes for performance improvements and new features, never for new models.

### The Pitch

> **The only inference runtime where adding a new model is a JSON file, not a code change — and it runs on every platform.**

### Design Principle

> **Detect, don't declare.** The runtime infers model category from structural signals (which ONNX sessions exist, what I/O signatures they have), not from string dispatch. Config fields express behavioral choices that genuinely can't be inferred from structure. `model.type` becomes metadata for humans, not a dispatch key for machines.

### The One-Sentence Version

**The runtime should know HOW to run pipelines (KV cache, generation loop, sampling), not WHAT models it's running — the "what" comes entirely from the ONNX graph + pipeline config.**

---

## 1. The Problem Today

### 1.1 The Six Coupling Points

Adding a new model type to ORT GenAI currently requires C++ source changes in up to 6 locations:

| # | Coupling Point | File | Lines | What Changes |
|---|---------------|------|-------|-------------|
| CP1 | Model type whitelist | `src/models/model_type.h` | 16-61 | Add string to static array |
| CP2 | Model factory dispatch | `src/models/model.cpp` | 820-842 | Add if-else branch |
| CP3 | Position input strategy | `src/models/position_inputs.cpp` | 928-938 | Add model_type check |
| CP4 | Vision state factory | `src/models/multi_modal.cpp` | 565-572 | Add model_type check |
| CP5 | Multimodal processor factory | `src/models/model.cpp` | 915-933 | Add factory entry |
| CP6 | Python model builder | `src/python/py/models/builders/` | Various | Add Python builder file |

For a standard decoder-only LLM, only CP1 is required (adding one string). But that one string requires a C++ PR, code review, CI pipeline, and a new release. **The bottleneck isn't engineering complexity — it's release process overhead for a trivial change.**

### 1.2 The False Complexity

The codebase has 8 C++ model classes and 32 recognized model_type strings. But strip away the legacy, and there are only **3 genuinely different runtime behaviors**:

| Runtime Behavior | C++ Classes | Model Types | Actually Different? |
|-----------------|-------------|-------------|-------------------|
| Decoder-only autoregressive | `DecoderOnly_Model`, `Gpt_Model` | 22 types | No — GPT-2 differs only in KV cache format (config-detectable) |
| Multi-session (VLM/multimodal) | `MultiModalLanguageModel`, `Qwen2_5_VL_PipelineModel` | 8 types | Partially — vision invocation strategy varies, but is config-expressible |
| Encoder-decoder | `WhisperModel`, `MarianModel` | 2 types | No — both are encoder→cross-attention-decoder |
| RNNT streaming ASR | `NemotronSpeechModel` | 1 type | Yes — fundamentally different decoding loop |
| Pipeline (QNN multi-stage) | `DecoderOnlyPipelineModel` | 1 type | Deployment variant, not architectural difference |

**3 runtime patterns, not 32 model types.** The model_type string is doing almost zero useful work.

### 1.3 What Users Experience

Four personas are blocked by this architecture:

1. **The Model Builder** (mobius dev): "I built a perfect ONNX model + config, but ORT GenAI rejects it because it doesn't know my model_type string."
2. **The Deployer** (ML engineer): "I have to wait for a new ORT GenAI release just to use a new model. My alternative is forking the runtime."
3. **The Fine-Tuner** (researcher): "My model is architecturally identical to Llama but has a custom model_type. ORT GenAI won't load it."
4. **The ORT GenAI Maintainer** (MSFT): "Every new HuggingFace model = a C++ PR. We're bottlenecked on model support."

---

## 2. Competitive Analysis

### How Other Runtimes Handle Extensibility

| Runtime | Pattern | Extensible Without Source Changes? | Adding a New Model |
|---------|---------|-----------------------------------|--------------------|
| **vLLM** | Python dict registry + lazy import | ✅ Yes — `register_model()` API | 1 registry line + 1 Python file |
| **SGLang** | AST-based filesystem discovery | ✅ Yes — drop .py file in directory | 0-1 config lines + 1 module file |
| **llama.cpp** | C++ enum dispatch (like ORT GenAI) | ❌ No — requires recompilation | 1 enum + 100-300 LOC |
| **ORT GenAI** | C++ string whitelist dispatch | ❌ No — requires recompilation | 1 string + PR + release cycle |
| **ORT GenAI (proposed)** | Declarative pipeline config | ✅ Yes — JSON config only | 0 code lines + 1 JSON config |

### ORT GenAI's Unique Advantage

ORT GenAI has something no other runtime has: **the ONNX model IS the computation.** vLLM and SGLang require model-specific Python classes that implement `forward()` with PyTorch ops. llama.cpp requires model-specific C++ code that implements attention, MLP, and normalization. ORT GenAI delegates ALL computation to ONNX Runtime — it never touches model internals.

**This means ORT GenAI's extensibility problem is fundamentally simpler.** It doesn't need a plugin system for model computation (the ONNX graph handles that). It only needs extensibility for *orchestration* — which sessions to run, in what order, how to manage state between steps. And orchestration is naturally expressed as configuration.

### Why Pipeline-as-Config Is Better Than GGUF (llama.cpp)

**1. GGUF bundles computation with metadata. We separate them.**

GGUF's model file contains weights + architecture metadata. The runtime reads the metadata and *builds a compute graph at load time*. This means the runtime must understand every architecture's compute pattern — which attention variant, which normalization, which MLP structure. When a model adds dual head_dim or KV sharing, llama.cpp needs new C++ code to interpret those metadata keys and build the right compute graph.

Our ONNX model IS the precompiled compute graph. The runtime never interprets architecture details — it just runs `session.Run()`. The pipeline config only describes ORCHESTRATION (which sessions, what order, what state), not COMPUTATION:

```
GGUF:    metadata → [runtime builds graph] → execution
Ours:    ONNX graph (prebuilt) + pipeline config → [runtime orchestrates] → execution
```

**The runtime never needs to know what's inside the model.** GGUF's runtime does.

**2. Multi-EP deployment is impossible with GGUF.**

GGUF models run on llama.cpp's own backends (CPU, CUDA, Metal, Vulkan). You can't take a GGUF and run it on DirectML, QNN (Qualcomm NPU), OpenVINO, or WebGPU without porting the entire backend.

ONNX + pipeline config runs on ANY ORT execution provider. The same model + config deploys to cloud GPU (CUDA EP), Windows laptop (DML EP), Qualcomm mobile (QNN EP), Intel hardware (OpenVINO EP), browser (WebGPU EP), and CPU. **One model, one config, six+ deployment targets.**

**3. Graph-level optimization at export time.**

ONNX models go through ORT's graph optimization pipeline: constant folding, op fusion, layout optimization, EP-specific transformations. These happen ONCE at model load time and produce an optimized execution plan. GGUF's runtime-built graphs can't do this — the graph is constructed and executed simultaneously.

### Why Pipeline-as-Config Is Better Than vLLM

**1. vLLM requires Python code for every model. We require JSON.**

Adding a model to vLLM means writing a Python class with `forward()`, weight loading, attention implementation — typically 200-500 lines of PyTorch code. Even with `register_model()`, someone must WRITE that code.

Our approach: the export tool (mobius) generates the ONNX graph + pipeline config. The runtime needs ZERO new code. The complexity lives in the exporter (which already understands the model), not the runtime.

**2. vLLM is CUDA-only for production.**

vLLM's custom CUDA kernels (PagedAttention, FlashAttention) are what make it fast. But they only work on NVIDIA GPUs. Running vLLM on AMD, Intel, Qualcomm, or in a browser requires rewriting those kernels. ORT's execution providers handle hardware abstraction transparently.

**3. vLLM couples computation and orchestration.**

vLLM's model classes implement both the forward pass AND orchestration logic (KV cache management, attention patterns). Our architecture cleanly separates: ONNX model = computation, pipeline config = orchestration, ORT = execution.

### Where Competitors Are Better (Honest Assessment)

| They're better at | Why | Our path to parity |
|-------------------|-----|--------------------| 
| **GGUF: Single-file distribution** | One .gguf file vs our model dir | ONNX metadata embedding (research direction) |
| **GGUF: Quantization simplicity** | `Q4_K_M` is one flag | Olive pipeline (more steps, but more flexible) |
| **vLLM: Serving features** | Continuous batching, speculative decoding, prefix caching | ORT GenAI engine mode (growing) |
| **vLLM: Community velocity** | 200+ models, rapid community PRs | Pipeline-as-config FIXES this — enables same velocity |
| **Both: No export step** | Load HF weights directly | We require an export step (mobius build) |

### The Core Competitive Insight: Compile at Export Time

This is our unique structural advantage that neither competitor can replicate:

**We move complexity from RUNTIME to EXPORT TIME.**

- **GGUF:** Runtime builds the compute graph (complexity at runtime)
- **vLLM:** Runtime runs model-specific Python code (complexity at runtime)
- **Ours:** Export tool builds the compute graph AND generates the orchestration config (complexity at export time). Runtime is generic.

Why this matters:
1. **Export runs ONCE; inference runs millions of times.** Put the intelligence where it runs once.
2. **Export has access to the full HuggingFace model** — Python code, config, architecture details. It can make perfect decisions. The runtime shouldn't need this information.
3. **The export tool (mobius) is Python** — easy to extend. The runtime is C++ — hard to change. Our architecture puts extensibility in the easy-to-change layer.
4. **Export-time optimization** — graph optimization, quantization, EP-specific tuning all happen before deployment. The runtime gets a pre-optimized artifact.

**This is the 'compiler vs interpreter' advantage.** GGUF and vLLM are interpreters — they process model definitions at runtime. We're a compiler — we process model definitions once at export and produce an optimized artifact that a simple, generic runtime executes.

### What We Do That NEITHER Competitor Can

Three capabilities that pipeline-as-config delivers that no competitor matches:

**1. Multi-Session Declarative Pipelines.** GGUF has flat key-value metadata — no concept of multi-model pipelines. vLLM can do multi-model through Python code, but each topology requires a new class. Pipeline-as-Config's `flow[]` + `dataflow[]` declaratively express ANY multi-session topology — VLMs, speech models, multimodal with vision+audio+decoder — all as JSON without new code.

**2. Hardware-Agnostic Model Artifacts.** GGUF models are tied to llama.cpp's backend ecosystem. vLLM is CUDA-first (AMD ROCm second-class, no DirectML/QNN/WebGPU). ONNX + pipeline config is a hardware-agnostic artifact — the same files deploy on CPU, CUDA, DirectML, QNN (Qualcomm NPU), OpenVINO (Intel), and WebGPU. **Write once, deploy on 6+ hardware targets.** Even more powerfully, different sessions in the same pipeline can run on DIFFERENT execution providers — e.g., vision encoder on CPU while the decoder runs on GPU, or vision on NPU while decoder runs on GPU:

```json
"sessions": {
  "vision":  {"file": "vision/model.onnx", "execution_provider": "QNNExecutionProvider"},
  "decoder": {"file": "decoder/model.onnx", "execution_provider": "CUDAExecutionProvider"}
}
```

This heterogeneous hardware deployment — different EPs per session in a single pipeline — is something neither GGUF nor vLLM can express at all.

**3. Truly Model-Agnostic Runtime.** GGUF's runtime interprets architecture metadata to build compute graphs — it must understand every model's attention pattern, normalization, MLP structure. vLLM's runtime runs model-specific Python `forward()` code. Our runtime executes a declared pipeline — it understands ZERO model architecture. **The runtime has no decisions to make.**

### The Complete Competitive Matrix

| Capability | GGUF | vLLM | Pipeline-as-Config |
|-----------|------|------|--------------------| 
| New LLM without runtime changes | ❌ | ❌ (needs Python class) | ✅ (JSON config) |
| Multi-session pipelines (VLM) | ❌ (no concept) | ⚠️ (Python code) | ✅ (declarative flow) |
| Deploy same model on 6+ HW targets | ❌ | ❌ | ✅ (ORT execution providers) |
| Heterogeneous HW per session | ❌ | ❌ | ✅ (vision on NPU, decoder on GPU) |
| Model-agnostic runtime | ❌ | ❌ | ✅ |
| Self-describing model artifacts | ✅ (GGUF metadata) | ❌ (needs Python) | ✅ (ONNX + pipeline JSON) |
| Declarative preprocessing | ❌ | ❌ | ✅ (ort-extensions JSON) |
| No Python dependency at inference | ✅ | ❌ | ✅ |
| C++ only deployment (edge/embedded) | ✅ | ❌ | ✅ |
| Extensibility without recompilation | ❌ | ✅ (Python) | ✅ (JSON + plugin .so) |

**Pipeline-as-Config is the only approach that checks ALL boxes.**

---

## 3. The Architecture: Pipeline-as-Config

### 3.1 Core Concept

Replace model-type dispatch with a declarative pipeline configuration. The runtime becomes a generic pipeline executor that:

1. **Loads** whatever ONNX sessions the config declares
2. **Executes** them in the order the config specifies
3. **Wires** outputs→inputs using explicit dataflow declarations
4. **Manages** state (KV cache, position IDs) per config-driven strategies
5. **Generates** tokens using the standard (fully generic) generation loop

```
┌─────────────────────────────────────────────────┐
│              genai_config.json v2               │
│                                                  │
│  pipeline.extends: "autoregressive-decoder"      │
│  pipeline.sessions: {name → file}                │
│  pipeline.flow: [{run, when, loop}]              │
│  pipeline.dataflow: [{from, to}]                 │
│  pipeline.state: {kv_cache, position_ids}        │
│  pipeline.plugin: "libcustom.so" (optional)      │
│                                                  │
│  tokens: {pad, eos, bos}                         │
│  generation: {max_length, sampling, stop}        │
│  metadata: {model_type, source} (human-only)     │
└──────────────────┬──────────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────────┐
│          Pipeline Factory (structural)           │
│                                                  │
│  No string dispatch. Config structure drives:    │
│  ┌─ DecoderPipeline (single session)             │
│  ├─ MultiSessionPipeline (2+ sessions)           │
│  ├─ EncoderDecoderPipeline (cross-attention)     │
│  └─ PluginPipeline (custom shared library)       │
└──────────────────┬──────────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────────┐
│         Generic Pipeline Executor                │
│                                                  │
│  Interprets flow[] declaratively:                │
│  - when: init | step | final                     │
│  - loop: batched | per_image                     │
│                                                  │
│  State management (config-driven):               │
│  - KV cache (auto / separate / combined)         │
│  - Position IDs (auto / default / mrope_3d)      │
│  - Sliding window (from config)                  │
│                                                  │
│  Generation loop (fully generic, unchanged):     │
│  - Sampling, beam search, EOS detection          │
│  - Streaming output                              │
└─────────────────────────────────────────────────┘
```

### 3.2 The `flow` Array — Execution Ordering

The `flow` array declares which sessions run, when, and how:

```json
"flow": [
  {"run": "vision",    "when": "init", "loop": "per_image"},
  {"run": "embedding", "when": "init"},
  {"run": "decoder",   "when": "step"}
]
```

**Lifecycle phases** (fixed vocabulary — not Turing-complete):
- `when: "init"` — run before the main generation loop (encoders, embedding projectors, preprocessing). Execution order follows the flow[] array. Covers what earlier drafts called `once` and `prompt`.
- `when: "step"` — run every iteration of the generation loop (decoder in autoregressive, UNet in denoising)
- `when: "final"` — run after the generation loop completes (vocoder in TTS, VAE decoder in diffusion)

These three phases map cleanly across all generation paradigms:

| Phase | Autoregressive | Denoising (Diffusion) | Single-Pass |
|---|---|---|---|
| `init` | Vision encoder, embedding, prompt processing | Text encoder, latent init | All sessions |
| `step` | Decoder (each token) | UNet (each denoising step) | N/A (no loop) |
| `final` | — | VAE decoder → image | — |

**Loop modes** (fixed vocabulary):
- `loop: "batched"` — pass all inputs at once (default)
- `loop: "per_image"` — iterate over inputs individually (Qwen VL, Pixtral)

**Guardrails:**
- Fixed vocabulary for `when` and `loop` — no arbitrary conditions or iterations
- Maximum 10 flow stages — prevents pathological configs
- No `if/else` — anything that needs conditional logic uses the plugin API
- Cycle detection in dataflow at load time

### 3.3 The `dataflow` Array — Session Wiring

Optional. Declares how outputs from one session feed into inputs of another:

```json
"dataflow": [
  {"from": "vision.image_features",     "to": "embedding.image_features"},
  {"from": "embedding.inputs_embeds",   "to": "decoder.inputs_embeds"}
]
```

When omitted, the runtime auto-matches by tensor name (output name in session A matches input name in session B). When explicit, overrides auto-matching for cases where tensor names differ.

### 3.4 The `state` Object — KV Cache & Position Strategy

```json
"state": {
  "kv_cache": {
    "format": "auto",
    "past_key_pattern": "past_key_values.{layer}.key",
    "present_key_pattern": "present.{layer}.key",
    "past_value_pattern": "past_key_values.{layer}.value",
    "present_value_pattern": "present.{layer}.value"
  },
  "position_ids": {
    "strategy": "auto",
    "input_name": "position_ids"
  }
}
```

**KV cache formats:**
- `"auto"` — introspect ONNX session I/O to detect format (default)
- `"separate"` — standard `past_key_values.{layer}.key` / `present.{layer}.key`
- `"combined"` — GPT-2 style `past_{layer}` / `present_{layer}`
- Name patterns are optional overrides when auto-detection fails

**Position ID strategies:**
- `"auto"` — introspect position_ids input shape: rank 2 → default 1D, rank 3 → mRoPE 3D
- `"default"` — standard 1D position IDs
- `"mrope_3d"` — 3-dimensional mRoPE (temporal, height, width)
- `"windowed"` — sliding window position tracking

### 3.5 The `extends` Mechanism — Preset Inheritance

Built-in presets eliminate boilerplate for common patterns:

```json
{"pipeline": {"extends": "autoregressive-decoder"}}
```

**Built-in presets:**

| Preset Name | What It Expands To |
|-------------|-------------------|
| `autoregressive-decoder` | Single decoder session, default KV cache, default position IDs, flow: [{run: decoder, when: always}] |
| `vision-language` | Vision + embedding + decoder sessions, batched vision, default KV cache |
| `encoder-decoder` | Encoder (once) + decoder (always), cross-attention KV cache |
| `speech-language` | Speech encoder + embedding + decoder |

Presets are resolved at load time — the runtime sees a fully expanded config. Overrides replace preset defaults:

```json
{
  "pipeline": {
    "extends": "vision-language",
    "flow": [
      {"run": "vision", "when": "prompt", "loop": "per_image"},
      {"run": "embedding", "when": "prompt"},
      {"run": "decoder", "when": "always"}
    ],
    "state": {
      "position_ids": {"strategy": "mrope_3d", "grid_source": "vision.image_grid_thw"}
    }
  }
}
```

### 3.6 The Plugin API — Escape Hatch

For genuinely novel architectures that can't be expressed as standard pipelines (RNNT, SSM/Mamba, diffusion):

```json
{
  "pipeline": {
    "plugin": {
      "library": "libgenai_rnnt.so",
      "entry_point": "CreateRnntPipeline"
    }
  }
}
```

C++ plugin interface:
```cpp
// Stable C ABI — plugins compiled separately from the runtime
extern "C" {
  std::shared_ptr<Pipeline> CreateRnntPipeline(
    OrtEnv& env, std::unique_ptr<Config> config);
}
```

The plugin registers a Pipeline factory, not a Model factory — keeping the abstraction consistent. Plugins extend the pipeline type system for the ~1% of models that can't fit the declarative config.

### 3.7 Preprocessing: Image, Audio, and Variable Input Shapes

**Preprocessing is NOT the pipeline executor's job.** It transforms raw inputs (pixels, audio waveforms) into model-ready tensors. This happens BEFORE the pipeline runs and is handled by a separate, config-driven preprocessing layer.

#### The Architecture Boundary

```
Raw Input (images, audio, text)
        │
        ▼
┌─────────────────────────────┐
│   Preprocessing Layer       │
│   (ort-extensions)          │
│                             │
│   image_processor.json  ────┤──→ pixel_values, image_sizes, grid_thw
│   audio_processor.json  ────┤──→ audio_features, audio_sizes
│   tokenizer.json        ────┤──→ input_ids, attention_mask
│                             │
│   Config-driven.            │
│   Zero model-specific C++.  │
└──────────────┬──────────────┘
               │ model-ready tensors
               ▼
┌─────────────────────────────┐
│   Pipeline Executor         │
│   (this proposal)           │
└─────────────────────────────┘
```

#### Image Preprocessing — ort-extensions `image_processor.json`

Each VLM ships an `image_processor.json` that declares its preprocessing pipeline:

```json
{
  "image_processor_type": "Qwen2VLImageProcessor",
  "resample": "bicubic",
  "do_resize": true,
  "size": {"min_pixels": 3136, "max_pixels": 12845056},
  "do_rescale": true,
  "rescale_factor": 0.00392156862745098,
  "do_normalize": true,
  "image_mean": [0.48145466, 0.4578275, 0.40821073],
  "image_std": [0.26862954, 0.26130258, 0.27577711],
  "patch_size": 14,
  "merge_size": 2
}
```

ort-extensions loads this JSON and executes the preprocessing pipeline using its own C++ ops. **No model-type dispatch needed.** Different VLMs (Phi3v at 336×336, Qwen2.5-VL with dynamic resolution, Pixtral with variable per-image sizes) all use the same mechanism — they just ship different `image_processor.json` configs.

The C++ preprocessors (`PhiImageProcessor`, `QwenImageProcessor`, `GemmaImageProcessor`, `Mistral3ImageProcessor`) become legacy. New models use ort-extensions exclusively. This is already the direction — mobius already generates `image_processor.json` for all VLMs.

#### Audio Preprocessing — `audio_processor.json`

Same pattern for speech models:

```json
{
  "audio_processor_type": "WhisperFeatureExtractor",
  "feature_size": 128,
  "sampling_rate": 16000,
  "hop_length": 160,
  "chunk_length": 30,
  "n_fft": 400
}
```

For multimodal models needing both image and audio (Phi4mm):

```json
"preprocessing": {
  "image": {"config": "image_processor.json"},
  "audio": {"config": "audio_processor.json"}
}
```

#### Variable Input Shapes

Different models handle input sizes differently. The pipeline config + preprocessor config handle all cases:

| Pattern | Example | How It's Handled |
|---------|---------|-----------------|
| **Fixed-size** | Phi3v (336×336 images) | `image_processor.json` resizes to target. ONNX model has static shapes. |
| **Dynamic-size** | Qwen2.5-VL (arbitrary resolution) | `image_processor.json` does dynamic resize + patch extraction. ONNX model has dynamic shapes. Pipeline executor passes tensors as-is. |
| **Per-image variable** | Pixtral (each image different resolution) | Preprocessor zero-pads to max(H)×max(W), provides `image_sizes[N, 2]`. Pipeline flow uses `loop: per_image` + `dynamic_shape` for per-image slicing. |

For the per-image variable resolution case:
```json
{"run": "vision", "when": "prompt", "loop": "per_image",
 "loop_over": "pixel_values",
 "dynamic_shape": {"source": "image_sizes", "apply_to_dims": [2, 3]}}
```

The executor slices `pixel_values[i, :, :H_i, :W_i]` where H_i and W_i come from `image_sizes[i]`. This is ~15 lines of generic loop code, not a model-specific class.

#### Pipeline Config Reference

The pipeline config references preprocessing configs without embedding their details:

```json
"preprocessing": {
  "image": {"config": "image_processor.json", "format": "ort-extensions"},
  "audio": {"config": "audio_processor.json", "format": "ort-extensions"}
}
```

The `format` field future-proofs beyond ort-extensions — today it's the only value, but it enables alternative preprocessing backends (e.g., a pure ONNX preprocessing graph) without schema changes.

This clean boundary means: preprocessing is fully described by its own config files (already supported by ort-extensions), and the pipeline executor receives model-ready tensors without knowing how they were produced. Note that **half the pipeline-as-config vision is already shipped and working in production** via ort-extensions — we're completing the other half for inference orchestration.

**Every layer of the stack is config-driven. Zero model-specific C++ anywhere.**

### 3.8 Advanced KV Cache Patterns (Shared Cache, Dual Head Dim)

Some models have non-uniform KV cache layouts. Gemma4 is the most complex example:

- **Dual head_dim:** Local (sliding_attention) layers use `head_dim=256`; global (full_attention) layers use `global_head_dim=512`
- **KV sharing:** The last `num_kv_shared_layers` layers reuse K/V from earlier layers and have NO independent cache entries
- **Mixed window sizes:** Sliding window layers have bounded cache; full attention layers have unbounded cache

**These are all handled by auto-detection — no model-specific code needed.**

ORT GenAI's `DefaultKeyValueCache` already supports:
- **Sparse layer indices** (`kv_layer_indices_`): Auto-discovered by scanning which `past_key_values.{N}.key` inputs exist in the ONNX session. If the model only has layers 0-25 (skipping 26-33 due to KV sharing), the cache allocates only 26 entries.
- **Per-layer shapes** (`layer_shapes_`): Each layer can have a different `[batch, heads, seq_len, head_dim]` shape, auto-discovered from the ONNX session output shapes.
- **Per-layer sliding window**: Configurable which layers use bounded cache vs unbounded.

The pipeline config for a Gemma4-style model:

```json
"state": {
  "kv_cache": {
    "format": "auto",
    "sliding_window": {
      "window_size": 4096,
      "layers": [0, 1, 3, 4, 6, 7, 9, 10, 12, 13, 15, 16, 18, 19, 21, 22, 24, 25]
    }
  }
}
```

`format: auto` handles the dual head_dim and sparse layers automatically. The `sliding_window.layers` array (already supported by ORT GenAI today) specifies which layers use bounded cache. The export tool (mobius) builds the ONNX model with cache I/O only for non-shared layers, so the runtime never needs to know about KV sharing — it's implicit in the graph structure.

### 3.9 Preprocessor↔Model Shape Alignment

**Problem:** Different models expect different image sizes and preprocessing. How do we ensure the preprocessor output matches the model's expectations without model-specific code?

**Answer: Co-generation.** The export tool (mobius) generates BOTH the ONNX model and its `image_processor.json` from the same HuggingFace config. They're guaranteed to be aligned because they share a single source of truth:

```
HuggingFace Config
├── → ONNX model (expects specific input shapes)
└── → image_processor.json (produces those exact shapes)
```

For additional safety, the pipeline config can include optional shape validation:

```json
"preprocessing": {
  "image": {
    "config": "image_processor.json",
    "format": "ort-extensions",
    "expected_outputs": {
      "pixel_values": {"rank": 4, "dtype": "float32"},
      "image_grid_thw": {"rank": 2, "dtype": "int64"}
    }
  }
}
```

The `expected_outputs` field enables load-time validation: verify that the preprocessor config produces tensors compatible with the model's inputs before running inference. This catches mismatches at load time rather than inference time.

**If someone provides the wrong preprocessor:** ORT Runtime throws a shape mismatch error at `session.Run()` — already a clear, debuggable failure. The optional validation catches it earlier.

### 3.10 The `metadata` Section

model_type lives here — as documentation, not dispatch:

```json
"metadata": {
  "model_type": "qwen2_5_vl",
  "architecture": "Qwen2_5VLForConditionalGeneration",
  "source": "mobius",
  "export_version": "0.5.0"
}
```

Used for: logging, telemetry, debugging, human readability. Ignored by: all dispatch and runtime logic.

---

## 4. Concrete Schema Examples

### 4.1 Decoder-Only LLM (Minimal — 7 lines)

```json
{
  "version": 2,
  "pipeline": {
    "extends": "autoregressive-decoder",
    "sessions": {"decoder": {"file": "model.onnx"}}
  },
  "tokens": {"eos": [151645], "pad": 0},
  "generation": {"max_length": 4096, "sampling": {"temperature": 0.7}},
  "metadata": {"model_type": "qwen2", "source": "mobius"}
}
```

### 4.2 Vision-Language Model (Qwen2.5-VL style — 25 lines)

```json
{
  "version": 2,
  "pipeline": {
    "extends": "vision-language",
    "sessions": {
      "vision":    {"file": "vision_encoder/model.onnx"},
      "embedding": {"file": "embedding/model.onnx"},
      "decoder":   {"file": "decoder/model.onnx"}
    },
    "flow": [
      {"run": "vision",    "when": "prompt", "loop": "per_image"},
      {"run": "embedding", "when": "prompt"},
      {"run": "decoder",   "when": "always"}
    ],
    "dataflow": [
      {"from": "vision.image_features",   "to": "embedding.image_features"},
      {"from": "embedding.inputs_embeds", "to": "decoder.inputs_embeds"}
    ],
    "state": {
      "kv_cache": {"format": "auto"},
      "position_ids": {
        "strategy": "mrope_3d",
        "grid_source": "vision.image_grid_thw"
      }
    },
    "preprocessing": {
      "image": {"config": "image_processor.json"}
    }
  },
  "tokens": {"eos": [151645], "pad": 0, "image_token": 151655},
  "generation": {"max_length": 4096, "sampling": {"temperature": 0.7}},
  "metadata": {"model_type": "qwen2_5_vl", "source": "mobius"}
}
```

### 4.3 Encoder-Decoder (Whisper style)

```json
{
  "version": 2,
  "pipeline": {
    "extends": "encoder-decoder",
    "sessions": {
      "encoder": {"file": "encoder/model.onnx"},
      "decoder": {"file": "decoder/model.onnx"}
    },
    "flow": [
      {"run": "encoder", "when": "once"},
      {"run": "decoder", "when": "always", "cross_attention_from": "encoder"}
    ],
    "state": {
      "kv_cache": {"format": "auto"},
      "cross_cache": {"source": "encoder", "frozen": true}
    }
  },
  "tokens": {"eos": [50257], "pad": 50257, "decoder_start": 50258},
  "generation": {"max_length": 448},
  "metadata": {"model_type": "whisper", "source": "mobius"}
}
```

### 4.4 Multimodal (Vision + Audio — Phi4mm style)

```json
{
  "version": 2,
  "pipeline": {
    "sessions": {
      "vision":    {"file": "vision_encoder/model.onnx"},
      "speech":    {"file": "audio_encoder/model.onnx"},
      "embedding": {"file": "embedding/model.onnx"},
      "decoder":   {"file": "decoder/model.onnx"}
    },
    "flow": [
      {"run": "vision",    "when": "prompt", "loop": "batched"},
      {"run": "speech",    "when": "prompt", "loop": "batched"},
      {"run": "embedding", "when": "prompt"},
      {"run": "decoder",   "when": "always"}
    ],
    "dataflow": [
      {"from": "vision.image_features", "to": "embedding.image_features"},
      {"from": "speech.audio_features", "to": "embedding.audio_features"},
      {"from": "embedding.inputs_embeds", "to": "decoder.inputs_embeds"}
    ],
    "state": {
      "kv_cache": {"format": "auto"},
      "position_ids": {"strategy": "default"}
    }
  },
  "tokens": {"eos": [32007], "pad": 32000},
  "generation": {"max_length": 4096},
  "metadata": {"model_type": "phi4mm", "source": "mobius"}
}
```

### 4.5 Novel Architecture via Plugin (RNNT)

```json
{
  "version": 2,
  "pipeline": {
    "plugin": {
      "library": "libgenai_rnnt.so",
      "entry_point": "CreateRnntPipeline"
    },
    "sessions": {
      "encoder":  {"file": "encoder/model.onnx"},
      "predictor": {"file": "predictor/model.onnx"},
      "joiner":   {"file": "joiner/model.onnx"}
    }
  },
  "metadata": {"model_type": "nemotron_speech", "source": "mobius"}
}
```

---

## 5. Implementation Plan

### Overview

This is a **refactor, not a rewrite.** The generation loop, search/sampling, tokenizer, KV cache internals, and all language bindings remain unchanged. We're replacing the model dispatch layer with a pipeline dispatch layer.

**Net code change estimate: +800 lines added, -2000 lines deleted.** The codebase gets smaller.

### PR 1: Config Schema v2 Parser + Backward Compatibility (~300 LOC)

**Files changed:**
- `src/config.h` — Add `Pipeline` struct with `sessions`, `flow`, `dataflow`, `state`, `extends` fields
- `src/config.cpp` — Parse v2 schema; add v1→v2 translator that converts old-format configs to pipeline format
- New: `src/pipeline_presets.h` — Built-in preset definitions (autoregressive-decoder, vision-language, encoder-decoder, speech-language)

**Logic:**
```cpp
// In Config constructor:
if (json.contains("version") && json["version"] == 2) {
  ParsePipelineConfig(json);  // New v2 path
} else {
  ParseLegacyConfig(json);    // Existing v1 path
  TranslateV1ToV2();          // Convert to pipeline format internally
}
```

**Backward compatibility guarantee:** Every existing genai_config.json produces an identical internal `Pipeline` struct after translation. The v1→v2 translator maps:
- `model.type` + `model_type.h` classification → appropriate preset
- `model.decoder.inputs/outputs` → `state.kv_cache` patterns
- `model.vision/speech/embedding` sections → `sessions` + `flow` + `dataflow`

**Tests:** All existing config tests pass unchanged. New tests for v2 parsing, preset resolution, `extends` override logic.

### PR 2: PipelineExecutor Class (~350 LOC)

**Files changed:**
- New: `src/models/pipeline_executor.h` — PipelineExecutor class definition
- New: `src/models/pipeline_executor.cpp` — Implementation
- `src/models/model.cpp` — Replace `CreateModel()` with `CreatePipeline()` using structural detection

**The core class:**
```cpp
class PipelineExecutor : public State {
public:
  PipelineExecutor(std::unique_ptr<Config> config, OrtEnv& env);
  
  DeviceSpan<float> RunStep(int total_length, DeviceSpan<int32_t>& next_tokens,
                            DeviceSpan<int32_t> next_indices) override;
  
private:
  // Loaded from config
  std::map<std::string, std::unique_ptr<OrtSession>> sessions_;
  std::vector<FlowStep> prompt_flow_;   // Steps where when != "always"
  std::vector<FlowStep> decode_flow_;   // Steps where when == "always"
  std::vector<DataflowWire> dataflow_;
  
  // State (auto-detected or config-driven)
  std::unique_ptr<KeyValueCache> kv_cache_;
  std::unique_ptr<PositionStrategy> position_ids_;
  DefaultInputIDs input_ids_{*this};
  Logits logits_{*this};
  
  // Data flow between sessions
  std::map<std::string, std::unique_ptr<OrtValue>> intermediates_;
  
  bool is_prompt_{true};
  
  void WireInputs(const FlowStep& step);
  void WireOutputs(const FlowStep& step);
  void RunFlowStep(const FlowStep& step, bool graph_capture);
};
```

**Structural detection in CreatePipeline() (replaces CreateModel()):**
```cpp
std::shared_ptr<Model> CreatePipeline(OrtEnv& env, std::unique_ptr<Config> config) {
  auto& pipeline = config->pipeline;
  
  // Plugin escape hatch
  if (pipeline.plugin.has_value()) {
    return LoadPluginPipeline(pipeline.plugin.value(), std::move(config), env);
  }
  
  // Structural detection — no string dispatch
  bool has_encoder_with_cross_attn = HasCrossAttentionFlow(pipeline.flow);
  bool has_multiple_sessions = pipeline.sessions.size() > 1;
  
  if (has_encoder_with_cross_attn) {
    return std::make_shared<EncoderDecoderPipeline>(std::move(config), env);
  }
  if (has_multiple_sessions) {
    return std::make_shared<MultiSessionPipeline>(std::move(config), env);
  }
  return std::make_shared<DecoderPipeline>(std::move(config), env);
}
```

### PR 3: Flow Interpreter + Dataflow Wiring (~200 LOC)

**Files changed:**
- New: `src/models/flow_interpreter.h/.cpp` — Interprets `flow[]` and `dataflow[]`
- `src/models/pipeline_executor.cpp` — Uses flow interpreter

**Key logic:**
```cpp
void PipelineExecutor::RunFlowStep(const FlowStep& step, bool graph_capture) {
  auto& session = sessions_[step.session_name];
  
  if (step.loop == LoopMode::PerImage) {
    // Per-image loop: iterate over input tensor's batch dimension
    auto input_slices = SliceTensorDim0(GetInput(step, step.loop_over));
    std::vector<OrtValue> output_parts;
    for (auto& slice : input_slices) {
      BindSlicedInput(step, slice);
      session->Run();
      output_parts.push_back(CaptureOutput(step));
    }
    intermediates_[step.output_key] = ConcatenateDim0(output_parts);
  } else {
    // Standard batched execution
    WireInputs(step);
    session->Run(graph_capture);
    WireOutputs(step);
  }
}
```

**Dataflow wiring:**
```cpp
void PipelineExecutor::WireInputs(const FlowStep& step) {
  for (auto& wire : dataflow_) {
    if (wire.to_session == step.session_name) {
      // Wire output from previous session to input of this session
      auto& source = intermediates_[wire.from_key];
      BindInput(step, wire.to_input_name, source);
    }
  }
}
```

### PR 4: Plugin API (~100 LOC)

**Files changed:**
- New: `src/models/plugin_api.h` — Stable C ABI for pipeline plugins
- New: `src/models/plugin_loader.cpp` — Dynamic library loading (dlopen/LoadLibrary)

**Interface:**
```cpp
// plugin_api.h — stable ABI, ships with ORT GenAI headers
extern "C" {
  typedef std::shared_ptr<Model> (*PipelineFactoryFn)(
    OrtEnv& env, std::unique_ptr<Config> config);
}

// In plugin .so/.dll:
extern "C" {
  std::shared_ptr<Model> CreateRnntPipeline(
    OrtEnv& env, std::unique_ptr<Config> config) {
    return std::make_shared<RnntPipeline>(std::move(config), env);
  }
}
```

### PR 5: Delete Model-Type Dispatch (~-1500 LOC)

**Files deleted:**
- `src/models/model_type.h` — The entire file

**Files simplified:**
- `src/models/model.cpp` — Remove `CreateModel()` if-else chain, replace with `CreatePipeline()`
- `src/models/position_inputs.cpp` — Remove `IsQwenVLFamily()` check; position strategy comes from config
- `src/models/multi_modal.cpp` — Remove `CreateVisionState()` model_type dispatch; vision loop mode comes from config flow

**Files eventually deprecated** (kept for v1 compat, removed in future release):
- `src/models/gpt.h/cpp` — Absorbed into generic pipeline with `kv_cache.format: combined`
- Per-model C++ preprocessors (phi_image_processor, gemma_image_processor, etc.) — Replaced by ort-extensions `image_processor.json`

### Implementation Summary

| PR | Description | LOC Added | LOC Deleted | Net |
|----|-------------|-----------|-------------|-----|
| PR 1 | Config v2 parser + v1 translator | +300 | -0 | +300 |
| PR 2 | PipelineExecutor classes | +350 | -0 | +350 |
| PR 3 | Flow interpreter + dataflow | +200 | -0 | +200 |
| PR 4 | Plugin API | +100 | -0 | +100 |
| PR 5 | Delete model_type dispatch | +0 | -1500 | -1500 |
| **Total** | | **+950** | **-1500** | **-550** |

**The codebase shrinks by ~550 lines while gaining full model-agnostic extensibility.**

---

## 6. Compatibility Matrix

| Model Scenario | Today | After PR 1-2 | After PR 1-5 |
|---------------|-------|-------------|--------------|
| **Existing Llama/Phi/Gemma (v1 config)** | ✅ Works | ✅ Works (v1→v2 translator) | ✅ Works (translator) |
| **New decoder-only LLM (unknown type)** | ❌ Rejected by whitelist | ✅ 7-line v2 config | ✅ 7-line v2 config |
| **Custom fine-tune with custom model_type** | ❌ Rejected by whitelist | ✅ extends preset | ✅ extends preset |
| **New VLM family** | ❌ Needs new C++ class + processor | ✅ ~25-line v2 config | ✅ ~25-line v2 config |
| **Qwen2.5-VL (3D mRoPE, per-image vision)** | ✅ Hardcoded | ✅ v2 config with position_strategy + loop | ✅ Config-driven |
| **Pixtral/Mistral3 (variable resolution)** | ✅ Hardcoded | ✅ v2 config with per_image loop + dynamic_shape | ✅ Config-driven |
| **Whisper (encoder-decoder)** | ✅ Hardcoded | ✅ v2 config with encoder-decoder preset | ✅ Config-driven |
| **GPT-2 (combined KV cache)** | ✅ Hardcoded (separate class) | ✅ v2 config with kv_cache.format: combined | ✅ Config-driven |
| **Mamba/SSM (recurrent, no KV)** | ❌ Not supported | ⚠️ Needs state.type: recurrent | ✅ Config-driven |
| **RNNT (non-autoregressive)** | ✅ Hardcoded | ✅ Plugin .so | ✅ Plugin .so |
| **Novel architecture (unknown future)** | ❌ Major C++ work | ✅ Plugin .so, zero runtime changes | ✅ Plugin .so |
| **Phi4mm (vision + audio)** | ✅ Hardcoded | ✅ v2 config with 4 sessions | ✅ Config-driven |

---

## 7. Technical Feasibility

### 7.1 CUDA Graph Capture

**Concern:** CUDA graphs require identical session topology and buffer shapes between captures and replays. Does a generic pipeline executor break this?

**Answer: No.** The executor pre-computes a "decode flow" (steps where `when: "step"`) at init time. During token generation, only the decode flow runs — this is a fixed, repeatable sequence identical to what the current `DecoderOnly_State::Run()` does. CUDA graph capture applies to this fixed sequence:

```cpp
bool graph_capture = !is_prompt_ && params_->use_graph_capture 
                     && input_ids_.GetShape()[1] == 1;
// Only the decode_flow_ steps run — topology is fixed
for (auto& step : decode_flow_) {
  RunFlowStep(step, graph_capture);
}
```

### 7.2 Memory Pre-allocation

**Concern:** The current code pre-allocates KV cache buffers based on model dimensions. Can a generic executor do this without model-specific knowledge?

**Answer: Yes.** KV cache dimensions come from config (`decoder.num_hidden_layers`, `decoder.num_key_value_heads`, `decoder.head_size`) or are discoverable from ONNX session output shapes at init time. The current `DefaultKeyValueCache` already auto-discovers layer count by pattern-matching present tensor names in the session. A generic executor uses the same mechanism — zero model-specific knowledge needed.

### 7.3 Performance Overhead

**Concern:** Does the generic pipeline add overhead vs hand-optimized model classes?

**Answer: Negligible.** The overhead is:
- One `for` loop over `flow_` steps per generation step (typically 1 step for LLMs)
- One map lookup per dataflow wire per step
- These are nanosecond-scale operations vs millisecond-scale ONNX session runs

The hot path — `session.Run()` + KV cache management — is identical to the current code. The generation loop, search/sampling, and tokenizer are completely unchanged.

### 7.4 Config Validation

Invalid configs must produce clear errors at load time, not runtime crashes:

| Error | Message |
|-------|---------|
| Session referenced in flow but not declared | `Flow step references session "vision" but no such session is declared in pipeline.sessions` |
| Dataflow references non-existent tensor | `Dataflow wire references output "image_features" but session "vision" has no such output (available outputs: hidden_states, pooler_output)` |
| Unknown position strategy | `Unknown position_ids strategy "my_custom". Valid options: auto, default, mrope_3d, windowed` |
| Cycle in dataflow | `Circular dependency detected in dataflow: vision → embedding → decoder → vision` |
| Unknown preset | `Unknown pipeline preset "my-preset". Built-in presets: autoregressive-decoder, vision-language, encoder-decoder, speech-language` |
| Missing required field | `Pipeline config requires at least one session. Add "sessions": {"decoder": {"file": "model.onnx"}}` |

### 7.5 The `per_image` Loop for Vision

QwenVisionState and PixtralVisionState loop over images individually with different slicing strategies. The flow interpreter handles this generically:

```json
{"run": "vision", "when": "prompt", "loop": "per_image",
 "loop_over": "pixel_values"}
```

For Pixtral's variable-resolution cropping (per-image height/width from `image_sizes`):
```json
{"run": "vision", "when": "prompt", "loop": "per_image",
 "loop_over": "pixel_values",
 "dynamic_shape": {"source": "image_sizes", "apply_to_dims": [2, 3]}}
```

The executor slices `pixel_values[i, :, :H_i, :W_i]` where `H_i, W_i` come from `image_sizes[i]`. This is ~15 lines of generic loop code, not a model-specific class.

---

## 8. The Pitch to the ORT GenAI Team

### Framing

> **"Your runtime is already 90% model-agnostic. We're proposing you formalize what's already true — and eliminate the last 10% of model-specific code."**
>
> Today, 21 of 32 model types share identical C++ code. The generation loop doesn't know what model it's running. The KV cache auto-discovers its own layout. The only thing preventing any new model from working is a string whitelist that adds no value.
>
> We're not asking you to change your architecture — we're asking you to recognize that your architecture has already evolved past the model_type dispatch layer. The pipeline config makes the implicit explicit.

### Value Proposition

| For | Today | With Pipeline-as-Config |
|-----|-------|------------------------|
| **ORT GenAI team** | Bottlenecked on model support PRs | Never writes model-specific code again |
| **Model builders (mobius/Olive)** | Must coordinate with runtime team for every new model | Ship independently — generate config, done |
| **ML engineers** | Wait for runtime releases | New models work immediately |
| **The ecosystem** | ORT GenAI lags behind HuggingFace model zoo | ORT GenAI supports any ONNX model by design |

### The Key Selling Point

**This REDUCES ORT GenAI's maintenance burden.** The team goes from "we must ship a PR for every new HuggingFace model" to "we maintain a stable pipeline runtime." New model support becomes the exporter's responsibility (mobius/Olive), not the runtime's.

**"You build the engine. We build the cars."**

---

## 9. Risk Analysis

| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|-----------|
| **Performance regression** for existing models | Low | High | Benchmark all 32 model types before/after. The hot path is identical. |
| **Config complexity** deters users | Medium | Medium | Presets with `extends` reduce 90% of configs to 7 lines. JSON Schema for IDE support. |
| **Edge cases** in flow interpreter | Medium | Medium | Comprehensive test matrix covering all 32 model types. Validation at load time. |
| **ORT GenAI team** rejects the proposal | Medium | High | Start with the blacklist inversion (5 lines) to build trust. Present the full vision as an RFC. |
| **Plugin ABI stability** across versions | Low | Medium | Version the plugin API. Keep it minimal (1 factory function). |
| **v1→v2 translator** has subtle bugs | Medium | Medium | The translator is tested against every existing genai_config.json in the test suite. |

### Immediate Bridge (While Building the Future)

While the pipeline-as-config architecture is implemented, mobius can unblock users TODAY:

1. For unregistered LLM model_types: emit `"type": "decoder"` + `"original_model_type": "<real_type>"` in genai_config.json
2. `"decoder"` is in the current whitelist and routes to `DecoderOnly_Model`
3. When pipeline-as-config ships, switch to v2 format with the real model_type in metadata

---

## 10. The mobius Role: Pipeline Compiler

mobius already knows everything needed to generate complete pipeline configs:

| What mobius knows | How it maps to pipeline config |
|------------------|-------------------------------|
| Model architecture (decoder-only, VLM, enc-dec) | Which preset to `extend` |
| Number and type of ONNX sessions | `pipeline.sessions` |
| Vision invocation pattern (batched vs per-image) | `flow[].loop` |
| Position embedding strategy (1D, 3D mRoPE) | `state.position_ids.strategy` |
| KV cache format (separate, combined) | `state.kv_cache.format` |
| All I/O tensor names | `state.kv_cache.*_pattern`, `dataflow[]` |
| Token IDs, generation params | `tokens`, `generation` |

**Implementation in mobius:** Extend the existing `_write_genai_config()` function to emit v2 format alongside (or instead of) v1. The pipeline config is generated from the same model metadata that already drives ONNX graph construction.

---

## 11. Research Direction: Self-Contained Generation Graphs

As a long-term research direction (not part of the core proposal), we explored embedding generation logic inside the ONNX graph itself. Microsoft's existing `com.microsoft.BeamSearch` and `com.microsoft.GreedySearch` contrib ops prove this is technically possible.

**Viable for:** Offline batch inference, edge deployment, WebAssembly

**Not viable for:** Interactive serving (streaming, continuous batching, speculative decoding — all require host-side coordination)

**Potential approach:** Small set of generation-specific custom ops (`GenerationKVCacheUpdate`, `SampleTopP`) that the runtime provides as efficient primitives, while the ONNX graph carries the generation logic. Worth exploring for simple deployment scenarios but not the primary architecture.

---

## 12. Beyond Autoregressive: TTS, Diffusion, and Multimodal Audio

### 12.1 The Question

Pipeline-as-Config is designed around autoregressive token generation. But the model ecosystem includes fundamentally different generation patterns:

- **TTS (text-to-speech):** Text → mel spectrogram → audio waveform (multi-stage, often non-autoregressive)
- **Diffusion (image generation):** Iterative denoising loop with fixed step count, noise scheduling, no token sampling
- **Audio+text multimodal:** Mixed modality inputs (audio + text → text), structurally similar to VLMs

Can `flow[]`/`dataflow[]`/`state{}` express these? Or is the schema inherently autoregressive?

### 12.2 The Honest Assessment

| Model Type | Schema Expressive? | Runtime Can Execute? | What's Missing |
|---|---|---|---|
| Audio+text multimodal (Phi4mm, speech-language) | ✅ Yes | ✅ Yes | Nothing — structurally identical to VLMs |
| Encoder-decoder (Whisper, Marian) | ✅ Yes | ✅ Yes | Nothing — already supported |
| Autoregressive TTS (Bark, VALL-E) | ✅ Yes | ✅ Yes | Add `when: "final"` for vocoder post-processing |
| Non-autoregressive TTS (VITS, FastSpeech2) | ✅ Yes | ❌ No | Sequential executor + non-token output |
| Diffusion (SD, Flux, DiT) | ⚠️ Topology yes | ❌ No | Iterative executor, scheduler state, latent init, non-token output |

**Key insight: the `flow[]`/`dataflow[]` schema is MORE GENERAL than the current runtime.** It can already describe these topologies. The bottleneck is the C++ `Generator`, which assumes autoregressive token generation.

### 12.3 Concrete Config Examples

**Audio+text multimodal (works today with pipeline-as-config):**

```json
{
  "pipeline": {
    "extends": "multimodal",
    "sessions": {
      "audio_encoder": {"file": "audio_encoder.onnx"},
      "embedding": {"file": "embedding.onnx"},
      "decoder": {"file": "decoder.onnx"}
    },
    "flow": [
      {"run": "audio_encoder", "when": "init"},
      {"run": "embedding", "when": "init"},
      {"run": "decoder", "when": "step"}
    ],
    "dataflow": [
      {"from": "audio_encoder.audio_features", "to": "embedding.audio_features"},
      {"from": "embedding.inputs_embeds", "to": "decoder.inputs_embeds"}
    ]
  },
  "generation": {"loop": "autoregressive", "max_length": 4096}
}
```

Whisper and Nemotron Speech are already this pattern. No schema changes needed.

**Autoregressive TTS (Bark — works with minor extension):**

```json
{
  "pipeline": {
    "sessions": {
      "decoder": {"file": "decoder.onnx"},
      "vocoder": {"file": "vocoder.onnx"}
    },
    "flow": [
      {"run": "decoder", "when": "step"},
      {"run": "vocoder", "when": "final"}
    ],
    "dataflow": [
      {"from": "decoder.audio_tokens", "to": "vocoder.input_ids"}
    ]
  },
  "generation": {"loop": "autoregressive", "max_length": 2048}
}
```

New: `when: "final"` — runs after the generation loop completes (post-processing). Trivial to add.

**Non-autoregressive TTS (VITS — needs sequential executor):**

```json
{
  "pipeline": {
    "sessions": {
      "text_encoder": {"file": "text_encoder.onnx"},
      "duration_predictor": {"file": "duration.onnx"},
      "mel_decoder": {"file": "mel_decoder.onnx"},
      "vocoder": {"file": "vocoder.onnx"}
    },
    "flow": [
      {"run": "text_encoder", "when": "init"},
      {"run": "duration_predictor", "when": "init"},
      {"run": "mel_decoder", "when": "init"},
      {"run": "vocoder", "when": "init"}
    ],
    "output": {"session": "vocoder", "name": "audio_waveform"}
  },
  "generation": {"loop": "single_pass"}
}
```

New: `"loop": "single_pass"` — no generation loop, run all flow steps once, return output tensor. Requires a `SequentialExecutor` (~100 LOC).

**Complex TTS with inner loops (Qwen3 TTS — needs flow step extensions):**

Qwen3 TTS is a 4-model pipeline: embedding → talker → code_predictor → speaker_encoder. The talker IS autoregressive (KV cache, logits), but within each generation step, the code_predictor runs 14 times in an inner loop with a step counter:

```json
{
  "pipeline": {
    "sessions": {
      "embedding": {"file": "embedding.onnx"},
      "talker": {"file": "talker.onnx"},
      "code_predictor": {"file": "code_predictor.onnx"},
      "speaker_encoder": {"file": "speaker_encoder.onnx", "optional": true}
    },
    "flow": [
      {"run": "speaker_encoder", "when": "init", "optional": true},
      {"run": "embedding", "when": "init"},
      {"run": "talker", "when": "step"},
      {"run": "code_predictor", "when": "step", "repeat": 14, "counter": "step_index"}
    ],
    "dataflow": [
      {"from": "embedding.text_embeds", "to": "talker.inputs_embeds"},
      {"from": "talker.last_hidden_state", "to": "code_predictor.inputs_embeds"},
      {"from": "code_predictor.codec_embeddings", "to": "code_predictor.inputs_embeds"}
    ]
  },
  "generation": {"loop": "autoregressive", "max_length": 2048}
}
```

New concepts: `repeat: N` on a flow step (inner loop within each generation step), `counter` field (provides a step index input), and self-referential dataflow (code_predictor output feeds back into itself). These are v2.1 extensions. Until then, the plugin escape hatch covers complex TTS.

**Diffusion (Stable Diffusion, Flux) — in scope for schema design, out of scope for v2.0 implementation:**

Diffusion has a fundamentally different generation loop: fixed N-step denoising with noise scheduling, classifier-free guidance (conditional UNet double-call), and non-neural scheduler math between iterations. The `flow[]`/`dataflow[]` schema can express the session topology using the same `init/step/final` phases:

```json
{
  "pipeline": {
    "sessions": {
      "text_encoder": {"file": "text_encoder.onnx"},
      "unet": {"file": "unet.onnx"},
      "vae_decoder": {"file": "vae_decoder.onnx"}
    },
    "flow": [
      {"run": "text_encoder", "when": "init"},
      {"run": "unet", "when": "step"},
      {"run": "vae_decoder", "when": "final"}
    ],
    "dataflow": [
      {"from": "text_encoder.text_embeddings", "to": "unet.encoder_hidden_states"},
      {"from": "unet.noise_pred", "to": "vae_decoder.latent_sample"}
    ]
  },
  "generation": {
    "loop": "denoising",
    "num_steps": 50,
    "scheduler": "euler_discrete",
    "guidance_scale": 7.5
  }
}
```

Note how `init/step/final` maps naturally: text_encoder = init, unet = step (each denoising iteration), vae_decoder = final (after loop). The same three phases work for autoregressive AND denoising — no schema fork needed.

The denoising loop itself requires host-side C++ logic (scheduler.step, CFG interpolation) that would be Turing-complete if expressed declaratively. The right approach: a dedicated `DenoisingExecutor` in C++ that implements the denoising loop — analogous to how the autoregressive loop is C++ today. The config parameterizes it; the C++ implements it. The loop skeleton is ~300 LOC, but the full scheduler zoo (Euler, DDPM, DDIM, DPM-Solver, LCM, Flow Matching) plus classifier-free guidance and ControlNet support is ~1000-1500 LOC of implementation complexity. The pipeline executor and schema don't change — `scheduler: "euler_discrete"` is just a string that selects a C++ implementation.

**Implementation is deferred** — diffusion users have different tooling (ComfyUI, diffusers), different serving patterns (no streaming, batch-oriented), and ORT already has separate diffusion pipeline support. But the schema design explicitly accommodates diffusion so no breaking changes are needed when the executor is added.

### 12.4 The Scoping Decision

Pipeline-as-Config v2.0 **implements** autoregressive generation. The schema **designs for** all generation paradigms. This is a deliberate split: ship what matters now, design so future work is additive.

| Pattern | v2.0 Implementation | v2.1+ Implementation | Schema Support |
|---|---|---|---|
| LLM (decoder-only) | ✅ Ship | — | ✅ Designed |
| VLM (vision+language) | ✅ Ship | — | ✅ Designed |
| Encoder-decoder (Whisper, Marian) | ✅ Ship | — | ✅ Designed |
| Speech-language (audio+text→text) | ✅ Ship | — | ✅ Designed |
| Simple TTS (AR + vocoder) | ⚠️ Plugin | `when: "final"` | ✅ Designed |
| Complex TTS (Qwen3-style inner loops) | ⚠️ Plugin | `repeat` + `counter` | ✅ Designed |
| Non-autoregressive TTS (VITS) | ⚠️ Plugin | `loop: "single_pass"` | ✅ Designed |
| Diffusion (SD, Flux, DiT) | ⚠️ Plugin | `loop: "denoising"` | ✅ Designed |
| Exotic (RNNT, custom) | ⚠️ Plugin | Plugin | ✅ Plugin escape hatch |

**The pitch:** "The v2 schema supports any generation paradigm — autoregressive, denoising, single-pass. v2.0 ships the autoregressive executor. Adding a new paradigm = one C++ executor class. Adding a new model within any paradigm = zero code."

This prevents the "only works for LLMs" objection (the schema designs for everything) while keeping v2.0 scope tight (ship quality over breadth).

### 12.5 The Architectural Pattern: Pluggable Loop Strategies

The generation loop is a layer ABOVE the pipeline executor:

```
┌─────────────────────────────┐
│  Loop Strategy              │  ← autoregressive | denoising | single_pass
│  (generation.loop)          │
├─────────────────────────────┤
│  Pipeline Executor          │  ← flow[], dataflow[], state{} — UNCHANGED
│  (FlowInterpreter)         │
├─────────────────────────────┤
│  ONNX Sessions              │  ← The actual computation — UNCHANGED
└─────────────────────────────┘
```

Each loop strategy is independent:

| Loop Strategy | When It Runs | Termination | State Between Steps | LOC Estimate |
|---|---|---|---|---|
| `autoregressive` | Token-by-token | EOS or max_length | KV cache, positions | Existing (~800 LOC) |
| `single_pass` | All steps once | After one pass | None | ~100 LOC |
| `denoising` | Fixed N iterations | After N steps | Latents, scheduler | ~300 LOC loop + ~1000 LOC schedulers |

Adding a new loop strategy never touches the pipeline executor or existing loop strategies. Pure addition.

### 12.6 Competitive Advantage (Strengthened)

This analysis actually STRENGTHENS the competitive story:

- **llama.cpp:** GGUF has no concept of denoising loops, multi-session pipelines, or post-processing stages. Diffusion support would require a fundamentally new runtime.
- **vLLM:** Each diffusion architecture needs its own Python pipeline class. They're doing this (diffusion support is recent), but it's per-model Python code.
- **Pipeline-as-Config:** Add ONE loop strategy to the runtime → EVERY model of that type works via config. One `DenoisingExecutor` enables Stable Diffusion, Flux, DiT, SDXL, ControlNet — all expressed as JSON with different session topologies.

**The compiler advantage applies across modalities:** add one loop strategy to the "compiled runtime" → unlimited models of that type. With "interpreter" runtimes (vLLM, llama.cpp), every model needs its own code.

### 12.7 Implementation Roadmap

| Phase | What | Status |
|---|---|---|
| Phase 1 (v2.0) | Autoregressive (decoder-only, VLM, encoder-decoder, speech-language) | PRs 1-5 (in progress) |
| Phase 2 (v2.1) | `when: "final"` for post-processing (enables AR TTS with vocoder) | Trivial addition to FlowInterpreter |
| Phase 2 (v2.1) | `repeat` + `counter` on flow steps (enables complex TTS like Qwen3) | ~50 LOC FlowInterpreter extension |
| Phase 3 (v2.1) | `loop: "single_pass"` + SequentialExecutor (enables non-AR TTS, embeddings) | ~100 LOC new executor |
| Phase 4 (future) | `loop: "denoising"` + DenoisingExecutor (enables diffusion) | ~300 LOC loop skeleton + ~1000 LOC scheduler implementations |

**v2.0 scope:** Generative language models (autoregressive token generation). Covers ~95% of current ORT GenAI model zoo.

**v2.1 scope:** TTS extensions (`when: "final"`, `repeat`/`counter`, `loop: "single_pass"`). Additive, no breaking changes.

**Architecture accommodates:** Diffusion via pluggable loop strategy. Out of v2.0 scope (different product, different users), but architecturally consistent. The plugin escape hatch covers all exotic patterns in the meantime.

---

## 13. Summary

### What Changes

| Component | Before | After |
|-----------|--------|-------|
| Model dispatch | 32-string whitelist → 8 C++ classes | Structural detection → 3 pipeline classes + plugin |
| Adding a new LLM | C++ PR + release cycle | 7-line JSON config |
| Adding a new VLM | New C++ class + processor + factory entries | ~25-line JSON config |
| Config format | Implicit schema tied to C++ structs | Explicit v2 schema with presets, versioned |
| model_type | Dispatch key | Human-readable metadata |
| Code size | ~4000 LOC in model dispatch | ~2500 LOC in pipeline executor (**-1500 LOC**) |
| Extension mechanism | Fork the C++ runtime | JSON config or plugin .so |

### What Stays the Same

- Generation loop (Generator, Search, Sampling) — fully generic for autoregressive; extensible via pluggable loop strategies for diffusion/TTS (Section 12)
- KV cache internals — auto-detection mechanism preserved
- Tokenizer — unchanged
- C/Python/C#/Java/ObjC API surface — unchanged
- ONNX Runtime session management — unchanged
- All existing models — backward compatible via v1→v2 translator

### The Vision (2-Year Horizon)

ORT GenAI becomes a **generic pipeline runtime** — the ONNX equivalent of what Kubernetes is for container orchestration. Models describe their pipeline declaratively. The runtime executes it generically. No model-specific code. No release bottlenecks. Any ONNX model that follows standard I/O conventions runs automatically.

**Zero model-specific C++ code in ORT GenAI, ever again.**


#	Coupling Point	File	Lines	What Changes
CP1	Model type whitelist	`src/models/model_type.h`	16-61	Add string to static array
CP2	Model factory dispatch	`src/models/model.cpp`	820-842	Add if-else branch
CP3	Position input strategy	`src/models/position_inputs.cpp`	928-938	Add model_type check
CP4	Vision state factory	`src/models/multi_modal.cpp`	565-572	Add model_type check
CP5	Multimodal processor factory	`src/models/model.cpp`	915-933	Add factory entry
CP6	Python model builder	`src/python/py/models/builders/`	Various	Add Python builder file

Runtime Behavior	C++ Classes	Model Types	Actually Different?
Decoder-only autoregressive	`DecoderOnly_Model`, `Gpt_Model`	22 types	No — GPT-2 differs only in KV cache format (config-detectable)
Multi-session (VLM/multimodal)	`MultiModalLanguageModel`, `Qwen2_5_VL_PipelineModel`	8 types	Partially — vision invocation strategy varies, but is config-expressible
Encoder-decoder	`WhisperModel`, `MarianModel`	2 types	No — both are encoder→cross-attention-decoder
RNNT streaming ASR	`NemotronSpeechModel`	1 type	Yes — fundamentally different decoding loop
Pipeline (QNN multi-stage)	`DecoderOnlyPipelineModel`	1 type	Deployment variant, not architectural difference

Runtime	Pattern	Extensible Without Source Changes?	Adding a New Model
vLLM	Python dict registry + lazy import	✅ Yes — `register_model()` API	1 registry line + 1 Python file
SGLang	AST-based filesystem discovery	✅ Yes — drop .py file in directory	0-1 config lines + 1 module file
llama.cpp	C++ enum dispatch (like ORT GenAI)	❌ No — requires recompilation	1 enum + 100-300 LOC
ORT GenAI	C++ string whitelist dispatch	❌ No — requires recompilation	1 string + PR + release cycle
ORT GenAI (proposed)	Declarative pipeline config	✅ Yes — JSON config only	0 code lines + 1 JSON config

They're better at	Why	Our path to parity
GGUF: Single-file distribution	One .gguf file vs our model dir	ONNX metadata embedding (research direction)
GGUF: Quantization simplicity	`Q4_K_M` is one flag	Olive pipeline (more steps, but more flexible)
vLLM: Serving features	Continuous batching, speculative decoding, prefix caching	ORT GenAI engine mode (growing)
vLLM: Community velocity	200+ models, rapid community PRs	Pipeline-as-config FIXES this — enables same velocity
Both: No export step	Load HF weights directly	We require an export step (mobius build)

Capability	GGUF	vLLM	Pipeline-as-Config
New LLM without runtime changes	❌	❌ (needs Python class)	✅ (JSON config)
Multi-session pipelines (VLM)	❌ (no concept)	⚠️ (Python code)	✅ (declarative flow)
Deploy same model on 6+ HW targets	❌	❌	✅ (ORT execution providers)
Heterogeneous HW per session	❌	❌	✅ (vision on NPU, decoder on GPU)
Model-agnostic runtime	❌	❌	✅
Self-describing model artifacts	✅ (GGUF metadata)	❌ (needs Python)	✅ (ONNX + pipeline JSON)
Declarative preprocessing	❌	❌	✅ (ort-extensions JSON)
No Python dependency at inference	✅	❌	✅
C++ only deployment (edge/embedded)	✅	❌	✅
Extensibility without recompilation	❌	✅ (Python)	✅ (JSON + plugin .so)

Phase	Autoregressive	Denoising (Diffusion)	Single-Pass
`init`	Vision encoder, embedding, prompt processing	Text encoder, latent init	All sessions
`step`	Decoder (each token)	UNet (each denoising step)	N/A (no loop)
`final`	—	VAE decoder → image	—

Preset Name	What It Expands To
`autoregressive-decoder`	Single decoder session, default KV cache, default position IDs, flow: [{run: decoder, when: always}]
`vision-language`	Vision + embedding + decoder sessions, batched vision, default KV cache
`encoder-decoder`	Encoder (once) + decoder (always), cross-attention KV cache
`speech-language`	Speech encoder + embedding + decoder

Pattern	Example	How It's Handled
Fixed-size	Phi3v (336×336 images)	`image_processor.json` resizes to target. ONNX model has static shapes.
Dynamic-size	Qwen2.5-VL (arbitrary resolution)	`image_processor.json` does dynamic resize + patch extraction. ONNX model has dynamic shapes. Pipeline executor passes tensors as-is.
Per-image variable	Pixtral (each image different resolution)	Preprocessor zero-pads to max(H)×max(W), provides `image_sizes[N, 2]`. Pipeline flow uses `loop: per_image` + `dynamic_shape` for per-image slicing.

PR	Description	LOC Added	LOC Deleted	Net
PR 1	Config v2 parser + v1 translator	+300	-0	+300
PR 2	PipelineExecutor classes	+350	-0	+350
PR 3	Flow interpreter + dataflow	+200	-0	+200
PR 4	Plugin API	+100	-0	+100
PR 5	Delete model_type dispatch	+0	-1500	-1500
Total		+950	-1500	-550

Model Scenario	Today	After PR 1-2	After PR 1-5
Existing Llama/Phi/Gemma (v1 config)	✅ Works	✅ Works (v1→v2 translator)	✅ Works (translator)
New decoder-only LLM (unknown type)	❌ Rejected by whitelist	✅ 7-line v2 config	✅ 7-line v2 config
Custom fine-tune with custom model_type	❌ Rejected by whitelist	✅ extends preset	✅ extends preset
New VLM family	❌ Needs new C++ class + processor	✅ ~25-line v2 config	✅ ~25-line v2 config
Qwen2.5-VL (3D mRoPE, per-image vision)	✅ Hardcoded	✅ v2 config with position_strategy + loop	✅ Config-driven
Pixtral/Mistral3 (variable resolution)	✅ Hardcoded	✅ v2 config with per_image loop + dynamic_shape	✅ Config-driven
Whisper (encoder-decoder)	✅ Hardcoded	✅ v2 config with encoder-decoder preset	✅ Config-driven
GPT-2 (combined KV cache)	✅ Hardcoded (separate class)	✅ v2 config with kv_cache.format: combined	✅ Config-driven
Mamba/SSM (recurrent, no KV)	❌ Not supported	⚠️ Needs state.type: recurrent	✅ Config-driven
RNNT (non-autoregressive)	✅ Hardcoded	✅ Plugin .so	✅ Plugin .so
Novel architecture (unknown future)	❌ Major C++ work	✅ Plugin .so, zero runtime changes	✅ Plugin .so
Phi4mm (vision + audio)	✅ Hardcoded	✅ v2 config with 4 sessions	✅ Config-driven

Error	Message
Session referenced in flow but not declared	`Flow step references session "vision" but no such session is declared in pipeline.sessions`
Dataflow references non-existent tensor	`Dataflow wire references output "image_features" but session "vision" has no such output (available outputs: hidden_states, pooler_output)`
Unknown position strategy	`Unknown position_ids strategy "my_custom". Valid options: auto, default, mrope_3d, windowed`
Cycle in dataflow	`Circular dependency detected in dataflow: vision → embedding → decoder → vision`
Unknown preset	`Unknown pipeline preset "my-preset". Built-in presets: autoregressive-decoder, vision-language, encoder-decoder, speech-language`
Missing required field	`Pipeline config requires at least one session. Add "sessions": {"decoder": {"file": "model.onnx"}}`

For	Today	With Pipeline-as-Config
ORT GenAI team	Bottlenecked on model support PRs	Never writes model-specific code again
Model builders (mobius/Olive)	Must coordinate with runtime team for every new model	Ship independently — generate config, done
ML engineers	Wait for runtime releases	New models work immediately
The ecosystem	ORT GenAI lags behind HuggingFace model zoo	ORT GenAI supports any ONNX model by design

Risk	Likelihood	Impact	Mitigation
Performance regression for existing models	Low	High	Benchmark all 32 model types before/after. The hot path is identical.
Config complexity deters users	Medium	Medium	Presets with `extends` reduce 90% of configs to 7 lines. JSON Schema for IDE support.
Edge cases in flow interpreter	Medium	Medium	Comprehensive test matrix covering all 32 model types. Validation at load time.
ORT GenAI team rejects the proposal	Medium	High	Start with the blacklist inversion (5 lines) to build trust. Present the full vision as an RFC.
Plugin ABI stability across versions	Low	Medium	Version the plugin API. Keep it minimal (1 factory function).
v1→v2 translator has subtle bugs	Medium	Medium	The translator is tested against every existing genai_config.json in the test suite.

What mobius knows	How it maps to pipeline config
Model architecture (decoder-only, VLM, enc-dec)	Which preset to `extend`
Number and type of ONNX sessions	`pipeline.sessions`
Vision invocation pattern (batched vs per-image)	`flow[].loop`
Position embedding strategy (1D, 3D mRoPE)	`state.position_ids.strategy`
KV cache format (separate, combined)	`state.kv_cache.format`
All I/O tensor names	`state.kv_cache.*_pattern`, `dataflow[]`
Token IDs, generation params	`tokens`, `generation`

Model Type	Schema Expressive?	Runtime Can Execute?	What's Missing
Audio+text multimodal (Phi4mm, speech-language)	✅ Yes	✅ Yes	Nothing — structurally identical to VLMs
Encoder-decoder (Whisper, Marian)	✅ Yes	✅ Yes	Nothing — already supported
Autoregressive TTS (Bark, VALL-E)	✅ Yes	✅ Yes	Add `when: "final"` for vocoder post-processing
Non-autoregressive TTS (VITS, FastSpeech2)	✅ Yes	❌ No	Sequential executor + non-token output
Diffusion (SD, Flux, DiT)	⚠️ Topology yes	❌ No	Iterative executor, scheduler state, latent init, non-token output

Pattern	v2.0 Implementation	v2.1+ Implementation	Schema Support
LLM (decoder-only)	✅ Ship	—	✅ Designed
VLM (vision+language)	✅ Ship	—	✅ Designed
Encoder-decoder (Whisper, Marian)	✅ Ship	—	✅ Designed
Speech-language (audio+text→text)	✅ Ship	—	✅ Designed
Simple TTS (AR + vocoder)	⚠️ Plugin	`when: "final"`	✅ Designed
Complex TTS (Qwen3-style inner loops)	⚠️ Plugin	`repeat` + `counter`	✅ Designed
Non-autoregressive TTS (VITS)	⚠️ Plugin	`loop: "single_pass"`	✅ Designed
Diffusion (SD, Flux, DiT)	⚠️ Plugin	`loop: "denoising"`	✅ Designed
Exotic (RNNT, custom)	⚠️ Plugin	Plugin	✅ Plugin escape hatch

Loop Strategy	When It Runs	Termination	State Between Steps	LOC Estimate
`autoregressive`	Token-by-token	EOS or max_length	KV cache, positions	Existing (~800 LOC)
`single_pass`	All steps once	After one pass	None	~100 LOC
`denoising`	Fixed N iterations	After N steps	Latents, scheduler	~300 LOC loop + ~1000 LOC schedulers

Phase	What	Status
Phase 1 (v2.0)	Autoregressive (decoder-only, VLM, encoder-decoder, speech-language)	PRs 1-5 (in progress)
Phase 2 (v2.1)	`when: "final"` for post-processing (enables AR TTS with vocoder)	Trivial addition to FlowInterpreter
Phase 2 (v2.1)	`repeat` + `counter` on flow steps (enables complex TTS like Qwen3)	~50 LOC FlowInterpreter extension
Phase 3 (v2.1)	`loop: "single_pass"` + SequentialExecutor (enables non-AR TTS, embeddings)	~100 LOC new executor
Phase 4 (future)	`loop: "denoising"` + DenoisingExecutor (enables diffusion)	~300 LOC loop skeleton + ~1000 LOC scheduler implementations

Component	Before	After
Model dispatch	32-string whitelist → 8 C++ classes	Structural detection → 3 pipeline classes + plugin
Adding a new LLM	C++ PR + release cycle	7-line JSON config
Adding a new VLM	New C++ class + processor + factory entries	~25-line JSON config
Config format	Implicit schema tied to C++ structs	Explicit v2 schema with presets, versioned
model_type	Dispatch key	Human-readable metadata
Code size	~4000 LOC in model dispatch	~2500 LOC in pipeline executor (-1500 LOC)
Extension mechanism	Fork the C++ runtime	JSON config or plugin .so

Copilot design of Pipeline-as-Config #2114

Description

ORT GenAI Architectural Redesign: Pipeline-as-Config

Executive Summary

The Pitch

Design Principle

The One-Sentence Version

1. The Problem Today

1.1 The Six Coupling Points

1.2 The False Complexity

1.3 What Users Experience

2. Competitive Analysis

How Other Runtimes Handle Extensibility

ORT GenAI's Unique Advantage

Why Pipeline-as-Config Is Better Than GGUF (llama.cpp)

Why Pipeline-as-Config Is Better Than vLLM

Where Competitors Are Better (Honest Assessment)

The Core Competitive Insight: Compile at Export Time

What We Do That NEITHER Competitor Can

The Complete Competitive Matrix

3. The Architecture: Pipeline-as-Config

3.1 Core Concept

3.2 The flow Array — Execution Ordering

3.3 The dataflow Array — Session Wiring

3.4 The state Object — KV Cache & Position Strategy

3.5 The extends Mechanism — Preset Inheritance

3.6 The Plugin API — Escape Hatch

3.7 Preprocessing: Image, Audio, and Variable Input Shapes

The Architecture Boundary

Image Preprocessing — ort-extensions image_processor.json

Audio Preprocessing — audio_processor.json

Variable Input Shapes

Pipeline Config Reference

3.8 Advanced KV Cache Patterns (Shared Cache, Dual Head Dim)

3.9 Preprocessor↔Model Shape Alignment

3.10 The metadata Section

4. Concrete Schema Examples

4.1 Decoder-Only LLM (Minimal — 7 lines)

4.2 Vision-Language Model (Qwen2.5-VL style — 25 lines)

4.3 Encoder-Decoder (Whisper style)

4.4 Multimodal (Vision + Audio — Phi4mm style)

4.5 Novel Architecture via Plugin (RNNT)

5. Implementation Plan

Overview

PR 1: Config Schema v2 Parser + Backward Compatibility (~300 LOC)

PR 2: PipelineExecutor Class (~350 LOC)

PR 3: Flow Interpreter + Dataflow Wiring (~200 LOC)

PR 4: Plugin API (~100 LOC)

PR 5: Delete Model-Type Dispatch (~-1500 LOC)

Implementation Summary

6. Compatibility Matrix

7. Technical Feasibility

7.1 CUDA Graph Capture

7.2 Memory Pre-allocation

7.3 Performance Overhead

7.4 Config Validation

7.5 The per_image Loop for Vision

8. The Pitch to the ORT GenAI Team

Framing

Value Proposition

The Key Selling Point

9. Risk Analysis

Immediate Bridge (While Building the Future)

10. The mobius Role: Pipeline Compiler

11. Research Direction: Self-Contained Generation Graphs

12. Beyond Autoregressive: TTS, Diffusion, and Multimodal Audio

12.1 The Question

12.2 The Honest Assessment

12.3 Concrete Config Examples

12.4 The Scoping Decision

12.5 The Architectural Pattern: Pluggable Loop Strategies

12.6 Competitive Advantage (Strengthened)

12.7 Implementation Roadmap

13. Summary

What Changes

What Stays the Same

The Vision (2-Year Horizon)

Metadata

Metadata

Assignees

3.2 The `flow` Array — Execution Ordering

3.3 The `dataflow` Array — Session Wiring

3.4 The `state` Object — KV Cache & Position Strategy

3.5 The `extends` Mechanism — Preset Inheritance

Image Preprocessing — ort-extensions `image_processor.json`

Audio Preprocessing — `audio_processor.json`

3.10 The `metadata` Section

7.5 The `per_image` Loop for Vision