Path of the UOp: The 22-Stage Codegen Pipeline

A UOp starts as a high-level tensor expression. By the time it reaches the hardware, it has been transformed through 22 distinct stages—each with a specific purpose, each building on the last. This chapter traces that journey.

The pipeline is a proven design for tensor compilation. Understanding it means understanding how tensor expressions become machine code.

How to Read This Chapter

If you're not a compiler engineer, this chapter might seem intimidating. Here's what you need to understand before diving in.

Key Concepts

UOp (Micro-Operation)

Think of it as a node in a flowchart representing one computation
Example: ADD(a, b) means "add a and b"

Pattern

A find-and-replace rule for code structures (not text)
Example: "If you see ADD(x, 0), replace with x"
Patterns fire repeatedly until no more matches (fixpoint)

Range

A loop iteration: RANGE(0..10) means "for i from 0 to 10"

AxisType

What kind of loop is this?
- Global: Parallel across GPU blocks / CPU threads
- Local: Parallel within a workgroup
- Reduce: Accumulator (sum, max, etc.)
- Loop: Sequential iteration

Stage

One transformation pass through the code
Patterns fire until fixpoint, then move to the next stage

Reading Strategy

First pass: Read just the "What This Does" and "Why This Matters" sections
Second pass: Look at the diagrams and examples
Third pass (if you want details): Read the pattern descriptions

Questions to Ask

For each stage, ask:

What does this stage accomplish? (High-level goal)
Why do we need this stage? (Motivation)
What would go wrong without it? (Consequences)

Overview

The 22 stages fall into four phases:

Tensor Expression
       │
       ▼
┌─────────────────────────────────────┐
│ RANGEIFY (Stages 1-7)               │
│ Movement ops → Explicit loops       │
│                                     │
│ [Make iteration explicit,           │
│  optimize ranges]                   │
└─────────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────┐
│ EXPANDER (Stages 8-10)              │
│ UNROLL/UPCAST → Explicit vectors    │
│                                     │
│ [Expand optimization primitives]    │
└─────────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────┐
│ DEVECTORIZER (Stages 11-15)         │
│ Vector ops → Scalar code            │
│                                     │
│ [Lower to hardware-specific ops]    │
└─────────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────┐
│ LINEARIZER (Stages 16-22)           │
│ IR → Linear instruction sequence    │
│                                     │
│ [Serialize to executable code]      │
└─────────────────────────────────────┘
       │
       ▼
  Machine Code

Each stage applies pattern-based rewrites. Patterns fire until fixpoint, then the next stage begins.

Additional Passes

Several passes run between the numbered stages and don't have their own stage number:

Pass	Between Stages	Purpose
`linearize_multi_index`	Before Stage 8	Flatten multi-dimensional indices to linear offsets
`pm_bool_devectorize`	14–15	Handle boolean vector patterns
`pm_reduce_devectorize`	14–15	Handle vector reductions (K-vec, bool, horizontal)
`merge_sibling_ends`	14–15	Merge adjacent END operations
`pm_float_decomp`	Post-opt	Decompose floating-point operations
`bool_storage_patterns`	Post-opt	Convert bool ↔ uint8 for memory operations

How to Read This Chapter​

Key Concepts​

Reading Strategy​

Questions to Ask​

Overview​

Additional Passes​