Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Op Bestiary: A Field Guide to UOp Operations

When debugging Morok IR dumps, you’ll encounter operations that aren’t obvious from their names. This chapter documents non-trivial operations with signatures, field explanations, and examples.

What’s covered: Operations that require explanation—loop control, reductions, memory operations, kernel structure, vectorization, tensor cores.

What’s NOT covered: Trivial ALU operations (Add, Mul, Sqrt, etc.) that work exactly as you’d expect.


Loop Control: RANGE and END

RANGE — Loop Scope Opener

#![allow(unused)]
fn main() {
Range {
    end: Arc<UOp>,           // loop bound (exclusive)
    axis_id: AxisId,         // identifier for deduplication
    axis_type: AxisType,     // scheduling behavior
}
}

Fields:

FieldTypePurpose
endArc<UOp>Upper bound (exclusive), typically a CONST
axis_idAxisIdUnrenumbered(n) before kernel splitting, Renumbered(n) after
axis_typeAxisTypeDetermines how the loop is scheduled (see below)

AxisType Hierarchy:

TypePriorityGPU MappingPurpose
Outer-2Kernel boundary marker
Loop-1for loopSequential iteration
Global0blockIdxGrid parallelism
Thread0thread poolCPU parallelism
Warp1warp/wavefrontSub-group parallelism
Local2threadIdxWorkgroup parallelism
GroupReduce2shared memoryTwo-stage reduction
Upcast3SIMDVectorization
Reduce4accumulatorReduction dimension
Unroll5unrolledLoop unrolling

Priority determines loop nesting order—lower values are outer loops.

Example:

RANGE(end=128, axis_id=R0, type=Global)
└── CONST(128) : Index

END — Loop Scope Closer

#![allow(unused)]
fn main() {
End {
    computation: Arc<UOp>,              // value computed inside loop
    ranges: SmallVec<[Arc<UOp>; 4]>,    // ranges being closed
}
}

END closes one or more RANGE scopes and removes them from the active set. Multiple ranges can be closed simultaneously.

Example:

END
├── STORE(...)           — computation
├── RANGE(R0, Global)    — first range closed
└── RANGE(R1, Local)     — second range closed

Reduction: REDUCE vs REDUCE_AXIS

Two operations with similar names serve different purposes.

REDUCE_AXIS — Tensor Dimension Reduction (High-Level)

#![allow(unused)]
fn main() {
ReduceAxis {
    src: Arc<UOp>,           // input tensor
    reduce_op: ReduceOp,     // Add, Mul, Max, Min
    axes: Vec<usize>,        // axes to reduce
}
}

Used before rangeify. Operates on tensor dimensions like NumPy’s .sum(axis=0).

Example:

REDUCE_AXIS(Add, axes=[1])
└── BUFFER[10, 20] : Float32

This reduces a [10, 20] tensor to [10] by summing along axis 1.

REDUCE — Range Iteration Reduction (Low-Level)

#![allow(unused)]
fn main() {
Reduce {
    src: Arc<UOp>,                      // value to accumulate
    ranges: SmallVec<[Arc<UOp>; 4]>,    // ranges being reduced
    reduce_op: ReduceOp,                // Add, Mul, Max, Min
}
}

Used after rangeify. Accumulates values across RANGE iterations and closes the specified ranges.

ReduceOp Variants:

OpIdentityOperationTinygrad
Add0acc + value
Mul1acc * value
Max-∞max(acc, value)
Min+∞min(acc, value)Morok-only

Compatibility: Tinygrad’s spec restricts REDUCE_AXIS to {Add, Mul, Max}. Morok extends this with Min.

Example:

REDUCE(Add)
├── MUL                      — value to accumulate
│   ├── LOAD(A, ...)
│   └── LOAD(B, ...)
└── RANGE(R2, Reduce)        — range being reduced
    └── CONST(64)

ALLREDUCE — Cross-Device Reduction

#![allow(unused)]
fn main() {
AllReduce {
    src: Arc<UOp>,           // local partial result
    device: Arc<UOp>,        // device specification
    reduce_op: ReduceOp,     // reduction operation
}
}

Performs distributed reduction across multiple devices. Used for multi-GPU training.


Buffer Operations

BUFFER — Buffer Declaration

#![allow(unused)]
fn main() {
Buffer {
    unique: Arc<UOp>,        // UNIQUE op for identity
    device: Arc<UOp>,        // DEVICE op
    size: usize,             // total element count
}
}

Declares a buffer for tensor storage. The unique field ensures distinct buffers even with identical size/device.

BUFFERIZE — Materialization Marker

#![allow(unused)]
fn main() {
Bufferize {
    compute: Arc<UOp>,                  // computation to materialize
    ranges: SmallVec<[Arc<UOp>; 4]>,    // output dimensions
    opts: BufferizeOpts,                // address space, device
}
}

Marks where computation should materialize to memory. Triggers kernel splitting.

BufferizeOpts:

FieldTypePurpose
deviceOption<DeviceSpec>Target device, None for local
addrspaceAddrSpaceGlobal (device) or Local (shared)

Example:

BUFFERIZE(opts={addrspace=Global})
├── REDUCE(Add, ...)         — computation
├── RANGE(R0, Global)        — output dim 0
└── RANGE(R1, Global)        — output dim 1

INDEX — Multi-Dimensional Buffer Access

#![allow(unused)]
fn main() {
Index {
    buffer: Arc<UOp>,                   // BUFFER or DEFINE_GLOBAL
    indices: SmallVec<[Arc<UOp>; 4]>,   // index per dimension
    gate: Option<Arc<UOp>>,             // optional predicate
}
}

Computes memory address from multi-dimensional indices. Returns element dtype (not pointer).

Example:

INDEX : Float32
├── DEFINE_GLOBAL(0)
├── RANGE(R0, Global)        — index for dim 0
├── RANGE(R1, Loop)          — index for dim 1
└── MUL(...)                 — index for dim 2

POINTER_INDEX — Low-Level Pointer Arithmetic

#![allow(unused)]
fn main() {
PointerIndex {
    ptr: Arc<UOp>,           // base pointer
    offset: Arc<UOp>,        // byte offset
}
}

Direct pointer arithmetic. Used after linearization when indices are flattened.

Compatibility: Tinygrad uses INDEX with a ptr=True flag instead of a separate operation.

LOAD — Memory Read

#![allow(unused)]
fn main() {
Load {
    buffer: Arc<UOp>,        // buffer or pointer
    index: Arc<UOp>,         // INDEX op
}
}

Read value from buffer at index. For gated loads, use an INDEX with a gate (INDEX has an optional gate field).

Example:

LOAD : Float32
├── DEFINE_GLOBAL(1)
└── INDEX
    ├── DEFINE_GLOBAL(1)
    ├── RANGE(R0)
    └── RANGE(R2)

STORE — Memory Write

#![allow(unused)]
fn main() {
Store {
    buffer: Arc<UOp>,                   // output buffer
    index: Arc<UOp>,                    // INDEX op
    value: Arc<UOp>,                    // value to write
    ranges: SmallVec<[Arc<UOp>; 4]>,    // ranges being closed
}
}

Write value to buffer. STORE closes the specified ranges, which represent output iteration dimensions. The ranges field is used for output upcasting: when a Range(Upcast) is included, it becomes UNROLL during expansion, then contracted via CONTRACT.

For gated stores, use an INDEX with a gate (INDEX has an optional gate field).

Compatibility: Morok’s STORE has an explicit index field (sources: buffer=0, index=1, value=2, ranges=3+). Tinygrad’s STORE combines buffer and value differently (range_start=2).

Example:

STORE
├── DEFINE_GLOBAL(0)         — output buffer
├── INDEX[R0, R1]            — write address
├── REDUCE(Add, ...)         — value
├── RANGE(R0, Global)        — output dim 0 (closed)
└── RANGE(R1, Global)        — output dim 1 (closed)

Kernel Structure

KERNEL — Kernel Wrapper

#![allow(unused)]
fn main() {
Kernel {
    sources: SmallVec<[Arc<UOp>; 4]>,   // arguments
    ast: Arc<UOp>,                       // computation (usually SINK)
}
}

Wraps a complete kernel for code generation. Sources are kernel arguments (DefineGlobal, DefineLocal, DefineVar).

Example:

KERNEL
├── DEFINE_GLOBAL(0)         — output buffer arg
├── DEFINE_GLOBAL(1)         — input A arg
├── DEFINE_GLOBAL(2)         — input B arg
└── SINK                     — computation
    └── STORE(...)

SINK — Multiple Root Collector

#![allow(unused)]
fn main() {
Sink {
    sources: SmallVec<[Arc<UOp>; 4]>,
}
}

Collects multiple outputs into a single root. Every kernel’s ast is typically a SINK containing STORE operations.

Example:

SINK
├── STORE(output_0, ...)
├── STORE(output_1, ...)
└── STORE(output_2, ...)

AFTER — Dependency Marker

#![allow(unused)]
fn main() {
After {
    passthrough: Arc<UOp>,              // value that flows through
    deps: SmallVec<[Arc<UOp>; 4]>,      // operations that must complete
}
}

Expresses execution dependencies between kernels without data dependency. The passthrough value is returned unchanged, but only after all deps complete.

Example:

SINK
├── AFTER
│   ├── DEFINE_GLOBAL(0)     — passthrough (buffer reference)
│   └── KERNEL(...)          — must complete first
└── KERNEL(...)              — can use buffer after AFTER

BARRIER — Synchronization Fence

#![allow(unused)]
fn main() {
Barrier {
    src: Arc<UOp>,                      // value passing through
    deps: SmallVec<[Arc<UOp>; 4]>,      // operations to wait for
}
}

GPU workgroup synchronization. Ensures all threads in a workgroup reach the barrier before continuing.


Vector Operations

VECTORIZE — Create Vector from Scalars

#![allow(unused)]
fn main() {
Vectorize {
    elements: SmallVec<[Arc<UOp>; 4]>,
}
}

Combines N scalar values into a vector of size N. All elements must have the same base dtype.

Example:

VECTORIZE : <4 x Float32>
├── CONST(1.0)
├── CONST(2.0)
├── CONST(3.0)
└── CONST(4.0)

GEP — Get Element Pointer (Vector Extract)

#![allow(unused)]
fn main() {
Gep {
    vector: Arc<UOp>,        // source vector
    indices: Vec<usize>,     // positions to extract
}
}

Extracts elements from a vector:

  • Single index → scalar
  • Multiple indices → smaller vector

Example:

GEP([0, 2]) : <2 x Float32>
└── VECTORIZE : <4 x Float32>
    └── ...

VConst — Vector Constant

#![allow(unused)]
fn main() {
VConst {
    values: Vec<ConstValue>,
}
}

Vector of compile-time constants. More efficient than VECTORIZE of CONST nodes.

CAT — Concatenate Vectors

#![allow(unused)]
fn main() {
Cat {
    sources: SmallVec<[Arc<UOp>; 4]>,
}
}

Concatenates vectors into a larger vector. Output vcount = sum of input vcounts.

Example:

CAT : <8 x Float32>
├── VECTORIZE : <4 x Float32>
└── VECTORIZE : <4 x Float32>

PtrCat — Concatenate Pointers

#![allow(unused)]
fn main() {
PtrCat {
    sources: SmallVec<[Arc<UOp>; 4]>,
}
}

Groups memory accesses for vectorized load/store. Used by the devectorizer pass.


Expansion: UNROLL and CONTRACT

UNROLL — Expand Computation Across Iterations

#![allow(unused)]
fn main() {
Unroll {
    src: Arc<UOp>,                       // computation to expand
    unroll_axes: Vec<(usize, usize)>,    // (axis_index, factor) pairs
}
}

Creates multiple versions of computation for different iteration values. Used for loop unrolling optimization.

Example: UNROLL(unroll_axes=[(0, 4)]) expands computation 4 times with different index values.

CONTRACT — Collapse Unrolled Values to Vector

#![allow(unused)]
fn main() {
Contract {
    src: Arc<UOp>,                       // unrolled computation
    upcast_ranges: Vec<(usize, usize)>,  // (axis_index, factor) pairs
}
}

The inverse of UNROLL—collects expanded scalar values into a vector. Output vector size = product of factors.

Example:

CONTRACT(upcast_ranges=[(0, 4)]) : <4 x Float32>
└── UNROLL(unroll_axes=[(0, 4)])
    └── LOAD(...)

This pattern vectorizes a load: expand 4 iterations, then pack results into a 4-element vector.


Tensor Cores: WMMA

WMMA — Warp Matrix Multiply-Accumulate

#![allow(unused)]
fn main() {
Wmma {
    a: Arc<UOp>,             // matrix A fragment
    b: Arc<UOp>,             // matrix B fragment
    c: Arc<UOp>,             // accumulator C fragment
    metadata: WmmaMetadata,  // hardware configuration
}
}

Hardware tensor core operation: D = A × B + C. Requires specific matrix shapes and data layouts.

WmmaMetadata Fields:

FieldTypePurpose
nameStringInstruction name (e.g., "__hmma...")
dims(N, M, K)Matrix dimensions (e.g., (16, 16, 16))
dtype_inDTypeInput matrix precision (e.g., Float16)
dtype_outDTypeOutput precision (e.g., Float32)
deviceStringTarget device string
threadsusizeThreads per warp (typically 32)
upcast_axesVec<(usize, usize)>Vectorization for output
reduce_axesVec<(usize, usize)>Contraction axes

Example:

WMMA(dims=(16, 16, 16), dtype_in=Float16, dtype_out=Float32)
├── A fragment : <8 x Float16>
├── B fragment : <8 x Float16>
└── C accumulator : <8 x Float32>

Control Flow

IF / ENDIF — Conditional Execution

#![allow(unused)]
fn main() {
If {
    condition: Arc<UOp>,                // boolean predicate
    body: SmallVec<[Arc<UOp>; 4]>,      // operations to execute
}

EndIf {
    if_op: Arc<UOp>,         // corresponding IF op
}
}

Execute body only when condition is true. Used for boundary checks and sparse operations.

Example:

IF
├── LT(idx, bound)           — condition (src[0])
├── STORE(...)               — body[0]
└── STORE(...)               — body[1]

ENDIF
└── IF(...)                  — references IF op

Definition Operations

DEFINE_GLOBAL — Device Memory Argument

#![allow(unused)]
fn main() {
DefineGlobal(usize)          // argument index
}

Kernel argument for device (global) memory. Index refers to position in kernel argument list.

DEFINE_LOCAL — Shared Memory Allocation

#![allow(unused)]
fn main() {
DefineLocal(usize)           // local memory index
}

GPU shared memory (LDS) allocation. Visible within a workgroup.

DEFINE_VAR — Symbolic Runtime Variable

#![allow(unused)]
fn main() {
DefineVar {
    name: String,            // variable name
    min_val: i64,            // minimum bound
    max_val: i64,            // maximum bound
}
}

Runtime variable with known bounds. Used for dynamic shapes where bounds are known.

Example:

DEFINE_VAR(name="batch_size", min=1, max=128) : Index

DEFINE_REG — Register Allocation

#![allow(unused)]
fn main() {
DefineReg {
    size: usize,             // register size
}
}

Allocates a register for intermediate storage. Used in code generation.

BIND — Variable Binding

#![allow(unused)]
fn main() {
Bind {
    var: Arc<UOp>,           // DEFINE_VAR
    value: Arc<UOp>,         // concrete value
}
}

Binds a symbolic variable to a concrete value at runtime.


Special Operations

SPECIAL — Hardware-Provided Values

#![allow(unused)]
fn main() {
Special {
    end: Arc<UOp>,           // upper bound for this dimension
    name: String,            // e.g., "blockIdx.x", "threadIdx.y"
}
}

Accesses hardware-provided values (thread/block indices). Not a loop—the hardware provides the value directly.

Example:

SPECIAL(name="blockIdx.x", end=128) : Index
└── CONST(128)

UNIQUE — Identity Marker

#![allow(unused)]
fn main() {
Unique(usize)                // unique identifier
}

Creates a unique identity for buffer disambiguation. Two buffers with different UNIQUE values are distinct even if otherwise identical.

DEVICE — Device Specification

#![allow(unused)]
fn main() {
Device(DeviceSpec)           // device specification
}

Specifies target device for computation.


Movement Operations

High-level tensor shape transformations. These are converted to explicit INDEX operations during rangeify.

OperationSignaturePurpose
Reshape{ src, new_shape }Change shape, same elements
Permute{ src, axes: Vec<usize> }Transpose/reorder axes
Expand{ src, new_shape }Broadcast to larger shape
Pad{ src, begin_pads, end_pads }Add padding
Shrink{ src, begins, ends }Extract sub-region
Flip{ src, axes: Vec<bool> }Reverse along axes

Example: RESHAPE

RESHAPE(new_shape=[6, 4]) : Shape[6, 4]
├── BUFFER[2, 3, 4] : Float32
└── CONST([6, 4]) : Shape

Quick Reference

By Category

CategoryOperations
Loop ControlRANGE, END
ReductionREDUCE_AXIS, REDUCE, ALLREDUCE
MemoryBUFFER, BUFFERIZE, INDEX, POINTER_INDEX, LOAD, STORE
KernelKERNEL, SINK, AFTER, BARRIER
VectorVECTORIZE, GEP, VCONST, CAT, PTRCAT
ExpansionUNROLL, CONTRACT
HardwareWMMA, SPECIAL
ControlIF, ENDIF
DefinitionDEFINE_GLOBAL, DEFINE_LOCAL, DEFINE_VAR, DEFINE_REG, BIND, UNIQUE, DEVICE
MovementRESHAPE, PERMUTE, EXPAND, PAD, SHRINK, FLIP
ALUUnary(...), Binary(...), Ternary(...), Cast, BitCast

Range-Ending Operations

Operations that close RANGE scopes (remove ranges from active set):

OperationRange Start Index
BUFFERIZE1 (compute=0, ranges=1+)
REDUCE1 (src=0, ranges=1+)
STORE3 (buffer=0, index=1, value=2, ranges=3+)
WMMA3 (a=0, b=1, c=2)
END1 (computation=0, ranges=1+)

Expandable Operations

Operations that propagate UNROLL through the computation graph:

  • ALU: Unary, Binary, Ternary
  • Type: Cast, BitCast
  • Vector: Gep, Vectorize
  • Memory: Load, Store, Index, PointerIndex
  • Control: Reduce, End, After
  • Buffer: Bufferize
  • Hardware: Wmma