From Tensor to Machine Code

In most ML frameworks, computation happens immediately. Write a + b in PyTorch and it runs now—the GPU crunches numbers before you can even inspect the result. This eager execution is simple to understand, but it leaves optimization opportunities on the table. How can a compiler optimize a computation it hasn't seen yet?

Morok takes the opposite approach: lazy evaluation. When you write a.try_add(&b)?, nothing computes. Morok builds a graph describing what to compute, not when. The magic happens when you call realize()—that single method triggers the entire compilation pipeline, from high-level tensor operations down to JIT-compiled machine code.

This chapter traces that journey.

tensor.realize()
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  LAZY GRAPH                                             │
│  Tensor ops build UOp DAG (no computation yet)          │
└─────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  RANGEIFY                                               │
│  Movement ops → explicit RANGE loops                    │
└─────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  KERNEL SPLITTING                                       │
│  Split at STORE boundaries → multiple KERNELs          │
└─────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  OPTIMIZATION & CODEGEN                                 │
│  Heuristics/beam → LLVM IR → JIT compile               │
└─────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────┐
│  EXECUTION                                              │
│  Parallel kernel launch → result buffer                │
└─────────────────────────────────────────────────────────┘

Each box is a distinct phase. Let's walk through them.

Lazy Evaluation: Building the Graph

A Tensor in Morok is surprisingly lightweight:

pub struct Tensor {
    entry: Arc<TensorEntry>,      // Computation graph
    buffer: Option<Arc<Buffer>>,  // Materialized data (if any)
}

The entry holds a TensorEntry containing the UOp graph—the computation this tensor represents. The buffer is optional: lazy tensors don't have one, only realized tensors do.

Three Ways to Create Tensors

1. Input tensors — buffer allocated immediately:

let a = Tensor::from_slice([1.0f32, 2.0, 3.0]);
// `a.buffer` = Some(Arc<Buffer>) with actual data

When you create a tensor from data, Morok allocates device memory and copies your bytes. The UOp graph contains a BUFFER node pointing to this allocation.

2. Lazy operations — no buffer, only graph:

let b = a.try_add(&a)?;   // b.buffer = None
let c = b.try_mul(&a)?;   // c.buffer = None

Arithmetic operations don't compute anything. They build a UOp graph: Binary(Add, a.uop, a.uop). The tensor exists purely as a description of future work.

3. Movement operations — shares the original buffer:

let d = a.try_reshape(&[1, 3])?;  // d.buffer = same as a.buffer

Reshape, permute, and similar operations create new views of existing data. The buffer is shared; only the UOp graph changes to describe the new indexing.

The Global Registry

Morok maintains three global maps (lock-free, thread-safe):

Map	Key → Value	Purpose
`TENSORS`	tensor_id → `Weak<TensorEntry>`	Track all tensors for graph substitution
`BUFFERS`	uop_id → `Arc<Buffer>`	Find buffers during scheduling

This registry enables a critical feature: global graph substitution. When an optimization transforms a UOp, all tensors referencing that UOp automatically see the updated version. No stale references, no manual updates.

Hash Consing in Action

Because UOps use hash consing (content-based deduplication), identical computations share memory:

let x = a.try_add(&b)?;
let y = a.try_add(&b)?;
// x.uop() and y.uop() point to the SAME Arc<UOp>

This matters for caching: when we compile kernels, we cache by UOp ID. Hash consing means identical computations automatically hit the cache, even if constructed separately.

Rangeify: Making Loops Explicit

When you write tensor.reshape([2, 3]).expand([4, 2, 3]).sum(axis=0), those movement operations (reshape, expand) are high-level descriptions. To generate actual loops, we need explicit iteration structure.

Rangeify transforms movement operations into RANGE loops and INDEX arithmetic. The entry point is rangeify() in schedule/src/rangeify/transforms.rs.

The Rangeify Pipeline

Rangeify isn't a single transformation—it's a multi-stage pipeline:

Stage	Purpose
0. Range Assignment	Create RANGE UOps for each tensor dimension
1. Early Movement Ops	Clean up movement operations before range assignment
2. Load Collapse	Eliminate REDUCE operations via range-independent detection
3. Split Ranges	Split ranges with modulo, flatten ranges
4. Initial Symbolic	Algebraic simplification, constant folding
5. Simplify Ranges	Merge adjacent ranges with cost analysis
6. Split Store	Split graph at STORE boundaries
7. Apply Opts	Optimization search (beam or heuristic)
Mega-pass	Symbolic + reduce + buffer folding + buffer removal + reduction simplification

The mega-pass combines multiple symbolic and structural optimizations into a single fixpoint loop. Per-kernel passes then run in apply_pre_optimization().

Each pass uses pattern-based rewriting (see the Pattern Engine chapter). Patterns fire until no more match, then the next pass begins.

Before and After

Consider this tensor expression:

Before: BUFFER.reshape([2, 3]).expand([4, 2, 3]).sum(axis=0)

After rangeify, movement ops become explicit index computations:

After:
STORE
├── INDEX[RANGE(0..2), RANGE(0..3)]          — index (src[0])
├── REDUCE(Add)                              — value (src[1])
│   ├── LOAD
│   │   └── INDEX[RANGE(0..4), RANGE(0..2), RANGE(0..3)]
│   └── RANGE(0..4, Reduce)
├── RANGE(0..2, Global)                      — output dim 0 (range)
└── RANGE(0..3, Global)                      — output dim 1 (range)

The EXPAND became a RANGE(0..4) that doesn't affect the buffer index—broadcasting. The RESHAPE became different index arithmetic. The SUM became REDUCE(Add) with the first range marked as Reduce type.

Movement → Index Arithmetic

Each movement operation has a specific transformation:

Operation	Transformation
RESHAPE	Flatten/unflatten index expressions
PERMUTE	Reorder dimensions in INDEX
EXPAND	Index becomes 0 (or range doesn't affect index)
PAD	WHERE(in_bounds, LOAD, pad_value)
SHRINK	Offset adjustment in INDEX
FLIP	`size - 1 - index`

After rangeify, there are no more movement ops—just arithmetic operations on indices.

Kernel Splitting: Finding the Boundaries

A computation graph might have multiple outputs, or intermediate values that need materialization. Kernel splitting identifies these boundaries and creates separate kernels.

The entry point is run_kernel_split_pipeline() in schedule/src/rangeify/kernel.rs.

Kernel Splitting Pipeline

The splitting proceeds through several coordinated steps:

Step 1: BUFFERIZE → STORE

BUFFERIZE nodes mark where values should materialize. pm_add_buffers_patterns() converts them to explicit STORE operations:

Before: BUFFERIZE(computation, ranges)
After:  END(STORE(INDEX(...), computation), ranges)

The END wrapper captures which ranges scope this store. Buffers are allocated and assigned IDs during this phase.

Step 2: Split stores into kernels

split_all_stores() and split_store() split the graph at STORE boundaries, creating separate kernels. Buffer numbering is assigned via LocalAddBufferContext.dg counter during splitting.

Before: END(STORE(...), ranges)
After:  KERNEL(SINK(STORE(...)), ranges, buffer_list)

The KERNEL node wraps everything: the computation (as a SINK), the iteration ranges, and the list of buffers this kernel reads and writes.

Step 3: Fix assignments

fix_assign() maps each buffer_id to the kernel that writes it and builds the dependency graph.

Tracking Dependencies

When one kernel's output feeds another kernel's input, we need dependency tracking:

fix_assign() maps each buffer_id to the kernel that writes it and builds the dependency graph
When kernel B reads a buffer written by kernel A, B depends on A
Dependencies appear as AFTER nodes in the IR

Dependencies appear as AFTER nodes in the IR, ensuring kernels execute in valid order.

Buffer Numbering

Buffer numbering is handled by LocalAddBufferContext.dg counter in split_store(). DEFINE_GLOBAL indices are assigned during the split process in pattern-match order—no separate renumbering pass is needed.

Schedule Creation: Preparing for Execution

Once kernels are split, we need to schedule them: determine execution order, allocate buffers, and prepare for compilation.

create_schedule() in tensor/src/schedule.rs produces a Vec<ScheduleItem>:

pub struct ScheduleItem {
    pub kernel: Arc<UOp>,              // KERNEL wrapper
    pub ast: Arc<UOp>,                 // Inner computation (for codegen)
    pub buffers: Vec<Buffer>,          // Device buffers
    pub buffer_uop_ids: Vec<u64>,      // UOp IDs for registry cleanup
    pub fixedvars: HashMap<String, i64>,  // Bound iteration variables
    pub bound_ranges: Vec<BoundRange>,    // Ranges needing expansion
    pub source_buffers: HashMap<u64, Arc<UOp>>,  // DEFINE_GLOBAL -> BUFFER mapping
    pub dependencies: Vec<u64>,        // Producer kernel IDs
    pub alias_registered_ids: Vec<u64>, // Alias UOp IDs for cleanup
}

Buffer Allocation Strategy

Input buffers: Already allocated (from Tensor::from_slice)
Intermediate buffers: Allocated during scheduling (for kernel outputs that feed other kernels)
Output buffer: Allocated and registered with the final tensor

Parallel Group Analysis

Not all kernels need sequential execution. Independent kernels can run in parallel:

Kernel A (writes buf0)
Kernel B (writes buf1)  ─── no dependency ─── can run in parallel
Kernel C (reads buf0, buf1)  ─── depends on A and B

The scheduler uses Kahn's algorithm to find parallel groups:

Build the kernel dependency DAG
Find all kernels with no incoming edges → Group 1
Remove Group 1, repeat → Group 2, etc.

Each group's kernels execute in parallel, then the next group starts.

Code Generation: From UOp to LLVM IR

With kernels scheduled, we generate actual code. Morok currently supports the LLVM backend:

Backend	Compile Speed	Output Quality	Use Case
LLVM	Slower	Highly optimized	Production

The Renderer trait abstracts code generation:

pub trait Renderer {
    fn render(&self, uop: &Arc<UOp>, name: Option<&str>) -> Result<RenderedKernel>;
    fn backend_name(&self) -> &str;
    fn decompositor(&self) -> Option<TypedPatternMatcher<()>>;
}

LLVM CPU Renderer

The LLVM renderer (codegen/src/llvm/cpu/) traverses the UOp graph and emits LLVM IR:

define void @kernel_0(ptr noalias align 32 %buf0, ptr noalias align 32 %buf1) #0 {
entry:
  br label %loop_0

loop_0:
  %i = phi i32 [ 0, %entry ], [ %i.next, %loop_0 ]
  ; ... computation ...
  %i.next = add nsw i32 %i, 1
  %cond = icmp slt i32 %i.next, 128
  br i1 %cond, label %loop_0, label %exit

exit:
  ret void
}

Each buffer is a direct ptr noalias align 32 parameter — no indirection through an args array. Symbolic variables (for dynamic shapes) and thread IDs are passed as additional typed parameters (e.g. i32 %N).

Post-Optimization Passes

Before code generation, ~15 pattern-based passes clean up the IR:

Pass	Purpose
`pm_add_loads`	Wrap INDEX operations in LOAD
`pre_expand`	Convert UNROLL/UPCAST ranges to explicit operations
`devectorize`	Group contiguous memory accesses
`pm_reduce_devectorize`	Handle vector reductions (K-vec, bool, horizontal)
`pm_bool_devectorize`	Handle boolean vector patterns
`merge_sibling_ends`	Merge adjacent END operations
`pm_fma_decomposition`	Convert `a*b+c` to fused multiply-add (for backends that support it)
`pm_float_decomp`	Decompose floating-point operations
`bool_storage_patterns`	Convert bool ↔ uint8 for memory operations

These passes transform the optimized AST into a form suitable for code generation. The result is clean, vectorized code with proper memory access patterns.

Backend Support

Morok supports multiple code generation backends:

Backend	Output	Status
LLVM	Native machine code	Primary (CPU)
C	C source code	Available
MLIR	MLIR dialect	Available

Execution: Running the Kernels

Code generation produces LLVM IR strings. Execution involves JIT compilation and kernel launch.

The ExecutionPlan

prepare() (single tensor) or prepare_batch() (multiple tensors) builds an ExecutionPlan:

pub struct ExecutionPlan {
    kernels: Vec<PreparedKernel>,       // Compiled kernels (topological order)
    buffers: Vec<Buffer>,
    ast_to_buffer: HashMap<u64, usize>, // AST id -> buffer index mapping
    output_buffer_indices: Vec<usize>,  // Indices of output buffers (multi-output)
}

Plans now support multiple outputs via realize_batch() / prepare_batch(). When several tensors share subgraphs, batch scheduling lets the compiler share kernels across outputs.

Key methods:

Method	Purpose
`output_buffer_at(i)`	Get the i-th output buffer (matches SINK source order)
`num_outputs()`	Number of output buffers in this plan
`execute_with_vars(var_vals)`	Re-execute with different symbolic variable values (no recompilation)

The plan is reusable: compile once, execute many times with different data.

JIT Compilation

The LLVM runtime (runtime/src/llvm.rs) compiles IR to machine code:

Parse the LLVM IR string into a module
Verify the module is well-formed
Optimize with LLVM's O3 pass pipeline
JIT compile to native machine code
Cache by (AST ID, device) for reuse

// Simplified JIT flow
let module = Module::parse_ir(context, ir_string)?;
module.verify()?;
pass_manager.run(&module);  // O3 optimization
let function = execution_engine.get_function::<KernelFn>(&name)?;
// Cache: (ast_id, device) → function

Kernel Execution

With kernels compiled, execution iterates through kernels in topological order, respecting dependencies:

for kernel in &plan.kernels {
    // Dependencies tracked per-kernel via kernel.dependencies
    kernel.execute(buffers);
}

Kernels carry their own device specification, so a plan can span multiple devices.

Kernel Caching

Hash consing makes kernel caching highly effective:

Key: (UOp ID, device string)
Storage: Lock-free HashMap (papaya crate)
Hit rate: High, because identical computations share UOp IDs

When you compute the same expression twice, the second call hits the cache—no recompilation.

Worked Example: Matrix Multiply

Let's trace C = A @ B through the entire pipeline. Assume 4×4 matrices.

Stage 1: Lazy Graph Construction

let a = Tensor::from_slice(a_data);  // Input buffer allocated
let b = Tensor::from_slice(b_data);  // Input buffer allocated
let c = a.matmul(&b);                 // Graph built, no computation

At this point, c is a lazy tensor with this UOp graph:

REDUCE_AXIS(Add, axis=2)
└── MUL
    ├── EXPAND(A, [4, 4, 4])    — A: [4, 4] → [4, 1, 4] → [4, 4, 4]
    └── EXPAND(B, [4, 4, 4])    — B: [4, 4] → [1, 4, 4] → [4, 4, 4]

Stage 2: Rangeify

Movement ops become explicit loops:

STORE
├── INDEX[BUFFER(C), RANGE(i, 0..4), RANGE(j, 0..4)]  — index
├── REDUCE(Add)                                          — value
│   ├── MUL
│   │   ├── LOAD(A)
│   │   │   └── INDEX[BUFFER(A), RANGE(i), RANGE(k, 0..4, Reduce)]
│   │   └── LOAD(B)
│   │       └── INDEX[BUFFER(B), RANGE(k), RANGE(j)]
│   └── RANGE(k, Reduce)
├── RANGE(i, Global)                                     — output dim 0
└── RANGE(j, Global)                                     — output dim 1

The i and j ranges are output dimensions. The k range is the reduction (contracted) dimension.

Stage 3: Kernel Splitting

Single STORE → single KERNEL:

KERNEL
├── SINK(STORE(...))
├── ranges: [i: 0..4, j: 0..4]
└── buffers: [C (output), A (input), B (input)]

Stage 4: Schedule

One ScheduleItem with:

kernel: The KERNEL UOp
ast: The inner SINK/STORE
buffers: [C, A, B]
dependencies: [] (no prior kernels)

Stage 5: Optimization

Heuristic optimizer applies:

Vectorization: UPCAST j dimension by 4
Loop ordering: Ensure good cache behavior

Stage 6: Code Generation

Generated LLVM IR (simplified):

define void @matmul(ptr noalias align 32 %C, ptr noalias align 32 %A, ptr noalias align 32 %B) #0 {
entry:
  br label %loop_i

loop_i:
  %i = phi i64 [ 0, %entry ], [ %i.next, %loop_i.end ]
  br label %loop_j

loop_j:
  %j = phi i64 [ 0, %loop_i ], [ %j.next, %loop_k.end ]
  %acc = ... ; initialize accumulator
  br label %loop_k

loop_k:
  %k = phi i64 [ 0, %loop_j ], [ %k.next, %loop_k ]
  %a_val = load float, ptr ...  ; A[i, k]
  %b_val = load float, ptr ...  ; B[k, j]
  %prod = fmul float %a_val, %b_val
  %acc.new = fadd float %acc, %prod
  %k.next = add i64 %k, 1
  %k.cond = icmp slt i64 %k.next, 4
  br i1 %k.cond, label %loop_k, label %loop_k.end

loop_k.end:
  store float %acc.new, ptr ...  ; C[i, j]
  ; ... continue j, i loops
}

Stage 7: Execution

JIT compile the LLVM IR
Execute: kernel([C_ptr, A_ptr, B_ptr], [])
Result is in C buffer

Total: one function call, result ready.

Comparison: How Other Frameworks Execute

Aspect	PyTorch	JAX	TVM	Morok
Evaluation	Eager (immediate)	Traced (jit decorator)	Lazy (te.compute)	Lazy (realize)
Graph capture	torch.compile	jax.jit trace	Explicit schedule	Implicit via ops
Compilation	TorchInductor	XLA backend	Auto-scheduler	Pattern + beam
Caching	Per-graph hash	Per-trace	Per-schedule	Per-AST (hash consing)
Parallelism	DataParallel/DDP	pmap/pjit	Parallel schedule	Parallel groups

PyTorch: Eager by default, torch.compile for optimization. TorchInductor generates Triton or C++ code.

JAX: Functional transformations (jit, grad, vmap) trace computations. XLA compiles to optimized kernels.

TVM: Explicit separation of computation and schedule. Auto-scheduler searches for good schedules.

Morok: Fully lazy—nothing executes until realize(). Hash consing provides automatic caching. Pattern-based optimization with optional beam search for production quality.

The Deeper Insight

The pipeline embodies several design principles:

Lazy evaluation enables global optimization. By deferring computation, we see the entire graph before generating code. No local decision limits global optimization.

Explicit loops enable hardware-specific scheduling. Movement ops are convenient abstractions, but GPUs need loops. Rangeify bridges the gap.

Hash consing makes caching automatic. Identical computations share pointers, so cache keys are trivial. No complex graph hashing needed.

Separation of concerns keeps each stage simple. Rangeify doesn't know about LLVM. Code generation doesn't know about tensor semantics. Each stage does one thing well.

The result: a compilation pipeline that's both powerful and maintainable. From tensor.realize() to machine code, every step is visible, debuggable, and extensible.

Lazy Evaluation: Building the Graph​

Three Ways to Create Tensors​

The Global Registry​

Hash Consing in Action​

Rangeify: Making Loops Explicit​

The Rangeify Pipeline​

Before and After​

Movement → Index Arithmetic​

Kernel Splitting: Finding the Boundaries​

Kernel Splitting Pipeline​

Tracking Dependencies​

Buffer Numbering​

Schedule Creation: Preparing for Execution​

Buffer Allocation Strategy​

Parallel Group Analysis​

Code Generation: From UOp to LLVM IR​

LLVM CPU Renderer​

Post-Optimization Passes​

Backend Support​

Execution: Running the Kernels​

The ExecutionPlan​

JIT Compilation​

Kernel Execution​

Kernel Caching​

Worked Example: Matrix Multiply​

Stage 1: Lazy Graph Construction​

Stage 2: Rangeify​

Stage 3: Kernel Splitting​

Stage 4: Schedule​

Stage 5: Optimization​

Stage 6: Code Generation​

Stage 7: Execution​

Comparison: How Other Frameworks Execute​

The Deeper Insight​