Phase 1: Rangeify

Goal: Transform high-level movement operations into explicit loop structures and optimize ranges.

Stage 1: Early Movement Ops

Stage at a Glance

Goal: Clean up movement operations before range assignment Key Patterns: Movement on INDEX, movement through wrappers, nested INDEX simplification Impact: Prevents missed optimizations later in the pipeline

What This Does: This stage cleans up movement operations by pushing index manipulations into places where they're actually needed. Think of it as organizing your desk before filing papers—move instructions closer to where the data is used.

Why This Matters: Movement operations (RESHAPE, PERMUTE, etc.) are convenient abstractions, but the hardware needs concrete index calculations. By cleaning them up early, we ensure patterns in later stages can match correctly.

Pattern: pm_mops + pm_syntactic_sugar (bottom-up)

Pattern	Transformation	Visual	Location
Movement on INDEX	Apply movement to index expressions	`INDEX(PERMUTE(arr), [i, j]) → INDEX(arr, [j, i])`	`movement_op_patterns()`
Movement through AFTER	Move RESHAPE through timing wrapper (Tinygrad-specific)	`AFTER(RESHAPE(x, arg), [dep1, dep2]) → RESHAPE(AFTER(x, [dep2]), arg)`	Tinygrad only
Movement through END	Unwrap movement from END wrapper (Tinygrad-specific)	`END(RESHAPE(x), ranges) → END(x, ranges)`	Tinygrad only
Nested INDEX simplification	Remove redundant nested INDEX (Morok)	`INDEX(INDEX(ptr, [i]), [i]) → INDEX(ptr, [i])`	`movement_op_patterns()`
Nested INDEX concat	Flatten nested INDEX for PtrDType	`INDEX(INDEX(ptr, i), j) → INDEX(ptr, i, j)`	`pm_syntactic_sugar`

Why bottom-up? Child nodes must be clean before parents can match. Movement ops nest deeply; cleaning from bottom prevents missed patterns.

Note: Tinygrad and Morok have different approaches here. Tinygrad moves movement ops through wrappers (AFTER, END) because it re-applies movement ops during bufferization. Morok removes movement ops entirely by transforming indices during bufferization, so AFTER/END patterns are not needed.

Morok: movement_op_patterns() in rangeify/patterns.rs

Stage 2: Load Collapse

Stage at a Glance

Goal: Eliminate REDUCE operations by detecting range-independent computation Key Patterns: Bounded sum, gated load collapse, general reduce elimination Impact: Converts loop iterations to arithmetic operations

What This Does: Eliminates REDUCE operations by recognizing when the computation can be done without iteration. Uses range-independent computation detection and symbolic simplification.

Why This Matters: Reducing iterations to arithmetic operations eliminates loop overhead. Instead of running a loop 1000 times, compute the answer directly.

Pattern: pm_load_collapse

// Before: Sum with bounds check
sum(1 for k in 0..64 if k >= length)

// After: Compute count directly (NO LOOP!)
count = clamp(64 - length, 0, 64)

The mechanism works by:

Identifying subexpressions that don't depend on the REDUCE range
Creating DEFINE_VAR for those subexpressions (treats as loop-invariant)
Substituting the range with DEFINE_VAR and running symbolic simplification
If the simplified expression has no more ranges, the REDUCE is eliminated

Note: WHERE movement through INDEX (pm_move_where_on_load) is a separate optimization that places conditionals before loads to skip memory accesses, but it doesn't eliminate REDUCE operations.

Morok: pm_load_collapse() in rangeify/patterns.rs

Stage 3: Split Ranges

Stage at a Glance

Goal: Enable better optimization through divmod decomposition Key Patterns: Split ranges with modulo, flatten ranges Impact: Inner ranges can vectorize, outer can parallelize

What This Does: Handles modulo patterns by splitting a range into outer and inner components.

Why This Matters: Splitting ranges is like dividing a large task among team members. If you have 12 items and each person does 4, you get 3 people × 4 items. Inner loops (one person's 4 items) can be fast; outer loops (3 people) can run in parallel.

Pattern: pm_split_ranges + pm_flatten_range

Before:  RANGE(end=12) % 4  // One loop with modulo (slow)
             ↓ [Split into outer × inner]
After:   RANGE(end=3) * 4 + RANGE(end=4)
            ↑outer        ↑inner
            Parallel      Sequential

This enables:

Inner ranges can vectorize (SIMD)
Outer ranges can parallelize (GPU blocks / CPU threads)

pm_flatten_range merges nested ranges on REDUCE/STORE/END when beneficial.

Context: Requires dictionary context (ctx={}) to track substitutions at SINK.

Note: The split only applies when end % mod == 0 (divisibility check).

Morok: pm_split_ranges() + pm_flatten_range() in rangeify/transforms.rs

Stage 4: Initial Symbolic

Stage at a Glance

Goal: Simplify expressions using algebra rules Key Patterns: Constant folding, identity removal, div-mod recombine Impact: Eliminates expensive operations, reduces code size

What This Does: Applies 100+ constant folding and algebraic simplification rules.

Why This Matters: Computers are fast at simple math. Dividing and taking remainders are slow operations. This stage uses algebra rules to eliminate slow operations whenever possible.

Pattern: symbolic() + pm_flatten_range

Note: symbolic() is a subset of sym used at Stage 8. It includes algebraic rules but omits later-stage patterns.

Constant folding:

ADD(CONST(2), CONST(3)) → CONST(5)
MUL(x, CONST(1)) → x
ADD(x, CONST(0)) → x

Div-mod recombination:

(x / c) * c + (x % c) → x

Why? Computes the same value as x but with 3 operations instead of 1. This pattern finds and removes the redundancy (common in stride calculations).

Boolean algebra:

x AND x → x
x OR FALSE → x
NOT(NOT(x)) → x

Additional categories:

Identity removal (self-folding, redundant operations)
Comparison simplification
Cast optimization
GEP pushing (move address calculations through ALUs)
Where folding (combine WHERE with same conditions)
Reduce mul chain (move multiplications outside reduce)

Morok: symbolic() in symbolic/patterns.rs

Stage 5: Simplify Ranges

Stage at a Glance

Goal: Merge adjacent ranges to reduce loop overhead Key Patterns: Range merging with cost analysis Impact: Fewer loops = less overhead

What This Does: Merges adjacent ranges when profitable.

Why This Matters: Merging ranges is like combining multiple small trips into one big one. Instead of going to the store 4 times for 4 items, go once for all 4 items. Saves the overhead of starting and stopping.

Pattern: pm_flatten_range() + pm_simplify_ranges()

// Before: two separate ranges
RANGE(0..4), RANGE(0..8)

// After: merged (if compatible)
RANGE(0..32)

Merge criteria:

Axis types must be compatible (both output, both reduce, etc.)
REDUCE scope must remain consistent
Cost-based: Accept only if divmod operation count does not increase

The compiler only merges if it saves operations. Merging might require division/modulo to recalculate indices. If that costs more than it saves, merge is skipped.

Morok: simplify_merge_adjacent() in rangeify/transforms.rs

Stage 6: Split Store

Stage at a Glance

Goal: Split graph at STORE boundaries into separate kernels Key Function: split_all_stores() + split_store() Impact: Enables per-kernel optimization

What This Does: Splits the UOp graph at STORE boundaries, creating separate kernels for each output.

Why This Matters: After bufferization, the graph may contain multiple STORE operations. Each STORE becomes its own kernel with its own set of buffers, ranges, and dependencies.

Function: run_kernel_split_pipeline() in schedule/src/rangeify/kernel.rs

Before splitting, a pm_flatten_range pre-pass runs on the full graph once (bottom-up). This merges nested ranges across all kernels in a single traversal, avoiding redundant per-kernel work on overlapping subgraphs. The pre-pass is a key compilation speed optimization — without it, each kernel's split_store would re-traverse shared subgraphs independently.

After the pre-pass, split_all_stores splits at STORE boundaries, then fix_assign handles buffer numbering (via LocalAddBufferContext.dg counter) and dependency tracking.

Stage 7: Apply Opts

Stage at a Glance

Goal: Find optimal combination of vectorization, unrolling, memory usage Key Algorithm: Beam search or heuristics Impact: Can significantly improve performance

What This Does: The optimization search—either beam search or heuristic—explores different combinations of optimization actions.

Why This Matters: The compiler tries different combinations of optimizations (vectorize here? unroll there?) and picks the fastest. Finding the right combination can make code 10x faster.

Function: apply_opts(sink, renderer)

Optimization actions:

Action	Effect	Hardware Target
TC	Enable tensor core usage	NVIDIA GPUs
UPCAST	Vectorize a dimension	All (SIMD)
LOCAL	Use local/shared memory	GPU (LDS) / CPU (L1)
UNROLL	Unroll a loop dimension	All (avoid loop overhead)
GROUP	Group operations for cache	All
GROUPTOP	Group for reduce ops	GPU tensor cores
THREAD	Thread-based parallelism	CPU
NOLOCALS	Disable local memory usage	All (constraint, prevents further LOCAL actions)
SWAP	Swap range assignments	All (try different tiling)
PADTO	Pad for alignment	All (memory alignment)

Optimization Search Explained:

The compiler searches for the best combination:

Heuristic mode (BEAM=0): Fast hand-coded optimization patterns, no compilation
Beam search (BEAM≥1): Compiles and runs candidates to measure actual performance

Optimization Search:
├── Heuristic mode (BEAM=0): Hand-coded optimizations
└── Beam search (BEAM≥1):
    ├── Generate all possible actions (~162 base actions, workload-dependent)
    ├── Apply to all top-K candidates in parallel
    ├── Filter based on constraints
    ├── Compile and run each candidate → Measure actual time
    └── Pick fastest

Note: NOLOCALS is a constraint that sets dont_use_locals = True, preventing further LOCAL actions and affecting shared memory usage decisions.

Morok: optimizer/mod.rs, optimizer/opts.rs

Stage 1: Early Movement Ops​

Stage 2: Load Collapse​

Stage 3: Split Ranges​

Stage 4: Initial Symbolic​

Stage 5: Simplify Ranges​

Stage 6: Split Store​

Stage 7: Apply Opts​