Kernel Optimization Search

After algebraic simplification, each kernel needs scheduling decisions: how to tile loops, where to parallelize, whether to use tensor cores. Morok offers two strategies: fast heuristics and thorough beam search.

This runs at Stage 7 of the codegen pipeline.

Tinygrad source: tinygrad/codegen/opt/. Morok source: schedule/src/optimizer/.

The Action Space

Optimization transforms loop structures by changing axis types. Each action modifies one range:

Action	Effect	Hardware Target
UPCAST(axis, amount)	Vectorize a dimension (SIMD)	All
UNROLL(axis, amount)	Unroll a loop dimension	All
LOCAL(axis, amount)	Use GPU shared memory	GPU (LDS) / CPU (L1)
GROUP(axis, amount)	Two-stage reduction	All
GROUPTOP(axis, amount)	Grouped reduction for tensor cores	GPU
THREAD(axis, amount)	CPU thread-based parallelism	CPU
SWAP(axis1, axis2)	Reorder global dimensions	All
PADTO(axis, amount)	Pad for alignment	All
NOLOCALS	Disable local memory	All (constraint)
TC	Enable tensor core usage	NVIDIA GPUs

The total action space is ~162 base actions (varies with kernel structure and available parallelism).

Heuristics (Default)

The heuristic optimizer applies optimizations in a fixed order (simplified pseudocode):

// Pseudocode — simplified from optimizer/heuristics.rs
fn hand_coded_optimizations(scheduler: &mut Scheduler) {
    // 1. Tensor cores (if matmul pattern detected)
    if let Some(tc) = detect_tensor_core_pattern(scheduler) {
        apply_tensor_core(scheduler, tc);
        return;  // TC handles everything
    }

    // 2. Grouped reductions (two-stage for large reductions)
    apply_grouped_reduction_if_needed(scheduler);

    // 3. Vectorization (UPCAST output dimensions)
    apply_upcast(scheduler, 4);

    // 4. GPU local memory (workgroup dimensions)
    apply_local_dims(scheduler);

    // 5. CPU threading
    apply_threading(scheduler);
}

Pros: Fast (~50ms per kernel), predictable, no hardware measurement needed.

Cons: May miss optimization opportunities, fixed heuristics don't adapt to workload.

Beam Search (Optional)

For production workloads, beam search finds better schedules by compiling and timing candidates (simplified pseudocode):

// Pseudocode — simplified from optimizer/beam.rs
// Actual API: beam_search_cached(scheduler, config, compile_and_time) -> Result<BeamResult>
fn beam_search(scheduler: Scheduler, config: BeamConfig) -> Scheduler {
    let mut beam = vec![scheduler];
    let deadline = Instant::now() + config.time_limit;

    while Instant::now() < deadline {
        let mut candidates = vec![];

        for state in &beam {
            for action in generate_actions(state) {
                if let Ok(next) = state.apply(action) {
                    candidates.push(next);
                }
            }
        }

        // Compile and time each candidate
        let timed: Vec<_> = candidates.par_iter()
            .map(|c| (c, measure_kernel_time(c)))
            .collect();

        // Keep top K by execution time
        beam = timed.into_iter()
            .sorted_by_key(|(_, time)| *time)
            .take(config.beam_width)
            .map(|(c, _)| c)
            .collect();
    }

    beam.into_iter().next().unwrap()
}

Pros: Finds near-optimal schedules, adapts to hardware.

Cons: Minutes per kernel (but results are cached by AST hash).

Configuration

# Disable optimization (debugging)
MOROK_NOOPT=1 cargo run

# Enable beam search with width 8
BEAM=8 cargo run

Or programmatically:

let config = PrepareConfig::builder()
    .strategy(OptStrategy::Beam { width: 8 })
    .build();

tensor.realize_with(config)?;

Comparison: How Other Compilers Optimize

Aspect	XLA	TVM/Ansor	Triton	Morok
Philosophy	Fixed heuristics	Search-based	Programmer-guided	Pattern-based
Fusion	Conservative rules	Tile-and-fuse	Block-level	Graph rewriting
Auto-tuning	None	Evolutionary + cost model	Grid search	Beam search
Tuning cost	0	Hours	Minutes	Minutes (cached)
Flexibility	Low	High	Medium	High
Transparency	Low (C++ passes)	Medium (Python)	Medium (DSL)	High (declarative patterns)

XLA uses fixed heuristics for fusion decisions. Safe and predictable, but leaves performance on the table. Fusion rules are hard-coded in C++.

TVM/Ansor separates what to compute from how to compute it. Ansor uses evolutionary search with a learned cost model. Best-in-class performance, but tuning takes hours per model.

Triton exposes a Python-like DSL for blocked algorithms. Good balance of control and automation, but requires GPU programming expertise.

Morok expresses optimizations as composable patterns. Beam search adds auto-tuning when needed, with results cached by AST hash for reuse.

The Action Space​

Heuristics (Default)​

Beam Search (Optional)​

Configuration​

Comparison: How Other Compilers Optimize​

The Action Space

Heuristics (Default)

Beam Search (Optional)

Configuration

Comparison: How Other Compilers Optimize