JIT Compiler

Most ML compilers either link an entire LLVM toolchain into the binary — adding hundreds of megabytes of dependencies — or write temporary files to disk and dlopen the result. Morok does neither.

When a kernel needs to execute, Morok pipes the generated source through clang on stdin, receives a relocatable ELF object on stdout, parses it in-process, copies the machine code into an anonymous memory mapping, applies relocations, flips the page permissions to executable, and calls the function pointer directly. The whole process happens in memory — no temp files touch the disk, no shared libraries are loaded, and no LLVM installation is required beyond clang on the PATH.

This chapter describes how the CPU JIT loader works. GPU backends (CUDA, Metal, etc.) use their respective driver APIs for compilation and dispatch, and will be documented separately as they are added. The higher-level graph wrapper API that compiles a model graph once and replays it many times is documented in JIT Graphs.

Pipeline

C source / LLVM IR
       │
       ▼
 clang -c (stdin → stdout)
       │
       ▼
  ELF .o bytes (in memory)
       │
       ▼
 Parse sections (object crate)
       │
       ▼
 Anonymous mmap + copy sections
       │
       ▼
 Apply relocations (arch-specific)
       │
       ▼
 mprotect(PROT_READ | PROT_EXEC)
       │
       ▼
 Flush I-cache (non-x86_64)
       │
       ▼
 Call function pointer via libffi

Both the Clang backend (C source via -x c) and the LLVM backend (LLVM IR text via -x ir) share this loader. The only difference is the clang input language flag.

:::tip Fallback mode For debugging or platforms where the custom ELF loader doesn't work, the dlopen-fallback Cargo feature switches to a traditional pipeline: clang -shared writes a .so to a temp directory, which is loaded via dlopen. This is slower (disk I/O + dynamic linker overhead) but more portable. :::

Supported Architectures

Architecture	Target triple	Compile flag	I-cache	Notes
x86_64	`x86_64-none-unknown-elf`	`-march=native`	Coherent	AMD64, Intel 64
aarch64	`aarch64-none-unknown-elf`	`-march=native`	`__clear_cache`	Apple Silicon, Ampere, Graviton
riscv64	`riscv64-none-unknown-elf`	`-march=rv64gc`	`__clear_cache`	RV64I + M + A + F + D + C extensions
loongarch64	`loongarch64-none-unknown-elf`	`-march=native`	`__clear_cache`	Loongson 3A5000+
ppc64le	`powerpc64le-none-unknown-elf`	`-mcpu=native`	`__clear_cache`	ELFv2 ABI, little-endian only

Architecture detection is automatic via std::env::consts::ARCH at runtime — no compile-time feature flags needed.

Relocation Support

The loader implements a minimal ELF relocator for each architecture. It handles the relocation types that clang -c -O2 actually emits for small, self-contained compute kernels — not a full linker.

x86_64 — PC-relative (R_X86_64_PC32, PLT32, GOTPCRELX, REX_GOTPCRELX), absolute 32/64-bit (R_X86_64_32, 32S, 64).

aarch64 — 26-bit branches (CALL26, JUMP26) with automatic veneer generation when the target exceeds ±128 MiB, page-relative ADRP (ADR_PREL_PG_HI21), 12-bit page offsets with access-size shifts (ADD_ABS_LO12_NC, LDST8/16/32/64/128_ABS_LO12_NC).

riscv64 — Call pairs (CALL, CALL_PLT), PC-relative split addressing with state tracking (PCREL_HI20 + PCREL_LO12_I/S), absolute (HI20, LO12_I/S), branches (BRANCH, JAL), data (32, 64). Linker relaxation hints (RELAX) are skipped.

loongarch64 — 26-bit branches (B26), page-aligned split addressing (PCALA_HI20, PCALA_LO12), data (32, 64). Linker relaxation hints (RELAX) are skipped.

ppc64le — 24-bit branches (REL24), TOC-relative addressing with .TOC. symbol lookup (TOC16_HA, TOC16_LO, TOC16_LO_DS, TOC16, TOC16_HI), PC-relative (REL32), absolute (ADDR32, ADDR64).

Compilation Flags

The loader compiles with a bare-metal target to produce clean, self-contained ELF objects with no runtime dependencies:

Flag	C backend	LLVM IR backend	Purpose
`-c`	yes	yes	Compile only (no linking)
`-O2`	yes	yes	Optimization level
`-march=native`	yes	yes	Use host CPU features
`-fPIC`	yes	yes	Position-independent code
`-ffreestanding`	yes	no	No hosted environment assumed
`-fno-math-errno`	yes	yes	Math builtins don't set errno
`-fno-stack-protector`	yes	yes	No stack canary overhead
`-nostdlib`	yes	no	No standard library
`-fno-ident`	yes	no	Suppress `.comment` section
`--target=<arch>-none-unknown-elf`	yes	yes	Bare-metal ELF target
`-ffixed-x18`	aarch64 macOS/Win	aarch64 macOS/Win	Reserve platform register
`-funroll-loops`	no	yes	Aggressive loop unrolling
`-fvectorize`	no	yes	Loop vectorization
`-fslp-vectorize`	no	yes	SLP (straight-line) vectorization

The C backend uses __builtin_* functions (e.g. __builtin_sqrtf, __builtin_fmaf) instead of #include <math.h>, so -ffreestanding -nostdlib works without losing math support — these are compiler intrinsics that lower to hardware instructions directly.

External Symbol Resolution

If clang emits a call to an external function (rare — most math is handled by builtins), the loader resolves it via dlsym(RTLD_DEFAULT, name) at load time. This covers cases like memcpy or platform-specific libm symbols that clang might emit instead of inlining.

Branch Veneers (aarch64)

On aarch64, CALL26/JUMP26 relocations encode a PC-relative offset in 26 bits, giving a range of ±128 MiB. On macOS with ASLR, the anonymous mmap region is typically ~2 GB away from system libraries like libm — far beyond this range.

When the loader detects an out-of-range CALL26/JUMP26, it emits a veneer (branch trampoline) in a reserved area at the end of the mmap:

LDR X16, [PC, #8]   // load 64-bit target address
BR  X16              // indirect branch
.quad <address>      // full 64-bit address

Veneers are pre-scanned (counted before mmap allocation) and deduplicated — if multiple call sites reference the same external symbol, they share a single veneer.

Platform Register (aarch64)

On macOS and Windows ARM, register x18 is reserved as the platform register. Since we compile with --target=aarch64-none-unknown-elf (bare-metal), the compiler would normally treat x18 as a free GPR. The -ffixed-x18 flag prevents this, avoiding crashes when JIT code runs in a macOS/Windows process.

Instruction Cache Coherence

On x86_64, the instruction and data caches are coherent — writing machine code to memory and jumping to it works without extra steps. On all other architectures, the loader calls __clear_cache(start, end) after mprotect to ensure the instruction cache sees the new code.

Pipeline​

Supported Architectures​

Relocation Support​

Compilation Flags​

External Symbol Resolution​

Branch Veneers (aarch64)​

Platform Register (aarch64)​

Instruction Cache Coherence​