Skip to main content

JIT Compiler

Most ML compilers either link an entire LLVM toolchain into the binary — adding hundreds of megabytes of dependencies — or write temporary files to disk and dlopen the result. Morok does neither.

When a kernel needs to execute, Morok pipes the generated source through clang on stdin, receives a relocatable ELF object on stdout, parses it in-process, copies the machine code into an anonymous memory mapping, applies relocations, flips the page permissions to executable, and calls the function pointer directly. The whole process happens in memory — no temp files touch the disk, no shared libraries are loaded, and no LLVM installation is required beyond clang on the PATH.

This chapter describes how the CPU JIT loader works. GPU backends (CUDA, Metal, etc.) use their respective driver APIs for compilation and dispatch, and will be documented separately as they are added. The higher-level graph wrapper API that compiles a model graph once and replays it many times is documented in JIT Graphs.

Pipeline

C source / LLVM IR


clang -c (stdin → stdout)


ELF .o bytes (in memory)


Parse sections (object crate)


Anonymous mmap + copy sections


Apply relocations (arch-specific)


mprotect(PROT_READ | PROT_EXEC)


Flush I-cache (non-x86_64)


Call function pointer via libffi

Both the Clang backend (C source via -x c) and the LLVM backend (LLVM IR text via -x ir) share this loader. The only difference is the clang input language flag.

:::tip Fallback mode For debugging or platforms where the custom ELF loader doesn't work, the dlopen-fallback Cargo feature switches to a traditional pipeline: clang -shared writes a .so to a temp directory, which is loaded via dlopen. This is slower (disk I/O + dynamic linker overhead) but more portable. :::

Supported Architectures

ArchitectureTarget tripleCompile flagI-cacheNotes
x86_64x86_64-none-unknown-elf-march=nativeCoherentAMD64, Intel 64
aarch64aarch64-none-unknown-elf-march=native__clear_cacheApple Silicon, Ampere, Graviton
riscv64riscv64-none-unknown-elf-march=rv64gc__clear_cacheRV64I + M + A + F + D + C extensions
loongarch64loongarch64-none-unknown-elf-march=native__clear_cacheLoongson 3A5000+
ppc64lepowerpc64le-none-unknown-elf-mcpu=native__clear_cacheELFv2 ABI, little-endian only

Architecture detection is automatic via std::env::consts::ARCH at runtime — no compile-time feature flags needed.

Relocation Support

The loader implements a minimal ELF relocator for each architecture. It handles the relocation types that clang -c -O2 actually emits for small, self-contained compute kernels — not a full linker.

x86_64 — PC-relative (R_X86_64_PC32, PLT32, GOTPCRELX, REX_GOTPCRELX), absolute 32/64-bit (R_X86_64_32, 32S, 64).

aarch64 — 26-bit branches (CALL26, JUMP26) with automatic veneer generation when the target exceeds ±128 MiB, page-relative ADRP (ADR_PREL_PG_HI21), 12-bit page offsets with access-size shifts (ADD_ABS_LO12_NC, LDST8/16/32/64/128_ABS_LO12_NC).

riscv64 — Call pairs (CALL, CALL_PLT), PC-relative split addressing with state tracking (PCREL_HI20 + PCREL_LO12_I/S), absolute (HI20, LO12_I/S), branches (BRANCH, JAL), data (32, 64). Linker relaxation hints (RELAX) are skipped.

loongarch64 — 26-bit branches (B26), page-aligned split addressing (PCALA_HI20, PCALA_LO12), data (32, 64). Linker relaxation hints (RELAX) are skipped.

ppc64le — 24-bit branches (REL24), TOC-relative addressing with .TOC. symbol lookup (TOC16_HA, TOC16_LO, TOC16_LO_DS, TOC16, TOC16_HI), PC-relative (REL32), absolute (ADDR32, ADDR64).

Compilation Flags

The loader compiles with a bare-metal target to produce clean, self-contained ELF objects with no runtime dependencies:

FlagC backendLLVM IR backendPurpose
-cyesyesCompile only (no linking)
-O2yesyesOptimization level
-march=nativeyesyesUse host CPU features
-fPICyesyesPosition-independent code
-ffreestandingyesnoNo hosted environment assumed
-fno-math-errnoyesyesMath builtins don't set errno
-fno-stack-protectoryesyesNo stack canary overhead
-nostdlibyesnoNo standard library
-fno-identyesnoSuppress .comment section
--target=<arch>-none-unknown-elfyesyesBare-metal ELF target
-ffixed-x18aarch64 macOS/Winaarch64 macOS/WinReserve platform register
-funroll-loopsnoyesAggressive loop unrolling
-fvectorizenoyesLoop vectorization
-fslp-vectorizenoyesSLP (straight-line) vectorization

The C backend uses __builtin_* functions (e.g. __builtin_sqrtf, __builtin_fmaf) instead of #include <math.h>, so -ffreestanding -nostdlib works without losing math support — these are compiler intrinsics that lower to hardware instructions directly.

External Symbol Resolution

If clang emits a call to an external function (rare — most math is handled by builtins), the loader resolves it via dlsym(RTLD_DEFAULT, name) at load time. This covers cases like memcpy or platform-specific libm symbols that clang might emit instead of inlining.

Branch Veneers (aarch64)

On aarch64, CALL26/JUMP26 relocations encode a PC-relative offset in 26 bits, giving a range of ±128 MiB. On macOS with ASLR, the anonymous mmap region is typically ~2 GB away from system libraries like libm — far beyond this range.

When the loader detects an out-of-range CALL26/JUMP26, it emits a veneer (branch trampoline) in a reserved area at the end of the mmap:

LDR X16, [PC, #8] // load 64-bit target address
BR X16 // indirect branch
.quad <address> // full 64-bit address

Veneers are pre-scanned (counted before mmap allocation) and deduplicated — if multiple call sites reference the same external symbol, they share a single veneer.

Platform Register (aarch64)

On macOS and Windows ARM, register x18 is reserved as the platform register. Since we compile with --target=aarch64-none-unknown-elf (bare-metal), the compiler would normally treat x18 as a free GPR. The -ffixed-x18 flag prevents this, avoiding crashes when JIT code runs in a macOS/Windows process.

Instruction Cache Coherence

On x86_64, the instruction and data caches are coherent — writing machine code to memory and jumping to it works without extra steps. On all other architectures, the loader calls __clear_cache(start, end) after mprotect to ensure the instruction cache sees the new code.