← black metal kernel — series index

// black metal kernel — episode 04 of 08

SIMD memory operations:
asm! on the critical path

// kernel programming in rust — zero-cost abstractions — no gc — no mercy

AVX2 movnti sfence align(32) cache-bypass

// 04 — simd memory: avoiding cache pollution

clearing or copying large contiguous memory regions in the kernel is a bottleneck that defines system latency. standard `memset` leverages standard byte loops that poison the L1/L2 caches with transient data, pushing out instructions and actual operational memory. doing this correctly means bypassing the cache completely.

on x86-64, AVX2 non-temporal stores (`movnti` or `vmovntdq`) write directly to main memory. they are destructive to performance if you read from them immediately, but for zeroing out page allocations or flushing buffers, they ensure the cache hierarchy remains undisturbed. combining inline assembly `asm!` in rust with the strict layout control of `align(32)` enforces perfect hardware bounds required for AVX operations.

#[repr(C, align(32))]
pub struct BlackAlignedBuffer {
    black_data: [u8; 4096],
}

impl BlackAlignedBuffer {
    pub fn black_zero_nt(&mut self) {
        let black_ptr = self.black_data.as_mut_ptr() as *mut u8;
        let black_len = self.black_data.len() / 32;

        unsafe {
            core::arch::asm!(
                "vpxor ymm0, ymm0, ymm0",
                "2:",
                "vmovntdq ymmword ptr [{0}], ymm0",
                "add {0}, 32",
                "dec {1}",
                "jnz 2b",
                "sfence",
                inout(reg) black_ptr => _,
                inout(reg) black_len => _,
                out("ymm0") _,
                options(nostack, preserves_flags)
            );
        }
    }
}
// expanded — AVX2, non-temporal stores, and alignment

a standard write instruction pushes data into the CPU's L1 cache. if the cache is full, existing lines are evicted. when a kernel zeroes a 4096-byte page before handing it to a user process, that write will evict 4096 bytes of potentially useful data from L1. because the kernel never intends to read that newly zeroed page itself, caching it is a cache pollution event.

the vmovntdq (Vector Move Non-Temporal Double Quadword) instruction solves this. "Non-temporal" hints to the processor that the data will not be accessed again soon. the CPU writes the 32-byte `YMM` register directly to the memory controller's write-combining buffer, entirely bypassing the L1/L2/L3 cache hierarchy. the cache survives intact.

to safely use `vmovntdq`, the destination memory address must be 32-byte aligned. if it is not, the CPU immediately throws a general protection fault (#GP). we utilize rust's #[repr(C, align(32))] to enforce that every instance of BlackAlignedBuffer naturally aligns in memory, satisfying the hardware's strict demands at compile time.

the sfence (Store Fence) instruction at the end is absolutely mandatory. because non-temporal stores bypass the standard cache coherency mechanisms, they are weakly ordered. `sfence` forces the memory controller to drain all write-combining buffers to main memory before any subsequent load or store operations execute, ensuring global visibility.

// 04 / 08 — black_ptr owns its truth — BLACK0X80