← black metal kernel — series index
// black metal kernel — episode 02 of 08
custom allocator:
owning the heap from birth
// kernel programming in rust — zero-cost abstractions — no gc — no mercy
GlobalAlloc
AtomicUsize
bump alloc
lock-free
align(4096)
SMP
// 02 — custom allocator: owning the heap from birth
the standard heap allocator is a comfortable lie built on decades of compromises — it does not know your access patterns it does not know your alignment requirements it does not know that you will allocate ten thousand 64-byte objects and never free them until shutdown. in the kernel context, ignorance is the antagonist of performance. you write the allocator that fits the workload exactly. achieving lock-free O(1) allocation requires leveraging atomic operations and avoiding global locks that cause thread starvation when multiple cores attempt to retrieve memory simultaneously.
by implementing rust's `GlobalAlloc` trait, we map the entire language's dynamic allocation features — `Box`, `Vec`, `String` — directly to our custom memory management policy. our bump allocator demonstrates brutal efficiency. it never frees. it simply increments a pointer forward, relying on atomic compare-and-swap (CAS) to stay entirely thread-safe without ever acquiring a lock.
#![feature(allocator_api)]
use core::alloc::{GlobalAlloc, Layout};
use core::sync::atomic::{AtomicUsize, Ordering};
const BLACK_HEAP_SIZE: usize = 1024 * 1024 * 4;
#[repr(align(4096))]
struct BlackHeap([u8; BLACK_HEAP_SIZE]);
static mut black_heap: BlackHeap = BlackHeap([0u8; BLACK_HEAP_SIZE]);
static black_offset: AtomicUsize = AtomicUsize::new(0);
pub struct BlackBumpAllocator;
unsafe impl GlobalAlloc for BlackBumpAllocator {
unsafe fn alloc(&self, black_layout: Layout) -> *mut u8 {
let black_align = black_layout.align();
let black_size = black_layout.size();
loop {
let black_current = black_offset.load(Ordering::Relaxed);
let black_aligned = (black_current + black_align - 1) & !(black_align - 1);
let black_next = black_aligned + black_size;
if black_next > BLACK_HEAP_SIZE {
return core::ptr::null_mut();
}
if black_offset
.compare_exchange_weak(
black_current,
black_next,
Ordering::AcqRel,
Ordering::Relaxed,
)
.is_ok()
{
return black_heap.0.as_mut_ptr().add(black_aligned) as *mut u8;
}
}
}
unsafe fn dealloc(&self, _black_ptr: *mut u8, _black_layout: Layout) {}
}
#[global_allocator]
static black_alloc: BlackBumpAllocator = BlackBumpAllocator;
// dealloc is intentionally empty — this is a bump allocator — if your workload never frees or frees everything at once this is O(1) allocation with zero fragmentation and zero lock contention on the fast path — the compare_exchange loop makes it safe for SMP without a spinlock
// expanded — bump allocation, alignment math, and lock-free SMP safety
a general-purpose allocator like ptmalloc or jemalloc maintains segregated free lists, coalesces adjacent free blocks, embeds metadata headers immediately before each allocation, and handles fragmentation across thousands of allocation sizes and lifetimes. all of that is correctness machinery for the general case. in a kernel init path or a real-time interrupt handler, you do not need the general case and you cannot afford its overhead. the bump allocator relies on contiguous blocks of memory being treated as infinite. it moves forward, taking the next chunk of RAM without consideration for what was behind it. no fragmentation means no internal list traversals.
a bump allocator reduces heap allocation to a single atomic integer increment. the entire heap is one contiguous array — BlackHeap — sitting in static memory. black_offset is a cursor pointing to the next free byte. to allocate N bytes: round the cursor up to the required alignment, advance it by N, return the pre-advance position as the allocation address. O(1). no free list traversal. no metadata. no fragmentation because the cursor never moves backward.
the alignment calculation (current + align - 1) & !(align - 1) is the classic power-of-two round-up. it works because all alignment values in Rust's Layout are guaranteed to be powers of two. adding align - 1 pushes the value past the next boundary, then masking with !(align - 1) snaps it down to it. the result is the smallest aligned address greater than or equal to the current offset. no branches. two arithmetic operations.
making this correct for multiple CPU cores simultaneously is what the compare_exchange_weak loop achieves. two cores might read the same black_current value simultaneously. both compute their black_next. only one can win the CAS — the atomic operation succeeds only if the value at the address still equals black_current at the moment of exchange. the other core's CAS fails because the value changed between load and exchange. the loser retries with the new current value. no two cores ever receive the same memory region. this is a lock-free algorithm: it makes progress as long as at least one thread keeps succeeding, with no mutex, no spinlock, and no blocking.
dealloc is a no-op by design. the bump allocator is appropriate for workloads with one of two shapes: either nothing is freed until the entire arena is reset at once (per-boot init, per-request arenas that live and die together), or the working set is known at design time and the arena is sized exactly to fit it. individual frees in a bump allocator are semantically meaningless. for dynamic reclamation, you layer an object pool or buddy-allocator system over the bump structure.
// 02 / 08 — black_ptr owns its truth — BLACK0X80