Skip to content

PixelML/luce-megakernel

Repository files navigation

Luce Megakernel

The first megakernel for hybrid DeltaNet/Attention LLMs.
All 24 layers of Qwen 3.5-0.8B in a single CUDA dispatch.
1.87 tok/J on a 2020 GPU, matching Apple's latest silicon at 2x the throughput.

Blog post · Benchmarks · Discord · lucebox.com


                        Prefill      Decode      tok/J
Megakernel (RTX 3090)   37,800       413         1.87  @220W
llama.cpp  (RTX 3090)   11,247       267         0.76
Apple M5 Max               -         229         1.76

The efficiency gap between NVIDIA and Apple isn't inherent to the silicon. It's an artifact of running generic software on capable hardware.

Why this exists

Conventional wisdom says NVIDIA GPUs are fast but power hungry, and Apple Silicon is slower but efficient. On paper, that checks out: llama.cpp on an RTX 3090 gets 267 tok/s at 350W (0.76 tok/J), while an M5 Max gets 229 tok/s at ~130W (1.76 tok/J). NVIDIA is faster, but 2.3x worse on efficiency. Qwen 3.5-0.8B uses a hybrid DeltaNet + Attention architecture (linear attention interleaved with standard attention). No fused kernel existed for this pattern. This is the first.

Inspired by Hazy Research's megakernel work on Llama-1B, we asked: can the same idea work for hybrid DeltaNet/Attention models on consumer GPUs?

We thought the problem was never the hardware. The RTX 3090 has 936 GB/s memory bandwidth and 142 TFLOPS FP16 compute. Extracting only 267 tok/s from that is a software problem.

The culprit: ~100 kernel launches per token. Each layer boundary returns control to the CPU, dispatches the next kernel, re-fetches weights from global memory, and synchronizes threads. For 24 layers, those microseconds add up, and each one burns power doing nothing useful.

So we fused everything into one kernel.

Results

Decode and Prefill (Qwen 3.5-0.8B)

Method Prefill pp520 (tok/s) Decode tg128 (tok/s)
Megakernel 37,800 413
llama.cpp BF16 11,247 267
PyTorch HuggingFace 7,578 108

3.4x faster prefill, 1.55x faster decode, 3.8x faster than PyTorch. Same hardware, same model, same weights.

Energy Efficiency (DVFS Power Sweep)

Power Limit Clock Draw tok/s tok/J vs Stock
420W (stock) 1980 MHz 314W 433 1.38 baseline
300W 1935 MHz 299W 432 1.44 99.8% speed, 5% less power
220W 1635 MHz 220W 411 1.87 95% speed, 30% less power
150W 405 MHz 150W 194 1.29 too aggressive

Sweet spot at 220W: 95% of the speed, 30% less power. The curve is nonlinear, tight execution converts directly into saved watts until you starve the GPU too aggressively.

The Comparison That Shouldn't Exist

Metric RTX 3090 (llama.cpp) M5 Max RTX 3090 (Megakernel @220W)
tok/s 267 229 411
Power 350W ~130W 220W
tok/J 0.76 1.76 1.87
GPU price ~$700 $2,499+ (system) ~$700

A $700 GPU from 2020, power-limited to 220W, matches Apple's latest chip on efficiency while delivering 1.8x the throughput.

How it works

A single persistent CUDA kernel processes the entire Qwen 3.5-0.8B forward pass in one dispatch. No CPU round-trips between layers.

Architecture: Qwen 3.5-0.8B is a hybrid model, 18 DeltaNet layers (linear attention with learned recurrence) and 6 full attention layers, in a 3:1 ratio. DeltaNet scales linearly with context length vs. quadratic for standard attention. It's an emerging pattern in next-gen models (Qwen3-Next, Kimi Linear), but no framework had optimized kernels for it.

Kernel specs:

  • 82 blocks, 512 threads, all SMs on the RTX 3090 kept occupied
  • BF16 weights and activations, FP32 accumulation where it matters
  • DeltaNet recurrence via warp-cooperative state updates in F32 registers
  • Full attention with online softmax (fused QKV, RoPE, causal mask, output projection)
  • Cooperative grid sync between layers instead of kernel launches (zero inter-layer overhead)
  • KV cache updates in-kernel
  • Weights loaded directly from HuggingFace

What traditional frameworks do: launch ~100 separate kernels per token, each one paying the cost of CPU dispatch, weight re-fetch, and thread synchronization. The megakernel eliminates all of that.

Why DeltaNet matters

Standard transformers have years of kernel optimization: FlashAttention, PagedAttention, continuous batching. Hybrid DeltaNet/Attention architectures are newer, and the kernel ecosystem is immature:

  • MLX: no native DeltaNet kernels
  • llama.cpp: generic DeltaNet support, no fusion
  • vLLM/SGLang: Triton kernels via flash-linear-attention, but no megakernel fusion

As more models go hybrid (and they will, because linear attention scales better), what you run them on matters less than how you run them. When you write a kernel that actually uses what the GPU offers, tensor cores, shared memory, cooperative grid launches, register-resident state, a five-year-old GPU matches Apple's latest chip.

Lessons from building this

grid.sync() inside loops will deadlock silently. We tried synchronizing all blocks within the per-token DeltaNet recurrence loop. No error message, just a hang. The fix: synchronize between layers, not within them.

Register pressure kills performance quietly. We attempted S_TILE=16 for more instruction-level parallelism. Silent crash, no CUDA error, registers spilled to local memory, performance collapsed. S_TILE=8 was the sweet spot.

The power curve is nonlinear. 420W to 300W lost almost nothing (99.8% speed). 300W to 220W lost marginally (95%). 220W to 150W collapsed to 45%. The megakernel's tight execution means the GPU hits its compute ceiling before its power ceiling, until you starve it too aggressively.

Quick start

git clone https://github.com/Luce-Org/luce-megakernel
cd luce-megakernel
pip install -e .
python bench_pp_tg.py    # runs pp520 tg128, prints tok/s and tok/J

Requirements:

  • NVIDIA GPU (Ampere+), tested on RTX 3090
  • CUDA 12+
  • PyTorch 2.0+
  • ~1.5 GB VRAM for BF16 weights

Optional: Set a power limit to find your GPU's sweet spot:

sudo nvidia-smi -pl 220    # or whatever your target wattage

Files

File Description
kernel.cu Decode megakernel, all 24 layers in one dispatch
prefill.cu Prefill (cuBLAS + standalone kernels)
torch_bindings.cpp PyTorch C++ bindings
model.py Weight loading + decoder
setup.py Build configuration
bench_pp_tg.py Benchmark (prefill + decode)
RESULTS.md Full benchmark results and DVFS sweep

Scope and limitations

This is a research proof-of-concept, not a production inference server.

  • Batch size 1 only. This targets single-user local inference (the llama.cpp/Ollama use case), not multi-tenant serving. If you need batched throughput, use vLLM or SGLang.
  • Single model, single architecture. The kernel is hand-written for Qwen 3.5-0.8B's specific layer pattern (18 DeltaNet + 6 Attention). It does not generalize to other models without rewriting.
  • BF16 only. No quantization support (GGUF/GPTQ/AWQ). We benchmark at BF16 to isolate kernel-level efficiency from quantization tradeoffs.
  • 0.8B parameters. This is a small model. Megakernel fusion benefits shrink as model size grows and compute begins to dominate over launch overhead. We chose 0.8B because it's the first hybrid DeltaNet model available, not because it's representative of all workloads.
  • Power methodology. Efficiency numbers measure accelerator power only (NVML for NVIDIA, powermetrics for Apple), following Hazy Research's Intelligence Per Watt methodology. Total system draw is higher for both platforms.
  • Correctness. The benchmark includes an end-to-end correctness check comparing megakernel output against a reference decode path. See bench_pp_tg.py.

The goal is to demonstrate that architecture-specific kernel fusion eliminates a real efficiency gap on consumer hardware, and to do it in the open so others can reproduce, critique, and extend the work.

Files

Questions, ideas, or want to see what others are building? Join the Luce Discord.

Citation

If you use this work in your research:

@software{luce_megakernel_2026,
  title  = {Luce Megakernel: Fused Forward Pass for Hybrid DeltaNet/Attention LLMs},
  author = {Luce},
  url    = {https://github.com/Luce-Org/luce-megakernel},
  year   = {2026}
}

Why llama.cpp as a baseline?

llama.cpp is the most widely used local inference engine. It's what most people actually run on consumer GPUs. We also include PyTorch HuggingFace numbers (3.8x slower) for a second reference point. This is not a critique of llama.cpp, it's an excellent project. The comparison shows what architecture-specific optimization can unlock on top of a generic framework.

Why an RTX 3090?

Deliberately chosen as the "worst case" for NVIDIA: a 2020 GPU, widely dismissed as power-hungry, available for ~$900-1,000 used. If the software gap is real on old hardware, it's even larger on newer cards.

Community

Questions, ideas, or want to see what others are building? Join the Luce Discord.


MIT · Lucebox

Built with Claude

Inspired by AlpinDale/qwen_megakernel, Infatoshi/MegaQwen

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors