Luce Megakernel

The first megakernel for hybrid DeltaNet/Attention LLMs.
All 24 layers of Qwen 3.5-0.8B in a single CUDA dispatch.
1.87 tok/J on a 2020 GPU, matching Apple's latest silicon at 2x the throughput.

Blog post · Benchmarks · Discord · lucebox.com

                        Prefill      Decode      tok/J
Megakernel (RTX 3090)   37,800       413         1.87  @220W
llama.cpp  (RTX 3090)   11,247       267         0.76
Apple M5 Max               -         229         1.76

The efficiency gap between NVIDIA and Apple isn't inherent to the silicon. It's an artifact of running generic software on capable hardware.

Why this exists

Conventional wisdom says NVIDIA GPUs are fast but power hungry, and Apple Silicon is slower but efficient. On paper, that checks out: llama.cpp on an RTX 3090 gets 267 tok/s at 350W (0.76 tok/J), while an M5 Max gets 229 tok/s at ~130W (1.76 tok/J). NVIDIA is faster, but 2.3x worse on efficiency. Qwen 3.5-0.8B uses a hybrid DeltaNet + Attention architecture (linear attention interleaved with standard attention). No fused kernel existed for this pattern. This is the first.

Inspired by Hazy Research's megakernel work on Llama-1B, we asked: can the same idea work for hybrid DeltaNet/Attention models on consumer GPUs?

We thought the problem was never the hardware. The RTX 3090 has 936 GB/s memory bandwidth and 142 TFLOPS FP16 compute. Extracting only 267 tok/s from that is a software problem.

The culprit: ~100 kernel launches per token. Each layer boundary returns control to the CPU, dispatches the next kernel, re-fetches weights from global memory, and synchronizes threads. For 24 layers, those microseconds add up, and each one burns power doing nothing useful.

So we fused everything into one kernel.

Results

Decode and Prefill (Qwen 3.5-0.8B)

Method	Prefill pp520 (tok/s)	Decode tg128 (tok/s)
Megakernel	37,800	413
llama.cpp BF16	11,247	267
PyTorch HuggingFace	7,578	108

3.4x faster prefill, 1.55x faster decode, 3.8x faster than PyTorch. Same hardware, same model, same weights.

Energy Efficiency (DVFS Power Sweep)

Power Limit	Clock	Draw	tok/s	tok/J	vs Stock
420W (stock)	1980 MHz	314W	433	1.38	baseline
300W	1935 MHz	299W	432	1.44	99.8% speed, 5% less power
220W	1635 MHz	220W	411	1.87	95% speed, 30% less power
150W	405 MHz	150W	194	1.29	too aggressive

Sweet spot at 220W: 95% of the speed, 30% less power. The curve is nonlinear, tight execution converts directly into saved watts until you starve the GPU too aggressively.

The Comparison That Shouldn't Exist

Metric	RTX 3090 (llama.cpp)	M5 Max	RTX 3090 (Megakernel @220W)
tok/s	267	229	411
Power	350W	~130W	220W
tok/J	0.76	1.76	1.87
GPU price	~$700	$2,499+ (system)	~$700

A $700 GPU from 2020, power-limited to 220W, matches Apple's latest chip on efficiency while delivering 1.8x the throughput.

How it works

A single persistent CUDA kernel processes the entire Qwen 3.5-0.8B forward pass in one dispatch. No CPU round-trips between layers.

Architecture: Qwen 3.5-0.8B is a hybrid model, 18 DeltaNet layers (linear attention with learned recurrence) and 6 full attention layers, in a 3:1 ratio. DeltaNet scales linearly with context length vs. quadratic for standard attention. It's an emerging pattern in next-gen models (Qwen3-Next, Kimi Linear), but no framework had optimized kernels for it.

Kernel specs:

82 blocks, 512 threads, all SMs on the RTX 3090 kept occupied
BF16 weights and activations, FP32 accumulation where it matters
DeltaNet recurrence via warp-cooperative state updates in F32 registers
Full attention with online softmax (fused QKV, RoPE, causal mask, output projection)
Cooperative grid sync between layers instead of kernel launches (zero inter-layer overhead)
KV cache updates in-kernel
Weights loaded directly from HuggingFace

What traditional frameworks do: launch ~100 separate kernels per token, each one paying the cost of CPU dispatch, weight re-fetch, and thread synchronization. The megakernel eliminates all of that.

Why DeltaNet matters

Standard transformers have years of kernel optimization: FlashAttention, PagedAttention, continuous batching. Hybrid DeltaNet/Attention architectures are newer, and the kernel ecosystem is immature:

MLX: no native DeltaNet kernels
llama.cpp: generic DeltaNet support, no fusion
vLLM/SGLang: Triton kernels via flash-linear-attention, but no megakernel fusion

As more models go hybrid (and they will, because linear attention scales better), what you run them on matters less than how you run them. When you write a kernel that actually uses what the GPU offers, tensor cores, shared memory, cooperative grid launches, register-resident state, a five-year-old GPU matches Apple's latest chip.

Lessons from building this

grid.sync() inside loops will deadlock silently. We tried synchronizing all blocks within the per-token DeltaNet recurrence loop. No error message, just a hang. The fix: synchronize between layers, not within them.

Register pressure kills performance quietly. We attempted S_TILE=16 for more instruction-level parallelism. Silent crash, no CUDA error, registers spilled to local memory, performance collapsed. S_TILE=8 was the sweet spot.

The power curve is nonlinear. 420W to 300W lost almost nothing (99.8% speed). 300W to 220W lost marginally (95%). 220W to 150W collapsed to 45%. The megakernel's tight execution means the GPU hits its compute ceiling before its power ceiling, until you starve it too aggressively.

Quick start

git clone https://github.com/Luce-Org/luce-megakernel
cd luce-megakernel
pip install -e .
python bench_pp_tg.py    # runs pp520 tg128, prints tok/s and tok/J

Requirements:

NVIDIA GPU (Ampere+), tested on RTX 3090
CUDA 12+
PyTorch 2.0+
~1.5 GB VRAM for BF16 weights

Optional: Set a power limit to find your GPU's sweet spot:

sudo nvidia-smi -pl 220    # or whatever your target wattage

Files

File	Description
`kernel.cu`	Decode megakernel, all 24 layers in one dispatch
`prefill.cu`	Prefill (cuBLAS + standalone kernels)
`torch_bindings.cpp`	PyTorch C++ bindings
`model.py`	Weight loading + decoder
`setup.py`	Build configuration
`bench_pp_tg.py`	Benchmark (prefill + decode)
`RESULTS.md`	Full benchmark results and DVFS sweep

Scope and limitations

This is a research proof-of-concept, not a production inference server.

Batch size 1 only. This targets single-user local inference (the llama.cpp/Ollama use case), not multi-tenant serving. If you need batched throughput, use vLLM or SGLang.
Single model, single architecture. The kernel is hand-written for Qwen 3.5-0.8B's specific layer pattern (18 DeltaNet + 6 Attention). It does not generalize to other models without rewriting.
BF16 only. No quantization support (GGUF/GPTQ/AWQ). We benchmark at BF16 to isolate kernel-level efficiency from quantization tradeoffs.
0.8B parameters. This is a small model. Megakernel fusion benefits shrink as model size grows and compute begins to dominate over launch overhead. We chose 0.8B because it's the first hybrid DeltaNet model available, not because it's representative of all workloads.
Power methodology. Efficiency numbers measure accelerator power only (NVML for NVIDIA, powermetrics for Apple), following Hazy Research's Intelligence Per Watt methodology. Total system draw is higher for both platforms.
Correctness. The benchmark includes an end-to-end correctness check comparing megakernel output against a reference decode path. See bench_pp_tg.py.

The goal is to demonstrate that architecture-specific kernel fusion eliminates a real efficiency gap on consumer hardware, and to do it in the open so others can reproduce, critique, and extend the work.

Files

Questions, ideas, or want to see what others are building? Join the Luce Discord.

Citation

If you use this work in your research:

@software{luce_megakernel_2026,
  title  = {Luce Megakernel: Fused Forward Pass for Hybrid DeltaNet/Attention LLMs},
  author = {Luce},
  url    = {https://github.com/Luce-Org/luce-megakernel},
  year   = {2026}
}

Why llama.cpp as a baseline?

llama.cpp is the most widely used local inference engine. It's what most people actually run on consumer GPUs. We also include PyTorch HuggingFace numbers (3.8x slower) for a second reference point. This is not a critique of llama.cpp, it's an excellent project. The comparison shows what architecture-specific optimization can unlock on top of a generic framework.

Why an RTX 3090?

Deliberately chosen as the "worst case" for NVIDIA: a 2020 GPU, widely dismissed as power-hungry, available for ~$900-1,000 used. If the software gap is real on old hardware, it's even larger on newer cards.

Community

Questions, ideas, or want to see what others are building? Join the Luce Discord.

MIT · Lucebox

Built with Claude

Inspired by AlpinDale/qwen_megakernel, Infatoshi/MegaQwen

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
HANDOFF.md		HANDOFF.md
README.md		README.md
RESULTS.md		RESULTS.md
bench.py		bench.py
bench_pp_tg.py		bench_pp_tg.py
final_bench.py		final_bench.py
hero.png		hero.png
kernel.cu		kernel.cu
model.py		model.py
prefill.cu		prefill.cu
setup.py		setup.py
torch_bindings.cpp		torch_bindings.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Luce Megakernel

Why this exists

Results

Decode and Prefill (Qwen 3.5-0.8B)

Energy Efficiency (DVFS Power Sweep)

The Comparison That Shouldn't Exist

How it works

Why DeltaNet matters

Lessons from building this

Quick start

Files

Scope and limitations

Files

Citation

Why llama.cpp as a baseline?

Why an RTX 3090?

Community

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Luce Megakernel

Why this exists

Results

Decode and Prefill (Qwen 3.5-0.8B)

Energy Efficiency (DVFS Power Sweep)

The Comparison That Shouldn't Exist

How it works

Why DeltaNet matters

Lessons from building this

Quick start

Files

Scope and limitations

Files

Citation

Why llama.cpp as a baseline?

Why an RTX 3090?

Community

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages