Non-Record: U-Net Transformer + Int8 QAT + LeakyReLU² + Muon — 1.6656 BPB (DGX Spark) by AlirezaAlampour · Pull Request #1486 · openai/parameter-golf

AlirezaAlampour · 2026-04-09T00:24:55Z

Non-Record Submission: U-Net Transformer + Int8 QAT + LeakyReLU² + Muon

Track: non_record_16mb
Hardware: NVIDIA DGX Spark (1× GB10 Blackwell, 128GB unified memory)
Artifact size: ~8.8 MB (int8+zlib)

Results

Seed	val_bpb (roundtrip)	Steps	Artifact
42	1.6642	319	8,772,205 B
314	1.6694	321	8,785,874 B
999	1.6632	319	8,777,249 B
Mean	1.6656

Techniques

U-Net skip connections with learned skip weights and per-block residual mixing
9 transformer layers, d=512, 8 heads, 4 KV heads (GQA)
LeakyReLU² activation (negative_slope=0.5, squared)
Muon optimizer with Newton-Schulz orthogonalization + Adam for embeddings
Int8 per-row QAT with straight-through estimator
EMA weight averaging
zlib compression on serialized checkpoint
Logit softcap (tanh, cap=30), RoPE, tied embeddings
SentencePiece BPE vocab 1024, seq_len 1024

What makes this interesting

This was developed from zero ML background using AI-assisted development across Claude, GPT, and Gemini. All training ran on a single NVIDIA DGX Spark (consumer Blackwell GB10 GPU with unified memory) — no H100 access. The DGX Spark runs at ~15s/step vs ~87ms/step on 8×H100, limiting us to ~320 steps in the 80-minute wallclock window.

The journey: baseline 9.69 BPB → hyperparameter tuning → LeakyReLU² → U-Net skip connections → QAT → EMA → 1.66 BPB across 30+ experiments over two weeks.

Files

train_gpt.py — self-contained training + eval (1279 lines)
submission.json — metadata and 3-seed results
README.md — full technique description
requirements.txt — dependencies
train_seed{42,314,999}.log — training logs

…park 9L/512d U-Net transformer with int8 per-row quantization, leaky ReLU squared MLP, Muon optimizer, and EMA. Trained on single GB10 Blackwell GPU (DGX Spark) with 80 min wallclock cap (~319 steps). Results (int8+zlib roundtrip BPB): seed 42: 1.6642 | seed 314: 1.6694 | seed 999: 1.6632 Artifact: ~8.77 MB (well under 16 MB limit) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MatoTeziTanka · 2026-04-12T05:10:25Z

Community Review — Non-Record: U-Net Transformer + Int8 QAT + LeakyReLU² + Muon — 1.6656 BPB (DGX Spark)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

This submission is a pure-neural entry. No disqualifying patterns were found.

Checked patterns:

N-gram family bug (target XOR'd into hash lookup key): No hash tables, XOR operations, or n-gram structures exist anywhere in the file. The model is a standard transformer (GPT) with GQA attention, RoPE, RMSNorm, and a UNet-style skip-connection layout between encoder/decoder halves.
Pre-Quant TTT (multi-epoch AdamW on val_tokens without score-first): No test-time training of any kind is present. val_tokens is only ever read inside eval_val() (lines 235–294), which runs in torch.inference_mode() with model.eval() — no gradients, no optimizer steps, no backward passes on validation data.
Score-first-per-chunk TTT (PR Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean) #1413 pattern): Not present. There is no chunking of val_tokens for adaptation, no is_last_chunk guard, and no online weight updates during evaluation.
HOLD / Scored-region SLOT: No scored-region slot or special region bypass detected.

What is present (all legal):

Standard Muon + Adam optimizer split for training (lines 979–1003)
QAT via CastedLinear._qat_state with optional late-QAT trigger (lines 1125–1127) — this is training-time quantization aware training, not inference-time adaptation
EMA support (lines 1078–1081, 1155–1159) — standard exponential moving average on training weights
Post-training quantization: int8 (per-row) or int6+LZMA (lines 1199–1270) — applied after training ends, before final eval roundtrip

The architecture is a UNet-shaped transformer (encoder layers store skip connections, decoder layers consume them in reverse) with LeakyReLU^2 MLP activation (line 723: F.leaky_relu(...).square()), Int8/Int6 QAT, and Muon optimizer. All training happens exclusively on train tokens; validation is inference-only.

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

AlirezaAlampour mentioned this pull request Apr 13, 2026

Non-Record v2: 7L UNet + Int8 QAT + EMA + Long Train — 1.3969 BPB (DGX Spark) #1606

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-Record: U-Net Transformer + Int8 QAT + LeakyReLU² + Muon — 1.6656 BPB (DGX Spark)#1486

Non-Record: U-Net Transformer + Int8 QAT + LeakyReLU² + Muon — 1.6656 BPB (DGX Spark)#1486
AlirezaAlampour wants to merge 1 commit intoopenai:mainfrom
AlirezaAlampour:submission/clean-pr

AlirezaAlampour commented Apr 9, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AlirezaAlampour commented Apr 9, 2026

Non-Record Submission: U-Net Transformer + Int8 QAT + LeakyReLU² + Muon

Results

Techniques

What makes this interesting

Files

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Non-Record: U-Net Transformer + Int8 QAT + LeakyReLU² + Muon — 1.6656 BPB (DGX Spark)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants