Non-Record: U-Net Transformer + Int8 QAT + LeakyReLU² + Muon — 1.6656 BPB (DGX Spark)#1486
Non-Record: U-Net Transformer + Int8 QAT + LeakyReLU² + Muon — 1.6656 BPB (DGX Spark)#1486AlirezaAlampour wants to merge 1 commit intoopenai:mainfrom
Conversation
…park 9L/512d U-Net transformer with int8 per-row quantization, leaky ReLU squared MLP, Muon optimizer, and EMA. Trained on single GB10 Blackwell GPU (DGX Spark) with 80 min wallclock cap (~319 steps). Results (int8+zlib roundtrip BPB): seed 42: 1.6642 | seed 314: 1.6694 | seed 999: 1.6632 Artifact: ~8.77 MB (well under 16 MB limit) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Community Review — Non-Record: U-Net Transformer + Int8 QAT + LeakyReLU² + Muon — 1.6656 BPB (DGX Spark)Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache This submission is a pure-neural entry. No disqualifying patterns were found. Checked patterns:
What is present (all legal):
The architecture is a UNet-shaped transformer (encoder layers store skip connections, decoder layers consume them in reverse) with LeakyReLU^2 MLP activation (line 723: Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission. Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually. |
Non-Record Submission: U-Net Transformer + Int8 QAT + LeakyReLU² + Muon
Track:
non_record_16mbHardware: NVIDIA DGX Spark (1× GB10 Blackwell, 128GB unified memory)
Artifact size: ~8.8 MB (int8+zlib)
Results
Techniques
What makes this interesting
This was developed from zero ML background using AI-assisted development across Claude, GPT, and Gemini. All training ran on a single NVIDIA DGX Spark (consumer Blackwell GB10 GPU with unified memory) — no H100 access. The DGX Spark runs at ~15s/step vs ~87ms/step on 8×H100, limiting us to ~320 steps in the 80-minute wallclock window.
The journey: baseline 9.69 BPB → hyperparameter tuning → LeakyReLU² → U-Net skip connections → QAT → EMA → 1.66 BPB across 30+ experiments over two weeks.
Files
train_gpt.py— self-contained training + eval (1279 lines)submission.json— metadata and 3-seed resultsREADME.md— full technique descriptionrequirements.txt— dependenciestrain_seed{42,314,999}.log— training logs