Skip to content

Non-Record: U-Net Transformer + Int8 QAT + LeakyReLU² + Muon — 1.6656 BPB (DGX Spark)#1486

Open
AlirezaAlampour wants to merge 1 commit intoopenai:mainfrom
AlirezaAlampour:submission/clean-pr
Open

Non-Record: U-Net Transformer + Int8 QAT + LeakyReLU² + Muon — 1.6656 BPB (DGX Spark)#1486
AlirezaAlampour wants to merge 1 commit intoopenai:mainfrom
AlirezaAlampour:submission/clean-pr

Conversation

@AlirezaAlampour
Copy link
Copy Markdown

Non-Record Submission: U-Net Transformer + Int8 QAT + LeakyReLU² + Muon

Track: non_record_16mb
Hardware: NVIDIA DGX Spark (1× GB10 Blackwell, 128GB unified memory)
Artifact size: ~8.8 MB (int8+zlib)

Results

Seed val_bpb (roundtrip) Steps Artifact
42 1.6642 319 8,772,205 B
314 1.6694 321 8,785,874 B
999 1.6632 319 8,777,249 B
Mean 1.6656

Techniques

  • U-Net skip connections with learned skip weights and per-block residual mixing
  • 9 transformer layers, d=512, 8 heads, 4 KV heads (GQA)
  • LeakyReLU² activation (negative_slope=0.5, squared)
  • Muon optimizer with Newton-Schulz orthogonalization + Adam for embeddings
  • Int8 per-row QAT with straight-through estimator
  • EMA weight averaging
  • zlib compression on serialized checkpoint
  • Logit softcap (tanh, cap=30), RoPE, tied embeddings
  • SentencePiece BPE vocab 1024, seq_len 1024

What makes this interesting

This was developed from zero ML background using AI-assisted development across Claude, GPT, and Gemini. All training ran on a single NVIDIA DGX Spark (consumer Blackwell GB10 GPU with unified memory) — no H100 access. The DGX Spark runs at ~15s/step vs ~87ms/step on 8×H100, limiting us to ~320 steps in the 80-minute wallclock window.

The journey: baseline 9.69 BPB → hyperparameter tuning → LeakyReLU² → U-Net skip connections → QAT → EMA → 1.66 BPB across 30+ experiments over two weeks.

Files

  • train_gpt.py — self-contained training + eval (1279 lines)
  • submission.json — metadata and 3-seed results
  • README.md — full technique description
  • requirements.txt — dependencies
  • train_seed{42,314,999}.log — training logs

…park

9L/512d U-Net transformer with int8 per-row quantization, leaky ReLU squared
MLP, Muon optimizer, and EMA. Trained on single GB10 Blackwell GPU (DGX Spark)
with 80 min wallclock cap (~319 steps).

Results (int8+zlib roundtrip BPB):
  seed 42:  1.6642 | seed 314: 1.6694 | seed 999: 1.6632
  Artifact: ~8.77 MB (well under 16 MB limit)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-Record: U-Net Transformer + Int8 QAT + LeakyReLU² + Muon — 1.6656 BPB (DGX Spark)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

This submission is a pure-neural entry. No disqualifying patterns were found.

Checked patterns:

  1. N-gram family bug (target XOR'd into hash lookup key): No hash tables, XOR operations, or n-gram structures exist anywhere in the file. The model is a standard transformer (GPT) with GQA attention, RoPE, RMSNorm, and a UNet-style skip-connection layout between encoder/decoder halves.

  2. Pre-Quant TTT (multi-epoch AdamW on val_tokens without score-first): No test-time training of any kind is present. val_tokens is only ever read inside eval_val() (lines 235–294), which runs in torch.inference_mode() with model.eval() — no gradients, no optimizer steps, no backward passes on validation data.

  3. Score-first-per-chunk TTT (PR Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean) #1413 pattern): Not present. There is no chunking of val_tokens for adaptation, no is_last_chunk guard, and no online weight updates during evaluation.

  4. HOLD / Scored-region SLOT: No scored-region slot or special region bypass detected.

What is present (all legal):

  • Standard Muon + Adam optimizer split for training (lines 979–1003)
  • QAT via CastedLinear._qat_state with optional late-QAT trigger (lines 1125–1127) — this is training-time quantization aware training, not inference-time adaptation
  • EMA support (lines 1078–1081, 1155–1159) — standard exponential moving average on training weights
  • Post-training quantization: int8 (per-row) or int6+LZMA (lines 1199–1270) — applied after training ends, before final eval roundtrip

The architecture is a UNet-shaped transformer (encoder layers store skip connections, decoder layers consume them in reverse) with LeakyReLU^2 MLP activation (line 723: F.leaky_relu(...).square()), Int8/Int6 QAT, and Muon optimizer. All training happens exclusively on train tokens; validation is inference-only.

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants