Non-record: ALBERT-Style Low-Rank Embedding Factorisation (ablation study, 1×H100)#1481
Non-record: ALBERT-Style Low-Rank Embedding Factorisation (ablation study, 1×H100)#1481Cayton-Tech wants to merge 3 commits intoopenai:mainfrom
Conversation
Community Review — Non-record: ALBERT-Style Low-Rank Embedding Factorisation (ablation study, 1×H100)Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache SummaryPR #1481 ( Checks PerformedN-gram / hash XOR bugNot present. No hash tables, XOR operations, or n-gram lookup structures exist anywhere in the file. The grep for Pre-Quant TTT (multi-epoch AdamW on val_tokens)Not present. Score-first-per-chunk TTT (PR #1413 pattern)Not present. No chunked inference loop, no Scored-region SLOTNot present. No deferred or held scoring regions. What the submission actually doesThe key innovation is a low-rank embedding bottleneck (lines 669–671): self.embed_bottleneck = 64
self.tok_emb = nn.Embedding(vocab_size, self.embed_bottleneck)
self.embed_proj = nn.Linear(self.embed_bottleneck, model_dim, bias=False)This replaces the standard Additional architectural features: U-Net style skip connections between encoder/decoder halves (lines 706–715), per-layer The training loop (lines 977–1059) reads exclusively from Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission. Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually. |
Ablation study testing ALBERT-style low-rank embedding factorisation (Lan et al., ICLR 2020) at the 16MB / 1024-token vocabulary scale. Tested rank=64 and rank=128 bottlenecks against the unmodified baseline on 1×H100 SXM.
Finding: Negative result — low-rank embedding does not improve BPB at 1024-token vocabulary. The vocabulary is too small for meaningful embedding redundancy to exist. Full ablation table, analysis, and implementation notes (including a tied embedding dimension mismatch fix) in the README.