Non-record: ALBERT-Style Low-Rank Embedding Factorisation (ablation study, 1×H100) by Cayton-Tech · Pull Request #1481 · openai/parameter-golf

Cayton-Tech · 2026-04-08T22:08:18Z

Ablation study testing ALBERT-style low-rank embedding factorisation (Lan et al., ICLR 2020) at the 16MB / 1024-token vocabulary scale. Tested rank=64 and rank=128 bottlenecks against the unmodified baseline on 1×H100 SXM.

Finding: Negative result — low-rank embedding does not improve BPB at 1024-token vocabulary. The vocabulary is too small for meaningful embedding redundancy to exist. Full ablation table, analysis, and implementation notes (including a tied embedding dimension mismatch fix) in the README.

MatoTeziTanka · 2026-04-12T05:10:36Z

Community Review — Non-record: ALBERT-Style Low-Rank Embedding Factorisation (ablation study, 1×H100)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

Summary

PR #1481 (2026-03-29_LowRankEmbed_CaytonTech) is a clean pure-neural submission. No illegal patterns were found.

Checks Performed

N-gram / hash XOR bug

Not present. No hash tables, XOR operations, or n-gram lookup structures exist anywhere in the file. The grep for hash, xor, ngram, n.gram returned zero hits.

Pre-Quant TTT (multi-epoch AdamW on val_tokens)

Not present. val_tokens is used exclusively inside eval_val() (lines 219–278) under torch.inference_mode() with model.eval(). No optimizer touches val_tokens, no gradients are computed on it.

Score-first-per-chunk TTT (PR #1413 pattern)

Not present. No chunked inference loop, no is_last_chunk guard, no TTT of any kind.

Scored-region SLOT

Not present. No deferred or held scoring regions.

What the submission actually does

The key innovation is a low-rank embedding bottleneck (lines 669–671):

self.embed_bottleneck = 64
self.tok_emb = nn.Embedding(vocab_size, self.embed_bottleneck)
self.embed_proj = nn.Linear(self.embed_bottleneck, model_dim, bias=False)

This replaces the standard vocab_size × model_dim embedding with a rank-64 factorization (vocab_size × 64 + 64 × model_dim), dramatically reducing parameter count for the embedding table. The embed_proj weight is reused as the tied LM head projection (line 721: F.linear(x, self.embed_proj.weight.t())).

Additional architectural features: U-Net style skip connections between encoder/decoder halves (lines 706–715), per-layer resid_mix gating, GQA attention, logit softcapping, and a Muon + Adam optimizer split. All are standard, legal architectural choices.

The training loop (lines 977–1059) reads exclusively from train_files, never from val_files during gradient computation. Validation is read-only inference.

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Cayton-Tech added 3 commits March 29, 2026 15:17

Low-rank embedding factorisation (rank 64)

f81e864

Fix tied embedding dimension mismatch

3746a66

Non-record: Low-rank embedding factorisation ablation study

a2594a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: ALBERT-Style Low-Rank Embedding Factorisation (ablation study, 1×H100)#1481

Non-record: ALBERT-Style Low-Rank Embedding Factorisation (ablation study, 1×H100)#1481
Cayton-Tech wants to merge 3 commits intoopenai:mainfrom
Cayton-Tech:main

Cayton-Tech commented Apr 8, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Cayton-Tech commented Apr 8, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Non-record: ALBERT-Style Low-Rank Embedding Factorisation (ablation study, 1×H100)

Summary

Checks Performed

N-gram / hash XOR bug

Pre-Quant TTT (multi-epoch AdamW on val_tokens)

Score-first-per-chunk TTT (PR #1413 pattern)

Scored-region SLOT

What the submission actually does

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants