Skip to content

Non-record: ALBERT-Style Low-Rank Embedding Factorisation (ablation study, 1×H100)#1481

Open
Cayton-Tech wants to merge 3 commits intoopenai:mainfrom
Cayton-Tech:main
Open

Non-record: ALBERT-Style Low-Rank Embedding Factorisation (ablation study, 1×H100)#1481
Cayton-Tech wants to merge 3 commits intoopenai:mainfrom
Cayton-Tech:main

Conversation

@Cayton-Tech
Copy link
Copy Markdown

Ablation study testing ALBERT-style low-rank embedding factorisation (Lan et al., ICLR 2020) at the 16MB / 1024-token vocabulary scale. Tested rank=64 and rank=128 bottlenecks against the unmodified baseline on 1×H100 SXM.

Finding: Negative result — low-rank embedding does not improve BPB at 1024-token vocabulary. The vocabulary is too small for meaningful embedding redundancy to exist. Full ablation table, analysis, and implementation notes (including a tied embedding dimension mismatch fix) in the README.

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-record: ALBERT-Style Low-Rank Embedding Factorisation (ablation study, 1×H100)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

Summary

PR #1481 (2026-03-29_LowRankEmbed_CaytonTech) is a clean pure-neural submission. No illegal patterns were found.

Checks Performed

N-gram / hash XOR bug

Not present. No hash tables, XOR operations, or n-gram lookup structures exist anywhere in the file. The grep for hash, xor, ngram, n.gram returned zero hits.

Pre-Quant TTT (multi-epoch AdamW on val_tokens)

Not present. val_tokens is used exclusively inside eval_val() (lines 219–278) under torch.inference_mode() with model.eval(). No optimizer touches val_tokens, no gradients are computed on it.

Score-first-per-chunk TTT (PR #1413 pattern)

Not present. No chunked inference loop, no is_last_chunk guard, no TTT of any kind.

Scored-region SLOT

Not present. No deferred or held scoring regions.

What the submission actually does

The key innovation is a low-rank embedding bottleneck (lines 669–671):

self.embed_bottleneck = 64
self.tok_emb = nn.Embedding(vocab_size, self.embed_bottleneck)
self.embed_proj = nn.Linear(self.embed_bottleneck, model_dim, bias=False)

This replaces the standard vocab_size × model_dim embedding with a rank-64 factorization (vocab_size × 64 + 64 × model_dim), dramatically reducing parameter count for the embedding table. The embed_proj weight is reused as the tied LM head projection (line 721: F.linear(x, self.embed_proj.weight.t())).

Additional architectural features: U-Net style skip connections between encoder/decoder halves (lines 706–715), per-layer resid_mix gating, GQA attention, logit softcapping, and a Muon + Adam optimizer split. All are standard, legal architectural choices.

The training loop (lines 977–1059) reads exclusively from train_files, never from val_files during gradient computation. Validation is read-only inference.

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants