ICU: per-item zstd compression of libicudata#237
Conversation
Preview Builds
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a Node.js compression tool and Docker build steps to repack ICU common-data with per-item zstd compression, a hot-item whitelist, and a weak runtime decompression hook applied by patching ICU's udata loader. ChangesICU data compression build and runtime
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
icu/compress-data.ts (1)
305-308: 🧹 Nitpick | 🔵 Trivial | 💤 Low valueConsider documenting the +4 compression threshold.
The
+ 4inz.length + 4 < raw.lengthserves as a minimum savings margin before choosing compression. A brief inline comment would help future maintainers understand this heuristic (e.g., decompression overhead, frame metadata, or simply a margin to ensure worthwhile savings).📝 Suggested documentation
if (raw.length >= 64 && !isHot(bare)) { const z = compressFile(path, dictPath, ZSTD_LEVEL, tmpOut); + // Require at least 4 bytes savings to justify runtime decompression cost. if (z.length + 4 < raw.length) { body = z; comp++; } else kept++; } else kept++;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@icu/compress-data.ts` around lines 305 - 308, The comparison using "z.length + 4 < raw.length" in the compress decision (inside the branch that calls compressFile(path, dictPath, ZSTD_LEVEL, tmpOut)) is unclear; add a short inline comment next to that condition explaining the "+4" margin (e.g., to account for decompression/frame overhead and ensure net size savings) so maintainers understand why a compressed result must be at least 4 bytes smaller before assigning body = z and incrementing comp (otherwise kept++). Keep the comment concise and reference compressFile, z.length, raw.length and the variables body/comp/kept so readers can locate the logic easily.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@icu/compress-data.ts`:
- Around line 305-308: The comparison using "z.length + 4 < raw.length" in the
compress decision (inside the branch that calls compressFile(path, dictPath,
ZSTD_LEVEL, tmpOut)) is unclear; add a short inline comment next to that
condition explaining the "+4" margin (e.g., to account for decompression/frame
overhead and ensure net size savings) so maintainers understand why a compressed
result must be at least 4 bytes smaller before assigning body = z and
incrementing comp (otherwise kept++). Keep the comment concise and reference
compressFile, z.length, raw.length and the variables body/comp/kept so readers
can locate the logic easily.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 25a31d03-0356-4f46-9be6-d3261748404f
📒 Files selected for processing (1)
icu/compress-data.ts
|
Done in $(git rev-parse --short HEAD) — named |
Runtime side of oven-sh/WebKit#237 (now merged as `782504c968e2`). That change repacks ICU's `libicudata.a` with most items per-item zstd-compressed and patches `udata.cpp` to call `bun_icu_maybe_decompress` between TOC lookup and header validation. This PR provides that function, the test coverage, and bumps `WEBKIT_VERSION`. ## What's in this PR - **`src/jsc/bindings/bun_icu_decompress.cpp`** — `Bun::ICUDecompressor` singleton (`LazyNeverDestroyed` + `std::call_once`, `WTF::Lock`/`HashMap`). The hook checks `ZSTD_MAGICNUMBER` and returns immediately for raw items; zstd frames are decompressed once with the shared `DDict` into 16-aligned mimalloc and cached by `.rodata` address. Linux-only (`#if OS(LINUX)`). - **`test/js/web/intl/`** — 31 tests / 5,902 assertions. Snapshot tests for 12 locales × every `Intl.*` API (captured against unmodified ICU); structural sweep over all locales × 5 trees; plus `icu-locales.txt` (the full locale list). - **`scripts/build/deps/webkit.ts`** — `WEBKIT_VERSION` → `782504c968e2`; fix `prebuiltDestDir` cache key for `autobuild-*` tags. ## Runtime cost Nothing is removed — same ICU code, same data, just stored compressed and decoded on first read, then cached for the process lifetime. | | Before | After | | |---|--:|--:|--:| | **Binary size** (linux-x64, stripped) | 83.8 MB | 72.2 MB | **−11.5 MB** | | First `new Intl.NumberFormat("ru", {style:"unit"})` | 1.00 ms | 1.30 ms | +0.30 ms once | | First i18n library init for one non-en locale | 0.59 ms | 0.82 ms | +0.23 ms once | | First time every Intl API × every locale is used | 196.5 ms | 218.0 ms | +21.4 ms once | | Any of the above, second time onward | 185.9 ms | 184.0 ms | −1.9 ms | `Intl.*("en")`, `Date.toString()`, URL parsing, `String.normalize`, regex `\p{}` are unchanged (kept uncompressed). All `Intl` outputs byte-identical — the snapshots are the unmodified-ICU reference. No popular i18n package iterates more than the app's own locale at runtime ([luxon](https://github.com/moment/luxon/blob/master/src/impl/locale.js), [@formatjs/intl](https://formatjs.github.io/docs/intl/), i18next, dayjs, date-fns all checked); the every-locale row is the absolute upper bound, not a realistic workload. *30 fresh processes per binary, median; baseline = `autobuild-0d85951a`, this PR = `autobuild-782504c968e2`.* ## Memory Total RSS is flat. Up to ~11 MB shifts from evictable `.rodata` to pinned heap as items are decompressed — reachable only via `DisplayNames`/`unit`/`timeZoneName`/`currencyDisplay:"name"` across many locales. ## Scope Linux glibc + musl only. Windows (`build-icu.ps1`, ICU 73.2), FreeBSD, Android Dockerfiles, and macOS (system `libicucore`) are untouched. The hook compiles everywhere but is dead code where the prebuilt's `udata.cpp` isn't patched. --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Compresses ICU's five display-name trees (
curr/ lang/ region/ unit/ zone/, non-en) per-item with zstd and adds a two-line hook inudata.cppso Bun decompresses on first lookup. Everything else — collation, segmentation, locale format patterns, properties, normalization, tz rules — stays raw, soIntl.Collator/Segmenter/DateTimeFormat/NumberFormat(default),Date, URL IDNA,String.normalize, regex\p{}pay zero decompression in any locale.What changes
icu/udata-decompress-hook.patch— applied after extracting the ICU tarball. Adds a weakextern "C"call between TOC lookup andcheckDataItem; null in ICU's own tools, defined by Bun at link time.icu/compress-data.ts— runs after the existingicupkgfilter. Uses ICU's ownicupkg -l/-xto read the package (no manual format parsing), trains a 128 KB zstd dictionary, compresses each item not inicu/hot-items.txtwith thezstdCLI, writes the package back (UDataOffsetTOC— the one hand-rolled bit, sinceicupkg -arejects non-ICU item bodies), and emitslibicudata.awith the package + dict as.rodatasymbols. Node stdlib +util.parseArgs; runs under Node's native type-stripping.icu/hot-items.txt— everything except non-en display-name items. 1,655 raw / 2,115 compressed; largest compressed item 79 KB.Dockerfile,Dockerfile.musl— installzstd+ Node, apply the patch, run the repacker.Bun-side companion
oven-sh/bun#31200 —
Bun::ICUDecompressorsingleton (the hook) +test/js/web/intl/(30 tests, ~5,900 assertions: snapshots for 12 locales × everyIntlAPI captured against unmodifiedlibicudata.a, plus an exhaustive sweep that loads every compressed item). All pass identically on baseline, compressed-release, and compressed-LTO builds.Measured (real Bun release binaries, hyperfine 50 runs)
bun --versionnew Date().toString()Intl.DateTimeFormat("ja")Intl.Collator("zh")Intl.Segmenter("zh", word)Intl.DisplayNames("ko").of("US")NumberFormat("ru", {style:"unit"})DateTimeFormat("ja", {timeZoneName:"long"})All
Intloutputs byte-identical to baseline (中文\|分词\|测试,미국,1.234,56 €,5 км, …). Regressions are first-call-only, then cached.Memory (
/proc/self/status, 102 locales × 5 trees)RssAnon(pinned)RssFile(evictable)Total resident is flat; up to ~11.3 MB shifts evictable→pinned, reachable only via
DisplayNames/unit/timeZoneName/currencyDisplay:"name"across all ~500 locales. No refcounting (would re-decompress on GC churn).Linux-only / not in this PR
Dockerfile(glibc) andDockerfile.muslonly.build-icu.ps1(Windows, also still on ICU 73.2),Dockerfile.android,Dockerfile.freebsdare untouched; macOS uses systemlibicucore. LTO validated (weak symbol resolves; baseline DCE's the unreachable hook).