Skip to content

ICU: per-item zstd compression of libicudata#237

Merged
dylan-conway merged 11 commits into
mainfrom
claude/icu-compress-data
May 23, 2026
Merged

ICU: per-item zstd compression of libicudata#237
dylan-conway merged 11 commits into
mainfrom
claude/icu-compress-data

Conversation

@dylan-conway
Copy link
Copy Markdown
Member

@dylan-conway dylan-conway commented May 22, 2026

Compresses ICU's five display-name trees (curr/ lang/ region/ unit/ zone/, non-en) per-item with zstd and adds a two-line hook in udata.cpp so Bun decompresses on first lookup. Everything else — collation, segmentation, locale format patterns, properties, normalization, tz rules — stays raw, so Intl.Collator/Segmenter/DateTimeFormat/NumberFormat (default), Date, URL IDNA, String.normalize, regex \p{} pay zero decompression in any locale.

What changes

  • icu/udata-decompress-hook.patch — applied after extracting the ICU tarball. Adds a weak extern "C" call between TOC lookup and checkDataItem; null in ICU's own tools, defined by Bun at link time.
  • icu/compress-data.ts — runs after the existing icupkg filter. Uses ICU's own icupkg -l/-x to read the package (no manual format parsing), trains a 128 KB zstd dictionary, compresses each item not in icu/hot-items.txt with the zstd CLI, writes the package back (UDataOffsetTOC — the one hand-rolled bit, since icupkg -a rejects non-ICU item bodies), and emits libicudata.a with the package + dict as .rodata symbols. Node stdlib + util.parseArgs; runs under Node's native type-stripping.
  • icu/hot-items.txt — everything except non-en display-name items. 1,655 raw / 2,115 compressed; largest compressed item 79 KB.
  • Dockerfile, Dockerfile.musl — install zstd + Node, apply the patch, run the repacker.

Bun-side companion

oven-sh/bun#31200Bun::ICUDecompressor singleton (the hook) + test/js/web/intl/ (30 tests, ~5,900 assertions: snapshots for 12 locales × every Intl API captured against unmodified libicudata.a, plus an exhaustive sweep that loads every compressed item). All pass identically on baseline, compressed-release, and compressed-LTO builds.

Measured (real Bun release binaries, hyperfine 50 runs)

baseline with this Δ
Stripped binary 83,741,640 B 75,385,800 B −8.4 MB (same under LTO)
bun --version 0.50 ms 0.51 ms noise
new Date().toString() 6.9 ms 6.3 ms noise
Intl.DateTimeFormat("ja") 6.1 ms 6.1 ms 0
Intl.Collator("zh") 5.3 ms 5.3 ms 0
Intl.Segmenter("zh", word) 6.6 ms 6.6 ms 0
Intl.DisplayNames("ko").of("US") 5.7 ms 5.8 ms +0.1 ms
NumberFormat("ru", {style:"unit"}) 5.9 ms 6.1 ms +0.2 ms (worst case)
DateTimeFormat("ja", {timeZoneName:"long"}) 6.1 ms 6.3 ms +0.2 ms

All Intl outputs byte-identical to baseline (中文\|分词\|测试, 미국, 1.234,56 €, 5 км, …). Regressions are first-call-only, then cached.

Memory (/proc/self/status, 102 locales × 5 trees)

baseline Δ with this Δ diff
RSS total +23.9 MB +24.2 MB +0.3 MB
RssAnon (pinned) +8.3 MB +16.5 MB +8.2 MB
RssFile (evictable) +15.6 MB +7.7 MB −7.9 MB

Total resident is flat; up to ~11.3 MB shifts evictable→pinned, reachable only via DisplayNames/unit/timeZoneName/currencyDisplay:"name" across all ~500 locales. No refcounting (would re-decompress on GC churn).

Linux-only / not in this PR

Dockerfile (glibc) and Dockerfile.musl only. build-icu.ps1 (Windows, also still on ICU 73.2), Dockerfile.android, Dockerfile.freebsd are untouched; macOS uses system libicucore. LTO validated (weak symbol resolves; baseline DCE's the unreachable hook).

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 22, 2026

Preview Builds

Commit Release Date
a863d5bb autobuild-preview-pr-237-a863d5bb 2026-05-23 03:08:15 UTC
83b6a12f autobuild-preview-pr-237-83b6a12f 2026-05-22 22:44:33 UTC
e8ce5e56 autobuild-preview-pr-237-e8ce5e56 2026-05-22 20:37:28 UTC
ddf38a1d autobuild-preview-pr-237-ddf38a1d 2026-05-22 19:28:21 UTC
cbfdad14 autobuild-preview-pr-237-cbfdad14 2026-05-22 06:11:36 UTC
33ac802c autobuild-preview-pr-237-33ac802c 2026-05-22 05:26:32 UTC
72c07745 autobuild-preview-pr-237-72c07745 2026-05-22 02:07:25 UTC

@dylan-conway dylan-conway marked this pull request as ready for review May 22, 2026 05:19
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a Node.js compression tool and Docker build steps to repack ICU common-data with per-item zstd compression, a hot-item whitelist, and a weak runtime decompression hook applied by patching ICU's udata loader.

Changes

ICU data compression build and runtime

Layer / File(s) Summary
Build environment dependencies
Dockerfile, Dockerfile.musl
Adds zstd and xz-utils to apt deps, installs Node.js in the Dockerfile, and adds nodejs/patch to musl build prerequisites so the image can run the compression tool and apply source patches.
ICU data compression tool
icu/compress-data.ts
Adds a Node.js/TypeScript CLI that extracts ICU .dat packages via icupkg, trains a zstd dictionary, conditionally compresses items into per-item zstd frames, rebuilds the ICU package binary (TOC, name pool, 16-byte alignment), verifies integrity, emits a rebuilt .dat, compiles an object embedding the .dat and dict, and archives to .a.
Hot items configuration
icu/hot-items.txt
Adds glob rules marking ICU items that must remain uncompressed (hot); limits compression to the Intl.DisplayNames trees (curr/, lang/, region/, unit/, zone) and excludes */pool.res and other specified patterns.
Runtime decompression hook
icu/udata-decompress-hook.patch
Patches source/common/udata.cpp to declare a weak bun_icu_maybe_decompress symbol and, when present, pass per-item DataHeader through it to potentially decompress zstd-framed data before header checks.
Docker build orchestration
Dockerfile, Dockerfile.musl
Stages local icu/ assets into images, applies the udata patch during ICU build, generates filtered icudt75l.dat, and runs node --experimental-strip-types /icu-bun/compress-data.ts (skipping /icu-bun/hot-items.txt) to produce the final libicudata.a.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: per-item zstd compression of ICU's libicudata library, which is the core objective of this pull request.
Description check ✅ Passed The description provides comprehensive context about the changes, implementation details, performance measurements, and companion work, all directly related to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
icu/compress-data.ts (1)

305-308: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider documenting the +4 compression threshold.

The + 4 in z.length + 4 < raw.length serves as a minimum savings margin before choosing compression. A brief inline comment would help future maintainers understand this heuristic (e.g., decompression overhead, frame metadata, or simply a margin to ensure worthwhile savings).

📝 Suggested documentation
     if (raw.length >= 64 && !isHot(bare)) {
       const z = compressFile(path, dictPath, ZSTD_LEVEL, tmpOut);
+      // Require at least 4 bytes savings to justify runtime decompression cost.
       if (z.length + 4 < raw.length) { body = z; comp++; } else kept++;
     } else kept++;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@icu/compress-data.ts` around lines 305 - 308, The comparison using "z.length
+ 4 < raw.length" in the compress decision (inside the branch that calls
compressFile(path, dictPath, ZSTD_LEVEL, tmpOut)) is unclear; add a short inline
comment next to that condition explaining the "+4" margin (e.g., to account for
decompression/frame overhead and ensure net size savings) so maintainers
understand why a compressed result must be at least 4 bytes smaller before
assigning body = z and incrementing comp (otherwise kept++). Keep the comment
concise and reference compressFile, z.length, raw.length and the variables
body/comp/kept so readers can locate the logic easily.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@icu/compress-data.ts`:
- Around line 305-308: The comparison using "z.length + 4 < raw.length" in the
compress decision (inside the branch that calls compressFile(path, dictPath,
ZSTD_LEVEL, tmpOut)) is unclear; add a short inline comment next to that
condition explaining the "+4" margin (e.g., to account for decompression/frame
overhead and ensure net size savings) so maintainers understand why a compressed
result must be at least 4 bytes smaller before assigning body = z and
incrementing comp (otherwise kept++). Keep the comment concise and reference
compressFile, z.length, raw.length and the variables body/comp/kept so readers
can locate the logic easily.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 25a31d03-0356-4f46-9be6-d3261748404f

📥 Commits

Reviewing files that changed from the base of the PR and between ddf38a1 and 14e5db2.

📒 Files selected for processing (1)
  • icu/compress-data.ts

@dylan-conway
Copy link
Copy Markdown
Member Author

Done in $(git rev-parse --short HEAD) — named MIN_COMPRESS_BYTES/MIN_SAVINGS_BYTES with a comment.

dylan-conway added a commit to oven-sh/bun that referenced this pull request May 23, 2026
@dylan-conway dylan-conway merged commit 782504c into main May 23, 2026
43 checks passed
dylan-conway added a commit to oven-sh/bun that referenced this pull request May 23, 2026
Jarred-Sumner pushed a commit to oven-sh/bun that referenced this pull request May 23, 2026
Runtime side of oven-sh/WebKit#237 (now merged as `782504c968e2`). That
change repacks ICU's `libicudata.a` with most items per-item
zstd-compressed and patches `udata.cpp` to call
`bun_icu_maybe_decompress` between TOC lookup and header validation.
This PR provides that function, the test coverage, and bumps
`WEBKIT_VERSION`.

## What's in this PR

- **`src/jsc/bindings/bun_icu_decompress.cpp`** — `Bun::ICUDecompressor`
singleton (`LazyNeverDestroyed` + `std::call_once`,
`WTF::Lock`/`HashMap`). The hook checks `ZSTD_MAGICNUMBER` and returns
immediately for raw items; zstd frames are decompressed once with the
shared `DDict` into 16-aligned mimalloc and cached by `.rodata` address.
Linux-only (`#if OS(LINUX)`).
- **`test/js/web/intl/`** — 31 tests / 5,902 assertions. Snapshot tests
for 12 locales × every `Intl.*` API (captured against unmodified ICU);
structural sweep over all locales × 5 trees; plus `icu-locales.txt` (the
full locale list).
- **`scripts/build/deps/webkit.ts`** — `WEBKIT_VERSION` →
`782504c968e2`; fix `prebuiltDestDir` cache key for `autobuild-*` tags.

## Runtime cost

Nothing is removed — same ICU code, same data, just stored compressed
and decoded on first read, then cached for the process lifetime.

| | Before | After | |
|---|--:|--:|--:|
| **Binary size** (linux-x64, stripped) | 83.8 MB | 72.2 MB | **−11.5
MB** |
| First `new Intl.NumberFormat("ru", {style:"unit"})` | 1.00 ms | 1.30
ms | +0.30 ms once |
| First i18n library init for one non-en locale | 0.59 ms | 0.82 ms |
+0.23 ms once |
| First time every Intl API × every locale is used | 196.5 ms | 218.0 ms
| +21.4 ms once |
| Any of the above, second time onward | 185.9 ms | 184.0 ms | −1.9 ms |

`Intl.*("en")`, `Date.toString()`, URL parsing, `String.normalize`,
regex `\p{}` are unchanged (kept uncompressed). All `Intl` outputs
byte-identical — the snapshots are the unmodified-ICU reference.

No popular i18n package iterates more than the app's own locale at
runtime
([luxon](https://github.com/moment/luxon/blob/master/src/impl/locale.js),
[@formatjs/intl](https://formatjs.github.io/docs/intl/), i18next, dayjs,
date-fns all checked); the every-locale row is the absolute upper bound,
not a realistic workload.

*30 fresh processes per binary, median; baseline = `autobuild-0d85951a`,
this PR = `autobuild-782504c968e2`.*

## Memory

Total RSS is flat. Up to ~11 MB shifts from evictable `.rodata` to
pinned heap as items are decompressed — reachable only via
`DisplayNames`/`unit`/`timeZoneName`/`currencyDisplay:"name"` across
many locales.

## Scope

Linux glibc + musl only. Windows (`build-icu.ps1`, ICU 73.2), FreeBSD,
Android Dockerfiles, and macOS (system `libicucore`) are untouched. The
hook compiles everywhere but is dead code where the prebuilt's
`udata.cpp` isn't patched.

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant