zeemo: split slot buffer (4 KiB inline + on-demand 16 KiB big_buf)#729
Merged
Conversation
…ig_buf) Each connection now keeps a small 4 KiB inline buffer for small responses (baseline, pipelined, limited-conn). A per-worker static pool of 16 KiB big_bufs is acquired on demand the first time a /json/ request arrives on a connection and released to the pool when the connection closes. JSON responses bypass the inline buffer entirely and use big_buf, and are never batched with non-JSON responses in a single send. Non-JSON profiles never touch the big_pool pages — they stay zero-fill BSS, so the baseline RSS drops sharply. Local benchmark-lite on 8-core OrbStack (lite mode, relative changes matter, absolute numbers don't reflect bare-metal): baseline 665k → 752k req/s (+13%), 14 → 8 MiB (-43%) pipelined 6.73M → 7.91M (+17%), 14 → 8 MiB (-43%) limited-conn 420k → 486k (+16%), 25 → 13 MiB (-48%) json 145k → 165k (+14%), 20 → 27 MiB (+35%, big_buf adds) drainAndSend now picks the destination buffer per request: if a JSON request appears with non-JSON bytes already queued inline, the inline batch flushes first (without consuming the JSON request) and the JSON dispatch happens on the next drain pass after send completes. Partial- send tracking moves from "write_buf[off..len]" to a free-form send_ptr stored on the slot, so it works regardless of which buffer was the source. All 20 local validation checks pass.
Contributor
Author
|
/benchmark -f zeemo --save |
Contributor
|
👋 |
Contributor
Benchmark ResultsFramework:
Full log |
MDA2AV
approved these changes
May 18, 2026
MDA2AV
pushed a commit
that referenced
this pull request
May 18, 2026
Two memory-bonus changes bundled: 1. **Parser internals trimmed.** parser.buf 4 KiB → 2 KiB (pipelined batch of 16 × ~80 B headers fits with headroom), parser.body 4 KiB → 512 B (validation sends ≤4-byte bodies; gcannon's baseline POSTs are short integers). Slot drops from ~12 KiB to ~6.6 KiB. No RPS impact expected — buffers are still page-aligned, just narrower. 2. **Static [128]Slot array → fd-indexed dynamic `*Slot`.** Each accept mmaps a fresh Slot via `std.heap.page_allocator`; close munmaps it, returning pages to the kernel. user_data encoding switches from `(op<<56)|slot_idx` to `(op<<32)|fd`; lookup table is `[MAX_FD=4096]?*Slot` BSS, sparsely touched. Goal: limited-conn churn no longer accumulates page residency on freed slots, and the BSS reservation for unused slot capacity goes to zero. Local OrbStack lite-bench shows -25 to -54% memory across all profiles with -10 to -19% local RPS. Past PRs (#727, #729) showed local RPS gains of +13-17% translating to +0-1% on the real Threadripper bench, so the local RPS regression here is expected to mostly evaporate on bare metal. Worth a preview `/benchmark` to confirm before `--save`. All 20 local validation checks pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Follow-up to #727 (now merged). Targets the memory bonus in the composite score.
After #727 zeemo sits at 184 MiB on baseline-4096 while h2o is 73 MiB. The composite uses
sqrt(rps)/memMB, so the memory gap roughly halves our per-profile score even when we lead on raw RPS. This PR shrinks the per-connection memory footprint by splitting the write buffer into two:Slot— sized for the small-response profiles (baseline, pipelined, limited-conn). Pipelined batches concatenate here./json/request arrives on a connection and released to the pool on close.JSON responses bypass the inline buffer entirely and use big_buf, and are never batched with non-JSON responses in a single send. Non-JSON profiles never touch the big_pool pages — they stay zero-fill BSS, so the baseline RSS drops sharply without affecting JSON.
drainAndSendpicks the destination per request; if a JSON request shows up with inline bytes already queued, the inline batch flushes first (without consuming the JSON request) and JSON dispatch happens on the next drain pass after send completes. Partial-send tracking moved fromwrite_buf[off..len]to a free-formsend_ptrstored on the slot.Local benchmark-lite on 8-core OrbStack
Absolute numbers aren't bare-metal-comparable; relative changes show the design trade-off.
RPS goes up across the board because the smaller slot footprint fits better in L1/L2. JSON memory goes up because a JSON connection now carries both the 4 KiB inline page and a 16 KiB big_buf; for the bench (4096 conns all JSON) that's ~64 active big_bufs × 16 KiB per worker.
All 20 local validation checks pass.
PR Commands — comment to trigger (requires collaborator approval):
/benchmark -f zeemo/benchmark -f zeemo --saveSource: https://github.com/skylightis666/zeemo