Skip to content

zeemo: split slot buffer (4 KiB inline + on-demand 16 KiB big_buf)#729

Merged
MDA2AV merged 2 commits into
MDA2AV:mainfrom
skylightis666:zeemo-dual-buffer
May 18, 2026
Merged

zeemo: split slot buffer (4 KiB inline + on-demand 16 KiB big_buf)#729
MDA2AV merged 2 commits into
MDA2AV:mainfrom
skylightis666:zeemo-dual-buffer

Conversation

@skylightis666
Copy link
Copy Markdown
Contributor

Description

Follow-up to #727 (now merged). Targets the memory bonus in the composite score.

After #727 zeemo sits at 184 MiB on baseline-4096 while h2o is 73 MiB. The composite uses sqrt(rps)/memMB, so the memory gap roughly halves our per-profile score even when we lead on raw RPS. This PR shrinks the per-connection memory footprint by splitting the write buffer into two:

  • 4 KiB inline buffer in every Slot — sized for the small-response profiles (baseline, pipelined, limited-conn). Pipelined batches concatenate here.
  • 16 KiB big_buf drawn from a per-worker static pool, acquired on demand the first time a /json/ request arrives on a connection and released to the pool on close.

JSON responses bypass the inline buffer entirely and use big_buf, and are never batched with non-JSON responses in a single send. Non-JSON profiles never touch the big_pool pages — they stay zero-fill BSS, so the baseline RSS drops sharply without affecting JSON.

drainAndSend picks the destination per request; if a JSON request shows up with inline bytes already queued, the inline batch flushes first (without consuming the JSON request) and JSON dispatch happens on the next drain pass after send completes. Partial-send tracking moved from write_buf[off..len] to a free-form send_ptr stored on the slot.

Local benchmark-lite on 8-core OrbStack

Absolute numbers aren't bare-metal-comparable; relative changes show the design trade-off.

profile before RPS after RPS Δ before mem after mem Δ
baseline 665k 752k +13% 14 MiB 8 MiB −43%
pipelined 6.73M 7.91M +17% 14 MiB 8 MiB −43%
limited-conn 420k 486k +16% 25 MiB 13 MiB −48%
json 145k 165k +14% 20 MiB 27 MiB +35%

RPS goes up across the board because the smaller slot footprint fits better in L1/L2. JSON memory goes up because a JSON connection now carries both the 4 KiB inline page and a 16 KiB big_buf; for the bench (4096 conns all JSON) that's ~64 active big_bufs × 16 KiB per worker.

All 20 local validation checks pass.

PR Commands — comment to trigger (requires collaborator approval):

Command Description
/benchmark -f zeemo Preview run
/benchmark -f zeemo --save Run and save results

Source: https://github.com/skylightis666/zeemo

…ig_buf)

Each connection now keeps a small 4 KiB inline buffer for small responses
(baseline, pipelined, limited-conn). A per-worker static pool of 16 KiB
big_bufs is acquired on demand the first time a /json/ request arrives
on a connection and released to the pool when the connection closes.

JSON responses bypass the inline buffer entirely and use big_buf, and are
never batched with non-JSON responses in a single send. Non-JSON profiles
never touch the big_pool pages — they stay zero-fill BSS, so the baseline
RSS drops sharply.

Local benchmark-lite on 8-core OrbStack (lite mode, relative changes
matter, absolute numbers don't reflect bare-metal):

  baseline      665k → 752k req/s  (+13%), 14 → 8 MiB (-43%)
  pipelined     6.73M → 7.91M     (+17%), 14 → 8 MiB (-43%)
  limited-conn  420k → 486k       (+16%), 25 → 13 MiB (-48%)
  json          145k → 165k       (+14%), 20 → 27 MiB (+35%, big_buf adds)

drainAndSend now picks the destination buffer per request: if a JSON
request appears with non-JSON bytes already queued inline, the inline
batch flushes first (without consuming the JSON request) and the JSON
dispatch happens on the next drain pass after send completes. Partial-
send tracking moves from "write_buf[off..len]" to a free-form send_ptr
stored on the slot, so it works regardless of which buffer was the
source.

All 20 local validation checks pass.
@skylightis666
Copy link
Copy Markdown
Contributor Author

/benchmark -f zeemo --save

@github-actions
Copy link
Copy Markdown
Contributor

👋 /benchmark request received. A collaborator will review and approve the run.

@github-actions
Copy link
Copy Markdown
Contributor

Benchmark Results

Framework: zeemo | Test: all tests

Test Conn RPS CPU Mem Δ RPS Δ Mem
baseline 512 4,120,722 6267.1% 69MiB +0.3% -8.0%
baseline 4096 4,435,413 6406.3% 130MiB +0.2% -29.3%
pipelined 512 48,445,534 6523.3% 69MiB ~0% -6.8%
pipelined 4096 49,927,385 6413.2% 124MiB -0.2% -28.3%
limited-conn 512 2,635,070 5467.8% 88MiB +0.2% -22.8%
limited-conn 4096 2,617,676 5554.3% 178MiB +0.4% -29.9%
json 4096 2,374,963 6403.3% 257MiB +1.0% -0.4%
Full log
  Conns:     4096 (64/thread)
  Pipeline:  1
  Req/conn:  10
  Templates: 3
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   1.42ms   1.40ms   1.74ms   1.96ms   2.32ms

  13088191 requests in 5.00s, 13088383 responses
  Throughput: 2.62M req/s
  Bandwidth:  164.69MB/s
  Status codes: 2xx=13088383, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 13089113 / 13088383 responses (100.0%)
  Reconnects: 1308298
  Per-template: 4362849,4362780,4362710
  Per-template-ok: 4362849,4362780,4362710
[info] CPU 5554.3% | Mem 178MiB

[run 2/3]
gcannon v0.5.3
  Target:    localhost:8080/
  Threads:   64
  Conns:     4096 (64/thread)
  Pipeline:  1
  Req/conn:  10
  Templates: 3
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   1.42ms   1.40ms   1.75ms   1.99ms   2.51ms

  13071741 requests in 5.00s, 13071236 responses
  Throughput: 2.61M req/s
  Bandwidth:  164.48MB/s
  Status codes: 2xx=13071236, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 13071222 / 13071236 responses (100.0%)
  Reconnects: 1306419
  Per-template: 4357173,4357010,4357039
  Per-template-ok: 4357173,4357010,4357039
[info] CPU 5698.6% | Mem 178MiB

[run 3/3]
gcannon v0.5.3
  Target:    localhost:8080/
  Threads:   64
  Conns:     4096 (64/thread)
  Pipeline:  1
  Req/conn:  10
  Templates: 3
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   1.42ms   1.40ms   1.75ms   1.99ms   2.37ms

  13021514 requests in 5.00s, 13021338 responses
  Throughput: 2.60M req/s
  Bandwidth:  163.87MB/s
  Status codes: 2xx=13021338, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 13021292 / 13021338 responses (100.0%)
  Reconnects: 1302905
  Per-template: 4340353,4340641,4340298
  Per-template-ok: 4340353,4340641,4340297
[info] CPU 5523.0% | Mem 184MiB

=== Best: 2617676 req/s (CPU: 5554.3%, Mem: 178MiB) ===
[info] input BW: 202.21MB/s (avg template: 81 bytes)
[info] saved results/limited-conn/4096/zeemo.json
httparena-bench-zeemo
httparena-bench-zeemo

==============================================
=== zeemo / json / 4096c (tool=gcannon) ===
==============================================
[info] waiting for server...
[info] server ready

[run 1/3]
gcannon v0.5.3
  Target:    localhost:8080/
  Threads:   64
  Conns:     4096 (64/thread)
  Pipeline:  1
  Req/conn:  25
  Templates: 7
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency    694us    403us   1.85ms   2.70ms   3.38ms

  11814932 requests in 5.00s, 11813461 responses
  Throughput: 2.36M req/s
  Bandwidth:  7.91GB/s
  Status codes: 2xx=11813461, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 11815665 / 11813461 responses (100.0%)
  Reconnects: 474810
  Per-template: 1681186,1685519,1689325,1692122,1692915,1689308,1682984
  Per-template-ok: 1681186,1685519,1689325,1692122,1692915,1689308,1682984
[info] CPU 5918.1% | Mem 242MiB

[run 2/3]
gcannon v0.5.3
  Target:    localhost:8080/
  Threads:   64
  Conns:     4096 (64/thread)
  Pipeline:  1
  Req/conn:  25
  Templates: 7
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency    704us    397us   1.93ms   2.74ms   3.37ms

  11876867 requests in 5.00s, 11874819 responses
  Throughput: 2.37M req/s
  Bandwidth:  7.95GB/s
  Status codes: 2xx=11874819, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 11874762 / 11874819 responses (100.0%)
  Reconnects: 477786
  Per-template: 1690273,1692794,1697283,1700980,1701871,1698293,1693268
  Per-template-ok: 1690273,1692794,1697283,1700980,1701871,1698293,1693268
[info] CPU 6403.3% | Mem 257MiB

[run 3/3]
gcannon v0.5.3
  Target:    localhost:8080/
  Threads:   64
  Conns:     4096 (64/thread)
  Pipeline:  1
  Req/conn:  25
  Templates: 7
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency    700us    381us   1.91ms   2.71ms   3.31ms

  11850176 requests in 5.00s, 11848052 responses
  Throughput: 2.37M req/s
  Bandwidth:  7.93GB/s
  Status codes: 2xx=11848052, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 11847931 / 11848052 responses (100.0%)
  Reconnects: 476805
  Per-template: 1686264,1691037,1694261,1696453,1696614,1694800,1688502
  Per-template-ok: 1686264,1691037,1694261,1696453,1696614,1694800,1688502
[info] CPU 5937.0% | Mem 260MiB

=== Best: 2374963 req/s (CPU: 6403.3%, Mem: 257MiB) ===
[info] input BW: 113.25MB/s (avg template: 50 bytes)
[info] saved results/json/4096/zeemo.json
httparena-bench-zeemo
httparena-bench-zeemo
[info] skip: zeemo does not subscribe to json-comp
[info] skip: zeemo does not subscribe to json-tls
[info] skip: zeemo does not subscribe to upload
[info] skip: zeemo does not subscribe to api-4
[info] skip: zeemo does not subscribe to api-16
[info] skip: zeemo does not subscribe to static
[info] skip: zeemo does not subscribe to async-db
[info] skip: zeemo does not subscribe to crud
[info] skip: zeemo does not subscribe to fortunes
[info] skip: zeemo does not subscribe to baseline-h2
[info] skip: zeemo does not subscribe to static-h2
[info] skip: zeemo does not subscribe to baseline-h2c
[info] skip: zeemo does not subscribe to json-h2c
[info] skip: zeemo does not subscribe to baseline-h3
[info] skip: zeemo does not subscribe to static-h3
[info] skip: zeemo does not subscribe to gateway-64
[info] skip: zeemo does not subscribe to gateway-h3
[info] skip: zeemo does not subscribe to production-stack
[info] skip: zeemo does not subscribe to unary-grpc
[info] skip: zeemo does not subscribe to unary-grpc-tls
[info] skip: zeemo does not subscribe to stream-grpc
[info] skip: zeemo does not subscribe to stream-grpc-tls
[info] skip: zeemo does not subscribe to echo-ws
[info] skip: zeemo does not subscribe to echo-ws-pipeline
[info] rebuilding site/data/*.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/frameworks.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/baseline-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/baseline-512.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/json-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/limited-conn-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/limited-conn-512.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/pipelined-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/pipelined-512.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/current.json
[info] done
[info] restoring loopback MTU to 65536
[info] restoring CPU governor → powersave

@MDA2AV MDA2AV merged commit 1d6b3d1 into MDA2AV:main May 18, 2026
MDA2AV pushed a commit that referenced this pull request May 18, 2026
Two memory-bonus changes bundled:

1. **Parser internals trimmed.** parser.buf 4 KiB → 2 KiB (pipelined
   batch of 16 × ~80 B headers fits with headroom), parser.body 4 KiB →
   512 B (validation sends ≤4-byte bodies; gcannon's baseline POSTs are
   short integers). Slot drops from ~12 KiB to ~6.6 KiB. No RPS impact
   expected — buffers are still page-aligned, just narrower.

2. **Static [128]Slot array → fd-indexed dynamic `*Slot`.** Each accept
   mmaps a fresh Slot via `std.heap.page_allocator`; close munmaps it,
   returning pages to the kernel. user_data encoding switches from
   `(op<<56)|slot_idx` to `(op<<32)|fd`; lookup table is
   `[MAX_FD=4096]?*Slot` BSS, sparsely touched.

   Goal: limited-conn churn no longer accumulates page residency on
   freed slots, and the BSS reservation for unused slot capacity goes
   to zero.

Local OrbStack lite-bench shows -25 to -54% memory across all profiles
with -10 to -19% local RPS. Past PRs (#727, #729) showed local RPS
gains of +13-17% translating to +0-1% on the real Threadripper bench,
so the local RPS regression here is expected to mostly evaporate on
bare metal. Worth a preview `/benchmark` to confirm before `--save`.

All 20 local validation checks pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants