zeemo: split slot buffer (4 KiB inline + on-demand 16 KiB big_buf) by skylightis666 · Pull Request #729 · MDA2AV/HttpArena

skylightis666 · 2026-05-18T15:02:20Z

Description

Follow-up to #727 (now merged). Targets the memory bonus in the composite score.

After #727 zeemo sits at 184 MiB on baseline-4096 while h2o is 73 MiB. The composite uses sqrt(rps)/memMB, so the memory gap roughly halves our per-profile score even when we lead on raw RPS. This PR shrinks the per-connection memory footprint by splitting the write buffer into two:

4 KiB inline buffer in every Slot — sized for the small-response profiles (baseline, pipelined, limited-conn). Pipelined batches concatenate here.
16 KiB big_buf drawn from a per-worker static pool, acquired on demand the first time a /json/ request arrives on a connection and released to the pool on close.

JSON responses bypass the inline buffer entirely and use big_buf, and are never batched with non-JSON responses in a single send. Non-JSON profiles never touch the big_pool pages — they stay zero-fill BSS, so the baseline RSS drops sharply without affecting JSON.

drainAndSend picks the destination per request; if a JSON request shows up with inline bytes already queued, the inline batch flushes first (without consuming the JSON request) and JSON dispatch happens on the next drain pass after send completes. Partial-send tracking moved from write_buf[off..len] to a free-form send_ptr stored on the slot.

Local benchmark-lite on 8-core OrbStack

Absolute numbers aren't bare-metal-comparable; relative changes show the design trade-off.

profile	before RPS	after RPS	Δ	before mem	after mem	Δ
baseline	665k	752k	+13%	14 MiB	8 MiB	−43%
pipelined	6.73M	7.91M	+17%	14 MiB	8 MiB	−43%
limited-conn	420k	486k	+16%	25 MiB	13 MiB	−48%
json	145k	165k	+14%	20 MiB	27 MiB	+35%

RPS goes up across the board because the smaller slot footprint fits better in L1/L2. JSON memory goes up because a JSON connection now carries both the 4 KiB inline page and a 16 KiB big_buf; for the bench (4096 conns all JSON) that's ~64 active big_bufs × 16 KiB per worker.

All 20 local validation checks pass.

PR Commands — comment to trigger (requires collaborator approval):

Command	Description
`/benchmark -f zeemo`	Preview run
`/benchmark -f zeemo --save`	Run and save results

Source: https://github.com/skylightis666/zeemo

…ig_buf) Each connection now keeps a small 4 KiB inline buffer for small responses (baseline, pipelined, limited-conn). A per-worker static pool of 16 KiB big_bufs is acquired on demand the first time a /json/ request arrives on a connection and released to the pool when the connection closes. JSON responses bypass the inline buffer entirely and use big_buf, and are never batched with non-JSON responses in a single send. Non-JSON profiles never touch the big_pool pages — they stay zero-fill BSS, so the baseline RSS drops sharply. Local benchmark-lite on 8-core OrbStack (lite mode, relative changes matter, absolute numbers don't reflect bare-metal): baseline 665k → 752k req/s (+13%), 14 → 8 MiB (-43%) pipelined 6.73M → 7.91M (+17%), 14 → 8 MiB (-43%) limited-conn 420k → 486k (+16%), 25 → 13 MiB (-48%) json 145k → 165k (+14%), 20 → 27 MiB (+35%, big_buf adds) drainAndSend now picks the destination buffer per request: if a JSON request appears with non-JSON bytes already queued inline, the inline batch flushes first (without consuming the JSON request) and the JSON dispatch happens on the next drain pass after send completes. Partial- send tracking moves from "write_buf[off..len]" to a free-form send_ptr stored on the slot, so it works regardless of which buffer was the source. All 20 local validation checks pass.

skylightis666 · 2026-05-18T15:05:12Z

/benchmark -f zeemo --save

github-actions · 2026-05-18T15:05:25Z

👋 /benchmark request received. A collaborator will review and approve the run.

github-actions · 2026-05-18T15:11:15Z

Benchmark Results

Framework: zeemo | Test: all tests

Test	Conn	RPS	CPU	Mem	Δ RPS	Δ Mem
baseline	512	4,120,722	6267.1%	69MiB	+0.3%	-8.0%
baseline	4096	4,435,413	6406.3%	130MiB	+0.2%	-29.3%
pipelined	512	48,445,534	6523.3%	69MiB	~0%	-6.8%
pipelined	4096	49,927,385	6413.2%	124MiB	-0.2%	-28.3%
limited-conn	512	2,635,070	5467.8%	88MiB	+0.2%	-22.8%
limited-conn	4096	2,617,676	5554.3%	178MiB	+0.4%	-29.9%
json	4096	2,374,963	6403.3%	257MiB	+1.0%	-0.4%

Full log

  Conns:     4096 (64/thread)
  Pipeline:  1
  Req/conn:  10
  Templates: 3
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   1.42ms   1.40ms   1.74ms   1.96ms   2.32ms

  13088191 requests in 5.00s, 13088383 responses
  Throughput: 2.62M req/s
  Bandwidth:  164.69MB/s
  Status codes: 2xx=13088383, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 13089113 / 13088383 responses (100.0%)
  Reconnects: 1308298
  Per-template: 4362849,4362780,4362710
  Per-template-ok: 4362849,4362780,4362710
[info] CPU 5554.3% | Mem 178MiB

[run 2/3]
gcannon v0.5.3
  Target:    localhost:8080/
  Threads:   64
  Conns:     4096 (64/thread)
  Pipeline:  1
  Req/conn:  10
  Templates: 3
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   1.42ms   1.40ms   1.75ms   1.99ms   2.51ms

  13071741 requests in 5.00s, 13071236 responses
  Throughput: 2.61M req/s
  Bandwidth:  164.48MB/s
  Status codes: 2xx=13071236, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 13071222 / 13071236 responses (100.0%)
  Reconnects: 1306419
  Per-template: 4357173,4357010,4357039
  Per-template-ok: 4357173,4357010,4357039
[info] CPU 5698.6% | Mem 178MiB

[run 3/3]
gcannon v0.5.3
  Target:    localhost:8080/
  Threads:   64
  Conns:     4096 (64/thread)
  Pipeline:  1
  Req/conn:  10
  Templates: 3
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   1.42ms   1.40ms   1.75ms   1.99ms   2.37ms

  13021514 requests in 5.00s, 13021338 responses
  Throughput: 2.60M req/s
  Bandwidth:  163.87MB/s
  Status codes: 2xx=13021338, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 13021292 / 13021338 responses (100.0%)
  Reconnects: 1302905
  Per-template: 4340353,4340641,4340298
  Per-template-ok: 4340353,4340641,4340297
[info] CPU 5523.0% | Mem 184MiB

=== Best: 2617676 req/s (CPU: 5554.3%, Mem: 178MiB) ===
[info] input BW: 202.21MB/s (avg template: 81 bytes)
[info] saved results/limited-conn/4096/zeemo.json
httparena-bench-zeemo
httparena-bench-zeemo

==============================================
=== zeemo / json / 4096c (tool=gcannon) ===
==============================================
[info] waiting for server...
[info] server ready

[run 1/3]
gcannon v0.5.3
  Target:    localhost:8080/
  Threads:   64
  Conns:     4096 (64/thread)
  Pipeline:  1
  Req/conn:  25
  Templates: 7
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency    694us    403us   1.85ms   2.70ms   3.38ms

  11814932 requests in 5.00s, 11813461 responses
  Throughput: 2.36M req/s
  Bandwidth:  7.91GB/s
  Status codes: 2xx=11813461, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 11815665 / 11813461 responses (100.0%)
  Reconnects: 474810
  Per-template: 1681186,1685519,1689325,1692122,1692915,1689308,1682984
  Per-template-ok: 1681186,1685519,1689325,1692122,1692915,1689308,1682984
[info] CPU 5918.1% | Mem 242MiB

[run 2/3]
gcannon v0.5.3
  Target:    localhost:8080/
  Threads:   64
  Conns:     4096 (64/thread)
  Pipeline:  1
  Req/conn:  25
  Templates: 7
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency    704us    397us   1.93ms   2.74ms   3.37ms

  11876867 requests in 5.00s, 11874819 responses
  Throughput: 2.37M req/s
  Bandwidth:  7.95GB/s
  Status codes: 2xx=11874819, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 11874762 / 11874819 responses (100.0%)
  Reconnects: 477786
  Per-template: 1690273,1692794,1697283,1700980,1701871,1698293,1693268
  Per-template-ok: 1690273,1692794,1697283,1700980,1701871,1698293,1693268
[info] CPU 6403.3% | Mem 257MiB

[run 3/3]
gcannon v0.5.3
  Target:    localhost:8080/
  Threads:   64
  Conns:     4096 (64/thread)
  Pipeline:  1
  Req/conn:  25
  Templates: 7
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency    700us    381us   1.91ms   2.71ms   3.31ms

  11850176 requests in 5.00s, 11848052 responses
  Throughput: 2.37M req/s
  Bandwidth:  7.93GB/s
  Status codes: 2xx=11848052, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 11847931 / 11848052 responses (100.0%)
  Reconnects: 476805
  Per-template: 1686264,1691037,1694261,1696453,1696614,1694800,1688502
  Per-template-ok: 1686264,1691037,1694261,1696453,1696614,1694800,1688502
[info] CPU 5937.0% | Mem 260MiB

=== Best: 2374963 req/s (CPU: 6403.3%, Mem: 257MiB) ===
[info] input BW: 113.25MB/s (avg template: 50 bytes)
[info] saved results/json/4096/zeemo.json
httparena-bench-zeemo
httparena-bench-zeemo
[info] skip: zeemo does not subscribe to json-comp
[info] skip: zeemo does not subscribe to json-tls
[info] skip: zeemo does not subscribe to upload
[info] skip: zeemo does not subscribe to api-4
[info] skip: zeemo does not subscribe to api-16
[info] skip: zeemo does not subscribe to static
[info] skip: zeemo does not subscribe to async-db
[info] skip: zeemo does not subscribe to crud
[info] skip: zeemo does not subscribe to fortunes
[info] skip: zeemo does not subscribe to baseline-h2
[info] skip: zeemo does not subscribe to static-h2
[info] skip: zeemo does not subscribe to baseline-h2c
[info] skip: zeemo does not subscribe to json-h2c
[info] skip: zeemo does not subscribe to baseline-h3
[info] skip: zeemo does not subscribe to static-h3
[info] skip: zeemo does not subscribe to gateway-64
[info] skip: zeemo does not subscribe to gateway-h3
[info] skip: zeemo does not subscribe to production-stack
[info] skip: zeemo does not subscribe to unary-grpc
[info] skip: zeemo does not subscribe to unary-grpc-tls
[info] skip: zeemo does not subscribe to stream-grpc
[info] skip: zeemo does not subscribe to stream-grpc-tls
[info] skip: zeemo does not subscribe to echo-ws
[info] skip: zeemo does not subscribe to echo-ws-pipeline
[info] rebuilding site/data/*.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/frameworks.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/baseline-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/baseline-512.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/json-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/limited-conn-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/limited-conn-512.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/pipelined-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/pipelined-512.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/current.json
[info] done
[info] restoring loopback MTU to 65536
[info] restoring CPU governor → powersave

Two memory-bonus changes bundled: 1. **Parser internals trimmed.** parser.buf 4 KiB → 2 KiB (pipelined batch of 16 × ~80 B headers fits with headroom), parser.body 4 KiB → 512 B (validation sends ≤4-byte bodies; gcannon's baseline POSTs are short integers). Slot drops from ~12 KiB to ~6.6 KiB. No RPS impact expected — buffers are still page-aligned, just narrower. 2. **Static [128]Slot array → fd-indexed dynamic `*Slot`.** Each accept mmaps a fresh Slot via `std.heap.page_allocator`; close munmaps it, returning pages to the kernel. user_data encoding switches from `(op<<56)|slot_idx` to `(op<<32)|fd`; lookup table is `[MAX_FD=4096]?*Slot` BSS, sparsely touched. Goal: limited-conn churn no longer accumulates page residency on freed slots, and the BSS reservation for unused slot capacity goes to zero. Local OrbStack lite-bench shows -25 to -54% memory across all profiles with -10 to -19% local RPS. Past PRs (#727, #729) showed local RPS gains of +13-17% translating to +0-1% on the real Threadripper bench, so the local RPS regression here is expected to mostly evaporate on bare metal. Worth a preview `/benchmark` to confirm before `--save`. All 20 local validation checks pass.

skylightis666 requested review from Kaliumhexacyanoferrat and MDA2AV as code owners May 18, 2026 15:02

Benchmark results: zeemo

3f970d0

MDA2AV approved these changes May 18, 2026

View reviewed changes

MDA2AV merged commit 1d6b3d1 into MDA2AV:main May 18, 2026

skylightis666 mentioned this pull request May 18, 2026

zeemo: dynamic fd-indexed slot allocation + parser shrink #736

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zeemo: split slot buffer (4 KiB inline + on-demand 16 KiB big_buf)#729

zeemo: split slot buffer (4 KiB inline + on-demand 16 KiB big_buf)#729
MDA2AV merged 2 commits into
MDA2AV:mainfrom
skylightis666:zeemo-dual-buffer

skylightis666 commented May 18, 2026

Uh oh!

skylightis666 commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

skylightis666 commented May 18, 2026

Description

Local benchmark-lite on 8-core OrbStack

Uh oh!

skylightis666 commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Benchmark Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants