fix: heap-allocate thread-local random buffer to reduce thread creation overhead by johnathan79717 · Pull Request #20488 · AztecProtocol/aztec-packages

johnathan79717 · 2026-02-13T13:55:53Z

Summary

Moves the 1 MiB thread_local random buffer in engine.cpp from inline TLS (.tbss) to heap allocation on first use
Reduces per-thread TLS footprint from ~1 MiB to 16 bytes, dropping thread pool creation time from ~18 ms to ~1.6 ms for 31 threads
Fixes UltraHonk verification being slower at 32 cores than at 8 cores

Root cause

The RandomBufferWrapper struct had a uint8_t buffer[1 << 20] inline array declared thread_local. This placed 1 MiB in the ELF .tbss section, meaning every pthread_create had to allocate and zero-initialize 1 MiB of TLS. With 31 worker threads, this added ~18 ms to thread pool creation — which dominated verification time at high core counts.

Benchmark results (parity_base circuit)

With IPA (noir-rollup):

Cores	Before	After
8	63 ms	59 ms
32	72 ms	51 ms
64	87 ms	53 ms

Without IPA (noir-recursive-no-zk):

Cores	Before	After
8	24 ms	19 ms
32	41 ms	25 ms
64	63 ms	25 ms

Remaining overhead at high core counts

After this fix there is still a small overhead at 64 cores vs 32 cores for IPA, and at 32/64 cores vs 8 cores for non-IPA. Two sources:

Thread creation still costs ~1.6 ms for 31 threads / ~3 ms for 63 threads — the baseline pthread_create + remaining TLS (~3 KiB) + stack mapping cost, paid once on first parallel_for call.
notify_all() wakes all sleeping workers regardless of how much work there is. For the non-IPA 66-point MSM, all 31 (or 63) threads wake up, acquire the mutex, find no work left, and go back to sleep. This mutex contention scales linearly with thread count.

Follow-up improvements that could close this gap:

Lazy pool growth: only create worker threads on demand in start_tasks(), so a 66-point MSM never spawns 31 threads
Selective notify_one: wake only as many workers as there are iterations, avoiding spurious wakeups
MSM thread capping: cap thread count in batch_multi_scalar_mul based on total scalar count

Test plan

Existing CI tests pass (no functional change — buffer is allocated identically, just on heap instead of TLS)
Verify .tbss section shrinks from ~1 MiB to ~3 KiB via readelf -S bin/bb

Resolves AztecProtocol/barretenberg#1624

…on overhead The 1 MiB `thread_local` random buffer in `engine.cpp` was stored inline in the TLS segment (.tbss), causing every `pthread_create` to allocate and zero-initialize 1 MiB per thread. With 31 worker threads this added ~18 ms of overhead to thread pool creation — making UltraHonk verification slower at 32 cores than at 8 cores. Moving the buffer to heap allocation on first use reduces the TLS footprint from ~1 MiB to 16 bytes per thread, dropping thread pool creation from 18 ms to ~1.6 ms for 31 threads. Benchmark results (parity_base circuit, noir-rollup verifier target): Cores | Before | After 8 | 63 ms | 59 ms 32 | 72 ms | 51 ms 64 | 87 ms | 53 ms Resolves AztecProtocol/barretenberg#1624

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: heap-allocate thread-local random buffer to reduce thread creation overhead#20488

fix: heap-allocate thread-local random buffer to reduce thread creation overhead#20488
johnathan79717 wants to merge 1 commit intomerge-train/barretenbergfrom
jh/fix-tls-thread-overhead

johnathan79717 commented Feb 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

johnathan79717 commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Benchmark results (parity_base circuit)

Remaining overhead at high core counts

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

johnathan79717 commented Feb 13, 2026 •

edited

Loading