Skip to content

fix: heap-allocate thread-local random buffer to reduce thread creation overhead#20488

Open
johnathan79717 wants to merge 1 commit intomerge-train/barretenbergfrom
jh/fix-tls-thread-overhead
Open

fix: heap-allocate thread-local random buffer to reduce thread creation overhead#20488
johnathan79717 wants to merge 1 commit intomerge-train/barretenbergfrom
jh/fix-tls-thread-overhead

Conversation

@johnathan79717
Copy link
Contributor

@johnathan79717 johnathan79717 commented Feb 13, 2026

Summary

  • Moves the 1 MiB thread_local random buffer in engine.cpp from inline TLS (.tbss) to heap allocation on first use
  • Reduces per-thread TLS footprint from ~1 MiB to 16 bytes, dropping thread pool creation time from ~18 ms to ~1.6 ms for 31 threads
  • Fixes UltraHonk verification being slower at 32 cores than at 8 cores

Root cause

The RandomBufferWrapper struct had a uint8_t buffer[1 << 20] inline array declared thread_local. This placed 1 MiB in the ELF .tbss section, meaning every pthread_create had to allocate and zero-initialize 1 MiB of TLS. With 31 worker threads, this added ~18 ms to thread pool creation — which dominated verification time at high core counts.

Benchmark results (parity_base circuit)

With IPA (noir-rollup):

Cores Before After
8 63 ms 59 ms
32 72 ms 51 ms
64 87 ms 53 ms

Without IPA (noir-recursive-no-zk):

Cores Before After
8 24 ms 19 ms
32 41 ms 25 ms
64 63 ms 25 ms

Remaining overhead at high core counts

After this fix there is still a small overhead at 64 cores vs 32 cores for IPA, and at 32/64 cores vs 8 cores for non-IPA. Two sources:

  1. Thread creation still costs ~1.6 ms for 31 threads / ~3 ms for 63 threads — the baseline pthread_create + remaining TLS (~3 KiB) + stack mapping cost, paid once on first parallel_for call.

  2. notify_all() wakes all sleeping workers regardless of how much work there is. For the non-IPA 66-point MSM, all 31 (or 63) threads wake up, acquire the mutex, find no work left, and go back to sleep. This mutex contention scales linearly with thread count.

Follow-up improvements that could close this gap:

  • Lazy pool growth: only create worker threads on demand in start_tasks(), so a 66-point MSM never spawns 31 threads
  • Selective notify_one: wake only as many workers as there are iterations, avoiding spurious wakeups
  • MSM thread capping: cap thread count in batch_multi_scalar_mul based on total scalar count

Test plan

  • Existing CI tests pass (no functional change — buffer is allocated identically, just on heap instead of TLS)
  • Verify .tbss section shrinks from ~1 MiB to ~3 KiB via readelf -S bin/bb

Resolves AztecProtocol/barretenberg#1624

…on overhead

The 1 MiB `thread_local` random buffer in `engine.cpp` was stored inline
in the TLS segment (.tbss), causing every `pthread_create` to allocate and
zero-initialize 1 MiB per thread. With 31 worker threads this added ~18 ms
of overhead to thread pool creation — making UltraHonk verification slower
at 32 cores than at 8 cores.

Moving the buffer to heap allocation on first use reduces the TLS footprint
from ~1 MiB to 16 bytes per thread, dropping thread pool creation from
18 ms to ~1.6 ms for 31 threads.

Benchmark results (parity_base circuit, noir-rollup verifier target):

  Cores | Before | After
  8     | 63 ms  | 59 ms
  32    | 72 ms  | 51 ms
  64    | 87 ms  | 53 ms

Resolves AztecProtocol/barretenberg#1624
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant