Profiling, benchmarking instrumentation and optimization plan by prk-Jr · Pull Request #322 · IABTechLab/trusted-server

prk-Jr · 2026-02-18T16:05:47Z

Summary

Add OPTIMIZATION.md — WASM guest profiling results, CPU breakdown, and phased optimization plan.
Add scripts/profile.sh for WASM guest profiling via fastly compute serve --profile-guest (Firefox Profiler-compatible output)
Add scripts/benchmark.sh for TTFB analysis, cold start detection, endpoint latency breakdown, and load testing with --save/--compare support

Test plan

cargo test passes (no code changes, tooling and docs only)
Run ./scripts/profile.sh and verify flame graph output in benchmark-results/profiles/
Run ./scripts/benchmark.sh --quick against local Viceroy and verify output
Team review of OPTIMIZATION.md — especially "Decisions Needed" section

Introduce RequestTimer for per-request phase tracking (init, backend, process, total) exposed via Server-Timing response headers. Add benchmark tooling with --profile mode for collecting timing data. Document phased optimization plan covering streaming architecture, code-level fixes, and open design questions for team review.

…Info

aram356 · 2026-02-18T23:26:22Z

@prk-Jr Have you seen this? https://www.fastly.com/blog/profiling-fastly-compute-applications

I don't think we need to make changes to profile

RequestTimer and Server-Timing header were premature — WASM guest profiling via profile.sh gives better per-function visibility without runtime overhead. Also strips dead --profile mode from benchmark.sh.

build.rs already resolves trusted-server.toml + env vars at compile time and embeds the result. Replace Settings::from_toml() with direct toml::from_str() to skip the config crate pipeline on every request. Profiling confirms: ~5-8% → ~3.3% CPU per request.

- OPTIMIZATION.md: profiling results, CPU breakdown, phased optimization plan covering streaming fixes, config crate elimination, and stream_to_client() architecture - scripts/profile.sh: WASM guest profiling via --profile-guest with Firefox Profiler-compatible output - scripts/benchmark.sh: TTFB analysis, cold start detection, endpoint latency breakdown, and load testing with save/compare support

ChristianPavilonis

Benchmark Script Review

Ran the benchmarks locally against golf.com (with TRUSTED_SERVER__PUBLISHER__ORIGIN_URL=https://golf.com) and confirmed the full HTML processing pipeline is exercised. The tooling is well-structured — a few suggestions to make branch comparisons more reliable:

1. `run_endpoint_latency` output is silently swallowed

The timed_curl function both printfs timing data to stdout and returns raw JSON. But the call sites redirect stdout to /dev/null:

# scripts/benchmark.sh:178, 183, 203
timed_curl GET "$BASE_URL/" "GET / (first)" > /dev/null

This discards the formatted output along with the JSON. In the full benchmark run, the "COLD START ANALYSIS" and "ENDPOINT LATENCY (WARM)" sections print headers but no data rows.

Suggested fix: Separate the display from the return value. Either write the formatted output to stderr (>&2) so it survives the redirect, or restructure timed_curl to only return JSON and have the caller format it:

# Option A: print to stderr
printf "  %-40s  HTTP %s  TTFB: %8.2f ms  Total: %8.2f ms  Size: %s bytes\n" \
    "$label" "$code" "$ttfb_ms" "$total_ms" "$size" >&2

2. `--compare` does a text diff — not actionable for numeric data

compare_results runs diff -u on raw output files. Since every timing number differs between runs (even on the same branch), this produces a wall of changes with no signal. It's impossible to tell whether a 3ms difference in P50 is a real improvement or noise.

Suggested fix: Emit a machine-readable summary (JSON or key=value) alongside the human-readable output, then have --compare parse and report percentage deltas on key metrics. Even a minimal version would help:

# At the end of run_all, emit a parseable summary:
echo "---METRICS---"
echo "publisher_proxy_p50=$p50"
echo "publisher_proxy_p95=$p95"
echo "publisher_proxy_rps=$rps"
echo "static_js_p50=$static_p50"
# ...

# In compare_results, parse both files and compute deltas:
# publisher_proxy_p50: 105.8ms -> 98.2ms (-7.2%)

3. No warmup phase before load tests

The hey runs in run_load_test start measuring immediately. The first handful of requests pay connection pool setup and potential JIT/cache warming costs, which skew the averages and inflate the "Slowest" number.

Suggested fix: Add a throwaway warmup burst before each measured run:

run_load_test() {
    # ...
    local warmup_requests=50

    # Warmup (discarded)
    hey -n "$warmup_requests" -c "$concurrency" -t 30 "$BASE_URL/" > /dev/null 2>&1

    # Measured run
    echo -e "${BOLD}GET / (publisher proxy) - ${total_requests} requests, ${concurrency} concurrent${RESET}"
    hey -n "$total_requests" -c "$concurrency" -t 30 "$BASE_URL/" 2>&1 | \
        grep -E "(Requests/sec|Total:|Slowest:|Fastest:|Average:|requests done)|Status code|Latency distribution" -A 20
    # ...
}

4. Single-run variance makes branch comparison unreliable

One hey run is noisy — P50 can swing 5-10% between identical runs on the same build due to OS scheduling, network jitter (for the publisher proxy path), and GC pressure. A single data point can't distinguish a real 5% improvement from noise.

Suggested fix: Run each load test 3 times and report mean ± stddev for the key metrics. This doesn't need to be fancy — even running hey in a loop, grepping out the Requests/sec and percentile lines, and averaging them would significantly improve confidence:

run_load_test_with_repeats() {
    local repeats=3
    local rps_values=()

    for r in $(seq 1 "$repeats"); do
        # warmup
        hey -n 50 -c "$concurrency" -t 30 "$url" > /dev/null 2>&1
        # measured
        local output
        output=$(hey -n "$total_requests" -c "$concurrency" -t 30 "$url" 2>&1)
        local rps
        rps=$(echo "$output" | grep "Requests/sec" | awk '{print $2}')
        rps_values+=("$rps")
        echo "  Run $r: $rps req/sec"
    done
    # compute mean/stddev from rps_values...
}

This triples the benchmark runtime but makes the comparison actually trustworthy. A --quick mode could keep the single-run behavior for smoke testing.

prk-Jr added 3 commits February 18, 2026 21:33

Downgrade Settings debug dump to log::debug and tighten max_level to …

4d90469

…Info

aram356 assigned prk-Jr Feb 18, 2026

prk-Jr added 3 commits February 19, 2026 17:17

Remove Server-Timing instrumentation and clean up benchmark tooling

01d77e4

RequestTimer and Server-Timing header were premature — WASM guest profiling via profile.sh gives better per-function visibility without runtime overhead. Also strips dead --profile mode from benchmark.sh.

Removed blank line

e49f9f7

prk-Jr linked an issue Feb 19, 2026 that may be closed by this pull request

Document and investigate metrics for request and response #314

Open

prk-Jr changed the title ~~Add Server-Timing instrumentation and optimization plan~~ Profiling, benchmarking instrumentation and optimization plan Feb 19, 2026

aram356 requested review from ChristianPavilonis and aram356 February 23, 2026 16:44

aram356 marked this pull request as ready for review February 23, 2026 16:44

ChristianPavilonis reviewed Feb 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiling, benchmarking instrumentation and optimization plan#322

Profiling, benchmarking instrumentation and optimization plan#322
prk-Jr wants to merge 7 commits intomainfrom
feat/optimize-ts

prk-Jr commented Feb 18, 2026 •

edited

Loading

Uh oh!

aram356 commented Feb 18, 2026 •

edited

Loading

Uh oh!

ChristianPavilonis left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

prk-Jr commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

aram356 commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChristianPavilonis left a comment

Choose a reason for hiding this comment

Benchmark Script Review

1. run_endpoint_latency output is silently swallowed

2. --compare does a text diff — not actionable for numeric data

3. No warmup phase before load tests

4. Single-run variance makes branch comparison unreliable

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

prk-Jr commented Feb 18, 2026 •

edited

Loading

aram356 commented Feb 18, 2026 •

edited

Loading

1. `run_endpoint_latency` output is silently swallowed

2. `--compare` does a text diff — not actionable for numeric data