Skip to content

Profiling, benchmarking instrumentation and optimization plan#322

Open
prk-Jr wants to merge 7 commits intomainfrom
feat/optimize-ts
Open

Profiling, benchmarking instrumentation and optimization plan#322
prk-Jr wants to merge 7 commits intomainfrom
feat/optimize-ts

Conversation

@prk-Jr
Copy link
Collaborator

@prk-Jr prk-Jr commented Feb 18, 2026

Summary

  • Add OPTIMIZATION.md — WASM guest profiling results, CPU breakdown, and phased optimization plan.
  • Add scripts/profile.sh for WASM guest profiling via fastly compute serve --profile-guest (Firefox Profiler-compatible output)
  • Add scripts/benchmark.sh for TTFB analysis, cold start detection, endpoint latency breakdown, and load testing with --save/--compare support

Test plan

  • cargo test passes (no code changes, tooling and docs only)
  • Run ./scripts/profile.sh and verify flame graph output in benchmark-results/profiles/
  • Run ./scripts/benchmark.sh --quick against local Viceroy and verify output
  • Team review of OPTIMIZATION.md — especially "Decisions Needed" section

Introduce RequestTimer for per-request phase tracking (init, backend,
process, total) exposed via Server-Timing response headers. Add
benchmark tooling with --profile mode for collecting timing data.
Document phased optimization plan covering streaming architecture,
code-level fixes, and open design questions for team review.
Introduce RequestTimer for per-request phase tracking (init, backend,
process, total) exposed via Server-Timing response headers. Add
benchmark tooling with --profile mode for collecting timing data.
Document phased optimization plan covering streaming architecture,
code-level fixes, and open design questions for team review.
@aram356
Copy link
Collaborator

aram356 commented Feb 18, 2026

@prk-Jr Have you seen this? https://www.fastly.com/blog/profiling-fastly-compute-applications

I don't think we need to make changes to profile

RequestTimer and Server-Timing header were premature — WASM guest
profiling via profile.sh gives better per-function visibility without
runtime overhead. Also strips dead --profile mode from benchmark.sh.
build.rs already resolves trusted-server.toml + env vars at compile time
and embeds the result. Replace Settings::from_toml() with direct
toml::from_str() to skip the config crate pipeline on every request.
Profiling confirms: ~5-8% → ~3.3% CPU per request.
@prk-Jr prk-Jr linked an issue Feb 19, 2026 that may be closed by this pull request
- OPTIMIZATION.md: profiling results, CPU breakdown, phased optimization
  plan covering streaming fixes, config crate elimination, and
  stream_to_client() architecture
- scripts/profile.sh: WASM guest profiling via --profile-guest with
  Firefox Profiler-compatible output
- scripts/benchmark.sh: TTFB analysis, cold start detection, endpoint
  latency breakdown, and load testing with save/compare support
@prk-Jr prk-Jr changed the title Add Server-Timing instrumentation and optimization plan Profiling, benchmarking instrumentation and optimization plan Feb 19, 2026
@aram356 aram356 marked this pull request as ready for review February 23, 2026 16:44
Copy link
Collaborator

@ChristianPavilonis ChristianPavilonis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark Script Review

Ran the benchmarks locally against golf.com (with TRUSTED_SERVER__PUBLISHER__ORIGIN_URL=https://golf.com) and confirmed the full HTML processing pipeline is exercised. The tooling is well-structured — a few suggestions to make branch comparisons more reliable:

1. run_endpoint_latency output is silently swallowed

The timed_curl function both printfs timing data to stdout and returns raw JSON. But the call sites redirect stdout to /dev/null:

# scripts/benchmark.sh:178, 183, 203
timed_curl GET "$BASE_URL/" "GET / (first)" > /dev/null

This discards the formatted output along with the JSON. In the full benchmark run, the "COLD START ANALYSIS" and "ENDPOINT LATENCY (WARM)" sections print headers but no data rows.

Suggested fix: Separate the display from the return value. Either write the formatted output to stderr (>&2) so it survives the redirect, or restructure timed_curl to only return JSON and have the caller format it:

# Option A: print to stderr
printf "  %-40s  HTTP %s  TTFB: %8.2f ms  Total: %8.2f ms  Size: %s bytes\n" \
    "$label" "$code" "$ttfb_ms" "$total_ms" "$size" >&2

2. --compare does a text diff — not actionable for numeric data

compare_results runs diff -u on raw output files. Since every timing number differs between runs (even on the same branch), this produces a wall of changes with no signal. It's impossible to tell whether a 3ms difference in P50 is a real improvement or noise.

Suggested fix: Emit a machine-readable summary (JSON or key=value) alongside the human-readable output, then have --compare parse and report percentage deltas on key metrics. Even a minimal version would help:

# At the end of run_all, emit a parseable summary:
echo "---METRICS---"
echo "publisher_proxy_p50=$p50"
echo "publisher_proxy_p95=$p95"
echo "publisher_proxy_rps=$rps"
echo "static_js_p50=$static_p50"
# ...

# In compare_results, parse both files and compute deltas:
# publisher_proxy_p50: 105.8ms -> 98.2ms (-7.2%)

3. No warmup phase before load tests

The hey runs in run_load_test start measuring immediately. The first handful of requests pay connection pool setup and potential JIT/cache warming costs, which skew the averages and inflate the "Slowest" number.

Suggested fix: Add a throwaway warmup burst before each measured run:

run_load_test() {
    # ...
    local warmup_requests=50

    # Warmup (discarded)
    hey -n "$warmup_requests" -c "$concurrency" -t 30 "$BASE_URL/" > /dev/null 2>&1

    # Measured run
    echo -e "${BOLD}GET / (publisher proxy) - ${total_requests} requests, ${concurrency} concurrent${RESET}"
    hey -n "$total_requests" -c "$concurrency" -t 30 "$BASE_URL/" 2>&1 | \
        grep -E "(Requests/sec|Total:|Slowest:|Fastest:|Average:|requests done)|Status code|Latency distribution" -A 20
    # ...
}

4. Single-run variance makes branch comparison unreliable

One hey run is noisy — P50 can swing 5-10% between identical runs on the same build due to OS scheduling, network jitter (for the publisher proxy path), and GC pressure. A single data point can't distinguish a real 5% improvement from noise.

Suggested fix: Run each load test 3 times and report mean ± stddev for the key metrics. This doesn't need to be fancy — even running hey in a loop, grepping out the Requests/sec and percentile lines, and averaging them would significantly improve confidence:

run_load_test_with_repeats() {
    local repeats=3
    local rps_values=()

    for r in $(seq 1 "$repeats"); do
        # warmup
        hey -n 50 -c "$concurrency" -t 30 "$url" > /dev/null 2>&1
        # measured
        local output
        output=$(hey -n "$total_requests" -c "$concurrency" -t 30 "$url" 2>&1)
        local rps
        rps=$(echo "$output" | grep "Requests/sec" | awk '{print $2}')
        rps_values+=("$rps")
        echo "  Run $r: $rps req/sec"
    done
    # compute mean/stddev from rps_values...
}

This triples the benchmark runtime but makes the comparison actually trustworthy. A --quick mode could keep the single-run behavior for smoke testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Document and investigate metrics for request and response

3 participants