Profiling, benchmarking instrumentation and optimization plan#322
Profiling, benchmarking instrumentation and optimization plan#322
Conversation
Introduce RequestTimer for per-request phase tracking (init, backend, process, total) exposed via Server-Timing response headers. Add benchmark tooling with --profile mode for collecting timing data. Document phased optimization plan covering streaming architecture, code-level fixes, and open design questions for team review.
Introduce RequestTimer for per-request phase tracking (init, backend, process, total) exposed via Server-Timing response headers. Add benchmark tooling with --profile mode for collecting timing data. Document phased optimization plan covering streaming architecture, code-level fixes, and open design questions for team review.
|
@prk-Jr Have you seen this? https://www.fastly.com/blog/profiling-fastly-compute-applications I don't think we need to make changes to profile |
RequestTimer and Server-Timing header were premature — WASM guest profiling via profile.sh gives better per-function visibility without runtime overhead. Also strips dead --profile mode from benchmark.sh.
build.rs already resolves trusted-server.toml + env vars at compile time and embeds the result. Replace Settings::from_toml() with direct toml::from_str() to skip the config crate pipeline on every request. Profiling confirms: ~5-8% → ~3.3% CPU per request.
- OPTIMIZATION.md: profiling results, CPU breakdown, phased optimization plan covering streaming fixes, config crate elimination, and stream_to_client() architecture - scripts/profile.sh: WASM guest profiling via --profile-guest with Firefox Profiler-compatible output - scripts/benchmark.sh: TTFB analysis, cold start detection, endpoint latency breakdown, and load testing with save/compare support
ChristianPavilonis
left a comment
There was a problem hiding this comment.
Benchmark Script Review
Ran the benchmarks locally against golf.com (with TRUSTED_SERVER__PUBLISHER__ORIGIN_URL=https://golf.com) and confirmed the full HTML processing pipeline is exercised. The tooling is well-structured — a few suggestions to make branch comparisons more reliable:
1. run_endpoint_latency output is silently swallowed
The timed_curl function both printfs timing data to stdout and returns raw JSON. But the call sites redirect stdout to /dev/null:
# scripts/benchmark.sh:178, 183, 203
timed_curl GET "$BASE_URL/" "GET / (first)" > /dev/nullThis discards the formatted output along with the JSON. In the full benchmark run, the "COLD START ANALYSIS" and "ENDPOINT LATENCY (WARM)" sections print headers but no data rows.
Suggested fix: Separate the display from the return value. Either write the formatted output to stderr (>&2) so it survives the redirect, or restructure timed_curl to only return JSON and have the caller format it:
# Option A: print to stderr
printf " %-40s HTTP %s TTFB: %8.2f ms Total: %8.2f ms Size: %s bytes\n" \
"$label" "$code" "$ttfb_ms" "$total_ms" "$size" >&22. --compare does a text diff — not actionable for numeric data
compare_results runs diff -u on raw output files. Since every timing number differs between runs (even on the same branch), this produces a wall of changes with no signal. It's impossible to tell whether a 3ms difference in P50 is a real improvement or noise.
Suggested fix: Emit a machine-readable summary (JSON or key=value) alongside the human-readable output, then have --compare parse and report percentage deltas on key metrics. Even a minimal version would help:
# At the end of run_all, emit a parseable summary:
echo "---METRICS---"
echo "publisher_proxy_p50=$p50"
echo "publisher_proxy_p95=$p95"
echo "publisher_proxy_rps=$rps"
echo "static_js_p50=$static_p50"
# ...
# In compare_results, parse both files and compute deltas:
# publisher_proxy_p50: 105.8ms -> 98.2ms (-7.2%)3. No warmup phase before load tests
The hey runs in run_load_test start measuring immediately. The first handful of requests pay connection pool setup and potential JIT/cache warming costs, which skew the averages and inflate the "Slowest" number.
Suggested fix: Add a throwaway warmup burst before each measured run:
run_load_test() {
# ...
local warmup_requests=50
# Warmup (discarded)
hey -n "$warmup_requests" -c "$concurrency" -t 30 "$BASE_URL/" > /dev/null 2>&1
# Measured run
echo -e "${BOLD}GET / (publisher proxy) - ${total_requests} requests, ${concurrency} concurrent${RESET}"
hey -n "$total_requests" -c "$concurrency" -t 30 "$BASE_URL/" 2>&1 | \
grep -E "(Requests/sec|Total:|Slowest:|Fastest:|Average:|requests done)|Status code|Latency distribution" -A 20
# ...
}4. Single-run variance makes branch comparison unreliable
One hey run is noisy — P50 can swing 5-10% between identical runs on the same build due to OS scheduling, network jitter (for the publisher proxy path), and GC pressure. A single data point can't distinguish a real 5% improvement from noise.
Suggested fix: Run each load test 3 times and report mean ± stddev for the key metrics. This doesn't need to be fancy — even running hey in a loop, grepping out the Requests/sec and percentile lines, and averaging them would significantly improve confidence:
run_load_test_with_repeats() {
local repeats=3
local rps_values=()
for r in $(seq 1 "$repeats"); do
# warmup
hey -n 50 -c "$concurrency" -t 30 "$url" > /dev/null 2>&1
# measured
local output
output=$(hey -n "$total_requests" -c "$concurrency" -t 30 "$url" 2>&1)
local rps
rps=$(echo "$output" | grep "Requests/sec" | awk '{print $2}')
rps_values+=("$rps")
echo " Run $r: $rps req/sec"
done
# compute mean/stddev from rps_values...
}This triples the benchmark runtime but makes the comparison actually trustworthy. A --quick mode could keep the single-run behavior for smoke testing.
Summary
OPTIMIZATION.md— WASM guest profiling results, CPU breakdown, and phased optimization plan.scripts/profile.shfor WASM guest profiling viafastly compute serve --profile-guest(Firefox Profiler-compatible output)scripts/benchmark.shfor TTFB analysis, cold start detection, endpoint latency breakdown, and load testing with--save/--comparesupportTest plan
cargo testpasses (no code changes, tooling and docs only)./scripts/profile.shand verify flame graph output inbenchmark-results/profiles/./scripts/benchmark.sh --quickagainst local Viceroy and verify output