feat(fc): drain virtio-balloon free-page-hinting before pause by ValentaTomas · Pull Request #2552 · e2b-dev/infra

ValentaTomas · 2026-05-04T00:05:41Z

Drains virtio-balloon free-page-hinting before pause so snapshots don't capture pages the guest already considers free.

Balloon install is gated by free-page-hinting-install (bool LD flag); kernel-side eligibility is targeted via the LD context (kernel/FC version). On pause we call start_balloon_hinting(acknowledge_on_stop=true) and poll describe_balloon_hinting until host_cmd == DONE, gated by free-page-hinting-timeout-ms (int LD flag, ms; 0 = disabled). Reclaimed pages emit UFFD_EVENT_REMOVE, already tracked by the parent FPR work.

Hot path is kept minimal: post-drain and post-pause we trigger an FC metrics flush but don't wait for the reader, trading per-pause counter precision for pause latency. System-level FPH activity is observable via the periodic 5 s metrics flush.

Includes cmd/resume-build -fph-bench and scripts/bench-fph.sh for offline FPR vs FPR+FPH comparison.

Operator must wait for the kernel FPH race fix to roll out before enabling free-page-hinting-timeout-ms in prod.

cursor · 2026-05-04T00:05:46Z

PR Summary

Medium Risk
Touches the VM pause/snapshot hot path and adds new Firecracker balloon API calls gated by feature flags; issues here could impact snapshot latency or reliability even if the feature is disabled by default.

Overview
This change adds a pre-pause step that can trigger and wait for Firecracker virtio-balloon free-page-hinting to complete (with exponential-backoff polling), gated by Firecracker version support plus new LaunchDarkly flags and a per-sandbox LD context that includes kernel/FC versions.

It extends balloon device installation/config to support FreePageHinting, adds client helpers for start_balloon_hinting/describe_balloon_hinting (including a workaround for Firecracker returning an “unexpected success” 204), and starts accumulating balloon counters from the metrics FIFO with a new flush-and-wait helper used by a new resume-build -fph-bench mode and scripts/bench-fph.sh.

Potential issues: the drain is best-effort but still runs on the snapshot path when the timeout flag is enabled; the polling loop is timer-based and could add latency under load, and the metrics “flush-and-wait” spins until a pointer changes which may time out if the reader stalls or FC stops emitting balloon fields.

^{Reviewed by Cursor Bugbot for commit 291933b. Bugbot is set up for automated code reviews on this repo. Configure here.}

linear-code · 2026-05-06T20:34:02Z

ENG-3664 Add free page hinting before pause

codecov · 2026-05-06T20:35:02Z

❌ 10 Tests Failed:

Tests completed	Failed	Passed	Skipped
2621	10	2611	5

View the full list of 10 ❄️ flaky test(s)

github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestTeamMetrics

Flake rate in main: 70.74% (Passed 165 times, Failed 399 times)

Stack Traces | 0.36s run time

=== RUN   TestTeamMetrics
=== PAUSE TestTeamMetrics
=== CONT  TestTeamMetrics
    team_metrics_test.go:61: 
        	Error Trace:	.../api/metrics/team_metrics_test.go:61
        	Error:      	Should be true
        	Test:       	TestTeamMetrics
        	Messages:   	MaxConcurrentSandboxes should be >= 0
--- FAIL: TestTeamMetrics (0.36s)

github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 76.93% (Passed 173 times, Failed 577 times)

Stack Traces | 42.4s run time

=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
Executing command dig in sandbox i2vp28pnyv1ed4x7aa5s0
--- FAIL: TestUpdateNetworkConfig (42.35s)

github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 77.37% (Passed 167 times, Failed 571 times)

Stack Traces | 3.73s run time

=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
Executing command curl in sandbox i2b63vq3jht9l2kyznjdi
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1359}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox i2b63vq3jht9l2kyznjdi
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1360}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox ivcza2lg1vw9dd2rf2tbj
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1361}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Thu, 14 May 2026 00:15:25 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox i2b63vq3jht9l2kyznjdi
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (3.73s)

github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost
Flake rate in main: 57.08% (Passed 297 times, Failed 395 times)
Stack Traces | 0s run time
=== RUN   TestBindLocalhost
=== PAUSE TestBindLocalhost
=== CONT  TestBindLocalhost
--- FAIL: TestBindLocalhost (0.00s)

github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_0_0_0_0

Flake rate in main: 63.70% (Passed 163 times, Failed 286 times)

Stack Traces | 9.12s run time

=== RUN   TestBindLocalhost/bind_0_0_0_0
=== PAUSE TestBindLocalhost/bind_0_0_0_0
=== CONT  TestBindLocalhost/bind_0_0_0_0
Executing command python in sandbox i3ps0rvyus13btzlpt2g9
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1266}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_0_0_0_0
        	Messages:   	Unexpected status code 502 for bind address 0.0.0.0
--- FAIL: TestBindLocalhost/bind_0_0_0_0 (9.12s)

github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_127_0_0_1

Flake rate in main: 59.06% (Passed 165 times, Failed 238 times)

Stack Traces | 7.67s run time

=== RUN   TestBindLocalhost/bind_127_0_0_1
=== PAUSE TestBindLocalhost/bind_127_0_0_1
=== CONT  TestBindLocalhost/bind_127_0_0_1
Executing command python in sandbox il6a0o8a7d1jdbu0eqbpc
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1266}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_127_0_0_1
        	Messages:   	Unexpected status code 502 for bind address 127.0.0.1
--- FAIL: TestBindLocalhost/bind_127_0_0_1 (7.67s)

github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_::1

Flake rate in main: 65.17% (Passed 163 times, Failed 305 times)

Stack Traces | 8.8s run time

=== RUN   TestBindLocalhost/bind_::1
=== PAUSE TestBindLocalhost/bind_::1
=== CONT  TestBindLocalhost/bind_::1
Executing command python in sandbox ioy0ozw1yx12rzem9ev4x
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1266}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_::1
        	Messages:   	Unexpected status code 502 for bind address ::1
Executing command python in sandbox ioh4k1guj8g0w3j8ewwtp
--- FAIL: TestBindLocalhost/bind_::1 (8.80s)

github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_localhost

Flake rate in main: 65.02% (Passed 163 times, Failed 303 times)

Stack Traces | 8.18s run time

=== RUN   TestBindLocalhost/bind_localhost
=== PAUSE TestBindLocalhost/bind_localhost
=== CONT  TestBindLocalhost/bind_localhost
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1267}}
Executing command python in sandbox ixcfftk86as30dczjqox5
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_localhost
        	Messages:   	Unexpected status code 502 for bind address localhost
--- FAIL: TestBindLocalhost/bind_localhost (8.18s)

github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 66.67% (Passed 173 times, Failed 346 times)

Stack Traces | 78.1s run time

=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:26: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (78.11s)

github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 67.59% (Passed 163 times, Failed 340 times)

Stack Traces | 23.4s run time

=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1265}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\nUsed memory before tmpfs mount: 191 MB\nFree memory before tmpfs mount: 793 MB\nMemory to use in integrity test (80% of free, min 64MB): 634 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"634+0 records in\n634+0 records out\n664797184 bytes (665 MB, 634 MiB) copied, 3.27622 s, 203 MB/s\n\tCommand being timed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=634\"\n\tUser time (seconds): 0.00\n\tSystem time (seconds): 3.24\n\tPercent of CPU this job got: 99%\n\tElapsed (wall clock) time (h:mm:ss or m:ss): 0:03.28\n\tAverage shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAverage stack size (kbytes): 0\n\tAverage total size (kbytes): 0\n\tMaximum resident set s"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ize (kbytes): 2688\n\tAverage resident set size (kbytes): 0\n\tMajor (requiring I/O) page faults: 3\n\tMinor (reclaiming a frame) page faults: 343\n\tVoluntary context switches: 4\n\tInvoluntary context switches: 14\n\tSwaps: 0\n\tFile system inputs: 176\n\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket messages received: 0\n\tSignals delivered: 0\n\tPage size (bytes): 4096\n\tExit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 827 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox i4p6lku8f16wl5hcuszou
Executing command bash in sandbox i4p6lku8f16wl5hcuszou (user: root)
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{start:{pid:1281}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{data:{stdout:"c8bad58ea8b6f0f7cae6e78c9766413c78313735962d6233503661b71ed30214\n"}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:74: Command [bash] completed successfully in sandbox i4p6lku8f16wl5hcuszou
Executing command bash in sandbox i4p6lku8f16wl5hcuszou (user: root)
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{start:{pid:1284}}
    sandbox_memory_integrity_test.go:100: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:100
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox i4p6lku8f16wl5hcuszou: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (23.39s)

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: FPR conflicts with hugepages
- Added !hugePages condition to FPR auto-enable logic, matching the server build path's conflict prevention.

Or push these changes by commenting:

@cursor push 7c518d0d3e

Preview (7c518d0d3e)

diff --git a/packages/orchestrator/cmd/create-build/main.go b/packages/orchestrator/cmd/create-build/main.go
--- a/packages/orchestrator/cmd/create-build/main.go
+++ b/packages/orchestrator/cmd/create-build/main.go
@@ -358,7 +358,8 @@
 		})
 	}
 
-	// Default FPR on for FC v1.14+; explicit --free-page-reporting overrides.
+	// Default FPR on for FC v1.14+ unless hugepages is enabled.
+	// Firecracker rejects balloon (free-page-reporting) together with hugepages.
 	var fprEnabled bool
 	if freePageReporting != nil {
 		fprEnabled = *freePageReporting
@@ -366,7 +367,7 @@
 		versionOnly, _, _ := strings.Cut(fcVersion, "_")
 		supported, err := utils.IsGTEVersion(versionOnly, "v1.14.0")
 		if err == nil {
-			fprEnabled = supported
+			fprEnabled = !hugePages && supported
 		}
 	}

_{You can send follow-ups to the cloud agent here.}

ValentaTomas · 2026-05-07T06:28:33Z

Waiting for the merge of #2541, but otherwise should be ready.

ValentaTomas · 2026-05-07T06:29:09Z

Before enabling in prod we need to deploy the kernel fix though.

qodo-code-review · 2026-05-07T06:38:50Z

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (0)

1. ~~FPH kernel gate disables~~ ✓ Resolved 🐞

Description

MinFreePageHintingKernelVersion is set to 999.0.0, so kernelSupportsFreePageHinting() will never
enable FreePageHinting for normal guest kernels and installBalloon() will always configure the
balloon with hinting disabled. With hinting disabled, DrainBalloon() will consistently no-op as “not
configured”, so enabling free-page-hinting-timeout-ms won’t actually drain anything before pause.

Code

packages/orchestrator/pkg/sandbox/fc/fph_gates.go[R10-18]

+// MinFreePageHintingKernelVersion is the minimum guest kernel version that
+// contains the FPH/MADV_DONTNEED race fix. Bump once the fixed kernel ships.
+const MinFreePageHintingKernelVersion = "999.0.0"
+
+func kernelSupportsFreePageHinting(kernelVersion string) bool {
+	v := strings.TrimPrefix(kernelVersion, "vmlinux-")
+	ok, _ := utils.IsGTEVersion(v, MinFreePageHintingKernelVersion)
+
+	return ok

Evidence
The kernel gate compares the guest kernel version against 999.0.0, which will fail for real kernel
versions (e.g. the repo default vmlinux-6.1.158), causing freePageHinting to be false when
configuring the balloon. Firecracker’s API reports 400 when hinting wasn’t enabled at device
configuration time; DrainBalloon treats that specific 400 as “not configured” and returns nil,
making the pre-pause drain ineffective.
packages/orchestrator/pkg/sandbox/fc/fph_gates.go[10-18]
packages/orchestrator/pkg/sandbox/fc/process.go[446-454]
packages/shared/pkg/featureflags/flags.go[244-247]
packages/shared/pkg/fc/client/operations/start_balloon_hinting_responses.go[110-114]
packages/orchestrator/pkg/sandbox/fc/process.go[734-740]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Free-page-hinting is effectively impossible to enable because `MinFreePageHintingKernelVersion` is hardcoded to `999.0.0`, making `kernelSupportsFreePageHinting()` always return false for real kernel versions; this causes the balloon to be configured without hinting and makes `DrainBalloon()` a no-op.

### Issue Context
The pre-pause drain is guarded by a timeout feature flag, but the balloon hinting capability is separately gated by the kernel version check; with the current constant, the drain cannot ever perform useful work.

### Fix Focus Areas
- packages/orchestrator/pkg/sandbox/fc/fph_gates.go[10-18]
- packages/orchestrator/pkg/sandbox/fc/process.go[446-454]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. FPH override no-op online 🐞

Description

resume-build’s -fph-timeout-ms calls featureflags.NewIntFlag(), which only updates the offline test
datasource, not a live LaunchDarkly environment. When LAUNCH_DARKLY_API_KEY is set,
NewClientWithLogLevel uses a real LaunchDarkly client and the override is ignored, so the CLI flag
does not do what its help text claims.

Code

packages/orchestrator/cmd/resume-build/main.go[R76-82]

+	fphTimeoutMs := flag.Int("fph-timeout-ms", 0, "override free-page-hinting-timeout-ms LD flag (0 = use LD default)")
+
	flag.Parse()

+	if *fphTimeoutMs > 0 {
+		featureflags.NewIntFlag("free-page-hinting-timeout-ms", *fphTimeoutMs)
+	}

Evidence
The CLI override is implemented by calling NewIntFlag(), which mutates the in-process ldtestdata
(offline) store. The featureflags client switches to a real LaunchDarkly client whenever
LAUNCH_DARKLY_API_KEY is set, so changes to the offline store won’t affect evaluation in that mode.
packages/orchestrator/cmd/resume-build/main.go[76-82]
packages/shared/pkg/featureflags/flags.go[147-152]
packages/shared/pkg/featureflags/client.go[19-23]
packages/shared/pkg/featureflags/client.go[71-86]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`-fph-timeout-ms` currently only affects the offline LaunchDarkly test datasource; when a real LaunchDarkly client is in use, the override is ignored.

### Issue Context
The flag help text says it “overrides free-page-hinting-timeout-ms LD flag”, so it should deterministically control the drain timeout in resume-build regardless of whether LaunchDarkly is configured.

### Fix Focus Areas
- packages/orchestrator/cmd/resume-build/main.go[76-82]
- packages/shared/pkg/featureflags/flags.go[147-152]
- packages/shared/pkg/featureflags/client.go[19-23]
- packages/shared/pkg/featureflags/client.go[71-86]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Adds an opt-in pre-pause step that runs `sync`, `drop_caches`, `compact_memory`, and `fstrim -av` on the live VM via envd's Process service to shrink the memfile/rootfs diff. Each step is wrapped in `timeout -s KILL` with its own cap, so a stuck step (most realistically a slow `sync` on a large dirty backlog) cannot starve the rest — and a killed step does not abort the chain (`;`-separated, not `&&`). Pausing FC is unaffected by an in-flight guest `sync` we time out: FC only drains in-flight virtio I/O before completing the pause; any unflushed dirty pages stay in the memfile snapshot and converge on resume. Per-step timeouts trade reclaim payoff, never correctness — `drop_caches` is documented non-destructive, `fstrim` consults FS allocation metadata not pagecache, and a partial `compact_memory` is just less-compacted. Disabled by default — the LD flag's null default leaves every step at 0 (skipped). Missing keys, zero, negative, and wrong-type values all collapse to "skip". The orchestrator skips the envd call entirely when the chain is empty. The outer `Connect-Timeout-Ms` is the sum of per-step caps plus a small slack. Single LD flag, one rule per cohort: - `guest-pause-reclaim` (JSON) — per-step caps in milliseconds keyed by step name, evaluated against sandbox / team / template LD contexts so targeting is configured in LaunchDarkly. Example value: ```json {"sync":500,"drop_caches":200,"compact_memory":1000,"fstrim":500} ``` `resume-build` exposes `-reclaim` to inject the example values into the offline LD store for local testing. Pairs cleanly with #2553 (disable proactive compaction in the guest base image), but is independent of it and of FPH (#2552). Split out from #2550.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 55d213b1bb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Drain poll can miss fast cycle when hostBefore equals freePageHintDone
- Initialized sawBump to true when hostBefore equals freePageHintDone so fast-completing cycles are correctly detected as successful instead of timing out.

Or push these changes by commenting:

@cursor push ed7f7d6038

Preview (ed7f7d6038)

diff --git a/packages/orchestrator/pkg/sandbox/fc/process.go b/packages/orchestrator/pkg/sandbox/fc/process.go
--- a/packages/orchestrator/pkg/sandbox/fc/process.go
+++ b/packages/orchestrator/pkg/sandbox/fc/process.go
@@ -772,7 +772,12 @@
 	}
 
 	backoff := 5 * time.Millisecond
-	sawBump := false
+	// If hostBefore is already freePageHintDone, we're starting from a
+	// previously completed cycle. In this case, if the new cycle completes
+	// before the first poll, host will remain at freePageHintDone and we'd
+	// miss the bump. Initialize sawBump=true so any observation of
+	// host==freePageHintDone signals completion.
+	sawBump := hostBefore == freePageHintDone
 	for {
 		select {
 		case <-ctx.Done():

_{You can send follow-ups to the cloud agent here.}

ValentaTomas · 2026-05-11T22:42:45Z

@cla-bot check

the issue in the drain logic has been addressed

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c2bfac48d5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Drains virtio-balloon free-page-hinting before pause so snapshots don't capture pages the guest already considers free. Balloon install gated by free-page-hinting-install (bool LD flag); kernel-side eligibility targeted via the LD context (kernel/FC version). On pause we call start_balloon_hinting(acknowledge_on_stop=true) and poll describe_balloon_hinting until host_cmd == DONE, gated by free-page-hinting-timeout-ms (int LD flag, ms; 0 = disabled). Hot path: post-pause we trigger an FC metrics flush but don't wait for the reader, trading per-pause counter precision for pause latency. Includes cmd/resume-build -fph-bench and scripts/bench-fph.sh for offline FPR vs FPR+FPH comparison.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Drain balloon metrics read after paused VM may fail
- Moved FlushAndReadBalloonMetrics call before Pause to avoid timeout when reading metrics from paused VM.

Or push these changes by commenting:

@cursor push 25f80a5ef2

Preview (25f80a5ef2)

diff --git a/packages/orchestrator/cmd/resume-build/fph_bench.go b/packages/orchestrator/cmd/resume-build/fph_bench.go
--- a/packages/orchestrator/cmd/resume-build/fph_bench.go
+++ b/packages/orchestrator/cmd/resume-build/fph_bench.go
@@ -139,6 +139,8 @@
 	newMeta := origMeta
 	newMeta.Template.BuildID = buildID
 
+	balloon, _ := sbx.FlushAndReadBalloonMetrics(ctx)
+
 	pauseStart := time.Now()
 	snapshot, err := sbx.Pause(ctx, newMeta, sandbox.SnapshotUseCasePause)
 	pauseDur := time.Since(pauseStart)
@@ -147,8 +149,6 @@
 	}
 	defer snapshot.Close(context.WithoutCancel(ctx))
 
-	balloon, _ := sbx.FlushAndReadBalloonMetrics(ctx)
-
 	upload, err := sandbox.NewUpload(ctx, nil, snapshot, r.storage, storage.CompressConfig{}, nil, "", nil)
 	if err != nil {
 		return fphBenchSample{pause: pauseDur, err: fmt.Errorf("upload prepare: %w", err)}

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit fab97ff. Configure here.}

Adds an opt-in pre-pause step that runs `sync`, `drop_caches`, `compact_memory`, and `fstrim -av` on the live VM via envd's Process service to shrink the memfile/rootfs diff. Each step is wrapped in `timeout -s KILL` with its own cap, so a stuck step (most realistically a slow `sync` on a large dirty backlog) cannot starve the rest — and a killed step does not abort the chain (`;`-separated, not `&&`). Pausing FC is unaffected by an in-flight guest `sync` we time out: FC only drains in-flight virtio I/O before completing the pause; any unflushed dirty pages stay in the memfile snapshot and converge on resume. Per-step timeouts trade reclaim payoff, never correctness — `drop_caches` is documented non-destructive, `fstrim` consults FS allocation metadata not pagecache, and a partial `compact_memory` is just less-compacted. Disabled by default — the LD flag's null default leaves every step at 0 (skipped). Missing keys, zero, negative, and wrong-type values all collapse to "skip". The orchestrator skips the envd call entirely when the chain is empty. The outer `Connect-Timeout-Ms` is the sum of per-step caps plus a small slack. Single LD flag, one rule per cohort: - `guest-pause-reclaim` (JSON) — per-step caps in milliseconds keyed by step name, evaluated against sandbox / team / template LD contexts so targeting is configured in LaunchDarkly. Example value: ```json {"sync":500,"drop_caches":200,"compact_memory":1000,"fstrim":500} ``` `resume-build` exposes `-reclaim` to inject the example values into the offline LD store for local testing. Pairs cleanly with e2b-dev#2553 (disable proactive compaction in the guest base image), but is independent of it and of FPH (e2b-dev#2552). Split out from e2b-dev#2550.

e2b-request-same-site-reviewers Bot assigned djeebus May 4, 2026

cursor Bot reviewed May 4, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go

ValentaTomas force-pushed the feat/sandbox-pause-fph branch from e8bd708 to bf00edc Compare May 4, 2026 00:35

This was referenced May 4, 2026

feat(uffd,fc): balloon free-page-hinting + envd reclaim on pause #2550

Closed

feat(sandbox): pre-pause guest reclaim via envd #2551

Merged

cursor Bot reviewed May 4, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go

ValentaTomas force-pushed the feat/sandbox-pause-fph branch 4 times, most recently from f4e3ab0 to 7619cc9 Compare May 4, 2026 00:55

ValentaTomas unassigned djeebus May 4, 2026

ValentaTomas force-pushed the feat/uffd-fc-free-page-reporting-integration branch 2 times, most recently from 920e8ec to 7f22709 Compare May 5, 2026 08:19

cla-bot Bot added the cla-signed label May 6, 2026

cursor Bot reviewed May 6, 2026

View reviewed changes

Comment thread packages/orchestrator/cmd/create-build/main.go Outdated

ValentaTomas marked this pull request as ready for review May 7, 2026 06:28

ValentaTomas requested review from dobrac and jakubno as code owners May 7, 2026 06:28

e2b-request-same-site-reviewers Bot assigned levb May 7, 2026

ValentaTomas unassigned levb May 7, 2026

qodo-code-review Bot reviewed May 7, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/fc/fph_gates.go Outdated

claude Bot reviewed May 7, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go

ValentaTomas removed the request for review from dobrac May 8, 2026 08:48

ValentaTomas requested review from bchalios and kalyazin and removed request for jakubno May 8, 2026 08:48

Base automatically changed from feat/uffd-fc-free-page-reporting-integration to main May 8, 2026 23:42

ValentaTomas enabled auto-merge (squash) May 9, 2026 22:19

kalyazin previously requested changes May 11, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go Outdated

chatgpt-codex-connector Bot reviewed May 11, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go Outdated

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go

ValentaTomas disabled auto-merge May 11, 2026 22:42

kalyazin reviewed May 12, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go Outdated

ValentaTomas mentioned this pull request May 13, 2026

feat(metrics): record per-snapshot dirty/empty/total bytes at creation #2649

Merged

ValentaTomas enabled auto-merge (squash) May 13, 2026 23:09

cursor Bot reviewed May 13, 2026

View reviewed changes

Comment thread iac/modules/job-otel-collector/configs/otel-collector.yaml

chatgpt-codex-connector Bot reviewed May 13, 2026

View reviewed changes

Comment thread packages/orchestrator/pkg/sandbox/fc/drain_balloon_test.go

ValentaTomas force-pushed the feat/sandbox-pause-fph branch from 2eedd3f to 86be69e Compare May 13, 2026 23:50

chore(fph): trim bench tooling for review

fab97ff

cursor Bot reviewed May 14, 2026

View reviewed changes

Comment thread packages/orchestrator/cmd/resume-build/fph_bench.go

fix(fph-bench): satisfy nlreturn lint

291933b

ValentaTomas requested a review from kalyazin May 14, 2026 00:20

Conversation

ValentaTomas commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Uh oh!

Uh oh!

linear-code Bot commented May 6, 2026

Uh oh!

codecov Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 10 Tests Failed:

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ValentaTomas commented May 7, 2026

Uh oh!

ValentaTomas commented May 7, 2026

Uh oh!

qodo-code-review Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ValentaTomas commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ValentaTomas commented May 4, 2026 •

edited

Loading

cursor Bot commented May 4, 2026 •

edited

Loading

codecov Bot commented May 6, 2026 •

edited

Loading

cursor Bot left a comment •

edited

Loading

qodo-code-review Bot commented May 7, 2026 •

edited

Loading

cursor Bot left a comment •

edited

Loading

cursor Bot left a comment •

edited

Loading