Skip to content

feat(fc): drain virtio-balloon free-page-hinting before pause#2552

Open
ValentaTomas wants to merge 3 commits into
mainfrom
feat/sandbox-pause-fph
Open

feat(fc): drain virtio-balloon free-page-hinting before pause#2552
ValentaTomas wants to merge 3 commits into
mainfrom
feat/sandbox-pause-fph

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

@ValentaTomas ValentaTomas commented May 4, 2026

Drains virtio-balloon free-page-hinting before pause so snapshots don't capture pages the guest already considers free.

Balloon install is gated by free-page-hinting-install (bool LD flag); kernel-side eligibility is targeted via the LD context (kernel/FC version). On pause we call start_balloon_hinting(acknowledge_on_stop=true) and poll describe_balloon_hinting until host_cmd == DONE, gated by free-page-hinting-timeout-ms (int LD flag, ms; 0 = disabled). Reclaimed pages emit UFFD_EVENT_REMOVE, already tracked by the parent FPR work.

Hot path is kept minimal: post-drain and post-pause we trigger an FC metrics flush but don't wait for the reader, trading per-pause counter precision for pause latency. System-level FPH activity is observable via the periodic 5 s metrics flush.

Includes cmd/resume-build -fph-bench and scripts/bench-fph.sh for offline FPR vs FPR+FPH comparison.

Operator must wait for the kernel FPH race fix to roll out before enabling free-page-hinting-timeout-ms in prod.

@cursor
Copy link
Copy Markdown

cursor Bot commented May 4, 2026

PR Summary

Medium Risk
Touches the VM pause/snapshot hot path and adds new Firecracker balloon API calls gated by feature flags; issues here could impact snapshot latency or reliability even if the feature is disabled by default.

Overview
This change adds a pre-pause step that can trigger and wait for Firecracker virtio-balloon free-page-hinting to complete (with exponential-backoff polling), gated by Firecracker version support plus new LaunchDarkly flags and a per-sandbox LD context that includes kernel/FC versions.

It extends balloon device installation/config to support FreePageHinting, adds client helpers for start_balloon_hinting/describe_balloon_hinting (including a workaround for Firecracker returning an “unexpected success” 204), and starts accumulating balloon counters from the metrics FIFO with a new flush-and-wait helper used by a new resume-build -fph-bench mode and scripts/bench-fph.sh.

Potential issues: the drain is best-effort but still runs on the snapshot path when the timeout flag is enabled; the polling loop is timer-based and could add latency under load, and the metrics “flush-and-wait” spins until a pointer changes which may time out if the reader stalls or FC stops emitting balloon fields.

Reviewed by Cursor Bugbot for commit 291933b. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go
Comment thread packages/orchestrator/pkg/sandbox/fc/process.go
@ValentaTomas ValentaTomas force-pushed the feat/sandbox-pause-fph branch 4 times, most recently from f4e3ab0 to 7619cc9 Compare May 4, 2026 00:55
@ValentaTomas ValentaTomas force-pushed the feat/uffd-fc-free-page-reporting-integration branch 2 times, most recently from 920e8ec to 7f22709 Compare May 5, 2026 08:19
@cla-bot cla-bot Bot added the cla-signed label May 6, 2026
@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 6, 2026

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

❌ 10 Tests Failed:

Tests completed Failed Passed Skipped
2621 10 2611 5
View the full list of 10 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestTeamMetrics

Flake rate in main: 70.74% (Passed 165 times, Failed 399 times)

Stack Traces | 0.36s run time
=== RUN   TestTeamMetrics
=== PAUSE TestTeamMetrics
=== CONT  TestTeamMetrics
    team_metrics_test.go:61: 
        	Error Trace:	.../api/metrics/team_metrics_test.go:61
        	Error:      	Should be true
        	Test:       	TestTeamMetrics
        	Messages:   	MaxConcurrentSandboxes should be >= 0
--- FAIL: TestTeamMetrics (0.36s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 76.93% (Passed 173 times, Failed 577 times)

Stack Traces | 42.4s run time
=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
Executing command dig in sandbox i2vp28pnyv1ed4x7aa5s0
--- FAIL: TestUpdateNetworkConfig (42.35s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 77.37% (Passed 167 times, Failed 571 times)

Stack Traces | 3.73s run time
=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
Executing command curl in sandbox i2b63vq3jht9l2kyznjdi
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1359}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox i2b63vq3jht9l2kyznjdi
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1360}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox ivcza2lg1vw9dd2rf2tbj
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1361}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Thu, 14 May 2026 00:15:25 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox i2b63vq3jht9l2kyznjdi
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (3.73s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost

Flake rate in main: 57.08% (Passed 297 times, Failed 395 times)

Stack Traces | 0s run time
=== RUN   TestBindLocalhost
=== PAUSE TestBindLocalhost
=== CONT  TestBindLocalhost
--- FAIL: TestBindLocalhost (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_0_0_0_0

Flake rate in main: 63.70% (Passed 163 times, Failed 286 times)

Stack Traces | 9.12s run time
=== RUN   TestBindLocalhost/bind_0_0_0_0
=== PAUSE TestBindLocalhost/bind_0_0_0_0
=== CONT  TestBindLocalhost/bind_0_0_0_0
Executing command python in sandbox i3ps0rvyus13btzlpt2g9
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1266}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_0_0_0_0
        	Messages:   	Unexpected status code 502 for bind address 0.0.0.0
--- FAIL: TestBindLocalhost/bind_0_0_0_0 (9.12s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_127_0_0_1

Flake rate in main: 59.06% (Passed 165 times, Failed 238 times)

Stack Traces | 7.67s run time
=== RUN   TestBindLocalhost/bind_127_0_0_1
=== PAUSE TestBindLocalhost/bind_127_0_0_1
=== CONT  TestBindLocalhost/bind_127_0_0_1
Executing command python in sandbox il6a0o8a7d1jdbu0eqbpc
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1266}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_127_0_0_1
        	Messages:   	Unexpected status code 502 for bind address 127.0.0.1
--- FAIL: TestBindLocalhost/bind_127_0_0_1 (7.67s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_::1

Flake rate in main: 65.17% (Passed 163 times, Failed 305 times)

Stack Traces | 8.8s run time
=== RUN   TestBindLocalhost/bind_::1
=== PAUSE TestBindLocalhost/bind_::1
=== CONT  TestBindLocalhost/bind_::1
Executing command python in sandbox ioy0ozw1yx12rzem9ev4x
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1266}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_::1
        	Messages:   	Unexpected status code 502 for bind address ::1
Executing command python in sandbox ioh4k1guj8g0w3j8ewwtp
--- FAIL: TestBindLocalhost/bind_::1 (8.80s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_localhost

Flake rate in main: 65.02% (Passed 163 times, Failed 303 times)

Stack Traces | 8.18s run time
=== RUN   TestBindLocalhost/bind_localhost
=== PAUSE TestBindLocalhost/bind_localhost
=== CONT  TestBindLocalhost/bind_localhost
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1267}}
Executing command python in sandbox ixcfftk86as30dczjqox5
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_localhost
        	Messages:   	Unexpected status code 502 for bind address localhost
--- FAIL: TestBindLocalhost/bind_localhost (8.18s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 66.67% (Passed 173 times, Failed 346 times)

Stack Traces | 78.1s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:26: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (78.11s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 67.59% (Passed 163 times, Failed 340 times)

Stack Traces | 23.4s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1265}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\nUsed memory before tmpfs mount: 191 MB\nFree memory before tmpfs mount: 793 MB\nMemory to use in integrity test (80% of free, min 64MB): 634 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"634+0 records in\n634+0 records out\n664797184 bytes (665 MB, 634 MiB) copied, 3.27622 s, 203 MB/s\n\tCommand being timed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=634\"\n\tUser time (seconds): 0.00\n\tSystem time (seconds): 3.24\n\tPercent of CPU this job got: 99%\n\tElapsed (wall clock) time (h:mm:ss or m:ss): 0:03.28\n\tAverage shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAverage stack size (kbytes): 0\n\tAverage total size (kbytes): 0\n\tMaximum resident set s"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ize (kbytes): 2688\n\tAverage resident set size (kbytes): 0\n\tMajor (requiring I/O) page faults: 3\n\tMinor (reclaiming a frame) page faults: 343\n\tVoluntary context switches: 4\n\tInvoluntary context switches: 14\n\tSwaps: 0\n\tFile system inputs: 176\n\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket messages received: 0\n\tSignals delivered: 0\n\tPage size (bytes): 4096\n\tExit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 827 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox i4p6lku8f16wl5hcuszou
Executing command bash in sandbox i4p6lku8f16wl5hcuszou (user: root)
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{start:{pid:1281}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{data:{stdout:"c8bad58ea8b6f0f7cae6e78c9766413c78313735962d6233503661b71ed30214\n"}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:74: Command [bash] completed successfully in sandbox i4p6lku8f16wl5hcuszou
Executing command bash in sandbox i4p6lku8f16wl5hcuszou (user: root)
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{start:{pid:1284}}
    sandbox_memory_integrity_test.go:100: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:100
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox i4p6lku8f16wl5hcuszou: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (23.39s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: FPR conflicts with hugepages
    • Added !hugePages condition to FPR auto-enable logic, matching the server build path's conflict prevention.

Create PR

Or push these changes by commenting:

@cursor push 7c518d0d3e
Preview (7c518d0d3e)
diff --git a/packages/orchestrator/cmd/create-build/main.go b/packages/orchestrator/cmd/create-build/main.go
--- a/packages/orchestrator/cmd/create-build/main.go
+++ b/packages/orchestrator/cmd/create-build/main.go
@@ -358,7 +358,8 @@
 		})
 	}
 
-	// Default FPR on for FC v1.14+; explicit --free-page-reporting overrides.
+	// Default FPR on for FC v1.14+ unless hugepages is enabled.
+	// Firecracker rejects balloon (free-page-reporting) together with hugepages.
 	var fprEnabled bool
 	if freePageReporting != nil {
 		fprEnabled = *freePageReporting
@@ -366,7 +367,7 @@
 		versionOnly, _, _ := strings.Cut(fcVersion, "_")
 		supported, err := utils.IsGTEVersion(versionOnly, "v1.14.0")
 		if err == nil {
-			fprEnabled = supported
+			fprEnabled = !hugePages && supported
 		}
 	}

You can send follow-ups to the cloud agent here.

Comment thread packages/orchestrator/cmd/create-build/main.go Outdated
@ValentaTomas
Copy link
Copy Markdown
Member Author

Waiting for the merge of #2541, but otherwise should be ready.

@ValentaTomas ValentaTomas marked this pull request as ready for review May 7, 2026 06:28
@ValentaTomas
Copy link
Copy Markdown
Member Author

Before enabling in prod we need to deploy the kernel fix though.

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented May 7, 2026

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. FPH kernel gate disables ✓ Resolved 🐞
Description
MinFreePageHintingKernelVersion is set to 999.0.0, so kernelSupportsFreePageHinting() will never
enable FreePageHinting for normal guest kernels and installBalloon() will always configure the
balloon with hinting disabled. With hinting disabled, DrainBalloon() will consistently no-op as “not
configured”, so enabling free-page-hinting-timeout-ms won’t actually drain anything before pause.
Code

packages/orchestrator/pkg/sandbox/fc/fph_gates.go[R10-18]

+// MinFreePageHintingKernelVersion is the minimum guest kernel version that
+// contains the FPH/MADV_DONTNEED race fix. Bump once the fixed kernel ships.
+const MinFreePageHintingKernelVersion = "999.0.0"
+
+func kernelSupportsFreePageHinting(kernelVersion string) bool {
+	v := strings.TrimPrefix(kernelVersion, "vmlinux-")
+	ok, _ := utils.IsGTEVersion(v, MinFreePageHintingKernelVersion)
+
+	return ok
Evidence
The kernel gate compares the guest kernel version against 999.0.0, which will fail for real kernel
versions (e.g. the repo default vmlinux-6.1.158), causing freePageHinting to be false when
configuring the balloon. Firecracker’s API reports 400 when hinting wasn’t enabled at device
configuration time; DrainBalloon treats that specific 400 as “not configured” and returns nil,
making the pre-pause drain ineffective.

packages/orchestrator/pkg/sandbox/fc/fph_gates.go[10-18]
packages/orchestrator/pkg/sandbox/fc/process.go[446-454]
packages/shared/pkg/featureflags/flags.go[244-247]
packages/shared/pkg/fc/client/operations/start_balloon_hinting_responses.go[110-114]
packages/orchestrator/pkg/sandbox/fc/process.go[734-740]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Free-page-hinting is effectively impossible to enable because `MinFreePageHintingKernelVersion` is hardcoded to `999.0.0`, making `kernelSupportsFreePageHinting()` always return false for real kernel versions; this causes the balloon to be configured without hinting and makes `DrainBalloon()` a no-op.

### Issue Context
The pre-pause drain is guarded by a timeout feature flag, but the balloon hinting capability is separately gated by the kernel version check; with the current constant, the drain cannot ever perform useful work.

### Fix Focus Areas
- packages/orchestrator/pkg/sandbox/fc/fph_gates.go[10-18]
- packages/orchestrator/pkg/sandbox/fc/process.go[446-454]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. FPH override no-op online 🐞
Description
resume-build’s -fph-timeout-ms calls featureflags.NewIntFlag(), which only updates the offline test
datasource, not a live LaunchDarkly environment. When LAUNCH_DARKLY_API_KEY is set,
NewClientWithLogLevel uses a real LaunchDarkly client and the override is ignored, so the CLI flag
does not do what its help text claims.
Code

packages/orchestrator/cmd/resume-build/main.go[R76-82]

+	fphTimeoutMs := flag.Int("fph-timeout-ms", 0, "override free-page-hinting-timeout-ms LD flag (0 = use LD default)")
+
	flag.Parse()

+	if *fphTimeoutMs > 0 {
+		featureflags.NewIntFlag("free-page-hinting-timeout-ms", *fphTimeoutMs)
+	}
Evidence
The CLI override is implemented by calling NewIntFlag(), which mutates the in-process ldtestdata
(offline) store. The featureflags client switches to a real LaunchDarkly client whenever
LAUNCH_DARKLY_API_KEY is set, so changes to the offline store won’t affect evaluation in that mode.

packages/orchestrator/cmd/resume-build/main.go[76-82]
packages/shared/pkg/featureflags/flags.go[147-152]
packages/shared/pkg/featureflags/client.go[19-23]
packages/shared/pkg/featureflags/client.go[71-86]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`-fph-timeout-ms` currently only affects the offline LaunchDarkly test datasource; when a real LaunchDarkly client is in use, the override is ignored.

### Issue Context
The flag help text says it “overrides free-page-hinting-timeout-ms LD flag”, so it should deterministically control the drain timeout in resume-build regardless of whether LaunchDarkly is configured.

### Fix Focus Areas
- packages/orchestrator/cmd/resume-build/main.go[76-82]
- packages/shared/pkg/featureflags/flags.go[147-152]
- packages/shared/pkg/featureflags/client.go[19-23]
- packages/shared/pkg/featureflags/client.go[71-86]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread packages/orchestrator/pkg/sandbox/fc/fph_gates.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/fc/process.go
ValentaTomas added a commit that referenced this pull request May 8, 2026
Adds an opt-in pre-pause step that runs `sync`, `drop_caches`,
`compact_memory`, and `fstrim -av` on the live VM via envd's Process
service to shrink the memfile/rootfs diff. Each step is wrapped in
`timeout -s KILL` with its own cap, so a stuck step (most realistically
a slow `sync` on a large dirty backlog) cannot starve the rest — and a
killed step does not abort the chain (`;`-separated, not `&&`).

Pausing FC is unaffected by an in-flight guest `sync` we time out: FC
only drains in-flight virtio I/O before completing the pause; any
unflushed dirty pages stay in the memfile snapshot and converge on
resume. Per-step timeouts trade reclaim payoff, never correctness —
`drop_caches` is documented non-destructive, `fstrim` consults FS
allocation metadata not pagecache, and a partial `compact_memory` is
just less-compacted.

Disabled by default — the LD flag's null default leaves every step at 0
(skipped). Missing keys, zero, negative, and wrong-type values all
collapse to "skip". The orchestrator skips the envd call entirely when
the chain is empty. The outer `Connect-Timeout-Ms` is the sum of
per-step caps plus a small slack.

Single LD flag, one rule per cohort:

- `guest-pause-reclaim` (JSON) — per-step caps in milliseconds keyed by
step name, evaluated against sandbox / team / template LD contexts so
targeting is configured in LaunchDarkly.

Example value:

```json
{"sync":500,"drop_caches":200,"compact_memory":1000,"fstrim":500}
```

`resume-build` exposes `-reclaim` to inject the example values into the
offline LD store for local testing.

Pairs cleanly with #2553 (disable proactive compaction in the guest base
image), but is independent of it and of FPH (#2552). Split out from
#2550.
@ValentaTomas ValentaTomas removed the request for review from dobrac May 8, 2026 08:48
@ValentaTomas ValentaTomas requested review from bchalios and kalyazin and removed request for jakubno May 8, 2026 08:48
Base automatically changed from feat/uffd-fc-free-page-reporting-integration to main May 8, 2026 23:42
@ValentaTomas ValentaTomas enabled auto-merge (squash) May 9, 2026 22:19
Comment thread packages/orchestrator/pkg/sandbox/fc/process.go Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 55d213b1bb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Drain poll can miss fast cycle when hostBefore equals freePageHintDone
    • Initialized sawBump to true when hostBefore equals freePageHintDone so fast-completing cycles are correctly detected as successful instead of timing out.

Create PR

Or push these changes by commenting:

@cursor push ed7f7d6038
Preview (ed7f7d6038)
diff --git a/packages/orchestrator/pkg/sandbox/fc/process.go b/packages/orchestrator/pkg/sandbox/fc/process.go
--- a/packages/orchestrator/pkg/sandbox/fc/process.go
+++ b/packages/orchestrator/pkg/sandbox/fc/process.go
@@ -772,7 +772,12 @@
 	}
 
 	backoff := 5 * time.Millisecond
-	sawBump := false
+	// If hostBefore is already freePageHintDone, we're starting from a
+	// previously completed cycle. In this case, if the new cycle completes
+	// before the first poll, host will remain at freePageHintDone and we'd
+	// miss the bump. Initialize sawBump=true so any observation of
+	// host==freePageHintDone signals completion.
+	sawBump := hostBefore == freePageHintDone
 	for {
 		select {
 		case <-ctx.Done():

You can send follow-ups to the cloud agent here.

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go
@ValentaTomas
Copy link
Copy Markdown
Member Author

@cla-bot check

@ValentaTomas ValentaTomas disabled auto-merge May 11, 2026 22:42
Comment thread packages/orchestrator/pkg/sandbox/fc/process.go
Comment thread packages/orchestrator/pkg/sandbox/fc/process.go Outdated
@kalyazin kalyazin dismissed their stale review May 12, 2026 10:19

the issue in the drain logic has been addressed

@ValentaTomas ValentaTomas enabled auto-merge (squash) May 13, 2026 23:09
Comment thread iac/modules/job-otel-collector/configs/otel-collector.yaml
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c2bfac48d5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/orchestrator/pkg/sandbox/fc/drain_balloon_test.go
Drains virtio-balloon free-page-hinting before pause so snapshots don't
capture pages the guest already considers free. Balloon install gated by
free-page-hinting-install (bool LD flag); kernel-side eligibility targeted
via the LD context (kernel/FC version). On pause we call
start_balloon_hinting(acknowledge_on_stop=true) and poll
describe_balloon_hinting until host_cmd == DONE, gated by
free-page-hinting-timeout-ms (int LD flag, ms; 0 = disabled).

Hot path: post-pause we trigger an FC metrics flush but don't wait for
the reader, trading per-pause counter precision for pause latency.

Includes cmd/resume-build -fph-bench and scripts/bench-fph.sh for
offline FPR vs FPR+FPH comparison.
@ValentaTomas ValentaTomas force-pushed the feat/sandbox-pause-fph branch from 2eedd3f to 86be69e Compare May 13, 2026 23:50
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Drain balloon metrics read after paused VM may fail
    • Moved FlushAndReadBalloonMetrics call before Pause to avoid timeout when reading metrics from paused VM.

Create PR

Or push these changes by commenting:

@cursor push 25f80a5ef2
Preview (25f80a5ef2)
diff --git a/packages/orchestrator/cmd/resume-build/fph_bench.go b/packages/orchestrator/cmd/resume-build/fph_bench.go
--- a/packages/orchestrator/cmd/resume-build/fph_bench.go
+++ b/packages/orchestrator/cmd/resume-build/fph_bench.go
@@ -139,6 +139,8 @@
 	newMeta := origMeta
 	newMeta.Template.BuildID = buildID
 
+	balloon, _ := sbx.FlushAndReadBalloonMetrics(ctx)
+
 	pauseStart := time.Now()
 	snapshot, err := sbx.Pause(ctx, newMeta, sandbox.SnapshotUseCasePause)
 	pauseDur := time.Since(pauseStart)
@@ -147,8 +149,6 @@
 	}
 	defer snapshot.Close(context.WithoutCancel(ctx))
 
-	balloon, _ := sbx.FlushAndReadBalloonMetrics(ctx)
-
 	upload, err := sandbox.NewUpload(ctx, nil, snapshot, r.storage, storage.CompressConfig{}, nil, "", nil)
 	if err != nil {
 		return fphBenchSample{pause: pauseDur, err: fmt.Errorf("upload prepare: %w", err)}

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit fab97ff. Configure here.

Comment thread packages/orchestrator/cmd/resume-build/fph_bench.go
@ValentaTomas ValentaTomas requested a review from kalyazin May 14, 2026 00:20
AdaAibaby pushed a commit to AdaAibaby/infra that referenced this pull request May 14, 2026
Adds an opt-in pre-pause step that runs `sync`, `drop_caches`,
`compact_memory`, and `fstrim -av` on the live VM via envd's Process
service to shrink the memfile/rootfs diff. Each step is wrapped in
`timeout -s KILL` with its own cap, so a stuck step (most realistically
a slow `sync` on a large dirty backlog) cannot starve the rest — and a
killed step does not abort the chain (`;`-separated, not `&&`).

Pausing FC is unaffected by an in-flight guest `sync` we time out: FC
only drains in-flight virtio I/O before completing the pause; any
unflushed dirty pages stay in the memfile snapshot and converge on
resume. Per-step timeouts trade reclaim payoff, never correctness —
`drop_caches` is documented non-destructive, `fstrim` consults FS
allocation metadata not pagecache, and a partial `compact_memory` is
just less-compacted.

Disabled by default — the LD flag's null default leaves every step at 0
(skipped). Missing keys, zero, negative, and wrong-type values all
collapse to "skip". The orchestrator skips the envd call entirely when
the chain is empty. The outer `Connect-Timeout-Ms` is the sum of
per-step caps plus a small slack.

Single LD flag, one rule per cohort:

- `guest-pause-reclaim` (JSON) — per-step caps in milliseconds keyed by
step name, evaluated against sandbox / team / template LD contexts so
targeting is configured in LaunchDarkly.

Example value:

```json
{"sync":500,"drop_caches":200,"compact_memory":1000,"fstrim":500}
```

`resume-build` exposes `-reclaim` to inject the example values into the
offline LD store for local testing.

Pairs cleanly with e2b-dev#2553 (disable proactive compaction in the guest base
image), but is independent of it and of FPH (e2b-dev#2552). Split out from
e2b-dev#2550.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants