feat(fc): drain virtio-balloon free-page-hinting before pause#2552
feat(fc): drain virtio-balloon free-page-hinting before pause#2552ValentaTomas wants to merge 3 commits into
Conversation
PR SummaryMedium Risk Overview It extends balloon device installation/config to support Potential issues: the drain is best-effort but still runs on the snapshot path when the timeout flag is enabled; the polling loop is timer-based and could add latency under load, and the metrics “flush-and-wait” spins until a pointer changes which may time out if the reader stalls or FC stops emitting balloon fields. Reviewed by Cursor Bugbot for commit 291933b. Bugbot is set up for automated code reviews on this repo. Configure here. |
e8bd708 to
bf00edc
Compare
f4e3ab0 to
7619cc9
Compare
920e8ec to
7f22709
Compare
❌ 10 Tests Failed:
View the full list of 10 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: FPR conflicts with hugepages
- Added
!hugePagescondition to FPR auto-enable logic, matching the server build path's conflict prevention.
- Added
Or push these changes by commenting:
@cursor push 7c518d0d3e
Preview (7c518d0d3e)
diff --git a/packages/orchestrator/cmd/create-build/main.go b/packages/orchestrator/cmd/create-build/main.go
--- a/packages/orchestrator/cmd/create-build/main.go
+++ b/packages/orchestrator/cmd/create-build/main.go
@@ -358,7 +358,8 @@
})
}
- // Default FPR on for FC v1.14+; explicit --free-page-reporting overrides.
+ // Default FPR on for FC v1.14+ unless hugepages is enabled.
+ // Firecracker rejects balloon (free-page-reporting) together with hugepages.
var fprEnabled bool
if freePageReporting != nil {
fprEnabled = *freePageReporting
@@ -366,7 +367,7 @@
versionOnly, _, _ := strings.Cut(fcVersion, "_")
supported, err := utils.IsGTEVersion(versionOnly, "v1.14.0")
if err == nil {
- fprEnabled = supported
+ fprEnabled = !hugePages && supported
}
}You can send follow-ups to the cloud agent here.
|
Waiting for the merge of #2541, but otherwise should be ready. |
|
Before enabling in prod we need to deploy the kernel fix though. |
Code Review by Qodo
1.
|
Adds an opt-in pre-pause step that runs `sync`, `drop_caches`,
`compact_memory`, and `fstrim -av` on the live VM via envd's Process
service to shrink the memfile/rootfs diff. Each step is wrapped in
`timeout -s KILL` with its own cap, so a stuck step (most realistically
a slow `sync` on a large dirty backlog) cannot starve the rest — and a
killed step does not abort the chain (`;`-separated, not `&&`).
Pausing FC is unaffected by an in-flight guest `sync` we time out: FC
only drains in-flight virtio I/O before completing the pause; any
unflushed dirty pages stay in the memfile snapshot and converge on
resume. Per-step timeouts trade reclaim payoff, never correctness —
`drop_caches` is documented non-destructive, `fstrim` consults FS
allocation metadata not pagecache, and a partial `compact_memory` is
just less-compacted.
Disabled by default — the LD flag's null default leaves every step at 0
(skipped). Missing keys, zero, negative, and wrong-type values all
collapse to "skip". The orchestrator skips the envd call entirely when
the chain is empty. The outer `Connect-Timeout-Ms` is the sum of
per-step caps plus a small slack.
Single LD flag, one rule per cohort:
- `guest-pause-reclaim` (JSON) — per-step caps in milliseconds keyed by
step name, evaluated against sandbox / team / template LD contexts so
targeting is configured in LaunchDarkly.
Example value:
```json
{"sync":500,"drop_caches":200,"compact_memory":1000,"fstrim":500}
```
`resume-build` exposes `-reclaim` to inject the example values into the
offline LD store for local testing.
Pairs cleanly with #2553 (disable proactive compaction in the guest base
image), but is independent of it and of FPH (#2552). Split out from
#2550.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 55d213b1bb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Drain poll can miss fast cycle when hostBefore equals freePageHintDone
- Initialized sawBump to true when hostBefore equals freePageHintDone so fast-completing cycles are correctly detected as successful instead of timing out.
Or push these changes by commenting:
@cursor push ed7f7d6038
Preview (ed7f7d6038)
diff --git a/packages/orchestrator/pkg/sandbox/fc/process.go b/packages/orchestrator/pkg/sandbox/fc/process.go
--- a/packages/orchestrator/pkg/sandbox/fc/process.go
+++ b/packages/orchestrator/pkg/sandbox/fc/process.go
@@ -772,7 +772,12 @@
}
backoff := 5 * time.Millisecond
- sawBump := false
+ // If hostBefore is already freePageHintDone, we're starting from a
+ // previously completed cycle. In this case, if the new cycle completes
+ // before the first poll, host will remain at freePageHintDone and we'd
+ // miss the bump. Initialize sawBump=true so any observation of
+ // host==freePageHintDone signals completion.
+ sawBump := hostBefore == freePageHintDone
for {
select {
case <-ctx.Done():You can send follow-ups to the cloud agent here.
|
@cla-bot check |
the issue in the drain logic has been addressed
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c2bfac48d5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Drains virtio-balloon free-page-hinting before pause so snapshots don't capture pages the guest already considers free. Balloon install gated by free-page-hinting-install (bool LD flag); kernel-side eligibility targeted via the LD context (kernel/FC version). On pause we call start_balloon_hinting(acknowledge_on_stop=true) and poll describe_balloon_hinting until host_cmd == DONE, gated by free-page-hinting-timeout-ms (int LD flag, ms; 0 = disabled). Hot path: post-pause we trigger an FC metrics flush but don't wait for the reader, trading per-pause counter precision for pause latency. Includes cmd/resume-build -fph-bench and scripts/bench-fph.sh for offline FPR vs FPR+FPH comparison.
2eedd3f to
86be69e
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Drain balloon metrics read after paused VM may fail
- Moved FlushAndReadBalloonMetrics call before Pause to avoid timeout when reading metrics from paused VM.
Or push these changes by commenting:
@cursor push 25f80a5ef2
Preview (25f80a5ef2)
diff --git a/packages/orchestrator/cmd/resume-build/fph_bench.go b/packages/orchestrator/cmd/resume-build/fph_bench.go
--- a/packages/orchestrator/cmd/resume-build/fph_bench.go
+++ b/packages/orchestrator/cmd/resume-build/fph_bench.go
@@ -139,6 +139,8 @@
newMeta := origMeta
newMeta.Template.BuildID = buildID
+ balloon, _ := sbx.FlushAndReadBalloonMetrics(ctx)
+
pauseStart := time.Now()
snapshot, err := sbx.Pause(ctx, newMeta, sandbox.SnapshotUseCasePause)
pauseDur := time.Since(pauseStart)
@@ -147,8 +149,6 @@
}
defer snapshot.Close(context.WithoutCancel(ctx))
- balloon, _ := sbx.FlushAndReadBalloonMetrics(ctx)
-
upload, err := sandbox.NewUpload(ctx, nil, snapshot, r.storage, storage.CompressConfig{}, nil, "", nil)
if err != nil {
return fphBenchSample{pause: pauseDur, err: fmt.Errorf("upload prepare: %w", err)}You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit fab97ff. Configure here.
Adds an opt-in pre-pause step that runs `sync`, `drop_caches`,
`compact_memory`, and `fstrim -av` on the live VM via envd's Process
service to shrink the memfile/rootfs diff. Each step is wrapped in
`timeout -s KILL` with its own cap, so a stuck step (most realistically
a slow `sync` on a large dirty backlog) cannot starve the rest — and a
killed step does not abort the chain (`;`-separated, not `&&`).
Pausing FC is unaffected by an in-flight guest `sync` we time out: FC
only drains in-flight virtio I/O before completing the pause; any
unflushed dirty pages stay in the memfile snapshot and converge on
resume. Per-step timeouts trade reclaim payoff, never correctness —
`drop_caches` is documented non-destructive, `fstrim` consults FS
allocation metadata not pagecache, and a partial `compact_memory` is
just less-compacted.
Disabled by default — the LD flag's null default leaves every step at 0
(skipped). Missing keys, zero, negative, and wrong-type values all
collapse to "skip". The orchestrator skips the envd call entirely when
the chain is empty. The outer `Connect-Timeout-Ms` is the sum of
per-step caps plus a small slack.
Single LD flag, one rule per cohort:
- `guest-pause-reclaim` (JSON) — per-step caps in milliseconds keyed by
step name, evaluated against sandbox / team / template LD contexts so
targeting is configured in LaunchDarkly.
Example value:
```json
{"sync":500,"drop_caches":200,"compact_memory":1000,"fstrim":500}
```
`resume-build` exposes `-reclaim` to inject the example values into the
offline LD store for local testing.
Pairs cleanly with e2b-dev#2553 (disable proactive compaction in the guest base
image), but is independent of it and of FPH (e2b-dev#2552). Split out from
e2b-dev#2550.


Drains virtio-balloon free-page-hinting before pause so snapshots don't capture pages the guest already considers free.
Balloon install is gated by
free-page-hinting-install(bool LD flag); kernel-side eligibility is targeted via the LD context (kernel/FC version). On pause we callstart_balloon_hinting(acknowledge_on_stop=true)and polldescribe_balloon_hintinguntilhost_cmd == DONE, gated byfree-page-hinting-timeout-ms(int LD flag, ms; 0 = disabled). Reclaimed pages emitUFFD_EVENT_REMOVE, already tracked by the parent FPR work.Hot path is kept minimal: post-drain and post-pause we trigger an FC metrics flush but don't wait for the reader, trading per-pause counter precision for pause latency. System-level FPH activity is observable via the periodic 5 s metrics flush.
Includes
cmd/resume-build -fph-benchandscripts/bench-fph.shfor offline FPR vs FPR+FPH comparison.Operator must wait for the kernel FPH race fix to roll out before enabling
free-page-hinting-timeout-msin prod.