Skip to content

🐛 Drain SDK idle-callback queue before flushing in e2e tests#4569

Open
thomas-lebeau wants to merge 2 commits intomainfrom
thomas.lebeau/fix-microfrontend-fetch-xhr-flake
Open

🐛 Drain SDK idle-callback queue before flushing in e2e tests#4569
thomas-lebeau wants to merge 2 commits intomainfrom
thomas.lebeau/fix-microfrontend-fetch-xhr-flake

Conversation

@thomas-lebeau
Copy link
Copy Markdown
Collaborator

Motivation

E2E tests (notably the microfrontend fetch/XHR scenarios) were flaky because the SDK defers some bookkeeping — particularly resource event emission — onto an idle-callback task queue. The previous waitForRequests helper just slept 200ms via setTimeout, which was not always enough for those deferred tasks to run before the test asserted on captured events.

Changes

  • Replace the fixed 200ms setTimeout in waitForRequests with two requestIdleCallback round-trips to drain the SDK's idle-callback queue (including tasks enqueued by the first batch).
  • Keep a 500ms watchdog as a fallback for environments where requestIdleCallback is throttled or unavailable (e.g. the empty /flush page during teardown).

Test instructions

  • Run the E2E suite: yarn test:e2e
  • Re-run microfrontend fetch/XHR scenarios multiple times to confirm the flake no longer reproduces.

Checklist

  • Tested locally
  • Tested on staging
  • Added unit tests for this change.
  • Added e2e/integration tests for this change.
  • Updated documentation and/or relevant AGENTS.md file

The SDK defers resource event emission onto a `requestIdleCallback`-backed
task queue (with a 1s timeout). The previous 200ms in-page setTimeout in
`waitForRequests` could return before that idle callback fired, leaving
events in the queue when `pagehide` triggered `sendBeacon` — so they were
silently dropped from the captured intake registry.

Replace the fixed wait with two `requestIdleCallback` round-trips (with a
500ms watchdog for pages where rIC is throttled or unavailable, e.g. the
empty /flush page used during teardown). This eliminates the residual
flake exposed by tests firing two back-to-back `page.click` calls (e.g.
microfrontend.scenario.ts fetch/xhr/feature-operation tests), without
requiring per-test workarounds.

Verified with 50× repeats of the full microfrontend suite (1050/1050 pass).
@cit-pr-commenter-54b7da
Copy link
Copy Markdown

cit-pr-commenter-54b7da Bot commented May 6, 2026

Bundles Sizes Evolution

📦 Bundle Name Base Size Local Size 𝚫 𝚫% Status
Rum 179.65 KiB 179.65 KiB 0 B 0.00%
Rum Profiler 6.17 KiB 6.17 KiB 0 B 0.00%
Rum Recorder 27.03 KiB 27.03 KiB 0 B 0.00%
Logs 56.78 KiB 56.78 KiB 0 B 0.00%
Rum Slim 135.50 KiB 135.50 KiB 0 B 0.00%
Worker 23.63 KiB 23.63 KiB 0 B 0.00%
🚀 CPU Performance
Action Name Base CPU Time (ms) Local CPU Time (ms) 𝚫%
RUM - add global context 0.0041 0.0042 +2.44%
RUM - add action 0.014 0.0148 +5.71%
RUM - add error 0.0139 0.0123 -11.51%
RUM - add timing 0.0029 0.0027 -6.90%
RUM - start view 0.0128 0.012 -6.25%
RUM - start/stop session replay recording 0.0008 0.0008 0.00%
Logs - log message 0.0159 0.014 -11.95%
🧠 Memory Performance
Action Name Base Memory Consumption Local Memory Consumption 𝚫
RUM - add global context 30.45 KiB 31.54 KiB +1.09 KiB
RUM - add action 57.61 KiB 55.82 KiB -1.79 KiB
RUM - add timing 32.41 KiB 32.03 KiB -387 B
RUM - add error 63.46 KiB 62.82 KiB -655 B
RUM - start/stop session replay recording 31.90 KiB 30.98 KiB -951 B
RUM - start view 486.17 KiB 486.00 KiB -175 B
Logs - log message 105.80 KiB 107.35 KiB +1.55 KiB

🔗 RealWorld

@datadog-official
Copy link
Copy Markdown

datadog-official Bot commented May 6, 2026

Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 77.01% (+0.00%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: fa10f7c | Docs | Datadog PR Page | Give us feedback!

Comment on lines +24 to +30
let done = false
const finish = () => {
if (!done) {
done = true
resolve()
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: this is unnecessary, just use resolve instead of finish: it doesn't matter if resolve is called twice

Promise resolve() is idempotent, so the manual `done` flag and `finish`
wrapper are unnecessary — call resolve() directly from both the watchdog
timeout and the idle-callback chain.
@thomas-lebeau thomas-lebeau marked this pull request as ready for review May 7, 2026 12:35
@thomas-lebeau thomas-lebeau requested a review from a team as a code owner May 7, 2026 12:35
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fa10f7ccd0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// first batch of tasks are also processed. A 500ms watchdog covers pages where
// requestIdleCallback is throttled or unavailable (e.g. the empty /flush page during
// teardown).
setTimeout(resolve, 500)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep the watchdog behind the SDK idle timeout

When requestIdleCallback is throttled, this watchdog can resolve flushEvents() before the SDK's own task queue is forced to run: packages/core/src/tools/taskQueue.ts schedules SDK work with IDLE_CALLBACK_TIMEOUT = ONE_SECOND, but this fallback proceeds after 500 ms. In that throttled/no-idle context the helper can still call waitForServersIdle() while resource-emission tasks remain queued, so the flake this path is meant to cover can still reproduce; the fallback needs to wait at least as long as the SDK timeout or force the same queue to run.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants