Skip to content

🐛 make sure we don't postpone batch indefinitely when upserting a view#4583

Merged
thomas-lebeau merged 1 commit into
mainfrom
thomas.lebeau/fix-flush-controller-upsert
May 8, 2026
Merged

🐛 make sure we don't postpone batch indefinitely when upserting a view#4583
thomas-lebeau merged 1 commit into
mainfrom
thomas.lebeau/fix-flush-controller-upsert

Conversation

@thomas-lebeau
Copy link
Copy Markdown
Collaborator

Motivation

Backport of v7 commit 0fd15958a (originally squashed into #4462 on the v7 branch).

FlushController had a bug where the duration_limit timer could be repeatedly cancelled and rescheduled when a batch contained a single upserted message. Because Batch.addOrUpdate called remove(key) then push(...) on every upsert with the same key, currentMessagesCount dropped to 0 between the two calls, which triggered cancelDurationLimitTimeout(). The subsequent push then rescheduled the timer from scratch. As long as the upsert cadence (e.g. throttled view updates every 3s) was shorter than FLUSH_DURATION_LIMIT (30s), the timer never fired.

In practice this happened most often for view events being upserted with no accompanying resource/action/error events in the same batch — a common state for idle / backgrounded tabs and quickly-closed pages. The buffered view event was only flushed when something else triggered it (page exit, session expire, messages_limit, bytes_limit). When the page died before any of those ran (tab discarded, OS reclaim, crash, mobile background kill), the view event was lost entirely, silently under-reporting view durations in RUM.

Observed impact in our own dashboards after the v7 fix landed: total view count rose ~17.5% and view.time_spent p90/p95/p99 jumped sharply (p95 ~+114%) while p50/p75 were flat or slightly lower — the signature of recovering previously-dropped views at both tails of the distribution, not of inflating individual durations.

Changes

  • Drop notifyAfterRemoveMessage from FlushController entirely.
  • In Batch.push, when upserting a key that already has a buffered message, call notifyAfterAddMessage(bytesDiff) instead of remove-then-add. This adjusts only the byte count and never touches currentMessagesCount or the duration_limit timer.
  • Update specs and mockFlushController accordingly.

This is a straight cherry-pick from v7 — no behavioral divergence between the two branches in this area.

Test instructions

  • `yarn test:unit --spec packages/core/src/transport/flushController.spec.ts --spec packages/core/src/transport/batch.spec.ts` — 41 tests pass.
  • `yarn typecheck` and `yarn lint` clean.

Checklist

  • Tested locally
  • Tested on staging
  • Added unit tests for this change.
  • Added e2e/integration tests for this change.
  • Updated documentation and/or relevant AGENTS.md file

This commit simplifies FlushController so we don't have a 'notify
remove' step that can clear the duration timeout if the batch becomes
empty.

Instead, when we upsert a message, we call notifyAfterAddMessage with
the bytes difference between the old and new message. This reuses the
existing adjustment mechanism without adding API surface to
FlushController.

Backport from v7 (commit 0fd1595, originally part of #4462).

When a batch contained exactly one upserted message (e.g. a view
event with no accompanying resource/action/error events), each
subsequent upsert called notifyAfterRemoveMessage which decremented
currentMessagesCount to 0 and cancelled the duration_limit timeout.
The next push rescheduled it from scratch. As long as the upsert
cadence (e.g. throttled view updates) was shorter than
FLUSH_DURATION_LIMIT (30s), the timer never fired and the batch was
only flushed when something else triggered it (page exit, session
expire, messages_limit, bytes_limit). When the page died before any
of those fired, the view event was lost entirely - silently
under-reporting view durations in RUM.
@thomas-lebeau thomas-lebeau marked this pull request as ready for review May 8, 2026 12:09
@thomas-lebeau thomas-lebeau requested a review from a team as a code owner May 8, 2026 12:09
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c0c74ff488

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

}
upsertBuffer[key] = serializedMessage
flushController.notifyAfterAddMessage()
flushController.notifyAfterAddMessage(bytesDiff)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep page-exit upserts under the beacon limit

When an existing upserted view is replaced during PAGE_MAY_EXIT and the replacement pushes the batch over RECOMMENDED_REQUEST_BYTES_LIMIT, this call flushes only after the new view has been stored in upsertBuffer, so the near-full batch and final view update are sent as one over-limit sendOnExit payload. I checked startRum.ts, where page-exit preparation emits the final view update, and httpRequest.ts lines 109-124, where over-limit exit payloads skip sendBeacon and fall back to fetch; this regresses the near-full action + final view scenario documented in test/e2e/scenario/transport.scenario.ts, where the fetch can be cancelled on unload. Previously the remove-then-notifyBeforeAddMessage path flushed the near-full batch before adding the replacement view, keeping both exit requests beacon-eligible.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codex address that feedback

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Git / PR actions

  • Committed changes on the current branch with message: 🐛 preserve beacon-friendly flush behavior for upsert updates.
  • Created a follow-up PR message via make_pr with the same title and a body summarizing the delta on top of the existing PR.

Testing

  • ⚠️ yarn test:unit --spec packages/core/src/transport/flushController.spec.ts --spec packages/core/src/transport/batch.spec.ts (failed due to environment limitation: ChromeHeadless binary is not available).

View task →

@cit-pr-commenter-54b7da
Copy link
Copy Markdown

cit-pr-commenter-54b7da Bot commented May 8, 2026

Bundles Sizes Evolution

📦 Bundle Name Base Size Local Size 𝚫 𝚫% Status
Rum 179.65 KiB 179.52 KiB -129 B -0.07%
Rum Profiler 6.17 KiB 6.17 KiB 0 B 0.00%
Rum Recorder 27.03 KiB 27.03 KiB 0 B 0.00%
Logs 56.78 KiB 56.65 KiB -129 B -0.22%
Rum Slim 135.50 KiB 135.38 KiB -129 B -0.09%
Worker 23.63 KiB 23.63 KiB 0 B 0.00%
🚀 CPU Performance

Pending...

🧠 Memory Performance

Error fetching base memory performance

🔗 RealWorld

@datadog-datadog-prod-us1-2
Copy link
Copy Markdown

datadog-datadog-prod-us1-2 Bot commented May 8, 2026

Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 83.33%
Overall Coverage: 77.03% (+0.02%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: c0c74ff | Docs | Datadog PR Page | Give us feedback!

@thomas-lebeau thomas-lebeau force-pushed the thomas.lebeau/fix-flush-controller-upsert branch from d7cc7d6 to 1dc0d8d Compare May 8, 2026 13:05
* limit flush can happen before the replacement is committed.
*/
notifyBeforeUpdateMessage(estimatedMessageBytesCountDiff: number) {
if (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's an edge case here. If the byte diff pushes the batch over the limit, the flush resets counters to 0 and empties upsertBuffer, but then in push() the new message goes into the empty buffer with currentMessagesCount still at 0 and no timer scheduled. notifyBeforeAddMessage handles this correctly because it re-registers the message after the flush. I think this could happen in practice with a batch close to 16 KiB (a long stack error would do it). What do you think?

@thomas-lebeau thomas-lebeau force-pushed the thomas.lebeau/fix-flush-controller-upsert branch from 1dc0d8d to c0c74ff Compare May 8, 2026 13:18
@thomas-lebeau thomas-lebeau merged commit 34d30ab into main May 8, 2026
21 of 24 checks passed
@thomas-lebeau thomas-lebeau deleted the thomas.lebeau/fix-flush-controller-upsert branch May 8, 2026 13:42
@github-actions github-actions Bot locked and limited conversation to collaborators May 8, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants