🐛 make sure we don't postpone batch indefinitely when upserting a view#4583
Conversation
This commit simplifies FlushController so we don't have a 'notify remove' step that can clear the duration timeout if the batch becomes empty. Instead, when we upsert a message, we call notifyAfterAddMessage with the bytes difference between the old and new message. This reuses the existing adjustment mechanism without adding API surface to FlushController. Backport from v7 (commit 0fd1595, originally part of #4462). When a batch contained exactly one upserted message (e.g. a view event with no accompanying resource/action/error events), each subsequent upsert called notifyAfterRemoveMessage which decremented currentMessagesCount to 0 and cancelled the duration_limit timeout. The next push rescheduled it from scratch. As long as the upsert cadence (e.g. throttled view updates) was shorter than FLUSH_DURATION_LIMIT (30s), the timer never fired and the batch was only flushed when something else triggered it (page exit, session expire, messages_limit, bytes_limit). When the page died before any of those fired, the view event was lost entirely - silently under-reporting view durations in RUM.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c0c74ff488
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| upsertBuffer[key] = serializedMessage | ||
| flushController.notifyAfterAddMessage() | ||
| flushController.notifyAfterAddMessage(bytesDiff) |
There was a problem hiding this comment.
Keep page-exit upserts under the beacon limit
When an existing upserted view is replaced during PAGE_MAY_EXIT and the replacement pushes the batch over RECOMMENDED_REQUEST_BYTES_LIMIT, this call flushes only after the new view has been stored in upsertBuffer, so the near-full batch and final view update are sent as one over-limit sendOnExit payload. I checked startRum.ts, where page-exit preparation emits the final view update, and httpRequest.ts lines 109-124, where over-limit exit payloads skip sendBeacon and fall back to fetch; this regresses the near-full action + final view scenario documented in test/e2e/scenario/transport.scenario.ts, where the fetch can be cancelled on unload. Previously the remove-then-notifyBeforeAddMessage path flushed the near-full batch before adding the replacement view, keeping both exit requests beacon-eligible.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Summary
- Addressed the review feedback by making upsert replacements run a pre-update bytes-limit check before the replacement is committed, via
flushController.notifyBeforeUpdateMessage(bytesDiff)inBatch.push. This preserves the “flush-before-replacement” behavior needed to keep page-exit payloads beacon-friendly when a replacement would cross the recommended size limit. packages/core/src/transport/batch.tsL32-L44 - Added a new
FlushControllerAPI,notifyBeforeUpdateMessage, that can trigger a bytes-limit flush for positive replacement diffs without incrementing message count or touching duration-limit scheduling, so it does not reintroduce the indefinite-postpone issue fixed by the original PR. packages/core/src/transport/flushController.tsL124-L156 - Updated batch unit tests to assert the new update-path callback is used during same-key upsert replacement. packages/core/src/transport/batch.spec.tsL71-L90
- Added a flush controller unit test covering bytes-limit notification before message update/replacement. packages/core/src/transport/flushController.spec.tsL144-L151
- Extended the mock flush controller with the new method so transport tests remain type-accurate and stable. packages/core/test/emulate/mockFlushController.tsL13-L27
Git / PR actions
- Committed changes on the current branch with message:
🐛 preserve beacon-friendly flush behavior for upsert updates. - Created a follow-up PR message via
make_prwith the same title and a body summarizing the delta on top of the existing PR.
Testing
⚠️ yarn test:unit --spec packages/core/src/transport/flushController.spec.ts --spec packages/core/src/transport/batch.spec.ts(failed due to environment limitation:ChromeHeadlessbinary is not available).
Bundles Sizes Evolution
🚀 CPU PerformancePending... 🧠 Memory PerformanceError fetching base memory performance |
🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage (details) 🔗 Commit SHA: c0c74ff | Docs | Datadog PR Page | Give us feedback! |
d7cc7d6 to
1dc0d8d
Compare
| * limit flush can happen before the replacement is committed. | ||
| */ | ||
| notifyBeforeUpdateMessage(estimatedMessageBytesCountDiff: number) { | ||
| if ( |
There was a problem hiding this comment.
I think there's an edge case here. If the byte diff pushes the batch over the limit, the flush resets counters to 0 and empties upsertBuffer, but then in push() the new message goes into the empty buffer with currentMessagesCount still at 0 and no timer scheduled. notifyBeforeAddMessage handles this correctly because it re-registers the message after the flush. I think this could happen in practice with a batch close to 16 KiB (a long stack error would do it). What do you think?
1dc0d8d to
c0c74ff
Compare
Motivation
Backport of v7 commit
0fd15958a(originally squashed into #4462 on the v7 branch).FlushControllerhad a bug where the duration_limit timer could be repeatedly cancelled and rescheduled when a batch contained a single upserted message. BecauseBatch.addOrUpdatecalledremove(key)thenpush(...)on every upsert with the same key,currentMessagesCountdropped to0between the two calls, which triggeredcancelDurationLimitTimeout(). The subsequentpushthen rescheduled the timer from scratch. As long as the upsert cadence (e.g. throttled view updates every 3s) was shorter thanFLUSH_DURATION_LIMIT(30s), the timer never fired.In practice this happened most often for view events being upserted with no accompanying resource/action/error events in the same batch — a common state for idle / backgrounded tabs and quickly-closed pages. The buffered view event was only flushed when something else triggered it (page exit, session expire,
messages_limit,bytes_limit). When the page died before any of those ran (tab discarded, OS reclaim, crash, mobile background kill), the view event was lost entirely, silently under-reporting view durations in RUM.Observed impact in our own dashboards after the v7 fix landed: total view count rose ~17.5% and
view.time_spentp90/p95/p99 jumped sharply (p95 ~+114%) while p50/p75 were flat or slightly lower — the signature of recovering previously-dropped views at both tails of the distribution, not of inflating individual durations.Changes
notifyAfterRemoveMessagefromFlushControllerentirely.Batch.push, when upserting a key that already has a buffered message, callnotifyAfterAddMessage(bytesDiff)instead of remove-then-add. This adjusts only the byte count and never touchescurrentMessagesCountor the duration_limit timer.mockFlushControlleraccordingly.This is a straight cherry-pick from v7 — no behavioral divergence between the two branches in this area.
Test instructions
Checklist