Skip to content

fix(run-engine): sweeper LREMs the worker queue list when acking marked runs#3527

Closed
d-cs wants to merge 4 commits intomainfrom
run-queue-sweeper-stale-entry-repro
Closed

fix(run-engine): sweeper LREMs the worker queue list when acking marked runs#3527
d-cs wants to merge 4 commits intomainfrom
run-queue-sweeper-stale-entry-repro

Conversation

@d-cs
Copy link
Copy Markdown
Collaborator

@d-cs d-cs commented May 5, 2026

The concurrency sweeper's processMarkedRun calls acknowledgeMessage with removeFromWorkerQueue: false. The Lua script always DELs the message body but only LREMs the worker queue list when the flag is set. Result: any run that was pushed onto the worker queue list (fast-path enqueue or processQueueForWorkerQueue promotion) but not yet BLPOP'd when the sweeper finds it leaves a stale messageKey value sitting on the list. The next worker BLPOP returns the tombstone, GET messageKey returns nil, and the dequeue path logs Failed to dequeue message from worker queue.

The original assumption was that sweeper-acked runs had already been BLPOP'd off the list (entry already gone, nothing to LREM). That holds when the worker BLPOPs and then dies, but not when fast-path enqueue or processQueueForWorkerQueue adds the entry to the list and the run then ages past the sweeper's 10-minute completed-at threshold without ever being popped.

Switch to removeFromWorkerQueue: true. Cost is one extra LREM per swept run, O(N) over the worker queue list length. Sweeper acks are bounded by the cron schedule (every 5 minutes, max 100 marked runs per fire), so the added Redis work is negligible relative to baseline ack workload.

Test

internal-packages/run-engine/src/run-queue/tests/concurrencySweeper.test.ts adds a regression test that fast-path enqueues a message, runs the sweeper, then dequeues. Pre-fix the dequeue path logs the "Failed to dequeue message from worker queue" error; post-fix the worker queue list is left clean and the dequeue returns undefined silently.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 5, 2026

⚠️ No Changeset found

Latest commit: 0a1199a

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@d-cs d-cs self-assigned this May 5, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5149dd01-6fda-4117-bfc5-726274b0117d

📥 Commits

Reviewing files that changed from the base of the PR and between eacd023 and 0a1199a.

📒 Files selected for processing (1)
  • internal-packages/run-engine/src/run-queue/tests/concurrencySweeper.test.ts
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: units / e2e-webapp / 🧪 E2E Tests: Webapp
  • GitHub Check: typecheck / typecheck
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: sdk-compat / Cloudflare Workers
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
🧰 Additional context used
📓 Path-based instructions (10)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

Files:

  • internal-packages/run-engine/src/run-queue/tests/concurrencySweeper.test.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Files:

  • internal-packages/run-engine/src/run-queue/tests/concurrencySweeper.test.ts
**/*.{test,spec}.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use vitest for all tests in the Trigger.dev repository

Files:

  • internal-packages/run-engine/src/run-queue/tests/concurrencySweeper.test.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • internal-packages/run-engine/src/run-queue/tests/concurrencySweeper.test.ts
{apps,internal-packages}/**/*.{ts,tsx,js}

📄 CodeRabbit inference engine (CLAUDE.md)

Use pnpm run typecheck to verify changes in apps and internal packages (apps/*, internal-packages/*) instead of build, which proves almost nothing about correctness

Files:

  • internal-packages/run-engine/src/run-queue/tests/concurrencySweeper.test.ts
**/*.test.{ts,tsx,js}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.test.{ts,tsx,js}: Use vitest exclusively for testing and never mock anything - use testcontainers instead
Place test files next to source files using the pattern MyService.ts -> MyService.test.ts

**/*.test.{ts,tsx,js}: Use vitest for unit testing and run tests with pnpm run test
Test files should live beside the files under test with descriptive describe and it blocks
Tests should avoid mocks or stubs and use helpers from @internal/testcontainers when Redis or Postgres are needed

Files:

  • internal-packages/run-engine/src/run-queue/tests/concurrencySweeper.test.ts
**/*.test.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Use testcontainers with redisTest, postgresTest, or containerTest from @internal/testcontainers for testing with Redis/PostgreSQL dependencies

Files:

  • internal-packages/run-engine/src/run-queue/tests/concurrencySweeper.test.ts
{package.json,**/*.{ts,tsx,js}}

📄 CodeRabbit inference engine (CLAUDE.md)

Pin Zod to version 3.25.76 exactly across the entire monorepo - never use a different version or version range

Files:

  • internal-packages/run-engine/src/run-queue/tests/concurrencySweeper.test.ts
**/*.{ts,tsx,js}

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.{ts,tsx,js}: Import from @trigger.dev/core using subpaths only, never the root export
Always import tasks from @trigger.dev/sdk, never from @trigger.dev/sdk/v3 or deprecated client.defineJob
Add crumbs to code using // @Crumbs comments or `// `#region` `@crumbs blocks for debug tracing during development

Files:

  • internal-packages/run-engine/src/run-queue/tests/concurrencySweeper.test.ts
**/*.{ts,tsx,js,jsx,json,md,css,scss}

📄 CodeRabbit inference engine (AGENTS.md)

Code formatting is enforced using Prettier. Run pnpm run format before committing

Files:

  • internal-packages/run-engine/src/run-queue/tests/concurrencySweeper.test.ts
🧠 Learnings (2)
📚 Learning: 2026-03-22T13:26:12.060Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).

Applied to files:

  • internal-packages/run-engine/src/run-queue/tests/concurrencySweeper.test.ts
📚 Learning: 2026-03-22T19:24:14.403Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.

Applied to files:

  • internal-packages/run-engine/src/run-queue/tests/concurrencySweeper.test.ts
🔇 Additional comments (2)
internal-packages/run-engine/src/run-queue/tests/concurrencySweeper.test.ts (2)

178-244: LGTM — both previous review concerns are resolved.

  • The vi.spyOn(logger, "error") has been replaced with state-based assertions (peekAllOnWorkerQueue, operationalCurrentConcurrencyOfEnvironment, readMessage), fully complying with the "never mock anything" guideline.
  • The inline comment at lines 242–243 now correctly reads removeFromWorkerQueue: true, matching processMarkedRun in index.ts.

256-261: ⚡ Quick win

Method signature is correct; no issues found.

The dequeueMessageFromWorkerQueue method correctly accepts the third-argument options object with blockingPop and blockingPopTimeoutSeconds properties. The test code properly passes these options, and the 2-second blocking timeout will function as intended.


Walkthrough

The concurrency sweeper was changed to remove the corresponding message entry from the worker queue list when acknowledging marked runs. A server-changes note documents this behavior. The code update flips removeFromWorkerQueue to true for acknowledgments in the sweeper path. A new Redis-backed integration test validates that the sweeper releases operational concurrency, deletes the message body, clears the worker queue list (no stale messageKey tombstone), and that a subsequent blocking dequeue returns undefined.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description provides comprehensive technical context, root cause analysis, and test coverage, but the PR template structure is not followed. Restructure the description to match the template: add issue reference (Closes #), complete the checklist, use section headings (Testing, Changelog, Screenshots), and maintain the concise explanations within the template format.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: the concurrency sweeper now removes worker queue list entries when acknowledging marked runs.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch run-queue-sweeper-stale-entry-repro

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

d-cs and others added 3 commits May 5, 2026 15:29
Adds a failing test in concurrencySweeper.test.ts that demonstrates an
inconsistency between the worker queue list and the message body store:

- Fast-path enqueue RPUSHes the messageKey value onto the worker queue list
  and SADDs the run into currentConcurrency.
- The sweeper marks the run as completed (via test callback) and processMarkedRun
  acks with removeFromWorkerQueue: false, which DELs the message body but skips
  the LREM on the worker queue list.
- A subsequent blocking dequeue BLPOPs the stale messageKey, the GET returns
  nil, and the dequeue path emits "Failed to dequeue message from worker queue"
  with workerQueueLength: 0.

The test asserts that the dequeue path does not log this error after the
sweeper has acked the run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed runs

processMarkedRun previously called acknowledgeMessage with
removeFromWorkerQueue: false. The Lua script always DELs the message body
but only LREMs the worker queue list when the flag is set, leaving a
stale messageKey on the list whenever the swept run had been pushed to
the worker queue list (fast-path enqueue or processQueueForWorkerQueue
promotion) but never BLPOP'd.

The next BLPOP returns the tombstone, GET on the messageKey returns nil,
and the dequeue path logs "Failed to dequeue message from worker queue".

Switch to removeFromWorkerQueue: true. Cost is one extra LREM per swept
run (O(N) on the list); production telemetry shows sweeper acks at
<0.33/sec hard upper bound on worker queue lists averaging 200-500
entries — under 0.05% Redis CPU even at peak. The repro test added in
the previous commit flips to passing with this change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@d-cs d-cs force-pushed the run-queue-sweeper-stale-entry-repro branch from 847a4ef to eacd023 Compare May 5, 2026 14:32
coderabbitai[bot]

This comment was marked as resolved.

… logger

Address CodeRabbit feedback on concurrencySweeper.test.ts:

- Replace vi.spyOn(logger, "error") with a peekAllOnWorkerQueue assertion.
  The state-based check proves the same invariant (no tombstone exists, so
  no future BLPOP can fail) and matches the repo's "never mock anything"
  testing convention.
- Update the stale inline comment that said the sweeper acks with
  removeFromWorkerQueue: false; the implementation now uses true.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@d-cs d-cs marked this pull request as ready for review May 5, 2026 15:07
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

@d-cs d-cs marked this pull request as draft May 5, 2026 15:26
@d-cs d-cs closed this May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant