Skip to content

Commit 5022769

Browse files
d-csclaude
andauthored
fix(webapp): retry on version collision when initializing a deployment (#3610)
Concurrent `POST /api/v1/deployments` requests for the same environment race on the `WorkerDeployment(environmentId, version)` unique constraint. Both requests read the same latest deployment via `findFirst`, compute the same next version via `calculateNextBuildVersion`, and both attempt `prisma.workerDeployment.create()` — one wins, the other crashes with Prisma `P2002`. The bug is a classic TOCTOU between the version read and the version write; it's been latent since the version-assignment logic was first added but only fires when two deploys land within milliseconds of each other (CI matrices, retried CLI calls, webhook-triggered redeploys). ## Approach Extracts the version assignment + create into a small helper `createDeploymentWithNextVersion` (`apps/webapp/app/v3/services/initializeDeployment/createDeploymentWithNextVersion.server.ts`). The helper retries on `P2002 (environmentId, version)` up to 5 times with randomised 5–50ms jitter so N concurrent racers don't loop in lockstep. Each attempt re-reads the latest version, recomputes via `calculateNextBuildVersion`, and re-runs the caller's `buildData` callback so version-dependent fields (image ref tag, friendlyId) are always consistent with the version actually persisted. A `logger.warn` fires per collision so the retry rate is observable in production logs. When retries are exhausted, the helper throws a dedicated `DeploymentVersionCollisionError` carrying `environmentId`, `attempts`, and `lastAttemptedVersion`, with the original `PrismaClientKnownRequestError` attached as `cause`. Sentry walks the `cause` chain natively, so contention exhaustion shows up as a distinguishable wrapper exception linked to the underlying `P2002` rather than a generic unique-constraint violation that looks identical to every other duplicate-key bug. The behavioural change is limited to "catch P2002 and retry instead of crashing." The image ref computation stays inside the builder callback (same call site as before the refactor), so ECR / non-ECR behaviour, S2 stream creation order, and all downstream side effects are unchanged. ## Non-goals - No new database migrations, no schema changes, no isolation-level / locking changes. A serialisable transaction or advisory lock would also fix this; retry-on-conflict is the smaller change that keeps the existing version-allocation logic intact. - Does not touch the analogous `calculateNextBuildVersion` call in `createBackgroundWorker.server.ts`, which likely has the same race shape against `BackgroundWorker`'s unique constraint — flagged as a follow-up. ## Test plan - [x] `pnpm run typecheck --filter webapp` passes (no new errors in the modified files). - [x] Three real-Postgres tests in `apps/webapp/test/createDeploymentWithNextVersion.test.ts` via `containerTest`: - 5 concurrent calls all produce distinct, persistable versions (`Set(versions).size === concurrency`). The naive read-then-create version of the helper fails this test with the exact same `P2002` seen in production; the retry version passes. - Non-`P2002` errors raised from the `buildData` callback propagate immediately without retry, builder invoked exactly once. - With `maxRetries: 0`, concurrent racers surface the wrapped `DeploymentVersionCollisionError` (not a raw `P2002`); `environmentId`, `attempts`, `lastAttemptedVersion` are populated and `error.cause.code === "P2002"`. - [x] Existing `apps/webapp/test/getDeploymentImageRef.test.ts` still green (the file was untouched in the final diff). ## Follow-ups (not in this PR) - `createBackgroundWorker.server.ts` likely has the same TOCTOU shape against its background-worker version unique constraint — should use the same helper. - Sentry visibility check: confirm `error.cause` chain renders as a linked exception in the Sentry UI when the wrapped error fires (requires a sandboxed triggering of the exhaustion path). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent a97365d commit 5022769

4 files changed

Lines changed: 311 additions & 74 deletions

File tree

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
---
2+
area: webapp
3+
type: fix
4+
---
5+
6+
Retry on unique-constraint collisions when assigning the next worker deployment version so concurrent deploys to the same environment no longer fail with P2002.

apps/webapp/app/v3/services/initializeDeployment.server.ts

Lines changed: 73 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -9,13 +9,13 @@ import { type AuthenticatedEnvironment } from "~/services/apiAuth.server";
99
import { logger } from "~/services/logger.server";
1010
import { generateFriendlyId } from "../friendlyIdentifiers";
1111
import { createRemoteImageBuild, remoteBuildsEnabled } from "../remoteImageBuilder.server";
12-
import { calculateNextBuildVersion } from "../utils/calculateNextBuildVersion";
1312
import { BaseService, ServiceValidationError } from "./baseService.server";
1413
import { TimeoutDeploymentService } from "./timeoutDeployment.server";
1514
import { getDeploymentImageRef } from "../getDeploymentImageRef.server";
1615
import { tryCatch } from "@trigger.dev/core";
1716
import { getRegistryConfig } from "../registryConfig.server";
1817
import { DeploymentService } from "./deployment.server";
18+
import { createDeploymentWithNextVersion } from "./initializeDeployment/createDeploymentWithNextVersion.server";
1919
import { errAsync } from "neverthrow";
2020

2121
const nanoid = customAlphabet("1234567890abcdefghijklmnopqrstuvwxyz", 8);
@@ -97,18 +97,6 @@ export class InitializeDeploymentService extends BaseService {
9797
});
9898
}
9999

100-
const latestDeployment = await this._prisma.workerDeployment.findFirst({
101-
where: {
102-
environmentId: environment.id,
103-
},
104-
orderBy: {
105-
createdAt: "desc",
106-
},
107-
take: 1,
108-
});
109-
110-
const nextVersion = calculateNextBuildVersion(latestDeployment?.version);
111-
112100
if (payload.selfHosted && remoteBuildsEnabled()) {
113101
throw new ServiceValidationError(
114102
"Self-hosted deployments are not supported on this instance"
@@ -146,30 +134,6 @@ export class InitializeDeploymentService extends BaseService {
146134

147135
const deploymentShortCode = nanoid(8);
148136

149-
const [imageRefError, imageRefResult] = await tryCatch(
150-
getDeploymentImageRef({
151-
registry: registryConfig,
152-
projectRef: environment.project.externalRef,
153-
nextVersion,
154-
environmentType: environment.type,
155-
deploymentShortCode,
156-
})
157-
);
158-
159-
if (imageRefError) {
160-
logger.error("Failed to get deployment image ref", {
161-
environmentId: environment.id,
162-
projectId: environment.projectId,
163-
version: nextVersion,
164-
triggeredById: triggeredBy?.id,
165-
type: payload.type,
166-
cause: imageRefError.message,
167-
});
168-
throw new ServiceValidationError("Failed to get deployment image ref");
169-
}
170-
171-
const { imageRef, isEcr, repoCreated } = imageRefResult;
172-
173137
// We keep using `BUILDING` as the initial status if not explicitly set
174138
// to avoid changing the behavior for deployments not created in the build server.
175139
// Native builds always start in the `PENDING` status.
@@ -208,20 +172,6 @@ export class InitializeDeploymentService extends BaseService {
208172
}
209173
: undefined;
210174

211-
logger.debug("Creating deployment", {
212-
environmentId: environment.id,
213-
projectId: environment.projectId,
214-
version: nextVersion,
215-
triggeredById: triggeredBy?.id,
216-
type: payload.type,
217-
imageRef,
218-
isEcr,
219-
repoCreated,
220-
initialStatus,
221-
artifactKey: payload.isNativeBuild ? payload.artifactKey : undefined,
222-
isNativeBuild: payload.isNativeBuild,
223-
});
224-
225175
const buildServerMetadata: BuildServerMetadata | undefined =
226176
payload.isNativeBuild || payload.buildId
227177
? {
@@ -238,28 +188,77 @@ export class InitializeDeploymentService extends BaseService {
238188
}
239189
: undefined;
240190

241-
const deployment = await this._prisma.workerDeployment.create({
242-
data: {
243-
friendlyId: generateFriendlyId("deployment"),
244-
contentHash: payload.contentHash,
245-
shortCode: deploymentShortCode,
246-
version: nextVersion,
247-
status: initialStatus,
248-
environmentId: environment.id,
249-
projectId: environment.projectId,
250-
externalBuildData,
251-
buildServerMetadata,
252-
triggeredById: triggeredBy?.id,
253-
type: payload.type,
254-
imageReference: imageRef,
255-
imagePlatform: env.DEPLOY_IMAGE_PLATFORM,
256-
git: payload.gitMeta ?? undefined,
257-
commitSHA: payload.gitMeta?.commitSha ?? undefined,
258-
runtime: payload.runtime ?? undefined,
259-
triggeredVia: payload.triggeredVia ?? undefined,
260-
startedAt: initialStatus === "BUILDING" ? new Date() : undefined,
261-
},
262-
});
191+
// Concurrent deploys to the same environment race on the
192+
// `(environmentId, version)` unique constraint. The helper retries on
193+
// P2002, recomputing the version (and re-running the image ref call so
194+
// the persisted imageReference always matches the persisted version)
195+
// each attempt.
196+
const deployment = await createDeploymentWithNextVersion(
197+
this._prisma,
198+
environment.id,
199+
async (nextVersion) => {
200+
const [imageRefError, imageRefResult] = await tryCatch(
201+
getDeploymentImageRef({
202+
registry: registryConfig,
203+
projectRef: environment.project.externalRef,
204+
nextVersion,
205+
environmentType: environment.type,
206+
deploymentShortCode,
207+
})
208+
);
209+
210+
if (imageRefError) {
211+
logger.error("Failed to get deployment image ref", {
212+
environmentId: environment.id,
213+
projectId: environment.projectId,
214+
version: nextVersion,
215+
triggeredById: triggeredBy?.id,
216+
type: payload.type,
217+
cause: imageRefError.message,
218+
});
219+
throw new ServiceValidationError("Failed to get deployment image ref");
220+
}
221+
222+
const { imageRef, isEcr, repoCreated } = imageRefResult;
223+
224+
logger.debug("Creating deployment", {
225+
environmentId: environment.id,
226+
projectId: environment.projectId,
227+
version: nextVersion,
228+
triggeredById: triggeredBy?.id,
229+
type: payload.type,
230+
imageRef,
231+
isEcr,
232+
repoCreated,
233+
initialStatus,
234+
artifactKey: payload.isNativeBuild ? payload.artifactKey : undefined,
235+
isNativeBuild: payload.isNativeBuild,
236+
});
237+
238+
return {
239+
// Regenerated per attempt: each attempt is a fresh `create` that
240+
// must satisfy `WorkerDeployment.friendlyId @unique`, so reusing a
241+
// friendlyId across retries would risk a spurious P2002 on
242+
// friendlyId instead of the version collision we're retrying.
243+
friendlyId: generateFriendlyId("deployment"),
244+
contentHash: payload.contentHash,
245+
shortCode: deploymentShortCode,
246+
status: initialStatus,
247+
projectId: environment.projectId,
248+
externalBuildData,
249+
buildServerMetadata,
250+
triggeredById: triggeredBy?.id,
251+
type: payload.type,
252+
imageReference: imageRef,
253+
imagePlatform: env.DEPLOY_IMAGE_PLATFORM,
254+
git: payload.gitMeta ?? undefined,
255+
commitSHA: payload.gitMeta?.commitSha ?? undefined,
256+
runtime: payload.runtime ?? undefined,
257+
triggeredVia: payload.triggeredVia ?? undefined,
258+
startedAt: initialStatus === "BUILDING" ? new Date() : undefined,
259+
};
260+
}
261+
);
263262

264263
const timeoutMs =
265264
deployment.status === "PENDING" ? env.DEPLOY_QUEUE_TIMEOUT_MS : env.DEPLOY_TIMEOUT_MS;
@@ -309,7 +308,7 @@ export class InitializeDeploymentService extends BaseService {
309308

310309
return {
311310
deployment,
312-
imageRef,
311+
imageRef: deployment.imageReference ?? "",
313312
eventStream,
314313
};
315314
});
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
import {
2+
isUniqueConstraintError,
3+
type Prisma,
4+
type PrismaClientOrTransaction,
5+
type WorkerDeployment,
6+
} from "@trigger.dev/database";
7+
import { setTimeout as sleep } from "node:timers/promises";
8+
import { logger } from "~/services/logger.server";
9+
import { calculateNextBuildVersion } from "../../utils/calculateNextBuildVersion";
10+
11+
export type CreateDeploymentData = Omit<
12+
Prisma.WorkerDeploymentUncheckedCreateInput,
13+
"version" | "environmentId"
14+
>;
15+
16+
export type CreateDeploymentWithNextVersionOptions = {
17+
maxRetries?: number;
18+
jitterMs?: { min: number; max: number };
19+
};
20+
21+
const DEFAULT_MAX_RETRIES = 5;
22+
const DEFAULT_JITTER_MS = { min: 5, max: 50 };
23+
24+
export class DeploymentVersionCollisionError extends Error {
25+
readonly name = "DeploymentVersionCollisionError";
26+
readonly environmentId: string;
27+
readonly attempts: number;
28+
readonly lastAttemptedVersion: string;
29+
30+
constructor(args: {
31+
environmentId: string;
32+
attempts: number;
33+
lastAttemptedVersion: string;
34+
cause: unknown;
35+
}) {
36+
super(
37+
`Failed to allocate a unique worker deployment version for environment ${args.environmentId} after ${args.attempts} attempt(s); last tried "${args.lastAttemptedVersion}"`,
38+
{ cause: args.cause }
39+
);
40+
this.environmentId = args.environmentId;
41+
this.attempts = args.attempts;
42+
this.lastAttemptedVersion = args.lastAttemptedVersion;
43+
}
44+
}
45+
46+
export async function createDeploymentWithNextVersion(
47+
prisma: PrismaClientOrTransaction,
48+
environmentId: string,
49+
buildData: (nextVersion: string) => CreateDeploymentData | Promise<CreateDeploymentData>,
50+
options: CreateDeploymentWithNextVersionOptions = {}
51+
): Promise<WorkerDeployment> {
52+
const maxRetries = options.maxRetries ?? DEFAULT_MAX_RETRIES;
53+
const jitterMs = options.jitterMs ?? DEFAULT_JITTER_MS;
54+
55+
let lastError: unknown;
56+
let lastVersion = "";
57+
58+
for (let attempt = 0; attempt <= maxRetries; attempt++) {
59+
const latest = await prisma.workerDeployment.findFirst({
60+
where: { environmentId },
61+
orderBy: { createdAt: "desc" },
62+
take: 1,
63+
});
64+
65+
const version = calculateNextBuildVersion(latest?.version);
66+
lastVersion = version;
67+
const data = await buildData(version);
68+
69+
try {
70+
return await prisma.workerDeployment.create({
71+
data: { ...data, environmentId, version },
72+
});
73+
} catch (error) {
74+
if (!isUniqueConstraintError(error, ["environmentId", "version"])) {
75+
throw error;
76+
}
77+
78+
lastError = error;
79+
logger.warn("Worker deployment version collided, retrying", {
80+
environmentId,
81+
attempt: attempt + 1,
82+
maxRetries,
83+
attemptedVersion: version,
84+
});
85+
86+
// Randomised backoff so N concurrent racers don't loop in lockstep into the
87+
// same collision again.
88+
const delay = jitterMs.min + Math.random() * (jitterMs.max - jitterMs.min);
89+
await sleep(delay);
90+
}
91+
}
92+
93+
throw new DeploymentVersionCollisionError({
94+
environmentId,
95+
attempts: maxRetries + 1,
96+
lastAttemptedVersion: lastVersion,
97+
cause: lastError,
98+
});
99+
}

0 commit comments

Comments
 (0)