Prevent failed instance retries by ciaranbor · Pull Request #1763 · exo-explore/exo

ciaranbor · 2026-03-20T14:59:09Z

Motivation

Currently, when a runner fails, the master retries the instance. Most of the time, this causes a loop over failure. Retries need backoff and a cap.

Changes

src/exo/worker/main.py: Before creating a runner, check an exponential backoff timer per instance. After EXO_MAX_INSTANCE_RETRIES failures, send DeleteInstance to permanently remove the instance. Record attempts on Shutdown; reset on InstanceDeleted.
src/exo/utils/keyed_backoff.py: Add attempts() method to query retry count
src/exo/shared/constants.py: Add EXO_MAX_INSTANCE_RETRIES = 3.

Why It Works

The worker gates CreateRunner tasks behind a KeyedBackoff, adding exponential delay (2s base, 30s cap) between retries. After 3 failures the worker sends DeleteInstance, stopping retries entirely. The backoff resets when the instance is deleted, so a fresh placement starts clean.

AlexCheema · 2026-03-20T20:18:22Z

I agree with limiting the number of retries with exponential backoff, but we should not remove retries altogether. Often I have run into situations where the instance fails the first time then succeeds after a retry (particularly common with RDMA).

rltakashige

Cool!

extension to #1763 that prevents crash looping in some common scenarios.

ciaranbor force-pushed the ciaran/remove-instance-retry branch 6 times, most recently from 2d32a29 to 384f830 Compare March 26, 2026 17:08

ciaranbor added 2 commits April 1, 2026 13:03

Prevent failed instance retries

b912f21

Use exponential backoff for instance retries

d58c6b0

Evanev7 force-pushed the ciaran/remove-instance-retry branch from 384f830 to 383ed31 Compare April 1, 2026 12:20

improve keyed backoff

7b26e19

Evanev7 force-pushed the ciaran/remove-instance-retry branch from 383ed31 to 7b26e19 Compare April 1, 2026 15:00

Evanev7 mentioned this pull request Apr 1, 2026

prevent some crash loops #1827

Merged

rltakashige approved these changes Apr 1, 2026

View reviewed changes

rltakashige merged commit eb6ae9f into main Apr 1, 2026
6 checks passed

rltakashige deleted the ciaran/remove-instance-retry branch April 1, 2026 20:03

Evanev7 added a commit that referenced this pull request Apr 9, 2026

prevent some crash loops (#1827)

f2e6b1e

extension to #1763 that prevents crash looping in some common scenarios.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent failed instance retries#1763

Prevent failed instance retries#1763
rltakashige merged 3 commits intomainfrom
ciaran/remove-instance-retry

ciaranbor commented Mar 20, 2026 •

edited

Loading

Uh oh!

AlexCheema commented Mar 20, 2026

Uh oh!

rltakashige left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ciaranbor commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Why It Works

Uh oh!

AlexCheema commented Mar 20, 2026

Uh oh!

rltakashige left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ciaranbor commented Mar 20, 2026 •

edited

Loading