Skip to content

Prevent failed instance retries#1763

Merged
rltakashige merged 3 commits intomainfrom
ciaran/remove-instance-retry
Apr 1, 2026
Merged

Prevent failed instance retries#1763
rltakashige merged 3 commits intomainfrom
ciaran/remove-instance-retry

Conversation

@ciaranbor
Copy link
Copy Markdown
Member

@ciaranbor ciaranbor commented Mar 20, 2026

Motivation

Currently, when a runner fails, the master retries the instance. Most of the time, this causes a loop over failure. Retries need backoff and a cap.

Changes

  • src/exo/worker/main.py: Before creating a runner, check an exponential backoff timer per instance. After EXO_MAX_INSTANCE_RETRIES failures, send DeleteInstance to permanently remove the instance. Record attempts on Shutdown; reset on InstanceDeleted.
  • src/exo/utils/keyed_backoff.py: Add attempts() method to query retry count
  • src/exo/shared/constants.py: Add EXO_MAX_INSTANCE_RETRIES = 3.

Why It Works

The worker gates CreateRunner tasks behind a KeyedBackoff, adding exponential delay (2s base, 30s cap) between retries. After 3 failures the worker sends DeleteInstance, stopping retries entirely. The backoff resets when the instance is deleted, so a fresh placement starts clean.

@AlexCheema
Copy link
Copy Markdown
Contributor

I agree with limiting the number of retries with exponential backoff, but we should not remove retries altogether. Often I have run into situations where the instance fails the first time then succeeds after a retry (particularly common with RDMA).

@ciaranbor ciaranbor force-pushed the ciaran/remove-instance-retry branch 6 times, most recently from 2d32a29 to 384f830 Compare March 26, 2026 17:08
@Evanev7 Evanev7 force-pushed the ciaran/remove-instance-retry branch from 384f830 to 383ed31 Compare April 1, 2026 12:20
@Evanev7 Evanev7 force-pushed the ciaran/remove-instance-retry branch from 383ed31 to 7b26e19 Compare April 1, 2026 15:00
@Evanev7 Evanev7 mentioned this pull request Apr 1, 2026
Copy link
Copy Markdown
Collaborator

@rltakashige rltakashige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

@rltakashige rltakashige merged commit eb6ae9f into main Apr 1, 2026
6 checks passed
@rltakashige rltakashige deleted the ciaran/remove-instance-retry branch April 1, 2026 20:03
Evanev7 added a commit that referenced this pull request Apr 9, 2026
extension to #1763 that prevents crash looping in some common scenarios.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants