Parallel catchup V2 retry logic can lead to stuck mission

### What version are you using?

Latest code from the `main` branch

### What did you do?

1. Run parallel catchup v2
2. Some jobs caused OOM of the catchup worker pods. The jobs didn't finish and were left in the progress queue
3. Eventually job monitor saw jobs in the progress queue with all workers were down
4. Retry logic moved the oomed jobs from the in progress queue to the job queue
5. The jobs were picked up by the catchup worker pods and moved to the in progress queue
6. Jobs oomed again and did not finish
7. We're back to step 3 in an infinite loop

### What did you expect to see?

Perhaps we should fail the mission after certain number of failures?
We could also check catchup worker for OOM events. If those happen there is a possibility that some ranges will never finish.

### What did you see instead?

The missions was stuck in a retry loop.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel catchup V2 retry logic can lead to stuck mission #334

What version are you using?

What did you do?

What did you expect to see?

What did you see instead?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parallel catchup V2 retry logic can lead to stuck mission #334

Description

What version are you using?

What did you do?

What did you expect to see?

What did you see instead?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions