What version are you using?
Latest code from the main branch
What did you do?
- Run parallel catchup v2
- Some jobs caused OOM of the catchup worker pods. The jobs didn't finish and were left in the progress queue
- Eventually job monitor saw jobs in the progress queue with all workers were down
- Retry logic moved the oomed jobs from the in progress queue to the job queue
- The jobs were picked up by the catchup worker pods and moved to the in progress queue
- Jobs oomed again and did not finish
- We're back to step 3 in an infinite loop
What did you expect to see?
Perhaps we should fail the mission after certain number of failures?
We could also check catchup worker for OOM events. If those happen there is a possibility that some ranges will never finish.
What did you see instead?
The missions was stuck in a retry loop.
What version are you using?
Latest code from the
mainbranchWhat did you do?
What did you expect to see?
Perhaps we should fail the mission after certain number of failures?
We could also check catchup worker for OOM events. If those happen there is a possibility that some ranges will never finish.
What did you see instead?
The missions was stuck in a retry loop.