fix(jobs): finalize async job when 0 images need processing#1191
Conversation
✅ Deploy Preview for antenna-ssec canceled.
|
✅ Deploy Preview for antenna-preview canceled.
|
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe change enhances job completion handling in Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
When an async_api job finds 0 images to process (all already processed by the pipeline), queue_images_to_nats sets stages to SUCCESS but MLJob.run() returns without updating the job status. The job stays STARTED forever since no async results will arrive to trigger the completion check in update_job_progress. Check job.progress.is_complete() after NATS dispatch and finalize immediately if all stages are already done. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8c85bc2 to
af9b41d
Compare
There was a problem hiding this comment.
Pull request overview
Fixes a stuck-status edge case for async_api ML jobs where collect_images() returns no work (e.g., everything already processed), ensuring the job is finalized instead of remaining in STARTED indefinitely.
Changes:
- After NATS dispatch for
ASYNC_API, checksjob.progress.is_complete()and immediately finalizes the job when no async results will arrive. - Sets job-level
SUCCESSandfinished_atin this early-completion path.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
ami/jobs/models.py (1)
481-489: Good fix for the zero-images edge case.The logic correctly handles the case where all stages are already complete after queuing to NATS. This prevents jobs from being stuck in STARTED when no async results will arrive.
One minor observation: the status check only considers
JobState.FAILURE, butJobState.failed_states()(line 88-89) also includesREVOKEDandUNKNOWN. For consistency, consider using the helper method:if job.progress.is_complete(): - has_failure = any(s.status == JobState.FAILURE for s in job.progress.stages) + has_failure = any(s.status in JobState.failed_states() for s in job.progress.stages) job.update_status(JobState.FAILURE if has_failure else JobState.SUCCESS)This aligns with how failure states are defined elsewhere in the codebase. That said, for the 0-images scenario where stages are set to SUCCESS by
queue_images_to_nats, this is unlikely to matter in practice.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ami/jobs/models.py` around lines 481 - 489, The status check when finalizing jobs uses a hardcoded check for JobState.FAILURE; change it to use the canonical failure set by calling JobState.failed_states() so it detects REVOKED/UNKNOWN too — i.e., replace the has_failure computation (currently using any(s.status == JobState.FAILURE for s in job.progress.stages)) with an inclusion check against JobState.failed_states(); leave the surrounding calls to job.update_status(...), job.finished_at = datetime.datetime.now(), and job.save() unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@ami/jobs/models.py`:
- Around line 481-489: The status check when finalizing jobs uses a hardcoded
check for JobState.FAILURE; change it to use the canonical failure set by
calling JobState.failed_states() so it detects REVOKED/UNKNOWN too — i.e.,
replace the has_failure computation (currently using any(s.status ==
JobState.FAILURE for s in job.progress.stages)) with an inclusion check against
JobState.failed_states(); leave the surrounding calls to job.update_status(...),
job.finished_at = datetime.datetime.now(), and job.save() unchanged.
Code reviewFound 1 issue:
Lines 482 to 492 in af9b41d 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
- Add cleanup_async_job_if_needed() call to avoid stale Redis/NATS keys - Use update_status(save=False) to avoid redundant double save - Use JobState.failed_states() instead of hardcoded FAILURE check Co-Authored-By: Claude <noreply@anthropic.com>
Verifies that an async_api job with 0 images to process transitions to SUCCESS with finished_at set, rather than staying stuck at STARTED. Co-Authored-By: Claude <noreply@anthropic.com>
Summary
async_apijob finds 0 images to process (all already processed by the pipeline),queue_images_to_natscorrectly sets all stages to SUCCESS, butMLJob.run()returns without updating the job-level statusSTARTEDforever since no async results will arrive to trigger the completion check inupdate_job_progressjob.progress.is_complete()after NATS dispatch and finalize immediately if all stages are already doneFound during demo environment 500/504 investigation. Jobs 18 and 26 on demo both hit this — 0 images to process, stages all SUCCESS, but job stuck at STARTED.
Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes