fix(jobs): finalize async job when 0 images need processing by mihow · Pull Request #1191 · RolnickLab/antenna

mihow · 2026-03-25T23:31:44Z

Summary

When an async_api job finds 0 images to process (all already processed by the pipeline), queue_images_to_nats correctly sets all stages to SUCCESS, but MLJob.run() returns without updating the job-level status
The job stays STARTED forever since no async results will arrive to trigger the completion check in update_job_progress
Fix: check job.progress.is_complete() after NATS dispatch and finalize immediately if all stages are already done

Found during demo environment 500/504 investigation. Jobs 18 and 26 on demo both hit this — 0 images to process, stages all SUCCESS, but job stuck at STARTED.

Test plan

Run an async_api job where all images have already been processed by the pipeline
Verify job status transitions to SUCCESS (not stuck at STARTED)
Verify normal async jobs with images still complete via the NATS result pipeline

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

Bug Fixes
- Fixed job completion handling for edge cases in async operations, particularly when no images are provided for processing. Jobs now properly determine their final status based on stage results and finalize immediately rather than depending on delayed external callbacks.

netlify · 2026-03-25T23:31:50Z

✅ Deploy Preview for antenna-ssec canceled.

Name	Link
🔨 Latest commit	`4d3a7a5`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-ssec/deploys/69c476bafdea130008baceff

netlify · 2026-03-25T23:31:53Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`4d3a7a5`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/69c476ba7227e20008d10ce3

coderabbitai · 2026-03-25T23:32:00Z

Warning

Rate limit exceeded

@mihow has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 43 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b82721f3-9916-4fa5-bfee-48b61b43a946

📥 Commits

Reviewing files that changed from the base of the PR and between af9b41d and 4d3a7a5.

📒 Files selected for processing (2)

ami/jobs/models.py
ami/jobs/tests/test_jobs.py

📝 Walkthrough

Walkthrough

The change enhances job completion handling in MLJob.run to address the case where zero images are queued (e.g., job.progress.is_complete() is already true). It derives the final job status by examining per-stage statuses, sets the job state to FAILURE or SUCCESS accordingly, records the completion timestamp, and persists the job immediately instead of relying on asynchronous callbacks.

Changes

Cohort / File(s)	Summary
Job Completion Logic `ami/jobs/models.py`	Added handling for complete jobs with zero images: derives final job status from per-stage failures, sets `FAILURE` or `SUCCESS`, records `finished_at`, and persists immediately.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

PSv2 cleanup: use is_complete() and dispatch_mode in job progress handler #1125: Adjusts job completion handling using job.progress.is_complete() and related finalization behavior.

Poem

🐰 A job complete with naught to do,
No more waiting for callbacks true,
We check the stages, count the toll,
Set success or failure, whole and whole! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main fix: finalizing async jobs when there are zero images to process, which matches the primary change in the changeset.
Description check	✅ Passed	The description addresses the main issue and includes summary, problem context, and solution, but omits several template sections including detailed test instructions, deployment notes, and the completion checklist.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/async-job-zero-images-completion

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

When an async_api job finds 0 images to process (all already processed by the pipeline), queue_images_to_nats sets stages to SUCCESS but MLJob.run() returns without updating the job status. The job stays STARTED forever since no async results will arrive to trigger the completion check in update_job_progress. Check job.progress.is_complete() after NATS dispatch and finalize immediately if all stages are already done. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Fixes a stuck-status edge case for async_api ML jobs where collect_images() returns no work (e.g., everything already processed), ensuring the job is finalized instead of remaining in STARTED indefinitely.

Changes:

After NATS dispatch for ASYNC_API, checks job.progress.is_complete() and immediately finalizes the job when no async results will arrive.
Sets job-level SUCCESS and finished_at in this early-completion path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ami/jobs/models.py

coderabbitai

🧹 Nitpick comments (1)

ami/jobs/models.py (1)
481-489: Good fix for the zero-images edge case.

The logic correctly handles the case where all stages are already complete after queuing to NATS. This prevents jobs from being stuck in STARTED when no async results will arrive.

One minor observation: the status check only considers JobState.FAILURE, but JobState.failed_states() (line 88-89) also includes REVOKED and UNKNOWN. For consistency, consider using the helper method:
             if job.progress.is_complete():
-                has_failure = any(s.status == JobState.FAILURE for s in job.progress.stages)
+                has_failure = any(s.status in JobState.failed_states() for s in job.progress.stages)
                 job.update_status(JobState.FAILURE if has_failure else JobState.SUCCESS)
This aligns with how failure states are defined elsewhere in the codebase. That said, for the 0-images scenario where stages are set to SUCCESS by queue_images_to_nats, this is unlikely to matter in practice.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/jobs/models.py` around lines 481 - 489, The status check when finalizing
jobs uses a hardcoded check for JobState.FAILURE; change it to use the canonical
failure set by calling JobState.failed_states() so it detects REVOKED/UNKNOWN
too — i.e., replace the has_failure computation (currently using any(s.status ==
JobState.FAILURE for s in job.progress.stages)) with an inclusion check against
JobState.failed_states(); leave the surrounding calls to job.update_status(...),
job.finished_at = datetime.datetime.now(), and job.save() unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@ami/jobs/models.py`:
- Around line 481-489: The status check when finalizing jobs uses a hardcoded
check for JobState.FAILURE; change it to use the canonical failure set by
calling JobState.failed_states() so it detects REVOKED/UNKNOWN too — i.e.,
replace the has_failure computation (currently using any(s.status ==
JobState.FAILURE for s in job.progress.stages)) with an inclusion check against
JobState.failed_states(); leave the surrounding calls to job.update_status(...),
job.finished_at = datetime.datetime.now(), and job.save() unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1bd631b2-fb3f-4a41-a7c2-1cdc159e79ca

📥 Commits

Reviewing files that changed from the base of the PR and between 5af9ce7 and af9b41d.

📒 Files selected for processing (1)

ami/jobs/models.py

mihow · 2026-03-25T23:47:47Z

Code review

Found 1 issue:

Missing cleanup_async_job_if_needed(job) call in the new early-completion path. Every other finalization path in the codebase calls cleanup (_fail_job, _update_job_progress, check_stale_jobs, update_job_failure, Job.cancel). When queue_images_to_nats runs, it calls state_manager.initialize_job(image_ids) which creates Redis keys for progress tracking. In the zero-images case, no NATS results arrive, so the progress handler (which normally triggers cleanup) never fires. The function is already imported at line 18 of models.py.

antenna/ami/jobs/models.py

Lines 482 to 492 in af9b41d

    
               # finalize the job now since no async results will arrive to trigger completion. 
        
               # Derive status from stages rather than assuming SUCCESS — a stage could be 
        
               # in FAILURE if some images were skipped during collection. 
        
               if job.progress.is_complete(): 
        
                   has_failure = any(s.status == JobState.FAILURE for s in job.progress.stages) 
        
                   job.update_status(JobState.FAILURE if has_failure else JobState.SUCCESS) 
        
                   job.finished_at = datetime.datetime.now() 
        
                   job.save() 
        
           else: 
        
               cls.process_images(job, images)

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

- Add cleanup_async_job_if_needed() call to avoid stale Redis/NATS keys - Use update_status(save=False) to avoid redundant double save - Use JobState.failed_states() instead of hardcoded FAILURE check Co-Authored-By: Claude <noreply@anthropic.com>

Verifies that an async_api job with 0 images to process transitions to SUCCESS with finished_at set, rather than staying stuck at STARTED. Co-Authored-By: Claude <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings March 25, 2026 23:31

Copilot started reviewing on behalf of mihow March 25, 2026 23:32 View session

mihow force-pushed the fix/async-job-zero-images-completion branch from 8c85bc2 to af9b41d Compare March 25, 2026 23:33

Copilot AI reviewed Mar 25, 2026

View reviewed changes

ami/jobs/models.py Outdated Show resolved Hide resolved

ami/jobs/models.py Outdated Show resolved Hide resolved

ami/jobs/models.py Outdated Show resolved Hide resolved

ami/jobs/models.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

mihow and others added 2 commits March 25, 2026 16:49

test: add test for async job completion with zero images

4d3a7a5

Verifies that an async_api job with 0 images to process transitions to SUCCESS with finished_at set, rather than staying stuck at STARTED. Co-Authored-By: Claude <noreply@anthropic.com>

mihow merged commit 76d3b62 into main Mar 26, 2026
7 checks passed

mihow deleted the fix/async-job-zero-images-completion branch March 26, 2026 00:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(jobs): finalize async job when 0 images need processing#1191

fix(jobs): finalize async job when 0 images need processing#1191
mihow merged 3 commits intomainfrom
fix/async-job-zero-images-completion

mihow commented Mar 25, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

netlify bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

netlify bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 25, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

mihow commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mihow commented Mar 25, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

netlify bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-ssec canceled.

Uh oh!

netlify bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

coderabbitai bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

mihow commented Mar 25, 2026

Code review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mihow commented Mar 25, 2026 •

edited by coderabbitai bot

Loading

netlify bot commented Mar 25, 2026 •

edited

Loading

netlify bot commented Mar 25, 2026 •

edited

Loading

coderabbitai bot commented Mar 25, 2026 •

edited

Loading