Skip to content

fix(jobs): finalize async job when 0 images need processing#1191

Merged
mihow merged 3 commits intomainfrom
fix/async-job-zero-images-completion
Mar 26, 2026
Merged

fix(jobs): finalize async job when 0 images need processing#1191
mihow merged 3 commits intomainfrom
fix/async-job-zero-images-completion

Conversation

@mihow
Copy link
Collaborator

@mihow mihow commented Mar 25, 2026

Summary

  • When an async_api job finds 0 images to process (all already processed by the pipeline), queue_images_to_nats correctly sets all stages to SUCCESS, but MLJob.run() returns without updating the job-level status
  • The job stays STARTED forever since no async results will arrive to trigger the completion check in update_job_progress
  • Fix: check job.progress.is_complete() after NATS dispatch and finalize immediately if all stages are already done

Found during demo environment 500/504 investigation. Jobs 18 and 26 on demo both hit this — 0 images to process, stages all SUCCESS, but job stuck at STARTED.

Test plan

  • Run an async_api job where all images have already been processed by the pipeline
  • Verify job status transitions to SUCCESS (not stuck at STARTED)
  • Verify normal async jobs with images still complete via the NATS result pipeline

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Fixed job completion handling for edge cases in async operations, particularly when no images are provided for processing. Jobs now properly determine their final status based on stage results and finalize immediately rather than depending on delayed external callbacks.

Copilot AI review requested due to automatic review settings March 25, 2026 23:31
@netlify
Copy link

netlify bot commented Mar 25, 2026

Deploy Preview for antenna-ssec canceled.

Name Link
🔨 Latest commit 4d3a7a5
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/69c476bafdea130008baceff

@netlify
Copy link

netlify bot commented Mar 25, 2026

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit 4d3a7a5
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/69c476ba7227e20008d10ce3

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 25, 2026

Warning

Rate limit exceeded

@mihow has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 43 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b82721f3-9916-4fa5-bfee-48b61b43a946

📥 Commits

Reviewing files that changed from the base of the PR and between af9b41d and 4d3a7a5.

📒 Files selected for processing (2)
  • ami/jobs/models.py
  • ami/jobs/tests/test_jobs.py
📝 Walkthrough

Walkthrough

The change enhances job completion handling in MLJob.run to address the case where zero images are queued (e.g., job.progress.is_complete() is already true). It derives the final job status by examining per-stage statuses, sets the job state to FAILURE or SUCCESS accordingly, records the completion timestamp, and persists the job immediately instead of relying on asynchronous callbacks.

Changes

Cohort / File(s) Summary
Job Completion Logic
ami/jobs/models.py
Added handling for complete jobs with zero images: derives final job status from per-stage failures, sets FAILURE or SUCCESS, records finished_at, and persists immediately.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

Poem

🐰 A job complete with naught to do,
No more waiting for callbacks true,
We check the stages, count the toll,
Set success or failure, whole and whole! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main fix: finalizing async jobs when there are zero images to process, which matches the primary change in the changeset.
Description check ✅ Passed The description addresses the main issue and includes summary, problem context, and solution, but omits several template sections including detailed test instructions, deployment notes, and the completion checklist.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/async-job-zero-images-completion

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

When an async_api job finds 0 images to process (all already processed
by the pipeline), queue_images_to_nats sets stages to SUCCESS but
MLJob.run() returns without updating the job status. The job stays
STARTED forever since no async results will arrive to trigger the
completion check in update_job_progress.

Check job.progress.is_complete() after NATS dispatch and finalize
immediately if all stages are already done.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mihow mihow force-pushed the fix/async-job-zero-images-completion branch from 8c85bc2 to af9b41d Compare March 25, 2026 23:33
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a stuck-status edge case for async_api ML jobs where collect_images() returns no work (e.g., everything already processed), ensuring the job is finalized instead of remaining in STARTED indefinitely.

Changes:

  • After NATS dispatch for ASYNC_API, checks job.progress.is_complete() and immediately finalizes the job when no async results will arrive.
  • Sets job-level SUCCESS and finished_at in this early-completion path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
ami/jobs/models.py (1)

481-489: Good fix for the zero-images edge case.

The logic correctly handles the case where all stages are already complete after queuing to NATS. This prevents jobs from being stuck in STARTED when no async results will arrive.

One minor observation: the status check only considers JobState.FAILURE, but JobState.failed_states() (line 88-89) also includes REVOKED and UNKNOWN. For consistency, consider using the helper method:

             if job.progress.is_complete():
-                has_failure = any(s.status == JobState.FAILURE for s in job.progress.stages)
+                has_failure = any(s.status in JobState.failed_states() for s in job.progress.stages)
                 job.update_status(JobState.FAILURE if has_failure else JobState.SUCCESS)

This aligns with how failure states are defined elsewhere in the codebase. That said, for the 0-images scenario where stages are set to SUCCESS by queue_images_to_nats, this is unlikely to matter in practice.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/jobs/models.py` around lines 481 - 489, The status check when finalizing
jobs uses a hardcoded check for JobState.FAILURE; change it to use the canonical
failure set by calling JobState.failed_states() so it detects REVOKED/UNKNOWN
too — i.e., replace the has_failure computation (currently using any(s.status ==
JobState.FAILURE for s in job.progress.stages)) with an inclusion check against
JobState.failed_states(); leave the surrounding calls to job.update_status(...),
job.finished_at = datetime.datetime.now(), and job.save() unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@ami/jobs/models.py`:
- Around line 481-489: The status check when finalizing jobs uses a hardcoded
check for JobState.FAILURE; change it to use the canonical failure set by
calling JobState.failed_states() so it detects REVOKED/UNKNOWN too — i.e.,
replace the has_failure computation (currently using any(s.status ==
JobState.FAILURE for s in job.progress.stages)) with an inclusion check against
JobState.failed_states(); leave the surrounding calls to job.update_status(...),
job.finished_at = datetime.datetime.now(), and job.save() unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1bd631b2-fb3f-4a41-a7c2-1cdc159e79ca

📥 Commits

Reviewing files that changed from the base of the PR and between 5af9ce7 and af9b41d.

📒 Files selected for processing (1)
  • ami/jobs/models.py

@mihow
Copy link
Collaborator Author

mihow commented Mar 25, 2026

Code review

Found 1 issue:

  1. Missing cleanup_async_job_if_needed(job) call in the new early-completion path. Every other finalization path in the codebase calls cleanup (_fail_job, _update_job_progress, check_stale_jobs, update_job_failure, Job.cancel). When queue_images_to_nats runs, it calls state_manager.initialize_job(image_ids) which creates Redis keys for progress tracking. In the zero-images case, no NATS results arrive, so the progress handler (which normally triggers cleanup) never fires. The function is already imported at line 18 of models.py.

antenna/ami/jobs/models.py

Lines 482 to 492 in af9b41d

# finalize the job now since no async results will arrive to trigger completion.
# Derive status from stages rather than assuming SUCCESS — a stage could be
# in FAILURE if some images were skipped during collection.
if job.progress.is_complete():
has_failure = any(s.status == JobState.FAILURE for s in job.progress.stages)
job.update_status(JobState.FAILURE if has_failure else JobState.SUCCESS)
job.finished_at = datetime.datetime.now()
job.save()
else:
cls.process_images(job, images)

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

mihow and others added 2 commits March 25, 2026 16:49
- Add cleanup_async_job_if_needed() call to avoid stale Redis/NATS keys
- Use update_status(save=False) to avoid redundant double save
- Use JobState.failed_states() instead of hardcoded FAILURE check

Co-Authored-By: Claude <noreply@anthropic.com>
Verifies that an async_api job with 0 images to process transitions
to SUCCESS with finished_at set, rather than staying stuck at STARTED.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow merged commit 76d3b62 into main Mar 26, 2026
7 checks passed
@mihow mihow deleted the fix/async-job-zero-images-completion branch March 26, 2026 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants