Skip to content

PSv2: Show online / offline status for async processing services#1146

Open
mihow wants to merge 19 commits intomainfrom
issue-1122-rename-last-checked
Open

PSv2: Show online / offline status for async processing services#1146
mihow wants to merge 19 commits intomainfrom
issue-1122-rename-last-checked

Conversation

@mihow
Copy link
Collaborator

@mihow mihow commented Feb 21, 2026

Summary

Closes #1122 & #1166

  • Renames last_checkedlast_seen, last_checked_livelast_seen_live, last_checked_latencylast_seen_latency on the ProcessingService model
  • Adds mark_seen() method for async/pull-mode services to record liveness when they register pipelines
  • Updates the pipeline registration endpoint (POST /api/v2/projects/{id}/pipelines/) to call mark_seen(live=True) after successful registration
  • Heartbeat on task-fetch and result-submit endpoints (_mark_pipeline_pull_services_seen), scoped to the job's project
  • Periodic beat task marks stale async services offline (async-first, before sync checks)
  • Exposes is_async computed property on the ProcessingService serializer
  • Fixes UnboundLocalError in select_processing_service() when all online services have no latency data (e.g. async services)
  • Updates all backend references (serializers, pipeline queryset, tasks)
  • Updates all frontend references (TypeScript models, table columns, dialog, language strings)
  • Async services show gray "Unknown" status in the processing services list (since per-service status is not yet scoped)
  • Async pipelines are treated as selectable in the pipeline picker (not disabled like offline sync pipelines)
  • Includes Django migration using RenameField (data-preserving, no data loss)

The naming better reflects the semantic difference between:

  • Sync/push services (with endpoint_url): Antenna actively checks the service via the periodic Celery Beat task, updating last_seen/last_seen_live from the health check response
  • Async/pull services (without endpoint_url): Workers report in by registering pipelines and polling for tasks, and we record when we last heard from them via mark_seen() and heartbeat updates

Test plan

  • New unit tests for mark_seen() method (live=True and live=False)
  • New integration test verifying pipeline registration updates last_seen/last_seen_live
  • Existing TestProcessingServiceAPI tests pass with renamed fields
  • Existing TestPipeline and TestProjectPipelinesAPI tests pass
  • Verify UI columns show "Last seen" instead of "Last checked"
  • Verify processing service detail dialog shows "Last seen" label
  • Verify async services show "Unknown" (gray) status in processing services list
  • Verify async pipelines are selectable in session "Process Now" picker

Follow-up items

  • Reconcile pipeline pickers: The job creation form uses a generic EntityPicker (name only, no status indicator) while the session "Process Now" view uses a custom PipelinesPicker with status dots and per-pipeline disabling. These should be reconciled — either extract PipelinesPicker into a shared component or enhance EntityPicker to support optional status rendering.
  • Scope heartbeat to individual service: The heartbeat currently marks all async services on a pipeline+project as live when any one worker polls. Once app-token auth lands (Handle processing_service_name parameters in requests from workers #1117), add an X-Service-ID header so the heartbeat targets the specific calling service. This would allow upgrading the "Unknown" status for async services to real ONLINE/OFFLINE.
  • Mock network call in test: test_get_status_updates_last_seen_for_sync_service hits a real network address — should be mocked for faster CI.

@netlify
Copy link

netlify bot commented Feb 21, 2026

Deploy Preview for antenna-ssec ready!

Name Link
🔨 Latest commit 203ad5d
🔍 Latest deploy log https://app.netlify.com/projects/antenna-ssec/deploys/69c2032ffd77d00008e65141
😎 Deploy Preview https://deploy-preview-1146--antenna-ssec.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@netlify
Copy link

netlify bot commented Feb 21, 2026

Deploy Preview for antenna-preview ready!

Name Link
🔨 Latest commit 203ad5d
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/69c2032f7be6d20008abc94f
😎 Deploy Preview https://deploy-preview-1146--antenna-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
Lighthouse
Lighthouse
1 paths audited
Performance: 76 (🟢 up 10 from production)
Accessibility: 89 (🟢 up 9 from production)
Best Practices: 92 (🔴 down 8 from production)
SEO: 100 (🟢 up 8 from production)
PWA: 80 (no change from production)
View the detailed breakdown and full score reports

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 21, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Renamed ProcessingService monitoring fields from last_checked* to last_seen*; added is_async and mark_seen() to record heartbeats for pull-mode services; updated models, serializers, views, tasks, migrations, tests, UI, and docs to use last_seen* and emit heartbeats on pipeline registration and job endpoints.

Changes

Cohort / File(s) Summary
Database Schema & Migration
.agents/DATABASE_SCHEMA.md, ami/ml/migrations/0027_rename_last_checked_to_last_seen.py
Replace last_checked* fields with last_seen (datetime), last_seen_live, and last_seen_latency; add migration renaming three fields.
ProcessingService model & manager
ami/ml/models/processing_service.py
Add ProcessingServiceQuerySet/manager with async_services()/sync_services(); rename last_checked*last_seen*; add is_async property and mark_seen(); update get_status/get_pipeline_configs to use last_seen*.
Pipeline logic
ami/ml/models/pipeline.py
Switch online filtering and latency selection to use last_seen_live and last_seen_latency; adjust fallback and logging.
Views & Job endpoints
ami/ml/views.py, ami/jobs/views.py
Call processing_service.mark_seen(live=True) after pipeline registration; add _mark_pipeline_pull_services_seen(job) and invoke it from job tasks/results endpoints to heartbeat pull-mode services.
Background task (beat)
ami/ml/tasks.py
Expire stale async (pull) services using last_seen* first, then sequentially check sync services; increase task time limits/expiry; improve exception logging.
Serializers & API
ami/ml/serializers.py
Expose last_seen, last_seen_live, and is_async in serializers; remove last_checked* fields.
Backend tests
ami/ml/tests.py
Add tests for mark_seen() and pipeline-registration side-effects on last_seen; update status tests to assert new server_live/last_seen semantics.
Frontend models
ui/src/data-services/models/processing-service.ts, ui/src/data-services/models/pipeline.ts
Rename getters to lastSeen/lastSeenLive, add isAsync, adjust endpointUrl typing and most-recent timestamp logic.
Frontend UI & i18n
ui/src/pages/.../*.tsx, ui/src/utils/language.ts
Rename label key and strings from "Last checked" → "Last seen"; update columns, sort fields, and displayed values to use lastSeen fields.
Documentation
.agents/AGENTS.md
Document rename and new mark_seen() heartbeat behavior for async (pull-mode) services.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant JobsView as "Jobs View"
  participant PSModel as "ProcessingService"
  participant DB as "Database"

  Client->>JobsView: POST /jobs/{id}/tasks or /results
  JobsView->>PSModel: _mark_pipeline_pull_services_seen(job)
  PSModel->>PSModel: for each async service -> mark_seen(live=True)
  PSModel->>DB: UPDATE processing_service SET last_seen=now(), last_seen_live=TRUE
  DB-->>PSModel: OK
  PSModel-->>JobsView: done
  JobsView-->>Client: continue tasks/results flow
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • annavik

Poem

🐰 I hopped in quick to tap a beat,
Seen, not checked — a gentler heartbeat.
Pipelines wink, workers nod "I'm here",
Timestamps brighten, statuses clear.
Little rabbit prances: "All systems cheer!"

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 74.19% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: renaming fields and adding liveness tracking for async processing services to show their online/offline status.
Description check ✅ Passed The PR description is comprehensive, well-structured, and follows the template with all required sections: Summary (including issue closure), detailed list of changes, test plan, and deployment/follow-up notes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch issue-1122-rename-last-checked

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mihow mihow force-pushed the issue-1122-rename-last-checked branch 2 times, most recently from 77dd024 to 2c029f8 Compare February 27, 2026 02:16
@mihow mihow marked this pull request as ready for review February 27, 2026 02:27
Copilot AI review requested due to automatic review settings February 27, 2026 02:27
@mihow mihow changed the title Rename ProcessingService last_checked fields to last_seen PSv2: Show online / offline status for async processing services Feb 27, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
ui/src/pages/project/processing-services/processing-services-columns.tsx (1)

54-59: Consider using the translation key for consistency.

The status details use a hardcoded 'Last seen ' string, while other parts of the codebase (e.g., processing-service-details-dialog.tsx at line 81) use translate(STRING.FIELD_LABEL_LAST_SEEN). For i18n consistency, consider using the translation function here as well.

Suggested change
     renderCell: (item: ProcessingService) => (
       <StatusTableCell
         color={item.status.color}
-        details={'Last seen ' + item.lastSeen}
+        details={translate(STRING.FIELD_LABEL_LAST_SEEN) + ' ' + item.lastSeen}
         label={item.status.label}
       />
     ),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui/src/pages/project/processing-services/processing-services-columns.tsx`
around lines 54 - 59, Replace the hardcoded "Last seen " prefix in the
StatusTableCell details with the i18n translation key usage; locate the
StatusTableCell usage where details is built (details={'Last seen ' +
item.lastSeen}) and change it to use translate(STRING.FIELD_LABEL_LAST_SEEN)
concatenated or interpolated with item.lastSeen (e.g.,
translate(STRING.FIELD_LABEL_LAST_SEEN) + item.lastSeen) so it matches other
components like processing-service-details-dialog.tsx and uses the same
STRING.FIELD_LABEL_LAST_SEEN key.
ami/ml/models/pipeline.py (1)

1072-1092: Field renames look correct, but note pre-existing edge case.

The field references are correctly updated to last_seen_live and last_seen_latency. However, there's a pre-existing edge case: if all online processing services have last_seen_latency as None, processing_service_lowest_latency will never be assigned, causing an UnboundLocalError at line 1088.

This isn't introduced by this PR, but consider addressing it as a follow-up if async/pull-mode services may not have latency values.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/ml/models/pipeline.py` around lines 1072 - 1092, The loop can leave
processing_service_lowest_latency unassigned when every online
processing_service has last_seen_latency == None; after iterating
processing_services check if processing_service_lowest_latency is still None and
if so pick a fallback online service (e.g., the first processing_service where
last_seen_live is True) and set lowest_latency to None or a sentinel, then log
that latency is unknown for the chosen service using task_logger (include
pipeline_name and the selected processing_service) before returning it; update
references to processing_services, processing_service_lowest_latency,
last_seen_live, last_seen_latency, pipeline_name, and task_logger in this fix.
ami/ml/tasks.py (1)

121-123: Use logger.exception() to preserve traceback when service health checks fail.

At line 122, logger.error() logs only the exception message, losing the full stack trace. When debugging intermittent polling failures in production, the traceback is essential. Use logger.exception() instead, which is designed for exception handlers and automatically includes the full context.

♻️ Proposed refactor
        except Exception as e:
-            logger.error(f"Error checking service {service}: {e}")
+            logger.exception("Error checking service %s", service)
            continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/ml/tasks.py` around lines 121 - 123, Replace the logger.error call inside
the exception handler that catches service health-check failures so the full
traceback is preserved: in the except Exception as e: block where
logger.error(f"Error checking service {service}: {e}") is used (in
ami/ml/tasks.py, referencing logger and the service variable), call
logger.exception with a clear message (e.g., "Error checking service {service}")
so the stack trace is included, then continue as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ami/jobs/views.py`:
- Around line 49-52: The bulk update on
job.pipeline.processing_services.async_services().update(...) incorrectly
updates services across all projects; restrict the update to services linked to
the current job's project by adding an M2M filter on the ProcessingService
projects relation (e.g., replace async_services().update(...) with
async_services().filter(projects=job.project).update(...) or
.filter(projects__in=[job.project]).update(...) as appropriate), ensuring you
reference the existing job.pipeline and job.project objects so only services
associated with this job's project are marked last_seen/last_seen_live.

In `@ui/src/data-services/models/processing-service.ts`:
- Around line 53-54: The endpointUrl getter currently returns
this._processingService.endpoint_url using nullish coalescing, which leaves
empty string values intact; update the endpointUrl getter to normalize both null
and "" (and optionally whitespace-only strings) to undefined by reading const v
= this._processingService.endpoint_url and returning undefined when v is null or
v.trim() === "" (otherwise return v) so the UI treats empty endpoints as
async/pull-mode; modify the endpointUrl getter implementation accordingly.

---

Nitpick comments:
In `@ami/ml/models/pipeline.py`:
- Around line 1072-1092: The loop can leave processing_service_lowest_latency
unassigned when every online processing_service has last_seen_latency == None;
after iterating processing_services check if processing_service_lowest_latency
is still None and if so pick a fallback online service (e.g., the first
processing_service where last_seen_live is True) and set lowest_latency to None
or a sentinel, then log that latency is unknown for the chosen service using
task_logger (include pipeline_name and the selected processing_service) before
returning it; update references to processing_services,
processing_service_lowest_latency, last_seen_live, last_seen_latency,
pipeline_name, and task_logger in this fix.

In `@ami/ml/tasks.py`:
- Around line 121-123: Replace the logger.error call inside the exception
handler that catches service health-check failures so the full traceback is
preserved: in the except Exception as e: block where logger.error(f"Error
checking service {service}: {e}") is used (in ami/ml/tasks.py, referencing
logger and the service variable), call logger.exception with a clear message
(e.g., "Error checking service {service}") so the stack trace is included, then
continue as before.

In `@ui/src/pages/project/processing-services/processing-services-columns.tsx`:
- Around line 54-59: Replace the hardcoded "Last seen " prefix in the
StatusTableCell details with the i18n translation key usage; locate the
StatusTableCell usage where details is built (details={'Last seen ' +
item.lastSeen}) and change it to use translate(STRING.FIELD_LABEL_LAST_SEEN)
concatenated or interpolated with item.lastSeen (e.g.,
translate(STRING.FIELD_LABEL_LAST_SEEN) + item.lastSeen) so it matches other
components like processing-service-details-dialog.tsx and uses the same
STRING.FIELD_LABEL_LAST_SEEN key.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b01f01b and b5309ec.

📒 Files selected for processing (16)
  • .agents/AGENTS.md
  • .agents/DATABASE_SCHEMA.md
  • ami/jobs/views.py
  • ami/ml/migrations/0027_rename_last_checked_to_last_seen.py
  • ami/ml/models/pipeline.py
  • ami/ml/models/processing_service.py
  • ami/ml/serializers.py
  • ami/ml/tasks.py
  • ami/ml/tests.py
  • ami/ml/views.py
  • ui/src/data-services/models/pipeline.ts
  • ui/src/data-services/models/processing-service.ts
  • ui/src/pages/processing-service-details/processing-service-details-dialog.tsx
  • ui/src/pages/project/pipelines/pipelines-columns.tsx
  • ui/src/pages/project/processing-services/processing-services-columns.tsx
  • ui/src/utils/language.ts

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Renames ProcessingService.last_checked* fields to last_seen* across backend and UI, and adds heartbeat-style liveness tracking for pull-mode (async/no-endpoint) processing services.

Changes:

  • Backend: rename model fields + add mark_seen() and async/sync query helpers; update polling task + endpoints to update last_seen.
  • Frontend: rename model accessors/columns/labels from “last checked” to “last seen”.
  • Data migration: add Django RenameField migration to preserve existing data.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
ami/ml/models/processing_service.py Renames fields, adds mark_seen(), and changes get_status() semantics for pull-mode services.
ami/ml/tasks.py Updates periodic liveness checker to poll sync services + mark stale async services offline.
ami/ml/views.py Marks service seen after successful async pipeline registration.
ami/jobs/views.py Adds heartbeat updates on async worker task-fetch/result-submit endpoints.
ami/ml/serializers.py Renames serialized fields to last_seen/last_seen_live.
ami/ml/models/pipeline.py Updates queryset filters and comments to use last_seen*.
ami/ml/migrations/0027_rename_last_checked_to_last_seen.py Data-preserving field renames via RenameField.
ami/ml/tests.py Adds unit/integration tests for mark_seen() and pipeline registration.
ui/src/data-services/models/processing-service.ts Renames getters to lastSeen* and makes endpointUrl nullable-safe.
ui/src/data-services/models/pipeline.ts Renames computed fields to processingServicesOnlineLastSeen.
ui/src/pages/project/processing-services/processing-services-columns.tsx Updates status details display to use lastSeen.
ui/src/pages/project/pipelines/pipelines-columns.tsx Renames “last checked” column to “last seen”.
ui/src/pages/processing-service-details/processing-service-details-dialog.tsx Updates label/value bindings to lastSeen.
ui/src/utils/language.ts Adds FIELD_LABEL_LAST_SEEN string and English translation.
.agents/DATABASE_SCHEMA.md / .agents/AGENTS.md Updates documentation references to last_seen_live and pull-mode heartbeat.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mihow
Copy link
Collaborator Author

mihow commented Feb 27, 2026

Code review

Found 2 issues; fixed 3 (including the critical one). Responding to existing CodeRabbit/Copilot comments inline below.


Fixed in commit 671da1a:

  1. ImportError on PROCESSING_SERVICE_LAST_SEEN_MAXami/ml/tasks.py:112 imports from ami.ml.models, but the constant is defined in the submodule and not re-exported from __init__.py. This would crash the check_processing_services_online Celery beat task on every run, meaning async services would never be marked offline. Fixed by importing from ami.ml.models.processing_service directly.

    antenna/ami/ml/tasks.py

    Lines 110 to 113 in b5309ec

    import datetime
    from ami.ml.models import PROCESSING_SERVICE_LAST_SEEN_MAX, ProcessingService

  2. Null last_seen producing epoch timestampprocessingServicesOnlineLastSeen in pipeline.ts pushed new Date(null).getTime() (= 0, i.e. Jan 1 1970) into the last_seen_times array without filtering, causing Math.max(...) to silently return an ancient timestamp for any pipeline with an async service that had never checked in. Fixed by filtering out null values; returns undefined when no valid timestamps exist.

    const last_seen_times = []
    for (const processingService of processingServices) {
    last_seen_times.push(new Date(processingService.last_seen).getTime())
    }
    return getFormatedDateTimeString({
    date: new Date(Math.max(...last_seen_times)),
    })
    }

  3. "Last seen undefined" in status column — (Copilot comment 2862206485) Valid. Fixed with the suggested guard.


Needs your input:

  1. pipelines_online returns all pipelines even when service is offline — In the async branch of get_status(), pipelines_online=pipeline_names is always populated regardless of is_live. When the heartbeat expires and is_live=False, the response still reports all pipelines as online. Sync services only populate pipelines_online when actually live. Suggest pipelines_online=pipeline_names if is_live else [] for consistency.

    self.save(update_fields=["last_seen_live"])
    pipeline_names = list(self.pipelines.values_list("name", flat=True))
    return ProcessingServiceStatusResponse(
    timestamp=self.last_seen or datetime.datetime.now(),
    request_successful=is_live,
    server_live=is_live,
    pipelines_online=pipeline_names,
    pipeline_configs=[],
    endpoint_url=None,
    latency=0.0,
    )

  2. Heartbeat marks all async services on a pipeline (Copilot/CodeRabbit comment 2862206476) — _mark_pipeline_pull_services_seen() does a bulk update on all async services linked to the pipeline, not just the caller. The docstring acknowledges this ("A future application-token scheme…"). This seems acceptable for now given the single-service-per-pipeline deployment model, but worth tracking.


Responding to other comments:

  • CodeRabbit: i18n 'Last seen ' string — Valid nitpick. translate(STRING.FIELD_LABEL_LAST_SEEN) would be consistent with the details dialog. Worth a follow-up.

  • Copilot: test makes real network call (2862206503) — get_status(timeout=1) will still hit retry logic in create_session. Worth mocking, but not blocking.

  • Copilot: server_live contract change (2862206515) — The prior behavior returned server_live=None with an error string for no-endpoint services. Now it returns server_live=False with no error. If any client checks server_live is None to detect "no endpoint configured", that would break. The UI only checks truthiness so it's fine there, but worth noting for external API consumers.

🤖 Generated with Claude Code

If this code review was useful, please react with 👍. Otherwise, react with 👎.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
ui/src/data-services/models/processing-service.ts (1)

86-88: Consider guarding against null last_seen_live.

The backend model defines last_seen_live = models.BooleanField(null=True), meaning it can be null. The return type boolean doesn't reflect this possibility. While the current status getter handles falsy values gracefully (treating null as OFFLINE), direct consumers of lastSeenLive may expect a definite boolean.

♻️ Optional: coerce to boolean
   get lastSeenLive(): boolean {
-    return this._processingService.last_seen_live
+    return this._processingService.last_seen_live ?? false
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui/src/data-services/models/processing-service.ts` around lines 86 - 88, The
getter lastSeenLive currently declares a boolean return but directly returns
_processingService.last_seen_live which can be null; change the getter signature
to return boolean | null and keep returning
this._processingService.last_seen_live (or, if you prefer to coerce, return
!!this._processingService.last_seen_live and keep boolean) so consumers and the
type system reflect the backend's nullable models.BooleanField; update any
callers if you choose the nullable variant.
ami/ml/tests.py (1)

192-204: Consider mocking the network call for deterministic tests.

This test makes a real network request to a nonexistent host, which depends on DNS resolution and socket timeouts. While it works, mocking requests or the underlying HTTP call would make this test faster and more reliable in CI environments with restricted network access.

Example mock approach
from unittest.mock import patch

def test_get_status_updates_last_seen_for_sync_service(self):
    """Test that get_status() updates last_seen fields for sync services."""
    service = ProcessingService.objects.create(
        name="Sync Service", 
        endpoint_url="http://nonexistent-host:9999"
    )
    service.projects.add(self.project)

    with patch('requests.Session.get') as mock_get:
        mock_get.side_effect = Exception("Connection refused")
        service.get_status(timeout=1)
    
    service.refresh_from_db()
    # assertions...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/ml/tests.py` around lines 192 - 204, Replace the real network call in
test_get_status_updates_last_seen_for_sync_service with a deterministic mock:
patch requests.Session.get (or the exact HTTP call used by
ProcessingService.get_status) to raise an exception or return a controlled
response, then call service.get_status(timeout=1), refresh_from_db and assert
last_seen, last_seen_live and last_seen_latency; this ensures the test for
ProcessingService.get_status runs fast and reliably without performing
DNS/socket I/O.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@ami/ml/tests.py`:
- Around line 192-204: Replace the real network call in
test_get_status_updates_last_seen_for_sync_service with a deterministic mock:
patch requests.Session.get (or the exact HTTP call used by
ProcessingService.get_status) to raise an exception or return a controlled
response, then call service.get_status(timeout=1), refresh_from_db and assert
last_seen, last_seen_live and last_seen_latency; this ensures the test for
ProcessingService.get_status runs fast and reliably without performing
DNS/socket I/O.

In `@ui/src/data-services/models/processing-service.ts`:
- Around line 86-88: The getter lastSeenLive currently declares a boolean return
but directly returns _processingService.last_seen_live which can be null; change
the getter signature to return boolean | null and keep returning
this._processingService.last_seen_live (or, if you prefer to coerce, return
!!this._processingService.last_seen_live and keep boolean) so consumers and the
type system reflect the backend's nullable models.BooleanField; update any
callers if you choose the nullable variant.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b5309ec and cb368ac.

📒 Files selected for processing (7)
  • ami/jobs/views.py
  • ami/ml/serializers.py
  • ami/ml/tasks.py
  • ami/ml/tests.py
  • ui/src/data-services/models/pipeline.ts
  • ui/src/data-services/models/processing-service.ts
  • ui/src/pages/project/processing-services/processing-services-columns.tsx
🚧 Files skipped from review as they are similar to previous changes (3)
  • ui/src/pages/project/processing-services/processing-services-columns.tsx
  • ami/jobs/views.py
  • ui/src/data-services/models/pipeline.ts

@mihow
Copy link
Collaborator Author

mihow commented Feb 27, 2026

🧹 Nitpick comments (2)

ui/src/data-services/models/processing-service.ts (1)> 86-88: Consider guarding against null last_seen_live.

The backend model defines last_seen_live = models.BooleanField(null=True), meaning it can be null. The return type boolean doesn't reflect this possibility. While the current status getter handles falsy values gracefully (treating null as OFFLINE), direct consumers of lastSeenLive may expect a definite boolean.

♻️ Optional: coerce to boolean

   get lastSeenLive(): boolean {
-    return this._processingService.last_seen_live
+    return this._processingService.last_seen_live ?? false
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@ui/src/data-services/models/processing-service.ts` around lines 86 - 88, The
getter lastSeenLive currently declares a boolean return but directly returns
_processingService.last_seen_live which can be null; change the getter signature
to return boolean | null and keep returning
this._processingService.last_seen_live (or, if you prefer to coerce, return
!!this._processingService.last_seen_live and keep boolean) so consumers and the
type system reflect the backend's nullable models.BooleanField; update any
callers if you choose the nullable variant.

ami/ml/tests.py (1)> 192-204: Consider mocking the network call for deterministic tests.

This test makes a real network request to a nonexistent host, which depends on DNS resolution and socket timeouts. While it works, mocking requests or the underlying HTTP call would make this test faster and more reliable in CI environments with restricted network access.

Example mock approach

from unittest.mock import patch

def test_get_status_updates_last_seen_for_sync_service(self):
    """Test that get_status() updates last_seen fields for sync services."""
    service = ProcessingService.objects.create(
        name="Sync Service", 
        endpoint_url="http://nonexistent-host:9999"
    )
    service.projects.add(self.project)

    with patch('requests.Session.get') as mock_get:
        mock_get.side_effect = Exception("Connection refused")
        service.get_status(timeout=1)
    
    service.refresh_from_db()
    # assertions...

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@ami/ml/tests.py` around lines 192 - 204, Replace the real network call in
test_get_status_updates_last_seen_for_sync_service with a deterministic mock:
patch requests.Session.get (or the exact HTTP call used by
ProcessingService.get_status) to raise an exception or return a controlled
response, then call service.get_status(timeout=1), refresh_from_db and assert
last_seen, last_seen_live and last_seen_latency; this ensures the test for
ProcessingService.get_status runs fast and reliably without performing
DNS/socket I/O.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@ami/ml/tests.py`:
- Around line 192-204: Replace the real network call in
test_get_status_updates_last_seen_for_sync_service with a deterministic mock:
patch requests.Session.get (or the exact HTTP call used by
ProcessingService.get_status) to raise an exception or return a controlled
response, then call service.get_status(timeout=1), refresh_from_db and assert
last_seen, last_seen_live and last_seen_latency; this ensures the test for
ProcessingService.get_status runs fast and reliably without performing
DNS/socket I/O.

In `@ui/src/data-services/models/processing-service.ts`:
- Around line 86-88: The getter lastSeenLive currently declares a boolean return
but directly returns _processingService.last_seen_live which can be null; change
the getter signature to return boolean | null and keep returning
this._processingService.last_seen_live (or, if you prefer to coerce, return
!!this._processingService.last_seen_live and keep boolean) so consumers and the
type system reflect the backend's nullable models.BooleanField; update any
callers if you choose the nullable variant.

ℹ️ Review info

I agree that a boolean with a default of Null is confusing. I think I chose that to represent "not checked" or "unknown". So the frontend could show blank instead of Offline. But I think False would also be a fine simplification.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
ami/ml/tasks.py (1)

98-101: Minor: Comment arithmetic may be slightly off.

The comment states worst case is "8 + 2 + 8 + 4 + 8 = 30s" but with retries=3 and backoff_factor=2, there are 4 total attempts (initial + 3 retries). The actual sum shown equals 30s, but if there are 4 attempts at 8s each plus backoff delays (0s, 2s, 4s), worst case would be closer to 38s. This doesn't affect functionality—just a documentation nit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ami/ml/tasks.py` around lines 98 - 101, The inline comment for
_BEAT_STATUS_TIMEOUT miscalculates the worst-case total; update the comment to
state there are 4 attempts (initial + 3 retries) so the total is 4 * 8s = 32s
plus backoff delays (0s + 2s + 4s = 6s) => ~38s; edit the comment above
_BEAT_STATUS_TIMEOUT to reflect this corrected arithmetic and briefly note the
calculation (attempts * timeout + backoff delays).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@ami/ml/tasks.py`:
- Around line 98-101: The inline comment for _BEAT_STATUS_TIMEOUT miscalculates
the worst-case total; update the comment to state there are 4 attempts (initial
+ 3 retries) so the total is 4 * 8s = 32s plus backoff delays (0s + 2s + 4s =
6s) => ~38s; edit the comment above _BEAT_STATUS_TIMEOUT to reflect this
corrected arithmetic and briefly note the calculation (attempts * timeout +
backoff delays).

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cb368ac and a159ba3.

📒 Files selected for processing (1)
  • ami/ml/tasks.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ami/ml/models/processing_service.py`:
- Around line 219-225: The status response currently lists all registered
pipelines even when the worker is not live; change the pipelines_online passed
into ProcessingServiceStatusResponse to be pipeline_names only when is_live is
True, otherwise pass an empty list (i.e., compute pipeline_names from
self.pipelines.values_list("name", flat=True) but use it conditionally based on
is_live) so that ProcessingServiceStatusResponse(timestamp=self.last_seen or
datetime.datetime.now(), request_successful=is_live, server_live=is_live,
pipelines_online=..., pipeline_configs=[]) accurately reflects an offline
worker.

In `@ami/ml/tasks.py`:
- Around line 103-105: The offline-sweep timing is too slow relative to
PROCESSING_SERVICE_LAST_SEEN_MAX, causing last_seen_live to remain stale; either
make the sweep cadence (_BEAT_TASK_EXPIRES) no greater than
PROCESSING_SERVICE_LAST_SEEN_MAX or stop relying on a persisted last_seen_live
flag and derive liveness from the stored last_seen timestamp at read time;
update references to _BEAT_TASK_EXPIRES and any logic that sets/reads
last_seen_live (also in the block around the code at the 126-134 region) to use
the chosen approach so worker liveness reflects the heartbeat window.
- Around line 98-100: The periodic beat task iterates
ProcessingService.objects.sync_services() but assumes a fixed time_limit=150
while each unreachable sync service can cost ~38s (4 attempts × ~8s + backoff),
so with >3 services the task may exceed the hard limit and skip later services;
fix by computing the required budget from the queryset size (e.g., required_time
= num_services * per_service_worst_case + margin) and set time_limit
accordingly, or refactor to fan out work per service (enqueue a per-service sync
task) or process services in bounded chunks to ensure no single beat exceeds
time_limit; update the code references where time_limit=150 and the periodic
beat/sync loop that consumes ProcessingService.objects.sync_services() to
implement one of these solutions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b2a39962-a3ef-4415-9c84-ca842d37a99c

📥 Commits

Reviewing files that changed from the base of the PR and between a159ba3 and fdfb2ec.

📒 Files selected for processing (5)
  • ami/ml/models/pipeline.py
  • ami/ml/models/processing_service.py
  • ami/ml/tasks.py
  • ui/src/data-services/models/processing-service.ts
  • ui/src/pages/project/processing-services/processing-services-columns.tsx
✅ Files skipped from review due to trivial changes (1)
  • ui/src/data-services/models/processing-service.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • ui/src/pages/project/processing-services/processing-services-columns.tsx
  • ami/ml/models/pipeline.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
ui/src/data-services/models/processing-service.ts (1)

88-90: Consider preserving last_seen_live = null as “unknown” instead of coercing to false.

Current logic collapses null/undefined to OFFLINE, which can hide “not yet observed” state for endpoint-based services. If tri-state behavior is desired, keep nullability and map it to UNKNOWN in status.

♻️ Proposed refactor
-  get lastSeenLive(): boolean {
-    return this._processingService.last_seen_live ?? false
+  get lastSeenLive(): boolean | null | undefined {
+    return this._processingService.last_seen_live
   }

@@
-    const status_code = this.lastSeenLive ? 'ONLINE' : 'OFFLINE'
+    if (this.lastSeenLive == null) {
+      return ProcessingService.getStatusInfo('UNKNOWN')
+    }
+    const status_code = this.lastSeenLive ? 'ONLINE' : 'OFFLINE'

Please confirm intended product semantics for endpoint services with last_seen_live = null (offline vs unknown).

Also applies to: 105-106

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui/src/data-services/models/processing-service.ts` around lines 88 - 90, The
getter lastSeenLive currently coerces this._processingService.last_seen_live to
false, losing a nullable "unknown" state; change it to preserve null/undefined
(i.e., return this._processingService.last_seen_live as boolean |
null/undefined) and update whatever status-mapping logic (the status calculation
that consumes lastSeenLive) to explicitly map null/undefined to an UNKNOWN state
rather than OFFLINE; also apply the same null-preserving change to the other
occurrences that currently coerce last_seen_live to false (the similar getters
at the other two spots noted).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ui/src/data-services/models/processing-service.ts`:
- Around line 60-62: The isAsync getter currently returns
this._processingService.is_async which may be undefined and violates the boolean
return type; update the getter (isAsync) to defensively coerce the value to a
boolean (for example using `?? false` or `Boolean(...)`) when reading
this._processingService.is_async so it always returns a boolean similar to the
defensive pattern used in lastSeenLive.

---

Nitpick comments:
In `@ui/src/data-services/models/processing-service.ts`:
- Around line 88-90: The getter lastSeenLive currently coerces
this._processingService.last_seen_live to false, losing a nullable "unknown"
state; change it to preserve null/undefined (i.e., return
this._processingService.last_seen_live as boolean | null/undefined) and update
whatever status-mapping logic (the status calculation that consumes
lastSeenLive) to explicitly map null/undefined to an UNKNOWN state rather than
OFFLINE; also apply the same null-preserving change to the other occurrences
that currently coerce last_seen_live to false (the similar getters at the other
two spots noted).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a0a436ff-a7e2-48b4-abf4-d08780cdf3fb

📥 Commits

Reviewing files that changed from the base of the PR and between fdfb2ec and 857f5d3.

📒 Files selected for processing (1)
  • ui/src/data-services/models/processing-service.ts

mihow and others added 13 commits March 23, 2026 19:04
…1122)

Rename fields to better reflect the semantic difference between sync and
async processing service status tracking:
- last_checked → last_seen
- last_checked_live → last_seen_live
- last_checked_latency → last_seen_latency

For sync services with endpoint URLs, fields are updated by the periodic
status checker. For async/pull-mode services, a new mark_seen() method
is called when the service registers pipelines, recording that we heard
from it.

Also updates all references in serializers, pipeline queryset, views,
frontend models, columns, dialog, and language strings.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Add structured queryset methods and a heartbeat mechanism so async
(pull-mode) processing services stay in sync with their actual liveness.

ProcessingService:
- New ProcessingServiceQuerySet with async_services() / sync_services()
  methods — single canonical filter for endpoint_url null-or-empty, used
  everywhere instead of ad-hoc Q expressions
- is_async property (derived from endpoint_url, no DB column)
- Docstrings reference Job.dispatch_mode ASYNC_API / SYNC_API for context

Liveness tracking:
- PROCESSING_SERVICE_LAST_SEEN_MAX = 60s constant (12× the worker's 5s
  poll interval) — async services are considered offline after this
- check_processing_services_online task now handles both modes:
  sync → active /readyz poll; async → bulk mark stale via async_services()
- _mark_pipeline_pull_services_seen() helper in jobs/views.py: single bulk
  UPDATE via job.pipeline.processing_services.async_services(), called at
  the top of both /jobs/{id}/tasks/ and /jobs/{id}/result/ so every worker
  poll cycle refreshes last_seen without needing a separate registration

Async job cleanup (from carlosg/redisatomic):
- Rename _cleanup_job_if_needed → cleanup_async_job_if_needed and export
  it so Job.cancel() can call it directly without a local import
- JobLogHandler: refresh_from_db before appending to avoid last-writer-
  wins race across concurrent worker processes
- Job.logger: update existing handler's job reference instead of always
  adding a new handler (process-level singleton leak fix)

Co-Authored-By: Claude <noreply@anthropic.com>
- PROCESSING_SERVICE_LAST_SEEN_MAX = 60s constant (12x the worker's 5s
  poll interval) used by check_processing_services_online to expire stale
  async service heartbeats
- get_status() pull-mode branch: derives server_live from staleness check,
  populates pipelines_online from registered pipelines, uses
  `not self.endpoint_url` to also handle empty-string endpoints
- endpointUrl getter: returns undefined instead of stringified "null" so
  async services show a blank cell in the endpoint column

Co-Authored-By: Claude <noreply@anthropic.com>
- Fix ImportError: import PROCESSING_SERVICE_LAST_SEEN_MAX directly from
  ami.ml.models.processing_service (not re-exported from ami.ml.models)
- Fix null last_seen causing epoch timestamp in processingServicesOnlineLastSeen
  getter — filter out null values before Math.max
- Fix "Last seen undefined" rendered in status column when lastSeen is undefined

Co-Authored-By: Claude <noreply@anthropic.com>
Move the async service stale-check to the top of check_processing_services_online
so it always runs, even if a slow sync service check hits the time limit.

Reduce the per-request timeout for the beat task from 90s (designed for cold-start
waits) to 8s — if a sync service is starting up it will recover on the next cycle.
Raise soft_time_limit/time_limit accordingly to give the sync loop room to complete
(worst case ~30s per service with retries).

Co-Authored-By: Claude <noreply@anthropic.com>
Async services now derive liveness from heartbeats rather than returning
an error message. Update assertions: server_live=False (not None) when
no heartbeat has been received, and remove error message checks.

Co-Authored-By: Claude <noreply@anthropic.com>
Filter async services by project when marking them as seen, preventing
cross-project contamination when a pipeline is shared across projects.
Clarify in the docstring that this still marks all async services on the
pipeline within the project, not the individual caller, until
application-token auth (PR #1117) is available.

Co-Authored-By: Claude <noreply@anthropic.com>
Add is_async to ProcessingServiceSerializer and
ProcessingServiceNestedSerializer so the frontend can distinguish
pull-mode from push-mode services. Also normalize empty endpoint_url
strings to undefined in the FE model for consistency with the backend.

Co-Authored-By: Claude <noreply@anthropic.com>
…le copies

- Run async stale-check before the sync loop so it always executes
  regardless of how long sync checks take
- Reduce per-request timeout for the beat task from 90s (designed for
  cold-start waits) to 8s — a slow or unreachable service just waits
  for the next 5-minute cycle
- Add expires=240s so copies queued during a worker outage are discarded
  when the worker returns; only the most recent firing runs

Co-Authored-By: Claude <noreply@anthropic.com>
- Coerce `last_seen_live` null to false in the getter to match the boolean
  return type (backend field is nullable).
- Use `translate(STRING.FIELD_LABEL_LAST_SEEN)` in the status column instead
  of a hardcoded English string, matching the pattern in the details dialog.

Co-Authored-By: Claude <noreply@anthropic.com>
…ogging

- Fix potential UnboundLocalError in select_processing_service() when all
  online services have last_seen_latency=None (e.g. async/pull-mode services).
  Falls back to the first online service and logs that no latency data exists.
- Use logger.exception() instead of logger.error() in the beat task exception
  handler to preserve the full traceback for debugging.
- Fix comment arithmetic for worst-case timeout calculation.

Co-Authored-By: Claude <noreply@anthropic.com>
mihow and others added 2 commits March 23, 2026 19:04
Async/pull-mode services (no endpoint URL) now show a gray "Unknown"
status indicator instead of the misleading ONLINE/OFFLINE based on
heartbeat data that applies to all services on the pipeline. The
pipeline-level indicator in the dropdown continues to work as before.

Co-Authored-By: Claude <noreply@anthropic.com>
Async/pull-mode services (no endpoint URL) are now treated as "online"
in the pipeline selector since their status cannot be verified via active
polling. This prevents the "Process Now" dropdown from incorrectly
disabling async pipelines when no recent heartbeat exists.

Co-Authored-By: Claude <noreply@anthropic.com>
@mihow mihow force-pushed the issue-1122-rename-last-checked branch from 857f5d3 to 770de40 Compare March 24, 2026 02:09
mihow and others added 4 commits March 23, 2026 19:17
…hecks

Use the explicit is_async field from the backend serializer rather than
inferring async status from the absence of endpoint_url.

Co-Authored-By: Claude <noreply@anthropic.com>
- Return empty pipelines_online list when async service is offline,
  avoiding contradictory payload (server_live=False + non-empty pipelines).
- Add ?? false to isAsync getter to match defensive pattern used by
  lastSeenLive.

Co-Authored-By: Claude <noreply@anthropic.com>
…s list

The write-only project field was declared on the serializer but not
included in Meta.fields, causing an AssertionError at runtime.
Missed during rebase conflict resolution.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update how the status of processing services are checked & reported

2 participants