Skip to content

API: Add failurePolicy to TaskSpawnerSpec for per-work-item circuit breaking across TTL boundaries #889

@kelos-bot

Description

@kelos-bot

🤖 Kelos Strategist Agent @gjkim42

Area: New CRDs & API Extensions

Summary

Add a failurePolicy field to TaskSpawnerSpec that tracks per-work-item failure counts across TTL cleanup boundaries and stops recreating tasks for items that persistently fail. Currently, when ttlSecondsAfterFinished is set and a task fails, the spawner unconditionally recreates the task on the next discovery cycle — burning tokens and compute indefinitely on items the agent cannot fix.

Problem

The infinite retry loop

The spawner deduplication logic in cmd/kelos-spawner/main.go:260-305 works by listing all existing tasks and checking if a task already exists for each discovered work item:

existingTaskMap := make(map[string]*kelosv1alpha1.Task)
// ...
for _, item := range items {
    taskName := fmt.Sprintf("%s-%s", ts.Name, item.ID)
    existing, found := existingTaskMap[taskName]
    if !found {
        newItems = append(newItems, item)  // treated as new → will create task
        continue
    }
    // ... retrigger logic for items with TriggerTime ...
}

When a task fails and TTL deletes it, the spawner no longer finds it in existingTaskMap. The work item (e.g., a GitHub issue matching the label filter) is still discovered, and the spawner creates a brand-new task. This task fails again, TTL deletes it, and the cycle repeats indefinitely.

The TTLSecondsAfterFinished field's own doc comment acknowledges this design:

If set, spawned Tasks will be automatically deleted after the given number of seconds once they reach a terminal phase, allowing TaskSpawner to create a new Task.
api/v1alpha1/taskspawner_types.go:456-458

This is intentionally useful for cron-based spawners and re-triggerable workflows, but creates a silent cost leak for polling-based spawners when an item consistently fails.

When does this happen?

The loop occurs when ALL of these conditions are true:

  1. Polling-based source (githubIssues, githubPullRequests, Jira) — the same item is rediscovered every cycle
  2. No trigger comment (commentPolicy.triggerComment not set) — items are discovered by label/state/assignee alone, not by human action
  3. TTL configured (ttlSecondsAfterFinished > 0) — failed tasks are cleaned up, erasing the spawner's memory of the failure
  4. Item persistently fails — the underlying problem (malformed input, permissions, unsupported codebase, etc.) isn't fixable by the agent

The retry interval is approximately TTL + pollInterval. With typical values (TTL: 3600s, pollInterval: 5m), this means one failed task per ~65 minutes per stuck item, running indefinitely until a human notices.

Concrete scenario

A TaskSpawner watches issues labeled agent-ready:

  1. Issue Install build-essential in claude-code container #42 is labeled agent-ready but describes a task the agent cannot perform
  2. Spawner creates my-spawner-42 → agent runs → fails (cost: $0.50 in tokens)
  3. After 1 hour, TTL deletes my-spawner-42
  4. Next poll: issue Install build-essential in claude-code container #42 still has agent-ready label → spawner creates my-spawner-42 again → fails → TTL deletes → repeat
  5. After 24 hours: 22 failed tasks × $0.50 = $11 wasted on a single unfixable issue

With 10 stuck items, this becomes $110/day. With expensive models (Opus), costs scale further.

What existing proposals don't cover

Proposal Scope Gap
#730 (retryStrategy) Retries within a single task lifecycle before it reaches Failed phase Does not prevent the spawner from creating entirely new tasks for the same item after TTL cleanup
#765 (obsolescencePolicy) Cancels stale running tasks that are no longer relevant Does not track or act on historical failures of completed-and-deleted tasks
#788 / #624 (costBudget) Caps total spending across all tasks spawned by a TaskSpawner Does not distinguish between productive spending and wasted retries on a single stuck item; one stuck item could consume the entire budget
#749 (onCompletion hooks) Sends outbound notifications on task completion Notifies but does not prevent recreation; no feedback loop to the spawner

The missing primitive is per-work-item failure memory that survives task deletion.

Proposed API

New field on TaskSpawnerSpec

// FailurePolicy configures how the spawner handles work items whose tasks
// repeatedly fail. When set, the spawner tracks consecutive failure counts
// per work item and stops creating new tasks after the limit is reached.
// This prevents infinite retry loops when TTL cleanup removes failed tasks.
// +optional
FailurePolicy *FailurePolicy `json:"failurePolicy,omitempty"`

FailurePolicy type

// FailurePolicy controls per-work-item circuit breaking for the spawner.
type FailurePolicy struct {
    // MaxRetriesPerItem is the maximum number of consecutive failed tasks
    // the spawner will create for a single work item before skipping it.
    // A value of 0 means unlimited retries (current behavior). A value of 1
    // means the spawner creates at most one task and never retries on failure.
    // The counter resets when a task for the item succeeds.
    //
    // Example: maxRetriesPerItem=3 means the spawner creates up to 3 tasks.
    // If all 3 fail, the item is skipped on subsequent cycles until manually reset.
    // +kubebuilder:validation:Minimum=0
    // +kubebuilder:default=0
    // +optional
    MaxRetriesPerItem int32 `json:"maxRetriesPerItem,omitempty"`

    // ResetOnChange causes the failure counter to reset when the work item's
    // content changes (e.g., issue body edited, new comments added). This allows
    // automatic retries after a human updates the item to address the failure.
    // Defaults to false.
    // +optional
    // +kubebuilder:default=false
    ResetOnChange *bool `json:"resetOnChange,omitempty"`
}

Status tracking

Add a failedItems map to TaskSpawnerStatus for persistence across spawner restarts:

type TaskSpawnerStatus struct {
    // ... existing fields ...

    // FailedItems tracks consecutive failure counts per work item ID.
    // Entries are pruned when the item is no longer discovered by the source
    // or when the failure counter resets (successful task or manual reset).
    // +optional
    FailedItems map[string]FailedItemStatus `json:"failedItems,omitempty"`
}

// FailedItemStatus tracks the failure history of a single work item.
type FailedItemStatus struct {
    // ConsecutiveFailures is the number of consecutive task failures for this item.
    ConsecutiveFailures int32 `json:"consecutiveFailures"`

    // LastFailureTime is when the most recent task failure was recorded.
    LastFailureTime metav1.Time `json:"lastFailureTime"`

    // ContentHash is an opaque hash of the work item content at the time of
    // last task creation. Used by resetOnChange to detect item updates.
    // +optional
    ContentHash string `json:"contentHash,omitempty"`
}

Example configuration

apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: issue-worker
spec:
  when:
    githubIssues:
      labels: ["agent-ready"]
  failurePolicy:
    maxRetriesPerItem: 3
    resetOnChange: true
  taskTemplate:
    type: claude-code
    credentials:
      type: api-key
      secretRef:
        name: anthropic-key
    workspaceRef:
      name: my-workspace
    ttlSecondsAfterFinished: 3600
    promptTemplate: "Fix issue #{{.Number}}: {{.Title}}\n\n{{.Body}}"

With this config:

  • Issue Install build-essential in claude-code container #42 fails 3 times → spawner skips it on future cycles
  • If someone edits the issue body (e.g., adds clarification), resetOnChange: true resets the counter → spawner retries
  • If someone posts a /kelos pick-up comment (when commentPolicy.triggerComment is set), the comment-based TriggerTime retrigger works independently of failurePolicy

Implementation sketch

Spawner cycle changes (cmd/kelos-spawner/main.go)

In runCycleWithSourceCore, after building the existing task map:

// 1. Update failure tracker from completed tasks
for _, t := range existingTaskList.Items {
    if t.Status.Phase == kelosv1alpha1.TaskPhaseFailed {
        itemID := strings.TrimPrefix(t.Name, ts.Name+"-")
        tracker := ts.Status.FailedItems[itemID]
        if tracker.LastFailureTime.Before(&t.Status.CompletionTime.Time) {
            tracker.ConsecutiveFailures++
            tracker.LastFailureTime = *t.Status.CompletionTime
            ts.Status.FailedItems[itemID] = tracker
        }
    } else if t.Status.Phase == kelosv1alpha1.TaskPhaseSucceeded {
        delete(ts.Status.FailedItems, itemID) // reset on success
    }
}

// 2. Before creating a task, check the circuit breaker
maxRetries := int32(0)
if ts.Spec.FailurePolicy != nil {
    maxRetries = ts.Spec.FailurePolicy.MaxRetriesPerItem
}
if maxRetries > 0 {
    tracker := ts.Status.FailedItems[item.ID]
    if tracker.ConsecutiveFailures >= maxRetries {
        log.Info("Skipping item: max retries exceeded",
            "item", item.ID, "failures", tracker.ConsecutiveFailures)
        continue
    }
}

// 3. Prune stale entries (items no longer discovered)
discoveredIDs := make(map[string]bool)
for _, item := range items {
    discoveredIDs[item.ID] = true
}
for id := range ts.Status.FailedItems {
    if !discoveredIDs[id] {
        delete(ts.Status.FailedItems, id)
    }
}

Status size management

The failedItems map is bounded by the number of currently discovered items that have failed. Entries are pruned when items are no longer discovered (issue closed, label removed, etc.). For a spawner watching 100 issues where 10 have hit the retry limit, the map has 10 entries (~1KB in status). This is well within Kubernetes object size limits.

Observability

Add a Prometheus counter and a status condition:

  • Metric: kelos_spawner_items_circuit_broken_total{spawner, namespace} — incremented each time an item is skipped due to maxRetriesPerItem
  • Condition: ItemsCircuitBroken with status True when any item has exceeded the retry limit, including a message listing the affected item IDs
status:
  conditions:
    - type: ItemsCircuitBroken
      status: "True"
      reason: MaxRetriesExceeded
      message: "3 items skipped due to max retries: 42, 87, 103"

Manual reset

Operators can reset a stuck item by:

  1. Editing the work item (when resetOnChange: true) — triggers content hash mismatch → counter resets
  2. Posting a trigger comment (when commentPolicy.triggerComment is set) — TriggerTime-based retrigger bypasses the failure tracker
  3. Patching the statuskubectl patch taskspawner my-spawner --subresource=status --type=json -p '[{"op":"remove","path":"/status/failedItems/42"}]'

Backward compatibility

  • failurePolicy is optional; when omitted, behavior is unchanged (unlimited retries, matching current behavior)
  • No changes to existing CRD fields or spawner logic for users who don't configure it
  • The failedItems status field is additive; existing controllers ignore unknown status fields

Related proposals

/kind feature

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions