API: Add failurePolicy to TaskSpawnerSpec for per-work-item circuit breaking across TTL boundaries

🤖 **Kelos Strategist Agent** @gjkim42

## Area: New CRDs & API Extensions

## Summary

Add a `failurePolicy` field to `TaskSpawnerSpec` that tracks per-work-item failure counts across TTL cleanup boundaries and stops recreating tasks for items that persistently fail. Currently, when `ttlSecondsAfterFinished` is set and a task fails, the spawner unconditionally recreates the task on the next discovery cycle — burning tokens and compute indefinitely on items the agent cannot fix.

## Problem

### The infinite retry loop

The spawner deduplication logic in `cmd/kelos-spawner/main.go:260-305` works by listing all existing tasks and checking if a task already exists for each discovered work item:

```go
existingTaskMap := make(map[string]*kelosv1alpha1.Task)
// ...
for _, item := range items {
    taskName := fmt.Sprintf("%s-%s", ts.Name, item.ID)
    existing, found := existingTaskMap[taskName]
    if !found {
        newItems = append(newItems, item)  // treated as new → will create task
        continue
    }
    // ... retrigger logic for items with TriggerTime ...
}
```

When a task fails and TTL deletes it, the spawner no longer finds it in `existingTaskMap`. The work item (e.g., a GitHub issue matching the label filter) is still discovered, and the spawner creates a brand-new task. This task fails again, TTL deletes it, and the cycle repeats indefinitely.

The `TTLSecondsAfterFinished` field's own doc comment acknowledges this design:

> If set, spawned Tasks will be automatically deleted after the given number of seconds once they reach a terminal phase, **allowing TaskSpawner to create a new Task.**
> — `api/v1alpha1/taskspawner_types.go:456-458`

This is intentionally useful for cron-based spawners and re-triggerable workflows, but creates a silent cost leak for polling-based spawners when an item consistently fails.

### When does this happen?

The loop occurs when ALL of these conditions are true:
1. **Polling-based source** (githubIssues, githubPullRequests, Jira) — the same item is rediscovered every cycle
2. **No trigger comment** (`commentPolicy.triggerComment` not set) — items are discovered by label/state/assignee alone, not by human action
3. **TTL configured** (`ttlSecondsAfterFinished > 0`) — failed tasks are cleaned up, erasing the spawner's memory of the failure
4. **Item persistently fails** — the underlying problem (malformed input, permissions, unsupported codebase, etc.) isn't fixable by the agent

The retry interval is approximately `TTL + pollInterval`. With typical values (TTL: 3600s, pollInterval: 5m), this means **one failed task per ~65 minutes per stuck item**, running indefinitely until a human notices.

### Concrete scenario

A TaskSpawner watches issues labeled `agent-ready`:
1. Issue #42 is labeled `agent-ready` but describes a task the agent cannot perform
2. Spawner creates `my-spawner-42` → agent runs → fails (cost: $0.50 in tokens)
3. After 1 hour, TTL deletes `my-spawner-42`
4. Next poll: issue #42 still has `agent-ready` label → spawner creates `my-spawner-42` again → fails → TTL deletes → repeat
5. After 24 hours: 22 failed tasks × $0.50 = **$11 wasted** on a single unfixable issue

With 10 stuck items, this becomes $110/day. With expensive models (Opus), costs scale further.

### What existing proposals don't cover

| Proposal | Scope | Gap |
|----------|-------|-----|
| #730 (retryStrategy) | Retries within a **single task lifecycle** before it reaches Failed phase | Does not prevent the spawner from creating entirely new tasks for the same item after TTL cleanup |
| #765 (obsolescencePolicy) | Cancels **stale running tasks** that are no longer relevant | Does not track or act on historical failures of completed-and-deleted tasks |
| #788 / #624 (costBudget) | Caps **total spending** across all tasks spawned by a TaskSpawner | Does not distinguish between productive spending and wasted retries on a single stuck item; one stuck item could consume the entire budget |
| #749 (onCompletion hooks) | Sends outbound notifications on task completion | Notifies but does not prevent recreation; no feedback loop to the spawner |

The missing primitive is **per-work-item failure memory that survives task deletion**.

## Proposed API

### New field on TaskSpawnerSpec

```go
// FailurePolicy configures how the spawner handles work items whose tasks
// repeatedly fail. When set, the spawner tracks consecutive failure counts
// per work item and stops creating new tasks after the limit is reached.
// This prevents infinite retry loops when TTL cleanup removes failed tasks.
// +optional
FailurePolicy *FailurePolicy `json:"failurePolicy,omitempty"`
```

### FailurePolicy type

```go
// FailurePolicy controls per-work-item circuit breaking for the spawner.
type FailurePolicy struct {
    // MaxRetriesPerItem is the maximum number of consecutive failed tasks
    // the spawner will create for a single work item before skipping it.
    // A value of 0 means unlimited retries (current behavior). A value of 1
    // means the spawner creates at most one task and never retries on failure.
    // The counter resets when a task for the item succeeds.
    //
    // Example: maxRetriesPerItem=3 means the spawner creates up to 3 tasks.
    // If all 3 fail, the item is skipped on subsequent cycles until manually reset.
    // +kubebuilder:validation:Minimum=0
    // +kubebuilder:default=0
    // +optional
    MaxRetriesPerItem int32 `json:"maxRetriesPerItem,omitempty"`

    // ResetOnChange causes the failure counter to reset when the work item's
    // content changes (e.g., issue body edited, new comments added). This allows
    // automatic retries after a human updates the item to address the failure.
    // Defaults to false.
    // +optional
    // +kubebuilder:default=false
    ResetOnChange *bool `json:"resetOnChange,omitempty"`
}
```

### Status tracking

Add a `failedItems` map to `TaskSpawnerStatus` for persistence across spawner restarts:

```go
type TaskSpawnerStatus struct {
    // ... existing fields ...

    // FailedItems tracks consecutive failure counts per work item ID.
    // Entries are pruned when the item is no longer discovered by the source
    // or when the failure counter resets (successful task or manual reset).
    // +optional
    FailedItems map[string]FailedItemStatus `json:"failedItems,omitempty"`
}

// FailedItemStatus tracks the failure history of a single work item.
type FailedItemStatus struct {
    // ConsecutiveFailures is the number of consecutive task failures for this item.
    ConsecutiveFailures int32 `json:"consecutiveFailures"`

    // LastFailureTime is when the most recent task failure was recorded.
    LastFailureTime metav1.Time `json:"lastFailureTime"`

    // ContentHash is an opaque hash of the work item content at the time of
    // last task creation. Used by resetOnChange to detect item updates.
    // +optional
    ContentHash string `json:"contentHash,omitempty"`
}
```

### Example configuration

```yaml
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: issue-worker
spec:
  when:
    githubIssues:
      labels: ["agent-ready"]
  failurePolicy:
    maxRetriesPerItem: 3
    resetOnChange: true
  taskTemplate:
    type: claude-code
    credentials:
      type: api-key
      secretRef:
        name: anthropic-key
    workspaceRef:
      name: my-workspace
    ttlSecondsAfterFinished: 3600
    promptTemplate: "Fix issue #{{.Number}}: {{.Title}}\n\n{{.Body}}"
```

With this config:
- Issue #42 fails 3 times → spawner skips it on future cycles
- If someone edits the issue body (e.g., adds clarification), `resetOnChange: true` resets the counter → spawner retries
- If someone posts a `/kelos pick-up` comment (when `commentPolicy.triggerComment` is set), the comment-based TriggerTime retrigger works independently of failurePolicy

## Implementation sketch

### Spawner cycle changes (`cmd/kelos-spawner/main.go`)

In `runCycleWithSourceCore`, after building the existing task map:

```go
// 1. Update failure tracker from completed tasks
for _, t := range existingTaskList.Items {
    if t.Status.Phase == kelosv1alpha1.TaskPhaseFailed {
        itemID := strings.TrimPrefix(t.Name, ts.Name+"-")
        tracker := ts.Status.FailedItems[itemID]
        if tracker.LastFailureTime.Before(&t.Status.CompletionTime.Time) {
            tracker.ConsecutiveFailures++
            tracker.LastFailureTime = *t.Status.CompletionTime
            ts.Status.FailedItems[itemID] = tracker
        }
    } else if t.Status.Phase == kelosv1alpha1.TaskPhaseSucceeded {
        delete(ts.Status.FailedItems, itemID) // reset on success
    }
}

// 2. Before creating a task, check the circuit breaker
maxRetries := int32(0)
if ts.Spec.FailurePolicy != nil {
    maxRetries = ts.Spec.FailurePolicy.MaxRetriesPerItem
}
if maxRetries > 0 {
    tracker := ts.Status.FailedItems[item.ID]
    if tracker.ConsecutiveFailures >= maxRetries {
        log.Info("Skipping item: max retries exceeded",
            "item", item.ID, "failures", tracker.ConsecutiveFailures)
        continue
    }
}

// 3. Prune stale entries (items no longer discovered)
discoveredIDs := make(map[string]bool)
for _, item := range items {
    discoveredIDs[item.ID] = true
}
for id := range ts.Status.FailedItems {
    if !discoveredIDs[id] {
        delete(ts.Status.FailedItems, id)
    }
}
```

### Status size management

The `failedItems` map is bounded by the number of **currently discovered items that have failed**. Entries are pruned when items are no longer discovered (issue closed, label removed, etc.). For a spawner watching 100 issues where 10 have hit the retry limit, the map has 10 entries (~1KB in status). This is well within Kubernetes object size limits.

### Observability

Add a Prometheus counter and a status condition:

- **Metric**: `kelos_spawner_items_circuit_broken_total{spawner, namespace}` — incremented each time an item is skipped due to maxRetriesPerItem
- **Condition**: `ItemsCircuitBroken` with status `True` when any item has exceeded the retry limit, including a message listing the affected item IDs

```yaml
status:
  conditions:
    - type: ItemsCircuitBroken
      status: "True"
      reason: MaxRetriesExceeded
      message: "3 items skipped due to max retries: 42, 87, 103"
```

### Manual reset

Operators can reset a stuck item by:
1. **Editing the work item** (when `resetOnChange: true`) — triggers content hash mismatch → counter resets
2. **Posting a trigger comment** (when `commentPolicy.triggerComment` is set) — TriggerTime-based retrigger bypasses the failure tracker
3. **Patching the status** — `kubectl patch taskspawner my-spawner --subresource=status --type=json -p '[{"op":"remove","path":"/status/failedItems/42"}]'`

## Backward compatibility

- `failurePolicy` is optional; when omitted, behavior is unchanged (unlimited retries, matching current behavior)
- No changes to existing CRD fields or spawner logic for users who don't configure it
- The `failedItems` status field is additive; existing controllers ignore unknown status fields

## Related proposals

- **#730 (retryStrategy)**: Complementary. RetryStrategy retries within a single task lifecycle; failurePolicy prevents the spawner from creating new tasks after repeated lifecycle failures.
- **#788 / #624 (costBudget)**: Complementary. CostBudget caps total spending; failurePolicy targets the specific waste pattern of retrying stuck items. A budget without circuit breaking still wastes the budget on known-failing items before hitting the cap.
- **#749 (onCompletion hooks)**: Could be enhanced to fire a notification when an item is circuit-broken, alerting the team to investigate.

/kind feature

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Add failurePolicy to TaskSpawnerSpec for per-work-item circuit breaking across TTL boundaries #889

Area: New CRDs & API Extensions

Summary

Problem

The infinite retry loop

When does this happen?

Concrete scenario

What existing proposals don't cover

Proposed API

New field on TaskSpawnerSpec

FailurePolicy type

Status tracking

Example configuration

Implementation sketch

Spawner cycle changes (`cmd/kelos-spawner/main.go`)

Status size management

Observability

Manual reset

Backward compatibility

Related proposals

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal	Scope	Gap
#730 (retryStrategy)	Retries within a single task lifecycle before it reaches Failed phase	Does not prevent the spawner from creating entirely new tasks for the same item after TTL cleanup
#765 (obsolescencePolicy)	Cancels stale running tasks that are no longer relevant	Does not track or act on historical failures of completed-and-deleted tasks
#788 / #624 (costBudget)	Caps total spending across all tasks spawned by a TaskSpawner	Does not distinguish between productive spending and wasted retries on a single stuck item; one stuck item could consume the entire budget
#749 (onCompletion hooks)	Sends outbound notifications on task completion	Notifies but does not prevent recreation; no feedback loop to the spawner

API: Add failurePolicy to TaskSpawnerSpec for per-work-item circuit breaking across TTL boundaries #889

Description

Area: New CRDs & API Extensions

Summary

Problem

The infinite retry loop

When does this happen?

Concrete scenario

What existing proposals don't cover

Proposed API

New field on TaskSpawnerSpec

FailurePolicy type

Status tracking

Example configuration

Implementation sketch

Spawner cycle changes (cmd/kelos-spawner/main.go)

Status size management

Observability

Manual reset

Backward compatibility

Related proposals

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Spawner cycle changes (`cmd/kelos-spawner/main.go`)