You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add a failurePolicy field to TaskSpawnerSpec that tracks per-work-item failure counts across TTL cleanup boundaries and stops recreating tasks for items that persistently fail. Currently, when ttlSecondsAfterFinished is set and a task fails, the spawner unconditionally recreates the task on the next discovery cycle — burning tokens and compute indefinitely on items the agent cannot fix.
Problem
The infinite retry loop
The spawner deduplication logic in cmd/kelos-spawner/main.go:260-305 works by listing all existing tasks and checking if a task already exists for each discovered work item:
existingTaskMap:=make(map[string]*kelosv1alpha1.Task)
// ...for_, item:=rangeitems {
taskName:=fmt.Sprintf("%s-%s", ts.Name, item.ID)
existing, found:=existingTaskMap[taskName]
if!found {
newItems=append(newItems, item) // treated as new → will create taskcontinue
}
// ... retrigger logic for items with TriggerTime ...
}
When a task fails and TTL deletes it, the spawner no longer finds it in existingTaskMap. The work item (e.g., a GitHub issue matching the label filter) is still discovered, and the spawner creates a brand-new task. This task fails again, TTL deletes it, and the cycle repeats indefinitely.
The TTLSecondsAfterFinished field's own doc comment acknowledges this design:
If set, spawned Tasks will be automatically deleted after the given number of seconds once they reach a terminal phase, allowing TaskSpawner to create a new Task.
— api/v1alpha1/taskspawner_types.go:456-458
This is intentionally useful for cron-based spawners and re-triggerable workflows, but creates a silent cost leak for polling-based spawners when an item consistently fails.
When does this happen?
The loop occurs when ALL of these conditions are true:
Polling-based source (githubIssues, githubPullRequests, Jira) — the same item is rediscovered every cycle
No trigger comment (commentPolicy.triggerComment not set) — items are discovered by label/state/assignee alone, not by human action
TTL configured (ttlSecondsAfterFinished > 0) — failed tasks are cleaned up, erasing the spawner's memory of the failure
Item persistently fails — the underlying problem (malformed input, permissions, unsupported codebase, etc.) isn't fixable by the agent
The retry interval is approximately TTL + pollInterval. With typical values (TTL: 3600s, pollInterval: 5m), this means one failed task per ~65 minutes per stuck item, running indefinitely until a human notices.
Notifies but does not prevent recreation; no feedback loop to the spawner
The missing primitive is per-work-item failure memory that survives task deletion.
Proposed API
New field on TaskSpawnerSpec
// FailurePolicy configures how the spawner handles work items whose tasks// repeatedly fail. When set, the spawner tracks consecutive failure counts// per work item and stops creating new tasks after the limit is reached.// This prevents infinite retry loops when TTL cleanup removes failed tasks.// +optionalFailurePolicy*FailurePolicy`json:"failurePolicy,omitempty"`
FailurePolicy type
// FailurePolicy controls per-work-item circuit breaking for the spawner.typeFailurePolicystruct {
// MaxRetriesPerItem is the maximum number of consecutive failed tasks// the spawner will create for a single work item before skipping it.// A value of 0 means unlimited retries (current behavior). A value of 1// means the spawner creates at most one task and never retries on failure.// The counter resets when a task for the item succeeds.//// Example: maxRetriesPerItem=3 means the spawner creates up to 3 tasks.// If all 3 fail, the item is skipped on subsequent cycles until manually reset.// +kubebuilder:validation:Minimum=0// +kubebuilder:default=0// +optionalMaxRetriesPerItemint32`json:"maxRetriesPerItem,omitempty"`// ResetOnChange causes the failure counter to reset when the work item's// content changes (e.g., issue body edited, new comments added). This allows// automatic retries after a human updates the item to address the failure.// Defaults to false.// +optional// +kubebuilder:default=falseResetOnChange*bool`json:"resetOnChange,omitempty"`
}
Status tracking
Add a failedItems map to TaskSpawnerStatus for persistence across spawner restarts:
typeTaskSpawnerStatusstruct {
// ... existing fields ...// FailedItems tracks consecutive failure counts per work item ID.// Entries are pruned when the item is no longer discovered by the source// or when the failure counter resets (successful task or manual reset).// +optionalFailedItemsmap[string]FailedItemStatus`json:"failedItems,omitempty"`
}
// FailedItemStatus tracks the failure history of a single work item.typeFailedItemStatusstruct {
// ConsecutiveFailures is the number of consecutive task failures for this item.ConsecutiveFailuresint32`json:"consecutiveFailures"`// LastFailureTime is when the most recent task failure was recorded.LastFailureTime metav1.Time`json:"lastFailureTime"`// ContentHash is an opaque hash of the work item content at the time of// last task creation. Used by resetOnChange to detect item updates.// +optionalContentHashstring`json:"contentHash,omitempty"`
}
If someone edits the issue body (e.g., adds clarification), resetOnChange: true resets the counter → spawner retries
If someone posts a /kelos pick-up comment (when commentPolicy.triggerComment is set), the comment-based TriggerTime retrigger works independently of failurePolicy
Implementation sketch
Spawner cycle changes (cmd/kelos-spawner/main.go)
In runCycleWithSourceCore, after building the existing task map:
The failedItems map is bounded by the number of currently discovered items that have failed. Entries are pruned when items are no longer discovered (issue closed, label removed, etc.). For a spawner watching 100 issues where 10 have hit the retry limit, the map has 10 entries (~1KB in status). This is well within Kubernetes object size limits.
Observability
Add a Prometheus counter and a status condition:
Metric: kelos_spawner_items_circuit_broken_total{spawner, namespace} — incremented each time an item is skipped due to maxRetriesPerItem
Condition: ItemsCircuitBroken with status True when any item has exceeded the retry limit, including a message listing the affected item IDs
status:
conditions:
- type: ItemsCircuitBrokenstatus: "True"reason: MaxRetriesExceededmessage: "3 items skipped due to max retries: 42, 87, 103"
Manual reset
Operators can reset a stuck item by:
Editing the work item (when resetOnChange: true) — triggers content hash mismatch → counter resets
Posting a trigger comment (when commentPolicy.triggerComment is set) — TriggerTime-based retrigger bypasses the failure tracker
Patching the status — kubectl patch taskspawner my-spawner --subresource=status --type=json -p '[{"op":"remove","path":"/status/failedItems/42"}]'
Backward compatibility
failurePolicy is optional; when omitted, behavior is unchanged (unlimited retries, matching current behavior)
No changes to existing CRD fields or spawner logic for users who don't configure it
The failedItems status field is additive; existing controllers ignore unknown status fields
🤖 Kelos Strategist Agent @gjkim42
Area: New CRDs & API Extensions
Summary
Add a
failurePolicyfield toTaskSpawnerSpecthat tracks per-work-item failure counts across TTL cleanup boundaries and stops recreating tasks for items that persistently fail. Currently, whenttlSecondsAfterFinishedis set and a task fails, the spawner unconditionally recreates the task on the next discovery cycle — burning tokens and compute indefinitely on items the agent cannot fix.Problem
The infinite retry loop
The spawner deduplication logic in
cmd/kelos-spawner/main.go:260-305works by listing all existing tasks and checking if a task already exists for each discovered work item:When a task fails and TTL deletes it, the spawner no longer finds it in
existingTaskMap. The work item (e.g., a GitHub issue matching the label filter) is still discovered, and the spawner creates a brand-new task. This task fails again, TTL deletes it, and the cycle repeats indefinitely.The
TTLSecondsAfterFinishedfield's own doc comment acknowledges this design:This is intentionally useful for cron-based spawners and re-triggerable workflows, but creates a silent cost leak for polling-based spawners when an item consistently fails.
When does this happen?
The loop occurs when ALL of these conditions are true:
commentPolicy.triggerCommentnot set) — items are discovered by label/state/assignee alone, not by human actionttlSecondsAfterFinished > 0) — failed tasks are cleaned up, erasing the spawner's memory of the failureThe retry interval is approximately
TTL + pollInterval. With typical values (TTL: 3600s, pollInterval: 5m), this means one failed task per ~65 minutes per stuck item, running indefinitely until a human notices.Concrete scenario
A TaskSpawner watches issues labeled
agent-ready:agent-readybut describes a task the agent cannot performmy-spawner-42→ agent runs → fails (cost: $0.50 in tokens)my-spawner-42agent-readylabel → spawner createsmy-spawner-42again → fails → TTL deletes → repeatWith 10 stuck items, this becomes $110/day. With expensive models (Opus), costs scale further.
What existing proposals don't cover
The missing primitive is per-work-item failure memory that survives task deletion.
Proposed API
New field on TaskSpawnerSpec
FailurePolicy type
Status tracking
Add a
failedItemsmap toTaskSpawnerStatusfor persistence across spawner restarts:Example configuration
With this config:
resetOnChange: trueresets the counter → spawner retries/kelos pick-upcomment (whencommentPolicy.triggerCommentis set), the comment-based TriggerTime retrigger works independently of failurePolicyImplementation sketch
Spawner cycle changes (
cmd/kelos-spawner/main.go)In
runCycleWithSourceCore, after building the existing task map:Status size management
The
failedItemsmap is bounded by the number of currently discovered items that have failed. Entries are pruned when items are no longer discovered (issue closed, label removed, etc.). For a spawner watching 100 issues where 10 have hit the retry limit, the map has 10 entries (~1KB in status). This is well within Kubernetes object size limits.Observability
Add a Prometheus counter and a status condition:
kelos_spawner_items_circuit_broken_total{spawner, namespace}— incremented each time an item is skipped due to maxRetriesPerItemItemsCircuitBrokenwith statusTruewhen any item has exceeded the retry limit, including a message listing the affected item IDsManual reset
Operators can reset a stuck item by:
resetOnChange: true) — triggers content hash mismatch → counter resetscommentPolicy.triggerCommentis set) — TriggerTime-based retrigger bypasses the failure trackerkubectl patch taskspawner my-spawner --subresource=status --type=json -p '[{"op":"remove","path":"/status/failedItems/42"}]'Backward compatibility
failurePolicyis optional; when omitted, behavior is unchanged (unlimited retries, matching current behavior)failedItemsstatus field is additive; existing controllers ignore unknown status fieldsRelated proposals
/kind feature