🤖 Kelos Strategist Agent @gjkim42
Area: New CRDs & API Extensions
Summary
TaskSpawner has governance controls for how many tasks run (maxConcurrency, maxTotalTasks) and whether spawning is active at all (suspend), but no control over when tasks are created. This proposal adds schedulingPolicy to TaskSpawnerSpec, enabling declarative time-based constraints — recurring active windows (e.g., business hours only) and fixed blackout periods (e.g., release freezes) — so that task creation respects organizational schedules without manual intervention.
Problem
1. No time-awareness in task creation decisions
The spawner's discovery loop (cmd/kelos-spawner/main.go:322-395) creates tasks immediately when items are discovered, with only two gatekeepers: maxConcurrency and maxTotalTasks. There is no concept of "now is not a good time to create tasks."
This matters because autonomous agents create PRs, push branches, and post comments — actions with real impact on team workflows. Creating them at the wrong time introduces friction rather than reducing it.
2. suspend is manual and error-prone
The suspend field (api/v1alpha1/taskspawner_types.go:560-561) is a binary toggle that requires manual intervention:
spec:
suspend: true # someone must remember to set this
Real-world scheduling needs are recurring and predictable:
- "Only run during business hours" (every weekday, 9am-6pm)
- "Pause during the release freeze" (April 10-12)
- "Don't create tasks on weekends"
Using suspend for these requires external automation (a CronJob that patches the TaskSpawner), custom RBAC, and coordination across multiple spawners. If someone forgets to unsuspend after a freeze, agents silently stop working.
3. Off-hours task creation wastes money and attention
Agent tasks cost money (API tokens, compute) and produce artifacts (PRs, branches) that need human review. When agents run at 3am:
- PRs created overnight accumulate without review, creating a morning backlog
- Token spend occurs when no one is available to act on results
- Failed tasks go unnoticed for hours
- CI resources compete with overnight batch jobs
Teams running Kelos in production report wanting to align agent activity with their working hours, but the only option today is to build external tooling around suspend.
4. Release freezes require coordinated manual action
Before a release cut, platform teams typically announce a "code freeze" — no new PRs except critical fixes. With Kelos, this means:
- Identify all active TaskSpawners
- Set
suspend: true on each one
- Remember to unsuspend after the release
- Hope no one creates a new unsuspended TaskSpawner during the freeze
This is the same operational burden that Kubernetes solved with PodDisruptionBudgets and maintenance windows. Kelos should provide a declarative equivalent.
Proposed API
Add a schedulingPolicy field to TaskSpawnerSpec:
type TaskSpawnerSpec struct {
When When `json:"when"`
TaskTemplate TaskTemplate `json:"taskTemplate"`
// ... existing fields ...
// SchedulingPolicy defines time-based constraints on task creation.
// Discovery continues on schedule to keep status current, but task
// creation is deferred until the policy allows it. When unset, tasks
// are created immediately upon discovery (current behavior).
// +optional
SchedulingPolicy *SchedulingPolicy `json:"schedulingPolicy,omitempty"`
}
// SchedulingPolicy controls when a TaskSpawner is allowed to create new Tasks.
type SchedulingPolicy struct {
// ActiveWindows restricts task creation to specific recurring time windows.
// Tasks are only created when the current time falls within at least one
// window (OR semantics). If empty, no recurring time restriction is applied.
// +optional
ActiveWindows []TimeWindow `json:"activeWindows,omitempty"`
// BlackoutWindows defines fixed periods where task creation is suspended,
// regardless of activeWindows. Useful for release freezes, maintenance
// periods, or planned downtime. Blackout takes precedence over active windows.
// +optional
BlackoutWindows []BlackoutWindow `json:"blackoutWindows,omitempty"`
}
// TimeWindow defines a recurring time window.
type TimeWindow struct {
// Days restricts to specific days of the week. If empty, all days match.
// +kubebuilder:validation:Items:Enum=monday;tuesday;wednesday;thursday;friday;saturday;sunday
// +optional
Days []string `json:"days,omitempty"`
// StartTime is the daily start time in HH:MM format (24-hour clock).
// Required when endTime is set.
// +kubebuilder:validation:Pattern=`^\d{2}:\d{2}$`
// +optional
StartTime string `json:"startTime,omitempty"`
// EndTime is the daily end time in HH:MM format (24-hour clock).
// Required when startTime is set. If endTime < startTime, the window
// wraps past midnight (e.g., startTime: "22:00", endTime: "06:00").
// +kubebuilder:validation:Pattern=`^\d{2}:\d{2}$`
// +optional
EndTime string `json:"endTime,omitempty"`
// Timezone is the IANA timezone name (e.g., "America/New_York", "Europe/London").
// Defaults to "UTC".
// +kubebuilder:default="UTC"
// +optional
Timezone string `json:"timezone,omitempty"`
}
// BlackoutWindow defines a fixed time period where task creation is suspended.
type BlackoutWindow struct {
// Start is the beginning of the blackout period (RFC3339 format).
// +kubebuilder:validation:Required
// +kubebuilder:validation:Format=date-time
Start string `json:"start"`
// End is the end of the blackout period (RFC3339 format).
// +kubebuilder:validation:Required
// +kubebuilder:validation:Format=date-time
End string `json:"end"`
// Reason is a human-readable explanation for the blackout, surfaced
// in status conditions and events.
// +optional
Reason string `json:"reason,omitempty"`
}
Evaluation logic
The scheduling check evaluates as:
allowed = !inBlackout(now) && (noActiveWindows || inAnyActiveWindow(now))
Pseudocode:
func (p *SchedulingPolicy) IsTaskCreationAllowed(now time.Time) (bool, string) {
// Blackout windows take absolute precedence
for _, bw := range p.BlackoutWindows {
if now.After(bw.Start) && now.Before(bw.End) {
return false, fmt.Sprintf("In blackout window until %s: %s", bw.End, bw.Reason)
}
}
// If no active windows defined, allow (default open)
if len(p.ActiveWindows) == 0 {
return true, ""
}
// Check if current time falls within any active window
for _, aw := range p.ActiveWindows {
loc, _ := time.LoadLocation(aw.Timezone)
localNow := now.In(loc)
if aw.matchesDay(localNow) && aw.matchesTime(localNow) {
return true, ""
}
}
return false, "Outside all active windows"
}
Example configurations
1. Business hours only (US Eastern)
Agents only create tasks during working hours, when engineers are available to review:
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
name: bug-fixer
spec:
schedulingPolicy:
activeWindows:
- days: [monday, tuesday, wednesday, thursday, friday]
startTime: "09:00"
endTime: "18:00"
timezone: "America/New_York"
when:
githubIssues:
labels: [bug, priority/important-soon]
pollInterval: 5m
taskTemplate:
type: claude-code
workspaceRef:
name: my-app
credentials:
type: oauth
secretRef:
name: claude-creds
promptTemplate: |
Fix the following bug and open a PR:
Issue #{{.Number}}: {{.Title}}
{{.Body}}
branch: "fix-{{.Number}}"
maxConcurrency: 3
Effect: Issues are discovered continuously (for up-to-date status), but tasks are only created Mon-Fri 9am-6pm ET. A bug filed at 11pm is picked up the next morning at 9am.
2. Release freeze with business hours
Combines recurring windows with a fixed blackout for an upcoming release:
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
name: feature-builder
spec:
schedulingPolicy:
activeWindows:
- days: [monday, tuesday, wednesday, thursday, friday]
startTime: "09:00"
endTime: "18:00"
timezone: "Europe/London"
blackoutWindows:
- start: "2026-04-10T00:00:00Z"
end: "2026-04-12T23:59:59Z"
reason: "v2.0 release freeze"
when:
githubIssues:
labels: [kind/feature]
taskTemplate:
type: claude-code
workspaceRef:
name: my-app
credentials:
type: oauth
secretRef:
name: claude-creds
promptTemplate: |
Implement: {{.Title}}
{{.Body}}
branch: "feature-{{.Number}}"
maxConcurrency: 2
Effect: Normal business hours operation, but completely silent during the April 10-12 release freeze.
3. Off-peak compute optimization
Run cost-heavy agent tasks during off-peak hours when cluster resources are cheaper:
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
name: weekly-refactor
spec:
schedulingPolicy:
activeWindows:
- days: [saturday, sunday]
startTime: "02:00"
endTime: "08:00"
timezone: "UTC"
when:
cron:
schedule: "0 2 * * 6" # Trigger at 2am Saturday
taskTemplate:
type: claude-code
model: claude-opus-4-20250514
workspaceRef:
name: my-app
credentials:
type: oauth
secretRef:
name: claude-creds
promptTemplate: |
Perform a comprehensive code quality review...
maxConcurrency: 5
4. Multi-timezone team coverage
Active during business hours across two offices:
schedulingPolicy:
activeWindows:
- days: [monday, tuesday, wednesday, thursday, friday]
startTime: "09:00"
endTime: "18:00"
timezone: "America/New_York"
- days: [monday, tuesday, wednesday, thursday, friday]
startTime: "09:00"
endTime: "18:00"
timezone: "Asia/Tokyo"
Effect: Tasks are created whenever either office is open (OR semantics), maximizing coverage while ensuring someone is available to review.
Implementation approach
Spawner changes (cmd/kelos-spawner/main.go)
The scheduling check is inserted in the runCycleWithSource function after discovery/deduplication but before the task creation loop (between the current lines ~310 and ~323):
// After deduplication, before task creation loop:
if ts.Spec.SchedulingPolicy != nil {
allowed, reason := ts.Spec.SchedulingPolicy.IsTaskCreationAllowed(time.Now())
if !allowed {
log.Info("Scheduling policy restricts task creation", "reason", reason)
// Still update status with discovery count (items are known),
// but skip task creation
// ... update status with SchedulingRestricted condition ...
return nil
}
}
Discovery continues normally — items are counted in status.totalDiscovered so operators know work is queued. On the next poll cycle within an active window, items will be re-discovered and tasks created.
Webhook handler changes (internal/webhook/handler.go)
The createTask method in the webhook handler checks the scheduling policy before creating a task:
func (h *WebhookHandler) createTask(ctx context.Context, spawner *v1alpha1.TaskSpawner, ...) error {
if spawner.Spec.SchedulingPolicy != nil {
allowed, reason := spawner.Spec.SchedulingPolicy.IsTaskCreationAllowed(time.Now())
if !allowed {
log.Info("Scheduling policy restricts webhook task creation",
"reason", reason, "spawner", spawner.Name)
return nil // Accept webhook but skip task creation
}
}
// ... existing task creation logic ...
}
Note for webhook sources: Unlike polling sources that re-discover items, webhook events are ephemeral. A webhook received during a blackout is dropped (the event will not be replayed). This is acceptable for most webhook use cases (the next event for the same item will trigger when the window opens), but should be clearly documented.
Status reporting
Add a SchedulingRestricted condition to communicate the scheduling state:
status:
conditions:
- type: SchedulingRestricted
status: "True"
reason: "OutsideActiveWindow"
message: "Task creation paused — next active window: Mon 09:00 America/New_York"
lastTransitionTime: "2026-04-05T23:00:00Z"
Or during a blackout:
status:
conditions:
- type: SchedulingRestricted
status: "True"
reason: "InBlackoutWindow"
message: "Task creation paused until 2026-04-12T23:59:59Z: v2.0 release freeze"
The condition transitions to status: "False" when the window opens, providing clear audit trail.
Scope estimate
- Types: ~60 lines in
api/v1alpha1/taskspawner_types.go
- Evaluation logic: ~80 lines in a new
internal/scheduling/policy.go
- Spawner integration: ~15 lines in
cmd/kelos-spawner/main.go
- Webhook integration: ~10 lines in
internal/webhook/handler.go
- Status condition: ~20 lines in spawner status update
- Tests: ~200 lines (time zone edge cases, midnight wrapping, blackout precedence)
- CRD regeneration:
make update
- Total: ~400 lines including tests
Design decisions
Why on TaskSpawnerSpec, not a separate CRD?
A cluster-wide SchedulingPolicy CRD was considered but rejected for this proposal:
- Per-spawner policies are simpler and cover 90% of use cases
- Different spawners often need different schedules (bug-fix agents should run during business hours; dependency-update agents should run off-peak)
- A cluster-wide CRD can be added later as a higher-level governance layer that references per-spawner policies
Why "discovery continues, creation pauses" instead of "spawner stops entirely"?
Continuing discovery during restricted windows keeps status.totalDiscovered current, so operators can see queued work via kelos get taskspawners. It also means the moment a window opens, the spawner has fresh item data and can create tasks immediately without waiting for a full poll cycle.
Why not reuse CronJob scheduling?
Kubernetes CronJobs define when to run, not when to pause. The scheduling policy here is the inverse — it defines time constraints on an always-running spawner. The two are complementary: a cron spawner with schedule: "0 * * * *" (every hour) could have schedulingPolicy to restrict which of those hours actually create tasks.
Webhook "drop" behavior
Webhook events received during restricted windows are acknowledged (HTTP 200) but don't create tasks. This avoids the complexity of a durable event queue while being consistent with how maxConcurrency already drops webhook events when the limit is reached (handler.go:311-318). For systems where event durability matters, the source system's retry mechanism or a separate event store should be used.
Backward compatibility
- Purely additive: new optional
schedulingPolicy field on TaskSpawnerSpec
- When unset (default), behavior is identical to today — tasks created immediately
- No changes to existing CRDs, controllers, or webhook behavior
- XValidation on
TaskSpawnerSpec does not need updating (scheduling policy is orthogonal to source type)
- Existing spawners continue working without modification
Relationship to existing proposals
| Issue |
Relationship |
suspend field (built-in) |
Complementary. suspend is a manual kill switch; schedulingPolicy is a declarative, automatic time constraint. Both can coexist — suspend: true overrides any scheduling policy. |
| #788 (costBudget) |
Complementary. Cost budget limits total spending; scheduling policy limits when spending occurs. A spawner can have both: "max $50/day, only during business hours." |
| #765 (Cancelled phase / obsolescencePolicy) |
Complementary. Obsolescence cancels stale running tasks; scheduling prevents new task creation during restricted windows. |
| #889 (failurePolicy) |
Orthogonal. Failure policy governs what happens when tasks fail; scheduling governs when tasks are created. |
References
TaskSpawnerSpec: api/v1alpha1/taskspawner_types.go:531-571
suspend field: api/v1alpha1/taskspawner_types.go:560-561
- Task creation loop:
cmd/kelos-spawner/main.go:322-395
- Status condition pattern:
cmd/kelos-spawner/main.go:413-429
- Webhook concurrency drop precedent:
internal/webhook/handler.go:311-318
- Go
time.LoadLocation: used for IANA timezone parsing (stdlib, no external dependency)
/kind feature
🤖 Kelos Strategist Agent @gjkim42
Area: New CRDs & API Extensions
Summary
TaskSpawner has governance controls for how many tasks run (
maxConcurrency,maxTotalTasks) and whether spawning is active at all (suspend), but no control over when tasks are created. This proposal addsschedulingPolicytoTaskSpawnerSpec, enabling declarative time-based constraints — recurring active windows (e.g., business hours only) and fixed blackout periods (e.g., release freezes) — so that task creation respects organizational schedules without manual intervention.Problem
1. No time-awareness in task creation decisions
The spawner's discovery loop (
cmd/kelos-spawner/main.go:322-395) creates tasks immediately when items are discovered, with only two gatekeepers:maxConcurrencyandmaxTotalTasks. There is no concept of "now is not a good time to create tasks."This matters because autonomous agents create PRs, push branches, and post comments — actions with real impact on team workflows. Creating them at the wrong time introduces friction rather than reducing it.
2.
suspendis manual and error-proneThe
suspendfield (api/v1alpha1/taskspawner_types.go:560-561) is a binary toggle that requires manual intervention:Real-world scheduling needs are recurring and predictable:
Using
suspendfor these requires external automation (a CronJob that patches the TaskSpawner), custom RBAC, and coordination across multiple spawners. If someone forgets to unsuspend after a freeze, agents silently stop working.3. Off-hours task creation wastes money and attention
Agent tasks cost money (API tokens, compute) and produce artifacts (PRs, branches) that need human review. When agents run at 3am:
Teams running Kelos in production report wanting to align agent activity with their working hours, but the only option today is to build external tooling around
suspend.4. Release freezes require coordinated manual action
Before a release cut, platform teams typically announce a "code freeze" — no new PRs except critical fixes. With Kelos, this means:
suspend: trueon each oneThis is the same operational burden that Kubernetes solved with PodDisruptionBudgets and maintenance windows. Kelos should provide a declarative equivalent.
Proposed API
Add a
schedulingPolicyfield toTaskSpawnerSpec:Evaluation logic
The scheduling check evaluates as:
Pseudocode:
Example configurations
1. Business hours only (US Eastern)
Agents only create tasks during working hours, when engineers are available to review:
Effect: Issues are discovered continuously (for up-to-date status), but tasks are only created Mon-Fri 9am-6pm ET. A bug filed at 11pm is picked up the next morning at 9am.
2. Release freeze with business hours
Combines recurring windows with a fixed blackout for an upcoming release:
Effect: Normal business hours operation, but completely silent during the April 10-12 release freeze.
3. Off-peak compute optimization
Run cost-heavy agent tasks during off-peak hours when cluster resources are cheaper:
4. Multi-timezone team coverage
Active during business hours across two offices:
Effect: Tasks are created whenever either office is open (OR semantics), maximizing coverage while ensuring someone is available to review.
Implementation approach
Spawner changes (
cmd/kelos-spawner/main.go)The scheduling check is inserted in the
runCycleWithSourcefunction after discovery/deduplication but before the task creation loop (between the current lines ~310 and ~323):Discovery continues normally — items are counted in
status.totalDiscoveredso operators know work is queued. On the next poll cycle within an active window, items will be re-discovered and tasks created.Webhook handler changes (
internal/webhook/handler.go)The
createTaskmethod in the webhook handler checks the scheduling policy before creating a task:Note for webhook sources: Unlike polling sources that re-discover items, webhook events are ephemeral. A webhook received during a blackout is dropped (the event will not be replayed). This is acceptable for most webhook use cases (the next event for the same item will trigger when the window opens), but should be clearly documented.
Status reporting
Add a
SchedulingRestrictedcondition to communicate the scheduling state:Or during a blackout:
The condition transitions to
status: "False"when the window opens, providing clear audit trail.Scope estimate
api/v1alpha1/taskspawner_types.gointernal/scheduling/policy.gocmd/kelos-spawner/main.gointernal/webhook/handler.gomake updateDesign decisions
Why on TaskSpawnerSpec, not a separate CRD?
A cluster-wide
SchedulingPolicyCRD was considered but rejected for this proposal:Why "discovery continues, creation pauses" instead of "spawner stops entirely"?
Continuing discovery during restricted windows keeps
status.totalDiscoveredcurrent, so operators can see queued work viakelos get taskspawners. It also means the moment a window opens, the spawner has fresh item data and can create tasks immediately without waiting for a full poll cycle.Why not reuse CronJob scheduling?
Kubernetes CronJobs define when to run, not when to pause. The scheduling policy here is the inverse — it defines time constraints on an always-running spawner. The two are complementary: a cron spawner with
schedule: "0 * * * *"(every hour) could haveschedulingPolicyto restrict which of those hours actually create tasks.Webhook "drop" behavior
Webhook events received during restricted windows are acknowledged (HTTP 200) but don't create tasks. This avoids the complexity of a durable event queue while being consistent with how
maxConcurrencyalready drops webhook events when the limit is reached (handler.go:311-318). For systems where event durability matters, the source system's retry mechanism or a separate event store should be used.Backward compatibility
schedulingPolicyfield onTaskSpawnerSpecTaskSpawnerSpecdoes not need updating (scheduling policy is orthogonal to source type)Relationship to existing proposals
suspendfield (built-in)suspendis a manual kill switch;schedulingPolicyis a declarative, automatic time constraint. Both can coexist —suspend: trueoverrides any scheduling policy.References
TaskSpawnerSpec:api/v1alpha1/taskspawner_types.go:531-571suspendfield:api/v1alpha1/taskspawner_types.go:560-561cmd/kelos-spawner/main.go:322-395cmd/kelos-spawner/main.go:413-429internal/webhook/handler.go:311-318time.LoadLocation: used for IANA timezone parsing (stdlib, no external dependency)/kind feature