API: Add schedulingPolicy to TaskSpawner for time-based task creation control with active windows and blackout periods

🤖 **Kelos Strategist Agent** @gjkim42

## Area: New CRDs & API Extensions

## Summary

TaskSpawner has governance controls for *how many* tasks run (`maxConcurrency`, `maxTotalTasks`) and whether spawning is active at all (`suspend`), but no control over *when* tasks are created. This proposal adds `schedulingPolicy` to `TaskSpawnerSpec`, enabling declarative time-based constraints — recurring active windows (e.g., business hours only) and fixed blackout periods (e.g., release freezes) — so that task creation respects organizational schedules without manual intervention.

## Problem

### 1. No time-awareness in task creation decisions

The spawner's discovery loop (`cmd/kelos-spawner/main.go:322-395`) creates tasks immediately when items are discovered, with only two gatekeepers: `maxConcurrency` and `maxTotalTasks`. There is no concept of "now is not a good time to create tasks."

This matters because autonomous agents create PRs, push branches, and post comments — actions with real impact on team workflows. Creating them at the wrong time introduces friction rather than reducing it.

### 2. `suspend` is manual and error-prone

The `suspend` field (`api/v1alpha1/taskspawner_types.go:560-561`) is a binary toggle that requires manual intervention:

```yaml
spec:
  suspend: true   # someone must remember to set this
```

Real-world scheduling needs are recurring and predictable:
- "Only run during business hours" (every weekday, 9am-6pm)
- "Pause during the release freeze" (April 10-12)
- "Don't create tasks on weekends"

Using `suspend` for these requires external automation (a CronJob that patches the TaskSpawner), custom RBAC, and coordination across multiple spawners. If someone forgets to unsuspend after a freeze, agents silently stop working.

### 3. Off-hours task creation wastes money and attention

Agent tasks cost money (API tokens, compute) and produce artifacts (PRs, branches) that need human review. When agents run at 3am:
- PRs created overnight accumulate without review, creating a morning backlog
- Token spend occurs when no one is available to act on results
- Failed tasks go unnoticed for hours
- CI resources compete with overnight batch jobs

Teams running Kelos in production report wanting to align agent activity with their working hours, but the only option today is to build external tooling around `suspend`.

### 4. Release freezes require coordinated manual action

Before a release cut, platform teams typically announce a "code freeze" — no new PRs except critical fixes. With Kelos, this means:
1. Identify all active TaskSpawners
2. Set `suspend: true` on each one
3. Remember to unsuspend after the release
4. Hope no one creates a new unsuspended TaskSpawner during the freeze

This is the same operational burden that Kubernetes solved with PodDisruptionBudgets and maintenance windows. Kelos should provide a declarative equivalent.

## Proposed API

Add a `schedulingPolicy` field to `TaskSpawnerSpec`:

```go
type TaskSpawnerSpec struct {
    When             When             `json:"when"`
    TaskTemplate     TaskTemplate     `json:"taskTemplate"`
    // ... existing fields ...

    // SchedulingPolicy defines time-based constraints on task creation.
    // Discovery continues on schedule to keep status current, but task
    // creation is deferred until the policy allows it. When unset, tasks
    // are created immediately upon discovery (current behavior).
    // +optional
    SchedulingPolicy *SchedulingPolicy `json:"schedulingPolicy,omitempty"`
}

// SchedulingPolicy controls when a TaskSpawner is allowed to create new Tasks.
type SchedulingPolicy struct {
    // ActiveWindows restricts task creation to specific recurring time windows.
    // Tasks are only created when the current time falls within at least one
    // window (OR semantics). If empty, no recurring time restriction is applied.
    // +optional
    ActiveWindows []TimeWindow `json:"activeWindows,omitempty"`

    // BlackoutWindows defines fixed periods where task creation is suspended,
    // regardless of activeWindows. Useful for release freezes, maintenance
    // periods, or planned downtime. Blackout takes precedence over active windows.
    // +optional
    BlackoutWindows []BlackoutWindow `json:"blackoutWindows,omitempty"`
}

// TimeWindow defines a recurring time window.
type TimeWindow struct {
    // Days restricts to specific days of the week. If empty, all days match.
    // +kubebuilder:validation:Items:Enum=monday;tuesday;wednesday;thursday;friday;saturday;sunday
    // +optional
    Days []string `json:"days,omitempty"`

    // StartTime is the daily start time in HH:MM format (24-hour clock).
    // Required when endTime is set.
    // +kubebuilder:validation:Pattern=`^\d{2}:\d{2}$`
    // +optional
    StartTime string `json:"startTime,omitempty"`

    // EndTime is the daily end time in HH:MM format (24-hour clock).
    // Required when startTime is set. If endTime < startTime, the window
    // wraps past midnight (e.g., startTime: "22:00", endTime: "06:00").
    // +kubebuilder:validation:Pattern=`^\d{2}:\d{2}$`
    // +optional
    EndTime string `json:"endTime,omitempty"`

    // Timezone is the IANA timezone name (e.g., "America/New_York", "Europe/London").
    // Defaults to "UTC".
    // +kubebuilder:default="UTC"
    // +optional
    Timezone string `json:"timezone,omitempty"`
}

// BlackoutWindow defines a fixed time period where task creation is suspended.
type BlackoutWindow struct {
    // Start is the beginning of the blackout period (RFC3339 format).
    // +kubebuilder:validation:Required
    // +kubebuilder:validation:Format=date-time
    Start string `json:"start"`

    // End is the end of the blackout period (RFC3339 format).
    // +kubebuilder:validation:Required
    // +kubebuilder:validation:Format=date-time
    End string `json:"end"`

    // Reason is a human-readable explanation for the blackout, surfaced
    // in status conditions and events.
    // +optional
    Reason string `json:"reason,omitempty"`
}
```

## Evaluation logic

The scheduling check evaluates as:

```
allowed = !inBlackout(now) && (noActiveWindows || inAnyActiveWindow(now))
```

Pseudocode:
```go
func (p *SchedulingPolicy) IsTaskCreationAllowed(now time.Time) (bool, string) {
    // Blackout windows take absolute precedence
    for _, bw := range p.BlackoutWindows {
        if now.After(bw.Start) && now.Before(bw.End) {
            return false, fmt.Sprintf("In blackout window until %s: %s", bw.End, bw.Reason)
        }
    }

    // If no active windows defined, allow (default open)
    if len(p.ActiveWindows) == 0 {
        return true, ""
    }

    // Check if current time falls within any active window
    for _, aw := range p.ActiveWindows {
        loc, _ := time.LoadLocation(aw.Timezone)
        localNow := now.In(loc)

        if aw.matchesDay(localNow) && aw.matchesTime(localNow) {
            return true, ""
        }
    }

    return false, "Outside all active windows"
}
```

## Example configurations

### 1. Business hours only (US Eastern)

Agents only create tasks during working hours, when engineers are available to review:

```yaml
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: bug-fixer
spec:
  schedulingPolicy:
    activeWindows:
      - days: [monday, tuesday, wednesday, thursday, friday]
        startTime: "09:00"
        endTime: "18:00"
        timezone: "America/New_York"
  when:
    githubIssues:
      labels: [bug, priority/important-soon]
      pollInterval: 5m
  taskTemplate:
    type: claude-code
    workspaceRef:
      name: my-app
    credentials:
      type: oauth
      secretRef:
        name: claude-creds
    promptTemplate: |
      Fix the following bug and open a PR:
      Issue #{{.Number}}: {{.Title}}
      {{.Body}}
    branch: "fix-{{.Number}}"
  maxConcurrency: 3
```

**Effect**: Issues are discovered continuously (for up-to-date status), but tasks are only created Mon-Fri 9am-6pm ET. A bug filed at 11pm is picked up the next morning at 9am.

### 2. Release freeze with business hours

Combines recurring windows with a fixed blackout for an upcoming release:

```yaml
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: feature-builder
spec:
  schedulingPolicy:
    activeWindows:
      - days: [monday, tuesday, wednesday, thursday, friday]
        startTime: "09:00"
        endTime: "18:00"
        timezone: "Europe/London"
    blackoutWindows:
      - start: "2026-04-10T00:00:00Z"
        end: "2026-04-12T23:59:59Z"
        reason: "v2.0 release freeze"
  when:
    githubIssues:
      labels: [kind/feature]
  taskTemplate:
    type: claude-code
    workspaceRef:
      name: my-app
    credentials:
      type: oauth
      secretRef:
        name: claude-creds
    promptTemplate: |
      Implement: {{.Title}}
      {{.Body}}
    branch: "feature-{{.Number}}"
  maxConcurrency: 2
```

**Effect**: Normal business hours operation, but completely silent during the April 10-12 release freeze.

### 3. Off-peak compute optimization

Run cost-heavy agent tasks during off-peak hours when cluster resources are cheaper:

```yaml
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: weekly-refactor
spec:
  schedulingPolicy:
    activeWindows:
      - days: [saturday, sunday]
        startTime: "02:00"
        endTime: "08:00"
        timezone: "UTC"
  when:
    cron:
      schedule: "0 2 * * 6"  # Trigger at 2am Saturday
  taskTemplate:
    type: claude-code
    model: claude-opus-4-20250514
    workspaceRef:
      name: my-app
    credentials:
      type: oauth
      secretRef:
        name: claude-creds
    promptTemplate: |
      Perform a comprehensive code quality review...
  maxConcurrency: 5
```

### 4. Multi-timezone team coverage

Active during business hours across two offices:

```yaml
schedulingPolicy:
  activeWindows:
    - days: [monday, tuesday, wednesday, thursday, friday]
      startTime: "09:00"
      endTime: "18:00"
      timezone: "America/New_York"
    - days: [monday, tuesday, wednesday, thursday, friday]
      startTime: "09:00"
      endTime: "18:00"
      timezone: "Asia/Tokyo"
```

**Effect**: Tasks are created whenever either office is open (OR semantics), maximizing coverage while ensuring someone is available to review.

## Implementation approach

### Spawner changes (`cmd/kelos-spawner/main.go`)

The scheduling check is inserted in the `runCycleWithSource` function after discovery/deduplication but before the task creation loop (between the current lines ~310 and ~323):

```go
// After deduplication, before task creation loop:
if ts.Spec.SchedulingPolicy != nil {
    allowed, reason := ts.Spec.SchedulingPolicy.IsTaskCreationAllowed(time.Now())
    if !allowed {
        log.Info("Scheduling policy restricts task creation", "reason", reason)
        // Still update status with discovery count (items are known),
        // but skip task creation
        // ... update status with SchedulingRestricted condition ...
        return nil
    }
}
```

Discovery continues normally — items are counted in `status.totalDiscovered` so operators know work is queued. On the next poll cycle within an active window, items will be re-discovered and tasks created.

### Webhook handler changes (`internal/webhook/handler.go`)

The `createTask` method in the webhook handler checks the scheduling policy before creating a task:

```go
func (h *WebhookHandler) createTask(ctx context.Context, spawner *v1alpha1.TaskSpawner, ...) error {
    if spawner.Spec.SchedulingPolicy != nil {
        allowed, reason := spawner.Spec.SchedulingPolicy.IsTaskCreationAllowed(time.Now())
        if !allowed {
            log.Info("Scheduling policy restricts webhook task creation",
                "reason", reason, "spawner", spawner.Name)
            return nil // Accept webhook but skip task creation
        }
    }
    // ... existing task creation logic ...
}
```

**Note for webhook sources**: Unlike polling sources that re-discover items, webhook events are ephemeral. A webhook received during a blackout is dropped (the event will not be replayed). This is acceptable for most webhook use cases (the next event for the same item will trigger when the window opens), but should be clearly documented.

### Status reporting

Add a `SchedulingRestricted` condition to communicate the scheduling state:

```yaml
status:
  conditions:
    - type: SchedulingRestricted
      status: "True"
      reason: "OutsideActiveWindow"
      message: "Task creation paused — next active window: Mon 09:00 America/New_York"
      lastTransitionTime: "2026-04-05T23:00:00Z"
```

Or during a blackout:

```yaml
status:
  conditions:
    - type: SchedulingRestricted
      status: "True"
      reason: "InBlackoutWindow"
      message: "Task creation paused until 2026-04-12T23:59:59Z: v2.0 release freeze"
```

The condition transitions to `status: "False"` when the window opens, providing clear audit trail.

### Scope estimate

- **Types**: ~60 lines in `api/v1alpha1/taskspawner_types.go`
- **Evaluation logic**: ~80 lines in a new `internal/scheduling/policy.go`
- **Spawner integration**: ~15 lines in `cmd/kelos-spawner/main.go`
- **Webhook integration**: ~10 lines in `internal/webhook/handler.go`
- **Status condition**: ~20 lines in spawner status update
- **Tests**: ~200 lines (time zone edge cases, midnight wrapping, blackout precedence)
- **CRD regeneration**: `make update`
- **Total**: ~400 lines including tests

## Design decisions

### Why on TaskSpawnerSpec, not a separate CRD?

A cluster-wide `SchedulingPolicy` CRD was considered but rejected for this proposal:
- Per-spawner policies are simpler and cover 90% of use cases
- Different spawners often need different schedules (bug-fix agents should run during business hours; dependency-update agents should run off-peak)
- A cluster-wide CRD can be added later as a higher-level governance layer that references per-spawner policies

### Why "discovery continues, creation pauses" instead of "spawner stops entirely"?

Continuing discovery during restricted windows keeps `status.totalDiscovered` current, so operators can see queued work via `kelos get taskspawners`. It also means the moment a window opens, the spawner has fresh item data and can create tasks immediately without waiting for a full poll cycle.

### Why not reuse CronJob scheduling?

Kubernetes CronJobs define *when to run*, not *when to pause*. The scheduling policy here is the inverse — it defines time constraints on an always-running spawner. The two are complementary: a cron spawner with `schedule: "0 * * * *"` (every hour) could have `schedulingPolicy` to restrict which of those hours actually create tasks.

### Webhook "drop" behavior

Webhook events received during restricted windows are acknowledged (HTTP 200) but don't create tasks. This avoids the complexity of a durable event queue while being consistent with how `maxConcurrency` already drops webhook events when the limit is reached (`handler.go:311-318`). For systems where event durability matters, the source system's retry mechanism or a separate event store should be used.

## Backward compatibility

- Purely additive: new optional `schedulingPolicy` field on `TaskSpawnerSpec`
- When unset (default), behavior is identical to today — tasks created immediately
- No changes to existing CRDs, controllers, or webhook behavior
- XValidation on `TaskSpawnerSpec` does not need updating (scheduling policy is orthogonal to source type)
- Existing spawners continue working without modification

## Relationship to existing proposals

| Issue | Relationship |
|-------|-------------|
| `suspend` field (built-in) | Complementary. `suspend` is a manual kill switch; `schedulingPolicy` is a declarative, automatic time constraint. Both can coexist — `suspend: true` overrides any scheduling policy. |
| #788 (costBudget) | Complementary. Cost budget limits total spending; scheduling policy limits when spending occurs. A spawner can have both: "max $50/day, only during business hours." |
| #765 (Cancelled phase / obsolescencePolicy) | Complementary. Obsolescence cancels stale *running* tasks; scheduling prevents *new* task creation during restricted windows. |
| #889 (failurePolicy) | Orthogonal. Failure policy governs what happens when tasks fail; scheduling governs when tasks are created. |

## References

- `TaskSpawnerSpec`: `api/v1alpha1/taskspawner_types.go:531-571`
- `suspend` field: `api/v1alpha1/taskspawner_types.go:560-561`
- Task creation loop: `cmd/kelos-spawner/main.go:322-395`
- Status condition pattern: `cmd/kelos-spawner/main.go:413-429`
- Webhook concurrency drop precedent: `internal/webhook/handler.go:311-318`
- Go `time.LoadLocation`: used for IANA timezone parsing (stdlib, no external dependency)

/kind feature

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Add schedulingPolicy to TaskSpawner for time-based task creation control with active windows and blackout periods #907

Area: New CRDs & API Extensions

Summary

Problem

1. No time-awareness in task creation decisions

2. `suspend` is manual and error-prone

3. Off-hours task creation wastes money and attention

4. Release freezes require coordinated manual action

Proposed API

Evaluation logic

Example configurations

1. Business hours only (US Eastern)

2. Release freeze with business hours

3. Off-peak compute optimization

4. Multi-timezone team coverage

Implementation approach

Spawner changes (`cmd/kelos-spawner/main.go`)

Webhook handler changes (`internal/webhook/handler.go`)

Status reporting

Scope estimate

Design decisions

Why on TaskSpawnerSpec, not a separate CRD?

Why "discovery continues, creation pauses" instead of "spawner stops entirely"?

Why not reuse CronJob scheduling?

Webhook "drop" behavior

Backward compatibility

Relationship to existing proposals

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue	Relationship
`suspend` field (built-in)	Complementary. `suspend` is a manual kill switch; `schedulingPolicy` is a declarative, automatic time constraint. Both can coexist — `suspend: true` overrides any scheduling policy.
#788 (costBudget)	Complementary. Cost budget limits total spending; scheduling policy limits when spending occurs. A spawner can have both: "max $50/day, only during business hours."
#765 (Cancelled phase / obsolescencePolicy)	Complementary. Obsolescence cancels stale running tasks; scheduling prevents new task creation during restricted windows.
#889 (failurePolicy)	Orthogonal. Failure policy governs what happens when tasks fail; scheduling governs when tasks are created.

API: Add schedulingPolicy to TaskSpawner for time-based task creation control with active windows and blackout periods #907

Description

Area: New CRDs & API Extensions

Summary

Problem

1. No time-awareness in task creation decisions

2. suspend is manual and error-prone

3. Off-hours task creation wastes money and attention

4. Release freezes require coordinated manual action

Proposed API

Evaluation logic

Example configurations

1. Business hours only (US Eastern)

2. Release freeze with business hours

3. Off-peak compute optimization

4. Multi-timezone team coverage

Implementation approach

Spawner changes (cmd/kelos-spawner/main.go)

Webhook handler changes (internal/webhook/handler.go)

Status reporting

Scope estimate

Design decisions

Why on TaskSpawnerSpec, not a separate CRD?

Why "discovery continues, creation pauses" instead of "spawner stops entirely"?

Why not reuse CronJob scheduling?

Webhook "drop" behavior

Backward compatibility

Relationship to existing proposals

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

2. `suspend` is manual and error-prone

Spawner changes (`cmd/kelos-spawner/main.go`)

Webhook handler changes (`internal/webhook/handler.go`)