Skip to content

API: Add schedulingPolicy to TaskSpawner for time-based task creation control with active windows and blackout periods #907

@kelos-bot

Description

@kelos-bot

🤖 Kelos Strategist Agent @gjkim42

Area: New CRDs & API Extensions

Summary

TaskSpawner has governance controls for how many tasks run (maxConcurrency, maxTotalTasks) and whether spawning is active at all (suspend), but no control over when tasks are created. This proposal adds schedulingPolicy to TaskSpawnerSpec, enabling declarative time-based constraints — recurring active windows (e.g., business hours only) and fixed blackout periods (e.g., release freezes) — so that task creation respects organizational schedules without manual intervention.

Problem

1. No time-awareness in task creation decisions

The spawner's discovery loop (cmd/kelos-spawner/main.go:322-395) creates tasks immediately when items are discovered, with only two gatekeepers: maxConcurrency and maxTotalTasks. There is no concept of "now is not a good time to create tasks."

This matters because autonomous agents create PRs, push branches, and post comments — actions with real impact on team workflows. Creating them at the wrong time introduces friction rather than reducing it.

2. suspend is manual and error-prone

The suspend field (api/v1alpha1/taskspawner_types.go:560-561) is a binary toggle that requires manual intervention:

spec:
  suspend: true   # someone must remember to set this

Real-world scheduling needs are recurring and predictable:

  • "Only run during business hours" (every weekday, 9am-6pm)
  • "Pause during the release freeze" (April 10-12)
  • "Don't create tasks on weekends"

Using suspend for these requires external automation (a CronJob that patches the TaskSpawner), custom RBAC, and coordination across multiple spawners. If someone forgets to unsuspend after a freeze, agents silently stop working.

3. Off-hours task creation wastes money and attention

Agent tasks cost money (API tokens, compute) and produce artifacts (PRs, branches) that need human review. When agents run at 3am:

  • PRs created overnight accumulate without review, creating a morning backlog
  • Token spend occurs when no one is available to act on results
  • Failed tasks go unnoticed for hours
  • CI resources compete with overnight batch jobs

Teams running Kelos in production report wanting to align agent activity with their working hours, but the only option today is to build external tooling around suspend.

4. Release freezes require coordinated manual action

Before a release cut, platform teams typically announce a "code freeze" — no new PRs except critical fixes. With Kelos, this means:

  1. Identify all active TaskSpawners
  2. Set suspend: true on each one
  3. Remember to unsuspend after the release
  4. Hope no one creates a new unsuspended TaskSpawner during the freeze

This is the same operational burden that Kubernetes solved with PodDisruptionBudgets and maintenance windows. Kelos should provide a declarative equivalent.

Proposed API

Add a schedulingPolicy field to TaskSpawnerSpec:

type TaskSpawnerSpec struct {
    When             When             `json:"when"`
    TaskTemplate     TaskTemplate     `json:"taskTemplate"`
    // ... existing fields ...

    // SchedulingPolicy defines time-based constraints on task creation.
    // Discovery continues on schedule to keep status current, but task
    // creation is deferred until the policy allows it. When unset, tasks
    // are created immediately upon discovery (current behavior).
    // +optional
    SchedulingPolicy *SchedulingPolicy `json:"schedulingPolicy,omitempty"`
}

// SchedulingPolicy controls when a TaskSpawner is allowed to create new Tasks.
type SchedulingPolicy struct {
    // ActiveWindows restricts task creation to specific recurring time windows.
    // Tasks are only created when the current time falls within at least one
    // window (OR semantics). If empty, no recurring time restriction is applied.
    // +optional
    ActiveWindows []TimeWindow `json:"activeWindows,omitempty"`

    // BlackoutWindows defines fixed periods where task creation is suspended,
    // regardless of activeWindows. Useful for release freezes, maintenance
    // periods, or planned downtime. Blackout takes precedence over active windows.
    // +optional
    BlackoutWindows []BlackoutWindow `json:"blackoutWindows,omitempty"`
}

// TimeWindow defines a recurring time window.
type TimeWindow struct {
    // Days restricts to specific days of the week. If empty, all days match.
    // +kubebuilder:validation:Items:Enum=monday;tuesday;wednesday;thursday;friday;saturday;sunday
    // +optional
    Days []string `json:"days,omitempty"`

    // StartTime is the daily start time in HH:MM format (24-hour clock).
    // Required when endTime is set.
    // +kubebuilder:validation:Pattern=`^\d{2}:\d{2}$`
    // +optional
    StartTime string `json:"startTime,omitempty"`

    // EndTime is the daily end time in HH:MM format (24-hour clock).
    // Required when startTime is set. If endTime < startTime, the window
    // wraps past midnight (e.g., startTime: "22:00", endTime: "06:00").
    // +kubebuilder:validation:Pattern=`^\d{2}:\d{2}$`
    // +optional
    EndTime string `json:"endTime,omitempty"`

    // Timezone is the IANA timezone name (e.g., "America/New_York", "Europe/London").
    // Defaults to "UTC".
    // +kubebuilder:default="UTC"
    // +optional
    Timezone string `json:"timezone,omitempty"`
}

// BlackoutWindow defines a fixed time period where task creation is suspended.
type BlackoutWindow struct {
    // Start is the beginning of the blackout period (RFC3339 format).
    // +kubebuilder:validation:Required
    // +kubebuilder:validation:Format=date-time
    Start string `json:"start"`

    // End is the end of the blackout period (RFC3339 format).
    // +kubebuilder:validation:Required
    // +kubebuilder:validation:Format=date-time
    End string `json:"end"`

    // Reason is a human-readable explanation for the blackout, surfaced
    // in status conditions and events.
    // +optional
    Reason string `json:"reason,omitempty"`
}

Evaluation logic

The scheduling check evaluates as:

allowed = !inBlackout(now) && (noActiveWindows || inAnyActiveWindow(now))

Pseudocode:

func (p *SchedulingPolicy) IsTaskCreationAllowed(now time.Time) (bool, string) {
    // Blackout windows take absolute precedence
    for _, bw := range p.BlackoutWindows {
        if now.After(bw.Start) && now.Before(bw.End) {
            return false, fmt.Sprintf("In blackout window until %s: %s", bw.End, bw.Reason)
        }
    }

    // If no active windows defined, allow (default open)
    if len(p.ActiveWindows) == 0 {
        return true, ""
    }

    // Check if current time falls within any active window
    for _, aw := range p.ActiveWindows {
        loc, _ := time.LoadLocation(aw.Timezone)
        localNow := now.In(loc)

        if aw.matchesDay(localNow) && aw.matchesTime(localNow) {
            return true, ""
        }
    }

    return false, "Outside all active windows"
}

Example configurations

1. Business hours only (US Eastern)

Agents only create tasks during working hours, when engineers are available to review:

apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: bug-fixer
spec:
  schedulingPolicy:
    activeWindows:
      - days: [monday, tuesday, wednesday, thursday, friday]
        startTime: "09:00"
        endTime: "18:00"
        timezone: "America/New_York"
  when:
    githubIssues:
      labels: [bug, priority/important-soon]
      pollInterval: 5m
  taskTemplate:
    type: claude-code
    workspaceRef:
      name: my-app
    credentials:
      type: oauth
      secretRef:
        name: claude-creds
    promptTemplate: |
      Fix the following bug and open a PR:
      Issue #{{.Number}}: {{.Title}}
      {{.Body}}
    branch: "fix-{{.Number}}"
  maxConcurrency: 3

Effect: Issues are discovered continuously (for up-to-date status), but tasks are only created Mon-Fri 9am-6pm ET. A bug filed at 11pm is picked up the next morning at 9am.

2. Release freeze with business hours

Combines recurring windows with a fixed blackout for an upcoming release:

apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: feature-builder
spec:
  schedulingPolicy:
    activeWindows:
      - days: [monday, tuesday, wednesday, thursday, friday]
        startTime: "09:00"
        endTime: "18:00"
        timezone: "Europe/London"
    blackoutWindows:
      - start: "2026-04-10T00:00:00Z"
        end: "2026-04-12T23:59:59Z"
        reason: "v2.0 release freeze"
  when:
    githubIssues:
      labels: [kind/feature]
  taskTemplate:
    type: claude-code
    workspaceRef:
      name: my-app
    credentials:
      type: oauth
      secretRef:
        name: claude-creds
    promptTemplate: |
      Implement: {{.Title}}
      {{.Body}}
    branch: "feature-{{.Number}}"
  maxConcurrency: 2

Effect: Normal business hours operation, but completely silent during the April 10-12 release freeze.

3. Off-peak compute optimization

Run cost-heavy agent tasks during off-peak hours when cluster resources are cheaper:

apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: weekly-refactor
spec:
  schedulingPolicy:
    activeWindows:
      - days: [saturday, sunday]
        startTime: "02:00"
        endTime: "08:00"
        timezone: "UTC"
  when:
    cron:
      schedule: "0 2 * * 6"  # Trigger at 2am Saturday
  taskTemplate:
    type: claude-code
    model: claude-opus-4-20250514
    workspaceRef:
      name: my-app
    credentials:
      type: oauth
      secretRef:
        name: claude-creds
    promptTemplate: |
      Perform a comprehensive code quality review...
  maxConcurrency: 5

4. Multi-timezone team coverage

Active during business hours across two offices:

schedulingPolicy:
  activeWindows:
    - days: [monday, tuesday, wednesday, thursday, friday]
      startTime: "09:00"
      endTime: "18:00"
      timezone: "America/New_York"
    - days: [monday, tuesday, wednesday, thursday, friday]
      startTime: "09:00"
      endTime: "18:00"
      timezone: "Asia/Tokyo"

Effect: Tasks are created whenever either office is open (OR semantics), maximizing coverage while ensuring someone is available to review.

Implementation approach

Spawner changes (cmd/kelos-spawner/main.go)

The scheduling check is inserted in the runCycleWithSource function after discovery/deduplication but before the task creation loop (between the current lines ~310 and ~323):

// After deduplication, before task creation loop:
if ts.Spec.SchedulingPolicy != nil {
    allowed, reason := ts.Spec.SchedulingPolicy.IsTaskCreationAllowed(time.Now())
    if !allowed {
        log.Info("Scheduling policy restricts task creation", "reason", reason)
        // Still update status with discovery count (items are known),
        // but skip task creation
        // ... update status with SchedulingRestricted condition ...
        return nil
    }
}

Discovery continues normally — items are counted in status.totalDiscovered so operators know work is queued. On the next poll cycle within an active window, items will be re-discovered and tasks created.

Webhook handler changes (internal/webhook/handler.go)

The createTask method in the webhook handler checks the scheduling policy before creating a task:

func (h *WebhookHandler) createTask(ctx context.Context, spawner *v1alpha1.TaskSpawner, ...) error {
    if spawner.Spec.SchedulingPolicy != nil {
        allowed, reason := spawner.Spec.SchedulingPolicy.IsTaskCreationAllowed(time.Now())
        if !allowed {
            log.Info("Scheduling policy restricts webhook task creation",
                "reason", reason, "spawner", spawner.Name)
            return nil // Accept webhook but skip task creation
        }
    }
    // ... existing task creation logic ...
}

Note for webhook sources: Unlike polling sources that re-discover items, webhook events are ephemeral. A webhook received during a blackout is dropped (the event will not be replayed). This is acceptable for most webhook use cases (the next event for the same item will trigger when the window opens), but should be clearly documented.

Status reporting

Add a SchedulingRestricted condition to communicate the scheduling state:

status:
  conditions:
    - type: SchedulingRestricted
      status: "True"
      reason: "OutsideActiveWindow"
      message: "Task creation paused — next active window: Mon 09:00 America/New_York"
      lastTransitionTime: "2026-04-05T23:00:00Z"

Or during a blackout:

status:
  conditions:
    - type: SchedulingRestricted
      status: "True"
      reason: "InBlackoutWindow"
      message: "Task creation paused until 2026-04-12T23:59:59Z: v2.0 release freeze"

The condition transitions to status: "False" when the window opens, providing clear audit trail.

Scope estimate

  • Types: ~60 lines in api/v1alpha1/taskspawner_types.go
  • Evaluation logic: ~80 lines in a new internal/scheduling/policy.go
  • Spawner integration: ~15 lines in cmd/kelos-spawner/main.go
  • Webhook integration: ~10 lines in internal/webhook/handler.go
  • Status condition: ~20 lines in spawner status update
  • Tests: ~200 lines (time zone edge cases, midnight wrapping, blackout precedence)
  • CRD regeneration: make update
  • Total: ~400 lines including tests

Design decisions

Why on TaskSpawnerSpec, not a separate CRD?

A cluster-wide SchedulingPolicy CRD was considered but rejected for this proposal:

  • Per-spawner policies are simpler and cover 90% of use cases
  • Different spawners often need different schedules (bug-fix agents should run during business hours; dependency-update agents should run off-peak)
  • A cluster-wide CRD can be added later as a higher-level governance layer that references per-spawner policies

Why "discovery continues, creation pauses" instead of "spawner stops entirely"?

Continuing discovery during restricted windows keeps status.totalDiscovered current, so operators can see queued work via kelos get taskspawners. It also means the moment a window opens, the spawner has fresh item data and can create tasks immediately without waiting for a full poll cycle.

Why not reuse CronJob scheduling?

Kubernetes CronJobs define when to run, not when to pause. The scheduling policy here is the inverse — it defines time constraints on an always-running spawner. The two are complementary: a cron spawner with schedule: "0 * * * *" (every hour) could have schedulingPolicy to restrict which of those hours actually create tasks.

Webhook "drop" behavior

Webhook events received during restricted windows are acknowledged (HTTP 200) but don't create tasks. This avoids the complexity of a durable event queue while being consistent with how maxConcurrency already drops webhook events when the limit is reached (handler.go:311-318). For systems where event durability matters, the source system's retry mechanism or a separate event store should be used.

Backward compatibility

  • Purely additive: new optional schedulingPolicy field on TaskSpawnerSpec
  • When unset (default), behavior is identical to today — tasks created immediately
  • No changes to existing CRDs, controllers, or webhook behavior
  • XValidation on TaskSpawnerSpec does not need updating (scheduling policy is orthogonal to source type)
  • Existing spawners continue working without modification

Relationship to existing proposals

Issue Relationship
suspend field (built-in) Complementary. suspend is a manual kill switch; schedulingPolicy is a declarative, automatic time constraint. Both can coexist — suspend: true overrides any scheduling policy.
#788 (costBudget) Complementary. Cost budget limits total spending; scheduling policy limits when spending occurs. A spawner can have both: "max $50/day, only during business hours."
#765 (Cancelled phase / obsolescencePolicy) Complementary. Obsolescence cancels stale running tasks; scheduling prevents new task creation during restricted windows.
#889 (failurePolicy) Orthogonal. Failure policy governs what happens when tasks fail; scheduling governs when tasks are created.

References

  • TaskSpawnerSpec: api/v1alpha1/taskspawner_types.go:531-571
  • suspend field: api/v1alpha1/taskspawner_types.go:560-561
  • Task creation loop: cmd/kelos-spawner/main.go:322-395
  • Status condition pattern: cmd/kelos-spawner/main.go:413-429
  • Webhook concurrency drop precedent: internal/webhook/handler.go:311-318
  • Go time.LoadLocation: used for IANA timezone parsing (stdlib, no external dependency)

/kind feature

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions