Skip to content

Implement per-group workloops and queue-aware claiming #1289

@stuartc

Description

@stuartc

Summary

Replace the worker's single workloop with independent per-group workloops.
Each slot group gets its own workloop that claims with its queue preferences,
tracks its own capacity, and backs off independently.

Background

Currently the worker has one workloop that calls claim() with
{ demand: 1, worker_name }. After this change, each slot group runs its own
workloop that sends { demand: 1, worker_name, queues: [...] }.

A worker configured with --queues "fast_lane:1 manual,*:4" will have:

  • One workloop claiming ["fast_lane"] with 1 slot
  • One workloop claiming ["manual", "*"] with 4 slots

Per-group capacity tracking

Each slot group independently checks its own capacity:

if (group.activeRuns + group.pendingClaims >= group.maxSlots) → stop workloop

The global app.workflows map continues to track all active runs. The
per-group tracking determines which group's workloop to resume when a run
completes.

When a run is claimed by a slot group, it is associated with that group for
its lifetime. On completion, the group's activeRuns decrements and its
workloop resumes if it was at capacity.

Claim protocol

Each workloop sends its slot group's queue preference in the claim message:

channel.push("claim", {
  demand: 1,
  worker_name: "pod-1",
  queues: ["fast_lane"]           // pinned slot
})

channel.push("claim", {
  demand: 1,
  worker_name: "pod-1",
  queues: ["manual", "*"]         // preference slot
})

Lightning interprets this as:

  • No * → filter mode: only return runs from the named queues
  • With * → preference mode: all runs, ordered by preference position

If Lightning hasn't been updated yet, it will ignore the queues field and
return runs as before (Phoenix channels ignore unknown payload keys).

work-available handling

When Lightning pushes work-available, attempt to claim for each slot
group that has free capacity
. This means a single work-available event
may trigger multiple claim requests (one per group with free slots).

This is acceptable because:

  • Most deployments will have 2-3 slot groups at most
  • Claim requests are lightweight channel pushes
  • Lightning handles concurrent claims via FOR UPDATE SKIP LOCKED

Join payload

Update the channel join to include queue configuration:

channel.join("worker:queue", {
  capacity: 5,
  queues: { "fast_lane": 1, "manual,*": 4 }  // informational
})

The queues map is informational — Lightning doesn't use it for routing in
this release. It enables future presence-based queue tracking. capacity
remains the total for backward compatibility.

Backoff considerations

Fast-lane slots with strict pinning (["fast_lane"], no wildcard) will
frequently get empty responses when there's no sync work. The existing backoff
(1s min → 10s max) handles this fine because:

  • work-available pushes bypass backoff for immediate claiming
  • A 1-10s backoff on an idle fast-lane slot has negligible cost

Things to watch out for

  • The refactor of server.ts / workloop management is the most complex part —
    the single workloop assumption is likely baked into several places
  • Make sure resumeWorkloop only resumes the correct group's workloop, not all
  • The run-to-group association needs to survive the full run lifecycle
  • Test the edge case where a fast_lane group is idle (getting empty claims)
    while the default group is fully busy — they should not interfere with each
    other

Testing

  • Each slot group runs its own independent workloop
  • A slot group at capacity stops its workloop
  • Run completion resumes the correct group's workloop (not all groups)
  • Claim messages include the correct queues array per slot group
  • work-available triggers claims on all groups with free slots
  • Fast_lane group backs off independently when getting empty responses
  • Default group claims normally while fast_lane is idle
  • Join payload includes queues map
  • Backward compat: single manual,*:5 group behaves like old capacity-5

Metadata

Metadata

Assignees

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions