Skip to content

fix(proxy): stop pool self-failover spam and operator-park stranding#44

Merged
nnemirovsky merged 1 commit into
mainfrom
fix/pool-failover-stranding-and-self-failover-spam
May 17, 2026
Merged

fix(proxy): stop pool self-failover spam and operator-park stranding#44
nnemirovsky merged 1 commit into
mainfrom
fix/pool-failover-stranding-and-self-failover-spam

Conversation

@nnemirovsky
Copy link
Copy Markdown
Owner

What

Live operation of the credential pool surfaced two real failover defects (not v0.16.0-shippable as-is). Both are fixed here with fail-before/pass-after tests.

Symptom (observed in production)

Telegram showed six identical pool openai_pool failed over openai_oauth_2 -> openai_oauth_2 (429) notices and the agent hard-failed (API call failed after 3 retries: HTTP 429) when the active pooled member was exhausted and the only other member was parked.

Root causes

  1. Operator-park stranding. sluice pool rotate parks the displaced member with reason "manual rotate" (a 300s cooldown). That member is healthy, just operator-deprioritized. When the rotated-to member then 429s, every member is cooling, and ResolveActive's all-cooled degrade picked the soonest-recovering by time — the genuinely-failed member (60s 429 TTL < 300s park) — so it degraded back onto the exhausted account and self-looped.
  2. Self-failover spam. With no distinct target, handlePoolFailover emitted a meaningless from -> from cred_failover plus one audit row + one Telegram notice per agent retry.

Fixes

  • vault.ManualRotateReason constant (shared with cmd/sluice rotate). ResolveActive now prefers an operator-parked-but-healthy member over a genuinely-failed one when every member is cooling — a pool rotate onto an exhausted member now fails over to the healthy peer instead of self-looping.
  • handlePoolFailover detects to == from as pool exhaustion: distinct pool_exhausted audit action + an honest "pool exhausted" operator notice instead of a self-referential cred_failover. FailoverEvent.Exhausted carries the distinction; the cooldown is still persisted.
  • shouldEmitPoolNotice dedups identical (pool,from,to,tag) signals within a 30s window so a retry storm yields one row, not N.

Normal position-order selection still skips a manual-rotate-parked member (rotate semantics unchanged); the preference only applies to the all-cooled degrade/failover path.

Tests

  • vault: degrade prefers manual-rotate-parked-but-healthy over genuinely-failed-soonest; unchanged when no manual-rotate park exists.
  • proxy: to==from emits pool_exhausted (not cred_failover), Exhausted=true, and the retry storm collapses to exactly one audit row + one onFailover; a real failover to an operator-parked-but-healthy peer is cred_failover with Exhausted=false.

go build/go vet/go vet -tags=e2e/golangci-lint clean; full internal/proxy + internal/vault + cmd suites (1230 tests) + -race subset green.

Not included / follow-up

The third live-found issue — sluice's pooled token-host phantom expansion corrupting in-container device_code OAuth logins (codex --device-auth 400s because the grant hits the pool's shared auth.openai.com/oauth/token) — is a separate subsystem and tracked separately; it does not gate this failover fix.

No release is cut by this PR (per request: hold until re-verified live on the deployment).

A pooled member that returned 429/401 while every other member was
cooling degraded back to itself: handlePoolFailover emitted a
meaningless "X -> X" cred_failover plus one Telegram notice + audit
row per agent retry (observed live as six identical "failed over
X -> X (429)" notices), and the agent hard-failed because there was
no real failover target.

Root causes:
- ResolveActive's all-cooled degrade picked the soonest-recovering
  member by time. `sluice pool rotate` parks the displaced member with
  reason "manual rotate" (300s) -- that member is healthy, just
  operator-deprioritized -- while a genuine 429 cools for 60s. So a
  rotate onto an exhausted account degraded back to the exhausted
  account instead of the parked-but-healthy peer, self-looping.
- No dedup: an agent retry storm produced one notice/audit per request.

Fixes:
- vault.ManualRotateReason constant (shared with cmd/sluice rotate).
  ResolveActive now prefers an operator-parked-but-healthy member over
  a genuinely-failed one when every member is cooling, so a rotate onto
  an exhausted member fails over to the healthy peer.
- handlePoolFailover detects to==from (no distinct target) as pool
  EXHAUSTION: emits the distinct pool_exhausted audit action and an
  honest "pool exhausted" operator notice instead of a self-referential
  cred_failover. FailoverEvent.Exhausted carries the distinction.
- shouldEmitPoolNotice dedups identical (pool,from,to,tag) signals
  within a 30s window so a retry storm yields one row, not N.

Tests: fail-before/pass-after for the degrade preference, the
pool-exhausted self-failover suppression + dedup, and the real
failover to an operator-parked-but-healthy peer.
@nnemirovsky nnemirovsky merged commit 70fcff3 into main May 17, 2026
6 checks passed
@nnemirovsky nnemirovsky deleted the fix/pool-failover-stranding-and-self-failover-spam branch May 17, 2026 02:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant