Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
9a687c2
feat(channel): coalesce duplicate approval prompts by dest:port
nnemirovsky May 15, 2026
e7bde0e
wip(proxy): checkpoint persist-once + store work before respawn
nnemirovsky May 15, 2026
b76a0f3
test(proxy): assert concurrent same-target asks coalesce to one prompt
nnemirovsky May 15, 2026
4d3b326
feat(store): add credential pool + health schema and store API
nnemirovsky May 15, 2026
5fb5657
feat(vault): add PoolResolver pool->active-member chokepoint
nnemirovsky May 15, 2026
a1a93e3
feat(cli): add pool subcommands and credential/pool namespace guards
nnemirovsky May 15, 2026
800d36f
feat(proxy): wire PoolResolver into server, addon and reloadAll
nnemirovsky May 15, 2026
b68cd7c
feat(mcp): opt MCP tool calls out of approval coalescing
nnemirovsky May 15, 2026
185a382
wip(telegram): checkpoint final-count edit before respawn
nnemirovsky May 15, 2026
2556b0f
merge: approval-coalescing work
nnemirovsky May 15, 2026
4ccea7b
merge: credential-pool-failover work
nnemirovsky May 15, 2026
a3602d6
fix(telegram): use coalesced-count label in resolve edit body
nnemirovsky May 15, 2026
d2f5360
docs(plans): convert tasks/phases to exec checkboxes; mark completed …
nnemirovsky May 15, 2026
0b0554b
feat(channel): re-confirm broker coalescing tests green
nnemirovsky May 15, 2026
8948f2b
feat(store): idempotent approval-rule persist
nnemirovsky May 15, 2026
8e265fd
test(proxy): confirm full suite green after coalescing merge
nnemirovsky May 15, 2026
2d945de
feat(telegram): render coalesced count on resolve and cancel edits
nnemirovsky May 15, 2026
5aff5ff
docs(plans): complete approval-coalescing; move to completed
nnemirovsky May 15, 2026
ef05ee9
test(store): confirm credential-pool Phase 0 green post-merge
nnemirovsky May 15, 2026
864420c
feat(proxy): pool phantom indirection + R1 attribution + R3 stable JWT
nnemirovsky May 15, 2026
1187d7a
feat(proxy): auto-failover on 429/401 for credential pools
nnemirovsky May 15, 2026
54a64bd
docs(plans): complete credential-pool-failover; move to completed
nnemirovsky May 15, 2026
3a4104d
fix(proxy): address comprehensive review findings
nnemirovsky May 15, 2026
8823362
fix(proxy): correct token-endpoint failover member attribution + hard…
nnemirovsky May 16, 2026
22de0ce
style: satisfy golangci-lint (errorlint, QF1002, unparam)
nnemirovsky May 16, 2026
fe64664
test(e2e): pool failover + approval coalescing end-to-end
nnemirovsky May 16, 2026
62d4704
fix(proxy): address Copilot review (per-request failover attribution,…
nnemirovsky May 16, 2026
52689ff
fix: address Copilot re-review (failover callback registration, cred-…
nnemirovsky May 16, 2026
0176068
fix(proxy): split-host pool OAuth refresh attribution + protocol-awar…
nnemirovsky May 16, 2026
914bc09
fix(proxy): classify token-endpoint 403 invalid_grant as auth-failure…
nnemirovsky May 16, 2026
a6b24e8
fix(vault): monotonic cooldown (in-memory + durable); doc pool phanto…
nnemirovsky May 16, 2026
b4218a6
fix(proxy): fail-closed unattributed pooled refresh; broker resolve/d…
nnemirovsky May 16, 2026
a560bfd
fix(cli): reject pool removal while bindings still reference the pool
nnemirovsky May 16, 2026
661e51e
fix(store): enforce namespace + pool-member + health-row invariants a…
nnemirovsky May 16, 2026
dcaa6f3
fix(proxy): pool-namespace covered-set; shared-health prune for non-m…
nnemirovsky May 16, 2026
509ca44
fix(store): guard CAS credential-rollback for pool members + health c…
nnemirovsky May 16, 2026
fcb2fbf
fix(store): RemovePool health cleanup; reject live-pool-member meta d…
nnemirovsky May 16, 2026
9ac52f1
fix(proxy): API-host failover requires per-flow pool-usage evidence (…
nnemirovsky May 16, 2026
b2daf4a
fix(cred): validate vault before store removal; cap coalesced subscri…
nnemirovsky May 16, 2026
c9bacaf
fix(store): failover health write no-ops for non-pool-member (atomic;…
nnemirovsky May 16, 2026
e2a2696
fix: guard pool-rotate health write; atomic REST cred removal; gate M…
nnemirovsky May 16, 2026
a0a5322
fix(proxy): free flow tag after Response; tag token-host flow only on…
nnemirovsky May 16, 2026
ae45808
fix: RemoveCredentialFully cleans health on partial-cleanup finish; e…
nnemirovsky May 16, 2026
cf11d86
fix: pool+epoch-scoped health guards; CancelAll lost-wakeup; pooled-J…
nnemirovsky May 16, 2026
d893823
fix(proxy): plain-cred refresh attribution on shared token URL; pool-…
nnemirovsky May 16, 2026
19a55f3
fix: R3-stable QUIC pool phantom; pool+epoch in refresh attribution; …
nnemirovsky May 16, 2026
04a83a6
fix(store): binding creation requires live credential or pool; CLAUDE…
nnemirovsky May 16, 2026
1c47635
test(telegram): deterministically sync cancel-edit assertion (deflake…
nnemirovsky May 16, 2026
0db59db
fix: Telegram cred_meta on all adds; QUIC R3 comments; typed 409-vs-5…
nnemirovsky May 16, 2026
fb64ef2
fix(telegram): roll back vault secret if cred-add metadata/binding fa…
nnemirovsky May 16, 2026
573b931
fix(telegram): roll back credential_meta (CAS) as well as vault on en…
nnemirovsky May 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 41 additions & 1 deletion CLAUDE.md

Large diffs are not rendered by default.

25 changes: 25 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -286,6 +286,31 @@ github_pat static api.github.com

**Supported response formats:** Both `application/json` and `application/x-www-form-urlencoded` token responses per RFC 6749.

## Credential Pools

A credential pool lets a single phantom identity the agent sees be backed by **N real OAuth credentials**, with sluice auto-failing-over to the next member when the upstream rejects the active one. Primary use case: two OpenAI Codex OAuth accounts driven by one agent, so quota exhaustion on one account transparently rolls onto the other. The agent always holds one pool-scoped phantom pair, byte-stable across member switches: the **access** phantom is a synthetic pool-stable JWT (HS256, `sub: sluice-pool:<name>`, `iss: sluice-phantom`, far-future `exp`) that is byte-identical for a given pool regardless of which member is active, so a cross-member failover never changes the access token the agent holds; the **refresh** phantom is the static string `SLUICE_PHANTOM:<pool>.refresh`. Sluice maps the pair to the currently active member's real token at injection time and persists refreshed tokens back to the member that issued them.

```bash
sluice pool create <name> --members credA,credB[,credC] [--strategy failover]
sluice pool list
sluice pool status <name>
sluice pool rotate <name> # operator override: force next member
sluice pool remove <name>
```

Members are existing OAuth credentials (static credentials are rejected). Member order is the failover order.

**Auto-failover behavior:**

- HTTP 429, or 403 with a quota-exhaustion body -> the active member is rate-limited; cooled down for **60s**.
- HTTP 401, or a token-endpoint body of `invalid_grant` / `invalid_token` -> the active member's token is rejected; cooled down for **300s**.
- 2xx, 5xx, and any other status -> no-op (a server-side error is not evidence the account is exhausted).
- The active-member switch is **synchronous**: the cooldown is recorded in memory before the response returns, so the very next request injects the next member. The durable store write only reconciles for restarts.
- **No in-flight retry**: the triggering request still returns its own upstream error to the agent; the agent's own retry resolves to the freshly-activated next member.
- Every failover emits a `cred_failover` audit event (`Reason = "<pool>:<from>-><to>:<tag>"`) and a best-effort Telegram notice.

The phantom access token is **byte-identical across a member switch** (pooled OAuth credentials use a pool-keyed synthetic JWT resign), so the agent never observes the rollover.

## Approval Channels

Sluice broadcasts "ask" verdicts to all configured approval channels. The first channel to respond wins. Other channels get a cancellation notice.
Expand Down
10 changes: 10 additions & 0 deletions cmd/sluice/binding_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,16 @@ func setupBindingDB(t *testing.T) string {
if err != nil {
t.Fatalf("create test DB: %v", err)
}
// Seed the credentials these tests bind to. AddBinding /
// AddRuleAndBinding now require the referenced credential (or pool) to
// exist, mirroring the real flow where "sluice cred add" creates the
// credential before "sluice binding add" binds to it. Without this, the
// binding-CLI tests would be exercising an impossible state.
for _, c := range []string{"mycred", "cred_a", "cred_b"} {
if err := db.AddCredentialMeta(c, "static", ""); err != nil {
t.Fatalf("seed credential meta %q: %v", c, err)
}
}
_ = db.Close()
return dbPath
}
Expand Down
139 changes: 86 additions & 53 deletions cmd/sluice/cred.go
Original file line number Diff line number Diff line change
Expand Up @@ -227,6 +227,15 @@ func handleCredAdd(args []string) error {
}
defer func() { _ = db.Close() }()

// Namespace mutual-exclusion: a credential must not shadow a pool. Pool
// and credential names share one namespace so a bound destination
// resolves unambiguously to either a pool or a plain credential.
if exists, perr := db.PoolExists(name); perr != nil {
return fmt.Errorf("check pool name collision: %w", perr)
} else if exists {
return fmt.Errorf("name %q is already a credential pool; pool and credential names share one namespace", name)
}
Comment thread
nnemirovsky marked this conversation as resolved.

// Inputs validated and DB is open. Now persist the credential.
vs, err := openVaultStore(*dbPath)
if err != nil {
Expand Down Expand Up @@ -553,13 +562,85 @@ func handleCredRemove(args []string) error {
}
name := fs.Arg(0)

// Removal order (Finding 1, round-13 + Finding 3, round-9):
//
// 1. Open/validate the vault store FIRST (no delete yet -- just
// confirm it opens). If the configured backend cannot be opened
// (e.g. a non-age provider unsupported by the CLI), abort BEFORE
// any metadata is removed. Doing the store removal first and then
// discovering the vault is unopenable would leave credential_meta
// gone while the vault secret + bindings/rules are orphaned.
//
// 2. Run the store-layer pool-membership gate (RemoveCredentialMeta).
// This is the atomic, fail-closed guard: it refuses inside its own
// transaction if the credential is still a live pool member,
// closing the TOCTOU window where a separate pre-check passes and a
// concurrent caller then creates a pool with this credential before
// the vault secret is deleted.
//
// 3. Only call vs.Remove AFTER that gate succeeds. If the gate
// refuses, the vault secret is left untouched and no window exists
// where the secret is gone but credential_pool_members still
// references it.
//
// (1) precedes (2) so an unopenable vault aborts before any metadata is
// removed; (2) still precedes (3) so the store gate always runs before
// the actual secret delete.
vs, err := openVaultStore(*dbPath)
if err != nil {
return err
}

// Remove from vault. If already gone (previous partial cleanup),
// continue to DB cleanup so stale rules/bindings can be removed.
// Only consult/mutate the DB if it already exists (do not create it as
// a side effect of a removal).
dbExists := false
if _, statErr := os.Stat(*dbPath); statErr == nil {
dbExists = true
} else if !os.IsNotExist(statErr) {
return fmt.Errorf("access database %q for credential removal of %q (refusing to remove; a pool member may otherwise be orphaned): %w", *dbPath, name, statErr)
}

var db *store.Store
if dbExists {
var derr error
db, derr = store.New(*dbPath)
if derr != nil {
// Fail closed: the DB exists but cannot be opened, so the
// pool-membership gate cannot run. Proceeding to delete the
// vault secret would orphan any credential_pool_members row
// pointing at this now-missing credential -- exactly what the
// gate prevents. Refuse the removal instead.
return fmt.Errorf("open database %q to check pool membership for %q (refusing to remove; a pool member may otherwise be orphaned): %w", *dbPath, name, derr)
}
defer func() { _ = db.Close() }()

// GATE + atomic store cleanup (Finding 2, round-15). This MUST run
// before the vault delete. RemoveCredentialFully runs the
// fail-closed pool-member guard AND deletes credential_meta,
// credential_health, all bindings on the credential, and all
// auto-created rules in ONE transaction. If the credential is still
// a live pool member (or any store delete fails) it returns an
// error with NOTHING removed and the vault secret below is never
// touched — no partially-deleted-credential window.
metaDeleted, rmBindings, rmRules, rmErr := db.RemoveCredentialFully(name)
if rmErr != nil {
return fmt.Errorf("remove credential store state for %q (refusing to delete the vault secret so the credential is not partially deleted): %w", name, rmErr)
}
if metaDeleted {
fmt.Printf("removed credential metadata for %q\n", name)
}
if rmRules > 0 {
fmt.Printf("removed %d auto-created rule(s) for credential %q\n", rmRules, name)
}
if rmBindings > 0 {
fmt.Printf("removed %d binding(s) for %q\n", rmBindings, name)
}
}
Comment thread
nnemirovsky marked this conversation as resolved.

// Store removal already succeeded (or the DB does not exist). Now it is
// safe to delete the vault secret. If already gone (previous partial
// cleanup), continue to DB cleanup so stale rules/bindings can be
// removed.
if err := vs.Remove(name); err != nil {
if !os.IsNotExist(err) {
return fmt.Errorf("remove: %w", err)
Expand All @@ -569,57 +650,9 @@ func handleCredRemove(args []string) error {
fmt.Printf("credential %q removed\n", name)
}

// Clean up associated bindings and auto-created rules. Only open the
// store if the DB file exists to avoid creating it as a side effect of
// a credential removal.
if _, statErr := os.Stat(*dbPath); statErr != nil {
if !os.IsNotExist(statErr) {
log.Printf("warning: cannot access database %q for cleanup: %v (stale rules/bindings may remain)", *dbPath, statErr)
}
return nil
}

db, err := store.New(*dbPath)
if err != nil {
log.Printf("warning: could not open database %q for cleanup: %v (stale rules/bindings may remain)", *dbPath, err)
return nil
}
defer func() { _ = db.Close() }()

// Remove rules tagged either by "sluice cred add --destination"
// (cred-add:<name>) or by "sluice binding add" (binding-add:<name>).
// Both paths may have produced rules associated with this credential,
// and failing to clean up either set leaves orphan allow rules in
// the store.
var total int64
for _, src := range []string{
store.CredAddSourcePrefix + name,
store.BindingAddSourcePrefix + name,
} {
n, rmErr := db.RemoveRulesBySource(src)
if rmErr != nil {
log.Printf("warning: failed to remove rules with source %q for credential %q: %v", src, name, rmErr)
continue
}
total += n
}
if total > 0 {
fmt.Printf("removed %d auto-created rule(s) for credential %q\n", total, name)
}
removed, rmBindErr := db.RemoveBindingsByCredential(name)
if rmBindErr != nil {
log.Printf("warning: failed to remove bindings for %q: %v", name, rmBindErr)
} else if removed > 0 {
fmt.Printf("removed %d binding(s) for %q\n", removed, name)
}

// Remove credential metadata (type, token_url).
metaDeleted, rmMetaErr := db.RemoveCredentialMeta(name)
if rmMetaErr != nil {
log.Printf("warning: failed to remove credential meta for %q: %v", name, rmMetaErr)
} else if metaDeleted {
fmt.Printf("removed credential metadata for %q\n", name)
}
// Bindings and auto-created rules were already removed atomically with
// credential_meta + health by RemoveCredentialFully above, before the
// vault secret was deleted. Nothing left to clean up here.
return nil
}

Expand Down
Loading
Loading