From 4aa4f547c2025b737f8b7c9106e66fec184ad11e Mon Sep 17 00:00:00 2001 From: bdchatham Date: Sun, 26 Apr 2026 09:24:03 -0700 Subject: [PATCH 1/2] docs: design for graceful upgrade-halt contract MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds a first-draft issue + design under docs/upgrade-shutdown-contract/ covering an exit-code and HaltIntent contract for seid expected halts (operator-action-required vs genuine crash). Captures the cross-repo coordination story (sei-config = contract, sei-chain = producer, sei-k8s-controller = consumer) and the explicit case for a dedicated /halt_intent route over extending /status. No code yet — first draft for review and iteration. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/upgrade-shutdown-contract/DESIGN.md | 216 +++++++++++++++++++++++ docs/upgrade-shutdown-contract/ISSUE.md | 53 ++++++ 2 files changed, 269 insertions(+) create mode 100644 docs/upgrade-shutdown-contract/DESIGN.md create mode 100644 docs/upgrade-shutdown-contract/ISSUE.md diff --git a/docs/upgrade-shutdown-contract/DESIGN.md b/docs/upgrade-shutdown-contract/DESIGN.md new file mode 100644 index 0000000..9cae7e2 --- /dev/null +++ b/docs/upgrade-shutdown-contract/DESIGN.md @@ -0,0 +1,216 @@ +# Upgrade shutdown contract — design + +> Companion to [ISSUE.md](./ISSUE.md). This doc covers the design intent for a graceful, distinguishable shutdown signal when seid detects it is running an incompatible binary for an on-chain upgrade. The deliverable in *this* PR is only the sei-config-side contract; the producer (sei-chain) and consumer (sei-k8s-controller) changes are described here so reviewers can sanity-check the shape, but they ship as separate PRs. + +## Background + +When the x/upgrade module's `BeginBlocker` decides the running binary cannot proceed, it calls `panic()`. Three panic sites in `sei-cosmos/x/upgrade/abci.go` (lines 44, 99, 119). All exit with Go's panic code (2) and dump a stack trace; the relevant message is also raw-written to stderr for cosmovisor regex compat. + +For comparison, **the existing `--halt-height` operator path is already graceful.** `BaseApp.halt()` at `sei-cosmos/baseapp/abci.go:288-321` does *not* panic — it sends `SIGINT` to itself, which the existing shutdown defer in `server/start.go:459-484` catches and tears everything down cleanly. The upgrade halt is the outlier; this design proposes bringing it in line. + +## Goals + +- A stable, machine-readable signal that distinguishes "operator action required" halts from genuine crashes. +- The signal is observable both **post-mortem** (process exit code) and, optionally, **live** (status endpoint, while the process is still running) so a control plane can fetch *why* the node is no longer producing blocks. +- Sidecar-optional. External operators (validators on systemd, humans with curl) get the same primitive. +- Backwards-compatible default. Today's "the process exits when it can't proceed" behavior is preserved; the live-status mode is opt-in via a flag. + +## Non-goals + +- Replace cosmovisor or systemd integration paths. +- Define a unified node-status aggregator API. That's an explicit *Tier-2 future*; see [Architecture](#architecture-two-tier). +- Solve binary-image lookup, restart policy, image swap, or CRD modeling — those belong to the consumer (sei-k8s-controller). +- Modify the existing `upgrade-info.json` artifact (cosmovisor's contract). It continues to be written exactly as today; we do not extend its schema. + +## Architecture (two tier) + +**Tier 1 — seid (dataplane primitive).** Each halt cause has: +- A distinct **exit code** in the range 70–79. +- A structured **`HaltIntent` value** populated before exit, exposed via a discrete `/halt_intent` route on the existing sei-tendermint RPC server (port 26657). + +The HaltIntent shape is owned by sei-config. Every consumer — sidecar, systemd unit, controller, human curl — reads the same primitive. This keeps tier 1 universal and predictable. + +**Tier 2 — node-side sidecar (control-plane adapter, future, not in scope).** A sidecar process running alongside seid in the same Pod (for the controller deployment topology) polls Tier-1 primitives and exposes an opinionated, aggregated `/status` to the control plane. This is where seid's surface-area volatility and tech debt are massaged into a stable control-plane API. **Not part of this PR; mentioned only so reviewers understand why Tier-1 stays minimal.** + +The two-tier split is what reconciles "synergistic with the sidecar" with "doesn't require the sidecar." Tier 1 is sidecar-agnostic; the sidecar is purely a convenience layer. + +--- + +## ⚠️ Decision: dedicated `/halt_intent` route, NOT extending `/status` + +This is the most contested call in the design. Reviewers will land here first — calling it out explicitly with the case for and against. Push back hard on this section if you disagree. + +### TL;DR + +Add a **new dedicated route `/halt_intent`** on the sei-tendermint RPC server. Do **not** extend the existing `/status` response to include a `halt_intent` field. + +### Why this is non-obvious + +`/status` is the idiomatic place a control-plane fetches a dataplane process's state. Most reviewers' first instinct — and the original direction of this design — was to extend `/status` with an optional `halt_intent` field that's null when healthy. The "new endpoint" feels like a smell, especially since both routes live on a surface (Tendermint RPC) that we want to deprecate eventually. + +### The case for extending `/status` (rejected) + +1. **Idiomatic.** Control-plane→dataplane state polling lives on `/status`. One endpoint, one response, one parse. +2. **Discoverability.** Sidecars and operators already poll `/status`; a new route is one more thing to know. +3. **No new surface to deprecate later.** If we eventually retire Tendermint RPC entirely, every additional route there is a chip we have to migrate. +4. **Upstream-merge tax** *(does not apply for Sei — see below)*. In a typical Cosmos fork, modifying upstream-defined types like `ResultStatus` means re-merging on every CometBFT minor release. **For Sei this argument is moot:** Sei never merges sei-tendermint changes back to upstream; the fork is fully owned. We can modify `ResultStatus` freely. + +### Why we still recommend a dedicated route (failure-isolation) + +Of the arguments above, only #4 was decisive in the original analysis, and Sei's never-merge-upstream posture neutralizes it. So why still go with a dedicated route? + +**Failure-domain isolation.** This is the strongest single argument and the one most reviewers don't anticipate. + +`/status` is the most heavily polled endpoint in the entire Cosmos universe — Prometheus exporters, Tenderduty, validator dashboards, block explorers, load-balancer health checks, the sei-k8s-controller's existing health probes. Coupling halt-intent population to `/status`'s critical path means any bug in the halt-intent reader — lock contention with the consensus shutdown path, a nil-deref on `PlanName`, a serialization edge case — takes down sync-status visibility for the entire monitoring fleet. + +Crucially, that breakage would happen **exactly during the upgrade window** when operators most need `/status` to be boring and reliable. We would be introducing a new failure mode at the worst possible time. + +A dedicated `/halt_intent` handler with its own mutex and its own `defer recover()` keeps the blast radius scoped: a bug in halt-intent code can break `/halt_intent` only, and `/status` keeps reporting sync state to everyone polling it. + +### Secondary argument (deprecation) + +The "we want to deprecate Tendermint RPC eventually" point cuts the *opposite* way from initial intuition. A discrete route is trivially shimmed, proxied, or replaced by whatever succeeds Tendermint RPC. A nested field inside `ResultStatus` is welded to the struct's lifetime — a future replacement has to keep emitting the same field at the same nesting depth, or migrate every consumer in lockstep. The dedicated route gives us more freedom, not less. + +### Counterarguments and how we address them + +- *"A new endpoint is a smell."* Real, but the smell is preferable to the failure-isolation cost. The endpoint is small (one method, one response shape, one mutex) and serves a single purpose. It is not a kitchen-sink endpoint waiting to grow. +- *"Discoverability."* The Tier-2 sidecar (when it ships) knows about `/halt_intent` natively and is the controller's primary path. For systemd/curl operators, discoverability is solved by docs. We deliberately do **not** advertise `/halt_intent` via `/status`'s `node_info.other` map — that re-couples the two routes' release lifecycles, which the failure-isolation argument is trying to decouple. +- *"What about the deprecation."* See above. A dedicated route is *easier* to migrate, not harder. + +### Open question (defer) + +If a future contributor argues that the failure-isolation risk is overstated — i.e. demonstrates that the halt-intent reader can be made unconditionally safe with no shared lock with consensus — the right move is *still* a dedicated route on first ship, and revisit consolidation later. Easier to merge two routes than split one. + +--- + +## sei-config contract (v1 deliverable in this PR) + +A new file (proposed name: `exitcodes.go`): + +```go +package seiconfig + +import "time" + +// ShutdownReason identifies why seid halted. The numeric values are stable +// process exit codes; see the ExitCode* constants below. Values are append-only — +// renumbering any constant is a backwards-incompatible change. +type ShutdownReason int + +const ( + ShutdownReasonUnknown ShutdownReason = 0 + ShutdownReasonUpgradeRequired ShutdownReason = 70 + ShutdownReasonBinaryTooNew ShutdownReason = 71 + ShutdownReasonDowngradeDetected ShutdownReason = 72 + // 73-79 reserved for future upgrade-related graceful halts. + // 80-89 reserved for non-upgrade operator-action halts (e.g. halt-height). +) + +// Exit codes used by seid to signal graceful halts to process supervisors. +// Identical numeric values to the corresponding ShutdownReason; duplicated as +// distinct constants because process supervisors see exit codes while in-process +// code uses the typed enum. +const ( + ExitCodeUpgradeRequired = 70 + ExitCodeBinaryTooNew = 71 + ExitCodeDowngradeDetected = 72 +) + +// HaltIntent is the structured signal seid serves on /halt_intent and that +// describes the reason for an in-progress or pending graceful halt. +// +// JSON tags are part of the wire contract and must not be renamed without +// coordinated consumer updates. +type HaltIntent struct { + Reason ShutdownReason `json:"reason"` + ReasonString string `json:"reason_string"` // human-readable, fixed phrasing per Reason + PlanName string `json:"plan_name,omitempty"` + Height int64 `json:"height,omitempty"` + Info string `json:"info,omitempty"` + AnnouncedAt time.Time `json:"announced_at"` +} + +// ParseExitCode maps a process exit code to a ShutdownReason. Returns +// (ShutdownReasonUnknown, false) for codes outside the graceful range. +func ParseExitCode(code int) (ShutdownReason, bool) { ... } + +// String returns a stable lowercase token suitable for logs and the +// HaltIntent.ReasonString field. +func (r ShutdownReason) String() string { ... } +``` + +That's the entire library surface. Plus a unit test exercising every constant through `ParseExitCode` and round-tripping `HaltIntent` through `encoding/json`. + +Notable omissions, all deliberate: +- **No `UpgradeInfo` reader.** The on-disk `upgrade-info.json` shape is owned by `x/upgrade/keeper` in sei-cosmos. Duplicating it here would create a drift hazard with no payoff — sei-config doesn't need it; the controller can vendor a 30-line struct itself if it ever does. +- **No file IO.** sei-config does not write or read `halt-intent.json`. The single source of truth is the live endpoint. +- **No constructor for the SIGINT/exit helper.** The producer-side helper (`gracefulHalt`) lives next to its callsites in sei-cosmos, since it needs the upgrade keeper for `DumpUpgradeInfoWithInfoToDisk`. It references `seiconfig.ExitCodeUpgradeRequired` etc. The library does not export a `GracefulHalt(...)` function. + +## Producer side (sei-chain — out of scope for this PR, sketched) + +In `sei-cosmos/x/upgrade/abci.go`, replace the three `panic()` calls with calls to a new helper `gracefulHalt(reason, intent)` that: + +1. Sets the HaltIntent on a process-global holder (an `RWMutex`-protected struct in a small new package, importable by both `x/upgrade` and the `/halt_intent` HTTP handler in `sei-tendermint`). +2. Performs the existing logging + raw stderr write + (for the line-119 path) `upgrade-info.json` write. **No change to those existing artifacts.** +3. Sends `SIGINT` to self, mirroring the existing `BaseApp.halt()` pattern. +4. The shutdown defer in `sei-cosmos/server/start.go:459-484` is extended to read the holder, look up the matching exit code via `seiconfig.ParseExitCode`, and call `os.Exit(N)` with the distinct code instead of returning normally. + +This reuses the existing graceful-shutdown plumbing rather than inventing a parallel one. It is a much smaller change than it would first appear. + +### Default vs stay-alive mode + +A new boolean flag, default `false`, gates the live-endpoint behavior: + +- **Off (default — equivalent to today):** `gracefulHalt` triggers the SIGINT shutdown path immediately. Process exits with the distinct exit code. Status servers go down with consensus. No surprise for current operators; the only observable change vs. today is a clean exit code (70/71/72) instead of panic exit code 2 with a stack trace. +- **On (stay-alive):** `gracefulHalt` populates the holder, stops consensus, and **leaves the four servers running indefinitely**. Process exits only on external `SIGTERM`/`SIGINT`. The supervisor (sidecar, controller, human) drives termination after observing the halt intent and taking action. + +Stay-alive mode requires splitting the shared `goCtx` in `sei-cosmos/server/start.go` so consensus and the servers have separate cancellation. This is real surgery but contained to that file. + +A flag name is deliberately not specified in this design — the producer PR can pick something like `--halt-stay-alive` or `--halt-stays-running`. Naming bikeshed deferred. + +## /halt_intent route (also out of scope for this PR, sketched) + +A new Tendermint RPC route on `sei-tendermint`, registered alongside the existing Sei-only `/lag_status`. Path: `/halt_intent`. Always reachable while the RPC server is up. + +### Response + +- Always **`200 OK`** with the HaltIntent JSON. +- When healthy: `{"reason": 0, "reason_string": "healthy", "announced_at": "..."}`. +- When halted: the populated HaltIntent. + +**Not 204. Not 404.** A 404 is indistinguishable from "older seid doesn't have this route" and breaks consumer version-probing. We always answer 200 with a populated `reason` field. + +### Concurrency contract + +- HaltIntent state is held under an `RWMutex` in the holder package. +- The producer (BeginBlocker → gracefulHalt) writes once under `Lock()` and never mutates the value after. It is safe to publish a pointer once and never replace it. +- The handler reads under `RLock()`. It copies a snapshot out of the lock and serializes outside. +- The handler has a `defer recover()` so a bug in halt-intent serialization cannot bring down the broader Tendermint RPC server (failure-isolation principle from the [decision section](#-decision-dedicated-halt_intent-route-not-extending-status)). + +### Discoverability + +The Tier-2 sidecar knows about `/halt_intent` natively. For non-sidecar consumers (systemd, curl), the route is documented in seid's reference docs. We do **not** advertise it via `/status`'s `node_info.other` field — that would re-couple the two routes' lifecycles, which contradicts the failure-isolation rationale for splitting them. + +## What's NOT in this PR (or this design's scope) + +- The producer-side sei-chain change. Tracked as a follow-up issue (must be filed before this PR merges). +- The consumer-side sei-k8s-controller wiring. Tracked as a follow-up issue (same). +- The Tier-2 sidecar `/status` aggregator. Future work. +- Halt-intent on EVM RPC, Admin gRPC, or Cosmos gRPC surfaces. Tier 1 has one canonical surface (Tendermint RPC). Multiple surfaces re-introduce the version-coordination problem the dedicated route is designed to avoid. +- A persisted `halt-intent.json` file. The live endpoint is the single source of truth; on-disk state can desync. +- A termination-log JSON write to `/dev/termination-log`. Was considered (kubelet surfaces it via `lastState.terminated.message`); deferred — adds another contract surface for the controller follow-up to absorb without solving anything that the live endpoint doesn't. +- Replacing all three panic paths atomically. They can land sequentially in the producer PR. The line-119 path (most common case — operator forgot to upgrade) is the natural first target. + +## Open questions + +1. **Cosmovisor coexistence.** If validator pods run cosmovisor inside the container and we ship stay-alive mode, does cosmovisor swallow our exit code via its own restart loop? For controller-managed pods the controller likely bypasses cosmovisor entirely; needs confirming when the producer PR lands. +2. **Holder location.** The HaltIntent holder needs to be readable by the Tendermint RPC handler (in `sei-tendermint`) and writable by the upgrade keeper (in `sei-cosmos`). Concretely: a small new package both import? Or hang it off an existing shared type? Producer-side detail; defer to that PR. +3. **Image-mapping source for sei-k8s-controller.** When the controller observes `Reason=70 PlanName=v6`, where does it look up "v6 → ghcr.io/sei/seid:v6.0.1"? CRD field, ConfigMap, annotation? Out of scope here; flagged for the consumer follow-up. + +## Cross-repo coordination + +- **This PR (sei-config):** ships the contract. +- **Follow-up sei-chain issue/PR:** implements the producer (graceful-halt helper, stay-alive mode flag, `/halt_intent` route on sei-tendermint). +- **Follow-up sei-k8s-controller issue/PR:** implements the consumer (poll `/halt_intent`, branch on `ShutdownReason`, drive image swap or page). + +The sei-config PR must not merge until both follow-up issues are filed and linked. The producer must not merge until consumers can pin the matching sei-config version. The `/halt_intent` route can ship in sei-tendermint independently of the panic-replacement, but they should ship together to avoid releasing a route that always reports "healthy" for halt conditions a binary cannot otherwise survive. diff --git a/docs/upgrade-shutdown-contract/ISSUE.md b/docs/upgrade-shutdown-contract/ISSUE.md new file mode 100644 index 0000000..d3f4b2f --- /dev/null +++ b/docs/upgrade-shutdown-contract/ISSUE.md @@ -0,0 +1,53 @@ +# Define exit-code and halt-intent contract for seid expected halts + +## Problem + +When seid hits a governance-scheduled upgrade height and lacks a registered handler for the plan name (i.e. an operator has not upgraded the binary), it calls Go `panic()`. There are three such panic paths in `sei-cosmos/x/upgrade/abci.go`: + +| Line | Reason | Writes `upgrade-info.json` | +|---|---|---| +| 44 | Downgrade detected — last applied plan has no handler in this binary | no | +| 99 | Binary too new — has handler whose height has not arrived (non-minor upgrade) | no | +| 119 (`panicUpgradeNeeded`) | At/past upgrade height, no handler for plan | yes | + +All three exit with Go's panic exit code (2). The panic also dumps a stack trace to stderr that obscures the relevant message, though the message is also raw-written to stderr explicitly to keep cosmovisor's regex scanner working (see `abci.go:114-116`). + +To any process supervisor — Kubernetes (`lastState.terminated.exitCode`), systemd, our own sei-k8s-controller — these halts are indistinguishable from a genuine crash (nil deref, OOM, state-machine bug). The only structured signal today is a stderr regex. + +## Customer & job-to-be-done + +**Primary: sei-k8s-controller.** The controller is the supervisor we own end-to-end. When seid stops, the controller needs a machine-readable signal of *why* — to decide whether to restart in place, swap the container image, page on-call, or update the `SeiNode` CRD's status. Today it cannot tell "operator forgot to upgrade" from "state-machine bug" without log-scraping, which is brittle and racy. + +**Secondary: validator operators on systemd or cosmovisor.** They have a cosmovisor regex workaround today; this gives them a cleaner, exit-code-based contract. + +## Scope (v1) + +**In scope** for this issue / PR: + +- A new file in sei-config defining the contract: `ShutdownReason` enum, exit-code constants (70/71/72 with 73-79 reserved, 80-89 reserved for future non-upgrade graceful halts), `HaltIntent` struct (typed and JSON-serializable), `ParseExitCode` helper. ~30-50 lines plus a unit test. +- The companion design document (`DESIGN.md`) covering: producer-side change shape in sei-chain, opt-in stay-alive mode, the new `/halt_intent` route on sei-tendermint RPC, and cross-repo coordination. + +**Out of scope:** + +- The producer-side change in sei-chain (separate issue/PR). +- The consumer-side wiring in sei-k8s-controller (separate issue/PR). +- A node-side sidecar with an aggregated `/status` adapter (Tier-2 future work). +- Persisted halt-intent files, termination-log JSON, halt-intent on EVM/Admin/Cosmos-gRPC surfaces. +- Replacing all three panic paths in one go — they can land sequentially. + +## "Done" criteria + +- New file in sei-config (e.g. `exitcodes.go`) with constants, `ShutdownReason`, `HaltIntent`, `ParseExitCode`, and a `String()` method on the enum. +- A unit test exercising every constant through `ParseExitCode` plus round-tripping `HaltIntent` through JSON. +- All exported symbols carry godoc that reads as a stable contract, not implementation notes. +- Numeric exit codes documented to be append-only — renumbering is a major-version break. +- Follow-up issue filed on sei-chain (producer) referenced in this PR. +- Follow-up issue filed on sei-k8s-controller (consumer) referenced in this PR. + +## Note on placement + +This admittedly stretches sei-config's "leaf config library" charter (`CLAUDE.md`: zero panics, no IO outside its own concerns, minimal deps). The proposed addition is constants + a typed wire shape + a pure parser — within charter. The real risk is shipping the contract before either the producer or any consumer is ready, which leaves dead code on `main`. + +**Mitigation:** this PR does not merge until the two follow-up issues (sei-chain producer, sei-k8s-controller consumer) are filed and linked. + +**Long-term:** if more cross-process protocol constants accumulate, they may want a dedicated home (e.g., a `sei-protocol/sei-runtime` package). For v1 the existing import graph wins — all three consumers (`seid`, `seictl`, `sei-k8s-controller`) already import sei-config. From 3ee1047cc58bdf412c67e293aca3b34782957527 Mon Sep 17 00:00:00 2001 From: bdchatham Date: Sun, 26 Apr 2026 09:58:44 -0700 Subject: [PATCH 2/2] docs: flip endpoint decision to extend /status MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Previous draft recommended a dedicated /halt_intent route on failure-domain-isolation grounds. PR review pushed back: the "monitoring fleet polling /status" premise doesn't hold for Sei's actual stack. Audit confirmed: every /status consumer in the workspace is integration tests or thin orchestration (sei-tendermint/rpc/test/helpers.go, sei-cosmos/contrib/localnet_liveness.sh, networks/remote/integration.sh, autobahn explicitly avoids it). No production code in seid polls /status. No Prometheus/Grafana/Tenderduty config in workspace. Cosmovisor scans stderr. What's left of the failure-isolation concern is defensive programming — solvable with defer recover() in the /status handler's halt-intent population path, not requiring a separate endpoint. Updated: - Architecture (Tier 1) describes the field, not the route - Decision section rewritten with the audit findings and concession - Endpoint section retitled and rewritten to describe the field - Cross-repo coordination updated - ISSUE.md scope line updated Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/upgrade-shutdown-contract/DESIGN.md | 87 ++++++++++++------------ docs/upgrade-shutdown-contract/ISSUE.md | 2 +- 2 files changed, 45 insertions(+), 44 deletions(-) diff --git a/docs/upgrade-shutdown-contract/DESIGN.md b/docs/upgrade-shutdown-contract/DESIGN.md index 9cae7e2..9b3ab74 100644 --- a/docs/upgrade-shutdown-contract/DESIGN.md +++ b/docs/upgrade-shutdown-contract/DESIGN.md @@ -26,7 +26,7 @@ For comparison, **the existing `--halt-height` operator path is already graceful **Tier 1 — seid (dataplane primitive).** Each halt cause has: - A distinct **exit code** in the range 70–79. -- A structured **`HaltIntent` value** populated before exit, exposed via a discrete `/halt_intent` route on the existing sei-tendermint RPC server (port 26657). +- A structured **`HaltIntent` value** populated before exit, exposed as an optional `halt_intent` field on the existing `/status` response served by the sei-tendermint RPC server (port 26657). The HaltIntent shape is owned by sei-config. Every consumer — sidecar, systemd unit, controller, human curl — reads the same primitive. This keeps tier 1 universal and predictable. @@ -36,50 +36,50 @@ The two-tier split is what reconciles "synergistic with the sidecar" with "doesn --- -## ⚠️ Decision: dedicated `/halt_intent` route, NOT extending `/status` +## ⚠️ Decision: extend `/status` with a `halt_intent` field, not a dedicated route -This is the most contested call in the design. Reviewers will land here first — calling it out explicitly with the case for and against. Push back hard on this section if you disagree. +This is the most contested call in the design. The recommendation flipped during PR review after an empirical audit invalidated the original premise. Reviewers should land here first. ### TL;DR -Add a **new dedicated route `/halt_intent`** on the sei-tendermint RPC server. Do **not** extend the existing `/status` response to include a `halt_intent` field. +Add an **optional `halt_intent` field** to the existing `/status` response on the sei-tendermint RPC server. The field is omitted when the node is healthy and populated with a `HaltIntent` value when a graceful halt is in progress. **Do not** add a dedicated `/halt_intent` route. -### Why this is non-obvious - -`/status` is the idiomatic place a control-plane fetches a dataplane process's state. Most reviewers' first instinct — and the original direction of this design — was to extend `/status` with an optional `halt_intent` field that's null when healthy. The "new endpoint" feels like a smell, especially since both routes live on a surface (Tendermint RPC) that we want to deprecate eventually. - -### The case for extending `/status` (rejected) +### The case for extending `/status` 1. **Idiomatic.** Control-plane→dataplane state polling lives on `/status`. One endpoint, one response, one parse. -2. **Discoverability.** Sidecars and operators already poll `/status`; a new route is one more thing to know. -3. **No new surface to deprecate later.** If we eventually retire Tendermint RPC entirely, every additional route there is a chip we have to migrate. -4. **Upstream-merge tax** *(does not apply for Sei — see below)*. In a typical Cosmos fork, modifying upstream-defined types like `ResultStatus` means re-merging on every CometBFT minor release. **For Sei this argument is moot:** Sei never merges sei-tendermint changes back to upstream; the fork is fully owned. We can modify `ResultStatus` freely. +2. **Discoverability.** Sidecars and operators already poll `/status`; a new route is one more thing to know about and version-gate. +3. **Minimizes surface area on a doomed endpoint.** Sei intends to deprecate Tendermint RPC eventually. Every additional route there is a chip that has to be migrated. Extending an existing field set is no migration cost beyond what the route itself already costs. +4. **Additive JSON is non-breaking.** Adding an optional field to a JSON response is non-breaking unless consumers do strict-schema validation (`additionalProperties: false`), which is rare. Per the audit (below), no Sei consumer does this. +5. **Upstream-merge tax does not apply.** In a typical Cosmos fork, modifying `ResultStatus` would mean re-merging on every CometBFT minor release. Sei never merges sei-tendermint changes back upstream, so this is not a cost. -### Why we still recommend a dedicated route (failure-isolation) +### What we initially argued and why it didn't hold -Of the arguments above, only #4 was decisive in the original analysis, and Sei's never-merge-upstream posture neutralizes it. So why still go with a dedicated route? +The first draft of this design recommended a dedicated `/halt_intent` route on **failure-domain isolation** grounds: `/status` is "the most heavily polled endpoint in the entire Cosmos universe — Prometheus exporters, Tenderduty, validator dashboards, block explorers, load-balancer health checks." Coupling halt-intent population to `/status` would risk a halt-intent bug taking down sync-status visibility for the entire monitoring fleet during the upgrade window. -**Failure-domain isolation.** This is the strongest single argument and the one most reviewers don't anticipate. +That argument was load-bearing on a premise that does not hold for Sei's actual deployment stack. -`/status` is the most heavily polled endpoint in the entire Cosmos universe — Prometheus exporters, Tenderduty, validator dashboards, block explorers, load-balancer health checks, the sei-k8s-controller's existing health probes. Coupling halt-intent population to `/status`'s critical path means any bug in the halt-intent reader — lock contention with the consensus shutdown path, a nil-deref on `PlanName`, a serialization edge case — takes down sync-status visibility for the entire monitoring fleet. +### What the audit found -Crucially, that breakage would happen **exactly during the upgrade window** when operators most need `/status` to be boring and reliable. We would be introducing a new failure mode at the worst possible time. +A thorough audit of `/status` consumers in the workspace (`sei-chain` + vendored `sei-cosmos` + `sei-tendermint`) returned only: -A dedicated `/halt_intent` handler with its own mutex and its own `defer recover()` keeps the blast radius scoped: a bug in halt-intent code can break `/halt_intent` only, and `/status` keeps reporting sync state to everyone polling it. +- `sei-tendermint/rpc/test/helpers.go:37` — `waitForRPC()` test bootstrap +- `sei-cosmos/contrib/localnet_liveness.sh:32` — liveness loop in a localnet script +- `sei-tendermint/networks/remote/integration.sh` — e2e setup curls +- `integration_test/autobahn/autobahn_test.go:71-73` — explicitly *avoids* `/status` (uses `/abci_info` instead) -### Secondary argument (deprecation) +**No production code in seid polls `/status`.** No Prometheus, Grafana, or Tenderduty configuration is wired to it anywhere in the workspace. Cosmovisor scans stderr, not RPC. The "monitoring fleet" the failure-isolation argument was protecting does not exist for Sei. -The "we want to deprecate Tendermint RPC eventually" point cuts the *opposite* way from initial intuition. A discrete route is trivially shimmed, proxied, or replaced by whatever succeeds Tendermint RPC. A nested field inside `ResultStatus` is welded to the struct's lifetime — a future replacement has to keep emitting the same field at the same nesting depth, or migrate every consumer in lockstep. The dedicated route gives us more freedom, not less. +The single known unknown is `sei-k8s-controller`, which is in a separate repo not present in this workspace. The team's stated belief is that production `/status` consumption is "strictly integration tests and thin orchestration systems," which extends to the controller's usage. We accept this characterization for now; the consumer follow-up PR can verify. -### Counterarguments and how we address them +### What's left of the failure-isolation concern -- *"A new endpoint is a smell."* Real, but the smell is preferable to the failure-isolation cost. The endpoint is small (one method, one response shape, one mutex) and serves a single purpose. It is not a kitchen-sink endpoint waiting to grow. -- *"Discoverability."* The Tier-2 sidecar (when it ships) knows about `/halt_intent` natively and is the controller's primary path. For systemd/curl operators, discoverability is solved by docs. We deliberately do **not** advertise `/halt_intent` via `/status`'s `node_info.other` map — that re-couples the two routes' release lifecycles, which the failure-isolation argument is trying to decouple. -- *"What about the deprecation."* See above. A dedicated route is *easier* to migrate, not harder. +Only **defensive programming**: a bug in halt-intent population (lock contention, nil-deref on `PlanName`, serialization edge case) shouldn't be able to crash the `/status` handler. This is solvable cheaply — see the [concurrency contract](#concurrency-contract) below — and does not require a separate endpoint. -### Open question (defer) +### Counterarguments and remaining concerns -If a future contributor argues that the failure-isolation risk is overstated — i.e. demonstrates that the halt-intent reader can be made unconditionally safe with no shared lock with consensus — the right move is *still* a dedicated route on first ship, and revisit consolidation later. Easier to merge two routes than split one. +- *"A nested field welded to `ResultStatus`'s lifetime is harder to migrate when we deprecate Tendermint RPC."* Mitigated by the additivity of JSON: a successor surface can simply re-emit the same field shape, or migrate consumers field-by-field. The shape is owned by sei-config and stable independent of the route that serves it. +- *"What if a future production consumer (Prometheus exporter, third-party explorer) starts polling `/status` and is sensitive to schema changes?"* The new field is additive and `omitempty`. A consumer that doesn't know about `halt_intent` sees the same shape it sees today. A consumer that *does* know about it gets the signal for free without a separate poll. +- *"Should we keep a fallback to a dedicated route if the defensive-programming approach proves insufficient?"* No — easier to add a route later than to retire one. Ship the simpler design first. --- @@ -116,7 +116,7 @@ const ( ExitCodeDowngradeDetected = 72 ) -// HaltIntent is the structured signal seid serves on /halt_intent and that +// HaltIntent is the structured signal seid populates on /status and that // describes the reason for an in-progress or pending graceful halt. // // JSON tags are part of the wire contract and must not be renamed without @@ -150,7 +150,7 @@ Notable omissions, all deliberate: In `sei-cosmos/x/upgrade/abci.go`, replace the three `panic()` calls with calls to a new helper `gracefulHalt(reason, intent)` that: -1. Sets the HaltIntent on a process-global holder (an `RWMutex`-protected struct in a small new package, importable by both `x/upgrade` and the `/halt_intent` HTTP handler in `sei-tendermint`). +1. Sets the HaltIntent on a process-global holder (an `RWMutex`-protected struct in a small new package, importable by both `x/upgrade` and the `/status` handler in `sei-tendermint`). 2. Performs the existing logging + raw stderr write + (for the line-119 path) `upgrade-info.json` write. **No change to those existing artifacts.** 3. Sends `SIGINT` to self, mirroring the existing `BaseApp.halt()` pattern. 4. The shutdown defer in `sei-cosmos/server/start.go:459-484` is extended to read the holder, look up the matching exit code via `seiconfig.ParseExitCode`, and call `os.Exit(N)` with the distinct code instead of returning normally. @@ -168,28 +168,29 @@ Stay-alive mode requires splitting the shared `goCtx` in `sei-cosmos/server/star A flag name is deliberately not specified in this design — the producer PR can pick something like `--halt-stay-alive` or `--halt-stays-running`. Naming bikeshed deferred. -## /halt_intent route (also out of scope for this PR, sketched) +## `halt_intent` field on `/status` (also out of scope for this PR, sketched) -A new Tendermint RPC route on `sei-tendermint`, registered alongside the existing Sei-only `/lag_status`. Path: `/halt_intent`. Always reachable while the RPC server is up. +The existing `/status` handler in `sei-tendermint` is extended to populate an optional `halt_intent` field on its response, sourced from the in-process holder. The handler change is contained — a few lines reading from the holder and adding the field to the existing `ResultStatus`-shaped response. -### Response +### Response shape -- Always **`200 OK`** with the HaltIntent JSON. -- When healthy: `{"reason": 0, "reason_string": "healthy", "announced_at": "..."}`. -- When halted: the populated HaltIntent. +- **Healthy node:** `halt_intent` is omitted (or null, depending on JSON-encoding choice — `omitempty` recommended). Existing `/status` consumers see no change. +- **Halted node:** `halt_intent` is populated with the full `HaltIntent` value defined in sei-config. -**Not 204. Not 404.** A 404 is indistinguishable from "older seid doesn't have this route" and breaks consumer version-probing. We always answer 200 with a populated `reason` field. +This means a `/status` consumer that has never heard of `halt_intent` continues to work; one that knows the field gets the halt signal for free without polling a second endpoint. ### Concurrency contract -- HaltIntent state is held under an `RWMutex` in the holder package. -- The producer (BeginBlocker → gracefulHalt) writes once under `Lock()` and never mutates the value after. It is safe to publish a pointer once and never replace it. -- The handler reads under `RLock()`. It copies a snapshot out of the lock and serializes outside. -- The handler has a `defer recover()` so a bug in halt-intent serialization cannot bring down the broader Tendermint RPC server (failure-isolation principle from the [decision section](#-decision-dedicated-halt_intent-route-not-extending-status)). +This is the only piece of the original design that survives unchanged — it's the cheap defensive programming that absorbs what's left of the failure-isolation concern. + +- HaltIntent state is held under an `RWMutex` in a small holder package importable by both `x/upgrade` (writer) and `sei-tendermint` (reader). +- The producer (BeginBlocker → `gracefulHalt`) writes once under `Lock()` and never mutates the value after. It is safe to publish a pointer once and never replace it. +- The reader (the `/status` handler's halt-intent lookup) takes `RLock()`, copies a snapshot out of the lock, serializes outside. +- **The halt-intent population path inside the `/status` handler must `defer recover()`** so that a bug in halt-intent code (nil-deref, serialization panic, lock misuse) cannot 500 the broader `/status` response. On recover, log and emit `halt_intent: null`. `/status` continues to serve sync state. ### Discoverability -The Tier-2 sidecar knows about `/halt_intent` natively. For non-sidecar consumers (systemd, curl), the route is documented in seid's reference docs. We do **not** advertise it via `/status`'s `node_info.other` field — that would re-couple the two routes' lifecycles, which contradicts the failure-isolation rationale for splitting them. +`/status` is already the canonical place to ask seid "what's your state?". No new advertisement, no `node_info.other` entry, no version-gating dance. Consumers that know the field use it; consumers that don't keep working as before. ## What's NOT in this PR (or this design's scope) @@ -210,7 +211,7 @@ The Tier-2 sidecar knows about `/halt_intent` natively. For non-sidecar consumer ## Cross-repo coordination - **This PR (sei-config):** ships the contract. -- **Follow-up sei-chain issue/PR:** implements the producer (graceful-halt helper, stay-alive mode flag, `/halt_intent` route on sei-tendermint). -- **Follow-up sei-k8s-controller issue/PR:** implements the consumer (poll `/halt_intent`, branch on `ShutdownReason`, drive image swap or page). +- **Follow-up sei-chain issue/PR:** implements the producer (graceful-halt helper, stay-alive mode flag, `halt_intent` field on the `/status` response in sei-tendermint). +- **Follow-up sei-k8s-controller issue/PR:** implements the consumer (poll `/status`, branch on the `halt_intent` field's `reason`, drive image swap or page). -The sei-config PR must not merge until both follow-up issues are filed and linked. The producer must not merge until consumers can pin the matching sei-config version. The `/halt_intent` route can ship in sei-tendermint independently of the panic-replacement, but they should ship together to avoid releasing a route that always reports "healthy" for halt conditions a binary cannot otherwise survive. +The sei-config PR must not merge until both follow-up issues are filed and linked. The producer must not merge until consumers can pin the matching sei-config version. The `/status` field extension can ship in sei-tendermint independently of the panic-replacement, but they should ship together to avoid releasing a field that is always absent because the producer never populates it. diff --git a/docs/upgrade-shutdown-contract/ISSUE.md b/docs/upgrade-shutdown-contract/ISSUE.md index d3f4b2f..b805812 100644 --- a/docs/upgrade-shutdown-contract/ISSUE.md +++ b/docs/upgrade-shutdown-contract/ISSUE.md @@ -25,7 +25,7 @@ To any process supervisor — Kubernetes (`lastState.terminated.exitCode`), syst **In scope** for this issue / PR: - A new file in sei-config defining the contract: `ShutdownReason` enum, exit-code constants (70/71/72 with 73-79 reserved, 80-89 reserved for future non-upgrade graceful halts), `HaltIntent` struct (typed and JSON-serializable), `ParseExitCode` helper. ~30-50 lines plus a unit test. -- The companion design document (`DESIGN.md`) covering: producer-side change shape in sei-chain, opt-in stay-alive mode, the new `/halt_intent` route on sei-tendermint RPC, and cross-repo coordination. +- The companion design document (`DESIGN.md`) covering: producer-side change shape in sei-chain, opt-in stay-alive mode, the new optional `halt_intent` field on the existing `/status` response served by sei-tendermint RPC, and cross-repo coordination. **Out of scope:**