Consumer-side tracking issue for the cross-repo upgrade-shutdown contract. Contract is defined in sei-protocol/sei-config — see sei-config#9 and the design doc in sei-config PR#8. Producer-side work tracked at sei-protocol/sei-chain.
Problem
When seid stops, the controller currently has no machine-readable signal of why. It cannot distinguish "operator forgot to upgrade the binary" from "state-machine bug" without log-scraping, which is brittle and racy.
The cross-repo contract being shipped via sei-config + sei-chain provides:
- Distinct process exit codes (70/71/72) for graceful halt reasons.
- An optional
halt_intent field on seid's /status response carrying a structured HaltIntent (ShutdownReason, PlanName, Height, Info, AnnouncedAt).
- An opt-in seid flag
--halt-stay-alive that keeps the process running with /status serving after consensus halts, so the controller can poll the live signal before the pod terminates.
This issue tracks the controller-side wiring to consume that signal.
Scope
-
Import sei-config to pick up ShutdownReason, ExitCode* constants, HaltIntent, ParseExitCode. Pin to the matching version once the contract ships.
-
Poll seid's /status endpoint for the halt_intent field. When present and non-null, parse into the typed HaltIntent.
-
Branch on ShutdownReason:
ShutdownReasonUpgradeRequired (70): operator forgot to upgrade. Look up image mapping for PlanName (see open question below). If mapping exists, patch the Pod template image and recreate. If missing, set status.conditions[Upgrading]=Unknown, reason=ImageMappingMissing, emit a Kubernetes Event, and page on-call.
ShutdownReasonBinaryTooNew (71): we shipped a too-new binary. Page immediately. Roll back the image to the CRD's previous status.runningImage. Do not auto-advance.
ShutdownReasonDowngradeDetected (72): state is ahead of binary by a completed upgrade — botched rollback. Page immediately. Do not restart with same image; do not auto-pick a newer image. Human required.
-
Fallback signal: process exit code. For pods where --halt-stay-alive is off (or where the controller missed the live /status window), read status.containerStatuses[*].lastState.terminated.exitCode and apply the same branching via seiconfig.ParseExitCode.
-
Pod-template guidance. Controller-managed pods should set restartPolicy: Never (so kubelet doesn't crash-loop a binary that can't survive) and either disable cosmovisor or reconcile its presence with the new contract — see open question below.
"Done" criteria
- Controller imports the shipped sei-config version and uses the typed enum throughout the supervisor branch.
/status poller wired to read halt_intent, with a graceful-degradation path if the field is absent (older seid version) — fall back to exit-code path.
- Each
ShutdownReason has a documented controller behavior path; emergency paging integrated for 71/72.
- Image-mapping lookup mechanism designed and implemented (see open question).
- Integration test simulating each halt scenario end-to-end (seid populates halt_intent → controller observes → correct action taken).
Open questions
- Image mapping source. When the controller observes
Reason=70 PlanName="v6", where does it look up "v6" → ghcr.io/sei/seid:v6.0.1? CRD field on SeiNode? ConfigMap? Annotation? Pre-stage decision needs a Coral-style design.
- Cosmovisor coexistence. If the pod's container also runs cosmovisor, cosmovisor's own
os.Exit(0) after a binary swap may mask our distinct exit codes from kubelet. Either disable cosmovisor in controller-managed pods, or define how the two coexist.
- Stay-alive grace window. What's the maximum duration the controller will let a stay-alive pod linger before forcing termination? Is this a CRD field, a controller-wide config, or always supervisor-driven?
References
Problem
When seid stops, the controller currently has no machine-readable signal of why. It cannot distinguish "operator forgot to upgrade the binary" from "state-machine bug" without log-scraping, which is brittle and racy.
The cross-repo contract being shipped via sei-config + sei-chain provides:
halt_intentfield on seid's/statusresponse carrying a structuredHaltIntent(ShutdownReason,PlanName,Height,Info,AnnouncedAt).--halt-stay-alivethat keeps the process running with/statusserving after consensus halts, so the controller can poll the live signal before the pod terminates.This issue tracks the controller-side wiring to consume that signal.
Scope
Import sei-config to pick up
ShutdownReason,ExitCode*constants,HaltIntent,ParseExitCode. Pin to the matching version once the contract ships.Poll seid's
/statusendpoint for thehalt_intentfield. When present and non-null, parse into the typedHaltIntent.Branch on
ShutdownReason:ShutdownReasonUpgradeRequired(70): operator forgot to upgrade. Look up image mapping forPlanName(see open question below). If mapping exists, patch the Pod template image and recreate. If missing, setstatus.conditions[Upgrading]=Unknown, reason=ImageMappingMissing, emit a Kubernetes Event, and page on-call.ShutdownReasonBinaryTooNew(71): we shipped a too-new binary. Page immediately. Roll back the image to the CRD's previousstatus.runningImage. Do not auto-advance.ShutdownReasonDowngradeDetected(72): state is ahead of binary by a completed upgrade — botched rollback. Page immediately. Do not restart with same image; do not auto-pick a newer image. Human required.Fallback signal: process exit code. For pods where
--halt-stay-aliveis off (or where the controller missed the live/statuswindow), readstatus.containerStatuses[*].lastState.terminated.exitCodeand apply the same branching viaseiconfig.ParseExitCode.Pod-template guidance. Controller-managed pods should set
restartPolicy: Never(so kubelet doesn't crash-loop a binary that can't survive) and either disable cosmovisor or reconcile its presence with the new contract — see open question below."Done" criteria
/statuspoller wired to readhalt_intent, with a graceful-degradation path if the field is absent (older seid version) — fall back to exit-code path.ShutdownReasonhas a documented controller behavior path; emergency paging integrated for 71/72.Open questions
Reason=70 PlanName="v6", where does it look up"v6" → ghcr.io/sei/seid:v6.0.1? CRD field onSeiNode? ConfigMap? Annotation? Pre-stage decision needs a Coral-style design.os.Exit(0)after a binary swap may mask our distinct exit codes from kubelet. Either disable cosmovisor in controller-managed pods, or define how the two coexist.References