Skip to content

Consume seid /status halt_intent for upgrade-driven supervisor decisions #134

@bdchatham

Description

@bdchatham

Consumer-side tracking issue for the cross-repo upgrade-shutdown contract. Contract is defined in sei-protocol/sei-config — see sei-config#9 and the design doc in sei-config PR#8. Producer-side work tracked at sei-protocol/sei-chain.

Problem

When seid stops, the controller currently has no machine-readable signal of why. It cannot distinguish "operator forgot to upgrade the binary" from "state-machine bug" without log-scraping, which is brittle and racy.

The cross-repo contract being shipped via sei-config + sei-chain provides:

  • Distinct process exit codes (70/71/72) for graceful halt reasons.
  • An optional halt_intent field on seid's /status response carrying a structured HaltIntent (ShutdownReason, PlanName, Height, Info, AnnouncedAt).
  • An opt-in seid flag --halt-stay-alive that keeps the process running with /status serving after consensus halts, so the controller can poll the live signal before the pod terminates.

This issue tracks the controller-side wiring to consume that signal.

Scope

  1. Import sei-config to pick up ShutdownReason, ExitCode* constants, HaltIntent, ParseExitCode. Pin to the matching version once the contract ships.

  2. Poll seid's /status endpoint for the halt_intent field. When present and non-null, parse into the typed HaltIntent.

  3. Branch on ShutdownReason:

    • ShutdownReasonUpgradeRequired (70): operator forgot to upgrade. Look up image mapping for PlanName (see open question below). If mapping exists, patch the Pod template image and recreate. If missing, set status.conditions[Upgrading]=Unknown, reason=ImageMappingMissing, emit a Kubernetes Event, and page on-call.
    • ShutdownReasonBinaryTooNew (71): we shipped a too-new binary. Page immediately. Roll back the image to the CRD's previous status.runningImage. Do not auto-advance.
    • ShutdownReasonDowngradeDetected (72): state is ahead of binary by a completed upgrade — botched rollback. Page immediately. Do not restart with same image; do not auto-pick a newer image. Human required.
  4. Fallback signal: process exit code. For pods where --halt-stay-alive is off (or where the controller missed the live /status window), read status.containerStatuses[*].lastState.terminated.exitCode and apply the same branching via seiconfig.ParseExitCode.

  5. Pod-template guidance. Controller-managed pods should set restartPolicy: Never (so kubelet doesn't crash-loop a binary that can't survive) and either disable cosmovisor or reconcile its presence with the new contract — see open question below.

"Done" criteria

  • Controller imports the shipped sei-config version and uses the typed enum throughout the supervisor branch.
  • /status poller wired to read halt_intent, with a graceful-degradation path if the field is absent (older seid version) — fall back to exit-code path.
  • Each ShutdownReason has a documented controller behavior path; emergency paging integrated for 71/72.
  • Image-mapping lookup mechanism designed and implemented (see open question).
  • Integration test simulating each halt scenario end-to-end (seid populates halt_intent → controller observes → correct action taken).

Open questions

  1. Image mapping source. When the controller observes Reason=70 PlanName="v6", where does it look up "v6" → ghcr.io/sei/seid:v6.0.1? CRD field on SeiNode? ConfigMap? Annotation? Pre-stage decision needs a Coral-style design.
  2. Cosmovisor coexistence. If the pod's container also runs cosmovisor, cosmovisor's own os.Exit(0) after a binary swap may mask our distinct exit codes from kubelet. Either disable cosmovisor in controller-managed pods, or define how the two coexist.
  3. Stay-alive grace window. What's the maximum duration the controller will let a stay-alive pod linger before forcing termination? Is this a CRD field, a controller-wide config, or always supervisor-driven?

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions