Skip to content

Replace x/upgrade panics with graceful halt path emitting structured halt_intent #3311

@bdchatham

Description

@bdchatham

Producer-side tracking issue for the cross-repo upgrade-shutdown contract. Contract is defined in sei-protocol/sei-config — see sei-config#9 and the design doc in sei-config PR#8.

Problem

sei-cosmos/x/upgrade/abci.go calls panic() in three places when the running binary cannot proceed past an on-chain upgrade plan:

Line Reason Writes upgrade-info.json
44 Downgrade detected — last applied plan has no handler in this binary no
99 Binary too new — has handler whose height has not arrived (non-minor upgrade) no
119 (panicUpgradeNeeded) At/past upgrade height, no handler for plan yes

All three exit with Go's panic exit code (2). Process supervisors (k8s, systemd, sei-k8s-controller) cannot tell these halts apart from a genuine crash without parsing stderr — brittle and racy.

For comparison, the existing --halt-height operator path (BaseApp.halt() at sei-cosmos/baseapp/abci.go:288-321) is already graceful: it sends SIGINT to itself and the existing shutdown defer in server/start.go:459-484 handles teardown cleanly. The upgrade halt is the outlier.

Scope

Implement the producer side of the contract defined in sei-config:

  1. Replace the three panics with a new gracefulHalt(reason, intent) helper that:

    • Sets a HaltIntent value on a process-global holder (RWMutex-protected; new small package importable by both x/upgrade and sei-tendermint's /status handler).
    • Performs the existing logging + raw stderr write + (line-119 path only) upgrade-info.json write — these artifacts continue exactly as today for cosmovisor regex compat.
    • Sends SIGINT to self, mirroring BaseApp.halt().
    • The shutdown defer in sei-cosmos/server/start.go is extended to look up the holder, call seiconfig.ParseExitCode(reason), and os.Exit(N) with the matching distinct code (70/71/72). Replaces the implicit panic-exit-2 with a deterministic, supervisor-readable exit code.
  2. Extend the /status handler in sei-tendermint to populate an optional halt_intent field on its ResultStatus response, sourced from the holder. omitempty semantics — absent when healthy, populated when halted. The halt-intent population path must defer recover() so a halt-intent bug cannot 500 the broader /status response.

  3. New flag --halt-stay-alive (or similar; bikeshed welcome). Default false preserves today's behavior (graceful exit with distinct code). When true, gracefulHalt populates the holder, stops consensus, and leaves the four servers (Tendermint RPC, Cosmos gRPC, Cosmos REST, EVM RPC) running indefinitely. Process exits only on external SIGTERM/SIGINT. Requires splitting the shared goCtx in sei-cosmos/server/start.go so consensus and the servers have separate cancellation. Contained surgery to that one file.

  4. Import sei-config for ShutdownReason, ExitCode* constants, HaltIntent, ParseExitCode. The library is the single source of truth for the contract shape.

"Done" criteria

  • All three panic sites in sei-cosmos/x/upgrade/abci.go replaced with calls to gracefulHalt. Can land sequentially — line-119 path (most common: operator forgot to upgrade) is the natural first target.
  • New holder package, with the lock contract documented (write-once under Lock(), reads under RLock() with snapshot copy outside the lock).
  • /status response extended with the halt_intent field; existing /status consumers (tests, scripts) keep working.
  • --halt-stay-alive flag plumbed through, with stay-alive behavior verified end-to-end (consensus stops, servers stay up, /status returns the populated halt_intent, exit only on external signal).
  • Distinct exit codes (70/71/72) verified via integration test.
  • upgrade-info.json continues to be written exactly as today on the line-119 path (no schema change).
  • sei-config dependency pinned at the version that ships the contract.

Out of scope

  • Persisted halt-intent.json file. The live /status field is the single source of truth.
  • Termination-log JSON write to /dev/termination-log.
  • Halt-intent on EVM RPC, Admin gRPC, or Cosmos gRPC surfaces. Tier 1 has one canonical surface.
  • The consumer side in sei-k8s-controller — separate tracking issue.

Open questions

  1. Cosmovisor coexistence. If validator pods run cosmovisor inside the container and we ship stay-alive mode, does cosmovisor swallow our exit code via its own restart loop? May require pod-template guidance for the controller-managed deployment topology.
  2. Holder package home. Lives in sei-cosmos so both x/upgrade (writer) and sei-tendermint (reader) can import without a circular dep? Or in a tiny separate repo? Pick during implementation.
  3. Flag name. --halt-stay-alive / --halt-stays-running / something else. Bikeshed.

References

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions