Skip to content

feat(stack): pluggable backend system with native k3s support#135

Open
bussyjd wants to merge 20 commits intomainfrom
feature/k3s-backend
Open

feat(stack): pluggable backend system with native k3s support#135
bussyjd wants to merge 20 commits intomainfrom
feature/k3s-backend

Conversation

@bussyjd
Copy link
Collaborator

@bussyjd bussyjd commented Feb 6, 2026

Summary

  • Introduces a Backend interface that abstracts cluster lifecycle, enabling both k3d (default) and native k3s backends
  • Native k3s is a prerequisite for TEE/Confidential Computing — k3d cannot provide the direct hardware access needed for AMD SEV-SNP, Intel TDX, or GPU TEE workloads
  • Fixes pre-existing helmfile template issues (eRPC secretEnv type mismatch, obol-frontend escaped quotes, .Values.* unavailable during gotmpl first-pass rendering)

What changed

Area Change
Backend interface New Backend with Init, Up, Down, Destroy, IsRunning, DataDir — k3d extracted into K3dBackend, new K3sBackend added
k3s process management PID tracking, sudo kill -0 liveness checks, process group signals, k3s-killall.sh cleanup, API server readiness polling
Helmfile templates helmfile.yamlhelmfile.yaml.gotmpl, env vars replace .Values.* references, KUBECONFIG propagated to hooks
eRPC values secretEnv changed from nested map to {}, secret injected via extraEnv with valueFrom.secretKeyRef
obol-frontend values Replaced {{ printf \"...\" }} with direct interpolation and single-quoted env calls
Tests 26 unit tests (backend selection, PID parsing, config, Init templates) + 10 integration test scenarios behind //go:build integration

Test results

  • Unit tests: 26/26 pass with -race
  • K3s integration: 32/33 flow tests pass (11 scenarios: init, up, kubectl, idempotent, down, restart, purge)
  • Helmfile deploy: All 10 releases succeed on k3s (base, reloader, monitoring, gateway-api-crds, traefik, cloudflared, erpc, erpc-httproute, obol-frontend, obol-frontend-httproute)

Test plan

  • Unit tests pass (go test -race ./internal/stack/)
  • K3s: stack init --backend k3sstack up → full helmfile deploy
  • K3s: stack downstack up restart cycle
  • K3s: stack purge --force full cleanup
  • K3d: backward compatibility (blocked by local Docker/kernel issue, not code-related)
  • Network install on k3s backend

Closes #134

bussyjd and others added 18 commits January 12, 2026 12:26
Update dependency versions to latest stable releases:
- kubectl: 1.31.0 → 1.35.0
- helm: 3.19.1 → 3.19.4
- helmfile: 1.2.2 → 1.2.3
- k9s: 0.32.5 → 0.50.18
- helm-diff: 3.9.11 → 3.14.1

k3d remains at 5.8.3 (already current).
Replace nginx-ingress controller with Traefik 38.0.2 using Kubernetes
Gateway API for routing. This addresses the nginx-ingress deprecation
(end of maintenance March 2026).

Changes:
- Remove --disable=traefik from k3d config to use k3s built-in Traefik
- Replace nginx-ingress helm release with Traefik 38.0.2 in infrastructure
- Configure Gateway API provider with cross-namespace routing support
- Add GatewayClass and Gateway resources via Traefik helm chart
- Convert all Ingress resources to HTTPRoute format:
  - eRPC: /rpc path routing
  - obol-frontend: / path routing
  - ethereum: /execution and /beacon path routing with URL rewrite
  - aztec: namespace-based path routing with URL rewrite
  - helios: namespace-based path routing with URL rewrite
- Disable legacy Ingress in service helm values

Closes #125
Add Cloudflare Tunnel integration to expose obol-stack services publicly
without port forwarding or static IPs. Uses quick tunnel mode for MVP.

Changes:
- Add cloudflared Helm chart (internal/embed/infrastructure/cloudflared/)
- Add tunnel management package (internal/tunnel/)
- Add CLI commands: obol tunnel status/restart/logs
- Integrate cloudflared into infrastructure helmfile

The tunnel deploys automatically with `obol stack up` and provides a
random trycloudflare.com URL accessible via `obol tunnel status`.

Future: Named tunnel support for persistent URLs (obol tunnel login)
Update documentation to reflect the upgraded dependency versions
in obolup.sh. This keeps the documentation in sync with the actual
pinned versions used by the bootstrap installer.
Introduce a Backend interface that abstracts cluster lifecycle management,
enabling both k3d (Docker-based, default) and k3s (native bare-metal) backends.
This is a prerequisite for TEE/Confidential Computing workloads which require
direct hardware access that k3d cannot provide.

Changes:
- Add Backend interface (Init, Up, Down, Destroy, IsRunning, DataDir)
- Extract k3d logic into K3dBackend with backward-compatible fallback
- Add K3sBackend with sudo process management, PID tracking, and
  API server readiness checks
- Convert helmfile.yaml to helmfile.yaml.gotmpl using env vars instead
  of .Values references (fixes first-pass template rendering)
- Fix eRPC secretEnv type mismatch (map vs string for b64enc)
- Fix obol-frontend escaped quotes in gotmpl expressions
- Add KUBECONFIG env var to helmfile command for hook compatibility
- Add 26 unit tests and 10 integration test scenarios

Closes #134
Adds a Claude Code skill (`/test-backend`) with bash scripts that
exercise the full backend lifecycle: init, up, kubectl, down, restart,
and purge for both k3d and k3s backends.
Comment on lines 1 to 10
{{- $network := .Values.network -}}
{{- $network := env "STACK_NETWORK" | default "mainnet" -}}
{{- $publicDomain := env "STACK_PUBLIC_DOMAIN" | default "obol.stack" -}}
{{- $chainId := 1 -}} {{/* Default: mainnet */}}
{{- if eq $network "hoodi" -}}
{{- $chainId = 560048 -}}
{{- else if eq $network "sepolia" -}}
{{- $chainId = 11155111 -}}
{{- else if ne $network "mainnet" -}}
{{- fail (printf "Unknown network: %s. Supported networks: mainnet, hoodi, sepolia" $network) -}}
{{- end -}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to chose only one l1? can't we have all of these wired up? (well hoodi and mainnet is what we host on the dv labs side)

Comment on lines 20 to 21
- name: stakater
url: https://stakater.github.io/stakater-charts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this for?

The k3s Down() method was using kill -TERM with a negative PID (process
group kill), which could kill unrelated system processes like
systemd-logind sharing the same process group as the sudo wrapper. This
caused the entire desktop session to crash.

Changes:
- Kill only the specific sudo/k3s process, not the process group
- Remove unused Setpgid/syscall since we no longer use process groups
- Add containerd-shim cleanup fallback for binary-only k3s installs
- Add 600s helm timeout for kube-prometheus-stack deployment
- Disable admission webhook pre-install hooks that timeout on fresh k3s
- Fix flaky test: replace fixed sleep with polling loop for API shutdown
Base automatically changed from integration-okr-1 to main February 17, 2026 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants