A stateless container deployment platform with three core principles:
- Workloads are disposable - containers can be killed and recreated at any time
- Two node types - proxy nodes handle public traffic, worker nodes run containers
- Networking is private-first - services communicate over WireGuard mesh, public exposure via proxy nodes
| Component | Choice | Rationale |
|---|---|---|
| Control Plane | Next.js (full-stack) | Single deployment, React frontend + API routes |
| Database | Postgres + Drizzle | Simple, no external deps, single file, easy backup |
| Background Jobs | Inngest (self-hosted) | Durable workflows, event-driven orchestration, retries |
| Server Agent | Go | Single binary, shells out to Podman |
| Container Runtime | Podman | Docker-compatible, daemonless, bridge networking with static IPs |
| Reverse Proxy | Traefik | Automatic HTTPS via Let's Encrypt, runs on proxy nodes only |
| Private Network | WireGuard (self-managed) | Full mesh, control plane coordinates |
| Service Discovery | Built-in DNS | Agent runs DNS server for .internal domains |
| Agent Communication | Pull-based HTTP | Agent polls for expected state, reports status |
| Type | Traefik | Public Traffic | Containers |
|---|---|---|---|
| Proxy | ✓ | Handles TLS termination | ✓ |
| Worker | ✗ | None | ✓ |
- Proxy nodes: Handle incoming public traffic, TLS termination via HTTP-01 ACME, route to containers via WireGuard
- Worker nodes: Run containers only, no public exposure, lighter footprint
┌─────────────────────────────────────────────────────────────────┐
│ CONTROL PLANE │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Next.js (App Router + API Routes + Postgres) │ │
│ │ │ │
│ │ GET /api/v1/agent/expected-state (agent polls) │ │
│ │ POST /api/v1/agent/status (agent reports) │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
▲
│ HTTPS (poll every 10s)
▼
┌─────────────────────────────────────────────────────────────────┐
│ SERVERS │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Proxy Node 1 │ │ Worker Node 1 │ │ Worker Node 2 │ │
│ │ │ │ │ │ │ │
│ │ WG: 10.100.1.1 │ │ WG: 10.100.2.1 │ │ WG: 10.100.3.1 │ │
│ │ Containers: │ │ Containers: │ │ Containers: │ │
│ │ 10.200.1.2-254 │ │ 10.200.2.2-254 │ │ 10.200.3.2-254 │ │
│ │ │ │ │ │ │ │
│ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │
│ │ │ Agent │ │ │ │ Agent │ │ │ │ Agent │ │ │
│ │ ├─────────────┤ │ │ ├─────────────┤ │ │ ├─────────────┤ │ │
│ │ │ Podman │ │ │ │ Podman │ │ │ │ Podman │ │ │
│ │ ├─────────────┤ │ │ ├─────────────┤ │ │ ├─────────────┤ │ │
│ │ │ Traefik │ │ │ │ - │ │ │ │ - │ │ │
│ │ ├─────────────┤ │ │ ├─────────────┤ │ │ ├─────────────┤ │ │
│ │ │ DNS Server │ │ │ │ DNS Server │ │ │ │ DNS Server │ │ │
│ │ ├─────────────┤ │ │ ├─────────────┤ │ │ ├─────────────┤ │ │
│ │ │ WireGuard │ │ │ │ WireGuard │ │ │ │ WireGuard │ │ │
│ │ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ └────────────────────┴────────────────────┘ │
│ WireGuard Full Mesh │
└─────────────────────────────────────────────────────────────────┘
Public Traffic Flow:
Internet → DNS → Proxy Node → Traefik (TLS) → WireGuard → Container
The agent uses a two-state machine to prevent race conditions during reconciliation:
┌─────────────────────────────────────────────────────────────────┐
│ │
│ ┌─────────┐ ┌────────────┐ │
│ │ IDLE │───drift detected───────▶│ PROCESSING │ │
│ │ (poll) │◀────────────────────────│ (no poll) │ │
│ └─────────┘ done/failed/timeout └────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
- Poll control plane every 10 seconds for expected state
- Compare expected state vs actual state (containers, DNS, Traefik*, WireGuard)
- If no drift: send status report, stay in IDLE
- If drift detected: snapshot expected state, transition to PROCESSING
*Traefik drift detection only on proxy nodes
- Stop polling (use the expected state snapshot)
- Apply ONE change at a time with verification
- After each change, re-check drift
- If no drift remains: transition to IDLE
- Timeout after 5 minutes: force transition to IDLE
- Always send status report before transitioning to IDLE
The agent detects drift using hash comparisons:
- Containers: Missing, orphaned, wrong state, or image mismatch
- DNS: Hash of sorted records vs current DNS server config
- Traefik: Hash of sorted routes vs current Traefik config (proxy nodes only)
- WireGuard: Hash of sorted peers vs current wg0.conf
Order of operations:
- Stop orphan containers (no deployment ID)
- Start containers in "created" or "exited" state
- Deploy missing containers
- Redeploy containers with wrong state or image mismatch
- Update DNS records
- Update Traefik routes (proxy nodes only)
- Update WireGuard peers
Deployments go through these stages:
pending → pulling → starting → healthy → dns_updating → traefik_updating → stopping_old → running
| Stage | Description |
|---|---|
pending |
Deployment created, waiting for agent |
pulling |
Agent is pulling the container image |
starting |
Container started, waiting for health check |
healthy |
Health check passed (or no health check) |
dns_updating |
DNS records being updated |
traefik_updating |
Traefik routes being updated |
stopping_old |
Old deployment containers being stopped |
running |
Deployment complete and serving traffic |
Special states:
unknown: Agent stopped reporting this deployment (container may still exist)stopped: Container explicitly stoppedfailed: Deployment failed (health check, etc.)rolled_back: Rollout failed, reverted to previous deployment
| Range | Purpose |
|---|---|
10.100.X.1 |
WireGuard IP for server X (host mesh) |
10.200.X.2-254 |
Container IPs on server X |
Where X = server's subnet ID (1-255).
Each server gets a /24 subnet for routing:
- Server 1:
10.100.1.0/24→ WireGuard IP:10.100.1.1 - Server 2:
10.100.2.0/24→ WireGuard IP:10.100.2.1
Full mesh topology - every server peers with every other server. AllowedIPs includes both WireGuard and container subnets:
AllowedIPs = 10.100.2.0/24, 10.200.2.0/24
Each server has a Podman bridge network:
podman network create \
--driver bridge \
--subnet 10.200.1.0/24 \
--gateway 10.200.1.1 \
--disable-dns \
techulusContainers get static IPs assigned by the control plane:
podman run -d \
--name service-deployment \
--network techulus \
--ip 10.200.1.2 \
--label techulus.deployment.id=<deployment-id> \
--label techulus.service.id=<service-id> \
traefik/whoamiEach agent runs a built-in DNS server for .internal domain resolution:
- Listens on the container gateway IP (e.g.,
10.200.1.1) - Configures systemd-resolved to forward
.internalqueries - Records pushed from control plane via expected state
Services resolve via .internal domain with round-robin across replicas.
Proxy nodes run Traefik with routes and certificates pushed from control plane:
- Routes configured via file provider in
/etc/traefik/dynamic/routes.yaml - Certificates configured via file provider in
/etc/traefik/dynamic/tls.yaml - Routes:
subdomain.example.com→ container IPs (via WireGuard mesh) - TLS: Static certificates managed by control plane
- Challenge route:
/.well-known/acme-challenge/*→ control plane for ACME validation - Control plane only sends routes and certificates to proxy nodes
Worker nodes do not run Traefik.
The platform supports multiple proxy nodes in different regions with automatic proximity steering:
- Users point custom domains to a single DNS name via GeoDNS (BunnyDNS)
- BunnyDNS routes clients to geographically nearest proxy based on their location
- BunnyDNS health checks automatically failover if a proxy goes down
- All proxies share the same TLS certificates (synced from control plane)
Example:
Proxy US: 1.2.3.4
Proxy EU: 5.6.7.8
Proxy SYD: 9.10.11.12
GeoDNS (BunnyDNS):
example.com → lb.techulus.cloud
→ BunnyDNS steers to nearest proxy based on client geography
→ Returns 1.2.3.4 (US), 5.6.7.8 (EU), or 9.10.11.12 (SYD)
→ Health checks: exclude proxy if down, failover to next nearest
All proxies share same TLS certificates (synced from control plane)
ACME challenges work seamlessly because:
- Let's Encrypt validates the domain via single IP (any proxy)
- Challenge hits any proxy node (they're all interchangeable)
- All proxies have identical certificates
- If one proxy goes down, others already have the cert
Within a proxy node, traffic is distributed to replicas using weighted round-robin:
Replica Selection Priority:
- Local replicas (on same proxy server) - weight 5
- Remote replicas (on other proxy servers) - weight 1
This means if a service has 1 local replica and 1 remote replica, the local replica receives ~83% of traffic.
Traffic Flow:
User (US)
→ GeoDNS: nearest proxy = US (1.2.3.4)
→ Traefik: weighted round-robin
- Local replicas (weight 5) ← 83% of traffic
- Remote replicas (weight 1) ← 17% of traffic (failover)
→ Container
Benefits:
- Low latency: Requests stay on same proxy when possible
- Failover: If local replica fails, automatically uses remote
- Cost-effective: Minimizes cross-region traffic
Instead of each proxy managing its own ACME certificates, the control plane handles all certificate lifecycle:
Challenge Flow:
- Control plane initiates ACME renewal for expiring certificates
- Let's Encrypt requests validation:
GET http://domain/.well-known/acme-challenge/{token} - Request hits load balancer → any proxy node (all behind same IP)
- Traefik matches
PathPrefix(/.well-known/acme-challenge/)→ special challenge route - Challenge route (via middleware) rewrites path to
/api/v1/acme/challenge/{token} - Traefik forwards to control plane:
https://control-plane.internal/api/v1/acme/challenge/{token} - Control plane returns keyAuthorization from database
- Let's Encrypt validates and issues certificate
Certificate Sync:
- Certificate issued and stored in
domain_certificatestable - Control plane includes certificates in expected state API response (proxy nodes only)
- Agent receives certificates, writes to
/etc/traefik/certs/{domain}.crtand.key - Agent updates
/etc/traefik/dynamic/tls.yamlwith certificate paths - Traefik reloads and serves TLS with new certificates
Renewal:
- Cron job checks daily for certificates expiring in 30 days
- Triggers ACME renewal via acme-client library
- Challenge responses served through any proxy node
- New certificates synced to all proxies within agent poll cycle (10 seconds)
Internal (service-to-service):
Container A (10.200.1.2)
→ DNS: redis.internal → 10.200.2.3
→ Packet to 10.200.2.3
→ Host routes via WireGuard to Server 2
→ Container B (10.200.2.3)
External (public) - Custom Domain:
User domain: example.com (points to proxy IP via A record or CNAME)
→ Internet → Proxy Node public IP
→ Traefik: example.com → 10.200.1.2:80 (TLS terminated)
→ WireGuard tunnel to target node
→ Container (10.200.1.2)
ACME Challenge (Let's Encrypt validation):
Let's Encrypt → HTTP request to example.com/.well-known/acme-challenge/{token}
→ Proxy Node (any of them, all same IP)
→ Traefik matches challenge route (priority 9999)
→ Middleware rewrites path to /api/v1/acme/challenge/{token}
→ Traefik backend: control plane HTTPS
→ Returns keyAuthorization
→ Let's Encrypt validates
Responsibilities:
- User authentication
- Project and service configuration
- WireGuard coordination (assigns subnets, broadcasts peer updates)
- Deployment orchestration (rollouts)
- Certificate lifecycle management (issuance, renewal, sync)
- Serves expected state to agents
- Processes status reports from agents
- Advances rollout stages based on deployment status
API Endpoints:
GET /api/v1/agent/expected-state- Returns containers, DNS, Traefik (proxy only), WireGuard, certificates configPOST /api/v1/agent/status- Receives container status, advances rollout stagesGET /api/v1/acme/challenge/{token}- Returns ACME challenge keyAuthorization for Let's Encrypt validation
Background Jobs (Inngest):
- Rollout orchestration: Event-driven deployment workflow with health checks and DNS updates
- Migration orchestration: Backup, restore, and container migration workflows
- Build orchestration: Multi-architecture builds with manifest creation
- Backup/restore: Scheduled and on-demand volume backups
- Certificate renewal: ACME renewal for expiring certificates
Responsibilities:
- Polls control plane for expected state
- Manages containers via Podman with static IPs
- Manages local WireGuard interface
- Updates Traefik routes via file provider (proxy nodes only)
- Syncs TLS certificates to disk (proxy nodes only)
- Updates DNS records
- Reports status (resources, public IP, container health)
Agent Lifecycle:
- User creates server in control plane, receives agent token
- User runs install script (specifies if proxy node)
- User starts agent with token (and
--proxyflag if proxy node) - Agent generates WireGuard and signing keypairs
- Agent registers with control plane via HTTP (includes isProxy flag)
- Control plane assigns subnet, returns WireGuard peers
- Agent configures WireGuard, container network, DNS server, and Traefik (if proxy)
- Agent enters IDLE state, begins polling
Containers are tracked via Podman labels:
techulus.deployment.id- Links container to deployment recordtechulus.service.id- Links container to servicetechulus.service.name- Human-readable service name
- Agent Authentication: HMAC signatures on all HTTP requests
- Request Signing: Body + timestamp signed with server-specific secret
- WireGuard: All inter-server traffic encrypted
- No Public Ports on Containers: Only reachable via WireGuard mesh
- Traefik: Only entry point for public traffic (proxy nodes only)
Registration Token:
- One-time-use token for initial registration
- Invalidated after successful registration
Request Signing:
- Agent signs request body with HMAC-SHA256
- Includes timestamp to prevent replay attacks
- Control plane verifies using stored server secret