Techulus Cloud - Architecture

Overview

A stateless container deployment platform with three core principles:

Workloads are disposable - containers can be killed and recreated at any time
Two node types - proxy nodes handle public traffic, worker nodes run containers
Networking is private-first - services communicate over WireGuard mesh, public exposure via proxy nodes

Tech Stack

Component	Choice	Rationale
Control Plane	Next.js (full-stack)	Single deployment, React frontend + API routes
Database	Postgres + Drizzle	Simple, no external deps, single file, easy backup
Background Jobs	Inngest (self-hosted)	Durable workflows, event-driven orchestration, retries
Server Agent	Go	Single binary, shells out to Podman
Container Runtime	Podman	Docker-compatible, daemonless, bridge networking with static IPs
Reverse Proxy	Traefik	Automatic HTTPS via Let's Encrypt, runs on proxy nodes only
Private Network	WireGuard (self-managed)	Full mesh, control plane coordinates
Service Discovery	Built-in DNS	Agent runs DNS server for .internal domains
Agent Communication	Pull-based HTTP	Agent polls for expected state, reports status

Node Types

Type	Traefik	Public Traffic	Containers
Proxy	✓	Handles TLS termination	✓
Worker	✗	None	✓

Proxy nodes: Handle incoming public traffic, TLS termination via HTTP-01 ACME, route to containers via WireGuard
Worker nodes: Run containers only, no public exposure, lighter footprint

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                         CONTROL PLANE                           │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │   Next.js (App Router + API Routes + Postgres)           │  │
│  │                                                          │  │
│  │   GET /api/v1/agent/expected-state  (agent polls)        │  │
│  │   POST /api/v1/agent/status         (agent reports)      │  │
│  └──────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
                              ▲
                              │ HTTPS (poll every 10s)
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                          SERVERS                                 │
│                                                                 │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐ │
│  │  Proxy Node 1   │  │  Worker Node 1  │  │  Worker Node 2  │ │
│  │                 │  │                 │  │                 │ │
│  │ WG: 10.100.1.1  │  │ WG: 10.100.2.1  │  │ WG: 10.100.3.1  │ │
│  │ Containers:     │  │ Containers:     │  │ Containers:     │ │
│  │  10.200.1.2-254 │  │  10.200.2.2-254 │  │  10.200.3.2-254 │ │
│  │                 │  │                 │  │                 │ │
│  │ ┌─────────────┐ │  │ ┌─────────────┐ │  │ ┌─────────────┐ │ │
│  │ │    Agent    │ │  │ │    Agent    │ │  │ │    Agent    │ │ │
│  │ ├─────────────┤ │  │ ├─────────────┤ │  │ ├─────────────┤ │ │
│  │ │   Podman    │ │  │ │   Podman    │ │  │ │   Podman    │ │ │
│  │ ├─────────────┤ │  │ ├─────────────┤ │  │ ├─────────────┤ │ │
│  │ │   Traefik   │ │  │ │      -      │ │  │ │      -      │ │ │
│  │ ├─────────────┤ │  │ ├─────────────┤ │  │ ├─────────────┤ │ │
│  │ │  DNS Server │ │  │ │  DNS Server │ │  │ │  DNS Server │ │ │
│  │ ├─────────────┤ │  │ ├─────────────┤ │  │ ├─────────────┤ │ │
│  │ │  WireGuard  │ │  │ │  WireGuard  │ │  │ │  WireGuard  │ │ │
│  │ └─────────────┘ │  │ └─────────────┘ │  │ └─────────────┘ │ │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘ │
│           │                    │                    │          │
│           └────────────────────┴────────────────────┘          │
│                      WireGuard Full Mesh                       │
└─────────────────────────────────────────────────────────────────┘

Public Traffic Flow:
  Internet → DNS → Proxy Node → Traefik (TLS) → WireGuard → Container

Agent State Machine

The agent uses a two-state machine to prevent race conditions during reconciliation:

┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│    ┌─────────┐                         ┌────────────┐          │
│    │  IDLE   │───drift detected───────▶│ PROCESSING │          │
│    │ (poll)  │◀────────────────────────│  (no poll) │          │
│    └─────────┘    done/failed/timeout  └────────────┘          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

IDLE State

Poll control plane every 10 seconds for expected state
Compare expected state vs actual state (containers, DNS, Traefik*, WireGuard)
If no drift: send status report, stay in IDLE
If drift detected: snapshot expected state, transition to PROCESSING

*Traefik drift detection only on proxy nodes

PROCESSING State

Stop polling (use the expected state snapshot)
Apply ONE change at a time with verification
After each change, re-check drift
If no drift remains: transition to IDLE
Timeout after 5 minutes: force transition to IDLE
Always send status report before transitioning to IDLE

Drift Detection

The agent detects drift using hash comparisons:

Containers: Missing, orphaned, wrong state, or image mismatch
DNS: Hash of sorted records vs current DNS server config
Traefik: Hash of sorted routes vs current Traefik config (proxy nodes only)
WireGuard: Hash of sorted peers vs current wg0.conf

Container Reconciliation

Order of operations:

Stop orphan containers (no deployment ID)
Start containers in "created" or "exited" state
Deploy missing containers
Redeploy containers with wrong state or image mismatch
Update DNS records
Update Traefik routes (proxy nodes only)
Update WireGuard peers

Rollout Stages

Deployments go through these stages:

pending → pulling → starting → healthy → dns_updating → traefik_updating → stopping_old → running

Stage	Description
`pending`	Deployment created, waiting for agent
`pulling`	Agent is pulling the container image
`starting`	Container started, waiting for health check
`healthy`	Health check passed (or no health check)
`dns_updating`	DNS records being updated
`traefik_updating`	Traefik routes being updated
`stopping_old`	Old deployment containers being stopped
`running`	Deployment complete and serving traffic

Special states:

unknown: Agent stopped reporting this deployment (container may still exist)
stopped: Container explicitly stopped
failed: Deployment failed (health check, etc.)
rolled_back: Rollout failed, reverted to previous deployment

Networking

IP Address Scheme

Range	Purpose
`10.100.X.1`	WireGuard IP for server X (host mesh)
`10.200.X.2-254`	Container IPs on server X

Where X = server's subnet ID (1-255).

WireGuard Mesh (Host-to-Host)

Each server gets a /24 subnet for routing:

Server 1: 10.100.1.0/24 → WireGuard IP: 10.100.1.1
Server 2: 10.100.2.0/24 → WireGuard IP: 10.100.2.1

Full mesh topology - every server peers with every other server. AllowedIPs includes both WireGuard and container subnets:

AllowedIPs = 10.100.2.0/24, 10.200.2.0/24

Container Network (Per-Server)

Each server has a Podman bridge network:

podman network create \
  --driver bridge \
  --subnet 10.200.1.0/24 \
  --gateway 10.200.1.1 \
  --disable-dns \
  techulus

Containers get static IPs assigned by the control plane:

podman run -d \
  --name service-deployment \
  --network techulus \
  --ip 10.200.1.2 \
  --label techulus.deployment.id=<deployment-id> \
  --label techulus.service.id=<service-id> \
  traefik/whoami

DNS Resolution

Each agent runs a built-in DNS server for .internal domain resolution:

Listens on the container gateway IP (e.g., 10.200.1.1)
Configures systemd-resolved to forward .internal queries
Records pushed from control plane via expected state

Services resolve via .internal domain with round-robin across replicas.

Traefik (Proxy Nodes Only)

Proxy nodes run Traefik with routes and certificates pushed from control plane:

Routes configured via file provider in /etc/traefik/dynamic/routes.yaml
Certificates configured via file provider in /etc/traefik/dynamic/tls.yaml
Routes: subdomain.example.com → container IPs (via WireGuard mesh)
TLS: Static certificates managed by control plane
Challenge route: /.well-known/acme-challenge/* → control plane for ACME validation
Control plane only sends routes and certificates to proxy nodes

Worker nodes do not run Traefik.

Multiple Proxy Nodes (Geographic Distribution)

The platform supports multiple proxy nodes in different regions with automatic proximity steering:

Users point custom domains to a single DNS name via GeoDNS (BunnyDNS)
BunnyDNS routes clients to geographically nearest proxy based on their location
BunnyDNS health checks automatically failover if a proxy goes down
All proxies share the same TLS certificates (synced from control plane)

Example:

Proxy US:   1.2.3.4
Proxy EU:   5.6.7.8
Proxy SYD:  9.10.11.12

GeoDNS (BunnyDNS):
  example.com → lb.techulus.cloud
  → BunnyDNS steers to nearest proxy based on client geography
  → Returns 1.2.3.4 (US), 5.6.7.8 (EU), or 9.10.11.12 (SYD)
  → Health checks: exclude proxy if down, failover to next nearest

All proxies share same TLS certificates (synced from control plane)

ACME challenges work seamlessly because:

Let's Encrypt validates the domain via single IP (any proxy)
Challenge hits any proxy node (they're all interchangeable)
All proxies have identical certificates
If one proxy goes down, others already have the cert

Proximity-Aware Load Balancing

Within a proxy node, traffic is distributed to replicas using weighted round-robin:

Replica Selection Priority:

Local replicas (on same proxy server) - weight 5
Remote replicas (on other proxy servers) - weight 1

This means if a service has 1 local replica and 1 remote replica, the local replica receives ~83% of traffic.

Traffic Flow:

User (US) 
  → GeoDNS: nearest proxy = US (1.2.3.4)
  → Traefik: weighted round-robin
    - Local replicas (weight 5) ← 83% of traffic
    - Remote replicas (weight 1) ← 17% of traffic (failover)
  → Container

Benefits:

Low latency: Requests stay on same proxy when possible
Failover: If local replica fails, automatically uses remote
Cost-effective: Minimizes cross-region traffic

ACME Certificate Management (Centralized)

Instead of each proxy managing its own ACME certificates, the control plane handles all certificate lifecycle:

Challenge Flow:

Control plane initiates ACME renewal for expiring certificates
Let's Encrypt requests validation: GET http://domain/.well-known/acme-challenge/{token}
Request hits load balancer → any proxy node (all behind same IP)
Traefik matches PathPrefix(/.well-known/acme-challenge/) → special challenge route
Challenge route (via middleware) rewrites path to /api/v1/acme/challenge/{token}
Traefik forwards to control plane: https://control-plane.internal/api/v1/acme/challenge/{token}
Control plane returns keyAuthorization from database
Let's Encrypt validates and issues certificate

Certificate Sync:

Certificate issued and stored in domain_certificates table
Control plane includes certificates in expected state API response (proxy nodes only)
Agent receives certificates, writes to /etc/traefik/certs/{domain}.crt and .key
Agent updates /etc/traefik/dynamic/tls.yaml with certificate paths
Traefik reloads and serves TLS with new certificates

Renewal:

Cron job checks daily for certificates expiring in 30 days
Triggers ACME renewal via acme-client library
Challenge responses served through any proxy node
New certificates synced to all proxies within agent poll cycle (10 seconds)

Traffic Flows

Internal (service-to-service):

Container A (10.200.1.2)
  → DNS: redis.internal → 10.200.2.3
  → Packet to 10.200.2.3
  → Host routes via WireGuard to Server 2
  → Container B (10.200.2.3)

External (public) - Custom Domain:

User domain: example.com (points to proxy IP via A record or CNAME)
  → Internet → Proxy Node public IP
  → Traefik: example.com → 10.200.1.2:80 (TLS terminated)
  → WireGuard tunnel to target node
  → Container (10.200.1.2)

ACME Challenge (Let's Encrypt validation):

Let's Encrypt → HTTP request to example.com/.well-known/acme-challenge/{token}
  → Proxy Node (any of them, all same IP)
  → Traefik matches challenge route (priority 9999)
  → Middleware rewrites path to /api/v1/acme/challenge/{token}
  → Traefik backend: control plane HTTPS
  → Returns keyAuthorization
  → Let's Encrypt validates

Components

1. Control Plane (Next.js)

Responsibilities:

User authentication
Project and service configuration
WireGuard coordination (assigns subnets, broadcasts peer updates)
Deployment orchestration (rollouts)
Certificate lifecycle management (issuance, renewal, sync)
Serves expected state to agents
Processes status reports from agents
Advances rollout stages based on deployment status

API Endpoints:

GET /api/v1/agent/expected-state - Returns containers, DNS, Traefik (proxy only), WireGuard, certificates config
POST /api/v1/agent/status - Receives container status, advances rollout stages
GET /api/v1/acme/challenge/{token} - Returns ACME challenge keyAuthorization for Let's Encrypt validation

Background Jobs (Inngest):

Rollout orchestration: Event-driven deployment workflow with health checks and DNS updates
Migration orchestration: Backup, restore, and container migration workflows
Build orchestration: Multi-architecture builds with manifest creation
Backup/restore: Scheduled and on-demand volume backups
Certificate renewal: ACME renewal for expiring certificates

2. Server Agent (Go)

Responsibilities:

Polls control plane for expected state
Manages containers via Podman with static IPs
Manages local WireGuard interface
Updates Traefik routes via file provider (proxy nodes only)
Syncs TLS certificates to disk (proxy nodes only)
Updates DNS records
Reports status (resources, public IP, container health)

Agent Lifecycle:

User creates server in control plane, receives agent token
User runs install script (specifies if proxy node)
User starts agent with token (and --proxy flag if proxy node)
Agent generates WireGuard and signing keypairs
Agent registers with control plane via HTTP (includes isProxy flag)
Control plane assigns subnet, returns WireGuard peers
Agent configures WireGuard, container network, DNS server, and Traefik (if proxy)
Agent enters IDLE state, begins polling

3. Container Labels

Containers are tracked via Podman labels:

techulus.deployment.id - Links container to deployment record
techulus.service.id - Links container to service
techulus.service.name - Human-readable service name

Security Model

Agent Authentication: HMAC signatures on all HTTP requests
Request Signing: Body + timestamp signed with server-specific secret
WireGuard: All inter-server traffic encrypted
No Public Ports on Containers: Only reachable via WireGuard mesh
Traefik: Only entry point for public traffic (proxy nodes only)

Registration Token:

One-time-use token for initial registration
Invalidated after successful registration

Request Signing:

Agent signs request body with HMAC-SHA256
Includes timestamp to prevent replay attacks
Control plane verifies using stored server secret

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Techulus Cloud - Architecture

Overview

Tech Stack

Node Types

Architecture Diagram

Agent State Machine

IDLE State

PROCESSING State

Drift Detection

Container Reconciliation

Rollout Stages

Networking

IP Address Scheme

WireGuard Mesh (Host-to-Host)

Container Network (Per-Server)

DNS Resolution

Traefik (Proxy Nodes Only)

Multiple Proxy Nodes (Geographic Distribution)

Proximity-Aware Load Balancing

ACME Certificate Management (Centralized)

Traffic Flows

Components

1. Control Plane (Next.js)

2. Server Agent (Go)

3. Container Labels

Security Model

Uh oh!

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

Techulus Cloud - Architecture

Overview

Tech Stack

Node Types

Architecture Diagram

Agent State Machine

IDLE State

PROCESSING State

Drift Detection

Container Reconciliation

Rollout Stages

Networking

IP Address Scheme

WireGuard Mesh (Host-to-Host)

Container Network (Per-Server)

DNS Resolution

Traefik (Proxy Nodes Only)

Multiple Proxy Nodes (Geographic Distribution)

Proximity-Aware Load Balancing

ACME Certificate Management (Centralized)

Traffic Flows

Components

1. Control Plane (Next.js)

2. Server Agent (Go)

3. Container Labels

Security Model