feat(support): add support service with WebSockets and Yamux#47
feat(support): add support service with WebSockets and Yamux#47edospadoni wants to merge 21 commits intomainfrom
Conversation
|
🔗 Redirect URIs Added to Logto The following redirect URIs have been automatically added to the Logto application configuration: Redirect URIs:
Post-logout redirect URIs:
These will be automatically removed when the PR is closed or merged. |
🤖 My API structural change detectedStructural change detailsAdded (13)
Modified (5)
Powered by Bump.sh |
c62b877 to
007bd6d
Compare
tunnel-client binary (linux/amd64)Download: Quick start# Make it executable
chmod +x tunnel-client-linux-amd64
# Run it
./tunnel-client-linux-amd64 \
--url wss://my-proxy-qa-pr-47.onrender.com/support/api/tunnel \
--key <SYSTEM_KEY> \
--secret <SYSTEM_SECRET>Parameters
Service discovery modesThe tunnel-client auto-detects the environment:
Diagnostics plugin systemAt connect time, the tunnel-client collects a health snapshot and sends it to MY over the tunnel. Operators see the results directly in the support session popover — before opening a terminal or proxy — so they have immediate context on the system state. How it works:
Built-in plugin (
External plugins: any executable file placed in Each plugin must:
#!/bin/bash
# /usr/share/my/diagnostics.d/10-myservice.sh
STATUS="ok"
SUMMARY="all good"
if ! systemctl is-active --quiet myservice; then
STATUS="critical"
SUMMARY="myservice is not running"
fi
echo "{\"id\":\"myservice\",\"name\":\"My Service\",\"status\":\"$STATUS\",\"summary\":\"$SUMMARY\"}"
exit $([ "$STATUS" = "ok" ] && echo 0 || echo 2)The overall session status shown in MY is the worst status across all plugins (critical > warning > ok). If a plugin exceeds its timeout it is marked If Environment variablesAll flags can also be passed as env vars: export SUPPORT_URL=wss://my-proxy-qa-pr-47.onrender.com/support/api/tunnel
export SYSTEM_KEY=<your-key>
export SYSTEM_SECRET=<your-secret>
./tunnel-client-linux-amd64 |
Support session CRUD, WebSocket terminal with one-time tickets, subdomain proxy with body rewriting, access logging, RBAC with connect:systems permission, database migrations, and security hardening from penetration test findings.
Support sessions table with pagination and sorting, xterm.js web terminal with multi-tab support, service dropdown with multi-node grouping, connect:systems permission guard, and i18n translations.
Add support service routing in nginx proxy, Render.com deployment config, CI pipeline with tunnel-client Docker image and rolling dev release, release workflow with tunnel-client binary and SBOM, connect:systems RBAC permission.
…h in artifact name
…t subdomain catch-all
Allows manually created services (not from Blueprint) to be reached from PR preview environments by setting their env var to a full FQDN.
…kend Address 27 findings from security audit: prevent double-close panic with sync.Once, fix TOCTOU race in session creation with DB transaction, add gzip bomb protection, limit manifest size/rate, validate service names, use full session UUID in subdomain proxy, add org_role to proxy tokens, harden WebSocket origin checks, add session rate limiting, fix concurrent read/write safety, and multiple other hardening improvements.
…kend Address 23 findings from penetration testing report on the support service: - SSRF/DNS rebinding prevention with IP validation and DNS resolution checks - Open redirect fix via protocol-relative URL sanitization - CORS restriction from AllowAllOrigins to localhost-only in debug mode - HSTS, CSP, X-Content-Type-Options security headers in nginx proxy - InternalSecret middleware for defense-in-depth inter-service auth - PTY environment variable sanitization to prevent credential leakage - Cookie rewriting to prevent cross-session domain leakage - Global memory budget (50MB) for gzip decompression (bomb mitigation) - CONNECT protocol newline injection prevention with service name validation - Container hardening with nginx-unprivileged and non-root users - Input validation for node_id and service names - Nginx server_name regex anchoring for multi-environment support - Rate limiter single-instance design documentation - Non-functional default secrets in .env.example files
Add pid directive to /tmp/nginx.pid and create writable cache directories so nginx can run as non-root user without permission errors.
Add https://*.nethesis.it to connect-src so the frontend can reach the Logto identity provider for OIDC flows.
Embed the support session ID directly in system list and detail endpoints to avoid N+1 API calls when checking session status per system.
Show a clickable headset icon next to system name when an active support session exists. The popover displays session status, dates, and connected operators with per-node terminal badges. Backend now tracks terminal disconnect times via access log lifecycle (insert returns ID, disconnect updates disconnected_at).
…able rate limits Refactor the tunnel-client from a single 1181-line main.go into organized internal packages (config, connection, discovery, models, stream, terminal). Rename traefik.go to nethserver.go with updated function names and log messages. Replace YAML config with EXCLUDE_PATTERNS env var / --exclude flag for service filtering. Improve api-cli error logging to include stderr output. Add configurable rate limiting via env vars (RATE_LIMIT_TUNNEL_PER_IP, RATE_LIMIT_TUNNEL_PER_KEY, RATE_LIMIT_SESSION_PER_ID, RATE_LIMIT_WINDOW) with session limit raised from 100 to 500 req/min. Add build-tunnel-client and run-tunnel-client Makefile targets.
Shift migrations to avoid conflict with 017_inventory_fk_set_null added on main.
f56f203 to
2fbfa44
Compare
At connect time, the tunnel-client collects a health report and pushes it to the support service over a dedicated yamux stream. Operators see the results in the session popover before opening a terminal or proxy. Built-in system plugin always runs (CPU load, RAM, disk, uptime, OS info). External plugins can be dropped as executables in /usr/share/my/diagnostics.d/ - NS8 modules and NethSecurity can ship their own health checks independently. Each plugin writes JSON to stdout and signals severity via exit code (0=ok, 1=warning, 2=critical). The overall session status is the worst status across all plugins. Diagnostics run in parallel with the WebSocket connection to avoid adding latency. A per-plugin timeout (default 10s) and a total timeout (default 30s) prevent slow plugins from blocking the session. - tunnel-client: new internal/diagnostics package (runner + models), built-in system check, DIAGNOSTICS yamux stream after manifest - support service: acceptControlStream distinguishes DIAGNOSTICS header from manifest JSON, SaveDiagnostics() stores JSONB on session - backend: GET /api/support-sessions/:id/diagnostics with RBAC scoping, migration 021 adds diagnostics + diagnostics_at columns - frontend: diagnostics section in SupportSessionPopover with status dot and per-plugin summary rows
Operators can now inject arbitrary host:port services into a running tunnel session without reconnection, enabling access to LAN devices (IP phones, switches) through the support proxy. - Backend: POST /support-sessions/:id/services with RBAC, validation, and Redis pub/sub dispatch (add_services action) - Support service: SendCommandToSession() opens outbound yamux stream, writes COMMAND 1\n + JSON payload, waits for OK/ERROR - Tunnel-client: accept loop pre-reads first line to route COMMAND vs CONNECT streams; thread-safe serviceStore with sync.RWMutex - Frontend: Add Service modal with name/target/label/TLS fields; 1500ms delay before re-fetching services to account for async round-trip - OpenAPI: documented new endpoint with Conflict response component - README: added COMMAND stream table, Static Service Injection section
Fixes 10 security issues identified in the pen-test review of the static service injection and diagnostics features: - SSRF bypass in applyAddServices (HMAC-signed Redis commands, server pre-check, and client-side validateTarget) - Diagnostics JSON schema validation, 512 KB size cap, and DB-enforced rate limit across reconnections - Diagnostic plugins rejected if not owned by root or writable by others; sanitized environment strips credentials - host:port validation uses net.SplitHostPort with numeric range check - DIAGNOSTICS stream version validated as exact "DIAGNOSTICS 1" - serviceStore total cap (500) prevents unbounded growth - Diagnostics goroutine starts only after yamux session is established
Remote apps (NethVoice, NethCTI) proxied through different subdomains
make cross-origin API calls that require CORS headers and shared cookie
authentication across sibling subdomains of the same support session.
Backend:
- Move CORS middleware from router to /api group so it does not
intercept /support-proxy/* routes
- Add CORS preflight (OPTIONS 204) and response headers for
same-session sibling subdomains (validated by session slug match)
- Scope proxy cookie to .support.{domain} with SameSite=Lax so it
is shared across all service subdomains of the same session
- Remove per-service token validation: session ID match is sufficient
since users have session-level access
Support service:
- Fix non-deterministic hostname rewriting in buildHostRewriteMap:
when multiple services share the same original hostname, the current
service's proxy subdomain is always preferred, keeping API calls
same-origin and letting Traefik handle path-based routing
d683765 to
50624ac
Compare
…er display Add GET /api/support-sessions/diagnostics?system_id=X endpoint that returns diagnostics for all active sessions of a system grouped by node, with an overall_status reflecting the worst across all nodes. Update the frontend popover to show collapsible per-node sections for multi-node NS8 clusters while keeping the flat list for single-node systems.
📋 Description
🏗 Support Service — Architecture
How it works
A tunnel client on the customer's system opens a persistent WebSocket to our support service. The connection is multiplexed with yamux — one WebSocket carries many parallel streams. When an operator clicks "Open" in the UI, traffic flows through the tunnel to reach the remote service (web UI, terminal, API) as if it were local.
graph LR subgraph Customer System TC[tunnel-client<br/>yamux mux] --> WU[Web UI] TC --> SA[SSH/API] TC --> ETC[...] end TC ---|WebSocket<br/>single connection| SS BR[Browser<br/>operator] --> NG[nginx<br/>proxy] NG --> BE[Backend :8080<br/>sessions, auth] BE --> SS[Support :8082<br/>tunnels, yamux]Session Lifecycle
stateDiagram-v2 [*] --> pending pending --> active : WebSocket established active --> closed : operator closes active --> grace_period : disconnect grace_period --> active : reconnect (same session) grace_period --> expired : timeout (30-60s)WebSocket + yamux Multiplexing
The tunnel client opens one WebSocket to the support service. On top of it, yamux creates a multiplexed session — like having many TCP connections inside a single one.
How it connects:
GET /support/api/tunnelwith HTTP Basic Authnet.Connyamux.Serveris created over the wrapped connection (keepalive 15s)Server-initiated streams: the support service can also open streams toward the tunnel-client. These start with a
COMMAND <version>\nheader and carry a JSON payload. The tunnel-client processes the command and respondsOK\norERROR <msg>\n.On disconnect: the tunnel enters a grace period (30-60s). If the client reconnects, the same session is reused. If the grace expires, the session is closed.
Diagnostics
At connect time, the tunnel-client collects a health report and pushes it to the support service via a dedicated yamux stream. The report is stored on the session (
diagnosticsJSONB column) and shown to operators in the session popover.Plugin format (exit code:
0= ok,1= warning,2= critical):{ "id": "myapp", "name": "My Application", "status": "warning", "summary": "DB at 87% capacity", "checks": [ { "name": "service", "status": "ok", "value": "running" }, { "name": "database", "status": "warning", "value": "87% full" } ] }The overall session status is the worst status across all plugins.
Static Service Injection
Operators can add arbitrary
host:portservices to a running tunnel without reconnection. This is useful for services not auto-discovered via Traefik — for example the web management interface of a device on the customer's LAN (IP phone, managed switch, NAS, etc.).Example: to access a Yealink phone's web UI at
192.168.1.100:443on a customer system, add a service withtarget: 192.168.1.100:443,tls: true. The phone's interface becomes available at:…as if the operator were on the same LAN as the phone.
How the UI Proxy works (subdomain)
When an operator clicks a service link (e.g. NethVoice UI), the browser opens a new tab on a dedicated subdomain. Each service gets its own origin, so all the app's absolute paths (
/_next/,/api/,/static/) work natively.The
?token=is removed from the URL after the first request (redirect), so it never leaks in logs, referrer headers, or browser history.How the Web Terminal works (xterm.js)
The terminal needs a WebSocket from the browser, but browsers can't send
Authorizationheaders on WebSocket connections. Solution: one-time ticket exchanged beforehand.The tunnel client spawns a PTY (pseudo-terminal) directly on the customer system — no SSH daemon involved. The PTY output is forwarded as raw bytes through the yamux stream back to the browser's xterm.js.
Why TCP hijacking instead of
httputil.ReverseProxy?When the browser opens a WebSocket, it sends an HTTP request with
Upgrade: websocket. The server responds with101 Switching Protocolsand from that point the connection is no longer HTTP — it becomes a raw bidirectional byte channel.httputil.ReverseProxycan't handle this. It's designed for the classic HTTP request/response cycle: read the response from the backend, copy it to the client, close. With a WebSocket there's no "response" to copy — there's a continuous stream of frames in both directions.Gin (which uses
net/httpunderneath) has the same problem: itsResponseWriterbuffers, manages headers, Content-Length... none of which make sense after the101.The solution is
http.Hijacker: a Go interface that lets you take control of the raw TCP connection from the HTTP server. You're telling Go "I'll handle it from here".The flow:
101 Switching Protocolsfrom the support serviceHijack()on the browser connection — now it has the raw TCP socket101to the browserio.Copy): browser ↔ support serviceNo HTTP, no buffering, no overhead. Just bytes flowing through.
Access Patterns & Auth
system_key:system_secret(SHA256), 3-tier cache (memory → Redis → DB), rate-limitedconnect:systemspermission, standard middleware chainGETDELon use → WebSocket via TCP hijack{session_id, service_name, org_role}→ SameSite=Strict cookie on subdomain → auto-redirect strips token from URLINTERNAL_SECRETX-Session-Token(64-char hex, per-session) + sharedSUPPORT_INTERNAL_SECRETfor service-level auth, constant-time validationSecurity Highlights
GETDEL), JWT never touches the URL?token=, gets stored asHttpOnly SameSite=Strictcookie, URL is cleaned via redirect +Referrer-Policy: no-referrercrypto/subtlefor all token validations — no timing attacks169.254.x.x), link-local, multicast, loopbackframe-ancestors 'self'on proxied responses — prevents clickjackingSubdomain Proxy
Each service gets its own browser origin — no URL rewriting needed:
Requires: DNS wildcard
*.support.{domain}+ matching wildcard SSL certificate +SUPPORT_PROXY_DOMAINenv var.Inter-service Communication
Components & Files
services/support/backend/methods/support_proxy.gofrontend/src/components/support/proxy/nginx.confbackend/database/migrations/009_*,018_*–021_*support_sessions,support_access_logs, diagnostics column.github/workflows/,render.yaml,deploy.shRelated Issue: #[ISSUE_NUMBER]
🚀 Testing Environment
To trigger a fresh deployment of all services in the PR preview environment, comment:
To download tunnel-client binary, reference here: #47 (comment)
Automatic PR environments:
✅ Merge Checklist
Code Quality:
Builds: