device-health-oracle: add controller_success device activation criterion by nikw9944 · Pull Request #3503 · malbeclabs/doublezero

nikw9944 · 2026-04-09T14:20:46Z

Resolves: #3493

Summary of Changes

Add controller_success device activation criterion that queries ClickHouse to verify devices have called the controller at least once per minute over the burn-in period before advancing health
Introduce criteria-based evaluation pattern with stage-aware health progression (Pending → ReadyForLinks → ReadyForUsers) to support the many criteria planned in RFC-12
Skip onchain health writes when the value is already at the desired state
ClickHouse connection is optional via CLICKHOUSE_ADDR env var — backward compatible when not set

Diff Breakdown

Category	Files	Lines (+/-)	Net
Core logic	4	+303 / -4	+299
Scaffolding	3	+78 / -2	+76
Tests	3	+432 / -0	+432
Docs	2	+297 / -0	+297

~300 lines of core logic implementing the criteria pattern and ClickHouse integration, well-covered by ~430 lines of tests.

Key files (click to expand)

criteria.go — DeviceCriterion/LinkCriterion interfaces and stage-aware evaluators
controller_success.go — ControllerSuccessCriterion querying ClickHouse for call coverage
clickhouse.go — ClickHouse client with ControllerCallCoverage query
worker.go — Evaluator integration and skip-update optimization
main.go — Wiring ClickHouse connection and criteria into the worker

Testing Verification

Evaluator enforces stage progression — devices at Pending advance to ReadyForLinks (not directly to ReadyForUsers), and failing criteria block advancement
ClickHouse integration tests (testcontainers) verify coverage query with full coverage, partial windows, gaps, and empty data
Burn-in slot count correctly selects provisioning (200K) vs drained (5K) based on device status

Add a criteria-based evaluation pattern to the device-health-oracle and implement the first criterion: devices must have called the controller at least once per minute over the burn-in period (verified via ClickHouse controller_grpc_getconfig_success table). - Introduce DeviceCriterion/LinkCriterion interfaces and stage-aware evaluators that enforce Pending → ReadyForLinks → ReadyForUsers progression for devices - Add ControllerSuccessCriterion querying ClickHouse for controller call coverage over the burn-in window - Optimize update logic to skip health writes when the value is already at the desired state - ClickHouse connection is optional: when CLICKHOUSE_ADDR is not set, the oracle falls back to no-criteria behavior (backward compatible)

Use explicit credentials and the testcontainers ClickHouse module options (WithUsername, WithPassword, WithDatabase) to match the pattern used by other integration tests in the repo. Switch to 23.3.8.21-alpine image which is consistent with the rest of the codebase. Add CHANGELOG entry.

Use positional ? parameters instead of named parameters with type hints, and scan count result into uint64 to match ClickHouse's UInt64 return type from count(DISTINCT ...).

Remove --slot-duration-ms flag. Burn-in boundary slots are now resolved to wall-clock times via GetBlockTime once per tick, and passed to criteria through the context. This avoids hardcoding the ~400ms slot duration estimate.

…tests Move ClickHouse container tests out of the worker package — integration tests belong in e2e/. Replace with a lightweight unit test for NewClickHouseClient. Also update rfc12 to specify how config agent health is verified: query controller_grpc_getconfig_success for at least one record per minute, with the burn-in window resolved via GetBlockTime.

Replace the trivial NewClickHouseClient error test with proper mock-based tests for ControllerCallCoverage: verify query construction, database name quoting (mainnet-beta), result scanning, and error propagation.

Every device criterion needs to select the burn-in start time based on device status. Extract DeviceBurnInStart method on BurnInTimes so future criteria don't duplicate the provisioning-vs-drained selection logic.

Combine BurnInTimesFromContext and the status-based start time selection into a single DeviceBurnIn(ctx, status) function that returns (start, expectedMinutes, ok). This eliminates boilerplate that every device criterion would otherwise repeat. Also fix gofmt alignment in clickhouse_test.go.

## Summary of Changes - Standardize entity identification: probe mutation commands now use `--probe` (was `--code`) accepting pubkey or code; user mutation commands now accept pubkey or code via `validate_pubkey_or_code` (was code-only) - Rename `--signing-keypair` → `--signing-pubkey` (it accepts a pubkey, not a keypair) and `--target-pk` → `--target-signing-pubkey` (spell out instead of abbreviating) - Add `--json-compact` to `probe get` and `user get` (previously only available on `list` commands) ## Diff Breakdown | Category | Files | Lines (+/-) | Net | |--------------|-------|-------------|-------| | Core logic | 10 | +358 / -89 | +269 | | Tests | 10 | +49 / -12 | +37 | | E2E | 1 | +3 / -3 | 0 | Most of the net addition is resolution logic: non-create commands now resolve pubkey-or-code to a code via a `get` call before passing to the SDK. <details> <summary>Key files (click to expand)</summary> - `smartcontract/cli/src/geolocation/user/add_target.rs` — `--user` accepts pubkey_or_code; `--target-pk` → `--target-signing-pubkey`; adds user resolution + test mocks - `smartcontract/cli/src/geolocation/user/remove_target.rs` — same changes as add_target - `smartcontract/cli/src/geolocation/probe/update.rs` — `--code` → `--probe` with pubkey_or_code; `--signing-keypair` → `--signing-pubkey`; adds probe resolution - `smartcontract/cli/src/geolocation/probe/add_parent.rs` — `--code` → `--probe` with pubkey_or_code; adds probe resolution - `smartcontract/cli/src/geolocation/probe/delete.rs` — `--code` → `--probe` with pubkey_or_code; adds probe resolution - `smartcontract/cli/src/geolocation/user/update_payment_status.rs` — `--user` accepts pubkey_or_code; adds user resolution - `smartcontract/cli/src/geolocation/user/delete.rs` — `--user` accepts pubkey_or_code; adds user resolution - `smartcontract/cli/src/geolocation/probe/get.rs` — adds `--json-compact`; display field `signing_keypair` → `signing_pubkey` </details> ## Testing Verification - All 38 geolocation CLI unit tests pass - Clippy clean with `-Dclippy::all -Dwarnings` - E2E tests compile (`go build -tags e2e ./e2e/...`) - Infra repo verified: no references to the renamed geolocation CLI flags

Resolves: #3465 ## Summary of Changes - Adds a BGP status submitter to the telemetry daemon that periodically checks BGP session state in the BGP VRF namespace and submits \`Up\`/\`Down\` status onchain for each activated user on the device - Adds \`BGPStatus\` type and \`SetUserBGPStatus\` executor instruction to the Go serviceability SDK - Fixes a gap where a disappeared tunnel interface (e.g., after an ungraceful daemon kill) would silently skip the \`Down\` submission — now correctly transitions from \`Up\` to \`Down\` when the tunnel is gone but the user remains activated onchain - Introduces a \`CachingFetcher\` in the telemetry daemon that deduplicates \`GetProgramData\` RPC calls across the BGP status submitter, ledger peer discovery, and collector within a 5s TTL window using \`singleflight\`, mirroring the pattern in the client daemon - Adds Prometheus metrics for both the caching layer and onchain submission outcomes - Adds an e2e test (\`TestE2E_UserBGPStatus\`) verifying the full \`Up\` → \`Down\` lifecycle ## Diff Breakdown | Category | Files | Lines (+/-) | Net | |--------------|-------|-------------|-------| | Core logic | 4 | +526 / -0 | +526 | | Scaffolding | 5 | +161 / -7 | +154 | | Tests | 4 | +770 / -0 | +770 | | Docs | 1 | +4 / -0 | +4 | Mostly new code: a self-contained submitter package, a shared caching layer, and their tests. <details> <summary>Key files (click to expand)</summary> - [\`controlplane/telemetry/internal/bgpstatus/bgpstatus.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-7e435c1fdcd7c7b5767709008629f8022c298656aaa548d1636b162057d91c0f) — \`Submitter\` struct, \`Config\`, and pure helpers (\`buildEstablishedIPSet\`, \`computeEffectiveStatus\`, \`shouldSubmit\`) - [\`controlplane/telemetry/internal/bgpstatus/submitter.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-4f6db3ff3dcb17af124697b7d176a806c8359739f33546f5de230105500cd9c8) — tick/worker loop: collects BGP socket state, resolves tunnel IPs per user, enqueues status updates with retry; includes tunnel-not-found → Down fix - [\`e2e/user_bgp_status_test.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-8b6e16747b259945bc71d9d26b961b1eb8792c690d776448fc8ad26c8209d56a) — e2e test validating \`Up\` detection after BGP session establishes and \`Down\` detection after \`pkill -9 doublezerod\` - [\`controlplane/telemetry/internal/bgpstatus/submitter_test.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-6d59f1028aac0b9b2609f4bbdb7c155dbf6bef15fce256034100a276905f0572) — unit tests for tick logic, grace period, periodic refresh, in-flight deduplication, and tunnel-not-found → Down transition - [\`controlplane/telemetry/internal/serviceability/cache.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-50e9dfc9d61518cf113b7399651a828cdab7f5ff5abfefd443a772139845a315) — \`CachingFetcher\`: \`sync.RWMutex\` fast-path + \`singleflight\` slow-path + stale-on-error fallback for \`GetProgramData\` - [\`controlplane/telemetry/cmd/telemetry/main.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-ee2c048617227e6110d061752fb7e24fc85ff54c7024cea70c3016e976ee9228) — wires \`CachingFetcher\` and BGP status submitter; flags: \`--bgp-status-enable\`, \`--bgp-status-interval\`, \`--bgp-status-refresh-interval\`, \`--bgp-status-down-grace-period\` - [\`smartcontract/sdk/go/serviceability/executor.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-7be6c1e59c995ffb57df9bcf3926738f52220f83f5b4e4747fc0dd2d993d1bc8) — adds \`SetUserBGPStatus\` executor (instruction opcode 106) - [\`smartcontract/sdk/go/serviceability/state.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-057f627e5f2629b35f1976fba6eaa1b3c507c511b0f47d2f80a3a1d0c684c4a1) — adds \`BGPStatus\` type with \`Unknown/Up/Down\` constants and JSON serialization </details> ## Testing Verification - Unit tests cover all tick-loop branches: session established, session down with and without grace period, tunnel not found with last status Up/Down, periodic refresh, in-flight dedup, and submission retry - \`TestE2E_UserBGPStatus\` verified end-to-end: connects a client, waits for BGP session to reach \`Established\` in vrf1, confirms \`BGPStatusUp\` appears onchain, then kills doublezerod ungracefully and confirms \`BGPStatusDown\` appears onchain within 60s

Update RFC16 to match implemented status - Removed POC sections - Updated various section to match as-implementated details - Updated Security - Added Future work section for challenge-based inbound probing - Removed resolved open questions.

…er (#3506) Resolves: #3504 - Add `agent_version` ([u8; 16]) and `agent_commit` ([u8; 8]) fields to `DeviceLatencySamplesHeader`, carved from the existing 128-byte reserved region (now 104 bytes) — total header size unchanged, fully backward compatible - Wire the telemetry agent's LDFLAGS-injected `version` and `commit` strings through config → collector → submitter → instruction builder - Update Go, Python, and TypeScript SDKs to deserialize the new fields; update Go V0→V1 migration to split the old `Unused` region correctly | Category | Files | Lines (+/-) | Net | |--------------|-------|-------------|------| | Core logic | 7 | +74 / -8 | +66 | | Scaffolding | 5 | +16 / -0 | +16 | | Tests | 9 | +53 / -3 | +50 | | Fixtures | 3 | +15 / -1 | +14 | | Docs | 1 | +6 / -0 | +6 | | **Total** | 25 | +164 / -12 | +152 | Mostly core logic (onchain state + SDK deserialization) and tests; scaffolding is thin config plumbing. <details> <summary>Key files (click to expand)</summary> - [`smartcontract/programs/doublezero-telemetry/src/state/device_latency_samples.rs`](https://github.com/malbeclabs/doublezero/pull/3506/files#diff-010de70ef44cfdd400df3321ce1d289f9342a7048c7a77e296489a4fca96a780) — add `agent_version` and `agent_commit` fields to onchain header struct, shrink `_unused` from 128 to 104 bytes - [`sdk/telemetry/typescript/telemetry/state.ts`](https://github.com/malbeclabs/doublezero/pull/3506/files#diff-bbf486649ad2ac1398535366df9144a768aebee2a63d2baa2d4c54d1c620161a) — deserialize new fields with `readFixedString` helper, expose on interface - [`smartcontract/sdk/go/telemetry/state_v0.go`](https://github.com/malbeclabs/doublezero/pull/3506/files#diff-8f5df3efc61241621c8867509a324f0532369db21cf40d38fe2467362c927fac) — V0→V1 migration splits old 128-byte `Unused` into `AgentVersion` + `AgentCommit` + `Unused` - [`smartcontract/sdk/go/telemetry/initialize_device_latency_samples.go`](https://github.com/malbeclabs/doublezero/pull/3506/files#diff-831dd3d41e82ba7d6832bf1a09777c177bfc1f177283cd756b4d1cf205d56fe8) — serialize `AgentVersion`/`AgentCommit` strings into fixed-size byte arrays for the instruction - [`smartcontract/sdk/go/telemetry/state.go`](https://github.com/malbeclabs/doublezero/pull/3506/files#diff-b21eb1e1d5e0c91a4777fad19d0500e5e36ce0bb67481f42731fa0f736d5d166) — add `AgentVersion` and `AgentCommit` fields to Go SDK header struct - [`sdk/telemetry/python/telemetry/state.py`](https://github.com/malbeclabs/doublezero/pull/3506/files#diff-f9724dd41d02082c168c9e7fc4b9b51945eb6dba1964f235465862e972b4aeaf) — deserialize new fields as null-trimmed UTF-8 strings - [`smartcontract/programs/doublezero-telemetry/src/processors/telemetry/initialize_device_latency_samples.rs`](https://github.com/malbeclabs/doublezero/pull/3506/files#diff-fd37cd38c9a70671b397aa15496237a669afd8a5bb703e712cb6b76386b65fc4) — write new fields into account during initialization </details> - Rust: serialization round-trip test with non-zero `agent_version`/`agent_commit` values passes - Go SDK: V0→V1 conversion test verifies byte-level mapping of the split `Unused` region - TypeScript & Python SDK: fixture tests pass against regenerated binary/JSON fixtures with non-zero values - E2E: `device_telemetry_test.go` asserts non-zero version/commit bytes after live agent submission; `sdk_device_telemetry_test.go` verifies exact byte values round-trip through initialize + deserialize

Use explicit credentials and the testcontainers ClickHouse module options (WithUsername, WithPassword, WithDatabase) to match the pattern used by other integration tests in the repo. Switch to 23.3.8.21-alpine image which is consistent with the rest of the codebase. Add CHANGELOG entry.

Combine BurnInTimesFromContext and the status-based start time selection into a single DeviceBurnIn(ctx, status) function that returns (start, expectedMinutes, ok). This eliminates boilerplate that every device criterion would otherwise repeat. Also fix gofmt alignment in clickhouse_test.go.

- Close ClickHouse client on shutdown (defer chClient.Close()) - Validate database name against [a-zA-Z0-9_-]+ to prevent injection - Use status.IsDrained() instead of == for forward compatibility - Pass tick timestamp through BurnInTimes.Now instead of per-device time.Now() calls - Add comment explaining 2-tick minimum for ReadyForUsers progression - Remove obvious comments per CLAUDE.md guidelines - Add test for expectedMinutes == 0 edge case (new environment)

nikw9944 mentioned this pull request Apr 9, 2026

device-health-oracle: Add activation criterion: devices must consistently call controller #3493

Open

nikw9944 marked this pull request as draft April 9, 2026 14:38

nikw9944 added 2 commits April 9, 2026 14:40

nikw9944 force-pushed the nikw9944/doublezero-3493 branch from c4e29fb to efe6fa4 Compare April 9, 2026 14:45

nikw9944 added 6 commits April 9, 2026 16:01

device-health-oracle: fix clickhouse query parameter format

a4017cc

Use positional ? parameters instead of named parameters with type hints, and scan count result into uint64 to match ClickHouse's UInt64 return type from count(DISTINCT ...).

device-health-oracle: mock clickhouse conn in unit tests

9ce7445

Replace the trivial NewClickHouseClient error test with proper mock-based tests for ControllerCallCoverage: verify query construction, database name quoting (mainnet-beta), result scanning, and error propagation.

device-health-oracle: extract BurnInTimes.DeviceBurnInStart helper

562a663

Every device criterion needs to select the burn-in start time based on device status. Extract DeviceBurnInStart method on BurnInTimes so future criteria don't duplicate the provisioning-vs-drained selection logic.

nikw9944 changed the title ~~device-health-oracle: add controller_success activation criterion~~ device-health-oracle: add controller_success device activation criterion Apr 9, 2026

ben-dz and others added 7 commits April 10, 2026 20:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

device-health-oracle: add controller_success device activation criterion#3503

device-health-oracle: add controller_success device activation criterion#3503
nikw9944 wants to merge 15 commits intomainfrom
nikw9944/doublezero-3493

nikw9944 commented Apr 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nikw9944 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of Changes

Diff Breakdown

Testing Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nikw9944 commented Apr 9, 2026 •

edited

Loading