device-health-oracle: add controller_success device activation criterion#3503
Draft
device-health-oracle: add controller_success device activation criterion#3503
Conversation
Add a criteria-based evaluation pattern to the device-health-oracle and implement the first criterion: devices must have called the controller at least once per minute over the burn-in period (verified via ClickHouse controller_grpc_getconfig_success table). - Introduce DeviceCriterion/LinkCriterion interfaces and stage-aware evaluators that enforce Pending → ReadyForLinks → ReadyForUsers progression for devices - Add ControllerSuccessCriterion querying ClickHouse for controller call coverage over the burn-in window - Optimize update logic to skip health writes when the value is already at the desired state - ClickHouse connection is optional: when CLICKHOUSE_ADDR is not set, the oracle falls back to no-criteria behavior (backward compatible)
Use explicit credentials and the testcontainers ClickHouse module options (WithUsername, WithPassword, WithDatabase) to match the pattern used by other integration tests in the repo. Switch to 23.3.8.21-alpine image which is consistent with the rest of the codebase. Add CHANGELOG entry.
c4e29fb to
efe6fa4
Compare
Use positional ? parameters instead of named parameters with type hints, and scan count result into uint64 to match ClickHouse's UInt64 return type from count(DISTINCT ...).
Remove --slot-duration-ms flag. Burn-in boundary slots are now resolved to wall-clock times via GetBlockTime once per tick, and passed to criteria through the context. This avoids hardcoding the ~400ms slot duration estimate.
…tests Move ClickHouse container tests out of the worker package — integration tests belong in e2e/. Replace with a lightweight unit test for NewClickHouseClient. Also update rfc12 to specify how config agent health is verified: query controller_grpc_getconfig_success for at least one record per minute, with the burn-in window resolved via GetBlockTime.
Replace the trivial NewClickHouseClient error test with proper mock-based tests for ControllerCallCoverage: verify query construction, database name quoting (mainnet-beta), result scanning, and error propagation.
Every device criterion needs to select the burn-in start time based on device status. Extract DeviceBurnInStart method on BurnInTimes so future criteria don't duplicate the provisioning-vs-drained selection logic.
Combine BurnInTimesFromContext and the status-based start time selection into a single DeviceBurnIn(ctx, status) function that returns (start, expectedMinutes, ok). This eliminates boilerplate that every device criterion would otherwise repeat. Also fix gofmt alignment in clickhouse_test.go.
## Summary of Changes - Standardize entity identification: probe mutation commands now use `--probe` (was `--code`) accepting pubkey or code; user mutation commands now accept pubkey or code via `validate_pubkey_or_code` (was code-only) - Rename `--signing-keypair` → `--signing-pubkey` (it accepts a pubkey, not a keypair) and `--target-pk` → `--target-signing-pubkey` (spell out instead of abbreviating) - Add `--json-compact` to `probe get` and `user get` (previously only available on `list` commands) ## Diff Breakdown | Category | Files | Lines (+/-) | Net | |--------------|-------|-------------|-------| | Core logic | 10 | +358 / -89 | +269 | | Tests | 10 | +49 / -12 | +37 | | E2E | 1 | +3 / -3 | 0 | Most of the net addition is resolution logic: non-create commands now resolve pubkey-or-code to a code via a `get` call before passing to the SDK. <details> <summary>Key files (click to expand)</summary> - `smartcontract/cli/src/geolocation/user/add_target.rs` — `--user` accepts pubkey_or_code; `--target-pk` → `--target-signing-pubkey`; adds user resolution + test mocks - `smartcontract/cli/src/geolocation/user/remove_target.rs` — same changes as add_target - `smartcontract/cli/src/geolocation/probe/update.rs` — `--code` → `--probe` with pubkey_or_code; `--signing-keypair` → `--signing-pubkey`; adds probe resolution - `smartcontract/cli/src/geolocation/probe/add_parent.rs` — `--code` → `--probe` with pubkey_or_code; adds probe resolution - `smartcontract/cli/src/geolocation/probe/delete.rs` — `--code` → `--probe` with pubkey_or_code; adds probe resolution - `smartcontract/cli/src/geolocation/user/update_payment_status.rs` — `--user` accepts pubkey_or_code; adds user resolution - `smartcontract/cli/src/geolocation/user/delete.rs` — `--user` accepts pubkey_or_code; adds user resolution - `smartcontract/cli/src/geolocation/probe/get.rs` — adds `--json-compact`; display field `signing_keypair` → `signing_pubkey` </details> ## Testing Verification - All 38 geolocation CLI unit tests pass - Clippy clean with `-Dclippy::all -Dwarnings` - E2E tests compile (`go build -tags e2e ./e2e/...`) - Infra repo verified: no references to the renamed geolocation CLI flags
Resolves: #3465 ## Summary of Changes - Adds a BGP status submitter to the telemetry daemon that periodically checks BGP session state in the BGP VRF namespace and submits \`Up\`/\`Down\` status onchain for each activated user on the device - Adds \`BGPStatus\` type and \`SetUserBGPStatus\` executor instruction to the Go serviceability SDK - Fixes a gap where a disappeared tunnel interface (e.g., after an ungraceful daemon kill) would silently skip the \`Down\` submission — now correctly transitions from \`Up\` to \`Down\` when the tunnel is gone but the user remains activated onchain - Introduces a \`CachingFetcher\` in the telemetry daemon that deduplicates \`GetProgramData\` RPC calls across the BGP status submitter, ledger peer discovery, and collector within a 5s TTL window using \`singleflight\`, mirroring the pattern in the client daemon - Adds Prometheus metrics for both the caching layer and onchain submission outcomes - Adds an e2e test (\`TestE2E_UserBGPStatus\`) verifying the full \`Up\` → \`Down\` lifecycle ## Diff Breakdown | Category | Files | Lines (+/-) | Net | |--------------|-------|-------------|-------| | Core logic | 4 | +526 / -0 | +526 | | Scaffolding | 5 | +161 / -7 | +154 | | Tests | 4 | +770 / -0 | +770 | | Docs | 1 | +4 / -0 | +4 | Mostly new code: a self-contained submitter package, a shared caching layer, and their tests. <details> <summary>Key files (click to expand)</summary> - [\`controlplane/telemetry/internal/bgpstatus/bgpstatus.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-7e435c1fdcd7c7b5767709008629f8022c298656aaa548d1636b162057d91c0f) — \`Submitter\` struct, \`Config\`, and pure helpers (\`buildEstablishedIPSet\`, \`computeEffectiveStatus\`, \`shouldSubmit\`) - [\`controlplane/telemetry/internal/bgpstatus/submitter.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-4f6db3ff3dcb17af124697b7d176a806c8359739f33546f5de230105500cd9c8) — tick/worker loop: collects BGP socket state, resolves tunnel IPs per user, enqueues status updates with retry; includes tunnel-not-found → Down fix - [\`e2e/user_bgp_status_test.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-8b6e16747b259945bc71d9d26b961b1eb8792c690d776448fc8ad26c8209d56a) — e2e test validating \`Up\` detection after BGP session establishes and \`Down\` detection after \`pkill -9 doublezerod\` - [\`controlplane/telemetry/internal/bgpstatus/submitter_test.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-6d59f1028aac0b9b2609f4bbdb7c155dbf6bef15fce256034100a276905f0572) — unit tests for tick logic, grace period, periodic refresh, in-flight deduplication, and tunnel-not-found → Down transition - [\`controlplane/telemetry/internal/serviceability/cache.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-50e9dfc9d61518cf113b7399651a828cdab7f5ff5abfefd443a772139845a315) — \`CachingFetcher\`: \`sync.RWMutex\` fast-path + \`singleflight\` slow-path + stale-on-error fallback for \`GetProgramData\` - [\`controlplane/telemetry/cmd/telemetry/main.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-ee2c048617227e6110d061752fb7e24fc85ff54c7024cea70c3016e976ee9228) — wires \`CachingFetcher\` and BGP status submitter; flags: \`--bgp-status-enable\`, \`--bgp-status-interval\`, \`--bgp-status-refresh-interval\`, \`--bgp-status-down-grace-period\` - [\`smartcontract/sdk/go/serviceability/executor.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-7be6c1e59c995ffb57df9bcf3926738f52220f83f5b4e4747fc0dd2d993d1bc8) — adds \`SetUserBGPStatus\` executor (instruction opcode 106) - [\`smartcontract/sdk/go/serviceability/state.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-057f627e5f2629b35f1976fba6eaa1b3c507c511b0f47d2f80a3a1d0c684c4a1) — adds \`BGPStatus\` type with \`Unknown/Up/Down\` constants and JSON serialization </details> ## Testing Verification - Unit tests cover all tick-loop branches: session established, session down with and without grace period, tunnel not found with last status Up/Down, periodic refresh, in-flight dedup, and submission retry - \`TestE2E_UserBGPStatus\` verified end-to-end: connects a client, waits for BGP session to reach \`Established\` in vrf1, confirms \`BGPStatusUp\` appears onchain, then kills doublezerod ungracefully and confirms \`BGPStatusDown\` appears onchain within 60s
Update RFC16 to match implemented status - Removed POC sections - Updated various section to match as-implementated details - Updated Security - Added Future work section for challenge-based inbound probing - Removed resolved open questions.
…er (#3506) Resolves: #3504 - Add `agent_version` ([u8; 16]) and `agent_commit` ([u8; 8]) fields to `DeviceLatencySamplesHeader`, carved from the existing 128-byte reserved region (now 104 bytes) — total header size unchanged, fully backward compatible - Wire the telemetry agent's LDFLAGS-injected `version` and `commit` strings through config → collector → submitter → instruction builder - Update Go, Python, and TypeScript SDKs to deserialize the new fields; update Go V0→V1 migration to split the old `Unused` region correctly | Category | Files | Lines (+/-) | Net | |--------------|-------|-------------|------| | Core logic | 7 | +74 / -8 | +66 | | Scaffolding | 5 | +16 / -0 | +16 | | Tests | 9 | +53 / -3 | +50 | | Fixtures | 3 | +15 / -1 | +14 | | Docs | 1 | +6 / -0 | +6 | | **Total** | 25 | +164 / -12 | +152 | Mostly core logic (onchain state + SDK deserialization) and tests; scaffolding is thin config plumbing. <details> <summary>Key files (click to expand)</summary> - [`smartcontract/programs/doublezero-telemetry/src/state/device_latency_samples.rs`](https://github.com/malbeclabs/doublezero/pull/3506/files#diff-010de70ef44cfdd400df3321ce1d289f9342a7048c7a77e296489a4fca96a780) — add `agent_version` and `agent_commit` fields to onchain header struct, shrink `_unused` from 128 to 104 bytes - [`sdk/telemetry/typescript/telemetry/state.ts`](https://github.com/malbeclabs/doublezero/pull/3506/files#diff-bbf486649ad2ac1398535366df9144a768aebee2a63d2baa2d4c54d1c620161a) — deserialize new fields with `readFixedString` helper, expose on interface - [`smartcontract/sdk/go/telemetry/state_v0.go`](https://github.com/malbeclabs/doublezero/pull/3506/files#diff-8f5df3efc61241621c8867509a324f0532369db21cf40d38fe2467362c927fac) — V0→V1 migration splits old 128-byte `Unused` into `AgentVersion` + `AgentCommit` + `Unused` - [`smartcontract/sdk/go/telemetry/initialize_device_latency_samples.go`](https://github.com/malbeclabs/doublezero/pull/3506/files#diff-831dd3d41e82ba7d6832bf1a09777c177bfc1f177283cd756b4d1cf205d56fe8) — serialize `AgentVersion`/`AgentCommit` strings into fixed-size byte arrays for the instruction - [`smartcontract/sdk/go/telemetry/state.go`](https://github.com/malbeclabs/doublezero/pull/3506/files#diff-b21eb1e1d5e0c91a4777fad19d0500e5e36ce0bb67481f42731fa0f736d5d166) — add `AgentVersion` and `AgentCommit` fields to Go SDK header struct - [`sdk/telemetry/python/telemetry/state.py`](https://github.com/malbeclabs/doublezero/pull/3506/files#diff-f9724dd41d02082c168c9e7fc4b9b51945eb6dba1964f235465862e972b4aeaf) — deserialize new fields as null-trimmed UTF-8 strings - [`smartcontract/programs/doublezero-telemetry/src/processors/telemetry/initialize_device_latency_samples.rs`](https://github.com/malbeclabs/doublezero/pull/3506/files#diff-fd37cd38c9a70671b397aa15496237a669afd8a5bb703e712cb6b76386b65fc4) — write new fields into account during initialization </details> - Rust: serialization round-trip test with non-zero `agent_version`/`agent_commit` values passes - Go SDK: V0→V1 conversion test verifies byte-level mapping of the split `Unused` region - TypeScript & Python SDK: fixture tests pass against regenerated binary/JSON fixtures with non-zero values - E2E: `device_telemetry_test.go` asserts non-zero version/commit bytes after live agent submission; `sdk_device_telemetry_test.go` verifies exact byte values round-trip through initialize + deserialize
Use explicit credentials and the testcontainers ClickHouse module options (WithUsername, WithPassword, WithDatabase) to match the pattern used by other integration tests in the repo. Switch to 23.3.8.21-alpine image which is consistent with the rest of the codebase. Add CHANGELOG entry.
Combine BurnInTimesFromContext and the status-based start time selection into a single DeviceBurnIn(ctx, status) function that returns (start, expectedMinutes, ok). This eliminates boilerplate that every device criterion would otherwise repeat. Also fix gofmt alignment in clickhouse_test.go.
- Close ClickHouse client on shutdown (defer chClient.Close()) - Validate database name against [a-zA-Z0-9_-]+ to prevent injection - Use status.IsDrained() instead of == for forward compatibility - Pass tick timestamp through BurnInTimes.Now instead of per-device time.Now() calls - Add comment explaining 2-tick minimum for ReadyForUsers progression - Remove obvious comments per CLAUDE.md guidelines - Add test for expectedMinutes == 0 edge case (new environment)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves: #3493
Summary of Changes
controller_successdevice activation criterion that queries ClickHouse to verify devices have called the controller at least once per minute over the burn-in period before advancing healthPending → ReadyForLinks → ReadyForUsers) to support the many criteria planned in RFC-12CLICKHOUSE_ADDRenv var — backward compatible when not setDiff Breakdown
~300 lines of core logic implementing the criteria pattern and ClickHouse integration, well-covered by ~430 lines of tests.
Key files (click to expand)
criteria.go— DeviceCriterion/LinkCriterion interfaces and stage-aware evaluatorscontroller_success.go— ControllerSuccessCriterion querying ClickHouse for call coverageclickhouse.go— ClickHouse client with ControllerCallCoverage queryworker.go— Evaluator integration and skip-update optimizationmain.go— Wiring ClickHouse connection and criteria into the workerTesting Verification
Pendingadvance toReadyForLinks(not directly toReadyForUsers), and failing criteria block advancement