Skip to content

device-health-oracle: add controller_success device activation criterion#3503

Draft
nikw9944 wants to merge 15 commits intomainfrom
nikw9944/doublezero-3493
Draft

device-health-oracle: add controller_success device activation criterion#3503
nikw9944 wants to merge 15 commits intomainfrom
nikw9944/doublezero-3493

Conversation

@nikw9944
Copy link
Copy Markdown
Contributor

@nikw9944 nikw9944 commented Apr 9, 2026

Resolves: #3493

Summary of Changes

  • Add controller_success device activation criterion that queries ClickHouse to verify devices have called the controller at least once per minute over the burn-in period before advancing health
  • Introduce criteria-based evaluation pattern with stage-aware health progression (Pending → ReadyForLinks → ReadyForUsers) to support the many criteria planned in RFC-12
  • Skip onchain health writes when the value is already at the desired state
  • ClickHouse connection is optional via CLICKHOUSE_ADDR env var — backward compatible when not set

Diff Breakdown

Category Files Lines (+/-) Net
Core logic 4 +303 / -4 +299
Scaffolding 3 +78 / -2 +76
Tests 3 +432 / -0 +432
Docs 2 +297 / -0 +297

~300 lines of core logic implementing the criteria pattern and ClickHouse integration, well-covered by ~430 lines of tests.

Key files (click to expand)
  • criteria.go — DeviceCriterion/LinkCriterion interfaces and stage-aware evaluators
  • controller_success.go — ControllerSuccessCriterion querying ClickHouse for call coverage
  • clickhouse.go — ClickHouse client with ControllerCallCoverage query
  • worker.go — Evaluator integration and skip-update optimization
  • main.go — Wiring ClickHouse connection and criteria into the worker

Testing Verification

  • Evaluator enforces stage progression — devices at Pending advance to ReadyForLinks (not directly to ReadyForUsers), and failing criteria block advancement
  • ClickHouse integration tests (testcontainers) verify coverage query with full coverage, partial windows, gaps, and empty data
  • Burn-in slot count correctly selects provisioning (200K) vs drained (5K) based on device status

nikw9944 added 2 commits April 9, 2026 14:40
Add a criteria-based evaluation pattern to the device-health-oracle and
implement the first criterion: devices must have called the controller
at least once per minute over the burn-in period (verified via ClickHouse
controller_grpc_getconfig_success table).

- Introduce DeviceCriterion/LinkCriterion interfaces and stage-aware
  evaluators that enforce Pending → ReadyForLinks → ReadyForUsers
  progression for devices
- Add ControllerSuccessCriterion querying ClickHouse for controller call
  coverage over the burn-in window
- Optimize update logic to skip health writes when the value is already
  at the desired state
- ClickHouse connection is optional: when CLICKHOUSE_ADDR is not set,
  the oracle falls back to no-criteria behavior (backward compatible)
Use explicit credentials and the testcontainers ClickHouse module options
(WithUsername, WithPassword, WithDatabase) to match the pattern used by
other integration tests in the repo. Switch to 23.3.8.21-alpine image
which is consistent with the rest of the codebase. Add CHANGELOG entry.
@nikw9944 nikw9944 force-pushed the nikw9944/doublezero-3493 branch from c4e29fb to efe6fa4 Compare April 9, 2026 14:45
nikw9944 added 6 commits April 9, 2026 16:01
Use positional ? parameters instead of named parameters with type hints,
and scan count result into uint64 to match ClickHouse's UInt64 return
type from count(DISTINCT ...).
Remove --slot-duration-ms flag. Burn-in boundary slots are now resolved
to wall-clock times via GetBlockTime once per tick, and passed to
criteria through the context. This avoids hardcoding the ~400ms slot
duration estimate.
…tests

Move ClickHouse container tests out of the worker package — integration
tests belong in e2e/. Replace with a lightweight unit test for
NewClickHouseClient.

Also update rfc12 to specify how config agent health is verified:
query controller_grpc_getconfig_success for at least one record per
minute, with the burn-in window resolved via GetBlockTime.
Replace the trivial NewClickHouseClient error test with proper mock-based
tests for ControllerCallCoverage: verify query construction, database
name quoting (mainnet-beta), result scanning, and error propagation.
Every device criterion needs to select the burn-in start time based on
device status. Extract DeviceBurnInStart method on BurnInTimes so future
criteria don't duplicate the provisioning-vs-drained selection logic.
Combine BurnInTimesFromContext and the status-based start time selection
into a single DeviceBurnIn(ctx, status) function that returns
(start, expectedMinutes, ok). This eliminates boilerplate that every
device criterion would otherwise repeat.

Also fix gofmt alignment in clickhouse_test.go.
@nikw9944 nikw9944 changed the title device-health-oracle: add controller_success activation criterion device-health-oracle: add controller_success device activation criterion Apr 9, 2026
ben-dz and others added 7 commits April 10, 2026 20:56
## Summary of Changes
- Standardize entity identification: probe mutation commands now use
`--probe` (was `--code`) accepting pubkey or code; user mutation
commands now accept pubkey or code via `validate_pubkey_or_code` (was
code-only)
- Rename `--signing-keypair` → `--signing-pubkey` (it accepts a pubkey,
not a keypair) and `--target-pk` → `--target-signing-pubkey` (spell out
instead of abbreviating)
- Add `--json-compact` to `probe get` and `user get` (previously only
available on `list` commands)

## Diff Breakdown
| Category     | Files | Lines (+/-) | Net   |
|--------------|-------|-------------|-------|
| Core logic   |    10 | +358 / -89  | +269  |
| Tests        |    10 | +49 / -12   | +37   |
| E2E          |     1 | +3 / -3     |   0   |

Most of the net addition is resolution logic: non-create commands now
resolve pubkey-or-code to a code via a `get` call before passing to the
SDK.

<details>
<summary>Key files (click to expand)</summary>

- `smartcontract/cli/src/geolocation/user/add_target.rs` — `--user`
accepts pubkey_or_code; `--target-pk` → `--target-signing-pubkey`; adds
user resolution + test mocks
- `smartcontract/cli/src/geolocation/user/remove_target.rs` — same
changes as add_target
- `smartcontract/cli/src/geolocation/probe/update.rs` — `--code` →
`--probe` with pubkey_or_code; `--signing-keypair` → `--signing-pubkey`;
adds probe resolution
- `smartcontract/cli/src/geolocation/probe/add_parent.rs` — `--code` →
`--probe` with pubkey_or_code; adds probe resolution
- `smartcontract/cli/src/geolocation/probe/delete.rs` — `--code` →
`--probe` with pubkey_or_code; adds probe resolution
- `smartcontract/cli/src/geolocation/user/update_payment_status.rs` —
`--user` accepts pubkey_or_code; adds user resolution
- `smartcontract/cli/src/geolocation/user/delete.rs` — `--user` accepts
pubkey_or_code; adds user resolution
- `smartcontract/cli/src/geolocation/probe/get.rs` — adds
`--json-compact`; display field `signing_keypair` → `signing_pubkey`

</details>

## Testing Verification
- All 38 geolocation CLI unit tests pass
- Clippy clean with `-Dclippy::all -Dwarnings`
- E2E tests compile (`go build -tags e2e ./e2e/...`)
- Infra repo verified: no references to the renamed geolocation CLI
flags
Resolves: #3465

## Summary of Changes
- Adds a BGP status submitter to the telemetry daemon that periodically
checks BGP session state in the BGP VRF namespace and submits
\`Up\`/\`Down\` status onchain for each activated user on the device
- Adds \`BGPStatus\` type and \`SetUserBGPStatus\` executor instruction
to the Go serviceability SDK
- Fixes a gap where a disappeared tunnel interface (e.g., after an
ungraceful daemon kill) would silently skip the \`Down\` submission —
now correctly transitions from \`Up\` to \`Down\` when the tunnel is
gone but the user remains activated onchain
- Introduces a \`CachingFetcher\` in the telemetry daemon that
deduplicates \`GetProgramData\` RPC calls across the BGP status
submitter, ledger peer discovery, and collector within a 5s TTL window
using \`singleflight\`, mirroring the pattern in the client daemon
- Adds Prometheus metrics for both the caching layer and onchain
submission outcomes
- Adds an e2e test (\`TestE2E_UserBGPStatus\`) verifying the full \`Up\`
→ \`Down\` lifecycle

## Diff Breakdown
| Category     | Files | Lines (+/-) | Net   |
|--------------|-------|-------------|-------|
| Core logic   |     4 | +526 / -0   |  +526 |
| Scaffolding  |     5 | +161 / -7   |  +154 |
| Tests        |     4 | +770 / -0   |  +770 |
| Docs         |     1 | +4   / -0   |    +4 |

Mostly new code: a self-contained submitter package, a shared caching
layer, and their tests.

<details>
<summary>Key files (click to expand)</summary>

-
[\`controlplane/telemetry/internal/bgpstatus/bgpstatus.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-7e435c1fdcd7c7b5767709008629f8022c298656aaa548d1636b162057d91c0f)
— \`Submitter\` struct, \`Config\`, and pure helpers
(\`buildEstablishedIPSet\`, \`computeEffectiveStatus\`,
\`shouldSubmit\`)
-
[\`controlplane/telemetry/internal/bgpstatus/submitter.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-4f6db3ff3dcb17af124697b7d176a806c8359739f33546f5de230105500cd9c8)
— tick/worker loop: collects BGP socket state, resolves tunnel IPs per
user, enqueues status updates with retry; includes tunnel-not-found →
Down fix
-
[\`e2e/user_bgp_status_test.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-8b6e16747b259945bc71d9d26b961b1eb8792c690d776448fc8ad26c8209d56a)
— e2e test validating \`Up\` detection after BGP session establishes and
\`Down\` detection after \`pkill -9 doublezerod\`
-
[\`controlplane/telemetry/internal/bgpstatus/submitter_test.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-6d59f1028aac0b9b2609f4bbdb7c155dbf6bef15fce256034100a276905f0572)
— unit tests for tick logic, grace period, periodic refresh, in-flight
deduplication, and tunnel-not-found → Down transition
-
[\`controlplane/telemetry/internal/serviceability/cache.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-50e9dfc9d61518cf113b7399651a828cdab7f5ff5abfefd443a772139845a315)
— \`CachingFetcher\`: \`sync.RWMutex\` fast-path + \`singleflight\`
slow-path + stale-on-error fallback for \`GetProgramData\`
-
[\`controlplane/telemetry/cmd/telemetry/main.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-ee2c048617227e6110d061752fb7e24fc85ff54c7024cea70c3016e976ee9228)
— wires \`CachingFetcher\` and BGP status submitter; flags:
\`--bgp-status-enable\`, \`--bgp-status-interval\`,
\`--bgp-status-refresh-interval\`, \`--bgp-status-down-grace-period\`
-
[\`smartcontract/sdk/go/serviceability/executor.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-7be6c1e59c995ffb57df9bcf3926738f52220f83f5b4e4747fc0dd2d993d1bc8)
— adds \`SetUserBGPStatus\` executor (instruction opcode 106)
-
[\`smartcontract/sdk/go/serviceability/state.go\`](https://github.com/malbeclabs/doublezero/pull/3487/files#diff-057f627e5f2629b35f1976fba6eaa1b3c507c511b0f47d2f80a3a1d0c684c4a1)
— adds \`BGPStatus\` type with \`Unknown/Up/Down\` constants and JSON
serialization

</details>

## Testing Verification
- Unit tests cover all tick-loop branches: session established, session
down with and without grace period, tunnel not found with last status
Up/Down, periodic refresh, in-flight dedup, and submission retry
- \`TestE2E_UserBGPStatus\` verified end-to-end: connects a client,
waits for BGP session to reach \`Established\` in vrf1, confirms
\`BGPStatusUp\` appears onchain, then kills doublezerod ungracefully and
confirms \`BGPStatusDown\` appears onchain within 60s
Update RFC16 to match implemented status

- Removed POC sections
- Updated various section to match as-implementated details
- Updated Security
- Added Future work section for challenge-based inbound probing
- Removed resolved open questions.
…er (#3506)

Resolves: #3504

- Add `agent_version` ([u8; 16]) and `agent_commit` ([u8; 8]) fields to
`DeviceLatencySamplesHeader`, carved from the existing 128-byte reserved
region (now 104 bytes) — total header size unchanged, fully backward
compatible
- Wire the telemetry agent's LDFLAGS-injected `version` and `commit`
strings through config → collector → submitter → instruction builder
- Update Go, Python, and TypeScript SDKs to deserialize the new fields;
update Go V0→V1 migration to split the old `Unused` region correctly

| Category     | Files | Lines (+/-) | Net  |
|--------------|-------|-------------|------|
| Core logic   |     7 | +74 / -8    |  +66 |
| Scaffolding  |     5 | +16 / -0    |  +16 |
| Tests        |     9 | +53 / -3    |  +50 |
| Fixtures     |     3 | +15 / -1    |  +14 |
| Docs         |     1 | +6 / -0     |   +6 |
| **Total**    |    25 | +164 / -12  | +152 |

Mostly core logic (onchain state + SDK deserialization) and tests;
scaffolding is thin config plumbing.

<details>
<summary>Key files (click to expand)</summary>

-
[`smartcontract/programs/doublezero-telemetry/src/state/device_latency_samples.rs`](https://github.com/malbeclabs/doublezero/pull/3506/files#diff-010de70ef44cfdd400df3321ce1d289f9342a7048c7a77e296489a4fca96a780)
— add `agent_version` and `agent_commit` fields to onchain header
struct, shrink `_unused` from 128 to 104 bytes
-
[`sdk/telemetry/typescript/telemetry/state.ts`](https://github.com/malbeclabs/doublezero/pull/3506/files#diff-bbf486649ad2ac1398535366df9144a768aebee2a63d2baa2d4c54d1c620161a)
— deserialize new fields with `readFixedString` helper, expose on
interface
-
[`smartcontract/sdk/go/telemetry/state_v0.go`](https://github.com/malbeclabs/doublezero/pull/3506/files#diff-8f5df3efc61241621c8867509a324f0532369db21cf40d38fe2467362c927fac)
— V0→V1 migration splits old 128-byte `Unused` into `AgentVersion` +
`AgentCommit` + `Unused`
-
[`smartcontract/sdk/go/telemetry/initialize_device_latency_samples.go`](https://github.com/malbeclabs/doublezero/pull/3506/files#diff-831dd3d41e82ba7d6832bf1a09777c177bfc1f177283cd756b4d1cf205d56fe8)
— serialize `AgentVersion`/`AgentCommit` strings into fixed-size byte
arrays for the instruction
-
[`smartcontract/sdk/go/telemetry/state.go`](https://github.com/malbeclabs/doublezero/pull/3506/files#diff-b21eb1e1d5e0c91a4777fad19d0500e5e36ce0bb67481f42731fa0f736d5d166)
— add `AgentVersion` and `AgentCommit` fields to Go SDK header struct
-
[`sdk/telemetry/python/telemetry/state.py`](https://github.com/malbeclabs/doublezero/pull/3506/files#diff-f9724dd41d02082c168c9e7fc4b9b51945eb6dba1964f235465862e972b4aeaf)
— deserialize new fields as null-trimmed UTF-8 strings
-
[`smartcontract/programs/doublezero-telemetry/src/processors/telemetry/initialize_device_latency_samples.rs`](https://github.com/malbeclabs/doublezero/pull/3506/files#diff-fd37cd38c9a70671b397aa15496237a669afd8a5bb703e712cb6b76386b65fc4)
— write new fields into account during initialization

</details>

- Rust: serialization round-trip test with non-zero
`agent_version`/`agent_commit` values passes
- Go SDK: V0→V1 conversion test verifies byte-level mapping of the split
`Unused` region
- TypeScript & Python SDK: fixture tests pass against regenerated
binary/JSON fixtures with non-zero values
- E2E: `device_telemetry_test.go` asserts non-zero version/commit bytes
after live agent submission; `sdk_device_telemetry_test.go` verifies
exact byte values round-trip through initialize + deserialize
Use explicit credentials and the testcontainers ClickHouse module options
(WithUsername, WithPassword, WithDatabase) to match the pattern used by
other integration tests in the repo. Switch to 23.3.8.21-alpine image
which is consistent with the rest of the codebase. Add CHANGELOG entry.
Combine BurnInTimesFromContext and the status-based start time selection
into a single DeviceBurnIn(ctx, status) function that returns
(start, expectedMinutes, ok). This eliminates boilerplate that every
device criterion would otherwise repeat.

Also fix gofmt alignment in clickhouse_test.go.
- Close ClickHouse client on shutdown (defer chClient.Close())
- Validate database name against [a-zA-Z0-9_-]+ to prevent injection
- Use status.IsDrained() instead of == for forward compatibility
- Pass tick timestamp through BurnInTimes.Now instead of per-device
  time.Now() calls
- Add comment explaining 2-tick minimum for ReadyForUsers progression
- Remove obvious comments per CLAUDE.md guidelines
- Add test for expectedMinutes == 0 edge case (new environment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

device-health-oracle: Add activation criterion: devices must consistently call controller

3 participants