-
Notifications
You must be signed in to change notification settings - Fork 24
Introduce telemetry for observability #117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
I've assigned @benthecarman as a reviewer! |
a3831dc to
8692c8c
Compare
This introduces the foundational telemetry infrastructure to improve the observability of LDK Server. It adds a new `/metrics` endpoint exposed on the REST service address, which serves Prometheus-compatible metrics. This endpoint is public and does not require HMAC authentication, allowing for easy integration with monitoring systems. - Added `prometheus` dependency and a `Metrics` utility struct. - Introduced a basic `ldk_health_score` gauge (0-100) that reflects the node's operational status based on connection to peer, sync state, and running status. This is the first step in a larger effort to provide comprehensive telemetry. Future updates will expand this to include metrics for channels, balances, payments, and other critical node activities.
8692c8c to
56b5e4a
Compare
| chrono = { version = "0.4", default-features = false, features = ["clock"] } | ||
| log = "0.4.28" | ||
| base64 = { version = "0.21", default-features = false, features = ["std"] } | ||
| lazy_static = "1.5.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid taking this dependency and rather use one of the std::sync primitives (Once/OnceLock/LazyLock), if we need this at all.
| log = "0.4.28" | ||
| base64 = { version = "0.21", default-features = false, features = ["std"] } | ||
| lazy_static = "1.5.0" | ||
| prometheus = "0.14.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this dependency, or can we just reimplement it easily locally?
If we need it, this should be at the very least made optional behind the metrics feature, and disable any default features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, its a plain text file with some numbers, don't think we need to take a dep :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this dependency, or can we just reimplement it easily locally?
We can reimplement locally, but I think the dependency already offers some benefits that will come in handy when we want to provide metrics for things like balances in different states, payments, fees, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also pulls in several solo-maintainer dependencies (okay, with relatively well-known Rust community folks like burntsushi and dtolnay but also the apparently-kinda-unmaintained fnv crate) which we very strongly try to avoid given the security risk. Unless there's something that takes many hundred to a few thousand lines of code or very complicated and hard to test code to replicate, we should absolutely avoid taking a dependency for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I've implemented building prometheus endpoints in bash and python and....many times with probably less effort than it would have taken to figure out how to add a dependency and use it, so I'm not at all convinced that its worth it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, will drop the dependency and reimplement locally.
| self.service_health_score.set(score); | ||
| } | ||
|
|
||
| /// The health score computation is pretty basic for now and simply |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, is it customary to have such a score? I would find it very hard to interpret, tbh.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, is it customary to have such a score?
I think it is as some users might want to rely on that to know how their node is performing per time at a glance.
I would find it very hard to interpret, tbh.?
The current computation is pretty basic and relies on the NodeStatus informations from ldk-node. Though, we might want to refine that to include more informations and decide the best weightage value to assign for each event.
| pub const BUILD_METRICS_INTERVAL: Duration = Duration::from_secs(60); | ||
|
|
||
| lazy_static! { | ||
| pub static ref METRICS: Metrics = Metrics::new(default_registry()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having global statics is somewhat of an anti-pattern, can we avoid it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking it's safe since it's not a mutable state and wrapped around lazy_static. But will drop that.
This introduces the foundational telemetry infrastructure to improve the observability of LDK Server.
It adds a new
/metricsendpoint exposed on the REST service address, which serves Prometheus-compatible metrics. This endpoint is public and does not require HMAC authentication, allowing for easy integration with monitoring systems.prometheusdependency and aMetricsutility struct.ldk_health_scoregauge (0-100) that reflects the node's operational status based on connection to peer, sync state, and running status.This is the first step in a larger effort to provide comprehensive telemetry. Future updates will expand this to include metrics for channels, balances, payments, and other critical node activities.