Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 63 additions & 6 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions ldk-server-protos/src/endpoints.rs
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,4 @@ pub const LIST_FORWARDED_PAYMENTS_PATH: &str = "ListForwardedPayments";
pub const UPDATE_CHANNEL_CONFIG_PATH: &str = "UpdateChannelConfig";
pub const GET_PAYMENT_DETAILS_PATH: &str = "GetPaymentDetails";
pub const CONNECT_PEER_PATH: &str = "ConnectPeer";
pub const GET_METRICS_PATH: &str = "metrics";
2 changes: 2 additions & 0 deletions ldk-server/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ toml = { version = "0.8.9", default-features = false, features = ["parse"] }
chrono = { version = "0.4", default-features = false, features = ["clock"] }
log = "0.4.28"
base64 = { version = "0.21", default-features = false, features = ["std"] }
lazy_static = "1.5.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid taking this dependency and rather use one of the std::sync primitives (Once/OnceLock/LazyLock), if we need this at all.

prometheus = "0.14.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this dependency, or can we just reimplement it easily locally?

If we need it, this should be at the very least made optional behind the metrics feature, and disable any default features.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, its a plain text file with some numbers, don't think we need to take a dep :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this dependency, or can we just reimplement it easily locally?

We can reimplement locally, but I think the dependency already offers some benefits that will come in handy when we want to provide metrics for things like balances in different states, payments, fees, etc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also pulls in several solo-maintainer dependencies (okay, with relatively well-known Rust community folks like burntsushi and dtolnay but also the apparently-kinda-unmaintained fnv crate) which we very strongly try to avoid given the security risk. Unless there's something that takes many hundred to a few thousand lines of code or very complicated and hard to test code to replicate, we should absolutely avoid taking a dependency for it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I've implemented building prometheus endpoints in bash and python and....many times with probably less effort than it would have taken to figure out how to add a dependency and use it, so I'm not at all convinced that its worth it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, will drop the dependency and reimplement locally.


# Required for RabittMQ based EventPublisher. Only enabled for `events-rabbitmq` feature.
lapin = { version = "2.4.0", features = ["rustls"], default-features = false, optional = true }
Expand Down
12 changes: 12 additions & 0 deletions ldk-server/src/api/error.rs
Original file line number Diff line number Diff line change
Expand Up @@ -131,3 +131,15 @@ impl From<NodeError> for LdkServerError {
LdkServerError::new(error_code, message)
}
}

impl From<prometheus::Error> for LdkServerError {
fn from(e: prometheus::Error) -> Self {
LdkServerError::new(LdkServerErrorCode::InternalServerError, e.to_string())
}
}

impl From<std::string::FromUtf8Error> for LdkServerError {
fn from(e: std::string::FromUtf8Error) -> Self {
LdkServerError::new(LdkServerErrorCode::InternalServerError, e.to_string())
}
}
11 changes: 11 additions & 0 deletions ldk-server/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ use crate::io::persist::{
use crate::service::NodeService;
use crate::util::config::{load_config, ChainSource};
use crate::util::logger::ServerLogger;
use crate::util::metrics::{BUILD_METRICS_INTERVAL, METRICS};
use crate::util::proto_adapter::{forwarded_payment_to_proto, payment_to_proto};
use crate::util::tls::get_or_generate_tls_config;

Expand Down Expand Up @@ -291,6 +292,16 @@ fn main() {
}
};
let event_node = Arc::clone(&node);

let metrics_node = Arc::clone(&node);
let mut interval = tokio::time::interval(BUILD_METRICS_INTERVAL);
runtime.spawn(async move {
loop {
interval.tick().await;
METRICS.update_service_health_score(&metrics_node);
}
});

let rest_svc_listener = TcpListener::bind(config_file.rest_service_addr)
.await
.expect("Failed to bind listening port");
Expand Down
26 changes: 23 additions & 3 deletions ldk-server/src/service.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ use ldk_node::Node;
use ldk_server_protos::endpoints::{
BOLT11_RECEIVE_PATH, BOLT11_SEND_PATH, BOLT12_RECEIVE_PATH, BOLT12_SEND_PATH,
CLOSE_CHANNEL_PATH, CONNECT_PEER_PATH, FORCE_CLOSE_CHANNEL_PATH, GET_BALANCES_PATH,
GET_NODE_INFO_PATH, GET_PAYMENT_DETAILS_PATH, LIST_CHANNELS_PATH, LIST_FORWARDED_PAYMENTS_PATH,
LIST_PAYMENTS_PATH, ONCHAIN_RECEIVE_PATH, ONCHAIN_SEND_PATH, OPEN_CHANNEL_PATH, SPLICE_IN_PATH,
SPLICE_OUT_PATH, UPDATE_CHANNEL_CONFIG_PATH,
GET_METRICS_PATH, GET_NODE_INFO_PATH, GET_PAYMENT_DETAILS_PATH, LIST_CHANNELS_PATH,
LIST_FORWARDED_PAYMENTS_PATH, LIST_PAYMENTS_PATH, ONCHAIN_RECEIVE_PATH, ONCHAIN_SEND_PATH,
OPEN_CHANNEL_PATH, SPLICE_IN_PATH, SPLICE_OUT_PATH, UPDATE_CHANNEL_CONFIG_PATH,
};
use prost::Message;

Expand All @@ -47,6 +47,7 @@ use crate::api::open_channel::handle_open_channel;
use crate::api::splice_channel::{handle_splice_in_request, handle_splice_out_request};
use crate::api::update_channel_config::handle_update_channel_config_request;
use crate::io::persist::paginated_kv_store::PaginatedKVStore;
use crate::util::metrics::METRICS;
use crate::util::proto_adapter::to_error_response;

// Maximum request body size: 10 MB
Expand Down Expand Up @@ -148,6 +149,25 @@ impl Service<Request<Incoming>> for NodeService {
type Future = Pin<Box<dyn Future<Output = Result<Self::Response, Self::Error>> + Send>>;

fn call(&self, req: Request<Incoming>) -> Self::Future {
// Handle metrics endpoint separately to bypass auth and return plain text
if req.uri().path().len() > 1 && &req.uri().path()[1..] == GET_METRICS_PATH {
return Box::pin(async move {
match METRICS.gather_metrics() {
Ok(metrics) => Ok(Response::builder()
.header("Content-Type", "text/plain")
.body(Full::new(Bytes::from(metrics)))
.unwrap()),
Err(e) => {
let (error_response, status_code) = to_error_response(e);
Ok(Response::builder()
.status(status_code)
.body(Full::new(Bytes::from(error_response.encode_to_vec())))
.unwrap())
},
}
});
}

// Extract auth params from headers (validation happens after body is read)
let auth_params = match extract_auth_params(&req) {
Ok(params) => params,
Expand Down
138 changes: 138 additions & 0 deletions ldk-server/src/util/metrics.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
// This file is Copyright its original authors, visible in version control
// history.
//
// This file is licensed under the Apache License, Version 2.0 <LICENSE-APACHE
// or http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
// <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your option.
// You may not use this file except in accordance with one or both of these
// licenses.

use std::time::Duration;

use lazy_static::lazy_static;
use ldk_node::Node;
use prometheus::{
default_registry, gather, register_int_gauge_with_registry, Encoder, IntGauge, Opts, Registry,
TextEncoder,
};

use crate::api::error::LdkServerError;

pub const BUILD_METRICS_INTERVAL: Duration = Duration::from_secs(60);

lazy_static! {
pub static ref METRICS: Metrics = Metrics::new(default_registry());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having global statics is somewhat of an anti-pattern, can we avoid it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking it's safe since it's not a mutable state and wrapped around lazy_static. But will drop that.

}

pub struct Metrics {
pub service_health_score: IntGauge,
}

impl Metrics {
pub fn new(registry: &Registry) -> Self {
Self {
service_health_score: register_int_gauge_with_registry!(
Opts::new("ldk_health_score", "Current health score (0-100)"),
registry
)
.expect("Failed to register metric"),
}
}

pub fn update_service_health_score(&self, node: &Node) {
let score = self.calculate_ldk_server_health_score(node);
self.service_health_score.set(score);
}

/// The health score computation is pretty basic for now and simply
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is it customary to have such a score? I would find it very hard to interpret, tbh.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is it customary to have such a score?

I think it is as some users might want to rely on that to know how their node is performing per time at a glance.

I would find it very hard to interpret, tbh.?

The current computation is pretty basic and relies on the NodeStatus informations from ldk-node. Though, we might want to refine that to include more informations and decide the best weightage value to assign for each event.

/// calculated based on the impacted events on the components of the
/// `Node`. The events severity and weightage value are as follows:
///
/// - Critical: 0 (Total failure)
/// - Major: 35%
/// - Minor: 25%
///
/// Using the assigned score above, the health score of the `Node` is
/// computed as:
///
/// Health score = Maximum health score - Sum(Event severity score)
///
/// Where:
///
/// - Maximum health score = 100
///
/// If the `Node` is not running/online, i.e `is_running` is false,
/// the severity is critical with a weightage value of -100%.
///
/// If the `Node` is running but isn't connected to any peer yet,
/// the severity is major with a weightage value of -35%.
///
/// If the `Node` is running but the Lightning Wallet hasn't been synced
/// yet, the severity is minor with a weightage value of -25%.
pub fn calculate_ldk_server_health_score(&self, node: &Node) -> i64 {
Self::compute_health_score(
node.status().is_running,
!node.list_peers().is_empty(),
node.status().latest_lightning_wallet_sync_timestamp.is_some(),
)
}

pub fn gather_metrics(&self) -> Result<String, LdkServerError> {
let mut buffer = Vec::new();
let encoder = TextEncoder::new();

let all_metrics = gather();
encoder.encode(&all_metrics, &mut buffer)?;
Ok(String::from_utf8(buffer)?)
}

fn compute_health_score(is_running: bool, has_peers: bool, is_wallet_synced: bool) -> i64 {
if !is_running {
return 0;
}

let mut health_score = 100;

if !has_peers {
health_score -= 35;
}

if !is_wallet_synced {
health_score -= 25;
}

health_score
}
}

#[cfg(test)]
mod tests {
use super::*;

#[test]
fn test_compute_health_score() {
// Node is not running
assert_eq!(Metrics::compute_health_score(false, true, true), 0);
assert_eq!(Metrics::compute_health_score(false, false, false), 0);

// Node is running, connected to a peer and wallet is synced
assert_eq!(Metrics::compute_health_score(true, true, true), 100);

// Node is running, not connected to a peer but wallet is synced
assert_eq!(Metrics::compute_health_score(true, false, true), 65);

// Node is running, connected to a peer but wallet is not synced
assert_eq!(Metrics::compute_health_score(true, true, false), 75);

// Node is running, not connected to a peer and wallet is not synced
assert_eq!(Metrics::compute_health_score(true, false, false), 40);
}

#[test]
fn test_gather_metrics_format() {
let result = METRICS.gather_metrics();
assert!(result.is_ok());
let output = result.unwrap();
assert!(output.contains("ldk_health_score"));
}
}
1 change: 1 addition & 0 deletions ldk-server/src/util/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,6 @@

pub(crate) mod config;
pub(crate) mod logger;
pub(crate) mod metrics;
pub(crate) mod proto_adapter;
pub(crate) mod tls;