Skip to content

[Store] Support SSD Metrics#1879

Merged
ykwd merged 4 commits intokvcache-ai:mainfrom
LujhCoconut:add-feature-ssd-metrics
Apr 15, 2026
Merged

[Store] Support SSD Metrics#1879
ykwd merged 4 commits intokvcache-ai:mainfrom
LujhCoconut:add-feature-ssd-metrics

Conversation

@LujhCoconut
Copy link
Copy Markdown
Collaborator

Description

Add SSD offloading metrics to mooncake-store, enabling observability for SSD read/write performance including throughput (bytes/s), IOPS (ops/s), and exact P50/P90/P99 latency.

Currently, FileStorage::BatchLoad() and FileStorage::OffloadObjects() have timing code (VLOG), but no structured metrics are exported. This PR introduces a new SsdMetric struct (following the existing TransferMetric pattern) and instruments both the read (BatchLoad) and write (BatchOffload) paths. All metrics are reported in three dimensions: Read, Write, and Total (combined).

Exported Metrics (12 total)

Counters (6):

Metric Description
mooncake_ssd_read_bytes_total Cumulative bytes read from SSD
mooncake_ssd_write_bytes_total Cumulative bytes written to SSD
mooncake_ssd_total_bytes_total Cumulative bytes read + written (combined)
mooncake_ssd_read_ops_total Cumulative key count read
mooncake_ssd_write_ops_total Cumulative key count written
mooncake_ssd_total_ops_total Cumulative key count read + written (combined)

Histograms (3, for Prometheus aggregation):

Metric Description
mooncake_ssd_read_latency_us Per-batch read latency (18 buckets: 50μs ~ 30s)
mooncake_ssd_write_latency_us Per-batch write latency
mooncake_ssd_total_latency_us Per-batch total latency (combined)

Summaries (3, for exact percentiles):

Metric Description
mooncake_ssd_read_latency_summary_us Read latency exact P50/P90/P99
mooncake_ssd_write_latency_summary_us Write latency exact P50/P90/P99
mooncake_ssd_total_latency_summary_us Total latency exact P50/P90/P99

Key Design Decisions

  • Client-side metrics: SSD I/O runs in the client process, so SsdMetric lives in ClientMetric, not MasterMetricManager.
  • Histogram + Summary dual approach: Histograms for Prometheus export (cross-instance aggregatable); Summaries for precise P50/P90/P99 with avg in human-readable logs.
  • Latency buckets tuned for SSD: 18 buckets from 50μs (high-end NVMe) to 30s (3fs/nfs large-object writes), distinct from the RDMA-tuned kLatencyBucket.
  • Three dimensions: Read, Write, and Total (combined) for all metrics, as per requirements.
  • Failed operations excluded: Only successful BatchLoad/BatchOffload calls are counted, keeping IOPS/throughput/latency clean.
  • Backward compatible: SsdMetric* defaults to nullptr; existing callers are unaffected.
  • Prometheus naming convention: All counters use _total suffix, consistent with master-side metrics (e.g., master_put_start_requests_total).

Changed Files (8 files, +382 / -3)

File Change
include/client_metric.h +212: kSsdLatencyBucket, SsdMetric struct (12 metrics, serialize, summary with throughput/IOPS/exact percentiles)
include/client_service.h +4: Client::GetSsdMetricPtr() accessor
include/file_storage.h +6: SsdMetric* constructor param, ssd_metric_ member, forward decl
src/client_metric.cpp +4: ssd_metric in constructor/serialize/summary
src/file_storage.cpp +39: BatchLoad() read instrumentation, OffloadObjects() write instrumentation
src/real_client.cpp +3: pass client_->GetSsdMetricPtr() to FileStorage
tests/ssd_metrics_test.cpp New: 13 unit tests
tests/file_storage_test.cpp +116: 3 integration tests
tests/CMakeLists.txt +1: register ssd_metrics_test

Module

  • Mooncake Store (mooncake-store)

Type of Change

  • New feature

How Has This Been Tested?

Unit Tests (ssd_metrics_test, 13 test cases)

$ ./mooncake-store/tests/ssd_metrics_test
[==========] Running 13 tests from 1 test suite.
[  PASSED  ] 13 tests.
Test Case Verifies
InitialValuesTest All metrics initialize to 0
ReadMetricsTest Read ops/bytes accumulate correctly across multiple BatchLoad calls
WriteMetricsTest Write ops/bytes accumulate correctly, read metrics remain 0
TotalMetricsTest total_ops = read_ops + write_ops, total_bytes = read_bytes + write_bytes, total latency histogram/summary have correct observation counts
FailureNotCountedTest Failed operations do not increment any metric
ConcurrentTest 8 threads × 1000 ops, final counters (including totals) match expected values
SerializeTest Prometheus output contains all 12 metric names (counters, histograms, summaries)
SummaryMetricsTest Human-readable summary shows bytes, ops, throughput, IOPS, exact P50/P90/P99 with avg; no throughput/IOPS when data is 0
ThroughputCalculationTest 10MB read + 20MB write → 3 lines of throughput= and IOPS= (read, write, total)
IntegrationWithClientMetric SsdMetric data appears in ClientMetric::serialize() and summary_metrics()
SerializeWithDynamicLabels Labels (instance_id, cluster_id) propagate into serialized output
LatencyBucketBoundaryTest Values at min bucket (50μs), max bucket (30s), and >max fall into correct buckets (+Inf)
EmptyBatchMetricsTest Empty batch (0 ops, 0 bytes) does not crash; counters remain 0

Integration Tests (file_storage_test, 3 new test cases)

$ ./mooncake-store/tests/file_storage_test --gtest_filter='*SsdMetric*:*Ssd*:*Null*'
[==========] Running 3 tests from 1 test suite.
[  PASSED  ] 3 tests.
Test Case Verifies
BatchLoadRecordsSsdMetrics Write 100 keys (21000 bytes) via BatchOffloadUtil, then BatchLoadssd_read_ops=100, ssd_read_bytes=21000, latency has 1 observation
BatchLoadFailureDoesNotRecordSsdMetrics BatchLoad with nonexistent keys → all metrics remain 0
NullSsdMetricDoesNotCrash FileStorage(nullptr) completes BatchLoad without crash (backward compat)

Regression Tests

$ ./mooncake-store/tests/client_metrics_test   →  [PASSED] 7 tests.
$ ./mooncake-store/tests/file_storage_test     →  [PASSED] 20 tests.

Sample Output

summary_metrics() human-readable output:

=== SSD Metrics Summary ===
SSD Read: 10.00 MB, ops=1000, throughput=99.9 MB/s, IOPS=9992.9
SSD Write: 20.00 MB, ops=2000, throughput=199.9 MB/s, IOPS=19985.9
SSD Total: 30.00 MB, ops=3000, throughput=299.8 MB/s, IOPS=29978.8

=== SSD Latency Summary (microseconds) ===
Read: count=100, p50=500.0us, p90=2000.0us, p99=49664.0us, avg=6016.4us
Write: count=50, p50=1000.0us, p90=5000.0us, p99=20000.0us, avg=3200.0us
Total: count=150, p50=500.0us, p90=2000.0us, p99=49664.0us, avg=5079.5us

Prometheus serialized output (partial):

# HELP mooncake_ssd_read_bytes_total Total bytes read from SSD
# TYPE mooncake_ssd_read_bytes_total counter
mooncake_ssd_read_bytes_total 10485760

# HELP mooncake_ssd_read_latency_us SSD BatchLoad latency per batch (us)
# TYPE mooncake_ssd_read_latency_us histogram
mooncake_ssd_read_latency_us_bucket{le="50.000000"} 0
mooncake_ssd_read_latency_us_bucket{le="100.000000"} 0
...

# HELP mooncake_ssd_read_latency_summary_us SSD read latency quantiles (us)
# TYPE mooncake_ssd_read_latency_summary_us summary
mooncake_ssd_read_latency_summary_us{quantile="0.500000"} 500.0
mooncake_ssd_read_latency_summary_us{quantile="0.900000"} 2000.0
mooncake_ssd_read_latency_summary_us{quantile="0.990000"} 49664.0
mooncake_ssd_read_latency_summary_us_sum 601640
mooncake_ssd_read_latency_summary_us_count 100

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces SSD performance metrics to the mooncake-store, including counters for bytes and operations along with latency histograms and summaries. The new SsdMetric struct is integrated into the ClientMetric and FileStorage classes to instrument the read and write paths. Review feedback identifies that the current write latency measurement in OffloadObjects likely only captures submission time for asynchronous backends and should be moved to a completion handler for accuracy. Additionally, the format_ssd_latency_summary method is noted as dead code with logic flaws and is recommended for removal.

Comment thread mooncake-store/src/file_storage.cpp Outdated
Comment on lines +375 to +397
auto offload_start = std::chrono::steady_clock::now();
auto offload_res = storage_backend_->BatchOffload(
batch_object, complete_handler, eviction_handler);
if (offload_res && ssd_metric_) {
auto offload_elapsed_us =
std::chrono::duration_cast<std::chrono::microseconds>(
std::chrono::steady_clock::now() - offload_start)
.count();
int64_t total_bytes = 0;
for (const auto& [key, slices] : batch_object) {
for (const auto& slice : slices) {
total_bytes += slice.size;
}
}
ssd_metric_->ssd_write_ops.inc(batch_object.size());
ssd_metric_->ssd_write_bytes.inc(total_bytes);
ssd_metric_->ssd_write_latency_us.observe(offload_elapsed_us);
ssd_metric_->ssd_write_latency_summary.observe(offload_elapsed_us);
ssd_metric_->ssd_total_ops.inc(batch_object.size());
ssd_metric_->ssd_total_bytes.inc(total_bytes);
ssd_metric_->ssd_total_latency_us.observe(offload_elapsed_us);
ssd_metric_->ssd_total_latency_summary.observe(offload_elapsed_us);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The instrumentation for SSD write metrics is currently placed immediately after the call to storage_backend_->BatchOffload. Since BatchOffload uses a CompletionHandler, it is likely an asynchronous operation in many backend implementations (such as BucketStorageBackend). Measuring latency here only captures the submission time, not the actual I/O duration. Furthermore, metrics are recorded based on submission success rather than actual write completion. It is recommended to move the metrics recording logic into a per-bucket completion handler that captures the start time and calculates the total bytes from the successful metadatas received in the callback. This will ensure that throughput, IOPS, and latency metrics accurately reflect successful SSD operations.

        auto offload_start = std::chrono::steady_clock::now();
        auto bucket_complete_handler = [this, offload_start, complete_handler](
                                           const std::vector<std::string>& keys,
                                           std::vector<StorageObjectMetadata>& metadatas) {
            auto res = complete_handler(keys, metadatas);
            if (res == ErrorCode::OK && ssd_metric_) {
                auto elapsed_us = std::chrono::duration_cast<std::chrono::microseconds>(
                                      std::chrono::steady_clock::now() - offload_start)
                                      .count();
                int64_t total_bytes = 0;
                for (const auto& metadata : metadatas) total_bytes += metadata.size;
                ssd_metric_->ssd_write_ops.inc(keys.size());
                ssd_metric_->ssd_write_bytes.inc(total_bytes);
                ssd_metric_->ssd_write_latency_us.observe(elapsed_us);
                ssd_metric_->ssd_write_latency_summary.observe(elapsed_us);
                ssd_metric_->ssd_total_ops.inc(keys.size());
                ssd_metric_->ssd_total_bytes.inc(total_bytes);
                ssd_metric_->ssd_total_latency_us.observe(elapsed_us);
                ssd_metric_->ssd_total_latency_summary.observe(elapsed_us);
            }
            return res;
        };
        auto offload_res = storage_backend_->BatchOffload(
            batch_object, bucket_complete_handler, eviction_handler);

Comment thread mooncake-store/include/client_metric.h Outdated
Comment on lines +418 to +461
if (bucket_counts.empty()) {
return "No data";
}

int64_t total_count = 0;
for (auto& bucket : bucket_counts) {
total_count += bucket->value();
}

if (total_count == 0) {
return "No data";
}

std::stringstream ss;
ss << "count=" << total_count;

// Find P50, P90, P99
auto find_percentile = [&](int pct) -> double {
int64_t target = (total_count * pct) / 100;
int64_t cumulative = 0;
for (size_t i = 0;
i < bucket_counts.size() && i < kSsdLatencyBucket.size();
i++) {
cumulative += bucket_counts[i]->value();
if (cumulative >= target) {
return kSsdLatencyBucket[i];
}
}
return 0;
};

double p50 = find_percentile(50);
double p90 = find_percentile(90);
double p99 = find_percentile(99);

if (p50 > 0) ss << ", p50<" << p50 << "us";
if (p90 > 0) ss << ", p90<" << p90 << "us";
if (p99 > 0) ss << ", p99<" << p99 << "us";

return ss.str();
}

std::string format_summary_percentiles(ylt::metric::summary_t& summary) {
double sum = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The format_ssd_latency_summary method is defined but never used within the SsdMetric struct or elsewhere in the codebase. Additionally, it contains a redundant const_cast on line 420 and its percentile calculation logic ignores the +Inf bucket, which could lead to inaccurate results if observations exceed the defined boundaries. Since summary_metrics() already uses the more accurate format_summary_percentiles based on summary_t, this dead code should be removed to improve maintainability.

@LujhCoconut LujhCoconut force-pushed the add-feature-ssd-metrics branch from b97d8cc to 946e450 Compare April 13, 2026 08:58
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 13, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 94.31579% with 27 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
mooncake-store/src/file_storage.cpp 37.83% 23 Missing ⚠️
mooncake-store/include/client_service.h 0.00% 2 Missing ⚠️
mooncake-store/tests/ssd_metrics_test.cpp 99.26% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@staryxchen
Copy link
Copy Markdown
Collaborator

@00fish0 On a separate note unrelated to this PR, I’d like to ask: Should a PR in the draft stage trigger CI?

@00fish0
Copy link
Copy Markdown
Collaborator

00fish0 commented Apr 13, 2026

@00fish0 On a separate note unrelated to this PR, I’d like to ask: Should a PR in the draft stage trigger CI?

Personally, I think we could go with a selective CI approach for drafts. We could trigger just one or two core build checks for Draft PRs to provide early feedback.

@ykwd ykwd requested a review from zhangzuo21 April 13, 2026 13:12
Comment thread mooncake-store/include/client_metric.h Outdated

// SSD latency bucket: microseconds, tuned for SSD/network storage
// Range: 50us (high-end NVMe) to 30s (3fs/nfs large object batch writes)
const std::vector<double> kSsdLatencyBucket = {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kSsdLatencyBucket is defined in the header, so each .cpp that includes it gets its own copy. Consider adding inline to avoid duplication.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your insightful suggestion. Fixed.

@zhangzuo21
Copy link
Copy Markdown
Collaborator

LGTM, thanks for the contribution!

Comment thread .claude/plan_olloc_ssd_v2.md Outdated
@@ -0,0 +1,960 @@
# Plan V2: Production-Ready OffsetAllocatorStorageBackend for SSD Offloading
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete the file.

Comment thread .gitignore Outdated

# CodeBuddy Memory
.codebuddy/

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why changing .gitignore?

@zhangzuo21
Copy link
Copy Markdown
Collaborator

LGTM for me now.

@LujhCoconut LujhCoconut marked this pull request as ready for review April 14, 2026 08:42
@ykwd ykwd merged commit c257340 into kvcache-ai:main Apr 15, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants