Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces SSD performance metrics to the mooncake-store, including counters for bytes and operations along with latency histograms and summaries. The new SsdMetric struct is integrated into the ClientMetric and FileStorage classes to instrument the read and write paths. Review feedback identifies that the current write latency measurement in OffloadObjects likely only captures submission time for asynchronous backends and should be moved to a completion handler for accuracy. Additionally, the format_ssd_latency_summary method is noted as dead code with logic flaws and is recommended for removal.
| auto offload_start = std::chrono::steady_clock::now(); | ||
| auto offload_res = storage_backend_->BatchOffload( | ||
| batch_object, complete_handler, eviction_handler); | ||
| if (offload_res && ssd_metric_) { | ||
| auto offload_elapsed_us = | ||
| std::chrono::duration_cast<std::chrono::microseconds>( | ||
| std::chrono::steady_clock::now() - offload_start) | ||
| .count(); | ||
| int64_t total_bytes = 0; | ||
| for (const auto& [key, slices] : batch_object) { | ||
| for (const auto& slice : slices) { | ||
| total_bytes += slice.size; | ||
| } | ||
| } | ||
| ssd_metric_->ssd_write_ops.inc(batch_object.size()); | ||
| ssd_metric_->ssd_write_bytes.inc(total_bytes); | ||
| ssd_metric_->ssd_write_latency_us.observe(offload_elapsed_us); | ||
| ssd_metric_->ssd_write_latency_summary.observe(offload_elapsed_us); | ||
| ssd_metric_->ssd_total_ops.inc(batch_object.size()); | ||
| ssd_metric_->ssd_total_bytes.inc(total_bytes); | ||
| ssd_metric_->ssd_total_latency_us.observe(offload_elapsed_us); | ||
| ssd_metric_->ssd_total_latency_summary.observe(offload_elapsed_us); | ||
| } |
There was a problem hiding this comment.
The instrumentation for SSD write metrics is currently placed immediately after the call to storage_backend_->BatchOffload. Since BatchOffload uses a CompletionHandler, it is likely an asynchronous operation in many backend implementations (such as BucketStorageBackend). Measuring latency here only captures the submission time, not the actual I/O duration. Furthermore, metrics are recorded based on submission success rather than actual write completion. It is recommended to move the metrics recording logic into a per-bucket completion handler that captures the start time and calculates the total bytes from the successful metadatas received in the callback. This will ensure that throughput, IOPS, and latency metrics accurately reflect successful SSD operations.
auto offload_start = std::chrono::steady_clock::now();
auto bucket_complete_handler = [this, offload_start, complete_handler](
const std::vector<std::string>& keys,
std::vector<StorageObjectMetadata>& metadatas) {
auto res = complete_handler(keys, metadatas);
if (res == ErrorCode::OK && ssd_metric_) {
auto elapsed_us = std::chrono::duration_cast<std::chrono::microseconds>(
std::chrono::steady_clock::now() - offload_start)
.count();
int64_t total_bytes = 0;
for (const auto& metadata : metadatas) total_bytes += metadata.size;
ssd_metric_->ssd_write_ops.inc(keys.size());
ssd_metric_->ssd_write_bytes.inc(total_bytes);
ssd_metric_->ssd_write_latency_us.observe(elapsed_us);
ssd_metric_->ssd_write_latency_summary.observe(elapsed_us);
ssd_metric_->ssd_total_ops.inc(keys.size());
ssd_metric_->ssd_total_bytes.inc(total_bytes);
ssd_metric_->ssd_total_latency_us.observe(elapsed_us);
ssd_metric_->ssd_total_latency_summary.observe(elapsed_us);
}
return res;
};
auto offload_res = storage_backend_->BatchOffload(
batch_object, bucket_complete_handler, eviction_handler);| if (bucket_counts.empty()) { | ||
| return "No data"; | ||
| } | ||
|
|
||
| int64_t total_count = 0; | ||
| for (auto& bucket : bucket_counts) { | ||
| total_count += bucket->value(); | ||
| } | ||
|
|
||
| if (total_count == 0) { | ||
| return "No data"; | ||
| } | ||
|
|
||
| std::stringstream ss; | ||
| ss << "count=" << total_count; | ||
|
|
||
| // Find P50, P90, P99 | ||
| auto find_percentile = [&](int pct) -> double { | ||
| int64_t target = (total_count * pct) / 100; | ||
| int64_t cumulative = 0; | ||
| for (size_t i = 0; | ||
| i < bucket_counts.size() && i < kSsdLatencyBucket.size(); | ||
| i++) { | ||
| cumulative += bucket_counts[i]->value(); | ||
| if (cumulative >= target) { | ||
| return kSsdLatencyBucket[i]; | ||
| } | ||
| } | ||
| return 0; | ||
| }; | ||
|
|
||
| double p50 = find_percentile(50); | ||
| double p90 = find_percentile(90); | ||
| double p99 = find_percentile(99); | ||
|
|
||
| if (p50 > 0) ss << ", p50<" << p50 << "us"; | ||
| if (p90 > 0) ss << ", p90<" << p90 << "us"; | ||
| if (p99 > 0) ss << ", p99<" << p99 << "us"; | ||
|
|
||
| return ss.str(); | ||
| } | ||
|
|
||
| std::string format_summary_percentiles(ylt::metric::summary_t& summary) { | ||
| double sum = 0; |
There was a problem hiding this comment.
The format_ssd_latency_summary method is defined but never used within the SsdMetric struct or elsewhere in the codebase. Additionally, it contains a redundant const_cast on line 420 and its percentile calculation logic ignores the +Inf bucket, which could lead to inaccurate results if observations exceed the defined boundaries. Since summary_metrics() already uses the more accurate format_summary_percentiles based on summary_t, this dead code should be removed to improve maintainability.
b97d8cc to
946e450
Compare
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
@00fish0 On a separate note unrelated to this PR, I’d like to ask: Should a PR in the draft stage trigger CI? |
Personally, I think we could go with a selective CI approach for drafts. We could trigger just one or two core build checks for Draft PRs to provide early feedback. |
|
|
||
| // SSD latency bucket: microseconds, tuned for SSD/network storage | ||
| // Range: 50us (high-end NVMe) to 30s (3fs/nfs large object batch writes) | ||
| const std::vector<double> kSsdLatencyBucket = { |
There was a problem hiding this comment.
kSsdLatencyBucket is defined in the header, so each .cpp that includes it gets its own copy. Consider adding inline to avoid duplication.
There was a problem hiding this comment.
Thanks for your insightful suggestion. Fixed.
|
LGTM, thanks for the contribution! |
| @@ -0,0 +1,960 @@ | |||
| # Plan V2: Production-Ready OffsetAllocatorStorageBackend for SSD Offloading | |||
There was a problem hiding this comment.
Please delete the file.
|
|
||
| # CodeBuddy Memory | ||
| .codebuddy/ | ||
|
|
There was a problem hiding this comment.
Why changing .gitignore?
|
LGTM for me now. |
Description
Add SSD offloading metrics to
mooncake-store, enabling observability for SSD read/write performance including throughput (bytes/s), IOPS (ops/s), and exact P50/P90/P99 latency.Currently,
FileStorage::BatchLoad()andFileStorage::OffloadObjects()have timing code (VLOG), but no structured metrics are exported. This PR introduces a newSsdMetricstruct (following the existingTransferMetricpattern) and instruments both the read (BatchLoad) and write (BatchOffload) paths. All metrics are reported in three dimensions: Read, Write, and Total (combined).Exported Metrics (12 total)
Counters (6):
mooncake_ssd_read_bytes_totalmooncake_ssd_write_bytes_totalmooncake_ssd_total_bytes_totalmooncake_ssd_read_ops_totalmooncake_ssd_write_ops_totalmooncake_ssd_total_ops_totalHistograms (3, for Prometheus aggregation):
mooncake_ssd_read_latency_usmooncake_ssd_write_latency_usmooncake_ssd_total_latency_usSummaries (3, for exact percentiles):
mooncake_ssd_read_latency_summary_usmooncake_ssd_write_latency_summary_usmooncake_ssd_total_latency_summary_usKey Design Decisions
SsdMetriclives inClientMetric, notMasterMetricManager.kLatencyBucket.BatchLoad/BatchOffloadcalls are counted, keeping IOPS/throughput/latency clean.SsdMetric*defaults tonullptr; existing callers are unaffected._totalsuffix, consistent with master-side metrics (e.g.,master_put_start_requests_total).Changed Files (8 files, +382 / -3)
include/client_metric.hkSsdLatencyBucket,SsdMetricstruct (12 metrics, serialize, summary with throughput/IOPS/exact percentiles)include/client_service.hClient::GetSsdMetricPtr()accessorinclude/file_storage.hSsdMetric*constructor param,ssd_metric_member, forward declsrc/client_metric.cppssd_metricin constructor/serialize/summarysrc/file_storage.cppBatchLoad()read instrumentation,OffloadObjects()write instrumentationsrc/real_client.cppclient_->GetSsdMetricPtr()to FileStoragetests/ssd_metrics_test.cpptests/file_storage_test.cpptests/CMakeLists.txtssd_metrics_testModule
mooncake-store)Type of Change
How Has This Been Tested?
Unit Tests (
ssd_metrics_test, 13 test cases)InitialValuesTestReadMetricsTestWriteMetricsTestTotalMetricsTesttotal_ops = read_ops + write_ops,total_bytes = read_bytes + write_bytes, total latency histogram/summary have correct observation countsFailureNotCountedTestConcurrentTestSerializeTestSummaryMetricsTestThroughputCalculationTestthroughput=andIOPS=(read, write, total)IntegrationWithClientMetricSsdMetricdata appears inClientMetric::serialize()andsummary_metrics()SerializeWithDynamicLabelsinstance_id,cluster_id) propagate into serialized outputLatencyBucketBoundaryTestEmptyBatchMetricsTestIntegration Tests (
file_storage_test, 3 new test cases)BatchLoadRecordsSsdMetricsBatchOffloadUtil, thenBatchLoad→ssd_read_ops=100,ssd_read_bytes=21000, latency has 1 observationBatchLoadFailureDoesNotRecordSsdMetricsBatchLoadwith nonexistent keys → all metrics remain 0NullSsdMetricDoesNotCrashFileStorage(nullptr)completesBatchLoadwithout crash (backward compat)Regression Tests
Sample Output
summary_metrics()human-readable output:Prometheus serialized output (partial):
Checklist
./scripts/code_format.shbefore submitting.