[CI] fix: health_check_test flaky due to fixed sleep, use polling for master down detection by herbertskyper · Pull Request #1868 · kvcache-ai/Mooncake

herbertskyper · 2026-04-11T04:35:58Z

Description

Problem:
The health check tests (ReturnsTwoWhenMasterDown and HttpReturns503WhenMasterDown) in health_check_test.cpp were flaky in the CI environment. The failure was caused by a hardcoded std::this_thread::sleep_for(std::chrono::seconds(3)) that did not reliably wait for the asynchronous master disconnection state to propagate, leading to race conditions and test assertions failing.

Solution:
Replaced the hardcoded sleep_for with a robust polling mechanism (WaitForHealthCode). The test now polls the health state every 100ms with a 10-second timeout. This change:

Eliminates flakiness: Handles slow CI environments gracefully by allowing up to 10 seconds.
Improves test performance: Exits immediately when the expected state is reached, reducing the wait time from a fixed 3000ms to an average of ~200ms per test.

Module

Type of Change

How Has This Been Tested?

Repeatedly ran the affected tests (ReturnsTwoWhenMasterDown and HttpReturns503WhenMasterDown) using --gtest_repeat=100 to ensure stability and zero flakiness.
Instrument-tested the wait times: confirmed that the expected state is typically reached within ~200ms (max observed 300ms), proving the transition from a hardcoded 3s wait to the new polling method significantly speeds up the test execution.

Checklist

I have performed a self-review of my own code.
I have formatted my own code using ./scripts/code_format.sh before submitting.
I have updated the documentation.
I have added tests to prove my changes are effective.

… master down detection

gemini-code-assist

Code Review

This pull request introduces a WaitForHealthCode helper function in health_check_test.cpp to replace fixed sleep intervals with a polling mechanism. This change is applied to the ReturnsTwoWhenMasterDown and HttpReturns503WhenMasterDown test cases to improve execution efficiency and reduce flakiness. I have no feedback to provide.

codecov-commenter · 2026-04-11T04:59:24Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 92.30769% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
mooncake-store/tests/health_check_test.cpp	92.30%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

00fish0

Welcome to the project! Thanks for your first contribution.

[CI] fix: health_check_test flaky due to fixed sleep, use polling for…

ad465e9

… master down detection

herbertskyper requested review from XucSh, YiXR, stmatengss and ykwd as code owners April 11, 2026 04:35

github-actions bot added run-ci Store labels Apr 11, 2026

gemini-code-assist bot reviewed Apr 11, 2026

View reviewed changes

00fish0 self-assigned this Apr 11, 2026

00fish0 added the run-e2e-ci label Apr 12, 2026

github-actions bot removed the run-e2e-ci label Apr 12, 2026

00fish0 approved these changes Apr 16, 2026

View reviewed changes

ykwd merged commit 4a8684b into kvcache-ai:main Apr 16, 2026
35 of 37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] fix: health_check_test flaky due to fixed sleep, use polling for master down detection#1868

[CI] fix: health_check_test flaky due to fixed sleep, use polling for master down detection#1868
ykwd merged 1 commit intokvcache-ai:mainfrom
herbertskyper:main

herbertskyper commented Apr 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

codecov-commenter commented Apr 11, 2026

Uh oh!

00fish0 left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

herbertskyper commented Apr 11, 2026

Description

Module

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

codecov-commenter commented Apr 11, 2026

Codecov Report

Uh oh!

00fish0 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

00fish0 left a comment •

edited

Loading