Skip to content

[CI] fix: health_check_test flaky due to fixed sleep, use polling for master down detection#1868

Merged
ykwd merged 1 commit intokvcache-ai:mainfrom
herbertskyper:main
Apr 16, 2026
Merged

[CI] fix: health_check_test flaky due to fixed sleep, use polling for master down detection#1868
ykwd merged 1 commit intokvcache-ai:mainfrom
herbertskyper:main

Conversation

@herbertskyper
Copy link
Copy Markdown
Contributor

Description

See #1858

Problem:
The health check tests (ReturnsTwoWhenMasterDown and HttpReturns503WhenMasterDown) in health_check_test.cpp were flaky in the CI environment. The failure was caused by a hardcoded std::this_thread::sleep_for(std::chrono::seconds(3)) that did not reliably wait for the asynchronous master disconnection state to propagate, leading to race conditions and test assertions failing.

Solution:
Replaced the hardcoded sleep_for with a robust polling mechanism (WaitForHealthCode). The test now polls the health state every 100ms with a 10-second timeout. This change:

  1. Eliminates flakiness: Handles slow CI environments gracefully by allowing up to 10 seconds.
  2. Improves test performance: Exits immediately when the expected state is reached, reducing the wait time from a fixed 3000ms to an average of ~200ms per test.

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • PyTorch Backend (mooncake-pg)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Other

How Has This Been Tested?

  • Repeatedly ran the affected tests (ReturnsTwoWhenMasterDown and HttpReturns503WhenMasterDown) using --gtest_repeat=100 to ensure stability and zero flakiness.
  • Instrument-tested the wait times: confirmed that the expected state is typically reached within ~200ms (max observed 300ms), proving the transition from a hardcoded 3s wait to the new polling method significantly speeds up the test execution.

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a WaitForHealthCode helper function in health_check_test.cpp to replace fixed sleep intervals with a polling mechanism. This change is applied to the ReturnsTwoWhenMasterDown and HttpReturns503WhenMasterDown test cases to improve execution efficiency and reduce flakiness. I have no feedback to provide.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 92.30769% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
mooncake-store/tests/health_check_test.cpp 92.30% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@00fish0 00fish0 self-assigned this Apr 11, 2026
Copy link
Copy Markdown
Collaborator

@00fish0 00fish0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome to the project! Thanks for your first contribution.

@ykwd ykwd merged commit 4a8684b into kvcache-ai:main Apr 16, 2026
35 of 37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants