[TENT] Wire up cross-transport failover with safety limits and observability by staryxchen · Pull Request #1878 · kvcache-ai/Mooncake

staryxchen · 2026-04-13T07:19:36Z

Description

The existing resubmitTransferTask() in TENT had complete failover logic (increment xport_priority, resolve next transport, resubmit) but was never called — the failover path was dead code. This PR activates it and adds production safeguards.

Changes:

Wire getTransferStatus() to call resubmitTransferTask() when a task reports FAILED, automatically falling over to the next transport in the priority chain (e.g. RDMA → TCP)
Add failover_count to TaskInfo with configurable max_failover_attempts (default 3) to prevent infinite retry loops
Add transportTypeName() helper and structured LOG(INFO) for failover events (e.g. Transport failover: RDMA -> TCP (attempt 1/3))
Add tent_transport_failover_total Prometheus counter metric and include it in /metrics/summary
Add 11 unit tests covering failover state machine, config loading, and limit checks

Relates to: #553 (RDMA/TCP fallback)

Module

Transfer Engine (mooncake-transfer-engine)

Type of Change

New feature

How Has This Been Tested?

Unit tests (tent_failover_test): 11 test cases covering TaskInfo.failover_count defaults/increment/limit, config loading (default/custom/zero-disables), TransportType enum correctness, full failover state machine simulation (RDMA→TCP→success), exhaustion scenario, and staging bypass logic.
Compile verification: Builds cleanly with existing CI configuration.
E2E validation with tebench planned for multi-transport environments (RDMA + TCP).

Checklist

I have performed a self-review of my own code.
I have formatted my own code using ./scripts/code_format.sh before submitting.
I have updated the documentation.
I have added tests to prove my changes are effective.

…and observability The existing resubmitTransferTask() had complete failover logic (increment xport_priority, resolve next transport, resubmit) but was never called. This commit activates the failover path and adds production safeguards: - Wire getTransferStatus() to call resubmitTransferTask() on FAILED tasks - Add failover_count to TaskInfo with configurable max_failover_attempts (default 3) - Add transportTypeName() helper and structured LOG(INFO) for failover events - Add tent_transport_failover_total Prometheus counter metric - Add unit tests for failover state machine, config loading, and limit checks Signed-off-by: staryxchen <staryxchen@tencent.com>

gemini-code-assist

Code Review

This pull request introduces a transport failover mechanism to the transfer engine, allowing tasks to be automatically resubmitted using alternative transports upon failure. Key changes include the addition of a configurable max_failover_attempts parameter, tracking of failover counts within TaskInfo, and the implementation of resubmitTransferTask to handle the transition between transports. Additionally, new metrics have been added to monitor failover events, and a new test suite validates the failover state machine. A review comment identifies a potential logic error in the batch version of getTransferStatus, where resubmitting one task could incorrectly overwrite a permanent FAILED status of another task in the same batch, potentially leading to incorrect overall status reporting.

Move the failover attempt before status aggregation so that a successfully resubmitted task appears as PENDING to the existing if/else-if chain. Previously the code unconditionally set overall_status.s = PENDING after a successful resubmit, which could overwrite a permanent FAILED from another task in the same batch. Now the `else if (task_status.s != PENDING)` branch naturally skips PENDING tasks, and only truly-failed tasks set overall_status to FAILED. Signed-off-by: staryxchen <staryxchen@tencent.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov-commenter · 2026-04-13T07:49:09Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

alogfans

LGTM

staryxchen requested review from alogfans, chestnut-Q and doujiang24 as code owners April 13, 2026 07:19

github-actions Bot added run-ci Transfer Engine labels Apr 13, 2026

staryxchen changed the title ~~[TransferEngine] Wire up cross-transport failover with safety limits and observability~~ [TENT] Wire up cross-transport failover with safety limits and observability Apr 13, 2026

gemini-code-assist Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread mooncake-transfer-engine/tent/src/runtime/transfer_engine_impl.cpp

staryxchen added the run-e2e-ci label Apr 13, 2026

github-actions Bot removed the run-e2e-ci label Apr 13, 2026

alogfans approved these changes Apr 15, 2026

View reviewed changes

alogfans merged commit aadcc7d into kvcache-ai:main Apr 15, 2026
37 checks passed

staryxchen mentioned this pull request Apr 15, 2026

feat(tent): replace raw RdmaEndPoint* with weak_ptr for endpoint lifecycle safety #1897

Merged

6 tasks

staryxchen deleted the worktree-tent-failover-hardening branch April 16, 2026 03:58

staryxchen mentioned this pull request Apr 16, 2026

[TENT] add FaultProxyTransport for fault injection testing #1907

Merged

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TENT] Wire up cross-transport failover with safety limits and observability#1878

[TENT] Wire up cross-transport failover with safety limits and observability#1878
alogfans merged 2 commits intokvcache-ai:mainfrom
staryxchen:worktree-tent-failover-hardening

staryxchen commented Apr 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

codecov-commenter commented Apr 13, 2026

Uh oh!

alogfans left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

staryxchen commented Apr 13, 2026

Description

Module

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

codecov-commenter commented Apr 13, 2026

Codecov Report

Uh oh!

alogfans left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants