Skip to content

[TENT] Wire up cross-transport failover with safety limits and observability#1878

Merged
alogfans merged 2 commits intokvcache-ai:mainfrom
staryxchen:worktree-tent-failover-hardening
Apr 15, 2026
Merged

[TENT] Wire up cross-transport failover with safety limits and observability#1878
alogfans merged 2 commits intokvcache-ai:mainfrom
staryxchen:worktree-tent-failover-hardening

Conversation

@staryxchen
Copy link
Copy Markdown
Collaborator

Description

The existing resubmitTransferTask() in TENT had complete failover logic (increment xport_priority, resolve next transport, resubmit) but was never called — the failover path was dead code. This PR activates it and adds production safeguards.

Changes:

  • Wire getTransferStatus() to call resubmitTransferTask() when a task reports FAILED, automatically falling over to the next transport in the priority chain (e.g. RDMA → TCP)
  • Add failover_count to TaskInfo with configurable max_failover_attempts (default 3) to prevent infinite retry loops
  • Add transportTypeName() helper and structured LOG(INFO) for failover events (e.g. Transport failover: RDMA -> TCP (attempt 1/3))
  • Add tent_transport_failover_total Prometheus counter metric and include it in /metrics/summary
  • Add 11 unit tests covering failover state machine, config loading, and limit checks

Relates to: #553 (RDMA/TCP fallback)

Module

  • Transfer Engine (mooncake-transfer-engine)

Type of Change

  • New feature

How Has This Been Tested?

  • Unit tests (tent_failover_test): 11 test cases covering TaskInfo.failover_count defaults/increment/limit, config loading (default/custom/zero-disables), TransportType enum correctness, full failover state machine simulation (RDMA→TCP→success), exhaustion scenario, and staging bypass logic.
  • Compile verification: Builds cleanly with existing CI configuration.
  • E2E validation with tebench planned for multi-transport environments (RDMA + TCP).

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

…and observability

The existing resubmitTransferTask() had complete failover logic (increment
xport_priority, resolve next transport, resubmit) but was never called.
This commit activates the failover path and adds production safeguards:

- Wire getTransferStatus() to call resubmitTransferTask() on FAILED tasks
- Add failover_count to TaskInfo with configurable max_failover_attempts (default 3)
- Add transportTypeName() helper and structured LOG(INFO) for failover events
- Add tent_transport_failover_total Prometheus counter metric
- Add unit tests for failover state machine, config loading, and limit checks

Signed-off-by: staryxchen <staryxchen@tencent.com>
@staryxchen staryxchen changed the title [TransferEngine] Wire up cross-transport failover with safety limits and observability [TENT] Wire up cross-transport failover with safety limits and observability Apr 13, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a transport failover mechanism to the transfer engine, allowing tasks to be automatically resubmitted using alternative transports upon failure. Key changes include the addition of a configurable max_failover_attempts parameter, tracking of failover counts within TaskInfo, and the implementation of resubmitTransferTask to handle the transition between transports. Additionally, new metrics have been added to monitor failover events, and a new test suite validates the failover state machine. A review comment identifies a potential logic error in the batch version of getTransferStatus, where resubmitting one task could incorrectly overwrite a permanent FAILED status of another task in the same batch, potentially leading to incorrect overall status reporting.

Comment thread mooncake-transfer-engine/tent/src/runtime/transfer_engine_impl.cpp
Move the failover attempt before status aggregation so that a
successfully resubmitted task appears as PENDING to the existing
if/else-if chain.  Previously the code unconditionally set
overall_status.s = PENDING after a successful resubmit, which could
overwrite a permanent FAILED from another task in the same batch.

Now the `else if (task_status.s != PENDING)` branch naturally skips
PENDING tasks, and only truly-failed tasks set overall_status to
FAILED.

Signed-off-by: staryxchen <staryxchen@tencent.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Collaborator

@alogfans alogfans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alogfans alogfans merged commit aadcc7d into kvcache-ai:main Apr 15, 2026
37 checks passed
@staryxchen staryxchen deleted the worktree-tent-failover-hardening branch April 16, 2026 03:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants