[TENT] Wire up cross-transport failover with safety limits and observability#1878
Conversation
…and observability The existing resubmitTransferTask() had complete failover logic (increment xport_priority, resolve next transport, resubmit) but was never called. This commit activates the failover path and adds production safeguards: - Wire getTransferStatus() to call resubmitTransferTask() on FAILED tasks - Add failover_count to TaskInfo with configurable max_failover_attempts (default 3) - Add transportTypeName() helper and structured LOG(INFO) for failover events - Add tent_transport_failover_total Prometheus counter metric - Add unit tests for failover state machine, config loading, and limit checks Signed-off-by: staryxchen <staryxchen@tencent.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a transport failover mechanism to the transfer engine, allowing tasks to be automatically resubmitted using alternative transports upon failure. Key changes include the addition of a configurable max_failover_attempts parameter, tracking of failover counts within TaskInfo, and the implementation of resubmitTransferTask to handle the transition between transports. Additionally, new metrics have been added to monitor failover events, and a new test suite validates the failover state machine. A review comment identifies a potential logic error in the batch version of getTransferStatus, where resubmitting one task could incorrectly overwrite a permanent FAILED status of another task in the same batch, potentially leading to incorrect overall status reporting.
Move the failover attempt before status aggregation so that a successfully resubmitted task appears as PENDING to the existing if/else-if chain. Previously the code unconditionally set overall_status.s = PENDING after a successful resubmit, which could overwrite a permanent FAILED from another task in the same batch. Now the `else if (task_status.s != PENDING)` branch naturally skips PENDING tasks, and only truly-failed tasks set overall_status to FAILED. Signed-off-by: staryxchen <staryxchen@tencent.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Description
The existing
resubmitTransferTask()in TENT had complete failover logic (incrementxport_priority, resolve next transport, resubmit) but was never called — the failover path was dead code. This PR activates it and adds production safeguards.Changes:
getTransferStatus()to callresubmitTransferTask()when a task reports FAILED, automatically falling over to the next transport in the priority chain (e.g. RDMA → TCP)failover_counttoTaskInfowith configurablemax_failover_attempts(default 3) to prevent infinite retry loopstransportTypeName()helper and structuredLOG(INFO)for failover events (e.g.Transport failover: RDMA -> TCP (attempt 1/3))tent_transport_failover_totalPrometheus counter metric and include it in/metrics/summaryRelates to: #553 (RDMA/TCP fallback)
Module
mooncake-transfer-engine)Type of Change
How Has This Been Tested?
tent_failover_test): 11 test cases coveringTaskInfo.failover_countdefaults/increment/limit, config loading (default/custom/zero-disables),TransportTypeenum correctness, full failover state machine simulation (RDMA→TCP→success), exhaustion scenario, and staging bypass logic.tebenchplanned for multi-transport environments (RDMA + TCP).Checklist
./scripts/code_format.shbefore submitting.