Skip to content

fix(pd): add timeout and null-safety to getLeaderGrpcAddress()#2961

Open
bitflicker64 wants to merge 1 commit intoapache:masterfrom
bitflicker64:fix/raft-engine-leader-grpc-address
Open

fix(pd): add timeout and null-safety to getLeaderGrpcAddress()#2961
bitflicker64 wants to merge 1 commit intoapache:masterfrom
bitflicker64:fix/raft-engine-leader-grpc-address

Conversation

@bitflicker64
Copy link
Contributor

@bitflicker64 bitflicker64 commented Mar 5, 2026

Purpose of the PR

In a 3-node PD cluster running in Docker bridge network mode, getLeaderGrpcAddress() makes a bolt RPC call to discover the leader's gRPC address when the current node is a follower. This call fails in bridge mode — the TCP connection establishes but the bolt RPC response never returns properly, causing CompletableFuture.get() to return null and throw NPE.

This causes:

  1. redirectToLeader() fails with NPE
  2. Store registration requests landing on follower PDs are never forwarded
  3. Stores register but partitions are never distributed (partitionCount:0)
  4. HugeGraph servers stuck in DEADLINE_EXCEEDED loop indefinitely

The cluster only works when pd0 wins raft leader election (since isLeader() returns true and the broken code path is skipped). If pd1 or pd2 wins, the NPE fires on every redirect attempt.

Related PR: #2952

Main Changes

  • Add a bounded timeout to the bolt RPC call using config.getRpcTimeout() instead of unbounded .get()
  • Add null-check on the RPC response before accessing .getGrpcAddress()
  • Fall back to deriving the leader address from the raft endpoint IP + local gRPC port when the RPC fails or times out
  • Add TimeUnit and TimeoutException imports

Verifying these changes

  • Trivial rework / code cleanup without any test coverage. (No Need)
  • Already covered by existing tests, such as (please modify tests here).
  • Done testing and can be verified as follows:
    • Deploy 3-node PD cluster in Docker bridge network mode
    • Verify cluster works regardless of which PD node wins raft leader election
    • Confirm stores show partitionCount:12 on all 3 nodes when pd1 or pd2 is leader
    • Confirm no NPE in pd logs at getLeaderGrpcAddress

Does this PR potentially affect the following parts?

Documentation Status

  • Doc - TODO
  • Doc - Done
  • Doc - No Need

The bolt RPC call in getLeaderGrpcAddress() returns null in Docker
bridge network mode, causing NPE when a follower PD node attempts
to discover the leader's gRPC address. This breaks store registration
and partition distribution when any node other than pd0 wins the
raft leader election.

Add a bounded timeout using the configured rpc-timeout, null-check
the RPC response, and fall back to deriving the address from the
raft endpoint IP when the RPC fails.

Closes apache#2959
@bitflicker64
Copy link
Contributor Author

How I tested:

  1. Built a local Docker image from source with this fix applied
  2. Brought up the 3-node cluster (3 PD + 3 Store + 3 Server) in bridge network mode
  3. Confirmed cluster was healthy with pd0 as initial leader
  4. Restarted pd0 to force a new leader election — pd1 won
  5. Checked partition distribution and cluster health with pd1 as leader

Results with pd1 as leader:

partitionCount:12 on all 3 stores ✅
leaderCount:12 on all 3 stores ✅
{"graphs":["hugegraph"]} ✅
All 9 containers healthy ✅

Confirmed fallback triggered in pd1 logs:

[WARN] RaftEngine - Failed to get leader gRPC address via RPC, falling back to endpoint derivation
java.util.concurrent.ExecutionException: com.alipay.remoting.exception.RemotingException:
Create connection failed. The address is 172.20.0.10:8610
    at RaftEngine.getLeaderGrpcAddress(RaftEngine.java:247)
    at PDService.redirectToLeader(PDService.java:1275)

Before this fix: RPC returns null → NPE → follower PDs can't redirect requests to leader → cluster only worked when pd0 won leader election since it never hit the broken code path.

After this fix: RPC failure caught with bounded timeout → fallback to endpoint IP + gRPC port derivation → follower PDs correctly redirect to leader regardless of which PD node wins election.

Related docker bridge networking PR: #2952

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses PD follower redirects failing in Docker bridge mode by making RaftEngine.getLeaderGrpcAddress() resilient to stalled/failed bolt RPC lookups, preventing NPEs and enabling store registration/partition distribution regardless of which PD becomes Raft leader.

Changes:

  • Add a bounded timeout (config.getRpcTimeout()) to the bolt RPC CompletableFuture.get(...).
  • Add null-safety around the RPC response before reading getGrpcAddress().
  • Add a fallback that derives the leader gRPC address from the Raft endpoint IP/host plus the configured gRPC port.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +245 to +247
RaftRpcProcessor.GetMemberResponse response = raftRpcClient
.getGrpcAddress(raftNode.getLeaderId().getEndpoint().toString())
.get(config.getRpcTimeout(), TimeUnit.MILLISECONDS);
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waitingForLeader(10000) is called when raftNode.getLeaderId() is null, but the result isn’t used/rechecked. If the leader still isn’t elected after the wait, the subsequent raftNode.getLeaderId().getEndpoint() (and the fallback path) will still throw an NPE. Capture the leader into a local variable after waiting and handle the null case explicitly (e.g., throw a clear exception or return a sentinel that callers can handle).

Copilot uses AI. Check for mistakes.
return response.getGrpcAddress();
}
} catch (TimeoutException | ExecutionException e) {
log.warn("Failed to get leader gRPC address via RPC, falling back to endpoint derivation", e);
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log.warn(..., e) will likely be hit on every follower redirect in environments where the RPC path is consistently broken/blocked (the scenario described in #2959), which can flood logs and add overhead due to repeated stack traces. Consider reducing verbosity (e.g., warn without stack trace, debug with stack trace, or rate-limit) and include key context like leader endpoint/timeout in the message.

Suggested change
log.warn("Failed to get leader gRPC address via RPC, falling back to endpoint derivation", e);
String leaderEndpoint = raftNode.getLeaderId() != null
? String.valueOf(raftNode.getLeaderId().getEndpoint())
: "unknown";
long timeoutMs = config.getRpcTimeout();
log.warn(
"Failed to get leader gRPC address via RPC, falling back to endpoint derivation: " +
"leaderEndpoint={}, timeoutMs={}, errorType={}, errorMessage={}",
leaderEndpoint, timeoutMs, e.getClass().getSimpleName(), e.getMessage());
log.debug("Stack trace for failed getLeaderGrpcAddress RPC", e);

Copilot uses AI. Check for mistakes.

// Fallback: derive from raft endpoint IP + local gRPC port (best effort)
String leaderIp = raftNode.getLeaderId().getEndpoint().getIp();
return leaderIp + ":" + config.getGrpcPort();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‼️ This fallback is still incorrect for clusters where PD nodes use different grpc.port values. In this repo's own multi-node test configs, application-server1.yml, application-server2.yml, and application-server3.yml advertise 8686, 8687, and 8688 respectively, so a follower on 8687 will redirect to leader-ip:8687 even when the elected leader is actually listening on 8686 or 8688. That turns the original NPE into a silent misroute.

If we can't recover the leader's advertised gRPC endpoint here, I think it's safer to fail fast than to synthesize an address from the local port, for example:

Suggested change
return leaderIp + ":" + config.getGrpcPort();
} catch (TimeoutException | ExecutionException e) {
throw new ExecutionException(
String.format("Failed to resolve leader gRPC address for %s", raftNode.getLeaderId()),
e);
}

A more complete fix would need a source of truth for the leader's actual grpcAddress, not the local node's port.

.getGrpcAddress();
try {
RaftRpcProcessor.GetMemberResponse response = raftRpcClient
.getGrpcAddress(raftNode.getLeaderId().getEndpoint().toString())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ waitingForLeader(10000) does not guarantee that leaderId is non-null; it just waits up to the timeout and returns whatever it has. That means both the RPC call and this fallback path can still dereference raftNode.getLeaderId() and reintroduce an NPE during leader-election gaps.

Could we cache the leader once after waiting and turn the "leader still unknown" case into a controlled exception instead?

Suggested change
.getGrpcAddress(raftNode.getLeaderId().getEndpoint().toString())
PeerId leader = raftNode.getLeaderId();
if (leader == null) {
throw new ExecutionException(new IllegalStateException("Leader is not ready"));
}

Then reuse leader for both the RPC request and any follow-up handling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] 3-node PD cluster fails when pd0 is not raft leader — getLeaderGrpcAddress() NPE in bridge network mode

3 participants