[release-4.20] OCPBUGS-77881: handle event socket broken pipe with automatic reconnection by jzding · Pull Request #561 · openshift/linuxptp-daemon

jzding · 2026-03-06T21:42:36Z

Backport from #560

When cloud-event-proxy restarts, the Unix socket connection breaks, causing silent loss of PTP events. This PR adds robust reconnection logic with exponential backoff to automatically recover.

Fix broken pipe errors when cloud-event-proxy restarts by adding automatic event socket reconnection with exponential backoff, dial timeouts, and write deadlines.
Fix data races on shared EventHandler fields (clkSyncState map, clockClass, LeadingClockData) and a deadlock in updateBCState/announceClockClass.
Refactor all socket write sites (ProcessEvents, EmitClockSyncLogs, EmitPortRoleLogs, EmitProcessStatusLogs) to use centralized connection management with reconnect-on-failure, and separate stdout logging from socket writes so local logs are never lost.
Add unit tests for reconnection, broken pipe detection, and connection management

When cloud-event-proxy restarts, linuxptp-daemon's Unix socket connection to /cloud-native/events.sock breaks, causing persistent "broken pipe" errors and silent loss of PTP events. Add robust reconnection logic with exponential backoff that automatically re-establishes the event socket connection when a broken pipe is detected: - Introduce ReconnectWithBackoff utility in pkg/utils with configurable retry attempts, exponential backoff, and context-based cancellation for clean shutdown responsiveness. - Move net.Conn ownership into EventHandler with thread-safe getConn/setConn accessors that automatically close replaced connections, preventing resource leaks. - Add writeLogToSocket helper that encapsulates write-reconnect-retry logic for a single log line, replacing ad-hoc error handling at each write site. - Add dial timeouts (DialContext/DialTimeout) and write deadlines (SetWriteDeadline) to all socket operations to prevent indefinite blocking if the listener is unresponsive or the socket buffer is full. - Separate stdout printing from socket writing in ProcessEvents so all logs are always printed locally regardless of socket state. - Implement channel-based broken pipe signaling (brokenPipeCh) so background goroutines (clock class ticker, TBC announce) can notify the main ProcessEvents loop to reconnect without blocking. - Serialize concurrent reconnection attempts via reconnectMu to prevent multiple goroutines from opening duplicate connections. - Fix multiple data races: protect clkSyncState map and clockClass/ clockAccuracy fields with snapshot-under-lock patterns, resolve deadlock in updateBCState/announceClockClass by separating lock-holding from I/O operations, and guard LeadingClockData access in TBC goroutines. - Refactor EmitClockSyncLogs, EmitPortRoleLogs, EmitClockClass, and EmitProcessStatusLogs to use the EventHandler's managed connection with built-in reconnection support, including reconnect-on-nil-conn for reliable log emission after event proxy restarts. - Add IsBrokenPipe helper to detect EPIPE, ECONNRESET, ECONNREFUSED, and ENOTCONN errors including those wrapped in net.OpError. - Add comprehensive unit tests for reconnection backoff, broken pipe detection, connection management, and socket write/reconnect behavior. Signed-off-by: Jack Ding <jackding@gmail.com>

openshift-ci-robot · 2026-03-06T21:42:44Z

@jzding: This pull request references Jira Issue OCPBUGS-77881, which is invalid:

expected the bug to target the "4.20.z" version, but no target version was set
release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
expected dependent Jira Issue OCPBUGS-77871 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but it is MODIFIED instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

When cloud-event-proxy restarts, linuxptp-daemon's Unix socket connection to /cloud-native/events.sock breaks, causing persistent "broken pipe" errors and silent loss of PTP events.

Add robust reconnection logic with exponential backoff that automatically re-establishes the event socket connection when a broken pipe is detected:

Introduce ReconnectWithBackoff utility in pkg/utils with configurable retry attempts, exponential backoff, and context-based cancellation for clean shutdown responsiveness.

Move net.Conn ownership into EventHandler with thread-safe getConn/setConn accessors that automatically close replaced connections, preventing resource leaks.

Add writeLogToSocket helper that encapsulates write-reconnect-retry logic for a single log line, replacing ad-hoc error handling at each write site.

Add dial timeouts (DialContext/DialTimeout) and write deadlines (SetWriteDeadline) to all socket operations to prevent indefinite blocking if the listener is unresponsive or the socket buffer is full.

Separate stdout printing from socket writing in ProcessEvents so all logs are always printed locally regardless of socket state.

Implement channel-based broken pipe signaling (brokenPipeCh) so background goroutines (clock class ticker, TBC announce) can notify the main ProcessEvents loop to reconnect without blocking.

Serialize concurrent reconnection attempts via reconnectMu to prevent multiple goroutines from opening duplicate connections.

Fix multiple data races: protect clkSyncState map and clockClass/ clockAccuracy fields with snapshot-under-lock patterns, resolve deadlock in updateBCState/announceClockClass by separating lock-holding from I/O operations, and guard LeadingClockData access in TBC goroutines.

Refactor EmitClockSyncLogs, EmitPortRoleLogs, EmitClockClass, and EmitProcessStatusLogs to use the EventHandler's managed connection with built-in reconnection support, including reconnect-on-nil-conn for reliable log emission after event proxy restarts.

Add IsBrokenPipe helper to detect EPIPE, ECONNRESET, ECONNREFUSED, and ENOTCONN errors including those wrapped in net.OpError.

Add comprehensive unit tests for reconnection backoff, broken pipe detection, connection management, and socket write/reconnect behavior.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-03-06T21:42:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jzding

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jzding]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-03-07T00:19:43Z

@jzding: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot requested review from SchSeba and nocturnalastro March 6, 2026 21:42

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release-4.20] OCPBUGS-77881: handle event socket broken pipe with automatic reconnection#561

[release-4.20] OCPBUGS-77881: handle event socket broken pipe with automatic reconnection#561
jzding wants to merge 1 commit intoopenshift:release-4.20from
jzding:socket-reconnect-4.20

jzding commented Mar 6, 2026 •

edited

Loading

Uh oh!

openshift-ci-robot commented Mar 6, 2026

Uh oh!

openshift-ci bot commented Mar 6, 2026

Uh oh!

openshift-ci bot commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jzding commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 6, 2026

Uh oh!

openshift-ci bot commented Mar 6, 2026

Uh oh!

openshift-ci bot commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jzding commented Mar 6, 2026 •

edited

Loading