Skip to content

[release-4.20] OCPBUGS-77881: handle event socket broken pipe with automatic reconnection#561

Open
jzding wants to merge 1 commit intoopenshift:release-4.20from
jzding:socket-reconnect-4.20
Open

[release-4.20] OCPBUGS-77881: handle event socket broken pipe with automatic reconnection#561
jzding wants to merge 1 commit intoopenshift:release-4.20from
jzding:socket-reconnect-4.20

Conversation

@jzding
Copy link
Contributor

@jzding jzding commented Mar 6, 2026

Backport from #560

When cloud-event-proxy restarts, the Unix socket connection breaks, causing silent loss of PTP events. This PR adds robust reconnection logic with exponential backoff to automatically recover.

  • Fix broken pipe errors when cloud-event-proxy restarts by adding automatic event socket reconnection with exponential backoff, dial timeouts, and write deadlines.
  • Fix data races on shared EventHandler fields (clkSyncState map, clockClass, LeadingClockData) and a deadlock in updateBCState/announceClockClass.
  • Refactor all socket write sites (ProcessEvents, EmitClockSyncLogs, EmitPortRoleLogs, EmitProcessStatusLogs) to use centralized connection management with reconnect-on-failure, and separate stdout logging from socket writes so local logs are never lost.
  • Add unit tests for reconnection, broken pipe detection, and connection management

When cloud-event-proxy restarts, linuxptp-daemon's Unix socket connection
to /cloud-native/events.sock breaks, causing persistent "broken pipe"
errors and silent loss of PTP events.

Add robust reconnection logic with exponential backoff that automatically
re-establishes the event socket connection when a broken pipe is detected:

- Introduce ReconnectWithBackoff utility in pkg/utils with configurable
  retry attempts, exponential backoff, and context-based cancellation for
  clean shutdown responsiveness.

- Move net.Conn ownership into EventHandler with thread-safe
  getConn/setConn accessors that automatically close replaced
  connections, preventing resource leaks.

- Add writeLogToSocket helper that encapsulates write-reconnect-retry
  logic for a single log line, replacing ad-hoc error handling at each
  write site.

- Add dial timeouts (DialContext/DialTimeout) and write deadlines
  (SetWriteDeadline) to all socket operations to prevent indefinite
  blocking if the listener is unresponsive or the socket buffer is full.

- Separate stdout printing from socket writing in ProcessEvents so all
  logs are always printed locally regardless of socket state.

- Implement channel-based broken pipe signaling (brokenPipeCh) so
  background goroutines (clock class ticker, TBC announce) can notify
  the main ProcessEvents loop to reconnect without blocking.

- Serialize concurrent reconnection attempts via reconnectMu to prevent
  multiple goroutines from opening duplicate connections.

- Fix multiple data races: protect clkSyncState map and clockClass/
  clockAccuracy fields with snapshot-under-lock patterns, resolve
  deadlock in updateBCState/announceClockClass by separating lock-holding
  from I/O operations, and guard LeadingClockData access in TBC
  goroutines.

- Refactor EmitClockSyncLogs, EmitPortRoleLogs, EmitClockClass, and
  EmitProcessStatusLogs to use the EventHandler's managed connection
  with built-in reconnection support, including reconnect-on-nil-conn
  for reliable log emission after event proxy restarts.

- Add IsBrokenPipe helper to detect EPIPE, ECONNRESET, ECONNREFUSED,
  and ENOTCONN errors including those wrapped in net.OpError.

- Add comprehensive unit tests for reconnection backoff, broken pipe
  detection, connection management, and socket write/reconnect behavior.

Signed-off-by: Jack Ding <jackding@gmail.com>
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 6, 2026
@openshift-ci-robot
Copy link
Contributor

@jzding: This pull request references Jira Issue OCPBUGS-77881, which is invalid:

  • expected the bug to target the "4.20.z" version, but no target version was set
  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
  • expected dependent Jira Issue OCPBUGS-77871 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but it is MODIFIED instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

When cloud-event-proxy restarts, linuxptp-daemon's Unix socket connection to /cloud-native/events.sock breaks, causing persistent "broken pipe" errors and silent loss of PTP events.

Add robust reconnection logic with exponential backoff that automatically re-establishes the event socket connection when a broken pipe is detected:

  • Introduce ReconnectWithBackoff utility in pkg/utils with configurable retry attempts, exponential backoff, and context-based cancellation for clean shutdown responsiveness.

  • Move net.Conn ownership into EventHandler with thread-safe getConn/setConn accessors that automatically close replaced connections, preventing resource leaks.

  • Add writeLogToSocket helper that encapsulates write-reconnect-retry logic for a single log line, replacing ad-hoc error handling at each write site.

  • Add dial timeouts (DialContext/DialTimeout) and write deadlines (SetWriteDeadline) to all socket operations to prevent indefinite blocking if the listener is unresponsive or the socket buffer is full.

  • Separate stdout printing from socket writing in ProcessEvents so all logs are always printed locally regardless of socket state.

  • Implement channel-based broken pipe signaling (brokenPipeCh) so background goroutines (clock class ticker, TBC announce) can notify the main ProcessEvents loop to reconnect without blocking.

  • Serialize concurrent reconnection attempts via reconnectMu to prevent multiple goroutines from opening duplicate connections.

  • Fix multiple data races: protect clkSyncState map and clockClass/ clockAccuracy fields with snapshot-under-lock patterns, resolve deadlock in updateBCState/announceClockClass by separating lock-holding from I/O operations, and guard LeadingClockData access in TBC goroutines.

  • Refactor EmitClockSyncLogs, EmitPortRoleLogs, EmitClockClass, and EmitProcessStatusLogs to use the EventHandler's managed connection with built-in reconnection support, including reconnect-on-nil-conn for reliable log emission after event proxy restarts.

  • Add IsBrokenPipe helper to detect EPIPE, ECONNRESET, ECONNREFUSED, and ENOTCONN errors including those wrapped in net.OpError.

  • Add comprehensive unit tests for reconnection backoff, broken pipe detection, connection management, and socket write/reconnect behavior.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from SchSeba and nocturnalastro March 6, 2026 21:42
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 6, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jzding

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 6, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 7, 2026

@jzding: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants