[release-4.20] OCPBUGS-77881: handle event socket broken pipe with automatic reconnection#561
[release-4.20] OCPBUGS-77881: handle event socket broken pipe with automatic reconnection#561jzding wants to merge 1 commit intoopenshift:release-4.20from
Conversation
When cloud-event-proxy restarts, linuxptp-daemon's Unix socket connection to /cloud-native/events.sock breaks, causing persistent "broken pipe" errors and silent loss of PTP events. Add robust reconnection logic with exponential backoff that automatically re-establishes the event socket connection when a broken pipe is detected: - Introduce ReconnectWithBackoff utility in pkg/utils with configurable retry attempts, exponential backoff, and context-based cancellation for clean shutdown responsiveness. - Move net.Conn ownership into EventHandler with thread-safe getConn/setConn accessors that automatically close replaced connections, preventing resource leaks. - Add writeLogToSocket helper that encapsulates write-reconnect-retry logic for a single log line, replacing ad-hoc error handling at each write site. - Add dial timeouts (DialContext/DialTimeout) and write deadlines (SetWriteDeadline) to all socket operations to prevent indefinite blocking if the listener is unresponsive or the socket buffer is full. - Separate stdout printing from socket writing in ProcessEvents so all logs are always printed locally regardless of socket state. - Implement channel-based broken pipe signaling (brokenPipeCh) so background goroutines (clock class ticker, TBC announce) can notify the main ProcessEvents loop to reconnect without blocking. - Serialize concurrent reconnection attempts via reconnectMu to prevent multiple goroutines from opening duplicate connections. - Fix multiple data races: protect clkSyncState map and clockClass/ clockAccuracy fields with snapshot-under-lock patterns, resolve deadlock in updateBCState/announceClockClass by separating lock-holding from I/O operations, and guard LeadingClockData access in TBC goroutines. - Refactor EmitClockSyncLogs, EmitPortRoleLogs, EmitClockClass, and EmitProcessStatusLogs to use the EventHandler's managed connection with built-in reconnection support, including reconnect-on-nil-conn for reliable log emission after event proxy restarts. - Add IsBrokenPipe helper to detect EPIPE, ECONNRESET, ECONNREFUSED, and ENOTCONN errors including those wrapped in net.OpError. - Add comprehensive unit tests for reconnection backoff, broken pipe detection, connection management, and socket write/reconnect behavior. Signed-off-by: Jack Ding <jackding@gmail.com>
|
@jzding: This pull request references Jira Issue OCPBUGS-77881, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jzding The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@jzding: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Backport from #560
When cloud-event-proxy restarts, the Unix socket connection breaks, causing silent loss of PTP events. This PR adds robust reconnection logic with exponential backoff to automatically recover.