fix: retry rctx.execute() on SIGKILL (exit 137) for macOS 26#2750
fix: retry rctx.execute() on SIGKILL (exit 137) for macOS 26#2750yetanotheralex wants to merge 1 commit intoaspect-build:mainfrom
Conversation
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bfca8bd270
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
bfca8bd to
9f4106e
Compare
|
This is a bug with golang and macos 26? Can we fix this where that golang is located instead? |
It's not a golang-specific bug -- it's a macOS 26 (Tahoe) bug in the OS networking framework's pthread_atfork handler that intermittently sends SIGKILL to any child process spawned via fork+exec. During investigation we saw exit code 137 hitting not just yq (Go) but also cp, mkdir, openssl, and rg -- none of which are Go programs. The issue is on the parent side (Bazel's JVM calling ProcessBuilder/fork+exec), and the OS kills the child before it even gets a chance to run. There's an open report on the skaffold project with the same root cause: GoogleContainerTools/skaffold#9925 Until Apple ships a fix for macOS 26, the retry at the rctx.execute() call site is the pragmatic workaround -- there's nothing Go (or any of the affected binaries) can do differently since the OS is killing the process externally. |
9f4106e to
5da5b3e
Compare
macOS 26 (Tahoe) has a bug in its networking framework's pthread_atfork handler that intermittently kills child processes with SIGKILL (exit 137) when spawned via fork+exec. This affects all rctx.execute() calls in repository rules, causing yq, cp, and mkdir commands to fail randomly during npm_translate_lock. Add an execute_with_retry() wrapper in utils.bzl that catches exit code 137 and retries up to 3 times. Apply it to all rctx.execute() call sites in npm_translate_lock_state.bzl (yq lockfile parsing, mkdir, cp) and utils.bzl (reverse_force_copy). The failure is intermittent so retries reliably work around it. References: - GoogleContainerTools/skaffold#9925 - bazelbuild/bazel#27026 Made-with: Cursor
5da5b3e to
f9ede61
Compare
|
The bazel issue seems to be resolved in 6.6 and no longer an issue in bazel 7. What version of bazel are you using? |
7.7.1 |





Problem
macOS 26 (Tahoe) has a known bug in its networking framework's
pthread_atforkhandler that intermittently kills child processes withSIGKILL(exit code 137) when spawned viafork+exec. This affects Bazel'srctx.execute()calls in repository rules.In
rules_js, thenpm_translate_lockrepository rule callsrctx.execute()to runyq(for parsingpnpm-lock.yaml),mkdir, andcp(for copying input files). On macOS 26, these commands are intermittently killed, producing errors like:or:
The failure is transient — retrying the same command succeeds.
Fix
Add an
execute_with_retry()helper inutils.bzlthat catches exit code 137 (SIGKILL) and retries up to 3 times. Apply it to allrctx.execute()call sites in:npm/private/npm_translate_lock_state.bzl—_yaml_to_json()(yq),_copy_input_file()(mkdir, cp)npm/private/utils.bzl—_reverse_force_copy()(mkdir, cp)The function is also exported via
utils.execute_with_retryso downstream consumers can use it.References
fork+execbroken on macOS 26wrapped_clangbroken on macOS Tahoe 26.0 bazelbuild/bazel#27026 — Bazelwrapped_clangbroken on macOS 26Testing
Verified on macOS 26.3 (Tahoe) with Bazel 7.7.1 — the patch resolves the intermittent SIGKILL failures during
npm_translate_lockrepository rule evaluation.