fix: dispose native FFI resources before process.exit() in job shutdown#1042
Conversation
Call `dispose()` from `@livekit/rtc-node` before `process.exit(0)` in the job process shutdown sequence. Without this, the process terminates while Rust FFI resources (tokio runtimes, libwebrtc threads) are still running, which can cause: libc++abi: terminating due to uncaught exception of type std::system_error: mutex lock failed: Invalid argument The crash is a race condition — most reliably triggered when: - Audio is actively flowing through native pipeline (STT/VAD) - SIP trunk disconnect causes rapid shutdown - Multiple native threads are mid-execution during process.exit() The fix adds `await dispose()` after all job cleanup completes (session close, room disconnect, shutdown callbacks) but before process.exit(0). dispose() is wrapped in try/catch so a failed cleanup never blocks exit. Shutdown sequence (before): session.close() → room.disconnect() → callbacks → process.exit(0) Shutdown sequence (after): session.close() → room.disconnect() → callbacks → dispose() → exit(0) Related: livekit/node-sdks#564
🦋 Changeset detectedLatest commit: a18f008 The changes in this PR will be included in the next version bump. This PR includes changesets to release 19 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
toubatbrian
left a comment
There was a problem hiding this comment.
Nice catch on this bug! Have you tried testing the agent after this change and saw that the error is gone?
Raysharr
left a comment
There was a problem hiding this comment.
Thanks! I tested this against our production voice agent (SIP trunk, Silero VAD, Google STT, ElevenLabs TTS, LiveKit turn detector) and ran a deeper investigation.
dispose() works correctly — after patching, the log confirms native resources disposed on every shutdown. This properly cleans up FFI rooms, native handles, and tokio runtimes that were previously leaking on each job exit.
However, during testing I discovered that the libc++abi: mutex lock failed crash is a separate, deeper issue. It persists even after:
dispose()completes ✅- ONNX sessions released (VAD + turn detector) ✅
- 3s drain delay before exit ✅
The crash fires at process.exit(0) itself — during C++ destructor ordering in the native addon layer. All job work, IPC, and cleanup complete successfully before it. It's cosmetic but noisy.
I'd suggest we:
- Merge this PR as-is —
dispose()is the correct JS-side cleanup and should have been here regardless. It fixes the resource leak. - Track the mutex crash separately — it needs a fix in the native Rust/C++ teardown (likely related to node-sdks#564).
I've also updated the PR description to reflect these findings.
|
Filed #1375 — same libc++abi mutex error class, but the JS-side dispose-ordering workaround in this PR doesn't address it (the abort fires even after |
Summary
Calls
dispose()from@livekit/rtc-nodebeforeprocess.exit(0)in the job process shutdown sequence to properly clean up native FFI resources.Problem
When a job process shuts down (e.g., after SIP trunk disconnect), the current shutdown sequence is:
room.disconnect()disconnects from the LiveKit room, but does not clean up the Rust FFI Server resources — specifically the tokio async/audio runtimes, FfiRoom instances, and native handles in the DashMap. These resources leak on every job shutdown.Fix
Add
await dispose()after all job cleanup completes but beforeprocess.exit(0):dispose()callslivekitDispose()which:FfiRoominstances (drops track handles, awaits task JoinHandles)DashMapof native handlesThe call is wrapped in try/catch so a failed cleanup never blocks process exit.
Testing
Tested against a production voice agent (SIP trunk + Silero VAD + Google STT + ElevenLabs TTS + LiveKit turn detector). Confirmed via logs that
dispose()completes successfully on every SIP disconnect.Note: The
libc++abi: mutex lock failedcrash that sometimes appears onprocess.exit(0)is a separate native-layer issue — it persists regardless of JS-level cleanup (includingdispose(), ONNX session release, and drain delays). It occurs during C++ destructor ordering and is cosmetic: all job work, IPC messaging, and resource cleanup complete before it fires. See node-sdks#564 for the related native crash.Changes
agents/src/ipc/job_proc_lazy_main.ts— Importdisposefrom@livekit/rtc-node, call it beforeprocess.exit(0)in the shutdown sequenceRisk
Minimal.
dispose()is idempotent and designed for exactly this purpose. The try/catch ensures it never blocks exit even if it fails.