You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm building an application that runs two independent ML models concurrently on Apple Silicon (both macOS and iOS). Each model processes different inputs and produces different outputs - no shared arrays or dependencies between them. Currently this crashes due to thread-safety issues in the Metal backend.
I've researched the existing issues (#2133, #2067, #2086) and PR #2104. I'd like to discuss whether this use case can be supported and what the best path forward might be.
Concurrent inference produces Metal assertion failures such as:
-[_MTLCommandBuffer addScheduledHandler:]: failed assertion
'Scheduled handler provided after commit call'
A command encoder is already encoding to this command buffer
I can provide more detailed stack traces if helpful.
What I've Tried
1. Dedicated streams per model using StreamContext
// Construction
dedicated_stream_ = mx::new_stream(mx::Device::gpu);
// Inference
mx::StreamContext ctx(dedicated_stream_);
auto outputs = compiled_model_({input_array});
mx::eval(outputs[0]);
mx::synchronize();
Result: Crashes. Two threads using StreamContext concurrently race on the global default_streams_ map in Scheduler::set_default_stream().
2. Default stream (no StreamContext)
Both threads use the default GPU stream.
Result: Crashes. Both threads race on the shared DeviceStream::buffer and DeviceStream::encoder fields in get_command_buffer() / commit_command_buffer().
3. Mutex serialization
This works but negates the benefit of running models in parallel.
Root Cause Analysis
I traced this to two issues:
A. StreamContext modifies global state
set_default_stream() writes to Scheduler::default_streams_, a non-thread-safe unordered_map. Concurrent StreamContext usage corrupts this or causes incorrect restore on destruction.
B. DeviceStream lacks synchronization
Even with unique stream indices, DeviceStream::buffer and encoder are accessed without locks in the eval path, causing races when operations interleave.
Potential Solutions
I'd appreciate guidance on which (if any) of these aligns with MLX's design:
Option 1: Thread-local default streams
Store per-thread stream overrides in thread-local storage. get_default_stream() checks TLS first, set_default_stream() writes to TLS. This makes StreamContext thread-safe without locks.
Option 2: Per-DeviceStream synchronization
Add a mutex to DeviceStream so different streams can execute in parallel while same-stream access is serialized. (I saw concerns about deadlock in PR #2104 discussion.)
Option 3: Explicit stream passing for compiled functions
Allow callers to pass a stream directly when invoking compiled functions, bypassing the default stream mechanism entirely.
Option 4: Document as unsupported
If this isn't a priority use case, documenting the limitation clearly would also help users plan accordingly.
Questions
Is concurrent multi-model inference something MLX aims to support?
Summary
I'm building an application that runs two independent ML models concurrently on Apple Silicon (both macOS and iOS). Each model processes different inputs and produces different outputs - no shared arrays or dependencies between them. Currently this crashes due to thread-safety issues in the Metal backend.
I've researched the existing issues (#2133, #2067, #2086) and PR #2104. I'd like to discuss whether this use case can be supported and what the best path forward might be.
Environment
Use Case
Both models are:
mx::import_function()mx::compile()for kernel fusionCurrent Behavior
Concurrent inference produces Metal assertion failures such as:
I can provide more detailed stack traces if helpful.
What I've Tried
1. Dedicated streams per model using
StreamContextResult: Crashes. Two threads using
StreamContextconcurrently race on the globaldefault_streams_map inScheduler::set_default_stream().2. Default stream (no
StreamContext)Both threads use the default GPU stream.
Result: Crashes. Both threads race on the shared
DeviceStream::bufferandDeviceStream::encoderfields inget_command_buffer()/commit_command_buffer().3. Mutex serialization
This works but negates the benefit of running models in parallel.
Root Cause Analysis
I traced this to two issues:
A.
StreamContextmodifies global stateset_default_stream()writes toScheduler::default_streams_, a non-thread-safeunordered_map. ConcurrentStreamContextusage corrupts this or causes incorrect restore on destruction.B.
DeviceStreamlacks synchronizationEven with unique stream indices,
DeviceStream::bufferandencoderare accessed without locks in the eval path, causing races when operations interleave.Potential Solutions
I'd appreciate guidance on which (if any) of these aligns with MLX's design:
Option 1: Thread-local default streams
Store per-thread stream overrides in thread-local storage.
get_default_stream()checks TLS first,set_default_stream()writes to TLS. This makesStreamContextthread-safe without locks.Option 2: Per-
DeviceStreamsynchronizationAdd a mutex to
DeviceStreamso different streams can execute in parallel while same-stream access is serialized. (I saw concerns about deadlock in PR #2104 discussion.)Option 3: Explicit stream passing for compiled functions
Allow callers to pass a stream directly when invoking compiled functions, bypassing the default stream mechanism entirely.
Option 4: Document as unsupported
If this isn't a priority use case, documenting the limitation clearly would also help users plan accordingly.
Questions
Related
mx.evalin separate threadsThanks for MLX - the performance on Apple Silicon is excellent. Happy to provide more details or test proposed fixes.