Request: Support for concurrent inference of independent models from separate threads

### Summary

I'm building an application that runs two independent ML models concurrently on Apple Silicon (both macOS and iOS). Each model processes different inputs and produces different outputs - no shared arrays or dependencies between them. Currently this crashes due to thread-safety issues in the Metal backend.

I've researched the existing issues (#2133, #2067, #2086) and PR #2104. I'd like to discuss whether this use case can be supported and what the best path forward might be.

### Environment

- **MLX version:** 0.30.3
- **Platform:** macOS and iOS (Apple Silicon)
- **API:** C++ (direct MLX C++ API)
- **Build:** Built from source

### Use Case

```
Thread A: model_a → inference(input_a) → output_a
Thread B: model_b → inference(input_b) → output_b
(concurrent execution, completely independent)
```

Both models are:
- Loaded once at startup via `mx::import_function()`
- Wrapped with `mx::compile()` for kernel fusion
- Called from their own dedicated threads

### Current Behavior

Concurrent inference produces Metal assertion failures such as:

```
-[_MTLCommandBuffer addScheduledHandler:]: failed assertion
'Scheduled handler provided after commit call'
```

```
A command encoder is already encoding to this command buffer
```

I can provide more detailed stack traces if helpful.

### What I've Tried

**1. Dedicated streams per model using `StreamContext`**

```cpp
// Construction
dedicated_stream_ = mx::new_stream(mx::Device::gpu);

// Inference
mx::StreamContext ctx(dedicated_stream_);
auto outputs = compiled_model_({input_array});
mx::eval(outputs[0]);
mx::synchronize();
```

**Result:** Crashes. Two threads using `StreamContext` concurrently race on the global `default_streams_` map in `Scheduler::set_default_stream()`.

**2. Default stream (no `StreamContext`)**

Both threads use the default GPU stream.

**Result:** Crashes. Both threads race on the shared `DeviceStream::buffer` and `DeviceStream::encoder` fields in `get_command_buffer()` / `commit_command_buffer()`.

**3. Mutex serialization**

This works but negates the benefit of running models in parallel.

### Root Cause Analysis

I traced this to two issues:

**A. `StreamContext` modifies global state**

`set_default_stream()` writes to `Scheduler::default_streams_`, a non-thread-safe `unordered_map`. Concurrent `StreamContext` usage corrupts this or causes incorrect restore on destruction.

**B. `DeviceStream` lacks synchronization**

Even with unique stream indices, `DeviceStream::buffer` and `encoder` are accessed without locks in the eval path, causing races when operations interleave.

### Potential Solutions

I'd appreciate guidance on which (if any) of these aligns with MLX's design:

**Option 1: Thread-local default streams**

Store per-thread stream overrides in thread-local storage. `get_default_stream()` checks TLS first, `set_default_stream()` writes to TLS. This makes `StreamContext` thread-safe without locks.

**Option 2: Per-`DeviceStream` synchronization**

Add a mutex to `DeviceStream` so different streams can execute in parallel while same-stream access is serialized. (I saw concerns about deadlock in PR #2104 discussion.)

**Option 3: Explicit stream passing for compiled functions**

Allow callers to pass a stream directly when invoking compiled functions, bypassing the default stream mechanism entirely.

**Option 4: Document as unsupported**

If this isn't a priority use case, documenting the limitation clearly would also help users plan accordingly.

### Questions

1. Is concurrent multi-model inference something MLX aims to support?
2. Is PR #2104 the intended fix? Should discussion continue there instead?
3. Are there workarounds I've missed that avoid serialization?
4. Would any of the proposed solutions be welcome as a contribution?

### Related

- #2133 - Thread safety tracking issue
- #2067 - Thread issues with evaluation  
- #2086 - Compiler cache thread safety
- #2104 - Metal thread safety PR
- Discussion #1448 - `mx.eval` in separate threads

Thanks for MLX - the performance on Apple Silicon is excellent. Happy to provide more details or test proposed fixes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: Support for concurrent inference of independent models from separate threads #3078

Summary

Environment

Use Case

Current Behavior

What I've Tried

Root Cause Analysis

Potential Solutions

Questions

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Request: Support for concurrent inference of independent models from separate threads #3078

Description

Summary

Environment

Use Case

Current Behavior

What I've Tried

Root Cause Analysis

Potential Solutions

Questions

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions