[Bug] Loading an embeddings model alongside a non-embeddings model attempts to allocate extra VRAM even though embeddings is entirely on CPU

**Describe the Issue**
When attempting to load an embeddings model (Qwen3 0.6B at q8 to be exact) alongside my normal text-generation model via the terminal, I received a cudaMalloc error on ROCm due to it attempting to allocate 4964.22 MiB onto the GPU despite the fact that the embeddings model was supposed to be entirely on the CPU. When attempting the same on Vulkan, it also attempts the erroneous allocation, but due to Vulkan being okay with overflowing the VRAM it only raised a warning rather than an error, and was thus usable. The warning had no effect on use, as the processing and generation speed were both perfectly normal when generating text and the embeddings model appeared to be behaving appropriately as well.

**Additional Information:**
I am on Fedora KDE Plasma 43 (Linux).
My system has a Ryzen 5 5500, an RX 9060 XT 16GB, and 32 GB of DDR4 (Of which plenty was available for use at the time)
The error is reproducible on the latest KoboldCPP nocuda release and the rolling ROCm binary I downloaded today.
My bash scripts are as follows:
Vulkan
```sh
#!/bin/bash
./KoboldCPP/koboldcpp-linux-x64-nocuda --model ./KoboldCPP/Models/Maginum-Cydoms-24B.i1-IQ4_XS.gguf --usevulkan --contextsize 16384 --gpulayers 41 --flashattention --quantkv 1 --smartcache 2 --embeddingsmodel ./KoboldCPP/Models/Qwen3-Embedding-0.6B-Q8_0.gguf --embeddingsmaxctx 8192
```
ROCm
```sh
#!/bin/bash
./KoboldCPP/koboldcpp-linux-x64-rocm --model ./KoboldCPP/Models/Maginum-Cydoms-24B.i1-IQ4_XS.gguf --usehipblas mmq --contextsize 16384 --gpulayers 41 --flashattention --quantkv 1 --smartcache 2 --embeddingsmodel ./KoboldCPP/Models/Qwen3-Embedding-0.6B-Q8_0.gguf --embeddingsmaxctx 8192
```

Here is the full terminal output from both of the scripts above:

[EmbeddingsFailureRocm.txt](https://github.com/user-attachments/files/26236048/EmbeddingsFailureRocm.txt)
[EmbeddingsWarningVulkan.txt](https://github.com/user-attachments/files/26236047/EmbeddingsWarningVulkan.txt)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Loading an embeddings model alongside a non-embeddings model attempts to allocate extra VRAM even though embeddings is entirely on CPU #2069

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Bug] Loading an embeddings model alongside a non-embeddings model attempts to allocate extra VRAM even though embeddings is entirely on CPU #2069

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions