Describe the Issue
When attempting to load an embeddings model (Qwen3 0.6B at q8 to be exact) alongside my normal text-generation model via the terminal, I received a cudaMalloc error on ROCm due to it attempting to allocate 4964.22 MiB onto the GPU despite the fact that the embeddings model was supposed to be entirely on the CPU. When attempting the same on Vulkan, it also attempts the erroneous allocation, but due to Vulkan being okay with overflowing the VRAM it only raised a warning rather than an error, and was thus usable. The warning had no effect on use, as the processing and generation speed were both perfectly normal when generating text and the embeddings model appeared to be behaving appropriately as well.
Additional Information:
I am on Fedora KDE Plasma 43 (Linux).
My system has a Ryzen 5 5500, an RX 9060 XT 16GB, and 32 GB of DDR4 (Of which plenty was available for use at the time)
The error is reproducible on the latest KoboldCPP nocuda release and the rolling ROCm binary I downloaded today.
My bash scripts are as follows:
Vulkan
#!/bin/bash
./KoboldCPP/koboldcpp-linux-x64-nocuda --model ./KoboldCPP/Models/Maginum-Cydoms-24B.i1-IQ4_XS.gguf --usevulkan --contextsize 16384 --gpulayers 41 --flashattention --quantkv 1 --smartcache 2 --embeddingsmodel ./KoboldCPP/Models/Qwen3-Embedding-0.6B-Q8_0.gguf --embeddingsmaxctx 8192
ROCm
#!/bin/bash
./KoboldCPP/koboldcpp-linux-x64-rocm --model ./KoboldCPP/Models/Maginum-Cydoms-24B.i1-IQ4_XS.gguf --usehipblas mmq --contextsize 16384 --gpulayers 41 --flashattention --quantkv 1 --smartcache 2 --embeddingsmodel ./KoboldCPP/Models/Qwen3-Embedding-0.6B-Q8_0.gguf --embeddingsmaxctx 8192
Here is the full terminal output from both of the scripts above:
EmbeddingsFailureRocm.txt
EmbeddingsWarningVulkan.txt
Describe the Issue
When attempting to load an embeddings model (Qwen3 0.6B at q8 to be exact) alongside my normal text-generation model via the terminal, I received a cudaMalloc error on ROCm due to it attempting to allocate 4964.22 MiB onto the GPU despite the fact that the embeddings model was supposed to be entirely on the CPU. When attempting the same on Vulkan, it also attempts the erroneous allocation, but due to Vulkan being okay with overflowing the VRAM it only raised a warning rather than an error, and was thus usable. The warning had no effect on use, as the processing and generation speed were both perfectly normal when generating text and the embeddings model appeared to be behaving appropriately as well.
Additional Information:
I am on Fedora KDE Plasma 43 (Linux).
My system has a Ryzen 5 5500, an RX 9060 XT 16GB, and 32 GB of DDR4 (Of which plenty was available for use at the time)
The error is reproducible on the latest KoboldCPP nocuda release and the rolling ROCm binary I downloaded today.
My bash scripts are as follows:
Vulkan
#!/bin/bash ./KoboldCPP/koboldcpp-linux-x64-nocuda --model ./KoboldCPP/Models/Maginum-Cydoms-24B.i1-IQ4_XS.gguf --usevulkan --contextsize 16384 --gpulayers 41 --flashattention --quantkv 1 --smartcache 2 --embeddingsmodel ./KoboldCPP/Models/Qwen3-Embedding-0.6B-Q8_0.gguf --embeddingsmaxctx 8192ROCm
#!/bin/bash ./KoboldCPP/koboldcpp-linux-x64-rocm --model ./KoboldCPP/Models/Maginum-Cydoms-24B.i1-IQ4_XS.gguf --usehipblas mmq --contextsize 16384 --gpulayers 41 --flashattention --quantkv 1 --smartcache 2 --embeddingsmodel ./KoboldCPP/Models/Qwen3-Embedding-0.6B-Q8_0.gguf --embeddingsmaxctx 8192Here is the full terminal output from both of the scripts above:
EmbeddingsFailureRocm.txt
EmbeddingsWarningVulkan.txt