Fluid Server supports multiple NPU (Neural Processing Unit) runtimes for optimal performance on different hardware architectures. This guide covers the supported NPU backends and their specific configurations.
Intel NPU support is provided through the OpenVINO runtime, optimized for Intel NPU and integrated graphics.
- Format: OpenVINO IR format (.xml/.bin files)
- Location:
models/whisper/whisper-large-v3-turbo-fp16-ov-npu/ - Optimization: Optimized for Intel NPU and integrated graphics
- Excellent performance on Intel Arc graphics and NPU
- Low power consumption
- Optimized for Intel's AI acceleration hardware
models/whisper/whisper-large-v3-turbo-fp16-ov-npu/
├── openvino_model.xml
├── openvino_model.bin
└── config.json
Qualcomm NPU support uses the Qualcomm Neural Network (QNN) SDK with device-specific compilation for Snapdragon processors.
- Format: ONNX format with device-specific compilation
- Location:
models/whisper/whisper-large-v3-turbo-qnn/snapdragon-x-elite/ - Performance: 16× real-time transcription on Snapdragon X Elite
- Hardware: Snapdragon X Elite devices with HTP (Hexagon Tensor Processor)
- Exceptional performance on Snapdragon X Elite devices
- Leverages Hexagon Tensor Processor (HTP) for AI acceleration
- 16× real-time transcription performance
- Optimized for ARM64 architecture
models/whisper/whisper-large-v3-turbo-qnn/snapdragon-x-elite/
├── whisper_encoder.onnx
├── whisper_decoder.onnx
└── config.json
The server automatically detects your hardware and selects the appropriate NPU runtime:
- ARM64 Architecture: QNN backend is automatically preferred
- Intel x64 Architecture: OpenVINO backend is automatically preferred
- Fallback: CPU-based inference if NPU is unavailable
You can explicitly specify the runtime through command-line arguments:
# Force OpenVINO runtime
.\dist\fluid-server.exe --whisper-model whisper-large-v3-turbo-ov-npu
# Force QNN runtime (ARM64 only)
.\dist\fluid-server.exe --whisper-model whisper-large-v3-turbo-qnn- OS: Windows 10/11
- Hardware: Intel Arc graphics or Intel NPU
- Runtime: OpenVINO 2025.2.0+ runtime
- Memory: 8GB+ RAM recommended
- OS: Windows 11 (ARM64)
- Hardware: Snapdragon X Elite with HTP
- Runtime: ONNX Runtime QNN (bundled with dependencies)
- Memory: 8GB+ RAM recommended
- Driver Updates: Ensure latest Intel graphics drivers
- OpenVINO Version: Use OpenVINO 2025.2.0 or later
- Model Precision: FP16 models provide best performance/accuracy balance
- Device Compatibility: Verify Snapdragon X Elite compatibility
- Power Settings: Use high-performance power profile
- Memory Management: Close unnecessary applications for optimal memory usage
- Driver Problems: Update Intel graphics drivers
- OpenVINO Installation: Verify OpenVINO runtime is properly installed
- Model Loading Errors: Check model file integrity and paths
- Architecture Mismatch: Verify ARM64 Windows environment
- QNN Availability: Ensure Snapdragon X Elite with HTP support
- Model Compilation: Check ONNX Runtime QNN installation
# Check NPU availability
.\dist\fluid-server.exe --log-level DEBUG
# Test specific runtime
curl -X POST http://localhost:8080/v1/test -H "Content-Type: application/json"
# Verify model loading
curl http://localhost:8080/v1/models# Monitor transcription performance
curl -X POST http://localhost:8080/v1/audio/transcriptions \
-F "file=@test_audio.wav" \
-F "model=whisper-large-v3-turbo-qnn" \
-F "response_format=verbose_json"- Hardware Matching: Use the NPU runtime that matches your hardware
- Model Selection: Choose the appropriate model size for your use case
- Memory Management: Monitor memory usage with larger models
- Performance Testing: Benchmark different runtimes on your specific hardware
- Regular Updates: Keep NPU drivers and runtimes updated