Real-world implementation notes from setting up local codebase indexing with Ollama, Qwen3, Qdrant, and KiloCode.
What we expected: Based on research, 1024 dimensions seemed optimal (98-99% quality, 4× less storage compared to 4096)
What actually happened: Qwen3-Embedding-8B-FP16 through Ollama outputs 4096 dimensions
Verification:
curl -s http://localhost:11434/api/embeddings -d '{
"model": "qwen3-embedding:8b-fp16",
"prompt": "test code snippet"
}' | jq '.embedding | length'
# Output: 4096Impact:
- ✅ Maximum quality (100% model performance)
- ✅ No configuration needed (works out of the box)
⚠️ 4× storage (~160MB vs ~40MB for 10K blocks)⚠️ Slightly slower search (more dimensions to process)
Decision: Use 4096 dimensions for simplicity and maximum quality. With RTX 4090 and local storage, the trade-offs are acceptable.
What we expected: Need to manually create Qdrant collection before indexing
What actually happened: KiloCode automatically creates collections when you start indexing
Collection naming:
- Auto-generated name:
ws-{workspace-id}(e.g.,ws-2e031c6c5c75628a) - Based on workspace directory
- One collection per workspace
Settings applied:
- Vector size: Matches configured model dimension (4096)
- Distance metric: Cosine
- On-disk storage: Enabled by default
Benefit: Zero manual setup - just configure KiloCode and click "Start Indexing"
Error encountered:
Status: Error - Failed during initial scan: Indexing failed:
Failed to process batch after 3 attempts: Bad Request
Root cause:
- KiloCode initially configured for 1024 dimensions (based on research)
- Qwen3-8B via Ollama actually outputting 4096 dimensions
- Qdrant collection created for 1024 dimensions (mismatch)
- Insertion failed: vector size mismatch
Solution:
- Delete incorrect collection (via Qdrant dashboard)
- Update KiloCode setting: Model dimension = 4096
- Restart indexing (collection recreated correctly)
Lesson: Always verify actual embedding output dimensions before configuring
Initial docker-compose.yml had:
healthcheck:
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:6333/healthz"]Problem: Qdrant container doesn't include wget or curl
Error:
OCI runtime exec failed: exec failed: unable to start container process:
exec: "wget": executable file not found in $PATH
Solution: Remove healthcheck entirely (Qdrant official compose doesn't use one)
Lesson: Always check official Docker Compose examples before adding custom healthchecks
Default settings:
- Max search results: 50
- Search score threshold: 0.40
Testing results:
- Top result score: 0.631 (good relevance)
- 50 results: Provided comprehensive context
- Documentation search: Even 20-25 results sufficient
- Code search: 50 results recommended (distributed logic)
Recommendation:
- Code work: Keep 50 results (comprehensive understanding)
- Simple docs: 20-25 results sufficient
- Threshold: 0.40 works well (filters noise)
Testing results:
- Query: "hardware requirements"
- Found: RTX 4090, VRAM specs, storage needs across 4 different markdown files
- Top result score: 0.631 (good relevance)
- Search time: Fast (milliseconds)
Observed benefits:
- ✅ Finds relevant content across multiple files automatically
- ✅ Natural language queries work perfectly
- ✅ Semantic understanding (concepts, not just keywords)
- ✅ No manual file pointing needed
- This repo (5 markdown files, ~800 blocks): Minutes (very fast with RTX 4090)
- GPU utilization: Spikes to 100% during embedding generation
- VRAM usage: ~15GB (Qwen3 model)
- Speed: Very fast (RTX 4090 handles it easily)
- Total latency: Fast local search (milliseconds)
- Accuracy: High relevance scores observed (excellent)
- Qdrant collection: ~3-5MB for this small repo
- Model: 15GB (one-time)
- Docker volumes: Minimal overhead
Ollama:
- Model:
qwen3-embedding:8b-fp16 - Base URL:
http://localhost:11434/ - Output: 4096 dimensions (model's output)
Qdrant:
- URL:
http://localhost:6333 - No API key (local deployment)
- Auto-created collection
- Distance: Cosine
- On-disk storage: Enabled
KiloCode:
- Provider: Ollama
- Model dimension: 4096
- Max results: 50
- Score threshold: 0.40
- Auto-create collection: Yes
Docker:
- Network:
ollama-network(existing) - No healthcheck (simplified)
- Persistent volume:
qdrant_storage
- Check actual embedding dimensions:
curl -s http://localhost:11434/api/embeddings -d '{ "model": "qwen3-embedding:8b-fp16", "prompt": "test" }' | jq '.embedding | length'
- Verify KiloCode "Model dimension" matches output
- Delete collection and let KiloCode recreate it
- Check Docker logs:
docker logs qdrant-kilocode - Remove healthcheck if present
- Verify
ollama-networkexists:docker network ls
- Verify indexing completed (Green status)
- Check collection has points:
curl http://localhost:6333/collections - Lower score threshold below 0.40 if needed
-
Pull model (one-time):
ollama pull qwen3-embedding:8b-fp16
-
Deploy Qdrant (if not running):
docker compose up -d
-
Configure KiloCode:
- Open project in VS Code
- Settings → KiloCode → Codebase Indexing
- Set model dimension to 4096
- Save and Start Indexing
-
Wait for indexing (varies by project size)
-
Start searching with natural language!
- Open project → KiloCode auto-detects existing index
- Code changes → Incremental updates automatically
- Switch projects → Each has its own collection
- No maintenance needed
- Skip manual collection creation - KiloCode handles it
- Start with 4096 dimensions - No need to configure Matryoshka
- Trust the defaults - 50 results, 0.40 threshold work well
- Use official Docker Compose - No custom healthchecks
If storage becomes an issue:
- Configure Matryoshka for 1024 dimensions via custom Ollama modelfile
- Trade ~1% quality for 4× storage savings (4096 → 1024 dims)
If search is too slow:
- Reduce max results to 20-25
- Increase score threshold to 0.50
- Enable quantization in Qdrant
For very large codebases:
- Consider Qdrant clustering
- Enable payload indexing for faster filtering
- Use on-disk payload storage
✅ Semantic search - Understands concepts, not just keywords ✅ RAG workflow - Retrieval → Augmentation → Generation ✅ Local privacy - No data leaves machine ✅ Cost efficiency - No API fees, unlimited queries ✅ Performance - RTX 4090 handles it effortlessly
😮 Qwen3-8B outputs 4096 dimensions - Expected 1024 from research 😮 Auto-collection creation - Thought manual setup required 😮 Indexing speed - Very fast with RTX 4090 for this small repo 😮 Search quality - High relevance scores observed (excellent)
This setup provides production-grade RAG for code with:
- State-of-the-art embeddings (Qwen3 #1 MTEB ranked)
- Local deployment (100% privacy)
- Minimal configuration (KiloCode auto-creates)
- Excellent performance (RTX 4090 optimized)
- Zero ongoing costs
Total setup time: ~30 minutes (including troubleshooting) Result: Semantic code search across entire codebase with natural language
Success! 🎉