State-of-the-art semantic code search powered by local AI - optimized for RTX 4090
A complete local setup for intelligent codebase indexing using Ollama, Qwen3 embeddings, Qdrant vector database, and the KiloCode VS Code extension. Provides natural language code search with local embeddings, no API costs, and no rate limits. (Full privacy requires both local embeddings AND local inference LLM.)
- Tech Stack
- Key Features
- Why Qwen3
- Why 4096 Dimensions
- Architecture
- Performance Expectations
- Cost Analysis
- Documentation
- Configuration Details
- Troubleshooting
- Project Status
- Technical References
- License
- Contributing
- Acknowledgments
This repository documents setting up local RAG (Retrieval Augmented Generation) for code using KiloCode - my preferred AI code assistant as of November 23rd, 2025 for its user-friendly interface and comprehensive features.
Purpose: Make codebase indexing accessible through clear, user-friendly documentation. This is technical research simplified for developers who want to understand WHY and HOW RAG works, not just copy/paste commands.
What's included:
- Plain-English explanations of embeddings, vectors, and semantic search
- Research on embedding models and why Qwen3-8B was chosen
- Hardware-optimized setup guide (RTX 4090 + Ollama + Qdrant)
- Complete Qdrant installation via Docker Compose
- Full documentation on architecture, performance, and troubleshooting
Prerequisites (not covered here):
- Docker - Already installed and configured
- Ollama - Already running with models pulled
- Basic familiarity with terminal/command line
Starting point: Based on KiloCode's Codebase Indexing Documentation, customized for my hardware and needs.
Note: The docker-compose.yml is tailored to my setup (e.g., ollama-network), but easily adaptable as a template. While focused on KiloCode, the RAG principles apply to any AI code assistant (Cursor, Continue, Aider, etc.).
Ideal for:
- Individual developers with GPU hardware (16GB+ VRAM)
- Privacy-focused development (code stays on your machine)
- Unlimited usage scenarios (avoiding per-query API costs)
- Learning RAG architecture and local AI infrastructure
Not ideal for:
- Team collaboration (single-workspace, single-user design)
- Remote access needs (requires your machine running)
- Limited hardware (GPU with 16GB+ VRAM required)
- Enterprise deployment (no multi-user, access control, or redundancy)
For teams/enterprise: Consider Qdrant Cloud, centralized embedding services, or cloud APIs (OpenAI, Voyage) with proper authentication and monitoring. This project demonstrates single-developer architecture - adapt for your scale and requirements.
Note on costs: Electricity costs are highly individual (rates, usage patterns, existing vs. new hardware). Cloud provider pricing changes frequently. Evaluate based on your specific requirements and scale rather than generic comparisons.
This is RAG (Retrieval Augmented Generation) - giving AI models accurate context from YOUR codebase before they answer questions.
| Without RAG | With RAG (This Project) |
|---|---|
| Q: "What database do we use?" | Q: "What database do we use?" |
A: (After manually reading package.json, database config files, and connection modules...) "PostgreSQL 14, configured in config/database.js:12" (correct, but inefficient - read 5+ files, wasted tokens) |
A: "PostgreSQL 14, configured in config/database.js:12 with connection pooling (max 20 connections)" (accurate, instant) |
| Q: "How do I add authentication?" | Q: "How do I add authentication?" |
A: (After trial and error reading auth.ts, jwt.service.ts, middleware files...) "your project uses JWT with refresh tokens in auth/jwt.service.ts:34-67" (correct, but slow - read 10+ files manually) |
A: "Your project uses JWT with refresh tokens. See auth/jwt.service.ts:34-67 for token generation and middleware/auth.ts:12-28 for verification" (specific, instant) |
The key difference: RAG's semantic search finds code by meaning (not just keywords), enabling natural language queries across different naming conventions. Without RAG, you need exact terminology or better prompting to identify what to search for.
Analogy: Like Google's semantic search vs. Ctrl+F/grep keyword matching - both find things fast, but semantic search understands meaning (finds "authentication" when you search "user verification"), while keyword search requires knowing the exact terms used in the code.
✅ Accuracy - Answers based on YOUR code, not assumptions ✅ Efficiency - Instant semantic search, no manual file hunting ✅ Reduced Hallucination - Grounded in actual code ✅ Context-Aware - Understands your architecture and patterns ✅ Privacy - Local embeddings + storage (full privacy requires local inference LLM) ✅ Cost - No API fees for local embeddings + storage (cloud options available with associated costs) ✅ Works with smaller models - Good context = great answers
Worth it? If you work with large codebases and want accurate AI assistance, absolutely yes.
RAG (Retrieval Augmented Generation) is a simple three-step pattern:
- Prepare Knowledge - Break your documents into chunks, convert each chunk into a mathematical representation (vector), store in a database
- Find Relevant Information - Convert your question into a vector, find chunks with similar vectors (semantic similarity)
- Generate Answer - Give the AI model ONLY the relevant chunks as context, let it generate an accurate answer
The key insight: Instead of an AI model reading everything or guessing based on training data, it first retrieves ONLY the relevant information, then generates answers grounded in that specific context.
Why vectors work: Converting text to vectors captures meaning. "authentication" and "user verification" have similar vectors even though the words are different. This is why semantic search finds relevant code regardless of naming conventions.
Simplified example: Imagine each word as a point in space with coordinates. Words with similar meanings end up close together: "king" might be at [0.8, 0.3, 0.1] and "queen" at [0.8, 0.3, 0.2] - they're nearby. Meanwhile "banana" at [0.1, 0.9, 0.5] is far away. The computer calculates distances between these points to find similarity. Real vectors have thousands of dimensions (not just 3), but the principle is the same - similar meanings = nearby points.
Real-world example: When you ask "how does auth work?", the system:
- Converts your question to a vector
- Finds all code chunks with similar vectors (auth functions, login logic, token handling)
- Feeds those specific chunks to the AI
- AI answers based on YOUR actual code, not generic patterns
This is production-grade RAG - the same technology behind ChatGPT's "custom GPTs" and enterprise AI assistants, but 100% local.
For this project's specific implementation: See Architecture section below for how KiloCode, Ollama, Qwen3, and Qdrant work together.
- Privacy: Embeddings generated AND stored locally in Qdrant (full privacy requires local inference LLM for KiloCode responses)
- Cost: Minimal ongoing electricity costs vs per-usage cloud API fees
- Performance: No network latency, no rate limits
- Control: Your hardware, your rules
- No vendor lock-in: Works offline, independent of cloud services
Note: This setup uses local embedding generation + local vector storage. You could optionally use cloud-based Qdrant or cloud embedding providers if your privacy requirements differ.
| Component | Technology | Specification |
|---|---|---|
| Embedding Model | Qwen3-Embedding-8B-FP16 | SOTA for consumer GPUs (80.68 code, 70.58 ML on MTEB) |
| Dimensions | 4096 | 100% quality, Qwen3-8B via Ollama output |
| Vector Database | Qdrant | Cosine similarity, local deployment |
| AI Runtime | Ollama | Docker-based, GPU accelerated |
| Code Parser | Tree-sitter | AST-based semantic blocks |
| Interface | KiloCode | VS Code extension |
| Hardware | RTX 4090 | 15GB VRAM required |
- Search Latency: Fast local search (milliseconds)
- Indexing Speed: Varies by codebase size (minutes to hours for initial indexing)
- Accuracy: High top-10 retrieval accuracy
- Context Window: 32K tokens (handle large functions/files)
- Natural language queries ("authentication logic", "error handling patterns")
- Semantic understanding across 100+ programming languages
- Incremental indexing with file watching
- Git branch change detection
- Automatic file filtering (binaries, dependencies, large files)
- Hash-based caching for efficiency
- Model VRAM: ~15GB FP16 (maximum quality)
- Qdrant RAM: ~60-100MB for typical codebase
- Storage: ~160MB vectors for 10K code blocks (4096 dims)
- Throughput: GPU-accelerated embedding generation
After evaluating 9 embedding models, Qwen3-Embedding-8B was selected for:
- SOTA performance for consumer GPUs - Ranked #3 (as of November 2025) on MTEB multilingual leaderboard (70.58 score, 80.68 on Code benchmark); top-ranked models require 44GB+ VRAM, impractical for consumer hardware
- Code-optimized training - 100+ programming languages
- Hardware compatibility - ~15GB VRAM fits in RTX 4090's 24GB (vs 44GB+ for higher-ranked models)
- Advanced features - Instruction-aware, Matryoshka support, 32K context
- Future-proof - Latest release (June 2025), Apache 2.0 license
See 2_EMBEDDING_MODEL_SELECTION.md for detailed comparison.
Qwen3 supports Matryoshka embeddings (32-4096 dimensions), but Qwen3-8B via Ollama outputs 4096:
- Quality: 100% maximum performance (no quality loss)
- Simplicity: No configuration needed, works out of the box
- Speed: Fast search with GPU acceleration
- Storage: ~160MB for 10K blocks (acceptable with local setup)
Note: While 1024 dimensions would be more storage-efficient (minimal quality loss), using the model's 4096 output as-is eliminates configuration complexity and provides maximum quality for local deployments.
| Dimension | Quality Impact | Use Case |
|---|---|---|
| 256 | Noticeable degradation | Mobile, 10M+ vectors, storage-critical |
| 512 | Minor degradation | Large scale (100K+ files), speed-critical |
| 1024 | Minimal degradation | Balanced quality/efficiency |
| 2048 | Near-full quality | Specialized/high-precision needs |
| 4096 | Full quality | Maximum precision (our choice) |
Quality retention varies by model and task. One study (Voyage-3-large) showed only 0.31% quality loss at 1024 vs 2048 dimensions. Matryoshka-trained models like Qwen3 are specifically designed for graceful degradation at lower dimensions.
Note: While the model supports Matryoshka embeddings (configurable dimensions), when using Qwen3-Embedding-8B-FP16 through Ollama it outputs 4096 dimensions. This provides maximum quality with no extra configuration needed.
- GPU: NVIDIA GPU with 16GB+ VRAM (Qwen3-Embedding-8B-FP16 requires ~15GB)
- This project uses: RTX 4090 (24GB) for maximum performance
- Alternatives: Smaller embedding models available for lower VRAM cards (see research docs)
- Budget option: Quantized models reduce VRAM further (with minor quality trade-off)
- Note: This setup targets state-of-the-art local performance; adjust model choice for your hardware
- RAM: 16GB+ system RAM recommended
- Storage: 50GB+ free space (model + vectors + Docker)
- Docker & Docker Compose - Container orchestration
- Ollama - Local AI runtime for embeddings
- KiloCode Extension - VS Code extension for codebase indexing
Important: This documentation is optimized for MY specific environment but designed as a reference for others to adapt.
For context, here's how I run Ollama in Docker:
docker run -d \
--network ollama-network \
--gpus device=all \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
-e OLLAMA_FLASH_ATTENTION=1 \
-e OLLAMA_KV_CACHE_TYPE=q8_0 \
ollama/ollamaKey parameters explained:
--network ollama-network: OPTIONAL - My custom Docker network. You can omit this, use your own network, or run without one entirely.OLLAMA_FLASH_ATTENTION=1: Optional (recommended) - Performance optimization for faster inference.OLLAMA_KV_CACHE_TYPE=q8_0: Optional - Enables lower VRAM usage at minimal accuracy cost.
Adapt to your needs: Don't blindly copy the network setting. Either remove it, use your existing Docker network, or create your own. The setup works perfectly fine without custom networks.
For detailed Ollama configuration options, see the official Ollama documentation.
Before starting, verify your environment:
# Check Ollama is running
docker ps | grep ollama
# Confirm GPU is available
nvidia-smi
# (Optional) If using custom Docker network, verify it exists
docker network ls | grep ollama-networkNote: The ollama-network check is only needed if you're using a custom Docker network like in the setup above. If you're running Ollama without a custom network, skip that check.
ollama pull qwen3-embedding:8b-fp16docker compose up -dVerify it's running:
docker ps | grep qdrant
curl http://localhost:6333/healthzDashboard available at: http://localhost:6333/dashboard
- Open VS Code with KiloCode extension
- Navigate to Settings → KiloCode → Codebase Indexing
- Configure:
- Enable: ✓ Codebase Indexing
- Provider: Ollama
- Ollama base URL: http://localhost:11434/
- Model: qwen3-embedding:8b-fp16
- Model dimension: 4096
- Qdrant URL: http://localhost:6333
- Qdrant API key: (leave empty for local)
- Max Results: 50 (adjustable based on your needs)
- Search score threshold: 0.40 (default)
- Click Save in KiloCode settings
- Click Start Indexing
- KiloCode will:
- Auto-create collection (workspace-based name)
- Parse code with Tree-sitter
- Generate embeddings via Ollama
- Store vectors in Qdrant
- Watch status: Gray → Yellow (indexing) → Green (ready)
- Indexing time: Varies by project size (GPU-accelerated)
Natural language queries in KiloCode:
- "user authentication logic"
- "error handling patterns"
- "database connection setup"
- "API endpoint definitions"
This repository contains comprehensive documentation organized by workflow:
| Document | Description |
|---|---|
| 1_CODEBASE_INDEXING_FEATURE.md | Codebase Indexing Feature - Overview of the codebase indexing feature from official KiloCode documentation |
| 2_EMBEDDING_MODEL_SELECTION.md | Embedding Model Selection - Research and comparison of 9 embedding models, why Qwen3-8B was chosen |
| 3_QWEN3_OLLAMA_GUIDE.md | Qwen3 with Ollama Guide - Why the default configuration is perfect, FAQ, and best practices |
| 4_QDRANT_INSTALLATION_GUIDE.md | Qdrant Installation - Step-by-step Docker Compose deployment, configuration, integration with ollama-network |
| FAQ.md | Frequently Asked Questions - Quick reference for understanding RAG, collections, vectors, and workflow |
| IMPLEMENTATION_NOTES.md | Lessons Learned - Real-world implementation notes, key discoveries, troubleshooting tips |
User clicks "Start Indexing" in KiloCode
↓
KiloCode checks if Qdrant collection exists
↓ No collection found
KiloCode creates new collection (workspace-based name)
↓
KiloCode scans entire codebase
↓ Reads all code files
KiloCode parses files with Tree-sitter
↓ Extracts all semantic blocks
Semantic Blocks (functions, classes, methods)
↓ KiloCode sends blocks to Ollama API (batched)
Ollama + Qwen3-Embedding-8B-FP16
↓ Creates 4096-dim vector embeddings
↓ Returns vectors to KiloCode
KiloCode sends vectors + metadata to Qdrant
↓ Stores all vectors
Qdrant Collection (kilocode_codebase)
↓ Persistent storage
Indexed Codebase ✅
↓
Status: Gray → Yellow (indexing) → Green (ready)
File watcher detects change (save/delete/create)
↓
KiloCode identifies changed file(s)
↓
KiloCode deletes old vectors for that file from Qdrant
↓
KiloCode re-parses changed file with Tree-sitter
↓ Extracts new/modified blocks
Semantic Blocks (updated)
↓ KiloCode sends blocks to Ollama API
Ollama + Qwen3-Embedding-8B-FP16
↓ Creates 4096-dim vector embeddings
↓ Returns vectors to KiloCode
KiloCode sends updated vectors + metadata to Qdrant
↓ Replaces old vectors
Qdrant Collection (kilocode_codebase)
↓ Updated storage
Index stays current ✅ (incremental, fast)
Natural Language Query
↓
KiloCode (VS Code Extension)
↓ Sends query to Ollama API
Ollama + Qwen3-Embedding-8B-FP16
↓ Creates 4096-dim vector embedding
↓ Returns vector to KiloCode
KiloCode sends vector to Qdrant
↓
Qdrant Vector Database
↓ Cosine similarity search
↓ Returns top-N matches
Code Snippets + Metadata (file, lines, score)
↓ Sent back to KiloCode
KiloCode Results Panel
Docker Network: ollama-network
├── Ollama Container (GPU accelerated)
│ └── qwen3-embedding:8b-fp16
└── Qdrant Container
├── HTTP API: localhost:6333
├── gRPC: localhost:6334
└── Dashboard: localhost:6333/dashboard
KiloCode (host) → localhost:6333 → Qdrant
KiloCode (host) → localhost:11434 → Ollama
Qdrant → qdrant:11434 → Ollama (internal)
# Model: qwen3-embedding:8b-fp16
# No custom modelfile needed - use default
# Context: 32K tokens (built-in)
# Output: 4096 dimensions (Qwen3-8B via Ollama)
# Quantization: FP16 for maximum qualityCollection Name: kilocode_codebase
Vector Dimensions: 4096
Distance Metric: Cosine
Indexing: HNSW (Hierarchical Navigable Small World)
Embedding Provider: Ollama
Model: qwen3-embedding:8b-fp16
Qdrant URL: http://localhost:6333
Max Search Results: 50
Min Block Size: 100 chars
Max Block Size: 1000 chars
| Codebase Size | Blocks | VRAM | Storage |
|---|---|---|---|
| Small (1K files) | ~5K | 15GB | 20MB |
| Medium (5K files) | ~25K | 15GB | 100MB |
| Large (10K files) | ~50K | 15GB | 200MB |
Indexing Time: Varies by project size and hardware. GPU-accelerated embedding generation is the bottleneck. Initial indexing may take minutes to hours depending on codebase size.
- Search Latency: Fast local search in milliseconds
- Embedding Generation: GPU-accelerated (Ollama)
- Vector Search: Fast similarity matching (Qdrant)
- Accuracy: High top-10 retrieval accuracy
# GPU utilization
nvidia-smi
# Qdrant dashboard
http://localhost:6333/dashboard
# Check collection stats
curl http://localhost:6333/collections/kilocode_codebase- Hardware: One-time (already owned RTX 4090)
- Electricity: Ongoing cost (GPU running intermittently, varies by local rates and usage)
- Per-query cost: Effectively $0 (unlimited)
- Pricing model: Per-token/per-query charges
- Usage costs: Scale with usage (light to heavy)
- Rate limits: Applied
- Privacy: Code sent to cloud
Winner: Local setup - unlimited queries, local embeddings + storage (complete privacy requires local inference LLM), minimal ongoing costs vs per-usage cloud fees
- Research and model selection (Qwen3-Embedding-8B)
- Architecture design
- Documentation (comprehensive guides)
- Docker Compose file for Qdrant
- Qdrant deployment and configuration
- KiloCode integration and testing
- Hardware compatibility verification
- Performance validation
- Implementation documentation
- Automatic backup/restore scripts
- Multi-workspace support
- Custom block sizing strategies
- Performance monitoring dashboard
Q: Indexing is slow
- Check GPU utilization:
nvidia-smi - Verify FP16 model:
ollama list | grep qwen3 - Check Docker resources
Q: Search returns poor results
- Verify 4096 dimensions configured
- Check collection exists:
curl localhost:6333/collections - Rebuild index in KiloCode settings
Q: High VRAM usage
- Expected: ~15GB for FP16 model
- Consider smaller model if needed (qwen3:4b)
- Close other GPU applications
- Qwen3 Model Card: Qwen/Qwen3-Embedding-8B
- MTEB Leaderboard: Code Embedding Benchmark
- Qdrant Docs: https://qdrant.tech/documentation/
- Ollama Docs: https://ollama.ai/docs
- KiloCode: VS Code Extension
This project configuration and documentation is provided as-is for personal use.
Component Licenses:
- Qwen3-Embedding-8B: Apache 2.0
- Qdrant: Apache 2.0
- Ollama: MIT
- KiloCode: Check extension license
This is a personal project documenting a local setup. Feel free to:
- Fork and adapt for your hardware
- Submit issues for documentation improvements
- Share your own optimizations
- Alibaba Cloud - Qwen3 embedding models
- Qdrant - High-performance vector database
- Ollama - Simple local AI deployment
- KiloCode - Semantic code search integration
Ready to get started? Follow the 4_QDRANT_INSTALLATION_GUIDE.md for step-by-step deployment instructions.