Local Codebase Indexing for KiloCode

State-of-the-art semantic code search powered by local AI - optimized for RTX 4090

A complete local setup for intelligent codebase indexing using Ollama, Qwen3 embeddings, Qdrant vector database, and the KiloCode VS Code extension. Provides natural language code search with local embeddings, no API costs, and no rate limits. (Full privacy requires both local embeddings AND local inference LLM.)

About This Project

This repository documents setting up local RAG (Retrieval Augmented Generation) for code using KiloCode - my preferred AI code assistant as of November 23rd, 2025 for its user-friendly interface and comprehensive features.

Purpose: Make codebase indexing accessible through clear, user-friendly documentation. This is technical research simplified for developers who want to understand WHY and HOW RAG works, not just copy/paste commands.

What's included:

Plain-English explanations of embeddings, vectors, and semantic search
Research on embedding models and why Qwen3-8B was chosen
Hardware-optimized setup guide (RTX 4090 + Ollama + Qdrant)
Complete Qdrant installation via Docker Compose
Full documentation on architecture, performance, and troubleshooting

Prerequisites (not covered here):

Docker - Already installed and configured
Ollama - Already running with models pulled
Basic familiarity with terminal/command line

Starting point: Based on KiloCode's Codebase Indexing Documentation, customized for my hardware and needs.

Note: The docker-compose.yml is tailored to my setup (e.g., ollama-network), but easily adaptable as a template. While focused on KiloCode, the RAG principles apply to any AI code assistant (Cursor, Continue, Aider, etc.).

Who Should Use This Setup

Ideal for:

Individual developers with GPU hardware (16GB+ VRAM)
Privacy-focused development (code stays on your machine)
Unlimited usage scenarios (avoiding per-query API costs)
Learning RAG architecture and local AI infrastructure

Not ideal for:

Team collaboration (single-workspace, single-user design)
Remote access needs (requires your machine running)
Limited hardware (GPU with 16GB+ VRAM required)
Enterprise deployment (no multi-user, access control, or redundancy)

For teams/enterprise: Consider Qdrant Cloud, centralized embedding services, or cloud APIs (OpenAI, Voyage) with proper authentication and monitoring. This project demonstrates single-developer architecture - adapt for your scale and requirements.

Note on costs: Electricity costs are highly individual (rates, usage patterns, existing vs. new hardware). Cloud provider pricing changes frequently. Evaluate based on your specific requirements and scale rather than generic comparisons.

What This Is (RAG for Code)

This is RAG (Retrieval Augmented Generation) - giving AI models accurate context from YOUR codebase before they answer questions.

The Problem → Solution

Without RAG	With RAG (This Project)
Q: "What database do we use?"	Q: "What database do we use?"
A: (After manually reading package.json, database config files, and connection modules...) "PostgreSQL 14, configured in `config/database.js:12`" (correct, but inefficient - read 5+ files, wasted tokens)	A: "PostgreSQL 14, configured in `config/database.js:12` with connection pooling (max 20 connections)" (accurate, instant)

Q: "How do I add authentication?"	Q: "How do I add authentication?"
A: (After trial and error reading auth.ts, jwt.service.ts, middleware files...) "your project uses JWT with refresh tokens in `auth/jwt.service.ts:34-67`" (correct, but slow - read 10+ files manually)	A: "Your project uses JWT with refresh tokens. See `auth/jwt.service.ts:34-67` for token generation and `middleware/auth.ts:12-28` for verification" (specific, instant)

The key difference: RAG's semantic search finds code by meaning (not just keywords), enabling natural language queries across different naming conventions. Without RAG, you need exact terminology or better prompting to identify what to search for.

Analogy: Like Google's semantic search vs. Ctrl+F/grep keyword matching - both find things fast, but semantic search understands meaning (finds "authentication" when you search "user verification"), while keyword search requires knowing the exact terms used in the code.

Why It Matters

✅ Accuracy - Answers based on YOUR code, not assumptions ✅ Efficiency - Instant semantic search, no manual file hunting ✅ Reduced Hallucination - Grounded in actual code ✅ Context-Aware - Understands your architecture and patterns ✅ Privacy - Local embeddings + storage (full privacy requires local inference LLM) ✅ Cost - No API fees for local embeddings + storage (cloud options available with associated costs) ✅ Works with smaller models - Good context = great answers

⚠️ Trade-offs: ~30min setup, GPU required (15GB VRAM), storage (~160MB per 10K blocks)

Worth it? If you work with large codebases and want accurate AI assistance, absolutely yes.

How RAG Works (Conceptual Overview)

RAG (Retrieval Augmented Generation) is a simple three-step pattern:

Prepare Knowledge - Break your documents into chunks, convert each chunk into a mathematical representation (vector), store in a database
Find Relevant Information - Convert your question into a vector, find chunks with similar vectors (semantic similarity)
Generate Answer - Give the AI model ONLY the relevant chunks as context, let it generate an accurate answer

The key insight: Instead of an AI model reading everything or guessing based on training data, it first retrieves ONLY the relevant information, then generates answers grounded in that specific context.

Why vectors work: Converting text to vectors captures meaning. "authentication" and "user verification" have similar vectors even though the words are different. This is why semantic search finds relevant code regardless of naming conventions.

Simplified example: Imagine each word as a point in space with coordinates. Words with similar meanings end up close together: "king" might be at [0.8, 0.3, 0.1] and "queen" at [0.8, 0.3, 0.2] - they're nearby. Meanwhile "banana" at [0.1, 0.9, 0.5] is far away. The computer calculates distances between these points to find similarity. Real vectors have thousands of dimensions (not just 3), but the principle is the same - similar meanings = nearby points.

Real-world example: When you ask "how does auth work?", the system:

Converts your question to a vector
Finds all code chunks with similar vectors (auth functions, login logic, token handling)
Feeds those specific chunks to the AI
AI answers based on YOUR actual code, not generic patterns

This is production-grade RAG - the same technology behind ChatGPT's "custom GPTs" and enterprise AI assistants, but 100% local.

For this project's specific implementation: See Architecture section below for how KiloCode, Ollama, Qwen3, and Qdrant work together.

Why Local?

Privacy: Embeddings generated AND stored locally in Qdrant (full privacy requires local inference LLM for KiloCode responses)
Cost: Minimal ongoing electricity costs vs per-usage cloud API fees
Performance: No network latency, no rate limits
Control: Your hardware, your rules
No vendor lock-in: Works offline, independent of cloud services

Note: This setup uses local embedding generation + local vector storage. You could optionally use cloud-based Qdrant or cloud embedding providers if your privacy requirements differ.

Tech Stack

Component	Technology	Specification
Embedding Model	Qwen3-Embedding-8B-FP16	SOTA for consumer GPUs (80.68 code, 70.58 ML on MTEB)
Dimensions	4096	100% quality, Qwen3-8B via Ollama output
Vector Database	Qdrant	Cosine similarity, local deployment
AI Runtime	Ollama	Docker-based, GPU accelerated
Code Parser	Tree-sitter	AST-based semantic blocks
Interface	KiloCode	VS Code extension
Hardware	RTX 4090	15GB VRAM required

Key Features

Performance

Search Latency: Fast local search (milliseconds)
Indexing Speed: Varies by codebase size (minutes to hours for initial indexing)
Accuracy: High top-10 retrieval accuracy
Context Window: 32K tokens (handle large functions/files)

Capabilities

Natural language queries ("authentication logic", "error handling patterns")
Semantic understanding across 100+ programming languages
Incremental indexing with file watching
Git branch change detection
Automatic file filtering (binaries, dependencies, large files)
Hash-based caching for efficiency

Resource Usage

Model VRAM: ~15GB FP16 (maximum quality)
Qdrant RAM: ~60-100MB for typical codebase
Storage: ~160MB vectors for 10K code blocks (4096 dims)
Throughput: GPU-accelerated embedding generation

Why Qwen3-Embedding-8B?

After evaluating 9 embedding models, Qwen3-Embedding-8B was selected for:

SOTA performance for consumer GPUs - Ranked #3 (as of November 2025) on MTEB multilingual leaderboard (70.58 score, 80.68 on Code benchmark); top-ranked models require 44GB+ VRAM, impractical for consumer hardware
Code-optimized training - 100+ programming languages
Hardware compatibility - ~15GB VRAM fits in RTX 4090's 24GB (vs 44GB+ for higher-ranked models)
Advanced features - Instruction-aware, Matryoshka support, 32K context
Future-proof - Latest release (June 2025), Apache 2.0 license

See 2_EMBEDDING_MODEL_SELECTION.md for detailed comparison.

Why 4096 Dimensions?

Qwen3 supports Matryoshka embeddings (32-4096 dimensions), but Qwen3-8B via Ollama outputs 4096:

Quality: 100% maximum performance (no quality loss)
Simplicity: No configuration needed, works out of the box
Speed: Fast search with GPU acceleration
Storage: ~160MB for 10K blocks (acceptable with local setup)

Note: While 1024 dimensions would be more storage-efficient (minimal quality loss), using the model's 4096 output as-is eliminates configuration complexity and provides maximum quality for local deployments.

Dimension	Quality Impact	Use Case
256	Noticeable degradation	Mobile, 10M+ vectors, storage-critical
512	Minor degradation	Large scale (100K+ files), speed-critical
1024	Minimal degradation	Balanced quality/efficiency
2048	Near-full quality	Specialized/high-precision needs
4096	Full quality	Maximum precision (our choice)

Quality retention varies by model and task. One study (Voyage-3-large) showed only 0.31% quality loss at 1024 vs 2048 dimensions. Matryoshka-trained models like Qwen3 are specifically designed for graceful degradation at lower dimensions.

Note: While the model supports Matryoshka embeddings (configurable dimensions), when using Qwen3-Embedding-8B-FP16 through Ollama it outputs 4096 dimensions. This provides maximum quality with no extra configuration needed.

Prerequisites

Hardware

GPU: NVIDIA GPU with 16GB+ VRAM (Qwen3-Embedding-8B-FP16 requires ~15GB)
- This project uses: RTX 4090 (24GB) for maximum performance
- Alternatives: Smaller embedding models available for lower VRAM cards (see research docs)
- Budget option: Quantized models reduce VRAM further (with minor quality trade-off)
- Note: This setup targets state-of-the-art local performance; adjust model choice for your hardware
RAM: 16GB+ system RAM recommended
Storage: 50GB+ free space (model + vectors + Docker)

Software

Docker & Docker Compose - Container orchestration
Ollama - Local AI runtime for embeddings
KiloCode Extension - VS Code extension for codebase indexing

My Docker Setup (Reference Only)

Important: This documentation is optimized for MY specific environment but designed as a reference for others to adapt.

For context, here's how I run Ollama in Docker:

docker run -d \
  --network ollama-network \
  --gpus device=all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  -e OLLAMA_FLASH_ATTENTION=1 \
  -e OLLAMA_KV_CACHE_TYPE=q8_0 \
  ollama/ollama

Key parameters explained:

--network ollama-network: OPTIONAL - My custom Docker network. You can omit this, use your own network, or run without one entirely.
OLLAMA_FLASH_ATTENTION=1: Optional (recommended) - Performance optimization for faster inference.
OLLAMA_KV_CACHE_TYPE=q8_0: Optional - Enables lower VRAM usage at minimal accuracy cost.

Adapt to your needs: Don't blindly copy the network setting. Either remove it, use your existing Docker network, or create your own. The setup works perfectly fine without custom networks.

For detailed Ollama configuration options, see the official Ollama documentation.

Verification

Before starting, verify your environment:

# Check Ollama is running
docker ps | grep ollama

# Confirm GPU is available
nvidia-smi

# (Optional) If using custom Docker network, verify it exists
docker network ls | grep ollama-network

Note: The ollama-network check is only needed if you're using a custom Docker network like in the setup above. If you're running Ollama without a custom network, skip that check.

Quick Start

1. Pull the Embedding Model

ollama pull qwen3-embedding:8b-fp16

2. Deploy Qdrant

docker compose up -d

Verify it's running:

docker ps | grep qdrant
curl http://localhost:6333/healthz

Dashboard available at: http://localhost:6333/dashboard

3. Configure KiloCode

Open VS Code with KiloCode extension
Navigate to Settings → KiloCode → Codebase Indexing
Configure:
- Enable: ✓ Codebase Indexing
- Provider: Ollama
- Ollama base URL: http://localhost:11434/
- Model: qwen3-embedding:8b-fp16
- Model dimension: 4096
- Qdrant URL: http://localhost:6333
- Qdrant API key: (leave empty for local)
- Max Results: 50 (adjustable based on your needs)
- Search score threshold: 0.40 (default)

4. Start Indexing

Click Save in KiloCode settings
Click Start Indexing
KiloCode will:
- Auto-create collection (workspace-based name)
- Parse code with Tree-sitter
- Generate embeddings via Ollama
- Store vectors in Qdrant
Watch status: Gray → Yellow (indexing) → Green (ready)
Indexing time: Varies by project size (GPU-accelerated)

5. Search!

Natural language queries in KiloCode:
- "user authentication logic"
- "error handling patterns"
- "database connection setup"
- "API endpoint definitions"

Documentation

This repository contains comprehensive documentation organized by workflow:

Document	Description
1_CODEBASE_INDEXING_FEATURE.md	Codebase Indexing Feature - Overview of the codebase indexing feature from official KiloCode documentation
2_EMBEDDING_MODEL_SELECTION.md	Embedding Model Selection - Research and comparison of 9 embedding models, why Qwen3-8B was chosen
3_QWEN3_OLLAMA_GUIDE.md	Qwen3 with Ollama Guide - Why the default configuration is perfect, FAQ, and best practices
4_QDRANT_INSTALLATION_GUIDE.md	Qdrant Installation - Step-by-step Docker Compose deployment, configuration, integration with ollama-network
FAQ.md	Frequently Asked Questions - Quick reference for understanding RAG, collections, vectors, and workflow
IMPLEMENTATION_NOTES.md	Lessons Learned - Real-world implementation notes, key discoveries, troubleshooting tips

Architecture

Initial Indexing Flow (First Time Setup)

User clicks "Start Indexing" in KiloCode
  ↓
KiloCode checks if Qdrant collection exists
  ↓ No collection found
KiloCode creates new collection (workspace-based name)
  ↓
KiloCode scans entire codebase
  ↓ Reads all code files
KiloCode parses files with Tree-sitter
  ↓ Extracts all semantic blocks
Semantic Blocks (functions, classes, methods)
  ↓ KiloCode sends blocks to Ollama API (batched)
Ollama + Qwen3-Embedding-8B-FP16
  ↓ Creates 4096-dim vector embeddings
  ↓ Returns vectors to KiloCode
KiloCode sends vectors + metadata to Qdrant
  ↓ Stores all vectors
Qdrant Collection (kilocode_codebase)
  ↓ Persistent storage
Indexed Codebase ✅
  ↓
Status: Gray → Yellow (indexing) → Green (ready)

Auto-Update Flow (File Changes)

File watcher detects change (save/delete/create)
  ↓
KiloCode identifies changed file(s)
  ↓
KiloCode deletes old vectors for that file from Qdrant
  ↓
KiloCode re-parses changed file with Tree-sitter
  ↓ Extracts new/modified blocks
Semantic Blocks (updated)
  ↓ KiloCode sends blocks to Ollama API
Ollama + Qwen3-Embedding-8B-FP16
  ↓ Creates 4096-dim vector embeddings
  ↓ Returns vectors to KiloCode
KiloCode sends updated vectors + metadata to Qdrant
  ↓ Replaces old vectors
Qdrant Collection (kilocode_codebase)
  ↓ Updated storage
Index stays current ✅ (incremental, fast)

Data Flow: Search

Natural Language Query
  ↓
KiloCode (VS Code Extension)
  ↓ Sends query to Ollama API
Ollama + Qwen3-Embedding-8B-FP16
  ↓ Creates 4096-dim vector embedding
  ↓ Returns vector to KiloCode
KiloCode sends vector to Qdrant
  ↓
Qdrant Vector Database
  ↓ Cosine similarity search
  ↓ Returns top-N matches
Code Snippets + Metadata (file, lines, score)
  ↓ Sent back to KiloCode
KiloCode Results Panel

Network Architecture

Docker Network: ollama-network
├── Ollama Container (GPU accelerated)
│   └── qwen3-embedding:8b-fp16
└── Qdrant Container
    ├── HTTP API: localhost:6333
    ├── gRPC: localhost:6334
    └── Dashboard: localhost:6333/dashboard

KiloCode (host) → localhost:6333 → Qdrant
KiloCode (host) → localhost:11434 → Ollama
Qdrant → qdrant:11434 → Ollama (internal)

Configuration Details

Ollama Model Settings

# Model: qwen3-embedding:8b-fp16
# No custom modelfile needed - use default
# Context: 32K tokens (built-in)
# Output: 4096 dimensions (Qwen3-8B via Ollama)
# Quantization: FP16 for maximum quality

Qdrant Collection Settings

Collection Name: kilocode_codebase
Vector Dimensions: 4096
Distance Metric: Cosine
Indexing: HNSW (Hierarchical Navigable Small World)

KiloCode Settings

Embedding Provider: Ollama
Model: qwen3-embedding:8b-fp16
Qdrant URL: http://localhost:6333
Max Search Results: 50
Min Block Size: 100 chars
Max Block Size: 1000 chars

Performance Expectations

Indexing Performance

Codebase Size	Blocks	VRAM	Storage
Small (1K files)	~5K	15GB	20MB
Medium (5K files)	~25K	15GB	100MB
Large (10K files)	~50K	15GB	200MB

Indexing Time: Varies by project size and hardware. GPU-accelerated embedding generation is the bottleneck. Initial indexing may take minutes to hours depending on codebase size.

Search Performance

Search Latency: Fast local search in milliseconds
Embedding Generation: GPU-accelerated (Ollama)
Vector Search: Fast similarity matching (Qdrant)
Accuracy: High top-10 retrieval accuracy

Resource Monitoring

# GPU utilization
nvidia-smi

# Qdrant dashboard
http://localhost:6333/dashboard

# Check collection stats
curl http://localhost:6333/collections/kilocode_codebase

Cost Analysis

Local Setup (This Project)

Hardware: One-time (already owned RTX 4090)
Electricity: Ongoing cost (GPU running intermittently, varies by local rates and usage)
Per-query cost: Effectively $0 (unlimited)

Cloud API Alternative

Pricing model: Per-token/per-query charges
Usage costs: Scale with usage (light to heavy)
Rate limits: Applied
Privacy: Code sent to cloud

Winner: Local setup - unlimited queries, local embeddings + storage (complete privacy requires local inference LLM), minimal ongoing costs vs per-usage cloud fees

Project Status

✅ Completed

Research and model selection (Qwen3-Embedding-8B)
Architecture design
Documentation (comprehensive guides)
Docker Compose file for Qdrant
Qdrant deployment and configuration
KiloCode integration and testing
Hardware compatibility verification
Performance validation
Implementation documentation

🔮 Future Enhancements

Automatic backup/restore scripts
Multi-workspace support
Custom block sizing strategies
Performance monitoring dashboard

Troubleshooting

Common Issues

Q: Indexing is slow

Check GPU utilization: nvidia-smi
Verify FP16 model: ollama list | grep qwen3
Check Docker resources

Q: Search returns poor results

Verify 4096 dimensions configured
Check collection exists: curl localhost:6333/collections
Rebuild index in KiloCode settings

Q: High VRAM usage

Expected: ~15GB for FP16 model
Consider smaller model if needed (qwen3:4b)
Close other GPU applications

Technical References

Qwen3 Model Card: Qwen/Qwen3-Embedding-8B
MTEB Leaderboard: Code Embedding Benchmark
Qdrant Docs: https://qdrant.tech/documentation/
Ollama Docs: https://ollama.ai/docs
KiloCode: VS Code Extension

License

This project configuration and documentation is provided as-is for personal use.

Component Licenses:

Qwen3-Embedding-8B: Apache 2.0
Qdrant: Apache 2.0
Ollama: MIT
KiloCode: Check extension license

Contributing

This is a personal project documenting a local setup. Feel free to:

Fork and adapt for your hardware
Submit issues for documentation improvements
Share your own optimizations

Acknowledgments

Alibaba Cloud - Qwen3 embedding models
Qdrant - High-performance vector database
Ollama - Simple local AI deployment
KiloCode - Semantic code search integration

Ready to get started? Follow the 4_QDRANT_INSTALLATION_GUIDE.md for step-by-step deployment instructions.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.gitignore		.gitignore
1_CODEBASE_INDEXING_FEATURE.md		1_CODEBASE_INDEXING_FEATURE.md
2_EMBEDDING_MODEL_SELECTION.md		2_EMBEDDING_MODEL_SELECTION.md
3_QWEN3_OLLAMA_GUIDE.md		3_QWEN3_OLLAMA_GUIDE.md
4_QDRANT_INSTALLATION_GUIDE.md		4_QDRANT_INSTALLATION_GUIDE.md
FAQ.md		FAQ.md
IMPLEMENTATION_NOTES.md		IMPLEMENTATION_NOTES.md
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Local Codebase Indexing for KiloCode

Table of Contents

Overview

Technical Details

Getting Started

Reference

About This Project

Who Should Use This Setup

What This Is (RAG for Code)

The Problem → Solution

Why It Matters

How RAG Works (Conceptual Overview)

Why Local?

Tech Stack

Key Features

Performance

Capabilities

Resource Usage

Why Qwen3-Embedding-8B?

Why 4096 Dimensions?

Prerequisites

Hardware

Software

My Docker Setup (Reference Only)

Verification

Quick Start

1. Pull the Embedding Model

2. Deploy Qdrant

3. Configure KiloCode

4. Start Indexing

5. Search!

Documentation

Architecture

Initial Indexing Flow (First Time Setup)

Auto-Update Flow (File Changes)

Data Flow: Search

Network Architecture

Configuration Details

Ollama Model Settings

Qdrant Collection Settings

KiloCode Settings

Performance Expectations

Indexing Performance

Search Performance

Resource Monitoring

Cost Analysis

Local Setup (This Project)

Cloud API Alternative

Project Status

✅ Completed

🔮 Future Enhancements

Troubleshooting

Common Issues

Technical References

License

Contributing

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages