Local LLM serving made manageable. A Docker wrapper around llama.cpp with model profiles, multi-GPU optimization, interactive monitoring dashboard, benchmarking pipeline, and Claude Code local integration — built for a dual-GPU desktop (RTX 4090 + RTX 5070 Ti) but adaptable to other setups.
- What Does This Wrapper Add?
- Use Cases
- Hardware
- Quick Start
- Claude Code Local Integration
- Models
- Benchmarks (EvalPlus HumanEval+)
- Adding New Models
- Configuration
- Architecture & Documentation
- AI-Assisted Development
- Repository Structure
- Updating llama.cpp
llama.cpp is a high-performance C/C++ inference engine for running LLMs locally using quantized GGUF models. It supports CPU and GPU inference (CUDA, Metal, Vulkan), can split model layers across multiple GPUs and CPU RAM, and is the engine that Ollama is built on. It includes a web UI, an OpenAI-compatible API, and automatic multi-GPU placement via --fit.
What it does not have is a convenient way to manage multiple model configurations, switch between them, monitor hardware usage, or integrate with development tools like Claude Code. That is what this wrapper adds:
- Dockerized build — compiles llama.cpp from source with hardware-specific CUDA flags, making the setup reproducible and isolated
- Model selector (
start.sh) — interactive menu to pick a model, each with its own optimized GPU layer split, sampler defaults, and context size stored inmodels.conf - Monitoring dashboard (
dashboard.py) — curses TUI showing server logs, per-GPU VRAM/utilization/temperature, and system stats; includes an in-dashboard model picker (mkey) and a management API on port 8081 for switching models programmatically - Claude Code local integration (
claude-local) — run Claude Code against the local llama-server instead of the Anthropic cloud, with sandboxing and VS Code integration - Benchmarking — EvalPlus HumanEval+ runner to compare local models against each other and against proprietary references
- Model onboarding —
/add-modelskill with agent-assisted workflow for evaluating, configuring, and benchmarking new models - Documentation — GPU placement strategies, sampler settings per model, architecture overview, lessons learned
llama.cpp provides the inference engine, web UI, and API. Everything else listed above is part of this wrapper.
| Use case | What this project offers |
|---|---|
| Local LLM inference with multi-GPU | Automatic tensor placement (--fit) across asymmetric GPUs, model profiles with per-model GPU/context/sampler tuning, monitoring dashboard |
| Running Claude Code with local models | claude-local connects Claude Code to the local llama-server — chat, tool use, thinking, VS Code integration, bubblewrap sandboxing, mid-session model switching |
| Model benchmarking | EvalPlus HumanEval+ pipeline for comparing local models against each other and proprietary references (Claude, GPT, etc.) |
| MoE model optimization | Working reference for how MoE vs dense architectures behave with automatic GPU placement across GPUs with different VRAM sizes |
| Learning about Claude Code workflows | The project itself is developed with Claude Code agents and skills — see AI-Assisted Development |
For a guide on setting up Claude Code itself (agents, skills, project workflows), see the separate Claude Code Setup repository.
Note: This is not a plug-and-play installer. The Docker build compiles llama.cpp for specific GPU architectures (sm_89 + sm_120), and all model configurations are tuned for specific hardware (RTX 4090 + RTX 5070 Ti). It can be adapted to other setups, but GPU layers and build flags will need adjusting. The detailed docs are there to help with that.
For local LLM inference I used Ollama for a long time and it does a great job: easy model management, clean API, simple GPU offloading via GGUF. But as my projects moved toward larger models, agentic flows, and tighter hardware optimization, I kept running into limits. I wanted precise per-layer GPU/CPU placement across two GPUs with different VRAM sizes, access to the latest llama.cpp features as soon as they land, and control over build flags targeting my specific GPU architectures. Ollama's goal is simplicity — which it does well — but that means it doesn't expose these lower-level controls and sometimes lags behind on newer llama.cpp features.
So I went back to llama.cpp directly. It has come a long way: web UI, OpenAI-compatible API, automatic multi-GPU placement via --fit (including MoE expert offloading). The goal of this wrapper is simple: get the most out of my hardware in terms of model quality, speed, and context length — and more recently, to use these local models as a backend for Claude Code. More on that in the DGX Spark comparison article.
| Component | Spec |
|---|---|
| GPU 0 | NVIDIA RTX 4090 (24 GB VRAM) — Ada Lovelace, sm_89 |
| GPU 1 | NVIDIA RTX 5070 Ti (16 GB VRAM) — Blackwell, sm_120 |
| CPU | AMD Ryzen 7 5800X3D (8C/16T) |
| RAM | 64 GB DDR4 |
| OS | Ubuntu 24.04 |
| Driver | 580.x (open kernel) |
| CUDA | 13.0 (required for sm_120 / Blackwell support) |
Total GPU VRAM: 40 GB across two GPUs with asymmetric split.
git clone <repo-url> llama_cpp
cd llama_cpp
git clone https://github.com/ggml-org/llama.cpp.gitDownload models into models/<model-dir>/:
huggingface-cli download <repo> <file> --local-dir models/<model-dir>/Build the Docker image:
docker compose buildRequirements: Docker with Compose v2, NVIDIA Container Toolkit, NVIDIA driver 580+ (open kernel recommended for Blackwell), sufficient disk space for models (60-200+ GB).
./start.sh # Interactive menu + monitoring dashboard
./start.sh glm-flash-q4 # Direct launch (stops running container first)
./start.sh --list # List available models
./start.sh --no-dashboard # Launch without dashboard (raw docker compose logs)The script shows an interactive model selector with speeds and context sizes:
It generates .env, starts the container, waits for the server to be ready, and opens a monitoring dashboard with server logs, GPU stats, and system stats:
Dashboard controls:
q— Stop the server and exitr— Stop the server and return to the model menum— Open model picker (switch models without leaving the dashboard)Up/Down/PgUp/PgDn— Scroll server logs
Access:
- Web UI: http://localhost:8080 — llama.cpp's built-in chat interface
- API: http://localhost:8080/v1/chat/completions
- Management API: http://localhost:8081 — model switching for agents and external tools (
GET /models,GET /status,POST /switch)
Test with curl:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100}'Since llama.cpp supports the Anthropic Messages API natively, this setup works as a local backend for Claude Code. The claude-local command starts a separate Claude Code instance that connects to the local llama-server instead of the Anthropic cloud.
What works:
- Chat, tool use (Glob, Read, Write, Edit), thinking blocks
- VS Code IDE integration (diffs in editor)
- Bubblewrap sandboxing (bash commands restricted)
- Mid-session model switching via management API (conversation context preserved)
- Skills and agents (with local model capability limitations)
What to be aware of:
- Local models are less capable than Opus — review actions before approving
- The sandbox only covers bash commands, not Write/Edit tools
- Some Claude Code features (prompt caching, adaptive reasoning) do not work locally
Setup: See claude-local/README.md for installation, configuration, usage, and safety guide.
What's next: Automatic model switching within claude-local (agent decides which local model fits the task) is a future goal. Integration with other tools (Continue.dev, aider, OpenClaw) is on the Roadmap but not planned for the short term.
All active models are defined in models.conf. Use the section ID with ./start.sh to launch.
| Section ID | Model | Type | Speed | Context | Best for |
|---|---|---|---|---|---|
glm-flash-q4 |
GLM-4.7 Flash Q4_K_M | MoE | ~147 t/s | 128K | Fast tasks, reasoning |
glm-flash-q8 |
GLM-4.7 Flash Q8_0 | MoE | ~112 t/s | 128K | Quality reasoning, tools |
glm-flash-exp |
GLM-4.7 Flash Q8_0 (experimental) | MoE | ~112 t/s | 128K | Experimental |
qwen35-35b-q6 |
Qwen3.5-35B-A3B UD-Q6_K_XL | MoE | ~120 t/s | 262K | Thinking, reasoning, coding, agentic |
qwen35-122b-q4 |
Qwen3.5-122B-A10B UD-Q4_K_XL | MoE | ~18 t/s | 262K | Deep reasoning, quality, coding |
qwen35-27b-q8 |
Qwen3.5-27B UD-Q8_K_XL | Dense | ~20-30 t/s (est.) | 262K | Pending — CUDA crash under investigation |
Three models were retired 2026-02-26 after benchmark comparison: GPT-OSS 120B (87.2% HumanEval+), Qwen3-Coder-Next (90.9%), and Qwen3-Next-80B-A3B (93.9%). Their profiles are commented out in models.conf.
Recommended client-side settings per model. Most clients override server defaults, so set these explicitly.
| Setting | GLM (general) | GLM (coding) | Qwen3.5 (thinking) | Qwen3.5 (coding) |
|---|---|---|---|---|
| temperature | 1.0 | 0.7 | 1.0 | 0.6 |
| top_p | 0.95 | 1.0 | 0.95 | 0.95 |
| top_k | — | — | 20 | 20 |
| min_p | 0.01 | 0.01 | 0.0 | 0.0 |
| presence_penalty | — | — | 1.5 (client-side) | 0.0 |
Qwen3.5 settings apply to all three Qwen3.5 models (35B-A3B, 27B, 122B-A10B). Full details and rationale: docs/client-settings.md
Qwen3.5 thinking model: All three Qwen3.5 models generate <think> blocks by default. Thinking cannot be disabled with /nothink (unlike Qwen3) — use the chat template parameter enable_thinking=false if your client supports it. presence_penalty=1.5 is strongly recommended for general use and must be set client-side.
Retired model notes (for reference): GPT-OSS 120B, Qwen3-Coder-Next, and Qwen3-Next-80B-A3B were retired 2026-02-26. Historical sampler settings and notes for these models are preserved in docs/client-settings.md.
Coding benchmark: 164 Python problems (HumanEval+), pass@1, greedy decoding. HumanEval+ uses 80x more tests than standard HumanEval.
| # | Model | HumanEval | HumanEval+ | Speed | vs published |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 | 98.2% | 95.1% | (API) | +4.0pp |
| 2 | Qwen3.5-27B UD-Q6_K_XL (dense) | 98.2% | 94.5% | ~31 t/s | — |
| 3 | Qwen3.5-122B-A10B UD-Q4_K_XL | 97.6% | 94.5% | ~18 t/s | — |
| 4 | Claude Opus 4.6 (thinking) | 99.4% | 93.9% | (API) | +5.2pp |
| 5 | Qwen3-Next-80B-A3B UD-Q5_K_XL † | 98.2% | 93.9% | ~33 t/s | — |
| 6 | Qwen3.5-35B-A3B UD-Q6_K_XL | 95.1% | 90.9% | ~120 t/s | — |
| 7 | Qwen3-Coder-Next UD-Q5_K_XL † | 93.9% | 90.9% | ~33 t/s | -0.2pp |
| 8 | GLM-4.7 Flash Q8_0 * | 89.0% | 87.2% | ~112 t/s | +2.0pp |
| 9 | GPT-OSS 120B F16 † | 93.3% | 87.2% | ~22 t/s | +5.0pp |
| 10 | GLM-4.7 Flash Q4_K_M * | 87.8% | 83.5% | ~147 t/s | +0.8pp |
† Retired model (2026-02-26) — benchmark scores preserved for reference.
"vs published" = difference in HumanEval score compared to the closest published reference score for that model (from model cards, EvalPlus leaderboard, or benchmark articles). Not always an exact apples-to-apples comparison — see REPORT.md for full details, reference sources, and caveats.
* Reasoning model — benchmarked with --reasoning-format none (thinking tokens included in output). Claude was benchmarked via Claude Code (Max subscription) using a custom agent with the same prompts and evaluation pipeline, instead of the llama.cpp API.
Full results with proprietary model comparisons: benchmarks/evalplus/results/REPORT.md
HumanEval is a narrow benchmark: 164 short, well-defined Python functions. It measures one specific skill — writing a correct function from a docstring. Models that score 85-98% on this test are not necessarily 85-98% as capable as each other in practice.
What these scores show: local open-source models are genuinely competitive at structured coding tasks. A 3B-active MoE model running at 120 t/s locally can match a proprietary model on this specific benchmark.
What these scores don't show: real-world capability differences in complex reasoning, multi-file code generation, following ambiguous instructions, consistency across diverse domains, or long multi-turn conversations. Proprietary models with orders of magnitude more parameters invest those parameters in breadth, robustness, and handling edge cases — qualities that narrow benchmarks don't capture.
Practical takeaway: local models are excellent tools for experts who know how to pick the right model for the task, craft good prompts, and evaluate the output. They're not drop-in replacements for general-purpose assistants across all use cases. The benchmark numbers help compare models within this project, not make absolute claims about model quality.
Why these benchmarks? Current benchmarks are selected for practical ease: automated evaluation (no LLM judge), runs via the local API, completes in hours, and has published frontier scores for comparison. This limits scope to objectively measurable tasks. See extended benchmarks research for planned additions covering reasoning, generalization, and knowledge breadth.
cd benchmarks/evalplus
source .venv/bin/activate # One-time setup: uv venv && uv pip install evalplus
./benchmark.sh bench-glm-flash-q4 # Smoke test (one model)
./benchmark.sh --local # All local models
./benchmark.sh --all # All models (local + Claude)Full setup and usage: benchmarks/evalplus/README.md
The /add-model skill provides a guided 8-phase workflow for evaluating and adding new GGUF models. This is built for Claude Code and uses its agents and skills system, but the workflow pattern (evaluate → configure → test → benchmark → document) could be adapted for other AI-assisted development tools.
- Evaluate — Analyze architecture, quant options, VRAM fit (model-manager agent)
- Download — User downloads files to
models/<dir>/ - Create profile — Add production profile to
models.conf(gpu-optimizer agent) - Find samplers — Research official sampler settings (model-manager agent)
- Test — Verify the model loads, generates, and performs well
- Create bench profile — Add benchmark profile to
models.conf - Run benchmark — EvalPlus HumanEval+ evaluation (benchmark agent)
- Update docs — Update README, client-settings, ROADMAP (doc-keeper agent)
Usage: run /add-model <model-name> in Claude Code.
Models being evaluated for potential addition. Model cards are in models/documentation/CANDIDATES/.
| Model | Params | Architecture | Specialty |
|---|---|---|---|
| Nemotron-3-Nano-30B-A3B | 30B / 3.5B active | Hybrid Mamba2-Transformer MoE | Reasoning, tool calling, math/coding (SWE-bench 38.8%) |
| Devstral-Small-2-24B | 24B dense | Dense Transformer | Agentic coding (SWE-bench 68.0%, Terminal Bench 22.5%), vision |
| Ministral-3-14B-Instruct | 14B | Dense + vision encoder | General-purpose, multilingual, edge-optimized |
| Ministral-3-14B-Reasoning | 14B | Dense + vision encoder | Math/STEM reasoning (AIME25 85.0%) |
| File | Purpose |
|---|---|
models.conf |
Server config: model paths, context size, sampler defaults, --fit GPU placement |
docker-compose.yml |
Docker container config, GPU device mapping, volume mounts |
benchmarks/evalplus/bench-client.conf |
Benchmark client config: system prompts, reasoning levels per model |
.env |
Auto-generated by start.sh from models.conf — never edit manually |
Annotated template with full variable reference: docker-compose.example.yml
For a high-level overview of how all components connect (llama-server, dashboard, management API, Claude Code normal vs local, sandboxing), see docs/architecture.md.
| Document | Description |
|---|---|
| GPU Strategy Guide | GPU placement decision tree, strategies A-D, graph splits, tuning guidance |
| Client Settings | Recommended temperature, top_p, top_k, min_p, and system prompt settings per model |
| Bench Profile Test Results | GPU optimization data: VRAM usage, speeds, OOM failures, layer split decisions |
| EvalPlus Benchmark Results | Latest HumanEval+ scores for all models vs proprietary references |
| EvalPlus Benchmark Runner | HumanEval+ coding benchmark setup, usage, and comparison with proprietary models |
| Claude Code Local Setup | Installation, usage, and safety guide for running Claude Code with a local backend |
| Architecture Overview | C4-style overview of all components and design decisions |
| DGX Spark Comparison | DGX Spark vs desktop analysis for local LLM inference |
| Lessons Learned | Common mistakes and prevention rules |
| Local Setup Decision | Analysis of isolation options and rationale for the chosen approach |
| docker-compose.example.yml | Annotated compose template with full variable reference |
This project is developed with Claude Code using specialized agents and workflows.
| Resource | Purpose |
|---|---|
AI_INSTRUCTIONS.md |
Project context and rules for AI tools |
.claude/agents/ |
Specialized agents (gpu-optimizer, benchmark, model-manager, builder, diagnose, api-integration, doc-keeper) |
.claude/skills/add-model/ |
/add-model — 8-phase model onboarding workflow |
claude_plans/ |
Active plan files (archived to archive/ when done) |
Workflow: plan → approve → implement → test → document → commit. Non-trivial changes start as a plan file, get user approval, then are implemented with the appropriate agents.
See ROADMAP.md for current status, completed milestones, and future plans.
Research: DGX Spark vs Desktop Comparison — analysis of when NVIDIA's DGX Spark (128 GB unified memory, Grace Blackwell) is worth it compared to a dual-GPU desktop for local inference. Key finding (based on GPT-OSS 120B, now retired): Spark was 2.7x faster for that model (52.8 vs 19.7 t/s), but the desktop wins for models that fit on a single GPU.
.
├── README.md # This file
├── AI_INSTRUCTIONS.md # Project context for AI tools
├── ROADMAP.md # Future plans and status
├── Dockerfile # Multi-stage build (CUDA 13.0, sm_89+sm_120)
├── docker-compose.yml # Production compose file
├── docker-compose.example.yml # Annotated template with usage instructions
├── .dockerignore
├── .gitignore
├── models.conf # Server configuration (all models)
├── start.sh # Model selector script (generates .env, launches dashboard)
├── dashboard.py # Terminal monitoring dashboard (curses TUI)
├── .env.example # Generic template with all variables documented
├── docs/
│ ├── gpu-strategy-guide.md # GPU placement decision tree
│ ├── client-settings.md # Recommended client-side sampler settings per model
│ ├── bench-test-results.md # Bench profile GPU optimization (VRAM, speeds, OOM tests)
│ ├── dgx-spark-comparison.md # DGX Spark vs desktop comparison (draft article)
│ ├── lessons_learned.md # Mistakes and prevention rules
│ ├── claude_tips.md # Claude Code usage tips
│ ├── extended-benchmarks-research.md # Research on non-coding benchmarks
│ ├── alternative_benches_advice.md # Alternative benchmark options
│ ├── screenshots/ # UI screenshots for README
│ ├── architecture.md # C4-style architecture overview
│ └── decisions/ # Architecture/design decision records
├── claude-local/ # Claude Code local instance setup
│ ├── README.md # Installation, usage, and safety guide
│ ├── install.sh # Copies config to ~/.claude-local/ and ~/bin/
│ ├── bin/claude-local # Wrapper script
│ └── home/ # Config files (CLAUDE.md, settings.json, skills)
├── models/ # GGUF files (gitignored)
│ ├── .gitkeep
│ ├── documentation/ # Model cards (README from HuggingFace)
│ │ ├── CANDIDATES/ # Model cards for candidate models (not yet adopted)
│ │ ├── README_modelcard_GLM-4.7-Flash.md
│ │ ├── README_Qwen3.5-35B-A3B-GGUF.md
│ │ └── README_Qwen3.5-122B-A10B-GGUF.md
│ ├── GLM-4.7-Flash/
│ ├── Qwen3.5/
│ │ ├── MoE/
│ │ │ ├── 35B/ # Qwen3.5-35B-A3B UD-Q6_K_XL
│ │ │ └── 122B/ # Qwen3.5-122B-A10B UD-Q4_K_XL
│ │ └── Dense/
│ │ └── 27B-UD-Q8_K_XL/ # Qwen3.5-27B (pending — CUDA crash)
│ ├── GPT-OSS-120b/ # retired 2026-02-26
│ ├── Qwen3-Coder-Next/ # retired 2026-02-26
│ │ └── UD-Q5_K_XL/
│ └── Qwen3-Next/ # retired 2026-02-26
│ └── UD-Q5_K_XL/
├── benchmarks/
│ └── evalplus/ # EvalPlus HumanEval+ coding benchmark runner
│ ├── benchmark.sh # Main runner (orchestrates all steps)
│ ├── bench-client.conf # Client-side config (system prompts per model)
│ ├── generate-report.py # Results → comparison table
│ ├── reference-scores.json # Published proprietary model scores
│ └── results/ # Benchmark outputs (gitignored)
│ └── REPORT.md # Latest EvalPlus HumanEval+ results
├── archive/ # Archived plans and superseded docs (cleaned periodically)
├── claude_plans/ # Claude Code plan files
├── llama.cpp/ # llama.cpp source (separate git repo, gitignored)
└── .claude/
├── agents/ # Claude Code specialized agents
│ ├── gpu-optimizer.md
│ ├── benchmark.md
│ ├── builder.md
│ ├── diagnose.md
│ ├── model-manager.md
│ ├── api-integration.md
│ └── doc-keeper.md
└── skills/ # Claude Code skills (reusable workflows)
└── add-model/SKILL.md # /add-model — model onboarding workflow
cd llama.cpp
git pull origin master
cd ..
docker compose build --no-cacheThe llama.cpp/ directory is a separate git repository — it's gitignored from this wrapper project and updated independently.


