Cake is a Rust framework for distributed inference of large language models and image generation models based on Candle. The goal is to run big (70B+) models by repurposing consumer hardware into a heterogeneous cluster of iOS, Android, macOS, Linux and Windows devices, effectively leveraging planned obsolescence as a tool to make AI more accessible and democratic.
This is experimental code that's being actively developed and changed very quickly, expect bugs
- Distributed Inference — Shard transformer blocks across multiple devices to run models that don't fit on a single GPU. Learn more.
- Multi Model — Support for LLaMA 3.x, Qwen2/2.5, Qwen3.5 and Stable Diffusion.
- Multi Platform — CUDA, Metal, and CPU backends across Linux, macOS, Windows, iOS, and Android.
- Zero-Config Clustering — mDNS discovery, automatic layer assignment, and model data push with a single
--cluster-keyflag. Learn more. - OpenAI-Compatible API — REST API with streaming support, plus a built-in web UI and TUI chat client.
- Docker — Container builds for Linux/NVIDIA with docker-compose cluster support.
cargo build --release --features cuda # or: --features metal
cake download Qwen/Qwen2.5-Coder-1.5B-Instruct
cake master --model Qwen/Qwen2.5-Coder-1.5B-Instruct --prompt "Hello!"To start the API server and web UI:
cake master --model Qwen/Qwen2.5-Coder-1.5B-Instruct --api 0.0.0.0:8080For the full usage guide and API reference, check the project documentation.
| Model | Type | Feature Flag | Status |
|---|---|---|---|
| LLaMA 3.x | Text | llama (default) |
✅ |
| Qwen2 / Qwen2.5 | Text | qwen2 (default) |
✅ |
| Qwen3.5 | Text | qwen3_5 (default) |
✅ |
| Stable Diffusion (1.5, 2.1, XL, XL Turbo) | Image | - | ✅ |
Released under the GPL 3 license. To see the licenses of the project dependencies, install cargo license with cargo install cargo-license and then run cargo license.