nerdpudding

NerdPudding

The proof is in the pudding.

Real-time AI video commentary with text-to-speech. Point it at a football match, a security camera, a nature stream, or any video source — and get a live AI commentator that sees, understands, and speaks about what's happening. Runs locally on your GPU, no cloud needed.

Status: Sprint 2 complete. Core pipeline works end-to-end: video in, text + TTS audio out, with adaptive pacing. Docker and WebRTC are next (Sprint 3). See Roadmap for details.

Demo: Live Sports Commentary

The most fun way to try this: download a football match (or any sports broadcast) and let the AI commentate live — with voice.

# Start with TTS enabled
ENABLE_TTS=true python -m app.main

Open http://localhost:8199 in your browser, enter the path to a video file, click Start, and set the instruction:

Commentate on this football match between Brazil (BRA) and France (FRA).
The scoreboard shows country abbreviations, the score, and the match clock
— the clock is NOT the score. Focus on exciting moments: attacks, shots,
saves, fouls, corners, and near-misses. Build tension during dangerous plays.
Be enthusiastic about goal chances, not monotone. Skip boring buildup in
midfield — only speak when something interesting happens.

Adapt the team names and context to your match. The AI will commentate with natural pacing — more during action, quieter during slow moments. Use the speaker button in the header to mute/unmute.

Tip: The prompt makes a big difference. Experiment with it while the video is running — you can change the instruction at any time. For example, the model may read the match clock too often. Adding a constraint like "You may mention the match time only at 5, 10, 15, ... 90 minutes play time" fixes that. See the Tuning Guide for more prompt examples.

Video Sources

Enter any of these in the "Video source" field in the browser UI:

Source	Format	Example
Local video file	File path	`/home/user/match.mp4`
Webcam	Device ID (integer)	`0`
RTSP stream	RTSP URL	`rtsp://192.168.1.100:554/stream`
HTTP MJPEG stream	HTTP URL	`http://192.168.1.100:8080/video`
HTTP video stream	HTTP URL	`http://example.com/stream.mp4`

The system uses OpenCV's VideoCapture underneath, so anything OpenCV supports will work. Video files loop automatically for testing.

Phone as camera: Install IP Webcam (Android) or similar app, then use the MJPEG URL it provides (e.g. http://192.168.1.50:8080/video).

VLC re-streaming: Stream any content as RTSP from another PC:

vlc input.mp4 --sout '#rtp{sdp=rtsp://:8554/stream}'
# Then use: rtsp://<that-pc-ip>:8554/stream

YouTube / Twitch: Not supported directly. Use yt-dlp -g <url> to extract the direct stream URL, then paste that URL — but results vary depending on format and DRM.

Getting Started

Prerequisites

NVIDIA GPU with sufficient VRAM (see table below)
CUDA 12.x installed
Miniconda or Anaconda
~10 GB disk space for AWQ model + TTS assets (~30 GB if also downloading BF16)

Mode	VRAM Required	Tested On
Text-only (AWQ)	~8.6 GB	RTX 4090
Text + TTS (AWQ)	~14-15 GB	RTX 4090
Text-only (BF16)	~18.5 GB	RTX 4090

Quick Start

git clone https://github.com/nerdpudding/nerdpudding.git
cd nerdpudding

# Clone the reference repos (not included in this repo)
git clone https://github.com/OpenBMB/MiniCPM-o.git
git clone https://github.com/OpenBMB/MiniCPM-V-CookBook.git

# 1. Create conda environment
conda create -n nerdpudding python=3.12 -y
conda activate nerdpudding
pip install -r app/requirements.txt

# 2. Download AWQ INT4 model (~8 GB, default)
huggingface-cli download openbmb/MiniCPM-o-4_5-awq --local-dir models/MiniCPM-o-4_5-awq

# 3. Download TTS assets (~1.2 GB vocoder + reference audio)
#    The AWQ model needs these from the BF16 model's assets directory.
huggingface-cli download openbmb/MiniCPM-o-4_5 --local-dir models/MiniCPM-o-4_5 --include "assets/*"
cp -r models/MiniCPM-o-4_5/assets models/MiniCPM-o-4_5-awq/assets

# Optional: download full BF16 model (~19 GB, for comparison or fallback)
# huggingface-cli download openbmb/MiniCPM-o-4_5 --local-dir models/MiniCPM-o-4_5

# 4. Apply required model patches (see docs/model_patches.md for all patches)
#    AWQ model needs config.json fix + streaming fix in modeling_minicpmo.py
#    BF16 model (if downloaded) needs streaming fix in modeling_minicpmo.py

# 5. Start the server (text-only)
python -m app.main

# Or with TTS audio commentary
ENABLE_TTS=true python -m app.main

# Server starts on http://localhost:8199

Open the browser, enter a video source, click Start, type an instruction, and press Send. The AI commentary streams as text in the right panel. With TTS enabled, you'll also hear it — use the speaker button to mute/unmute.

Note: The server binds to 127.0.0.1 by default — only accessible from your own machine. To allow access from other devices on your network (e.g. for phone testing), use SERVER_HOST=0.0.0.0 python -m app.main. There is no authentication — do not expose on untrusted networks.

Testing Without a Browser

# Test model loading + inference on a single image
python -m scripts.test_model --image test_files/images/test.jpg

# Test frame capture from a video file
python -m scripts.test_capture --source test_files/videos/test.mp4

# Test full pipeline (model + capture + commentary loop)
python -m scripts.test_monitor --source test_files/videos/test.mp4 --cycles 2

# Test TTS audio output (saves WAV file)
ENABLE_TTS=true python -m scripts.test_tts --source test_files/videos/test.mp4

Configuration

All settings are in app/config.py and overridable via environment variables. For detailed tuning instructions — including per-GPU recommendations, TTS pacing, scene detection, and prompt tips — see the Tuning Guide.

Quick examples:

# Enable TTS with custom pacing
ENABLE_TTS=true TTS_PAUSE_AFTER=1.5 python -m app.main

# Use BF16 model instead of AWQ (needs ~18.5 GB VRAM)
MODEL_PATH=models/MiniCPM-o-4_5 python -m app.main

# Disable video-commentary sync (show real-time video, no delay)
STREAM_DELAY_INIT=0 python -m app.main

# Different GPU
CUDA_VISIBLE_DEVICES=1 python -m app.main

These repos are used as reference material only -- see Resources for details.

Goal

Stream live video from any source into MiniCPM-o 4.5 and have a real-time conversation about what it sees -- like a live commentator that watches along and responds to your directions.

The AI continuously monitors the video stream and narrates or answers based on a sliding window of recent frames. The user can steer the AI's focus at any time (e.g., "only tell me what the dog does"). This is not video upload + batch processing -- it's live, continuous, and steerable. Text chat first, voice interaction later.

Architecture Overview

Video Source  --->  Model Server (MiniCPM-o 4.5)  --->  Web UI
(cam/stream/file)       (Python, local GPU)           (browser)
                              ^                           |
                              |     user questions        |
                              +---------------------------+

Use Cases

Live video conversation -- ask questions about what the AI sees in real-time
Monitoring & alerting -- describe events, trigger alerts on conditions
Content logging -- auto-generate text summaries of video content
Accessibility -- rich scene descriptions for visually impaired users
Multi-model pipeline -- feed vision output into other LLMs, alert systems, or video generators

Model

MiniCPM-o 4.5 -- omni-modal model (vision + audio/STT + TTS), 9B parameters. Supports video understanding up to 10 FPS, speech recognition, text-to-speech, and full-duplex streaming -- all in one model.

Variant	VRAM	Backend	Link
AWQ INT4 (default)	~8.6 GB	Python / transformers + autoawq	HuggingFace
Full (BF16)	~18.5 GB	Python / transformers	HuggingFace / ModelScope
GGUF (quantized)	4.8 - 16.4 GB	C++ / llama.cpp	HuggingFace

Primary target: AWQ INT4 on RTX 4090 (~8.6 GB VRAM, comparable quality to BF16). Fallback: BF16 via MODEL_PATH env var. See concept for detailed comparison.

Resources

Built upon two cloned repositories:

Repo	Contents
`MiniCPM-o/`	Official model repo -- web demos, FastAPI server, Vue frontend, VAD
`MiniCPM-V-CookBook/`	Cookbook -- WebRTC demo, Omni Stream, Gradio, Docker setups, inference examples

Hardware

Component	Spec
GPU (primary)	NVIDIA RTX 4090 24 GB
GPU (secondary)	NVIDIA RTX 5070 Ti 16 GB (~12 GB usable) -- backup only
CPU	AMD Ryzen 5800X3D
RAM	64 GB DDR4
OS	Ubuntu Desktop
Tools	Docker, npm, miniconda, uv

The RTX 4090 is the primary compute target. The 5070 Ti is available but only considered if VRAM constraints require multi-GPU offloading (adds complexity due to mixed architectures).

Development Approach

Proof of concept with iterative sprints. Start minimal, find limitations, improve. SOLID, DRY, KISS.

Project Structure & Agents

See the Project hierarchy in AI_INSTRUCTIONS.md for what each folder and file is for, including the agent table and usage guidelines.

Current Status

Sprint 2 complete. Full end-to-end pipeline: video in, text + TTS audio out, with adaptive pacing and scene-weighted commentary density. See Sprint 2 Review for detailed findings.

Metric	Text-only	With TTS
VRAM (AWQ INT4)	~8.6 GB	~14-15 GB
Inference per cycle	~1.6s avg	~5s avg
End-to-end latency	~4.8s avg	Audio-gated (adaptive)
Display frame rate	Native (~24 FPS via MJPEG)	Same
Commentary output	Streaming text (SSE)	Text + audio (Web Audio API)

Next: Sprint 3 — Docker, LiveKit WebRTC, input robustness, UI polish.

Documentation

Tuning Guide -- per-GPU settings, TTS pacing, prompt tips
AI Instructions -- project rules, hierarchy, agent table
Detailed Concept -- full concept with diagrams and technical details
Roadmap -- project roadmap and sprint status
Sprint 1 Review -- sprint summary, findings, Sprint 2 ideas
Sprint 1 Log -- setup steps, test results, findings
Sprint 2 Review -- sprint summary, findings, Sprint 3 recommendations
Sprint 2 Log -- AWQ, latency, MJPEG sync, TTS, audio pacing
Model Patches -- patches applied to model files (must reapply after update)
Lessons Learned -- what worked and didn't (context for AI assistants)
docs/ -- all guides and reference documentation

Acknowledgments

Built on MiniCPM-o 4.5 by OpenBMB — an impressive omni-modal model with vision, speech, and TTS in a single 9B-parameter package. The model is Apache 2.0 licensed. This project applies minor patches to the model code for streaming compatibility (see Model Patches).

Support

If you find this project useful, consider supporting development:

License

MIT -- use it freely, just include the copyright notice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly