Skip to content
View nerdpudding's full-sized avatar

Highlights

  • Pro

Block or report nerdpudding

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
nerdpudding/README.md

NerdPudding

The proof is in the pudding.

License: MIT Ko-fi

Real-time AI video commentary with text-to-speech. Point it at a football match, a security camera, a nature stream, or any video source — and get a live AI commentator that sees, understands, and speaks about what's happening. Runs locally on your GPU, no cloud needed.

Status: Sprint 2 complete. Core pipeline works end-to-end: video in, text + TTS audio out, with adaptive pacing. Docker and WebRTC are next (Sprint 3). See Roadmap for details.

Table of Contents

Demo: Live Sports Commentary

The most fun way to try this: download a football match (or any sports broadcast) and let the AI commentate live — with voice.

# Start with TTS enabled
ENABLE_TTS=true python -m app.main

Open http://localhost:8199 in your browser, enter the path to a video file, click Start, and set the instruction:

Commentate on this football match between Brazil (BRA) and France (FRA).
The scoreboard shows country abbreviations, the score, and the match clock
— the clock is NOT the score. Focus on exciting moments: attacks, shots,
saves, fouls, corners, and near-misses. Build tension during dangerous plays.
Be enthusiastic about goal chances, not monotone. Skip boring buildup in
midfield — only speak when something interesting happens.

Adapt the team names and context to your match. The AI will commentate with natural pacing — more during action, quieter during slow moments. Use the speaker button in the header to mute/unmute.

Tip: The prompt makes a big difference. Experiment with it while the video is running — you can change the instruction at any time. For example, the model may read the match clock too often. Adding a constraint like "You may mention the match time only at 5, 10, 15, ... 90 minutes play time" fixes that. See the Tuning Guide for more prompt examples.

Video Sources

Enter any of these in the "Video source" field in the browser UI:

Source Format Example
Local video file File path /home/user/match.mp4
Webcam Device ID (integer) 0
RTSP stream RTSP URL rtsp://192.168.1.100:554/stream
HTTP MJPEG stream HTTP URL http://192.168.1.100:8080/video
HTTP video stream HTTP URL http://example.com/stream.mp4

The system uses OpenCV's VideoCapture underneath, so anything OpenCV supports will work. Video files loop automatically for testing.

Phone as camera: Install IP Webcam (Android) or similar app, then use the MJPEG URL it provides (e.g. http://192.168.1.50:8080/video).

VLC re-streaming: Stream any content as RTSP from another PC:

vlc input.mp4 --sout '#rtp{sdp=rtsp://:8554/stream}'
# Then use: rtsp://<that-pc-ip>:8554/stream

YouTube / Twitch: Not supported directly. Use yt-dlp -g <url> to extract the direct stream URL, then paste that URL — but results vary depending on format and DRM.

Getting Started

Prerequisites

  • NVIDIA GPU with sufficient VRAM (see table below)
  • CUDA 12.x installed
  • Miniconda or Anaconda
  • ~10 GB disk space for AWQ model + TTS assets (~30 GB if also downloading BF16)
Mode VRAM Required Tested On
Text-only (AWQ) ~8.6 GB RTX 4090
Text + TTS (AWQ) ~14-15 GB RTX 4090
Text-only (BF16) ~18.5 GB RTX 4090

Quick Start

git clone https://github.com/nerdpudding/nerdpudding.git
cd nerdpudding

# Clone the reference repos (not included in this repo)
git clone https://github.com/OpenBMB/MiniCPM-o.git
git clone https://github.com/OpenBMB/MiniCPM-V-CookBook.git

# 1. Create conda environment
conda create -n nerdpudding python=3.12 -y
conda activate nerdpudding
pip install -r app/requirements.txt

# 2. Download AWQ INT4 model (~8 GB, default)
huggingface-cli download openbmb/MiniCPM-o-4_5-awq --local-dir models/MiniCPM-o-4_5-awq

# 3. Download TTS assets (~1.2 GB vocoder + reference audio)
#    The AWQ model needs these from the BF16 model's assets directory.
huggingface-cli download openbmb/MiniCPM-o-4_5 --local-dir models/MiniCPM-o-4_5 --include "assets/*"
cp -r models/MiniCPM-o-4_5/assets models/MiniCPM-o-4_5-awq/assets

# Optional: download full BF16 model (~19 GB, for comparison or fallback)
# huggingface-cli download openbmb/MiniCPM-o-4_5 --local-dir models/MiniCPM-o-4_5

# 4. Apply required model patches (see docs/model_patches.md for all patches)
#    AWQ model needs config.json fix + streaming fix in modeling_minicpmo.py
#    BF16 model (if downloaded) needs streaming fix in modeling_minicpmo.py

# 5. Start the server (text-only)
python -m app.main

# Or with TTS audio commentary
ENABLE_TTS=true python -m app.main

# Server starts on http://localhost:8199

Open the browser, enter a video source, click Start, type an instruction, and press Send. The AI commentary streams as text in the right panel. With TTS enabled, you'll also hear it — use the speaker button to mute/unmute.

Note: The server binds to 127.0.0.1 by default — only accessible from your own machine. To allow access from other devices on your network (e.g. for phone testing), use SERVER_HOST=0.0.0.0 python -m app.main. There is no authentication — do not expose on untrusted networks.

Testing Without a Browser

# Test model loading + inference on a single image
python -m scripts.test_model --image test_files/images/test.jpg

# Test frame capture from a video file
python -m scripts.test_capture --source test_files/videos/test.mp4

# Test full pipeline (model + capture + commentary loop)
python -m scripts.test_monitor --source test_files/videos/test.mp4 --cycles 2

# Test TTS audio output (saves WAV file)
ENABLE_TTS=true python -m scripts.test_tts --source test_files/videos/test.mp4

Configuration

All settings are in app/config.py and overridable via environment variables. For detailed tuning instructions — including per-GPU recommendations, TTS pacing, scene detection, and prompt tips — see the Tuning Guide.

Quick examples:

# Enable TTS with custom pacing
ENABLE_TTS=true TTS_PAUSE_AFTER=1.5 python -m app.main

# Use BF16 model instead of AWQ (needs ~18.5 GB VRAM)
MODEL_PATH=models/MiniCPM-o-4_5 python -m app.main

# Disable video-commentary sync (show real-time video, no delay)
STREAM_DELAY_INIT=0 python -m app.main

# Different GPU
CUDA_VISIBLE_DEVICES=1 python -m app.main

These repos are used as reference material only -- see Resources for details.

Goal

Stream live video from any source into MiniCPM-o 4.5 and have a real-time conversation about what it sees -- like a live commentator that watches along and responds to your directions.

The AI continuously monitors the video stream and narrates or answers based on a sliding window of recent frames. The user can steer the AI's focus at any time (e.g., "only tell me what the dog does"). This is not video upload + batch processing -- it's live, continuous, and steerable. Text chat first, voice interaction later.

Architecture Overview

Video Source  --->  Model Server (MiniCPM-o 4.5)  --->  Web UI
(cam/stream/file)       (Python, local GPU)           (browser)
                              ^                           |
                              |     user questions        |
                              +---------------------------+

Use Cases

  • Live video conversation -- ask questions about what the AI sees in real-time
  • Monitoring & alerting -- describe events, trigger alerts on conditions
  • Content logging -- auto-generate text summaries of video content
  • Accessibility -- rich scene descriptions for visually impaired users
  • Multi-model pipeline -- feed vision output into other LLMs, alert systems, or video generators

Model

MiniCPM-o 4.5 -- omni-modal model (vision + audio/STT + TTS), 9B parameters. Supports video understanding up to 10 FPS, speech recognition, text-to-speech, and full-duplex streaming -- all in one model.

Variant VRAM Backend Link
AWQ INT4 (default) ~8.6 GB Python / transformers + autoawq HuggingFace
Full (BF16) ~18.5 GB Python / transformers HuggingFace / ModelScope
GGUF (quantized) 4.8 - 16.4 GB C++ / llama.cpp HuggingFace

Primary target: AWQ INT4 on RTX 4090 (~8.6 GB VRAM, comparable quality to BF16). Fallback: BF16 via MODEL_PATH env var. See concept for detailed comparison.

Resources

Built upon two cloned repositories:

Repo Contents
MiniCPM-o/ Official model repo -- web demos, FastAPI server, Vue frontend, VAD
MiniCPM-V-CookBook/ Cookbook -- WebRTC demo, Omni Stream, Gradio, Docker setups, inference examples

Hardware

Component Spec
GPU (primary) NVIDIA RTX 4090 24 GB
GPU (secondary) NVIDIA RTX 5070 Ti 16 GB (~12 GB usable) -- backup only
CPU AMD Ryzen 5800X3D
RAM 64 GB DDR4
OS Ubuntu Desktop
Tools Docker, npm, miniconda, uv

The RTX 4090 is the primary compute target. The 5070 Ti is available but only considered if VRAM constraints require multi-GPU offloading (adds complexity due to mixed architectures).

Development Approach

Proof of concept with iterative sprints. Start minimal, find limitations, improve. SOLID, DRY, KISS.

Project Structure & Agents

See the Project hierarchy in AI_INSTRUCTIONS.md for what each folder and file is for, including the agent table and usage guidelines.

Current Status

Sprint 2 complete. Full end-to-end pipeline: video in, text + TTS audio out, with adaptive pacing and scene-weighted commentary density. See Sprint 2 Review for detailed findings.

Metric Text-only With TTS
VRAM (AWQ INT4) ~8.6 GB ~14-15 GB
Inference per cycle ~1.6s avg ~5s avg
End-to-end latency ~4.8s avg Audio-gated (adaptive)
Display frame rate Native (~24 FPS via MJPEG) Same
Commentary output Streaming text (SSE) Text + audio (Web Audio API)

Next: Sprint 3 — Docker, LiveKit WebRTC, input robustness, UI polish.

Documentation

Acknowledgments

Built on MiniCPM-o 4.5 by OpenBMB — an impressive omni-modal model with vision, speech, and TTS in a single 9B-parameter package. The model is Apache 2.0 licensed. This project applies minor patches to the model code for streaming compatibility (see Model Patches).

Support

If you find this project useful, consider supporting development:

Ko-fi

License

MIT -- use it freely, just include the copyright notice.

Pinned Loading

  1. nerdpudding nerdpudding Public

    The proof is in the pudding. Real-time AI video commentary with text-to-speech, running locally on your GPU.

    Python 1

  2. llama_cpp llama_cpp Public

    Local LLM serving made manageable: llama.cpp in Docker with model profiles, interactive dashboard, benchmarking, and integration with Claude Code and AI tools

    Python 1

  3. nerdcarx nerdcarx Public

    AI-powered PiCar-X robot with local speech interaction, emotion display, and vision — Pi 5 client + desktop GPU server

    Python 1

  4. factorio_llm factorio_llm Public

    Control Factorio via natural language using LLM tool calling

    Python 1

  5. claude_code_setup claude_code_setup Public

    Template repo for bootstrapping Claude Code: global config, skills, agents, and the /project-setup workflow

    1

  6. Local_CodeBase_Indexing Local_CodeBase_Indexing Public

    Local semantic code search with Qwen3 embeddings, Qdrant, and Ollama — RAG for your codebase, no cloud needed

    1