# CLAUDE.md — Home AI Assistant Project ## Project Overview A self-hosted, always-on personal AI assistant running on a **Mac Mini M4 Pro (64GB RAM, 1TB SSD)**. The goal is a modular, expandable system that replaces commercial smart home speakers (Google Home etc.) with a locally-run AI that has a defined personality, voice, visual representation, and full smart home integration. --- ## Hardware | Component | Spec | |---|---| | Chip | Apple M4 Pro | | CPU | 14-core | | GPU | 20-core | | Neural Engine | 16-core | | RAM | 64GB unified memory | | Storage | 1TB SSD | | Network | Gigabit Ethernet | All AI inference runs locally on this machine. No cloud dependency required (cloud APIs optional). --- ## Core Stack ### AI & LLM - **Ollama** — local LLM runtime (target models: Llama 3.3 70B, Qwen 2.5 72B) - **Open WebUI** — browser-based chat interface, runs as Docker container ### Image Generation - **ComfyUI** — primary image generation UI, node-based workflows - Target models: SDXL, Flux.1, ControlNet - Runs via Metal (Apple GPU API) ### Speech - **Whisper.cpp** — speech-to-text, optimised for Apple Silicon/Neural Engine - **Kokoro TTS** — fast, lightweight text-to-speech (primary, low-latency) - **Chatterbox TTS** — voice cloning engine (Apple Silicon MPS optimised) - **Qwen3-TTS** — alternative voice cloning via MLX - **openWakeWord** — always-on wake word detection ### Smart Home - **Home Assistant** — smart home control platform (Docker) - **Wyoming Protocol** — bridges Whisper STT + Kokoro/Piper TTS into Home Assistant - **Music Assistant** — self-hosted music control, integrates with Home Assistant - **Snapcast** — multi-room synchronised audio output ### AI Agent / Orchestration - **OpenClaw** — primary AI agent layer; receives voice commands, calls tools, manages personality - **n8n** — visual workflow automation (Docker), chains AI actions - **mem0** — long-term memory layer for the AI character ### Character & Personality - **Character Manager** (built — see `character-manager.jsx`) — single config UI for personality, prompts, models, Live2D mappings, and notes - Character config exports to JSON, consumed by OpenClaw system prompt and pipeline ### Visual Representation - **VTube Studio** — Live2D model display on desktop (macOS) and mobile (iOS/Android) - VTube Studio WebSocket API used to drive expressions from the AI pipeline - **LVGL** — simplified animated face on ESP32-S3-BOX-3 units - Live2D model: to be sourced/commissioned (nizima.com or booth.pm) ### Room Presence (Smart Speaker Replacement) - **ESP32-S3-BOX-3** units — one per room - Flashed with **ESPHome** - Acts as Wyoming Satellite (mic input → Mac Mini → TTS audio back) - LVGL display shows animated face + status info - Communicates over local WiFi ### Infrastructure - **Docker Desktop for Mac** — containerises Home Assistant, Open WebUI, n8n, etc. - **Tailscale** — secure remote access to all services, no port forwarding - **Authelia** — 2FA authentication layer for exposed web UIs - **Portainer** — Docker container management UI - **Uptime Kuma** — service health monitoring and mobile alerts - **Gitea** — self-hosted Git server for all project code and configs - **code-server** — browser-based VS Code for remote development --- ## Voice Pipeline (End-to-End) ``` ESP32-S3-BOX-3 (room) → Wake word detected (openWakeWord, runs locally on device or Mac Mini) → Audio streamed to Mac Mini via Wyoming Satellite → Whisper.cpp transcribes speech to text → OpenClaw receives text + context → Ollama LLM generates response (with character persona from system prompt) → mem0 updates long-term memory → Response dispatched: → Kokoro/Chatterbox renders TTS audio → Audio sent back to ESP32-S3-BOX-3 (spoken response) → VTube Studio API triggered (expression + lip sync on desktop/mobile) → Home Assistant action called if applicable (lights, music, etc.) ``` --- ## Character System The AI assistant has a defined personality managed via the Character Manager tool. Key config surfaces: - **System prompt** — injected into every Ollama request - **Voice clone reference** — `.wav` file path for Chatterbox/Qwen3-TTS - **Live2D expression mappings** — idle, speaking, thinking, happy, error states - **VTube Studio WebSocket triggers** — JSON map of events to expressions - **Custom prompt rules** — trigger/response overrides for specific contexts - **mem0** — persistent memory that evolves over time Character config JSON (exported from Character Manager) is the single source of truth consumed by all pipeline components. --- ## Project Priorities 1. **Foundation** — Docker stack up (Home Assistant, Open WebUI, Portainer, Uptime Kuma) 2. **LLM** — Ollama running with target models, Open WebUI connected 3. **Voice pipeline** — Whisper → Ollama → Kokoro → Wyoming → Home Assistant 4. **OpenClaw** — installed, onboarded, connected to Ollama and Home Assistant 5. **ESP32-S3-BOX-3** — ESPHome flash, Wyoming Satellite, LVGL face 6. **Character system** — system prompt wired up, mem0 integrated, voice cloned 7. **VTube Studio** — model loaded, WebSocket API bridge written as OpenClaw skill 8. **ComfyUI** — image generation online, character-consistent model workflows 9. **Extended integrations** — n8n workflows, Music Assistant, Snapcast, Gitea, code-server 10. **Polish** — Authelia, Tailscale hardening, mobile companion, iOS widgets --- ## Key Paths & Conventions - All Docker compose files: `~/server/docker/` - OpenClaw skills: `~/.openclaw/skills/` - Character configs: `~/.openclaw/characters/` - Whisper models: `~/models/whisper/` - Ollama models: managed by Ollama at `~/.ollama/models/` - ComfyUI models: `~/ComfyUI/models/` - Voice reference audio: `~/voices/` - Gitea repos root: `~/gitea/` --- ## Notes for Planning - All services should survive a Mac Mini reboot (launchd or Docker restart policies) - ESP32-S3-BOX-3 units are dumb satellites — all intelligence stays on Mac Mini - The character JSON schema (from Character Manager) should be treated as a versioned spec; pipeline components read from it, never hardcode personality values - OpenClaw skills are the primary extension mechanism — new capabilities = new skills - Prefer local models; cloud API keys (Anthropic, OpenAI) are fallback only - VTube Studio API bridge should be a standalone OpenClaw skill with clear event interface - mem0 memory store should be backed up as part of regular Gitea commits