Initial project structure and planning docs
Full project plan across 8 sub-projects (homeai-infra, homeai-llm, homeai-voice, homeai-agent, homeai-character, homeai-esp32, homeai-visual, homeai-images). Includes per-project PLAN.md files, top-level PROJECT_PLAN.md, and master TODO.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
153
CLAUDE.md
Normal file
153
CLAUDE.md
Normal file
@@ -0,0 +1,153 @@
|
||||
# CLAUDE.md — Home AI Assistant Project
|
||||
|
||||
## Project Overview
|
||||
|
||||
A self-hosted, always-on personal AI assistant running on a **Mac Mini M4 Pro (64GB RAM, 1TB SSD)**. The goal is a modular, expandable system that replaces commercial smart home speakers (Google Home etc.) with a locally-run AI that has a defined personality, voice, visual representation, and full smart home integration.
|
||||
|
||||
---
|
||||
|
||||
## Hardware
|
||||
|
||||
| Component | Spec |
|
||||
|---|---|
|
||||
| Chip | Apple M4 Pro |
|
||||
| CPU | 14-core |
|
||||
| GPU | 20-core |
|
||||
| Neural Engine | 16-core |
|
||||
| RAM | 64GB unified memory |
|
||||
| Storage | 1TB SSD |
|
||||
| Network | Gigabit Ethernet |
|
||||
|
||||
All AI inference runs locally on this machine. No cloud dependency required (cloud APIs optional).
|
||||
|
||||
---
|
||||
|
||||
## Core Stack
|
||||
|
||||
### AI & LLM
|
||||
- **Ollama** — local LLM runtime (target models: Llama 3.3 70B, Qwen 2.5 72B)
|
||||
- **Open WebUI** — browser-based chat interface, runs as Docker container
|
||||
|
||||
### Image Generation
|
||||
- **ComfyUI** — primary image generation UI, node-based workflows
|
||||
- Target models: SDXL, Flux.1, ControlNet
|
||||
- Runs via Metal (Apple GPU API)
|
||||
|
||||
### Speech
|
||||
- **Whisper.cpp** — speech-to-text, optimised for Apple Silicon/Neural Engine
|
||||
- **Kokoro TTS** — fast, lightweight text-to-speech (primary, low-latency)
|
||||
- **Chatterbox TTS** — voice cloning engine (Apple Silicon MPS optimised)
|
||||
- **Qwen3-TTS** — alternative voice cloning via MLX
|
||||
- **openWakeWord** — always-on wake word detection
|
||||
|
||||
### Smart Home
|
||||
- **Home Assistant** — smart home control platform (Docker)
|
||||
- **Wyoming Protocol** — bridges Whisper STT + Kokoro/Piper TTS into Home Assistant
|
||||
- **Music Assistant** — self-hosted music control, integrates with Home Assistant
|
||||
- **Snapcast** — multi-room synchronised audio output
|
||||
|
||||
### AI Agent / Orchestration
|
||||
- **OpenClaw** — primary AI agent layer; receives voice commands, calls tools, manages personality
|
||||
- **n8n** — visual workflow automation (Docker), chains AI actions
|
||||
- **mem0** — long-term memory layer for the AI character
|
||||
|
||||
### Character & Personality
|
||||
- **Character Manager** (built — see `character-manager.jsx`) — single config UI for personality, prompts, models, Live2D mappings, and notes
|
||||
- Character config exports to JSON, consumed by OpenClaw system prompt and pipeline
|
||||
|
||||
### Visual Representation
|
||||
- **VTube Studio** — Live2D model display on desktop (macOS) and mobile (iOS/Android)
|
||||
- VTube Studio WebSocket API used to drive expressions from the AI pipeline
|
||||
- **LVGL** — simplified animated face on ESP32-S3-BOX-3 units
|
||||
- Live2D model: to be sourced/commissioned (nizima.com or booth.pm)
|
||||
|
||||
### Room Presence (Smart Speaker Replacement)
|
||||
- **ESP32-S3-BOX-3** units — one per room
|
||||
- Flashed with **ESPHome**
|
||||
- Acts as Wyoming Satellite (mic input → Mac Mini → TTS audio back)
|
||||
- LVGL display shows animated face + status info
|
||||
- Communicates over local WiFi
|
||||
|
||||
### Infrastructure
|
||||
- **Docker Desktop for Mac** — containerises Home Assistant, Open WebUI, n8n, etc.
|
||||
- **Tailscale** — secure remote access to all services, no port forwarding
|
||||
- **Authelia** — 2FA authentication layer for exposed web UIs
|
||||
- **Portainer** — Docker container management UI
|
||||
- **Uptime Kuma** — service health monitoring and mobile alerts
|
||||
- **Gitea** — self-hosted Git server for all project code and configs
|
||||
- **code-server** — browser-based VS Code for remote development
|
||||
|
||||
---
|
||||
|
||||
## Voice Pipeline (End-to-End)
|
||||
|
||||
```
|
||||
ESP32-S3-BOX-3 (room)
|
||||
→ Wake word detected (openWakeWord, runs locally on device or Mac Mini)
|
||||
→ Audio streamed to Mac Mini via Wyoming Satellite
|
||||
→ Whisper.cpp transcribes speech to text
|
||||
→ OpenClaw receives text + context
|
||||
→ Ollama LLM generates response (with character persona from system prompt)
|
||||
→ mem0 updates long-term memory
|
||||
→ Response dispatched:
|
||||
→ Kokoro/Chatterbox renders TTS audio
|
||||
→ Audio sent back to ESP32-S3-BOX-3 (spoken response)
|
||||
→ VTube Studio API triggered (expression + lip sync on desktop/mobile)
|
||||
→ Home Assistant action called if applicable (lights, music, etc.)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Character System
|
||||
|
||||
The AI assistant has a defined personality managed via the Character Manager tool.
|
||||
|
||||
Key config surfaces:
|
||||
- **System prompt** — injected into every Ollama request
|
||||
- **Voice clone reference** — `.wav` file path for Chatterbox/Qwen3-TTS
|
||||
- **Live2D expression mappings** — idle, speaking, thinking, happy, error states
|
||||
- **VTube Studio WebSocket triggers** — JSON map of events to expressions
|
||||
- **Custom prompt rules** — trigger/response overrides for specific contexts
|
||||
- **mem0** — persistent memory that evolves over time
|
||||
|
||||
Character config JSON (exported from Character Manager) is the single source of truth consumed by all pipeline components.
|
||||
|
||||
---
|
||||
|
||||
## Project Priorities
|
||||
|
||||
1. **Foundation** — Docker stack up (Home Assistant, Open WebUI, Portainer, Uptime Kuma)
|
||||
2. **LLM** — Ollama running with target models, Open WebUI connected
|
||||
3. **Voice pipeline** — Whisper → Ollama → Kokoro → Wyoming → Home Assistant
|
||||
4. **OpenClaw** — installed, onboarded, connected to Ollama and Home Assistant
|
||||
5. **ESP32-S3-BOX-3** — ESPHome flash, Wyoming Satellite, LVGL face
|
||||
6. **Character system** — system prompt wired up, mem0 integrated, voice cloned
|
||||
7. **VTube Studio** — model loaded, WebSocket API bridge written as OpenClaw skill
|
||||
8. **ComfyUI** — image generation online, character-consistent model workflows
|
||||
9. **Extended integrations** — n8n workflows, Music Assistant, Snapcast, Gitea, code-server
|
||||
10. **Polish** — Authelia, Tailscale hardening, mobile companion, iOS widgets
|
||||
|
||||
---
|
||||
|
||||
## Key Paths & Conventions
|
||||
|
||||
- All Docker compose files: `~/server/docker/`
|
||||
- OpenClaw skills: `~/.openclaw/skills/`
|
||||
- Character configs: `~/.openclaw/characters/`
|
||||
- Whisper models: `~/models/whisper/`
|
||||
- Ollama models: managed by Ollama at `~/.ollama/models/`
|
||||
- ComfyUI models: `~/ComfyUI/models/`
|
||||
- Voice reference audio: `~/voices/`
|
||||
- Gitea repos root: `~/gitea/`
|
||||
|
||||
---
|
||||
|
||||
## Notes for Planning
|
||||
|
||||
- All services should survive a Mac Mini reboot (launchd or Docker restart policies)
|
||||
- ESP32-S3-BOX-3 units are dumb satellites — all intelligence stays on Mac Mini
|
||||
- The character JSON schema (from Character Manager) should be treated as a versioned spec; pipeline components read from it, never hardcode personality values
|
||||
- OpenClaw skills are the primary extension mechanism — new capabilities = new skills
|
||||
- Prefer local models; cloud API keys (Anthropic, OpenAI) are fallback only
|
||||
- VTube Studio API bridge should be a standalone OpenClaw skill with clear event interface
|
||||
- mem0 memory store should be backed up as part of regular Gitea commits
|
||||
Reference in New Issue
Block a user