Initial project structure and planning docs

Full project plan across 8 sub-projects (homeai-infra, homeai-llm, homeai-voice, homeai-agent, homeai-character, homeai-esp32, homeai-visual, homeai-images). Includes per-project PLAN.md files, top-level PROJECT_PLAN.md, and master TODO.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-04 01:11:37 +00:00
commit 38247d7cc4
11 changed files with 3060 additions and 0 deletions
--- a/homeai-visual/PLAN.md
+++ b/homeai-visual/PLAN.md
@@ -0,0 +1,322 @@
+# P7: homeai-visual — VTube Studio Visual Layer
+
+> Phase 5 | Depends on: P4 (OpenClaw skill runner), P5 (character expression map)
+
+---
+
+## Goal
+
+VTube Studio displays a Live2D model on Mac Mini desktop and mobile. Expressions are driven by the AI pipeline state (thinking, speaking, happy, etc.) via an OpenClaw skill that talks to VTube Studio's WebSocket API. Lip sync follows audio amplitude.
+
+---
+
+## Architecture
+
+```
+OpenClaw pipeline state
+        ↓ (during LLM response generation)
+vtube_studio.py skill
+        ↓ WebSocket (port 8001)
+VTube Studio (macOS app)
+        ↓
+Live2D model renders expression
+        ↓
+Displayed on:
+  - Mac Mini desktop (primary)
+  - iPhone/iPad (VTube Studio mobile, same model via Tailscale)
+```
+
+---
+
+## VTube Studio Setup
+
+### Installation
+
+1. Download VTube Studio from the Mac App Store
+2. Launch, go through initial setup
+3. Enable WebSocket API: Settings → WebSocket API → Enable (port 8001)
+4. Load Live2D model (see Model section below)
+
+### WebSocket API Authentication
+
+VTube Studio uses a token-based auth flow:
+
+```python
+import asyncio
+import websockets
+import json
+
+async def authenticate():
+    async with websockets.connect("ws://localhost:8001") as ws:
+        # Step 1: request authentication token
+        await ws.send(json.dumps({
+            "apiName": "VTubeStudioPublicAPI",
+            "apiVersion": "1.0",
+            "requestID": "auth-req",
+            "messageType": "AuthenticationTokenRequest",
+            "data": {
+                "pluginName": "HomeAI",
+                "pluginDeveloper": "HomeAI",
+                "pluginIcon": None
+            }
+        }))
+        response = json.loads(await ws.recv())
+        token = response["data"]["authenticationToken"]
+        # User must click "Allow" in VTube Studio UI
+
+        # Step 2: authenticate with token
+        await ws.send(json.dumps({
+            "apiName": "VTubeStudioPublicAPI",
+            "apiVersion": "1.0",
+            "requestID": "auth",
+            "messageType": "AuthenticationRequest",
+            "data": {
+                "pluginName": "HomeAI",
+                "pluginDeveloper": "HomeAI",
+                "authenticationToken": token
+            }
+        }))
+        auth_resp = json.loads(await ws.recv())
+        print("Authenticated:", auth_resp["data"]["authenticated"])
+        return token
+```
+
+Token is persisted to `~/.openclaw/vtube_token.json`.
+
+---
+
+## `vtube_studio.py` Skill
+
+Full implementation (replaces the stub from P4).
+
+File: `homeai-visual/skills/vtube_studio.py` (symlinked to `~/.openclaw/skills/`)
+
+```python
+"""
+VTube Studio WebSocket skill for OpenClaw.
+Drives Live2D model expressions based on AI pipeline state.
+"""
+
+import asyncio
+import json
+import websockets
+from pathlib import Path
+
+VTUBE_WS_URL = "ws://localhost:8001"
+TOKEN_PATH = Path.home() / ".openclaw" / "vtube_token.json"
+
+class VTubeStudioSkill:
+    def __init__(self, character_config: dict):
+        self.expression_map = character_config.get("live2d_expressions", {})
+        self.ws_triggers = character_config.get("vtube_ws_triggers", {})
+        self.token = self._load_token()
+        self._ws = None
+
+    def _load_token(self) -> str | None:
+        if TOKEN_PATH.exists():
+            return json.loads(TOKEN_PATH.read_text()).get("token")
+        return None
+
+    def _save_token(self, token: str):
+        TOKEN_PATH.write_text(json.dumps({"token": token}))
+
+    async def connect(self):
+        self._ws = await websockets.connect(VTUBE_WS_URL)
+        if self.token:
+            await self._authenticate()
+        else:
+            await self._request_new_token()
+
+    async def _authenticate(self):
+        await self._send({
+            "messageType": "AuthenticationRequest",
+            "data": {
+                "pluginName": "HomeAI",
+                "pluginDeveloper": "HomeAI",
+                "authenticationToken": self.token
+            }
+        })
+        resp = await self._recv()
+        if not resp["data"].get("authenticated"):
+            # Token expired — request a new one
+            await self._request_new_token()
+
+    async def _request_new_token(self):
+        await self._send({
+            "messageType": "AuthenticationTokenRequest",
+            "data": {
+                "pluginName": "HomeAI",
+                "pluginDeveloper": "HomeAI",
+                "pluginIcon": None
+            }
+        })
+        resp = await self._recv()
+        token = resp["data"]["authenticationToken"]
+        self._save_token(token)
+        self.token = token
+        await self._authenticate()
+
+    async def trigger_expression(self, event: str):
+        """Trigger a named expression state (idle, thinking, speaking, etc.)"""
+        hotkey_id = self.expression_map.get(event)
+        if not hotkey_id:
+            return
+        await self._trigger_hotkey(hotkey_id)
+
+    async def _trigger_hotkey(self, hotkey_id: str):
+        await self._send({
+            "messageType": "HotkeyTriggerRequest",
+            "data": {"hotkeyID": hotkey_id}
+        })
+        await self._recv()
+
+    async def set_parameter(self, name: str, value: float):
+        """Set a VTube Studio parameter (e.g., mouth open for lip sync)"""
+        await self._send({
+            "messageType": "InjectParameterDataRequest",
+            "data": {
+                "parameterValues": [
+                    {"id": name, "value": value}
+                ]
+            }
+        })
+        await self._recv()
+
+    async def _send(self, payload: dict):
+        full = {
+            "apiName": "VTubeStudioPublicAPI",
+            "apiVersion": "1.0",
+            "requestID": "homeai",
+            **payload
+        }
+        await self._ws.send(json.dumps(full))
+
+    async def _recv(self) -> dict:
+        return json.loads(await self._ws.recv())
+
+    async def close(self):
+        if self._ws:
+            await self._ws.close()
+
+
+# OpenClaw skill entry point — synchronous wrapper
+def trigger_expression(event: str, character_config: dict):
+    skill = VTubeStudioSkill(character_config)
+    asyncio.run(_run(skill, event))
+
+async def _run(skill, event):
+    await skill.connect()
+    await skill.trigger_expression(event)
+    await skill.close()
+```
+
+---
+
+## Lip Sync
+
+### Phase 1: Amplitude-Based (Simple)
+
+During TTS audio playback, sample audio amplitude and map to mouth open parameter:
+
+```python
+import numpy as np
+import sounddevice as sd
+
+def stream_with_lipsync(audio_data: np.ndarray, sample_rate: int, vtube: VTubeStudioSkill):
+    chunk_size = 1024
+    for i in range(0, len(audio_data), chunk_size):
+        chunk = audio_data[i:i+chunk_size]
+        amplitude = float(np.abs(chunk).mean()) / 32768.0  # normalise 16-bit PCM
+        mouth_value = min(amplitude * 10, 1.0)  # scale to 0–1
+        asyncio.run(vtube.set_parameter("MouthOpen", mouth_value))
+        sd.play(chunk, sample_rate, blocking=True)
+    asyncio.run(vtube.set_parameter("MouthOpen", 0.0))  # close mouth after
+```
+
+### Phase 2: Phoneme-Based (Future)
+
+Parse TTS phoneme timing from Kokoro/Chatterbox output and drive expression per phoneme. More accurate but significantly more complex. Defer to after Phase 5.
+
+---
+
+## Live2D Model
+
+### Options
+
+| Option | Cost | Effort | Quality |
+|---|---|---|---|
+| Free models (VTube Studio sample packs) | Free | Low | Generic |
+| Purchase from nizima.com or booth.pm | ¥3,000–¥30,000 | Low | High |
+| Commission custom model | ¥50,000–¥200,000+ | Low (for you) | Unique |
+
+**Recommendation:** Start with a purchased model from nizima.com or booth.pm that matches the character's aesthetic. Commission custom later once personality is locked in.
+
+### Model Setup
+
+1. Download `.vtube.model3.json` + associated assets
+2. Place in `~/Documents/Live2DModels/` (VTube Studio default)
+3. Load in VTube Studio: Model tab → Add Model
+4. Map hotkeys: VTube Studio → Hotkeys → create one per expression state
+5. Record hotkey IDs, update `aria.json` `live2d_expressions` mapping
+
+---
+
+## Expression Hotkey Mapping Workflow
+
+1. Launch VTube Studio, load model
+2. Go to Hotkeys → add hotkeys for each state: idle, listening, thinking, speaking, happy, sad, surprised, error
+3. VTube Studio assigns a UUID to each hotkey — copy these
+4. Open Character Manager (P5), paste UUIDs into expression mapping UI
+5. Export updated `aria.json`
+6. Restart OpenClaw — new expression map loaded
+
+---
+
+## Mobile Setup
+
+1. Install VTube Studio on iPhone/iPad
+2. On same Tailscale network, VTube Studio mobile discovers Mac Mini model
+3. Mirror mode: mobile shows same model as desktop
+4. Useful for bedside or kitchen display while Mac Mini desktop is the primary
+
+---
+
+## Directory Layout
+
+```
+homeai-visual/
+└── skills/
+    ├── vtube_studio.py      ← full implementation
+    ├── lipsync.py           ← amplitude-based lip sync helper
+    └── auth.py              ← token management utility
+```
+
+---
+
+## Implementation Steps
+
+- [ ] Install VTube Studio (Mac App Store)
+- [ ] Enable WebSocket API on port 8001
+- [ ] Source/purchase a Live2D model
+- [ ] Load model in VTube Studio, verify it renders
+- [ ] Create hotkeys in VTube Studio for all 8 expression states
+- [ ] Write `vtube_studio.py` full implementation
+- [ ] Run auth flow — click "Allow" in VTube Studio UI, save token
+- [ ] Test `trigger_expression("thinking")` → model shows expression
+- [ ] Test all 8 expressions via a simple test script
+- [ ] Update `aria.json` with real VTube Studio hotkey IDs
+- [ ] Write `lipsync.py` amplitude-based helper
+- [ ] Integrate lip sync into TTS dispatch in OpenClaw
+- [ ] Symlink `skills/` → `~/.openclaw/skills/`
+- [ ] Test full pipeline: voice query → thinking expression → LLM → speaking expression with lip sync
+- [ ] Set up VTube Studio on iPhone (optional, do last)
+
+---
+
+## Success Criteria
+
+- [ ] All 8 expression states trigger correctly via `trigger_expression()`
+- [ ] Lip sync is visibly responding to TTS audio (even if imperfect)
+- [ ] VTube Studio token survives app restart (token file persists)
+- [ ] Expression triggers are fast enough to feel responsive (<100ms from call to render)
+- [ ] Model stays loaded and connected after Mac Mini sleep/wake