Initial project structure and planning docs

Full project plan across 8 sub-projects (homeai-infra, homeai-llm,
homeai-voice, homeai-agent, homeai-character, homeai-esp32,
homeai-visual, homeai-images). Includes per-project PLAN.md files,
top-level PROJECT_PLAN.md, and master TODO.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Aodhan Collins
2026-03-04 01:11:37 +00:00
commit 38247d7cc4
11 changed files with 3060 additions and 0 deletions

322
homeai-visual/PLAN.md Normal file
View File

@@ -0,0 +1,322 @@
# P7: homeai-visual — VTube Studio Visual Layer
> Phase 5 | Depends on: P4 (OpenClaw skill runner), P5 (character expression map)
---
## Goal
VTube Studio displays a Live2D model on Mac Mini desktop and mobile. Expressions are driven by the AI pipeline state (thinking, speaking, happy, etc.) via an OpenClaw skill that talks to VTube Studio's WebSocket API. Lip sync follows audio amplitude.
---
## Architecture
```
OpenClaw pipeline state
↓ (during LLM response generation)
vtube_studio.py skill
↓ WebSocket (port 8001)
VTube Studio (macOS app)
Live2D model renders expression
Displayed on:
- Mac Mini desktop (primary)
- iPhone/iPad (VTube Studio mobile, same model via Tailscale)
```
---
## VTube Studio Setup
### Installation
1. Download VTube Studio from the Mac App Store
2. Launch, go through initial setup
3. Enable WebSocket API: Settings → WebSocket API → Enable (port 8001)
4. Load Live2D model (see Model section below)
### WebSocket API Authentication
VTube Studio uses a token-based auth flow:
```python
import asyncio
import websockets
import json
async def authenticate():
async with websockets.connect("ws://localhost:8001") as ws:
# Step 1: request authentication token
await ws.send(json.dumps({
"apiName": "VTubeStudioPublicAPI",
"apiVersion": "1.0",
"requestID": "auth-req",
"messageType": "AuthenticationTokenRequest",
"data": {
"pluginName": "HomeAI",
"pluginDeveloper": "HomeAI",
"pluginIcon": None
}
}))
response = json.loads(await ws.recv())
token = response["data"]["authenticationToken"]
# User must click "Allow" in VTube Studio UI
# Step 2: authenticate with token
await ws.send(json.dumps({
"apiName": "VTubeStudioPublicAPI",
"apiVersion": "1.0",
"requestID": "auth",
"messageType": "AuthenticationRequest",
"data": {
"pluginName": "HomeAI",
"pluginDeveloper": "HomeAI",
"authenticationToken": token
}
}))
auth_resp = json.loads(await ws.recv())
print("Authenticated:", auth_resp["data"]["authenticated"])
return token
```
Token is persisted to `~/.openclaw/vtube_token.json`.
---
## `vtube_studio.py` Skill
Full implementation (replaces the stub from P4).
File: `homeai-visual/skills/vtube_studio.py` (symlinked to `~/.openclaw/skills/`)
```python
"""
VTube Studio WebSocket skill for OpenClaw.
Drives Live2D model expressions based on AI pipeline state.
"""
import asyncio
import json
import websockets
from pathlib import Path
VTUBE_WS_URL = "ws://localhost:8001"
TOKEN_PATH = Path.home() / ".openclaw" / "vtube_token.json"
class VTubeStudioSkill:
def __init__(self, character_config: dict):
self.expression_map = character_config.get("live2d_expressions", {})
self.ws_triggers = character_config.get("vtube_ws_triggers", {})
self.token = self._load_token()
self._ws = None
def _load_token(self) -> str | None:
if TOKEN_PATH.exists():
return json.loads(TOKEN_PATH.read_text()).get("token")
return None
def _save_token(self, token: str):
TOKEN_PATH.write_text(json.dumps({"token": token}))
async def connect(self):
self._ws = await websockets.connect(VTUBE_WS_URL)
if self.token:
await self._authenticate()
else:
await self._request_new_token()
async def _authenticate(self):
await self._send({
"messageType": "AuthenticationRequest",
"data": {
"pluginName": "HomeAI",
"pluginDeveloper": "HomeAI",
"authenticationToken": self.token
}
})
resp = await self._recv()
if not resp["data"].get("authenticated"):
# Token expired — request a new one
await self._request_new_token()
async def _request_new_token(self):
await self._send({
"messageType": "AuthenticationTokenRequest",
"data": {
"pluginName": "HomeAI",
"pluginDeveloper": "HomeAI",
"pluginIcon": None
}
})
resp = await self._recv()
token = resp["data"]["authenticationToken"]
self._save_token(token)
self.token = token
await self._authenticate()
async def trigger_expression(self, event: str):
"""Trigger a named expression state (idle, thinking, speaking, etc.)"""
hotkey_id = self.expression_map.get(event)
if not hotkey_id:
return
await self._trigger_hotkey(hotkey_id)
async def _trigger_hotkey(self, hotkey_id: str):
await self._send({
"messageType": "HotkeyTriggerRequest",
"data": {"hotkeyID": hotkey_id}
})
await self._recv()
async def set_parameter(self, name: str, value: float):
"""Set a VTube Studio parameter (e.g., mouth open for lip sync)"""
await self._send({
"messageType": "InjectParameterDataRequest",
"data": {
"parameterValues": [
{"id": name, "value": value}
]
}
})
await self._recv()
async def _send(self, payload: dict):
full = {
"apiName": "VTubeStudioPublicAPI",
"apiVersion": "1.0",
"requestID": "homeai",
**payload
}
await self._ws.send(json.dumps(full))
async def _recv(self) -> dict:
return json.loads(await self._ws.recv())
async def close(self):
if self._ws:
await self._ws.close()
# OpenClaw skill entry point — synchronous wrapper
def trigger_expression(event: str, character_config: dict):
skill = VTubeStudioSkill(character_config)
asyncio.run(_run(skill, event))
async def _run(skill, event):
await skill.connect()
await skill.trigger_expression(event)
await skill.close()
```
---
## Lip Sync
### Phase 1: Amplitude-Based (Simple)
During TTS audio playback, sample audio amplitude and map to mouth open parameter:
```python
import numpy as np
import sounddevice as sd
def stream_with_lipsync(audio_data: np.ndarray, sample_rate: int, vtube: VTubeStudioSkill):
chunk_size = 1024
for i in range(0, len(audio_data), chunk_size):
chunk = audio_data[i:i+chunk_size]
amplitude = float(np.abs(chunk).mean()) / 32768.0 # normalise 16-bit PCM
mouth_value = min(amplitude * 10, 1.0) # scale to 01
asyncio.run(vtube.set_parameter("MouthOpen", mouth_value))
sd.play(chunk, sample_rate, blocking=True)
asyncio.run(vtube.set_parameter("MouthOpen", 0.0)) # close mouth after
```
### Phase 2: Phoneme-Based (Future)
Parse TTS phoneme timing from Kokoro/Chatterbox output and drive expression per phoneme. More accurate but significantly more complex. Defer to after Phase 5.
---
## Live2D Model
### Options
| Option | Cost | Effort | Quality |
|---|---|---|---|
| Free models (VTube Studio sample packs) | Free | Low | Generic |
| Purchase from nizima.com or booth.pm | ¥3,000¥30,000 | Low | High |
| Commission custom model | ¥50,000¥200,000+ | Low (for you) | Unique |
**Recommendation:** Start with a purchased model from nizima.com or booth.pm that matches the character's aesthetic. Commission custom later once personality is locked in.
### Model Setup
1. Download `.vtube.model3.json` + associated assets
2. Place in `~/Documents/Live2DModels/` (VTube Studio default)
3. Load in VTube Studio: Model tab → Add Model
4. Map hotkeys: VTube Studio → Hotkeys → create one per expression state
5. Record hotkey IDs, update `aria.json` `live2d_expressions` mapping
---
## Expression Hotkey Mapping Workflow
1. Launch VTube Studio, load model
2. Go to Hotkeys → add hotkeys for each state: idle, listening, thinking, speaking, happy, sad, surprised, error
3. VTube Studio assigns a UUID to each hotkey — copy these
4. Open Character Manager (P5), paste UUIDs into expression mapping UI
5. Export updated `aria.json`
6. Restart OpenClaw — new expression map loaded
---
## Mobile Setup
1. Install VTube Studio on iPhone/iPad
2. On same Tailscale network, VTube Studio mobile discovers Mac Mini model
3. Mirror mode: mobile shows same model as desktop
4. Useful for bedside or kitchen display while Mac Mini desktop is the primary
---
## Directory Layout
```
homeai-visual/
└── skills/
├── vtube_studio.py ← full implementation
├── lipsync.py ← amplitude-based lip sync helper
└── auth.py ← token management utility
```
---
## Implementation Steps
- [ ] Install VTube Studio (Mac App Store)
- [ ] Enable WebSocket API on port 8001
- [ ] Source/purchase a Live2D model
- [ ] Load model in VTube Studio, verify it renders
- [ ] Create hotkeys in VTube Studio for all 8 expression states
- [ ] Write `vtube_studio.py` full implementation
- [ ] Run auth flow — click "Allow" in VTube Studio UI, save token
- [ ] Test `trigger_expression("thinking")` → model shows expression
- [ ] Test all 8 expressions via a simple test script
- [ ] Update `aria.json` with real VTube Studio hotkey IDs
- [ ] Write `lipsync.py` amplitude-based helper
- [ ] Integrate lip sync into TTS dispatch in OpenClaw
- [ ] Symlink `skills/``~/.openclaw/skills/`
- [ ] Test full pipeline: voice query → thinking expression → LLM → speaking expression with lip sync
- [ ] Set up VTube Studio on iPhone (optional, do last)
---
## Success Criteria
- [ ] All 8 expression states trigger correctly via `trigger_expression()`
- [ ] Lip sync is visibly responding to TTS audio (even if imperfect)
- [ ] VTube Studio token survives app restart (token file persists)
- [ ] Expression triggers are fast enough to feel responsive (<100ms from call to render)
- [ ] Model stays loaded and connected after Mac Mini sleep/wake