Full project plan across 8 sub-projects (homeai-infra, homeai-llm, homeai-voice, homeai-agent, homeai-character, homeai-esp32, homeai-visual, homeai-images). Includes per-project PLAN.md files, top-level PROJECT_PLAN.md, and master TODO.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8.4 KiB
P6: homeai-esp32 — Room Satellite Hardware
Phase 4 | Depends on: P1 (HA running), P3 (Wyoming STT/TTS servers running)
Goal
Flash ESP32-S3-BOX-3 units with ESPHome. Each unit acts as a dumb room satellite: always-on mic, local wake word detection, audio playback, and an LVGL animated face showing assistant state. All intelligence stays on the Mac Mini.
Hardware: ESP32-S3-BOX-3
| Feature | Spec |
|---|---|
| SoC | ESP32-S3 (dual-core Xtensa, 240MHz) |
| RAM | 512KB SRAM + 16MB PSRAM |
| Flash | 16MB |
| Display | 2.4" IPS LCD, 320×240, touchscreen |
| Mic | Dual microphone array |
| Speaker | Built-in 1W speaker |
| Connectivity | WiFi 802.11b/g/n, BT 5.0 |
| USB | USB-C (programming + power) |
Architecture Per Unit
ESP32-S3-BOX-3
├── microWakeWord (on-device, always listening)
│ └── triggers Wyoming Satellite on wake detection
├── Wyoming Satellite
│ ├── streams mic audio → Mac Mini Wyoming STT (port 10300)
│ └── receives TTS audio ← Mac Mini Wyoming TTS (port 10301)
├── LVGL Display
│ └── animated face, driven by HA entity state
└── ESPHome OTA
└── firmware updates over WiFi
ESPHome Configuration
Base Config Template
esphome/base.yaml — shared across all units:
esphome:
name: homeai-${room}
friendly_name: "HomeAI ${room_display}"
platform: esp32
board: esp32-s3-box-3
wifi:
ssid: !secret wifi_ssid
password: !secret wifi_password
ap:
ssid: "HomeAI Fallback"
api:
encryption:
key: !secret api_key
ota:
password: !secret ota_password
logger:
level: INFO
Room-Specific Config
esphome/s3-box-living-room.yaml:
substitutions:
room: living-room
room_display: "Living Room"
mac_mini_ip: "192.168.1.x" # or Tailscale IP
packages:
base: !include base.yaml
voice: !include voice.yaml
display: !include display.yaml
One file per room, only the substitutions change.
Voice / Wyoming Satellite — esphome/voice.yaml
microphone:
- platform: esp_adf
id: mic
speaker:
- platform: esp_adf
id: spk
micro_wake_word:
model: hey_jarvis # or custom model path
on_wake_word_detected:
- voice_assistant.start:
voice_assistant:
microphone: mic
speaker: spk
noise_suppression_level: 2
auto_gain: 31dBFS
volume_multiplier: 2.0
on_listening:
- display.page.show: page_listening
- script.execute: animate_face_listening
on_stt_vad_end:
- display.page.show: page_thinking
- script.execute: animate_face_thinking
on_tts_start:
- display.page.show: page_speaking
- script.execute: animate_face_speaking
on_end:
- display.page.show: page_idle
- script.execute: animate_face_idle
on_error:
- display.page.show: page_error
- script.execute: animate_face_error
Note: ESPHome's voice_assistant component connects to HA, which routes to Wyoming STT/TTS on the Mac Mini. This is the standard ESPHome → HA → Wyoming path.
LVGL Display — esphome/display.yaml
display:
- platform: ili9xxx
model: ILI9341
id: lcd
cs_pin: GPIO5
dc_pin: GPIO4
reset_pin: GPIO48
touchscreen:
- platform: tt21100
id: touch
lvgl:
displays:
- lcd
touchscreens:
- touch
# Face widget — centered on screen
widgets:
- obj:
id: face_container
width: 320
height: 240
bg_color: 0x000000
children:
# Eyes (two circles)
- obj:
id: eye_left
x: 90
y: 90
width: 50
height: 50
radius: 25
bg_color: 0xFFFFFF
- obj:
id: eye_right
x: 180
y: 90
width: 50
height: 50
radius: 25
bg_color: 0xFFFFFF
# Mouth (line/arc)
- arc:
id: mouth
x: 110
y: 160
width: 100
height: 40
start_angle: 180
end_angle: 360
arc_color: 0xFFFFFF
pages:
- id: page_idle
- id: page_listening
- id: page_thinking
- id: page_speaking
- id: page_error
LVGL Face State Animations — esphome/animations.yaml
script:
- id: animate_face_idle
then:
- lvgl.widget.modify:
id: eye_left
height: 50 # normal open
- lvgl.widget.modify:
id: eye_right
height: 50
- lvgl.widget.modify:
id: mouth
arc_color: 0xFFFFFF
- id: animate_face_listening
then:
- lvgl.widget.modify:
id: eye_left
height: 60 # wider eyes
- lvgl.widget.modify:
id: eye_right
height: 60
- lvgl.widget.modify:
id: mouth
arc_color: 0x00BFFF # blue tint
- id: animate_face_thinking
then:
- lvgl.widget.modify:
id: eye_left
height: 20 # squinting
- lvgl.widget.modify:
id: eye_right
height: 20
- id: animate_face_speaking
then:
- lvgl.widget.modify:
id: mouth
arc_color: 0x00FF88 # green speaking indicator
- id: animate_face_error
then:
- lvgl.widget.modify:
id: eye_left
bg_color: 0xFF2200 # red eyes
- lvgl.widget.modify:
id: eye_right
bg_color: 0xFF2200
Note: True lip-sync animation (mouth moving with audio) is complex on ESP32. Phase 1: static states. Phase 2: amplitude-driven mouth height using speaker volume feedback.
Secrets File
esphome/secrets.yaml (gitignored):
wifi_ssid: "YourNetwork"
wifi_password: "YourPassword"
api_key: "<32-byte base64 key>"
ota_password: "YourOTAPassword"
Flash & Deployment Workflow
# Install ESPHome
pip install esphome
# Compile + flash via USB (first time)
esphome run esphome/s3-box-living-room.yaml
# OTA update (subsequent)
esphome upload esphome/s3-box-living-room.yaml --device <device-ip>
# View logs
esphome logs esphome/s3-box-living-room.yaml
Home Assistant Integration
After flashing:
- HA discovers ESP32 automatically via mDNS
- Add device in HA → Settings → Devices
- Assign Wyoming voice assistant pipeline to the device
- Set up room-specific automations (e.g., "Living Room" light control from that satellite)
Directory Layout
homeai-esp32/
└── esphome/
├── base.yaml
├── voice.yaml
├── display.yaml
├── animations.yaml
├── s3-box-living-room.yaml
├── s3-box-bedroom.yaml # template, fill in when hardware available
├── s3-box-kitchen.yaml # template
└── secrets.yaml # gitignored
Wake Word Decisions
| Option | Latency | Privacy | Effort |
|---|---|---|---|
hey_jarvis (built-in microWakeWord) |
~200ms | On-device | Zero |
| Custom word (trained model) | ~200ms | On-device | High — requires 50+ recordings |
| Mac Mini openWakeWord (stream audio) | ~500ms | On Mac | Medium |
Recommendation: Start with hey_jarvis. Train a custom word (character's name) once character name is finalised.
Implementation Steps
- Install ESPHome:
pip install esphome - Write
esphome/secrets.yaml(gitignored) - Write
base.yaml,voice.yaml,display.yaml,animations.yaml - Write
s3-box-living-room.yamlfor first unit - Flash first unit via USB:
esphome run s3-box-living-room.yaml - Verify unit appears in HA device list
- Assign Wyoming voice pipeline to unit in HA
- Test: speak wake word → transcription → LLM response → spoken reply
- Test: LVGL face cycles through idle → listening → thinking → speaking
- Verify OTA update works: change LVGL color, deploy wirelessly
- Write config templates for remaining rooms (bedroom, kitchen)
- Flash remaining units, verify each works independently
- Document final MAC address → room name mapping
Success Criteria
- Wake word "hey jarvis" triggers pipeline reliably from 3m distance
- STT transcription accuracy >90% for clear speech in quiet room
- TTS audio plays clearly through ESP32 speaker
- LVGL face shows correct state for idle / listening / thinking / speaking / error
- OTA firmware updates work without USB cable
- Unit reconnects automatically after WiFi drop
- Unit survives power cycle and resumes normal operation