Living room unit fully working: on-device wake word (hey_jarvis), voice pipeline via HA (Wyoming STT → OpenClaw → Wyoming TTS), static PNG display states, OTA updates. Includes deploy.sh for quick OTA with custom image support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7.3 KiB
P6: homeai-esp32 — Room Satellite Hardware
Phase 4 | Depends on: P1 (HA running), P3 (Wyoming STT/TTS servers running)
Goal
Flash ESP32-S3-BOX-3 units with ESPHome. Each unit acts as a dumb room satellite: always-on mic, on-device wake word detection, audio playback, and a display showing assistant state via static PNG face illustrations. All intelligence stays on the Mac Mini.
Hardware: ESP32-S3-BOX-3
| Feature | Spec |
|---|---|
| SoC | ESP32-S3 (dual-core Xtensa, 240MHz) |
| RAM | 512KB SRAM + 16MB PSRAM |
| Flash | 16MB |
| Display | 2.4" IPS LCD, 320×240, touchscreen (ILI9xxx, model S3BOX) |
| Audio ADC | ES7210 (dual mic array, 16kHz 16-bit) |
| Audio DAC | ES8311 (speaker output, 48kHz 16-bit) |
| Speaker | Built-in 1W |
| Connectivity | WiFi 802.11b/g/n (2.4GHz only), BT 5.0 |
| USB | USB-C (programming + power, native USB JTAG serial) |
Architecture Per Unit
ESP32-S3-BOX-3
├── micro_wake_word (on-device, always listening)
│ └── "hey_jarvis" — triggers voice_assistant on wake detection
├── voice_assistant (ESPHome component)
│ ├── connects to Home Assistant via ESPHome API
│ ├── HA routes audio → Mac Mini Wyoming STT (10.0.0.101:10300)
│ ├── HA routes text → OpenClaw conversation agent (10.0.0.101:8081)
│ └── HA routes response → Mac Mini Wyoming TTS (10.0.0.101:10301)
├── Display (ili9xxx, model S3BOX, 320×240)
│ └── static PNG faces per state (idle, listening, thinking, replying, error)
└── ESPHome OTA
└── firmware updates over WiFi
Pin Map (ESP32-S3-BOX-3)
| Function | Pin(s) | Notes |
|---|---|---|
| I2S LRCLK | GPIO45 | strapping pin — warning ignored |
| I2S BCLK | GPIO17 | |
| I2S MCLK | GPIO2 | |
| I2S DIN (mic) | GPIO16 | ES7210 ADC input |
| I2S DOUT (speaker) | GPIO15 | ES8311 DAC output |
| Speaker enable | GPIO46 | strapping pin — warning ignored |
| I2C SCL | GPIO18 | audio codec control bus |
| I2C SDA | GPIO8 | audio codec control bus |
| SPI CLK (display) | GPIO7 | |
| SPI MOSI (display) | GPIO6 | |
| Display CS | GPIO5 | |
| Display DC | GPIO4 | |
| Display Reset | GPIO48 | inverted |
| Backlight | GPIO47 | LEDC PWM |
| Left top button | GPIO0 | strapping pin — mute toggle / factory reset |
ESPHome Configuration
Platform & Framework
esp32:
board: esp32s3box
flash_size: 16MB
cpu_frequency: 240MHz
framework:
type: esp-idf
sdkconfig_options:
CONFIG_ESP32S3_DEFAULT_CPU_FREQ_240: "y"
CONFIG_ESP32S3_DATA_CACHE_64KB: "y"
CONFIG_ESP32S3_DATA_CACHE_LINE_64B: "y"
psram:
mode: octal
speed: 80MHz
Audio Stack
Uses i2s_audio platform with external ADC/DAC codec chips:
- Microphone: ES7210 ADC via I2S, 16kHz 16-bit mono
- Speaker: ES8311 DAC via I2S, 48kHz 16-bit mono (left channel)
- Media player: wraps speaker with volume control (min 50%, max 85%)
Wake Word
On-device micro_wake_word component with hey_jarvis model. Can optionally be switched to Home Assistant streaming wake word via a selector entity.
Display
ili9xxx platform with model S3BOX. Uses update_interval: never — display updates are triggered by scripts on voice assistant state changes. Static 320×240 PNG images for each state are compiled into firmware.
Voice Assistant
ESPHome's voice_assistant component connects to HA via the ESPHome native API (not directly to Wyoming). HA orchestrates the pipeline:
- Audio → Wyoming STT (Mac Mini) → text
- Text → OpenClaw conversation agent → response
- Response → Wyoming TTS (Mac Mini) → audio back to ESP32
Directory Layout
homeai-esp32/
├── PLAN.md
├── setup.sh # env check + flash/ota/logs commands
└── esphome/
├── secrets.yaml # gitignored — WiFi + API key
├── homeai-living-room.yaml # first unit (full config)
├── homeai-bedroom.yaml # future: copy + change substitutions
├── homeai-kitchen.yaml # future: copy + change substitutions
└── illustrations/ # 320×240 PNG face images
├── idle.png
├── loading.png
├── listening.png
├── thinking.png
├── replying.png
├── error.png
└── timer_finished.png
ESPHome Environment
# Dedicated venv (Python 3.12) — do NOT share with voice/whisper venvs
~/homeai-esphome-env/bin/esphome version # ESPHome 2026.2.4+
# Quick commands
cd ~/gitea/homeai/homeai-esp32
~/homeai-esphome-env/bin/esphome run esphome/homeai-living-room.yaml # compile + flash
~/homeai-esphome-env/bin/esphome logs esphome/homeai-living-room.yaml # stream logs
# Or use the setup script
./setup.sh flash # compile + USB flash
./setup.sh ota # compile + OTA update
./setup.sh logs # stream device logs
./setup.sh validate # check YAML without compiling
Wake Word Options
| Option | Latency | Privacy | Effort |
|---|---|---|---|
hey_jarvis (built-in micro_wake_word) |
~200ms | On-device | Zero |
| Custom word (trained model) | ~200ms | On-device | High — requires 50+ recordings |
| HA streaming wake word | ~500ms | On Mac Mini | Medium — stream all audio |
Current: hey_jarvis on-device. Train a custom word (character's name) once finalised.
Implementation Steps
- Install ESPHome in
~/homeai-esphome-env(Python 3.12) - Write
esphome/secrets.yaml(gitignored) - Write
homeai-living-room.yaml(based on official S3-BOX-3 reference config) - Generate placeholder face illustrations (7 PNGs, 320×240)
- Write
setup.shwith flash/ota/logs/validate commands - Write
deploy.shwith OTA deploy, image management, multi-unit support - Flash first unit via USB (living room)
- Verify unit appears in HA device list
- Assign Wyoming voice pipeline to unit in HA
- Test: speak wake word → transcription → LLM response → spoken reply
- Test: display cycles through idle → listening → thinking → replying
- Verify OTA update works: change config, deploy wirelessly
- Write config templates for remaining rooms (bedroom, kitchen)
- Flash remaining units, verify each works independently
- Document final MAC address → room name mapping
Success Criteria
- Wake word "hey jarvis" triggers pipeline reliably from 3m distance
- STT transcription accuracy >90% for clear speech in quiet room
- TTS audio plays clearly through ESP32 speaker
- Display shows correct state for idle / listening / thinking / replying / error / muted
- OTA firmware updates work without USB cable
- Unit reconnects automatically after WiFi drop
- Unit survives power cycle and resumes normal operation
Known Constraints
- Memory: voice_assistant + micro_wake_word + display is near the limit. Do NOT add Bluetooth or LVGL widgets — they will cause crashes.
- WiFi: 2.4GHz only. 5GHz networks are not supported.
- Speaker: 1W built-in. Volume capped at 85% to avoid distortion.
- Display: Static PNGs compiled into firmware. To change images, reflash via OTA (~1-2 min).
- First compile: Downloads ESP-IDF toolchain (~500MB), takes 5-10 minutes. Incremental builds are 1-2 minutes.