feat: ESP32-S3-BOX-3 room satellite — ESPHome config, OTA deploy, placeholder faces
Living room unit fully working: on-device wake word (hey_jarvis), voice pipeline via HA (Wyoming STT → OpenClaw → Wyoming TTS), static PNG display states, OTA updates. Includes deploy.sh for quick OTA with custom image support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -6,7 +6,7 @@
|
||||
|
||||
## Goal
|
||||
|
||||
Flash ESP32-S3-BOX-3 units with ESPHome. Each unit acts as a dumb room satellite: always-on mic, local wake word detection, audio playback, and an LVGL animated face showing assistant state. All intelligence stays on the Mac Mini.
|
||||
Flash ESP32-S3-BOX-3 units with ESPHome. Each unit acts as a dumb room satellite: always-on mic, on-device wake word detection, audio playback, and a display showing assistant state via static PNG face illustrations. All intelligence stays on the Mac Mini.
|
||||
|
||||
---
|
||||
|
||||
@@ -17,11 +17,12 @@ Flash ESP32-S3-BOX-3 units with ESPHome. Each unit acts as a dumb room satellite
|
||||
| SoC | ESP32-S3 (dual-core Xtensa, 240MHz) |
|
||||
| RAM | 512KB SRAM + 16MB PSRAM |
|
||||
| Flash | 16MB |
|
||||
| Display | 2.4" IPS LCD, 320×240, touchscreen |
|
||||
| Mic | Dual microphone array |
|
||||
| Speaker | Built-in 1W speaker |
|
||||
| Connectivity | WiFi 802.11b/g/n, BT 5.0 |
|
||||
| USB | USB-C (programming + power) |
|
||||
| Display | 2.4" IPS LCD, 320×240, touchscreen (ILI9xxx, model S3BOX) |
|
||||
| Audio ADC | ES7210 (dual mic array, 16kHz 16-bit) |
|
||||
| Audio DAC | ES8311 (speaker output, 48kHz 16-bit) |
|
||||
| Speaker | Built-in 1W |
|
||||
| Connectivity | WiFi 802.11b/g/n (2.4GHz only), BT 5.0 |
|
||||
| USB | USB-C (programming + power, native USB JTAG serial) |
|
||||
|
||||
---
|
||||
|
||||
@@ -29,273 +30,86 @@ Flash ESP32-S3-BOX-3 units with ESPHome. Each unit acts as a dumb room satellite
|
||||
|
||||
```
|
||||
ESP32-S3-BOX-3
|
||||
├── microWakeWord (on-device, always listening)
|
||||
│ └── triggers Wyoming Satellite on wake detection
|
||||
├── Wyoming Satellite
|
||||
│ ├── streams mic audio → Mac Mini Wyoming STT (port 10300)
|
||||
│ └── receives TTS audio ← Mac Mini Wyoming TTS (port 10301)
|
||||
├── LVGL Display
|
||||
│ └── animated face, driven by HA entity state
|
||||
├── micro_wake_word (on-device, always listening)
|
||||
│ └── "hey_jarvis" — triggers voice_assistant on wake detection
|
||||
├── voice_assistant (ESPHome component)
|
||||
│ ├── connects to Home Assistant via ESPHome API
|
||||
│ ├── HA routes audio → Mac Mini Wyoming STT (10.0.0.101:10300)
|
||||
│ ├── HA routes text → OpenClaw conversation agent (10.0.0.101:8081)
|
||||
│ └── HA routes response → Mac Mini Wyoming TTS (10.0.0.101:10301)
|
||||
├── Display (ili9xxx, model S3BOX, 320×240)
|
||||
│ └── static PNG faces per state (idle, listening, thinking, replying, error)
|
||||
└── ESPHome OTA
|
||||
└── firmware updates over WiFi
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Pin Map (ESP32-S3-BOX-3)
|
||||
|
||||
| Function | Pin(s) | Notes |
|
||||
|---|---|---|
|
||||
| I2S LRCLK | GPIO45 | strapping pin — warning ignored |
|
||||
| I2S BCLK | GPIO17 | |
|
||||
| I2S MCLK | GPIO2 | |
|
||||
| I2S DIN (mic) | GPIO16 | ES7210 ADC input |
|
||||
| I2S DOUT (speaker) | GPIO15 | ES8311 DAC output |
|
||||
| Speaker enable | GPIO46 | strapping pin — warning ignored |
|
||||
| I2C SCL | GPIO18 | audio codec control bus |
|
||||
| I2C SDA | GPIO8 | audio codec control bus |
|
||||
| SPI CLK (display) | GPIO7 | |
|
||||
| SPI MOSI (display) | GPIO6 | |
|
||||
| Display CS | GPIO5 | |
|
||||
| Display DC | GPIO4 | |
|
||||
| Display Reset | GPIO48 | inverted |
|
||||
| Backlight | GPIO47 | LEDC PWM |
|
||||
| Left top button | GPIO0 | strapping pin — mute toggle / factory reset |
|
||||
|
||||
---
|
||||
|
||||
## ESPHome Configuration
|
||||
|
||||
### Base Config Template
|
||||
|
||||
`esphome/base.yaml` — shared across all units:
|
||||
### Platform & Framework
|
||||
|
||||
```yaml
|
||||
esphome:
|
||||
name: homeai-${room}
|
||||
friendly_name: "HomeAI ${room_display}"
|
||||
platform: esp32
|
||||
board: esp32-s3-box-3
|
||||
esp32:
|
||||
board: esp32s3box
|
||||
flash_size: 16MB
|
||||
cpu_frequency: 240MHz
|
||||
framework:
|
||||
type: esp-idf
|
||||
sdkconfig_options:
|
||||
CONFIG_ESP32S3_DEFAULT_CPU_FREQ_240: "y"
|
||||
CONFIG_ESP32S3_DATA_CACHE_64KB: "y"
|
||||
CONFIG_ESP32S3_DATA_CACHE_LINE_64B: "y"
|
||||
|
||||
wifi:
|
||||
ssid: !secret wifi_ssid
|
||||
password: !secret wifi_password
|
||||
ap:
|
||||
ssid: "HomeAI Fallback"
|
||||
|
||||
api:
|
||||
encryption:
|
||||
key: !secret api_key
|
||||
|
||||
ota:
|
||||
password: !secret ota_password
|
||||
|
||||
logger:
|
||||
level: INFO
|
||||
psram:
|
||||
mode: octal
|
||||
speed: 80MHz
|
||||
```
|
||||
|
||||
### Room-Specific Config
|
||||
### Audio Stack
|
||||
|
||||
`esphome/s3-box-living-room.yaml`:
|
||||
Uses `i2s_audio` platform with external ADC/DAC codec chips:
|
||||
|
||||
```yaml
|
||||
substitutions:
|
||||
room: living-room
|
||||
room_display: "Living Room"
|
||||
mac_mini_ip: "192.168.1.x" # or Tailscale IP
|
||||
- **Microphone**: ES7210 ADC via I2S, 16kHz 16-bit mono
|
||||
- **Speaker**: ES8311 DAC via I2S, 48kHz 16-bit mono (left channel)
|
||||
- **Media player**: wraps speaker with volume control (min 50%, max 85%)
|
||||
|
||||
packages:
|
||||
base: !include base.yaml
|
||||
voice: !include voice.yaml
|
||||
display: !include display.yaml
|
||||
```
|
||||
### Wake Word
|
||||
|
||||
One file per room, only the substitutions change.
|
||||
On-device `micro_wake_word` component with `hey_jarvis` model. Can optionally be switched to Home Assistant streaming wake word via a selector entity.
|
||||
|
||||
### Voice / Wyoming Satellite — `esphome/voice.yaml`
|
||||
### Display
|
||||
|
||||
```yaml
|
||||
microphone:
|
||||
- platform: esp_adf
|
||||
id: mic
|
||||
`ili9xxx` platform with model `S3BOX`. Uses `update_interval: never` — display updates are triggered by scripts on voice assistant state changes. Static 320×240 PNG images for each state are compiled into firmware.
|
||||
|
||||
speaker:
|
||||
- platform: esp_adf
|
||||
id: spk
|
||||
### Voice Assistant
|
||||
|
||||
micro_wake_word:
|
||||
model: hey_jarvis # or custom model path
|
||||
on_wake_word_detected:
|
||||
- voice_assistant.start:
|
||||
|
||||
voice_assistant:
|
||||
microphone: mic
|
||||
speaker: spk
|
||||
noise_suppression_level: 2
|
||||
auto_gain: 31dBFS
|
||||
volume_multiplier: 2.0
|
||||
|
||||
on_listening:
|
||||
- display.page.show: page_listening
|
||||
- script.execute: animate_face_listening
|
||||
|
||||
on_stt_vad_end:
|
||||
- display.page.show: page_thinking
|
||||
- script.execute: animate_face_thinking
|
||||
|
||||
on_tts_start:
|
||||
- display.page.show: page_speaking
|
||||
- script.execute: animate_face_speaking
|
||||
|
||||
on_end:
|
||||
- display.page.show: page_idle
|
||||
- script.execute: animate_face_idle
|
||||
|
||||
on_error:
|
||||
- display.page.show: page_error
|
||||
- script.execute: animate_face_error
|
||||
```
|
||||
|
||||
**Note:** ESPHome's `voice_assistant` component connects to HA, which routes to Wyoming STT/TTS on the Mac Mini. This is the standard ESPHome → HA → Wyoming path.
|
||||
|
||||
### LVGL Display — `esphome/display.yaml`
|
||||
|
||||
```yaml
|
||||
display:
|
||||
- platform: ili9xxx
|
||||
model: ILI9341
|
||||
id: lcd
|
||||
cs_pin: GPIO5
|
||||
dc_pin: GPIO4
|
||||
reset_pin: GPIO48
|
||||
|
||||
touchscreen:
|
||||
- platform: tt21100
|
||||
id: touch
|
||||
|
||||
lvgl:
|
||||
displays:
|
||||
- lcd
|
||||
touchscreens:
|
||||
- touch
|
||||
|
||||
# Face widget — centered on screen
|
||||
widgets:
|
||||
- obj:
|
||||
id: face_container
|
||||
width: 320
|
||||
height: 240
|
||||
bg_color: 0x000000
|
||||
children:
|
||||
# Eyes (two circles)
|
||||
- obj:
|
||||
id: eye_left
|
||||
x: 90
|
||||
y: 90
|
||||
width: 50
|
||||
height: 50
|
||||
radius: 25
|
||||
bg_color: 0xFFFFFF
|
||||
- obj:
|
||||
id: eye_right
|
||||
x: 180
|
||||
y: 90
|
||||
width: 50
|
||||
height: 50
|
||||
radius: 25
|
||||
bg_color: 0xFFFFFF
|
||||
# Mouth (line/arc)
|
||||
- arc:
|
||||
id: mouth
|
||||
x: 110
|
||||
y: 160
|
||||
width: 100
|
||||
height: 40
|
||||
start_angle: 180
|
||||
end_angle: 360
|
||||
arc_color: 0xFFFFFF
|
||||
|
||||
pages:
|
||||
- id: page_idle
|
||||
- id: page_listening
|
||||
- id: page_thinking
|
||||
- id: page_speaking
|
||||
- id: page_error
|
||||
```
|
||||
|
||||
### LVGL Face State Animations — `esphome/animations.yaml`
|
||||
|
||||
```yaml
|
||||
script:
|
||||
- id: animate_face_idle
|
||||
then:
|
||||
- lvgl.widget.modify:
|
||||
id: eye_left
|
||||
height: 50 # normal open
|
||||
- lvgl.widget.modify:
|
||||
id: eye_right
|
||||
height: 50
|
||||
- lvgl.widget.modify:
|
||||
id: mouth
|
||||
arc_color: 0xFFFFFF
|
||||
|
||||
- id: animate_face_listening
|
||||
then:
|
||||
- lvgl.widget.modify:
|
||||
id: eye_left
|
||||
height: 60 # wider eyes
|
||||
- lvgl.widget.modify:
|
||||
id: eye_right
|
||||
height: 60
|
||||
- lvgl.widget.modify:
|
||||
id: mouth
|
||||
arc_color: 0x00BFFF # blue tint
|
||||
|
||||
- id: animate_face_thinking
|
||||
then:
|
||||
- lvgl.widget.modify:
|
||||
id: eye_left
|
||||
height: 20 # squinting
|
||||
- lvgl.widget.modify:
|
||||
id: eye_right
|
||||
height: 20
|
||||
|
||||
- id: animate_face_speaking
|
||||
then:
|
||||
- lvgl.widget.modify:
|
||||
id: mouth
|
||||
arc_color: 0x00FF88 # green speaking indicator
|
||||
|
||||
- id: animate_face_error
|
||||
then:
|
||||
- lvgl.widget.modify:
|
||||
id: eye_left
|
||||
bg_color: 0xFF2200 # red eyes
|
||||
- lvgl.widget.modify:
|
||||
id: eye_right
|
||||
bg_color: 0xFF2200
|
||||
```
|
||||
|
||||
> **Note:** True lip-sync animation (mouth moving with audio) is complex on ESP32. Phase 1: static states. Phase 2: amplitude-driven mouth height using speaker volume feedback.
|
||||
|
||||
---
|
||||
|
||||
## Secrets File
|
||||
|
||||
`esphome/secrets.yaml` (gitignored):
|
||||
|
||||
```yaml
|
||||
wifi_ssid: "YourNetwork"
|
||||
wifi_password: "YourPassword"
|
||||
api_key: "<32-byte base64 key>"
|
||||
ota_password: "YourOTAPassword"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Flash & Deployment Workflow
|
||||
|
||||
```bash
|
||||
# Install ESPHome
|
||||
pip install esphome
|
||||
|
||||
# Compile + flash via USB (first time)
|
||||
esphome run esphome/s3-box-living-room.yaml
|
||||
|
||||
# OTA update (subsequent)
|
||||
esphome upload esphome/s3-box-living-room.yaml --device <device-ip>
|
||||
|
||||
# View logs
|
||||
esphome logs esphome/s3-box-living-room.yaml
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Home Assistant Integration
|
||||
|
||||
After flashing:
|
||||
1. HA discovers ESP32 automatically via mDNS
|
||||
2. Add device in HA → Settings → Devices
|
||||
3. Assign Wyoming voice assistant pipeline to the device
|
||||
4. Set up room-specific automations (e.g., "Living Room" light control from that satellite)
|
||||
ESPHome's `voice_assistant` component connects to HA via the ESPHome native API (not directly to Wyoming). HA orchestrates the pipeline:
|
||||
1. Audio → Wyoming STT (Mac Mini) → text
|
||||
2. Text → OpenClaw conversation agent → response
|
||||
3. Response → Wyoming TTS (Mac Mini) → audio back to ESP32
|
||||
|
||||
---
|
||||
|
||||
@@ -303,43 +117,71 @@ After flashing:
|
||||
|
||||
```
|
||||
homeai-esp32/
|
||||
├── PLAN.md
|
||||
├── setup.sh # env check + flash/ota/logs commands
|
||||
└── esphome/
|
||||
├── base.yaml
|
||||
├── voice.yaml
|
||||
├── display.yaml
|
||||
├── animations.yaml
|
||||
├── s3-box-living-room.yaml
|
||||
├── s3-box-bedroom.yaml # template, fill in when hardware available
|
||||
├── s3-box-kitchen.yaml # template
|
||||
└── secrets.yaml # gitignored
|
||||
├── secrets.yaml # gitignored — WiFi + API key
|
||||
├── homeai-living-room.yaml # first unit (full config)
|
||||
├── homeai-bedroom.yaml # future: copy + change substitutions
|
||||
├── homeai-kitchen.yaml # future: copy + change substitutions
|
||||
└── illustrations/ # 320×240 PNG face images
|
||||
├── idle.png
|
||||
├── loading.png
|
||||
├── listening.png
|
||||
├── thinking.png
|
||||
├── replying.png
|
||||
├── error.png
|
||||
└── timer_finished.png
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Wake Word Decisions
|
||||
## ESPHome Environment
|
||||
|
||||
```bash
|
||||
# Dedicated venv (Python 3.12) — do NOT share with voice/whisper venvs
|
||||
~/homeai-esphome-env/bin/esphome version # ESPHome 2026.2.4+
|
||||
|
||||
# Quick commands
|
||||
cd ~/gitea/homeai/homeai-esp32
|
||||
~/homeai-esphome-env/bin/esphome run esphome/homeai-living-room.yaml # compile + flash
|
||||
~/homeai-esphome-env/bin/esphome logs esphome/homeai-living-room.yaml # stream logs
|
||||
|
||||
# Or use the setup script
|
||||
./setup.sh flash # compile + USB flash
|
||||
./setup.sh ota # compile + OTA update
|
||||
./setup.sh logs # stream device logs
|
||||
./setup.sh validate # check YAML without compiling
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Wake Word Options
|
||||
|
||||
| Option | Latency | Privacy | Effort |
|
||||
|---|---|---|---|
|
||||
| `hey_jarvis` (built-in microWakeWord) | ~200ms | On-device | Zero |
|
||||
| `hey_jarvis` (built-in micro_wake_word) | ~200ms | On-device | Zero |
|
||||
| Custom word (trained model) | ~200ms | On-device | High — requires 50+ recordings |
|
||||
| Mac Mini openWakeWord (stream audio) | ~500ms | On Mac | Medium |
|
||||
| HA streaming wake word | ~500ms | On Mac Mini | Medium — stream all audio |
|
||||
|
||||
**Recommendation:** Start with `hey_jarvis`. Train a custom word (character's name) once character name is finalised.
|
||||
**Current**: `hey_jarvis` on-device. Train a custom word (character's name) once finalised.
|
||||
|
||||
---
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
- [ ] Install ESPHome: `pip install esphome`
|
||||
- [ ] Write `esphome/secrets.yaml` (gitignored)
|
||||
- [ ] Write `base.yaml`, `voice.yaml`, `display.yaml`, `animations.yaml`
|
||||
- [ ] Write `s3-box-living-room.yaml` for first unit
|
||||
- [ ] Flash first unit via USB: `esphome run s3-box-living-room.yaml`
|
||||
- [ ] Verify unit appears in HA device list
|
||||
- [ ] Assign Wyoming voice pipeline to unit in HA
|
||||
- [ ] Test: speak wake word → transcription → LLM response → spoken reply
|
||||
- [ ] Test: LVGL face cycles through idle → listening → thinking → speaking
|
||||
- [ ] Verify OTA update works: change LVGL color, deploy wirelessly
|
||||
- [x] Install ESPHome in `~/homeai-esphome-env` (Python 3.12)
|
||||
- [x] Write `esphome/secrets.yaml` (gitignored)
|
||||
- [x] Write `homeai-living-room.yaml` (based on official S3-BOX-3 reference config)
|
||||
- [x] Generate placeholder face illustrations (7 PNGs, 320×240)
|
||||
- [x] Write `setup.sh` with flash/ota/logs/validate commands
|
||||
- [x] Write `deploy.sh` with OTA deploy, image management, multi-unit support
|
||||
- [x] Flash first unit via USB (living room)
|
||||
- [x] Verify unit appears in HA device list
|
||||
- [x] Assign Wyoming voice pipeline to unit in HA
|
||||
- [x] Test: speak wake word → transcription → LLM response → spoken reply
|
||||
- [x] Test: display cycles through idle → listening → thinking → replying
|
||||
- [x] Verify OTA update works: change config, deploy wirelessly
|
||||
- [ ] Write config templates for remaining rooms (bedroom, kitchen)
|
||||
- [ ] Flash remaining units, verify each works independently
|
||||
- [ ] Document final MAC address → room name mapping
|
||||
@@ -351,7 +193,17 @@ homeai-esp32/
|
||||
- [ ] Wake word "hey jarvis" triggers pipeline reliably from 3m distance
|
||||
- [ ] STT transcription accuracy >90% for clear speech in quiet room
|
||||
- [ ] TTS audio plays clearly through ESP32 speaker
|
||||
- [ ] LVGL face shows correct state for idle / listening / thinking / speaking / error
|
||||
- [ ] Display shows correct state for idle / listening / thinking / replying / error / muted
|
||||
- [ ] OTA firmware updates work without USB cable
|
||||
- [ ] Unit reconnects automatically after WiFi drop
|
||||
- [ ] Unit survives power cycle and resumes normal operation
|
||||
|
||||
---
|
||||
|
||||
## Known Constraints
|
||||
|
||||
- **Memory**: voice_assistant + micro_wake_word + display is near the limit. Do NOT add Bluetooth or LVGL widgets — they will cause crashes.
|
||||
- **WiFi**: 2.4GHz only. 5GHz networks are not supported.
|
||||
- **Speaker**: 1W built-in. Volume capped at 85% to avoid distortion.
|
||||
- **Display**: Static PNGs compiled into firmware. To change images, reflash via OTA (~1-2 min).
|
||||
- **First compile**: Downloads ESP-IDF toolchain (~500MB), takes 5-10 minutes. Incremental builds are 1-2 minutes.
|
||||
|
||||
Reference in New Issue
Block a user