feat: ESP32-S3-BOX-3 room satellite — ESPHome config, OTA deploy, placeholder faces

Living room unit fully working: on-device wake word (hey_jarvis), voice pipeline via HA (Wyoming STT → OpenClaw → Wyoming TTS), static PNG display states, OTA updates. Includes deploy.sh for quick OTA with custom image support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 20:48:03 +00:00
parent 3c0d905e64
commit c4cecbd8dc
13 changed files with 1410 additions and 341 deletions
--- a/homeai-esp32/PLAN.md
+++ b/homeai-esp32/PLAN.md
@@ -6,7 +6,7 @@

 ## Goal

-Flash ESP32-S3-BOX-3 units with ESPHome. Each unit acts as a dumb room satellite: always-on mic, local wake word detection, audio playback, and an LVGL animated face showing assistant state. All intelligence stays on the Mac Mini.
+Flash ESP32-S3-BOX-3 units with ESPHome. Each unit acts as a dumb room satellite: always-on mic, on-device wake word detection, audio playback, and a display showing assistant state via static PNG face illustrations. All intelligence stays on the Mac Mini.

 ---

@@ -17,11 +17,12 @@ Flash ESP32-S3-BOX-3 units with ESPHome. Each unit acts as a dumb room satellite
 | SoC | ESP32-S3 (dual-core Xtensa, 240MHz) |
 | RAM | 512KB SRAM + 16MB PSRAM |
 | Flash | 16MB |
-| Display | 2.4" IPS LCD, 320×240, touchscreen |
-| Mic | Dual microphone array |
-| Speaker | Built-in 1W speaker |
-| Connectivity | WiFi 802.11b/g/n, BT 5.0 |
-| USB | USB-C (programming + power) |
+| Display | 2.4" IPS LCD, 320×240, touchscreen (ILI9xxx, model S3BOX) |
+| Audio ADC | ES7210 (dual mic array, 16kHz 16-bit) |
+| Audio DAC | ES8311 (speaker output, 48kHz 16-bit) |
+| Speaker | Built-in 1W |
+| Connectivity | WiFi 802.11b/g/n (2.4GHz only), BT 5.0 |
+| USB | USB-C (programming + power, native USB JTAG serial) |

 ---

@@ -29,273 +30,86 @@ Flash ESP32-S3-BOX-3 units with ESPHome. Each unit acts as a dumb room satellite

 ```
 ESP32-S3-BOX-3
-├── microWakeWord (on-device, always listening)
-│   └── triggers Wyoming Satellite on wake detection
-├── Wyoming Satellite
-│   ├── streams mic audio → Mac Mini Wyoming STT (port 10300)
-│   └── receives TTS audio ← Mac Mini Wyoming TTS (port 10301)
-├── LVGL Display
-│   └── animated face, driven by HA entity state
+├── micro_wake_word (on-device, always listening)
+│   └── "hey_jarvis" — triggers voice_assistant on wake detection
+├── voice_assistant (ESPHome component)
+│   ├── connects to Home Assistant via ESPHome API
+│   ├── HA routes audio → Mac Mini Wyoming STT (10.0.0.101:10300)
+│   ├── HA routes text → OpenClaw conversation agent (10.0.0.101:8081)
+│   └── HA routes response → Mac Mini Wyoming TTS (10.0.0.101:10301)
+├── Display (ili9xxx, model S3BOX, 320×240)
+│   └── static PNG faces per state (idle, listening, thinking, replying, error)
 └── ESPHome OTA
    └── firmware updates over WiFi
 ```

 ---

+## Pin Map (ESP32-S3-BOX-3)
+
+| Function | Pin(s) | Notes |
+|---|---|---|
+| I2S LRCLK | GPIO45 | strapping pin — warning ignored |
+| I2S BCLK | GPIO17 | |
+| I2S MCLK | GPIO2 | |
+| I2S DIN (mic) | GPIO16 | ES7210 ADC input |
+| I2S DOUT (speaker) | GPIO15 | ES8311 DAC output |
+| Speaker enable | GPIO46 | strapping pin — warning ignored |
+| I2C SCL | GPIO18 | audio codec control bus |
+| I2C SDA | GPIO8 | audio codec control bus |
+| SPI CLK (display) | GPIO7 | |
+| SPI MOSI (display) | GPIO6 | |
+| Display CS | GPIO5 | |
+| Display DC | GPIO4 | |
+| Display Reset | GPIO48 | inverted |
+| Backlight | GPIO47 | LEDC PWM |
+| Left top button | GPIO0 | strapping pin — mute toggle / factory reset |
+
+---
+
 ## ESPHome Configuration

-### Base Config Template
-
-`esphome/base.yaml` — shared across all units:
+### Platform & Framework

 ```yaml
-esphome:
-  name: homeai-${room}
-  friendly_name: "HomeAI ${room_display}"
-  platform: esp32
-  board: esp32-s3-box-3
+esp32:
+  board: esp32s3box
+  flash_size: 16MB
+  cpu_frequency: 240MHz
+  framework:
+    type: esp-idf
+    sdkconfig_options:
+      CONFIG_ESP32S3_DEFAULT_CPU_FREQ_240: "y"
+      CONFIG_ESP32S3_DATA_CACHE_64KB: "y"
+      CONFIG_ESP32S3_DATA_CACHE_LINE_64B: "y"

-wifi:
-  ssid: !secret wifi_ssid
-  password: !secret wifi_password
-  ap:
-    ssid: "HomeAI Fallback"
-
-api:
-  encryption:
-    key: !secret api_key
-
-ota:
-  password: !secret ota_password
-
-logger:
-  level: INFO
+psram:
+  mode: octal
+  speed: 80MHz
 ```

-### Room-Specific Config
+### Audio Stack

-`esphome/s3-box-living-room.yaml`:
+Uses `i2s_audio` platform with external ADC/DAC codec chips:

-```yaml
-substitutions:
-  room: living-room
-  room_display: "Living Room"
-  mac_mini_ip: "192.168.1.x"    # or Tailscale IP
+- **Microphone**: ES7210 ADC via I2S, 16kHz 16-bit mono
+- **Speaker**: ES8311 DAC via I2S, 48kHz 16-bit mono (left channel)
+- **Media player**: wraps speaker with volume control (min 50%, max 85%)

-packages:
-  base: !include base.yaml
-  voice: !include voice.yaml
-  display: !include display.yaml
-```
+### Wake Word

-One file per room, only the substitutions change.
+On-device `micro_wake_word` component with `hey_jarvis` model. Can optionally be switched to Home Assistant streaming wake word via a selector entity.

-### Voice / Wyoming Satellite — `esphome/voice.yaml`
+### Display

-```yaml
-microphone:
-  - platform: esp_adf
-    id: mic
+`ili9xxx` platform with model `S3BOX`. Uses `update_interval: never` — display updates are triggered by scripts on voice assistant state changes. Static 320×240 PNG images for each state are compiled into firmware.

-speaker:
-  - platform: esp_adf
-    id: spk
+### Voice Assistant

-micro_wake_word:
-  model: hey_jarvis            # or custom model path
-  on_wake_word_detected:
-    - voice_assistant.start:
-
-voice_assistant:
-  microphone: mic
-  speaker: spk
-  noise_suppression_level: 2
-  auto_gain: 31dBFS
-  volume_multiplier: 2.0
-
-  on_listening:
-    - display.page.show: page_listening
-    - script.execute: animate_face_listening
-
-  on_stt_vad_end:
-    - display.page.show: page_thinking
-    - script.execute: animate_face_thinking
-
-  on_tts_start:
-    - display.page.show: page_speaking
-    - script.execute: animate_face_speaking
-
-  on_end:
-    - display.page.show: page_idle
-    - script.execute: animate_face_idle
-
-  on_error:
-    - display.page.show: page_error
-    - script.execute: animate_face_error
-```
-
-**Note:** ESPHome's `voice_assistant` component connects to HA, which routes to Wyoming STT/TTS on the Mac Mini. This is the standard ESPHome → HA → Wyoming path.
-
-### LVGL Display — `esphome/display.yaml`
-
-```yaml
-display:
-  - platform: ili9xxx
-    model: ILI9341
-    id: lcd
-    cs_pin: GPIO5
-    dc_pin: GPIO4
-    reset_pin: GPIO48
-
-touchscreen:
-  - platform: tt21100
-    id: touch
-
-lvgl:
-  displays:
-    - lcd
-  touchscreens:
-    - touch
-
-  # Face widget — centered on screen
-  widgets:
-    - obj:
-        id: face_container
-        width: 320
-        height: 240
-        bg_color: 0x000000
-        children:
-          # Eyes (two circles)
-          - obj:
-              id: eye_left
-              x: 90
-              y: 90
-              width: 50
-              height: 50
-              radius: 25
-              bg_color: 0xFFFFFF
-          - obj:
-              id: eye_right
-              x: 180
-              y: 90
-              width: 50
-              height: 50
-              radius: 25
-              bg_color: 0xFFFFFF
-          # Mouth (line/arc)
-          - arc:
-              id: mouth
-              x: 110
-              y: 160
-              width: 100
-              height: 40
-              start_angle: 180
-              end_angle: 360
-              arc_color: 0xFFFFFF
-
-  pages:
-    - id: page_idle
-    - id: page_listening
-    - id: page_thinking
-    - id: page_speaking
-    - id: page_error
-```
-
-### LVGL Face State Animations — `esphome/animations.yaml`
-
-```yaml
-script:
-  - id: animate_face_idle
-    then:
-      - lvgl.widget.modify:
-          id: eye_left
-          height: 50     # normal open
-      - lvgl.widget.modify:
-          id: eye_right
-          height: 50
-      - lvgl.widget.modify:
-          id: mouth
-          arc_color: 0xFFFFFF
-
-  - id: animate_face_listening
-    then:
-      - lvgl.widget.modify:
-          id: eye_left
-          height: 60     # wider eyes
-      - lvgl.widget.modify:
-          id: eye_right
-          height: 60
-      - lvgl.widget.modify:
-          id: mouth
-          arc_color: 0x00BFFF  # blue tint
-
-  - id: animate_face_thinking
-    then:
-      - lvgl.widget.modify:
-          id: eye_left
-          height: 20     # squinting
-      - lvgl.widget.modify:
-          id: eye_right
-          height: 20
-
-  - id: animate_face_speaking
-    then:
-      - lvgl.widget.modify:
-          id: mouth
-          arc_color: 0x00FF88  # green speaking indicator
-
-  - id: animate_face_error
-    then:
-      - lvgl.widget.modify:
-          id: eye_left
-          bg_color: 0xFF2200  # red eyes
-      - lvgl.widget.modify:
-          id: eye_right
-          bg_color: 0xFF2200
-```
-
-> **Note:** True lip-sync animation (mouth moving with audio) is complex on ESP32. Phase 1: static states. Phase 2: amplitude-driven mouth height using speaker volume feedback.
-
---
-
-## Secrets File
-
-`esphome/secrets.yaml` (gitignored):
-
-```yaml
-wifi_ssid: "YourNetwork"
-wifi_password: "YourPassword"
-api_key: "<32-byte base64 key>"
-ota_password: "YourOTAPassword"
-```
-
---
-
-## Flash & Deployment Workflow
-
-```bash
-# Install ESPHome
-pip install esphome
-
-# Compile + flash via USB (first time)
-esphome run esphome/s3-box-living-room.yaml
-
-# OTA update (subsequent)
-esphome upload esphome/s3-box-living-room.yaml --device <device-ip>
-
-# View logs
-esphome logs esphome/s3-box-living-room.yaml
-```
-
---
-
-## Home Assistant Integration
-
-After flashing:
-1. HA discovers ESP32 automatically via mDNS
-2. Add device in HA → Settings → Devices
-3. Assign Wyoming voice assistant pipeline to the device
-4. Set up room-specific automations (e.g., "Living Room" light control from that satellite)
+ESPHome's `voice_assistant` component connects to HA via the ESPHome native API (not directly to Wyoming). HA orchestrates the pipeline:
+1. Audio → Wyoming STT (Mac Mini) → text
+2. Text → OpenClaw conversation agent → response
+3. Response → Wyoming TTS (Mac Mini) → audio back to ESP32

 ---

@@ -303,43 +117,71 @@ After flashing:

 ```
 homeai-esp32/
+├── PLAN.md
+├── setup.sh                          # env check + flash/ota/logs commands
 └── esphome/
-    ├── base.yaml
-    ├── voice.yaml
-    ├── display.yaml
-    ├── animations.yaml
-    ├── s3-box-living-room.yaml
-    ├── s3-box-bedroom.yaml       # template, fill in when hardware available
-    ├── s3-box-kitchen.yaml       # template
-    └── secrets.yaml              # gitignored
+    ├── secrets.yaml                  # gitignored — WiFi + API key
+    ├── homeai-living-room.yaml       # first unit (full config)
+    ├── homeai-bedroom.yaml           # future: copy + change substitutions
+    ├── homeai-kitchen.yaml           # future: copy + change substitutions
+    └── illustrations/                # 320×240 PNG face images
+        ├── idle.png
+        ├── loading.png
+        ├── listening.png
+        ├── thinking.png
+        ├── replying.png
+        ├── error.png
+        └── timer_finished.png
 ```

 ---

-## Wake Word Decisions
+## ESPHome Environment
+
+```bash
+# Dedicated venv (Python 3.12) — do NOT share with voice/whisper venvs
+~/homeai-esphome-env/bin/esphome version  # ESPHome 2026.2.4+
+
+# Quick commands
+cd ~/gitea/homeai/homeai-esp32
+~/homeai-esphome-env/bin/esphome run esphome/homeai-living-room.yaml     # compile + flash
+~/homeai-esphome-env/bin/esphome logs esphome/homeai-living-room.yaml    # stream logs
+
+# Or use the setup script
+./setup.sh flash    # compile + USB flash
+./setup.sh ota      # compile + OTA update
+./setup.sh logs     # stream device logs
+./setup.sh validate # check YAML without compiling
+```
+
+---
+
+## Wake Word Options

 | Option | Latency | Privacy | Effort |
 |---|---|---|---|
-| `hey_jarvis` (built-in microWakeWord) | ~200ms | On-device | Zero |
+| `hey_jarvis` (built-in micro_wake_word) | ~200ms | On-device | Zero |
 | Custom word (trained model) | ~200ms | On-device | High — requires 50+ recordings |
-| Mac Mini openWakeWord (stream audio) | ~500ms | On Mac | Medium |
+| HA streaming wake word | ~500ms | On Mac Mini | Medium — stream all audio |

-**Recommendation:** Start with `hey_jarvis`. Train a custom word (character's name) once character name is finalised.
+**Current**: `hey_jarvis` on-device. Train a custom word (character's name) once finalised.

 ---

 ## Implementation Steps

- [ ] Install ESPHome: `pip install esphome`
- [ ] Write `esphome/secrets.yaml` (gitignored)
- [ ] Write `base.yaml`, `voice.yaml`, `display.yaml`, `animations.yaml`
- [ ] Write `s3-box-living-room.yaml` for first unit
- [ ] Flash first unit via USB: `esphome run s3-box-living-room.yaml`
- [ ] Verify unit appears in HA device list
- [ ] Assign Wyoming voice pipeline to unit in HA
- [ ] Test: speak wake word → transcription → LLM response → spoken reply
- [ ] Test: LVGL face cycles through idle → listening → thinking → speaking
- [ ] Verify OTA update works: change LVGL color, deploy wirelessly
+- [x] Install ESPHome in `~/homeai-esphome-env` (Python 3.12)
+- [x] Write `esphome/secrets.yaml` (gitignored)
+- [x] Write `homeai-living-room.yaml` (based on official S3-BOX-3 reference config)
+- [x] Generate placeholder face illustrations (7 PNGs, 320×240)
+- [x] Write `setup.sh` with flash/ota/logs/validate commands
+- [x] Write `deploy.sh` with OTA deploy, image management, multi-unit support
+- [x] Flash first unit via USB (living room)
+- [x] Verify unit appears in HA device list
+- [x] Assign Wyoming voice pipeline to unit in HA
+- [x] Test: speak wake word → transcription → LLM response → spoken reply
+- [x] Test: display cycles through idle → listening → thinking → replying
+- [x] Verify OTA update works: change config, deploy wirelessly
 - [ ] Write config templates for remaining rooms (bedroom, kitchen)
 - [ ] Flash remaining units, verify each works independently
 - [ ] Document final MAC address → room name mapping
@@ -351,7 +193,17 @@ homeai-esp32/
 - [ ] Wake word "hey jarvis" triggers pipeline reliably from 3m distance
 - [ ] STT transcription accuracy >90% for clear speech in quiet room
 - [ ] TTS audio plays clearly through ESP32 speaker
- [ ] LVGL face shows correct state for idle / listening / thinking / speaking / error
+- [ ] Display shows correct state for idle / listening / thinking / replying / error / muted
 - [ ] OTA firmware updates work without USB cable
 - [ ] Unit reconnects automatically after WiFi drop
 - [ ] Unit survives power cycle and resumes normal operation
+
+---
+
+## Known Constraints
+
+- **Memory**: voice_assistant + micro_wake_word + display is near the limit. Do NOT add Bluetooth or LVGL widgets — they will cause crashes.
+- **WiFi**: 2.4GHz only. 5GHz networks are not supported.
+- **Speaker**: 1W built-in. Volume capped at 85% to avoid distortion.
+- **Display**: Static PNGs compiled into firmware. To change images, reflash via OTA (~1-2 min).
+- **First compile**: Downloads ESP-IDF toolchain (~500MB), takes 5-10 minutes. Incremental builds are 1-2 minutes.