Add full Pi 5 satellite setup with ReSpeaker 2-Mics pHAT for kitchen voice control via Wyoming protocol. Includes satellite_wrapper.py that monkey-patches WakeStreamingSatellite to fix three compounding bugs: - TTS echo suppression: mutes wake word detection while speaker plays - Server writer race fix: checks _writer before streaming, re-arms on None - Streaming timeout: auto-recovers after 30s if pipeline hangs - Error recovery: resets streaming state on server Error events Also includes Pi 5 hardware workarounds (wm8960 overlay, stereo-only audio wrappers, ALSA mixer calibration) and deploy.sh with fast iteration commands (--push-wrapper, --test-logs). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
224 lines
8.1 KiB
Markdown
224 lines
8.1 KiB
Markdown
# P6: homeai-esp32 — Room Satellite Hardware
|
||
|
||
> Phase 4 | Depends on: P1 (HA running), P3 (Wyoming STT/TTS servers running)
|
||
|
||
---
|
||
|
||
## Goal
|
||
|
||
Flash ESP32-S3-BOX-3 units with ESPHome. Each unit acts as a dumb room satellite: always-on mic, on-device wake word detection, audio playback, and a display showing assistant state via static PNG face illustrations. All intelligence stays on the Mac Mini.
|
||
|
||
---
|
||
|
||
## Hardware: ESP32-S3-BOX-3
|
||
|
||
| Feature | Spec |
|
||
|---|---|
|
||
| SoC | ESP32-S3 (dual-core Xtensa, 240MHz) |
|
||
| RAM | 512KB SRAM + 16MB PSRAM |
|
||
| Flash | 16MB |
|
||
| Display | 2.4" IPS LCD, 320×240, touchscreen (ILI9xxx, model S3BOX) |
|
||
| Audio ADC | ES7210 (dual mic array, 16kHz 16-bit) |
|
||
| Audio DAC | ES8311 (speaker output, 48kHz 16-bit) |
|
||
| Speaker | Built-in 1W |
|
||
| Connectivity | WiFi 802.11b/g/n (2.4GHz only), BT 5.0 |
|
||
| USB | USB-C (programming + power, native USB JTAG serial) |
|
||
|
||
---
|
||
|
||
## Architecture Per Unit
|
||
|
||
```
|
||
ESP32-S3-BOX-3
|
||
├── micro_wake_word (on-device, always listening)
|
||
│ └── "hey_jarvis" — triggers voice_assistant on wake detection
|
||
├── voice_assistant (ESPHome component)
|
||
│ ├── connects to Home Assistant via ESPHome API
|
||
│ ├── HA routes audio → Mac Mini Wyoming STT (10.0.0.101:10300)
|
||
│ ├── HA routes text → OpenClaw conversation agent (10.0.0.101:8081)
|
||
│ └── HA routes response → Mac Mini Wyoming TTS (10.0.0.101:10301)
|
||
├── Display (ili9xxx, model S3BOX, 320×240)
|
||
│ └── static PNG faces per state (idle, listening, thinking, replying, error)
|
||
└── ESPHome OTA
|
||
└── firmware updates over WiFi
|
||
```
|
||
|
||
---
|
||
|
||
## Pin Map (ESP32-S3-BOX-3)
|
||
|
||
| Function | Pin(s) | Notes |
|
||
|---|---|---|
|
||
| I2S LRCLK | GPIO45 | strapping pin — warning ignored |
|
||
| I2S BCLK | GPIO17 | |
|
||
| I2S MCLK | GPIO2 | |
|
||
| I2S DIN (mic) | GPIO16 | ES7210 ADC input |
|
||
| I2S DOUT (speaker) | GPIO15 | ES8311 DAC output |
|
||
| Speaker enable | GPIO46 | strapping pin — warning ignored |
|
||
| I2C SCL | GPIO18 | audio codec control bus |
|
||
| I2C SDA | GPIO8 | audio codec control bus |
|
||
| SPI CLK (display) | GPIO7 | |
|
||
| SPI MOSI (display) | GPIO6 | |
|
||
| Display CS | GPIO5 | |
|
||
| Display DC | GPIO4 | |
|
||
| Display Reset | GPIO48 | inverted |
|
||
| Backlight | GPIO47 | LEDC PWM |
|
||
| Left top button | GPIO0 | strapping pin — mute toggle / factory reset |
|
||
| Sensor dock I2C SCL | GPIO40 | sensor bus (AHT-30, AT581x radar) |
|
||
| Sensor dock I2C SDA | GPIO41 | sensor bus (AHT-30, AT581x radar) |
|
||
| Radar presence output | GPIO21 | AT581x digital detection pin |
|
||
|
||
---
|
||
|
||
## ESPHome Configuration
|
||
|
||
### Platform & Framework
|
||
|
||
```yaml
|
||
esp32:
|
||
board: esp32s3box
|
||
flash_size: 16MB
|
||
cpu_frequency: 240MHz
|
||
framework:
|
||
type: esp-idf
|
||
sdkconfig_options:
|
||
CONFIG_ESP32S3_DEFAULT_CPU_FREQ_240: "y"
|
||
CONFIG_ESP32S3_DATA_CACHE_64KB: "y"
|
||
CONFIG_ESP32S3_DATA_CACHE_LINE_64B: "y"
|
||
|
||
psram:
|
||
mode: octal
|
||
speed: 80MHz
|
||
```
|
||
|
||
### Audio Stack
|
||
|
||
Uses `i2s_audio` platform with external ADC/DAC codec chips:
|
||
|
||
- **Microphone**: ES7210 ADC via I2S, 16kHz 16-bit mono
|
||
- **Speaker**: ES8311 DAC via I2S, 48kHz 16-bit mono (left channel)
|
||
- **Media player**: wraps speaker with volume control (min 50%, max 85%)
|
||
|
||
### Wake Word
|
||
|
||
On-device `micro_wake_word` component with `hey_jarvis` model. Can optionally be switched to Home Assistant streaming wake word via a selector entity.
|
||
|
||
### Display
|
||
|
||
`ili9xxx` platform with model `S3BOX`. Uses `update_interval: never` — display updates are triggered by scripts on voice assistant state changes. Static 320×240 PNG images for each state are compiled into firmware. No text overlays — voice-only interaction.
|
||
|
||
Screen auto-dims after a configurable idle timeout (default 1 min, adjustable 1–60 min via HA entity). Wakes on voice activity or radar presence detection.
|
||
|
||
### Sensor Dock (ESP32-S3-BOX-3-SENSOR)
|
||
|
||
Optional accessory dock connected via secondary I2C bus (GPIO40/41, 100kHz):
|
||
|
||
- **AHT-30** (temp/humidity) — `aht10` component with variant AHT20, 30s update interval
|
||
- **AT581x mmWave radar** — presence detection via GPIO21, I2C for settings config
|
||
- **Radar RF switch** — toggle radar on/off from HA
|
||
- Radar configured on boot: sensing_distance=600, trigger_keep=5s, hw_frontend_reset=true
|
||
|
||
### Voice Assistant
|
||
|
||
ESPHome's `voice_assistant` component connects to HA via the ESPHome native API (not directly to Wyoming). HA orchestrates the pipeline:
|
||
1. Audio → Wyoming STT (Mac Mini) → text
|
||
2. Text → OpenClaw conversation agent → response
|
||
3. Response → Wyoming TTS (Mac Mini) → audio back to ESP32
|
||
|
||
---
|
||
|
||
## Directory Layout
|
||
|
||
```
|
||
homeai-esp32/
|
||
├── PLAN.md
|
||
├── setup.sh # env check + flash/ota/logs commands
|
||
└── esphome/
|
||
├── secrets.yaml # gitignored — WiFi + API key
|
||
├── homeai-living-room.yaml # first unit (full config)
|
||
├── homeai-bedroom.yaml # future: copy + change substitutions
|
||
├── homeai-kitchen.yaml # future: copy + change substitutions
|
||
└── illustrations/ # 320×240 PNG face images
|
||
├── idle.png
|
||
├── loading.png
|
||
├── listening.png
|
||
├── thinking.png
|
||
├── replying.png
|
||
├── error.png
|
||
└── timer_finished.png
|
||
```
|
||
|
||
---
|
||
|
||
## ESPHome Environment
|
||
|
||
```bash
|
||
# Dedicated venv (Python 3.12) — do NOT share with voice/whisper venvs
|
||
~/homeai-esphome-env/bin/esphome version # ESPHome 2026.2.4+
|
||
|
||
# Quick commands
|
||
cd ~/gitea/homeai/homeai-esp32
|
||
~/homeai-esphome-env/bin/esphome run esphome/homeai-living-room.yaml # compile + flash
|
||
~/homeai-esphome-env/bin/esphome logs esphome/homeai-living-room.yaml # stream logs
|
||
|
||
# Or use the setup script
|
||
./setup.sh flash # compile + USB flash
|
||
./setup.sh ota # compile + OTA update
|
||
./setup.sh logs # stream device logs
|
||
./setup.sh validate # check YAML without compiling
|
||
```
|
||
|
||
---
|
||
|
||
## Wake Word Options
|
||
|
||
| Option | Latency | Privacy | Effort |
|
||
|---|---|---|---|
|
||
| `hey_jarvis` (built-in micro_wake_word) | ~200ms | On-device | Zero |
|
||
| Custom word (trained model) | ~200ms | On-device | High — requires 50+ recordings |
|
||
| HA streaming wake word | ~500ms | On Mac Mini | Medium — stream all audio |
|
||
|
||
**Current**: `hey_jarvis` on-device. Train a custom word (character's name) once finalised.
|
||
|
||
---
|
||
|
||
## Implementation Steps
|
||
|
||
- [x] Install ESPHome in `~/homeai-esphome-env` (Python 3.12)
|
||
- [x] Write `esphome/secrets.yaml` (gitignored)
|
||
- [x] Write `homeai-living-room.yaml` (based on official S3-BOX-3 reference config)
|
||
- [x] Generate placeholder face illustrations (7 PNGs, 320×240)
|
||
- [x] Write `setup.sh` with flash/ota/logs/validate commands
|
||
- [x] Write `deploy.sh` with OTA deploy, image management, multi-unit support
|
||
- [x] Flash first unit via USB (living room)
|
||
- [x] Verify unit appears in HA device list
|
||
- [x] Assign Wyoming voice pipeline to unit in HA
|
||
- [x] Test: speak wake word → transcription → LLM response → spoken reply
|
||
- [x] Test: display cycles through idle → listening → thinking → replying
|
||
- [x] Verify OTA update works: change config, deploy wirelessly
|
||
- [ ] Write config templates for remaining rooms (bedroom, kitchen)
|
||
- [ ] Flash remaining units, verify each works independently
|
||
- [ ] Document final MAC address → room name mapping
|
||
|
||
---
|
||
|
||
## Success Criteria
|
||
|
||
- [ ] Wake word "hey jarvis" triggers pipeline reliably from 3m distance
|
||
- [ ] STT transcription accuracy >90% for clear speech in quiet room
|
||
- [ ] TTS audio plays clearly through ESP32 speaker
|
||
- [ ] Display shows correct state for idle / listening / thinking / replying / error / muted
|
||
- [ ] OTA firmware updates work without USB cable
|
||
- [ ] Unit reconnects automatically after WiFi drop
|
||
- [ ] Unit survives power cycle and resumes normal operation
|
||
|
||
---
|
||
|
||
## Known Constraints
|
||
|
||
- **Memory**: voice_assistant + micro_wake_word + display + sensor dock is near the limit. Do NOT add Bluetooth or LVGL widgets — they will cause crashes.
|
||
- **WiFi**: 2.4GHz only. 5GHz networks are not supported.
|
||
- **Speaker**: 1W built-in. Volume capped at 85% to avoid distortion.
|
||
- **Display**: Static PNGs compiled into firmware. To change images, reflash via OTA (~1-2 min).
|
||
- **First compile**: Downloads ESP-IDF toolchain (~500MB), takes 5-10 minutes. Incremental builds are 1-2 minutes.
|