docs: Add comprehensive speech features documentation and configuration

- Introduce detailed documentation for speech processing capabilities - Add new speech features documentation in `docs/features/speech.md` - Update README with speech feature highlights and prerequisites - Expand configuration documentation with speech-related settings - Include model selection, GPU acceleration, and best practices guidance
2025-02-06 04:30:20 +01:00
parent f0ff3d5e5a
commit cc9eede856
3 changed files with 425 additions and 26 deletions
--- a/docs/features/speech.md
+++ b/docs/features/speech.md
@@ -0,0 +1,212 @@
+# Speech Features
+
+The Home Assistant MCP Server includes powerful speech processing capabilities powered by fast-whisper and custom wake word detection. This guide explains how to set up and use these features effectively.
+
+## Overview
+
+The speech processing system consists of two main components:
+1. Wake Word Detection - Listens for specific trigger phrases
+2. Speech-to-Text - Transcribes spoken commands using fast-whisper
+
+## Setup
+
+### Prerequisites
+
+1. Docker environment:
+```bash
+docker --version  # Should be 20.10.0 or higher
+```
+
+2. For GPU acceleration:
+- NVIDIA GPU with CUDA support
+- NVIDIA Container Toolkit installed
+- NVIDIA drivers 450.80.02 or higher
+
+### Installation
+
+1. Enable speech features in your `.env`:
+```bash
+ENABLE_SPEECH_FEATURES=true
+ENABLE_WAKE_WORD=true
+ENABLE_SPEECH_TO_TEXT=true
+```
+
+2. Configure model settings:
+```bash
+WHISPER_MODEL_PATH=/models
+WHISPER_MODEL_TYPE=base
+WHISPER_LANGUAGE=en
+WHISPER_TASK=transcribe
+WHISPER_DEVICE=cuda  # or cpu
+```
+
+3. Start the services:
+```bash
+docker-compose up -d
+```
+
+## Usage
+
+### Wake Word Detection
+
+The wake word detector continuously listens for configured trigger phrases. Default wake words:
+- "hey jarvis"
+- "ok google"
+- "alexa"
+
+Custom wake words can be configured:
+```bash
+WAKE_WORDS=computer,jarvis,assistant
+```
+
+When a wake word is detected:
+1. The system starts recording audio
+2. Audio is processed through the speech-to-text pipeline
+3. The resulting command is processed by the server
+
+### Speech-to-Text
+
+#### Automatic Transcription
+
+After wake word detection:
+1. Audio is automatically captured (default: 5 seconds)
+2. The audio is transcribed using the configured whisper model
+3. The transcribed text is processed as a command
+
+#### Manual Transcription
+
+You can also manually transcribe audio using the API:
+
+```typescript
+// Using the TypeScript client
+import { SpeechService } from '@ha-mcp/client';
+
+const speech = new SpeechService();
+
+// Transcribe from audio buffer
+const buffer = await getAudioBuffer();
+const text = await speech.transcribe(buffer);
+
+// Transcribe from file
+const text = await speech.transcribeFile('command.wav');
+```
+
+```javascript
+// Using the REST API
+POST /api/speech/transcribe
+Content-Type: multipart/form-data
+
+file: <audio file>
+```
+
+### Event Handling
+
+The system emits various events during speech processing:
+
+```typescript
+speech.on('wakeWord', (word: string) => {
+  console.log(`Wake word detected: ${word}`);
+});
+
+speech.on('listening', () => {
+  console.log('Listening for command...');
+});
+
+speech.on('transcribing', () => {
+  console.log('Processing speech...');
+});
+
+speech.on('transcribed', (text: string) => {
+  console.log(`Transcribed text: ${text}`);
+});
+
+speech.on('error', (error: Error) => {
+  console.error('Speech processing error:', error);
+});
+```
+
+## Performance Optimization
+
+### Model Selection
+
+Choose an appropriate model based on your needs:
+
+1. Resource-constrained environments:
+   - Use `tiny.en` or `base.en`
+   - Run on CPU if GPU unavailable
+   - Limit concurrent processing
+
+2. High-accuracy requirements:
+   - Use `small.en` or `medium.en`
+   - Enable GPU acceleration
+   - Increase audio quality
+
+3. Production environments:
+   - Use `base.en` or `small.en`
+   - Enable GPU acceleration
+   - Configure appropriate timeouts
+
+### GPU Acceleration
+
+When using GPU acceleration:
+
+1. Monitor GPU memory usage:
+```bash
+nvidia-smi -l 1
+```
+
+2. Adjust model size if needed:
+```bash
+WHISPER_MODEL_TYPE=small  # Decrease if GPU memory limited
+```
+
+3. Configure processing device:
+```bash
+WHISPER_DEVICE=cuda      # Use GPU
+WHISPER_DEVICE=cpu      # Use CPU if GPU unavailable
+```
+
+## Troubleshooting
+
+### Common Issues
+
+1. Wake word detection not working:
+   - Check microphone permissions
+   - Adjust `WAKE_WORD_SENSITIVITY`
+   - Verify wake words configuration
+
+2. Poor transcription quality:
+   - Check audio input quality
+   - Try a larger model
+   - Verify language settings
+
+3. Performance issues:
+   - Monitor resource usage
+   - Consider smaller model
+   - Check GPU acceleration status
+
+### Logging
+
+Enable debug logging for detailed information:
+```bash
+LOG_LEVEL=debug
+```
+
+Speech-specific logs will be tagged with `[SPEECH]` prefix.
+
+## Security Considerations
+
+1. Audio Privacy:
+   - Audio is processed locally
+   - No data sent to external services
+   - Temporary files automatically cleaned
+
+2. Access Control:
+   - Speech endpoints require authentication
+   - Rate limiting applies to transcription
+   - Configurable command restrictions
+
+3. Resource Protection:
+   - Timeouts prevent hanging
+   - Memory limits enforced
+   - Graceful error handling