docs: Add comprehensive speech features documentation and configuration

- Introduce detailed documentation for speech processing capabilities - Add new speech features documentation in `docs/features/speech.md` - Update README with speech feature highlights and prerequisites - Expand configuration documentation with speech-related settings - Include model selection, GPU acceleration, and best practices guidance
2025-02-06 04:30:20 +01:00
3 changed files with 425 additions and 26 deletions
--- a/README.md
+++ b/README.md
@@ -12,12 +12,22 @@ MCP (Model Context Protocol) Server is a lightweight integration tool for Home A
 - 📡 WebSocket/Server-Sent Events (SSE) for state updates
 - 🤖 Simple automation rule management
 - 🔐 JWT-based authentication
+- 🎤 Real-time device control and monitoring
+- 🎤 Server-Sent Events (SSE) for live updates
+- 🎤 Comprehensive logging
+- 🎤 Optional speech features:
+  - 🎤 Wake word detection ("hey jarvis", "ok google", "alexa")
+  - 🎤 Speech-to-text using fast-whisper
+  - 🎤 Multiple language support
+  - 🎤 GPU acceleration support

 ## Prerequisites 📋

 - 🚀 Bun runtime (v1.0.26+)
 - 🏡 Home Assistant instance
- 🐳 Docker (optional, recommended for deployment)
+- 🐳 Docker (optional, recommended for deployment and speech features)
+- 🖥️ Node.js 18+ (optional, for speech features)
+- 🖥️ NVIDIA GPU with CUDA support (optional, for faster speech processing)

 ## Installation 🛠️

@@ -30,7 +40,7 @@ cd homeassistant-mcp

 # Copy and edit environment configuration
 cp .env.example .env
-# Edit .env with your Home Assistant credentials
+# Edit .env with your Home Assistant credentials and speech features settings

 # Build and start containers
 docker compose up -d --build
@@ -79,33 +89,69 @@ ws.onmessage = (event) => {
 };
 ```

-## Current Limitations ⚠️
+## Speech Features (Optional)

- 🎙️ Basic voice command support (work in progress)
- 🧠 Limited advanced NLP capabilities
- 🔗 Minimal third-party device integration
- 🐛 Early-stage error handling
+The MCP Server includes optional speech processing capabilities:

-## Contributing 🤝
+### Prerequisites
+1. Docker installed and running
+2. NVIDIA GPU with CUDA support (optional)
+3. At least 4GB RAM (8GB+ recommended for larger models)

-1. Fork the repository
-2. Create a feature branch:
-   ```bash
-   git checkout -b feature/your-feature
-   ```
-3. Make your changes
-4. Run tests:
-   ```bash
-   bun test
-   ```
-5. Submit a pull request
+### Setup

-## Roadmap 🗺️
+1. Enable speech features in your .env:
+```bash
+ENABLE_SPEECH_FEATURES=true
+ENABLE_WAKE_WORD=true
+ENABLE_SPEECH_TO_TEXT=true
+WHISPER_MODEL_PATH=/models
+WHISPER_MODEL_TYPE=base
+```

- 🎤 Enhance voice command processing
- 🔌 Improve device compatibility
- 🤖 Expand automation capabilities
- 🛡️ Implement more robust error handling
+2. Start the speech services:
+```bash
+docker-compose up -d
+```
+
+### Available Models
+
+Choose a model based on your needs:
+- `tiny.en`: Fastest, basic accuracy
+- `base.en`: Good balance (recommended)
+- `small.en`: Better accuracy, slower
+- `medium.en`: High accuracy, resource intensive
+- `large-v2`: Best accuracy, very resource intensive
+
+### Usage
+
+1. Wake word detection listens for:
+   - "hey jarvis"
+   - "ok google"
+   - "alexa"
+
+2. After wake word detection:
+   - Audio is automatically captured
+   - Speech is transcribed
+   - Commands are processed
+
+3. Manual transcription is also available:
+```typescript
+const speech = speechService.getSpeechToText();
+const text = await speech.transcribe(audioBuffer);
+```
+
+## Configuration
+
+See [Configuration Guide](docs/configuration.md) for detailed settings.
+
+## API Documentation
+
+See [API Documentation](docs/api/index.md) for available endpoints.
+
+## Development
+
+See [Development Guide](docs/development/index.md) for contribution guidelines.

 ## License 📄

--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -34,6 +34,14 @@ JWT_SECRET=your_secret_key
  - `MAX_CLIENTS`: Maximum concurrent clients (default: 1000)
  - `PING_INTERVAL`: Keep-alive ping interval in ms (default: 30000)

+### Speech Features (Optional)
+- `ENABLE_SPEECH_FEATURES`: Enable speech processing features (default: false)
+- `ENABLE_WAKE_WORD`: Enable wake word detection (default: false)
+- `ENABLE_SPEECH_TO_TEXT`: Enable speech-to-text conversion (default: false)
+- `WHISPER_MODEL_PATH`: Path to Whisper models directory (default: /models)
+- `WHISPER_MODEL_TYPE`: Whisper model type (default: base)
+  - Available models: tiny.en, base.en, small.en, medium.en, large-v2
+
 ## Environment Variables

 All configuration is managed through environment variables:
@@ -57,6 +65,13 @@ LOG_MAX_SIZE=20m
 LOG_MAX_DAYS=14d
 LOG_COMPRESS=true
 LOG_REQUESTS=true
+
+# Speech Features (Optional)
+ENABLE_SPEECH_FEATURES=false
+ENABLE_WAKE_WORD=false
+ENABLE_SPEECH_TO_TEXT=false
+WHISPER_MODEL_PATH=/models
+WHISPER_MODEL_TYPE=base
 ```

 ## Advanced Configuration
@@ -86,6 +101,26 @@ LOGGING: {
 }
 ```

+### Speech-to-Text Configuration
+When speech features are enabled, you can configure the following options:
+
+```typescript
+SPEECH: {
+  ENABLED: false,  // Master switch for all speech features
+  WAKE_WORD_ENABLED: false,  // Enable wake word detection
+  SPEECH_TO_TEXT_ENABLED: false,  // Enable speech-to-text
+  WHISPER_MODEL_PATH: "/models",  // Path to Whisper models
+  WHISPER_MODEL_TYPE: "base",  // Model type to use
+}
+```
+
+Available Whisper models:
+- `tiny.en`: Fastest, lowest accuracy
+- `base.en`: Good balance of speed and accuracy
+- `small.en`: Better accuracy, slower
+- `medium.en`: High accuracy, much slower
+- `large-v2`: Best accuracy, very slow
+
 For production deployments, we recommend using system tools like `logrotate` for log management.

 Example logrotate configuration (`/etc/logrotate.d/mcp-server`):
@@ -109,13 +144,15 @@ Example logrotate configuration (`/etc/logrotate.d/mcp-server`):
 4. Enable SSL/TLS in production (preferably via reverse proxy)
 5. Monitor log files for issues
 6. Regularly rotate logs in production
+7. Start with smaller Whisper models and upgrade if needed
+8. Consider GPU acceleration for larger Whisper models

 ## Validation

 The server validates configuration on startup using Zod schemas:
 - Required fields are checked (e.g., HASS_TOKEN)
 - Value types are verified
- Enums are validated (e.g., LOG_LEVEL)
+- Enums are validated (e.g., LOG_LEVEL, WHISPER_MODEL_TYPE)
 - Default values are applied when not specified

 ## Troubleshooting
@@ -125,5 +162,109 @@ Common configuration issues:
 2. Invalid environment variable values
 3. Permission issues with log directories
 4. Rate limiting too restrictive
+5. Speech model loading failures
+6. Docker not available for speech features
+7. Insufficient system resources for larger models

 See the [Troubleshooting Guide](troubleshooting.md) for solutions.
+
+# Configuration Guide
+
+This document describes all available configuration options for the Home Assistant MCP Server.
+
+## Environment Variables
+
+### Required Settings
+
+```bash
+# Server Configuration
+PORT=3000                      # Server port
+HOST=localhost                 # Server host
+
+# Home Assistant
+HASS_URL=http://localhost:8123 # Home Assistant URL
+HASS_TOKEN=your_token         # Long-lived access token
+
+# Security
+JWT_SECRET=your_secret        # JWT signing secret
+```
+
+### Optional Settings
+
+```bash
+# Rate Limiting
+RATE_LIMIT_WINDOW=60000       # Time window in ms (default: 60000)
+RATE_LIMIT_MAX=100           # Max requests per window (default: 100)
+
+# Logging
+LOG_LEVEL=info               # debug, info, warn, error (default: info)
+LOG_DIR=logs                 # Log directory (default: logs)
+LOG_MAX_SIZE=10m            # Max log file size (default: 10m)
+LOG_MAX_FILES=5             # Max number of log files (default: 5)
+
+# WebSocket/SSE
+WS_HEARTBEAT=30000          # WebSocket heartbeat interval in ms (default: 30000)
+SSE_RETRY=3000             # SSE retry interval in ms (default: 3000)
+
+# Speech Features
+ENABLE_SPEECH_FEATURES=false # Enable speech processing (default: false)
+ENABLE_WAKE_WORD=false      # Enable wake word detection (default: false)
+ENABLE_SPEECH_TO_TEXT=false # Enable speech-to-text (default: false)
+
+# Speech Model Configuration
+WHISPER_MODEL_PATH=/models  # Path to whisper models (default: /models)
+WHISPER_MODEL_TYPE=base     # Model type: tiny|base|small|medium|large-v2 (default: base)
+WHISPER_LANGUAGE=en        # Primary language (default: en)
+WHISPER_TASK=transcribe    # Task type: transcribe|translate (default: transcribe)
+WHISPER_DEVICE=cuda        # Processing device: cpu|cuda (default: cuda if available, else cpu)
+
+# Wake Word Configuration
+WAKE_WORDS=hey jarvis,ok google,alexa  # Comma-separated wake words (default: hey jarvis)
+WAKE_WORD_SENSITIVITY=0.5   # Detection sensitivity 0-1 (default: 0.5)
+```
+
+## Speech Features
+
+### Model Selection
+
+Choose a model based on your needs:
+
+| Model      | Size  | Memory Required | Speed | Accuracy |
+|------------|-------|-----------------|-------|----------|
+| tiny.en    | 75MB  | 1GB            | Fast  | Basic    |
+| base.en    | 150MB | 2GB            | Good  | Good     |
+| small.en   | 500MB | 4GB            | Med   | Better   |
+| medium.en  | 1.5GB | 8GB            | Slow  | High     |
+| large-v2   | 3GB   | 16GB           | Slow  | Best     |
+
+### GPU Acceleration
+
+When `WHISPER_DEVICE=cuda`:
+- NVIDIA GPU with CUDA support required
+- Significantly faster processing
+- Higher memory requirements
+
+### Wake Word Detection
+
+- Multiple wake words supported via comma-separated list
+- Adjustable sensitivity (0-1):
+  - Lower values: Fewer false positives, may miss some triggers
+  - Higher values: More responsive, may have false triggers
+  - Default (0.5): Balanced detection
+
+### Best Practices
+
+1. Model Selection:
+   - Start with `base.en` model
+   - Upgrade if better accuracy needed
+   - Downgrade if performance issues
+
+2. Resource Management:
+   - Monitor memory usage
+   - Use GPU acceleration when available
+   - Consider model size vs available resources
+
+3. Wake Word Configuration:
+   - Use distinct wake words
+   - Adjust sensitivity based on environment
+   - Limit number of wake words for better performance 
--- a/docs/features/speech.md
+++ b/docs/features/speech.md
@@ -0,0 +1,212 @@
+# Speech Features
+
+The Home Assistant MCP Server includes powerful speech processing capabilities powered by fast-whisper and custom wake word detection. This guide explains how to set up and use these features effectively.
+
+## Overview
+
+The speech processing system consists of two main components:
+1. Wake Word Detection - Listens for specific trigger phrases
+2. Speech-to-Text - Transcribes spoken commands using fast-whisper
+
+## Setup
+
+### Prerequisites
+
+1. Docker environment:
+```bash
+docker --version  # Should be 20.10.0 or higher
+```
+
+2. For GPU acceleration:
+- NVIDIA GPU with CUDA support
+- NVIDIA Container Toolkit installed
+- NVIDIA drivers 450.80.02 or higher
+
+### Installation
+
+1. Enable speech features in your `.env`:
+```bash
+ENABLE_SPEECH_FEATURES=true
+ENABLE_WAKE_WORD=true
+ENABLE_SPEECH_TO_TEXT=true
+```
+
+2. Configure model settings:
+```bash
+WHISPER_MODEL_PATH=/models
+WHISPER_MODEL_TYPE=base
+WHISPER_LANGUAGE=en
+WHISPER_TASK=transcribe
+WHISPER_DEVICE=cuda  # or cpu
+```
+
+3. Start the services:
+```bash
+docker-compose up -d
+```
+
+## Usage
+
+### Wake Word Detection
+
+The wake word detector continuously listens for configured trigger phrases. Default wake words:
+- "hey jarvis"
+- "ok google"
+- "alexa"
+
+Custom wake words can be configured:
+```bash
+WAKE_WORDS=computer,jarvis,assistant
+```
+
+When a wake word is detected:
+1. The system starts recording audio
+2. Audio is processed through the speech-to-text pipeline
+3. The resulting command is processed by the server
+
+### Speech-to-Text
+
+#### Automatic Transcription
+
+After wake word detection:
+1. Audio is automatically captured (default: 5 seconds)
+2. The audio is transcribed using the configured whisper model
+3. The transcribed text is processed as a command
+
+#### Manual Transcription
+
+You can also manually transcribe audio using the API:
+
+```typescript
+// Using the TypeScript client
+import { SpeechService } from '@ha-mcp/client';
+
+const speech = new SpeechService();
+
+// Transcribe from audio buffer
+const buffer = await getAudioBuffer();
+const text = await speech.transcribe(buffer);
+
+// Transcribe from file
+const text = await speech.transcribeFile('command.wav');
+```
+
+```javascript
+// Using the REST API
+POST /api/speech/transcribe
+Content-Type: multipart/form-data
+
+file: <audio file>
+```
+
+### Event Handling
+
+The system emits various events during speech processing:
+
+```typescript
+speech.on('wakeWord', (word: string) => {
+  console.log(`Wake word detected: ${word}`);
+});
+
+speech.on('listening', () => {
+  console.log('Listening for command...');
+});
+
+speech.on('transcribing', () => {
+  console.log('Processing speech...');
+});
+
+speech.on('transcribed', (text: string) => {
+  console.log(`Transcribed text: ${text}`);
+});
+
+speech.on('error', (error: Error) => {
+  console.error('Speech processing error:', error);
+});
+```
+
+## Performance Optimization
+
+### Model Selection
+
+Choose an appropriate model based on your needs:
+
+1. Resource-constrained environments:
+   - Use `tiny.en` or `base.en`
+   - Run on CPU if GPU unavailable
+   - Limit concurrent processing
+
+2. High-accuracy requirements:
+   - Use `small.en` or `medium.en`
+   - Enable GPU acceleration
+   - Increase audio quality
+
+3. Production environments:
+   - Use `base.en` or `small.en`
+   - Enable GPU acceleration
+   - Configure appropriate timeouts
+
+### GPU Acceleration
+
+When using GPU acceleration:
+
+1. Monitor GPU memory usage:
+```bash
+nvidia-smi -l 1
+```
+
+2. Adjust model size if needed:
+```bash
+WHISPER_MODEL_TYPE=small  # Decrease if GPU memory limited
+```
+
+3. Configure processing device:
+```bash
+WHISPER_DEVICE=cuda      # Use GPU
+WHISPER_DEVICE=cpu      # Use CPU if GPU unavailable
+```
+
+## Troubleshooting
+
+### Common Issues
+
+1. Wake word detection not working:
+   - Check microphone permissions
+   - Adjust `WAKE_WORD_SENSITIVITY`
+   - Verify wake words configuration
+
+2. Poor transcription quality:
+   - Check audio input quality
+   - Try a larger model
+   - Verify language settings
+
+3. Performance issues:
+   - Monitor resource usage
+   - Consider smaller model
+   - Check GPU acceleration status
+
+### Logging
+
+Enable debug logging for detailed information:
+```bash
+LOG_LEVEL=debug
+```
+
+Speech-specific logs will be tagged with `[SPEECH]` prefix.
+
+## Security Considerations
+
+1. Audio Privacy:
+   - Audio is processed locally
+   - No data sent to external services
+   - Temporary files automatically cleaned
+
+2. Access Control:
+   - Speech endpoints require authentication
+   - Rate limiting applies to transcription
+   - Configurable command restrictions
+
+3. Resource Protection:
+   - Timeouts prevent hanging
+   - Memory limits enforced
+   - Graceful error handling