docs: Add comprehensive speech features documentation and configuration

- Introduce detailed documentation for speech processing capabilities - Add new speech features documentation in `docs/features/speech.md` - Update README with speech feature highlights and prerequisites - Expand configuration documentation with speech-related settings - Include model selection, GPU acceleration, and best practices guidance
docs: Update configuration documentation to use environment variables
2025-02-06 04:30:20 +01:00 · 2025-02-06 04:25:35 +01:00
3 changed files with 509 additions and 87 deletions
--- a/README.md
+++ b/README.md
@@ -12,12 +12,22 @@ MCP (Model Context Protocol) Server is a lightweight integration tool for Home A
 - 📡 WebSocket/Server-Sent Events (SSE) for state updates
 - 🤖 Simple automation rule management
 - 🔐 JWT-based authentication
+- 🎤 Real-time device control and monitoring
+- 🎤 Server-Sent Events (SSE) for live updates
+- 🎤 Comprehensive logging
+- 🎤 Optional speech features:
+  - 🎤 Wake word detection ("hey jarvis", "ok google", "alexa")
+  - 🎤 Speech-to-text using fast-whisper
+  - 🎤 Multiple language support
+  - 🎤 GPU acceleration support

 ## Prerequisites 📋

 - 🚀 Bun runtime (v1.0.26+)
 - 🏡 Home Assistant instance
- 🐳 Docker (optional, recommended for deployment)
+- 🐳 Docker (optional, recommended for deployment and speech features)
+- 🖥️ Node.js 18+ (optional, for speech features)
+- 🖥️ NVIDIA GPU with CUDA support (optional, for faster speech processing)

 ## Installation 🛠️

@@ -30,7 +40,7 @@ cd homeassistant-mcp

 # Copy and edit environment configuration
 cp .env.example .env
-# Edit .env with your Home Assistant credentials
+# Edit .env with your Home Assistant credentials and speech features settings

 # Build and start containers
 docker compose up -d --build
@@ -79,33 +89,69 @@ ws.onmessage = (event) => {
 };
 ```

-## Current Limitations ⚠️
+## Speech Features (Optional)

- 🎙️ Basic voice command support (work in progress)
- 🧠 Limited advanced NLP capabilities
- 🔗 Minimal third-party device integration
- 🐛 Early-stage error handling
+The MCP Server includes optional speech processing capabilities:

-## Contributing 🤝
+### Prerequisites
+1. Docker installed and running
+2. NVIDIA GPU with CUDA support (optional)
+3. At least 4GB RAM (8GB+ recommended for larger models)

-1. Fork the repository
-2. Create a feature branch:
-   ```bash
-   git checkout -b feature/your-feature
-   ```
-3. Make your changes
-4. Run tests:
-   ```bash
-   bun test
-   ```
-5. Submit a pull request
+### Setup

-## Roadmap 🗺️
+1. Enable speech features in your .env:
+```bash
+ENABLE_SPEECH_FEATURES=true
+ENABLE_WAKE_WORD=true
+ENABLE_SPEECH_TO_TEXT=true
+WHISPER_MODEL_PATH=/models
+WHISPER_MODEL_TYPE=base
+```

- 🎤 Enhance voice command processing
- 🔌 Improve device compatibility
- 🤖 Expand automation capabilities
- 🛡️ Implement more robust error handling
+2. Start the speech services:
+```bash
+docker-compose up -d
+```
+
+### Available Models
+
+Choose a model based on your needs:
+- `tiny.en`: Fastest, basic accuracy
+- `base.en`: Good balance (recommended)
+- `small.en`: Better accuracy, slower
+- `medium.en`: High accuracy, resource intensive
+- `large-v2`: Best accuracy, very resource intensive
+
+### Usage
+
+1. Wake word detection listens for:
+   - "hey jarvis"
+   - "ok google"
+   - "alexa"
+
+2. After wake word detection:
+   - Audio is automatically captured
+   - Speech is transcribed
+   - Commands are processed
+
+3. Manual transcription is also available:
+```typescript
+const speech = speechService.getSpeechToText();
+const text = await speech.transcribe(audioBuffer);
+```
+
+## Configuration
+
+See [Configuration Guide](docs/configuration.md) for detailed settings.
+
+## API Documentation
+
+See [API Documentation](docs/api/index.md) for available endpoints.
+
+## Development
+
+See [Development Guide](docs/development/index.md) for contribution guidelines.

 ## License 📄

--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -4,103 +4,267 @@ This document provides detailed information about configuring the Home Assistant

 ## Configuration File Structure

-The MCP Server uses a hierarchical configuration structure:
+The MCP Server uses environment variables for configuration, with support for different environments (development, test, production):

-```yaml
-server:
-  host: 0.0.0.0
-  port: 8123
-  log_level: INFO
-
-security:
-  jwt_secret: YOUR_SECRET_KEY
-  allowed_origins:
-    - http://localhost:3000
-    - https://your-domain.com
-
-devices:
-  scan_interval: 30
-  default_timeout: 10
+```bash
+# .env, .env.development, or .env.test
+PORT=4000
+NODE_ENV=development
+HASS_HOST=http://192.168.178.63:8123
+HASS_TOKEN=your_token_here
+JWT_SECRET=your_secret_key
 ```

 ## Server Settings

 ### Basic Server Configuration
- `host`: Server binding address (default: 0.0.0.0)
- `port`: Server port number (default: 8123)
- `log_level`: Logging level (INFO, DEBUG, WARNING, ERROR)
+- `PORT`: Server port number (default: 4000)
+- `NODE_ENV`: Environment mode (development, production, test)
+- `HASS_HOST`: Home Assistant instance URL
+- `HASS_TOKEN`: Home Assistant long-lived access token

 ### Security Settings
- `jwt_secret`: Secret key for JWT token generation
- `allowed_origins`: CORS allowed origins list
- `ssl_cert`: Path to SSL certificate (optional)
- `ssl_key`: Path to SSL private key (optional)
+- `JWT_SECRET`: Secret key for JWT token generation
+- `RATE_LIMIT`: Rate limiting configuration
+  - `windowMs`: Time window in milliseconds (default: 15 minutes)
+  - `max`: Maximum requests per window (default: 100)

-### Device Management
- `scan_interval`: Device state scan interval in seconds
- `default_timeout`: Default device command timeout
- `retry_attempts`: Number of retry attempts for failed commands
+### WebSocket Settings
+- `SSE`: Server-Sent Events configuration
+  - `MAX_CLIENTS`: Maximum concurrent clients (default: 1000)
+  - `PING_INTERVAL`: Keep-alive ping interval in ms (default: 30000)
+
+### Speech Features (Optional)
+- `ENABLE_SPEECH_FEATURES`: Enable speech processing features (default: false)
+- `ENABLE_WAKE_WORD`: Enable wake word detection (default: false)
+- `ENABLE_SPEECH_TO_TEXT`: Enable speech-to-text conversion (default: false)
+- `WHISPER_MODEL_PATH`: Path to Whisper models directory (default: /models)
+- `WHISPER_MODEL_TYPE`: Whisper model type (default: base)
+  - Available models: tiny.en, base.en, small.en, medium.en, large-v2

 ## Environment Variables

-Environment variables override configuration file settings:
+All configuration is managed through environment variables:

 ```bash
-MCP_HOST=0.0.0.0
-MCP_PORT=8123
-MCP_LOG_LEVEL=INFO
-MCP_JWT_SECRET=your-secret-key
+# Server
+PORT=4000
+NODE_ENV=development
+
+# Home Assistant
+HASS_HOST=http://your-hass-instance:8123
+HASS_TOKEN=your_token_here
+
+# Security
+JWT_SECRET=your-secret-key
+
+# Logging
+LOG_LEVEL=info
+LOG_DIR=logs
+LOG_MAX_SIZE=20m
+LOG_MAX_DAYS=14d
+LOG_COMPRESS=true
+LOG_REQUESTS=true
+
+# Speech Features (Optional)
+ENABLE_SPEECH_FEATURES=false
+ENABLE_WAKE_WORD=false
+ENABLE_SPEECH_TO_TEXT=false
+WHISPER_MODEL_PATH=/models
+WHISPER_MODEL_TYPE=base
 ```

 ## Advanced Configuration

-### Rate Limiting
-```yaml
-rate_limit:
-  enabled: true
-  requests_per_minute: 100
-  burst: 20
-```
+### Security Rate Limiting
+Rate limiting is enabled by default to protect against brute force attacks:

-### Caching
-```yaml
-cache:
-  enabled: true
-  ttl: 300  # seconds
-  max_size: 1000  # entries
+```typescript
+RATE_LIMIT: {
+  windowMs: 15 * 60 * 1000,  // 15 minutes
+  max: 100  // limit each IP to 100 requests per window
+}
 ```

 ### Logging
-```yaml
-logging:
-  file: /var/log/mcp-server.log
-  max_size: 10MB
-  backup_count: 5
-  format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
+The server uses Bun's built-in logging capabilities with additional configuration:
+
+```typescript
+LOGGING: {
+  LEVEL: "info",  // debug, info, warn, error
+  DIR: "logs",
+  MAX_SIZE: "20m",
+  MAX_DAYS: "14d",
+  COMPRESS: true,
+  TIMESTAMP_FORMAT: "YYYY-MM-DD HH:mm:ss:ms",
+  LOG_REQUESTS: true
+}
+```
+
+### Speech-to-Text Configuration
+When speech features are enabled, you can configure the following options:
+
+```typescript
+SPEECH: {
+  ENABLED: false,  // Master switch for all speech features
+  WAKE_WORD_ENABLED: false,  // Enable wake word detection
+  SPEECH_TO_TEXT_ENABLED: false,  // Enable speech-to-text
+  WHISPER_MODEL_PATH: "/models",  // Path to Whisper models
+  WHISPER_MODEL_TYPE: "base",  // Model type to use
+}
+```
+
+Available Whisper models:
+- `tiny.en`: Fastest, lowest accuracy
+- `base.en`: Good balance of speed and accuracy
+- `small.en`: Better accuracy, slower
+- `medium.en`: High accuracy, much slower
+- `large-v2`: Best accuracy, very slow
+
+For production deployments, we recommend using system tools like `logrotate` for log management.
+
+Example logrotate configuration (`/etc/logrotate.d/mcp-server`):
+```
+/var/log/mcp-server.log {
+    daily
+    rotate 7
+    compress
+    delaycompress
+    missingok
+    notifempty
+    create 644 mcp mcp
+}
 ```

 ## Best Practices

 1. Always use environment variables for sensitive information
-2. Keep configuration files in a secure location
-3. Regularly backup your configuration
-4. Use SSL in production environments
+2. Keep .env files secure and never commit them to version control
+3. Use different environment files for development, test, and production
+4. Enable SSL/TLS in production (preferably via reverse proxy)
 5. Monitor log files for issues
+6. Regularly rotate logs in production
+7. Start with smaller Whisper models and upgrade if needed
+8. Consider GPU acceleration for larger Whisper models

 ## Validation

-The server validates configuration on startup:
- Required fields are checked
+The server validates configuration on startup using Zod schemas:
+- Required fields are checked (e.g., HASS_TOKEN)
 - Value types are verified
- Ranges are validated
- Security settings are assessed
+- Enums are validated (e.g., LOG_LEVEL, WHISPER_MODEL_TYPE)
+- Default values are applied when not specified

 ## Troubleshooting

 Common configuration issues:
-1. Permission denied accessing files
-2. Invalid YAML syntax
-3. Missing required fields
-4. Type mismatches in values
+1. Missing required environment variables
+2. Invalid environment variable values
+3. Permission issues with log directories
+4. Rate limiting too restrictive
+5. Speech model loading failures
+6. Docker not available for speech features
+7. Insufficient system resources for larger models

-See the [Troubleshooting Guide](troubleshooting.md) for solutions. 
+See the [Troubleshooting Guide](troubleshooting.md) for solutions.
+
+# Configuration Guide
+
+This document describes all available configuration options for the Home Assistant MCP Server.
+
+## Environment Variables
+
+### Required Settings
+
+```bash
+# Server Configuration
+PORT=3000                      # Server port
+HOST=localhost                 # Server host
+
+# Home Assistant
+HASS_URL=http://localhost:8123 # Home Assistant URL
+HASS_TOKEN=your_token         # Long-lived access token
+
+# Security
+JWT_SECRET=your_secret        # JWT signing secret
+```
+
+### Optional Settings
+
+```bash
+# Rate Limiting
+RATE_LIMIT_WINDOW=60000       # Time window in ms (default: 60000)
+RATE_LIMIT_MAX=100           # Max requests per window (default: 100)
+
+# Logging
+LOG_LEVEL=info               # debug, info, warn, error (default: info)
+LOG_DIR=logs                 # Log directory (default: logs)
+LOG_MAX_SIZE=10m            # Max log file size (default: 10m)
+LOG_MAX_FILES=5             # Max number of log files (default: 5)
+
+# WebSocket/SSE
+WS_HEARTBEAT=30000          # WebSocket heartbeat interval in ms (default: 30000)
+SSE_RETRY=3000             # SSE retry interval in ms (default: 3000)
+
+# Speech Features
+ENABLE_SPEECH_FEATURES=false # Enable speech processing (default: false)
+ENABLE_WAKE_WORD=false      # Enable wake word detection (default: false)
+ENABLE_SPEECH_TO_TEXT=false # Enable speech-to-text (default: false)
+
+# Speech Model Configuration
+WHISPER_MODEL_PATH=/models  # Path to whisper models (default: /models)
+WHISPER_MODEL_TYPE=base     # Model type: tiny|base|small|medium|large-v2 (default: base)
+WHISPER_LANGUAGE=en        # Primary language (default: en)
+WHISPER_TASK=transcribe    # Task type: transcribe|translate (default: transcribe)
+WHISPER_DEVICE=cuda        # Processing device: cpu|cuda (default: cuda if available, else cpu)
+
+# Wake Word Configuration
+WAKE_WORDS=hey jarvis,ok google,alexa  # Comma-separated wake words (default: hey jarvis)
+WAKE_WORD_SENSITIVITY=0.5   # Detection sensitivity 0-1 (default: 0.5)
+```
+
+## Speech Features
+
+### Model Selection
+
+Choose a model based on your needs:
+
+| Model      | Size  | Memory Required | Speed | Accuracy |
+|------------|-------|-----------------|-------|----------|
+| tiny.en    | 75MB  | 1GB            | Fast  | Basic    |
+| base.en    | 150MB | 2GB            | Good  | Good     |
+| small.en   | 500MB | 4GB            | Med   | Better   |
+| medium.en  | 1.5GB | 8GB            | Slow  | High     |
+| large-v2   | 3GB   | 16GB           | Slow  | Best     |
+
+### GPU Acceleration
+
+When `WHISPER_DEVICE=cuda`:
+- NVIDIA GPU with CUDA support required
+- Significantly faster processing
+- Higher memory requirements
+
+### Wake Word Detection
+
+- Multiple wake words supported via comma-separated list
+- Adjustable sensitivity (0-1):
+  - Lower values: Fewer false positives, may miss some triggers
+  - Higher values: More responsive, may have false triggers
+  - Default (0.5): Balanced detection
+
+### Best Practices
+
+1. Model Selection:
+   - Start with `base.en` model
+   - Upgrade if better accuracy needed
+   - Downgrade if performance issues
+
+2. Resource Management:
+   - Monitor memory usage
+   - Use GPU acceleration when available
+   - Consider model size vs available resources
+
+3. Wake Word Configuration:
+   - Use distinct wake words
+   - Adjust sensitivity based on environment
+   - Limit number of wake words for better performance 
--- a/docs/features/speech.md
+++ b/docs/features/speech.md
@@ -0,0 +1,212 @@
+# Speech Features
+
+The Home Assistant MCP Server includes powerful speech processing capabilities powered by fast-whisper and custom wake word detection. This guide explains how to set up and use these features effectively.
+
+## Overview
+
+The speech processing system consists of two main components:
+1. Wake Word Detection - Listens for specific trigger phrases
+2. Speech-to-Text - Transcribes spoken commands using fast-whisper
+
+## Setup
+
+### Prerequisites
+
+1. Docker environment:
+```bash
+docker --version  # Should be 20.10.0 or higher
+```
+
+2. For GPU acceleration:
+- NVIDIA GPU with CUDA support
+- NVIDIA Container Toolkit installed
+- NVIDIA drivers 450.80.02 or higher
+
+### Installation
+
+1. Enable speech features in your `.env`:
+```bash
+ENABLE_SPEECH_FEATURES=true
+ENABLE_WAKE_WORD=true
+ENABLE_SPEECH_TO_TEXT=true
+```
+
+2. Configure model settings:
+```bash
+WHISPER_MODEL_PATH=/models
+WHISPER_MODEL_TYPE=base
+WHISPER_LANGUAGE=en
+WHISPER_TASK=transcribe
+WHISPER_DEVICE=cuda  # or cpu
+```
+
+3. Start the services:
+```bash
+docker-compose up -d
+```
+
+## Usage
+
+### Wake Word Detection
+
+The wake word detector continuously listens for configured trigger phrases. Default wake words:
+- "hey jarvis"
+- "ok google"
+- "alexa"
+
+Custom wake words can be configured:
+```bash
+WAKE_WORDS=computer,jarvis,assistant
+```
+
+When a wake word is detected:
+1. The system starts recording audio
+2. Audio is processed through the speech-to-text pipeline
+3. The resulting command is processed by the server
+
+### Speech-to-Text
+
+#### Automatic Transcription
+
+After wake word detection:
+1. Audio is automatically captured (default: 5 seconds)
+2. The audio is transcribed using the configured whisper model
+3. The transcribed text is processed as a command
+
+#### Manual Transcription
+
+You can also manually transcribe audio using the API:
+
+```typescript
+// Using the TypeScript client
+import { SpeechService } from '@ha-mcp/client';
+
+const speech = new SpeechService();
+
+// Transcribe from audio buffer
+const buffer = await getAudioBuffer();
+const text = await speech.transcribe(buffer);
+
+// Transcribe from file
+const text = await speech.transcribeFile('command.wav');
+```
+
+```javascript
+// Using the REST API
+POST /api/speech/transcribe
+Content-Type: multipart/form-data
+
+file: <audio file>
+```
+
+### Event Handling
+
+The system emits various events during speech processing:
+
+```typescript
+speech.on('wakeWord', (word: string) => {
+  console.log(`Wake word detected: ${word}`);
+});
+
+speech.on('listening', () => {
+  console.log('Listening for command...');
+});
+
+speech.on('transcribing', () => {
+  console.log('Processing speech...');
+});
+
+speech.on('transcribed', (text: string) => {
+  console.log(`Transcribed text: ${text}`);
+});
+
+speech.on('error', (error: Error) => {
+  console.error('Speech processing error:', error);
+});
+```
+
+## Performance Optimization
+
+### Model Selection
+
+Choose an appropriate model based on your needs:
+
+1. Resource-constrained environments:
+   - Use `tiny.en` or `base.en`
+   - Run on CPU if GPU unavailable
+   - Limit concurrent processing
+
+2. High-accuracy requirements:
+   - Use `small.en` or `medium.en`
+   - Enable GPU acceleration
+   - Increase audio quality
+
+3. Production environments:
+   - Use `base.en` or `small.en`
+   - Enable GPU acceleration
+   - Configure appropriate timeouts
+
+### GPU Acceleration
+
+When using GPU acceleration:
+
+1. Monitor GPU memory usage:
+```bash
+nvidia-smi -l 1
+```
+
+2. Adjust model size if needed:
+```bash
+WHISPER_MODEL_TYPE=small  # Decrease if GPU memory limited
+```
+
+3. Configure processing device:
+```bash
+WHISPER_DEVICE=cuda      # Use GPU
+WHISPER_DEVICE=cpu      # Use CPU if GPU unavailable
+```
+
+## Troubleshooting
+
+### Common Issues
+
+1. Wake word detection not working:
+   - Check microphone permissions
+   - Adjust `WAKE_WORD_SENSITIVITY`
+   - Verify wake words configuration
+
+2. Poor transcription quality:
+   - Check audio input quality
+   - Try a larger model
+   - Verify language settings
+
+3. Performance issues:
+   - Monitor resource usage
+   - Consider smaller model
+   - Check GPU acceleration status
+
+### Logging
+
+Enable debug logging for detailed information:
+```bash
+LOG_LEVEL=debug
+```
+
+Speech-specific logs will be tagged with `[SPEECH]` prefix.
+
+## Security Considerations
+
+1. Audio Privacy:
+   - Audio is processed locally
+   - No data sent to external services
+   - Temporary files automatically cleaned
+
+2. Access Control:
+   - Speech endpoints require authentication
+   - Rate limiting applies to transcription
+   - Configurable command restrictions
+
+3. Resource Protection:
+   - Timeouts prevent hanging
+   - Memory limits enforced
+   - Graceful error handling