From cc9eede8569144e83798913ef603af194b74f2f8 Mon Sep 17 00:00:00 2001 From: jango-blockchained Date: Thu, 6 Feb 2025 04:30:20 +0100 Subject: [PATCH] docs: Add comprehensive speech features documentation and configuration - Introduce detailed documentation for speech processing capabilities - Add new speech features documentation in `docs/features/speech.md` - Update README with speech feature highlights and prerequisites - Expand configuration documentation with speech-related settings - Include model selection, GPU acceleration, and best practices guidance --- README.md | 94 +++++++++++++----- docs/configuration.md | 145 ++++++++++++++++++++++++++- docs/features/speech.md | 212 ++++++++++++++++++++++++++++++++++++++++ 3 files changed, 425 insertions(+), 26 deletions(-) create mode 100644 docs/features/speech.md diff --git a/README.md b/README.md index aeaabb3..6eb3af8 100644 --- a/README.md +++ b/README.md @@ -12,12 +12,22 @@ MCP (Model Context Protocol) Server is a lightweight integration tool for Home A - πŸ“‘ WebSocket/Server-Sent Events (SSE) for state updates - πŸ€– Simple automation rule management - πŸ” JWT-based authentication +- 🎀 Real-time device control and monitoring +- 🎀 Server-Sent Events (SSE) for live updates +- 🎀 Comprehensive logging +- 🎀 Optional speech features: + - 🎀 Wake word detection ("hey jarvis", "ok google", "alexa") + - 🎀 Speech-to-text using fast-whisper + - 🎀 Multiple language support + - 🎀 GPU acceleration support ## Prerequisites πŸ“‹ - πŸš€ Bun runtime (v1.0.26+) - 🏑 Home Assistant instance -- 🐳 Docker (optional, recommended for deployment) +- 🐳 Docker (optional, recommended for deployment and speech features) +- πŸ–₯️ Node.js 18+ (optional, for speech features) +- πŸ–₯️ NVIDIA GPU with CUDA support (optional, for faster speech processing) ## Installation πŸ› οΈ @@ -30,7 +40,7 @@ cd homeassistant-mcp # Copy and edit environment configuration cp .env.example .env -# Edit .env with your Home Assistant credentials +# Edit .env with your Home Assistant credentials and speech features settings # Build and start containers docker compose up -d --build @@ -79,33 +89,69 @@ ws.onmessage = (event) => { }; ``` -## Current Limitations ⚠️ +## Speech Features (Optional) -- πŸŽ™οΈ Basic voice command support (work in progress) -- 🧠 Limited advanced NLP capabilities -- πŸ”— Minimal third-party device integration -- πŸ› Early-stage error handling +The MCP Server includes optional speech processing capabilities: -## Contributing 🀝 +### Prerequisites +1. Docker installed and running +2. NVIDIA GPU with CUDA support (optional) +3. At least 4GB RAM (8GB+ recommended for larger models) -1. Fork the repository -2. Create a feature branch: - ```bash - git checkout -b feature/your-feature - ``` -3. Make your changes -4. Run tests: - ```bash - bun test - ``` -5. Submit a pull request +### Setup -## Roadmap πŸ—ΊοΈ +1. Enable speech features in your .env: +```bash +ENABLE_SPEECH_FEATURES=true +ENABLE_WAKE_WORD=true +ENABLE_SPEECH_TO_TEXT=true +WHISPER_MODEL_PATH=/models +WHISPER_MODEL_TYPE=base +``` -- 🎀 Enhance voice command processing -- πŸ”Œ Improve device compatibility -- πŸ€– Expand automation capabilities -- πŸ›‘οΈ Implement more robust error handling +2. Start the speech services: +```bash +docker-compose up -d +``` + +### Available Models + +Choose a model based on your needs: +- `tiny.en`: Fastest, basic accuracy +- `base.en`: Good balance (recommended) +- `small.en`: Better accuracy, slower +- `medium.en`: High accuracy, resource intensive +- `large-v2`: Best accuracy, very resource intensive + +### Usage + +1. Wake word detection listens for: + - "hey jarvis" + - "ok google" + - "alexa" + +2. After wake word detection: + - Audio is automatically captured + - Speech is transcribed + - Commands are processed + +3. Manual transcription is also available: +```typescript +const speech = speechService.getSpeechToText(); +const text = await speech.transcribe(audioBuffer); +``` + +## Configuration + +See [Configuration Guide](docs/configuration.md) for detailed settings. + +## API Documentation + +See [API Documentation](docs/api/index.md) for available endpoints. + +## Development + +See [Development Guide](docs/development/index.md) for contribution guidelines. ## License πŸ“„ diff --git a/docs/configuration.md b/docs/configuration.md index f70ead1..9957281 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -34,6 +34,14 @@ JWT_SECRET=your_secret_key - `MAX_CLIENTS`: Maximum concurrent clients (default: 1000) - `PING_INTERVAL`: Keep-alive ping interval in ms (default: 30000) +### Speech Features (Optional) +- `ENABLE_SPEECH_FEATURES`: Enable speech processing features (default: false) +- `ENABLE_WAKE_WORD`: Enable wake word detection (default: false) +- `ENABLE_SPEECH_TO_TEXT`: Enable speech-to-text conversion (default: false) +- `WHISPER_MODEL_PATH`: Path to Whisper models directory (default: /models) +- `WHISPER_MODEL_TYPE`: Whisper model type (default: base) + - Available models: tiny.en, base.en, small.en, medium.en, large-v2 + ## Environment Variables All configuration is managed through environment variables: @@ -57,6 +65,13 @@ LOG_MAX_SIZE=20m LOG_MAX_DAYS=14d LOG_COMPRESS=true LOG_REQUESTS=true + +# Speech Features (Optional) +ENABLE_SPEECH_FEATURES=false +ENABLE_WAKE_WORD=false +ENABLE_SPEECH_TO_TEXT=false +WHISPER_MODEL_PATH=/models +WHISPER_MODEL_TYPE=base ``` ## Advanced Configuration @@ -86,6 +101,26 @@ LOGGING: { } ``` +### Speech-to-Text Configuration +When speech features are enabled, you can configure the following options: + +```typescript +SPEECH: { + ENABLED: false, // Master switch for all speech features + WAKE_WORD_ENABLED: false, // Enable wake word detection + SPEECH_TO_TEXT_ENABLED: false, // Enable speech-to-text + WHISPER_MODEL_PATH: "/models", // Path to Whisper models + WHISPER_MODEL_TYPE: "base", // Model type to use +} +``` + +Available Whisper models: +- `tiny.en`: Fastest, lowest accuracy +- `base.en`: Good balance of speed and accuracy +- `small.en`: Better accuracy, slower +- `medium.en`: High accuracy, much slower +- `large-v2`: Best accuracy, very slow + For production deployments, we recommend using system tools like `logrotate` for log management. Example logrotate configuration (`/etc/logrotate.d/mcp-server`): @@ -109,13 +144,15 @@ Example logrotate configuration (`/etc/logrotate.d/mcp-server`): 4. Enable SSL/TLS in production (preferably via reverse proxy) 5. Monitor log files for issues 6. Regularly rotate logs in production +7. Start with smaller Whisper models and upgrade if needed +8. Consider GPU acceleration for larger Whisper models ## Validation The server validates configuration on startup using Zod schemas: - Required fields are checked (e.g., HASS_TOKEN) - Value types are verified -- Enums are validated (e.g., LOG_LEVEL) +- Enums are validated (e.g., LOG_LEVEL, WHISPER_MODEL_TYPE) - Default values are applied when not specified ## Troubleshooting @@ -125,5 +162,109 @@ Common configuration issues: 2. Invalid environment variable values 3. Permission issues with log directories 4. Rate limiting too restrictive +5. Speech model loading failures +6. Docker not available for speech features +7. Insufficient system resources for larger models -See the [Troubleshooting Guide](troubleshooting.md) for solutions. \ No newline at end of file +See the [Troubleshooting Guide](troubleshooting.md) for solutions. + +# Configuration Guide + +This document describes all available configuration options for the Home Assistant MCP Server. + +## Environment Variables + +### Required Settings + +```bash +# Server Configuration +PORT=3000 # Server port +HOST=localhost # Server host + +# Home Assistant +HASS_URL=http://localhost:8123 # Home Assistant URL +HASS_TOKEN=your_token # Long-lived access token + +# Security +JWT_SECRET=your_secret # JWT signing secret +``` + +### Optional Settings + +```bash +# Rate Limiting +RATE_LIMIT_WINDOW=60000 # Time window in ms (default: 60000) +RATE_LIMIT_MAX=100 # Max requests per window (default: 100) + +# Logging +LOG_LEVEL=info # debug, info, warn, error (default: info) +LOG_DIR=logs # Log directory (default: logs) +LOG_MAX_SIZE=10m # Max log file size (default: 10m) +LOG_MAX_FILES=5 # Max number of log files (default: 5) + +# WebSocket/SSE +WS_HEARTBEAT=30000 # WebSocket heartbeat interval in ms (default: 30000) +SSE_RETRY=3000 # SSE retry interval in ms (default: 3000) + +# Speech Features +ENABLE_SPEECH_FEATURES=false # Enable speech processing (default: false) +ENABLE_WAKE_WORD=false # Enable wake word detection (default: false) +ENABLE_SPEECH_TO_TEXT=false # Enable speech-to-text (default: false) + +# Speech Model Configuration +WHISPER_MODEL_PATH=/models # Path to whisper models (default: /models) +WHISPER_MODEL_TYPE=base # Model type: tiny|base|small|medium|large-v2 (default: base) +WHISPER_LANGUAGE=en # Primary language (default: en) +WHISPER_TASK=transcribe # Task type: transcribe|translate (default: transcribe) +WHISPER_DEVICE=cuda # Processing device: cpu|cuda (default: cuda if available, else cpu) + +# Wake Word Configuration +WAKE_WORDS=hey jarvis,ok google,alexa # Comma-separated wake words (default: hey jarvis) +WAKE_WORD_SENSITIVITY=0.5 # Detection sensitivity 0-1 (default: 0.5) +``` + +## Speech Features + +### Model Selection + +Choose a model based on your needs: + +| Model | Size | Memory Required | Speed | Accuracy | +|------------|-------|-----------------|-------|----------| +| tiny.en | 75MB | 1GB | Fast | Basic | +| base.en | 150MB | 2GB | Good | Good | +| small.en | 500MB | 4GB | Med | Better | +| medium.en | 1.5GB | 8GB | Slow | High | +| large-v2 | 3GB | 16GB | Slow | Best | + +### GPU Acceleration + +When `WHISPER_DEVICE=cuda`: +- NVIDIA GPU with CUDA support required +- Significantly faster processing +- Higher memory requirements + +### Wake Word Detection + +- Multiple wake words supported via comma-separated list +- Adjustable sensitivity (0-1): + - Lower values: Fewer false positives, may miss some triggers + - Higher values: More responsive, may have false triggers + - Default (0.5): Balanced detection + +### Best Practices + +1. Model Selection: + - Start with `base.en` model + - Upgrade if better accuracy needed + - Downgrade if performance issues + +2. Resource Management: + - Monitor memory usage + - Use GPU acceleration when available + - Consider model size vs available resources + +3. Wake Word Configuration: + - Use distinct wake words + - Adjust sensitivity based on environment + - Limit number of wake words for better performance \ No newline at end of file diff --git a/docs/features/speech.md b/docs/features/speech.md new file mode 100644 index 0000000..e136064 --- /dev/null +++ b/docs/features/speech.md @@ -0,0 +1,212 @@ +# Speech Features + +The Home Assistant MCP Server includes powerful speech processing capabilities powered by fast-whisper and custom wake word detection. This guide explains how to set up and use these features effectively. + +## Overview + +The speech processing system consists of two main components: +1. Wake Word Detection - Listens for specific trigger phrases +2. Speech-to-Text - Transcribes spoken commands using fast-whisper + +## Setup + +### Prerequisites + +1. Docker environment: +```bash +docker --version # Should be 20.10.0 or higher +``` + +2. For GPU acceleration: +- NVIDIA GPU with CUDA support +- NVIDIA Container Toolkit installed +- NVIDIA drivers 450.80.02 or higher + +### Installation + +1. Enable speech features in your `.env`: +```bash +ENABLE_SPEECH_FEATURES=true +ENABLE_WAKE_WORD=true +ENABLE_SPEECH_TO_TEXT=true +``` + +2. Configure model settings: +```bash +WHISPER_MODEL_PATH=/models +WHISPER_MODEL_TYPE=base +WHISPER_LANGUAGE=en +WHISPER_TASK=transcribe +WHISPER_DEVICE=cuda # or cpu +``` + +3. Start the services: +```bash +docker-compose up -d +``` + +## Usage + +### Wake Word Detection + +The wake word detector continuously listens for configured trigger phrases. Default wake words: +- "hey jarvis" +- "ok google" +- "alexa" + +Custom wake words can be configured: +```bash +WAKE_WORDS=computer,jarvis,assistant +``` + +When a wake word is detected: +1. The system starts recording audio +2. Audio is processed through the speech-to-text pipeline +3. The resulting command is processed by the server + +### Speech-to-Text + +#### Automatic Transcription + +After wake word detection: +1. Audio is automatically captured (default: 5 seconds) +2. The audio is transcribed using the configured whisper model +3. The transcribed text is processed as a command + +#### Manual Transcription + +You can also manually transcribe audio using the API: + +```typescript +// Using the TypeScript client +import { SpeechService } from '@ha-mcp/client'; + +const speech = new SpeechService(); + +// Transcribe from audio buffer +const buffer = await getAudioBuffer(); +const text = await speech.transcribe(buffer); + +// Transcribe from file +const text = await speech.transcribeFile('command.wav'); +``` + +```javascript +// Using the REST API +POST /api/speech/transcribe +Content-Type: multipart/form-data + +file: