- Introduce detailed documentation for speech processing capabilities - Add new speech features documentation in `docs/features/speech.md` - Update README with speech feature highlights and prerequisites - Expand configuration documentation with speech-related settings - Include model selection, GPU acceleration, and best practices guidance
4.5 KiB
Speech Features
The Home Assistant MCP Server includes powerful speech processing capabilities powered by fast-whisper and custom wake word detection. This guide explains how to set up and use these features effectively.
Overview
The speech processing system consists of two main components:
- Wake Word Detection - Listens for specific trigger phrases
- Speech-to-Text - Transcribes spoken commands using fast-whisper
Setup
Prerequisites
- Docker environment:
docker --version # Should be 20.10.0 or higher
- For GPU acceleration:
- NVIDIA GPU with CUDA support
- NVIDIA Container Toolkit installed
- NVIDIA drivers 450.80.02 or higher
Installation
- Enable speech features in your
.env:
ENABLE_SPEECH_FEATURES=true
ENABLE_WAKE_WORD=true
ENABLE_SPEECH_TO_TEXT=true
- Configure model settings:
WHISPER_MODEL_PATH=/models
WHISPER_MODEL_TYPE=base
WHISPER_LANGUAGE=en
WHISPER_TASK=transcribe
WHISPER_DEVICE=cuda # or cpu
- Start the services:
docker-compose up -d
Usage
Wake Word Detection
The wake word detector continuously listens for configured trigger phrases. Default wake words:
- "hey jarvis"
- "ok google"
- "alexa"
Custom wake words can be configured:
WAKE_WORDS=computer,jarvis,assistant
When a wake word is detected:
- The system starts recording audio
- Audio is processed through the speech-to-text pipeline
- The resulting command is processed by the server
Speech-to-Text
Automatic Transcription
After wake word detection:
- Audio is automatically captured (default: 5 seconds)
- The audio is transcribed using the configured whisper model
- The transcribed text is processed as a command
Manual Transcription
You can also manually transcribe audio using the API:
// Using the TypeScript client
import { SpeechService } from '@ha-mcp/client';
const speech = new SpeechService();
// Transcribe from audio buffer
const buffer = await getAudioBuffer();
const text = await speech.transcribe(buffer);
// Transcribe from file
const text = await speech.transcribeFile('command.wav');
// Using the REST API
POST /api/speech/transcribe
Content-Type: multipart/form-data
file: <audio file>
Event Handling
The system emits various events during speech processing:
speech.on('wakeWord', (word: string) => {
console.log(`Wake word detected: ${word}`);
});
speech.on('listening', () => {
console.log('Listening for command...');
});
speech.on('transcribing', () => {
console.log('Processing speech...');
});
speech.on('transcribed', (text: string) => {
console.log(`Transcribed text: ${text}`);
});
speech.on('error', (error: Error) => {
console.error('Speech processing error:', error);
});
Performance Optimization
Model Selection
Choose an appropriate model based on your needs:
-
Resource-constrained environments:
- Use
tiny.enorbase.en - Run on CPU if GPU unavailable
- Limit concurrent processing
- Use
-
High-accuracy requirements:
- Use
small.enormedium.en - Enable GPU acceleration
- Increase audio quality
- Use
-
Production environments:
- Use
base.enorsmall.en - Enable GPU acceleration
- Configure appropriate timeouts
- Use
GPU Acceleration
When using GPU acceleration:
- Monitor GPU memory usage:
nvidia-smi -l 1
- Adjust model size if needed:
WHISPER_MODEL_TYPE=small # Decrease if GPU memory limited
- Configure processing device:
WHISPER_DEVICE=cuda # Use GPU
WHISPER_DEVICE=cpu # Use CPU if GPU unavailable
Troubleshooting
Common Issues
-
Wake word detection not working:
- Check microphone permissions
- Adjust
WAKE_WORD_SENSITIVITY - Verify wake words configuration
-
Poor transcription quality:
- Check audio input quality
- Try a larger model
- Verify language settings
-
Performance issues:
- Monitor resource usage
- Consider smaller model
- Check GPU acceleration status
Logging
Enable debug logging for detailed information:
LOG_LEVEL=debug
Speech-specific logs will be tagged with [SPEECH] prefix.
Security Considerations
-
Audio Privacy:
- Audio is processed locally
- No data sent to external services
- Temporary files automatically cleaned
-
Access Control:
- Speech endpoints require authentication
- Rate limiting applies to transcription
- Configurable command restrictions
-
Resource Protection:
- Timeouts prevent hanging
- Memory limits enforced
- Graceful error handling