Speech Features
The Home Assistant MCP Server includes powerful speech processing capabilities powered by fast-whisper and custom wake word detection. This guide explains how to set up and use these features effectively.
Overview
The speech processing system consists of two main components: 1. Wake Word Detection - Listens for specific trigger phrases 2. Speech-to-Text - Transcribes spoken commands using fast-whisper
Setup
Prerequisites
-
Docker environment:
-
For GPU acceleration:
- NVIDIA GPU with CUDA support
- NVIDIA Container Toolkit installed
- NVIDIA drivers 450.80.02 or higher
Installation
-
Enable speech features in your
.env: -
Configure model settings:
-
Start the services:
Usage
Wake Word Detection
The wake word detector continuously listens for configured trigger phrases. Default wake words: - "hey jarvis" - "ok google" - "alexa"
Custom wake words can be configured:
When a wake word is detected: 1. The system starts recording audio 2. Audio is processed through the speech-to-text pipeline 3. The resulting command is processed by the server
Speech-to-Text
Automatic Transcription
After wake word detection: 1. Audio is automatically captured (default: 5 seconds) 2. The audio is transcribed using the configured whisper model 3. The transcribed text is processed as a command
Manual Transcription
You can also manually transcribe audio using the API:
// Using the TypeScript client
import { SpeechService } from '@ha-mcp/client';
const speech = new SpeechService();
// Transcribe from audio buffer
const buffer = await getAudioBuffer();
const text = await speech.transcribe(buffer);
// Transcribe from file
const text = await speech.transcribeFile('command.wav');
// Using the REST API
POST /api/speech/transcribe
Content-Type: multipart/form-data
file: <audio file>
Event Handling
The system emits various events during speech processing:
speech.on('wakeWord', (word: string) => {
console.log(`Wake word detected: ${word}`);
});
speech.on('listening', () => {
console.log('Listening for command...');
});
speech.on('transcribing', () => {
console.log('Processing speech...');
});
speech.on('transcribed', (text: string) => {
console.log(`Transcribed text: ${text}`);
});
speech.on('error', (error: Error) => {
console.error('Speech processing error:', error);
});
Performance Optimization
Model Selection
Choose an appropriate model based on your needs:
- Resource-constrained environments:
- Use
tiny.enorbase.en - Run on CPU if GPU unavailable
-
Limit concurrent processing
-
High-accuracy requirements:
- Use
small.enormedium.en - Enable GPU acceleration
-
Increase audio quality
-
Production environments:
- Use
base.enorsmall.en - Enable GPU acceleration
- Configure appropriate timeouts
GPU Acceleration
When using GPU acceleration:
-
Monitor GPU memory usage:
-
Adjust model size if needed:
-
Configure processing device:
Troubleshooting
Common Issues
- Wake word detection not working:
- Check microphone permissions
- Adjust
WAKE_WORD_SENSITIVITY -
Verify wake words configuration
-
Poor transcription quality:
- Check audio input quality
- Try a larger model
-
Verify language settings
-
Performance issues:
- Monitor resource usage
- Consider smaller model
- Check GPU acceleration status
Logging
Enable debug logging for detailed information:
Speech-specific logs will be tagged with [SPEECH] prefix.
Security Considerations
- Audio Privacy:
- Audio is processed locally
- No data sent to external services
-
Temporary files automatically cleaned
-
Access Control:
- Speech endpoints require authentication
- Rate limiting applies to transcription
-
Configurable command restrictions
-
Resource Protection:
- Timeouts prevent hanging
- Memory limits enforced
- Graceful error handling