Files

jango-blockchained cc9eede856 docs: Add comprehensive speech features documentation and configuration

- Introduce detailed documentation for speech processing capabilities
- Add new speech features documentation in `docs/features/speech.md`
- Update README with speech feature highlights and prerequisites
- Expand configuration documentation with speech-related settings
- Include model selection, GPU acceleration, and best practices guidance

2025-02-06 04:30:20 +01:00

4.5 KiB

Raw Blame History

Speech Features

The Home Assistant MCP Server includes powerful speech processing capabilities powered by fast-whisper and custom wake word detection. This guide explains how to set up and use these features effectively.

Overview

The speech processing system consists of two main components:

Wake Word Detection - Listens for specific trigger phrases
Speech-to-Text - Transcribes spoken commands using fast-whisper

Setup

Prerequisites

Docker environment:

docker --version  # Should be 20.10.0 or higher

For GPU acceleration:

NVIDIA GPU with CUDA support
NVIDIA Container Toolkit installed
NVIDIA drivers 450.80.02 or higher

Installation

Enable speech features in your .env:

ENABLE_SPEECH_FEATURES=true
ENABLE_WAKE_WORD=true
ENABLE_SPEECH_TO_TEXT=true

Configure model settings:

WHISPER_MODEL_PATH=/models
WHISPER_MODEL_TYPE=base
WHISPER_LANGUAGE=en
WHISPER_TASK=transcribe
WHISPER_DEVICE=cuda  # or cpu

Start the services:

docker-compose up -d

Usage

Wake Word Detection

The wake word detector continuously listens for configured trigger phrases. Default wake words:

"hey jarvis"
"ok google"
"alexa"

Custom wake words can be configured:

WAKE_WORDS=computer,jarvis,assistant

When a wake word is detected:

The system starts recording audio
Audio is processed through the speech-to-text pipeline
The resulting command is processed by the server

Speech-to-Text

Automatic Transcription

After wake word detection:

Audio is automatically captured (default: 5 seconds)
The audio is transcribed using the configured whisper model
The transcribed text is processed as a command

Manual Transcription

You can also manually transcribe audio using the API:

// Using the TypeScript client
import { SpeechService } from '@ha-mcp/client';

const speech = new SpeechService();

// Transcribe from audio buffer
const buffer = await getAudioBuffer();
const text = await speech.transcribe(buffer);

// Transcribe from file
const text = await speech.transcribeFile('command.wav');

// Using the REST API
POST /api/speech/transcribe
Content-Type: multipart/form-data

file: <audio file>

Event Handling

The system emits various events during speech processing:

speech.on('wakeWord', (word: string) => {
  console.log(`Wake word detected: ${word}`);
});

speech.on('listening', () => {
  console.log('Listening for command...');
});

speech.on('transcribing', () => {
  console.log('Processing speech...');
});

speech.on('transcribed', (text: string) => {
  console.log(`Transcribed text: ${text}`);
});

speech.on('error', (error: Error) => {
  console.error('Speech processing error:', error);
});

Performance Optimization

Model Selection

Choose an appropriate model based on your needs:

Resource-constrained environments:
- Use tiny.en or base.en
- Run on CPU if GPU unavailable
- Limit concurrent processing
High-accuracy requirements:
- Use small.en or medium.en
- Enable GPU acceleration
- Increase audio quality
Production environments:
- Use base.en or small.en
- Enable GPU acceleration
- Configure appropriate timeouts

GPU Acceleration

When using GPU acceleration:

Monitor GPU memory usage:

nvidia-smi -l 1

Adjust model size if needed:

WHISPER_MODEL_TYPE=small  # Decrease if GPU memory limited

Configure processing device:

WHISPER_DEVICE=cuda      # Use GPU
WHISPER_DEVICE=cpu      # Use CPU if GPU unavailable

Troubleshooting

Common Issues

Wake word detection not working:
- Check microphone permissions
- Adjust WAKE_WORD_SENSITIVITY
- Verify wake words configuration
Poor transcription quality:
- Check audio input quality
- Try a larger model
- Verify language settings
Performance issues:
- Monitor resource usage
- Consider smaller model
- Check GPU acceleration status

Logging

Enable debug logging for detailed information:

LOG_LEVEL=debug

Speech-specific logs will be tagged with [SPEECH] prefix.

Security Considerations

Audio Privacy:
- Audio is processed locally
- No data sent to external services
- Temporary files automatically cleaned
Access Control:
- Speech endpoints require authentication
- Rate limiting applies to transcription
- Configurable command restrictions
Resource Protection:
- Timeouts prevent hanging
- Memory limits enforced
- Graceful error handling

4.5 KiB Raw Blame History