Skip to content

Speech Features

The Home Assistant MCP Server includes powerful speech processing capabilities powered by fast-whisper and custom wake word detection. This guide explains how to set up and use these features effectively.

Overview

The speech processing system consists of two main components: 1. Wake Word Detection - Listens for specific trigger phrases 2. Speech-to-Text - Transcribes spoken commands using fast-whisper

Setup

Prerequisites

  1. Docker environment:

    docker --version  # Should be 20.10.0 or higher
    

  2. For GPU acceleration:

  3. NVIDIA GPU with CUDA support
  4. NVIDIA Container Toolkit installed
  5. NVIDIA drivers 450.80.02 or higher

Installation

  1. Enable speech features in your .env:

    ENABLE_SPEECH_FEATURES=true
    ENABLE_WAKE_WORD=true
    ENABLE_SPEECH_TO_TEXT=true
    

  2. Configure model settings:

    WHISPER_MODEL_PATH=/models
    WHISPER_MODEL_TYPE=base
    WHISPER_LANGUAGE=en
    WHISPER_TASK=transcribe
    WHISPER_DEVICE=cuda  # or cpu
    

  3. Start the services:

    docker-compose up -d
    

Usage

Wake Word Detection

The wake word detector continuously listens for configured trigger phrases. Default wake words: - "hey jarvis" - "ok google" - "alexa"

Custom wake words can be configured:

WAKE_WORDS=computer,jarvis,assistant

When a wake word is detected: 1. The system starts recording audio 2. Audio is processed through the speech-to-text pipeline 3. The resulting command is processed by the server

Speech-to-Text

Automatic Transcription

After wake word detection: 1. Audio is automatically captured (default: 5 seconds) 2. The audio is transcribed using the configured whisper model 3. The transcribed text is processed as a command

Manual Transcription

You can also manually transcribe audio using the API:

// Using the TypeScript client
import { SpeechService } from '@ha-mcp/client';

const speech = new SpeechService();

// Transcribe from audio buffer
const buffer = await getAudioBuffer();
const text = await speech.transcribe(buffer);

// Transcribe from file
const text = await speech.transcribeFile('command.wav');
// Using the REST API
POST /api/speech/transcribe
Content-Type: multipart/form-data

file: <audio file>

Event Handling

The system emits various events during speech processing:

speech.on('wakeWord', (word: string) => {
  console.log(`Wake word detected: ${word}`);
});

speech.on('listening', () => {
  console.log('Listening for command...');
});

speech.on('transcribing', () => {
  console.log('Processing speech...');
});

speech.on('transcribed', (text: string) => {
  console.log(`Transcribed text: ${text}`);
});

speech.on('error', (error: Error) => {
  console.error('Speech processing error:', error);
});

Performance Optimization

Model Selection

Choose an appropriate model based on your needs:

  1. Resource-constrained environments:
  2. Use tiny.en or base.en
  3. Run on CPU if GPU unavailable
  4. Limit concurrent processing

  5. High-accuracy requirements:

  6. Use small.en or medium.en
  7. Enable GPU acceleration
  8. Increase audio quality

  9. Production environments:

  10. Use base.en or small.en
  11. Enable GPU acceleration
  12. Configure appropriate timeouts

GPU Acceleration

When using GPU acceleration:

  1. Monitor GPU memory usage:

    nvidia-smi -l 1
    

  2. Adjust model size if needed:

    WHISPER_MODEL_TYPE=small  # Decrease if GPU memory limited
    

  3. Configure processing device:

    WHISPER_DEVICE=cuda      # Use GPU
    WHISPER_DEVICE=cpu      # Use CPU if GPU unavailable
    

Troubleshooting

Common Issues

  1. Wake word detection not working:
  2. Check microphone permissions
  3. Adjust WAKE_WORD_SENSITIVITY
  4. Verify wake words configuration

  5. Poor transcription quality:

  6. Check audio input quality
  7. Try a larger model
  8. Verify language settings

  9. Performance issues:

  10. Monitor resource usage
  11. Consider smaller model
  12. Check GPU acceleration status

Logging

Enable debug logging for detailed information:

LOG_LEVEL=debug

Speech-specific logs will be tagged with [SPEECH] prefix.

Security Considerations

  1. Audio Privacy:
  2. Audio is processed locally
  3. No data sent to external services
  4. Temporary files automatically cleaned

  5. Access Control:

  6. Speech endpoints require authentication
  7. Rate limiting applies to transcription
  8. Configurable command restrictions

  9. Resource Protection:

  10. Timeouts prevent hanging
  11. Memory limits enforced
  12. Graceful error handling