Files
voicebox/docs/content/docs/developer/transcription.mdx
2026-04-24 19:18:15 +08:00

161 lines
5.8 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Transcription"
description: "How Whisper-based audio transcription works in Voicebox"
---
## Overview
Voicebox uses OpenAI's Whisper for automatic speech recognition (ASR). Transcription powers two flows:
1. **Reference-text auto-fill** — when a user records or uploads a voice sample, the backend transcribes it and populates the `reference_text` field so cloning can use it.
2. **On-demand transcription** — a user-facing `/transcribe` endpoint for arbitrary audio.
On Apple Silicon, the transcription path runs through **MLX-Whisper** (from `mlx-audio`) for ~8× faster inference than PyTorch. Everywhere else it runs through PyTorch's `transformers` Whisper.
## Architecture
Transcription is wired through the same backend abstraction as TTS. The `STTBackend` protocol lives in `backend/backends/__init__.py`:
```python
@runtime_checkable
class STTBackend(Protocol):
async def load_model(self, model_size: str) -> None: ...
async def transcribe(
self,
audio_path: str,
language: Optional[str] = None,
model_size: Optional[str] = None,
) -> str: ...
def unload_model(self) -> None: ...
def is_loaded(self) -> bool: ...
```
Two implementations ship today:
- **`MLXSTTBackend`** (`backends/mlx_backend.py`) — uses `mlx_audio.stt.load()`. Default on Apple Silicon.
- **`PyTorchSTTBackend`** (`backends/pytorch_backend.py`) — uses `transformers.WhisperForConditionalGeneration`. Default everywhere else.
`get_stt_backend()` picks the right one based on `get_backend_type()`. `backend/services/transcribe.py` is a thin wrapper that delegates to the backend.
## Model Sizes
Five Whisper variants are registered in `ModelConfig`:
| Model | HuggingFace Repo | Size | Notes |
|-------|------------------|------|-------|
| **Base** | `openai/whisper-base` | ~300 MB | Default; fast, decent quality |
| **Small** | `openai/whisper-small` | ~500 MB | Better quality, still fast |
| **Medium** | `openai/whisper-medium` | ~1.5 GB | High quality |
| **Large** | `openai/whisper-large-v3` | ~3 GB | Best quality, slow on CPU |
| **Turbo** | `openai/whisper-large-v3-turbo` | ~1.5 GB | Large-tier quality, ~5× faster than Large |
The `tiny` model is **not** exposed — the quality gap to `base` wasn't worth the download.
`Turbo` + MLX-Whisper on Apple Silicon dropped user-facing transcription latency from ~20s to ~2-3s in v0.1.10.
## Language Hints
Whisper can auto-detect language, but providing a hint improves accuracy on short clips:
```python
text = await backend.transcribe(audio_path, language="en")
```
Accepted language codes are the standard Whisper set (99+ languages). The frontend typically passes the profile's language if available, or lets Whisper detect otherwise.
## Model Loading
Both backends are lazy: the model is loaded on first use and cached in memory. Switching sizes unloads the previous model.
On MLX, the model is loaded via `mlx_audio.stt.load(hf_repo)`. On PyTorch, via:
```python
WhisperProcessor.from_pretrained(hf_repo)
WhisperForConditionalGeneration.from_pretrained(hf_repo).to(device)
```
Both load paths use `model_load_progress()` from `backends/base.py` so the frontend sees live download progress on the first use.
## Audio Preprocessing
Whisper expects mono 16 kHz audio. The audio utility in `backend/utils/audio.py` handles resampling and format conversion transparently:
- **Formats:** WAV, MP3, FLAC, OGG, M4A (via soundfile / librosa)
- **Target:** mono, 16 kHz, float32
Files longer than Whisper's 30-second window are handled by the underlying library's chunking logic — no explicit splitting in Voicebox code.
## API Endpoints
| Method | Endpoint | Description |
|--------|----------|-------------|
| POST | `/transcribe` | Transcribe an uploaded audio file |
### Request
Multipart form data:
```
POST /transcribe
Content-Type: multipart/form-data
file: <audio_file>
language: en # optional
model_size: base # optional (default: "base")
```
### Response
```json
{
"text": "Hello, this is a test transcription.",
"duration": 3.5
}
```
## Use Cases
### Reference Text for Voice Cloning
Adding a voice sample triggers transcription automatically:
1. User uploads or records audio.
2. The backend writes the audio file and calls `/transcribe` internally (or the frontend calls it separately).
3. The returned text becomes `reference_text` on the new `profile_samples` row.
4. Cloning engines that need reference text (Chatterbox, TADA, etc.) read it from there.
### Quality Tips
- Provide a language hint for short clips (under 5 seconds) — auto-detection is unreliable on little audio.
- Use Turbo or Large for noisy audio — Base can hallucinate on hard inputs.
- Prefer clean audio; transcription errors become reference-text errors, which become cloning errors.
## Memory Management
`unload_model()` drops the model reference and clears the CUDA cache if applicable. `/models/unload` wires this up for manual control.
A singleton per backend is returned by `get_stt_backend()` — multiple callers share one Whisper instance.
## Error Handling
| Error | Cause | Solution |
|-------|-------|----------|
| Model not found | First run + network failure | Retry; check connectivity |
| OOM on load | Large model on low-VRAM GPU | Switch to Small or Turbo |
| Empty result | No speech in audio | Confirm input has voice; check trim |
| Wrong language | Auto-detect misfired | Pass `language` hint |
## Next Steps
<Cards>
<Card title="Model Management" href="/developer/model-management">
Download / load / unload any model
</Card>
<Card title="Voice Profiles" href="/developer/voice-profiles">
How reference text is stored alongside samples
</Card>
<Card title="GPU Acceleration" href="/overview/gpu-acceleration">
Platform-specific acceleration including MLX-Whisper
</Card>
</Cards>