704 lines
33 KiB
Plaintext
704 lines
33 KiB
Plaintext
---
|
||
title: "TTS Engines"
|
||
description: "How to add new text-to-speech engines to Voicebox"
|
||
---
|
||
|
||
> **For humans:** This doc is optimized for AI agents to implement new TTS engines autonomously. It's structured as a phased workflow with explicit gates and a checklist so an agent can do the full integration — dependency research, backend, frontend, bundling — and hand you a draft release or prod build to test locally. It's also a useful reference if you're doing it yourself.
|
||
|
||
Adding an engine touches ~10 files across 4 layers. The backend protocol work is straightforward — the real time sink is dependency hell, upstream library bugs, and PyInstaller bundling.
|
||
|
||
**Do not start writing code until you complete Phase 0.** The v0.2.3 release was three patch releases of PyInstaller fixes because dependency research was skipped. Every issue — `inspect.getsource()` failures, missing native data files, metadata lookups, dtype mismatches — was discoverable by reading the model library's source code before integration began.
|
||
|
||
## Architecture Overview
|
||
|
||
The backend is split into layers:
|
||
|
||
| Layer | Purpose | Files Touched |
|
||
|-------|---------|---------------|
|
||
| `routes/` | Thin HTTP handlers | None (auto-dispatch) |
|
||
| `services/` | Business logic | None (auto-dispatch) |
|
||
| `backends/` | Engine implementations | `your_engine_backend.py` |
|
||
| `utils/` | Shared utilities | As needed |
|
||
|
||
New engines only need to touch `backends/` and `models.py` on the backend side — the route and service layers use a model config registry that handles dispatch automatically.
|
||
|
||
## Phase 0: Dependency Research
|
||
|
||
**This phase is mandatory.** Clone the model library and its key dependencies into a temporary directory and inspect them before writing any integration code. The goal is to produce a dependency audit that identifies every PyInstaller-incompatible pattern, every native data file, and every upstream bug you'll need to work around.
|
||
|
||
### 0.1 Clone and Inspect the Model Library
|
||
|
||
```bash
|
||
# Create a throwaway workspace
|
||
mkdir /tmp/engine-research && cd /tmp/engine-research
|
||
|
||
# Clone the model library
|
||
git clone https://github.com/org/model-library.git
|
||
cd model-library
|
||
```
|
||
|
||
**Read these files first, in order:**
|
||
|
||
1. **`setup.py` / `setup.cfg` / `pyproject.toml`** — Check pinned dependency versions. If the library pins `torch==2.6.0` or `numpy<1.26`, you'll need `--no-deps` installation and manual sub-dependency listing (this is what happened with `chatterbox-tts`).
|
||
|
||
2. **`__init__.py` and the main model class** — Trace the import chain. Look for:
|
||
- `from_pretrained()` — does it call `huggingface_hub` internally? Does it pass `token=True` (which crashes without a stored HF token)?
|
||
- `from_local()` — does it exist? You may need manual `snapshot_download()` + `from_local()` to bypass download bugs.
|
||
- Device handling — does it default to CUDA? Does it support MPS? Many libraries crash on MPS with unsupported operators.
|
||
|
||
3. **All `import` statements** — Recursively trace what the library imports. You're looking for:
|
||
- `inspect.getsource()` anywhere in the chain (search all `.py` files)
|
||
- `typeguard` / `@typechecked` decorators (these call `inspect.getsource()` at import time)
|
||
- `importlib.metadata.version()` or `pkg_resources.get_distribution()` (need `--copy-metadata`)
|
||
- `lazy_loader` (needs `--collect-all` to bundle `.pyi` stubs)
|
||
|
||
### 0.2 Scan for PyInstaller-Incompatible Patterns
|
||
|
||
Run these searches against the cloned library **and** its transitive dependencies:
|
||
|
||
```bash
|
||
# inspect.getsource — will crash in frozen binary without --collect-all
|
||
grep -r "inspect.getsource\|getsource(" .
|
||
|
||
# typeguard / @typechecked — calls inspect.getsource at import time
|
||
grep -r "@typechecked\|from typeguard" .
|
||
|
||
# importlib.metadata — needs --copy-metadata
|
||
grep -r "importlib.metadata\|pkg_resources.get_distribution\|pkg_resources.require" .
|
||
|
||
# Data files loaded at runtime — need --collect-all or --collect-data
|
||
grep -r "Path(__file__).parent\|os.path.dirname(__file__)\|resources_path\|pkg_resources.resource_filename" .
|
||
|
||
# Native library paths — may need env var override in frozen builds
|
||
grep -r "/usr/share\|/usr/lib\|/usr/local\|espeak\|phonemize" .
|
||
|
||
# torch.load without map_location — will crash on CPU-only builds
|
||
grep -r "torch.load(" . | grep -v "map_location"
|
||
|
||
# HuggingFace token bugs
|
||
grep -r 'token=True\|token=os.getenv' .
|
||
|
||
# Float64/Float32 assumptions — librosa returns float64, many models assume float32
|
||
grep -r "torch.from_numpy\|\.double()\|float64" .
|
||
|
||
# @torch.jit.script — calls inspect.getsource(), crashes in frozen builds
|
||
grep -r "@torch.jit.script\|torch.jit.script" .
|
||
|
||
# torchaudio.load — requires torchcodec in torchaudio 2.10+, use soundfile.read() instead
|
||
grep -r "torchaudio.load\|torchaudio.save" .
|
||
|
||
# Gated HuggingFace repos — models that hardcode gated repos as tokenizer/config sources
|
||
grep -r "from_pretrained\|tokenizer_name\|AutoTokenizer" . | grep -i "llama\|meta-llama\|gated"
|
||
```
|
||
|
||
### 0.3 Install and Trace in a Throwaway Venv
|
||
|
||
```bash
|
||
# Create isolated venv
|
||
python -m venv /tmp/engine-venv
|
||
source /tmp/engine-venv/bin/activate
|
||
|
||
# Install the package (try normally first)
|
||
pip install model-package
|
||
|
||
# Check if it conflicts with our stack
|
||
pip install model-package torch==2.10 transformers==4.57.3 numpy>=1.26
|
||
# If this fails, you need --no-deps:
|
||
pip install --no-deps model-package
|
||
|
||
# Get the full dependency tree
|
||
pip show model-package # Check Requires: field
|
||
pip show -f model-package # List all installed files (look for data files)
|
||
|
||
# Check for non-PyPI dependencies
|
||
pip install model-package 2>&1 | grep -i "no matching distribution"
|
||
```
|
||
|
||
### 0.4 Test Model Loading on CPU
|
||
|
||
Before writing any integration code, verify the model works on CPU in a plain Python script:
|
||
|
||
```python
|
||
import torch
|
||
# Force CPU to catch map_location bugs early
|
||
model = ModelClass.from_pretrained("org/model", device="cpu")
|
||
|
||
# Test with a float32 audio array (not float64)
|
||
import numpy as np
|
||
audio = np.random.randn(16000).astype(np.float32)
|
||
output = model.generate("Hello world", audio)
|
||
print(f"Output shape: {output.shape}, dtype: {output.dtype}, sample rate: {model.sample_rate}")
|
||
```
|
||
|
||
If this crashes, you've found a bug you'll need to monkey-patch. Common ones:
|
||
- `RuntimeError: expected scalar type Float but found Double` → needs float32 cast
|
||
- `RuntimeError: map_location` → needs `torch.load` patch
|
||
- `RuntimeError: Unsupported operator aten::...` → needs MPS skip
|
||
|
||
### 0.5 Produce a Dependency Audit
|
||
|
||
Before proceeding to Phase 1, write down:
|
||
|
||
1. **PyPI vs non-PyPI deps** — which packages need `--find-links`, `git+https://`, or `--no-deps`?
|
||
2. **PyInstaller directives needed** — which packages need `--collect-all`, `--copy-metadata`, `--hidden-import`?
|
||
3. **Runtime data files** — which packages ship data files (YAML, pretrained weights, phoneme tables, shader libraries) that must be bundled?
|
||
4. **Native library paths** — which packages look for data at system paths that won't exist in a frozen binary?
|
||
5. **Monkey-patches needed** — `torch.load` map_location, float64→float32 casts, MPS skip, HF token bypass, etc.
|
||
6. **Sample rate** — what does the engine output? (24kHz, 44.1kHz, 48kHz)
|
||
7. **Model download method** — `from_pretrained()` with library-managed download, or manual `snapshot_download()` + `from_local()`?
|
||
|
||
This audit becomes your implementation plan for Phases 1, 4, and 5.
|
||
|
||
## Phase 1: Backend Implementation
|
||
|
||
### 1.1 Create the Backend File
|
||
|
||
Create `backend/backends/<engine>_backend.py` (~200-300 lines) implementing the `TTSBackend` protocol:
|
||
|
||
```python
|
||
class YourBackend:
|
||
"""Must satisfy the TTSBackend protocol."""
|
||
|
||
async def load_model(self, model_size: str = "default") -> None: ...
|
||
async def create_voice_prompt(self, audio_path: str, reference_text: str, use_cache: bool = True) -> tuple[dict, bool]: ...
|
||
async def combine_voice_prompts(self, audio_paths: list[str], ref_texts: list[str]) -> tuple[np.ndarray, str]: ...
|
||
async def generate(self, text: str, voice_prompt: dict, language: str = "en", seed: int | None = None, instruct: str | None = None) -> tuple[np.ndarray, int]: ...
|
||
def unload_model(self) -> None: ...
|
||
def is_loaded(self) -> bool: ...
|
||
def _get_model_path(self, model_size: str) -> str: ...
|
||
```
|
||
|
||
**Key decisions per engine:**
|
||
|
||
| Decision | Options | Examples |
|
||
|----------|---------|---------|
|
||
| **Voice prompt storage** | Pre-computed tensors vs deferred file paths | Qwen stores tensor dicts; Chatterbox stores paths |
|
||
| **Caching** | Use voice prompt cache or skip it | LuxTTS caches with prefix; Chatterbox skips caching |
|
||
| **Device selection** | CUDA / MPS / CPU | Chatterbox forces CPU on macOS (MPS bugs) |
|
||
| **Model download** | Library handles it vs manual `snapshot_download` | Turbo uses manual download to bypass `token=True` bug |
|
||
| **Sample rate** | Engine-specific | LuxTTS outputs 48kHz, everything else is 24kHz |
|
||
|
||
### 1.2 Voice Prompt Patterns
|
||
|
||
**Pattern A: Pre-computed tensors** (Qwen, LuxTTS)
|
||
```python
|
||
encoded = model.encode_prompt(audio_path)
|
||
return encoded, False # (prompt_dict, was_cached)
|
||
```
|
||
|
||
**Pattern B: Deferred file paths** (Chatterbox, MLX)
|
||
```python
|
||
return {"ref_audio": audio_path, "ref_text": reference_text}, False
|
||
```
|
||
|
||
**Pattern C: Hybrid** (possible for new engines)
|
||
```python
|
||
embedding = model.extract_speaker(audio_path)
|
||
return {"embedding": embedding, "ref_audio": audio_path}, False
|
||
```
|
||
|
||
If caching, prefix your cache keys:
|
||
```python
|
||
cache_key = "yourengine_" + get_cache_key(audio_path, reference_text)
|
||
```
|
||
|
||
### 1.3 Register the Engine
|
||
|
||
In `backend/backends/__init__.py`:
|
||
|
||
**Add a `ModelConfig` entry:**
|
||
|
||
```python
|
||
ModelConfig(
|
||
model_name="your-engine",
|
||
display_name="Your Engine",
|
||
engine="your_engine",
|
||
hf_repo_id="org/model-repo",
|
||
size_mb=3200,
|
||
needs_trim=False, # set True if output needs trim_tts_output()
|
||
languages=["en", "fr", "de"],
|
||
),
|
||
```
|
||
|
||
**Add to `TTS_ENGINES` dict:**
|
||
|
||
```python
|
||
TTS_ENGINES = {
|
||
...
|
||
"your_engine": "Your Engine",
|
||
}
|
||
```
|
||
|
||
**Add factory branch:**
|
||
|
||
```python
|
||
elif engine == "your_engine":
|
||
from .your_backend import YourBackend
|
||
backend = YourBackend()
|
||
```
|
||
|
||
### 1.4 Update Request Models
|
||
|
||
In `backend/models.py`:
|
||
- Add engine name to `GenerationRequest.engine` regex pattern
|
||
- Add any new language codes to the language regex
|
||
|
||
## Phase 2: Route and Service Integration
|
||
|
||
With the model config registry, route and service layers have **zero per-engine dispatch points**. All endpoints use registry helpers like `get_model_config()`, `load_engine_model()`, `engine_needs_trim()`, `check_model_loaded()`, etc.
|
||
|
||
**You don't need to touch any route or service files** unless your engine needs custom behavior in the generate pipeline.
|
||
|
||
### Post-Processing
|
||
|
||
If your model produces trailing silence, set `needs_trim=True` on your `ModelConfig`. The generation service applies `trim_tts_output()` automatically.
|
||
|
||
## Phase 3: Frontend Integration
|
||
|
||
### 3.1 TypeScript Types
|
||
|
||
In `app/src/lib/api/types.ts`:
|
||
- Add to the `engine` union type on `GenerationRequest`
|
||
|
||
### 3.2 Language Maps
|
||
|
||
In `app/src/lib/constants/languages.ts`:
|
||
- Add entry to `ENGINE_LANGUAGES` record
|
||
- Add any new language codes to `ALL_LANGUAGES` if needed
|
||
|
||
### 3.3 Engine/Model Selector
|
||
|
||
In `app/src/components/Generation/EngineModelSelector.tsx`:
|
||
- Add entry to `ENGINE_OPTIONS` and `ENGINE_DESCRIPTIONS`
|
||
- Add to `ENGLISH_ONLY_ENGINES` if applicable
|
||
|
||
### 3.4 Form Hook
|
||
|
||
In `app/src/lib/hooks/useGenerationForm.ts`:
|
||
- Add to Zod schema enum for `engine`
|
||
- Add engine-to-model-name mapping
|
||
- Update payload construction for engine-specific fields
|
||
|
||
**Watch out for model naming inconsistencies.** The HuggingFace repo name, the model size label, and the API model name don't always follow predictable patterns. For example, TADA's 3B model is named `tada-3b-ml` (not `tada-3b`), because it's a multilingual variant. Always check the actual repo names and build the frontend model name mapping from those, not from assumptions like `{engine}-{size}`.
|
||
|
||
### 3.5 Model Management
|
||
|
||
In `app/src/components/ServerSettings/ModelManagement.tsx`:
|
||
- Add description to `MODEL_DESCRIPTIONS` record
|
||
- Add model name to `voiceModels` filter condition
|
||
|
||
### 3.6 Non-Cloning Engines (Preset Voices)
|
||
|
||
If your engine uses **pre-built voices** instead of zero-shot cloning from reference audio (e.g. Kokoro), additional integration is needed:
|
||
|
||
**Backend:**
|
||
- In `kokoro_backend.py` (or your engine), define a `VOICES` list of `(voice_id, display_name, gender, language)` tuples
|
||
- `create_voice_prompt()` should return `{"voice_type": "preset", "preset_engine": "<engine>", "preset_voice_id": "<id>"}`
|
||
- `generate()` should read `voice_prompt.get("preset_voice_id")` to select the voice
|
||
- Add a `seed_preset_profiles("<engine>")` call in `backend/routes/models.py` after model download completes
|
||
- The `seed_preset_profiles()` function in `backend/services/profiles.py` creates DB profiles with `voice_type="preset"`
|
||
|
||
**Frontend:**
|
||
- The `EngineModelSelector` filters options based on `selectedProfile.voice_type`:
|
||
- `"cloned"` profiles → only cloning engines shown (Kokoro hidden)
|
||
- `"preset"` profiles → only the preset's engine shown
|
||
- Profile cards show the engine name as a badge for preset profiles
|
||
- When a preset profile is selected, the engine auto-switches
|
||
|
||
**Profile schema fields for presets:**
|
||
- `voice_type: "preset"` (vs `"cloned"` for traditional profiles)
|
||
- `preset_engine: "<engine>"` — which engine owns this voice
|
||
- `preset_voice_id: "<id>"` — the engine-specific voice identifier
|
||
|
||
**For future "designed" voices** (text description instead of audio, e.g. Qwen CustomVoice):
|
||
- Use `voice_type: "designed"` with `design_prompt` field
|
||
- `create_voice_prompt_for_profile()` already returns the design prompt for this type
|
||
|
||
## Phase 4: Dependencies
|
||
|
||
Use the dependency audit from Phase 0 to drive this phase. You should already know what packages are needed, which conflict, and which require special installation.
|
||
|
||
### 4.1 Python Dependencies
|
||
|
||
Add to `backend/requirements.txt`. There are three installation patterns, depending on what Phase 0 revealed:
|
||
|
||
**Normal PyPI packages:**
|
||
```
|
||
some-model-package>=1.0.0
|
||
```
|
||
|
||
**Pinned dependency conflicts (`--no-deps`)** — If the model package pins old versions of torch/numpy/transformers, install with `--no-deps` and list sub-dependencies manually. This is the pattern used for `chatterbox-tts`:
|
||
```bash
|
||
# In justfile / CI setup:
|
||
pip install --no-deps chatterbox-tts
|
||
|
||
# In requirements.txt — list each actual sub-dependency:
|
||
conformer>=0.3.2
|
||
diffusers>=0.31.0
|
||
omegaconf>=2.3.0
|
||
resemble-perth>=0.0.2
|
||
s3tokenizer>=0.1.6
|
||
```
|
||
|
||
To identify sub-deps: `pip show chatterbox-tts` → `Requires:` field, then cross-reference against existing `requirements.txt` to avoid duplicates.
|
||
|
||
**Non-PyPI packages** — Some libraries only exist on GitHub or require custom indexes:
|
||
```
|
||
# Git-only packages (no PyPI release)
|
||
linacodec @ git+https://github.com/ysharma3501/LinaCodec.git
|
||
Zipvoice @ git+https://github.com/ysharma3501/LuxTTS.git
|
||
|
||
# Custom package indexes (C extensions with platform-specific wheels)
|
||
--find-links https://k2-fsa.github.io/icefall/piper_phonemize.html
|
||
piper-phonemize>=1.2.0
|
||
```
|
||
|
||
### 4.2 Dependency Conflict Resolution
|
||
|
||
Check for conflicts with the existing stack before adding anything:
|
||
|
||
```bash
|
||
# Our current stack pins (approximate):
|
||
# Python 3.12+, torch>=2.10, transformers>=4.57, numpy>=1.26
|
||
|
||
# Test compatibility
|
||
pip install model-package torch==2.10 transformers==4.57.3 numpy>=1.26
|
||
|
||
# If it fails, check what the package pins:
|
||
pip show model-package | grep Requires
|
||
# Look at setup.py/pyproject.toml for version constraints
|
||
```
|
||
|
||
**Known incompatible patterns in the wild:**
|
||
- `torch==2.6.0` — many older packages pin this
|
||
- `numpy<1.26` — conflicts with Python 3.12+
|
||
- `transformers==4.46.3` — many packages pin old transformers
|
||
- `onnxruntime` pinned versions — often conflict with torch
|
||
|
||
### 4.3 Update Installation Scripts
|
||
|
||
Dependencies must be added in multiple places:
|
||
|
||
| File | What to add |
|
||
|------|------------|
|
||
| `backend/requirements.txt` | Package and version constraint |
|
||
| `justfile` | `--no-deps` install line if needed (in `setup-python` and `setup-python-release` targets) |
|
||
| `.github/workflows/release.yml` | Same `--no-deps` line in CI build steps |
|
||
| `Dockerfile` | Same install commands for Docker builds |
|
||
|
||
## Phase 5: PyInstaller Bundling (`build_binary.py`)
|
||
|
||
This is where most of the pain lives. **The v0.2.3 release was entirely dedicated to fixing bundling issues** — every new engine that shipped in v0.2.1 (LuxTTS, Chatterbox, Chatterbox Turbo) worked in dev but failed in production builds. Don't skip this phase.
|
||
|
||
### 5.1 Register Your Engine in `build_binary.py`
|
||
|
||
Every new engine needs entries in `backend/build_binary.py`. This file drives PyInstaller and is the single most common source of "works in dev, breaks in prod" bugs. You need to decide which PyInstaller directives your engine's dependencies require:
|
||
|
||
| Directive | What It Does | When You Need It |
|
||
|-----------|-------------|-----------------|
|
||
| `--hidden-import <module>` | Includes a module PyInstaller can't detect via static analysis | Dynamic imports, lazy imports, plugin architectures |
|
||
| `--collect-all <package>` | Bundles source `.py` files, data files, AND native libraries | Packages that call `inspect.getsource()` at import time (e.g. `inflect` via `typeguard`'s `@typechecked`), or that ship pretrained model files (e.g. `perth` ships `.pth.tar` + `hparams.yaml`) |
|
||
| `--collect-data <package>` | Bundles only data files (not source or native libs) | Packages with YAML configs, vocab files, etc. |
|
||
| `--collect-submodules <package>` | Bundles all submodules | Packages with deep module trees that PyInstaller misses |
|
||
| `--copy-metadata <package>` | Copies `importlib.metadata` info | Packages that call `importlib.metadata.version()` or `pkg_resources.get_distribution()` at runtime. Already required for: `requests`, `transformers`, `huggingface-hub`, `tokenizers`, `safetensors`, `tqdm` |
|
||
|
||
**Example: adding hidden imports and collect-all for a new engine:**
|
||
|
||
```python
|
||
# In build_binary.py, inside the args list:
|
||
"--hidden-import",
|
||
"backend.backends.your_engine_backend",
|
||
"--hidden-import",
|
||
"your_engine_package",
|
||
"--hidden-import",
|
||
"your_engine_package.inference",
|
||
"--collect-all",
|
||
"some_dependency_that_uses_inspect_getsource",
|
||
"--copy-metadata",
|
||
"some_dependency_that_checks_its_own_version",
|
||
```
|
||
|
||
### 5.2 Lessons from v0.2.3 — Real Failures and Their Fixes
|
||
|
||
These are actual production failures from shipping new engines. Every one of these passed `python -m uvicorn` in dev:
|
||
|
||
| Engine | Failure | Root Cause | Fix |
|
||
|--------|---------|-----------|-----|
|
||
| LuxTTS | `"could not get source code"` on import | `inflect` uses `typeguard`'s `@typechecked` which calls `inspect.getsource()` — needs `.py` source files, not just bytecode | `--collect-all inflect` |
|
||
| LuxTTS | `espeak-ng-data` not found | `piper_phonemize` C library looks for data at `/usr/share/espeak-ng-data/` which doesn't exist in the bundle | `--collect-all piper_phonemize` + set `ESPEAK_DATA_PATH` env var at runtime (see 5.3) |
|
||
| LuxTTS | `inspect.getsource` error in Vocos codec | `linacodec` and `zipvoice` use source introspection | `--collect-all linacodec` + `--collect-all zipvoice` |
|
||
| Chatterbox | `FileNotFoundError` for watermark model | `perth` ships pretrained model files (`hparams.yaml`, `.pth.tar`) that PyInstaller doesn't bundle by default | `--collect-all perth` |
|
||
| All engines | `importlib.metadata` failures | Frozen binary doesn't include package metadata for `huggingface-hub`, `transformers`, etc. | `--copy-metadata` for each affected package |
|
||
| All engines | Download progress bars stuck at 0% | `huggingface_hub` silently disables tqdm progress bars based on logger level in frozen builds — our progress tracker never receives byte updates | Force-enable tqdm's internal counter in `HFProgressTracker` |
|
||
| TADA | `inspect.getsource` error in DAC's `Snake1d` | `@torch.jit.script` calls `inspect.getsource()` which fails without `.py` source files | Wrote a lightweight shim (`dac_shim.py`) reimplementing `Snake1d` without `@torch.jit.script`, registered fake `dac.*` modules in `sys.modules` |
|
||
| All engines | `NameError: name 'obj' is not defined` on macOS | Python 3.12.0 has a [CPython bug](https://github.com/pyinstaller/pyinstaller/issues/7992) that corrupts bytecode when PyInstaller rewrites code objects | Upgrade to Python 3.12.13+ |
|
||
| All engines | `resource_tracker` subprocess crash | `multiprocessing` in frozen binaries needs `freeze_support()` called before anything else | Added to `server.py` entry point |
|
||
|
||
### 5.3 Runtime Frozen-Build Handling (`server.py`)
|
||
|
||
Some fixes can't live in `build_binary.py` — they need runtime detection. The entry point `backend/server.py` handles these before any heavy imports:
|
||
|
||
```python
|
||
# 1. freeze_support() — MUST be called before any multiprocessing use
|
||
import multiprocessing
|
||
multiprocessing.freeze_support()
|
||
|
||
# 2. Native data paths — redirect C libraries to bundled data
|
||
if getattr(sys, 'frozen', False):
|
||
_meipass = getattr(sys, '_MEIPASS', os.path.dirname(sys.executable))
|
||
_espeak_data = os.path.join(_meipass, 'piper_phonemize', 'espeak-ng-data')
|
||
if os.path.isdir(_espeak_data):
|
||
os.environ.setdefault('ESPEAK_DATA_PATH', _espeak_data)
|
||
|
||
# 3. stdout/stderr safety — PyInstaller --noconsole on Windows sets these to None
|
||
if not _is_writable(sys.stdout):
|
||
sys.stdout = open(os.devnull, 'w')
|
||
```
|
||
|
||
If your engine's dependencies include native libraries that look for data at system paths (like espeak-ng does), you'll need to add a similar `os.environ.setdefault()` block here.
|
||
|
||
### 5.4 CUDA vs CPU Build Branching
|
||
|
||
`build_binary.py` produces two different binaries:
|
||
|
||
- **`voicebox-server`** (CPU) — excludes all `nvidia.*` packages to avoid bundling ~3 GB of CUDA DLLs
|
||
- **`voicebox-server-cuda`** — includes `torch.cuda` and `torch.backends.cudnn`
|
||
|
||
On Windows, if the build environment has CUDA torch installed but you're building the CPU binary, the script temporarily swaps to CPU-only torch and restores CUDA torch afterward. This prevents PyInstaller from accidentally bundling CUDA libraries into the CPU build.
|
||
|
||
New engine imports go in the **common section** (not the CUDA or MLX conditional blocks) unless your engine has platform-specific dependencies.
|
||
|
||
### 5.5 MLX Conditional Inclusion
|
||
|
||
Apple Silicon builds conditionally include MLX hidden imports and `--collect-all mlx` / `--collect-all mlx_audio`. If your engine has an MLX-specific backend variant, add its imports inside the `if is_apple_silicon() and not cuda:` block.
|
||
|
||
### 5.6 Testing Frozen Builds
|
||
|
||
You can't skip this. Models that work in `python -m uvicorn` will break in the PyInstaller binary. The v0.2.3 release required **three patch releases** (v0.2.1 → v0.2.2 → v0.2.3) to get all engines working in production.
|
||
|
||
1. Build: `just build`
|
||
2. Launch the binary directly (not via `python -m`)
|
||
3. Test the **full chain**: download → load → generate → progress tracking
|
||
4. Check stderr for the actual error (logs go to stderr for Tauri sidecar capture)
|
||
5. Fix, rebuild, repeat
|
||
|
||
**Common gotcha:** testing only generation with a pre-cached model from your dev install. Always test with a clean model cache to verify downloads work too.
|
||
|
||
## Phase 6: Common Upstream Workarounds
|
||
|
||
### torch.load device mismatch
|
||
```python
|
||
_original_torch_load = torch.load
|
||
def _patched_torch_load(*args, **kwargs):
|
||
kwargs.setdefault("map_location", "cpu")
|
||
return _original_torch_load(*args, **kwargs)
|
||
torch.load = _patched_torch_load
|
||
```
|
||
|
||
### Float64/Float32 dtype mismatch
|
||
```python
|
||
original_fn = SomeClass.some_method
|
||
def patched_fn(self, *args, **kwargs):
|
||
result = original_fn(self, *args, **kwargs)
|
||
return result.float()
|
||
SomeClass.some_method = patched_fn
|
||
```
|
||
|
||
### HuggingFace token bug
|
||
```python
|
||
from huggingface_hub import snapshot_download
|
||
local_path = snapshot_download(repo_id=REPO, token=None)
|
||
model = ModelClass.from_local(local_path, device=device)
|
||
```
|
||
|
||
### MPS tensor issues
|
||
Skip MPS entirely if operators aren't supported:
|
||
```python
|
||
def _get_device(self):
|
||
if torch.cuda.is_available():
|
||
return "cuda"
|
||
return "cpu" # Skip MPS
|
||
```
|
||
|
||
### Gated HuggingFace repos as hardcoded config sources
|
||
|
||
Some models hardcode a gated HuggingFace repo as their tokenizer or config source (e.g., TADA hardcodes `"meta-llama/Llama-3.2-1B"` in both its `AlignerConfig` and `TadaConfig`). This silently fails without HF authentication.
|
||
|
||
**Fix:** Download from an ungated mirror and patch the config objects directly:
|
||
|
||
```python
|
||
# Download tokenizer from ungated mirror
|
||
UNGATED_TOKENIZER = "unsloth/Llama-3.2-1B"
|
||
tokenizer_path = snapshot_download(UNGATED_TOKENIZER, token=None)
|
||
|
||
# Patch the model config to use the local path instead of the gated repo
|
||
config = ModelConfig.from_pretrained(model_path)
|
||
config.tokenizer_name = tokenizer_path
|
||
model = ModelClass.from_pretrained(model_path, config=config)
|
||
```
|
||
|
||
**Do NOT monkey-patch `AutoTokenizer.from_pretrained`** — it's a classmethod, and replacing it corrupts the descriptor, which breaks other engines that use different tokenizers (e.g., Qwen uses a Qwen tokenizer via `AutoTokenizer`). Always patch at the config level, not the class method level.
|
||
|
||
### `torchaudio.load()` requires `torchcodec` in 2.10+
|
||
|
||
As of `torchaudio>=2.10`, `torchaudio.load()` requires the `torchcodec` package for audio I/O. If your engine or backend code uses `torchaudio.load()`, replace it with `soundfile`:
|
||
|
||
```python
|
||
# Before (breaks without torchcodec):
|
||
import torchaudio
|
||
waveform, sr = torchaudio.load("audio.wav")
|
||
|
||
# After:
|
||
import soundfile as sf
|
||
import torch
|
||
data, sr = sf.read("audio.wav", dtype="float32")
|
||
waveform = torch.from_numpy(data).unsqueeze(0)
|
||
```
|
||
|
||
Note: `torchaudio.functional.resample()` and other pure-PyTorch math functions work fine without `torchcodec` — only the I/O functions are affected.
|
||
|
||
### `@torch.jit.script` breaks in frozen builds
|
||
|
||
`torch.jit.script` calls `inspect.getsource()` to parse the decorated function's source code. In a PyInstaller binary, `.py` source files aren't available, so this crashes at import time.
|
||
|
||
**Fix:** Remove or avoid `@torch.jit.script` decorators. If the decorated function comes from an upstream dependency, write a shim that reimplements the function without the decorator (see "Toxic dependency chains" below).
|
||
|
||
### Toxic dependency chains — the shim pattern
|
||
|
||
Sometimes a model library depends on a package with a massive, hostile transitive dependency tree, but only uses a tiny piece of it. When the dependency chain is unbuildable or would pull in dozens of unwanted packages, the right move is to write a lightweight shim.
|
||
|
||
**Example:** TADA depends on `descript-audio-codec` (DAC), which pulls in `descript-audiotools` -> `onnx`, `tensorboard`, `protobuf`, `matplotlib`, `pystoi`, etc. The `onnx` package fails to build from source on macOS. But TADA only uses `Snake1d` from DAC — a 7-line PyTorch module.
|
||
|
||
**Solution:** Create a shim at `backend/utils/dac_shim.py` that registers fake modules in `sys.modules`:
|
||
|
||
```python
|
||
import sys
|
||
import types
|
||
import torch
|
||
from torch import nn
|
||
|
||
def snake(x, alpha):
|
||
"""Snake activation — reimplemented without @torch.jit.script."""
|
||
return x + (1.0 / (alpha + 1e-9)) * torch.sin(alpha * x).pow(2)
|
||
|
||
class Snake1d(nn.Module):
|
||
def __init__(self, channels):
|
||
super().__init__()
|
||
self.alpha = nn.Parameter(torch.ones(1, channels, 1))
|
||
def forward(self, x):
|
||
return snake(x, self.alpha)
|
||
|
||
# Register fake dac.* modules so "from dac.nn.layers import Snake1d" works
|
||
_nn = types.ModuleType("dac.nn")
|
||
_layers = types.ModuleType("dac.nn.layers")
|
||
_layers.Snake1d = Snake1d
|
||
_nn.layers = _layers
|
||
|
||
for name, mod in [("dac", types.ModuleType("dac")),
|
||
("dac.nn", _nn), ("dac.nn.layers", _layers)]:
|
||
sys.modules[name] = mod
|
||
```
|
||
|
||
**Key rules for shims:**
|
||
- Import the shim **before** importing the model library (so it finds the fake modules first)
|
||
- Do NOT use `@torch.jit.script` in the shim (see above)
|
||
- Only reimplement what the model actually uses — check the import chain carefully
|
||
|
||
## Candidate Engines
|
||
|
||
The [`docs/PROJECT_STATUS.md`](https://github.com/jamiepine/voicebox/blob/main/docs/PROJECT_STATUS.md) file is the canonical, living list of candidates under evaluation — including why some have been backlogged (e.g. VoxCPM, which is effectively CUDA-only upstream).
|
||
|
||
At a glance, current top candidates:
|
||
|
||
| Model | Tier | Size | Cross-platform? | Key Features |
|
||
|-------|------|------|-----------------|--------------|
|
||
| **MOSS-TTS-Nano** | 1 | 0.1 B | Yes (CPU realtime) | 48 kHz stereo, Apache 2.0, released 2026-04-13 |
|
||
| **Voxtral TTS** | 2 | 4 B | Likely | `mistralai/Voxtral-4B-TTS-2603` — presets + cloning |
|
||
| **VibeVoice** | 2 | ~500 M | Yes | Podcast-style multi-speaker dialogue |
|
||
| **Dia2** | 3 | TBD | TBD | Successor to the original Dia |
|
||
| **Fish Audio S2 Pro** | 3 | Medium | Yes | Word-level control via inline text |
|
||
|
||
**Backlogged:**
|
||
|
||
- **VoxCPM** (2B, Apache 2.0) — CUDA ≥12 required upstream; MPS broken in issues #232/#248; CPU path rejected by maintainers (#256). Keep watching for a PR that relaxes the device requirement.
|
||
|
||
Update `PROJECT_STATUS.md` when you pick one up or mark one as shipped/backlogged.
|
||
|
||
## Implementation Checklist
|
||
|
||
Use this as a gate between phases. Do not proceed to the next phase until every item in the current phase is checked.
|
||
|
||
### Phase 0: Dependency Research
|
||
- [ ] Cloned model library source into a temp directory
|
||
- [ ] Read `setup.py` / `pyproject.toml` — noted pinned dependency versions
|
||
- [ ] Traced all imports from the model class through to leaf dependencies
|
||
- [ ] Searched for `inspect.getsource`, `@typechecked`, `typeguard` in the full dependency tree
|
||
- [ ] Searched for `importlib.metadata`, `pkg_resources.get_distribution` in the dependency tree
|
||
- [ ] Searched for `Path(__file__).parent`, `os.path.dirname(__file__)`, hardcoded system paths
|
||
- [ ] Searched for `torch.load` calls missing `map_location`
|
||
- [ ] Searched for `torch.from_numpy` without `.float()` cast
|
||
- [ ] Searched for `token=True` or `token=os.getenv("HF_TOKEN")` in HuggingFace calls
|
||
- [ ] Searched for `@torch.jit.script` / `torch.jit.script` (crashes in frozen builds)
|
||
- [ ] Searched for `torchaudio.load` / `torchaudio.save` (requires `torchcodec` in 2.10+)
|
||
- [ ] Searched for hardcoded gated HuggingFace repo names (e.g., `meta-llama/*`)
|
||
- [ ] Evaluated whether any dependency is used minimally enough to shim instead of install
|
||
- [ ] Tested model loading and generation on CPU in a throwaway venv
|
||
- [ ] Tested with a clean HuggingFace cache (no pre-downloaded models)
|
||
- [ ] Produced a written dependency audit documenting all findings
|
||
|
||
### Phase 1: Backend Implementation
|
||
- [ ] Created `backend/backends/<engine>_backend.py` implementing `TTSBackend` protocol
|
||
- [ ] Chose voice prompt pattern (pre-computed tensors vs deferred file paths)
|
||
- [ ] Implemented all monkey-patches identified in Phase 0
|
||
- [ ] Used `get_torch_device()` from `backends/base.py` for device selection
|
||
- [ ] Used `model_load_progress()` from `backends/base.py` for download/load tracking
|
||
- [ ] Tested: model downloads correctly
|
||
- [ ] Tested: model loads on CPU
|
||
- [ ] Tested: generation produces valid audio
|
||
- [ ] Tested: voice cloning from reference audio works
|
||
- [ ] Registered `ModelConfig` in `backends/__init__.py`
|
||
- [ ] Added to `TTS_ENGINES` dict
|
||
- [ ] Added factory branch in `get_tts_backend_for_engine()`
|
||
- [ ] Updated engine regex in `backend/models.py`
|
||
|
||
### Phase 2–3: Route, Service, and Frontend
|
||
- [ ] Confirmed zero changes needed in routes/services (or documented why custom behavior is needed)
|
||
- [ ] Added engine to TypeScript union type in `app/src/lib/api/types.ts`
|
||
- [ ] Added language map entry in `app/src/lib/constants/languages.ts`
|
||
- [ ] Added to `ENGINE_OPTIONS` and `ENGINE_DESCRIPTIONS` in `EngineModelSelector.tsx`
|
||
- [ ] Added to Zod schema and model-name mapping in `useGenerationForm.ts`
|
||
- [ ] Added description in `ModelManagement.tsx`
|
||
|
||
### Phase 4: Dependencies
|
||
- [ ] Added packages to `backend/requirements.txt`
|
||
- [ ] If `--no-deps` needed: listed sub-dependencies explicitly
|
||
- [ ] If git-only packages: added `@ git+https://...` entries
|
||
- [ ] If custom index needed: added `--find-links` line
|
||
- [ ] Updated `justfile` setup targets
|
||
- [ ] Updated `.github/workflows/release.yml` build steps
|
||
- [ ] Updated `Dockerfile` if applicable
|
||
- [ ] Verified `pip install` succeeds in a clean venv with existing requirements
|
||
|
||
### Phase 5: PyInstaller Bundling
|
||
- [ ] Added `--hidden-import` entries in `build_binary.py` for:
|
||
- [ ] `backend.backends.<engine>_backend`
|
||
- [ ] The model package and its key submodules
|
||
- [ ] Added `--collect-all` for any packages that:
|
||
- [ ] Use `inspect.getsource()` / `@typechecked`
|
||
- [ ] Ship pretrained model data files (`.pth.tar`, `.yaml`, etc.)
|
||
- [ ] Ship native data files (phoneme tables, shader libraries, etc.)
|
||
- [ ] Added `--copy-metadata` for any packages that use `importlib.metadata`
|
||
- [ ] If engine has native data paths: added `os.environ.setdefault()` in `server.py`
|
||
- [ ] Built frozen binary with `just build`
|
||
- [ ] Tested in frozen binary with **clean model cache** (not pre-cached from dev):
|
||
- [ ] Model download works with real-time progress
|
||
- [ ] Model loading works
|
||
- [ ] Generation produces valid audio
|
||
- [ ] No errors in stderr logs
|
||
|
||
### Phase 6: Final Verification
|
||
- [ ] Engine works in dev mode (`just dev`)
|
||
- [ ] Engine works in frozen binary (`just build` → run binary directly)
|
||
- [ ] Tested on target platform (macOS for MLX, Windows/Linux for CUDA)
|
||
- [ ] No regressions in existing engines
|