# Voicebox Project Status & Roadmap

> Last updated: 2026-04-18 | Current version: **v0.4.1** | 232 open issues | 12 open PRs

---

## Table of Contents

1. [Architecture Overview](#architecture-overview)
2. [Current State](#current-state)
3. [Open PRs — Triage & Analysis](#open-prs--triage--analysis)
4. [Open Issues — Categorized](#open-issues--categorized)
5. [Existing Plan Documents — Status](#existing-plan-documents--status)
6. [New Model Integration — Landscape](#new-model-integration--landscape)
7. [Architectural Bottlenecks](#architectural-bottlenecks)
8. [Recommended Priorities](#recommended-priorities)

---

## Architecture Overview

**Tauri shell (Rust)** hosts a **React frontend** (`app/`) that talks over HTTP on `localhost:17493` to a **FastAPI backend** (`backend/`).

The backend exposes:

- **`TTSBackend` Protocol** with seven concrete engine implementations:
  - Qwen3-TTS (PyTorch or MLX depending on platform)
  - Qwen CustomVoice (predefined speakers with instruct)
  - LuxTTS (fast, CPU-friendly)
  - Chatterbox Multilingual (23 languages)
  - Chatterbox Turbo (English, paralinguistic tags)
  - TADA (1B English, 3B multilingual via HumeAI)
  - Kokoro 82M (pre-built voices, CPU realtime)
- **`STTBackend` Protocol** for Whisper (PyTorch or MLX-Whisper)
- **Profiles / History / Stories** services for persistence and timeline editing

### Key Files

| Layer | File | Purpose |
|-------|------|---------|
| Backend entry | `backend/main.py` | FastAPI app, all API routes (~2850 lines) |
| TTS protocol | `backend/backends/__init__.py:32-101` | `TTSBackend` Protocol definition |
| Model registry | `backend/backends/__init__.py:17-29,153-366` | `ModelConfig` dataclass + registry helpers |
| TTS factory | `backend/backends/__init__.py:382-426` | Thread-safe engine registry (double-checked locking) |
| PyTorch TTS | `backend/backends/pytorch_backend.py` | Qwen3-TTS via `qwen_tts` package |
| MLX TTS | `backend/backends/mlx_backend.py` | Qwen3-TTS via `mlx_audio.tts` |
| LuxTTS | `backend/backends/luxtts_backend.py` | LuxTTS — fast, CPU-friendly |
| Chatterbox MTL | `backend/backends/chatterbox_backend.py` | Chatterbox Multilingual — 23 languages |
| Chatterbox Turbo | `backend/backends/chatterbox_turbo_backend.py` | Chatterbox Turbo — English, paralinguistic tags |
| TADA | `backend/backends/hume_backend.py` | HumeAI TADA — 1B English + 3B Multilingual |
| Kokoro | `backend/backends/kokoro_backend.py` | Kokoro 82M — CPU realtime, pre-built voices |
| Qwen CustomVoice | `backend/backends/qwen_custom_voice_backend.py` | Qwen CustomVoice — predefined speakers with instruct |
| Platform detect | `backend/platform_detect.py` | Apple Silicon → MLX, else → PyTorch |
| API types | `backend/models.py` | Pydantic request/response models |
| HF progress | `backend/utils/hf_progress.py` | HFProgressTracker (tqdm patching for download progress) |
| Audio utils | `backend/utils/audio.py` | `trim_tts_output()`, normalize, load/save audio |
| Frontend API | `app/src/lib/api/client.ts` | Hand-written fetch wrapper |
| Frontend types | `app/src/lib/api/types.ts` | TypeScript API types |
| Engine selector | `app/src/components/Generation/EngineModelSelector.tsx` | Shared engine/model dropdown |
| Generation form | `app/src/components/Generation/GenerationForm.tsx` | TTS generation UI |
| Floating gen box | `app/src/components/Generation/FloatingGenerateBox.tsx` | Compact generation UI |
| Model manager | `app/src/components/ServerSettings/ModelManagement.tsx` | Model download/status/progress UI |
| GPU acceleration | `app/src/components/ServerSettings/GpuAcceleration.tsx` | CUDA backend swap UI |
| Gen form hook | `app/src/lib/hooks/useGenerationForm.ts` | Form validation + submission |
| Language constants | `app/src/lib/constants/languages.ts` | Per-engine language maps |

### How TTS Generation Works (Current Flow)

```
POST /generate
  1. Look up voice profile from DB
  2. Resolve engine from request (qwen | qwen_custom_voice | luxtts | chatterbox | chatterbox_turbo | tada | kokoro)
  3. Get backend: get_tts_backend_for_engine(engine)  # thread-safe singleton per engine
  4. Check model cache → if missing, trigger background download, return HTTP 202
  5. Load model (lazy): tts_backend.load_model(model_size)
  6. Create voice prompt: profiles.create_voice_prompt_for_profile(engine=engine)
       → tts_backend.create_voice_prompt(audio_path, reference_text)
  7. Generate: tts_backend.generate(text, voice_prompt, language, seed, instruct)
  8. Post-process: trim_tts_output() for Chatterbox engines
  9. Save WAV → data/generations/{id}.wav
  10. Insert history record in SQLite
  11. Return GenerationResponse
```

---

## Current State

### What's Shipped (v0.4.x)

**New since v0.3.0:**
- Kokoro 82M TTS engine + voice profile type system (PR #325)
- Qwen CustomVoice preset engine — predefined speakers with instruct support (PR #328)
- Intel Arc (XPU) GPU support (PR #320)
- Blackwell GPU (sm_120) CUDA support (PR #401)
- Generation cancellation flow (PR #444)
- Frontend quality gates + TypeScript hardening (PR #418)
- macOS Intel (x86_64) PyTorch compatibility (PR #416)
- Frozen-binary import fixes for Kokoro / Chatterbox Multilingual / scipy / transformers (PR #438)
- Linux PipeWire/PulseAudio monitor detection (PR #457)
- Server survives GUI close on Windows (PR #402)
- GPU arch compatibility warning on startup (catches unsupported PyTorch builds)
- cpal Stream playback reliability (PR #405), clip-splitting stability (PR #403)
- torch.from_numpy crash with numpy 2.x in frozen binary (PR #361)
- Async CUDA download lock (PR #428), NUMBA_CACHE_DIR env var (PR #425)
- "Clear failed" history button (PR #412)
- External server GUI startup + data refresh (PR #319)
- Force offline mode for cached Qwen/Whisper models (PR #318)
- macOS 11 ScreenCaptureKit launch crash fix (PR #424)

**Core TTS (cumulative):**
- Qwen3-TTS voice cloning (1.7B and 0.6B models, MLX + PyTorch)
- Qwen CustomVoice (preset speakers, instruct)
- LuxTTS — fast, CPU-friendly English TTS (PR #254)
- Chatterbox Multilingual — 23 languages including Hebrew (PR #257)
- Chatterbox Turbo — paralinguistic tags, low latency English (PR #258)
- HumeAI TADA — 1B English + 3B Multilingual (PR #296)
- Kokoro 82M — CPU-realtime, 8 languages, Apache 2.0 (PR #325)
- Multi-engine architecture with thread-safe backend registry (PR #254)
- Chunked TTS generation — engine-agnostic, removes ~500 char limit (PR #266)
- Async generation queue (PR #269)
- Post-processing audio effects system (PR #271)
- Voice profile type system (preset vs cloned, engine compatibility gating)
- Centralized `ModelConfig` registry — no per-engine dispatch maps
- Shared `EngineModelSelector` component

**Infrastructure (cumulative):**
- CUDA backend swap via binary download (PR #252), cu128 upgrade (PR #316), Blackwell/sm_120 (PR #401)
- CUDA backend split into independently versioned server + libs archives (PR #298)
- Intel Arc XPU support (PR #320)
- Docker + web deployment (PR #161)
- Backend refactor: modular architecture, style guide, tooling (PR #285)
- Settings overhaul: routed sub-tabs, server logs, changelog, about page (PR #294)
- Windows support: CUDA detection, cross-platform justfile, server lifecycle (PR #272, #402)
- Linux audio capture via pactl monitor detection (PR #457)
- macOS Intel x86_64 compatibility (PR #416)
- Voice profiles with multi-sample support
- Stories editor (multi-track DAW timeline)
- Whisper transcription (base, small, medium, large, turbo variants)
- Model management UI with inline download progress + folder migration (PR #268)
- Download cancel/clear UI with error panel (PR #238)
- Generation history with caching and cancellation (PR #444)
- Streaming generation endpoint (MLX only)
- Audio player freeze fix + UX improvements (PR #293)
- CORS restriction to known local origins (PR #88)

### Abandoned / Backlogged Integrations

| Model | PR / Branch | Reason |
|-------|-------------|--------|
| **CosyVoice2/3** | PR #311 | Output quality too poor. Heavy deps, no PyPI, needed 5+ shims. PR should be closed. |
| **VoxCPM 1.5 / VoxCPM2** | `voicebox-new-models` research (2026-04-18) | **Backlogged.** See detailed analysis below. |

#### VoxCPM — Evaluation Notes (2026-04-18)

**Project:** [OpenBMB/VoxCPM](https://github.com/OpenBMB/VoxCPM) — tokenizer-free TTS, 2B params (VoxCPM2), end-to-end diffusion autoregressive architecture, 30 languages, 48 kHz output, Apache 2.0, `pip install voxcpm`.

**Why it looked interesting:**
- Clean PyPI install (`pip install voxcpm`)
- Apache 2.0 — commercially safe
- Voice cloning via `reference_wav_path` with optional `prompt_wav_path` + `prompt_text` for "ultimate" cloning
- Streaming API via `generate_streaming()`
- Zero-shot cloning + style control via parenthetical prefixes in text (`(slightly faster, cheerful tone)...`)
- Relatively high-quality output per demos

**Why we backlogged it:**
- **Effectively CUDA-only.** README states `CUDA ≥ 12.0` as hard requirement. Source code's `from_pretrained(device=None|"auto")` claims "preferring CUDA, then MPS, then CPU," but in practice:
  - **MPS (Apple Silicon) broken upstream** — OpenBMB/VoxCPM issues #232 (`NotImplementedError: Output channels > 65536 not supported at the MPS device`) and #248 (`IndexError` on M3 Mac) are both open with no resolution.
  - **CPU unsupported in the Python package** — issue #256 shows `voxcpm --device cpu` rejected with `unrecognized arguments`. The only CPU path is the third-party **VoxCPM.cpp** GGML engine, which is a separate ecosystem project, not `pip install voxcpm`.
  - **macOS source install fails** — issue #233 open with no resolution.
- Would require CUDA-only gating in UI (new `requires_cuda` flag on `ModelConfig`, lock icon + "Requires NVIDIA GPU" in `ModelManagement.tsx` / `EngineModelSelector.tsx`) plus a hard error at `load_model()` as safety net. Doable but adds first-class platform gating that doesn't exist for any other engine today.
- Voicebox's user base skews Apple Silicon (MLX is a primary backend). Shipping a CUDA-only model sets a precedent worth a separate scoping discussion (see issues #419 engine sprawl, #420 platform tiers, PR #465).

**What would change the decision:**
- Upstream fixes MPS crashes (watch issues #232, #248).
- We define an "experimental / CUDA-only" engine tier as part of issue #419 / PR #465, and decide it's acceptable to ship engines that are hidden on non-NVIDIA platforms.
- VoxCPM.cpp matures into a viable CPU path we can wrap (currently separate project, C++/GGML, unclear ergonomics).

**Integration shape if we revive it:** Zero-shot cloning maps naturally to the Chatterbox-style backend (store `ref_audio` + `ref_text` paths in the voice prompt dict, process at generate time). Est. ~250 lines for `voxcpm_backend.py` + one `ModelConfig` entry + engine registration in `backends/__init__.py`. Frontend UI gating is the bigger lift.

### What's In-Flight

| Feature | Branch/PR | Status |
|---------|-----------|--------|
| Platform support tiers | PR #465, issue #420 | Defining tier-1 (supported) vs tier-2 (community) platforms |
| Engine sprawl cleanup | issue #419 | First-class vs experimental TTS backends distinction |
| Frontend tech-debt burn-down | issue #421 | Biome + a11y debt before gating CI |
| Docker registry auto-publish | PR #463, issue #453 | ghcr.io image on tag push |
| New model research | `voicebox-new-models` branch | Evaluating Fish Speech, XTTS-v2, Pocket TTS, VibeVoice, Fish Audio S2, index-tts2 |

### TTS Engine Comparison

| Engine | Model Name | Profile Type | Languages | Size | Key Features | Instruct Support |
|--------|-----------|--------------|-----------|------|-------------|-----------------|
| Qwen3-TTS 1.7B | `qwen-tts-1.7B` | Cloned | 10 (zh, en, ja, ko, de, fr, ru, pt, es, it) | ~3.5 GB | Highest quality, voice cloning | None (Base model has no instruct path) |
| Qwen3-TTS 0.6B | `qwen-tts-0.6B` | Cloned | 10 | ~1.2 GB | Lighter, faster | None |
| Qwen CustomVoice 1.7B | `qwen-custom-voice-1.7B` | Preset | 10 | ~3.5 GB | Predefined speakers, instruct support | **Yes** |
| Qwen CustomVoice 0.6B | `qwen-custom-voice-0.6B` | Preset | 10 | ~1.2 GB | Predefined speakers, instruct support | **Yes** |
| LuxTTS | `luxtts` | Cloned | English | ~300 MB | CPU-friendly, 48 kHz, fast | None |
| Chatterbox | `chatterbox-tts` | Cloned | 23 (incl. Hebrew, Arabic, Hindi, etc.) | ~3.2 GB | Zero-shot cloning, multilingual | Partial — `exaggeration` float (0-1) |
| Chatterbox Turbo | `chatterbox-turbo` | Cloned | English | ~1.5 GB | Paralinguistic tags ([laugh], [cough]), 350M params, low latency | Partial — inline tags only |
| TADA 1B | `tada-1b` | Cloned | English | ~4 GB | HumeAI speech-language model, 700s+ coherent audio | None |
| TADA 3B Multilingual | `tada-3b-ml` | Cloned | 10 (en, ar, zh, de, es, fr, it, ja, pl, pt) | ~8 GB | Multilingual, text-acoustic dual alignment | None |
| Kokoro 82M | `kokoro` | Preset | 8 (en, es, fr, hi, it, pt, ja, zh) | ~350 MB | 82M params, CPU realtime, Apache 2.0, pre-built voices | None |

### Multi-Engine Architecture (Shipped)

- **Thread-safe backend registry** (`_tts_backends` dict + `_tts_backends_lock`) with double-checked locking
- **Per-engine backend instances** — each engine gets its own singleton, loaded lazily
- **Engine field on GenerationRequest** — frontend sends `engine: 'qwen' | 'qwen_custom_voice' | 'luxtts' | 'chatterbox' | 'chatterbox_turbo' | 'tada' | 'kokoro'`
- **Per-engine language filtering** — `ENGINE_LANGUAGES` map in frontend, backend regex accepts all languages
- **Per-engine voice prompts** — `create_voice_prompt_for_profile()` dispatches to the correct backend
- **Profile type system** — preset vs cloned profiles, UI grays out incompatible engines and auto-switches on selection
- **Trim post-processing** — `trim_tts_output()` for Chatterbox engines (cuts trailing silence/hallucination)

### Known Limitations

- **HF XET progress**: Large files downloaded via `hf-xet` (HuggingFace's new transfer backend) report `n=0` in tqdm updates. Progress bars may appear stuck for large `.safetensors` files even though the download is proceeding. This is a known upstream limitation.
- **Chatterbox Turbo upstream token bug**: `from_pretrained()` passes `token=os.getenv("HF_TOKEN") or True` which fails without a stored HF token. Our backend works around this by calling `snapshot_download(token=None)` + `from_local()`.
- **chatterbox-tts must install with `--no-deps`**: It pins `numpy<1.26`, `torch==2.6.0`, `transformers==4.46.3` — all incompatible with our stack (Python 3.12, torch 2.10, transformers 4.57.3). Sub-deps listed explicitly in `requirements.txt`.
- **Instruct parameter partially shipped** (#224, #303): Qwen CustomVoice (PR #328) now provides real instruct support via predefined speakers. Other backends still silently drop the instruct field — the UI exposes the field broadly but most engines ignore it. The floating generate box was patched to restore instruct for CustomVoice (commit `106aec4`).
- **Streaming generation** only works for Qwen on MLX. Other engines use the non-streaming `/generate` endpoint.
- **dicta-onnx** (Hebrew diacritization) not included — upstream Chatterbox bug requires `model_path` arg but calls `Dicta()` with none. Hebrew works fine without it.
- **Blackwell (RTX 50-series) CUDA**: cu128 + sm_120 kernel support shipped (PR #401, #316), but users still report `cudaErrorNoKernelImageForDevice` (#417, #400, #396, #395, #390, #362) — likely a stale CUDA binary on upgraded installs. Needs a follow-up diagnostic / forced re-download path.
- **Long text 50k character limit** (#464, #365, #354): Still hit on GPU despite chunking (PR #266). Chunking reliability needs another pass.
- **ROCm on RDNA 3/4** (#469): `HSA_OVERRIDE_GFX_VERSION` is hardcoded and harms newer cards.
- **`flash-attn is not installed` warning on every platform (cosmetic, common user complaint)**: Our transformer-based engines (Chatterbox / Qwen) emit `Warning: flash-attn is not installed. Will only run the manual PyTorch version. Please install flash-attn for faster inference.` on every startup, on every platform — we don't pin `flash-attn` in requirements because installing it is fragile and version-sensitive. Fallback is PyTorch SDPA, which is near-FA2 throughput on Ampere+ and is what actually runs. **Per-platform reality:** (a) **macOS/Apple Silicon** — FlashAttention is CUDA-only, irrelevant here; MLX has its own attention kernels. (b) **Linux** — `pip install flash-attn --no-build-isolation` works but takes 20+ min to compile. (c) **Windows** — no official support (Dao-AILab README still says only "Might work"; source builds routinely fail on recent CUDA/MSVC, issues #1715, #1828, #2395). Windows users can install community prebuilt wheels from `kingbri1/flash-attention` or `bdashore3/flash-attention` (latest v2.8.3, Aug 2025; `win_amd64` wheels for CUDA 12.4/12.8, Torch 2.6–2.9, Python 3.10–3.13) matching their exact CUDA/Torch/Python, or use WSL2. **Native-Windows alternatives worth considering as a build-time swap:** SageAttention (thu-ml, Apache 2.0, claims 2–5× over FA2) and xformers (official Windows wheels). **Action for us:** troubleshooting doc now covers it (see `docs/content/docs/overview/troubleshooting.mdx`), and we should optionally suppress the warning via `logging.getLogger(...).setLevel(ERROR)` at backend import since the fallback is functionally fine.
- **WebAudio playback dies after audio-session interruption** (#41, plus an internal repro where the app is backgrounded long enough): WaveSurfer's `AudioContext` gets suspended by macOS — either because another app grabs the audio output, or because the WKWebView throttles when backgrounded. `play()` resolves and `timeupdate` can still fire, but no audio reaches the output. Only app restart fixes it. **Things already tried that didn't work:** (a) swapping WaveSurfer backend away from WebAudio — introduced more bugs, not an option; (b) remount hook on the player — doesn't help because a freshly-created `AudioContext` is born suspended and only resumes on a user gesture. PR #293 was a prior partial fix that doesn't cover this path. **Next thing to try** (not yet attempted — confirmed via grep of `AudioPlayer.tsx`): call `wavesurfer.getMediaElement().getGainNode().context.resume()` on the play button click (the click itself is a valid user gesture), plus a `visibilitychange` + `statechange` listener as belt-and-suspenders. The `ctx.resume()` pattern already exists in the codebase at `useStoryPlayback.ts:52` — just not wired into the main player.

---

## Open PRs — Triage & Analysis

### Recently Merged (Since Last Update — 2026-03-18 → 2026-04-18)

| PR | Title | Merged |
|----|-------|--------|
| **#481** | fix(build): pin transformers in MLX requirements to prevent 5.x upgrade | 2026-04-19 |
| **#470** | fix(api-client): declare moved + errors on migrateModels response type | 2026-04-18 |
| **#457** | fix(linux): use pactl to detect PipeWire/PulseAudio monitor | 2026-04-18 |
| **#450** | docs: clarify paralinguistic tag support in quick start | 2026-04-18 |
| **#447** | fix: delete version rows and files in delete_generations_by_profile | 2026-04-18 |
| **#444** | Fix generation cancellation flow | 2026-04-18 |
| **#440** | fix(paths): strip legacy "data/" prefix when resolving stored paths | 2026-04-18 |
| **#439** | Fix migration dialog hanging when no models are present | 2026-04-18 |
| **#438** | fix(build): repair frozen-binary imports for kokoro/chatterbox-multilingual/scipy/transformers | 2026-04-18 |
| **#433** | fix: warn user when no models to migrate during storage change | 2026-04-18 |
| **#425** | Add NUMBA_CACHE_DIR environment variable | 2026-04-16 |
| **#424** | fix: avoid ScreenCaptureKit launch crash on macOS 11 | 2026-04-16 |
| **#418** | Frontend quality gates + TypeScript hardening | 2026-04-18 |
| **#416** | fix(deps): relax PyTorch requirement for macOS Intel (x86_64) | 2026-04-16 |
| **#412** | feat(history): add "Clear failed" button | 2026-04-16 |
| **#405** | fix: keep cpal Stream alive until playback completes | 2026-04-16 |
| **#403** | fix: prevent intermittent clip splitting failures | 2026-04-16 |
| **#402** | fix: reliably keep server alive after GUI close on Windows | 2026-04-16 |
| **#401** | feat: add Blackwell GPU (sm_120) CUDA support | 2026-04-16 |
| **#394** | fix(history): populate status/error/engine fields from DB row | 2026-04-16 |
| **#384** | Fix: Resolve ModuleNotFoundError in effects service | 2026-04-16 |
| **#361** | fix: torch.from_numpy crash with numpy 2.x in frozen binary | 2026-04-16 |
| **#345** | Fix: "Failed to Save" preset error by resolving backend import path | 2026-03-22 |
| **#344** | fix: include changelog in docker web build | 2026-03-27 |
| **#332** | Fix links in Get Started section of index.mdx | 2026-03-21 |
| **#328** | feat: add Qwen CustomVoice preset engine | 2026-03-27 |
| **#325** | feat: Kokoro 82M TTS engine + voice profile type system | 2026-03-20 |
| **#321** | fix: allows deletion of failed generations | 2026-03-19 |
| **#320** | feat: Intel Arc (XPU) GPU support | 2026-03-21 |
| **#319** | fix: GUI startup with external server + data refresh on server switch | 2026-03-27 |
| **#318** | fix: force offline mode when loading cached models (Qwen TTS & Whisper) | 2026-03-21 |
| **#316** | Upgrade CUDA backend from cu126 to cu128, fix GPU settings UI | 2026-03-18 |

### Currently Open (12 PRs)

| PR | Title | Status | Notes |
|----|-------|--------|-------|
| **#465** | docs: define tier-1 and tier-2 platform support targets | Community PR | Pairs with issue #420. Important for scoping. |
| **#463** | feat(actions): add docker-registry.yml for automatic ghcr.io publishing | Community PR | Pairs with issue #453. Low risk. |
| **#443** | fix: prevent infinite retry loop in offline mode (#434) | Community PR | Fixes reported bug. |
| **#430** | feat: add MiniMax TTS provider support | Community PR | Cloud TTS provider — new direction (external API). Superset of #331? |
| **#331** | feat: add MiniMax Cloud TTS as a built-in engine | Community PR | Likely superseded by #430. Dedupe. |
| **#311** | feat: add CosyVoice2/3 TTS engine | **Close** | Abandoned — output quality too poor. |
| **#253** | Enhance speech tokenizer with 48kHz version | Community PR | Qwen tokenizer upgrade. Still worth reviewing. |
| **#227** | fix: harden input validation & file safety | Community PR | Coupled to #225 (custom models). |
| **#225** | feat: custom HuggingFace voice model support | Community PR | Needs rework for multi-engine arch. |
| **#195** | feat: per-profile LoRA fine-tuning | Draft | Complex. 15 new endpoints. |
| **#154** | feat: Audiobook tab | Community PR | Chunked generation now shipped (#266). |
| **#91** | fix: CoreAudio device enumeration | Draft | macOS audio device handling. |

---

## Open Issues — Categorized

### GPU / Hardware Detection — still the top category

**RTX 50-series (Blackwell / sm_120) cluster — NEW:** #417, #400, #396, #395, #390, #362 all report `cudaErrorNoKernelImageForDevice` / "no kernel image available." sm_120 support shipped in PR #401 + cu128 in PR #316, but users on upgraded installs still hit it — likely stale CUDA binary. Needs a diagnostic that detects binary/GPU-arch mismatch and prompts re-download.

**AMD / ROCm — NEW:** #469 `HSA_OVERRIDE_GFX_VERSION` is hardcoded and breaks RDNA 3/4 cards. #313 DirectML on AMD Ryzen AI Max+ 395 not working.

**Intel Arc:** PR #320 shipped XPU support — may resolve #119.

**General GPU-not-detected (older):** #368, #310, #330, #324, #326, #355 (multi-GPU / eGPU).

**Fix path:** CUDA backend swap (PR #252) + cu128 (PR #316) + sm_120 (PR #401) + GPU-arch warning (`73170d0`) are all in. Remaining work is diagnostics + re-download prompts for users whose binary predates the kernel updates.

### Model Downloads

Still reported. Users get stuck downloads, can't resume, offline mode edge cases.

**Key issues:** #475 (MAC CustomVoice install error), #449 (infinite loading macOS), #445 (can't download CustomVoice), #462 (Qwen requires internet even when loaded — regression from #150), #434 (infinite retry loop offline — PR #443 open), #432 (storage location change hangs when empty — partly fixed by PR #439/#433), #348 (TADA 3B Multilingual download fails), #336 (TADA model not listed in app), #275 (`No module named 'chatterbox'` on download), #304 (whisper-base feature extractor load error), #287 (macOS ARM `check_model_inputs` ImportError on new version), #181, #180.

**Fix path:** PR #443 addresses infinite offline retry. CustomVoice-specific download failures (#475, #445) need triage — likely related to frozen-binary import fixes in PR #438. TADA cluster (#336, #348) and macOS ARM import regressions (#287, #275, #304) need a dedicated triage pass.

**Qwen 0.6B-downloads-1.7B reports:** **#485** (2026-04-19), **#423** (macOS M1), **#329**. Originally a stale-fallback bug: `mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16` wasn't published when MLX support shipped, so the 0.6B slot was aliased to the 1.7B repo. The 0.6B bf16 conversion is live now and both `backend/backends/mlx_backend.py` and `backend/backends/__init__.py` point at their correct repos. Qwen CustomVoice is unaffected — it runs via PyTorch on all platforms, both sizes always have dedicated repos.

### Language Requests (ongoing)

Strong demand: Hungarian (#479), Indonesian (#458, #247), Thai (#455), Bangla (#454), Arabic (#379), Persian (#162), IndicF5 (#339 — Indian languages), Ukrainian (#109), Chinese UI (#392, #261).

**Fix path:** Chatterbox Multilingual (PR #257) covers Arabic, Danish, German, Greek, Finnish, Hebrew, Hindi, Dutch, Norwegian, Polish, Swedish, Swahili, Turkish. Still missing: Hungarian, Indonesian, Thai, Bangla, Ukrainian. Issue #411 offers a PR for UI i18n foundation.

### New Model Requests (growing)

| Issue | Model Requested |
|-------|----------------|
| #478 | CosyVoice3 (we tried & abandoned CosyVoice2/3 — see #311) |
| #407, #347 | RVC-style voice-to-voice / seed voice conversion (STS) |
| #385 | Fish Audio S2 |
| #380 | OmniVoice |
| #370 | index-tts2 |
| #364 | Voxtral-TTS |
| #335 | Faster-Qwen-TTS |
| #346 | Multi-model batch request |
| #381 | Microsoft MAI models |
| #339 | IndicF5 |
| #226 | GGUF support |
| #172 | VibeVoice |
| #138 | Export to ONNX/Piper format |
| #132 | LavaSR (transcription) |
| #147 | Facebook Omnilingual ASR |
| #338 | Default voices |

The multi-engine architecture makes integration straightforward — see [`content/docs/developer/tts-engines.mdx`](content/docs/developer/tts-engines.mdx). Platform-specific gating (e.g. VoxCPM CUDA-only) doesn't exist yet and would need design.

### Platform Scope & Quality Debt — NEW category

Awareness issues filed this cycle — ties into engine sprawl and platform tier work.

- **#419** — Engine sprawl: define first-class vs experimental TTS backends
- **#420** — Formalize tier-1 vs tier-2 platform support targets (PR #465 open)
- **#421** — Track & burn down frontend Biome + a11y debt before gating CI
- **#422** — Code-split web build (main bundle > 1 MB)

### Long-Form / Chunking

Still reported despite chunking + queue being merged.

**Key issues:** #464 (50k char limit on GPU despite 16 GB VRAM — v0.4.0), #365 (FR: >50k chars), #363 (smart chunking to prevent robotic artifacts), #354 (50k limit v0.3.0).

**Fix path:** Chunking (#266) and queue (#269) shipped. Remaining work is raising/removing the 50k guard and tuning chunk boundaries for prosody.

### Feature Requests (ongoing)

Notable:
- **#480** — Noise removal on uploaded recordings
- **#448** — API for non-Qwen models (external integrations)
- **#427** — Task status control
- **#407, #347** — Voice-to-voice / audio-to-audio conversion
- **#387** — Location of downloaded generated voices
- **#383** — Concatenate partial reference audio into generated audio
- **#382** — Lightning.ai support
- **#376** — Remote mode
- **#353** — Audio transcoding
- **#317** — Voice pitch control
- **#189** — "Auto" language option
- **#173** — Vocal intonation/inflection control
- **#165, #270** — Audiobook mode (PR #154 open)
- **#242** — Seed value pinning
- **#228** — Always use 0.6B option
- **#235** — Finetuned Qwen3-TTS tokenizer (PR #253 open)
- **#144** — Copy text to clipboard

### Housekeeping / Triage Needed

| Issue | Reason |
|-------|--------|
| **#431**, **#408** | Spam — Chinese "free Claude API" promos. Close. |
| **#398** ("Excelente") | Non-issue. Close. |
| **#357** | Informational — project featured in Awesome MLX. Close after acknowledgement. |
| **#374**, **#377** | Version-release questions, no bug. Close. |
| **#306** ("voice model"), **#389** ("New model"), **#473** ("New functionality") | Title-only issues, no content. Request details or close. |
| **#309** | Uninstall/cleanup question. Answer and close. |
| **#241** | "How to use in Colab" — support question, not a bug. |
| **#423** / **#485** / **#329** | Stale MLX fallback to 1.7B repo — fixed; 0.6B bf16 conversion now live on `mlx-community`, registry points at correct repo on both backends. |
| **#336** / **#348** | TADA download/registration cluster — triage together. |
| **#287** / **#275** / **#304** | macOS ARM import regressions on new version — likely one root cause. |
| **#292**, **#349** | Possibly already fixed by merged PRs (#321/#412 and #345). Verify + close. |

**~70 older issues (pre-#170) not individually categorized above.** Most are long-tail support questions or duplicates of problems now addressed by the multi-engine / model-registry work. A dedicated backlog-sweep pass is overdue.

### Bugs (ongoing)

| Category | Issues |
|----------|--------|
| Generation failures | #476, #467, #452, #459 (voice clone fetch error), #468 (tada-1b marked error), #437, #300, #301, #282 |
| Audio quality | #456 (clipping errors v0.4.0), #436 (emotion labels), #333 (pitch/echo), #307 (by-model breakdown), #340 (all generations say "www...") |
| Transcription | #371 (fails every time), #291 (extract transcription from generated audio) |
| Effects / presets | #349 ("Failed to save" when creating effects presets — possibly fixed by merged #345) |
| File ops | #477 (spacy_pkuseg dict missing on frozen Windows build), #472 (storage location change), #283 (allow longer files for voice creation + in-app trim), #350 (failed to add sample) |
| History | #292 (can't delete failed generations — possibly fixed by merged #321/#412) |
| Windows | #466 (install problem), #375 (WinError 5 access denied), #273 (port 8000 conflict), #201 (model doesn't stay loaded) |
| Linux | #471 (thread-safe PULSE_SOURCE), #413 (Arch build), #409 (Kubuntu build), #351, #341 |
| macOS | #441 (older macOS), #369 (malware flag), #334 (microphone permission), #287 (`check_model_inputs` ImportError — regression), #171 (ARM64 binary won't open) |
| Profile/UI | #360 (Kokoro profile hides others — partly addressed by auto-switch), #299 (drag-drop on Win11), #329 (size selector state bug), #393 (stuck loading screen after reinstall to new dir) |
| Integrations | #397 (SAMMI-bot 422 Unprocessable Entity) |
| Audio playback / session | **#41** (macOS: Voicebox goes silent after another app takes audio output; restart restores it) — see deep-dive below |
| Database | #174 (sqlite3 IntegrityError) |

---

## Existing Plan Documents — Status

| Document | Target Version | Status | Relevance |
|----------|---------------|--------|-----------|
| `TTS_PROVIDER_ARCHITECTURE.md` | v0.1.13 | **Partially superseded** by multi-engine arch + CUDA swap | Core concepts implemented differently than planned |
| `CUDA_BACKEND_SWAP.md` | — | **Shipped** (PR #252) | CUDA binary download + backend restart |
| `CUDA_BACKEND_SWAP_FINAL.md` | — | **Shipped** (PR #252) | Final implementation plan |
| `EXTERNAL_PROVIDERS.md` | v0.2.0 | **Not started** | Remote server support |
| `MLX_AUDIO.md` | — | **Shipped** | MLX backend is live |
| `DOCKER_DEPLOYMENT.md` | v0.2.0 | **Shipped** (PR #161) | Docker + web deployment |
| `OPENAI_SUPPORT.md` | v0.2.0 | **Not started** | OpenAI-compatible API layer |
| `PR33_CUDA_PROVIDER_REVIEW.md` | — | **Reference** | Analysis of the original provider approach |

---

## New Model Integration — Landscape

### Status Snapshot (2026-04-18)

| Model | Cloning | Speed | Sample Rate | Languages | VRAM | Instruct | Cross-platform? | Status |
|-------|---------|-------|-------------|-----------|------|----------|-----------------|--------|
| **Qwen3-TTS** | 10s zero-shot | Medium | 24 kHz | 10 | Medium | None | MLX + PyTorch | **Shipped** |
| **Qwen CustomVoice** | Preset speakers | Medium | 24 kHz | 10 | Medium | **Yes** | PyTorch | **Shipped** (PR #328) |
| **LuxTTS** | 3s zero-shot | 150x RT, CPU ok | 48 kHz | English | <1 GB | None | All | **Shipped** (PR #254) |
| **Chatterbox MTL** | 5s zero-shot | Medium | 24 kHz | 23 | Medium | Partial — `exaggeration` | CPU/CUDA | **Shipped** (PR #257) |
| **Chatterbox Turbo** | 5s zero-shot | Fast | 24 kHz | English | Low | Partial — inline tags | CPU/CUDA | **Shipped** (PR #258) |
| **HumeAI TADA 1B/3B** | Zero-shot | 5x faster than LLM-TTS | 24 kHz | EN (1B), 10 (3B) | Medium | Partial — prosody | PyTorch | **Shipped** (PR #296) |
| **Kokoro-82M** | Preset voices | CPU realtime | 24 kHz | 8 | Tiny (82M) | None | All | **Shipped** (PR #325) |
| ~~**CosyVoice2-0.5B**~~ | 3-10s zero-shot | Very fast | 24 kHz | Multilingual | Low | **Yes** | — | **Abandoned** (PR #311) — poor output quality |
| ~~**VoxCPM2**~~ | Zero-shot | ~0.15 RTF streaming | 48 kHz | 30 | Medium | Partial — parenthetical style | **CUDA-only in practice** | **Backlogged** (2026-04-18) — see notes above |
| **Fish Speech** | 10-30s few-shot | Real-time | 24-44 kHz | 50+ | Medium | **Yes** — word-level inline | All | Candidate — license TBD |
| **Fish Audio S2** | — | — | — | — | — | — | — | Candidate (#385) |
| **XTTS-v2** | 6s zero-shot | Mid-GPU | 24 kHz | 17+ | Medium | Partial — style transfer from ref | All | Candidate — CPML license likely blocker |
| **Pocket TTS** (Kyutai) | Zero-shot + streaming | >1x RT on CPU | — | English + several European (FR/DE/PT/IT/ES added by Feb 2026) | ~100M | None | CPU-first | Candidate — MIT |
| **MOSS-TTS-Nano** | Zero-shot | **Realtime on 4 CPU cores** | 48 kHz stereo | 20 | 0.1B | Partial — MOSS-VoiceGenerator companion does text-to-voice design | All (ONNX CPU path dropped 2026-04-17) | **Top candidate** — Apache 2.0, released 2026-04-13, streaming |
| **VibeVoice** (Microsoft) | — | — | — | Multi-speaker long-form (up to 90 min, 4 speakers) | 1.5B | — | — | Candidate (#172) — Stories-editor fit |
| **index-tts2** | — | — | — | — | — | — | — | Candidate (#370) |
| **Voxtral TTS** (Mistral) | Zero-shot (short clips) + 20 preset voices | Single-GPU | — | — | 4B (`Voxtral-4B-TTS-2603`) | Presets + cloning | CUDA (16 GB+ VRAM) | Candidate (#364) — frontier quality claim, open-weight |
| **Dia / Dia2** | — | — | — | — | — | — | — | Watch — emotion-forward, but "rough edges" / artifacts per April reviews |
| **IndicF5** | — | — | — | Indian languages | — | — | — | Candidate (#339) — fills Indic gap |
| **MiniMax Cloud TTS** | — | Cloud | — | — | N/A (API) | — | N/A | Community PR #430, #331 — new direction (external API) |
| **OmniVoice** | — | — | — | — | — | — | — | Candidate (#380) |
| **RVC voice conversion** | N/A (STS) | — | — | — | — | N/A | All | New modality, not TTS (#407, #347) |

**Watch list:** MioTTS-2.6B (fast LLM-based EN/JP, vLLM compatible), Oolel-Voices (Soynade Research, expressive modular control), Faster-Qwen-TTS (#335), Orpheus / Sesame CSM (on-device fine-tuning discussions), Fish Audio S2 Pro / Fish Speech V1.5 (benchmark leader but research/non-commercial license — same blocker as Fish Speech).

**Deep-research pass (2026-04-18):** MOSS-TTS-Nano identified as the freshest high-alignment candidate — verified via [OpenMOSS/MOSS-TTS](https://github.com/OpenMOSS/MOSS-TTS) README (0.1B params, Apache 2.0, 48 kHz stereo, 4-core CPU realtime, streaming, released 2026-04-13). Dedicated repo: [OpenMOSS/MOSS-TTS-Nano](https://github.com/OpenMOSS/MOSS-TTS-Nano). Voxtral TTS verified on HF as `mistralai/Voxtral-4B-TTS-2603`.

#### Active Evaluation Criteria (learned from cycle)

1. **Cross-platform first.** MLX is a primary backend for our Apple Silicon user base. CUDA-only models require platform gating that doesn't exist yet — shipping one sets a precedent (see VoxCPM notes, issues #419/#420).
2. **PyPI + Apache/MIT licensing preferred.** Heavy deps, git-only installs, and `--no-deps` workarounds are expensive to maintain (Chatterbox taught us this).
3. **Output quality is non-negotiable.** CosyVoice was abandoned despite the best instruct API.
4. **Instruct support fills a real gap** (#173, #224, #303). Qwen CustomVoice partially addresses it with preset speakers; zero-shot clone-with-instruct is still unmet.
5. **Long-form + streaming are user-requested** (#363, #365, #464). Candidates with native streaming (Pocket TTS, Fish Speech) get extra weight.

### Adding a New Engine (Now Straightforward)

With the model config registry and shared `EngineModelSelector` component, adding a new TTS engine requires:

1. **Create `backend/backends/<engine>_backend.py`** — implement `TTSBackend` protocol (~200-300 lines)
2. **Register in `backend/backends/__init__.py`** — add `ModelConfig` entry + `TTS_ENGINES` entry + factory elif
3. **Update `backend/models.py`** — add engine name to regex
4. **Update frontend** — add to engine union type, `EngineModelSelector` options, form schema, language map, profile type gating (icons/labels ~9 files per grep of `kokoro`)

`main.py` requires **zero changes** — the registry handles all dispatch automatically.

**Platform gating doesn't exist yet.** If we add a CUDA-only model (e.g. VoxCPM), we need a new `requires_cuda` (or more generally `requires: list[device]`) flag on `ModelConfig`, plumbed through `/models` API and surfaced in `ModelManagement.tsx` and `EngineModelSelector.tsx` as a lock icon + "Requires NVIDIA GPU" state. Backend should hard-error at `load_model()` as a safety net.

Total effort: **~1 day** for a well-documented model with a PyPI package, cross-platform. **~2 days** if platform gating is required. See [`content/docs/developer/tts-engines.mdx`](content/docs/developer/tts-engines.mdx) for the full guide.

---

## Architectural Bottlenecks

### ~~1. Single Backend Singleton~~ — RESOLVED

The singleton TTS backend was replaced with a thread-safe per-engine registry in PR #254. Multiple engines can now be loaded simultaneously.

### ~~2. `main.py` Dispatch Point Duplication~~ — RESOLVED

Previously, each engine required updates to 6+ hardcoded dispatch maps across `main.py` (~320 lines of if/elif chains). A model config registry in `backend/backends/__init__.py` now centralizes all model metadata (`ModelConfig` dataclass) with helper functions (`load_engine_model()`, `check_model_loaded()`, `engine_needs_trim()`, etc.). Adding a new engine requires zero changes to `main.py`.

### ~~3. Model Config is Scattered~~ — RESOLVED

Model identifiers, HF repo IDs, display names, and engine metadata are now consolidated in the `ModelConfig` registry. Backend-aware branching (e.g. MLX vs PyTorch Qwen repo IDs) happens inside the registry. Frontend model options are centralized in `EngineModelSelector.tsx`.

### 4. Voice Prompt Cache Assumes PyTorch Tensors

`backend/utils/cache.py` uses `torch.save()` / `torch.load()`. LuxTTS, Chatterbox, and Kokoro backends work around this by storing reference audio paths (or preset voice IDs) instead of tensors in their voice prompt dicts. Not ideal but functional.

### 5. ~~Frontend Assumes Qwen Model Sizes~~ — RESOLVED

The generation form now uses a flat model dropdown with engine-based routing. Per-engine language filtering is in place. Model size is only sent for Qwen / Qwen CustomVoice.

### 6. No Platform Gating on Models — NEW

`ModelConfig` has no way to express hardware requirements. Every engine is shown to every user, regardless of whether it'll actually load. Users on non-CUDA platforms discover failure at load time (or not at all — some fall back silently to CPU and never complete). Blocks shipping CUDA-only engines (VoxCPM) and would improve the Intel Arc / ROCm / CPU-only UX today. See `ModelConfig` TODO: add `requires: list[Literal["cuda", "mps", "xpu", "cpu", "rocm"]]` or equivalent, plumb through `/models` API, render in `ModelManagement.tsx` + `EngineModelSelector.tsx`.

### 7. Engine Sprawl — NEW

Seven TTS engines shipped, more candidates queued. Issue #419 asks for a first-class vs experimental distinction. Related: issue #420 asks for formalized platform support tiers. Combined, these would let us ship more engines more confidently with clearer expectations for users.

---

## Recommended Priorities

### Tier 1 — Ship Now

| Priority | PR/Item | Impact | Effort |
|----------|---------|--------|--------|
| 1 | **RTX 50-series / Blackwell diagnostic** — detect stale CUDA binary vs GPU arch, prompt re-download (#417, #400, #396, #395, #390, #362) | Large cluster of user-blocking errors | Medium |
| 2 | **CustomVoice download failures** (#475, #445) | New engine blocked on MAC/Win — regression triage | Medium |
| 3 | **50k char limit on GPU** (#464) | Regression — chunking should handle this | Medium |
| 4 | Close PR #311 (CosyVoice) and dedupe #331/#430 (MiniMax) | Housekeeping | None |
| 5 | **PR #443** — infinite offline retry loop | Bug fix, reviewable | Low |
| 6 | **PR #465** — define tier-1 / tier-2 platforms | Unblocks engine-sprawl decision (#419) | Low |
| 7 | **PR #463** — docker registry auto-publish | Community PR, low risk | Low |
| 8 | **#253** — 48kHz speech tokenizer | Quality improvement for Qwen | Medium |
| 9 | **Kokoro profile UX** (#360) — partially addressed by auto-switch | Polish | Low |

### Tier 2 — Feature Work

| Priority | Item | Impact | Effort |
|----------|------|--------|--------|
| 1 | **Engine tier system** (#419) — first-class vs experimental, platform gating in `ModelConfig` | Unblocks CUDA-only engines (VoxCPM, etc.) and frontend polish | Medium |
| 2 | **Frontend tech-debt burn-down** (#421) + code-split (#422) | Before gating CI on Biome | Medium |
| 3 | **#154** — Audiobook tab | Long-form users. Chunking + queue shipped. | Medium |
| 4 | **UI i18n** (#411 PR offer, #392, #261) | Chinese UI + general localization | Medium |
| 5 | **#225** — Custom HuggingFace models | User-supplied models. Needs rework. | High |
| 6 | OpenAI-compatible API (plan doc exists) — see also #448 (API for non-Qwen) | Low effort once API is stable | Low |
| 7 | LoRA fine-tuning (PR #195) | Complex, needs rework for multi-engine | Very High |
| 8 | Streaming for non-MLX engines | Currently MLX-only | Medium |
| 9 | Voice-to-voice / RVC (#407, #347) | New modality — different arch shape | High |

### Tier 3 — Future Engines (cross-platform preferred)

| Priority | Item | Notes |
|----------|------|-------|
| 1 | **MOSS-TTS-Nano** | 0.1B, Apache 2.0, 4-core CPU realtime, 48 kHz stereo, streaming, 20 langs, released 2026-04-13. Best alignment with our criteria. Verify install ergonomics before committing. |
| 2 | **Pocket TTS** (Kyutai) | CPU-first 100M model. MIT. Fills streaming gap without CUDA dependency. Several European langs added by Feb 2026. |
| 3 | **IndicF5** | Fills Indian-language gap (#339). Closes many language-request issues. |
| 4 | **VibeVoice** (Microsoft, #172) | 1.5B, long-form multi-speaker (up to 90 min, 4 speakers). Strong Stories-editor fit. |
| 5 | **Voxtral TTS** (Mistral, #364) | 4B presets+cloning. Frontier quality claim, but 16 GB+ VRAM — would need the platform-tier work first. |
| 6 | **Fish Speech / Fish Audio S2** | 50+ langs, word-level instruct. **License clarification first.** (#385) |
| 7 | **XTTS-v2** | 17+ langs, mature pip. CPML likely kills commercial use — verify. |
| 8 | **index-tts2** (#370) | Unvetted. |
| — | ~~**VoxCPM2**~~ | **Backlogged** — CUDA-only upstream. Revisit when tier system ships or MPS bugs are fixed upstream. |

### ~~Previously Prioritized — Now Done~~

- ~~Kokoro 82M — finish integration~~ **Shipped** (PR #325)
- ~~Qwen CustomVoice~~ **Shipped** (PR #328)
- ~~Intel Arc (XPU) support~~ **Shipped** (PR #320)
- ~~Blackwell CUDA~~ **Shipped** (PR #401, follow-up work open)
- ~~Generation cancellation~~ **Shipped** (PR #444)
- ~~macOS Intel x86_64~~ **Shipped** (PR #416)

---

## Branch Inventory

| Branch | PR | Status | Notes |
|--------|-----|--------|-------|
| `voicebox-new-models` | — | **Active** | New model research (Fish Speech, Pocket TTS, VibeVoice, etc.); VoxCPM evaluated & backlogged |
| `fix/kokoro-pyinstaller-source-files` | — | Active | Kokoro frozen-build source bundling (parent of `voicebox-new-models`) |
| `feat/cosyvoice-engine` | #311 | Open — closing | CosyVoice2/3 — abandoned, poor quality |
| `feat/kokoro` | #325 | **Merged** | Kokoro 82M + voice profile type system |
| `feat/qwen-custom-voice` | #328 | **Merged** | Qwen CustomVoice preset engine |
| `feat/chatterbox-turbo` | #258 | **Merged** | Chatterbox Turbo + per-engine languages |
| `feat/chatterbox` | #257 | **Merged** | Chatterbox Multilingual |
| `feat/luxtts` | #254 | **Merged** | LuxTTS + multi-engine arch |

---

## Quick Reference: API Endpoints

<details>
<summary>All current endpoints</summary>

| Endpoint | Method | Purpose |
|----------|--------|---------|
| `/health` | GET | Health check, model/GPU status |
| `/profiles` | POST, GET | Create/list voice profiles |
| `/profiles/{id}` | GET, PUT, DELETE | Profile CRUD |
| `/profiles/{id}/samples` | POST, GET | Add/list voice samples |
| `/profiles/{id}/avatar` | POST, GET, DELETE | Avatar management |
| `/profiles/{id}/export` | GET | Export profile as ZIP |
| `/profiles/import` | POST | Import profile from ZIP |
| `/generate` | POST | Generate speech (engine param selects TTS backend) |
| `/generate/stream` | POST | Stream speech (MLX only) |
| `/history` | GET | List generation history |
| `/history/{id}` | GET, DELETE | Get/delete generation |
| `/history/{id}/export` | GET | Export generation ZIP |
| `/history/{id}/export-audio` | GET | Export audio only |
| `/transcribe` | POST | Transcribe audio (Whisper) |
| `/models/status` | GET | All model statuses (Qwen, LuxTTS, Chatterbox, Chatterbox Turbo, TADA, Whisper) |
| `/models/download` | POST | Trigger model download |
| `/models/download/cancel` | POST | Cancel/dismiss download |
| `/models/{name}` | DELETE | Delete downloaded model |
| `/models/load` | POST | Load model into memory |
| `/models/unload` | POST | Unload model |
| `/models/progress/{name}` | GET | SSE download progress |
| `/tasks/active` | GET | Active downloads/generations (with inline progress) |
| `/stories` | POST, GET | Create/list stories |
| `/stories/{id}` | GET, PUT, DELETE | Story CRUD |
| `/stories/{id}/items` | POST, GET | Story items CRUD |
| `/stories/{id}/export` | GET | Export story audio |
| `/channels` | POST, GET | Audio channel CRUD |
| `/channels/{id}` | PUT, DELETE | Channel update/delete |
| `/cache/clear` | POST | Clear voice prompt cache |
| `/server/cuda/status` | GET | CUDA binary availability |
| `/server/cuda/download` | POST | Download CUDA binary |
| `/server/cuda/switch` | POST | Switch to CUDA backend |

</details>