voicebox/docs/content/docs/developer/architecture.mdx

---
title: "Architecture"
description: "Understanding Voicebox's technical architecture"
---

## System Overview

Voicebox uses a client-server architecture with a React frontend and Python backend. The desktop app is built with Tauri and contains two main layers:

**Frontend Layer:** A React application that handles the UI components, state management with Zustand, and data fetching with React Query (TanStack Query).

**Backend Layer:** A Python FastAPI server that hosts the REST API, runs a pluggable registry of TTS and STT engines, manages the SQLite database, and handles audio processing.

These two layers communicate via HTTP on `localhost:17493`, with the frontend making API requests to the backend. In production the backend is compiled with PyInstaller and launched as a Tauri sidecar; in development it's run manually via `uvicorn`.

## Frontend Architecture

### Tech Stack

- **Framework**: React 18 with TypeScript
- **State Management**: Zustand stores
- **Data Fetching**: React Query (TanStack Query)
- **Styling**: Tailwind CSS
- **Audio**: WaveSurfer.js
- **Desktop**: Tauri (Rust)

### Component Structure

<Files>
  <Folder name="app/src" defaultOpen>
    <Folder name="components">
      <File name="Profiles/" />
      <File name="Generation/" />
      <File name="Stories/" />
      <File name="ServerSettings/" />
    </Folder>
    <Folder name="lib">
      <File name="api/" />
      <File name="constants/" />
      <File name="hooks/" />
      <File name="utils/" />
    </Folder>
    <Folder name="stores" />
  </Folder>
</Files>

## Backend Architecture

### Tech Stack

- **Framework**: FastAPI (Python 3.11+)
- **TTS Engines**: Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox, Chatterbox Turbo, TADA, Kokoro
- **Transcription**: Whisper (PyTorch or MLX-Whisper)
- **Inference Backends**: MLX (Apple Silicon), PyTorch (CUDA / ROCm / XPU / DirectML / CPU)
- **Database**: SQLite via SQLAlchemy
- **Audio**: librosa, soundfile, Pedalboard

### Layout

<Files>
  <Folder name="backend" defaultOpen>
    <File name="app.py" />
    <File name="main.py" />
    <File name="config.py" />
    <File name="models.py" />
    <File name="server.py" />
    <File name="build_binary.py" />
    <Folder name="routes">
      <File name="profiles.py" />
      <File name="generate.py" />
      <File name="history.py" />
      <File name="models.py" />
      <File name="channels.py" />
    </Folder>
    <Folder name="services">
      <File name="generation.py" />
      <File name="task_queue.py" />
      <File name="profiles.py" />
      <File name="channels.py" />
    </Folder>
    <Folder name="backends">
      <File name="__init__.py" />
      <File name="base.py" />
      <File name="pytorch_backend.py" />
      <File name="mlx_backend.py" />
      <File name="qwen_custom_voice_backend.py" />
      <File name="luxtts_backend.py" />
      <File name="chatterbox_backend.py" />
      <File name="chatterbox_turbo_backend.py" />
      <File name="hume_backend.py" />
      <File name="kokoro_backend.py" />
    </Folder>
    <Folder name="database">
      <File name="models.py" />
      <File name="session.py" />
    </Folder>
    <Folder name="utils">
      <File name="audio.py" />
      <File name="effects.py" />
    </Folder>
  </Folder>
</Files>

### Request Flow

An HTTP request enters a **route handler**, which validates input and delegates to a **service** function. The service calls into the appropriate **engine backend** via the registry, which runs the actual inference. Audio post-processing runs through **utils** (trim, resample, effects).

Route handlers are intentionally thin — they validate input, delegate to a service function, and format the response. All business logic lives in `services/`.

### Multi-Engine Registry

The backend is designed so that adding a new TTS engine only requires touching the `backends/` directory and the central registry. There is no per-engine branching in routes or services.

- **`TTSBackend` Protocol** (`backends/__init__.py`) — defines the contract every engine implements: `load_model`, `create_voice_prompt`, `combine_voice_prompts`, `generate`, `unload_model`, `is_loaded`, `_get_model_path`.
- **`ModelConfig` dataclass** — central metadata record for each model variant: `model_name`, `display_name`, `engine`, `hf_repo_id`, `size_mb`, `needs_trim`, `languages`, `supports_instruct`, etc.
- **`TTS_ENGINES` dict** — maps engine name (`"qwen"`, `"kokoro"`, etc.) to display name.
- **`get_tts_backend_for_engine(engine)`** — thread-safe factory that lazily instantiates and caches the backend for an engine using double-checked locking.

Shipped engines:

| Engine key | Display name | Profile type |
|------------|--------------|--------------|
| `qwen` | Qwen TTS | Cloned |
| `qwen_custom_voice` | Qwen CustomVoice | Preset |
| `luxtts` | LuxTTS | Cloned |
| `chatterbox` | Chatterbox TTS | Cloned |
| `chatterbox_turbo` | Chatterbox Turbo | Cloned |
| `tada` | TADA | Cloned |
| `kokoro` | Kokoro | Preset |

See [TTS Engines](/developer/tts-engines) for the full contract and integration phases, and [PROJECT_STATUS.md](https://github.com/jamiepine/voicebox/blob/main/docs/PROJECT_STATUS.md) for candidates under evaluation.

### Key Modules

- **`app.py`** — FastAPI app factory, CORS, lifecycle events
- **`main.py`** — Entry point (imports app, runs uvicorn)
- **`server.py`** — Tauri sidecar launcher, parent-pid watchdog, frozen-build environment setup
- **`services/generation.py`** — Single function handling all generation modes (generate, retry, regenerate)
- **`services/task_queue.py`** — Serial generation queue for GPU inference
- **`backends/__init__.py`** — Protocol definitions, `ModelConfig` registry, and engine factory
- **`backends/base.py`** — Shared utilities across all engine implementations (device selection, progress tracking, output trimming)

### Inference Backend Selection

The server detects the best inference backend at startup and uses it for all engines that support it:

| Platform | Backend | Acceleration |
|----------|---------|--------------|
| macOS (Apple Silicon) | MLX | Metal / Neural Engine |
| Windows / Linux (NVIDIA) | PyTorch | CUDA (cu128) |
| Linux (AMD) | PyTorch | ROCm |
| Windows / Linux (Intel Arc) | PyTorch | XPU (IPEX) |
| Windows (other GPU) | PyTorch | DirectML |
| Any | PyTorch | CPU fallback |

See [GPU Acceleration](/overview/gpu-acceleration) for platform-specific notes and manual overrides.

### Data Model

Core tables (see `backend/database/models.py`):

- **`profiles`** — Voice profiles with `voice_type` discriminator (`cloned` | `preset` | `designed`), `preset_engine`, `preset_voice_id`, and `default_engine`.
- **`profile_samples`** — Reference audio clips + transcripts for cloned profiles. Empty for preset profiles.
- **`generations`** — Generated audio with text, engine, model, language, seed, and duration.
- **`generation_versions`** — Processed variants of a generation with different effects chains applied.
- **`audio_channels`** + **`channel_device_mappings`** + **`profile_channel_mappings`** — Multi-output routing.

See [Voice Profiles](/developer/voice-profiles) and [Effects Pipeline](/developer/effects-pipeline) for details.

## Desktop App (Tauri)

### Rust Backend

<Files>
  <Folder name="tauri/src-tauri" defaultOpen>
    <File name="Cargo.toml" />
    <File name="tauri.conf.json" />
    <File name="src/" />
    <Folder name="binaries" />
  </Folder>
</Files>

### Responsibilities

- Launch Python backend as sidecar process
- Native file dialogs
- System tray integration
- Auto-updates (Tauri updater + custom CUDA backend swap)
- Parent-PID watchdog so the backend exits if the app crashes

## Build Process

### Development

```bash
just dev              # Starts backend + Tauri app
just dev-web          # Starts backend + web app (no Tauri)
just dev-backend      # Backend only
just dev-frontend     # Tauri app only (backend must be running)
```

### Production

```bash
just build            # CPU server binary + Tauri installer
just build-local      # CPU + CUDA binaries + Tauri installer (Windows)
just build-server     # Server binary only
just build-tauri      # Tauri app only
```

See [Building](/developer/building) for what PyInstaller does and how the CUDA binary is split and packaged separately.

## Data Flow

### Generation Flow

1. **User Input** — text entered in a React component, engine + profile selected
2. **State Update** — Zustand generation form store records the request
3. **API Request** — React Query mutation hits `POST /generate`
4. **Route** — `routes/generate.py` validates input, dispatches to `services/generation.py`
5. **Voice Prompt** — the service creates or retrieves a cached voice prompt via the engine's backend
6. **Queue** — `services/task_queue.py` serializes generation to avoid GPU contention
7. **Inference** — the engine backend runs `generate()` and returns audio + sample rate
8. **Post-process** — optional trim (for engines that need it), effects chain applied per generation version
9. **Storage** — audio written to the generations directory, metadata saved to SQLite
10. **Response** — backend returns the generation record; frontend updates React Query cache and plays audio

## Performance Considerations

### Frontend

- **Code splitting** — lazy-load routes
- **Memoization** — `React.memo` for heavy components
- **Virtual scrolling** — for large lists
- **Debouncing** — search and input handling

### Backend

- **Async I/O** — all I/O is async; inference runs in `asyncio.to_thread`
- **Serial task queue** — avoids multiple engines fighting for the GPU
- **Voice prompt caching** — engine-specific, keyed by audio hash + reference text
- **Model pinning** — only one model per engine loaded at a time; switching unloads the previous one
- **Per-engine backend cache** — engines are only instantiated once per process

## Security

### Current

- Local-only by default (bound to `127.0.0.1:17493`)
- No authentication (localhost trust)
- File system sandboxing via Tauri

### Planned

- API key authentication for remote mode
- User accounts
- Rate limiting
- HTTPS support

## Deployment Modes

### Local Mode

- Backend runs as sidecar
- All data stays on device
- No network required

### Remote Mode

- Backend on a separate machine (Docker or bare host)
- Frontend (desktop or web) connects over HTTP
- See [Remote Mode](/overview/remote-mode) and [Docker](/overview/docker)

## Next Steps

<Cards>
  <Card title="Development Setup" href="/developer/setup">
    Set up your dev environment
  </Card>
  <Card title="TTS Engines" href="/developer/tts-engines">
    How to add a new engine
  </Card>
  <Card title="Contributing" href="/developer/contributing">
    Contribute to Voicebox
  </Card>
</Cards>