Initial commit

2026-04-24 19:18:15 +08:00
commit fbcbe08696
555 changed files with 96692 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,339 @@
+<p align="center">
+  <img src=".github/assets/icon-dark.webp" alt="Voicebox" width="120" height="120" />
+</p>
+
+<h1 align="center">Voicebox</h1>
+
+<p align="center">
+  <strong>The open-source voice synthesis studio.</strong><br/>
+  Clone voices. Generate speech. Apply effects. Build voice-powered apps.<br/>
+  All running locally on your machine.
+</p>
+
+<p align="center">
+  <a href="https://github.com/jamiepine/voicebox/releases">
+    <img src="https://img.shields.io/github/downloads/jamiepine/voicebox/total?style=flat&color=blue" alt="Downloads" />
+  </a>
+  <a href="https://github.com/jamiepine/voicebox/releases/latest">
+    <img src="https://img.shields.io/github/v/release/jamiepine/voicebox?style=flat" alt="Release" />
+  </a>
+  <a href="https://github.com/jamiepine/voicebox/stargazers">
+    <img src="https://img.shields.io/github/stars/jamiepine/voicebox?style=flat" alt="Stars" />
+  </a>
+  <a href="https://github.com/jamiepine/voicebox/blob/main/LICENSE">
+    <img src="https://img.shields.io/github/license/jamiepine/voicebox?style=flat" alt="License" />
+  </a>
+  <a href="https://deepwiki.com/jamiepine/voicebox">
+    <img src="https://img.shields.io/static/v1?label=Ask&message=DeepWiki&color=5B6EF7" alt="Ask DeepWiki" />
+  </a>
+</p>
+
+<p align="center">
+  <a href="https://voicebox.sh">voicebox.sh</a> •
+  <a href="https://docs.voicebox.sh">Docs</a> •
+  <a href="#download">Download</a> •
+  <a href="#features">Features</a> •
+  <a href="#api">API</a> •
+  <a href="docs/content/docs/overview/troubleshooting.mdx">Troubleshooting</a>
+</p>
+
+<br/>
+
+<p align="center">
+  <a href="https://voicebox.sh">
+    <img src="landing/public/assets/app-screenshot-1.webp" alt="Voicebox App Screenshot" width="800" />
+  </a>
+</p>
+
+<p align="center">
+  <em>Click the image above to watch the demo video on <a href="https://voicebox.sh">voicebox.sh</a></em>
+</p>
+
+<br/>
+
+<p align="center">
+  <img src="landing/public/assets/app-screenshot-2.webp" alt="Voicebox Screenshot 2" width="800" />
+</p>
+
+<p align="center">
+  <img src="landing/public/assets/app-screenshot-3.webp" alt="Voicebox Screenshot 3" width="800" />
+</p>
+
+<br/>
+
+## What is Voicebox?
+
+Voicebox is a **local-first voice cloning studio** — a free and open-source alternative to ElevenLabs. Clone voices from a few seconds of audio or pick from 50+ preset voices, generate speech in 23 languages across 7 TTS engines, apply post-processing effects, and compose multi-voice projects with a timeline editor.
+
+- **Complete privacy** — models and voice data stay on your machine
+- **7 TTS engines** — Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox Multilingual, Chatterbox Turbo, HumeAI TADA, and Kokoro
+- **Cloning and preset voices** — zero-shot cloning from a reference sample, or curated preset voices via Kokoro (50 voices) and Qwen CustomVoice (9 voices)
+- **23 languages** — from English to Arabic, Japanese, Hindi, Swahili, and more
+- **Post-processing effects** — pitch shift, reverb, delay, chorus, compression, and filters
+- **Expressive speech** — paralinguistic tags like `[laugh]`, `[sigh]`, `[gasp]` via Chatterbox Turbo; natural-language delivery control via Qwen CustomVoice
+- **Unlimited length** — auto-chunking with crossfade for scripts, articles, and chapters
+- **Stories editor** — multi-track timeline for conversations, podcasts, and narratives
+- **API-first** — REST API for integrating voice synthesis into your own projects
+- **Native performance** — built with Tauri (Rust), not Electron
+- **Runs everywhere** — macOS (MLX/Metal), Windows (CUDA), Linux, AMD ROCm, Intel Arc, Docker
+
+---
+
+## Download
+
+| Platform              | Download                                               |
+| --------------------- | ------------------------------------------------------ |
+| macOS (Apple Silicon) | [Download DMG](https://voicebox.sh/download/mac-arm)   |
+| macOS (Intel)         | [Download DMG](https://voicebox.sh/download/mac-intel) |
+| Windows               | [Download MSI](https://voicebox.sh/download/windows)   |
+| Docker                | `docker compose up`                                    |
+
+> **[View all binaries →](https://github.com/jamiepine/voicebox/releases/latest)**
+
+> **Linux** — Pre-built binaries are not yet available. See [voicebox.sh/linux-install](https://voicebox.sh/linux-install) for build-from-source instructions.
+
+> **Having trouble?** See the [Troubleshooting Guide](docs/content/docs/overview/troubleshooting.mdx) for common install, generation, model-download, and GPU issues.
+
+---
+
+## Features
+
+### Multi-Engine Voice Cloning
+
+Seven TTS engines with different strengths, switchable per-generation:
+
+| Engine                      | Languages | Strengths                                                                                                                                |
+| --------------------------- | --------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
+| **Qwen3-TTS** (0.6B / 1.7B) | 10        | High-quality multilingual cloning, delivery instructions ("speak slowly", "whisper")                                                     |
+| **Qwen CustomVoice**        | 10        | 9 curated preset voices with natural-language delivery control — no reference audio required                                             |
+| **LuxTTS**                  | English   | Lightweight (~1GB VRAM), 48kHz output, 150x realtime on CPU                                                                              |
+| **Chatterbox Multilingual** | 23        | Broadest language coverage — Arabic, Danish, Finnish, Greek, Hebrew, Hindi, Malay, Norwegian, Polish, Swahili, Swedish, Turkish and more |
+| **Chatterbox Turbo**        | English   | Fast 350M model with paralinguistic emotion/sound tags                                                                                   |
+| **TADA** (1B / 3B)          | 10        | HumeAI speech-language model — 700s+ coherent audio, text-acoustic dual alignment                                                        |
+| **Kokoro**                  | 8         | 50 curated preset voices, tiny 82M model, fast CPU inference                                                                             |
+
+### Emotions & Paralinguistic Tags
+
+Only **Chatterbox Turbo** interprets paralinguistic tags like `[laugh]` and
+`[sigh]`. Qwen3-TTS, LuxTTS, Chatterbox Multilingual, and HumeAI TADA read them
+literally as text.
+
+With **Chatterbox Turbo** selected, type `/` in the text input to open the tag
+inserter and add expressive tags inline with speech:
+
+`[laugh]` `[chuckle]` `[gasp]` `[cough]` `[sigh]` `[groan]` `[sniff]` `[shush]` `[clear throat]`
+
+### Post-Processing Effects
+
+8 audio effects powered by Spotify's `pedalboard` library. Apply after generation, preview in real time, build reusable presets.
+
+| Effect           | Description                                   |
+| ---------------- | --------------------------------------------- |
+| Pitch Shift      | Up or down by up to 12 semitones              |
+| Reverb           | Configurable room size, damping, wet/dry mix  |
+| Delay            | Echo with adjustable time, feedback, and mix  |
+| Chorus / Flanger | Modulated delay for metallic or lush textures |
+| Compressor       | Dynamic range compression                     |
+| Gain             | Volume adjustment (-40 to +40 dB)             |
+| High-Pass Filter | Remove low frequencies                        |
+| Low-Pass Filter  | Remove high frequencies                       |
+
+Ships with 4 built-in presets (Robotic, Radio, Echo Chamber, Deep Voice) and supports custom presets. Effects can be assigned per-profile as defaults.
+
+### Unlimited Generation Length
+
+Text is automatically split at sentence boundaries and each chunk is generated independently, then crossfaded together. Works with all engines.
+
+- Configurable auto-chunking limit (100–5,000 chars)
+- Crossfade slider (0–200ms) for smooth transitions
+- Max text length: 50,000 characters
+- Smart splitting respects abbreviations, CJK punctuation, and `[tags]`
+
+### Generation Versions
+
+Every generation supports multiple versions with provenance tracking:
+
+- **Original** — clean TTS output, always preserved
+- **Effects versions** — apply different effects chains from any source version
+- **Takes** — regenerate with a new seed for variation
+- **Source tracking** — each version records its lineage
+- **Favorites** — star generations for quick access
+
+### Async Generation Queue
+
+Generation is non-blocking. Submit and immediately start typing the next one.
+
+- Serial execution queue prevents GPU contention
+- Real-time SSE status streaming
+- Failed generations can be retried
+- Stale generations from crashes auto-recover on startup
+
+### Voice Profile Management
+
+- Create profiles from audio files or record directly in-app
+- Import/export profiles to share or back up
+- Multi-sample support for higher quality cloning
+- Per-profile default effects chains
+- Organize with descriptions and language tags
+
+### Stories Editor
+
+Multi-voice timeline editor for conversations, podcasts, and narratives.
+
+- Multi-track composition with drag-and-drop
+- Inline audio trimming and splitting
+- Auto-playback with synchronized playhead
+- Version pinning per track clip
+
+### Recording & Transcription
+
+- In-app recording with waveform visualization
+- System audio capture (macOS and Windows)
+- Automatic transcription powered by Whisper (including Whisper Turbo)
+- Export recordings in multiple formats
+
+### Model Management
+
+- Per-model unload to free GPU memory without deleting downloads
+- Custom models directory via `VOICEBOX_MODELS_DIR`
+- Model folder migration with progress tracking
+- Download cancel/clear UI
+
+### GPU Support
+
+| Platform                 | Backend        | Notes                                          |
+| ------------------------ | -------------- | ---------------------------------------------- |
+| macOS (Apple Silicon)    | MLX (Metal)    | 4-5x faster via Neural Engine                  |
+| Windows / Linux (NVIDIA) | PyTorch (CUDA) | Auto-downloads CUDA binary from within the app |
+| Linux (AMD)              | PyTorch (ROCm) | Auto-configures HSA_OVERRIDE_GFX_VERSION       |
+| Windows (any GPU)        | DirectML       | Universal Windows GPU support                  |
+| Intel Arc                | IPEX/XPU       | Intel discrete GPU acceleration                |
+| Any                      | CPU            | Works everywhere, just slower                  |
+
+---
+
+## API
+
+Voicebox exposes a full REST API for integrating voice synthesis into your own apps.
+
+```bash
+# Generate speech
+curl -X POST http://localhost:17493/generate \
+  -H "Content-Type: application/json" \
+  -d '{"text": "Hello world", "profile_id": "abc123", "language": "en"}'
+
+# List voice profiles
+curl http://localhost:17493/profiles
+
+# Create a profile
+curl -X POST http://localhost:17493/profiles \
+  -H "Content-Type: application/json" \
+  -d '{"name": "My Voice", "language": "en"}'
+```
+
+**Use cases:** game dialogue, podcast production, accessibility tools, voice assistants, content automation.
+
+Full API documentation available at `http://localhost:17493/docs`.
+
+---
+
+## Tech Stack
+
+| Layer         | Technology                                        |
+| ------------- | ------------------------------------------------- |
+| Desktop App   | Tauri (Rust)                                      |
+| Frontend      | React, TypeScript, Tailwind CSS                   |
+| State         | Zustand, React Query                              |
+| Backend       | FastAPI (Python)                                  |
+| TTS Engines   | Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox, Chatterbox Turbo, TADA, Kokoro |
+| Effects       | Pedalboard (Spotify)                              |
+| Transcription | Whisper / Whisper Turbo (PyTorch or MLX)          |
+| Inference     | MLX (Apple Silicon) / PyTorch (CUDA/ROCm/XPU/CPU) |
+| Database      | SQLite                                            |
+| Audio         | WaveSurfer.js, librosa                            |
+
+---
+
+## Roadmap
+
+| Feature                 | Description                                    |
+| ----------------------- | ---------------------------------------------- |
+| **Real-time Streaming** | Stream audio as it generates, word by word     |
+| **Voice Design**        | Create new voices from text descriptions       |
+| **More Models**         | XTTS, Bark, and other open-source voice models  |
+| **Plugin Architecture** | Extend with custom models and effects          |
+| **Mobile Companion**    | Control Voicebox from your phone               |
+
+For the **full engineering status, open-issue triage, and prioritized work queue**, see [`docs/PROJECT_STATUS.md`](docs/PROJECT_STATUS.md) — a living document that tracks what's shipped, what's in-flight, candidate TTS engines under evaluation, and why we've accepted or backlogged specific integrations.
+
+---
+
+## Development
+
+See [CONTRIBUTING.md](CONTRIBUTING.md) for detailed setup and contribution guidelines.
+
+### Quick Start
+
+```bash
+git clone https://github.com/jamiepine/voicebox.git
+cd voicebox
+
+just setup   # creates Python venv, installs all deps
+just dev     # starts backend + desktop app
+```
+
+Install [just](https://github.com/casey/just): `brew install just` or `cargo install just`. Run `just --list` to see all commands.
+
+**Prerequisites:** [Bun](https://bun.sh), [Rust](https://rustup.rs), [Python 3.11+](https://python.org), [Tauri Prerequisites](https://v2.tauri.app/start/prerequisites/), and [Xcode](https://developer.apple.com/xcode/) on macOS.
+
+### Building Locally
+
+```bash
+just build          # Build CPU server binary + Tauri app
+just build-local    # (Windows) Build CPU + CUDA server binaries + Tauri app
+```
+
+### Adding New Voice Models
+
+The multi-engine architecture makes adding new TTS engines straightforward. A [step-by-step guide](docs/content/docs/developer/tts-engines.mdx) covers the full process: dependency research, backend protocol implementation, frontend wiring, and PyInstaller bundling.
+
+The guide is optimized for AI coding agents. An [agent skill](.agents/skills/add-tts-engine/SKILL.md) can pick up a model name and handle the entire integration autonomously — you just test the build locally.
+
+### Project Structure
+
+```
+voicebox/
+├── app/              # Shared React frontend
+├── tauri/            # Desktop app (Tauri + Rust)
+├── web/              # Web deployment
+├── backend/          # Python FastAPI server
+├── landing/          # Marketing website
+└── scripts/          # Build & release scripts
+```
+
+---
+
+## Contributing
+
+Contributions welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
+
+1. Fork the repo
+2. Create a feature branch
+3. Make your changes
+4. Submit a PR
+
+## Security
+
+Found a security vulnerability? Please report it responsibly. See [SECURITY.md](SECURITY.md) for details.
+
+---
+
+## License
+
+MIT License — see [LICENSE](LICENSE) for details.
+
+---
+
+<p align="center">
+  <a href="https://voicebox.sh">voicebox.sh</a>
+</p>