221 lines
8.7 KiB
Markdown
221 lines
8.7 KiB
Markdown
# End-to-End Model Generation Test — Design
|
|
|
|
## Goal
|
|
|
|
A single script, runnable on macOS and Windows, that exercises every TTS model against the **frozen PyInstaller binary** (not the dev server), captures per-model pass/fail and error messages, and exits non-zero if any model fails. Generation is strictly sequential — one model loaded at a time.
|
|
|
|
## Test matrix (10 runs)
|
|
|
|
Derived from `backend/backends/__init__.py:185-316`. Each row maps to one `POST /generate` call.
|
|
|
|
| # | engine | model_size | profile kind | notes |
|
|
|---|-----------------------|------------|--------------|-------|
|
|
| 1 | `qwen` | `1.7B` | cloned | reference audio required |
|
|
| 2 | `qwen` | `0.6B` | cloned | |
|
|
| 3 | `qwen_custom_voice` | `1.7B` | preset | `preset_voice_id="Ryan"` |
|
|
| 4 | `qwen_custom_voice` | `0.6B` | preset | `preset_voice_id="Ryan"` |
|
|
| 5 | `luxtts` | — | cloned | English only |
|
|
| 6 | `chatterbox` | — | cloned | |
|
|
| 7 | `chatterbox_turbo` | — | cloned | English only |
|
|
| 8 | `tada` | `1B` | cloned | tada-1b, English only |
|
|
| 9 | `tada` | `3B` | cloned | tada-3b-ml, multilingual |
|
|
| 10| `kokoro` | — | preset | `preset_voice_id="af_heart"` |
|
|
|
|
Cloned engines (1, 2, 5, 6, 7, 8, 9) share **one** profile created once with the reference WAV. Preset profiles are created separately, one for kokoro and one for qwen_custom_voice.
|
|
|
|
Language for every run: `en` (covers every engine's supported set).
|
|
|
|
## End-to-end flow
|
|
|
|
```
|
|
1. Resolve paths → find binary, build if missing
|
|
2. Launch binary → spawn with --port --data-dir --parent-pid
|
|
3. Wait for /health → poll until status=="healthy" or 120s timeout
|
|
4. Create profiles → 1 cloned + 2 preset, via /profiles (+ /samples)
|
|
5. For each (engine, model_size) in matrix:
|
|
a. Check cache → GET /models/status → cached? short timeout : long
|
|
b. POST /generate → get generation_id
|
|
c. Stream /status → consume SSE until completed/failed/timeout
|
|
d. Record result → {engine, model_size, status, duration, error, elapsed}
|
|
6. Write results → JSON + Markdown table to ./results/
|
|
7. Shutdown binary → SIGTERM, fall back to kill, verify port freed
|
|
8. Exit code → 0 if all passed, 1 otherwise
|
|
```
|
|
|
|
## Binary resolution
|
|
|
|
Search order — **first hit wins**:
|
|
|
|
| Platform | Path | Build type |
|
|
|----------|------|------------|
|
|
| macOS | `backend/dist/voicebox-server-cuda/voicebox-server-cuda` | onedir (CUDA, rarely on Mac) |
|
|
| macOS | `backend/dist/voicebox-server` | onefile (CPU) |
|
|
| Windows | `backend\dist\voicebox-server-cuda\voicebox-server-cuda.exe` | onedir (CUDA) |
|
|
| Windows | `backend\dist\voicebox-server.exe` | onefile (CPU) |
|
|
|
|
If none exist, run `python backend/build_binary.py` and wait for it to finish (can take 5-20 min). Fail with a clear error if the build itself fails. `--skip-build` flag forces "error out if no binary" instead of building.
|
|
|
|
## Spawn command
|
|
|
|
Mirrors Tauri's launch in `tauri/src-tauri/src/main.rs:369-388`:
|
|
|
|
```
|
|
<binary> --host 127.0.0.1 --port <free-port> --data-dir <tempdir> --parent-pid <test-pid>
|
|
```
|
|
|
|
- **Port**: bind to `0` first in Python to grab a free port, then pass that number.
|
|
- **Data dir**: `tempfile.mkdtemp(prefix="voicebox-e2e-")`. Deleted after the run unless `--keep-data-dir`. Profiles and generated WAVs land here.
|
|
- **Parent PID**: current Python PID — ensures the backend dies if the test crashes (watchdog in `server.py:102-224`).
|
|
- **stdout/stderr**: tee to both a log file in `./results/server-<timestamp>.log` and a rolling in-memory buffer. On model failure, last 100 lines of the buffer are attached to that model's error record.
|
|
|
|
## Profile setup
|
|
|
|
One cloned profile shared across all cloning engines:
|
|
|
|
```http
|
|
POST /profiles
|
|
{
|
|
"name": "e2e-cloned",
|
|
"voice_type": "cloned",
|
|
"language": "en"
|
|
}
|
|
```
|
|
|
|
Then:
|
|
|
|
```http
|
|
POST /profiles/{id}/samples (multipart)
|
|
file: <reference WAV>
|
|
reference_text: <exact transcription>
|
|
```
|
|
|
|
Two preset profiles:
|
|
|
|
```http
|
|
POST /profiles
|
|
{ "name": "e2e-kokoro", "voice_type": "preset", "language": "en",
|
|
"preset_engine": "kokoro", "preset_voice_id": "af_heart" }
|
|
|
|
POST /profiles
|
|
{ "name": "e2e-qwen-cv", "voice_type": "preset", "language": "en",
|
|
"preset_engine": "qwen_custom_voice", "preset_voice_id": "Ryan" }
|
|
```
|
|
|
|
## Generation request (per matrix row)
|
|
|
|
```http
|
|
POST /generate
|
|
{
|
|
"profile_id": "<appropriate profile>",
|
|
"text": "The quick brown fox jumps over the lazy dog.",
|
|
"language": "en",
|
|
"engine": "<engine>",
|
|
"model_size": "<size or omitted>",
|
|
"seed": 42,
|
|
"normalize": true
|
|
}
|
|
```
|
|
|
|
Response `id` feeds into the SSE status loop (`GET /generate/{id}/status`, `routes/generations.py:190-227`). Loop reads lines until a payload with `status in ("completed", "failed")` arrives, then breaks.
|
|
|
|
## Timeout strategy (split)
|
|
|
|
Check `GET /models/status` for the target model **before** generation:
|
|
|
|
| Cached? | Per-model timeout | Rationale |
|
|
|---------|-------------------|-----------|
|
|
| Yes | **3 minutes** | Inference only; generous for CPU builds |
|
|
| No | **20 minutes** | First-run HF download up to 8 GB (tada-3b-ml) |
|
|
|
|
On timeout: cancel the SSE stream, mark the row `timeout`, and continue to the next row. Don't abort the whole run on one timeout.
|
|
|
|
## Result format
|
|
|
|
`./results/e2e-<platform>-<arch>-<timestamp>.json`:
|
|
|
|
```json
|
|
{
|
|
"platform": "darwin-arm64",
|
|
"binary": "/abs/path/voicebox-server",
|
|
"binary_size_mb": 612,
|
|
"started_at": "2026-04-16T12:34:56Z",
|
|
"finished_at": "...",
|
|
"results": [
|
|
{
|
|
"engine": "qwen",
|
|
"model_size": "1.7B",
|
|
"status": "passed|failed|timeout",
|
|
"generation_id": "...",
|
|
"was_cached": true,
|
|
"elapsed_seconds": 12.4,
|
|
"audio_duration": 3.1,
|
|
"audio_path": "/tmp/.../gen.wav",
|
|
"error": null,
|
|
"server_log_tail": null
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
Companion `./results/e2e-<...>.md`:
|
|
|
|
```
|
|
# Voicebox E2E — darwin-arm64 — 2026-04-16 12:34
|
|
|
|
| Engine | Size | Status | Elapsed | Error |
|
|
|---------------------|------|--------|---------|-------|
|
|
| qwen | 1.7B | PASS | 12.4s | |
|
|
| qwen | 0.6B | FAIL | 4.1s | CUDA OOM: ... |
|
|
...
|
|
```
|
|
|
|
## CLI flags
|
|
|
|
```
|
|
python -m backend.tests.test_all_models_e2e [flags]
|
|
|
|
--binary PATH Use this binary instead of auto-detecting
|
|
--skip-build Error if no binary found (no auto-build)
|
|
--reference-wav PATH Reference audio (default: backend/tests/fixtures/reference_voice.wav)
|
|
--reference-text STR Transcription (default: read from fixtures/reference_voice.txt)
|
|
--only ENGINE[,...] Run only these engines (e.g. kokoro,qwen)
|
|
--skip ENGINE[,...] Skip these engines
|
|
--keep-data-dir Don't delete tempdir after run
|
|
--timeout-cached SEC Override 180
|
|
--timeout-download SEC Override 1200
|
|
--port N Override auto-picked port
|
|
--output-dir PATH Default: backend/tests/results/
|
|
```
|
|
|
|
## File layout
|
|
|
|
```
|
|
backend/tests/
|
|
├── E2E_MODEL_TEST_DESIGN.md (this file)
|
|
├── test_all_models_e2e.py (main script, ~400-500 LoC)
|
|
├── fixtures/
|
|
│ ├── reference_voice.wav (user-provided, ~5-15s clean speech)
|
|
│ └── reference_voice.txt (exact transcription)
|
|
└── results/ (gitignored)
|
|
├── e2e-darwin-arm64-<ts>.json
|
|
├── e2e-darwin-arm64-<ts>.md
|
|
└── server-<ts>.log
|
|
```
|
|
|
|
The script uses only stdlib + `httpx` (or `requests`) + `sseclient-py` — all already in `backend/requirements.txt`. No pytest to keep it invocable as a single command on fresh checkouts.
|
|
|
|
## Safety & cleanup
|
|
|
|
- Always kill the spawned binary in a `try/finally`. On Windows, `taskkill /F /T` the whole tree (Tauri does the same).
|
|
- Verify the port is free on shutdown (Tauri port-reuse check in `main.rs:114-186` could otherwise pick up a ghost).
|
|
- Don't touch the user's HF cache by default — let the server use `HF_HUB_CACHE` / `VOICEBOX_MODELS_DIR`. Passing `--isolated-cache` would point both env vars at the tempdir for a true cold-start run (opt-in only; would re-download every time).
|
|
|
|
## Non-goals
|
|
|
|
- Not validating audio quality (no WER, no waveform comparison). Pass = "endpoint returned `completed` and produced a non-empty WAV".
|
|
- Not testing STT (Whisper), effects chains, channels, or streaming endpoints.
|
|
- Not running on CI today — human-invoked on dev machines. CI integration is a follow-up once the script is stable.
|
|
- No model unload between runs — models stay loaded; server manages its own eviction.
|
|
- No version-drift check on the binary.
|
|
- No `instruct` parameter exercised on qwen_custom_voice runs.
|