8.7 KiB
End-to-End Model Generation Test — Design
Goal
A single script, runnable on macOS and Windows, that exercises every TTS model against the frozen PyInstaller binary (not the dev server), captures per-model pass/fail and error messages, and exits non-zero if any model fails. Generation is strictly sequential — one model loaded at a time.
Test matrix (10 runs)
Derived from backend/backends/__init__.py:185-316. Each row maps to one POST /generate call.
| # | engine | model_size | profile kind | notes |
|---|---|---|---|---|
| 1 | qwen |
1.7B |
cloned | reference audio required |
| 2 | qwen |
0.6B |
cloned | |
| 3 | qwen_custom_voice |
1.7B |
preset | preset_voice_id="Ryan" |
| 4 | qwen_custom_voice |
0.6B |
preset | preset_voice_id="Ryan" |
| 5 | luxtts |
— | cloned | English only |
| 6 | chatterbox |
— | cloned | |
| 7 | chatterbox_turbo |
— | cloned | English only |
| 8 | tada |
1B |
cloned | tada-1b, English only |
| 9 | tada |
3B |
cloned | tada-3b-ml, multilingual |
| 10 | kokoro |
— | preset | preset_voice_id="af_heart" |
Cloned engines (1, 2, 5, 6, 7, 8, 9) share one profile created once with the reference WAV. Preset profiles are created separately, one for kokoro and one for qwen_custom_voice.
Language for every run: en (covers every engine's supported set).
End-to-end flow
1. Resolve paths → find binary, build if missing
2. Launch binary → spawn with --port --data-dir --parent-pid
3. Wait for /health → poll until status=="healthy" or 120s timeout
4. Create profiles → 1 cloned + 2 preset, via /profiles (+ /samples)
5. For each (engine, model_size) in matrix:
a. Check cache → GET /models/status → cached? short timeout : long
b. POST /generate → get generation_id
c. Stream /status → consume SSE until completed/failed/timeout
d. Record result → {engine, model_size, status, duration, error, elapsed}
6. Write results → JSON + Markdown table to ./results/
7. Shutdown binary → SIGTERM, fall back to kill, verify port freed
8. Exit code → 0 if all passed, 1 otherwise
Binary resolution
Search order — first hit wins:
| Platform | Path | Build type |
|---|---|---|
| macOS | backend/dist/voicebox-server-cuda/voicebox-server-cuda |
onedir (CUDA, rarely on Mac) |
| macOS | backend/dist/voicebox-server |
onefile (CPU) |
| Windows | backend\dist\voicebox-server-cuda\voicebox-server-cuda.exe |
onedir (CUDA) |
| Windows | backend\dist\voicebox-server.exe |
onefile (CPU) |
If none exist, run python backend/build_binary.py and wait for it to finish (can take 5-20 min). Fail with a clear error if the build itself fails. --skip-build flag forces "error out if no binary" instead of building.
Spawn command
Mirrors Tauri's launch in tauri/src-tauri/src/main.rs:369-388:
<binary> --host 127.0.0.1 --port <free-port> --data-dir <tempdir> --parent-pid <test-pid>
- Port: bind to
0first in Python to grab a free port, then pass that number. - Data dir:
tempfile.mkdtemp(prefix="voicebox-e2e-"). Deleted after the run unless--keep-data-dir. Profiles and generated WAVs land here. - Parent PID: current Python PID — ensures the backend dies if the test crashes (watchdog in
server.py:102-224). - stdout/stderr: tee to both a log file in
./results/server-<timestamp>.logand a rolling in-memory buffer. On model failure, last 100 lines of the buffer are attached to that model's error record.
Profile setup
One cloned profile shared across all cloning engines:
POST /profiles
{
"name": "e2e-cloned",
"voice_type": "cloned",
"language": "en"
}
Then:
POST /profiles/{id}/samples (multipart)
file: <reference WAV>
reference_text: <exact transcription>
Two preset profiles:
POST /profiles
{ "name": "e2e-kokoro", "voice_type": "preset", "language": "en",
"preset_engine": "kokoro", "preset_voice_id": "af_heart" }
POST /profiles
{ "name": "e2e-qwen-cv", "voice_type": "preset", "language": "en",
"preset_engine": "qwen_custom_voice", "preset_voice_id": "Ryan" }
Generation request (per matrix row)
POST /generate
{
"profile_id": "<appropriate profile>",
"text": "The quick brown fox jumps over the lazy dog.",
"language": "en",
"engine": "<engine>",
"model_size": "<size or omitted>",
"seed": 42,
"normalize": true
}
Response id feeds into the SSE status loop (GET /generate/{id}/status, routes/generations.py:190-227). Loop reads lines until a payload with status in ("completed", "failed") arrives, then breaks.
Timeout strategy (split)
Check GET /models/status for the target model before generation:
| Cached? | Per-model timeout | Rationale |
|---|---|---|
| Yes | 3 minutes | Inference only; generous for CPU builds |
| No | 20 minutes | First-run HF download up to 8 GB (tada-3b-ml) |
On timeout: cancel the SSE stream, mark the row timeout, and continue to the next row. Don't abort the whole run on one timeout.
Result format
./results/e2e-<platform>-<arch>-<timestamp>.json:
{
"platform": "darwin-arm64",
"binary": "/abs/path/voicebox-server",
"binary_size_mb": 612,
"started_at": "2026-04-16T12:34:56Z",
"finished_at": "...",
"results": [
{
"engine": "qwen",
"model_size": "1.7B",
"status": "passed|failed|timeout",
"generation_id": "...",
"was_cached": true,
"elapsed_seconds": 12.4,
"audio_duration": 3.1,
"audio_path": "/tmp/.../gen.wav",
"error": null,
"server_log_tail": null
}
]
}
Companion ./results/e2e-<...>.md:
# Voicebox E2E — darwin-arm64 — 2026-04-16 12:34
| Engine | Size | Status | Elapsed | Error |
|---------------------|------|--------|---------|-------|
| qwen | 1.7B | PASS | 12.4s | |
| qwen | 0.6B | FAIL | 4.1s | CUDA OOM: ... |
...
CLI flags
python -m backend.tests.test_all_models_e2e [flags]
--binary PATH Use this binary instead of auto-detecting
--skip-build Error if no binary found (no auto-build)
--reference-wav PATH Reference audio (default: backend/tests/fixtures/reference_voice.wav)
--reference-text STR Transcription (default: read from fixtures/reference_voice.txt)
--only ENGINE[,...] Run only these engines (e.g. kokoro,qwen)
--skip ENGINE[,...] Skip these engines
--keep-data-dir Don't delete tempdir after run
--timeout-cached SEC Override 180
--timeout-download SEC Override 1200
--port N Override auto-picked port
--output-dir PATH Default: backend/tests/results/
File layout
backend/tests/
├── E2E_MODEL_TEST_DESIGN.md (this file)
├── test_all_models_e2e.py (main script, ~400-500 LoC)
├── fixtures/
│ ├── reference_voice.wav (user-provided, ~5-15s clean speech)
│ └── reference_voice.txt (exact transcription)
└── results/ (gitignored)
├── e2e-darwin-arm64-<ts>.json
├── e2e-darwin-arm64-<ts>.md
└── server-<ts>.log
The script uses only stdlib + httpx (or requests) + sseclient-py — all already in backend/requirements.txt. No pytest to keep it invocable as a single command on fresh checkouts.
Safety & cleanup
- Always kill the spawned binary in a
try/finally. On Windows,taskkill /F /Tthe whole tree (Tauri does the same). - Verify the port is free on shutdown (Tauri port-reuse check in
main.rs:114-186could otherwise pick up a ghost). - Don't touch the user's HF cache by default — let the server use
HF_HUB_CACHE/VOICEBOX_MODELS_DIR. Passing--isolated-cachewould point both env vars at the tempdir for a true cold-start run (opt-in only; would re-download every time).
Non-goals
- Not validating audio quality (no WER, no waveform comparison). Pass = "endpoint returned
completedand produced a non-empty WAV". - Not testing STT (Whisper), effects chains, channels, or streaming endpoints.
- Not running on CI today — human-invoked on dev machines. CI integration is a follow-up once the script is stable.
- No model unload between runs — models stay loaded; server manages its own eviction.
- No version-drift check on the binary.
- No
instructparameter exercised on qwen_custom_voice runs.