Initial commit

2026-04-24 19:18:15 +08:00
commit fbcbe08696
555 changed files with 96692 additions and 0 deletions
--- a/docs/plans/OPENAI_SUPPORT.md
+++ b/docs/plans/OPENAI_SUPPORT.md
@@ -0,0 +1,235 @@
+# OpenAI API Compatibility
+
+**Status:** Planned for v0.2.0
+
+**Issue:** [#10 OpenAI API compatibility](https://github.com/jamiepine/voicebox/issues/10)
+
+## Overview
+
+This feature exposes OpenAI-compatible endpoints from Voicebox, allowing any tool, library, or application that speaks the OpenAI Audio API to use Voicebox as a drop-in local replacement.
+
+```mermaid
+flowchart LR
+    subgraph clients [External Clients]
+        SDK[OpenAI SDK]
+        Curl[curl / HTTP]
+        Apps[Third-party Apps]
+    end
+    
+    subgraph voicebox [Voicebox Server]
+        OpenAI["/v1/audio/* endpoints"]
+        TTS[TTSModel]
+        Whisper[WhisperModel]
+        Profiles[Voice Profiles]
+    end
+    
+    SDK --> OpenAI
+    Curl --> OpenAI
+    Apps --> OpenAI
+    OpenAI --> TTS
+    OpenAI --> Whisper
+    OpenAI --> Profiles
+```
+
+## Use Cases
+
+- **OpenAI SDK users**: `openai.audio.speech.create()` works with Voicebox
+- **LLM frameworks**: LangChain, AutoGen, etc. can use Voicebox for TTS
+- **Shell scripts**: `curl` commands copy-pasted from OpenAI docs work
+- **Existing integrations**: Any tool expecting OpenAI's API works without code changes
+
+## Endpoints to Implement
+
+### 1. `POST /v1/audio/speech` (TTS)
+
+OpenAI spec: https://platform.openai.com/docs/api-reference/audio/createSpeech
+
+**Request:**
+
+```json
+{
+  "model": "tts-1",
+  "input": "Hello world!",
+  "voice": "alloy",
+  "response_format": "mp3",
+  "speed": 1.0
+}
+```
+
+**Response:** Audio file (mp3, wav, opus, aac, flac, pcm)
+
+**Voice Mapping Strategy:**
+
+- `voice` parameter maps to Voicebox profile names (case-insensitive)
+- If no match, use a configurable default profile
+- Support special syntax: `voice: "profile:uuid"` for explicit profile ID
+
+### 2. `POST /v1/audio/transcriptions` (Whisper)
+
+OpenAI spec: https://platform.openai.com/docs/api-reference/audio/createTranscription
+
+**Request:** (multipart/form-data)
+
+- `file`: Audio file
+- `model`: "whisper-1"
+- `language`: Optional language hint
+- `response_format`: json, text, srt, verbose_json, vtt
+
+**Response:**
+
+```json
+{
+  "text": "Hello world!"
+}
+```
+
+## Implementation Details
+
+### New File: `backend/openai_compat.py`
+
+Create a dedicated module with an APIRouter for OpenAI-compatible endpoints:
+
+```python
+from fastapi import APIRouter, UploadFile, File, Form, HTTPException
+from fastapi.responses import StreamingResponse
+from pydantic import BaseModel
+from typing import Literal, Optional
+
+router = APIRouter(prefix="/v1/audio", tags=["OpenAI Compatible"])
+
+class SpeechRequest(BaseModel):
+    model: str = "tts-1"
+    input: str
+    voice: str = "alloy"
+    response_format: Literal["mp3", "wav", "opus", "aac", "flac", "pcm"] = "mp3"
+    speed: float = 1.0
+
+@router.post("/speech")
+async def create_speech(request: SpeechRequest, db: Session = Depends(get_db)):
+    # 1. Map voice name to profile
+    # 2. Generate audio using existing TTSModel
+    # 3. Convert to requested format
+    # 4. Return audio stream
+    ...
+
+@router.post("/transcriptions")
+async def create_transcription(
+    file: UploadFile = File(...),
+    model: str = Form("whisper-1"),
+    language: Optional[str] = Form(None),
+    response_format: str = Form("json"),
+):
+    # 1. Save uploaded file
+    # 2. Transcribe using existing WhisperModel
+    # 3. Return in requested format
+    ...
+```
+
+### Voice Profile Resolution
+
+Add helper in [backend/profiles.py](backend/profiles.py):
+
+```python
+async def resolve_voice_for_openai(voice: str, db: Session) -> Optional[VoiceProfile]:
+    """
+    Resolve OpenAI voice parameter to a Voicebox profile.
+    
+    Priority:
+    1. Exact profile name match (case-insensitive)
+    2. Profile ID match (if voice starts with "profile:")
+    3. Default profile from config
+    4. First available profile
+    """
+    ...
+```
+
+### Audio Format Conversion
+
+Add conversion utilities in [backend/utils/audio.py](backend/utils/audio.py):
+
+```python
+def convert_audio_format(
+    audio: np.ndarray,
+    sample_rate: int,
+    target_format: str,  # mp3, wav, opus, aac, flac, pcm
+) -> bytes:
+    """Convert audio to target format using ffmpeg or pydub."""
+    ...
+```
+
+### Configuration
+
+Add to [backend/config.py](backend/config.py):
+
+```python
+# OpenAI API Compatibility
+OPENAI_COMPAT_ENABLED = True
+OPENAI_COMPAT_DEFAULT_VOICE = None  # Profile ID or name for default voice
+OPENAI_COMPAT_REQUIRE_AUTH = False  # Require API key validation
+OPENAI_COMPAT_API_KEY = None        # If set, validate against this
+```
+
+### Integration with main.py
+
+In [backend/main.py](backend/main.py), include the router:
+
+```python
+from . import openai_compat
+
+# Add OpenAI-compatible routes
+if config.OPENAI_COMPAT_ENABLED:
+    app.include_router(openai_compat.router)
+```
+
+## Streaming Support (Future Enhancement)
+
+Initial implementation returns complete audio. Streaming can be added later:
+
+```python
+@router.post("/speech")
+async def create_speech(request: SpeechRequest):
+    if request.stream:
+        return StreamingResponse(
+            generate_audio_chunks(request),
+            media_type=f"audio/{request.response_format}"
+        )
+    ...
+```
+
+## Testing
+
+Example usage after implementation:
+
+```bash
+# TTS with curl
+curl http://localhost:8000/v1/audio/speech \
+  -H "Content-Type: application/json" \
+  -d '{"model": "tts-1", "input": "Hello!", "voice": "MyProfile"}' \
+  --output speech.mp3
+
+# With OpenAI Python SDK
+from openai import OpenAI
+client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
+response = client.audio.speech.create(
+    model="tts-1",
+    voice="MyProfile",
+    input="Hello world!"
+)
+response.stream_to_file("output.mp3")
+
+# Transcription
+curl http://localhost:8000/v1/audio/transcriptions \
+  -F file=@audio.mp3 \
+  -F model="whisper-1"
+```
+
+## Security Considerations
+
+- Optional API key validation (for shared deployments)
+- Rate limiting on endpoints
+- Input length limits (same as existing `/generate` endpoint)
+
+## Dependencies
+
+- `pydub` or `ffmpeg-python` for audio format conversion (mp3, opus, etc.)
+- No changes to existing TTS/Whisper model code