235 lines
5.9 KiB
Markdown
235 lines
5.9 KiB
Markdown
# OpenAI API Compatibility
|
|
|
|
**Status:** Planned for v0.2.0
|
|
|
|
**Issue:** [#10 OpenAI API compatibility](https://github.com/jamiepine/voicebox/issues/10)
|
|
|
|
## Overview
|
|
|
|
This feature exposes OpenAI-compatible endpoints from Voicebox, allowing any tool, library, or application that speaks the OpenAI Audio API to use Voicebox as a drop-in local replacement.
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
subgraph clients [External Clients]
|
|
SDK[OpenAI SDK]
|
|
Curl[curl / HTTP]
|
|
Apps[Third-party Apps]
|
|
end
|
|
|
|
subgraph voicebox [Voicebox Server]
|
|
OpenAI["/v1/audio/* endpoints"]
|
|
TTS[TTSModel]
|
|
Whisper[WhisperModel]
|
|
Profiles[Voice Profiles]
|
|
end
|
|
|
|
SDK --> OpenAI
|
|
Curl --> OpenAI
|
|
Apps --> OpenAI
|
|
OpenAI --> TTS
|
|
OpenAI --> Whisper
|
|
OpenAI --> Profiles
|
|
```
|
|
|
|
## Use Cases
|
|
|
|
- **OpenAI SDK users**: `openai.audio.speech.create()` works with Voicebox
|
|
- **LLM frameworks**: LangChain, AutoGen, etc. can use Voicebox for TTS
|
|
- **Shell scripts**: `curl` commands copy-pasted from OpenAI docs work
|
|
- **Existing integrations**: Any tool expecting OpenAI's API works without code changes
|
|
|
|
## Endpoints to Implement
|
|
|
|
### 1. `POST /v1/audio/speech` (TTS)
|
|
|
|
OpenAI spec: https://platform.openai.com/docs/api-reference/audio/createSpeech
|
|
|
|
**Request:**
|
|
|
|
```json
|
|
{
|
|
"model": "tts-1",
|
|
"input": "Hello world!",
|
|
"voice": "alloy",
|
|
"response_format": "mp3",
|
|
"speed": 1.0
|
|
}
|
|
```
|
|
|
|
**Response:** Audio file (mp3, wav, opus, aac, flac, pcm)
|
|
|
|
**Voice Mapping Strategy:**
|
|
|
|
- `voice` parameter maps to Voicebox profile names (case-insensitive)
|
|
- If no match, use a configurable default profile
|
|
- Support special syntax: `voice: "profile:uuid"` for explicit profile ID
|
|
|
|
### 2. `POST /v1/audio/transcriptions` (Whisper)
|
|
|
|
OpenAI spec: https://platform.openai.com/docs/api-reference/audio/createTranscription
|
|
|
|
**Request:** (multipart/form-data)
|
|
|
|
- `file`: Audio file
|
|
- `model`: "whisper-1"
|
|
- `language`: Optional language hint
|
|
- `response_format`: json, text, srt, verbose_json, vtt
|
|
|
|
**Response:**
|
|
|
|
```json
|
|
{
|
|
"text": "Hello world!"
|
|
}
|
|
```
|
|
|
|
## Implementation Details
|
|
|
|
### New File: `backend/openai_compat.py`
|
|
|
|
Create a dedicated module with an APIRouter for OpenAI-compatible endpoints:
|
|
|
|
```python
|
|
from fastapi import APIRouter, UploadFile, File, Form, HTTPException
|
|
from fastapi.responses import StreamingResponse
|
|
from pydantic import BaseModel
|
|
from typing import Literal, Optional
|
|
|
|
router = APIRouter(prefix="/v1/audio", tags=["OpenAI Compatible"])
|
|
|
|
class SpeechRequest(BaseModel):
|
|
model: str = "tts-1"
|
|
input: str
|
|
voice: str = "alloy"
|
|
response_format: Literal["mp3", "wav", "opus", "aac", "flac", "pcm"] = "mp3"
|
|
speed: float = 1.0
|
|
|
|
@router.post("/speech")
|
|
async def create_speech(request: SpeechRequest, db: Session = Depends(get_db)):
|
|
# 1. Map voice name to profile
|
|
# 2. Generate audio using existing TTSModel
|
|
# 3. Convert to requested format
|
|
# 4. Return audio stream
|
|
...
|
|
|
|
@router.post("/transcriptions")
|
|
async def create_transcription(
|
|
file: UploadFile = File(...),
|
|
model: str = Form("whisper-1"),
|
|
language: Optional[str] = Form(None),
|
|
response_format: str = Form("json"),
|
|
):
|
|
# 1. Save uploaded file
|
|
# 2. Transcribe using existing WhisperModel
|
|
# 3. Return in requested format
|
|
...
|
|
```
|
|
|
|
### Voice Profile Resolution
|
|
|
|
Add helper in [backend/profiles.py](backend/profiles.py):
|
|
|
|
```python
|
|
async def resolve_voice_for_openai(voice: str, db: Session) -> Optional[VoiceProfile]:
|
|
"""
|
|
Resolve OpenAI voice parameter to a Voicebox profile.
|
|
|
|
Priority:
|
|
1. Exact profile name match (case-insensitive)
|
|
2. Profile ID match (if voice starts with "profile:")
|
|
3. Default profile from config
|
|
4. First available profile
|
|
"""
|
|
...
|
|
```
|
|
|
|
### Audio Format Conversion
|
|
|
|
Add conversion utilities in [backend/utils/audio.py](backend/utils/audio.py):
|
|
|
|
```python
|
|
def convert_audio_format(
|
|
audio: np.ndarray,
|
|
sample_rate: int,
|
|
target_format: str, # mp3, wav, opus, aac, flac, pcm
|
|
) -> bytes:
|
|
"""Convert audio to target format using ffmpeg or pydub."""
|
|
...
|
|
```
|
|
|
|
### Configuration
|
|
|
|
Add to [backend/config.py](backend/config.py):
|
|
|
|
```python
|
|
# OpenAI API Compatibility
|
|
OPENAI_COMPAT_ENABLED = True
|
|
OPENAI_COMPAT_DEFAULT_VOICE = None # Profile ID or name for default voice
|
|
OPENAI_COMPAT_REQUIRE_AUTH = False # Require API key validation
|
|
OPENAI_COMPAT_API_KEY = None # If set, validate against this
|
|
```
|
|
|
|
### Integration with main.py
|
|
|
|
In [backend/main.py](backend/main.py), include the router:
|
|
|
|
```python
|
|
from . import openai_compat
|
|
|
|
# Add OpenAI-compatible routes
|
|
if config.OPENAI_COMPAT_ENABLED:
|
|
app.include_router(openai_compat.router)
|
|
```
|
|
|
|
## Streaming Support (Future Enhancement)
|
|
|
|
Initial implementation returns complete audio. Streaming can be added later:
|
|
|
|
```python
|
|
@router.post("/speech")
|
|
async def create_speech(request: SpeechRequest):
|
|
if request.stream:
|
|
return StreamingResponse(
|
|
generate_audio_chunks(request),
|
|
media_type=f"audio/{request.response_format}"
|
|
)
|
|
...
|
|
```
|
|
|
|
## Testing
|
|
|
|
Example usage after implementation:
|
|
|
|
```bash
|
|
# TTS with curl
|
|
curl http://localhost:8000/v1/audio/speech \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"model": "tts-1", "input": "Hello!", "voice": "MyProfile"}' \
|
|
--output speech.mp3
|
|
|
|
# With OpenAI Python SDK
|
|
from openai import OpenAI
|
|
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
|
|
response = client.audio.speech.create(
|
|
model="tts-1",
|
|
voice="MyProfile",
|
|
input="Hello world!"
|
|
)
|
|
response.stream_to_file("output.mp3")
|
|
|
|
# Transcription
|
|
curl http://localhost:8000/v1/audio/transcriptions \
|
|
-F file=@audio.mp3 \
|
|
-F model="whisper-1"
|
|
```
|
|
|
|
## Security Considerations
|
|
|
|
- Optional API key validation (for shared deployments)
|
|
- Rate limiting on endpoints
|
|
- Input length limits (same as existing `/generate` endpoint)
|
|
|
|
## Dependencies
|
|
|
|
- `pydub` or `ffmpeg-python` for audio format conversion (mp3, opus, etc.)
|
|
- No changes to existing TTS/Whisper model code |