Initial commit
This commit is contained in:
235
docs/plans/OPENAI_SUPPORT.md
Normal file
235
docs/plans/OPENAI_SUPPORT.md
Normal file
@@ -0,0 +1,235 @@
|
||||
# OpenAI API Compatibility
|
||||
|
||||
**Status:** Planned for v0.2.0
|
||||
|
||||
**Issue:** [#10 OpenAI API compatibility](https://github.com/jamiepine/voicebox/issues/10)
|
||||
|
||||
## Overview
|
||||
|
||||
This feature exposes OpenAI-compatible endpoints from Voicebox, allowing any tool, library, or application that speaks the OpenAI Audio API to use Voicebox as a drop-in local replacement.
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph clients [External Clients]
|
||||
SDK[OpenAI SDK]
|
||||
Curl[curl / HTTP]
|
||||
Apps[Third-party Apps]
|
||||
end
|
||||
|
||||
subgraph voicebox [Voicebox Server]
|
||||
OpenAI["/v1/audio/* endpoints"]
|
||||
TTS[TTSModel]
|
||||
Whisper[WhisperModel]
|
||||
Profiles[Voice Profiles]
|
||||
end
|
||||
|
||||
SDK --> OpenAI
|
||||
Curl --> OpenAI
|
||||
Apps --> OpenAI
|
||||
OpenAI --> TTS
|
||||
OpenAI --> Whisper
|
||||
OpenAI --> Profiles
|
||||
```
|
||||
|
||||
## Use Cases
|
||||
|
||||
- **OpenAI SDK users**: `openai.audio.speech.create()` works with Voicebox
|
||||
- **LLM frameworks**: LangChain, AutoGen, etc. can use Voicebox for TTS
|
||||
- **Shell scripts**: `curl` commands copy-pasted from OpenAI docs work
|
||||
- **Existing integrations**: Any tool expecting OpenAI's API works without code changes
|
||||
|
||||
## Endpoints to Implement
|
||||
|
||||
### 1. `POST /v1/audio/speech` (TTS)
|
||||
|
||||
OpenAI spec: https://platform.openai.com/docs/api-reference/audio/createSpeech
|
||||
|
||||
**Request:**
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "tts-1",
|
||||
"input": "Hello world!",
|
||||
"voice": "alloy",
|
||||
"response_format": "mp3",
|
||||
"speed": 1.0
|
||||
}
|
||||
```
|
||||
|
||||
**Response:** Audio file (mp3, wav, opus, aac, flac, pcm)
|
||||
|
||||
**Voice Mapping Strategy:**
|
||||
|
||||
- `voice` parameter maps to Voicebox profile names (case-insensitive)
|
||||
- If no match, use a configurable default profile
|
||||
- Support special syntax: `voice: "profile:uuid"` for explicit profile ID
|
||||
|
||||
### 2. `POST /v1/audio/transcriptions` (Whisper)
|
||||
|
||||
OpenAI spec: https://platform.openai.com/docs/api-reference/audio/createTranscription
|
||||
|
||||
**Request:** (multipart/form-data)
|
||||
|
||||
- `file`: Audio file
|
||||
- `model`: "whisper-1"
|
||||
- `language`: Optional language hint
|
||||
- `response_format`: json, text, srt, verbose_json, vtt
|
||||
|
||||
**Response:**
|
||||
|
||||
```json
|
||||
{
|
||||
"text": "Hello world!"
|
||||
}
|
||||
```
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### New File: `backend/openai_compat.py`
|
||||
|
||||
Create a dedicated module with an APIRouter for OpenAI-compatible endpoints:
|
||||
|
||||
```python
|
||||
from fastapi import APIRouter, UploadFile, File, Form, HTTPException
|
||||
from fastapi.responses import StreamingResponse
|
||||
from pydantic import BaseModel
|
||||
from typing import Literal, Optional
|
||||
|
||||
router = APIRouter(prefix="/v1/audio", tags=["OpenAI Compatible"])
|
||||
|
||||
class SpeechRequest(BaseModel):
|
||||
model: str = "tts-1"
|
||||
input: str
|
||||
voice: str = "alloy"
|
||||
response_format: Literal["mp3", "wav", "opus", "aac", "flac", "pcm"] = "mp3"
|
||||
speed: float = 1.0
|
||||
|
||||
@router.post("/speech")
|
||||
async def create_speech(request: SpeechRequest, db: Session = Depends(get_db)):
|
||||
# 1. Map voice name to profile
|
||||
# 2. Generate audio using existing TTSModel
|
||||
# 3. Convert to requested format
|
||||
# 4. Return audio stream
|
||||
...
|
||||
|
||||
@router.post("/transcriptions")
|
||||
async def create_transcription(
|
||||
file: UploadFile = File(...),
|
||||
model: str = Form("whisper-1"),
|
||||
language: Optional[str] = Form(None),
|
||||
response_format: str = Form("json"),
|
||||
):
|
||||
# 1. Save uploaded file
|
||||
# 2. Transcribe using existing WhisperModel
|
||||
# 3. Return in requested format
|
||||
...
|
||||
```
|
||||
|
||||
### Voice Profile Resolution
|
||||
|
||||
Add helper in [backend/profiles.py](backend/profiles.py):
|
||||
|
||||
```python
|
||||
async def resolve_voice_for_openai(voice: str, db: Session) -> Optional[VoiceProfile]:
|
||||
"""
|
||||
Resolve OpenAI voice parameter to a Voicebox profile.
|
||||
|
||||
Priority:
|
||||
1. Exact profile name match (case-insensitive)
|
||||
2. Profile ID match (if voice starts with "profile:")
|
||||
3. Default profile from config
|
||||
4. First available profile
|
||||
"""
|
||||
...
|
||||
```
|
||||
|
||||
### Audio Format Conversion
|
||||
|
||||
Add conversion utilities in [backend/utils/audio.py](backend/utils/audio.py):
|
||||
|
||||
```python
|
||||
def convert_audio_format(
|
||||
audio: np.ndarray,
|
||||
sample_rate: int,
|
||||
target_format: str, # mp3, wav, opus, aac, flac, pcm
|
||||
) -> bytes:
|
||||
"""Convert audio to target format using ffmpeg or pydub."""
|
||||
...
|
||||
```
|
||||
|
||||
### Configuration
|
||||
|
||||
Add to [backend/config.py](backend/config.py):
|
||||
|
||||
```python
|
||||
# OpenAI API Compatibility
|
||||
OPENAI_COMPAT_ENABLED = True
|
||||
OPENAI_COMPAT_DEFAULT_VOICE = None # Profile ID or name for default voice
|
||||
OPENAI_COMPAT_REQUIRE_AUTH = False # Require API key validation
|
||||
OPENAI_COMPAT_API_KEY = None # If set, validate against this
|
||||
```
|
||||
|
||||
### Integration with main.py
|
||||
|
||||
In [backend/main.py](backend/main.py), include the router:
|
||||
|
||||
```python
|
||||
from . import openai_compat
|
||||
|
||||
# Add OpenAI-compatible routes
|
||||
if config.OPENAI_COMPAT_ENABLED:
|
||||
app.include_router(openai_compat.router)
|
||||
```
|
||||
|
||||
## Streaming Support (Future Enhancement)
|
||||
|
||||
Initial implementation returns complete audio. Streaming can be added later:
|
||||
|
||||
```python
|
||||
@router.post("/speech")
|
||||
async def create_speech(request: SpeechRequest):
|
||||
if request.stream:
|
||||
return StreamingResponse(
|
||||
generate_audio_chunks(request),
|
||||
media_type=f"audio/{request.response_format}"
|
||||
)
|
||||
...
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
Example usage after implementation:
|
||||
|
||||
```bash
|
||||
# TTS with curl
|
||||
curl http://localhost:8000/v1/audio/speech \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"model": "tts-1", "input": "Hello!", "voice": "MyProfile"}' \
|
||||
--output speech.mp3
|
||||
|
||||
# With OpenAI Python SDK
|
||||
from openai import OpenAI
|
||||
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
|
||||
response = client.audio.speech.create(
|
||||
model="tts-1",
|
||||
voice="MyProfile",
|
||||
input="Hello world!"
|
||||
)
|
||||
response.stream_to_file("output.mp3")
|
||||
|
||||
# Transcription
|
||||
curl http://localhost:8000/v1/audio/transcriptions \
|
||||
-F file=@audio.mp3 \
|
||||
-F model="whisper-1"
|
||||
```
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- Optional API key validation (for shared deployments)
|
||||
- Rate limiting on endpoints
|
||||
- Input length limits (same as existing `/generate` endpoint)
|
||||
|
||||
## Dependencies
|
||||
|
||||
- `pydub` or `ffmpeg-python` for audio format conversion (mp3, opus, etc.)
|
||||
- No changes to existing TTS/Whisper model code
|
||||
Reference in New Issue
Block a user