voicebox/docs/plans/OPENAI_SUPPORT.md

# OpenAI API Compatibility

**Status:** Planned for v0.2.0

**Issue:** [#10 OpenAI API compatibility](https://github.com/jamiepine/voicebox/issues/10)

## Overview

This feature exposes OpenAI-compatible endpoints from Voicebox, allowing any tool, library, or application that speaks the OpenAI Audio API to use Voicebox as a drop-in local replacement.

```mermaid
flowchart LR
    subgraph clients [External Clients]
        SDK[OpenAI SDK]
        Curl[curl / HTTP]
        Apps[Third-party Apps]
    end

    subgraph voicebox [Voicebox Server]
        OpenAI["/v1/audio/* endpoints"]
        TTS[TTSModel]
        Whisper[WhisperModel]
        Profiles[Voice Profiles]
    end

    SDK --> OpenAI
    Curl --> OpenAI
    Apps --> OpenAI
    OpenAI --> TTS
    OpenAI --> Whisper
    OpenAI --> Profiles
```

## Use Cases

- **OpenAI SDK users**: `openai.audio.speech.create()` works with Voicebox
- **LLM frameworks**: LangChain, AutoGen, etc. can use Voicebox for TTS
- **Shell scripts**: `curl` commands copy-pasted from OpenAI docs work
- **Existing integrations**: Any tool expecting OpenAI's API works without code changes

## Endpoints to Implement

### 1. `POST /v1/audio/speech` (TTS)

OpenAI spec: https://platform.openai.com/docs/api-reference/audio/createSpeech

**Request:**

```json
{
  "model": "tts-1",
  "input": "Hello world!",
  "voice": "alloy",
  "response_format": "mp3",
  "speed": 1.0
}
```

**Response:** Audio file (mp3, wav, opus, aac, flac, pcm)

**Voice Mapping Strategy:**

- `voice` parameter maps to Voicebox profile names (case-insensitive)
- If no match, use a configurable default profile
- Support special syntax: `voice: "profile:uuid"` for explicit profile ID

### 2. `POST /v1/audio/transcriptions` (Whisper)

OpenAI spec: https://platform.openai.com/docs/api-reference/audio/createTranscription

**Request:** (multipart/form-data)

- `file`: Audio file
- `model`: "whisper-1"
- `language`: Optional language hint
- `response_format`: json, text, srt, verbose_json, vtt

**Response:**

```json
{
  "text": "Hello world!"
}
```

## Implementation Details

### New File: `backend/openai_compat.py`

Create a dedicated module with an APIRouter for OpenAI-compatible endpoints:

```python
from fastapi import APIRouter, UploadFile, File, Form, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from typing import Literal, Optional

router = APIRouter(prefix="/v1/audio", tags=["OpenAI Compatible"])

class SpeechRequest(BaseModel):
    model: str = "tts-1"
    input: str
    voice: str = "alloy"
    response_format: Literal["mp3", "wav", "opus", "aac", "flac", "pcm"] = "mp3"
    speed: float = 1.0

@router.post("/speech")
async def create_speech(request: SpeechRequest, db: Session = Depends(get_db)):
    # 1. Map voice name to profile
    # 2. Generate audio using existing TTSModel
    # 3. Convert to requested format
    # 4. Return audio stream
    ...

@router.post("/transcriptions")
async def create_transcription(
    file: UploadFile = File(...),
    model: str = Form("whisper-1"),
    language: Optional[str] = Form(None),
    response_format: str = Form("json"),
):
    # 1. Save uploaded file
    # 2. Transcribe using existing WhisperModel
    # 3. Return in requested format
    ...
```

### Voice Profile Resolution

Add helper in [backend/profiles.py](backend/profiles.py):

```python
async def resolve_voice_for_openai(voice: str, db: Session) -> Optional[VoiceProfile]:
    """
    Resolve OpenAI voice parameter to a Voicebox profile.

    Priority:
    1. Exact profile name match (case-insensitive)
    2. Profile ID match (if voice starts with "profile:")
    3. Default profile from config
    4. First available profile
    """
    ...
```

### Audio Format Conversion

Add conversion utilities in [backend/utils/audio.py](backend/utils/audio.py):

```python
def convert_audio_format(
    audio: np.ndarray,
    sample_rate: int,
    target_format: str,  # mp3, wav, opus, aac, flac, pcm
) -> bytes:
    """Convert audio to target format using ffmpeg or pydub."""
    ...
```

### Configuration

Add to [backend/config.py](backend/config.py):

```python
# OpenAI API Compatibility
OPENAI_COMPAT_ENABLED = True
OPENAI_COMPAT_DEFAULT_VOICE = None  # Profile ID or name for default voice
OPENAI_COMPAT_REQUIRE_AUTH = False  # Require API key validation
OPENAI_COMPAT_API_KEY = None        # If set, validate against this
```

### Integration with main.py

In [backend/main.py](backend/main.py), include the router:

```python
from . import openai_compat

# Add OpenAI-compatible routes
if config.OPENAI_COMPAT_ENABLED:
    app.include_router(openai_compat.router)
```

## Streaming Support (Future Enhancement)

Initial implementation returns complete audio. Streaming can be added later:

```python
@router.post("/speech")
async def create_speech(request: SpeechRequest):
    if request.stream:
        return StreamingResponse(
            generate_audio_chunks(request),
            media_type=f"audio/{request.response_format}"
        )
    ...
```

## Testing

Example usage after implementation:

```bash
# TTS with curl
curl http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "tts-1", "input": "Hello!", "voice": "MyProfile"}' \
  --output speech.mp3

# With OpenAI Python SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.audio.speech.create(
    model="tts-1",
    voice="MyProfile",
    input="Hello world!"
)
response.stream_to_file("output.mp3")

# Transcription
curl http://localhost:8000/v1/audio/transcriptions \
  -F file=@audio.mp3 \
  -F model="whisper-1"
```

## Security Considerations

- Optional API key validation (for shared deployments)
- Rate limiting on endpoints
- Input length limits (same as existing `/generate` endpoint)

## Dependencies

- `pydub` or `ffmpeg-python` for audio format conversion (mp3, opus, etc.)
- No changes to existing TTS/Whisper model code