first commit
This commit is contained in:
121
finetuning/README.md
Normal file
121
finetuning/README.md
Normal file
@@ -0,0 +1,121 @@
|
||||
## Fine Tuning Qwen3-TTS-12Hz-1.7B/0.6B-Base
|
||||
|
||||
The Qwen3-TTS-12Hz-1.7B/0.6B-Base model series currently supports single-speaker fine-tuning. Please run `pip install qwen-tts` first, then run the command below:
|
||||
|
||||
```
|
||||
git clone https://github.com/QwenLM/Qwen3-TTS.git
|
||||
cd Qwen3-TTS/finetuning
|
||||
```
|
||||
|
||||
Then follow the steps below to complete the entire fine-tuning workflow. Multi-speaker fine-tuning and other advanced fine-tuning features will be supported in future releases.
|
||||
|
||||
### 1) Input JSONL format
|
||||
|
||||
Prepare your training file as a JSONL (one JSON object per line). Each line must contain:
|
||||
|
||||
- `audio`: path to the target training audio (wav)
|
||||
- `text`: transcript corresponding to `audio`
|
||||
- `ref_audio`: path to the reference speaker audio (wav)
|
||||
|
||||
Example:
|
||||
```jsonl
|
||||
{"audio":"./data/utt0001.wav","text":"其实我真的有发现,我是一个特别善于观察别人情绪的人。","ref_audio":"./data/ref.wav"}
|
||||
{"audio":"./data/utt0002.wav","text":"She said she would be here by noon.","ref_audio":"./data/ref.wav"}
|
||||
```
|
||||
|
||||
`ref_audio` recommendation:
|
||||
- Strongly recommended: use the same `ref_audio` for all samples.
|
||||
- Keeping `ref_audio` identical across the dataset usually improves speaker consistency and stability during generation.
|
||||
|
||||
|
||||
### 2) Prepare data (extract `audio_codes`)
|
||||
|
||||
Convert `train_raw.jsonl` into a training JSONL that includes `audio_codes`:
|
||||
|
||||
```bash
|
||||
python prepare_data.py \
|
||||
--device cuda:0 \
|
||||
--tokenizer_model_path Qwen/Qwen3-TTS-Tokenizer-12Hz \
|
||||
--input_jsonl train_raw.jsonl \
|
||||
--output_jsonl train_with_codes.jsonl
|
||||
```
|
||||
|
||||
|
||||
### 3) Fine-tune
|
||||
|
||||
Run SFT using the prepared JSONL:
|
||||
|
||||
```bash
|
||||
python sft_12hz.py \
|
||||
--init_model_path Qwen/Qwen3-TTS-12Hz-1.7B-Base \
|
||||
--output_model_path output \
|
||||
--train_jsonl train_with_codes.jsonl \
|
||||
--batch_size 32 \
|
||||
--lr 2e-6 \
|
||||
--num_epochs 10 \
|
||||
--speaker_name speaker_test
|
||||
```
|
||||
|
||||
Checkpoints will be written to:
|
||||
- `output/checkpoint-epoch-0`
|
||||
- `output/checkpoint-epoch-1`
|
||||
- `output/checkpoint-epoch-2`
|
||||
- ...
|
||||
|
||||
|
||||
### 4) Quick inference test
|
||||
|
||||
```python
|
||||
import torch
|
||||
import soundfile as sf
|
||||
from qwen_tts import Qwen3TTSModel
|
||||
|
||||
device = "cuda:0"
|
||||
tts = Qwen3TTSModel.from_pretrained(
|
||||
"output/checkpoint-epoch-2",
|
||||
device_map=device,
|
||||
dtype=torch.bfloat16,
|
||||
attn_implementation="flash_attention_2",
|
||||
)
|
||||
|
||||
wavs, sr = tts.generate_custom_voice(
|
||||
text="She said she would be here by noon.",
|
||||
speaker="speaker_test",
|
||||
)
|
||||
sf.write("output.wav", wavs[0], sr)
|
||||
```
|
||||
|
||||
### One-click shell script example
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
set -e
|
||||
|
||||
DEVICE="cuda:0"
|
||||
TOKENIZER_MODEL_PATH="Qwen/Qwen3-TTS-Tokenizer-12Hz"
|
||||
INIT_MODEL_PATH="Qwen/Qwen3-TTS-12Hz-1.7B-Base"
|
||||
|
||||
RAW_JSONL="train_raw.jsonl"
|
||||
TRAIN_JSONL="train_with_codes.jsonl"
|
||||
OUTPUT_DIR="output"
|
||||
|
||||
BATCH_SIZE=2
|
||||
LR=2e-5
|
||||
EPOCHS=3
|
||||
SPEAKER_NAME="speaker_1"
|
||||
|
||||
python prepare_data.py \
|
||||
--device ${DEVICE} \
|
||||
--tokenizer_model_path ${TOKENIZER_MODEL_PATH} \
|
||||
--input_jsonl ${RAW_JSONL} \
|
||||
--output_jsonl ${TRAIN_JSONL}
|
||||
|
||||
python sft_12hz.py \
|
||||
--init_model_path ${INIT_MODEL_PATH} \
|
||||
--output_model_path ${OUTPUT_DIR} \
|
||||
--train_jsonl ${TRAIN_JSONL} \
|
||||
--batch_size ${BATCH_SIZE} \
|
||||
--lr ${LR} \
|
||||
--num_epochs ${EPOCHS} \
|
||||
--speaker_name ${SPEAKER_NAME}
|
||||
```
|
||||
Reference in New Issue
Block a user