first commit

2026-04-24 19:34:41 +08:00
commit 4899d56d9a
31 changed files with 11866 additions and 0 deletions
--- a/finetuning/README.md
+++ b/finetuning/README.md
@@ -0,0 +1,121 @@
+## Fine Tuning Qwen3-TTS-12Hz-1.7B/0.6B-Base
+
+The Qwen3-TTS-12Hz-1.7B/0.6B-Base model series currently supports single-speaker fine-tuning. Please run `pip install qwen-tts` first, then run the command below:
+
+```
+git clone https://github.com/QwenLM/Qwen3-TTS.git
+cd Qwen3-TTS/finetuning
+```
+
+Then follow the steps below to complete the entire fine-tuning workflow. Multi-speaker fine-tuning and other advanced fine-tuning features will be supported in future releases.
+
+### 1) Input JSONL format
+
+Prepare your training file as a JSONL (one JSON object per line). Each line must contain:
+
+- `audio`: path to the target training audio (wav)
+- `text`: transcript corresponding to `audio`
+- `ref_audio`: path to the reference speaker audio (wav)
+
+Example:
+```jsonl
+{"audio":"./data/utt0001.wav","text":"其实我真的有发现，我是一个特别善于观察别人情绪的人。","ref_audio":"./data/ref.wav"}
+{"audio":"./data/utt0002.wav","text":"She said she would be here by noon.","ref_audio":"./data/ref.wav"}
+```
+
+`ref_audio` recommendation:
+- Strongly recommended: use the same `ref_audio` for all samples.
+- Keeping `ref_audio` identical across the dataset usually improves speaker consistency and stability during generation.
+
+
+### 2) Prepare data (extract `audio_codes`)
+
+Convert `train_raw.jsonl` into a training JSONL that includes `audio_codes`:
+
+```bash
+python prepare_data.py \
+  --device cuda:0 \
+  --tokenizer_model_path Qwen/Qwen3-TTS-Tokenizer-12Hz \
+  --input_jsonl train_raw.jsonl \
+  --output_jsonl train_with_codes.jsonl
+```
+
+
+### 3) Fine-tune
+
+Run SFT using the prepared JSONL:
+
+```bash
+python sft_12hz.py \
+  --init_model_path Qwen/Qwen3-TTS-12Hz-1.7B-Base \
+  --output_model_path output \
+  --train_jsonl train_with_codes.jsonl \
+  --batch_size 32 \
+  --lr 2e-6 \
+  --num_epochs 10 \
+  --speaker_name speaker_test
+```
+
+Checkpoints will be written to:
+- `output/checkpoint-epoch-0`
+- `output/checkpoint-epoch-1`
+- `output/checkpoint-epoch-2`
+- ...
+
+
+### 4) Quick inference test
+
+```python
+import torch
+import soundfile as sf
+from qwen_tts import Qwen3TTSModel
+
+device = "cuda:0"
+tts = Qwen3TTSModel.from_pretrained(
+    "output/checkpoint-epoch-2",
+    device_map=device,
+    dtype=torch.bfloat16,
+    attn_implementation="flash_attention_2",
+)
+
+wavs, sr = tts.generate_custom_voice(
+    text="She said she would be here by noon.",
+    speaker="speaker_test",
+)
+sf.write("output.wav", wavs[0], sr)
+```
+
+### One-click shell script example
+
+```bash
+#!/usr/bin/env bash
+set -e
+
+DEVICE="cuda:0"
+TOKENIZER_MODEL_PATH="Qwen/Qwen3-TTS-Tokenizer-12Hz"
+INIT_MODEL_PATH="Qwen/Qwen3-TTS-12Hz-1.7B-Base"
+
+RAW_JSONL="train_raw.jsonl"
+TRAIN_JSONL="train_with_codes.jsonl"
+OUTPUT_DIR="output"
+
+BATCH_SIZE=2
+LR=2e-5
+EPOCHS=3
+SPEAKER_NAME="speaker_1"
+
+python prepare_data.py \
+  --device ${DEVICE} \
+  --tokenizer_model_path ${TOKENIZER_MODEL_PATH} \
+  --input_jsonl ${RAW_JSONL} \
+  --output_jsonl ${TRAIN_JSONL}
+
+python sft_12hz.py \
+  --init_model_path ${INIT_MODEL_PATH} \
+  --output_model_path ${OUTPUT_DIR} \
+  --train_jsonl ${TRAIN_JSONL} \
+  --batch_size ${BATCH_SIZE} \
+  --lr ${LR} \
+  --num_epochs ${EPOCHS} \
+  --speaker_name ${SPEAKER_NAME}
+```