-
Notifications
You must be signed in to change notification settings - Fork 406
Description
Motivation
Qwen3-TTS was initially supported in PR #895 with offline inference. Online serving was added in PR #968 (merged Jan 27). To make Qwen3-TTS production ready, we need to complete the remaining optimization work by end of February 2026.
Current Status (Feb 5):
| Feature | PR | Status | Priority | Owner |
|---|---|---|---|---|
| Offline inference | #895 | β Merged | - | - |
| Online serving | #968 | β Merged | - | @linyueqian |
| Disaggregated pipeline | #1161 | π In Progress | P0 | @Sy0307 @gcanlin |
| CUDA Graph acceleration | #1205 | π In Progress | P0 | @xulusjb @tsdocode |
| Streaming audio output | #1189 | π In Progress | P0 | @gerayking |
| Voice upload API | #1201 | π In Progress | P0 | @zhaotyer |
| E2E tests | #1206 | π In Progress | P0 | @linyueqian |
| Streaming text input | - | π Planned | P1 | - |
| Model-specific params | - | π Planned | P1 | - |
| Gradio demo | - | π Planned | P1 | - |
| TTS benchmark | - | π Planned | P1 | - |
Production Ready Criteria:
- RTF < 1.0 for real-time synthesis
- First chunk latency < 200ms for streaming mode
- Stable 2-stage deployment for resource optimization
- Streaming output for interactive use cases
- Benchmark tooling for performance validation
Related Infrastructure:
The following PRs provide context and shared infrastructure:
| PR | Description | Relevance |
|---|---|---|
| #1151 | Refactor async chunk for Thinker/Talker | Async chunk pattern; need similar TTS benchmark |
| #962 | Async chunk design docs (Qwen3-Omni) | Architecture reference |
| #986 | Streaming input from vLLM | Streaming input pattern for TTS |
| #1109 | Benchmark audio timing fix | Benchmark infrastructure |
Note on Async Infrastructure:
- Current async put/get will be refactored into
OmniChunkManager - Qwen3-TTS streaming should align with this new abstraction (we can have a version based on current main branch and adapt to the changes later)
- Thinker/Talker async chunk ([Refactor] Refactor async chunk and fix the shape mismatch issueΒ #1151) serves as reference implementation
Proposed Change
1 Work Items
1.1 Disaggregated Inference Pipeline (#1161) - P0
Separate Qwen3-TTS into two stages for flexible deployment:
Stage 0: Talker (AR Model)
- Generates codec tokens
- Compute-intensive, benefits from large GPU
Stage 1: SpeechTokenizer (Code2Wav)
- Decodes tokens to audio
- Can run on smaller GPU or CPU
Benefits:
- Independent scaling of AR and decoder
- Better GPU utilization
- Foundation for streaming output
Status: WIP - basic implementation done, testing in progress.
1.2 CUDA Graph Acceleration (#1205) - P0
Enable CUDA Graph for SpeechTokenizer decoder to reduce kernel launch overhead.
Implementation:
CUDAGraphDecoderWrapperclass for graph capture and replay- Captured graph sizes: [25, 50, 100, 150, 200, 250, 300]
- Auto-fallback to eager execution for unsupported sizes
Benchmark Results (H200):
| Metric | Before (Eager) | After (CUDA Graph) | Improvement |
|---|---|---|---|
| Avg Latency | ~8.96s | ~6.60s | 26% |
| Requests/sec | ~0.11 | ~0.15 | 36% |
Status: Implementation complete, pending review.
1.3 Streaming Audio Output (#1189) - P0
Enable chunk-based audio generation for low first-chunk latency.
Implementation:
StreamingChunkOutputdataclass for streaming chunk outputAsyncDecodingPipelineclass for background thread decodinggenerate_streaming_iter()for token-level streaming- Streaming variants:
generate_custom_voice_streaming(),generate_voice_design_streaming(),generate_voice_clone_streaming()
Key Parameters:
chunk_size = 25(tokens per chunk) (note that this is not official default value, may need to test on different values)left_context_size = 25(for smooth chunk boundaries)
Status: Implementation complete, pending review and API integration.
1.4 Voice Upload API (#1201) - P0
Add voice management endpoints for Qwen3-TTS:
Endpoints:
POST /v1/audio/voices- Upload custom voice samples (max 10MB)GET /v1/audio/voices- List available voices (built-in + uploaded)
Use Case: Allow users to upload reference audio for voice cloning without embedding in each request.
Status: Implementation in progress.
1.5 E2E Tests (#1206) - P0
Add end-to-end tests for /v1/audio/speech endpoint.
Motivation: Existing unit tests used mocks that didn't match real behavior, allowing bugs like #1159 to slip through.
Coverage:
- CustomVoice task with different speakers
- VoiceDesign task with instructions
- Base task (voice cloning) with reference audio
Status: Implementation in progress.
1.6 Streaming Text Input - P1
Accept text input in streaming fashion (for real-time transcription β TTS pipelines).
Reference: PR #986 implements streaming input for Qwen3-Omni, based on vLLM's StreamingInput API.
1.7 Model-Specific Parameters - P1
Expose generation hyperparameters in API:
Currently missing from API (but supported at model layer):
temperature,top_k,top_prepetition_penaltydo_sample,non_streaming_modesubtalker_temperature,subtalker_top_k,subtalker_top_p
1.8 Gradio Demo - P1
Add interactive Gradio demo for Qwen3-TTS (reference: examples/online_serving/qwen3_omni/gradio_demo.py).
Features:
- Support all 3 task types: CustomVoice, VoiceDesign, Base (voice clone)
- Streaming audio output (real-time playback as audio is generated)
- Streaming text input (for real-time transcription β TTS pipelines)
- Speaker/voice selection
- Instruction input for VoiceDesign
- Reference audio upload for voice cloning
1.9 TTS Benchmark - P1
Add benchmark tooling for Qwen3-TTS performance validation.
Challenge: Qwen3-TTS supports 3 different task types, each with different input requirements and use cases. Benchmark should cover all of them.
Task-Specific Benchmarks:
| Task | Input | Key Metrics | Notes |
|---|---|---|---|
| CustomVoice | text + speaker_id + (optional) instruction | Latency, RTF, throughput | Most common use case |
| VoiceDesign | text + instruction | Latency, RTF | Instruction parsing overhead |
| Base (Voice Clone) | text + ref_audio + ref_text | Latency, RTF, speaker similarity | Speaker encoder overhead |
Metrics to Measure:
-
Latency Metrics:
- First chunk latency (streaming mode)
- Time to first audio byte (TTFAB)
- End-to-end latency
-
Throughput Metrics:
- Real-Time Factor (RTF) = processing_time / audio_duration
- Requests per second
- Tokens per second (AR generation)
-
Quality Metrics (optional):
- Speaker similarity score (for voice cloning)
- MOS (Mean Opinion Score) via automatic evaluation
Benchmark Dataset:
- Short sentences (< 50 chars): latency testing
- Medium sentences (50-200 chars): typical use case
- Long sentences (> 200 chars): streaming benefit validation
- Multi-language: Chinese, English, mixed
Reference: PR #1109 fixed audio benchmark timing for Qwen3-Omni.
2 Dependencies
#1161 Disaggregated Pipeline
ββββ #1189 Streaming Output (requires 2-stage separation)
ββββ #1205 CUDA Graph (can parallel)
ββββ #1189 Streaming output
3 Anticipated Timeline
| Week | Focus | Deliverable |
|---|---|---|
| Feb 5-10 | #1161 | Disaggregated pipeline merged |
| Feb 10-15 | #1205 | CUDA Graph merged |
| Feb 15-20 | #1189 | Streaming output merged |
| Feb 20-28 | Testing | e2e validation, benchmarks |
4 Performance Targets
| Metric | Target | Reference |
|---|---|---|
| First chunk latency | < 200ms | nano-qwen3tts-vllm: 160ms |
| RTF | < 1.0 | nano-qwen3tts-vllm: 0.65 |
| e2e latency reduction | > 50% | PR #727: 66% |
CC List
@hsliuustc0106 @Gaohan123 @gcanlin @Sy0307 @tsdocode @xulusjb @gerayking @linyueqian @amy-why-3459 @R2-Y