Skip to content

[RFC]: Qwen3-TTS Production Ready - February MilestoneΒ #938

@Gaohan123

Description

@Gaohan123

Motivation

Qwen3-TTS was initially supported in PR #895 with offline inference. Online serving was added in PR #968 (merged Jan 27). To make Qwen3-TTS production ready, we need to complete the remaining optimization work by end of February 2026.

Current Status (Feb 5):

Feature PR Status Priority Owner
Offline inference #895 βœ… Merged - -
Online serving #968 βœ… Merged - @linyueqian
Disaggregated pipeline #1161 πŸ”„ In Progress P0 @Sy0307 @gcanlin
CUDA Graph acceleration #1205 πŸ”„ In Progress P0 @xulusjb @tsdocode
Streaming audio output #1189 πŸ”„ In Progress P0 @gerayking
Voice upload API #1201 πŸ”„ In Progress P0 @zhaotyer
E2E tests #1206 πŸ”„ In Progress P0 @linyueqian
Streaming text input - πŸ“‹ Planned P1 -
Model-specific params - πŸ“‹ Planned P1 -
Gradio demo - πŸ“‹ Planned P1 -
TTS benchmark - πŸ“‹ Planned P1 -

Production Ready Criteria:

  • RTF < 1.0 for real-time synthesis
  • First chunk latency < 200ms for streaming mode
  • Stable 2-stage deployment for resource optimization
  • Streaming output for interactive use cases
  • Benchmark tooling for performance validation

Related Infrastructure:

The following PRs provide context and shared infrastructure:

PR Description Relevance
#1151 Refactor async chunk for Thinker/Talker Async chunk pattern; need similar TTS benchmark
#962 Async chunk design docs (Qwen3-Omni) Architecture reference
#986 Streaming input from vLLM Streaming input pattern for TTS
#1109 Benchmark audio timing fix Benchmark infrastructure

Note on Async Infrastructure:


Proposed Change

1 Work Items

1.1 Disaggregated Inference Pipeline (#1161) - P0

Separate Qwen3-TTS into two stages for flexible deployment:

Stage 0: Talker (AR Model)
  - Generates codec tokens
  - Compute-intensive, benefits from large GPU

Stage 1: SpeechTokenizer (Code2Wav)
  - Decodes tokens to audio
  - Can run on smaller GPU or CPU

Benefits:

  • Independent scaling of AR and decoder
  • Better GPU utilization
  • Foundation for streaming output

Status: WIP - basic implementation done, testing in progress.

1.2 CUDA Graph Acceleration (#1205) - P0

Enable CUDA Graph for SpeechTokenizer decoder to reduce kernel launch overhead.

Implementation:

  • CUDAGraphDecoderWrapper class for graph capture and replay
  • Captured graph sizes: [25, 50, 100, 150, 200, 250, 300]
  • Auto-fallback to eager execution for unsupported sizes

Benchmark Results (H200):

Metric Before (Eager) After (CUDA Graph) Improvement
Avg Latency ~8.96s ~6.60s 26%
Requests/sec ~0.11 ~0.15 36%

Status: Implementation complete, pending review.

1.3 Streaming Audio Output (#1189) - P0

Enable chunk-based audio generation for low first-chunk latency.

Implementation:

  • StreamingChunkOutput dataclass for streaming chunk output
  • AsyncDecodingPipeline class for background thread decoding
  • generate_streaming_iter() for token-level streaming
  • Streaming variants: generate_custom_voice_streaming(), generate_voice_design_streaming(), generate_voice_clone_streaming()

Key Parameters:

  • chunk_size = 25 (tokens per chunk) (note that this is not official default value, may need to test on different values)
  • left_context_size = 25 (for smooth chunk boundaries)

Status: Implementation complete, pending review and API integration.

1.4 Voice Upload API (#1201) - P0

Add voice management endpoints for Qwen3-TTS:

Endpoints:

  • POST /v1/audio/voices - Upload custom voice samples (max 10MB)
  • GET /v1/audio/voices - List available voices (built-in + uploaded)

Use Case: Allow users to upload reference audio for voice cloning without embedding in each request.

Status: Implementation in progress.

1.5 E2E Tests (#1206) - P0

Add end-to-end tests for /v1/audio/speech endpoint.

Motivation: Existing unit tests used mocks that didn't match real behavior, allowing bugs like #1159 to slip through.

Coverage:

  • CustomVoice task with different speakers
  • VoiceDesign task with instructions
  • Base task (voice cloning) with reference audio

Status: Implementation in progress.

1.6 Streaming Text Input - P1

Accept text input in streaming fashion (for real-time transcription β†’ TTS pipelines).

Reference: PR #986 implements streaming input for Qwen3-Omni, based on vLLM's StreamingInput API.

1.7 Model-Specific Parameters - P1

Expose generation hyperparameters in API:

Currently missing from API (but supported at model layer):

  • temperature, top_k, top_p
  • repetition_penalty
  • do_sample, non_streaming_mode
  • subtalker_temperature, subtalker_top_k, subtalker_top_p

1.8 Gradio Demo - P1

Add interactive Gradio demo for Qwen3-TTS (reference: examples/online_serving/qwen3_omni/gradio_demo.py).

Features:

  • Support all 3 task types: CustomVoice, VoiceDesign, Base (voice clone)
  • Streaming audio output (real-time playback as audio is generated)
  • Streaming text input (for real-time transcription β†’ TTS pipelines)
  • Speaker/voice selection
  • Instruction input for VoiceDesign
  • Reference audio upload for voice cloning

1.9 TTS Benchmark - P1

Add benchmark tooling for Qwen3-TTS performance validation.

Challenge: Qwen3-TTS supports 3 different task types, each with different input requirements and use cases. Benchmark should cover all of them.

Task-Specific Benchmarks:

Task Input Key Metrics Notes
CustomVoice text + speaker_id + (optional) instruction Latency, RTF, throughput Most common use case
VoiceDesign text + instruction Latency, RTF Instruction parsing overhead
Base (Voice Clone) text + ref_audio + ref_text Latency, RTF, speaker similarity Speaker encoder overhead

Metrics to Measure:

  1. Latency Metrics:

    • First chunk latency (streaming mode)
    • Time to first audio byte (TTFAB)
    • End-to-end latency
  2. Throughput Metrics:

    • Real-Time Factor (RTF) = processing_time / audio_duration
    • Requests per second
    • Tokens per second (AR generation)
  3. Quality Metrics (optional):

    • Speaker similarity score (for voice cloning)
    • MOS (Mean Opinion Score) via automatic evaluation

Benchmark Dataset:

  • Short sentences (< 50 chars): latency testing
  • Medium sentences (50-200 chars): typical use case
  • Long sentences (> 200 chars): streaming benefit validation
  • Multi-language: Chinese, English, mixed

Reference: PR #1109 fixed audio benchmark timing for Qwen3-Omni.

2 Dependencies

#1161 Disaggregated Pipeline
    β”œβ”€β”€β†’ #1189 Streaming Output (requires 2-stage separation)
    └──→ #1205 CUDA Graph (can parallel)
    └──→ #1189 Streaming output

3 Anticipated Timeline

Week Focus Deliverable
Feb 5-10 #1161 Disaggregated pipeline merged
Feb 10-15 #1205 CUDA Graph merged
Feb 15-20 #1189 Streaming output merged
Feb 20-28 Testing e2e validation, benchmarks

4 Performance Targets

Metric Target Reference
First chunk latency < 200ms nano-qwen3tts-vllm: 160ms
RTF < 1.0 nano-qwen3tts-vllm: 0.65
e2e latency reduction > 50% PR #727: 66%

CC List

@hsliuustc0106 @Gaohan123 @gcanlin @Sy0307 @tsdocode @xulusjb @gerayking @linyueqian @amy-why-3459 @R2-Y


Sub-issues

Metadata

Metadata

Labels

enhancementNew feature or requestgood first issueGood for newcomershelp wantedExtra attention is neededhigh priorityhigh priority issue, needs to be done asapnew modeladd new model

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions