[RFC]: Qwen3-TTS Production Ready - February Milestone

## Motivation

Qwen3-TTS was initially supported in PR #895 with offline inference. Online serving was added in PR #968 (merged Jan 27). To make Qwen3-TTS **production ready**, we need to complete the remaining optimization work by **end of February 2026**.

**Current Status (Feb 5):**

| Feature | PR | Status | Priority | Owner |
|---------|-----|--------|----------|-------|
| Offline inference | #895 | ✅ Merged | - | - |
| Online serving | #968 | ✅ Merged | - | @linyueqian |
| Disaggregated pipeline | #1161 | 🔄 In Progress | P0 | @Sy0307 @gcanlin |
| CUDA Graph acceleration | #1205 | 🔄 In Progress | P0 | @xulusjb @tsdocode |
| Streaming audio output | #1189 | 🔄 In Progress | P0 | @gerayking |
| Voice upload API | #1201 | 🔄 In Progress | P0 | @zhaotyer |
| E2E tests | #1206 | 🔄 In Progress | P0 | @linyueqian |
| Streaming text input | - | 📋 Planned | P1 | - |
| Model-specific params | - | 📋 Planned | P1 | - |
| Gradio demo | - | 📋 Planned | P1 | - |
| TTS benchmark | - | 📋 Planned | P1 | - |

**Production Ready Criteria:**
- RTF < 1.0 for real-time synthesis
- First chunk latency < 200ms for streaming mode
- Stable 2-stage deployment for resource optimization
- Streaming output for interactive use cases
- Benchmark tooling for performance validation

**Related Infrastructure:**

The following PRs provide context and shared infrastructure:

| PR | Description | Relevance |
|----|-------------|-----------|
| #1151 | Refactor async chunk for Thinker/Talker | Async chunk pattern; need similar TTS benchmark |
| #962 | Async chunk design docs (Qwen3-Omni) | Architecture reference |
| #986 | Streaming input from vLLM | Streaming input pattern for TTS |
| #1109 | Benchmark audio timing fix | Benchmark infrastructure |

**Note on Async Infrastructure:**
- Current async put/get will be refactored into `OmniChunkManager`
- Qwen3-TTS streaming should align with this new abstraction (we can have a version based on current main branch and adapt to the changes later)
- Thinker/Talker async chunk (#1151) serves as reference implementation

---

## Proposed Change

### 1 Work Items

#### 1.1 Disaggregated Inference Pipeline (#1161) - P0

Separate Qwen3-TTS into two stages for flexible deployment:

```
Stage 0: Talker (AR Model)
  - Generates codec tokens
  - Compute-intensive, benefits from large GPU

Stage 1: SpeechTokenizer (Code2Wav)
  - Decodes tokens to audio
  - Can run on smaller GPU or CPU
```

**Benefits:**
- Independent scaling of AR and decoder
- Better GPU utilization
- Foundation for streaming output

**Status:** WIP - basic implementation done, testing in progress.

#### 1.2 CUDA Graph Acceleration (#1205) - P0

Enable CUDA Graph for SpeechTokenizer decoder to reduce kernel launch overhead.

**Implementation:**
- `CUDAGraphDecoderWrapper` class for graph capture and replay
- Captured graph sizes: [25, 50, 100, 150, 200, 250, 300]
- Auto-fallback to eager execution for unsupported sizes

**Benchmark Results (H200):**
| Metric | Before (Eager) | After (CUDA Graph) | Improvement |
|--------|----------------|-------------------|-------------|
| Avg Latency | ~8.96s | ~6.60s | 26% |
| Requests/sec | ~0.11 | ~0.15 | 36% |

**Status:** Implementation complete, pending review.

#### 1.3 Streaming Audio Output (#1189) - P0

Enable chunk-based audio generation for low first-chunk latency.

**Implementation:**
- `StreamingChunkOutput` dataclass for streaming chunk output
- `AsyncDecodingPipeline` class for background thread decoding
- `generate_streaming_iter()` for token-level streaming
- Streaming variants: `generate_custom_voice_streaming()`, `generate_voice_design_streaming()`, `generate_voice_clone_streaming()`

**Key Parameters:**
- `chunk_size = 25` (tokens per chunk) (note that this is not official default value, may need to test on different values)
- `left_context_size = 25` (for smooth chunk boundaries)

**Status:** Implementation complete, pending review and API integration.

#### 1.4 Voice Upload API (#1201) - P0

Add voice management endpoints for Qwen3-TTS:

**Endpoints:**
- `POST /v1/audio/voices` - Upload custom voice samples (max 10MB)
- `GET /v1/audio/voices` - List available voices (built-in + uploaded)

**Use Case:** Allow users to upload reference audio for voice cloning without embedding in each request.

**Status:** Implementation in progress.

#### 1.5 E2E Tests (#1206) - P0

Add end-to-end tests for `/v1/audio/speech` endpoint.

**Motivation:** Existing unit tests used mocks that didn't match real behavior, allowing bugs like #1159 to slip through.

**Coverage:**
- CustomVoice task with different speakers
- VoiceDesign task with instructions
- Base task (voice cloning) with reference audio

**Status:** Implementation in progress.

#### 1.6 Streaming Text Input - P1

Accept text input in streaming fashion (for real-time transcription → TTS pipelines).

**Reference:** PR #986 implements streaming input for Qwen3-Omni, based on vLLM's `StreamingInput` API.

#### 1.7 Model-Specific Parameters - P1

Expose generation hyperparameters in API:

**Currently missing from API (but supported at model layer):**
- `temperature`, `top_k`, `top_p`
- `repetition_penalty`
- `do_sample`, `non_streaming_mode`
- `subtalker_temperature`, `subtalker_top_k`, `subtalker_top_p`

#### 1.8 Gradio Demo - P1

Add interactive Gradio demo for Qwen3-TTS (reference: `examples/online_serving/qwen3_omni/gradio_demo.py`).

**Features:**
- Support all 3 task types: CustomVoice, VoiceDesign, Base (voice clone)
- Streaming audio output (real-time playback as audio is generated)
- Streaming text input (for real-time transcription → TTS pipelines)
- Speaker/voice selection
- Instruction input for VoiceDesign
- Reference audio upload for voice cloning

#### 1.9 TTS Benchmark - P1

Add benchmark tooling for Qwen3-TTS performance validation.

**Challenge:** Qwen3-TTS supports 3 different task types, each with different input requirements and use cases. Benchmark should cover all of them.

**Task-Specific Benchmarks:**

| Task | Input | Key Metrics | Notes |
|------|-------|-------------|-------|
| CustomVoice | text + speaker_id + (optional) instruction | Latency, RTF, throughput | Most common use case |
| VoiceDesign | text + instruction | Latency, RTF | Instruction parsing overhead |
| Base (Voice Clone) | text + ref_audio + ref_text | Latency, RTF, speaker similarity | Speaker encoder overhead |

**Metrics to Measure:**

1. **Latency Metrics:**
   - First chunk latency (streaming mode)
   - Time to first audio byte (TTFAB)
   - End-to-end latency

2. **Throughput Metrics:**
   - Real-Time Factor (RTF) = processing_time / audio_duration
   - Requests per second
   - Tokens per second (AR generation)

3. **Quality Metrics (optional):**
   - Speaker similarity score (for voice cloning)
   - MOS (Mean Opinion Score) via automatic evaluation

**Benchmark Dataset:**
- Short sentences (< 50 chars): latency testing
- Medium sentences (50-200 chars): typical use case
- Long sentences (> 200 chars): streaming benefit validation
- Multi-language: Chinese, English, mixed

**Reference:** PR #1109 fixed audio benchmark timing for Qwen3-Omni.

### 2 Dependencies

```
#1161 Disaggregated Pipeline
    ├──→ #1189 Streaming Output (requires 2-stage separation)
    └──→ #1205 CUDA Graph (can parallel)
    └──→ #1189 Streaming output
```

### 3 Anticipated Timeline

| Week | Focus | Deliverable |
|------|-------|-------------|
| Feb 5-10 | #1161 | Disaggregated pipeline merged |
| Feb 10-15 | #1205 | CUDA Graph merged |
| Feb 15-20 | #1189 | Streaming output merged |
| Feb 20-28 | Testing | e2e validation, benchmarks |

### 4 Performance Targets

| Metric | Target | Reference |
|--------|--------|-----------|
| First chunk latency | < 200ms | nano-qwen3tts-vllm: 160ms |
| RTF | < 1.0 | nano-qwen3tts-vllm: 0.65 |
| e2e latency reduction | > 50% | PR #727: 66% |

---

## CC List

@hsliuustc0106 @Gaohan123 @gcanlin @Sy0307 @tsdocode @xulusjb @gerayking @linyueqian @amy-why-3459 @R2-Y 

---


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Qwen3-TTS Production Ready - February Milestone #938

Motivation

Proposed Change

1 Work Items

1.1 Disaggregated Inference Pipeline (#1161) - P0

1.2 CUDA Graph Acceleration (#1205) - P0

1.3 Streaming Audio Output (#1189) - P0

1.4 Voice Upload API (#1201) - P0

1.5 E2E Tests (#1206) - P0

1.6 Streaming Text Input - P1

1.7 Model-Specific Parameters - P1

1.8 Gradio Demo - P1

1.9 TTS Benchmark - P1

2 Dependencies

3 Anticipated Timeline

4 Performance Targets

CC List

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature	PR	Status	Priority	Owner
Offline inference	#895	✅ Merged	-	-
Online serving	#968	✅ Merged	-	@linyueqian
Disaggregated pipeline	#1161	🔄 In Progress	P0	@Sy0307 @gcanlin
CUDA Graph acceleration	#1205	🔄 In Progress	P0	@xulusjb @tsdocode
Streaming audio output	#1189	🔄 In Progress	P0	@gerayking
Voice upload API	#1201	🔄 In Progress	P0	@zhaotyer
E2E tests	#1206	🔄 In Progress	P0	@linyueqian
Streaming text input	-	📋 Planned	P1	-
Model-specific params	-	📋 Planned	P1	-
Gradio demo	-	📋 Planned	P1	-
TTS benchmark	-	📋 Planned	P1	-

PR	Description	Relevance
#1151	Refactor async chunk for Thinker/Talker	Async chunk pattern; need similar TTS benchmark
#962	Async chunk design docs (Qwen3-Omni)	Architecture reference
#986	Streaming input from vLLM	Streaming input pattern for TTS
#1109	Benchmark audio timing fix	Benchmark infrastructure

Task	Input	Key Metrics	Notes
CustomVoice	text + speaker_id + (optional) instruction	Latency, RTF, throughput	Most common use case
VoiceDesign	text + instruction	Latency, RTF	Instruction parsing overhead
Base (Voice Clone)	text + ref_audio + ref_text	Latency, RTF, speaker similarity	Speaker encoder overhead

Week	Focus	Deliverable
Feb 5-10	#1161	Disaggregated pipeline merged
Feb 10-15	#1205	CUDA Graph merged
Feb 15-20	#1189	Streaming output merged
Feb 20-28	Testing	e2e validation, benchmarks

Metric	Target	Reference
First chunk latency	< 200ms	nano-qwen3tts-vllm: 160ms
RTF	< 1.0	nano-qwen3tts-vllm: 0.65
e2e latency reduction	> 50%	PR #727: 66%

[RFC]: Qwen3-TTS Production Ready - February Milestone #938

Description

Motivation

Proposed Change

1 Work Items

1.1 Disaggregated Inference Pipeline (#1161) - P0

1.2 CUDA Graph Acceleration (#1205) - P0

1.3 Streaming Audio Output (#1189) - P0

1.4 Voice Upload API (#1201) - P0

1.5 E2E Tests (#1206) - P0

1.6 Streaming Text Input - P1

1.7 Model-Specific Parameters - P1

1.8 Gradio Demo - P1

1.9 TTS Benchmark - P1

2 Dependencies

3 Anticipated Timeline

4 Performance Targets

CC List

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions