Skip to content

Commit 6d92db4

Browse files
authored
Merge branch 'vllm-project:main' into feature_mimo_audio
2 parents f97de11 + 4d8e290 commit 6d92db4

File tree

16 files changed

+354
-42
lines changed

16 files changed

+354
-42
lines changed

docs/.nav.yml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,17 +11,23 @@ nav:
1111
- Examples:
1212
- examples/README.md
1313
- Offline Inference:
14+
- BAGEL-7B-MoT: user_guide/examples/offline_inference/bagel.md
1415
- Image-To-Image: user_guide/examples/offline_inference/image_to_image.md
1516
- Image-To-Video: user_guide/examples/offline_inference/image_to_video.md
17+
- LoRA Inference Examples: user_guide/examples/offline_inference/lora_inference.md
1618
- Qwen2.5-Omni: user_guide/examples/offline_inference/qwen2_5_omni.md
1719
- Qwen3-Omni: user_guide/examples/offline_inference/qwen3_omni.md
1820
- Qwen3-TTS Offline Inference: user_guide/examples/offline_inference/qwen3_tts.md
21+
- Text-To-Audio: user_guide/examples/offline_inference/text_to_audio.md
1922
- Text-To-Image: user_guide/examples/offline_inference/text_to_image.md
2023
- Text-To-Video: user_guide/examples/offline_inference/text_to_video.md
2124
- Online Serving:
25+
- BAGEL-7B-MoT: user_guide/examples/online_serving/bagel.md
2226
- Image-To-Image: user_guide/examples/online_serving/image_to_image.md
27+
- Online LoRA Inference (Diffusion): user_guide/examples/online_serving/lora_inference.md
2328
- Qwen2.5-Omni: user_guide/examples/online_serving/qwen2_5_omni.md
2429
- Qwen3-Omni: user_guide/examples/online_serving/qwen3_omni.md
30+
- Qwen3-TTS Online Serving: user_guide/examples/online_serving/qwen3_tts.md
2531
- Text-To-Image: user_guide/examples/online_serving/text_to_image.md
2632
- General:
2733
- usage/*
@@ -54,6 +60,7 @@ nav:
5460
- Feature Design:
5561
- design/feature/disaggregated_inference.md
5662
- design/feature/ray_based_execution.md
63+
- design/feature/omni_connectors/
5764
- Module Design:
5865
- design/module/ar_module.md
5966
- design/module/dit_module.md

docs/api/README.md

Lines changed: 20 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,9 @@ Main entry points for vLLM-Omni inference and serving.
1010
- [vllm_omni.entrypoints.chat_utils.OmniAsyncMultiModalContentParser][]
1111
- [vllm_omni.entrypoints.chat_utils.OmniAsyncMultiModalItemTracker][]
1212
- [vllm_omni.entrypoints.chat_utils.parse_chat_messages_futures][]
13+
- [vllm_omni.entrypoints.cli.benchmark.base.OmniBenchmarkSubcommandBase][]
14+
- [vllm_omni.entrypoints.cli.benchmark.main.OmniBenchmarkSubcommand][]
15+
- [vllm_omni.entrypoints.cli.benchmark.serve.OmniBenchmarkServingSubcommand][]
1316
- [vllm_omni.entrypoints.cli.serve.OmniServeCommand][]
1417
- [vllm_omni.entrypoints.client_request_state.ClientRequestState][]
1518
- [vllm_omni.entrypoints.log_utils.OrchestratorMetrics][]
@@ -26,7 +29,9 @@ Main entry points for vLLM-Omni inference and serving.
2629

2730
Input data structures for multi-modal inputs.
2831

32+
- [vllm_omni.inputs.data.OmniDiffusionSamplingParams][]
2933
- [vllm_omni.inputs.data.OmniEmbedsPrompt][]
34+
- [vllm_omni.inputs.data.OmniTextPrompt][]
3035
- [vllm_omni.inputs.data.OmniTokenInputs][]
3136
- [vllm_omni.inputs.data.OmniTokensPrompt][]
3237
- [vllm_omni.inputs.parse.parse_singleton_prompt_omni][]
@@ -58,6 +63,7 @@ Core scheduling and caching components.
5863
- [vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler][]
5964
- [vllm_omni.core.sched.output.OmniCachedRequestData][]
6065
- [vllm_omni.core.sched.output.OmniNewRequestData][]
66+
- [vllm_omni.core.sched.output.OmniSchedulerOutput][]
6167
- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_25hz.vq.core_vq.DistributedGroupResidualVectorQuantization][]
6268
- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_25hz.vq.core_vq.DistributedResidualVectorQuantization][]
6369
- [vllm_omni.model_executor.models.qwen3_tts.tokenizer_25hz.vq.core_vq.EuclideanCodebook][]
@@ -88,20 +94,23 @@ Configuration classes.
8894

8995
Worker classes and model runners for distributed inference.
9096

91-
- [vllm_omni.diffusion.worker.gpu_diffusion_model_runner.GPUDiffusionModelRunner][]
92-
- [vllm_omni.diffusion.worker.gpu_diffusion_worker.GPUDiffusionWorker][]
93-
- [vllm_omni.diffusion.worker.gpu_diffusion_worker.WorkerProc][]
94-
- [vllm_omni.diffusion.worker.npu.npu_worker.NPUWorker][]
95-
- [vllm_omni.diffusion.worker.npu.npu_worker.NPUWorkerProc][]
97+
- [vllm_omni.diffusion.worker.diffusion_model_runner.DiffusionModelRunner][]
98+
- [vllm_omni.diffusion.worker.diffusion_worker.DiffusionWorker][]
99+
- [vllm_omni.diffusion.worker.diffusion_worker.WorkerProc][]
100+
- [vllm_omni.platforms.npu.worker.npu_ar_model_runner.ExecuteModelState][]
101+
- [vllm_omni.platforms.npu.worker.npu_ar_model_runner.NPUARModelRunner][]
102+
- [vllm_omni.platforms.npu.worker.npu_ar_worker.NPUARWorker][]
103+
- [vllm_omni.platforms.npu.worker.npu_generation_model_runner.NPUGenerationModelRunner][]
104+
- [vllm_omni.platforms.npu.worker.npu_generation_worker.NPUGenerationWorker][]
105+
- [vllm_omni.platforms.npu.worker.npu_model_runner.OmniNPUModelRunner][]
106+
- [vllm_omni.platforms.xpu.worker.xpu_ar_model_runner.XPUARModelRunner][]
107+
- [vllm_omni.platforms.xpu.worker.xpu_ar_worker.XPUARWorker][]
108+
- [vllm_omni.platforms.xpu.worker.xpu_generation_model_runner.XPUGenerationModelRunner][]
109+
- [vllm_omni.platforms.xpu.worker.xpu_generation_worker.XPUGenerationWorker][]
96110
- [vllm_omni.worker.gpu_ar_model_runner.ExecuteModelState][]
97111
- [vllm_omni.worker.gpu_ar_model_runner.GPUARModelRunner][]
98112
- [vllm_omni.worker.gpu_ar_worker.GPUARWorker][]
99113
- [vllm_omni.worker.gpu_generation_model_runner.GPUGenerationModelRunner][]
100114
- [vllm_omni.worker.gpu_generation_worker.GPUGenerationWorker][]
101115
- [vllm_omni.worker.gpu_model_runner.OmniGPUModelRunner][]
102-
- [vllm_omni.worker.npu.npu_ar_model_runner.ExecuteModelState][]
103-
- [vllm_omni.worker.npu.npu_ar_model_runner.NPUARModelRunner][]
104-
- [vllm_omni.worker.npu.npu_ar_worker.NPUARWorker][]
105-
- [vllm_omni.worker.npu.npu_generation_model_runner.NPUGenerationModelRunner][]
106-
- [vllm_omni.worker.npu.npu_generation_worker.NPUGenerationWorker][]
107-
- [vllm_omni.worker.npu.npu_model_runner.OmniNPUModelRunner][]
116+
- [vllm_omni.worker.mixins.OmniWorkerMixin][]

docs/user_guide/examples/offline_inference/bagel.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/bagel>.
44

5+
56
## Set up
67

78
Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
@@ -99,7 +100,7 @@ python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
99100

100101
BAGEL-7B-MoT supports **multiple modality modes** for different use cases.
101102

102-
The default yaml configuration deploys Thinker and DiT on the same GPU. You can use the default configuration file: [`bagel.yaml`](../../../vllm_omni/model_executor/stage_configs/bagel.yaml)
103+
The default yaml configuration deploys Thinker and DiT on the same GPU. You can use the default configuration file: [`bagel.yaml`](https://github.com/vllm-project/vllm-omni/tree/main/vllm_omni/model_executor/stage_configs/bagel.yaml)
103104

104105
#### 📌 Command Line Arguments (end2end.py)
105106

@@ -177,3 +178,10 @@ sudo apt install ffmpeg
177178
| Stage-0 (Thinker) | **15.04 GiB** **+ KV Cache** |
178179
| Stage-1 (DiT) | **26.50 GiB** |
179180
| Total | **~42 GiB + KV Cache** |
181+
182+
## Example materials
183+
184+
??? abstract "end2end.py"
185+
``````py
186+
--8<-- "examples/offline_inference/bagel/end2end.py"
187+
``````

docs/user_guide/examples/offline_inference/image_to_image.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,10 +47,15 @@ Key arguments:
4747
- `--image`: path(s) to the source image(s) (PNG/JPG, converted to RGB). Can specify multiple images.
4848
- `--prompt` / `--negative_prompt`: text description (string).
4949
- `--cfg_scale`: true classifier-free guidance scale (default: 4.0). Classifier-free guidance is enabled by setting cfg_scale > 1 and providing a negative_prompt. Higher guidance scale encourages images closely linked to the text prompt, usually at the expense of lower image quality.
50-
- `--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
5150
- `--guidance_scale`: guidance scale for guidance-distilled models (default: 1.0, disabled). Unlike classifier-free guidance (--cfg_scale), guidance-distilled models take the guidance scale directly as an input parameter. Enabled when guidance_scale > 1. Ignored when not using guidance-distilled models.
5251
- `--num_inference_steps`: diffusion sampling steps (more steps = higher quality, slower).
5352
- `--output`: path to save the generated PNG.
53+
- `--vae_use_slicing`: enable VAE slicing for memory optimization.
54+
- `--vae_use_tiling`: enable VAE tiling for memory optimization.
55+
- `--cfg_parallel_size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](https://github.com/vllm-project/vllm-omni/tree/main/docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel).
56+
- `--enable-cpu-offload`: enable CPU offloading for diffusion models.
57+
58+
> ℹ️ If you encounter OOM errors, try using `--vae_use_slicing` and `--vae_use_tiling` to reduce memory usage.
5459
5560
## Example materials
5661

docs/user_guide/examples/offline_inference/image_to_video.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,12 +52,17 @@ Key arguments:
5252
- `--num_frames`: Number of frames (default 81).
5353
- `--guidance_scale` and `--guidance_scale_high`: CFG scale (applied to low/high-noise stages for MoE).
5454
- `--negative_prompt`: Optional list of artifacts to suppress.
55-
- `--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
5655
- `--boundary_ratio`: Boundary split ratio for two-stage MoE models.
5756
- `--flow_shift`: Scheduler flow shift (5.0 for 720p, 12.0 for 480p).
5857
- `--num_inference_steps`: Number of denoising steps (default 50).
5958
- `--fps`: Frames per second for the saved MP4 (requires `diffusers` export_to_video).
6059
- `--output`: Path to save the generated video.
60+
- `--vae_use_slicing`: Enable VAE slicing for memory optimization.
61+
- `--vae_use_tiling`: Enable VAE tiling for memory optimization.
62+
- `--cfg_parallel_size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](https://github.com/vllm-project/vllm-omni/tree/main/docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel).
63+
- `--enable-cpu-offload`: enable CPU offloading for diffusion models.
64+
65+
> ℹ️ If you encounter OOM errors, try using `--vae_use_slicing` and `--vae_use_tiling` to reduce memory usage.
6166
6267
## Example materials
6368

docs/user_guide/examples/offline_inference/lora_inference.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
1-
# LoRA-Inference
1+
# LoRA Inference Examples
22

33
Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/lora_inference>.
44

5-
This contains examples for using LoRA (Low-Rank Adaptation) adapters with vLLM-omni diffusion models for offline inference.
5+
6+
This directory contains examples for using LoRA (Low-Rank Adaptation) adapters with vLLM-omni diffusion models for offline inference.
67
The example uses the `stabilityai/stable-diffusion-3.5-medium` as the default model, but you can replace it with other models in vLLM-omni.
78

89
## Overview

docs/user_guide/examples/offline_inference/qwen3_tts.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,15 @@ Qwen3 TTS provides multiple task variants for speech generation:
1616
## Setup
1717
Please refer to the [stage configuration documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) to configure memory allocation appropriately for your hardware setup.
1818

19+
### ROCm Dependencies
20+
21+
You will need to install these two dependencies `onnxruntime-rocm` and `sox`.
22+
23+
```
24+
pip uninstall onnxruntime # should be removed before we can install onnxruntime-rocm
25+
pip install onnxruntime-rocm sox
26+
```
27+
1928
## Quick Start
2029

2130
Run a single sample for a task:
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# Text-To-Audio
2+
3+
Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/text_to_audio>.
4+
5+
6+
The `stabilityai/stable-audio-open-1.0` pipeline generates audio from text prompts.
7+
8+
## Prerequisites
9+
10+
If you use a gated model (e.g., `stabilityai/stable-audio-open-1.0`), ensure you have access:
11+
12+
1. **Accept Model License**: Visit the model page on Hugging Face (e.g., [stabilityai/stable-audio-open-1.0]) and accept the user agreement.
13+
2. **Authenticate**: Log in to Hugging Face locally to access the gated model.
14+
```bash
15+
huggingface-cli login
16+
```
17+
18+
## Local CLI Usage
19+
20+
```bash
21+
python text_to_audio.py \
22+
--model stabilityai/stable-audio-open-1.0 \
23+
--prompt "The sound of a hammer hitting a wooden surface" \
24+
--negative_prompt "Low quality" \
25+
--seed 42 \
26+
--guidance_scale 7.0 \
27+
--audio_length 10.0 \
28+
--num_inference_steps 100 \
29+
--output stable_audio_output.wav
30+
```
31+
32+
Key arguments:
33+
34+
- `--prompt`: text description (string).
35+
- `--negative_prompt`: negative prompt for classifier-free guidance.
36+
- `--seed`: integer seed for deterministic generation.
37+
- `--guidance_scale`: classifier-free guidance scale.
38+
- `--audio_length`: audio duration in seconds.
39+
- `--num_inference_steps`: diffusion sampling steps.(more steps = higher quality, slower).
40+
- `--output`: path to save the generated WAV file.
41+
42+
## Example materials
43+
44+
??? abstract "text_to_audio.py"
45+
``````py
46+
--8<-- "examples/offline_inference/text_to_audio/text_to_audio.py"
47+
``````

docs/user_guide/examples/offline_inference/text_to_image.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ if __name__ == "__main__":
5151

5252
For diffusion pipelines, the stage config field `stage_args.[].runtime.max_batch_size` is 1 by default, and the input
5353
list is sliced into single-item requests before feeding into the diffusion pipeline. For models that do internally support
54-
batched inputs, you can [modify this configuration](../../../configuration/stage_configs.md) to let the model accept a longer batch of prompts.
54+
batched inputs, you can [modify this configuration](https://github.com/vllm-project/vllm-omni/tree/main/configuration/stage_configs.md) to let the model accept a longer batch of prompts.
5555

5656
Apart from string prompt, vLLM-Omni also supports dictionary prompts in the same style as vLLM.
5757
This is useful for models that support negative prompts.
@@ -95,11 +95,16 @@ Key arguments:
9595
- `--prompt`: text description (string).
9696
- `--seed`: integer seed for deterministic sampling.
9797
- `--cfg_scale`: true CFG scale (model-specific guidance strength).
98-
- `--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
9998
- `--num_images_per_prompt`: number of images to generate per prompt (saves as `output`, `output_1`, ...).
10099
- `--num_inference_steps`: diffusion sampling steps (more steps = higher quality, slower).
101100
- `--height/--width`: output resolution (defaults 1024x1024).
102101
- `--output`: path to save the generated PNG.
102+
- `--vae_use_slicing`: enable VAE slicing for memory optimization.
103+
- `--vae_use_tiling`: enable VAE tiling for memory optimization.
104+
- `--cfg_parallel_size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](https://github.com/vllm-project/vllm-omni/tree/main/docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel).
105+
- `--enable-cpu-offload`: enable CPU offloading for diffusion models.
106+
107+
> ℹ️ If you encounter OOM errors, try using `--vae_use_slicing` and `--vae_use_tiling` to reduce memory usage.
103108
104109
> ℹ️ Qwen-Image currently publishes best-effort presets at `1328x1328`, `1664x928`, `928x1664`, `1472x1140`, `1140x1472`, `1584x1056`, and `1056x1584`. Adjust `--height/--width` accordingly for the most reliable outcomes.
105110

docs/user_guide/examples/offline_inference/text_to_video.md

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,11 @@ python text_to_video.py \
1212
--prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
1313
--negative_prompt "<optional quality filter>" \
1414
--height 480 \
15-
--width 640 \
16-
--num_frames 32 \
15+
--width 832 \
16+
--num_frames 33 \
1717
--guidance_scale 4.0 \
1818
--guidance_scale_high 3.0 \
19+
--flow_shift 12.0 \
1920
--num_inference_steps 40 \
2021
--fps 16 \
2122
--output t2v_out.mp4
@@ -24,14 +25,19 @@ python text_to_video.py \
2425
Key arguments:
2526

2627
- `--prompt`: text description (string).
27-
- `--height/--width`: output resolution (defaults 720x1280). Dimensions should align with Wan VAE downsampling (multiples of 8).
28+
- `--height/--width`: output resolution (defaults 480x832, i.e. 480P). Dimensions should align with Wan VAE downsampling (multiples of 8).
2829
- `--num_frames`: Number of frames (Wan default is 81).
29-
- `--guidance_scale` and `--guidance_scale_high`: CFG scale (applied to low/high)..
30+
- `--guidance_scale` and `--guidance_scale_high`: CFG scale (applied to low/high).
3031
- `--negative_prompt`: optional list of artifacts to suppress (the PR demo used a long Chinese string).
31-
- `--cfg_parallel_size`: the number of devices to run CFG Parallel. CFG Parallel is valid only if classifier-free guidance is enabled and `cfg_parallel_size` is set to 2.
32-
- `--boundary_ratio`: Boundary split ratio for low/high DiT.
32+
- `--boundary_ratio`: Boundary split ratio for low/high DiT. Default `0.875` uses both transformers for best quality. Set to `1.0` to load only the low-noise transformer (saves noticeable memory with good quality, recommended if memory is limited). Set to `0.0` loads only the high-noise transformer (not recommended, lower quality).
3333
- `--fps`: frames per second for the saved MP4 (requires `diffusers` export_to_video).
3434
- `--output`: path to save the generated video.
35+
- `--vae_use_slicing`: enable VAE slicing for memory optimization.
36+
- `--vae_use_tiling`: enable VAE tiling for memory optimization.
37+
- `--cfg_parallel_size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](https://github.com/vllm-project/vllm-omni/tree/main/docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel).
38+
- `--enable-cpu-offload`: enable CPU offloading for diffusion models.
39+
40+
> ℹ️ If you encounter OOM errors, try using `--vae_use_slicing` and `--vae_use_tiling` to reduce memory usage.
3541
3642
## Example materials
3743

0 commit comments

Comments
 (0)