Skip to content

Commit b07e7ce

Browse files
committed
Update ReadMe
1 parent c319333 commit b07e7ce

File tree

1 file changed

+131
-5
lines changed

1 file changed

+131
-5
lines changed

tools/llm/README.md

Lines changed: 131 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,9 @@ This directory provides utilities and scripts for compiling, optimizing, and ben
77
- **Model Support:** Works with popular LLMs such as Llama-3, Qwen2.5, etc.
88
- **VLM Support:** Supports Visual Language Models like Qwen2.5-VL and Eagle2.
99
- **Precision Modes:** Supports FP16, BF16, and FP32.
10+
- **Multiple Backends:**
11+
- **SDPA Backend** (default): Registers custom lowering pass for SDPA operations, enabling TensorRT conversion with optional static KV cache support
12+
- **Plugin Backend**: Uses TensorRT Edge-LLM attention plugin for optimized inference with built-in KV cache management
1013
- **KV Cache:** Supports static and dynamic KV cache for efficient autoregressive decoding.
1114
- **Benchmarking:** Measures and compares throughput and latency for PyTorch and TensorRT backends.
1215
- **Custom Attention:** Registers and converts custom scaled dot-product attention (SDPA) for compatibility with TensorRT.
@@ -37,41 +40,164 @@ We have officially verified support for the following models:
3740

3841
#### Text-only LLMs: `run_llm.py`
3942

43+
**1. Generation with Output Verification**
44+
45+
Compare PyTorch and TensorRT outputs to verify correctness:
46+
47+
*SDPA Backend:*
48+
```bash
49+
python run_llm.py --model meta-llama/Llama-3.2-1B-Instruct --backend sdpa \
50+
--prompt "What is parallel programming?" --precision FP16 --num_tokens 30 --enable_pytorch_run
51+
```
52+
<details>
53+
<summary>Expected Output</summary>
54+
55+
```
56+
========= PyTorch =========
57+
PyTorch model generated text: What is parallel programming? Parallel programming is a technique used to improve the performance of a program by dividing the work into smaller tasks and executing them simultaneously on multiple processors or cores.
58+
===================================
59+
========= TensorRT =========
60+
TensorRT model generated text: What is parallel programming? Parallel programming is a technique used to improve the performance of a program by dividing the work into smaller tasks and executing them simultaneously on multiple processors or cores.
61+
===================================
62+
PyTorch and TensorRT outputs match: True
63+
```
64+
</details>
65+
66+
*Plugin Backend:*
67+
```bash
68+
python run_llm.py --model Qwen/Qwen2.5-0.5B-Instruct --backend plugin \
69+
--prompt "What is parallel programming?" --precision FP16 --num_tokens 30 --enable_pytorch_run
70+
```
71+
<details>
72+
<summary>Expected Output</summary>
73+
74+
```
75+
========= PyTorch =========
76+
PyTorch model generated text: What is parallel programming? What are the benefits of parallel programming? What are the challenges of parallel programming? What are the different types of parallel programming? What are the advantages of
77+
===================================
78+
========= TensorRT =========
79+
TensorRT model generated text: What is parallel programming? What are the benefits of parallel programming? What are the challenges of parallel programming? What are the different types of parallel programming? What are the advantages of
80+
===================================
81+
PyTorch and TensorRT outputs match: True
82+
```
83+
</details>
84+
85+
**2. Benchmarking for Performance Comparison**
86+
87+
*Plugin Backend (compares TensorRT-Plugin vs PyTorch):*
4088
```bash
41-
python run_llm.py --model meta-llama/Llama-3.2-1B-Instruct --prompt "What is parallel programming?" --precision FP16 --num_tokens 128 --cache static_v2 --benchmark
89+
python run_llm.py --model Qwen/Qwen2.5-0.5B-Instruct --backend plugin --precision FP16 \
90+
--benchmark --iterations 5 --isl 128 --num_tokens 20 --batch_size 1 --enable_pytorch_run
4291
```
4392

93+
*SDPA with Static Cache (compares TensorRT-SDPA-StaticCache vs PyTorch):*
94+
```bash
95+
python run_llm.py --model meta-llama/Llama-3.2-1B-Instruct --backend sdpa --cache static_v2 \
96+
--precision FP16 --benchmark --iterations 5 --isl 128 --num_tokens 20 --batch_size 1 --enable_pytorch_run
97+
```
98+
99+
> **Note**: In benchmark mode, `--prompt` is not used. Random input tokens are generated based on `--isl` (input sequence length).
100+
44101
#### Vision Language Models: `run_vlm.py`
45102

103+
*Generation with Output Verification:*
104+
```bash
105+
python run_vlm.py --model nvidia/Eagle2-2B --precision FP16 --num_tokens 64 --cache static_v1 --enable_pytorch_run
106+
```
107+
108+
*Benchmarking:*
46109
```bash
47-
python run_vlm.py --model nvidia/Eagle2-2B --precision FP16 --num_tokens 128 --cache static_v1 --enable_pytorch_run --benchmark
110+
python run_vlm.py --model nvidia/Eagle2-2B --precision FP16 --cache static_v1 --benchmark --iterations 5 --num_tokens 128
48111
```
49112

50113
#### Key Arguments
51114

115+
**Model Configuration:**
52116
- `--model`: Name or path of the HuggingFace LLM/VLM.
53117
- `--tokenizer`: (Optional) Tokenizer name; defaults to model.
54-
- `--prompt`: Input prompt for generation.
118+
- `--backend`: Backend to use (`sdpa` or `plugin`). Default is `sdpa`. Only applicable for LLM models.
119+
120+
**Generation Settings:**
121+
- `--prompt`: Input prompt for generation (generation mode only, ignored in benchmark mode).
55122
- `--image_path`: (Optional) Path to input image file for VLM models. If not provided, will use a sample image.
56123
- `--precision`: Precision mode (`FP16`, `FP32`).
57124
- `--num_tokens`: Number of output tokens to generate.
58-
- `--cache`: KV cache type (`static_v1`, `static_v2`, or empty for no KV caching).
59-
- `--benchmark`: Enable benchmarking mode.
125+
126+
**Cache and Optimization:**
127+
- `--cache`: KV cache type for SDPA backend (`static_v1`, `static_v2`, or empty for no KV caching).
128+
- Note: Not applicable for plugin backend (manages cache internally).
129+
130+
**Benchmarking:**
131+
- `--benchmark`: Enable benchmarking mode (uses random inputs instead of prompt).
132+
- `--iterations`: Number of benchmark iterations. Default is 5.
133+
- `--isl`: Input sequence length for benchmarking. Default is 2048.
134+
- `--batch_size`: Batch size for benchmarking. Default is 1.
60135
- `--enable_pytorch_run`: Also run and compare PyTorch baseline.
61136

62137
### Caching Strategies
63138

139+
#### SDPA Backend
64140
- **Static Cache v1/v2:** Adds static KV cache tensors as model inputs/outputs for efficient reuse.
65141
- **No Cache:** Standard autoregressive decoding.
66142

67143
Please read our tutorial on how static cache is implemented.
68144

145+
#### Plugin Backend
146+
The plugin backend uses the TensorRT Edge-LLM AttentionPlugin which manages KV cache internally. The `--cache` option is not applicable and will be ignored if specified with `--backend plugin`.
147+
148+
## Plugin Backend Setup
149+
150+
To use the plugin backend (`--backend plugin`), you need to build the TensorRT Edge-LLM AttentionPlugin library.
151+
152+
### Building the AttentionPlugin
153+
154+
Currently, the plugin support requires a custom build from a feature branch:
155+
156+
```bash
157+
# Clone the repository with the torch-tensorrt-python-runtime feature
158+
git clone -b feature/torch-tensorrt-python-runtime https://github.com/chohk88/TensorRT-Edge-LLM.git
159+
cd TensorRT-Edge-LLM
160+
161+
# Build the plugin library
162+
mkdir build && cd build
163+
164+
# Configure with CMake (adjust paths based on your environment)
165+
# Example for typical Ubuntu setup with CUDA 12.9 and TensorRT in /usr:
166+
cmake .. -DTRT_PACKAGE_DIR=/usr -DCUDA_VERSION=12.9
167+
168+
# Build
169+
make -j$(nproc)
170+
171+
# The plugin library will be at: build/libNvInfer_edgellm_plugin.so
172+
```
173+
174+
> **Note**: CMake configuration may vary depending on your system setup. Common options include:
175+
> - `-DTRT_PACKAGE_DIR`: TensorRT installation directory (e.g., `/usr`, `/usr/local`)
176+
> - `-DCUDA_VERSION`: CUDA version (e.g., `12.9`, `12.6`)
177+
>
178+
> Refer to the [TensorRT-Edge-LLM build documentation](https://github.com/chohk88/TensorRT-Edge-LLM/tree/feature/torch-tensorrt-python-runtime) for complete build instructions and dependencies.
179+
180+
After building, update the plugin path in `plugin_utils.py`:
181+
```python
182+
DEFAULT_PLUGIN_PATH = "/path/to/your/TensorRT-Edge-LLM/build/libNvInfer_edgellm_plugin.so"
183+
```
184+
185+
### Additional Examples
186+
187+
Two comprehensive examples are provided in `examples/dynamo/` to demonstrate plugin usage:
188+
189+
- **`attention_plugin_example.py`**: Standalone example showing how to use the AttentionPlugin with custom models
190+
- **`end_to_end_llm_generation_example.py`**: End-to-end LLM generation example with plugin integration
191+
192+
These examples can serve as references for integrating the plugin into your own applications.
193+
69194
## Extension
70195

71196
This codebase can be extended to
72197
- Add new models by specifying their HuggingFace name.
73198
- Implement new cache strategies by adding FX graph passes.
74199
- Customize SDPA conversion for new attention mechanisms.
200+
- Add new backend implementations (see `plugin_utils.py` for plugin backend reference).
75201

76202
## Limitations
77203
- We do not currently support sliding window attention (used in Gemma3 and Qwen 3 models) yet.

0 commit comments

Comments
 (0)