You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
--prompt "What is parallel programming?" --precision FP16 --num_tokens 30 --enable_pytorch_run
51
+
```
52
+
<details>
53
+
<summary>Expected Output</summary>
54
+
55
+
```
56
+
========= PyTorch =========
57
+
PyTorch model generated text: What is parallel programming? Parallel programming is a technique used to improve the performance of a program by dividing the work into smaller tasks and executing them simultaneously on multiple processors or cores.
58
+
===================================
59
+
========= TensorRT =========
60
+
TensorRT model generated text: What is parallel programming? Parallel programming is a technique used to improve the performance of a program by dividing the work into smaller tasks and executing them simultaneously on multiple processors or cores.
--prompt "What is parallel programming?" --precision FP16 --num_tokens 30 --enable_pytorch_run
70
+
```
71
+
<details>
72
+
<summary>Expected Output</summary>
73
+
74
+
```
75
+
========= PyTorch =========
76
+
PyTorch model generated text: What is parallel programming? What are the benefits of parallel programming? What are the challenges of parallel programming? What are the different types of parallel programming? What are the advantages of
77
+
===================================
78
+
========= TensorRT =========
79
+
TensorRT model generated text: What is parallel programming? What are the benefits of parallel programming? What are the challenges of parallel programming? What are the different types of parallel programming? What are the advantages of
80
+
===================================
81
+
PyTorch and TensorRT outputs match: True
82
+
```
83
+
</details>
84
+
85
+
**2. Benchmarking for Performance Comparison**
86
+
87
+
*Plugin Backend (compares TensorRT-Plugin vs PyTorch):*
-`--model`: Name or path of the HuggingFace LLM/VLM.
53
117
-`--tokenizer`: (Optional) Tokenizer name; defaults to model.
54
-
-`--prompt`: Input prompt for generation.
118
+
-`--backend`: Backend to use (`sdpa` or `plugin`). Default is `sdpa`. Only applicable for LLM models.
119
+
120
+
**Generation Settings:**
121
+
-`--prompt`: Input prompt for generation (generation mode only, ignored in benchmark mode).
55
122
-`--image_path`: (Optional) Path to input image file for VLM models. If not provided, will use a sample image.
56
123
-`--precision`: Precision mode (`FP16`, `FP32`).
57
124
-`--num_tokens`: Number of output tokens to generate.
58
-
-`--cache`: KV cache type (`static_v1`, `static_v2`, or empty for no KV caching).
59
-
-`--benchmark`: Enable benchmarking mode.
125
+
126
+
**Cache and Optimization:**
127
+
-`--cache`: KV cache type for SDPA backend (`static_v1`, `static_v2`, or empty for no KV caching).
128
+
- Note: Not applicable for plugin backend (manages cache internally).
129
+
130
+
**Benchmarking:**
131
+
-`--benchmark`: Enable benchmarking mode (uses random inputs instead of prompt).
132
+
-`--iterations`: Number of benchmark iterations. Default is 5.
133
+
-`--isl`: Input sequence length for benchmarking. Default is 2048.
134
+
-`--batch_size`: Batch size for benchmarking. Default is 1.
60
135
-`--enable_pytorch_run`: Also run and compare PyTorch baseline.
61
136
62
137
### Caching Strategies
63
138
139
+
#### SDPA Backend
64
140
-**Static Cache v1/v2:** Adds static KV cache tensors as model inputs/outputs for efficient reuse.
65
141
-**No Cache:** Standard autoregressive decoding.
66
142
67
143
Please read our tutorial on how static cache is implemented.
68
144
145
+
#### Plugin Backend
146
+
The plugin backend uses the TensorRT Edge-LLM AttentionPlugin which manages KV cache internally. The `--cache` option is not applicable and will be ignored if specified with `--backend plugin`.
147
+
148
+
## Plugin Backend Setup
149
+
150
+
To use the plugin backend (`--backend plugin`), you need to build the TensorRT Edge-LLM AttentionPlugin library.
151
+
152
+
### Building the AttentionPlugin
153
+
154
+
Currently, the plugin support requires a custom build from a feature branch:
155
+
156
+
```bash
157
+
# Clone the repository with the torch-tensorrt-python-runtime feature
> -`-DCUDA_VERSION`: CUDA version (e.g., `12.9`, `12.6`)
177
+
>
178
+
> Refer to the [TensorRT-Edge-LLM build documentation](https://github.com/chohk88/TensorRT-Edge-LLM/tree/feature/torch-tensorrt-python-runtime) for complete build instructions and dependencies.
179
+
180
+
After building, update the plugin path in `plugin_utils.py`:
0 commit comments