You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md
+54-28Lines changed: 54 additions & 28 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,9 @@ comments: true
6
6
7
7
PaddleOCR-VL is an advanced and efficient document parsing model designed specifically for element recognition in documents. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful Vision-Language Model (VLM) composed of a NaViT-style dynamic resolution visual encoder and the ERNIE-4.5-0.3B language model, enabling precise element recognition. The model supports 109 languages and excels in recognizing complex elements (such as text, tables, formulas, and charts) while maintaining extremely low resource consumption. Comprehensive evaluations on widely used public benchmarks and internal benchmarks demonstrate that PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing Pipeline-based solutions, document parsing multimodal schemes, and advanced general-purpose multimodal large models, while offering faster inference speeds. These advantages make it highly suitable for deployment in real-world scenarios.
**On January 29, 2026, we released PaddleOCR-VL-1.5. PaddleOCR-VL-1.5 not only significantly improved the accuracy on the OmniDocBench v1.5 evaluation set to 94.5%, but also innovatively supports irregular-shaped bounding box localization. As a result, PaddleOCR-VL-1.5 demonstrates outstanding performance in real-world scenarios such as Skew, Warping, Screen Photography, Illumination, and Scanning. In addition, the model has added new capabilities for seal (stamp) recognition and text detection and recognition, with key metrics continuing to lead the industry.**
res.save_to_markdown(save_path="output") ## Save the current image's result in Markdown format
639
641
```
640
642
641
-
For PDF files, each page will be processed individually and generate a separate Markdown file. If you want to convert the entire PDF to a single Markdown file, use the following method:
643
+
For PDF files, each page will be processed individually, and a separate Markdown file will be generated for each page. If you wish to perform cross-page table merging, reconstruct multi-level labels, or merge multi-page results, you can achieve this using the following method:
642
644
643
645
```python
644
-
from pathlib import Path
645
646
from paddleocr import PaddleOCRVL
646
647
647
648
input_file ="./your_pdf_file.pdf"
648
-
output_path = Path("./output")
649
649
650
650
# NVIDIA GPU
651
651
pipeline = PaddleOCRVL()
@@ -658,28 +658,16 @@ pipeline = PaddleOCRVL()
658
658
659
659
output = pipeline.predict(input=input_file)
660
660
661
-
markdown_list = []
662
-
markdown_images = []
661
+
pages_res =list(output)
663
662
663
+
output = pipeline.restructure_pages(pages_res)
664
+
# output = pipeline.restructure_pages(pages_res, merge_table=True) # Merge tables across pages
665
+
# output = pipeline.restructure_pages(pages_res, merge_table=True, relevel_titles=True) # Merge tables across pages and reconstruct multi-level titles
666
+
# output = pipeline.restructure_pages(pages_res, merge_table=True, relevel_titles=True, merge_pages=True) # Merge tables across pages, reconstruct multi-level titles, and merge multiple pages
res.print() ## Print the structured prediction output
669
+
res.save_to_json(save_path="output") ## Save the current image's structured result in JSON format
670
+
res.save_to_markdown(save_path="output") ## Save the current image's result in Markdown format
683
671
```
684
672
685
673
If you need to process multiple files, **it is recommended to pass the directory path containing the files or a list of file paths to the `predict` method** to maximize processing efficiency. For example:
- In the example code, the parameters `use_doc_orientation_classify` and `use_doc_unwarping` are all set to `False` by default. These indicate that document orientation classification and document image unwarping are disabled. You can manually set them to `True` if needed.
701
688
702
689
The above Python script performs the following steps:
703
690
@@ -1217,8 +1204,6 @@ Setting it to <code>None</code> means using the instantiation parameter; otherwi
1217
1204
<li><code>chart_max_pixels</code>: Maximum resolution for charts</li>
1218
1205
<li><code>formula_min_pixels</code>: Minimum resolution for formulas</li>
1219
1206
<li><code>formula_max_pixels</code>: Maximum resolution for formulas</li>
1220
-
<li><code>spotting_min_pixels</code>: Minimum resolution for grounding</li>
1221
-
<li><code>spotting_max_pixels</code>: Maximum resolution for grounding</li>
1222
1207
<li><code>seal_min_pixels</code>: Minimum resolution for seals</li>
1223
1208
<li><code>seal_max_pixels</code>: Maximum resolution for seals</li>
1224
1209
</ul></td>
@@ -1227,7 +1212,48 @@ Setting it to <code>None</code> means using the instantiation parameter; otherwi
1227
1212
</tr>
1228
1213
</table>
1229
1214
</details>
1230
-
<details><summary>(3) Process the prediction results: The prediction result for each sample is a corresponding Result object, supporting operations such as printing, saving as an image, and saving as a <code>json</code> file:</summary>
1215
+
1216
+
<details><summary>(3) Invoke the <code>restructure_pages()</code> method of the PaddleOCR-VL object to reconstruct pages from the multi-page results list of inference predictions. This method will return a reconstructed multi-page result or a merged single-page result. Below are the parameters of the <code>restructure_pages()</code> method and their descriptions:</summary>
1217
+
<table>
1218
+
<thead>
1219
+
<tr>
1220
+
<th>Parameter</th>
1221
+
<th>Description</th>
1222
+
<th>Type</th>
1223
+
<th>Default Value</th>
1224
+
</tr>
1225
+
</thead>
1226
+
<tbody>
1227
+
<tr>
1228
+
<td><code>res_list</code></td>
1229
+
<td><b>Meaning:</b> The list of results predicted from a multi-page PDF inference.</td>
1230
+
<td><code>list|None</code></td>
1231
+
<td><code>None</code></td>
1232
+
</tr>
1233
+
<tr>
1234
+
<td><code>merge_tables</code></td>
1235
+
<td><b>Meaning:</b> Controls whether to merge tables across pages.</td>
1236
+
<td><code>Bool</code></td>
1237
+
<td><code>True</code></td>
1238
+
</tr>
1239
+
<tr>
1240
+
<td><code>relevel_titles</code></td>
1241
+
<td><b>Meaning:</b> Controls whether to perform multi-level table grading.</td>
1242
+
<td><code>Bool</code></td>
1243
+
<td><code>True</code></td>
1244
+
</tr>
1245
+
<tr>
1246
+
<td><code>concatenate_pages</code></td>
1247
+
<td><b>Meaning:</b> Controls whether to concatenate multi-page results into one page.</td>
1248
+
<td><code>Bool</code></td>
1249
+
<td><code>False</code></td>
1250
+
</tr>
1251
+
</tbody>
1252
+
</table>
1253
+
</details>
1254
+
1255
+
1256
+
<details><summary>(4) Process the prediction results: The prediction result for each sample is a corresponding Result object, supporting operations such as printing, saving as an image, and saving as a <code>json</code> file:</summary>
0 commit comments