Skip to content

Commit fc82f22

Browse files
[Docs] Update docs (#17565)
* Support new params * Update docs: * Polish PaddleOCR-VL docs * Add passing-list notice * Polish * Fix local path * Add notes on hosted VLM services * Update code * Update MCP server docs * Limit lower bound of paddlex * Update API reference * Fix workflow * Fix docs * Add iluvatar dockerfiles * Bump lower version bound of PaddleX * concatenate-markdown-pages -> concatenate-pages * Support new params * Add missing param * Update desc for use_polygon_points * Fix bug * Fix * Fix bugs * Update interface * Add missing doc * Fix typo * Fix and update * Fix bug * Update * Update and fix * Fix bugs and support multi-platform build * Fix bugs * Fix bugs * Update documentation for PaddleOCR-VL-1.5 * Delete unused file * Reset paddlex lower bound version * Remove PP-StructureV3 concatenate-pages * Remove PPStructureV3.concatenate_pages * Install common fonts * Update docs * Remove use_polygon_points and add layout_shape_mode * Update concatenate markdown pages * Update for PaddleOCR-VL-1.5 * Update for restructure_pages * Update doc * Update 3060 doc * Standardize docker image tags * Fix name * Optimize build scripts * update doc * update doc * update doc * Limit version --------- Co-authored-by: zhouchangda <zhouchangda@baidu.com>
1 parent a670767 commit fc82f22

File tree

3 files changed

+120
-57
lines changed

3 files changed

+120
-57
lines changed

docs/version3.x/pipeline_usage/PaddleOCR-VL.en.md

Lines changed: 54 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,9 @@ comments: true
66

77
PaddleOCR-VL is an advanced and efficient document parsing model designed specifically for element recognition in documents. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful Vision-Language Model (VLM) composed of a NaViT-style dynamic resolution visual encoder and the ERNIE-4.5-0.3B language model, enabling precise element recognition. The model supports 109 languages and excels in recognizing complex elements (such as text, tables, formulas, and charts) while maintaining extremely low resource consumption. Comprehensive evaluations on widely used public benchmarks and internal benchmarks demonstrate that PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing Pipeline-based solutions, document parsing multimodal schemes, and advanced general-purpose multimodal large models, while offering faster inference speeds. These advantages make it highly suitable for deployment in real-world scenarios.
88

9-
<img src="https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/refs/heads/main/images/paddleocr_vl/metrics/allmetric.png"/>
9+
**On January 29, 2026, we released PaddleOCR-VL-1.5. PaddleOCR-VL-1.5 not only significantly improved the accuracy on the OmniDocBench v1.5 evaluation set to 94.5%, but also innovatively supports irregular-shaped bounding box localization. As a result, PaddleOCR-VL-1.5 demonstrates outstanding performance in real-world scenarios such as Skew, Warping, Screen Photography, Illumination, and Scanning. In addition, the model has added new capabilities for seal (stamp) recognition and text detection and recognition, with key metrics continuing to lead the industry.**
10+
11+
<img src="https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/refs/heads/main/images/paddleocr_vl_1_5/paddleocr-vl-1.5_metrics.png"/>
1012

1113
## Process Guide
1214

@@ -638,14 +640,12 @@ for res in output:
638640
res.save_to_markdown(save_path="output") ## Save the current image's result in Markdown format
639641
```
640642

641-
For PDF files, each page will be processed individually and generate a separate Markdown file. If you want to convert the entire PDF to a single Markdown file, use the following method:
643+
For PDF files, each page will be processed individually, and a separate Markdown file will be generated for each page. If you wish to perform cross-page table merging, reconstruct multi-level labels, or merge multi-page results, you can achieve this using the following method:
642644

643645
```python
644-
from pathlib import Path
645646
from paddleocr import PaddleOCRVL
646647

647648
input_file = "./your_pdf_file.pdf"
648-
output_path = Path("./output")
649649

650650
# NVIDIA GPU
651651
pipeline = PaddleOCRVL()
@@ -658,28 +658,16 @@ pipeline = PaddleOCRVL()
658658

659659
output = pipeline.predict(input=input_file)
660660

661-
markdown_list = []
662-
markdown_images = []
661+
pages_res = list(output)
663662

663+
output = pipeline.restructure_pages(pages_res)
664+
# output = pipeline.restructure_pages(pages_res, merge_table=True) # Merge tables across pages
665+
# output = pipeline.restructure_pages(pages_res, merge_table=True, relevel_titles=True) # Merge tables across pages and reconstruct multi-level titles
666+
# output = pipeline.restructure_pages(pages_res, merge_table=True, relevel_titles=True, merge_pages=True) # Merge tables across pages, reconstruct multi-level titles, and merge multiple pages
664667
for res in output:
665-
md_info = res.markdown
666-
markdown_list.append(md_info)
667-
markdown_images.append(md_info.get("markdown_images", {}))
668-
669-
markdown_texts = pipeline.concatenate_markdown_pages(markdown_list)
670-
671-
mkd_file_path = output_path / f"{Path(input_file).stem}.md"
672-
mkd_file_path.parent.mkdir(parents=True, exist_ok=True)
673-
674-
with open(mkd_file_path, "w", encoding="utf-8") as f:
675-
f.write(markdown_texts)
676-
677-
for item in markdown_images:
678-
if item:
679-
for path, image in item.items():
680-
file_path = output_path / path
681-
file_path.parent.mkdir(parents=True, exist_ok=True)
682-
image.save(file_path)
668+
res.print() ## Print the structured prediction output
669+
res.save_to_json(save_path="output") ## Save the current image's structured result in JSON format
670+
res.save_to_markdown(save_path="output") ## Save the current image's result in Markdown format
683671
```
684672

685673
If you need to process multiple files, **it is recommended to pass the directory path containing the files or a list of file paths to the `predict` method** to maximize processing efficiency. For example:
@@ -697,7 +685,6 @@ output = pipeline.predict(["imgs/file1.png", "imgs/file2.png", "imgs/file3.png"]
697685

698686
**Note:**
699687

700-
- In the example code, the parameters `use_doc_orientation_classify` and `use_doc_unwarping` are all set to `False` by default. These indicate that document orientation classification and document image unwarping are disabled. You can manually set them to `True` if needed.
701688

702689
The above Python script performs the following steps:
703690

@@ -1217,8 +1204,6 @@ Setting it to <code>None</code> means using the instantiation parameter; otherwi
12171204
<li><code>chart_max_pixels</code>: Maximum resolution for charts</li>
12181205
<li><code>formula_min_pixels</code>: Minimum resolution for formulas</li>
12191206
<li><code>formula_max_pixels</code>: Maximum resolution for formulas</li>
1220-
<li><code>spotting_min_pixels</code>: Minimum resolution for grounding</li>
1221-
<li><code>spotting_max_pixels</code>: Maximum resolution for grounding</li>
12221207
<li><code>seal_min_pixels</code>: Minimum resolution for seals</li>
12231208
<li><code>seal_max_pixels</code>: Maximum resolution for seals</li>
12241209
</ul></td>
@@ -1227,7 +1212,48 @@ Setting it to <code>None</code> means using the instantiation parameter; otherwi
12271212
</tr>
12281213
</table>
12291214
</details>
1230-
<details><summary>(3) Process the prediction results: The prediction result for each sample is a corresponding Result object, supporting operations such as printing, saving as an image, and saving as a <code>json</code> file:</summary>
1215+
1216+
<details><summary>(3) Invoke the <code>restructure_pages()</code> method of the PaddleOCR-VL object to reconstruct pages from the multi-page results list of inference predictions. This method will return a reconstructed multi-page result or a merged single-page result. Below are the parameters of the <code>restructure_pages()</code> method and their descriptions:</summary>
1217+
<table>
1218+
<thead>
1219+
<tr>
1220+
<th>Parameter</th>
1221+
<th>Description</th>
1222+
<th>Type</th>
1223+
<th>Default Value</th>
1224+
</tr>
1225+
</thead>
1226+
<tbody>
1227+
<tr>
1228+
<td><code>res_list</code></td>
1229+
<td><b>Meaning:</b> The list of results predicted from a multi-page PDF inference.</td>
1230+
<td><code>list|None</code></td>
1231+
<td><code>None</code></td>
1232+
</tr>
1233+
<tr>
1234+
<td><code>merge_tables</code></td>
1235+
<td><b>Meaning:</b> Controls whether to merge tables across pages.</td>
1236+
<td><code>Bool</code></td>
1237+
<td><code>True</code></td>
1238+
</tr>
1239+
<tr>
1240+
<td><code>relevel_titles</code></td>
1241+
<td><b>Meaning:</b> Controls whether to perform multi-level table grading.</td>
1242+
<td><code>Bool</code></td>
1243+
<td><code>True</code></td>
1244+
</tr>
1245+
<tr>
1246+
<td><code>concatenate_pages</code></td>
1247+
<td><b>Meaning:</b> Controls whether to concatenate multi-page results into one page.</td>
1248+
<td><code>Bool</code></td>
1249+
<td><code>False</code></td>
1250+
</tr>
1251+
</tbody>
1252+
</table>
1253+
</details>
1254+
1255+
1256+
<details><summary>(4) Process the prediction results: The prediction result for each sample is a corresponding Result object, supporting operations such as printing, saving as an image, and saving as a <code>json</code> file:</summary>
12311257
<table>
12321258
<thead>
12331259
<tr>

docs/version3.x/pipeline_usage/PaddleOCR-VL.md

Lines changed: 61 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,9 @@ comments: true
66

77
PaddleOCR-VL 是一款先进、高效的文档解析模型,专为文档中的元素识别设计。其核心组件为 PaddleOCR-VL-0.9B,这是一种紧凑而强大的视觉语言模型(VLM),它由 NaViT 风格的动态分辨率视觉编码器与 ERNIE-4.5-0.3B 语言模型组成,能够实现精准的元素识别。该模型支持 109 种语言,并在识别复杂元素(如文本、表格、公式和图表)方面表现出色,同时保持极低的资源消耗。通过在广泛使用的公开基准与内部基准上的全面评测,PaddleOCR-VL 在页级级文档解析与元素级识别均达到 SOTA 表现。它显著优于现有的基于Pipeline方案和文档解析多模态方案以及先进的通用多模态大模型,并具备更快的推理速度。这些优势使其非常适合在真实场景中落地部署。
88

9-
<img src="https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/refs/heads/main/images/paddleocr_vl/metrics/allmetric.png"/>
9+
**2026年1月29日,我们发布了PaddleOCR-VL-1.5。PaddleOCR-VL-1.5不仅以94.5%精度大幅刷新了评测集OmniDocBench v1.5,更创新性地支持了异形框定位,使得PaddleOCR-VL-1.5 在扫描、倾斜、弯折、屏幕拍摄及复杂光照等真实场景中均表现优异。此外,模型还新增了印章识别与文本检测识别能力,关键指标持续领跑。**
10+
11+
<img src="https://raw.githubusercontent.com/cuicheng01/PaddleX_doc_images/refs/heads/main/images/paddleocr_vl_1_5/paddleocr-vl-1.5_metrics.png"/>
1012

1113
## 流程导览
1214

@@ -616,10 +618,9 @@ for res in output:
616618
res.save_to_markdown(save_path="output") ## 保存当前图像的markdown格式的结果
617619
```
618620

619-
如果是 PDF 文件,会将 PDF 的每一页单独处理,每一页的 Markdown 文件也会对应单独的结果。如果希望整个 PDF 文件转换为 Markdown 文件,建议使用以下的方式运行
621+
如果是 PDF 文件,会将 PDF 的每一页单独处理,每一页的 Markdown 文件也会对应单独的结果。如果您希望对多页的推理结果进行跨页表格合并、重建多级标和合并多页结果等需求,可以通过如下方式实现
620622

621623
```python
622-
from pathlib import Path
623624
from paddleocr import PaddleOCRVL
624625

625626
input_file = "./your_pdf_file.pdf"
@@ -636,28 +637,32 @@ pipeline = PaddleOCRVL()
636637

637638
output = pipeline.predict(input=input_file)
638639

639-
markdown_list = []
640-
markdown_images = []
640+
pages_res = list(output)
641641

642-
for res in output:
643-
md_info = res.markdown
644-
markdown_list.append(md_info)
645-
markdown_images.append(md_info.get("markdown_images", {}))
642+
output = pipeline.restructure_pages(pages_res)
643+
644+
# output = pipeline.restructure_pages(pages_res, merge_table=True) # 合并跨页表格
645+
# output = pipeline.restructure_pages(pages_res, merge_table=True, relevel_titles=True) # 合并跨页表格,重建多级标题
646+
# output = pipeline.restructure_pages(pages_res, merge_table=True, relevel_titles=True, merge_pages=True) # 合并跨页表格,重建多级标题,合并多页结果为一页
646647

647-
markdown_texts = pipeline.concatenate_markdown_pages(markdown_list)
648648

649-
mkd_file_path = output_path / f"{Path(input_file).stem}.md"
650-
mkd_file_path.parent.mkdir(parents=True, exist_ok=True)
649+
for res in output:
650+
res.print() ## 打印预测的结构化输出
651+
res.save_to_json(save_path="output") ## 保存当前图像的结构化json结果
652+
res.save_to_markdown(save_path="output") ## 保存当前图像的markdown格式的结果
653+
```
651654

652-
with open(mkd_file_path, "w", encoding="utf-8") as f:
653-
f.write(markdown_texts)
655+
如果您需要处理多个文件,**建议将包含文件的目录路径,或者文件路径列表传入 `predict` 方法**,以最大化处理效率。例如:
654656

655-
for item in markdown_images:
656-
if item:
657-
for path, image in item.items():
658-
file_path = output_path / path
659-
file_path.parent.mkdir(parents=True, exist_ok=True)
660-
image.save(file_path)
657+
```python
658+
# `imgs` 目录中包含多张待处理图像:file1.png、file2.png、file3.png
659+
# 传入目录路径
660+
output = pipeline.predict("imgs")
661+
# 或者传入文件路径列表
662+
output = pipeline.predict(["imgs/file1.png", "imgs/file2.png", "imgs/file3.png"])
663+
# 以上两种方式的处理效率高于下列方式:
664+
# for file in ["imgs/file1.png", "imgs/file2.png", "imgs/file3.png"]:
665+
# output = pipeline.predict(file)
661666
```
662667

663668
如果您需要处理多个文件,**建议将包含文件的目录路径,或者文件路径列表传入 `predict` 方法**,以最大化处理效率。例如:
@@ -1101,7 +1106,7 @@ output = pipeline.predict(["imgs/file1.png", "imgs/file2.png", "imgs/file3.png"]
11011106
<td><code>prompt_label</code></td>
11021107
<b>含义:</b>VL模型的 prompt 类型设置。<br/>
11031108
<b>说明:</b>
1104-
当且仅当 <code>use_layout_detection=False</code> 时生效。可填写参数为 <code>ocr</code>、<code>formula</code>、<code>table</code> 和 <code>chart</code>。</td>
1109+
当且仅当 <code>use_layout_detection=False</code> 时生效。可填写参数为 <code>ocr</code>、<code>formula</code>、<code>table</code> 、<code>seal</code>、<code>chart</code>和 <code>spotting</code>。</td>
11051110
<td><code>str|None</code></td>
11061111
<td><code>None</code></td>
11071112
</tr>
@@ -1174,8 +1179,6 @@ output = pipeline.predict(["imgs/file1.png", "imgs/file2.png", "imgs/file3.png"]
11741179
<li><code>chart_max_pixels</code>:图表最大分辨率</li>
11751180
<li><code>formula_min_pixels</code>:公式最小分辨率</li>
11761181
<li><code>formula_max_pixels</code>:公式最大分辨率</li>
1177-
<li><code>spotting_min_pixels</code>:Grounding 最小分辨率</li>
1178-
<li><code>spotting_max_pixels</code>:Grounding 最大分辨率</li>
11791182
<li><code>seal_min_pixels</code>:印章最小分辨率</li>
11801183
<li><code>seal_max_pixels</code>:印章最大分辨率</li>
11811184
</ul></td>
@@ -1185,7 +1188,41 @@ output = pipeline.predict(["imgs/file1.png", "imgs/file2.png", "imgs/file3.png"]
11851188
</table>
11861189
</details>
11871190

1188-
<details><summary>(3)对预测结果进行处理:每个样本的预测结果均为对应的Result对象,且支持打印、保存为图片、保存为<code>json</code>文件的操作:</summary>
1191+
<details><summary>(3)调用 PaddleOCR-VL 对象的 <code>restructure_pages()</code> 方法对推理预测的多页结果列表进行页面重建,该方法会返回一个重建后的多页结果或合并后的单页结果。以下是 <code>restructure_pages()</code> 方法的参数及其说明:</summary>
1192+
<table>
1193+
<thead>
1194+
<tr>
1195+
<th>参数</th>
1196+
<th>参数说明</th>
1197+
<th>参数类型</th>
1198+
<th>默认值</th>
1199+
<tr>
1200+
<td><code>res_list</code></td>
1201+
<td><b>含义:</b>多页 PDF 推理预测出的结果列表。</td>
1202+
<td><code>list|None</code></td>
1203+
<td><code>None</code></td>
1204+
</tr>
1205+
<tr>
1206+
<td><code>merge_tables</code></td>
1207+
<td><b>含义:</b>控制是否进行跨页表格合并。</td>
1208+
<td><code>Bool</code></td>
1209+
<td><code>True</code></td>
1210+
</tr>
1211+
<tr>
1212+
<td><code>relevel_titles</code></td>
1213+
<td><b>含义:</b>控制是否进行多级表格分级</td>
1214+
<td><code>Bool</code></td>
1215+
<td><code>True</code></td>
1216+
</tr>
1217+
<tr>
1218+
<td><code>concatenate_pages</code></td>
1219+
<td><b>含义:</b>控制是否拼接多页结果为一页</td>
1220+
<td><code>Bool</code></td>
1221+
<td><code>False</code></td>
1222+
</tr>
1223+
</details>
1224+
1225+
<details><summary>(4)对预测结果进行处理:每个样本的预测结果均为对应的Result对象,且支持打印、保存为图片、保存为<code>json</code>文件的操作:</summary>
11891226

11901227
<table>
11911228
<thead>

pyproject.toml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ classifiers = [
3939
"Topic :: Utilities",
4040
]
4141
dependencies = [
42-
"paddlex[ocr-core]>=3.3.0,<3.4.0",
42+
"paddlex[ocr-core]>=3.4.0,<3.5.0",
4343
"PyYAML>=6",
4444
"requests",
4545
"typing-extensions>=4.12",
@@ -55,10 +55,10 @@ issues = "https://github.com/PaddlePaddle/PaddleOCR/issues"
5555
paddleocr = "paddleocr.__main__:console_entry"
5656

5757
[project.optional-dependencies]
58-
doc-parser = ["paddlex[ocr,genai-client]>=3.3.0,<3.4.0"]
59-
ie = ["paddlex[ie]>=3.3.0,<3.4.0"]
60-
trans = ["paddlex[trans]>=3.3.0,<3.4.0"]
61-
all = ["paddlex[ocr,genai-client,ie,trans]>=3.3.0,<3.4.0"]
58+
doc-parser = ["paddlex[ocr,genai-client]>=3.4.0,<3.5.0"]
59+
ie = ["paddlex[ie]>=3.4.0,<3.5.0"]
60+
trans = ["paddlex[trans]>=3.4.0,<3.5.0"]
61+
all = ["paddlex[ocr,genai-client,ie,trans]>=3.4.0,<3.5.0"]
6262

6363
[tool.setuptools.packages.find]
6464
where = ["."]

0 commit comments

Comments
 (0)