fix(backend): improve Excel table bounds detection and flatten merged cells #2778

Ra5hidIslam · 2025-12-13T08:19:44Z

fix(backend): improve Excel table bounds detection and flatten merged cells

Description: This PR refactors the _find_table_bounds method in the MsExcelDocumentBackend to improve how Excel tables are detected and represented.

Key changes:

Region Growing Algorithm: Replaced the previous explicit boundary finding logic with a region-growing strategy that uses a
GAP_TOLERANCE (set to 3). This helps group nearby data clusters into a single table context more reliably.
Visual Grid / Flattening Spans: Changed the ExcelCell generation to force a "Visual Grid" structure.
All cells are now forced to row_span=1 and col_span=1.
Merged cell bodies (non-head cells) are explicitly filled with empty strings.
This change is intended to prevent text duplication issues in downstream Markdown exports.
Refactoring: Removed the _find_table_bottom and _find_table_right helper methods as their logic is now integrated into the region expansion loop.

Issue resolved by this Pull Request: Resolves #834

Checklist:

Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.
msexcel_backend.py

github-actions · 2025-12-13T08:19:56Z

✅ DCO Check Passed

Thanks @Ra5hidIslam, all your commits are properly signed off. 🎉

mergify · 2025-12-13T08:20:19Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

I, Rashidul Islam <rasidulislam71@gmail.com>, hereby add my Signed-off-by to this commit: f282abc Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>

Michele-Zhu · 2025-12-14T22:04:33Z

Hi, I've noticed that you're working on the table bounds algorithm now. See #2741 and #2626. I support your approach.

I suggest that the algorithm should also expand the bounds on the left due to how the data is scanned for the initial anchor in _find_data_tables.

Here is the test case that shouldn't work (I haven't run your code, so I cannot guarantee it)
edge_cases.xlsx

Ra5hidIslam · 2025-12-15T06:08:25Z

Hi @Michele-Zhu I have run my code for that excel file and this is the output:

Does it look fine or wrong to you?

codecov · 2025-12-15T08:13:14Z

Codecov Report

❌ Patch coverage is 0% with 43 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling/backend/msexcel_backend.py	0.00%	43 Missing ⚠️

📢 Thoughts on this report? Let us know!

Michele-Zhu · 2025-12-16T15:58:19Z

@Ra5hidIslam No, in my opinion, it should have detected one table for the first and second sheets.
Since the table-bound scan starts from the topmost left cell of a table, you'll also need to grow the region on the left side.

P.S. According to how you have defined the growth region, it creates a problem with the test of the boolean option treat_singleton_as_text.

Ra5hidIslam · 2025-12-22T10:01:23Z

Hi @Michele-Zhu ,

I have a few thoughts on the feedback:
Edge Case: I feel the edge case regarding connecting two sheets by text might be heading in the wrong direction. Is using attached_left for this type of connection considered an industry standard? It doesn't seem to align with typical use cases.

Failing Test: Regarding the title extraction, I don't see the benefit of separating the title. Getting the whole block of data seems more helpful/robust. If we separate the title, we'd need to add another processing layer to re-associate or manage the blocks. Unless isolating the title is crucial for a specific reason, I would prefer to keep the logic as is.

Ra5hidIslam · 2026-01-13T09:59:19Z

Hi @ceberam should I change my approach ?

ceberam

Sorry @Ra5hidIslam for replying only now after your last message.

I like your region-growing strategy and I think we all agree that data like in the spreadsheet of issue #2626 should be considered as a single table instead of 2 different tables. The current release only addresses rectangular tables and your PR tackles other cases like #2626.

However, I have some comments that require changes:

GAP_TOLERANCE (set to 3). This helps group nearby data clusters into a single table context more reliably.

I don't dislike using a gap tolerance since it may help consolidate some dispersed cells into a single table. But we need to adopt a default approach and I would keep the tolerance to 0 by default. The reason is simple: it is easier to merge tables in a post-processing step than to split tables that have been put together wrongly. You can define the gap tolerance as a backend parser option. We have done it recently with the treat_singleton_as_text option in the class MsExcelBackendOptions. You could apply the same pattern.

Visual Grid / Flattening Spans: Changed the ExcelCell generation to force a "Visual Grid" structure.
All cells are now forced to row_span=1 and col_span=1.
Merged cell bodies (non-head cells) are explicitly filled with empty strings.
This change is intended to prevent text duplication issues in downstream Markdown exports.

I assume the goal is to have a visual representation as close as possible to how the spreadsheet is rendered in most applications. However, this has drawbacks in many downstream applications (e.g., RAG, data generation for model fine-tuning). By removing the spans, we lose the information that may be useful to connect column and row information. Repeating values in the markdown serialization may look redundant but it preserves the values across the cells, since merged cells are not supported in Markdown standard syntax (i.e., with hyphens and pipes).
An alternatively (but not for this PR) could be to customize the Markdown serialization. You can check the documentation on custom serialization.

In addition, other considerations:

As mentioned by @Michele-Zhu , the table identification should not collide with the treat_singleton_as_text (check the PR #2589 for more background)
I also agree that the algorithm should expand the bounds on the left. On the first sheet in edge_cases.xlsx shared by @Michele-Zhu , there should be only 1 table detected instead of 2.
I am a bit concerned with the performance. I tested a large workbook with large sheets and I was getting a 180% increase in time-to-solution. Do you think there is any room for improvement?
Please, rebase to latest main and ensure that the code passes the style checks (uv run pre-commit run --all-files) and all the tests (uv run pytest tests). This is currently not the case.
You are also encouraged to add test data and/or unit tests to validate the code changes.

Ra5hidIslam · 2026-01-22T16:33:12Z

Hi @ceberam thanks a lot for the inputs, will follow up once I have made appropriate changes.

made changed to the _find_table_bounds function

f282abc

Ra5hidIslam added 2 commits December 13, 2025 13:51

DCO Remediation Commit for Rashidul Islam <rasidulislam71@gmail.com>

200e396

I, Rashidul Islam <rasidulislam71@gmail.com>, hereby add my Signed-off-by to this commit: f282abc Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>

DCO Remediation Commit for Rashidul Islam <rasidulislam71@gmail.com>

27cd8a9

I, Rashidul Islam <rasidulislam71@gmail.com>, hereby add my Signed-off-by to this commit: f282abc Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>

Ra5hidIslam changed the title ~~Made changes to the _find_table_bounds function.~~ fix(backend): improve Excel table bounds detection and flatten merged cells Dec 13, 2025

Michele-Zhu mentioned this pull request Dec 14, 2025

fix(excel): merge two separated tables#2626 #2741

Closed

3 tasks

ceberam self-assigned this Dec 15, 2025

ceberam added the xlsx issue related to xlsx backend label Jan 20, 2026

ceberam requested changes Jan 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(backend): improve Excel table bounds detection and flatten merged cells #2778

fix(backend): improve Excel table bounds detection and flatten merged cells #2778

Ra5hidIslam commented Dec 13, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 13, 2025 •

edited

Loading

Uh oh!

mergify bot commented Dec 13, 2025 •

edited

Loading

Uh oh!

Michele-Zhu commented Dec 14, 2025

Uh oh!

Ra5hidIslam commented Dec 15, 2025 •

edited

Loading

Uh oh!

codecov bot commented Dec 15, 2025

Uh oh!

Michele-Zhu commented Dec 16, 2025

Uh oh!

Ra5hidIslam commented Dec 22, 2025

Uh oh!

Ra5hidIslam commented Jan 13, 2026

Uh oh!

ceberam left a comment

Uh oh!

Ra5hidIslam commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix(backend): improve Excel table bounds detection and flatten merged cells #2778

Are you sure you want to change the base?

fix(backend): improve Excel table bounds detection and flatten merged cells #2778

Conversation

Ra5hidIslam commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🟢 Enforce conventional commit

Uh oh!

Michele-Zhu commented Dec 14, 2025

Uh oh!

Ra5hidIslam commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 15, 2025

Codecov Report

Uh oh!

Michele-Zhu commented Dec 16, 2025

Uh oh!

Ra5hidIslam commented Dec 22, 2025

Uh oh!

Ra5hidIslam commented Jan 13, 2026

Uh oh!

ceberam left a comment

Choose a reason for hiding this comment

Uh oh!

Ra5hidIslam commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ra5hidIslam commented Dec 13, 2025 •

edited

Loading

github-actions bot commented Dec 13, 2025 •

edited

Loading

mergify bot commented Dec 13, 2025 •

edited

Loading

Ra5hidIslam commented Dec 15, 2025 •

edited

Loading