Skip to content

Conversation

@Ra5hidIslam
Copy link

@Ra5hidIslam Ra5hidIslam commented Dec 13, 2025

fix(backend): improve Excel table bounds detection and flatten merged cells

Description: This PR refactors the _find_table_bounds method in the MsExcelDocumentBackend to improve how Excel tables are detected and represented.

Key changes:

  • Region Growing Algorithm: Replaced the previous explicit boundary finding logic with a region-growing strategy that uses a
  • GAP_TOLERANCE (set to 3). This helps group nearby data clusters into a single table context more reliably.
  • Visual Grid / Flattening Spans: Changed the ExcelCell generation to force a "Visual Grid" structure.
    All cells are now forced to row_span=1 and col_span=1.
    Merged cell bodies (non-head cells) are explicitly filled with empty strings.
    This change is intended to prevent text duplication issues in downstream Markdown exports.
  • Refactoring: Removed the _find_table_bottom and _find_table_right helper methods as their logic is now integrated into the region expansion loop.

Issue resolved by this Pull Request: Resolves #834

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.
    msexcel_backend.py

@github-actions
Copy link
Contributor

github-actions bot commented Dec 13, 2025

DCO Check Passed

Thanks @Ra5hidIslam, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Dec 13, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

I, Rashidul Islam <rasidulislam71@gmail.com>, hereby add my Signed-off-by to this commit: f282abc

Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>
I, Rashidul Islam <rasidulislam71@gmail.com>, hereby add my Signed-off-by to this commit: f282abc

Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>
@Ra5hidIslam Ra5hidIslam changed the title Made changes to the _find_table_bounds function. fix(backend): improve Excel table bounds detection and flatten merged cells Dec 13, 2025
@Michele-Zhu
Copy link

Hi, I've noticed that you're working on the table bounds algorithm now. See #2741 and #2626. I support your approach.

I suggest that the algorithm should also expand the bounds on the left due to how the data is scanned for the initial anchor in _find_data_tables.

Here is the test case that shouldn't work (I haven't run your code, so I cannot guarantee it)
edge_cases.xlsx

@Ra5hidIslam
Copy link
Author

Ra5hidIslam commented Dec 15, 2025

Hi @Michele-Zhu I have run my code for that excel file and this is the output:
Screenshot 2025-12-15 at 11 34 44

Does it look fine or wrong to you?

@codecov
Copy link

codecov bot commented Dec 15, 2025

Codecov Report

❌ Patch coverage is 0% with 43 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/msexcel_backend.py 0.00% 43 Missing ⚠️

📢 Thoughts on this report? Let us know!

@ceberam ceberam self-assigned this Dec 15, 2025
@Michele-Zhu
Copy link

@Ra5hidIslam No, in my opinion, it should have detected one table for the first and second sheets.
Since the table-bound scan starts from the topmost left cell of a table, you'll also need to grow the region on the left side.

P.S. According to how you have defined the growth region, it creates a problem with the test of the boolean option treat_singleton_as_text.

@Ra5hidIslam
Copy link
Author

Hi @Michele-Zhu ,

I have a few thoughts on the feedback:
Edge Case: I feel the edge case regarding connecting two sheets by text might be heading in the wrong direction. Is using attached_left for this type of connection considered an industry standard? It doesn't seem to align with typical use cases.

Failing Test: Regarding the title extraction, I don't see the benefit of separating the title. Getting the whole block of data seems more helpful/robust. If we separate the title, we'd need to add another processing layer to re-associate or manage the blocks. Unless isolating the title is crucial for a specific reason, I would prefer to keep the logic as is.

@Ra5hidIslam
Copy link
Author

Hi @ceberam should I change my approach ?

@ceberam ceberam added the xlsx issue related to xlsx backend label Jan 20, 2026
Copy link
Member

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @Ra5hidIslam for replying only now after your last message.

I like your region-growing strategy and I think we all agree that data like in the spreadsheet of issue #2626 should be considered as a single table instead of 2 different tables. The current release only addresses rectangular tables and your PR tackles other cases like #2626.

However, I have some comments that require changes:

GAP_TOLERANCE (set to 3). This helps group nearby data clusters into a single table context more reliably.

I don't dislike using a gap tolerance since it may help consolidate some dispersed cells into a single table. But we need to adopt a default approach and I would keep the tolerance to 0 by default. The reason is simple: it is easier to merge tables in a post-processing step than to split tables that have been put together wrongly. You can define the gap tolerance as a backend parser option. We have done it recently with the treat_singleton_as_text option in the class MsExcelBackendOptions. You could apply the same pattern.

Visual Grid / Flattening Spans: Changed the ExcelCell generation to force a "Visual Grid" structure.
All cells are now forced to row_span=1 and col_span=1.
Merged cell bodies (non-head cells) are explicitly filled with empty strings.
This change is intended to prevent text duplication issues in downstream Markdown exports.

I assume the goal is to have a visual representation as close as possible to how the spreadsheet is rendered in most applications. However, this has drawbacks in many downstream applications (e.g., RAG, data generation for model fine-tuning). By removing the spans, we lose the information that may be useful to connect column and row information. Repeating values in the markdown serialization may look redundant but it preserves the values across the cells, since merged cells are not supported in Markdown standard syntax (i.e., with hyphens and pipes).
An alternatively (but not for this PR) could be to customize the Markdown serialization. You can check the documentation on custom serialization.

In addition, other considerations:

  • As mentioned by @Michele-Zhu , the table identification should not collide with the treat_singleton_as_text (check the PR #2589 for more background)
  • I also agree that the algorithm should expand the bounds on the left. On the first sheet in edge_cases.xlsx shared by @Michele-Zhu , there should be only 1 table detected instead of 2.
  • I am a bit concerned with the performance. I tested a large workbook with large sheets and I was getting a 180% increase in time-to-solution. Do you think there is any room for improvement?
  • Please, rebase to latest main and ensure that the code passes the style checks (uv run pre-commit run --all-files) and all the tests (uv run pytest tests). This is currently not the case.
  • You are also encouraged to add test data and/or unit tests to validate the code changes.

@Ra5hidIslam
Copy link
Author

Hi @ceberam thanks a lot for the inputs, will follow up once I have made appropriate changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

xlsx issue related to xlsx backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

msexcel_backend.py doesn’t parse complex Excel tables properly.

3 participants