-
Notifications
You must be signed in to change notification settings - Fork 3.5k
fix(backend): improve Excel table bounds detection and flatten merged cells #2778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix(backend): improve Excel table bounds detection and flatten merged cells #2778
Conversation
|
✅ DCO Check Passed Thanks @Ra5hidIslam, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
I, Rashidul Islam <rasidulislam71@gmail.com>, hereby add my Signed-off-by to this commit: f282abc Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>
I, Rashidul Islam <rasidulislam71@gmail.com>, hereby add my Signed-off-by to this commit: f282abc Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>
|
Hi, I've noticed that you're working on the table bounds algorithm now. See #2741 and #2626. I support your approach. I suggest that the algorithm should also expand the bounds on the left due to how the data is scanned for the initial anchor in _find_data_tables. Here is the test case that shouldn't work (I haven't run your code, so I cannot guarantee it) |
|
Hi @Michele-Zhu I have run my code for that excel file and this is the output: Does it look fine or wrong to you? |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
@Ra5hidIslam No, in my opinion, it should have detected one table for the first and second sheets. P.S. According to how you have defined the growth region, it creates a problem with the test of the boolean option |
|
Hi @Michele-Zhu , I have a few thoughts on the feedback: Failing Test: Regarding the title extraction, I don't see the benefit of separating the title. Getting the whole block of data seems more helpful/robust. If we separate the title, we'd need to add another processing layer to re-associate or manage the blocks. Unless isolating the title is crucial for a specific reason, I would prefer to keep the logic as is. |
|
Hi @ceberam should I change my approach ? |
ceberam
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @Ra5hidIslam for replying only now after your last message.
I like your region-growing strategy and I think we all agree that data like in the spreadsheet of issue #2626 should be considered as a single table instead of 2 different tables. The current release only addresses rectangular tables and your PR tackles other cases like #2626.
However, I have some comments that require changes:
GAP_TOLERANCE (set to 3). This helps group nearby data clusters into a single table context more reliably.
I don't dislike using a gap tolerance since it may help consolidate some dispersed cells into a single table. But we need to adopt a default approach and I would keep the tolerance to 0 by default. The reason is simple: it is easier to merge tables in a post-processing step than to split tables that have been put together wrongly. You can define the gap tolerance as a backend parser option. We have done it recently with the treat_singleton_as_text option in the class MsExcelBackendOptions. You could apply the same pattern.
Visual Grid / Flattening Spans: Changed the ExcelCell generation to force a "Visual Grid" structure.
All cells are now forced to row_span=1 and col_span=1.
Merged cell bodies (non-head cells) are explicitly filled with empty strings.
This change is intended to prevent text duplication issues in downstream Markdown exports.
I assume the goal is to have a visual representation as close as possible to how the spreadsheet is rendered in most applications. However, this has drawbacks in many downstream applications (e.g., RAG, data generation for model fine-tuning). By removing the spans, we lose the information that may be useful to connect column and row information. Repeating values in the markdown serialization may look redundant but it preserves the values across the cells, since merged cells are not supported in Markdown standard syntax (i.e., with hyphens and pipes).
An alternatively (but not for this PR) could be to customize the Markdown serialization. You can check the documentation on custom serialization.
In addition, other considerations:
- As mentioned by @Michele-Zhu , the table identification should not collide with the
treat_singleton_as_text(check the PR #2589 for more background) - I also agree that the algorithm should expand the bounds on the left. On the first sheet in edge_cases.xlsx shared by @Michele-Zhu , there should be only 1 table detected instead of 2.
- I am a bit concerned with the performance. I tested a large workbook with large sheets and I was getting a 180% increase in time-to-solution. Do you think there is any room for improvement?
- Please, rebase to latest
mainand ensure that the code passes the style checks (uv run pre-commit run --all-files) and all the tests (uv run pytest tests). This is currently not the case. - You are also encouraged to add test data and/or unit tests to validate the code changes.
|
Hi @ceberam thanks a lot for the inputs, will follow up once I have made appropriate changes. |

fix(backend): improve Excel table bounds detection and flatten merged cells
Description: This PR refactors the _find_table_bounds method in the MsExcelDocumentBackend to improve how Excel tables are detected and represented.
Key changes:
All cells are now forced to row_span=1 and col_span=1.
Merged cell bodies (non-head cells) are explicitly filled with empty strings.
This change is intended to prevent text duplication issues in downstream Markdown exports.
Issue resolved by this Pull Request: Resolves #834
Checklist:
msexcel_backend.py