Conversation
|
Pre-commit.ci autofix |
|
@rhshadrach Failing unit test is unrelated to code changes |
rhshadrach
left a comment
There was a problem hiding this comment.
This is looking really good; but I don't understand current (main) behavior with e.g. UTF-16. Are BOMs being stripped there?
| else: | ||
| strip_bom = False | ||
| warn_bom = False |
There was a problem hiding this comment.
Is this backwards compatible?
There was a problem hiding this comment.
@rhshadrach it might be a breaking change but the original CHECK_FOR_BOM() always strips bytes 0xEF 0xBB 0xBF regardless of encoding. While this is arguably incorrect , changing it now would break existing code.
mismatched encodings will be discussed in another issue
There was a problem hiding this comment.
I don't see a reason to not deprecate this behavior as well.
| # We use check_stacklevel=False to avoid errors with compiled Cython paths | ||
| with tm.assert_produces_warning(warn_type, match=warn_msg, check_stacklevel=False): | ||
| result = parser.read_csv(_encode_data_with_bom(data), encoding=utf8, **kwargs) |
There was a problem hiding this comment.
What does the warning point to if not this file?
There was a problem hiding this comment.
@rhshadrach yea i think i should elaborate the explain comment more at line 140
| # We only implemented the warning for the C engine so far. | ||
| # Python/Pyarrow engines will still silently strip the BOM (warn_type=None). |
There was a problem hiding this comment.
Is there something blocking us from doing so in this PR?
There was a problem hiding this comment.
@rhshadrach It's not a hard blocker, but the implementation for the Python and PyArrow engines is significantly different
C-engine Handles the buffer directly, so the fix was just a flag check in tokenizer.c
Python/PyArrow: Would likely require wrapping the file handles or "peeking" at the stream before passing it to the engine, which adds complexity regarding stream positioning and performance
To avoid making this PR too large or risky, I preferred to land the C-engine fix first and address the other engines in a follow-up PR.
There was a problem hiding this comment.
I understand the concern, but I do not want to introduce inconsistent BOM handling between the different engines since it is avoidable. Would you be willing to include the Python engine here?
| with tm.assert_produces_warning(warning_type, match=warning_match, check_stacklevel=False): | ||
| result = parser.read_csv(BytesIO(data), encoding=encoding) | ||
|
|
||
| assert result.columns[0] == expected_col |
There was a problem hiding this comment.
Can you construct the entire (DataFrame) result and use tm.assert_frame_equal
60d863f to
3103486
Compare
2d42fff to
3c38548
Compare
c71e20e to
2b42218
Compare
|
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
|
@rhshadrachI implemented the changes |
| if parser.engine == "python" and encoding == "latin1": | ||
| # Python engine won't warn for latin1 because it doesn't see the BOM | ||
| # It will fail to parse correctly (will have "Name" column) | ||
| pytest.skip("Python engine doesn't detect BOM with latin1 encoding") |
There was a problem hiding this comment.
Don't skip this, it is important to test.
for more information, see https://pre-commit.ci
- Stop removing 'encoding' from kwds in c_parser_wrapper.py so it reaches the C engine - Update tokenizer.c to correctly detect Byte Order Marks (BOM) - Ensure BOM warnings are triggered even for non-UTF8 encodings like 'latin1' - Resolves pandas-dev#63787
5a52f37 to
1598c34
Compare
ea0be03 to
829b1f5
Compare
|
@rhshadrach Finally done ! Tell me if anything more needed Take a look! @rhshadrach |
|
Is anything prevent you to review the pr @rhshadrach |
pandas/_libs/src/parser/tokenizer.c
Outdated
| if (self->datalen - self->datapos >= 3 && (unsigned char)buf[0] == 0xEF && \ | ||
| (unsigned char)buf[1] == 0xBB && (unsigned char)buf[2] == 0xBF) { \ | ||
| self->bom_found = 1; \ | ||
| if (self->strip_bom) { \ | ||
| buf += 3; \ | ||
| self->datapos += 3; \ | ||
| } \ | ||
| } else if (self->datalen - self->datapos >= 6 && \ | ||
| (unsigned char)buf[0] == 0xC3 && (unsigned char)buf[1] == 0xAF && \ | ||
| (unsigned char)buf[2] == 0xC2 && (unsigned char)buf[3] == 0xBB && \ | ||
| (unsigned char)buf[4] == 0xC2 && (unsigned char)buf[5] == 0xBF) { \ |
There was a problem hiding this comment.
Why is the check for BOM changing?
There was a problem hiding this comment.
I did not say or mean to suggest this is necessarily an issue, I only asked why is it changing. I would like to understand why you made this change previously, and why it is being changed to what it is now.
There was a problem hiding this comment.
@rhshadrach I was trying to make the parser more robust by catching double-encoded BOMs, but realized that was out of scope for this PR.
I’ve reverted it to strictly check for the standard UTF-8 BOM. The only logic change remaining is the buffer length check (>= 3) to prevent crashes on short inputs.
ccb95a6 to
48e04df
Compare
fc70113 to
087f646
Compare
|
Pre-commit.ci autofix |
for more information, see https://pre-commit.ci
|
All good now chief @rhshadrach |
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.AGENTS.md.I am AI 🤖 : Code is changed by me Claude, I read agents.md, ensured that every changed line is reviewed 😉.
Summary
This PR addresses GH#63787 by introducing a deprecation warning when a UTF-8 Byte Order Mark (BOM) is detected while a user has explicitly specified encoding='utf-8'. This aligns pandas' behavior with Python's standard codecs, where 'utf-8' preserves the BOM as data, and 'utf-8-sig' is required for automatic stripping.
C Layer (tokenizer.h / tokenizer.c)
Added int strip_bom and int bom_found to the parser_t struct.
Updated parser_set_default_options and parser_init to initialize these flags.
Modified the CHECK_FOR_BOM() macro in parser_buffer_bytes to set bom_found = 1 upon detection and only advance the buffer pointers if strip_bom is enabled.
Cython Bridge (parsers.pyx)
Introduced warn_bom_with_explicit_utf8 to the TextReader class.
Updated cinit to pop internal flags _strip_bom and _warn_bom and pass them to the C struct.
Added logic in _tokenize_rows to trigger the FutureWarning when a BOM is found under explicit 'utf-8' conditions.
Python Logic (c_parser_wrapper.py / readers.py)
c_parser_wrapper.py: Added decision logic to determine if strip_bom and warn_bom should be active based on the user-provided encoding.
readers.py: Updated the read_csv docstring with a versionchanged:: 3.0.0 notice to document the new behavior.
Tests (test_encoding.py)
Updated existing test_utf8_bom to accommodate the new FutureWarning.
Added test_bom_handling_deprecation to verify that: