BUG: fix read_json failure on very large integers#63725
BUG: fix read_json failure on very large integers#63725Mazen050 wants to merge 13 commits intopandas-dev:mainfrom
Conversation
|
About #62072 The saved float values using If this is acceptable I will add a test for it and add a 'closes' keyword for it. |
Dr-Irv
left a comment
There was a problem hiding this comment.
You will need to do a few additional things if this is to be accepted (this list may not be complete);
- Update the docs for
read_json()andto_json()to show the availability of theorjsonvalue forengine - Update these docs https://pandas.pydata.org/docs/dev/user_guide/io.html#json in various places
- Update these docs https://pandas.pydata.org/docs/dev/getting_started/install.html#other-data-sources
Also, other maintainers will have to decide whether this should go in whatsnew/v3.0.0.rst or in something like whatsnew/v3.1.0.rst since 3.0 is almost out the door.
|
@Dr-Irv Thank you for the review. I will update the docs for For I did explore adding If the expectation is to document |
I think that's fine. Will let others also chime in. |
|
I did a little research on this, and we have issue #62464 where there is discussion about replacing So that should be investigated here. Do all the existing JSON tests pass if you use |
|
Hello @Dr-Irv. I have seen this issue and this comment here suggests to make it an optional dependency so thats why I made it so. About the tests. I have put
Error: if the last point should be fixed I would happily do so if you can guide me. |
|
About your suggestion here I tried the latest
making it 4 failed tests and they are the same tests above tho with a notable difference. The test with the This means the new version of Also keeping this warning in mind. To sum up, |
Is the difference in how they handle We might be willing to live with this, IF when we output JSON, the representation of @rhshadrach interested in your opinion here. |
|
The difference is in If what you mean by the second point is a test like this: then yes both the newer version of If you meant this test: The |
|
@Alvaro-Kothe Thanks I will check it out. |
So if It's an interesting question whether we need to support that, as well as if there is a workaround with |
Yes the problem is with I found this Issue where the maintainer of All this suggests that |
|
Hello @Alvaro-Kothe I’d appreciate your opinion on this. The main behavioral difference we’ve identified is that This leaves us with a few possible paths forward:
Given that |
|
My thought from the start was to make |
|
Thanks. I will add documentation tomorrow that informs about I intentionally kept this PR minimal and scoped to introducing I’ve also looked at your PR, which takes a more generalized and future-proof approach. If there’s interest in moving in that direction, I’m happy to adapt this implementation. Also If there is interest, a follow-up PR could explore an opt-in normalization step to align |
85cf226 to
ef74d55
Compare
rhshadrach
left a comment
There was a problem hiding this comment.
To add orjson as an optional dependency, I think we'll need to be running all the tests that we have for other engines with orjson.
Regarding the failing tests from #63725 (comment):
- 2 tests of which were expecting a
'Value is too small|Value is too big'ValueError but got a ValueError message'If using all scalar values, you must pass an index'instead.
In general changing of the exact message is fine if the type of exception is staying the same. However here the content seems to vary wildly, so it makes me suspicious that the proper reason for the error is being truly identified.
- 1 was an error message mismatch for when given a bad JSON like so:
This difference looks fine.
- 1 was a trailing comma error which exposes that orjson engine is more strict that the past ujson engine
I'd be okay with documenting this change.
- the last one was the most important as this one exposed a difference between
ujsonandorjsonwhereorjsondoesn't support infinity and NaN being in the Json like so:
It does seem to me that not supporting NaN/Infinity is a deal-breaker here.
I found this Issue where the maintainer of
orjsonsaid he is not interested in supportingNaNandInfinity.
This does not appear to me to be an accurate summary. The actual statement is:
I have no interest in figuring this out, but someone can open a new issue with a concrete proposal.
I take that as meaning orjson would be open to the extension if someone is willing to put in the work.
|
Hello @rhshadrach Thanks for the feedback.
I have investigated this test and found that the test below: depends on failing for big numbers as we know but since
would the already documented
Yeah sorry looks like I miss understood the maintainers comment. Additionally I ran all tests again this time making sure every test ran for For the |
4b9524a to
a44e9b8
Compare
Update error messages and handling in JSON tests
Remove commented-out code and update test for large numbers.
|
Hi @rhshadrach , thanks for the feedback. I’ve now parameterized the existing The remaining differences are limited to documented, engine-specific behavior:
Please let me know if you’d like any additional tests exercised under orjson. Also note that failing tests are failing under the main branch aswell indicating that they are not related to changes in this PR. |
I mentioned previously, but to elaborate, if we have no path forward to support NaN/Infinity with orjson, then it seems to me we should not move forward. It would mean we cannot remove ujson, which was the reason we were thinking of orjson in the first place. But I would like to get others thoughts here, cc @jorisvandenbossche @mroeschke |
|
Agreed with @rhshadrach's assessment - I don't think it's worth adding Also, is the pyarrow engine able to read json with large integers as described in the original issue? |
That is fair considering it might break a lot of existing code.
Yes the pyarrow engine is able to read large integers Also, If nothing else is required here, I’m fine with closing this PR. |
|
Thanks for your investigation here with Additionally since the origin issue of large integers is covered by at least the pyarrow engine, I believe we can close this PR. |
+1. I would also like to make a pitch to orjson to support NaN/Infinity, but it will likely take me quite some time to do so, so if anyone else is interested please have at it. But if orjson ever does support it, this PR will then be useful to get things up and running. |
Given the discussion at ijl/orjson#170, I don't think There's another option to consider if |
@Dr-Irv - did you see the reason for the closure? It was:
That is not a hard no.
I'm negative on pandas depending on a fork of a rust repo. |
Agreed.
Good point. Not that forking a C-implementation of |
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.AGENTS.md.This PR fixes a failure in pd.read_json when parsing JSON data containing very large integers, which currently raises ValueError: Value is too big! due to limitations in the vendored ujson parser.
The fix adds support for engine="orjson" in read_json, allowing large integers to be parsed successfully (following orjson semantics, where very large integers are decoded as floats). The implementation mirrors the existing ujson code path and does not change default behavior unless the orjson engine is explicitly selected.
New tests are added to cover large integer parsing for both DataFrame and Series outputs when using the orjson engine.
Note:
orjsonis added as an optional dependency using pandasimport_optional_dependency()