Skip to content

BUG: fix read_json failure on very large integers#63725

Closed
Mazen050 wants to merge 13 commits intopandas-dev:mainfrom
Mazen050:add-optional-orjson
Closed

BUG: fix read_json failure on very large integers#63725
Mazen050 wants to merge 13 commits intopandas-dev:mainfrom
Mazen050:add-optional-orjson

Conversation

@Mazen050
Copy link

@Mazen050 Mazen050 commented Jan 17, 2026


This PR fixes a failure in pd.read_json when parsing JSON data containing very large integers, which currently raises ValueError: Value is too big! due to limitations in the vendored ujson parser.

The fix adds support for engine="orjson" in read_json, allowing large integers to be parsed successfully (following orjson semantics, where very large integers are decoded as floats). The implementation mirrors the existing ujson code path and does not change default behavior unless the orjson engine is explicitly selected.

New tests are added to cover large integer parsing for both DataFrame and Series outputs when using the orjson engine.

Note: orjson is added as an optional dependency using pandas import_optional_dependency()

@Mazen050 Mazen050 requested a review from Dr-Irv as a code owner January 17, 2026 21:31
@Mazen050
Copy link
Author

Mazen050 commented Jan 17, 2026

About #62072

The saved float values using df.to_json("out.json",double_precision=15) saves [0.261799387799149,0.11111111111111111] instead of [0.2617993877991494,0.111111111111111112] and when using engine=orjson the output becomes 0.261799387799149 0.111111111111111 so it successfully reads the json but the precision loss if there is is from df.to_json.

If this is acceptable I will add a test for it and add a 'closes' keyword for it.

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to do a few additional things if this is to be accepted (this list may not be complete);

Also, other maintainers will have to decide whether this should go in whatsnew/v3.0.0.rst or in something like whatsnew/v3.1.0.rst since 3.0 is almost out the door.

@Mazen050
Copy link
Author

@Dr-Irv Thank you for the review.

I will update the docs for read_json to describe the availability and behavior of engine="orjson".

For to_json, this PR does not add or expose an engine parameter, and df.to_json(engine="orjson") is not currently supported. Because of that, I wasn’t sure how to update the to_json documentation without first extending its API.

I did explore adding engine="orjson" support to to_json, but it appears to require more extensive work, since several parameters passed to ujson_dumps are not supported by orjson.

If the expectation is to document orjson for to_json only once it is actually supported, I can proceed with documenting read_json only in this PR. Alternatively, I can open a follow-up PR to add orjson support to to_json and update the docs accordingly.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 19, 2026

If the expectation is to document orjson for to_json only once it is actually supported, I can proceed with documenting read_json only in this PR. Alternatively, I can open a follow-up PR to add orjson support to to_json and update the docs accordingly.

I think that's fine. Will let others also chime in.

@Mazen050 Mazen050 requested a review from Dr-Irv January 19, 2026 19:56
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 19, 2026

I did a little research on this, and we have issue #62464 where there is discussion about replacing ujson with orjson .

So that should be investigated here. Do all the existing JSON tests pass if you use orjson as a replacement?

@Mazen050
Copy link
Author

Hello @Dr-Irv.

I have seen this issue and this comment here suggests to make it an optional dependency so thats why I made it so.

About the tests. I have put orjson as the default param for read_json and found that all tests pass except for 5.

  • 2 tests of which were expecting a 'Value is too small|Value is too big' ValueError but got a ValueError message 'If using all scalar values, you must pass an index' instead.

  • 1 was an error message mismatch for when given a bad JSON like so:

E         Expected regex: 'Expected object or value'
E         Actual message: 'unexpected character: line 1 column 8 (char 7)'
  • 1 was a trailing comma error which exposes that orjson engine is more strict that the past ujson engine
        data_json = """{
            "schema":{
                "fields":[
                    {
                        "name":"a",
                        "type":"integer",
                        "extDtype":"Int64"
                    }
                ],  <---- HERE
            },
            "data":[
                {
                    "a":2
                },
                {
                    "a":null
                }
            ]
        }"""
  • the last one was the most important as this one exposed a difference between ujson and orjson where orjson doesn't support infinity and NaN being in the Json like so:
            '["a", NaN, "NaN", Infinity, "Infinity", -Infinity, "-Infinity"]'

Error:

E           orjson.JSONDecodeError: unexpected character: line 1 column 7 (char 6)

if the last point should be fixed I would happily do so if you can guide me.

@Mazen050
Copy link
Author

Mazen050 commented Jan 19, 2026

About your suggestion here

I tried the latest ujson version and found that the same tests that fail with orjson fail with the new version of ujson except for the second point below:

  • 1 was an error message mismatch for when given a bad JSON like so:
E         Expected regex: 'Expected object or value'
E         Actual message: 'unexpected character: line 1 column 8 (char 7)'

making it 4 failed tests and they are the same tests above tho with a notable difference.

The test with the NaN and Infinity had this Error instead:

E   AssertionError: DataFrame.iloc[:, 0] (column name="0") are different
E   
E   DataFrame.iloc[:, 0] (column name="0") values are different (14.28571 %)
E   [index]: [0, 1, 2, 3, 4, 5, 6]
E   [left]:  [a, nan, NaN, inf, Infinity, -inf, -Infinity]
E   [right]: [a, None, NaN, inf, Infinity, -inf, -Infinity]
E   At positional index 1, first diff: nan != None

This means the new version of ujson supports NaN and Infinity being in the JSON but with a difference in how they handle NaN

Also keeping this warning in mind.

To sum up, orjson (and the latest ujson) cannot be used as a drop-in replacement without behavior changes, mainly due to stricter JSON compliance and differences around NaN / Infinity.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 19, 2026

I have seen this issue and this comment here suggests to make it an optional dependency so thats why I made it so.

Yes, but this comment here makes it seem as if we are open to doing a full replacement.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 19, 2026

This means the new version of ujson supports NaN and Infinity being in the JSON but with a difference in how they handle NaN

Is the difference in how they handle NaN without quotes or "NaN" with quotes?

We might be willing to live with this, IF when we output JSON, the representation of np.nan is consistent.

@rhshadrach interested in your opinion here.

@Alvaro-Kothe
Copy link
Member

@Mazen050, I've opened a pull request (#63763) that tries to implement support for orjson. Feel free to reference it as a guide.

@Mazen050
Copy link
Author

The difference is in NaN without quotes.

If what you mean by the second point is a test like this:

def test_to_json_nan_roundtrip_orjson_ujson():
    df = DataFrame({"a": [1.0, np.nan, 2.0]})
    json_str = df.to_json(orient="records")

    result_ujson = read_json(
        StringIO(json_str), orient="records", engine="ujson"
    )
    result_orjson = read_json(
        StringIO(json_str), orient="records", engine="orjson"
    )
    assert_frame_equal(result_ujson, result_orjson)

then yes both the newer version of ujson and orjson read it correctly.

If you meant this test:

def test_emca_262_nan_inf_support():
  data = StringIO(
      '["a", NaN, "NaN", Infinity, "Infinity", -Infinity, "-Infinity"]'
  )
  result = read_json(data)
  json_str = result.to_json(orient="records")

  read_json(StringIO(json_str) , engine="orjson")

The orjson engine fails but the new version of ujson works like the older version. (infinity is represented as null (NaN)).

original:  '["a", NaN, "NaN", Infinity, "Infinity", -Infinity, "-Infinity"]'
read after to_json: [{"0":"a"},{"0":null},{"0":"NaN"},{"0":null},{"0":"Infinity"},{"0":null},{"0":"-Infinity"}]

@Mazen050
Copy link
Author

@Alvaro-Kothe Thanks I will check it out.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 20, 2026

If you meant this test:

def test_emca_262_nan_inf_support():
  data = StringIO(
      '["a", NaN, "NaN", Infinity, "Infinity", -Infinity, "-Infinity"]'
  )
  result = read_json(data)
  json_str = result.to_json(orient="records")

  read_json(StringIO(json_str) , engine="orjson")

The orjson engine fails but the new version of ujson works like the older version. (infinity is represented as null (NaN)).

original:  '["a", NaN, "NaN", Infinity, "Infinity", -Infinity, "-Infinity"]'
read after to_json: [{"0":"a"},{"0":null},{"0":"NaN"},{"0":null},{"0":"Infinity"},{"0":null},{"0":"-Infinity"}]

So if NaN was removed from data (but "NaN" was kept), the test would pass?

It's an interesting question whether we need to support that, as well as if there is a workaround with orjson to make that test pass.

@Mazen050
Copy link
Author

Mazen050 commented Jan 20, 2026

So if NaN was removed from data (but "NaN" was kept), the test would pass?

Yes the problem is with NaN and infinity without quotes.

I found this Issue where the maintainer of orjson said he is not interested in supporting NaN and Infinity. So a work around would be something like converting every NaN and Infinity that is not a key in the json to null (without quotes) which orjson supports.

All this suggests that orjson is stricter about the JSON being valid than the older ujson engine.

@Mazen050
Copy link
Author

Hello @Alvaro-Kothe

I’d appreciate your opinion on this.

The main behavioral difference we’ve identified is that orjson intentionally does not support unquoted NaN and Infinity, whereas the current ujson behavior allow them.

This leaves us with a few possible paths forward:

  1. Keep orjson as an optional engine only (not a replacement), and clearly document that unquoted NaN / Infinity are not supported.

  2. Accept the stricter JSON compliance as a trade-off for better performance and large-integer / float handling.

  3. Add a preprocessing workaround to preserve existing behavior.

Given that orjson explicitly does not plan to support these values, I’m leaning toward (1) unless there’s strong motivation otherwise

@Alvaro-Kothe
Copy link
Member

My thought from the start was to make orjson an optional engine, and slowly deprecate ujson. I think that your first option is the way to go.

@Mazen050
Copy link
Author

Thanks.

I will add documentation tomorrow that informs about orjson being stricter and that it doesn't support unquoted NaN and Infinity.

I intentionally kept this PR minimal and scoped to introducing orjson as an optional engine for read_json, to avoid coupling it to broader refactors.

I’ve also looked at your PR, which takes a more generalized and future-proof approach. If there’s interest in moving in that direction, I’m happy to adapt this implementation.

Also If there is interest, a follow-up PR could explore an opt-in normalization step to align orjson behavior more closely with the legacy ujson engine.

@Mazen050 Mazen050 force-pushed the add-optional-orjson branch from 85cf226 to ef74d55 Compare January 23, 2026 16:26
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add orjson as an optional dependency, I think we'll need to be running all the tests that we have for other engines with orjson.

Regarding the failing tests from #63725 (comment):

  • 2 tests of which were expecting a 'Value is too small|Value is too big' ValueError but got a ValueError message 'If using all scalar values, you must pass an index' instead.

In general changing of the exact message is fine if the type of exception is staying the same. However here the content seems to vary wildly, so it makes me suspicious that the proper reason for the error is being truly identified.

  • 1 was an error message mismatch for when given a bad JSON like so:

This difference looks fine.

  • 1 was a trailing comma error which exposes that orjson engine is more strict that the past ujson engine

I'd be okay with documenting this change.

  • the last one was the most important as this one exposed a difference between ujson and orjson where orjson doesn't support infinity and NaN being in the Json like so:

It does seem to me that not supporting NaN/Infinity is a deal-breaker here.

I found this Issue where the maintainer of orjson said he is not interested in supporting NaN and Infinity.

This does not appear to me to be an accurate summary. The actual statement is:

I have no interest in figuring this out, but someone can open a new issue with a concrete proposal.

I take that as meaning orjson would be open to the extension if someone is willing to put in the work.

@Mazen050
Copy link
Author

Hello @rhshadrach

Thanks for the feedback.

so it makes me suspicious that the proper reason for the error is being truly identified.

I have investigated this test and found that the test below:

    def test_read_json_large_numbers(self, bigNum):
        # GH20599, 26068
        json = StringIO('{"articleId":' + str(bigNum) + "}")
        msg = r"Value is too small|Value is too big"
        with pytest.raises(ValueError, match=msg):
            read_json(json)

        json = StringIO('{"0":{"articleId":' + str(bigNum) + "}}")
        with pytest.raises(ValueError, match=msg):
            read_json(json)

depends on failing for big numbers as we know but since orjson doesn't fail it parses and it finds that StringIO('{"articleId":' + str(bigNum) + "}") is evaluated as {"articleId":-9223372036854775809} which (if smaller number) would throw the same error with ujson: ValueError: If using all scalar values, you must pass an index but when I make it like so StringIO('{"articleId":' + '[' + str(bigNum) + ']' + "}") it works and successfully generates a dataframe. So this error is at the pandas validation layer, and this is expected.

I'd be okay with documenting this change.

would the already documented This engine is stricter about JSON compliance be enough or should I make it more specific?

I take that as meaning orjson would be open to the extension if someone is willing to put in the work.

Yeah sorry looks like I miss understood the maintainers comment.

Additionally I ran all tests again this time making sure every test ran for orjson and found the same errors stated above.

For the NaN/Infinity test. I’d agree it would be a deal-breaker if orjson were a replacement or default. Since it’s opt-in and documented, my intent is to treat lack of support for unquoted NaN / Infinity as an explicit limitation of that engine, with engine-specific xfails so the difference is visible.

@Mazen050 Mazen050 requested a review from rhshadrach January 25, 2026 14:56
@Mazen050 Mazen050 requested a review from mroeschke as a code owner January 26, 2026 17:12
@Mazen050 Mazen050 force-pushed the add-optional-orjson branch from 4b9524a to a44e9b8 Compare January 26, 2026 17:32
Update error messages and handling in JSON tests
Remove commented-out code and update test for large numbers.
@Mazen050
Copy link
Author

Hi @rhshadrach , thanks for the feedback.

I’ve now parameterized the existing read_json engine tests so they also run with engine="orjson", similar to ujson/pyarrow, and addressed the resulting failures.

The remaining differences are limited to documented, engine-specific behavior:

  • Stricter JSON compliance (trailing commas).

  • No support for unquoted NaN / Infinity (covered with engine-specific xfail).

Please let me know if you’d like any additional tests exercised under orjson.

Also note that failing tests are failing under the main branch aswell indicating that they are not related to changes in this PR.

@rhshadrach
Copy link
Member

  • No support for unquoted NaN / Infinity (covered with engine-specific xfail).

I mentioned previously, but to elaborate, if we have no path forward to support NaN/Infinity with orjson, then it seems to me we should not move forward. It would mean we cannot remove ujson, which was the reason we were thinking of orjson in the first place.

But I would like to get others thoughts here, cc @jorisvandenbossche @mroeschke

@mroeschke
Copy link
Member

Agreed with @rhshadrach's assessment - I don't think it's worth adding orjson if it doesn't meet all the feature parity of the existing ujson engine.

Also, is the pyarrow engine able to read json with large integers as described in the original issue?

@Mazen050
Copy link
Author

Agreed with @rhshadrach's assessment - I don't think it's worth adding orjson if it doesn't meet all the feature parity of the existing ujson engine.

That is fair considering it might break a lot of existing code.

Also, is the pyarrow engine able to read json with large integers as described in the original issue?

Yes the pyarrow engine is able to read large integers

Also, If nothing else is required here, I’m fine with closing this PR.

@mroeschke
Copy link
Member

Thanks for your investigation here with orjson so far @Mazen050. It was helpful in discovering the limitations between orjson and our vendored ujson and that pandas is probably not ready to make a wholesale-switch to orjson.

Additionally since the origin issue of large integers is covered by at least the pyarrow engine, I believe we can close this PR.

@mroeschke mroeschke closed this Jan 29, 2026
@rhshadrach
Copy link
Member

It was helpful in discovering the limitations between orjson and our vendored ujson and that pandas is probably not ready to make a wholesale-switch to orjson.

+1. I would also like to make a pitch to orjson to support NaN/Infinity, but it will likely take me quite some time to do so, so if anyone else is interested please have at it. But if orjson ever does support it, this PR will then be useful to get things up and running.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 29, 2026

+1. I would also like to make a pitch to orjson to support NaN/Infinity, but it will likely take me quite some time to do so, so if anyone else is interested please have at it. But if orjson ever does support it, this PR will then be useful to get things up and running.

Given the discussion at ijl/orjson#170, I don't think orjson would make the change. Unless we brought up the issue again saying we'd like this for pandas support. He rejected the idea almost 4 years ago.

There's another option to consider if orjson doesn't want to make the change. Just like we took some prior version of ujson, modified it, and then bundled the modified version as part of the pandas project (as opposed to making it a dependency), we could just see if we could make our own fork of orjson and fix the issue for us. Of course, if we do that, we might want to contribute it back to orjson.

@rhshadrach
Copy link
Member

rhshadrach commented Jan 30, 2026

Given the discussion at ijl/orjson#170, I don't think orjson would make the change. Unless we brought up the issue again saying we'd like this for pandas support. He rejected the idea almost 4 years ago.

@Dr-Irv - did you see the reason for the closure? It was:

I have no interest in figuring this out, but someone can open a new issue with a concrete proposal.

That is not a hard no.

we could just see if we could make our own fork of orjson and fix the issue for us.

I'm negative on pandas depending on a fork of a rust repo.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 30, 2026

I have no interest in figuring this out, but someone can open a new issue with a concrete proposal.

That is not a hard no.

Agreed.

we could just see if we could make our own fork of orjson and fix the issue for us.

I'm negative on pandas depending on a fork of a rust repo.

Good point. Not that forking a C-implementation of ujson was that much better!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: pd.read_json fails with "Value is too big!" on large integers, while json.load + DataFrame works ENH: Migrate from ujson to orjson

5 participants