BUG: fix read_json failure on very large integers by Mazen050 · Pull Request #63725 · pandas-dev/pandas

Mazen050 · 2026-01-17T21:31:32Z

closes BUG: pd.read_json fails with "Value is too big!" on large integers, while json.load + DataFrame works #63572, closes ENH: Migrate from ujson to orjson #62464
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.
If I used AI to develop this pull request, I prompted it to follow AGENTS.md.

This PR fixes a failure in pd.read_json when parsing JSON data containing very large integers, which currently raises ValueError: Value is too big! due to limitations in the vendored ujson parser.

The fix adds support for engine="orjson" in read_json, allowing large integers to be parsed successfully (following orjson semantics, where very large integers are decoded as floats). The implementation mirrors the existing ujson code path and does not change default behavior unless the orjson engine is explicitly selected.

New tests are added to cover large integer parsing for both DataFrame and Series outputs when using the orjson engine.

Note: orjson is added as an optional dependency using pandas import_optional_dependency()

Mazen050 · 2026-01-17T22:44:20Z

About #62072

The saved float values using df.to_json("out.json",double_precision=15) saves [0.261799387799149,0.11111111111111111] instead of [0.2617993877991494,0.111111111111111112] and when using engine=orjson the output becomes 0.261799387799149 0.111111111111111 so it successfully reads the json but the precision loss if there is is from df.to_json.

If this is acceptable I will add a test for it and add a 'closes' keyword for it.

Dr-Irv

You will need to do a few additional things if this is to be accepted (this list may not be complete);

Update the docs for read_json() and to_json() to show the availability of the orjson value for engine
Update these docs https://pandas.pydata.org/docs/dev/user_guide/io.html#json in various places
Update these docs https://pandas.pydata.org/docs/dev/getting_started/install.html#other-data-sources

Also, other maintainers will have to decide whether this should go in whatsnew/v3.0.0.rst or in something like whatsnew/v3.1.0.rst since 3.0 is almost out the door.

Mazen050 · 2026-01-19T15:38:36Z

@Dr-Irv Thank you for the review.

I will update the docs for read_json to describe the availability and behavior of engine="orjson".

For to_json, this PR does not add or expose an engine parameter, and df.to_json(engine="orjson") is not currently supported. Because of that, I wasn’t sure how to update the to_json documentation without first extending its API.

I did explore adding engine="orjson" support to to_json, but it appears to require more extensive work, since several parameters passed to ujson_dumps are not supported by orjson.

If the expectation is to document orjson for to_json only once it is actually supported, I can proceed with documenting read_json only in this PR. Alternatively, I can open a follow-up PR to add orjson support to to_json and update the docs accordingly.

Dr-Irv · 2026-01-19T16:44:05Z

If the expectation is to document orjson for to_json only once it is actually supported, I can proceed with documenting read_json only in this PR. Alternatively, I can open a follow-up PR to add orjson support to to_json and update the docs accordingly.

I think that's fine. Will let others also chime in.

Dr-Irv · 2026-01-19T21:36:46Z

I did a little research on this, and we have issue #62464 where there is discussion about replacing ujson with orjson .

So that should be investigated here. Do all the existing JSON tests pass if you use orjson as a replacement?

Mazen050 · 2026-01-19T22:05:22Z

Hello @Dr-Irv.

I have seen this issue and this comment here suggests to make it an optional dependency so thats why I made it so.

About the tests. I have put orjson as the default param for read_json and found that all tests pass except for 5.

2 tests of which were expecting a 'Value is too small|Value is too big' ValueError but got a ValueError message 'If using all scalar values, you must pass an index' instead.
1 was an error message mismatch for when given a bad JSON like so:

E         Expected regex: 'Expected object or value'
E         Actual message: 'unexpected character: line 1 column 8 (char 7)'

1 was a trailing comma error which exposes that orjson engine is more strict that the past ujson engine

        data_json = """{
            "schema":{
                "fields":[
                    {
                        "name":"a",
                        "type":"integer",
                        "extDtype":"Int64"
                    }
                ],  <---- HERE
            },
            "data":[
                {
                    "a":2
                },
                {
                    "a":null
                }
            ]
        }"""

the last one was the most important as this one exposed a difference between ujson and orjson where orjson doesn't support infinity and NaN being in the Json like so:

            '["a", NaN, "NaN", Infinity, "Infinity", -Infinity, "-Infinity"]'

Error:

E           orjson.JSONDecodeError: unexpected character: line 1 column 7 (char 6)

if the last point should be fixed I would happily do so if you can guide me.

Mazen050 · 2026-01-19T22:45:43Z

About your suggestion here

I tried the latest ujson version and found that the same tests that fail with orjson fail with the new version of ujson except for the second point below:

1 was an error message mismatch for when given a bad JSON like so:
E         Expected regex: 'Expected object or value'
E         Actual message: 'unexpected character: line 1 column 8 (char 7)'

making it 4 failed tests and they are the same tests above tho with a notable difference.

The test with the NaN and Infinity had this Error instead:

E   AssertionError: DataFrame.iloc[:, 0] (column name="0") are different
E   
E   DataFrame.iloc[:, 0] (column name="0") values are different (14.28571 %)
E   [index]: [0, 1, 2, 3, 4, 5, 6]
E   [left]:  [a, nan, NaN, inf, Infinity, -inf, -Infinity]
E   [right]: [a, None, NaN, inf, Infinity, -inf, -Infinity]
E   At positional index 1, first diff: nan != None

This means the new version of ujson supports NaN and Infinity being in the JSON but with a difference in how they handle NaN

Also keeping this warning in mind.

To sum up, orjson (and the latest ujson) cannot be used as a drop-in replacement without behavior changes, mainly due to stricter JSON compliance and differences around NaN / Infinity.

Dr-Irv · 2026-01-19T22:57:46Z

I have seen this issue and this comment here suggests to make it an optional dependency so thats why I made it so.

Yes, but this comment here makes it seem as if we are open to doing a full replacement.

Dr-Irv · 2026-01-19T23:02:00Z

This means the new version of ujson supports NaN and Infinity being in the JSON but with a difference in how they handle NaN

Is the difference in how they handle NaN without quotes or "NaN" with quotes?

We might be willing to live with this, IF when we output JSON, the representation of np.nan is consistent.

@rhshadrach interested in your opinion here.

Alvaro-Kothe · 2026-01-19T23:14:22Z

@Mazen050, I've opened a pull request (#63763) that tries to implement support for orjson. Feel free to reference it as a guide.

Mazen050 · 2026-01-19T23:29:23Z

The difference is in NaN without quotes.

If what you mean by the second point is a test like this:

def test_to_json_nan_roundtrip_orjson_ujson():
    df = DataFrame({"a": [1.0, np.nan, 2.0]})
    json_str = df.to_json(orient="records")

    result_ujson = read_json(
        StringIO(json_str), orient="records", engine="ujson"
    )
    result_orjson = read_json(
        StringIO(json_str), orient="records", engine="orjson"
    )
    assert_frame_equal(result_ujson, result_orjson)

then yes both the newer version of ujson and orjson read it correctly.

If you meant this test:

def test_emca_262_nan_inf_support():
  data = StringIO(
      '["a", NaN, "NaN", Infinity, "Infinity", -Infinity, "-Infinity"]'
  )
  result = read_json(data)
  json_str = result.to_json(orient="records")

  read_json(StringIO(json_str) , engine="orjson")

The orjson engine fails but the new version of ujson works like the older version. (infinity is represented as null (NaN)).

original:  '["a", NaN, "NaN", Infinity, "Infinity", -Infinity, "-Infinity"]'
read after to_json: [{"0":"a"},{"0":null},{"0":"NaN"},{"0":null},{"0":"Infinity"},{"0":null},{"0":"-Infinity"}]

Mazen050 · 2026-01-19T23:32:18Z

@Alvaro-Kothe Thanks I will check it out.

Dr-Irv · 2026-01-20T15:05:24Z

If you meant this test:

def test_emca_262_nan_inf_support():
  data = StringIO(
      '["a", NaN, "NaN", Infinity, "Infinity", -Infinity, "-Infinity"]'
  )
  result = read_json(data)
  json_str = result.to_json(orient="records")

  read_json(StringIO(json_str) , engine="orjson")

The orjson engine fails but the new version of ujson works like the older version. (infinity is represented as null (NaN)).

original:  '["a", NaN, "NaN", Infinity, "Infinity", -Infinity, "-Infinity"]'
read after to_json: [{"0":"a"},{"0":null},{"0":"NaN"},{"0":null},{"0":"Infinity"},{"0":null},{"0":"-Infinity"}]

So if NaN was removed from data (but "NaN" was kept), the test would pass?

It's an interesting question whether we need to support that, as well as if there is a workaround with orjson to make that test pass.

Mazen050 · 2026-01-20T15:11:20Z

So if NaN was removed from data (but "NaN" was kept), the test would pass?

Yes the problem is with NaN and infinity without quotes.

I found this Issue where the maintainer of orjson said he is not interested in supporting NaN and Infinity. So a work around would be something like converting every NaN and Infinity that is not a key in the json to null (without quotes) which orjson supports.

All this suggests that orjson is stricter about the JSON being valid than the older ujson engine.

Mazen050 · 2026-01-22T17:08:55Z

Hello @Alvaro-Kothe

I’d appreciate your opinion on this.

The main behavioral difference we’ve identified is that orjson intentionally does not support unquoted NaN and Infinity, whereas the current ujson behavior allow them.

This leaves us with a few possible paths forward:

Keep orjson as an optional engine only (not a replacement), and clearly document that unquoted NaN / Infinity are not supported.
Accept the stricter JSON compliance as a trade-off for better performance and large-integer / float handling.
Add a preprocessing workaround to preserve existing behavior.

Given that orjson explicitly does not plan to support these values, I’m leaning toward (1) unless there’s strong motivation otherwise

Alvaro-Kothe · 2026-01-22T23:02:46Z

My thought from the start was to make orjson an optional engine, and slowly deprecate ujson. I think that your first option is the way to go.

Mazen050 · 2026-01-23T00:43:36Z

Thanks.

I will add documentation tomorrow that informs about orjson being stricter and that it doesn't support unquoted NaN and Infinity.

I intentionally kept this PR minimal and scoped to introducing orjson as an optional engine for read_json, to avoid coupling it to broader refactors.

I’ve also looked at your PR, which takes a more generalized and future-proof approach. If there’s interest in moving in that direction, I’m happy to adapt this implementation.

Also If there is interest, a follow-up PR could explore an opt-in normalization step to align orjson behavior more closely with the legacy ujson engine.

rhshadrach

To add orjson as an optional dependency, I think we'll need to be running all the tests that we have for other engines with orjson.

Regarding the failing tests from #63725 (comment):

2 tests of which were expecting a 'Value is too small|Value is too big' ValueError but got a ValueError message 'If using all scalar values, you must pass an index' instead.

In general changing of the exact message is fine if the type of exception is staying the same. However here the content seems to vary wildly, so it makes me suspicious that the proper reason for the error is being truly identified.

1 was an error message mismatch for when given a bad JSON like so:

This difference looks fine.

1 was a trailing comma error which exposes that orjson engine is more strict that the past ujson engine

I'd be okay with documenting this change.

the last one was the most important as this one exposed a difference between ujson and orjson where orjson doesn't support infinity and NaN being in the Json like so:

It does seem to me that not supporting NaN/Infinity is a deal-breaker here.

I found this Issue where the maintainer of orjson said he is not interested in supporting NaN and Infinity.

This does not appear to me to be an accurate summary. The actual statement is:

I have no interest in figuring this out, but someone can open a new issue with a concrete proposal.

I take that as meaning orjson would be open to the extension if someone is willing to put in the work.

Mazen050 · 2026-01-24T16:33:10Z

Hello @rhshadrach

Thanks for the feedback.

so it makes me suspicious that the proper reason for the error is being truly identified.

I have investigated this test and found that the test below:

    def test_read_json_large_numbers(self, bigNum):
        # GH20599, 26068
        json = StringIO('{"articleId":' + str(bigNum) + "}")
        msg = r"Value is too small|Value is too big"
        with pytest.raises(ValueError, match=msg):
            read_json(json)

        json = StringIO('{"0":{"articleId":' + str(bigNum) + "}}")
        with pytest.raises(ValueError, match=msg):
            read_json(json)

depends on failing for big numbers as we know but since orjson doesn't fail it parses and it finds that StringIO('{"articleId":' + str(bigNum) + "}") is evaluated as {"articleId":-9223372036854775809} which (if smaller number) would throw the same error with ujson: ValueError: If using all scalar values, you must pass an index but when I make it like so StringIO('{"articleId":' + '[' + str(bigNum) + ']' + "}") it works and successfully generates a dataframe. So this error is at the pandas validation layer, and this is expected.

I'd be okay with documenting this change.

would the already documented This engine is stricter about JSON compliance be enough or should I make it more specific?

I take that as meaning orjson would be open to the extension if someone is willing to put in the work.

Yeah sorry looks like I miss understood the maintainers comment.

Additionally I ran all tests again this time making sure every test ran for orjson and found the same errors stated above.

For the NaN/Infinity test. I’d agree it would be a deal-breaker if orjson were a replacement or default. Since it’s opt-in and documented, my intent is to treat lack of support for unquoted NaN / Infinity as an explicit limitation of that engine, with engine-specific xfails so the difference is visible.

Update error messages and handling in JSON tests

Remove commented-out code and update test for large numbers.

Mazen050 · 2026-01-26T19:18:36Z

Hi @rhshadrach , thanks for the feedback.

I’ve now parameterized the existing read_json engine tests so they also run with engine="orjson", similar to ujson/pyarrow, and addressed the resulting failures.

The remaining differences are limited to documented, engine-specific behavior:

Stricter JSON compliance (trailing commas).
No support for unquoted NaN / Infinity (covered with engine-specific xfail).

Please let me know if you’d like any additional tests exercised under orjson.

Also note that failing tests are failing under the main branch aswell indicating that they are not related to changes in this PR.

rhshadrach · 2026-01-28T02:47:48Z

No support for unquoted NaN / Infinity (covered with engine-specific xfail).

I mentioned previously, but to elaborate, if we have no path forward to support NaN/Infinity with orjson, then it seems to me we should not move forward. It would mean we cannot remove ujson, which was the reason we were thinking of orjson in the first place.

But I would like to get others thoughts here, cc @jorisvandenbossche @mroeschke

mroeschke · 2026-01-28T17:56:23Z

Agreed with @rhshadrach's assessment - I don't think it's worth adding orjson if it doesn't meet all the feature parity of the existing ujson engine.

Also, is the pyarrow engine able to read json with large integers as described in the original issue?

Mazen050 · 2026-01-29T13:27:17Z

Agreed with @rhshadrach's assessment - I don't think it's worth adding orjson if it doesn't meet all the feature parity of the existing ujson engine.

That is fair considering it might break a lot of existing code.

Also, is the pyarrow engine able to read json with large integers as described in the original issue?

Yes the pyarrow engine is able to read large integers

Also, If nothing else is required here, I’m fine with closing this PR.

mroeschke · 2026-01-29T17:02:26Z

Thanks for your investigation here with orjson so far @Mazen050. It was helpful in discovering the limitations between orjson and our vendored ujson and that pandas is probably not ready to make a wholesale-switch to orjson.

Additionally since the origin issue of large integers is covered by at least the pyarrow engine, I believe we can close this PR.

rhshadrach · 2026-01-29T22:10:28Z

It was helpful in discovering the limitations between orjson and our vendored ujson and that pandas is probably not ready to make a wholesale-switch to orjson.

+1. I would also like to make a pitch to orjson to support NaN/Infinity, but it will likely take me quite some time to do so, so if anyone else is interested please have at it. But if orjson ever does support it, this PR will then be useful to get things up and running.

Dr-Irv · 2026-01-29T23:22:47Z

+1. I would also like to make a pitch to orjson to support NaN/Infinity, but it will likely take me quite some time to do so, so if anyone else is interested please have at it. But if orjson ever does support it, this PR will then be useful to get things up and running.

Given the discussion at ijl/orjson#170, I don't think orjson would make the change. Unless we brought up the issue again saying we'd like this for pandas support. He rejected the idea almost 4 years ago.

There's another option to consider if orjson doesn't want to make the change. Just like we took some prior version of ujson, modified it, and then bundled the modified version as part of the pandas project (as opposed to making it a dependency), we could just see if we could make our own fork of orjson and fix the issue for us. Of course, if we do that, we might want to contribute it back to orjson.

rhshadrach · 2026-01-30T19:56:19Z

Given the discussion at ijl/orjson#170, I don't think orjson would make the change. Unless we brought up the issue again saying we'd like this for pandas support. He rejected the idea almost 4 years ago.

@Dr-Irv - did you see the reason for the closure? It was:

I have no interest in figuring this out, but someone can open a new issue with a concrete proposal.

That is not a hard no.

we could just see if we could make our own fork of orjson and fix the issue for us.

I'm negative on pandas depending on a fork of a rust repo.

Dr-Irv · 2026-01-30T20:01:43Z

I have no interest in figuring this out, but someone can open a new issue with a concrete proposal.

That is not a hard no.

Agreed.

we could just see if we could make our own fork of orjson and fix the issue for us.

I'm negative on pandas depending on a fork of a rust repo.

Good point. Not that forking a C-implementation of ujson was that much better!

Mazen050 requested a review from Dr-Irv as a code owner January 17, 2026 21:31

Dr-Irv reviewed Jan 19, 2026

View reviewed changes

Mazen050 requested a review from Dr-Irv January 19, 2026 19:56

Dr-Irv mentioned this pull request Jan 19, 2026

ENH: Migrate from ujson to orjson #62464

Open

Mazen050 added 7 commits January 23, 2026 16:24

Add orjson engine

0d570c2

add skip if orjson option dependency not installed

50fc935

add whatsnew entry

45c9e98

add documentation for read_json orjson engine

25ba142

add doctest skip

a3ead7e

fix flak8 char limit

c6bbe9e

add docs to warn about NaN/Infinity support

ef74d55

Mazen050 force-pushed the add-optional-orjson branch from 85cf226 to ef74d55 Compare January 23, 2026 16:26

rhshadrach requested changes Jan 24, 2026

View reviewed changes

Mazen050 added 2 commits January 25, 2026 14:09

make tests run for orjson engine and add trailing comma test

0634097

fix precommit

a44e9b8

Mazen050 requested a review from rhshadrach January 25, 2026 14:56

Mazen050 requested a review from mroeschke as a code owner January 26, 2026 17:12

Mazen050 force-pushed the add-optional-orjson branch from 4b9524a to a44e9b8 Compare January 26, 2026 17:32

Mazen050 added 4 commits January 26, 2026 19:37

fix failing tests

3d31592

Update error messages and handling in JSON tests

Clean up test_read_json_large_numbers and test_emca_262

3b967a2

Remove commented-out code and update test for large numbers.

Fix test cases for precommit

a36d3fb

fix pre-commit error

a13fc26

mroeschke closed this Jan 29, 2026

Uh oh!

Conversation

Mazen050 commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mazen050 commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dr-Irv left a comment

Choose a reason for hiding this comment

Uh oh!

Mazen050 commented Jan 19, 2026

Uh oh!

Dr-Irv commented Jan 19, 2026

Uh oh!

Dr-Irv commented Jan 19, 2026

Uh oh!

Mazen050 commented Jan 19, 2026

Uh oh!

Mazen050 commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dr-Irv commented Jan 19, 2026

Uh oh!

Dr-Irv commented Jan 19, 2026

Uh oh!

Alvaro-Kothe commented Jan 19, 2026

Uh oh!

Mazen050 commented Jan 19, 2026

Uh oh!

Mazen050 commented Jan 19, 2026

Uh oh!

Dr-Irv commented Jan 20, 2026

Uh oh!

Mazen050 commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mazen050 commented Jan 22, 2026

Uh oh!

Alvaro-Kothe commented Jan 22, 2026

Uh oh!

Mazen050 commented Jan 23, 2026

Uh oh!

rhshadrach left a comment

Choose a reason for hiding this comment

Uh oh!

Mazen050 commented Jan 24, 2026

Uh oh!

Mazen050 commented Jan 26, 2026

Uh oh!

rhshadrach commented Jan 28, 2026

Uh oh!

mroeschke commented Jan 28, 2026

Uh oh!

Mazen050 commented Jan 29, 2026

Uh oh!

mroeschke commented Jan 29, 2026

Uh oh!

rhshadrach commented Jan 29, 2026

Uh oh!

Dr-Irv commented Jan 29, 2026

Uh oh!

rhshadrach commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dr-Irv commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Mazen050 commented Jan 17, 2026 •

edited

Loading

Mazen050 commented Jan 17, 2026 •

edited

Loading

Mazen050 commented Jan 19, 2026 •

edited

Loading

Mazen050 commented Jan 20, 2026 •

edited

Loading

rhshadrach commented Jan 30, 2026 •

edited

Loading