Json comma fix #9232

scovich · 2026-01-20T22:00:11Z

Which issue does this PR close?

Closes JSON parser tolerates illegal commas #9204

Rationale for this change

It's not good to tolerate obviously ill-formed JSON like [,,, 10,,, 20,,,]

What changes are included in this PR?

Reject leading and repeated commas while still tolerating at most one trailing comma, since that's a common and intuitive case.

While we're at it, optimize the tape decoder state machine to eliminate redundant decision-making. The performance benefits from that optimization compensate for the performance loss due to checking separately from commas.

Are these changes tested?

Yes, new unit tests cover the expected behavior change and benchmarking shows moderate overall improvement in performance. Exception: the three xxx_hex_json variants are very noisy and show anything from 15% speedup to 20% slowdown from run to run. But as far as I can tell they are all tape-decoding the exact same input JSON values, and any performance differences in the tape decoder should affect them equally. This leads me to conclude that those three benchmark cases are just plain unstable.

Are there any user-facing changes?

JSON parsing now rejects ill-formed JSON it used to accept. Not sure if this might merit a documentation change?

scovich

Self-review to help other reviewers.

scovich · 2026-01-20T22:01:35Z

arrow-json/src/reader/tape.rs

+/// Evaluates to the next non-whitespace byte in the iterator or breaks the current loop
+macro_rules! next_non_whitespace {


Directly inspired by the next! macro

scovich · 2026-01-20T22:12:44Z

arrow-json/src/reader/tape.rs

+/// Dispatches value type detection with optional special case and custom transition function
+macro_rules! dispatch_value {


There are three places in the fully expanded/inline JSON grammar that expect a value:

Starting a new line -- we skip leading whitespace before dispatching on the first value byte, in order to know whether we should increment the row count or not

Expecting an array element -- we skip leading whitespace and check for a closing ] before dispatching as a value

Expecting an object field value -- we already saw the : and know a value should follow, but there could be leading whitespace. To handle this, we push a Value on the stack instead of dispatching directly.

Rather than dispatch to a separate Value state that needs to re-examine the byte, we use the macro to inline the logic three times, with localized tweaks as needed.

scovich · 2026-01-20T22:14:15Z

arrow-json/src/reader/tape.rs

+    /// Write the closing elements for a list to the tape
+    fn end_list(&mut self, start_idx: u32) {


These two helpers were factored out from their respective Object and List match arms, because now they have two call sites. It didn't seem helpful to use a macro because there are no custom logic tweaks involved.

scovich · 2026-01-20T22:15:49Z

arrow-json/src/reader/tape.rs

                None => {
-                    iter.skip_whitespace();
-                    if iter.is_empty() || self.cur_row >= self.batch_size {
+                    if self.cur_row >= self.batch_size {


After thinking carefully (and asking Claude to double-check my reasoning), I don't think it was actually important to skip whitespace before checking row count. If anything, that ordering was a slight performance pessimization.

scovich · 2026-01-20T22:17:09Z

arrow-json/src/reader/tape.rs

                        break;
                    }

+                    let b = match iter.next_non_whitespace() {


This should have used the macro... but we still need to do it here, before incrementing the row count.

scovich · 2026-01-20T22:22:11Z

arrow-json/src/reader/tape.rs

+                        },
+                        b']' => {
+                            self.end_list(start_idx);
+                            continue;


The continue is newly required in order to bypass the transition logic inside the macro.
It's not a behavior change because this match is anyway the last statement in the loop body.

scovich · 2026-01-20T22:22:48Z

arrow-json/src/reader/tape.rs

                        b => unreachable!("{}", b),
                    }
                }
-                state @ DecoderState::Value => {


The state @ capture was unnecessary -- state was already in scope.

scovich · 2026-01-20T22:23:53Z

arrow-json/src/reader/tape.rs

-                    match next!(iter) {
-                        b':' => self.stack.pop(),
+                    match next_non_whitespace!(iter) {
+                        b':' => *state = DecoderState::Value,


This is the only state transition that requires an actual Value on the stack, because we don't know the next byte yet. The other two sites that expect values already know the next byte and can dispatch directly on it.

scovich · 2026-01-20T22:25:14Z

arrow-json/src/reader/tape.rs

-        self.advance_until(|b| !json_whitespace(b));
+    // Advance to the next non-whitespace char and consume it
+    fn next_non_whitespace(&mut self) -> Option<u8> {
+        for b in self.as_slice() {


Imperative code because all declarative monadic chains I could cook up produced significantly worse machine code.

scovich · 2026-01-20T22:26:36Z

arrow-json/src/reader/tape.rs


    fn next(&mut self) -> Option<Self::Item> {
-        let b = self.peek();
+        let b = self.peek()?;


I'm on the fence with this change:

Technically, it's needed because somebody repeatedly calling next on the iterator could cause an integer overflow by blindly incrementing the index

But it's performance-neutral at best and may even be a slight slowdown.

I can be convinced to revert it, if we think it's unhelpful.

scovich added 8 commits January 20, 2026 12:41

reject leading and repeated commas

6d05a7a

optimize whitespace

83b9960

optimize next

1314544

optimize more whitespace

dfd890d

actually optimize whitespace

1bd8ff6

simplify a bit

b37a11d

inline the old Value state

58c21e7

factor out a macro

c9df425

github-actions bot added the arrow Changes to the arrow crate label Jan 20, 2026

scovich commented Jan 20, 2026

View reviewed changes

cleanup

2120082

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Json comma fix #9232

Json comma fix #9232

scovich commented Jan 20, 2026

Uh oh!

scovich left a comment

Uh oh!

scovich Jan 20, 2026

Uh oh!

scovich Jan 20, 2026

Uh oh!

scovich Jan 20, 2026

Uh oh!

scovich Jan 20, 2026

Uh oh!

scovich Jan 20, 2026

Uh oh!

scovich Jan 20, 2026

Uh oh!

scovich Jan 20, 2026

Uh oh!

scovich Jan 20, 2026

Uh oh!

scovich Jan 20, 2026

Uh oh!

scovich Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		/// Evaluates to the next non-whitespace byte in the iterator or breaks the current loop
		macro_rules! next_non_whitespace {

		/// Dispatches value type detection with optional special case and custom transition function
		macro_rules! dispatch_value {

		/// Write the closing elements for a list to the tape
		fn end_list(&mut self, start_idx: u32) {

Json comma fix #9232

Are you sure you want to change the base?

Json comma fix #9232

Conversation

scovich commented Jan 20, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant