Skip to content

Conversation

@scovich
Copy link
Contributor

@scovich scovich commented Jan 20, 2026

Which issue does this PR close?

Rationale for this change

It's not good to tolerate obviously ill-formed JSON like [,,, 10,,, 20,,,]

What changes are included in this PR?

Reject leading and repeated commas while still tolerating at most one trailing comma, since that's a common and intuitive case.

While we're at it, optimize the tape decoder state machine to eliminate redundant decision-making. The performance benefits from that optimization compensate for the performance loss due to checking separately from commas.

Are these changes tested?

Yes, new unit tests cover the expected behavior change and benchmarking shows moderate overall improvement in performance. Exception: the three xxx_hex_json variants are very noisy and show anything from 15% speedup to 20% slowdown from run to run. But as far as I can tell they are all tape-decoding the exact same input JSON values, and any performance differences in the tape decoder should affect them equally. This leads me to conclude that those three benchmark cases are just plain unstable.

Are there any user-facing changes?

JSON parsing now rejects ill-formed JSON it used to accept. Not sure if this might merit a documentation change?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 20, 2026
Copy link
Contributor Author

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review to help other reviewers.

Comment on lines +308 to +309
/// Evaluates to the next non-whitespace byte in the iterator or breaks the current loop
macro_rules! next_non_whitespace {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directly inspired by the next! macro

Comment on lines +318 to +319
/// Dispatches value type detection with optional special case and custom transition function
macro_rules! dispatch_value {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three places in the fully expanded/inline JSON grammar that expect a value:

  • Starting a new line -- we skip leading whitespace before dispatching on the first value byte, in order to know whether we should increment the row count or not
  • Expecting an array element -- we skip leading whitespace and check for a closing ] before dispatching as a value
  • Expecting an object field value -- we already saw the : and know a value should follow, but there could be leading whitespace. To handle this, we push a Value on the stack instead of dispatching directly.

Rather than dispatch to a separate Value state that needs to re-examine the byte, we use the macro to inline the logic three times, with localized tweaks as needed.

Comment on lines +399 to +400
/// Write the closing elements for a list to the tape
fn end_list(&mut self, start_idx: u32) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two helpers were factored out from their respective Object and List match arms, because now they have two call sites. It didn't seem helpful to use a macro because there are no custom logic tweaks involved.

None => {
iter.skip_whitespace();
if iter.is_empty() || self.cur_row >= self.batch_size {
if self.cur_row >= self.batch_size {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking carefully (and asking Claude to double-check my reasoning), I don't think it was actually important to skip whitespace before checking row count. If anything, that ordering was a slight performance pessimization.

break;
}

let b = match iter.next_non_whitespace() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have used the macro... but we still need to do it here, before incrementing the row count.

},
b']' => {
self.end_list(start_idx);
continue;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The continue is newly required in order to bypass the transition logic inside the macro.
It's not a behavior change because this match is anyway the last statement in the loop body.

b => unreachable!("{}", b),
}
}
state @ DecoderState::Value => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state @ capture was unnecessary -- state was already in scope.

match next!(iter) {
b':' => self.stack.pop(),
match next_non_whitespace!(iter) {
b':' => *state = DecoderState::Value,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only state transition that requires an actual Value on the stack, because we don't know the next byte yet. The other two sites that expect values already know the next byte and can dispatch directly on it.

self.advance_until(|b| !json_whitespace(b));
// Advance to the next non-whitespace char and consume it
fn next_non_whitespace(&mut self) -> Option<u8> {
for b in self.as_slice() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imperative code because all declarative monadic chains I could cook up produced significantly worse machine code.


fn next(&mut self) -> Option<Self::Item> {
let b = self.peek();
let b = self.peek()?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence with this change:

  • Technically, it's needed because somebody repeatedly calling next on the iterator could cause an integer overflow by blindly incrementing the index
  • But it's performance-neutral at best and may even be a slight slowdown.

I can be convinced to revert it, if we think it's unhelpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JSON parser tolerates illegal commas

1 participant