-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Json comma fix #9232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Json comma fix #9232
Conversation
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self-review to help other reviewers.
| /// Evaluates to the next non-whitespace byte in the iterator or breaks the current loop | ||
| macro_rules! next_non_whitespace { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Directly inspired by the next! macro
| /// Dispatches value type detection with optional special case and custom transition function | ||
| macro_rules! dispatch_value { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are three places in the fully expanded/inline JSON grammar that expect a value:
- Starting a new line -- we skip leading whitespace before dispatching on the first value byte, in order to know whether we should increment the row count or not
- Expecting an array element -- we skip leading whitespace and check for a closing
]before dispatching as a value - Expecting an object field value -- we already saw the
:and know a value should follow, but there could be leading whitespace. To handle this, we push a Value on the stack instead of dispatching directly.
Rather than dispatch to a separate Value state that needs to re-examine the byte, we use the macro to inline the logic three times, with localized tweaks as needed.
| /// Write the closing elements for a list to the tape | ||
| fn end_list(&mut self, start_idx: u32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two helpers were factored out from their respective Object and List match arms, because now they have two call sites. It didn't seem helpful to use a macro because there are no custom logic tweaks involved.
| None => { | ||
| iter.skip_whitespace(); | ||
| if iter.is_empty() || self.cur_row >= self.batch_size { | ||
| if self.cur_row >= self.batch_size { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking carefully (and asking Claude to double-check my reasoning), I don't think it was actually important to skip whitespace before checking row count. If anything, that ordering was a slight performance pessimization.
arrow-json/src/reader/tape.rs
Outdated
| break; | ||
| } | ||
|
|
||
| let b = match iter.next_non_whitespace() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have used the macro... but we still need to do it here, before incrementing the row count.
| }, | ||
| b']' => { | ||
| self.end_list(start_idx); | ||
| continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The continue is newly required in order to bypass the transition logic inside the macro.
It's not a behavior change because this match is anyway the last statement in the loop body.
| b => unreachable!("{}", b), | ||
| } | ||
| } | ||
| state @ DecoderState::Value => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The state @ capture was unnecessary -- state was already in scope.
| match next!(iter) { | ||
| b':' => self.stack.pop(), | ||
| match next_non_whitespace!(iter) { | ||
| b':' => *state = DecoderState::Value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only state transition that requires an actual Value on the stack, because we don't know the next byte yet. The other two sites that expect values already know the next byte and can dispatch directly on it.
| self.advance_until(|b| !json_whitespace(b)); | ||
| // Advance to the next non-whitespace char and consume it | ||
| fn next_non_whitespace(&mut self) -> Option<u8> { | ||
| for b in self.as_slice() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imperative code because all declarative monadic chains I could cook up produced significantly worse machine code.
|
|
||
| fn next(&mut self) -> Option<Self::Item> { | ||
| let b = self.peek(); | ||
| let b = self.peek()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on the fence with this change:
- Technically, it's needed because somebody repeatedly calling
nexton the iterator could cause an integer overflow by blindly incrementing the index - But it's performance-neutral at best and may even be a slight slowdown.
I can be convinced to revert it, if we think it's unhelpful.
Which issue does this PR close?
Rationale for this change
It's not good to tolerate obviously ill-formed JSON like
[,,, 10,,, 20,,,]What changes are included in this PR?
Reject leading and repeated commas while still tolerating at most one trailing comma, since that's a common and intuitive case.
While we're at it, optimize the tape decoder state machine to eliminate redundant decision-making. The performance benefits from that optimization compensate for the performance loss due to checking separately from commas.
Are these changes tested?
Yes, new unit tests cover the expected behavior change and benchmarking shows moderate overall improvement in performance. Exception: the three
xxx_hex_jsonvariants are very noisy and show anything from 15% speedup to 20% slowdown from run to run. But as far as I can tell they are all tape-decoding the exact same input JSON values, and any performance differences in the tape decoder should affect them equally. This leads me to conclude that those three benchmark cases are just plain unstable.Are there any user-facing changes?
JSON parsing now rejects ill-formed JSON it used to accept. Not sure if this might merit a documentation change?