JSON Reader: Validation vs Performance

I have a big-picture question:

What trade-offs are we willing to make on validation of JSON values we will ultimately discard?

At one extreme, we could fully parse and validate everything and just choose not to append the skipped bits to the tape afterward.
* CON: Strongly limits performance gain of skipping, because parsing and validation are the lion's share of work.

At the other extreme, we completely ignore the bytes corresponding to skipped values, other than the bare minimum to be relatively confident we correctly identified byte range to skip.
* CON: Accepts blatantly invalid JSON as long as the bytes satisfy whatever region identification heuristics we come up with.
* CON: Risk of identifying the wrong region and skipping bytes that should not have been skipped.

I think this PR currently leans toward the lenient-for-max-performance end of the spectrum. That's not necessarily bad, but the PR doesn't really talk about the trade-off. For example, if we decide we want to be maximally lenient in order to skip as quickly as possible, this PR may not be aggressive enough (dunno, haven't explored that direction yet). On the other hand, if we favor correctness even for skipped values, then this PR is probably too lenient (a motivating factor behind some of my previous comments, which I wasn't fully self-aware of at the time).

Do we know what we want?

_Originally posted by @scovich in https://github.com/apache/arrow-rs/issues/9097#issuecomment-3818890681_
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON Reader: Validation vs Performance #9329

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

JSON Reader: Validation vs Performance #9329

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions