-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
I have a big-picture question:
What trade-offs are we willing to make on validation of JSON values we will ultimately discard?
At one extreme, we could fully parse and validate everything and just choose not to append the skipped bits to the tape afterward.
- CON: Strongly limits performance gain of skipping, because parsing and validation are the lion's share of work.
At the other extreme, we completely ignore the bytes corresponding to skipped values, other than the bare minimum to be relatively confident we correctly identified byte range to skip.
- CON: Accepts blatantly invalid JSON as long as the bytes satisfy whatever region identification heuristics we come up with.
- CON: Risk of identifying the wrong region and skipping bytes that should not have been skipped.
I think this PR currently leans toward the lenient-for-max-performance end of the spectrum. That's not necessarily bad, but the PR doesn't really talk about the trade-off. For example, if we decide we want to be maximally lenient in order to skip as quickly as possible, this PR may not be aggressive enough (dunno, haven't explored that direction yet). On the other hand, if we favor correctness even for skipped values, then this PR is probably too lenient (a motivating factor behind some of my previous comments, which I wasn't fully self-aware of at the time).
Do we know what we want?
Originally posted by @scovich in #9097 (comment)