-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[WIP] Explore fancier custom JSON decoding #9259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@debugmiller and @alamb, I'd love your thoughts on this pathfinding? It builds on top of @debugmiller excellent starting point and would merge mostly separately (afterward). The main consideration is that it would impact the public API we create, so it would be nice to get that part right on the first try. |
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self-review.
Also, I just realized there's a missing test, to validate recursive propagation of decoder factories. I'll try to add that soon.
| /// Return a Command instance for running the `flight_sql_client` CLI | ||
| fn flight_sql_client_cmd() -> Command { | ||
| Command::new(assert_cmd::cargo::cargo_bin!("flight_sql_client")) | ||
| Command::new(assert_cmd::cargo::cargo_bin("flight_sql_client")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some newer rust version thing, ignore.
| impl<O: OffsetSizeTrait> ListArrayDecoder<O> { | ||
| pub fn new( | ||
| data_type: DataType, | ||
| data_type: &DataType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed all the decoder paths to pass borrowed refs because it was leading to a lot of clone calls otherwise.
Plus, one of the example decoders relies on the datatype instances to have stable memory locations during the recursive make_decoder call.
| is_nullable: bool, | ||
| struct_mode: StructMode, | ||
| decoder_factory: Option<Arc<dyn DecoderFactory>>, | ||
| decoder_factory: Option<&dyn DecoderFactory>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed because this change relies on forwarding decoder factories from a trait provided method, and &Arc<Self> receiver is not dyn-compatible.
arrow-json/src/reader/list_array.rs
Outdated
| Some(field.clone()), | ||
| field.data_type().clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one of the redundant clones I mentioned earlier, eliminated
| field.data_type(), | ||
| field.is_nullable(), | ||
| field.metadata(), | ||
| coerce_primitive, | ||
| strict_mode, | ||
| field.is_nullable(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reordered to match the Field::new
| } | ||
| } | ||
|
|
||
| impl<'a> Hash for DataTypeIdentity<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| impl<'a> Hash for DataTypeIdentity<'a> { | |
| impl Hash for DataTypeIdentity<'_> { |
| enum DataTypeIdentity<'a> { | ||
| FieldRef(FieldRef), | ||
| DataType(&'a DataType), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took a long time to figure out this approach. It allows creating a hash map whose keys compare by pointer instead of by value, while still respecting lifetimes -- pointers are only created very briefly during actual hashing and comparisons.
| /// A factory that routes to custom decoders based on specific field paths. | ||
| /// | ||
| /// This allows fine-grained control: customize specific fields by name without | ||
| /// polluting the schema with metadata or affecting all fields of a given type. | ||
| #[derive(Debug)] | ||
| struct PathBasedDecoderFactory { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really powerful factory -- it allows individually customizing every single field in a schema, without altering the schema itself.
Relies on the fact that the updated decoder code passes the exact same DataType reference to the decoder factory's constructor as will eventually be passed to the make_custom_decoder method calls. Once the decoders have actually been created, we no longer care what happens to the data type.
| // Walk the fields and associate DataTypeIdentity::FieldRef with factory for O(1) lookup | ||
| let mut routes = HashMap::new(); | ||
| for (path, factory) in path_routes { | ||
| let parts: Vec<&str> = path.split('.').collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A real implementation would use a more robust mechanism for expressing nested column names, but I'm not aware of anything already in arrow-rs that would fit. The variant paths used by e.g. variant_get are the closest thing I could find, but they also support array indexing, which is not meaningful here. And they're anyway in the wrong crate.
| /// Recursively find a Field by following a path of field names. | ||
| fn find_field_by_path(fields: &Fields, path: &[&str]) -> Option<FieldRef> { | ||
| let (first, rest) = path.split_first()?; | ||
| let field = fields.iter().find(|f| f.name() == *first)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: very annoying that Fields doesn't have an efficient way to find a field by name...
Which issue does this PR close?
N/A - pathfinding related to
Rationale for this change
NOTE: Stacked on top of the other PR, so only look at the last four commits of this PR!
Exploring #9021 (comment)
What changes are included in this PR?
Enhance the customization support to allow things like:
Most of the changes are in a new test file that demos them. Important supporting changes are made to arrow-json.
If we decide to pursue this direction, the PR should be split into at least three separate PR, in addition to the original PR this one builds on.
Are these changes tested?
Yes, new unit tests
Are there any user-facing changes?