Skip to content

Conversation

@scovich
Copy link
Contributor

@scovich scovich commented Jan 24, 2026

Which issue does this PR close?

N/A - pathfinding related to

Rationale for this change

NOTE: Stacked on top of the other PR, so only look at the last four commits of this PR!

Exploring #9021 (comment)

What changes are included in this PR?

Enhance the customization support to allow things like:

  • Turn incompatible types to NULL instead of error
  • Special parsing behavior for complex types when presented with a string
  • Ability to apply specific custom decoders to specific fields based on field metadata
  • Ability to apply specific custom decoders to specific fields in a struct based on their path through the schema

Most of the changes are in a new test file that demos them. Important supporting changes are made to arrow-json.

If we decide to pursue this direction, the PR should be split into at least three separate PR, in addition to the original PR this one builds on.

Are these changes tested?

Yes, new unit tests

Are there any user-facing changes?

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate parquet-variant parquet-variant* crates labels Jan 24, 2026
@scovich
Copy link
Contributor Author

scovich commented Jan 24, 2026

@debugmiller and @alamb, I'd love your thoughts on this pathfinding? It builds on top of @debugmiller excellent starting point and would merge mostly separately (afterward).

The main consideration is that it would impact the public API we create, so it would be nice to get that part right on the first try.

Copy link
Contributor Author

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review.

Also, I just realized there's a missing test, to validate recursive propagation of decoder factories. I'll try to add that soon.

/// Return a Command instance for running the `flight_sql_client` CLI
fn flight_sql_client_cmd() -> Command {
Command::new(assert_cmd::cargo::cargo_bin!("flight_sql_client"))
Command::new(assert_cmd::cargo::cargo_bin("flight_sql_client"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some newer rust version thing, ignore.

impl<O: OffsetSizeTrait> ListArrayDecoder<O> {
pub fn new(
data_type: DataType,
data_type: &DataType,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed all the decoder paths to pass borrowed refs because it was leading to a lot of clone calls otherwise.

Plus, one of the example decoders relies on the datatype instances to have stable memory locations during the recursive make_decoder call.

is_nullable: bool,
struct_mode: StructMode,
decoder_factory: Option<Arc<dyn DecoderFactory>>,
decoder_factory: Option<&dyn DecoderFactory>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed because this change relies on forwarding decoder factories from a trait provided method, and &Arc<Self> receiver is not dyn-compatible.

Comment on lines 53 to 54
Some(field.clone()),
field.data_type().clone(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one of the redundant clones I mentioned earlier, eliminated

Comment on lines +52 to -57
field.data_type(),
field.is_nullable(),
field.metadata(),
coerce_primitive,
strict_mode,
field.is_nullable(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reordered to match the Field::new

}
}

impl<'a> Hash for DataTypeIdentity<'a> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
impl<'a> Hash for DataTypeIdentity<'a> {
impl Hash for DataTypeIdentity<'_> {

Comment on lines +587 to +589
enum DataTypeIdentity<'a> {
FieldRef(FieldRef),
DataType(&'a DataType),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took a long time to figure out this approach. It allows creating a hash map whose keys compare by pointer instead of by value, while still respecting lifetimes -- pointers are only created very briefly during actual hashing and comparisons.

Comment on lines +616 to +621
/// A factory that routes to custom decoders based on specific field paths.
///
/// This allows fine-grained control: customize specific fields by name without
/// polluting the schema with metadata or affecting all fields of a given type.
#[derive(Debug)]
struct PathBasedDecoderFactory {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really powerful factory -- it allows individually customizing every single field in a schema, without altering the schema itself.

Relies on the fact that the updated decoder code passes the exact same DataType reference to the decoder factory's constructor as will eventually be passed to the make_custom_decoder method calls. Once the decoders have actually been created, we no longer care what happens to the data type.

// Walk the fields and associate DataTypeIdentity::FieldRef with factory for O(1) lookup
let mut routes = HashMap::new();
for (path, factory) in path_routes {
let parts: Vec<&str> = path.split('.').collect();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A real implementation would use a more robust mechanism for expressing nested column names, but I'm not aware of anything already in arrow-rs that would fit. The variant paths used by e.g. variant_get are the closest thing I could find, but they also support array indexing, which is not meaningful here. And they're anyway in the wrong crate.

/// Recursively find a Field by following a path of field names.
fn find_field_by_path(fields: &Fields, path: &[&str]) -> Option<FieldRef> {
let (first, rest) = path.split_first()?;
let field = fields.iter().find(|f| f.name() == *first)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: very annoying that Fields doesn't have an efficient way to find a field by name...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants