Skip to content

Conversation

@ethan-tyler
Copy link

Fix arrow-cast failing to cast into Dictionary(K, Utf8View) and Dictionary(K, BinaryView) even though can_cast_types returns true.

Which issue does this PR close?

Rationale for this change

Downstream pipelines may require view-typed schemas (e.g. schema_force_view_types) while still dictionary-encoding columns for memory efficiency. Even with view types, dictionary encoding can reduce per-row memory (1–4 byte keys vs 16-byte views) for low-cardinality columns. Today there's no way to materialize Dictionary(K, Utf8View/BinaryView) via arrow-cast.

What changes are included in this PR?

Add Utf8View and BinaryView support in cast_to_dictionary using a two-step pack to recast approach:

  • Pack to Dictionary(K, Utf8/Binary)
  • Cast dictionary values to Utf8View/BinaryView via existing cast machinery

Are these changes tested?

  • StringArray/StringViewArray to Dictionary(UInt16, Utf8View) (including sliced inputs)
  • BinaryArray/BinaryViewArray to Dictionary(UInt16, BinaryView) (including sliced inputs)
  • Key overflow for Dictionary(UInt8, Utf8View)
  • Empty and all-null inputs

Are there any user-facing changes?

No

Fix cast_to_dictionary failing for view value types by packing to Utf8/Binary and recasting; add regression tests for view inputs and key overflow.
@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 19, 2026
};

let dict_base = cast_to_dictionary::<K>(array, &base_value_type, cast_options)?;
cast_with_options(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this equal to dictionary_cast::<K>(dict_base.as_ref(), &DataType::Dictionary(Box::new(K::Data_TYPE), Box::new(DataType::Utf8View), cast_options))?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, effectively equivalent. cast_with_options(dict, Dictionary(..)) hits the (Dictionary(..), _) arm, which calls dictionary_cast::<K>, then routes to dictionary_to_dictionary_cast since the target is also Dictionary(..). Same code path as calling dictionary_cast::<K> directly.

Only difference: cast_with_options short-circuits when from == to (:769-772), which a direct dictionary_cast call skips. Doesn't affect actual dict to dict conversions.

Happy to switch to explicit dictionary_cast::<K>(...) if you prefer, makes the dict-to-dict intent clearer. Let me know.

DataType::Dictionary(Box::new(DataType::UInt16), Box::new(DataType::Utf8View));
let cast_array = cast(&array, &cast_type).unwrap();
assert_eq!(cast_array.data_type(), &cast_type);
assert_eq!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will assert the value of keys and values for dictionary be better?

If array in L5584 is StringArray::from(vec![Some("one"), None, Some("three"), Some("one"), Some("null")]), it seems we can't distinguish where the "null" comes from (None or Some("null")).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. array_to_strings explicitly sets .with_null("null") (:8507-8512), None and Some("null") both render as "null" in test output.

The cast itself is correct - nulls are preserved as nulls. For string casts this happens via builder.append_null() in string.rs:21-37. For dictionary-to-string it's via unpack_dictionary and take in dictionary.rs:145-152. The collision is in the test formatter, not the cast result.

Will update dictionary-cast tests to assert semantics directly and add a regression case with both a real null and a literal "null" string to confirm we're testing the array.

ethan-tyler and others added 3 commits January 20, 2026 15:33
Avoid redundant cast dispatch when packing to Dictionary(_, Utf8View/BinaryView). Strengthen dictionary view cast tests to assert keys/values directly and add regressions distinguishing None from Some("null") for both String and Utf8View inputs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[arrow-cast] can_cast_types allows Dictionary(_, Utf8View/BinaryView) but cast fails

2 participants