Fix ClickBench EventDate handling by casting UInt16 days-since-epoch to DATE via hits view
#19881
+58
−12
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
ClickBench encodes
EventDateas aUInt16representing days since 1970-01-01. When DataFusion registers the ClickBench parquet file directly ashits,EventDateends up being compared as a string in some queries (notably ClickBench queries 36–42), which causes the date range predicates to filter out all rows.To make ClickBench queries behave as authored (and align with how other engines handle the dataset), we expose
hitsas a view that converts the rawUInt16encoding into a proper SQLDATE.What changes are included in this PR?
Register the underlying parquet table as
hits_rawinstead ofhits.Add a constant
HITS_VIEW_DDLthat defines ahitsview which:EventDatecolumn, andDATEusingCAST(CAST("EventDate" AS INTEGER) AS DATE).Factor view creation into a helper method (
create_hits_view) and add error context for easier debugging.Update the ClickBench sqllogictest file to:
hits_raw+hitsview,15901↔2013-07-15),EventDateis now aDATE, andAre these changes tested?
Yes.
Updated
datafusion/sqllogictest/test_files/clickbench.sltto cover:EventDatedecoding in thehitsview (returnsDATE),hits_raw.EventDateremains the original integer encoding, andScript to test q36-q42.
benchmarks/run_q36_q42.shRun results on this branch:
On
mainbranch, the queries return 0 rows.Are there any user-facing changes?
Yes (benchmark/test behavior):
hitscontinues to exist, but it is now a view that exposesEventDateas a properDATErather than the rawUInt16encoding.No public API changes.
LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.