Skip to content

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Jan 19, 2026

Which issue does this PR close?


Rationale for this change

ClickBench encodes EventDate as a UInt16 representing days since 1970-01-01. When DataFusion registers the ClickBench parquet file directly as hits, EventDate ends up being compared as a string in some queries (notably ClickBench queries 36–42), which causes the date range predicates to filter out all rows.

To make ClickBench queries behave as authored (and align with how other engines handle the dataset), we expose hits as a view that converts the raw UInt16 encoding into a proper SQL DATE.


What changes are included in this PR?

  • Register the underlying parquet table as hits_raw instead of hits.

  • Add a constant HITS_VIEW_DDL that defines a hits view which:

    • Removes the original EventDate column, and
    • Re-introduces it as DATE using CAST(CAST("EventDate" AS INTEGER) AS DATE).
  • Factor view creation into a helper method (create_hits_view) and add error context for easier debugging.

  • Update the ClickBench sqllogictest file to:

    • Create hits_raw + hits view,
    • Add explicit assertions validating the transformation (159012013-07-15),
    • Update expected result types where EventDate is now a DATE, and
    • Drop the view before dropping the raw table.

Are these changes tested?

Yes.

  • Updated datafusion/sqllogictest/test_files/clickbench.slt to cover:

    • Correct EventDate decoding in the hits view (returns DATE),
    • Raw hits_raw.EventDate remains the original integer encoding, and
    • Existing ClickBench queries that rely on date predicates (including the previously failing range-filter queries) now execute with the correct types.

Script to test q36-q42.

benchmarks/run_q36_q42.sh

#!/usr/bin/env bash
# Script to run ClickBench queries 36-42 and display results

set -e

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
BENCHMARK=${1:-"clickbench_1"}
OUTPUT_FILE="${2:-results_q36_q42.txt}"

echo "=========================================="
echo "Running ClickBench Queries 36-42"
echo "=========================================="
echo "Benchmark: $BENCHMARK"
echo "Output file: $OUTPUT_FILE"
echo ""

# Create results file
> "$OUTPUT_FILE"

# Run queries 36-42
for q in {36..42}; do
    echo "Running Query $q..."
    
    # Run the query and extract relevant info
    output=$($SCRIPT_DIR/bench.sh run $BENCHMARK $q 2>&1)
    
    # Extract timing and row count from the first iteration
    iteration_0=$(echo "$output" | grep "Query $q iteration 0" | head -1)
    avg_time=$(echo "$output" | grep "Query $q avg time" | head -1)
    
    echo "Q$q: $iteration_0" | tee -a "$OUTPUT_FILE"
    echo "     $avg_time" | tee -a "$OUTPUT_FILE"
    echo "" | tee -a "$OUTPUT_FILE"
done

echo "=========================================="
echo "Summary saved to: $OUTPUT_FILE"
echo "=========================================="
cat "$OUTPUT_FILE"

Run results on this branch:

Q36: Query 36 iteration 0 took 138.1 ms and returned 10 rows
     Query 36 avg time: 116.19 ms

Q37: Query 37 iteration 0 took 66.4 ms and returned 10 rows
     Query 37 avg time: 50.57 ms

Q38: Query 38 iteration 0 took 98.9 ms and returned 10 rows
     Query 38 avg time: 83.20 ms

Q39: Query 39 iteration 0 took 237.3 ms and returned 10 rows
     Query 39 avg time: 223.62 ms

Q40: Query 40 iteration 0 took 40.6 ms and returned 10 rows
     Query 40 avg time: 24.43 ms

Q41: Query 41 iteration 0 took 36.6 ms and returned 10 rows
     Query 41 avg time: 22.90 ms

Q42: Query 42 iteration 0 took 34.0 ms and returned 10 rows
     Query 42 avg time: 20.17 ms

On main branch, the queries return 0 rows.


Are there any user-facing changes?

Yes (benchmark/test behavior):

  • In ClickBench runs, the logical table name hits continues to exist, but it is now a view that exposes EventDate as a proper DATE rather than the raw UInt16 encoding.
  • This fixes ClickBench queries 36–42 so they return rows without requiring manual casts in the SQL.

No public API changes.


LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

…gister_hits

- Added a CREATE VIEW statement that selects all columns except "EventDate" and casts "EventDate" from UInt16 to Date32.
- Updated the external table creation logic to use 'hits_raw' instead of 'hits'.
- Refactored the registration process to include the new view for EventDate encoding.
- Changed `CAST(CAST("EventDate" AS Int32) AS Date32)` to `CAST(CAST("EventDate" AS INTEGER) AS DATE)` in `clickbench.rs`
- Updated SQL logic test file `clickbench.slt` to reflect the SQL syntax change for consistency.
- Added HITS_VIEW_DDL constant to define the SQL view creation for `hits` with proper `EventDate` casting from UInt16 to DATE.
- Refactored the `register_hits` function to utilize a new `create_hits_view` function for better separation of concerns.
- Added SQL logic tests in `clickbench.slt` to verify the transformation of `EventDate` from UInt16 to DATE and ensure raw values remain unaffected in `hits_raw`.
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 19, 2026
@kosiew kosiew marked this pull request as ready for review January 19, 2026 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Potential invalid ClickBench results

1 participant