Skip to content

Conversation

@Ahmed-Gomaa1
Copy link

@Ahmed-Gomaa1 Ahmed-Gomaa1 commented Jan 23, 2026

Description

Adds an optional trim_whitespace parameter to generate_surrogate_key macro.

When enabled, leading and trailing whitespace is removed from fields before hashing, improving surrogate key stability and enabling safer adoption of parallel pipeline patterns in modern dbt architectures.

Motivation

The Problem: Silent Hash Mismatches in Parallel Pipelines

Modern dbt projects increasingly use deterministic hashing to eliminate build-time dependencies between facts and dimensions:

-- dim_customers.sql
{{ dbt_utils.generate_surrogate_key(['customer_id']) }} as customer_sk

-- fct_orders.sql (no join needed!)
{{ dbt_utils.generate_surrogate_key(['customer_id']) }} as customer_sk

Benefits:

  • Facts and dimensions build in parallel (40-60% faster pipelines)
  • Late-arriving dimensions self-heal automatically
  • Eliminates blocking joins between layers

The Critical Dependency:
This pattern only works if both models generate identical hashes.
md5('customer_123')'a7f3c2d1...'
md5('customer_123 ')'x9z7k3m2...' ← one trailing space = broken relationship

Current Workaround: Manual Duplication

Today, engineers must manually ensure consistency across every model:

-- dim_customers.sql
{{ dbt_utils.generate_surrogate_key(["trim(customer_id)"]) }}

-- fct_orders.sql
{{ dbt_utils.generate_surrogate_key(["trim(customer_id)"]) }}  -- must match exactly

-- fct_returns.sql
{{ dbt_utils.generate_surrogate_key(["trim(customer_id)"]) }}  -- must match exactly

Failure mode: One forgotten trim() = broken relationships discovered weeks later in dashboards.

This Solution: Consistency at the Macro Level

-- Everywhere, guaranteed consistent
{{ dbt_utils.generate_surrogate_key(    ['customer_id'],
    trim=true
) }}

This moves the consistency guarantee from developer memory into the tool itself.

Changes

  • Added trim_whitespace parameter (default: false) to generate_surrogate_key
  • Applied to all adapter-specific implementations (Postgres, Snowflake, BigQuery, Redshift, Spark)
  • Added comprehensive integration tests covering:
    • Basic trim functionality
    • Null handling
    • Empty strings and whitespace-only values
    • Backward compatibility (default behavior unchanged)

Backward Compatibility

This is a fully opt-in feature:

  • Default: trim=false (current behavior, zero changes)
  • Explicit: trim=true (new behavior, deliberate choice)

No existing models break.
No existing hashes change.
Teams adopt on their own timeline.

Testing

  • Tested locally on Postgres
  • Integration tests pass
  • Unit tests cover edge cases
  • Backward compatibility verified (default behavior unchanged)
  • CI tests pass (waiting for maintainer review)

Real-World Impact

Sequential (join-based):
dim_customers: 45 min → fct_orders: 30 min = 75 minutes total

Parallel (hash-twice):
dim_customers: 45 min
fct_orders: 30 min = 45 minutes total (40% faster)

This PR makes the parallel pattern safe and practical by eliminating the primary source of silent failures.

Usage Example

Basic usage:

{{ dbt_utils.generate_surrogate_key(
    ['customer_id', 'order_date'],
    trim=true
) }}

Alternative Approaches Considered

  • "Just clean data upstream":
    Doesn't solve the duplication problem (same business keys hashed in N models). Requires perfect discipline.
    Silent failure when discipline lapses.
  • "Use a shared CTE":
    Only works within a single model.
    Doesn't help across dimensions, facts, and bridges.
  • "Document the pattern":
    Documentation can't enforce consistency. Silent failures remain possible.

This PR:
✅ Enforces consistency mechanically
✅ Opt-in and backward compatible
✅ Solves the problem once, centrally

Checklist

  • Code follows project style guidelines
  • Tests added for new functionality
  • Documentation updated
  • Backward compatible (default behavior unchanged)
  • Commit message follows conventions
  • Real-world use case documented

- Add optional trim_whitespace parameter (default: false)
- Trim leading/trailing whitespace before hashing when enabled
- Add integration tests for trim functionality
- Add edge case tests (nulls, empty strings, whitespace-only)
- Update macro documentation with usage examples
- Backward compatible - default behavior unchanged

Resolves #[ISSUE_NUMBER]"
- Add optional trim_whitespace parameter (default: false)
- Trim leading/trailing whitespace before hashing when enabled
- Add integration tests for trim functionality
- Add edge case tests (nulls, empty strings, whitespace-only)
- Update macro documentation with usage examples
- Backward compatible - default behavior unchanged

Resolves #[ISSUE_NUMBER]"
@Ahmed-Gomaa1 Ahmed-Gomaa1 requested a review from a team as a code owner January 23, 2026 14:41
@b-per
Copy link
Collaborator

b-per commented Jan 23, 2026

I am not really convinced that the macro should take care of trimming...

To me this is something that must be done outside of generating a SK. If the fields are not trimmed they either need to be trimmed in the upstream models (ideally) or in a CTE.

@Ahmed-Gomaa1
Copy link
Author

Ahmed-Gomaa1 commented Jan 23, 2026

Hi @b-per, thanks for the feedback.
I completely understand the preference for cleaning data upstream.

This feature is fully optional (default trim=false) and preserves backward compatibility.
It’s designed as a defensive mechanism for projects handling external sources like CSV, Excel, or other non-database inputs where leading/trailing whitespace can easily break surrogate key consistency.

Using this macro option prevents accidental key mismatches without requiring developers to manually trim fields in every staging model.

If preferred, we could mark this parameter as experimental in the documentation to emphasize that upstream trimming is still the recommended approach, while keeping the macro option available for defensive scenarios.

- Add optional trim_whitespace parameter (default: false)
- Trim leading/trailing whitespace before hashing when enabled
- Add integration tests for trim functionality
- Add edge case tests (nulls, empty strings, whitespace-only)
- Update macro documentation with usage examples
- Backward compatible - default behavior unchanged

Resolves #[ISSUE_NUMBER]"
@b-per
Copy link
Collaborator

b-per commented Jan 23, 2026

Thanks. I am still not convinced (if the sk uses trimmed columns when the actual columns are not trimmed, i can foresee issues) but I will let others chime in.

To me it doesn't feel like a "generic enough" problem to make it global to all dbt_utils users (updating this macro will likely trigger a model change for any person using it). But you are more than welcome to keep this macro in your own projects!

@Ahmed-Gomaa1
Copy link
Author

Hi @b-per, I appreciate the perspective on upstream cleaning.

I want to clarify the core problem this addresses:

The Issue:
When using deterministic hashing for surrogate keys (the modern pattern
that eliminates fact→dimension joins), input consistency becomes critical.

md5('Cairo')md5('Cairo ') - one trailing space breaks the relationship.

Why This Isn't Just Upstream Cleaning:

  1. The same business key appears in multiple models (dimension + facts)
  2. Each must apply identical sanitization before hashing
  3. Manual duplication creates drift and silent failures
  4. The failure mode is discovering broken joins weeks later

Current Reality:
Engineers write: trim(customer_id) in 5+ models
If one forgets, relationships break silently

This PR:
Moves consistency guarantee into the macro itself
generate_surrogate_key(['customer_id'], trim=true)
Same logic everywhere, enforced once

This isn't about handling messy source data - it's about preventing
inconsistency between models that hash the same business keys.

Happy to discuss further or provide examples from production systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants