-
Notifications
You must be signed in to change notification settings - Fork 600
Feature/surrogate key trim #1069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feature/surrogate key trim #1069
Conversation
- Add optional trim_whitespace parameter (default: false) - Trim leading/trailing whitespace before hashing when enabled - Add integration tests for trim functionality - Add edge case tests (nulls, empty strings, whitespace-only) - Update macro documentation with usage examples - Backward compatible - default behavior unchanged Resolves #[ISSUE_NUMBER]"
- Add optional trim_whitespace parameter (default: false) - Trim leading/trailing whitespace before hashing when enabled - Add integration tests for trim functionality - Add edge case tests (nulls, empty strings, whitespace-only) - Update macro documentation with usage examples - Backward compatible - default behavior unchanged Resolves #[ISSUE_NUMBER]"
|
I am not really convinced that the macro should take care of To me this is something that must be done outside of generating a SK. If the fields are not trimmed they either need to be trimmed in the upstream models (ideally) or in a CTE. |
|
Hi @b-per, thanks for the feedback. This feature is fully optional (default Using this macro option prevents accidental key mismatches without requiring developers to manually trim fields in every staging model. If preferred, we could mark this parameter as experimental in the documentation to emphasize that upstream trimming is still the recommended approach, while keeping the macro option available for defensive scenarios. |
- Add optional trim_whitespace parameter (default: false) - Trim leading/trailing whitespace before hashing when enabled - Add integration tests for trim functionality - Add edge case tests (nulls, empty strings, whitespace-only) - Update macro documentation with usage examples - Backward compatible - default behavior unchanged Resolves #[ISSUE_NUMBER]"
|
Thanks. I am still not convinced (if the sk uses trimmed columns when the actual columns are not trimmed, i can foresee issues) but I will let others chime in. To me it doesn't feel like a "generic enough" problem to make it global to all |
|
Hi @b-per, I appreciate the perspective on upstream cleaning. I want to clarify the core problem this addresses: The Issue:
Why This Isn't Just Upstream Cleaning:
Current Reality: This PR: This isn't about handling messy source data - it's about preventing Happy to discuss further or provide examples from production systems. |
Description
Adds an optional
trim_whitespaceparameter togenerate_surrogate_keymacro.When enabled, leading and trailing whitespace is removed from fields before hashing, improving surrogate key stability and enabling safer adoption of parallel pipeline patterns in modern dbt architectures.
Motivation
The Problem: Silent Hash Mismatches in Parallel Pipelines
Modern dbt projects increasingly use deterministic hashing to eliminate build-time dependencies between facts and dimensions:
Benefits:
The Critical Dependency:
This pattern only works if both models generate identical hashes.
md5('customer_123')→'a7f3c2d1...'md5('customer_123 ')→'x9z7k3m2...'← one trailing space = broken relationshipCurrent Workaround: Manual Duplication
Today, engineers must manually ensure consistency across every model:
Failure mode: One forgotten
trim()= broken relationships discovered weeks later in dashboards.This Solution: Consistency at the Macro Level
This moves the consistency guarantee from developer memory into the tool itself.
Changes
trim_whitespaceparameter (default:false) togenerate_surrogate_keyBackward Compatibility
This is a fully opt-in feature:
trim=false(current behavior, zero changes)trim=true(new behavior, deliberate choice)No existing models break.
No existing hashes change.
Teams adopt on their own timeline.
Testing
Real-World Impact
Sequential (join-based):
dim_customers: 45 min →fct_orders: 30 min = 75 minutes totalParallel (hash-twice):
dim_customers: 45 minfct_orders: 30 min = 45 minutes total (40% faster)This PR makes the parallel pattern safe and practical by eliminating the primary source of silent failures.
Usage Example
Basic usage:
{{ dbt_utils.generate_surrogate_key( ['customer_id', 'order_date'], trim=true ) }}Alternative Approaches Considered
Doesn't solve the duplication problem (same business keys hashed in N models). Requires perfect discipline.
Silent failure when discipline lapses.
Only works within a single model.
Doesn't help across dimensions, facts, and bridges.
Documentation can't enforce consistency. Silent failures remain possible.
This PR:
✅ Enforces consistency mechanically
✅ Opt-in and backward compatible
✅ Solves the problem once, centrally
Checklist