-
Notifications
You must be signed in to change notification settings - Fork 24
Fix metrics collection bug: preserve PartitionIsolatorExec in plan #281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Fix stage_metrics_rewriter to correctly match metrics to nodes in pre-order traversal - Update test assertions to allow empty metrics for PartitionIsolatorExec nodes - Fix assert_metrics_present_in_plan to handle PartitionIsolatorExec wrapped in MetricsWrapperExec Fixes issue where PartitionIsolatorExec nodes were missing from displayed plans after metrics rewriting. All metrics collection tests now pass.
Add test_metrics_collection_with_partition_isolator to verify that: - PartitionIsolatorExec nodes are preserved during metrics collection - Empty metrics are correctly collected for PartitionIsolatorExec nodes - Metrics count matches plan node count including PartitionIsolatorExec This addresses PR datafusion-contrib#281 comment requesting updated test snapshots and expectations for the PartitionIsolatorExec metrics collection bug fix.
gabotechs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution! ❤️
Left some comments, let me know what you think.
…rm_down - Refactor test_metrics_collection_with_partition_isolator to be more concise - Reduced from ~150 lines to ~60 lines (60% reduction) - Use weather dataset instead of creating custom test data - Focus on corner case: PartitionIsolatorExec nodes preserved in metrics - Replace recursive traverse_plan with transform_down in test helper - More consistent with DataFusion tree traversal patterns - Cleaner and easier to maintain - Fix clippy warnings across multiple files - Remove unnecessary parentheses in plan_annotator.rs - Use is_multiple_of() instead of manual modulo check - Remove unnecessary return statements - Collapse nested if statements where appropriate
- Format method chaining for stage_plan.clone().transform_down() calls - Remove unused MetricsSetProto import
- Remove special-case logic that tracked PartitionIsolatorExec nodes - Make tests generic to handle any node type without metrics - Remove unused imports (NetworkBoundaryExt, PartitionIsolatorExec) - Update comments to reflect generic approach This makes the tests more robust and prevents them from breaking when PartitionIsolatorExec gets metrics in the future, or when custom execution plans without metrics are used.
gabotechs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor round of comments, but it's looking good! thanks for taking the time for making this PR 🙏
| let is_partition_isolator = plan.name() == "PartitionIsolatorExec" | ||
| || plan | ||
| .as_any() | ||
| .downcast_ref::<PartitionIsolatorExec>() | ||
| .is_some(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even simpler:
| let is_partition_isolator = plan.name() == "PartitionIsolatorExec" | |
| || plan | |
| .as_any() | |
| .downcast_ref::<PartitionIsolatorExec>() | |
| .is_some(); | |
| let is_partition_isolator = plan.name() == PartitionIsolatorExec::static_name(); |
| let mut node_idx = 0; | ||
| plan.clone().transform_down(|plan| { | ||
| // Stop at network boundaries. | ||
| if plan.as_network_boundary().is_some() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if plan.as_network_boundary().is_some() { | |
| if plan.is_network_boundary() { |
| // Verify that metrics were collected for all nodes. Some nodes may legitimately have | ||
| // empty metrics (e.g., custom execution plans without metrics), which is fine - we | ||
| // just verify that a metrics set exists for each node. The count assertion above | ||
| // ensures all nodes are included in the metrics collection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 there's a comment here... but no code honor the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment was meant to go above the assert_eq! instead of below it.
jayshrivastava
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! Looks good but I think we can revert a part of this change.
It seems that test_metrics_collection_e2e* tests were failing because they were not accounting for certain nodes (ex. partition isolator) which have no metrics.
Separately, it seems that test_executed_distributed_plan_has_metrics was failing because plan.as_any().downcast_ref::<MetricsWrapperExec>() can never happen since as_any() on MetricsWrapperExec is delegated to the inner node here.
Both of these problems are fixed by fixing the tests.
It seems that the changes to the stage_metrics_rewriter function can be reverted as the function is still logically equivalent to the previous one. The new implementation is more clear, but it allocates a vec and traverses the plan twice. It should be fine to keep the previous one.
Fixes issue where PartitionIsolatorExec nodes were missing from displayed plans after metrics rewriting. All metrics collection tests now pass.
this is a fix for : #260