Skip to content

Conversation

@sundinesh1
Copy link

  • Fix stage_metrics_rewriter to correctly match metrics to nodes in pre-order traversal
  • Update test assertions to allow empty metrics for PartitionIsolatorExec nodes
  • Fix assert_metrics_present_in_plan to handle PartitionIsolatorExec wrapped in MetricsWrapperExec

Fixes issue where PartitionIsolatorExec nodes were missing from displayed plans after metrics rewriting. All metrics collection tests now pass.

this is a fix for : #260

- Fix stage_metrics_rewriter to correctly match metrics to nodes in pre-order traversal
- Update test assertions to allow empty metrics for PartitionIsolatorExec nodes
- Fix assert_metrics_present_in_plan to handle PartitionIsolatorExec wrapped in MetricsWrapperExec

Fixes issue where PartitionIsolatorExec nodes were missing from displayed plans
after metrics rewriting. All metrics collection tests now pass.
Add test_metrics_collection_with_partition_isolator to verify that:
- PartitionIsolatorExec nodes are preserved during metrics collection
- Empty metrics are correctly collected for PartitionIsolatorExec nodes
- Metrics count matches plan node count including PartitionIsolatorExec

This addresses PR datafusion-contrib#281 comment requesting updated test snapshots and
expectations for the PartitionIsolatorExec metrics collection bug fix.
Copy link
Collaborator

@gabotechs gabotechs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution! ❤️

Left some comments, let me know what you think.

…rm_down

- Refactor test_metrics_collection_with_partition_isolator to be more concise
  - Reduced from ~150 lines to ~60 lines (60% reduction)
  - Use weather dataset instead of creating custom test data
  - Focus on corner case: PartitionIsolatorExec nodes preserved in metrics
- Replace recursive traverse_plan with transform_down in test helper
  - More consistent with DataFusion tree traversal patterns
  - Cleaner and easier to maintain
- Fix clippy warnings across multiple files
  - Remove unnecessary parentheses in plan_annotator.rs
  - Use is_multiple_of() instead of manual modulo check
  - Remove unnecessary return statements
  - Collapse nested if statements where appropriate
@sundinesh1 sundinesh1 requested a review from gabotechs January 9, 2026 17:28
- Format method chaining for stage_plan.clone().transform_down() calls
- Remove unused MetricsSetProto import
- Remove special-case logic that tracked PartitionIsolatorExec nodes
- Make tests generic to handle any node type without metrics
- Remove unused imports (NetworkBoundaryExt, PartitionIsolatorExec)
- Update comments to reflect generic approach

This makes the tests more robust and prevents them from breaking
when PartitionIsolatorExec gets metrics in the future, or when
custom execution plans without metrics are used.
@sundinesh1 sundinesh1 requested a review from gabotechs January 12, 2026 15:50
Copy link
Collaborator

@gabotechs gabotechs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor round of comments, but it's looking good! thanks for taking the time for making this PR 🙏

Comment on lines +474 to +478
let is_partition_isolator = plan.name() == "PartitionIsolatorExec"
|| plan
.as_any()
.downcast_ref::<PartitionIsolatorExec>()
.is_some();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even simpler:

Suggested change
let is_partition_isolator = plan.name() == "PartitionIsolatorExec"
|| plan
.as_any()
.downcast_ref::<PartitionIsolatorExec>()
.is_some();
let is_partition_isolator = plan.name() == PartitionIsolatorExec::static_name();

let mut node_idx = 0;
plan.clone().transform_down(|plan| {
// Stop at network boundaries.
if plan.as_network_boundary().is_some() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if plan.as_network_boundary().is_some() {
if plan.is_network_boundary() {

Comment on lines +282 to +285
// Verify that metrics were collected for all nodes. Some nodes may legitimately have
// empty metrics (e.g., custom execution plans without metrics), which is fine - we
// just verify that a metrics set exists for each node. The count assertion above
// ensures all nodes are included in the metrics collection.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 there's a comment here... but no code honor the comment?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment was meant to go above the assert_eq! instead of below it.

Copy link
Collaborator

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! Looks good but I think we can revert a part of this change.

It seems that test_metrics_collection_e2e* tests were failing because they were not accounting for certain nodes (ex. partition isolator) which have no metrics.

Separately, it seems that test_executed_distributed_plan_has_metrics was failing because plan.as_any().downcast_ref::<MetricsWrapperExec>() can never happen since as_any() on MetricsWrapperExec is delegated to the inner node here.

Both of these problems are fixed by fixing the tests.

It seems that the changes to the stage_metrics_rewriter function can be reverted as the function is still logically equivalent to the previous one. The new implementation is more clear, but it allocates a vec and traverses the plan twice. It should be fine to keep the previous one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants