Fix metrics collection bug: preserve PartitionIsolatorExec in plan #281

sundinesh1 · 2026-01-08T13:06:07Z

Fix stage_metrics_rewriter to correctly match metrics to nodes in pre-order traversal
Update test assertions to allow empty metrics for PartitionIsolatorExec nodes
Fix assert_metrics_present_in_plan to handle PartitionIsolatorExec wrapped in MetricsWrapperExec

Fixes issue where PartitionIsolatorExec nodes were missing from displayed plans after metrics rewriting. All metrics collection tests now pass.

this is a fix for : #260

- Fix stage_metrics_rewriter to correctly match metrics to nodes in pre-order traversal - Update test assertions to allow empty metrics for PartitionIsolatorExec nodes - Fix assert_metrics_present_in_plan to handle PartitionIsolatorExec wrapped in MetricsWrapperExec Fixes issue where PartitionIsolatorExec nodes were missing from displayed plans after metrics rewriting. All metrics collection tests now pass.

Add test_metrics_collection_with_partition_isolator to verify that: - PartitionIsolatorExec nodes are preserved during metrics collection - Empty metrics are correctly collected for PartitionIsolatorExec nodes - Metrics count matches plan node count including PartitionIsolatorExec This addresses PR datafusion-contrib#281 comment requesting updated test snapshots and expectations for the PartitionIsolatorExec metrics collection bug fix.

gabotechs

Thanks for this contribution! ❤️

Left some comments, let me know what you think.

src/common/ttl_map.rs

src/distributed_planner/plan_annotator.rs

src/execution_plans/network_shuffle.rs

src/metrics/proto.rs

src/metrics/task_metrics_collector.rs

…rm_down - Refactor test_metrics_collection_with_partition_isolator to be more concise - Reduced from ~150 lines to ~60 lines (60% reduction) - Use weather dataset instead of creating custom test data - Focus on corner case: PartitionIsolatorExec nodes preserved in metrics - Replace recursive traverse_plan with transform_down in test helper - More consistent with DataFusion tree traversal patterns - Cleaner and easier to maintain - Fix clippy warnings across multiple files - Remove unnecessary parentheses in plan_annotator.rs - Use is_multiple_of() instead of manual modulo check - Remove unnecessary return statements - Collapse nested if statements where appropriate

- Format method chaining for stage_plan.clone().transform_down() calls - Remove unused MetricsSetProto import

src/metrics/task_metrics_collector.rs

- Remove special-case logic that tracked PartitionIsolatorExec nodes - Make tests generic to handle any node type without metrics - Remove unused imports (NetworkBoundaryExt, PartitionIsolatorExec) - Update comments to reflect generic approach This makes the tests more robust and prevents them from breaking when PartitionIsolatorExec gets metrics in the future, or when custom execution plans without metrics are used.

gabotechs

Just a minor round of comments, but it's looking good! thanks for taking the time for making this PR 🙏

gabotechs · 2026-01-13T08:25:47Z

src/metrics/task_metrics_rewriter.rs

+        let is_partition_isolator = plan.name() == "PartitionIsolatorExec"
+            || plan
+                .as_any()
+                .downcast_ref::<PartitionIsolatorExec>()
+                .is_some();


Even simpler:

Suggested change

let is_partition_isolator = plan.name() == "PartitionIsolatorExec"

|| plan

.as_any()

.downcast_ref::<PartitionIsolatorExec>()

.is_some();

let is_partition_isolator = plan.name() == PartitionIsolatorExec::static_name();

gabotechs · 2026-01-13T08:26:29Z

src/metrics/task_metrics_rewriter.rs

+    let mut node_idx = 0;
+    plan.clone().transform_down(|plan| {
+        // Stop at network boundaries.
+        if plan.as_network_boundary().is_some() {


Suggested change

if plan.as_network_boundary().is_some() {

if plan.is_network_boundary() {

gabotechs · 2026-01-13T08:31:43Z

src/metrics/task_metrics_collector.rs

+            // Verify that metrics were collected for all nodes. Some nodes may legitimately have
+            // empty metrics (e.g., custom execution plans without metrics), which is fine - we
+            // just verify that a metrics set exists for each node. The count assertion above
+            // ensures all nodes are included in the metrics collection.


🤔 there's a comment here... but no code honor the comment?

I think this comment was meant to go above the assert_eq! instead of below it.

jayshrivastava

Thanks for the contribution! Looks good but I think we can revert a part of this change.

It seems that test_metrics_collection_e2e* tests were failing because they were not accounting for certain nodes (ex. partition isolator) which have no metrics.

Separately, it seems that test_executed_distributed_plan_has_metrics was failing because plan.as_any().downcast_ref::<MetricsWrapperExec>() can never happen since as_any() on MetricsWrapperExec is delegated to the inner node here.

Both of these problems are fixed by fixing the tests.

It seems that the changes to the stage_metrics_rewriter function can be reverted as the function is still logically equivalent to the previous one. The new implementation is more clear, but it allocates a vec and traverses the plan twice. It should be fine to keep the previous one.

sundinesh1 added 2 commits January 8, 2026 18:30

gabotechs reviewed Jan 9, 2026

View reviewed changes

sundinesh1 added 2 commits January 9, 2026 22:56

restore the empty line

d083cf3

sundinesh1 requested a review from gabotechs January 9, 2026 17:28

Fix formatting and remove unused import in task_metrics_collector

8523799

- Format method chaining for stage_plan.clone().transform_down() calls - Remove unused MetricsSetProto import

gabotechs reviewed Jan 12, 2026

View reviewed changes

src/metrics/task_metrics_collector.rs Outdated Show resolved Hide resolved

sundinesh1 requested a review from gabotechs January 12, 2026 15:50

gabotechs approved these changes Jan 19, 2026

View reviewed changes

jayshrivastava requested changes Jan 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix metrics collection bug: preserve PartitionIsolatorExec in plan #281

Fix metrics collection bug: preserve PartitionIsolatorExec in plan #281

Uh oh!

sundinesh1 commented Jan 8, 2026

Uh oh!

gabotechs left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gabotechs left a comment

Uh oh!

gabotechs Jan 13, 2026

Uh oh!

gabotechs Jan 13, 2026

Uh oh!

gabotechs Jan 13, 2026

Uh oh!

jayshrivastava Jan 20, 2026

Uh oh!

jayshrivastava left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if plan.as_network_boundary().is_some() {
	if plan.is_network_boundary() {

Fix metrics collection bug: preserve PartitionIsolatorExec in plan #281

Are you sure you want to change the base?

Fix metrics collection bug: preserve PartitionIsolatorExec in plan #281

Uh oh!

Conversation

sundinesh1 commented Jan 8, 2026

Uh oh!

gabotechs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gabotechs left a comment

Choose a reason for hiding this comment

Uh oh!

gabotechs Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

gabotechs Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

gabotechs Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

jayshrivastava Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

jayshrivastava left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants