Skip to content

[SPARK-55406][PYTHON][CONNECT] Reimplement the thread pool for ExecutePlanResponseReattachableIterator#54189

Open
gaogaotiantian wants to merge 7 commits intoapache:masterfrom
gaogaotiantian:fix-connect-shutdown
Open

[SPARK-55406][PYTHON][CONNECT] Reimplement the thread pool for ExecutePlanResponseReattachableIterator#54189
gaogaotiantian wants to merge 7 commits intoapache:masterfrom
gaogaotiantian:fix-connect-shutdown

Conversation

@gaogaotiantian
Copy link
Contributor

What changes were proposed in this pull request?

Reimplemented the thread pool mechanism for ExecutePlanResponseReattachableIterator.

  • Number of ExecutePlanResponseReattachableIterator instances is counted with WeakSet so we know when to release the thread pool
  • Instead of relying on thread pool to finish, we collect the futures for each release task and wait for them in the client

Why are the changes needed?

The current implementation is fundamentally wrong. It tries to shutdown a thread pool for a class from client instances. We could have multiple client instances in a process and flipping the state of a global class would just cause racing issues.

For example, the old code can return a thread pool in a thread which could be shutdown before it is used.

The original reason that we "shutdown" the thread pool is because we need the client to wait for the release requests to finish. That can't justify waiting for the whole thread pool (or shut it down) because it can be shared with other clients. We actually have a deadlock caused by it.

Now we collect futures explicitly for each ExecutePlanResponseReattachableIterator call and when we need to close the client, just wait for those future to be done.

Notice that we will not keep these futures longer than they should be. We used WeakSet which means if the future has no reference, it will still be garbage collected properly. We won't use more memory because we keep track of these. In normal cases the set will probably be empty because all the release tasks are done.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Simple local test passed. Waiting for CI.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions
Copy link

github-actions bot commented Feb 6, 2026

JIRA Issue Information

=== Bug SPARK-55406 ===
Summary: Spark connect with reattach iterator has serious racing issues
Assignee: None
Status: Open
Affected: ["4.2.0"]


This comment was automatically generated by GitHub Actions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant