-
Notifications
You must be signed in to change notification settings - Fork 148
V2 rewrite (beta): Support Spark Connect #254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
pydeequ/v2/verification.py
Outdated
| plan = _create_deequ_plan(extension) | ||
|
|
||
| # Use DataFrame.withPlan to properly create the DataFrame | ||
| return ConnectDataFrame.withPlan(plan, session=self._spark) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to ignore!
There is a breaking change between 3.5.x and 4.0.x
In GraphFrames we are using such a code:
def _dataframe_from_plan(plan: LogicalPlan, session: SparkSession) -> DataFrame:
if hasattr(DataFrame, "withPlan"):
# Spark 3
return DataFrame.withPlan(plan, session)
# Spark 4
return DataFrame(plan, session)I would recommend to switch to this approach to avoid the pain during Spark 4.x migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the callout - addressed in 69a5ed9
Issue #, if available:
Description of changes:
This PR introduces PyDeequ 2.0 beta, a major release that replaces the Py4J-based architecture with Spark Connect for client-server communication.
The Deequ side change will be opened separately. the proto file here is copied for review purpose. For ease of testing, I created a pre-release https://github.com/awslabs/python-deequ/releases/tag/v2.0.0b1 to host the jars/wheels.
Motivation
The legacy PyDeequ relied on Py4J to bridge Python and the JVM, which had several limitations:
Spark Connect (introduced in Spark 3.4) provides a clean gRPC-based protocol that solves these issues.
Code Changes
New
pydeequ/v2/module with Spark Connect implementation:checks.py- Check and constraint buildersanalyzers.py- Analyzer classespredicates.py- Serializable predicates (eq,gte,between, etc.)verification.py- VerificationSuite and AnalysisRunnerproto/- Protobuf definitions and generated codeNew test suite in
tests/v2/:test_unit.py- Unit tests (no Spark required)test_analyzers.py- Analyzer integration teststest_checks.py- Check constraint teststest_e2e_spark_connect.py- End-to-end testsUpdated documentation:
API Changes
Testing
More details see https://github.com/awslabs/python-deequ/blob/v2_rewrite/README.md.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.