Skip to content

Conversation

@chenliu0831
Copy link
Contributor

@chenliu0831 chenliu0831 commented Jan 13, 2026

Issue #, if available:

Description of changes:

This PR introduces PyDeequ 2.0 beta, a major release that replaces the Py4J-based architecture with Spark Connect for client-server communication.

The Deequ side change will be opened separately. the proto file here is copied for review purpose. For ease of testing, I created a pre-release https://github.com/awslabs/python-deequ/releases/tag/v2.0.0b1 to host the jars/wheels.

Motivation

The legacy PyDeequ relied on Py4J to bridge Python and the JVM, which had several limitations:

  • Required local Spark session with JVM access
  • Python lambdas couldn't be serialized for remote execution
  • Tight coupling between Python client and JVM made debugging difficult

Spark Connect (introduced in Spark 3.4) provides a clean gRPC-based protocol that solves these issues.

Code Changes

  • New pydeequ/v2/ module with Spark Connect implementation:

    • checks.py - Check and constraint builders
    • analyzers.py - Analyzer classes
    • predicates.py - Serializable predicates (eq, gte, between, etc.)
    • verification.py - VerificationSuite and AnalysisRunner
    • proto/ - Protobuf definitions and generated code
  • New test suite in tests/v2/:

    • test_unit.py - Unit tests (no Spark required)
    • test_analyzers.py - Analyzer integration tests
    • test_checks.py - Check constraint tests
    • test_e2e_spark_connect.py - End-to-end tests
  • Updated documentation:

    • Merged README with 2.0 quick start guide
    • Added architecture diagram
    • Migration guide from 1.x to 2.0

API Changes

# Before (1.x)
from pydeequ.checks import Check, CheckLevel
check.hasSize(lambda x: x == 3)

# After (2.0)
from pydeequ.v2.checks import Check, CheckLevel
from pydeequ.v2.predicates import eq
check.hasSize(eq(3))

Testing

More details see https://github.com/awslabs/python-deequ/blob/v2_rewrite/README.md.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

plan = _create_deequ_plan(extension)

# Use DataFrame.withPlan to properly create the DataFrame
return ConnectDataFrame.withPlan(plan, session=self._spark)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to ignore!

There is a breaking change between 3.5.x and 4.0.x
In GraphFrames we are using such a code:

def _dataframe_from_plan(plan: LogicalPlan, session: SparkSession) -> DataFrame:
    if hasattr(DataFrame, "withPlan"):
        # Spark 3
        return DataFrame.withPlan(plan, session)

    # Spark 4
    return DataFrame(plan, session)

I would recommend to switch to this approach to avoid the pain during Spark 4.x migration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the callout - addressed in 69a5ed9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants