V2 rewrite (beta): Support Spark Connect #254

chenliu0831 · 2026-01-13T17:44:12Z

Issue #, if available:

Deequ/PyDeequ compatibility with Spark-Connect #221

Description of changes:

This PR introduces PyDeequ 2.0 beta, a major release that replaces the Py4J-based architecture with Spark Connect for client-server communication.

The Deequ side change will be opened separately. the proto file here is copied for review purpose. For ease of testing, I created a pre-release https://github.com/awslabs/python-deequ/releases/tag/v2.0.0b1 to host the jars/wheels.

Motivation

The legacy PyDeequ relied on Py4J to bridge Python and the JVM, which had several limitations:

Required local Spark session with JVM access
Python lambdas couldn't be serialized for remote execution
Tight coupling between Python client and JVM made debugging difficult

Spark Connect (introduced in Spark 3.4) provides a clean gRPC-based protocol that solves these issues.

Code Changes

New pydeequ/v2/ module with Spark Connect implementation:
- checks.py - Check and constraint builders
- analyzers.py - Analyzer classes
- predicates.py - Serializable predicates (eq, gte, between, etc.)
- verification.py - VerificationSuite and AnalysisRunner
- proto/ - Protobuf definitions and generated code
New test suite in tests/v2/:
- test_unit.py - Unit tests (no Spark required)
- test_analyzers.py - Analyzer integration tests
- test_checks.py - Check constraint tests
- test_e2e_spark_connect.py - End-to-end tests
Updated documentation:
- Merged README with 2.0 quick start guide
- Added architecture diagram
- Migration guide from 1.x to 2.0

API Changes

# Before (1.x)
from pydeequ.checks import Check, CheckLevel
check.hasSize(lambda x: x == 3)

# After (2.0)
from pydeequ.v2.checks import Check, CheckLevel
from pydeequ.v2.predicates import eq
check.hasSize(eq(3))

Testing

More details see https://github.com/awslabs/python-deequ/blob/v2_rewrite/README.md.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…) predicate bug

pydeequ/v2/verification.py

SemyonSinchenko · 2026-01-13T20:49:16Z

pydeequ/v2/verification.py

+        plan = _create_deequ_plan(extension)
+
+        # Use DataFrame.withPlan to properly create the DataFrame
+        return ConnectDataFrame.withPlan(plan, session=self._spark)


Feel free to ignore!

There is a breaking change between 3.5.x and 4.0.x
In GraphFrames we are using such a code:

def _dataframe_from_plan(plan: LogicalPlan, session: SparkSession) -> DataFrame: if hasattr(DataFrame, "withPlan"): # Spark 3 return DataFrame.withPlan(plan, session) # Spark 4 return DataFrame(plan, session)

I would recommend to switch to this approach to avoid the pain during Spark 4.x migration.

Thanks for the callout - addressed in 69a5ed9

chenliu0831 added 2 commits January 13, 2026 12:36

V2 rewrite (beta): Support Spark Connect

25aa2df

update testing process

3d394d6

chenliu0831 mentioned this pull request Jan 13, 2026

Deequ/PyDeequ compatibility with Spark-Connect #221

Open

fix some setup issues

7bdc2e5

chenliu0831 mentioned this pull request Jan 13, 2026

[DO NOT MERGE]Support Spark Connect awslabs/deequ#651

Open

chenliu0831 added 4 commits January 13, 2026 15:01

feat(v2): Add Column Profiler, Constraint Suggestions, and fix eq(0.0…

797084a

…) predicate bug

simplify CI setup to just do Python 3.12 and V2

928f2a8

fix pre-release jar naming convention

bd53212

Add another example for data quality profiling using v2 API

295c42f

SemyonSinchenko reviewed Jan 13, 2026

View reviewed changes

chenliu0831 added 3 commits January 13, 2026 16:07

sync proto and fix CI

f1b4810

Refactor to add helper function that works for both Spark 3.5+ and 4.x

69a5ed9

Fix integ tests

1148373

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

V2 rewrite (beta): Support Spark Connect #254

V2 rewrite (beta): Support Spark Connect #254

Uh oh!

chenliu0831 commented Jan 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

SemyonSinchenko Jan 13, 2026

Uh oh!

chenliu0831 Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

V2 rewrite (beta): Support Spark Connect #254

Are you sure you want to change the base?

V2 rewrite (beta): Support Spark Connect #254

Uh oh!

Conversation

chenliu0831 commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Code Changes

API Changes

Testing

Uh oh!

Uh oh!

SemyonSinchenko Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

chenliu0831 Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chenliu0831 commented Jan 13, 2026 •

edited

Loading