Skip to content

Conversation

@chenliu0831
Copy link
Contributor

@chenliu0831 chenliu0831 commented Jan 20, 2026

Issue #, if available: #128

Description of changes:

Adds DuckDB as a lightweight, JVM-free backend for PyDeequ 2.0 with optional dependency installation support. The overall design is inspired by DuckDQ project mentioned in #128 (actually most credit needs to go to that project). The stateful aggregation for streaming DQ monitoring is not implemented yet (i.e. MetricsRepository).

Other notable changes:

  • Restructured pyproject.toml to support optional dependencies. pip install pydeequ[duckdb] - DuckDB backend (no JVM required). Core package now has minimal dependencies (numpy, pandas, protobuf)
  • Engine Parity tests between Spark and DuckDB engine. Some HLL/quantile differences exists because of algorithm difference. More details in Engines.md
  • Benchmark tooling.
  • Comprehensive test suite.

See https://github.com/awslabs/python-deequ/blob/v2_engine/README.md and https://github.com/awslabs/python-deequ/blob/v2_engine/docs/architecture.md for more background.

Benchmark

See https://github.com/awslabs/python-deequ/blob/v2_engine/BENCHMARK.md for more details.

benchmark_chart

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant