Note: This project is maintained by the community after Datafold sunset the project in May 2024.
data-diff is an open-source CLI and Python library for efficiently comparing data across 13+ database engines. It uses bisection and checksumming to find differing rows without transferring entire tables, making it fast even on tables with millions of rows.
pip install data-diffInstall with database-specific extras:
pip install 'data-diff[postgresql,mysql]'data-diff \
postgresql://user:password@localhost/db1 table1 \
postgresql://user:password@localhost/db2 table2 \
--key-columns id \
--columns name,email,updated_atimport data_diff
diff = data_diff.diff_tables(
table1=data_diff.connect_to_table("postgresql://localhost/db1", "table1", "id"),
table2=data_diff.connect_to_table("postgresql://localhost/db2", "table2", "id"),
)
for sign, row in diff:
print(sign, row) # '+' for added, '-' for removed| Database | Status |
|---|---|
| PostgreSQL | Supported |
| MySQL | Supported |
| Snowflake | Supported |
| BigQuery | Supported |
| Databricks | Supported |
| Redshift | Supported |
| DuckDB | Supported |
| Presto | Supported |
| Trino | Supported |
| Oracle | Supported |
| MS SQL | Supported |
| ClickHouse | Supported |
| Vertica | Supported |
data-diff integrates with dbt to compare tables between development and production environments:
data-diff --dbtInstall with dbt support:
pip install 'data-diff[dbt]'See the full documentation for configuration details.
This project is licensed under the terms of the MIT License.