Skip to content

dtsong/data-diff

 
 

Repository files navigation

data-diff -- Efficiently diff rows across databases

Community Maintained License: MIT PyPI

Note: This project is maintained by the community after Datafold sunset the project in May 2024.

data-diff is an open-source CLI and Python library for efficiently comparing data across 13+ database engines. It uses bisection and checksumming to find differing rows without transferring entire tables, making it fast even on tables with millions of rows.

Installation

pip install data-diff

Install with database-specific extras:

pip install 'data-diff[postgresql,mysql]'

Quick Start

CLI

data-diff \
  postgresql://user:password@localhost/db1 table1 \
  postgresql://user:password@localhost/db2 table2 \
  --key-columns id \
  --columns name,email,updated_at

Python API

import data_diff

diff = data_diff.diff_tables(
    table1=data_diff.connect_to_table("postgresql://localhost/db1", "table1", "id"),
    table2=data_diff.connect_to_table("postgresql://localhost/db2", "table2", "id"),
)

for sign, row in diff:
    print(sign, row)  # '+' for added, '-' for removed

Supported Databases

Database Status
PostgreSQL Supported
MySQL Supported
Snowflake Supported
BigQuery Supported
Databricks Supported
Redshift Supported
DuckDB Supported
Presto Supported
Trino Supported
Oracle Supported
MS SQL Supported
ClickHouse Supported
Vertica Supported

dbt Integration

data-diff integrates with dbt to compare tables between development and production environments:

data-diff --dbt

Install with dbt support:

pip install 'data-diff[dbt]'

See the full documentation for configuration details.

Documentation

Contributors

License

This project is licensed under the terms of the MIT License.

About

Compare tables within or across databases

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.5%
  • Other 0.5%