Created by gh-md-toc
This project intends to collect, analyze and synthetize referential material about data-lakes, data warehouses and data lakehouses.
Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.
- Data Engineering Helpers - Knowledge Sharing - Data products
- Data Engineering Helpers - Knowledge Sharing - Data contracts
- Data Engineering Helpers - Knowledge Sharing - Data quality
- Data Engineering Helpers - Knowledge Sharing - Architecture principles
- Data Engineering Helpers - Knowledge Sharing - Data life cycle
- Data Engineering Helpers - Knowledge Sharing - Data management
- Data Engineering Helpers - Knowledge Sharing - Metadata
- Data Engineering Helpers - Knowledge Sharing - Data pipeline deployment
- Data Engineering Helpers - Knowledge Sharing - Semantic layer
- Data Engineering Helpers - Knowledge Sharing - Data analytics/analysis
- GitHub repository: https://github.com/dipankarmazumdar/awesome-lakehouse-guide
- Author/maintainer: Dipankar Mazumdar
- Post on LinkedIn:
- DataBricks blog - What is a data lakehouse
- Authors: Ben Lorica, Michael Armbrust, Reynold Xin, Matei Zaharia and Ali Ghodsi
- Date: Jan. 2020
- Title: What is Apache XTable (formerly OneTable) — Interoperability for Apache Hudi, Iceberg & Delta Lake
- Author: Dipankar Mazumdar (Dipankar Mazumdar on LinkedIn, Dipankar Mazumdar on Medium)
- Date: Dec. 2023
- Link to the article: https://dipankar-tnt.medium.com/onetable-interoperability-for-apache-hudi-iceberg-delta-lake-bb8b27dd288d
- Title: The Apache Arrow Ecosystem
- Date: Apr. 2026
- Author: Hoyt Emerson (Hoyt Emerson on LinkedIn, Hoyt Emerson on Substack)
- Substack post
- Title: ACID Transactions in an Open Data Lakehouse
- Date: Feb. 2025
- Author: Dipankar Mazumdar (Dipankar Mazumdar on LinkedIn)
- Link to the article on OneHouse: https://www.onehouse.ai/blog/acid-transactions-in-an-open-data-lakehouse
- Title: Why are companies building a lakehouse
- Date: Feb. 2025
- Author: Roy Hasson (Roy Hasson on LinkedIn)
- Link to the LinkedIn post: https://www.linkedin.com/posts/royhasson_last-week-i-ran-an-enablement-session-for-activity-7297658658773979136-Cr2M/
"Why are companies building a Lakehouse"? This is how I responded... The why is simple:
- Reduce costs
- Eliminate lock-in
- Be more agile and flexible
- Title: From Lakehouse architecture to data mesh
- Date: Dec. 2024
- Author: Adevinta (Adevinta on Medium)
- Link to the article on Medium: https://medium.com/adevinta-tech-blog/from-lakehouse-architecture-to-data-mesh-c532c91f7b61
- Title: Open Table Formats and the Open Data Lakehouse, In Perspective
- Date: Oct. 2024
- Author: Dipankar Mazumdar (Dipankar Mazumdar on LinkedIn)
- Link to the article on the OneHouse's blog: https://www.onehouse.ai/blog/open-table-formats-and-the-open-data-lakehouse-in-perspective
- Related open source projects:
- Understanding Parquet, Iceberg and Data Lakehouses at Broad
- Author: David Gomes (David Gomes on LinkedIn, David Gomes profile page on his own blog)
- Date: December 2023
- Link to the article: https://davidgomes.com/understanding-parquet-iceberg-and-data-lakehouses-at-broad/
- Title: The Data Lakehouse: Data Warehousing and More
- Authors: Dipankar Mazumdar, Jason Hughes, JB Onofré (all working at Dremio at the time)
- Date: October 2023
- Link to the PDF document on Arxiv: https://arxiv.org/pdf/2310.08697.pdf
- Link to the article: https://www.linkedin.com/posts/dipankar-mazumdar_dataengineering-softwareengineering-activity-7283666426437980160-A33n
- Title: Understanding Big Data File Formats
- Author: Vladimir Sivcevic (Vladimir Sivcevic on LinkeDIn, Vladimir Sivcevic profile page on his own blog)
- Date: April 2022
- Link to the article: https://www.vladsiv.com/big-data-file-formats/
- Post on LinkedIn: https://www.linkedin.com/posts/marcoslot_why-we-developed-pgincremental-one-of-activity-7392183986577440770-HCZU/
- Date: Nov. 2025
- Author: Marco Slot