Created by gh-md-toc
This project aims at benchmarking a few data processing engines (e.g., DuckDB, Spark, Polars, Daft).
Inspiration: Todo MVC
Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.
- Todo MVC home page
- Material for the Data platform - Data access to Databricks data lakehouse
- Data Engineering Helpers - Knowledge Sharing - Cheat sheets
- Material for the Data platform - Architecture principles
- Material for the Data platform - Modern Data Stack (MDS) in a box
- Material for the Data platform - Data life cycle
- Material for the Data platform - Data contracts
- Material for the Data platform - Metadata
- Material for the Data platform - Data quality
- Author: Mimoune Djouallah (Mimoune Djouallah on LinkedIn, Mimoune Djouallah on GitHub)
- Date: Jan. 2026
- Git repository with a Jupyter notebook and fully reproducible scripts
- Post on LinkedIn
- Title: Accelerating Apache Spark's Execution Engine
- Author: Dipankar Mazumdar
- Date: Dec. 2025
- Post on LinkedIn
- Title: Accelerating Apache Spark with Gluten & Velox
- Author: Angel Conde (Angel Conde on LinkedIn, Angel Conde on Medium)
- Date: Sep. 2025
- Link to the article on Medium
- Companion Git repository.
It features a benchmark with:
- Public, generated, datasets containing a fact table and dimension tables
- Several queries representing typical analytics workload:
- Query A — Heavy multi‑aggregation: Groups the fact table by
country_idandchannel_idand computes counts, sums, averages, standard deviation and approximate percentiles. This pattern stresses hash aggregation, projection and filter operators. - Query B — Rollup (cube) aggregation: Joins the fact table with a date
dimension and uses
rollup(date_key, country_id, product_id)to compute revenue, quantity and average discount across multiple grouping levels. - Query C — Star schema join and top‑K sort: Joins the fact table with broadcast dimensions (countries and channels) and the date dimension, computes gross and net revenue, and orders by gross descending, taking the top 5,000 rows. Broadcast joins and top‑K sorts test Velox’s vectorized join and sort operators.
- Query A — Heavy multi‑aggregation: Groups the fact table by
- Title: CSV, GZip, S3, Python (Polars vs DuckDB)
- Date: Nov. 2025
- Author: Daniel Beach (Daniel Beach on LinkedIn, Daniel Beach on Substack)
- Link to the article on Substack