Knowledge Sharing (KS) - Benchmark of data processing engines

Table of Content (ToC)

Knowledge Sharing (KS) - Benchmark of data processing engines

Overview

This project aims at benchmarking a few data processing engines (e.g., DuckDB, Spark, Polars, Daft).

Inspiration: Todo MVC

Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.

References

Articles

Light ETL Python engines

Author: Mimoune Djouallah (Mimoune Djouallah on LinkedIn, Mimoune Djouallah on GitHub)
Date: Jan. 2026
Git repository with a Jupyter notebook and fully reproducible scripts
- Git repository with the data sets
Post on LinkedIn

Accelerating Apache Spark's Execution Engine

Title: Accelerating Apache Spark's Execution Engine
Author: Dipankar Mazumdar
Date: Dec. 2025
Post on LinkedIn

Accelerating Apache Spark with Gluten and Velox

Title: Accelerating Apache Spark with Gluten & Velox
Author: Angel Conde (Angel Conde on LinkedIn, Angel Conde on Medium)
Date: Sep. 2025
Link to the article on Medium
Companion Git repository. It features a benchmark with:
- Public, generated, datasets containing a fact table and dimension tables
- Several queries representing typical analytics workload:
  - Query A — Heavy multi‑aggregation: Groups the fact table by country_id and channel_id and computes counts, sums, averages, standard deviation and approximate percentiles. This pattern stresses hash aggregation, projection and filter operators.
  - Query B — Rollup (cube) aggregation: Joins the fact table with a date dimension and uses rollup(date_key, country_id, product_id) to compute revenue, quantity and average discount across multiple grouping levels.
  - Query C — Star schema join and top‑K sort: Joins the fact table with broadcast dimensions (countries and channels) and the date dimension, computes gross and net revenue, and orders by gross descending, taking the top 5,000 rows. Broadcast joins and top‑K sorts test Velox’s vectorized join and sort operators.

Polars vs DuckDB

Title: CSV, GZip, S3, Python (Polars vs DuckDB)
Date: Nov. 2025
Author: Daniel Beach (Daniel Beach on LinkedIn, Daniel Beach on Substack)
Link to the article on Substack

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data/bronze		data/bronze
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knowledge Sharing (KS) - Benchmark of data processing engines

Table of Content (ToC)

Overview

References

Articles

Light ETL Python engines

Accelerating Apache Spark's Execution Engine

Accelerating Apache Spark with Gluten and Velox

Polars vs DuckDB

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Knowledge Sharing (KS) - Benchmark of data processing engines

Table of Content (ToC)

Overview

References

Articles

Light ETL Python engines

Accelerating Apache Spark's Execution Engine

Accelerating Apache Spark with Gluten and Velox

Polars vs DuckDB

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages