Skip to content

data-engineering-helpers/benchmark-processing-engines

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Knowledge Sharing (KS) - Benchmark of data processing engines

Table of Content (ToC)

Created by gh-md-toc

Overview

This project aims at benchmarking a few data processing engines (e.g., DuckDB, Spark, Polars, Daft).

Inspiration: Todo MVC

Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.

References

Articles

Light ETL Python engines

Accelerating Apache Spark's Execution Engine

Accelerating Apache Spark with Gluten and Velox

  • Title: Accelerating Apache Spark with Gluten & Velox
  • Author: Angel Conde (Angel Conde on LinkedIn, Angel Conde on Medium)
  • Date: Sep. 2025
  • Link to the article on Medium
  • Companion Git repository. It features a benchmark with:
    • Public, generated, datasets containing a fact table and dimension tables
    • Several queries representing typical analytics workload:
      • Query A — Heavy multi‑aggregation: Groups the fact table by country_id and channel_id and computes counts, sums, averages, standard deviation and approximate percentiles. This pattern stresses hash aggregation, projection and filter operators.
      • Query B — Rollup (cube) aggregation: Joins the fact table with a date dimension and uses rollup(date_key, country_id, product_id) to compute revenue, quantity and average discount across multiple grouping levels.
      • Query C — Star schema join and top‑K sort: Joins the fact table with broadcast dimensions (countries and channels) and the date dimension, computes gross and net revenue, and orders by gross descending, taking the top 5,000 rows. Broadcast joins and top‑K sorts test Velox’s vectorized join and sort operators.

Polars vs DuckDB

About

Benchmark for data processing engines (e.g., DuckDB, Spark, Polars, Daft)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages