databricks-template

A production-ready PySpark project template with medallion architecture, Python packaging, unit tests, integration tests, coverage tests, CI/CD automation, Databricks Asset Bundles, and DQX data quality framework.

🚀 Overview

This project template is designed to boost productivity and promote maintainability when developing ETL pipelines on Databricks. It aims to bring software engineering best practices—such as modular architecture, automated testing, and CI/CD—into the world of data engineering. By combining a clean project structure with robust development and deployment workflows, this template helps teams move faster with confidence.

You’re encouraged to adapt the structure and tooling to suit your project’s specific needs and environment.

Interested in bringing these principles in your own project? Let’s connect on Linkedin.

🧪 Technologies

Databricks Free Edition (Serverless)
Databricks Runtime 18.0 LTS
Databricks Asset Bundles
Databricks DQX
Databricks CLI
Databricks Python SDK
PySpark 4.1
Python 3.12+
Unity Catalog
GitHub Actions
Pytest

📦 Features

This project template demonstrates how to:

structure PySpark code inside classes/packages, instead of notebooks.
package and deploy code to different environments (dev, staging, prod).
use a CI/CD pipeline with Github Actions.
run unit tests on transformations with pytest package. Set up VSCode to run unit tests on your local machine.
run integration tests setting the input data and validating the output data.
isolate "dev" environments / catalogs to avoid concurrency issues between developer tests.
show developer name and branch as job tags to track issues.
utilize coverage package to generate test coverage reports.
utilize uv as a project/package manager.
configure job to run in different environments with different parameters with jinja package.
configure job to run tasks selectively.
use medallion architecture pattern.
lint and format code with ruff and pre-commit.
use a Make file to automate repetitive tasks.
utilize argparse package to build a flexible command line interface to start the jobs.

utilize Databricks Asset Bundles to package/deploy/run a Python wheel package on Databricks.
utilize Databricks DQX to define and enforce data quality rules, such as null checks, uniqueness, thresholds, and schema validation, and filter bad data on quarantine tables.
utilize Databricks SDK for Python to manage workspaces and accounts and analyse costs. Refer to 'scripts' folder for some examples.
utilize Databricks Unity Catalog and get data lineage for your tables and columns.
utilize Databricks Lakeflow Jobs to execute a DAG and task parameters to share context information between tasks (see Task Parameters section). Yes, you don't need Airflow to manage your DAGs here!!!
utilize serverless job clusters on Databricks Free Edition to deploy your pipelines.

🧠 Resources

For a debate on the use of notebooks vs. Python packaging, please refer to:

Sessions on Databricks Asset Bundles, CI/CD, and Software Development Life Cycle at Data + AI Summit 2025:

Other:

📁 Folder Structure

databricks-template/
│
├── .github/                       # CI/CD automation
│   └── workflows/
│       └── onpush.yml             # GitHub Actions pipeline
│
├── src/                           # Main source code
│   └── template/                  # Python package
│       ├── main.py                # Entry point with CLI (argparse)
│       ├── config.py              # Configuration management
│       ├── baseTask.py            # Base class for all tasks
│       ├── commonSchemas.py       # Shared PySpark schemas
│       └── job1/                  # Job-specific tasks
│           ├── extract_source1.py
│           ├── extract_source2.py
│           ├── generate_orders.py
│           ├── generate_orders_agg.py
│           ├── integration_setup.py
│           └── integration_validate.py
│
├── tests/                          # Unit tests
│   └── job1/
│       └── unit_test.py            # Pytest unit tests
│
├── resources/                      # Databricks workflow templates
│   ├── wf_template_serverless.yml  # Jinja2 template for serverless
│   ├── wf_template.yml             # Jinja2 template for job clusters
│   └── workflow.yml                # Generated workflow (auto-created)
│
├── scripts/                           # Helper scripts
│   ├── generate_template_workflow.py  # Workflow generator (Jinja2)
│   ├── sdk_analyze_job_costs.py       # Cost analysis script
│   └── sdk_workspace_and_account.py   # Workspace and account management
│
├── docs/                           # Documentation assets
│   ├── dag.png
│   ├── task_output.png
│   ├── data_lineage.png
│   ├── data_quality.png
│   └── ci_cd.png
│
├── dist/                        # Build artifacts (Python wheel)
├── coverage_reports/            # Test coverage reports
│
├── databricks.yml               # Databricks Asset Bundle config
├── pyproject.toml               # Python project configuration (uv)
├── Makefile                     # Build automation
├── .pre-commit-config.yaml      # Pre-commit hooks (ruff)
└── README.md                    # This file

CI/CD pipeline

Jobs

Task Output

Data Lineage

Data Quality (generated by Databricks DQX)

Instructions

Create a workspace. Use a Databricks Free Edition workspace.
Install and configure Databricks CLI on your local machine. Check the current version on databricks.yaml. Follow instructions here.
Build Python env and execute unit tests on your local machine.
```
 make sync & make test
```
Deploy and execute on the dev workspace.
```
 make deploy env=dev
```
configure CI/CD automation. Configure Github Actions repository secrets (DATABRICKS_HOST and DATABRICKS_TOKEN).
You can also execute unit tests from your preferred IDE. Here's a screenshot from VS Code with Microsoft's Python extension installed.

Task parameters

task (required) - determines the current task to be executed.
env (required) - determines the AWS account where the job is running. This parameter also defines the default catalog for the task.
user (required) - determines the name of the catalog when env is "dev".
schema (optional) - determines the default schema to read/store tables.
skip (optional) - determines if the current task should be skipped.
debug (optional) - determines if the current task should go through debug conditional.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

databricks-template

🚀 Overview

🧪 Technologies

📦 Features

🧠 Resources

📁 Folder Structure

CI/CD pipeline

Jobs

Task Output

Data Lineage

Data Quality (generated by Databricks DQX)

Instructions

Task parameters

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
docs		docs
resources		resources
scripts		scripts
src/template		src/template
tests/job1		tests/job1
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
README.md		README.md
databricks.yml		databricks.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

databricks-template

🚀 Overview

🧪 Technologies

📦 Features

🧠 Resources

📁 Folder Structure

CI/CD pipeline

Jobs

Task Output

Data Lineage

Data Quality (generated by Databricks DQX)

Instructions

Task parameters

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages