spatialdata-data-converter

This repo has been migrated from an old private one and the migration is not complete yet. I.e. things will not work as they are now.

Automated pipeline for the end-to-end testing of the SpatialData framework, the conversion of sample datasets and the upload of the converted datasets to S3.

How to use this repository

This repository contains a series of scripts that can be ran manually (either sequentially, either with parallelization) or in a Airflow pipeline. We need all these approaches, since they cover different use cases.

Airflow. A machine connected to the internet to which the team can have secure access will run Airflow 24/7 and will provide a continuously up-to-date status on the health of the SpatialData framework. The Airflow pipeline could be configured to use a parallel executor, but we will keep it simple and run the jobs sequentially. This is fine because the machine is more of a "run and forget" type, so it doesn't really matter if the jobs take a bit longer to run.

The Airflow pipeline has some disadvantages:

It's not straightforward to debug/rerun failing jobs.

The machine is not powerful. A laptop is more powerful, and the local workstations we have in the lab are much more powerful. This is why we have these two approaches:

Manual execution, sequentially. We run jobs in a series of bash script. Simple to understand, to run, to debug and fast.
Manual execution, parallelized. We run jobs in parallel using a simple bash function to spin multiple processes. Simple and very fast. More fragile than a sequential run, and not as stable or powerful than Airflow or other executors like Nextflow etc, but much simpler than those.

Probably there are better techs, like Flyte or Prefect, which combine easy local execution with a pipeline system. But they are expensive. Airflow is powerful but not good for local execution, bash is simple and good for local execution but not a good way to build pipelines. That's why we use both.

Installation on local/private machines (to run the scripts manually)

The installation instructions are not heavily tested so be ready to adjust them to your environment (and please contribute back your changes), or consider opening a GitHub issue.

Clone the repository, including the submodules:

git clone --recurse-submodules https://github.com/LucaMarconato/spatialdata-data-converter.git

Go to the cloned directory:
```
cd spatialdata-data-converter
```
Install the pixi environment You need to have pixi available in your system.
```
pixi install
pixi update
```
Copy template_envvars.sh into envvars.sh and edit the 2 paths according to your system.
Run workflow_first_run.sh with
```
bash src/spatialdata_data_converter/workflow_first_run.sh
```
This will run a small set of commands sequentially.
If your script fails (it shoult not, but there is some chance it will):
- Comment out what worked to avoid re-running everything.
- Fix what didn't work.
- Run the script again from where it failed.
- Continue like this until the script finishes successfully.
- Please commit and push the fixes, or at minimum report them.
Now there is a good chance that things work, you are ready to run the real thing! This is the script that you should run before making a release. The jobs will be run in parallel. On a multi-core ubuntu machine the parallel version takes 15 mins, the non parallel would take (I think) more than 1 hour.
```
bash src/spatialdata_data_converter/workflow_before_release.sh
```
If it passes, you are ready to make a release! If it doesn't pass, debug things like before. You can either comment out lines from the script above, or if you want to run things sequentially you can manually comment/uncomment out lines in the script invoke_cli.sh (which contains all the commands, even the one you don't need) and run it with:
```
bash src/spatialdata_data_converter/invoke_cli.sh
```
After making a release, please upload the new datasets to S3 by running:
```
bash src/spatialdata_data_converter/workflow_upload_data_for_release.sh
```

Installation on the Airflow machine

Configure the Airflow home directory. In your ~/.zshrc file (or depending on your system ~/.bashrc, ~/.bash_profile, ...), add the following lines (remember to change the path):

export AIRFLOW_HOME=/absolute/path/to/spatialdata-data-converter/airflow

Run Airflow

pixi run airflow standalone

When running the command above for the first time, it creates a configuration file in $AIRFLOW_HOME/airflow.cfg. After creating the configuration, the command starts the webserver and the scheduler.

Manually change the configuration Set

load_examples = False 
page_size = 100

Open the dashboard.

In your browser and go to http://localhost:8080 to access the Airflow webserver. If it doesn't work, check the terminal output for the correct port.
Login.

For username and password follow the instructions printed in the terminal: search for "standalone | Password for the admin user has been previously generated in".

Known issues

The previous version of the repo used a serial executor for Airflow (i.e. no two tasks could be run in parallel). The current default configuration allows parallel execution. I may have spotted some problems during IO operations due to this but I am still investigating. Worst case we can sacrifice some performance and go back to serial execution.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
airflow		airflow
dependencies		dependencies
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
envvars.sh		envvars.sh
pixi.lock		pixi.lock
pushover.py		pushover.py
pyproject.toml		pyproject.toml
template_envvars.sh		template_envvars.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

spatialdata-data-converter

How to use this repository

Installation on local/private machines (to run the scripts manually)

Installation on the Airflow machine

Known issues

About

Uh oh!

Releases

Packages

Languages

License

scverse/spatialdata-integration-testing

Folders and files

Latest commit

History

Repository files navigation

spatialdata-data-converter

How to use this repository

Installation on local/private machines (to run the scripts manually)

Installation on the Airflow machine

Known issues

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages