Large datasets, mostly CSV files, are currently fetched directly from Git LFS which induce significant Git LFS bandwidth costs.
Fetching these datasets as pre-compressed release assets will reduce download time and eliminate most GitHub Git LFS bandwidth costs. Thanks to @jvanulde for the idea and @DamonU2 for the pioneering work.
This, I think, is easier to implement and maintain, thus more robust and less error-prone than my previous unimplemented "XZ-compressed copies of repos" idea:
Data source repos:
- OpenDRR/openquake-inputs
- OpenDRR/model-inputs
- OpenDRR/canada-srm2
- OpenDRR/earthquake-scenarios
Scripts that fetch from these repos include (but may not be limited to):
- python/add_data.sh (OpenDRR/opendrr-api)
- scripts/DSRA_outputs2postgres_lfs.py (OpenDRR/model-factory)
Cf. these commands found in add_data.sh, for example:
fetch_csv openquake-inputs ...
fetch_csv model-inputs ...
curl -L https://api.github.com/repos/OpenDRR/canada-srm2/contents/cDamage/output?ref=tieg_natmodel2021
curl -L https://api.github.com/repos/OpenDRR/earthquake-scenarios/contents/FINISHED
python3 DSRA_outputs2postgres_lfs.py --dsraModelDir=$DSRA_REPOSITORY --columnsINI=DSRA_outputs2postgres.ini --eqScenario="$eqscenario"
XZ or Zstd compression? (compressed file sizes vs. decompression speed)
Large datasets, mostly CSV files, are currently fetched directly from Git LFS which induce significant Git LFS bandwidth costs.
Fetching these datasets as pre-compressed release assets will reduce download time and eliminate most GitHub Git LFS bandwidth costs. Thanks to @jvanulde for the idea and @DamonU2 for the pioneering work.
This, I think, is easier to implement and maintain, thus more robust and less error-prone than my previous unimplemented "XZ-compressed copies of repos" idea:
Data source repos:
Scripts that fetch from these repos include (but may not be limited to):
Cf. these commands found in add_data.sh, for example:
XZ or Zstd compression? (compressed file sizes vs. decompression speed)