Conversation
…der experiment/scenario set, remove some defaults that result in unituitive results, add some failure points where needed, add naive approach for scenario handling
|
I haven't run the tests yet and I can already see that the linters are failing here (it worked locally for me though). I just wanted to make sure that you have access to the latest version and can look into the changes. And I think it would be good to merge current changes before I continue adapting the downloader further. With those changes all my cmip6 test cases are now working (6/6), so that's already a big step :D |
…sgf.py. split up raw and model vars. remove unused constants.
…g. update attribute handling of class. rewrite some if-else blocks. unify model and raw input vars handling. update constants. rename emission handling funcs. add comments for attributes in downloader class.
|
Pushed my updates for review/testing.
Let me know what you thinks and if you find any bugs/typos of problems! |
|
Code looks all good, thank you so much! Here a summary of my tests: Data can be retrieved from:
I will add a commit to fix a typo in the CO2 configs, and update the configs for future reference of issues listed here. Collecting the issues I have found, that will be separated out into issues though (and delete them from this comment here). Issue 1: (highest priority) bc_historical.yaml is downloading all SSP scenario data (+historical data) instead of only historical data. The same is happening for bc_ssp.yaml (all instead of just the one ssp scenario). The same is happening for the CH4 config files. Same for the SO2 configs. Same for CO2. Issue 2: Make sure failed data retrieval (Result len 0; or no overlap between requested and avail ensemble members) is not failing silently
Issue 3: Result len 0, when all inputs say it should actually be available?
Issue 4: Try out the abstract downloader classes by setting it up for omip and cmip6plus (looking at the code, it should be really easy now!) Issue 5: Far in the future: Add year span in downloader. |
…i-value querying - Added `esgpull` to `pyproject.toml` as part of the overarching task to implement an async `esgpull` downloader client. - Refactored `climateset/download/constraints.py` to support native multi-value lists seamlessly compatible with `esgpull.models.Query(selection=...)`. - Added a `to_esgpull_query()` method while retaining the original `to_esgf_params()` boundary, avoiding breakage of `esgf-pyclient` dependent logic. - Updated `test_constraints.py` with corresponding multi-value list tests. - Marked task 01 as completed.
- Created isolated_esgpull_context context manager in climateset/download/utils.py to prevent SQLite file lock collisions during parallel execution. - Configured Esgpull to initialize locally within a unique UUID path inside RAW_DATA/.esgpull_jobs. - Wrapped initialization and execution in a try/finally block with shutil.rmtree to ensure ephemeral state is securely purged after use. - Added test_isolated_esgpull_context in tests/test_download/test_utils.py to assert behavior. - Marked task 02 as completed.
…ntract - Created EsgpullDownloader mimicking utils.py functions to perform search via esgpull. - Replaced iterative fallback logic with efficient hints discovery queries mapped onto esgpull.models.Query. - Integrated missing dynamically configured esgpull facet properties (version, target_mip) via Selection.configure(). - Implemented option parsing (distrib, latest) ensuring robust bulk constraints execution. - Covered context logic and mock validations thoroughly via test_esgpull_downloader.py. - Marked task 03 as completed.
- Added test_esgpull_downloader_integration_search to perform a real search on ESGF nodes. - Validated that results successfully fetch and parse esgpull.models.File items. - Asserted Dataset ID mapping properties (model, experiment, variable) accurately returned matching properties.
- Implemented async download runner _download_and_move_files for extracting queries natively via asyncio. - Transferred resulting .nc files utilizing shutil.move() from the localized internal isolated UUID DB directly into target RAW_DATA project-specific locations. - Safely integrated esg.db.add tracking models bypassing previously dependent ESGF bash THREDDS downloads. - Validated via rigorous patches mocking asyncio downloads and ensuring mapping matches correctly via test_esgpull_downloader.py. - Verified file integration structure fully correctly during E2E assertions resolving the Task 4 goal. - Marked task 04 as completed.
- Implemented real-search automated tests for EsgpullDownloader, fulfilling the mandate to never mock the search querying phase. - Added global AsyncMock intercept for Esgpull.download to prevent data bandwidth usage in CI while allowing end-to-end flow verification. - Performed manual verification of storage independence (using .esgpull_jobs UUID isolation) and subprocess elimination (async native downloads). - Confirmed backward compatibility with existing esgf-pyclient downloader tests. - Refined iterative search logic by removing problematic nominal_resolution constraint which caused empty overlaps in esgpull. - Marked task 05 as completed.
# Conflicts: # .github/workflows/lint.yml # .github/workflows/precommit.yml # .make/base.make # Makefile.private.example # noxfile.py
|
Right now, the previous py-esgf client has been refactored and still works, but has some problems with some sources (like for biomass-burning). A new implementation using Some more review is needed on my part, as well as adding information about the new client and how to use it. |
I am creating a pull request, so the branches are not splitting up too much over time.
Current state:
The other future_cases are not working yet and have not been implemented yet. Please ignore those.
Next immediate steps: (Input4Mips handling)
Other next steps:
Backlog: