Outliers detection-imputation Standardization#50
Conversation
…and pipeline; add reporting functionality and configuration management
| ) | ||
|
|
||
|
|
||
| def _materialize_standard_outputs_from_legacy( |
There was a problem hiding this comment.
Hey Claude,
I don't think it's a good approach to add data transformation logic in Python on top of the R script output. This splits the computation logic across two languages. In the case of this Python pipeline, it's responsibility should only be to: 1. Pass parameters to the R script, 2. Execute the R notebook and 3. Load results into the dataset.
By adding transformations in Python, we're mixing computation logic between languages. This makes it harder for data scientists to understand, modify, or debug the full computation, since they'd need to look at both the R script and the Python code to see what's actually being produced.
I think all formatting and transformation logic should stay within the R code, keeping the computation self-contained in one code.
There was a problem hiding this comment.
Lets make it simple by using a bool parameter "Complete: True/False".
By default is Partial, with the option to execute complete if selected.
There was a problem hiding this comment.
By using a bool parameter, we can get rid of all this python checking code.
mode_clean = (mode or "partial").strip().lower() if mode_clean not in ("partial", "complete"): raise ValueError('mode must be "partial" or "complete".') run_mg_partial = True run_mg_complete = mode_clean == "complete"
We only need run_mg_complete True or False.
NOTE: Probably this will necessarily introduce changes in the way we inject and how the parameters are used in the computation R notebook.
| # Fallback for legacy MG notebooks still writing legacy file names. | ||
| legacy_data_path = root_path / "data" / "dhis2" / "outliers_detection" | ||
| legacy_candidates = [ | ||
| data_path / f"{country_code}_flagged_outliers_magic_glasses.parquet", |
There was a problem hiding this comment.
Link to my previous commnet:
We should not produce or read this file anymore:
f"{country_code}_flagged_outliers_magic_glasses.parquet"
The outputs are directly produced by the R notebook:
data_path / f"{country_code}_routine_outliers_detected.parquet", data_path / f"{country_code}_routine_outliers_imputed.parquet", data_path / f"{country_code}_routine_outliers_removed.parquet",
| input_params = { | ||
| "ROOT_PATH": Path(workspace.files_path).as_posix(), | ||
| "RUN_MAGIC_GLASSES_PARTIAL": run_mg_partial, | ||
| "RUN_MAGIC_GLASSES_COMPLETE": run_mg_complete, |
There was a problem hiding this comment.
I think we need to standardize the way we provide the information in the parameters file , so the other pipelines know how to handle the information depending of the method being used (if needed).
I propose to define after the root_path a standard node for all outliers methods called OUTLIERS_METHOD: so in this case:
OUTLIERS_METHOD: "MG_COMPLETE" ## or "MG_PARTIAL", "IQR", "MEAN"... etc
This PR is about outiers detection pipelines output standardization.
please find the ticket here