Skip to content

Outliers detection-imputation Standardization#50

Open
claude-marie wants to merge 10 commits intomainfrom
SNT25-384_outliers
Open

Outliers detection-imputation Standardization#50
claude-marie wants to merge 10 commits intomainfrom
SNT25-384_outliers

Conversation

@claude-marie
Copy link
Contributor

This PR is about outiers detection pipelines output standardization.
please find the ticket here

)


def _materialize_standard_outputs_from_legacy(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Claude,
I don't think it's a good approach to add data transformation logic in Python on top of the R script output. This splits the computation logic across two languages. In the case of this Python pipeline, it's responsibility should only be to: 1. Pass parameters to the R script, 2. Execute the R notebook and 3. Load results into the dataset.

By adding transformations in Python, we're mixing computation logic between languages. This makes it harder for data scientists to understand, modify, or debug the full computation, since they'd need to look at both the R script and the Python code to see what's actually being produced.

I think all formatting and transformation logic should stay within the R code, keeping the computation self-contained in one code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets make it simple by using a bool parameter "Complete: True/False".
By default is Partial, with the option to execute complete if selected.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By using a bool parameter, we can get rid of all this python checking code.
mode_clean = (mode or "partial").strip().lower() if mode_clean not in ("partial", "complete"): raise ValueError('mode must be "partial" or "complete".') run_mg_partial = True run_mg_complete = mode_clean == "complete"

We only need run_mg_complete True or False.

NOTE: Probably this will necessarily introduce changes in the way we inject and how the parameters are used in the computation R notebook.

# Fallback for legacy MG notebooks still writing legacy file names.
legacy_data_path = root_path / "data" / "dhis2" / "outliers_detection"
legacy_candidates = [
data_path / f"{country_code}_flagged_outliers_magic_glasses.parquet",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to my previous commnet:

We should not produce or read this file anymore:
f"{country_code}_flagged_outliers_magic_glasses.parquet"

The outputs are directly produced by the R notebook:
data_path / f"{country_code}_routine_outliers_detected.parquet", data_path / f"{country_code}_routine_outliers_imputed.parquet", data_path / f"{country_code}_routine_outliers_removed.parquet",

input_params = {
"ROOT_PATH": Path(workspace.files_path).as_posix(),
"RUN_MAGIC_GLASSES_PARTIAL": run_mg_partial,
"RUN_MAGIC_GLASSES_COMPLETE": run_mg_complete,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to standardize the way we provide the information in the parameters file , so the other pipelines know how to handle the information depending of the method being used (if needed).

I propose to define after the root_path a standard node for all outliers methods called OUTLIERS_METHOD: so in this case:
OUTLIERS_METHOD: "MG_COMPLETE" ## or "MG_PARTIAL", "IQR", "MEAN"... etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants