Skip to content

1558 csv data reader#1640

Open
alexfurmenkov wants to merge 27 commits intomainfrom
1558-csv-data-reader
Open

1558 csv data reader#1640
alexfurmenkov wants to merge 27 commits intomainfrom
1558-csv-data-reader

Conversation

@alexfurmenkov
Copy link
Collaborator

@alexfurmenkov alexfurmenkov commented Feb 26, 2026

Added support for data in .csv format.
Implemented params reading from .env files - based on .env in data folder.

@alexfurmenkov alexfurmenkov marked this pull request as ready for review March 2, 2026 08:56
Copy link
Collaborator

@SFJohnson24 SFJohnson24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good so far-- can you see messages on slack with me and Gerry re: the .env and taking that into account as well with csv runs. https://github.com/cdisc-org/cdisc-rules-engine/issues/1559 My parsing script matches your reader logic--there is a description of the .env structure there.
It would want to get the standard/version/substandard/CT/define xml path from the .env

@alexfurmenkov alexfurmenkov linked an issue Mar 5, 2026 that may be closed by this pull request
raise NotImplementedError

def from_file(self, file_path):
with open(file_path, "r", encoding=self.encoding) as fp:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need encoding + general error handling wrapping similar to json reader here.

first_record = {}

with open(self.file_path, encoding=self.encoding) as f:
dataset_length = sum(1 for _ in f) - 1 # subtract header
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can result in dataset_length = -1 for empty csv files. For those cases I guess it should be zero. So we should clamp it to zero.

description = "JSON data is malformed."


class InvalidCSVFormat(EngineError):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think better naming for this would be InvalidCSVFile instead of format because this error is for when csv is malformed or encoding failed which makes it invalid csv file. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for csv files with the route of dataset-path -dp flag we will need some validation that user has not given metadata csv. The --data path handles this but for -dp path it just processes the file directly. it would be better to have a check here.

else:
chunk.to_parquet(temp_file.name, engine="fastparquet", append=True)

return num_rows, temp_file.name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function always returns a file path, but file is only written when csv is not empty. An empty csv can cause to return path to a file that is not yet created. It may cause some downstream errors.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added empty parquet file creation in case when csv was empty.
or should we raise value error in this scenario? should I also fix it in xpt reader?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this to add csv also here as we will be supporting csv too now.

core.py Outdated
p for p in paths if p.name.lower() not in ("tables.csv", "variables.csv")
]

if not tables_path:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should return an error and cancel execution if the 2 metadata files are not found. It should also ensure that only 1 of each is present.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by that? How can it be possible for two files with same name to sit in same directory?

and if we use -dp parameter, for examlpe

-dp datasets/dm.csv
-dp datasets_other/lb.csv

should we check that both of the directories have tables.csv or how we should cover this scenario? as I remember, variables.csv are not neccesary(if not required by the rule), but should we also check it?

Copy link
Collaborator Author

@alexfurmenkov alexfurmenkov Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can treat this situation like multiple -d flags.
so in sutuation when we provide

-dp datasets/dm.csv
-dp datasets_other/lb.csv

we should expect at least this files to exist:

datasets_1/dm.csv
datasets_1/tables.csv
datasets_1/variables.csv

and

datasets_2/lb.csv
datasets_2/variables.csv

core.py Outdated
tables_df = pd.read_csv(tables_path, encoding=encoding)

if "Filename" not in tables_df.columns:
return [str(p) for p in dataset_files if p.suffix.lower() == ".csv"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also dont think we should be falling back if the tables.csv is incorrect-- it seems this will just cause downstream problems? We should be throwing an InvalidCSVFile error and telling the user their metadata files is malformed, halting execution

…d .env file search in case of different -dp folders
fixed tests due to stricter tables.csv validation.
updated readme with new arguments
file_metadata["path"],
file_name,
encoding=self.encoding,
variables_csv_path=self.variables_csv_path,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is causing as error in the test suite.
TypeError: DatasetXPTMetadataReader.init() got an unexpected keyword argument 'variables_csv_path'
this should be a conditional where if it is the csv reader, it passes the csv files to it,otherwise just the others

Copy link
Collaborator

@SFJohnson24 SFJohnson24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is missing a .env argument for core.py per @gerrycampion 'I think if we use -dp, we should expect an env arg as well. If it is not provided, don't throw an error, but use the env from the program directory like we currently do'. You have the defaulting logic but there is no .env validation argument

SFJohnson24

This comment was marked as duplicate.

@alexfurmenkov
Copy link
Collaborator Author

@SFJohnson24

.env validation argument

I've added load_custom_dotenv function to be called when --dotenv-path param is passed and did not remove original load_dotenv call but modified to use dotenv_path parameter. Can you please clarify what kind of validation is expected?
something like this?

  elif dataset_path:
        if not Path(dotenv_path).exists():
            load_dotenv()
        else:
            load_dotenv(dotenv_path)
        ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support CSV datasets in engine

3 participants