Conversation
SFJohnson24
left a comment
There was a problem hiding this comment.
This looks good so far-- can you see messages on slack with me and Gerry re: the .env and taking that into account as well with csv runs. https://github.com/cdisc-org/cdisc-rules-engine/issues/1559 My parsing script matches your reader logic--there is a description of the .env structure there.
It would want to get the standard/version/substandard/CT/define xml path from the .env
…-dp, -d] parameters to run validation.
| raise NotImplementedError | ||
|
|
||
| def from_file(self, file_path): | ||
| with open(file_path, "r", encoding=self.encoding) as fp: |
There was a problem hiding this comment.
We will need encoding + general error handling wrapping similar to json reader here.
| first_record = {} | ||
|
|
||
| with open(self.file_path, encoding=self.encoding) as f: | ||
| dataset_length = sum(1 for _ in f) - 1 # subtract header |
There was a problem hiding this comment.
This can result in dataset_length = -1 for empty csv files. For those cases I guess it should be zero. So we should clamp it to zero.
| description = "JSON data is malformed." | ||
|
|
||
|
|
||
| class InvalidCSVFormat(EngineError): |
There was a problem hiding this comment.
I think better naming for this would be InvalidCSVFile instead of format because this error is for when csv is malformed or encoding failed which makes it invalid csv file. What do you think?
There was a problem hiding this comment.
I think for csv files with the route of dataset-path -dp flag we will need some validation that user has not given metadata csv. The --data path handles this but for -dp path it just processes the file directly. it would be better to have a check here.
| else: | ||
| chunk.to_parquet(temp_file.name, engine="fastparquet", append=True) | ||
|
|
||
| return num_rows, temp_file.name |
There was a problem hiding this comment.
This function always returns a file path, but file is only written when csv is not empty. An empty csv can cause to return path to a file that is not yet created. It may cause some downstream errors.
There was a problem hiding this comment.
added empty parquet file creation in case when csv was empty.
or should we raise value error in this scenario? should I also fix it in xpt reader?
There was a problem hiding this comment.
Please update this to add csv also here as we will be supporting csv too now.
core.py
Outdated
| p for p in paths if p.name.lower() not in ("tables.csv", "variables.csv") | ||
| ] | ||
|
|
||
| if not tables_path: |
There was a problem hiding this comment.
this should return an error and cancel execution if the 2 metadata files are not found. It should also ensure that only 1 of each is present.
There was a problem hiding this comment.
what do you mean by that? How can it be possible for two files with same name to sit in same directory?
and if we use -dp parameter, for examlpe
-dp datasets/dm.csv
-dp datasets_other/lb.csv
should we check that both of the directories have tables.csv or how we should cover this scenario? as I remember, variables.csv are not neccesary(if not required by the rule), but should we also check it?
There was a problem hiding this comment.
I think we can treat this situation like multiple -d flags.
so in sutuation when we provide
-dp datasets/dm.csv
-dp datasets_other/lb.csv
we should expect at least this files to exist:
datasets_1/dm.csv
datasets_1/tables.csv
datasets_1/variables.csv
and
datasets_2/lb.csv
datasets_2/variables.csv
core.py
Outdated
| tables_df = pd.read_csv(tables_path, encoding=encoding) | ||
|
|
||
| if "Filename" not in tables_df.columns: | ||
| return [str(p) for p in dataset_files if p.suffix.lower() == ".csv"] |
There was a problem hiding this comment.
I also dont think we should be falling back if the tables.csv is incorrect-- it seems this will just cause downstream problems? We should be throwing an InvalidCSVFile error and telling the user their metadata files is malformed, halting execution
…d .env file search in case of different -dp folders
fixed tests due to stricter tables.csv validation. updated readme with new arguments
| file_metadata["path"], | ||
| file_name, | ||
| encoding=self.encoding, | ||
| variables_csv_path=self.variables_csv_path, |
There was a problem hiding this comment.
this line is causing as error in the test suite.
TypeError: DatasetXPTMetadataReader.init() got an unexpected keyword argument 'variables_csv_path'
this should be a conditional where if it is the csv reader, it passes the csv files to it,otherwise just the others
SFJohnson24
left a comment
There was a problem hiding this comment.
I think this is missing a .env argument for core.py per @gerrycampion 'I think if we use -dp, we should expect an env arg as well. If it is not provided, don't throw an error, but use the env from the program directory like we currently do'. You have the defaulting logic but there is no .env validation argument
I've added elif dataset_path:
if not Path(dotenv_path).exists():
load_dotenv()
else:
load_dotenv(dotenv_path)
... |
Added support for data in .csv format.
Implemented params reading from .env files - based on .env in data folder.