1558 csv data reader by alexfurmenkov · Pull Request #1640 · cdisc-org/cdisc-rules-engine

alexfurmenkov · 2026-02-26T11:17:15Z

Added support for data in .csv format.
Implemented params reading from .env files - based on .env in data folder.

SFJohnson24

This looks good so far-- can you see messages on slack with me and Gerry re: the .env and taking that into account as well with csv runs. https://github.com/cdisc-org/cdisc-rules-engine/issues/1559 My parsing script matches your reader logic--there is a description of the .env structure there.
It would want to get the standard/version/substandard/CT/define xml path from the .env

…-dp, -d] parameters to run validation.

RamilCDISC · 2026-03-05T20:32:59Z

cdisc_rules_engine/services/data_readers/csv_reader.py

+        raise NotImplementedError
+
+    def from_file(self, file_path):
+        with open(file_path, "r", encoding=self.encoding) as fp:


We will need encoding + general error handling wrapping similar to json reader here.

RamilCDISC · 2026-03-05T20:34:53Z

cdisc_rules_engine/services/csv_metadata_reader.py

+            first_record = {}
+
+        with open(self.file_path, encoding=self.encoding) as f:
+            dataset_length = sum(1 for _ in f) - 1  # subtract header


This can result in dataset_length = -1 for empty csv files. For those cases I guess it should be zero. So we should clamp it to zero.

RamilCDISC · 2026-03-09T23:22:33Z

cdisc_rules_engine/exceptions/custom_exceptions.py

    description = "JSON data is malformed."


+class InvalidCSVFormat(EngineError):


I think better naming for this would be InvalidCSVFile instead of format because this error is for when csv is malformed or encoding failed which makes it invalid csv file. What do you think?

RamilCDISC · 2026-03-19T22:22:19Z

core.py

I think for csv files with the route of dataset-path -dp flag we will need some validation that user has not given metadata csv. The --data path handles this but for -dp path it just processes the file directly. it would be better to have a check here.

RamilCDISC · 2026-03-19T22:31:23Z

cdisc_rules_engine/services/data_readers/csv_reader.py

+            else:
+                chunk.to_parquet(temp_file.name, engine="fastparquet", append=True)
+
+        return num_rows, temp_file.name


This function always returns a file path, but file is only written when csv is not empty. An empty csv can cause to return path to a file that is not yet created. It may cause some downstream errors.

added empty parquet file creation in case when csv was empty.
or should we raise value error in this scenario? should I also fix it in xpt reader?

RamilCDISC · 2026-03-19T22:33:53Z

cdisc_rules_engine/services/data_services/local_data_service.py

Please update this to add csv also here as we will be supporting csv too now.

SFJohnson24 · 2026-03-20T15:59:34Z

core.py

+        p for p in paths if p.name.lower() not in ("tables.csv", "variables.csv")
+    ]
+
+    if not tables_path:


this should return an error and cancel execution if the 2 metadata files are not found. It should also ensure that only 1 of each is present.

what do you mean by that? How can it be possible for two files with same name to sit in same directory?

and if we use -dp parameter, for examlpe

-dp datasets/dm.csv
-dp datasets_other/lb.csv

should we check that both of the directories have tables.csv or how we should cover this scenario? as I remember, variables.csv are not neccesary(if not required by the rule), but should we also check it?

I think we can treat this situation like multiple -d flags.
so in sutuation when we provide

-dp datasets/dm.csv
-dp datasets_other/lb.csv

we should expect at least this files to exist:

datasets_1/dm.csv
datasets_1/tables.csv
datasets_1/variables.csv

and

datasets_2/lb.csv
datasets_2/variables.csv

SFJohnson24 · 2026-03-20T16:01:06Z

core.py

+    tables_df = pd.read_csv(tables_path, encoding=encoding)
+
+    if "Filename" not in tables_df.columns:
+        return [str(p) for p in dataset_files if p.suffix.lower() == ".csv"]


I also dont think we should be falling back if the tables.csv is incorrect-- it seems this will just cause downstream problems? We should be throwing an InvalidCSVFile error and telling the user their metadata files is malformed, halting execution

…d .env file search in case of different -dp folders

fixed tests due to stricter tables.csv validation. updated readme with new arguments

SFJohnson24 · 2026-03-24T16:06:13Z

cdisc_rules_engine/services/data_services/local_data_service.py

+            file_metadata["path"],
+            file_name,
+            encoding=self.encoding,
+            variables_csv_path=self.variables_csv_path,


this line is causing as error in the test suite.
TypeError: DatasetXPTMetadataReader.init() got an unexpected keyword argument 'variables_csv_path'
this should be a conditional where if it is the csv reader, it passes the csv files to it,otherwise just the others

SFJohnson24

I think this is missing a .env argument for core.py per @gerrycampion 'I think if we use -dp, we should expect an env arg as well. If it is not provided, don't throw an error, but use the env from the program directory like we currently do'. You have the defaulting logic but there is no .env validation argument

alexfurmenkov · 2026-03-25T16:24:35Z

@SFJohnson24

.env validation argument

I've added load_custom_dotenv function to be called when --dotenv-path param is passed and did not remove original load_dotenv call but modified to use dotenv_path parameter. Can you please clarify what kind of validation is expected?
something like this?

  elif dataset_path:
        if not Path(dotenv_path).exists():
            load_dotenv()
        else:
            load_dotenv(dotenv_path)
        ...

alexfurmenkov added 5 commits February 25, 2026 15:27

#1558 WIP: csv readers

9f5273b

#1558 csv metadata reader and tables filtering

c791d7c

#1558 example files

1490237

#1558 moved csv metadata reader

83d2704

Merge branch 'main' into 1558-csv-data-reader

4cbb541

alexfurmenkov temporarily deployed to DEV February 26, 2026 11:17 — with GitHub Actions Inactive

alexfurmenkov added 4 commits February 27, 2026 13:58

#1558 unit tests for dataset filtering

8098eb4

#1558 unit tests for dataset readers

514b147

#1558 regression and changes in metadata reader logic to preserve data

155a236

Merge branch 'main' into 1558-csv-data-reader

b3e8974

alexfurmenkov marked this pull request as ready for review March 2, 2026 08:56

alexfurmenkov requested review from RamilCDISC, SFJohnson24 and gerrycampion March 2, 2026 08:56

SFJohnson24 requested changes Mar 4, 2026

View reviewed changes

alexfurmenkov linked an issue Mar 5, 2026 that may be closed by this pull request

Support CSV datasets in engine #1558

Open

#1558 added envvar options for [-r, -er, -lr, -ss, -v, -s, -l, -dxp, …

c351222

…-dp, -d] parameters to run validation.

alexfurmenkov temporarily deployed to DEV March 5, 2026 14:15 — with GitHub Actions Inactive

#1558 added envvar option for -ct and -dv parameters.

21b0d3e

alexfurmenkov temporarily deployed to DEV March 5, 2026 14:38 — with GitHub Actions Inactive

Merge branch 'main' into 1558-csv-data-reader

695e1da

RamilCDISC temporarily deployed to DEV March 5, 2026 20:15 — with GitHub Actions Inactive

Merge branch 'main' into 1558-csv-data-reader

0fc3bb1

RamilCDISC temporarily deployed to DEV March 5, 2026 20:24 — with GitHub Actions Inactive

RamilCDISC requested changes Mar 5, 2026

View reviewed changes

#1558 error handling while reading CSV

d38fe58

alexfurmenkov temporarily deployed to DEV March 6, 2026 13:13 — with GitHub Actions Inactive

alexfurmenkov requested review from RamilCDISC and SFJohnson24 March 9, 2026 08:02

RamilCDISC requested changes Mar 9, 2026

View reviewed changes

alexfurmenkov requested review from RamilCDISC and SFJohnson24 March 18, 2026 09:24

RamilCDISC requested changes Mar 19, 2026

View reviewed changes

SFJohnson24 reviewed Mar 20, 2026

View reviewed changes

#1558 PR fixes

3ce61b5

alexfurmenkov temporarily deployed to DEV March 20, 2026 17:29 — with GitHub Actions Inactive

alexfurmenkov added 2 commits March 23, 2026 12:18

#1558 fixed csv tests

b2ee46a

Merge branch 'refs/heads/main' into 1558-csv-data-reader

51b7fef

alexfurmenkov temporarily deployed to DEV March 23, 2026 11:20 — with GitHub Actions Inactive

alexfurmenkov requested review from RamilCDISC and SFJohnson24 March 23, 2026 13:33

#1558 -dp csv handling improved - errors on multiple tables.csv, fixe…

33c2bc4

…d .env file search in case of different -dp folders

alexfurmenkov temporarily deployed to DEV March 23, 2026 17:05 — with GitHub Actions Inactive

#1558 added cli arguments for tables.csv and variables.csv, .env paths.

f3567fa

fixed tests due to stricter tables.csv validation. updated readme with new arguments

alexfurmenkov temporarily deployed to DEV March 24, 2026 12:04 — with GitHub Actions Inactive

SFJohnson24 reviewed Mar 24, 2026

View reviewed changes

#1558 added kwargs to dataset metadata readers. fixed README.md

14afc1b

alexfurmenkov had a problem deploying to DEV March 24, 2026 16:26 — with GitHub Actions Failure

Merge branch 'refs/heads/main' into 1558-csv-data-reader

2126df6

alexfurmenkov temporarily deployed to DEV March 24, 2026 16:26 — with GitHub Actions Inactive

#1558 added positional arguments to test data

4876aef

alexfurmenkov temporarily deployed to DEV March 24, 2026 16:33 — with GitHub Actions Inactive

SFJohnson24 requested changes Mar 24, 2026

View reviewed changes

This comment was marked as duplicate.

Sign in to view

Merge branch 'refs/heads/main' into 1558-csv-data-reader

d863b8a

alexfurmenkov temporarily deployed to DEV March 25, 2026 16:25 — with GitHub Actions Inactive

alexfurmenkov temporarily deployed to DEV March 26, 2026 20:38 — with GitHub Actions Inactive

alexfurmenkov deployed to DEV March 26, 2026 20:51 — with GitHub Actions Active

		description = "JSON data is malformed."


		class InvalidCSVFormat(EngineError):

Conversation

alexfurmenkov commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SFJohnson24 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexfurmenkov Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SFJohnson24 left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as duplicate.

Uh oh!

alexfurmenkov commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alexfurmenkov commented Feb 26, 2026 •

edited

Loading

alexfurmenkov Mar 23, 2026 •

edited

Loading