A123CortexM dataset by Coerulatus · Pull Request #268 · geometric-intelligence/TopoBench

Coerulatus · 2026-03-18T22:28:45Z

Checklist

My pull request has a clear and explanatory title.
My pull request passes the Linting test.
I added appropriate unit tests and I made sure the code passes all unit tests. (refer to comment below)
My PR follows PEP8 guidelines. (refer to comment below)
My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
I linked to issues and PRs that are relevant to this PR.

Description

This PR adds the A123CortexM dataset.
The code was taken from https://github.com/marindigen/TopoBenchForNeuro, but I removed the novel triangle classification tasks.

Note

Medium Risk
Adds a new externally-downloaded dataset pipeline and significantly changes download_file_from_link behavior (streaming, retries, SSL verification control), which could affect all dataset downloads and processing stability.

Overview
Adds support for the Bowen et al. A123 auditory cortex calcium imaging dataset as a new graph classification dataset, including a Hydra dataset config (configs/dataset/graph/a123.yaml), a new A123CortexMDataset that downloads/unzips raw data and converts correlation submatrices into PyG graphs, and a corresponding A123DatasetLoader.

Upgrades download_file_from_link to a streaming, chunked downloader with directory creation, progress reporting, configurable SSL verification and per-chunk timeouts, plus retry/backoff logic; adds MATLAB .mat helpers (collect_mat_files, process_mat, etc.) to io_utils and expands unit tests to cover the new download behavior.

Updates the pipeline test harness to accept dataset/model selection via CLI, and ignores lightning_logs/ in .gitignore.

^{Written by Cursor Bugbot for commit 5afc5ca. This will update automatically on new commits. Configure here.}

…o check and test the training. To be able to download dataset in the function 'download_file_from_link' in requests.get() verify parameter should be specified as False. Note also that currently the run script on the data doesn't run as it fails to download data even if verify parameter set to False

…ll_to_dict and process_mat. I have also modified download_file_from_link by specifying verify=False in requests.get()

…is flag in the config file

- Implement TriangleClassifier utility for extracting and classifying triangles - Add create_triangle_classification_task() to generate 7-class role predictions - Support triangle task via dataset loader (task_type='triangle_task') - Add config section with triangle task settings (7 classes, 3 features) - Include comprehensive test suite for triangle classification pipeline Added traingle common neighbours classification task: predicting the number of neighbours based on triangle intra-data: - Dataset creation - Support triangle task via dataset loader (task_type='triangle_common_task') - Add config section - Include cmprehensive test suite

… to the for the datasets (use for the triangle options) as the classes are imbalanced. Ideally, need to add class weights to the model and opt for a focal loss

…sed the number of samples for downsampling

… TriangleClassifier into utils for reusability and created a new class inhereting from it to specify it for a123 task

…ion. Move the test class to appropriate file. Note, that the same changes were done in the PR #241 (they are duplicated here, as the script wouldn't run otherwise and would require additional adaptation to the old download_file_from_link function.

…Bench into A123CortexM

cursor

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Bugbot Autofix prepared fixes for all 4 issues found in the latest run.

✅ Fixed: TestPipeline requires args, breaking pytest discovery
- Removed the required constructor argument from TestPipeline and restored default dataset/model configuration so pytest can collect the class.
✅ Fixed: Edge attributes misaligned after to_undirected sorting
- Edge weights are now indexed directly from corr using the final to_undirected edge_index order, ensuring correct alignment.
✅ Fixed: Download move logic uses wrong directory name
- The download reorganization now targets the extracted dataset directory name (Auditory cortex data, with underscore fallback) instead of self.name.
✅ Fixed: Loader uses parent path inconsistent with other loaders
- The A123 loader now uses root=str(self.root_data_dir) so dataset storage matches AbstractLoader.get_data_dir().

Or push these changes by commenting:

@cursor push 2951dab532

Preview (2951dab532)

diff --git a/test/pipeline/test_pipeline.py b/test/pipeline/test_pipeline.py
--- a/test/pipeline/test_pipeline.py
+++ b/test/pipeline/test_pipeline.py
@@ -1,31 +1,29 @@
 """Test pipeline for a particular dataset and model."""
 
+import argparse
 import sys
-import os
 from pathlib import Path
+
 sys.path.insert(0, str(Path(__file__).parent.parent.parent))
 
 import hydra
+
 from test._utils.simplified_pipeline import run
-import argparse
 
+DATASET = "graph/MUTAG"
+MODELS = ["graph/gcn"]
+
+
 class TestPipeline:
-    """Test pipeline for a particular dataset and model.
+    """Test pipeline for a particular dataset and model."""
 
-    Parameters
-    ----------
-    args : argparse.Namespace
-        Command-line arguments containing dataset and models configuration.
-    """
+    dataset = DATASET
+    models = MODELS
 
-    def __init__(self, args):
-        self.dataset = args.dataset
-        self.models = args.models if isinstance(args.models, list) else [args.models]
-
     def setup_method(self):
         """Setup method."""
         hydra.core.global_hydra.GlobalHydra.instance().clear()
-    
+
     def test_pipeline(self):
         """Test pipeline."""
         with hydra.initialize(config_path="../../configs", job_name="job"):
@@ -34,17 +32,18 @@
                     config_name="run.yaml",
                     overrides=[
                         f"model={MODEL}",
-                        f"dataset={self.dataset}", # IF YOU IMPLEMENT A LARGE DATASET WITH AN OPTION TO USE A SLICE OF IT, ADD BELOW THE CORRESPONDING OPTION
+                        f"dataset={self.dataset}",  # IF YOU IMPLEMENT A LARGE DATASET WITH AN OPTION TO USE A SLICE OF IT, ADD BELOW THE CORRESPONDING OPTION
                         "trainer.max_epochs=2",
                         "trainer.min_epochs=1",
                         "trainer.check_val_every_n_epoch=1",
                         "paths=test",
                         "callbacks=model_checkpoint",
                     ],
-                    return_hydra_config=True
+                    return_hydra_config=True,
                 )
                 run(cfg)
 
+
 if __name__ == "__main__":
     parser = argparse.ArgumentParser(description="Test Pipeline")
 
@@ -53,18 +52,22 @@
         type=str,
         required=True,
         help="Path to data root directory",
-        choices = ['graph/MUTAG', 'graph/a123']  # ADD YOUR DATASET HERE
+        choices=["graph/MUTAG", "graph/a123"],  # ADD YOUR DATASET HERE
     )
     parser.add_argument(
         "--models",
         type=str,
-        nargs='+',
+        nargs="+",
         required=False,
         default=["graph/gcn"],
         help="Model(s) to use in the pipeline",
     )
     args = parser.parse_args()
 
-    test_pipeline = TestPipeline(args)
\ No newline at end of file
+    test_pipeline = TestPipeline()
+    test_pipeline.dataset = args.dataset
+    test_pipeline.models = (
+        args.models if isinstance(args.models, list) else [args.models]
+    )
     test_pipeline.setup_method()
-    test_pipeline.test_pipeline()
+    test_pipeline.test_pipeline()

diff --git a/topobench/data/datasets/a123.py b/topobench/data/datasets/a123.py
--- a/topobench/data/datasets/a123.py
+++ b/topobench/data/datasets/a123.py
@@ -190,7 +190,9 @@
         os.unlink(path)
 
         # Move files from extracted "Auditory cortex data/" directory to raw_dir
-        downloaded_dir = osp.join(folder, self.name)
+        downloaded_dir = osp.join(folder, dataset_key)
+        if not osp.exists(downloaded_dir):
+            downloaded_dir = osp.join(folder, dataset_key.replace(" ", "_"))
         if osp.exists(downloaded_dir):
             for file in os.listdir(downloaded_dir):
                 src = osp.join(downloaded_dir, file)
@@ -320,13 +322,9 @@
                 edge_index = torch.tensor(edge_index_np, dtype=torch.long)
                 # make undirected
                 edge_index = to_undirected(edge_index)
-                # edge_attr: corresponding corr weights (for both directions, if made undirected)
-                weights = corr[rows, cols]
-                weights = (
-                    np.repeat(weights, 2)
-                    if edge_index.size(1) == weights.size * 2
-                    else weights
-                )
+                # Align edge weights with the final (possibly sorted) edge order.
+                edge_index_np = edge_index.cpu().numpy()
+                weights = corr[edge_index_np[0], edge_index_np[1]]
                 edge_attr = torch.tensor(
                     weights.reshape(-1, 1), dtype=torch.float
                 )

diff --git a/topobench/data/loaders/graph/a123_loader.py b/topobench/data/loaders/graph/a123_loader.py
--- a/topobench/data/loaders/graph/a123_loader.py
+++ b/topobench/data/loaders/graph/a123_loader.py
@@ -91,9 +91,8 @@
         # determine dataset name from parameters, fallback to expected id
         name = self.parameters.data_name
 
-        # root path for dataset: use the parent of root_data_dir since the dataset
-        # constructs its own subdirectory based on name
-        root = str(self.root_data_dir.parent)
+        # root path for dataset: align with AbstractLoader.get_data_dir().
+        root = str(self.root_data_dir)
 
         # Construct dataset; A123CortexMDataset expects (root, name, parameters)
         self.dataset = A123CortexMDataset(

_{This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.}

cursor · 2026-03-18T22:35:13Z

test/pipeline/test_pipeline.py

-    """Test pipeline for a particular dataset and model."""
+    def __init__(self, args):
+        self.dataset = args.dataset
+        self.models = args.models if isinstance(args.models, list) else [args.models]


TestPipeline requires args, breaking pytest discovery

High Severity

TestPipeline.__init__ now requires an args parameter, but pytest discovers test classes and instantiates them with no constructor arguments. This causes a TypeError when pytest tries to collect any test in this class, effectively breaking the entire test pipeline. Previously, the class had no __init__ and used module-level constants DATASET and MODELS, which allowed pytest to work normally.

cursor · 2026-03-18T22:35:13Z

topobench/data/datasets/a123.py

+                )
+                edge_attr = torch.tensor(
+                    weights.reshape(-1, 1), dtype=torch.float
+                )


Edge attributes misaligned after to_undirected sorting

Medium Severity

After to_undirected(edge_index) sorts edges via internal coalesce, the edge ordering changes. However, np.repeat(weights, 2) assumes the doubled edges retain the original-then-reversed interleaving. Since to_undirected sorts all edges by (row, col), the resulting edge_attr values are misaligned with their corresponding edges, assigning incorrect correlation weights to most edges.

cursor · 2026-03-18T22:35:13Z

topobench/data/datasets/a123.py

+                else:
+                    shutil.move(src, dst)
+            # Delete the extracted top-level directory
+            shutil.rmtree(downloaded_dir)


Download move logic uses wrong directory name

Medium Severity

The downloaded_dir is constructed as osp.join(folder, self.name) which resolves to <raw_dir>/a123_cortex_m, but the zip file from URL Auditory_cortex_data.zip extracts to a directory like Auditory cortex data/ (matching raw_file_names), not a123_cortex_m/. The osp.exists(downloaded_dir) check will always be False, so the intended file reorganization never happens. The comment on line 192 even says "Move files from extracted 'Auditory cortex data/' directory" but the code checks for self.name instead.

cursor · 2026-03-18T22:35:13Z

topobench/data/loaders/graph/a123_loader.py

+
+        # root path for dataset: use the parent of root_data_dir since the dataset
+        # constructs its own subdirectory based on name
+        root = str(self.root_data_dir.parent)


Loader uses parent path inconsistent with other loaders

Medium Severity

root = str(self.root_data_dir.parent) deviates from every other loader in the codebase, which all use root = str(self.root_data_dir). This causes the dataset to store files at a different path than what get_data_dir() (inherited from AbstractLoader) reports. When run.py calls dataset_loader.load(), the returned dataset_dir points to a non-existent directory, which can break downstream processing in PreProcessor.

codecov · 2026-03-18T22:39:01Z

Codecov Report

❌ Patch coverage is 99.55947% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 92.54%. Comparing base (0a7477e) to head (78d4981).
⚠️ Report is 8 commits behind head on main.

Files with missing lines	Patch %	Lines
topobench/data/utils/io_utils.py	98.86%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #268      +/-   ##
==========================================
- Coverage   94.01%   92.54%   -1.47%     
==========================================
  Files         184      185       +1     
  Lines        6665     6859     +194     
==========================================
+ Hits         6266     6348      +82     
- Misses        399      511     +112

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cursor · 2026-03-23T19:59:48Z

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

marindigen and others added 23 commits November 19, 2025 12:15

Added functions collect_mat_files, mat_cell_to_dict, planewise_mat_ce…

71a86f0

…ll_to_dict and process_mat. I have also modified download_file_from_link by specifying verify=False in requests.get()

Added verify flag to the download_from_link_function, and included th…

31d8fb4

…is flag in the config file

Added test for download_file_from_link function

495c253

Added more documentation in the Notes of the download function

fcdb603

Added tests for triangle common neighbour task

e3955ad

Add GitHub Actions workflow for A123 dataset coverage testing

bf00900

Add CI coverage workflow and document pytest hooks

29eb264

Fix CI workflow: use --no-build-isolation for PyG packages

7b78ace

Fixing the dependency issue

128d98c

Fixed the tests, debugged loader and dataset. Have added downsampling…

ac50ef7

… to the for the datasets (use for the triangle options) as the classes are imbalanced. Ideally, need to add class weights to the model and opt for a focal loss

Cleaned the tutorial and ran it for triangle_common_neighbors. Increa…

2254ffa

…sed the number of samples for downsampling

Fixed the test and removed the functions from conftest.py file. Moved…

c920b7d

… TriangleClassifier into utils for reusability and created a new class inhereting from it to specify it for a123 task

Debugging workflow

2dc71ef

Debugging workflow

07de038

Fixing workflow

e33acea

Added argparser and documentation for it

d5e093b

Removed the test_coverage_a123.yml workflow file

f70fa7a

Converted the tutorial notebook into markdown

7b114c7

A123Cortex dataset without triangle task

c61c215

Merge branch 'main' of https://github.com/geometric-intelligence/Topo…

5afc5ca

…Bench into A123CortexM

cursor bot reviewed Mar 18, 2026

View reviewed changes

improved coverage and download logic

be74e3b

coverage

78d4981

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A123CortexM dataset#268

A123CortexM dataset#268
Coerulatus wants to merge 25 commits intomainfrom
A123CortexM

Coerulatus commented Mar 18, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment •

edited

Loading

Uh oh!

cursor bot Mar 18, 2026

Uh oh!

cursor bot Mar 18, 2026

Uh oh!

cursor bot Mar 18, 2026

Uh oh!

cursor bot Mar 18, 2026

Uh oh!

codecov bot commented Mar 18, 2026 •

edited

Loading

Uh oh!

cursor bot commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Coerulatus commented Mar 18, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Description

Uh oh!

cursor bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 18, 2026

Choose a reason for hiding this comment

TestPipeline requires args, breaking pytest discovery

Uh oh!

cursor bot Mar 18, 2026

Choose a reason for hiding this comment

Edge attributes misaligned after to_undirected sorting

Uh oh!

cursor bot Mar 18, 2026

Choose a reason for hiding this comment

Download move logic uses wrong directory name

Uh oh!

cursor bot Mar 18, 2026

Choose a reason for hiding this comment

Loader uses parent path inconsistent with other loaders

Uh oh!

codecov bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cursor bot commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Coerulatus commented Mar 18, 2026 •

edited by cursor bot

Loading

cursor bot left a comment •

edited

Loading

codecov bot commented Mar 18, 2026 •

edited

Loading