Skip to content

A123CortexM dataset#268

Open
Coerulatus wants to merge 25 commits intomainfrom
A123CortexM
Open

A123CortexM dataset#268
Coerulatus wants to merge 25 commits intomainfrom
A123CortexM

Conversation

@Coerulatus
Copy link
Copy Markdown
Collaborator

@Coerulatus Coerulatus commented Mar 18, 2026

Checklist

  • My pull request has a clear and explanatory title.
  • My pull request passes the Linting test.
  • I added appropriate unit tests and I made sure the code passes all unit tests. (refer to comment below)
  • My PR follows PEP8 guidelines. (refer to comment below)
  • My code is properly documented, using numpy docs conventions, and I made sure the documentation renders properly.
  • I linked to issues and PRs that are relevant to this PR.

Description

This PR adds the A123CortexM dataset.
The code was taken from https://github.com/marindigen/TopoBenchForNeuro, but I removed the novel triangle classification tasks.


Note

Medium Risk
Adds a new externally-downloaded dataset pipeline and significantly changes download_file_from_link behavior (streaming, retries, SSL verification control), which could affect all dataset downloads and processing stability.

Overview
Adds support for the Bowen et al. A123 auditory cortex calcium imaging dataset as a new graph classification dataset, including a Hydra dataset config (configs/dataset/graph/a123.yaml), a new A123CortexMDataset that downloads/unzips raw data and converts correlation submatrices into PyG graphs, and a corresponding A123DatasetLoader.

Upgrades download_file_from_link to a streaming, chunked downloader with directory creation, progress reporting, configurable SSL verification and per-chunk timeouts, plus retry/backoff logic; adds MATLAB .mat helpers (collect_mat_files, process_mat, etc.) to io_utils and expands unit tests to cover the new download behavior.

Updates the pipeline test harness to accept dataset/model selection via CLI, and ignores lightning_logs/ in .gitignore.

Written by Cursor Bugbot for commit 5afc5ca. This will update automatically on new commits. Configure here.

marindigen and others added 23 commits November 19, 2025 12:15
…o check and test the training. To be able to download dataset in the function 'download_file_from_link' in requests.get() verify parameter should be specified as False. Note also that currently the run script on the data doesn't run as it fails to download data even if verify parameter set to False
…ll_to_dict and process_mat. I have also modified download_file_from_link by specifying verify=False in requests.get()
- Implement TriangleClassifier utility for extracting and classifying triangles
- Add create_triangle_classification_task() to generate 7-class role predictions
- Support triangle task via dataset loader (task_type='triangle_task')
- Add config section with triangle task settings (7 classes, 3 features)
- Include comprehensive test suite for triangle classification pipeline

Added traingle common neighbours classification task: predicting the number of neighbours based on triangle intra-data:

- Dataset creation
- Support triangle task via dataset loader (task_type='triangle_common_task')
- Add config section
- Include cmprehensive test suite
… to the for the datasets (use for the triangle options) as the classes are imbalanced. Ideally, need to add class weights to the model and opt for a focal loss
… TriangleClassifier into utils for reusability and created a new class inhereting from it to specify it for a123 task
…ion. Move the test class to appropriate file. Note, that the same changes were done in the PR #241 (they are duplicated here, as the script wouldn't run otherwise and would require additional adaptation to the old download_file_from_link function.
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Fix All in Cursor

Bugbot Autofix prepared fixes for all 4 issues found in the latest run.

  • ✅ Fixed: TestPipeline requires args, breaking pytest discovery
    • Removed the required constructor argument from TestPipeline and restored default dataset/model configuration so pytest can collect the class.
  • ✅ Fixed: Edge attributes misaligned after to_undirected sorting
    • Edge weights are now indexed directly from corr using the final to_undirected edge_index order, ensuring correct alignment.
  • ✅ Fixed: Download move logic uses wrong directory name
    • The download reorganization now targets the extracted dataset directory name (Auditory cortex data, with underscore fallback) instead of self.name.
  • ✅ Fixed: Loader uses parent path inconsistent with other loaders
    • The A123 loader now uses root=str(self.root_data_dir) so dataset storage matches AbstractLoader.get_data_dir().

Create PR

Or push these changes by commenting:

@cursor push 2951dab532
Preview (2951dab532)
diff --git a/test/pipeline/test_pipeline.py b/test/pipeline/test_pipeline.py
--- a/test/pipeline/test_pipeline.py
+++ b/test/pipeline/test_pipeline.py
@@ -1,31 +1,29 @@
 """Test pipeline for a particular dataset and model."""
 
+import argparse
 import sys
-import os
 from pathlib import Path
+
 sys.path.insert(0, str(Path(__file__).parent.parent.parent))
 
 import hydra
+
 from test._utils.simplified_pipeline import run
-import argparse
 
+DATASET = "graph/MUTAG"
+MODELS = ["graph/gcn"]
+
+
 class TestPipeline:
-    """Test pipeline for a particular dataset and model.
+    """Test pipeline for a particular dataset and model."""
 
-    Parameters
-    ----------
-    args : argparse.Namespace
-        Command-line arguments containing dataset and models configuration.
-    """
+    dataset = DATASET
+    models = MODELS
 
-    def __init__(self, args):
-        self.dataset = args.dataset
-        self.models = args.models if isinstance(args.models, list) else [args.models]
-
     def setup_method(self):
         """Setup method."""
         hydra.core.global_hydra.GlobalHydra.instance().clear()
-    
+
     def test_pipeline(self):
         """Test pipeline."""
         with hydra.initialize(config_path="../../configs", job_name="job"):
@@ -34,17 +32,18 @@
                     config_name="run.yaml",
                     overrides=[
                         f"model={MODEL}",
-                        f"dataset={self.dataset}", # IF YOU IMPLEMENT A LARGE DATASET WITH AN OPTION TO USE A SLICE OF IT, ADD BELOW THE CORRESPONDING OPTION
+                        f"dataset={self.dataset}",  # IF YOU IMPLEMENT A LARGE DATASET WITH AN OPTION TO USE A SLICE OF IT, ADD BELOW THE CORRESPONDING OPTION
                         "trainer.max_epochs=2",
                         "trainer.min_epochs=1",
                         "trainer.check_val_every_n_epoch=1",
                         "paths=test",
                         "callbacks=model_checkpoint",
                     ],
-                    return_hydra_config=True
+                    return_hydra_config=True,
                 )
                 run(cfg)
 
+
 if __name__ == "__main__":
     parser = argparse.ArgumentParser(description="Test Pipeline")
 
@@ -53,18 +52,22 @@
         type=str,
         required=True,
         help="Path to data root directory",
-        choices = ['graph/MUTAG', 'graph/a123']  # ADD YOUR DATASET HERE
+        choices=["graph/MUTAG", "graph/a123"],  # ADD YOUR DATASET HERE
     )
     parser.add_argument(
         "--models",
         type=str,
-        nargs='+',
+        nargs="+",
         required=False,
         default=["graph/gcn"],
         help="Model(s) to use in the pipeline",
     )
     args = parser.parse_args()
 
-    test_pipeline = TestPipeline(args)
\ No newline at end of file
+    test_pipeline = TestPipeline()
+    test_pipeline.dataset = args.dataset
+    test_pipeline.models = (
+        args.models if isinstance(args.models, list) else [args.models]
+    )
     test_pipeline.setup_method()
-    test_pipeline.test_pipeline()
+    test_pipeline.test_pipeline()

diff --git a/topobench/data/datasets/a123.py b/topobench/data/datasets/a123.py
--- a/topobench/data/datasets/a123.py
+++ b/topobench/data/datasets/a123.py
@@ -190,7 +190,9 @@
         os.unlink(path)
 
         # Move files from extracted "Auditory cortex data/" directory to raw_dir
-        downloaded_dir = osp.join(folder, self.name)
+        downloaded_dir = osp.join(folder, dataset_key)
+        if not osp.exists(downloaded_dir):
+            downloaded_dir = osp.join(folder, dataset_key.replace(" ", "_"))
         if osp.exists(downloaded_dir):
             for file in os.listdir(downloaded_dir):
                 src = osp.join(downloaded_dir, file)
@@ -320,13 +322,9 @@
                 edge_index = torch.tensor(edge_index_np, dtype=torch.long)
                 # make undirected
                 edge_index = to_undirected(edge_index)
-                # edge_attr: corresponding corr weights (for both directions, if made undirected)
-                weights = corr[rows, cols]
-                weights = (
-                    np.repeat(weights, 2)
-                    if edge_index.size(1) == weights.size * 2
-                    else weights
-                )
+                # Align edge weights with the final (possibly sorted) edge order.
+                edge_index_np = edge_index.cpu().numpy()
+                weights = corr[edge_index_np[0], edge_index_np[1]]
                 edge_attr = torch.tensor(
                     weights.reshape(-1, 1), dtype=torch.float
                 )

diff --git a/topobench/data/loaders/graph/a123_loader.py b/topobench/data/loaders/graph/a123_loader.py
--- a/topobench/data/loaders/graph/a123_loader.py
+++ b/topobench/data/loaders/graph/a123_loader.py
@@ -91,9 +91,8 @@
         # determine dataset name from parameters, fallback to expected id
         name = self.parameters.data_name
 
-        # root path for dataset: use the parent of root_data_dir since the dataset
-        # constructs its own subdirectory based on name
-        root = str(self.root_data_dir.parent)
+        # root path for dataset: align with AbstractLoader.get_data_dir().
+        root = str(self.root_data_dir)
 
         # Construct dataset; A123CortexMDataset expects (root, name, parameters)
         self.dataset = A123CortexMDataset(

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

"""Test pipeline for a particular dataset and model."""
def __init__(self, args):
self.dataset = args.dataset
self.models = args.models if isinstance(args.models, list) else [args.models]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestPipeline requires args, breaking pytest discovery

High Severity

TestPipeline.__init__ now requires an args parameter, but pytest discovers test classes and instantiates them with no constructor arguments. This causes a TypeError when pytest tries to collect any test in this class, effectively breaking the entire test pipeline. Previously, the class had no __init__ and used module-level constants DATASET and MODELS, which allowed pytest to work normally.

Fix in Cursor Fix in Web

)
edge_attr = torch.tensor(
weights.reshape(-1, 1), dtype=torch.float
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edge attributes misaligned after to_undirected sorting

Medium Severity

After to_undirected(edge_index) sorts edges via internal coalesce, the edge ordering changes. However, np.repeat(weights, 2) assumes the doubled edges retain the original-then-reversed interleaving. Since to_undirected sorts all edges by (row, col), the resulting edge_attr values are misaligned with their corresponding edges, assigning incorrect correlation weights to most edges.

Fix in Cursor Fix in Web

else:
shutil.move(src, dst)
# Delete the extracted top-level directory
shutil.rmtree(downloaded_dir)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Download move logic uses wrong directory name

Medium Severity

The downloaded_dir is constructed as osp.join(folder, self.name) which resolves to <raw_dir>/a123_cortex_m, but the zip file from URL Auditory_cortex_data.zip extracts to a directory like Auditory cortex data/ (matching raw_file_names), not a123_cortex_m/. The osp.exists(downloaded_dir) check will always be False, so the intended file reorganization never happens. The comment on line 192 even says "Move files from extracted 'Auditory cortex data/' directory" but the code checks for self.name instead.

Fix in Cursor Fix in Web


# root path for dataset: use the parent of root_data_dir since the dataset
# constructs its own subdirectory based on name
root = str(self.root_data_dir.parent)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loader uses parent path inconsistent with other loaders

Medium Severity

root = str(self.root_data_dir.parent) deviates from every other loader in the codebase, which all use root = str(self.root_data_dir). This causes the dataset to store files at a different path than what get_data_dir() (inherited from AbstractLoader) reports. When run.py calls dataset_loader.load(), the returned dataset_dir points to a non-existent directory, which can break downstream processing in PreProcessor.

Fix in Cursor Fix in Web

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 18, 2026

Codecov Report

❌ Patch coverage is 99.55947% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 92.54%. Comparing base (0a7477e) to head (78d4981).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
topobench/data/utils/io_utils.py 98.86% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #268      +/-   ##
==========================================
- Coverage   94.01%   92.54%   -1.47%     
==========================================
  Files         184      185       +1     
  Lines        6665     6859     +194     
==========================================
+ Hits         6266     6348      +82     
- Misses        399      511     +112     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cursor
Copy link
Copy Markdown

cursor bot commented Mar 23, 2026

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants