Fix GraphLand categorical feature encoding and update num_features by anilkeshwani · Pull Request #289 · geometric-intelligence/TopoBench

anilkeshwani · 2026-03-22T01:53:08Z

Summary

This PR fixes a critical feature preprocessing issue in the GraphLand benchmark integration (PR #192) and validates the fix by running TopoTune experiments.

The core problem: PR #192 loads all GraphLand features from features.csv as raw floats, treating integer-coded categorical features (e.g., education=3, colour_group=15, region_id=42) as continuous numerical values. This is semantically incorrect — the model learns spurious ordinal relationships between unordered categories (e.g., that education=3 is "three times" education=1), which degrades representation quality and downstream task performance.

The fix: Implements proper feature preprocessing following the pipeline explicitly recommended by the GraphLand authors (Bazhenov, Platonov & Prokhorenkova, arXiv:2409.14500, NeurIPS 2025 Datasets & Benchmarks):

Reads info.yaml bundled with each dataset ZIP to identify categorical_features_names vs numerical_features_names
One-hot encodes categoricals using OneHotEncoder(drop='if_binary', sparse_output=False), matching the exact encoding in the GraphLand codebase
Applies imputation only to numerical features (not categoricals)
Updates num_features in all 14 YAML configs to reflect post-encoding dimensions

Motivation and scientific context

The GraphLand benchmark was designed to evaluate graph ML models on real-world industrial data with rich heterogeneous features — a deliberate departure from academic benchmarks like Cora/CiteSeer where all features are homogeneous (bag-of-words). The mixed numerical/categorical feature structure is a defining characteristic of GraphLand and central to its research contribution. The paper states (Appendix B.3):

"For categorical features, we used one-hot encoding for all models except for LightGBM and CatBoost, which support the use of categorical features directly and have their specialized strategies for working with them."

Passing raw integer-coded categoricals to a neural network (GNN or TNN) defeats the purpose of this benchmark, as the model cannot distinguish between ordinal and nominal features. One-hot encoding is the standard approach for neural models, and drop='if_binary' avoids redundant columns for binary indicators (e.g., is_paved, age_is_nan).

Design decisions

1. One-hot encoding in `process()`, not as a TopoBench transform

The encoding is performed inside GraphlandDataset.process() rather than as a separate TopoBench data manipulation transform. This is because:

The encoding depends on dataset-specific metadata (info.yaml), not on graph topology
It must happen before any TopoBench transforms that depend on num_features (e.g., feature lifting via ProjectionSum)
The infer_in_channels config resolver reads dataset.parameters.num_features at config time, so the YAML value must reflect the post-encoded dimension
This matches how other TopoBench datasets handle preprocessing (e.g., ZINC's atom type encoding happens at dataset creation, not as a transform)

2. `info.yaml` added to `raw_file_names`

Added info.yaml to the raw_file_names property so PyG's caching mechanism correctly detects when the metadata file is missing and triggers a re-download. The file is bundled in every GraphLand Zenodo ZIP alongside the CSVs.

3. Imputation scoping

The SimpleImputer is now applied only to numerical features, not to categorical columns. Categorical features in GraphLand have no missing values by construction (they are integer-coded categories), and imputing them with most_frequent could introduce semantic errors.

4. Zenodo download timeout

Increased from 60s to 300s. Several GraphLand datasets (web-fraud, hm-prices, avazu-ctr) exceed 50MB and reliably time out at 60s on typical connections.

Updated `num_features` (all 14 datasets)

Dataset	Raw features	After OHE	Key categorical expansions
tolokers-2	16	19	education (4 levels)
hm-categories	35	120	colour_group (32), graphical_appearance (29), perceived_colour_master (20)
hm-prices	41	264	11 product/style categoricals
twitch-views	4	24	language (21 levels)
city-roads-M	26	68	region_id (23), speed_limit (10), category (7)
city-roads-L	26	207	region_id (156), speed_limit (14)
city-reviews	37	204	feature_18 (83 browser types), feature_19 (36)
artnet-exp	75	75	All 27 categoricals are binary → no expansion
artnet-views	50	50	All 27 categoricals are binary → no expansion
avazu-ctr	260	260	No categorical features
pokec-regions	56	56	All 54 categoricals are binary → no expansion
web-fraud	266	1179	Website zone/category features (4.4x expansion)
web-topics	263	528	Topic categoricals
web-traffic	267	1180	Traffic category features

Experimental validation

Ran TopoTune on tolokers-2 (binary classification: predicting crowdworker bans) with default parameters:

Parameter	Value
Model	`cell/topotune` (GCN backbone, 2 GNN layers, 32 hidden channels)
Lifting	`CellCycleLifting` (max_cell_length=4)
Feature lifting	`ProjectionSum`
Neighborhoods	`up_adjacency-1`, `up_incidence-0`, `down_incidence-2`, `2-up_adjacency-0`
Optimizer	Adam (lr=0.001), StepLR scheduler (step=50, γ=0.5)
Training	100 epochs, seed=42, CPU

Test results:

Metric	Value
Accuracy	78.1%
AUROC	70.7%
Precision	64.7%
Recall	52.0%
Loss	0.479

Notes on the experiment:

Simplicial clique lifting (SimplicialCliqueLifting) was intractable on this dense social graph (11.8K nodes, high clustering) — killed after 50+ minutes. Cell cycle lifting with max_cell_length=4 completed in seconds.
MPS acceleration (Apple Silicon) failed due to PyTorch 2.3.0's SparseMPS backend not supporting sparse tensor operations required by topological data structures (incidence/adjacency matrices). Fell back to CPU.
The dataset is class-imbalanced (most workers are not banned), which explains the accuracy-recall gap. The GraphLand paper reports GCN achieving ~83% accuracy, suggesting room for hyperparameter tuning.

Files changed

topobench/data/loaders/graph/graphland/dataset.py — Rewrote process() with proper categorical/numerical separation, one-hot encoding, and scoped imputation. Added _load_info() and _preprocess_features() methods. Added info.yaml to raw_file_names.
topobench/data/loaders/graph/graphland/repository/zenodo.py — Increased download timeout 60s → 300s.
configs/dataset/graph/*.yaml (10 files) — Updated num_features to post-encoding dimensions.
scripts/inspect_graphland_features.py — Utility script that downloads all 14 GraphLand datasets and computes post-encoding feature dimensions. Useful for verification and for future dataset additions.

References

Bazhenov, Platonov & Prokhorenkova. GraphLand: Evaluating Graph Machine Learning Models on Diverse Industrial Data. NeurIPS 2025. arXiv:2409.14500
GraphLand codebase: github.com/yandex-research/graphland — see dataset.py lines 496–500 for the reference one-hot encoding implementation
Papillon, Bernardez, Battiloro & Miolane. TopoTune: A Framework for Generalized Combinatorial Complex Neural Networks. ICML 2025. arXiv:2410.06530

Test plan

Verified config resolution: num_features: 19 correctly becomes in_channels: [19, 19, 19] for simplicial/cell TopoTune
Ran full train+test pipeline on tolokers-2 with cell/topotune (78.1% accuracy)
Downloaded and verified post-encoding dimensions for all 14 datasets via scripts/inspect_graphland_features.py
Run existing unit tests (test/data/load/test_datasetloaders.py)
Validate on a second GraphLand dataset (e.g., hm-categories) to confirm multiclass classification works with expanded features

🤖 Generated with Claude Code

Note

Medium Risk
Changes GraphLand dataset preprocessing to one-hot encode categorical columns and optionally drop missing targets, which can shift feature dimensionality and model behavior across multiple benchmarks. Risk is mainly around data compatibility/caching and correctness of the new feature/label handling for existing experiments.

Overview
Fixes GraphLand integration by changing GraphlandDataset.process() to read info.yaml, separate numerical vs categorical columns, apply imputation only to numerical features, and one-hot encode categoricals with OneHotEncoder(drop='if_binary') before building the PyG Data object.

Updates GraphLand dataset YAMLs to reflect post-encoding num_features, adds a small ZenodoZip helper for in-memory ZIP download/extraction, and introduces scripts/inspect_graphland_features.py to compute/verify encoded feature dimensions from the upstream Zenodo artifacts.

Adds new dataset entrypoints/config for wiki_cs via WikiCSDatasetLoader, and expands the dataset-loader test exclusions to skip additional slow/long-running datasets (including GraphLand configs).

^{Written by Cursor Bugbot for commit 55b17d6. This will update automatically on new commits. Configure here.}

Introducing nodes with missing y

The original GraphLand integration loaded all features from features.csv as raw floats, treating integer-coded categorical features (e.g., education=3, colour_group=15) as continuous numerical values. This is semantically incorrect and degrades model performance, as the model would learn spurious ordinal relationships between unordered categories. This commit implements proper feature preprocessing following the GraphLand paper's (arXiv:2409.14500) recommended pipeline: 1. Read info.yaml bundled with each dataset to identify feature types (categorical_features_names, numerical_features_names, fraction_features_names) 2. One-hot encode categorical features using OneHotEncoder(drop='if_binary', sparse_output=False), which: - Expands multi-level categoricals into binary indicator columns - Collapses binary categoricals to a single column (drop='if_binary') - Matches the exact encoding used in the GraphLand codebase 3. Apply imputation (SimpleImputer) only to numerical features, not to categoricals (which have no missing values by construction) 4. Concatenate features as [numerical | one_hot_categorical] 5. Update num_features in all 14 YAML configs to reflect post-encoding dimensions, which were computed by downloading each dataset and running the actual encoding pipeline The num_features changes are significant for several datasets: - web-fraud: 266 -> 1179 (website zone/category features expand 4.4x) - web-traffic: 267 -> 1180 - hm-prices: 41 -> 264 (11 product categoricals) - city-roads-L: 26 -> 207 (region_id has 156 unique values) - city-reviews: 37 -> 204 - hm-categories: 35 -> 120 - city-roads-M: 26 -> 68 Datasets with only binary categoricals are unchanged: - artnet-exp (75), artnet-views (50), pokec-regions (56), avazu-ctr (260) Also increases Zenodo download timeout from 60s to 300s to accommodate larger datasets, and adds info.yaml to raw_file_names for proper cache invalidation. Includes scripts/inspect_graphland_features.py utility for computing post-encoding dimensions across all 14 GraphLand datasets. Validated by running TopoTune (cell/topotune, GCN backbone, CellCycleLifting) on tolokers-2, achieving 78.1% test accuracy and 70.7% AUROC on the binary classification task. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

anilkeshwani · 2026-03-22T01:55:32Z

@levtelyatnikov here's the fully automated PR that we were talking about.

Copilot

Pull request overview

This PR updates the GraphLand dataset integration to correctly distinguish categorical vs numerical node features (and one-hot encode categoricals) so models no longer treat integer-coded categories as continuous values; it also updates GraphLand dataset configs to reflect the post-encoding num_features.

Changes:

Reworked GraphlandDataset.process() to load info.yaml, one-hot encode categorical features, and scope imputation to numerical features.
Increased Zenodo ZIP download timeout and added a utility script to compute post-encoding feature dimensions.
Updated GraphLand YAML configs’ num_features values (and added a WikiCS loader/config as well).

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
topobench/data/loaders/graph/graphland/dataset.py	Implements GraphLand preprocessing using `info.yaml` + OHE, and updates dataset processing/download behavior.
topobench/data/loaders/graph/graphland/repository/zenodo.py	Adds a minimal in-memory Zenodo ZIP downloader with longer timeout.
topobench/data/loaders/graph/graphland_dataset.py	Introduces a loader wrapper for `GraphlandDataset` used by Hydra configs.
topobench/data/loaders/graph/wiki_cs.py	Adds a loader for PyG’s WikiCS dataset.
scripts/inspect_graphland_features.py	Utility script to compute post-OHE `num_features` across GraphLand datasets.
test/data/load/test_datasetloaders.py	Expands dataset exclusions for loader tests (but currently introduces a syntax issue).
configs/dataset/graph/wiki_cs.yaml	Adds dataset config for WikiCS.
configs/dataset/graph/artnet-exp.yaml	Updates GraphLand `num_features` to reflect post-encoding dimensions.
configs/dataset/graph/artnet-views.yaml	Updates GraphLand `num_features` to reflect post-encoding dimensions.
configs/dataset/graph/avazu-ctr.yaml	Updates GraphLand `num_features` to reflect post-encoding dimensions.
configs/dataset/graph/city-reviews.yaml	Updates GraphLand `num_features` to reflect post-encoding dimensions.
configs/dataset/graph/city-roads-L.yaml	Updates GraphLand `num_features` to reflect post-encoding dimensions.
configs/dataset/graph/city-roads-M.yaml	Updates GraphLand `num_features` to reflect post-encoding dimensions.
configs/dataset/graph/hm-categories.yaml	Updates GraphLand `num_features` to reflect post-encoding dimensions.
configs/dataset/graph/hm-prices.yaml	Updates GraphLand `num_features` to reflect post-encoding dimensions.
configs/dataset/graph/pokec-regions.yaml	Updates GraphLand `num_features` to reflect post-encoding dimensions.
configs/dataset/graph/tolokers-2.yaml	Updates GraphLand `num_features` to reflect post-encoding dimensions.
configs/dataset/graph/twitch-views.yaml	Updates GraphLand `num_features` to reflect post-encoding dimensions.
configs/dataset/graph/web-fraud.yaml	Updates GraphLand `num_features` to reflect post-encoding dimensions.
configs/dataset/graph/web-topics.yaml	Updates GraphLand `num_features` to reflect post-encoding dimensions.
configs/dataset/graph/web-traffic.yaml	Updates GraphLand `num_features` to reflect post-encoding dimensions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-22T01:57:26Z

topobench/data/loaders/graph/graphland/dataset.py

+    @property
+    def processed_paths(self):
+        """The processed data directory path."""
+        return [os.path.join(self.root, "processed")]
+


GraphlandDataset overrides processed_paths to return the processed directory rather than the processed file path(s). PyG’s InMemoryDataset uses processed_paths to decide whether processing is needed; if the directory exists but data.pt does not, the dataset can incorrectly skip process() and then torch.load(...) will fail. Remove this processed_paths override (or override processed_dir instead) and rely on the base class’s processed_paths (which should be <root>/processed/<processed_file_names>).

Copilot · 2026-03-22T01:57:26Z

topobench/data/loaders/graph/graphland/dataset.py

+
+        # Numerical features (includes fraction features)
+        num_cols = [c for c in num_names if c in feats_df.columns]


The docstring states that fraction_features_names should be included with numerical features, but _preprocess_features() only uses numerical_features_names. If info.yaml provides fraction features separately, they’ll be silently dropped from x. Consider merging numerical_features_names + fraction_features_names (and ensuring a stable column order) so the processed feature matrix matches the GraphLand preprocessing spec and the updated num_features values.

Suggested change

# Numerical features (includes fraction features)

num_cols = [c for c in num_names if c in feats_df.columns]

frac_names = info.get("fraction_features_names", [])

# Numerical features (includes fraction features)

# Merge numerical and fraction features, preserving order and removing duplicates

all_num_names = list(dict.fromkeys(num_names + frac_names))

num_cols = [c for c in all_num_names if c in feats_df.columns]

Copilot · 2026-03-22T01:57:26Z

topobench/data/loaders/graph/graphland/dataset.py

+        if is_integer_dtype(
+            targ_values.fillna(0)
+        ) or targ_values.fillna(0).apply(float.is_integer).all():
+            y = torch.tensor(
+                targs_df.values, dtype=torch.long
+            ).squeeze()
+        else:
+            y = torch.tensor(
+                targs_df.values, dtype=torch.double
+            ).squeeze()
+
+        # Drop nodes with missing targets
+        if self.drop_missing_y:
+            mask = ~torch.tensor(targ_values.isna().values)
+            x = x[mask]
+            y = y[mask]
+            feats_df = feats_df[mask.numpy()]
+
+            old_to_new = {
+                old: new
+                for new, old in enumerate(
+                    mask.numpy().nonzero()[0]


y is cast to torch.long/torch.double before applying drop_missing_y. If targets.csv contains NaNs (GraphLand often has missing labels), converting to torch.long will raise (NaN → integer) and prevent processing. Apply the missing-target mask to targs_df/targ_values before dtype inference and tensor conversion, or fill missing targets with a sentinel only when drop_missing_y=False.

Suggested change

if is_integer_dtype(

targ_values.fillna(0)

) or targ_values.fillna(0).apply(float.is_integer).all():

y = torch.tensor(

targs_df.values, dtype=torch.long

).squeeze()

else:

y = torch.tensor(

targs_df.values, dtype=torch.double

).squeeze()

# Drop nodes with missing targets

if self.drop_missing_y:

mask = ~torch.tensor(targ_values.isna().values)

x = x[mask]

y = y[mask]

feats_df = feats_df[mask.numpy()]

old_to_new = {

old: new

for new, old in enumerate(

mask.numpy().nonzero()[0]

# Apply missing-target handling before dtype inference / tensor conversion

non_missing_mask = None

if self.drop_missing_y:

# Boolean mask over original nodes: True where target is present

non_missing_mask = ~targ_values.isna().values

# Filter targets down to nodes with observed labels

targs_df = targs_df[non_missing_mask]

targ_values = targ_values[non_missing_mask]

if is_integer_dtype(

targ_values.fillna(0)

) or targ_values.fillna(0).apply(float.is_integer).all():

# Integer-like targets

if self.drop_missing_y:

# After filtering, no NaNs remain

target_array = targs_df.values

else:

# Keep missing labels; represent them with a sentinel value

target_array = targs_df.fillna(-1).values

y = torch.tensor(

target_array, dtype=torch.long

).squeeze()

else:

# Continuous targets can safely remain as floating point (NaNs allowed)

y = torch.tensor(

targs_df.values, dtype=torch.double

).squeeze()

# Drop nodes with missing targets

if self.drop_missing_y:

# Use the original-length boolean mask for features / graph structure

mask = torch.from_numpy(non_missing_mask)

x = x[mask]

# y has already been filtered when building targs_df, so no further

# masking is required here.

feats_df = feats_df[non_missing_mask]

old_to_new = {

old: new

for new, old in enumerate(

non_missing_mask.nonzero()[0]

Copilot · 2026-03-22T01:57:27Z

topobench/data/loaders/graph/graphland/dataset.py

+            feats_df = feats_df[mask.numpy()]
+
+            old_to_new = {
+                old: new
+                for new, old in enumerate(
+                    mask.numpy().nonzero()[0]
+                )
+            }
+
+            edges_df = edges_df[
+                edges_df["source"].isin(old_to_new.keys())
+                & edges_df["target"].isin(old_to_new.keys())
+            ].copy()
+
+            edges_df["source"] = edges_df["source"].map(
+                old_to_new
+            )
+            edges_df["target"] = edges_df["target"].map(
+                old_to_new
+            )


When dropping missing targets, the old_to_new mapping is built from mask.numpy().nonzero()[0], i.e. row positions, but edges are filtered/mapped using edges_df['source']/['target'] values. This only works if node IDs are guaranteed to be contiguous 0..N-1 and ordered exactly like features.csv. To make this robust, build the mapping from the actual node IDs (e.g., kept_node_ids = feats_df.index[mask.numpy()]) and map edges using those IDs.

Copilot · 2026-03-22T01:57:27Z

topobench/data/loaders/graph/graphland_dataset.py

+        dataset = GraphlandDataset(
+            root = str(self.root_data_dir),
+            name = self.parameters.data_name,
+            drop_missing_y = self.parameters.get("drop_missing_y", True),
+            impute_missing_x = self.parameters.get("impute_missing_x", None),
+        )


impute_missing_x is read from the Hydra config (parameters.impute_missing_x) and passed straight into GraphlandDataset. In the YAMLs it’s a DictConfig with _target_: sklearn.impute.SimpleImputer, so GraphlandDataset._preprocess_features() will later call .fit_transform() on a DictConfig and crash. Instantiate the imputer in the loader (e.g., hydra.utils.instantiate(self.parameters.impute_missing_x) when present) before passing it to GraphlandDataset.

Copilot · 2026-03-22T01:57:27Z

scripts/inspect_graphland_features.py

@@ -0,0 +1,149 @@
+# /// script
+# requires-python = ">=3.12"


The uv-style script header declares requires-python = ">=3.12", but the repository’s pyproject.toml specifies requires-python = ">= 3.10". This mismatch can confuse users/tooling and suggests the script won’t run in supported environments. Consider lowering the script requirement to >=3.10 (if compatible) or removing the header and documenting any extra requirements in the docstring instead.

Suggested change

# requires-python = ">=3.12"

# requires-python = ">=3.10"

Copilot · 2026-03-22T01:57:28Z

scripts/inspect_graphland_features.py

+    print(f"  Task: {info_data.get('task', 'unknown')}")
+    print(f"  Metric: {info_data.get('metric', 'unknown')}")
+    print(f"  Nodes: {len(feats_data)}")
+    print(f"  Raw features: {len(feats_df.columns) if 'feats_df' in dir() else feats_data.shape[1]}")
+    print(f"  Numerical features: {len(num_names)}")
+    print(f"  Fraction features: {len(frac_names)}")
+    print(f"  Categorical features: {len(cat_names)}")
+


download_and_inspect() prints raw feature count using feats_df, but that variable is never defined in this function. Right now it falls back to feats_data.shape[1], but the conditional ('feats_df' in dir()) is dead/opaque and makes the script harder to trust. Replace that print with a direct feats_data.shape[1] (or rename consistently) so the script output is deterministic and maintainable.

Copilot · 2026-03-22T01:57:28Z

topobench/data/loaders/graph/graphland/repository/zenodo.py

+            for info in zf.infolist():
+                name = info.filename.replace("\\", "/")
+                if name.endswith("/") or name.startswith("__MACOSX/"):
+                    continue # skip dirs and macOS metadata


Inline comment spacing: continue # ... has only one space before #, which will be flagged by ruff (E261). Add two spaces before the inline comment (or move the comment to its own line) to satisfy the repo’s lint configuration.

Suggested change

continue # skip dirs and macOS metadata

continue # skip dirs and macOS metadata

Copilot · 2026-03-22T01:57:28Z

topobench/data/loaders/graph/graphland_dataset.py

+            root = str(self.root_data_dir),
+            name = self.parameters.data_name,
+            drop_missing_y = self.parameters.get("drop_missing_y", True),
+            impute_missing_x = self.parameters.get("impute_missing_x", None),


PEP8/ruff: keyword arguments should not have spaces around = (E251). This file uses root = ..., name = ..., etc., which is inconsistent with other loaders (e.g., PlanetoidDatasetLoader) and may fail linting. Remove the extra spaces in keyword arguments.

Suggested change

root = str(self.root_data_dir),

name = self.parameters.data_name,

drop_missing_y = self.parameters.get("drop_missing_y", True),

impute_missing_x = self.parameters.get("impute_missing_x", None),

root=str(self.root_data_dir),

name=self.parameters.data_name,

drop_missing_y=self.parameters.get("drop_missing_y", True),

impute_missing_x=self.parameters.get("impute_missing_x", None),

Copilot · 2026-03-22T01:57:28Z

topobench/data/loaders/graph/wiki_cs.py

+from omegaconf import DictConfig
+from torch_geometric.data import Dataset
+from torch_geometric.datasets import WikiCS
+
+from topobench.data.loaders.base import AbstractLoader
+
+
+class WikiCSDatasetLoader(AbstractLoader):


The PR title/description focus on fixing GraphLand feature encoding and updating num_features, but this PR also adds a new WikiCS loader and dataset config (wiki_cs.py / wiki_cs.yaml). If WikiCS changes are intentional, please mention them in the PR description (or split into a separate PR) so reviewers know why they’re included.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-22T01:59:10Z

test/data/load/test_datasetloaders.py

+                            "cocitation_pubmed.yaml", 'minesweeper.yaml', 'roman_empire.yaml',
+                            'tolokers.yaml'
+                            # Avoid datasets that take too long to load
+                            "artnet-views.yaml", 


Missing comma causes silent string concatenation in exclude set

High Severity

'tolokers.yaml' on line 50 is missing a trailing comma before the comment on line 51. In Python, adjacent string literals inside brackets are implicitly concatenated, even across lines with comments. This causes 'tolokers.yaml' and "artnet-views.yaml" to merge into the single string 'tolokers.yamlartnet-views.yaml', which means 'tolokers.yaml' is never actually added to the exclude_datasets set and won't be excluded from test runs.

cursor · 2026-03-22T01:59:10Z

test/data/load/test_datasetloaders.py

+                            "artnet-views.yaml", 
+                            "artnet-exp.yaml", "avazu-ctr.yaml", "city-reviews.yaml", "city-roads-L.yaml",
+                            "artnet-views.yaml", "hm-prices.yaml", "hm-categories.yaml", "pokec-regions.yaml", "tolokers-2.yaml",
+                            "twitch-views.yaml", "web-fraud.yaml", "web-traffic.yaml", "web-topics.yaml",


Missing city-roads-M.yaml from test exclude list

Medium Severity

city-roads-M.yaml is absent from the exclude_datasets set, even though all other 13 GraphLand dataset configs are listed for exclusion. This means the test suite will attempt to download and process the city-roads-M dataset from Zenodo, which will likely cause test timeouts or failures in CI environments without network access.

Loris697 and others added 17 commits October 8, 2025 15:45

Introducing WikiCS dataset

221869c

Complete introduction of wikics

5ba98db

Introducing all the datasets from GraphLand

9c608c2

Introducing node features imputing

7f4090a

Introducing nodes with missing y

Implementing Linting changes

c7383f0

Formatting import

6e20dea

Correcting whitespace

cb25b67

Fix string too long

b30d75b

Fix formatting

2d0a80b

Fix error web-topics configuration

1f7c779

Fix wiki_ccs configuration

61ee598

Modifying test loaders to avoid overflow of memory and time

66bea5d

Comment ZINC

6f1e72c

Changing test on datasets

c07c396

Deleting other dataset from test

47bb68c

Not using all the dataset in the testing phase

81d855a

Copilot AI review requested due to automatic review settings March 22, 2026 01:53

Copilot started reviewing on behalf of anilkeshwani March 22, 2026 01:53 View session

Copilot AI reviewed Mar 22, 2026

View reviewed changes

cursor bot reviewed Mar 22, 2026

View reviewed changes


		# Numerical features (includes fraction features)
		num_cols = [c for c in num_names if c in feats_df.columns]

-        # Numerical features (includes fraction features)
-        num_cols = [c for c in num_names if c in feats_df.columns]
+        frac_names = info.get("fraction_features_names", [])
+        # Numerical features (includes fraction features)
+        # Merge numerical and fraction features, preserving order and removing duplicates
+        all_num_names = list(dict.fromkeys(num_names + frac_names))
+        num_cols = [c for c in all_num_names if c in feats_df.columns]

	continue # skip dirs and macOS metadata
	continue # skip dirs and macOS metadata

Conversation

anilkeshwani commented Mar 22, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation and scientific context

Design decisions

1. One-hot encoding in process(), not as a TopoBench transform

2. info.yaml added to raw_file_names

3. Imputation scoping

4. Zenodo download timeout

Updated num_features (all 14 datasets)

Experimental validation

Files changed

References

Test plan

Uh oh!

anilkeshwani commented Mar 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 22, 2026

Choose a reason for hiding this comment

Missing comma causes silent string concatenation in exclude set

Uh oh!

cursor bot Mar 22, 2026

Choose a reason for hiding this comment

Missing city-roads-M.yaml from test exclude list

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anilkeshwani commented Mar 22, 2026 •

edited by cursor bot

Loading

1. One-hot encoding in `process()`, not as a TopoBench transform

2. `info.yaml` added to `raw_file_names`

Updated `num_features` (all 14 datasets)

Missing `city-roads-M.yaml` from test exclude list