Skip to content

Fix GraphLand categorical feature encoding and update num_features#289

Open
anilkeshwani wants to merge 17 commits intogeometric-intelligence:mainfrom
anilkeshwani:fix/graphland-categorical-feature-encoding
Open

Fix GraphLand categorical feature encoding and update num_features#289
anilkeshwani wants to merge 17 commits intogeometric-intelligence:mainfrom
anilkeshwani:fix/graphland-categorical-feature-encoding

Conversation

@anilkeshwani
Copy link
Copy Markdown

@anilkeshwani anilkeshwani commented Mar 22, 2026

Summary

This PR fixes a critical feature preprocessing issue in the GraphLand benchmark integration (PR #192) and validates the fix by running TopoTune experiments.

The core problem: PR #192 loads all GraphLand features from features.csv as raw floats, treating integer-coded categorical features (e.g., education=3, colour_group=15, region_id=42) as continuous numerical values. This is semantically incorrect — the model learns spurious ordinal relationships between unordered categories (e.g., that education=3 is "three times" education=1), which degrades representation quality and downstream task performance.

The fix: Implements proper feature preprocessing following the pipeline explicitly recommended by the GraphLand authors (Bazhenov, Platonov & Prokhorenkova, arXiv:2409.14500, NeurIPS 2025 Datasets & Benchmarks):

  • Reads info.yaml bundled with each dataset ZIP to identify categorical_features_names vs numerical_features_names
  • One-hot encodes categoricals using OneHotEncoder(drop='if_binary', sparse_output=False), matching the exact encoding in the GraphLand codebase
  • Applies imputation only to numerical features (not categoricals)
  • Updates num_features in all 14 YAML configs to reflect post-encoding dimensions

Motivation and scientific context

The GraphLand benchmark was designed to evaluate graph ML models on real-world industrial data with rich heterogeneous features — a deliberate departure from academic benchmarks like Cora/CiteSeer where all features are homogeneous (bag-of-words). The mixed numerical/categorical feature structure is a defining characteristic of GraphLand and central to its research contribution. The paper states (Appendix B.3):

"For categorical features, we used one-hot encoding for all models except for LightGBM and CatBoost, which support the use of categorical features directly and have their specialized strategies for working with them."

Passing raw integer-coded categoricals to a neural network (GNN or TNN) defeats the purpose of this benchmark, as the model cannot distinguish between ordinal and nominal features. One-hot encoding is the standard approach for neural models, and drop='if_binary' avoids redundant columns for binary indicators (e.g., is_paved, age_is_nan).

Design decisions

1. One-hot encoding in process(), not as a TopoBench transform

The encoding is performed inside GraphlandDataset.process() rather than as a separate TopoBench data manipulation transform. This is because:

  • The encoding depends on dataset-specific metadata (info.yaml), not on graph topology
  • It must happen before any TopoBench transforms that depend on num_features (e.g., feature lifting via ProjectionSum)
  • The infer_in_channels config resolver reads dataset.parameters.num_features at config time, so the YAML value must reflect the post-encoded dimension
  • This matches how other TopoBench datasets handle preprocessing (e.g., ZINC's atom type encoding happens at dataset creation, not as a transform)

2. info.yaml added to raw_file_names

Added info.yaml to the raw_file_names property so PyG's caching mechanism correctly detects when the metadata file is missing and triggers a re-download. The file is bundled in every GraphLand Zenodo ZIP alongside the CSVs.

3. Imputation scoping

The SimpleImputer is now applied only to numerical features, not to categorical columns. Categorical features in GraphLand have no missing values by construction (they are integer-coded categories), and imputing them with most_frequent could introduce semantic errors.

4. Zenodo download timeout

Increased from 60s to 300s. Several GraphLand datasets (web-fraud, hm-prices, avazu-ctr) exceed 50MB and reliably time out at 60s on typical connections.

Updated num_features (all 14 datasets)

Dataset Raw features After OHE Key categorical expansions
tolokers-2 16 19 education (4 levels)
hm-categories 35 120 colour_group (32), graphical_appearance (29), perceived_colour_master (20)
hm-prices 41 264 11 product/style categoricals
twitch-views 4 24 language (21 levels)
city-roads-M 26 68 region_id (23), speed_limit (10), category (7)
city-roads-L 26 207 region_id (156), speed_limit (14)
city-reviews 37 204 feature_18 (83 browser types), feature_19 (36)
artnet-exp 75 75 All 27 categoricals are binary → no expansion
artnet-views 50 50 All 27 categoricals are binary → no expansion
avazu-ctr 260 260 No categorical features
pokec-regions 56 56 All 54 categoricals are binary → no expansion
web-fraud 266 1179 Website zone/category features (4.4x expansion)
web-topics 263 528 Topic categoricals
web-traffic 267 1180 Traffic category features

Experimental validation

Ran TopoTune on tolokers-2 (binary classification: predicting crowdworker bans) with default parameters:

Parameter Value
Model cell/topotune (GCN backbone, 2 GNN layers, 32 hidden channels)
Lifting CellCycleLifting (max_cell_length=4)
Feature lifting ProjectionSum
Neighborhoods up_adjacency-1, up_incidence-0, down_incidence-2, 2-up_adjacency-0
Optimizer Adam (lr=0.001), StepLR scheduler (step=50, γ=0.5)
Training 100 epochs, seed=42, CPU

Test results:

Metric Value
Accuracy 78.1%
AUROC 70.7%
Precision 64.7%
Recall 52.0%
Loss 0.479

Notes on the experiment:

  • Simplicial clique lifting (SimplicialCliqueLifting) was intractable on this dense social graph (11.8K nodes, high clustering) — killed after 50+ minutes. Cell cycle lifting with max_cell_length=4 completed in seconds.
  • MPS acceleration (Apple Silicon) failed due to PyTorch 2.3.0's SparseMPS backend not supporting sparse tensor operations required by topological data structures (incidence/adjacency matrices). Fell back to CPU.
  • The dataset is class-imbalanced (most workers are not banned), which explains the accuracy-recall gap. The GraphLand paper reports GCN achieving ~83% accuracy, suggesting room for hyperparameter tuning.

Files changed

  • topobench/data/loaders/graph/graphland/dataset.py — Rewrote process() with proper categorical/numerical separation, one-hot encoding, and scoped imputation. Added _load_info() and _preprocess_features() methods. Added info.yaml to raw_file_names.
  • topobench/data/loaders/graph/graphland/repository/zenodo.py — Increased download timeout 60s → 300s.
  • configs/dataset/graph/*.yaml (10 files) — Updated num_features to post-encoding dimensions.
  • scripts/inspect_graphland_features.py — Utility script that downloads all 14 GraphLand datasets and computes post-encoding feature dimensions. Useful for verification and for future dataset additions.

References

  • Bazhenov, Platonov & Prokhorenkova. GraphLand: Evaluating Graph Machine Learning Models on Diverse Industrial Data. NeurIPS 2025. arXiv:2409.14500
  • GraphLand codebase: github.com/yandex-research/graphland — see dataset.py lines 496–500 for the reference one-hot encoding implementation
  • Papillon, Bernardez, Battiloro & Miolane. TopoTune: A Framework for Generalized Combinatorial Complex Neural Networks. ICML 2025. arXiv:2410.06530

Test plan

  • Verified config resolution: num_features: 19 correctly becomes in_channels: [19, 19, 19] for simplicial/cell TopoTune
  • Ran full train+test pipeline on tolokers-2 with cell/topotune (78.1% accuracy)
  • Downloaded and verified post-encoding dimensions for all 14 datasets via scripts/inspect_graphland_features.py
  • Run existing unit tests (test/data/load/test_datasetloaders.py)
  • Validate on a second GraphLand dataset (e.g., hm-categories) to confirm multiclass classification works with expanded features

🤖 Generated with Claude Code


Note

Medium Risk
Changes GraphLand dataset preprocessing to one-hot encode categorical columns and optionally drop missing targets, which can shift feature dimensionality and model behavior across multiple benchmarks. Risk is mainly around data compatibility/caching and correctness of the new feature/label handling for existing experiments.

Overview
Fixes GraphLand integration by changing GraphlandDataset.process() to read info.yaml, separate numerical vs categorical columns, apply imputation only to numerical features, and one-hot encode categoricals with OneHotEncoder(drop='if_binary') before building the PyG Data object.

Updates GraphLand dataset YAMLs to reflect post-encoding num_features, adds a small ZenodoZip helper for in-memory ZIP download/extraction, and introduces scripts/inspect_graphland_features.py to compute/verify encoded feature dimensions from the upstream Zenodo artifacts.

Adds new dataset entrypoints/config for wiki_cs via WikiCSDatasetLoader, and expands the dataset-loader test exclusions to skip additional slow/long-running datasets (including GraphLand configs).

Written by Cursor Bugbot for commit 55b17d6. This will update automatically on new commits. Configure here.

Loris697 and others added 17 commits October 8, 2025 15:45
Introducing nodes with missing y
The original GraphLand integration loaded all features from features.csv
as raw floats, treating integer-coded categorical features (e.g.,
education=3, colour_group=15) as continuous numerical values. This is
semantically incorrect and degrades model performance, as the model
would learn spurious ordinal relationships between unordered categories.

This commit implements proper feature preprocessing following the
GraphLand paper's (arXiv:2409.14500) recommended pipeline:

1. Read info.yaml bundled with each dataset to identify feature types
   (categorical_features_names, numerical_features_names,
   fraction_features_names)

2. One-hot encode categorical features using
   OneHotEncoder(drop='if_binary', sparse_output=False), which:
   - Expands multi-level categoricals into binary indicator columns
   - Collapses binary categoricals to a single column (drop='if_binary')
   - Matches the exact encoding used in the GraphLand codebase

3. Apply imputation (SimpleImputer) only to numerical features, not to
   categoricals (which have no missing values by construction)

4. Concatenate features as [numerical | one_hot_categorical]

5. Update num_features in all 14 YAML configs to reflect post-encoding
   dimensions, which were computed by downloading each dataset and
   running the actual encoding pipeline

The num_features changes are significant for several datasets:
- web-fraud: 266 -> 1179 (website zone/category features expand 4.4x)
- web-traffic: 267 -> 1180
- hm-prices: 41 -> 264 (11 product categoricals)
- city-roads-L: 26 -> 207 (region_id has 156 unique values)
- city-reviews: 37 -> 204
- hm-categories: 35 -> 120
- city-roads-M: 26 -> 68

Datasets with only binary categoricals are unchanged:
- artnet-exp (75), artnet-views (50), pokec-regions (56), avazu-ctr (260)

Also increases Zenodo download timeout from 60s to 300s to accommodate
larger datasets, and adds info.yaml to raw_file_names for proper
cache invalidation.

Includes scripts/inspect_graphland_features.py utility for computing
post-encoding dimensions across all 14 GraphLand datasets.

Validated by running TopoTune (cell/topotune, GCN backbone,
CellCycleLifting) on tolokers-2, achieving 78.1% test accuracy and
70.7% AUROC on the binary classification task.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 22, 2026 01:53
@anilkeshwani
Copy link
Copy Markdown
Author

@levtelyatnikov here's the fully automated PR that we were talking about.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the GraphLand dataset integration to correctly distinguish categorical vs numerical node features (and one-hot encode categoricals) so models no longer treat integer-coded categories as continuous values; it also updates GraphLand dataset configs to reflect the post-encoding num_features.

Changes:

  • Reworked GraphlandDataset.process() to load info.yaml, one-hot encode categorical features, and scope imputation to numerical features.
  • Increased Zenodo ZIP download timeout and added a utility script to compute post-encoding feature dimensions.
  • Updated GraphLand YAML configs’ num_features values (and added a WikiCS loader/config as well).

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
topobench/data/loaders/graph/graphland/dataset.py Implements GraphLand preprocessing using info.yaml + OHE, and updates dataset processing/download behavior.
topobench/data/loaders/graph/graphland/repository/zenodo.py Adds a minimal in-memory Zenodo ZIP downloader with longer timeout.
topobench/data/loaders/graph/graphland_dataset.py Introduces a loader wrapper for GraphlandDataset used by Hydra configs.
topobench/data/loaders/graph/wiki_cs.py Adds a loader for PyG’s WikiCS dataset.
scripts/inspect_graphland_features.py Utility script to compute post-OHE num_features across GraphLand datasets.
test/data/load/test_datasetloaders.py Expands dataset exclusions for loader tests (but currently introduces a syntax issue).
configs/dataset/graph/wiki_cs.yaml Adds dataset config for WikiCS.
configs/dataset/graph/artnet-exp.yaml Updates GraphLand num_features to reflect post-encoding dimensions.
configs/dataset/graph/artnet-views.yaml Updates GraphLand num_features to reflect post-encoding dimensions.
configs/dataset/graph/avazu-ctr.yaml Updates GraphLand num_features to reflect post-encoding dimensions.
configs/dataset/graph/city-reviews.yaml Updates GraphLand num_features to reflect post-encoding dimensions.
configs/dataset/graph/city-roads-L.yaml Updates GraphLand num_features to reflect post-encoding dimensions.
configs/dataset/graph/city-roads-M.yaml Updates GraphLand num_features to reflect post-encoding dimensions.
configs/dataset/graph/hm-categories.yaml Updates GraphLand num_features to reflect post-encoding dimensions.
configs/dataset/graph/hm-prices.yaml Updates GraphLand num_features to reflect post-encoding dimensions.
configs/dataset/graph/pokec-regions.yaml Updates GraphLand num_features to reflect post-encoding dimensions.
configs/dataset/graph/tolokers-2.yaml Updates GraphLand num_features to reflect post-encoding dimensions.
configs/dataset/graph/twitch-views.yaml Updates GraphLand num_features to reflect post-encoding dimensions.
configs/dataset/graph/web-fraud.yaml Updates GraphLand num_features to reflect post-encoding dimensions.
configs/dataset/graph/web-topics.yaml Updates GraphLand num_features to reflect post-encoding dimensions.
configs/dataset/graph/web-traffic.yaml Updates GraphLand num_features to reflect post-encoding dimensions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +288 to +292
@property
def processed_paths(self):
"""The processed data directory path."""
return [os.path.join(self.root, "processed")]

Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GraphlandDataset overrides processed_paths to return the processed directory rather than the processed file path(s). PyG’s InMemoryDataset uses processed_paths to decide whether processing is needed; if the directory exists but data.pt does not, the dataset can incorrectly skip process() and then torch.load(...) will fail. Remove this processed_paths override (or override processed_dir instead) and rely on the base class’s processed_paths (which should be <root>/processed/<processed_file_names>).

Copilot uses AI. Check for mistakes.
Comment on lines +147 to +149

# Numerical features (includes fraction features)
num_cols = [c for c in num_names if c in feats_df.columns]
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring states that fraction_features_names should be included with numerical features, but _preprocess_features() only uses numerical_features_names. If info.yaml provides fraction features separately, they’ll be silently dropped from x. Consider merging numerical_features_names + fraction_features_names (and ensuring a stable column order) so the processed feature matrix matches the GraphLand preprocessing spec and the updated num_features values.

Suggested change
# Numerical features (includes fraction features)
num_cols = [c for c in num_names if c in feats_df.columns]
frac_names = info.get("fraction_features_names", [])
# Numerical features (includes fraction features)
# Merge numerical and fraction features, preserving order and removing duplicates
all_num_names = list(dict.fromkeys(num_names + frac_names))
num_cols = [c for c in all_num_names if c in feats_df.columns]

Copilot uses AI. Check for mistakes.
Comment on lines +211 to +232
if is_integer_dtype(
targ_values.fillna(0)
) or targ_values.fillna(0).apply(float.is_integer).all():
y = torch.tensor(
targs_df.values, dtype=torch.long
).squeeze()
else:
y = torch.tensor(
targs_df.values, dtype=torch.double
).squeeze()

# Drop nodes with missing targets
if self.drop_missing_y:
mask = ~torch.tensor(targ_values.isna().values)
x = x[mask]
y = y[mask]
feats_df = feats_df[mask.numpy()]

old_to_new = {
old: new
for new, old in enumerate(
mask.numpy().nonzero()[0]
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y is cast to torch.long/torch.double before applying drop_missing_y. If targets.csv contains NaNs (GraphLand often has missing labels), converting to torch.long will raise (NaN → integer) and prevent processing. Apply the missing-target mask to targs_df/targ_values before dtype inference and tensor conversion, or fill missing targets with a sentinel only when drop_missing_y=False.

Suggested change
if is_integer_dtype(
targ_values.fillna(0)
) or targ_values.fillna(0).apply(float.is_integer).all():
y = torch.tensor(
targs_df.values, dtype=torch.long
).squeeze()
else:
y = torch.tensor(
targs_df.values, dtype=torch.double
).squeeze()
# Drop nodes with missing targets
if self.drop_missing_y:
mask = ~torch.tensor(targ_values.isna().values)
x = x[mask]
y = y[mask]
feats_df = feats_df[mask.numpy()]
old_to_new = {
old: new
for new, old in enumerate(
mask.numpy().nonzero()[0]
# Apply missing-target handling before dtype inference / tensor conversion
non_missing_mask = None
if self.drop_missing_y:
# Boolean mask over original nodes: True where target is present
non_missing_mask = ~targ_values.isna().values
# Filter targets down to nodes with observed labels
targs_df = targs_df[non_missing_mask]
targ_values = targ_values[non_missing_mask]
if is_integer_dtype(
targ_values.fillna(0)
) or targ_values.fillna(0).apply(float.is_integer).all():
# Integer-like targets
if self.drop_missing_y:
# After filtering, no NaNs remain
target_array = targs_df.values
else:
# Keep missing labels; represent them with a sentinel value
target_array = targs_df.fillna(-1).values
y = torch.tensor(
target_array, dtype=torch.long
).squeeze()
else:
# Continuous targets can safely remain as floating point (NaNs allowed)
y = torch.tensor(
targs_df.values, dtype=torch.double
).squeeze()
# Drop nodes with missing targets
if self.drop_missing_y:
# Use the original-length boolean mask for features / graph structure
mask = torch.from_numpy(non_missing_mask)
x = x[mask]
# y has already been filtered when building targs_df, so no further
# masking is required here.
feats_df = feats_df[non_missing_mask]
old_to_new = {
old: new
for new, old in enumerate(
non_missing_mask.nonzero()[0]

Copilot uses AI. Check for mistakes.
Comment on lines +227 to +246
feats_df = feats_df[mask.numpy()]

old_to_new = {
old: new
for new, old in enumerate(
mask.numpy().nonzero()[0]
)
}

edges_df = edges_df[
edges_df["source"].isin(old_to_new.keys())
& edges_df["target"].isin(old_to_new.keys())
].copy()

edges_df["source"] = edges_df["source"].map(
old_to_new
)
edges_df["target"] = edges_df["target"].map(
old_to_new
)
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When dropping missing targets, the old_to_new mapping is built from mask.numpy().nonzero()[0], i.e. row positions, but edges are filtered/mapped using edges_df['source']/['target'] values. This only works if node IDs are guaranteed to be contiguous 0..N-1 and ordered exactly like features.csv. To make this robust, build the mapping from the actual node IDs (e.g., kept_node_ids = feats_df.index[mask.numpy()]) and map edges using those IDs.

Copilot uses AI. Check for mistakes.
Comment on lines +37 to +42
dataset = GraphlandDataset(
root = str(self.root_data_dir),
name = self.parameters.data_name,
drop_missing_y = self.parameters.get("drop_missing_y", True),
impute_missing_x = self.parameters.get("impute_missing_x", None),
)
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

impute_missing_x is read from the Hydra config (parameters.impute_missing_x) and passed straight into GraphlandDataset. In the YAMLs it’s a DictConfig with _target_: sklearn.impute.SimpleImputer, so GraphlandDataset._preprocess_features() will later call .fit_transform() on a DictConfig and crash. Instantiate the imputer in the loader (e.g., hydra.utils.instantiate(self.parameters.impute_missing_x) when present) before passing it to GraphlandDataset.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,149 @@
# /// script
# requires-python = ">=3.12"
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uv-style script header declares requires-python = ">=3.12", but the repository’s pyproject.toml specifies requires-python = ">= 3.10". This mismatch can confuse users/tooling and suggests the script won’t run in supported environments. Consider lowering the script requirement to >=3.10 (if compatible) or removing the header and documenting any extra requirements in the docstring instead.

Suggested change
# requires-python = ">=3.12"
# requires-python = ">=3.10"

Copilot uses AI. Check for mistakes.
Comment on lines +75 to +82
print(f" Task: {info_data.get('task', 'unknown')}")
print(f" Metric: {info_data.get('metric', 'unknown')}")
print(f" Nodes: {len(feats_data)}")
print(f" Raw features: {len(feats_df.columns) if 'feats_df' in dir() else feats_data.shape[1]}")
print(f" Numerical features: {len(num_names)}")
print(f" Fraction features: {len(frac_names)}")
print(f" Categorical features: {len(cat_names)}")

Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

download_and_inspect() prints raw feature count using feats_df, but that variable is never defined in this function. Right now it falls back to feats_data.shape[1], but the conditional ('feats_df' in dir()) is dead/opaque and makes the script harder to trust. Replace that print with a direct feats_data.shape[1] (or rename consistently) so the script output is deterministic and maintainable.

Copilot uses AI. Check for mistakes.
for info in zf.infolist():
name = info.filename.replace("\\", "/")
if name.endswith("/") or name.startswith("__MACOSX/"):
continue # skip dirs and macOS metadata
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline comment spacing: continue # ... has only one space before #, which will be flagged by ruff (E261). Add two spaces before the inline comment (or move the comment to its own line) to satisfy the repo’s lint configuration.

Suggested change
continue # skip dirs and macOS metadata
continue # skip dirs and macOS metadata

Copilot uses AI. Check for mistakes.
Comment on lines +38 to +41
root = str(self.root_data_dir),
name = self.parameters.data_name,
drop_missing_y = self.parameters.get("drop_missing_y", True),
impute_missing_x = self.parameters.get("impute_missing_x", None),
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8/ruff: keyword arguments should not have spaces around = (E251). This file uses root = ..., name = ..., etc., which is inconsistent with other loaders (e.g., PlanetoidDatasetLoader) and may fail linting. Remove the extra spaces in keyword arguments.

Suggested change
root = str(self.root_data_dir),
name = self.parameters.data_name,
drop_missing_y = self.parameters.get("drop_missing_y", True),
impute_missing_x = self.parameters.get("impute_missing_x", None),
root=str(self.root_data_dir),
name=self.parameters.data_name,
drop_missing_y=self.parameters.get("drop_missing_y", True),
impute_missing_x=self.parameters.get("impute_missing_x", None),

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +8
from omegaconf import DictConfig
from torch_geometric.data import Dataset
from torch_geometric.datasets import WikiCS

from topobench.data.loaders.base import AbstractLoader


class WikiCSDatasetLoader(AbstractLoader):
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title/description focus on fixing GraphLand feature encoding and updating num_features, but this PR also adds a new WikiCS loader and dataset config (wiki_cs.py / wiki_cs.yaml). If WikiCS changes are intentional, please mention them in the PR description (or split into a separate PR) so reviewers know why they’re included.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

"cocitation_pubmed.yaml", 'minesweeper.yaml', 'roman_empire.yaml',
'tolokers.yaml'
# Avoid datasets that take too long to load
"artnet-views.yaml",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing comma causes silent string concatenation in exclude set

High Severity

'tolokers.yaml' on line 50 is missing a trailing comma before the comment on line 51. In Python, adjacent string literals inside brackets are implicitly concatenated, even across lines with comments. This causes 'tolokers.yaml' and "artnet-views.yaml" to merge into the single string 'tolokers.yamlartnet-views.yaml', which means 'tolokers.yaml' is never actually added to the exclude_datasets set and won't be excluded from test runs.

Fix in Cursor Fix in Web

"artnet-views.yaml",
"artnet-exp.yaml", "avazu-ctr.yaml", "city-reviews.yaml", "city-roads-L.yaml",
"artnet-views.yaml", "hm-prices.yaml", "hm-categories.yaml", "pokec-regions.yaml", "tolokers-2.yaml",
"twitch-views.yaml", "web-fraud.yaml", "web-traffic.yaml", "web-topics.yaml",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing city-roads-M.yaml from test exclude list

Medium Severity

city-roads-M.yaml is absent from the exclude_datasets set, even though all other 13 GraphLand dataset configs are listed for exclusion. This means the test suite will attempt to download and process the city-roads-M dataset from Zenodo, which will likely cause test timeouts or failures in CI environments without network access.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants