Fix GraphLand categorical feature encoding and update num_features#289
Fix GraphLand categorical feature encoding and update num_features#289anilkeshwani wants to merge 17 commits intogeometric-intelligence:mainfrom
Conversation
Introducing nodes with missing y
The original GraphLand integration loaded all features from features.csv as raw floats, treating integer-coded categorical features (e.g., education=3, colour_group=15) as continuous numerical values. This is semantically incorrect and degrades model performance, as the model would learn spurious ordinal relationships between unordered categories. This commit implements proper feature preprocessing following the GraphLand paper's (arXiv:2409.14500) recommended pipeline: 1. Read info.yaml bundled with each dataset to identify feature types (categorical_features_names, numerical_features_names, fraction_features_names) 2. One-hot encode categorical features using OneHotEncoder(drop='if_binary', sparse_output=False), which: - Expands multi-level categoricals into binary indicator columns - Collapses binary categoricals to a single column (drop='if_binary') - Matches the exact encoding used in the GraphLand codebase 3. Apply imputation (SimpleImputer) only to numerical features, not to categoricals (which have no missing values by construction) 4. Concatenate features as [numerical | one_hot_categorical] 5. Update num_features in all 14 YAML configs to reflect post-encoding dimensions, which were computed by downloading each dataset and running the actual encoding pipeline The num_features changes are significant for several datasets: - web-fraud: 266 -> 1179 (website zone/category features expand 4.4x) - web-traffic: 267 -> 1180 - hm-prices: 41 -> 264 (11 product categoricals) - city-roads-L: 26 -> 207 (region_id has 156 unique values) - city-reviews: 37 -> 204 - hm-categories: 35 -> 120 - city-roads-M: 26 -> 68 Datasets with only binary categoricals are unchanged: - artnet-exp (75), artnet-views (50), pokec-regions (56), avazu-ctr (260) Also increases Zenodo download timeout from 60s to 300s to accommodate larger datasets, and adds info.yaml to raw_file_names for proper cache invalidation. Includes scripts/inspect_graphland_features.py utility for computing post-encoding dimensions across all 14 GraphLand datasets. Validated by running TopoTune (cell/topotune, GCN backbone, CellCycleLifting) on tolokers-2, achieving 78.1% test accuracy and 70.7% AUROC on the binary classification task. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@levtelyatnikov here's the fully automated PR that we were talking about. |
There was a problem hiding this comment.
Pull request overview
This PR updates the GraphLand dataset integration to correctly distinguish categorical vs numerical node features (and one-hot encode categoricals) so models no longer treat integer-coded categories as continuous values; it also updates GraphLand dataset configs to reflect the post-encoding num_features.
Changes:
- Reworked
GraphlandDataset.process()to loadinfo.yaml, one-hot encode categorical features, and scope imputation to numerical features. - Increased Zenodo ZIP download timeout and added a utility script to compute post-encoding feature dimensions.
- Updated GraphLand YAML configs’
num_featuresvalues (and added a WikiCS loader/config as well).
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| topobench/data/loaders/graph/graphland/dataset.py | Implements GraphLand preprocessing using info.yaml + OHE, and updates dataset processing/download behavior. |
| topobench/data/loaders/graph/graphland/repository/zenodo.py | Adds a minimal in-memory Zenodo ZIP downloader with longer timeout. |
| topobench/data/loaders/graph/graphland_dataset.py | Introduces a loader wrapper for GraphlandDataset used by Hydra configs. |
| topobench/data/loaders/graph/wiki_cs.py | Adds a loader for PyG’s WikiCS dataset. |
| scripts/inspect_graphland_features.py | Utility script to compute post-OHE num_features across GraphLand datasets. |
| test/data/load/test_datasetloaders.py | Expands dataset exclusions for loader tests (but currently introduces a syntax issue). |
| configs/dataset/graph/wiki_cs.yaml | Adds dataset config for WikiCS. |
| configs/dataset/graph/artnet-exp.yaml | Updates GraphLand num_features to reflect post-encoding dimensions. |
| configs/dataset/graph/artnet-views.yaml | Updates GraphLand num_features to reflect post-encoding dimensions. |
| configs/dataset/graph/avazu-ctr.yaml | Updates GraphLand num_features to reflect post-encoding dimensions. |
| configs/dataset/graph/city-reviews.yaml | Updates GraphLand num_features to reflect post-encoding dimensions. |
| configs/dataset/graph/city-roads-L.yaml | Updates GraphLand num_features to reflect post-encoding dimensions. |
| configs/dataset/graph/city-roads-M.yaml | Updates GraphLand num_features to reflect post-encoding dimensions. |
| configs/dataset/graph/hm-categories.yaml | Updates GraphLand num_features to reflect post-encoding dimensions. |
| configs/dataset/graph/hm-prices.yaml | Updates GraphLand num_features to reflect post-encoding dimensions. |
| configs/dataset/graph/pokec-regions.yaml | Updates GraphLand num_features to reflect post-encoding dimensions. |
| configs/dataset/graph/tolokers-2.yaml | Updates GraphLand num_features to reflect post-encoding dimensions. |
| configs/dataset/graph/twitch-views.yaml | Updates GraphLand num_features to reflect post-encoding dimensions. |
| configs/dataset/graph/web-fraud.yaml | Updates GraphLand num_features to reflect post-encoding dimensions. |
| configs/dataset/graph/web-topics.yaml | Updates GraphLand num_features to reflect post-encoding dimensions. |
| configs/dataset/graph/web-traffic.yaml | Updates GraphLand num_features to reflect post-encoding dimensions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @property | ||
| def processed_paths(self): | ||
| """The processed data directory path.""" | ||
| return [os.path.join(self.root, "processed")] | ||
|
|
There was a problem hiding this comment.
GraphlandDataset overrides processed_paths to return the processed directory rather than the processed file path(s). PyG’s InMemoryDataset uses processed_paths to decide whether processing is needed; if the directory exists but data.pt does not, the dataset can incorrectly skip process() and then torch.load(...) will fail. Remove this processed_paths override (or override processed_dir instead) and rely on the base class’s processed_paths (which should be <root>/processed/<processed_file_names>).
|
|
||
| # Numerical features (includes fraction features) | ||
| num_cols = [c for c in num_names if c in feats_df.columns] |
There was a problem hiding this comment.
The docstring states that fraction_features_names should be included with numerical features, but _preprocess_features() only uses numerical_features_names. If info.yaml provides fraction features separately, they’ll be silently dropped from x. Consider merging numerical_features_names + fraction_features_names (and ensuring a stable column order) so the processed feature matrix matches the GraphLand preprocessing spec and the updated num_features values.
| # Numerical features (includes fraction features) | |
| num_cols = [c for c in num_names if c in feats_df.columns] | |
| frac_names = info.get("fraction_features_names", []) | |
| # Numerical features (includes fraction features) | |
| # Merge numerical and fraction features, preserving order and removing duplicates | |
| all_num_names = list(dict.fromkeys(num_names + frac_names)) | |
| num_cols = [c for c in all_num_names if c in feats_df.columns] |
| if is_integer_dtype( | ||
| targ_values.fillna(0) | ||
| ) or targ_values.fillna(0).apply(float.is_integer).all(): | ||
| y = torch.tensor( | ||
| targs_df.values, dtype=torch.long | ||
| ).squeeze() | ||
| else: | ||
| y = torch.tensor( | ||
| targs_df.values, dtype=torch.double | ||
| ).squeeze() | ||
|
|
||
| # Drop nodes with missing targets | ||
| if self.drop_missing_y: | ||
| mask = ~torch.tensor(targ_values.isna().values) | ||
| x = x[mask] | ||
| y = y[mask] | ||
| feats_df = feats_df[mask.numpy()] | ||
|
|
||
| old_to_new = { | ||
| old: new | ||
| for new, old in enumerate( | ||
| mask.numpy().nonzero()[0] |
There was a problem hiding this comment.
y is cast to torch.long/torch.double before applying drop_missing_y. If targets.csv contains NaNs (GraphLand often has missing labels), converting to torch.long will raise (NaN → integer) and prevent processing. Apply the missing-target mask to targs_df/targ_values before dtype inference and tensor conversion, or fill missing targets with a sentinel only when drop_missing_y=False.
| if is_integer_dtype( | |
| targ_values.fillna(0) | |
| ) or targ_values.fillna(0).apply(float.is_integer).all(): | |
| y = torch.tensor( | |
| targs_df.values, dtype=torch.long | |
| ).squeeze() | |
| else: | |
| y = torch.tensor( | |
| targs_df.values, dtype=torch.double | |
| ).squeeze() | |
| # Drop nodes with missing targets | |
| if self.drop_missing_y: | |
| mask = ~torch.tensor(targ_values.isna().values) | |
| x = x[mask] | |
| y = y[mask] | |
| feats_df = feats_df[mask.numpy()] | |
| old_to_new = { | |
| old: new | |
| for new, old in enumerate( | |
| mask.numpy().nonzero()[0] | |
| # Apply missing-target handling before dtype inference / tensor conversion | |
| non_missing_mask = None | |
| if self.drop_missing_y: | |
| # Boolean mask over original nodes: True where target is present | |
| non_missing_mask = ~targ_values.isna().values | |
| # Filter targets down to nodes with observed labels | |
| targs_df = targs_df[non_missing_mask] | |
| targ_values = targ_values[non_missing_mask] | |
| if is_integer_dtype( | |
| targ_values.fillna(0) | |
| ) or targ_values.fillna(0).apply(float.is_integer).all(): | |
| # Integer-like targets | |
| if self.drop_missing_y: | |
| # After filtering, no NaNs remain | |
| target_array = targs_df.values | |
| else: | |
| # Keep missing labels; represent them with a sentinel value | |
| target_array = targs_df.fillna(-1).values | |
| y = torch.tensor( | |
| target_array, dtype=torch.long | |
| ).squeeze() | |
| else: | |
| # Continuous targets can safely remain as floating point (NaNs allowed) | |
| y = torch.tensor( | |
| targs_df.values, dtype=torch.double | |
| ).squeeze() | |
| # Drop nodes with missing targets | |
| if self.drop_missing_y: | |
| # Use the original-length boolean mask for features / graph structure | |
| mask = torch.from_numpy(non_missing_mask) | |
| x = x[mask] | |
| # y has already been filtered when building targs_df, so no further | |
| # masking is required here. | |
| feats_df = feats_df[non_missing_mask] | |
| old_to_new = { | |
| old: new | |
| for new, old in enumerate( | |
| non_missing_mask.nonzero()[0] |
| feats_df = feats_df[mask.numpy()] | ||
|
|
||
| old_to_new = { | ||
| old: new | ||
| for new, old in enumerate( | ||
| mask.numpy().nonzero()[0] | ||
| ) | ||
| } | ||
|
|
||
| edges_df = edges_df[ | ||
| edges_df["source"].isin(old_to_new.keys()) | ||
| & edges_df["target"].isin(old_to_new.keys()) | ||
| ].copy() | ||
|
|
||
| edges_df["source"] = edges_df["source"].map( | ||
| old_to_new | ||
| ) | ||
| edges_df["target"] = edges_df["target"].map( | ||
| old_to_new | ||
| ) |
There was a problem hiding this comment.
When dropping missing targets, the old_to_new mapping is built from mask.numpy().nonzero()[0], i.e. row positions, but edges are filtered/mapped using edges_df['source']/['target'] values. This only works if node IDs are guaranteed to be contiguous 0..N-1 and ordered exactly like features.csv. To make this robust, build the mapping from the actual node IDs (e.g., kept_node_ids = feats_df.index[mask.numpy()]) and map edges using those IDs.
| dataset = GraphlandDataset( | ||
| root = str(self.root_data_dir), | ||
| name = self.parameters.data_name, | ||
| drop_missing_y = self.parameters.get("drop_missing_y", True), | ||
| impute_missing_x = self.parameters.get("impute_missing_x", None), | ||
| ) |
There was a problem hiding this comment.
impute_missing_x is read from the Hydra config (parameters.impute_missing_x) and passed straight into GraphlandDataset. In the YAMLs it’s a DictConfig with _target_: sklearn.impute.SimpleImputer, so GraphlandDataset._preprocess_features() will later call .fit_transform() on a DictConfig and crash. Instantiate the imputer in the loader (e.g., hydra.utils.instantiate(self.parameters.impute_missing_x) when present) before passing it to GraphlandDataset.
| @@ -0,0 +1,149 @@ | |||
| # /// script | |||
| # requires-python = ">=3.12" | |||
There was a problem hiding this comment.
The uv-style script header declares requires-python = ">=3.12", but the repository’s pyproject.toml specifies requires-python = ">= 3.10". This mismatch can confuse users/tooling and suggests the script won’t run in supported environments. Consider lowering the script requirement to >=3.10 (if compatible) or removing the header and documenting any extra requirements in the docstring instead.
| # requires-python = ">=3.12" | |
| # requires-python = ">=3.10" |
| print(f" Task: {info_data.get('task', 'unknown')}") | ||
| print(f" Metric: {info_data.get('metric', 'unknown')}") | ||
| print(f" Nodes: {len(feats_data)}") | ||
| print(f" Raw features: {len(feats_df.columns) if 'feats_df' in dir() else feats_data.shape[1]}") | ||
| print(f" Numerical features: {len(num_names)}") | ||
| print(f" Fraction features: {len(frac_names)}") | ||
| print(f" Categorical features: {len(cat_names)}") | ||
|
|
There was a problem hiding this comment.
download_and_inspect() prints raw feature count using feats_df, but that variable is never defined in this function. Right now it falls back to feats_data.shape[1], but the conditional ('feats_df' in dir()) is dead/opaque and makes the script harder to trust. Replace that print with a direct feats_data.shape[1] (or rename consistently) so the script output is deterministic and maintainable.
| for info in zf.infolist(): | ||
| name = info.filename.replace("\\", "/") | ||
| if name.endswith("/") or name.startswith("__MACOSX/"): | ||
| continue # skip dirs and macOS metadata |
There was a problem hiding this comment.
Inline comment spacing: continue # ... has only one space before #, which will be flagged by ruff (E261). Add two spaces before the inline comment (or move the comment to its own line) to satisfy the repo’s lint configuration.
| continue # skip dirs and macOS metadata | |
| continue # skip dirs and macOS metadata |
| root = str(self.root_data_dir), | ||
| name = self.parameters.data_name, | ||
| drop_missing_y = self.parameters.get("drop_missing_y", True), | ||
| impute_missing_x = self.parameters.get("impute_missing_x", None), |
There was a problem hiding this comment.
PEP8/ruff: keyword arguments should not have spaces around = (E251). This file uses root = ..., name = ..., etc., which is inconsistent with other loaders (e.g., PlanetoidDatasetLoader) and may fail linting. Remove the extra spaces in keyword arguments.
| root = str(self.root_data_dir), | |
| name = self.parameters.data_name, | |
| drop_missing_y = self.parameters.get("drop_missing_y", True), | |
| impute_missing_x = self.parameters.get("impute_missing_x", None), | |
| root=str(self.root_data_dir), | |
| name=self.parameters.data_name, | |
| drop_missing_y=self.parameters.get("drop_missing_y", True), | |
| impute_missing_x=self.parameters.get("impute_missing_x", None), |
| from omegaconf import DictConfig | ||
| from torch_geometric.data import Dataset | ||
| from torch_geometric.datasets import WikiCS | ||
|
|
||
| from topobench.data.loaders.base import AbstractLoader | ||
|
|
||
|
|
||
| class WikiCSDatasetLoader(AbstractLoader): |
There was a problem hiding this comment.
The PR title/description focus on fixing GraphLand feature encoding and updating num_features, but this PR also adds a new WikiCS loader and dataset config (wiki_cs.py / wiki_cs.yaml). If WikiCS changes are intentional, please mention them in the PR description (or split into a separate PR) so reviewers know why they’re included.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| "cocitation_pubmed.yaml", 'minesweeper.yaml', 'roman_empire.yaml', | ||
| 'tolokers.yaml' | ||
| # Avoid datasets that take too long to load | ||
| "artnet-views.yaml", |
There was a problem hiding this comment.
Missing comma causes silent string concatenation in exclude set
High Severity
'tolokers.yaml' on line 50 is missing a trailing comma before the comment on line 51. In Python, adjacent string literals inside brackets are implicitly concatenated, even across lines with comments. This causes 'tolokers.yaml' and "artnet-views.yaml" to merge into the single string 'tolokers.yamlartnet-views.yaml', which means 'tolokers.yaml' is never actually added to the exclude_datasets set and won't be excluded from test runs.
| "artnet-views.yaml", | ||
| "artnet-exp.yaml", "avazu-ctr.yaml", "city-reviews.yaml", "city-roads-L.yaml", | ||
| "artnet-views.yaml", "hm-prices.yaml", "hm-categories.yaml", "pokec-regions.yaml", "tolokers-2.yaml", | ||
| "twitch-views.yaml", "web-fraud.yaml", "web-traffic.yaml", "web-topics.yaml", |
There was a problem hiding this comment.
Missing city-roads-M.yaml from test exclude list
Medium Severity
city-roads-M.yaml is absent from the exclude_datasets set, even though all other 13 GraphLand dataset configs are listed for exclusion. This means the test suite will attempt to download and process the city-roads-M dataset from Zenodo, which will likely cause test timeouts or failures in CI environments without network access.


Summary
This PR fixes a critical feature preprocessing issue in the GraphLand benchmark integration (PR #192) and validates the fix by running TopoTune experiments.
The core problem: PR #192 loads all GraphLand features from
features.csvas raw floats, treating integer-coded categorical features (e.g.,education=3,colour_group=15,region_id=42) as continuous numerical values. This is semantically incorrect — the model learns spurious ordinal relationships between unordered categories (e.g., thateducation=3is "three times"education=1), which degrades representation quality and downstream task performance.The fix: Implements proper feature preprocessing following the pipeline explicitly recommended by the GraphLand authors (Bazhenov, Platonov & Prokhorenkova, arXiv:2409.14500, NeurIPS 2025 Datasets & Benchmarks):
info.yamlbundled with each dataset ZIP to identifycategorical_features_namesvsnumerical_features_namesOneHotEncoder(drop='if_binary', sparse_output=False), matching the exact encoding in the GraphLand codebasenum_featuresin all 14 YAML configs to reflect post-encoding dimensionsMotivation and scientific context
The GraphLand benchmark was designed to evaluate graph ML models on real-world industrial data with rich heterogeneous features — a deliberate departure from academic benchmarks like Cora/CiteSeer where all features are homogeneous (bag-of-words). The mixed numerical/categorical feature structure is a defining characteristic of GraphLand and central to its research contribution. The paper states (Appendix B.3):
Passing raw integer-coded categoricals to a neural network (GNN or TNN) defeats the purpose of this benchmark, as the model cannot distinguish between ordinal and nominal features. One-hot encoding is the standard approach for neural models, and
drop='if_binary'avoids redundant columns for binary indicators (e.g.,is_paved,age_is_nan).Design decisions
1. One-hot encoding in
process(), not as a TopoBench transformThe encoding is performed inside
GraphlandDataset.process()rather than as a separate TopoBench data manipulation transform. This is because:info.yaml), not on graph topologynum_features(e.g., feature lifting viaProjectionSum)infer_in_channelsconfig resolver readsdataset.parameters.num_featuresat config time, so the YAML value must reflect the post-encoded dimension2.
info.yamladded toraw_file_namesAdded
info.yamlto theraw_file_namesproperty so PyG's caching mechanism correctly detects when the metadata file is missing and triggers a re-download. The file is bundled in every GraphLand Zenodo ZIP alongside the CSVs.3. Imputation scoping
The
SimpleImputeris now applied only to numerical features, not to categorical columns. Categorical features in GraphLand have no missing values by construction (they are integer-coded categories), and imputing them withmost_frequentcould introduce semantic errors.4. Zenodo download timeout
Increased from 60s to 300s. Several GraphLand datasets (web-fraud, hm-prices, avazu-ctr) exceed 50MB and reliably time out at 60s on typical connections.
Updated
num_features(all 14 datasets)Experimental validation
Ran TopoTune on
tolokers-2(binary classification: predicting crowdworker bans) with default parameters:cell/topotune(GCN backbone, 2 GNN layers, 32 hidden channels)CellCycleLifting(max_cell_length=4)ProjectionSumup_adjacency-1,up_incidence-0,down_incidence-2,2-up_adjacency-0Test results:
Notes on the experiment:
SimplicialCliqueLifting) was intractable on this dense social graph (11.8K nodes, high clustering) — killed after 50+ minutes. Cell cycle lifting withmax_cell_length=4completed in seconds.SparseMPSbackend not supporting sparse tensor operations required by topological data structures (incidence/adjacency matrices). Fell back to CPU.Files changed
topobench/data/loaders/graph/graphland/dataset.py— Rewroteprocess()with proper categorical/numerical separation, one-hot encoding, and scoped imputation. Added_load_info()and_preprocess_features()methods. Addedinfo.yamltoraw_file_names.topobench/data/loaders/graph/graphland/repository/zenodo.py— Increased download timeout 60s → 300s.configs/dataset/graph/*.yaml(10 files) — Updatednum_featuresto post-encoding dimensions.scripts/inspect_graphland_features.py— Utility script that downloads all 14 GraphLand datasets and computes post-encoding feature dimensions. Useful for verification and for future dataset additions.References
dataset.pylines 496–500 for the reference one-hot encoding implementationTest plan
num_features: 19correctly becomesin_channels: [19, 19, 19]for simplicial/cell TopoTunescripts/inspect_graphland_features.pytest/data/load/test_datasetloaders.py)🤖 Generated with Claude Code
Note
Medium Risk
Changes GraphLand dataset preprocessing to one-hot encode categorical columns and optionally drop missing targets, which can shift feature dimensionality and model behavior across multiple benchmarks. Risk is mainly around data compatibility/caching and correctness of the new feature/label handling for existing experiments.
Overview
Fixes GraphLand integration by changing
GraphlandDataset.process()to readinfo.yaml, separate numerical vs categorical columns, apply imputation only to numerical features, and one-hot encode categoricals withOneHotEncoder(drop='if_binary')before building the PyGDataobject.Updates GraphLand dataset YAMLs to reflect post-encoding
num_features, adds a smallZenodoZiphelper for in-memory ZIP download/extraction, and introducesscripts/inspect_graphland_features.pyto compute/verify encoded feature dimensions from the upstream Zenodo artifacts.Adds new dataset entrypoints/config for
wiki_csviaWikiCSDatasetLoader, and expands the dataset-loader test exclusions to skip additional slow/long-running datasets (including GraphLand configs).Written by Cursor Bugbot for commit 55b17d6. This will update automatically on new commits. Configure here.