Add GNN calorimeter clustering (split-module design)#1823
Add GNN calorimeter clustering (split-module design)#1823
Conversation
Adds a Graph-Neural-Network-based calorimeter clustering algorithm
that runs alongside the existing seed+BFS CaloClusterMaker -- both
producers run, both emit a CaloClusterCollection, downstream consumers
select via (module_label, instance_name). The GNN output ships under
instance name "GNN" so existing BFS-reading analyses are untouched.
Following the design-meeting outcome (with Andy Edmonds + Sophie
Middleton, 2026-04-29), the algorithm is split into two art modules
joined by a transient data product:
CaloHitMaker -> CaloHitCollection
|
+-- CaloClusterMaker (existing, BFS, untouched)
| -> CaloClusterCollection ("")
|
+-- CaloHitGraphMaker (NEW, step 1)
-> CaloHitGraphCollection (transient)
|
v
CaloClusterMakerGNN (NEW, steps 2 + 3)
-> CaloClusterCollection ("GNN")
New files:
* RecoDataProducts/inc/CaloHitGraph.hh
Per-disk graph data product. Carries the three normalised tensors
the ONNX model expects (x, edge_index, edge_attr, all flat) plus
per-node art::Ptr<CaloHit> back-references. Transient -- not
registered for ROOT serialisation.
* CaloCluster/{inc,src}/GnnGraphBuilder.{hh,cc}
Step-1 helper. Per-disk hit collection from the CaloHitCollection;
six node features (log E, t, x, y, r, e_rel) and eight edge
features computed from the Calorimeter geometry service; radius
graph at r_max=210 mm via brute-force pairwise distance loop
(faithful to scipy.spatial.cKDTree at N <= ~65 hits/disk); time
filter |dt| <= 25 ns; kNN fallback at k_min=3; per-source-node
degree cap at k_max=20; z-score normalisation using train-split
statistics loaded from a JSON sidecar.
* CaloCluster/src/CaloHitGraphMaker_module.cc
The step-1 EDProducer. Consumes CaloHitCollection, partitions by
disk, runs GnnGraphBuilder once per disk, emits the
CaloHitGraphCollection. Norm sidecar resolved by
ConfigFileLookupPolicy.
* CaloCluster/{inc,src}/GnnClusterAssembler.{hh,cc}
Step-3 helper (CCN+BFS10 recipe): sigmoid + symmetrise directed
edge logits; threshold at tau_edge; BFS traversal seeded from
highest-energy hits, with the bfs_expand_cut=10 MeV ExpandCut
rule (hits below cut join the cluster but do not recruit further
neighbours -- mirrors Offline ClusterFinder semantics); cleanup
by min_hits and min_energy_mev; relabel to contiguous IDs.
* CaloCluster/src/CaloClusterMakerGNN_module.cc
The step-2 + step-3 EDProducer. Loads the ONNX session in the
constructor (via Ort::Env / Ort::SessionOptions / Ort::Session
in member-declaration order, which is RAII-safe). Asserts the
loaded model's metadata_props (model_version, node_features,
edge_features) against FHiCL expectations -- silent tensor-layout
drift after a retraining is caught loudly. produce() runs ONNX
inference per disk (zero-copy tensor views over the CaloHitGraph
payload), invokes GnnClusterAssembler, then builds CaloClusters
via the existing ClusterUtils linear cog3Vector helper.
Class is model-agnostic: production declares one instance with
the CCN .onnx; A/B comparison jobs declare a second instance with
sen.onnx and a different tau_edge.
* CaloCluster/data/calo_cluster_net_v2_stage1.onnx (2.6 MB)
* CaloCluster/data/calo_cluster_net_v2_stage1.norm.json (~1 KB)
* CaloCluster/data/simple_edge_net_v2.onnx (0.84 MB)
Trained model artifacts. Resolved at runtime by
ConfigFileLookupPolicy. The .onnx files carry the
metadata_props deployment contract; the .json carries the
train-split z-score normalisation statistics.
* CaloCluster/fcl/prolog.fcl
New CaloClusterGNN block defining caloHitGraphMakerGNN +
caloClusterMakerGNN with the frozen CCN+BFS10 recipe defaults.
Production FCLs include the bundled sequence:
physics.<reco-path> : [ ..., @sequence::CaloClusterGNN.Reco ]
* CaloCluster/fcl/from_mcs-gnn-prod.fcl
Production-style standalone FCL that runs the GNN chain on MCS
art-format input and writes both BFS and GNN CaloClusterCollections
to the output art file.
* CaloCluster/src/SConscript
Adds 'onnxruntime' to the plugins dependency list. Picks up the
central muse onnxruntime install via the u092 qualifier.
Build dependency: the central muse `onnxruntime` package; activate
with `muse setup -q u092` (the qualifier providing the central
onnxruntime hook, mirroring the pattern in Mu2e/ArtAnalysis#4).
Training repo: see Mu2e/MLTrain CaloClusterGNN/.
Training-data branch dependency: Mu2e/EventNtuple#366 adds
calomcsim.ancestorSimIds, used by the truth-labelling step in
training only -- this Offline-side code does not depend on the
EventNtuple PR landing.
Headline test-set numbers (276,688 events / 481,543 disk-graphs,
calo-entrant truth, E_reco >= 50 MeV downstream cut):
| Metric | BFS | CCN+BFS10 | Change |
|-----------------------|-------|-----------|--------|
| Mean abs(dE) / MeV | 0.839 | 0.616 | -27% |
| 95th-pct abs(dE) / MeV| 3.520 | 2.338 | -34% |
| Mean centroid dr / mm | 1.589 | 1.292 | -19% |
| 95th-pct dr / mm | 3.606 | 2.294 | -36% |
Two test artefacts plus their FHiCL wiring, exercising the C++
implementation against the Python pipeline that trained the model.
* CaloCluster/src/testGnnClusterAssembler_main.cc
Standalone executable (built via helper.make_bin) that loads a
JSON parity payload produced by the training repo's
scripts/dump_parity_payloads.py, replays GnnClusterAssembler on
each disk-graph, and asserts byte-identical cluster_labels
against the Python reference. Stage-2 of the parity gate.
Expected output:
graphs: N
mismatch graphs: 0
mismatch nodes: 0
[PASS] all N graphs match Python cluster_labels byte-exactly
* CaloCluster/src/CaloHitGraphParityDump_module.cc
art::EDAnalyzer that consumes a CaloClusterCollection emitted by
CaloClusterMakerGNN plus the source CaloHitCollection and writes
a flat TTree (per event-disk: crystalIDs, time, eDep, GNN
cluster labels). Used by the training-repo Python script
scripts/compare_parity_dump.py to replay the same hits through
the Python pipeline and assert byte-exact agreement on cluster
labels end-to-end. Stage-3 of the parity gate.
* CaloCluster/fcl/from_mcs-gnn-test.fcl
Drives the parity-dump analyzer over MCS art-format input. Also
serves as a minimal smoke-test for the C++ pipeline; outputs a
parity_dump.root TTree.
Both stages already pass on real data:
* Stage-2 (assembler-only on packed val graphs): 100/100 disk-graphs
byte-exact (1,147 nodes, 2,768 edges).
* Stage-3 (full mu2e art job on MCS art files, via from_mcs-gnn-test.fcl
+ the training-repo Python comparison): 100/100 disk-graphs
byte-exact (8,502 hits over 50 events).
|
☔ The build is failing at a58c90e.
N.B. These results were obtained from a build of this Pull Request at a58c90e after being merged into the base branch at d2340b7. For more information, please check the job page here. |
ORT is not available yet. Andy is testing the prototype |
|
@FNALbuild run build test with #1824 |
|
⌛ The following tests have been triggered for a58c90e: build (Build queue - API unavailable) |
|
☔ The build is failing at a58c90e.
N.B. These results were obtained from a build of this Pull Request at a58c90e after being merged into the base branch at d2340b7. For more information, please check the job page here. |
|
Try |
Summary
Adds a Graph Neural Network calorimeter clustering algorithm that runs
alongside the existing seed+BFS
CaloClusterMaker. Both clusteringchains run, both emit a
CaloClusterCollection; downstream consumersselect via
(module_label, instance_name). The GNN output ships underinstance name
"GNN"so existing BFS-reading analyses are untouched.Status: draft. The build links
onnxruntimevia theu092musemanifest (Sophie's hook into Andy's local install), tracking
Mu2e/ArtAnalysis#4which is also draft pending the central muse
onnxruntimepackage.Once that lands, the rebase-and-flip is small:
u092qualifier in favour of the standard one,SConscriptdependency nameArtAnalysis#4settles on (likely unchanged at
'onnxruntime'),gh pr ready.The C++ source itself is independent of which qualifier provides
onnxruntime.Design (split-module)
Following the design meeting with Sophie Middleton + Andrew Edmonds
(2026-04-29), the algorithm is split into two
art::EDProducersjoined by a transient data product:
CaloClusterMakerGNNis model-agnostic: a single C++ class loadedtwice in an A/B comparison job runs SimpleEdgeNet and CaloClusterNet
side by side, with per-instance FHiCL
tauEdgeandexpectedModelVersion.Production declares one instance with the CCN artifact.
Metadata-props deployment contract
The C++ session loader asserts the loaded
.onnx'smetadata_propsmap matches FHiCL expectations at job start, so asilently-out-of-sync retraining is caught before the first event:
model_versioncalo-cluster-net-v2-stage1expectedModelVersionnode_featureslog_e,t,x,y,r,e_relexpectedNodeFeaturesedge_featuresdx,dy,d,dt,dlog_e,asym_e,logsum_e,drexpectedEdgeFeaturesMismatches abort the job loudly.
What this adds
Two commits, reviewer-friendly split:
Add GNN calorimeter clustering (split-module design)Production code, FHiCL wiring, and trained model artifacts.
RecoDataProducts/inc/CaloHitGraph.hhart::Ptr<CaloHit>back-references)CaloCluster/{inc,src}/GnnGraphBuilder.{hh,cc}r_max=210 mm(faithful to the Pythonscipy.spatial.cKDTreebehaviour atN <= ~65 hits/disk), time filter, kNN fallback, degree cap, z-score normalisation from JSON sidecarCaloCluster/src/CaloHitGraphMaker_module.ccCaloCluster/{inc,src}/GnnClusterAssembler.{hh,cc}bfsExpandCut=10 MeVExpandCut →minHits/minEnergyMeVcleanup → contiguous relabelCaloCluster/src/CaloClusterMakerGNN_module.ccmetadata_propsassertion against FHiCL at job start; zero-copy tensor views over the graph payload at inference time;CaloClusterconstruction via the existingClusterUtils::cog3VectorCaloCluster/data/calo_cluster_net_v2_stage1.{onnx,norm.json}CaloCluster/data/simple_edge_net_v2.onnxCaloCluster/fcl/prolog.fclCaloClusterGNNblock defining the two producers with frozen recipe defaults; bundledRecosequence for one-line inclusionCaloCluster/fcl/from_mcs-gnn-prod.fclCaloClusterCollectionsCaloCluster/src/SConscript'onnxruntime'to plugin depsAdd GNN clustering parity testsCaloCluster/src/testGnnClusterAssembler_main.ccmake_bin) executable: loads JSON parity payload from the training repo, replaysGnnClusterAssembler, asserts byte-identical labels against Python (Stage 2)CaloCluster/src/CaloHitGraphParityDump_module.ccart::EDAnalyzer: dumps per-event-diskCaloHits + GNN cluster labels to a flat TTree for end-to-end Python comparison (Stage 3)CaloCluster/fcl/from_mcs-gnn-test.fclHeadline result
276,688 events / 481,543 disk-graphs on the MDC2025 mixed-pileup test
set, calo-entrant truth,
E_reco >= 50 MeVdownstream cut (clustersthat actually enter track finding):
Signal region (95-110 MeV, 47,279 clusters): mean abs(dE) drops
from 0.368 to 0.210 MeV (-43%), mean dr from 0.559 to 0.460 mm (-18%).
Parity validation
Both stages already pass on real data:
2,768 edges): 100/100 byte-exact.
mu2eart job on real MCS input viafrom_mcs-gnn-test.fcl+ Python comparison harness, 50 events / 100disk-graphs / 8,502 hits): 100/100 byte-exact.
Stage 3 implicitly covers Stage 1 (graph-maker parity): any divergence
in graph construction would propagate to mismatched cluster labels.
Reproduce locally:
Coordinated PRs
Mu2e/EventNtuplecalomcsim.ancestorSimIds. Used by the training-time truth-labelling step. This PR does not depend on EventNtuple#366 landing — the C++ Offline code only consumesCaloHitcollections, not the new EventNtuple branch.Mu2e/ArtAnalysisonnxruntimebuild-dep pattern this PR mirrors. This PR's draft status tracks ArtAnalysis#4 — both are blocked on the central museonnxruntimepackage.Mu2e/MLTrainCaloClusterGNN/subdirectory containing the full training pipeline that produces the.onnxartifacts shipped inCaloCluster/data/here.Try it
Acknowledgement
Implementation, refactoring, and documentation drafting were assisted
by Anthropic's Claude (Claude Code). Scientific decisions, training
campaign, validation results, and the v1->v2 truth-definition design
are my own work; Claude was used as a coding assistant.
Test plan
onnxruntimepackage once itlands (drop
u092, mirror finalSConscriptdep name fromArtAnalysis#4).testGnnClusterAssembleron the parity payload — expect[PASS] all 100 graphs match.from_mcs-gnn-test.fclend-to-end on a small MCS artfile + the training-repo Python comparison — expect 100/100.
Production/JobConfig/...(separate follow-up PR).