[CVPR 2025] Repository-aligned implementation notes and usage guide for the paper "Spatial-Temporal Graph Diffusion Policy with Kinematic Modeling for Bimanual Robotic Manipulation".
Qi Lv, Hao Li, Xiang Deng, Rui Shao, Yinchuan Li, Jianye Hao, Longxiang Gao, Michael Yu Wang, Liqiang Nie
- Updates
- Introduction
- Highlights
- Method
- Project Structure
- Installation
- Dataset / Benchmark
- Usage
- Results
- Citation
- Acknowledgement
- License
- [03/2025] KStar Diffuser was released on arXiv and accepted by CVPR 2025
Bimanual robotic manipulation is harder than single-arm control because the policy must model synchronization, collision-aware structure, and kinematic feasibility across two arms at the same time. The paper proposes Kinematics enhanced Spatial-TemporAl gRaph Diffuser (KStar Diffuser), which improves diffusion-based imitation learning with:
- A dynamic spatial-temporal robot graph that encodes dual-arm structure and motion history
- A differentiable kinematics regularizer that aligns predicted end-effector poses with feasible joint-space behavior
- A diffusion policy conditioned on language, multi-view RGB-D observations, and robot state
For the planned open-source release, we expose the method as two plugin-style components:
- A graph encoding plugin for dual-arm structural and temporal reasoning
- A kinematic regularizer plugin for turning joint predictions into kinematically meaningful end-effector constraints
They can be attached to different downstream policy backbones rather than being tied to one fixed end-to-end architecture.
- Dynamic spatial-temporal graph over both Panda arms and multiple history steps
- Differentiable forward kinematics regularization with
pytorch_kinematicsand Panda URDFs - Multiple graph encoders are supported:
GCN,GAT,MPNN,GraphSAGE, andEGNN - Plugin-style design that can be attached to different policy heads
- The public release focuses on the reusable graph and regularizer plugins
The paper proposes KStar Diffuser as a diffusion policy for bimanual manipulation that explicitly reasons about robot structure and kinematics. In this repository, the public-facing part can be understood as two reusable modules:
- Graph Encoding in
src/utils/data_utils.pyandsrc/models/gnn_models.py - Kinematic Regularizer in
src/utils/model_utils.py
The original full research code plugs these two modules into a downstream policy, but the open-source core is centered on the plugins themselves.
The graph branch is the most distinctive part of this repository. It is enabled by enable_graph=True and is built from dual-arm joint coordinates over multiple history steps.
Implemented in src/utils/data_utils.py::build_node_features.
For each joint at each history step, the node feature concatenates:
- Joint 3D coordinate
- Arm identity one-hot label: right arm
[1, 0], left arm[0, 1] - Distances from the current joint to all other joints at the same timestep
If there are 14 joints in total, the default node feature dimension is:
3 + 2 + 14 = 19
This matches the default config:
gnn_input_dim: 19Implemented in src/utils/data_utils.py::build_edge_features.
The graph contains two edge types:
- Spatial edges inside each timestep
- right arm chain:
(0,1) (1,2) ... (5,6) - left arm chain:
(7,8) (8,9) ... (12,13)
- right arm chain:
- Temporal edges across history
- the same joint index is connected between adjacent timesteps
The edge attribute is a scalar distance:
- spatial edge: Euclidean distance between connected joints in the same timestep
- temporal edge: Euclidean displacement of the same joint between neighboring timesteps
Implemented in src/models/gnn_models.py.
gcngatmpnnsageegnn
The graph encoder output can be pooled and attached to any downstream policy, planner, or decoder that needs robot-structure-aware features.
The second exposed module is the kinematic regularizer. Its role is to convert future joint predictions into end-effector-space constraints through differentiable forward kinematics.
Implemented across:
src/utils/model_utils.py- a Panda-compatible URDF file used by
pytorch_kinematics
In the current implementation, the regularizer works as follows:
- Predict future joint positions for the right and left arms
- Run differentiable forward kinematics with
pytorch_kinematics - Convert transformation matrices to RLBench-style pose representation
- Use the resulting poses as an additional conditioning signal
- Optionally optimize an auxiliary joint prediction loss together with the downstream task loss
In the original full integration, the resulting end-effector pose is projected back into a conditioning feature and combined with graph-enhanced observation features.
One example objective is:
loss = loss_lambda * ee_loss + (1 - loss_lambda) * joint_loss
From a plugin perspective, this module does not require a specific decoder. It only assumes that your downstream network can produce future joint predictions that can be fed into the kinematics chain.
.
├── src/
│ ├── utils/data_utils.py # Graph node / edge construction
│ ├── models/gnn_models.py # GCN, GAT, MPNN, GraphSAGE, EGNN graph encoders
│ ├── models/egnn.py # EGNN layer used by the graph encoder
│ ├── utils/model_utils.py # Pose conversion and kinematics helpers
│ └── utils/CONSTANT.py # Task constants
├── requirements.txt
├── LICENSE
├── stard.png
└── README.md
The dependency comments in requirements.txt suggest a Python 3.8 setup.
conda create -n krgb python=3.8 -y
conda activate krgbThe two public modules mainly depend on:
torchtorch_geometricpytorch_kinematicsnumpy
To stay consistent with the repository environment, you can install the full requirements:
pip install -r requirements.txtThe rest of the repository contains an end-to-end research implementation, but that is not the main public API of the open-source release.
The paper evaluates KStar Diffuser on RLBench2 simulated bimanual tasks.
| Setting | Tasks |
|---|---|
| Simulated RLBench2 | push_box, lift_ball, handover_item_easy, pick_laptop, sweep_to_dustpan |
The paper reports both 20-demo and 100-demo training settings in simulation.
The graph plugin expects a joint-coordinate tensor with shape:
[history, num_joints, 3]
In the current implementation:
historyis typically3num_jointsis14for two 7-DoF Panda arms- each joint is represented by its 3D coordinate
The regularizer plugin expects predicted future joint positions that can be fed into a Panda forward-kinematics chain. In the example implementation, that prediction has shape:
[batch, horizon, 14]
import numpy as np
from torch_geometric.data import Data
from src.utils.data_utils import build_node_features, build_edge_features
from src.models.gnn_models import GCNGraph
history = 3
num_joints = 14
joint_coordinations = np.random.randn(history, num_joints, 3)
node_features = build_node_features(joint_coordinations, history)
edge_index, edge_attr = build_edge_features(joint_coordinations, history)
graph_data = Data(
x=node_features.view(history * num_joints, -1),
edge_index=edge_index,
edge_attr=edge_attr.unsqueeze(-1),
)
gnn = GCNGraph(
input_dim=19,
hidden_dim=128,
output_dim=128,
num_layers=4,
activation="silu",
norm="layer",
)
graph_output = gnn(graph_data.x, graph_data.edge_index)If you want edge-aware message passing, replace GCNGraph with GATGraph, MPNNGraph, or EqGraph and pass edge_attr.
The minimal example below shows forward kinematics for one arm. In the full bimanual setting, the same procedure is applied to the right and left arms separately.
import torch
import pytorch_kinematics as pk
from src.utils.model_utils import matrix_to_rlb_pose, proc_quaternion
chain = pk.build_serial_chain_from_urdf(
open("path/to/panda.urdf").read(),
"Pandatip",
)
joint_prediction = torch.randn(8, 7)
fk_output = chain.forward_kinematics(joint_prediction)
ee_pose = proc_quaternion(matrix_to_rlb_pose(fk_output.get_matrix()))This regularizer can be attached to any downstream model that predicts future joint positions. In the original full implementation, the resulting end-effector pose is projected back into a conditioning feature and combined with the graph-enhanced observation representation.
The intended usage is:
- Use the graph encoding plugin to inject structural and temporal robot bias
- Use the kinematic regularizer plugin to constrain future predictions in a physically meaningful way
- Attach one or both plugins to your own policy backbone, action head, or diffusion decoder
The following numbers come from the paper, not from a fresh re-run inside this repository.
| Training demos | Push Box | Lift Ball | Handover Item (easy) | Pick Laptop | Sweep Dustpan | Overall |
|---|---|---|---|---|---|---|
| 20 demos | 79.3 | 87.0 | 23.7 | 17.0 | 83.0 | 58.0 |
| 100 demos | 83.0 | 98.7 | 27.0 | 43.7 | 89.0 | 68.2 |
According to the paper, KStar Diffuser outperforms prior transformer-based and diffusion-based baselines by more than 10 percentage points in overall success rate, and the ablation study confirms that both the Spatial-Temporal Graph and the Kinematics Regularizer are important.
If you use this repository or the KStar Diffuser paper in your research, please cite:
@InProceedings{Lv_2025_CVPR,
author = {Lv, Qi and Li, Hao and Deng, Xiang and Shao, Rui and Li, Yinchuan and Hao, Jianye and Gao, Longxiang and Wang, Michael Yu and Nie, Liqiang},
title = {Spatial-Temporal Graph Diffusion Policy with Kinematic Modeling for Bimanual Robotic Manipulation},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2025},
pages = {17394-17404}
}- The method description and reported results are based on the KStar Diffuser paper
- The simulator stack builds on RLBench and PyRep
- The implementation uses PyTorch, Hugging Face Transformers, Diffusers, and PyTorch Geometric
This repository is released under the Apache License 2.0. See the top-level LICENSE file for the full text.
External third-party dependencies keep their own original licenses.
