A Vision–Language–Action foundation model that, for the first time, scales reward-label-free contrastive RL into VLA pre-training itself — equipping a single Qwen3-VL backbone with a quantitative, language-grounded sense of goal-reachability, at near-BC compute.
- 2026/05 LIBERO post-trained checkpoint released — 🤗 TeleEmbodied/PRTS-4B-LIBERO, the exact checkpoint behind the LIBERO numbers in the paper.
- 2026/05 Project page refreshed with final video demos, BibTeX, and benchmark cards — rhodes-team-prts.github.io.
- 2026/05 Minimal SFT post-training code released — reproduces the LIBERO and real-robot fine-tuning runs from the paper.
- 2026/05 Pre-trained PRTS-4B checkpoint pushed to 🤗 TeleEmbodied/PRTS-4B.
- 2026/05 PRTS arXiv preprint released — arXiv:2604.27472.
We will progressively open-source the rest of the PRTS stack. Tick = done, square = upcoming.
- PRTS arXiv preprint — arXiv:2604.27472
- PRTS-4B pre-trained checkpoint — 🤗 TeleEmbodied/PRTS-4B
- Standard LIBERO LeRobot-v2.1 dataset for example fine-tuning — 🤗 TeleEmbodied/libero_4_suites
- Minimal SFT post-training code for LIBERO + real-robot platforms
- LIBERO evaluation of PRTS
- Project page — rhodes-team-prts.github.io with final video demos, BibTeX, and benchmark cards.
- PRTS-4B post-trained checkpoint for LIBERO — 🤗 TeleEmbodied/PRTS-4B-LIBERO, for one-click reproduction of the paper's LIBERO numbers.
- CRL value visualization scripts
- PRTS-4B post-trained checkpoint for SimplerEnv WidowX
Most VLA models pretrain by behavior cloning — they learn what to do, but never internalize how close the current state is to satisfying the instruction. PRTS reframes pre-training as a goal-conditioned RL problem and supervises a language-conditioned contrastive value alongside the action loss, all from offline trajectory structure alone.
The geometry the model converges to is sharp: the inner product
of the state–action embedding and the goal embedding tracks the log-discounted goal-occupancy along expert rollouts. It rises as the policy approaches the language goal, and stays flat under a mismatched instruction.
| 🧭 Goal-reachability awareness, end-to-end | The contrastive value head is co-trained inside the same Qwen3-VL backbone the policy uses. No separate value network, no curated reward dataset, no offline-RL post-training loop. |
| 💰 Reward-label-free | Supervision comes purely from the temporal structure of demonstrations — no per-episode success labels and no curated value-training corpus. |
| ⚡ Near-BC pre-training cost | A role-aware causal mask fused into FlashAttention via a custom CuTe kernel keeps per-layer attention within 1.18 × of vanilla FA3, vs. 2.7 ×–8.8 × for off-the-shelf FlexAttention. End-to-end pre-training scales at ≥ 85 % linear efficiency on 64 × H100. |
| 🌍 Out-of-distribution wins grow with the shift | On 5 simulation suites and 14 real-world tasks, PRTS matches or exceeds the strongest prior VLAs at ¼ – ⅛ the post-training compute, with the gap widening as evaluation moves further off-distribution: novel-instruction following (+38.8 over π0.5), long-horizon execution, and recovery under human intervention. |
PRTS targets CUDA 12.6 + PyTorch 2.11 + transformers 4.57.3. We recommend a fresh conda env:
# 1) Clone
git clone https://github.com/TeleHuman/PRTS.git
cd PRTS
# 2) Create env
conda create -n prts python=3.11 -y
conda activate prts
# 3) Install LeRobot first
pip install lerobot==0.3.3
# 4) Install the rest
pip install -r requirements.txt
# 5) FlashAttention installation (we recommend FlashAttention-3, which is the default attn implementation in our training, i.e., `--attn-implementation flash_attention_3`)
MAX_JOBS=8 pip install flash-attn==2.8.3 --no-build-isolation # only for FA2
# 6) Editable install of PRTS itself
pip install -e .huggingface-cli download TeleEmbodied/PRTS-4B \
--local-dir $HF_HUB_CACHE/models--TeleEmbodied--PRTS-4B
huggingface-cli download Qwen/Qwen3-VL-4B-Instruct \
--local-dir $HF_HUB_CACHE/models--Qwen--Qwen3-VL-4B-InstructFine-tuning is driven by a single YAML pointing at one or more LeRobot v2.1 datasets (local folder or HF repo). A minimal config (see configs/post-train/template.yaml):
# configs/post-train/my_robot.yaml
lerobot_datasets:
- repo_id: my_robot_v1 # any string id, used for logging
root: /path/to/your/lerobot_dataset # local v2.1 LeRobot folder
load_quantile_stats: true # load q01/q99 from meta/stats.json (run compute_stats.py first before starting fine-tuning)
state_relative_action: false # true = train on (action - state); must match `embodiment_tag`
embodiment_tag: my_robot_tag # see prts/data/embodiment_tag.pyKey fields explained:
| Field | Meaning |
|---|---|
repo_id |
Dataset short name. Doubles as a tag in the loss-channel logs (flow_matching/<repo_id>). |
root |
Absolute path to a v2.1 LeRobot dataset folder (must contain meta/info.json, meta/episodes.jsonl, meta/stats.json, …). |
load_quantile_stats |
If true, loads q01/q99 from meta/stats.json for QUANTILE normalization. Generate them once with python scripts/compute_stats.py --config <your.yaml>. |
state_relative_action |
true makes the policy predict action − state for the dims masked in the embodiment config. Must agree with the embodiment_tag you choose. |
embodiment_tag |
One of the keys in prts/data/embodiment_tag.py: libero_panda, flexiv, realman_dual_arm, arx_dual_arm, galaxea_r1_pro, agibot_g2, or the generic full_state_relative / full_absolute. To add a new robot, append a new EmbodimentConfig entry — the delta_action_mask is truncated to your real action dim at runtime. |
You can list multiple datasets under lerobot_datasets: and they will be co-trained with sample-level mixing.
Edit the top of scripts/ft/launch_finetune.sh — these are the knobs you will most often touch:
GPUS=4 # number of local GPUs
PER_DEVICE_BATCH_SIZE=8
dataset_config_path=configs/post-train/my_robot.yaml
embodiment_tag=my_robot_tag # must match the YAML above
chunk_size=20 # action horizon
action_dim=32 # >= max action dim across listed datasets
max_train_steps=30000
state_mode=QUANTILE # or MEAN_STD / MIN_MAX
model_name_or_path=$HF_HUB_CACHE/models--TeleEmbodied--PRTS-4B # or "TeleEmbodied/PRTS-4B"Then:
# 1) Pre-compute normalization statistics. This walks each dataset listed under
# `lerobot_datasets:` once and writes `<root>/meta/norm_stats.json`
# (mean/std/min/max + q01/q99) — the same file `load_quantile_stats: true`
# reads at training time.
python scripts/compute_stats.py \
--data_path configs/post-train/my_robot.yaml
# 2) Launch SFT (DeepSpeed ZeRO-2 by default)
bash scripts/ft/launch_finetune.shCheckpoints land under outputs/<date>/<time>-<run_name>/ and the final policy is saved as checkpoint-final-<step>/.
scripts/serve_policy.py does not accept a --ckpt-path flag — checkpoints are looked up via an EnvMode enum and a DEFAULT_CHECKPOINT table. To serve your own run, register a new entry in both:
# scripts/serve_policy.py
class EnvMode(enum.Enum):
ALOHA = "aloha"
ALOHA_SIM = "aloha_sim"
DROID = "droid"
LIBERO = "libero"
SIMPLER = "simplerenv"
LIBERO_hf = "libero_dit"
MY_ROBOT = "my_robot" # ← add this
DEFAULT_CHECKPOINT: dict[EnvMode, Checkpoint] = {
...
EnvMode.MY_ROBOT: Checkpoint( # ← and this
config = "prts_my_robot",
dir = "outputs/2026-XX-XX/.../checkpoint-final-30000",
action_dim = 7, # must match your dataset action dim
dataset_path = "/path/to/your/lerobot_dataset",
state_mode = "QUANTILE", # must match normalization mode used at training
state_relative_action = False, # must match the YAML used at training
),
}Then launch the websocket policy server (consumed by the OpenPI client included in third_party/openpi-client):
python scripts/serve_policy.py --env my_robot --port 10093The four fields action_dim / dataset_path / state_mode / state_relative_action must agree with the YAML used during fine-tuning — they are how the server reconstructs the un-normalization pipeline so that the actions returned over the websocket are in the same convention your robot expects.
The repo ships a ready-to-run config so you can confirm the stack end-to-end before pointing it at your own data. We also released the LeRobot-v2.1 packaging of the four LIBERO suites we used for the paper at 🤗 TeleEmbodied/libero_4_suites — pull it down once and the rest is one shell command:
# 1) Download the LIBERO LeRobot dataset
huggingface-cli download TeleEmbodied/libero_4_suites \
--repo-type dataset \
--local-dir /path/to/libero_4_suites
# 2) Point configs/post-train/libero.yaml at the folder you just downloaded
# (edit the `root:` field), then:
# 3) Launch SFT
bash scripts/ft/launch_finetune.shA successful run prints total_params=…, trainable_params=…, [≈99 %] at startup and begins logging flow_matching/libero_4_suites_channel_loss within the first ~100 steps.
PRTS reaches state-of-the-art average success rate on every standard suite, at a small post-training budget among all directly comparable VLAs.
| Method | LIBERO | LIBERO-Plus | LIBERO-Pro | SimplerEnv (WidowX) |
|---|---|---|---|---|
| OpenVLA-OFT | 97.1 | 69.6 | — | 41.8 |
| GR00T-N1.5 | 97.0 | — | — | 61.9 |
| π0 (bs=32, 30K) | 94.2 | 53.6 | 45.3 | 27.1 |
| π0.5 (bs=256, 30K) | 96.9 | 80.7 | 53.3 | — |
| ABot-M0 (bs=32, 30K) | 97.9 | 78.7 | 52.2 | — |
| PRTS (Ours) (bs=32, 30K) | 98.4 | 81.4 | 58.8 | 77.1 |
The gap to baselines grows as evaluation drifts further off-distribution: +0.5 on LIBERO → +0.7 on LIBERO-Plus → +5.5 on LIBERO-Pro.
This is where PRTS's CRL shaped representations really earn its keep. The benchmark holds the visual scene fixed and rewrites either the instruction (Task axis) or the target relation (Position axis).
| Method | Semantic | Object | Position | Task | Average |
|---|---|---|---|---|---|
| π0 (bs=32, 30K) | 90.5 | 90.5 | 0.0 | 0.0 | 45.3 |
| π0.5 (bs=256, 30K) | 95.8 | 96.0 | 20.8 | 0.8 | 53.3 |
| ABot-M0 (bs=32, 30K) | 97.1 | 82.5 | 7.1 | 22.3 | 52.2 |
| PRTS (Ours) (bs=32, 30K) | 97.0 | 82.3 | 24.3 | 31.5 | 58.8 |
On the Task axis, π0 and π0.5 collapse below 1 %, and the strongest comparable VLA (ABot-M0) reaches only 22.3 %. PRTS holds 31.5 % — although π0.5 achieves the second-best average result, PRTS suprisingly outperform it a large margin +30.7 on the hardest Task axis.
We deploy PRTS on a 14-DoF dual-arm RealMan platform (11 tasks) and a 7-DoF Flexiv single-arm platform (3 tasks). All three policies (PRTS, π0, π0.5) share identical post-training data, schedule, and 20-trial physical evaluation protocol.
| Method | RealMan dual-arm (avg over 11 tasks) | Flexiv single-arm (avg over 3 tasks) |
|---|---|---|
| π0 | 67.3 | 60.0 |
| π0.5 | 85.5 | 75.0 |
| PRTS (Ours) | 95.9 | 90.0 |
PRTS hits ≥ 90 % on every one of the 11 RealMan tasks and 100 % on four of them. On the genuinely long-horizon Office Long Term task (~2 min of continuous bimanual operation), π0.5 collapses to 40 % under multi-task interference while PRTS holds 95 %.
The cleanest test of "does the policy follow the language goal?" is to take a deployed task and change the language instruction to recombine seen primitives in a new way (e.g. Paper Rubbish with a soy-sauce bottle in place of the trash item). All four task-generalization cells, all 20 trials physical:
| Method | Paper Rubbish | Place Block | Pick Shoes | Stack Cups | Average |
|---|---|---|---|---|---|
| π0 | 5.0 | 0.0 | 30.0 | 20.0 | 13.8 |
| π0.5 | 65.0 | 15.0 | 35.0 | 25.0 | 35.0 |
| PRTS (Ours) | 80.0 | 55.0 | 85.0 | 75.0 | 73.8 |
+38.8 average margin over π0.5 — the most direct empirical evidence that PRTS's value head ties the language goal to feasible state-action outcomes.
(a) Aggregate token throughput vs. number of H100 GPUs (log-log) — PRTS retains ≥ 85 % of perfect linear scaling up to 64 GPUs. (b) Per-layer attention forward time at matched packing — our role-aware CuTe-FlashAttention kernel sits at 0.531 ms / layer (1.18 × of the BC-only FA3 reference at 0.45 ms), versus 1.23 ms (2.7 ×) and 3.95 ms (8.8 ×) for the alternative FlexAttention realizations of the same role-aware mask.
If you find PRTS useful, please consider citing:
@article{zhang2026prts,
title = {PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations},
author={Yang Zhang and Jiangyuan Zhao and Chenyou Fan and Fangzheng Yan and Tian Li and Haitong Tang and Sen Fu and Xuan'er Wu and Qizhen Weng and Weinan Zhang and Xiu Li and Chi Zhang and Chenjia Bai and Xuelong Li},
journal = {arXiv preprint arXiv:2604.27472},
year = {2026},
}Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0). See LICENSE for details. Released model weights and code are free for academic and non-commercial use; commercial use is not permitted under this license.
PRTS builds on Qwen3-VL, FlashAttention, LeRobot, and OpenPI. We thank the authors of Contrastive RL for the ideas behind the contrastive value formulation.
Feel free to open an issue or discussion if you encounter any problems or have questions about this project.
For collaborations, feedback, or further inquiries, please reach out to Yang Zhang (breezeyoung9470@gmail.com).
You can also join our WeChat discussion group for timely Q&A:
(QR code refreshed weekly — if it has expired, please email and we will update it.)




