Collection (COLL)
Pick up an object and place it into a container, evaluating basic trajectory control and end-effector coordination.
ManiSoft paves the way for soft-arm vision-language manipulation beyond current rigid-arm paradigms.
ManiSoft introduces a benchmark for vision-language manipulation with soft continuum robots. It combines a simulator that couples soft-body dynamics with contact-rich interaction, four benchmark tasks spanning basic coordination to obstacle-aware manipulation, and an automated pipeline for generating diverse scenes and expert trajectories at scale.
ManiSoft evaluates four task families. For each task, we show one clean rollout and two randomized rollouts with richer scene clutter and language descriptions.
Pick up an object and place it into a container, evaluating basic trajectory control and end-effector coordination.
Align a target object to a specified 6-DoF pose, testing precise position and orientation adjustment.
Stack tableware from largest to smallest into a stable pile, emphasizing precision in contact-rich interaction.
Arrange multiple objects, requiring perception, spatial reasoning, and obstacle-aware planning.
ManiSoft contains 6,300 scene-trajectory pairs across clean and randomized scenarios, built from an asset library of 263 objects. The dataset covers 109 manipulable objects across 17 categories and 154 obstacles spanning 35 categories, with an average of about 40 language instructions per scene and relatively long expert trajectories due to precise torque-based control.
(a) Distribution of trajectory lengths. Tasks in ManiSoft generally involve long trajectories, with the STK task exhibiting notably longer trajectories than the others. (b) Frequency distribution of target object categories, highlighting the diversity of manipulable objects in ManiSoft. (c) Spatial distribution of initial target object positions on the tabletop, showing that graspable targets are broadly and evenly distributed across the workspace.
ManiSoft uses a tabletop environment with the soft arm fixed behind the table and centered relative to the workspace. Scenes are procedurally generated from an asset library, where each object is annotated with candidate manipulation poses for the soft arm. Clean scenes contain only task-relevant objects with controlled layouts and appearances, while randomized scenes increase difficulty by introducing obstacles and by varying placements, textures, lighting intensity, and brightness.
Language instructions are instantiated from a curated template library using object attributes and descriptions, which preserves semantic accuracy while improving linguistic diversity across episodes.
Unlike rigid robotic arms, for which expert data can be collected (e.g., teleoperation), soft robotic arms present substantial challenges for acquiring corresponding expert demonstration data due to their inherent physical properties. To address this issue, we design a dedicated data collection framework tailored specifically for soft robotic arms.
Expert demonstrations are generated with a hierarchical mechanism. At the high level, a task-specific rule-based planner decomposes each manipulation problem into semantic waypoints such as approach, grasp, retract, and place. Every waypoint is represented as a desired 6-DoF end-effector pose, which avoids direct long-horizon torque sequencing.
At the low level, an RL-trained executor converts each waypoint into torque commands for the soft arm. The executor is optimized with dense rewards that combine pose accuracy and motion stability, encouraging smooth convergence while limiting oscillation and unstable deformation.
Figure 4. Trajectory generation pipeline in ManiSoft. (a) An executor is trained via RL policy to transform waypoint (6-DoF pose) into torques. (b) RL rewards are designed to balance accuracy and stability, consisting of a pose difference reward Rd negatively correlated with the pose difference, and a stability reward Rs that penalizes or rewards changes in pose difference. (c) Task-specific rules are predefined to produce high-level planning (trajectory waypoints) for each case, which are then converted into low-level actions (torque commands) by the executor to generate complete trajectories.
We benchmark representative policy models on ManiSoft, including Diffusion Policy (DP), RDT, and OpenVLA-OFT. Overall, performance in clean scenes is noticeably higher than in randomized scenes, confirming that visual variation, clutter, and obstacle injection remain major challenges for current vision-language-action models on soft robots.
OpenVLA-OFT shows the strongest robustness under randomization, while DP remains competitive in simpler clean settings. Across methods, Collection is consistently the easiest task, whereas Stacking and Arrangement expose the difficulty of precise deformable control and obstacle-aware reasoning.
@article{wei2026manisoft,
title={ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics},
author={Ziyu Wei and Luting Wang and Chen Gao and Li Wen and Si Liu},
journal={arXiv preprint arXiv:2605.18617},
year={2026},
}