π“œπ“ͺ𝓷𝓲𝓒𝓸𝓯𝓽: Towards Vision-Language Manipulation for Soft Continuum Robotics

Ziyu Wei1,2,* Luting Wang1,2,* Chen Gao1,3,,+ Li Wen1,+ Si Liu1,2,+
1Beihang University 2Hangzhou Innovation Institute, Beihang University 3National University of Singapore
*Equal Contribution Project Lead +Corresponding Authors
ICML 2026
ManiSoft benchmark overview

(a) Rigid arms operate in a low-dimensional action space and can fail due to limited shape adaptation, whereas soft arms driven by distributed low-level actuation can continuously deform to reach around obstacles. (b) Example expert trajectories for the four ManiSoft tasks. (c) The data generation pipeline comprises an asset library, clean and randomized scene generation, hierarchical trajectory generation, simulation, and quality check. (d) Performance of representative policy models on ManiSoft in clean scenarios.

TL;DR

ManiSoft paves the way for soft-arm vision-language manipulation beyond current rigid-arm paradigms.

ManiSoft introduces a benchmark for vision-language manipulation with soft continuum robots. It combines a simulator that couples soft-body dynamics with contact-rich interaction, four benchmark tasks spanning basic coordination to obstacle-aware manipulation, and an automated pipeline for generating diverse scenes and expert trajectories at scale.

Benchmark Overview

ManiSoft evaluates four task families. For each task, we show one clean rollout and two randomized rollouts with richer scene clutter and language descriptions.

Collection (COLL)

Pick up an object and place it into a container, evaluating basic trajectory control and end-effector coordination.

Collection clean demo
(clean) Pick up the bottle and place it in the storage box.
Collection randomized demo 1
(randomized) Put the box-drink with whit cap in the purple plastic box while keeping it balanced.
Collection randomized demo 2
(randomized) Put the round red kettle with a gray handle in the white plastic case while keeping it balanced.

Alignment (ALN)

Align a target object to a specified 6-DoF pose, testing precise position and orientation adjustment.

Alignment clean demo
(clean) Move the candlestick to the gray area and align its direction with the target orientation.
Alignment randomized demo 1
(randomized) Grasp the white shoe and place it to match the target position and direction while maintaining stability.
Alignment randomized demo 2
(randomized) Grasp the red bottle with a black cap and place it to match the target position and direction while maintaining stability.

Stacking (STK)

Stack tableware from largest to smallest into a stable pile, emphasizing precision in contact-rich interaction.

Stacking clean demo
(clean) Stack the bowls by size.
Stacking randomized demo 1
(randomized) Pick up the smaller blue bowl and place it steadily inside the bigger one.
Stacking randomized demo 2
(randomized) Pick up the smaller bowl, and steadily put it in the bigger one, don't run into other objects.

Arrangement (ARR)

Arrange multiple objects, requiring perception, spatial reasoning, and obstacle-aware planning.

Arrangement clean demo
(clean) Place the Rubik's Cube in the left area next to the book.
Arrangement randomized demo 1
(randomized) Please place the sport shoe to the left of the white phone.
Arrangement randomized demo 2
(randomized) Please place the white boxed drink to the left of the round kettle with a black handle.

ManiSoft contains 6,300 scene-trajectory pairs across clean and randomized scenarios, built from an asset library of 263 objects. The dataset covers 109 manipulable objects across 17 categories and 154 obstacles spanning 35 categories, with an average of about 40 language instructions per scene and relatively long expert trajectories due to precise torque-based control.

Dataset statistics for ManiSoft

(a) Distribution of trajectory lengths. Tasks in ManiSoft generally involve long trajectories, with the STK task exhibiting notably longer trajectories than the others. (b) Frequency distribution of target object categories, highlighting the diversity of manipulable objects in ManiSoft. (c) Spatial distribution of initial target object positions on the tabletop, showing that graspable targets are broadly and evenly distributed across the workspace.

Data Generation Pipeline

Scene Generation

ManiSoft uses a tabletop environment with the soft arm fixed behind the table and centered relative to the workspace. Scenes are procedurally generated from an asset library, where each object is annotated with candidate manipulation poses for the soft arm. Clean scenes contain only task-relevant objects with controlled layouts and appearances, while randomized scenes increase difficulty by introducing obstacles and by varying placements, textures, lighting intensity, and brightness.

Language instructions are instantiated from a curated template library using object attributes and descriptions, which preserves semantic accuracy while improving linguistic diversity across episodes.

Scene generation in ManiSoft

Trajectory Generation

Unlike rigid robotic arms, for which expert data can be collected (e.g., teleoperation), soft robotic arms present substantial challenges for acquiring corresponding expert demonstration data due to their inherent physical properties. To address this issue, we design a dedicated data collection framework tailored specifically for soft robotic arms.

Expert demonstrations are generated with a hierarchical mechanism. At the high level, a task-specific rule-based planner decomposes each manipulation problem into semantic waypoints such as approach, grasp, retract, and place. Every waypoint is represented as a desired 6-DoF end-effector pose, which avoids direct long-horizon torque sequencing.

At the low level, an RL-trained executor converts each waypoint into torque commands for the soft arm. The executor is optimized with dense rewards that combine pose accuracy and motion stability, encouraging smooth convergence while limiting oscillation and unstable deformation.

Trajectory generation pipeline in ManiSoft

Figure 4. Trajectory generation pipeline in ManiSoft. (a) An executor is trained via RL policy to transform waypoint (6-DoF pose) into torques. (b) RL rewards are designed to balance accuracy and stability, consisting of a pose difference reward Rd negatively correlated with the pose difference, and a stability reward Rs that penalizes or rewards changes in pose difference. (c) Task-specific rules are predefined to produce high-level planning (trajectory waypoints) for each case, which are then converted into low-level actions (torque commands) by the executor to generate complete trajectories.

Main Results

We benchmark representative policy models on ManiSoft, including Diffusion Policy (DP), RDT, and OpenVLA-OFT. Overall, performance in clean scenes is noticeably higher than in randomized scenes, confirming that visual variation, clutter, and obstacle injection remain major challenges for current vision-language-action models on soft robots.

OpenVLA-OFT shows the strongest robustness under randomization, while DP remains competitive in simpler clean settings. Across methods, Collection is consistently the easiest task, whereas Stacking and Arrangement expose the difficulty of precise deformable control and obstacle-aware reasoning.

Main benchmark results on ManiSoft

BibTeX

@article{wei2026manisoft,
      title={ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics},
      author={Ziyu Wei and Luting Wang and Chen Gao and Li Wen and Si Liu},
      journal={arXiv preprint arXiv:2605.18617},
      year={2026},
}