RoboSoft’25: The 1st International Workshop on Vision-Language in Soft Robot

Overview

The rapid development of embodied intelligence, realized through robots’ interaction with the environment, has evolved from rule-based control to autonomous systems integrating deep learning and reinforcement learning. Current research in embodied intelligence predominantly focuses on rigid robotic platforms. However, the characteristics of rigid materials not only limit flexibility and increase collision risks but also render them insufficiently adaptive in unstructured and constrained environments. To address these limitations, researchers have drawn inspiration from the biological traits of mollusks, introducing flexible materials into robotic design and advancing the field of embodied intelligence centered on soft robotic platforms. Soft robots, owing to their deformable properties, offer highly adaptive and safe solutions, particularly for human-robot collaboration scenarios and tasks in complex environments. Nevertheless, their underactuated nature and strong nonlinear dynamic characteristics pose significant challenges for the design of autonomous control systems.

This workshop focuses on the multimodal perception and decision-making of soft robots, delving into and promoting cutting-edge technologies in embodied intelligence with soft robotics as the carrier. We aim to bring together researchers and practitioners to explore emerging challenges and solutions, including the design of adaptive control systems for soft robots, the integration of multimodal sensory data, and the optimization of decision-making algorithms under nonlinear dynamics.

Invited Speakers

Xiang LI
Tsinghua University

Shuqiang Jiang
Chinese Academy of Sciences

Qin Jin
Renmin University of China

Call for Papers

This workshop aims to bring together researchers and practitioners from different disciplines to share ideas and methods on soft robot learning. We welcome research contributions as well as best-practice contributions on (but not limited to) the following topics:

Multimodal Embodied Navigation: visual navigation, vision-language navigation, soft robot navigation
Multimodal Embodied Manipulation: grasping, dexterous manipulation, soft-hand manipulation, tool manipulation
Embodied Reasoning: spatial reasoning, affordance learning, task planning
Embodied Perception: multi-modal perception, active perception
Embodied Simulation: 2D/3D reconstruction, sim-to-real, benchmark
Control Methods for Soft Robots: model-based/learning-based control methods

For this workshop, we accept the following type of papers：

4–8 pages for the main text, plus up to 2 pages for references.
Topics include but are not limited to original ideas, perspectives, research visions, and open challenges in the themes outlined above.

Submission Website: https://openreview.net/group?id=acmmm.org%2FACMMM%2F2025%2FWorkshop%2FRoboSoft#tab-your-consoles

Submission templates are available on the ACM MM 2025 website.

Submissions must adhere to the ACM MM 2025 submission policies.

The workshop will finally select a Best Paper Award.

Challenge

To advance research in multimodal soft robot planning, we propose two challenge tasks.

This competition employs the open-source software Elastica developed by the Gazzola Lab at UIUC for soft-body dynamics modeling, establishing a benchmark platform for soft robot dynamics and interaction simulation. In this benchmark, soft robots are modeled as a single Cosserat rod freely moving in 3D space—serving as a flexible manipulator in Task 1 and a flexible mobile body in Task 2. The soft rod features an elastic Young’s modulus of 10 MPa, exhibiting the typical bending stiffness of soft robots.

The actuation mechanism is realized through internal moments distributed along the rod’s length, where the continuous activation function is characterized by spline curves defined by N independent control points, approaching zero at both ends of the rod. Precise control is achieved by decomposing the overall actuation into orthogonal moment functions in the local normal, binormal (inducing bending), and orthogonal directions (inducing torsion).

Task 1: Vision-Language Manipulation for Soft Robot

Vision-Language Manipulation aims to endow soft robots with the ability to interact with objects based on human instructions and visual perception—a capability crucial for manufacturing and medical fields, encompassing scenarios such as object grasping, component assembly, item classification, and even surgical assistance. In this task, the soft robot must operate within a complex workspace containing various objects (e.g., cubes, spheres, cones). One end of the robot is fixed to a base, while the other end moves freely to accomplish manipulation tasks.

The system inputs include natural language instructions (specifying the object to manipulate and its target position) and multi-view visual observations. The robot must first recognize and localize the target object from visual inputs, then execute motions to transport it to the specified position. The operation is deemed successful when the object accurately reaches the target location.

Instruction: Move the football to the basketball

Instruction: Move the smaller yellow roadblocks next to the larger roadblocks

Vision-Language Navigation requires soft robots to autonomously explore complex environments by comprehending linguistic instructions and parsing visual cues. This task holds significant importance for applications such as disaster search and rescue, as well as exploration. Within this task, the agent must process synchronized multimodal inputs comprising visual observations and natural language instructions, necessitating cross-modal alignment between the vision-language modality and soft-body dynamics modeling. Instructions are translated into continuum mechanics actions.

The solution space must jointly optimize semantic localization accuracy, deformation trajectory smoothness, and obstacle avoidance feasibility under time-varying boundary conditions. Vision-Language Navigation for soft robots establishes a new research domain in embodied intelligence, where soft robots execute navigation tasks through morphological adaptation in dynamic environments.