OctoNav

Towards Generalist Embodied Navigation

¹Beihang University ²National University of Singapore
³Peking University ⁴Zhongguancun Academy
^*Equal Contribution
⁺Corresponding Author

Abstract

Embodied navigation stands as a foundation pillar within the broader pursuit of embodied intelligence. However, previous navigation research is divided into different tasks/capabilities, e.g., ObjNav, ImgNav and VLN, where they differ in task settings/objectives and modalities, making datasets and methods are designed individually. In this work, we take steps toward generalist navigation agents, which can follow free-form instructions that include arbitrary compounds of multi-modal and multi-capability. To achieve this, we propose a large-scale benchmark and corresponding method, termed OctoNav-Bench and OctoNav-R1. Specifically, OctoNav-Bench features continuous environments and is constructed via a designed automatic annotation pipeline. We thoroughly craft instruction-trajectory pairs for imitation learning, where instructions are diverse in free-form with arbitrary modality and capability. Also, we elaborately construct a Think-Before-Action (TBA-CoT) dataset within OctoNav-Bench to provide the thinking process behind actions.For OctoNav-R1, we build it upon MLLMs and adapt it to a VLA-type model, which can produce low-level actions solely based on 2D visual observations. Moreover, we design a Hybrid Training Paradigm (HTP) that consists of three stages, i.e., Action-/TBA-SFT, Nav-GPRO, and Online RL stages. Each stage contains specifically designed learning policies and rewards. Importantly, for TBA-SFT and Nav-GRPO designs, we are inspired by the OpenAI-o1 and DeepSeek-R1, which show impressive reasoning ability via thinking-before-answer. Thus, we aim to investigate how to achieve thinking-before-action in the embodied navigation field, to improve model's reasoning ability toward generalists. Specifically, we propose TBA-SFT to utilize the TBA-CoT dataset to fine-tune the model as a cold-start phrase and then leverage Nav-GPRO to improve its thinking ability. Finally, OctoNav-R1 shows superior performance compared with the previous methods.

Real-World Robot Demos

We deploy OctoNav-R1 on physical robots, and observe preliminary sim-to-real transfer ability without any real-world fine-tuning. It further confirms the potential of the OctoNav-Bench and designed OctoNav-R1 for real-world applications.

Move forward and go to the chair on the right.

Your current position is (1.0, 1.0, 0.0) and your current orientation is (1.0, 0.0, 0.0). Navigate to the coordinates (3.0, 3.0, 0.0) to reach the target zone.

Initially, look for the chair, and travel towards the chair. Secondly, follow the reference image {ImageNav}(tv) until reaching the matching scene.

Head straight for the sofa, then turn right to the tv.

Firstly, travel forward to the tv, after that, proceed straight to the glass door.

Seek out the sofa and head straight to the sofa.

OctoNav:

Towards Generalist Embodied Navigation

Abstract

Real-World Robot Demos

BibTeX