OctoNav:

Towards Generalist Embodied Navigation

Chen Gao1,2*       Liankai Jin1*       Xingyu Peng1,4*       Jiazhao Zhang3
Yue Deng1,4       Annan Li1       He Wang3       Si Liu1+
1Beihang University       2National University of Singapore
3Peking University       4Zhongguancun Academy
*Equal Contribution

+Corresponding Author
teaser

On the left, we present the large-scale OctoNav-Bench, which contains diverse instruction-trajectory pairs and the elaborate TBA-CoT dataset across numerous scenes. Based on OctoNav-Bench and our method/training designs, we introduce a VLA-based method, termed OctoNav-R1. On the right, (I) demonstrates the performance comparisons on OctoNav-Bench, where we provide a fine-grained breakdown of accuracy across various navigation capabilities. OctoNav-R1 outperforms previous methods in all capabilities, demonstrating its versatility. (II) presents a robot demo in the real world, which is driven by the OctoNav-R1, showing its preliminary sim2real generalization.

Abstract

Embodied navigation stands as a foundation pillar within the broader pursuit of embodied intelligence. However, previous navigation research is divided into different tasks/capabilities, e.g., ObjNav, ImgNav and VLN, where they differ in task settings/objectives and modalities, making datasets and methods are designed individually. In this work, we take steps toward generalist navigation agents, which can follow free-form instructions that include arbitrary compounds of multi-modal and multi-capability. To achieve this, we propose a large-scale benchmark and corresponding method, termed OctoNav-Bench and OctoNav-R1. Specifically, OctoNav-Bench features continuous environments and is constructed via a designed automatic annotation pipeline. We thoroughly craft instruction-trajectory pairs for imitation learning, where instructions are diverse in free-form with arbitrary modality and capability. Also, we elaborately construct a Think-Before-Action (TBA-CoT) dataset within OctoNav-Bench to provide the thinking process behind actions.For OctoNav-R1, we build it upon MLLMs and adapt it to a VLA-type model, which can produce low-level actions solely based on 2D visual observations. Moreover, we design a Hybrid Training Paradigm (HTP) that consists of three stages, i.e., Action-/TBA-SFT, Nav-GPRO, and Online RL stages. Each stage contains specifically designed learning policies and rewards. Importantly, for TBA-SFT and Nav-GRPO designs, we are inspired by the OpenAI-o1 and DeepSeek-R1, which show impressive reasoning ability via thinking-before-answer. Thus, we aim to investigate how to achieve thinking-before-action in the embodied navigation field, to improve model's reasoning ability toward generalists. Specifically, we propose TBA-SFT to utilize the TBA-CoT dataset to fine-tune the model as a cold-start phrase and then leverage Nav-GPRO to improve its thinking ability. Finally, OctoNav-R1 shows superior performance compared with the previous methods.

Real-World Robot Demos

  • We deploy OctoNav-R1 on physical robots, and observe preliminary sim-to-real transfer ability without any real-world fine-tuning. It further confirms the potential of the OctoNav-Bench and designed OctoNav-R1 for real-world applications.

Move forward and go to the chair on the right.
Your current position is (1.0, 1.0, 0.0) and your current orientation is (1.0, 0.0, 0.0). Navigate to the coordinates (3.0, 3.0, 0.0) to reach the target zone.
Initially, look for the chair, and travel towards the chair. Secondly, follow the reference image {ImageNav}(tv) until reaching the matching scene.
Head straight for the sofa, then turn right to the tv.
Firstly, travel forward to the tv, after that, proceed straight to the glass door.
Seek out the sofa and head straight to the sofa.

BibTeX

BibTex Code Here