Banner

Robotic Control via Embodied Chain-of-Thought Reaosning

Notes

April 2025

Abstract

The limitation of policy networks for robotic control is out of distribution generalization. VLAs, pretrained on large datasets can generalize, if they're trained on large-scale internet datasets. Chain of Thought is less effective with VLAs, as most are trained without being exposed to such data. Semantic (language) CoT is also ineffeictive for policies, as they typically need to reason based on sensory observations and their robot state. They introudce Embodied Chain-of-Thought Reasoning (ECoT) for VLAs, where they train VLAs to perform multiple reasoning steps about it's actions, sub-tasks, mnotions, etc and build a scalable pipeline for generating synthetic training data on large robot datasets.

Background

Previous works have shown that VLMs or VLAs can benefit from high-level Chain-of-Thought reasoning, meaning reasoning in the language space, but they conjecture that VLMs / VLAs, acting as robotic policies, need to reason in the sensory space or control space. So they propose embodied chain of thought (ECoT).

Preliminaries: Vision-Language-Action Models

VLAs are trained with a simple recipe. Starting with a pre-trained VLM, directly finetune the model to autoregressively predict the next robot action aa given the current image observation II and task instruction TT.

To make sure this works, robot actions are converted to discrete action tokens, Ta\mathcal{T}_a in the vocabulary of the vision-language model.

Assuming you have a set of continuous action values, in the range 1,1-1, 1. You can split the range in a set of bins (256256 in this case) and assign each bin to a unique token. Then, each value that lies within the bin is assigned a respective token. The more bins which are used, the finer granularity there will be for the robotic movement, but for simpler and faster inference, use fewer bins.

The model trained consists of a Visual encoder (SigLIP or DinoV2) and LLaMA 78B as the backbone. LLaMA7b takes in visually encoded featuers in a high dimensional latent space, as well as textual features, derived from text tokens \rightarrow tokenizer, to output action token for a robot.

Embodied Chain-of-Thought Reasoning for Visuomotor Policies

They train VLAs to perform embodied CoT by labeling data from existing robotic datasets with reasoning chains, filled with feature extraced from pre-treaind models and use that dataset of observation \rightarrow reasoning \rightarrow action for training.

All elements of the generated reasoning data should be represented as strings so they can use the LLaMA2 tokenizer to translate into reasoning tokens.

Interesting, but there could be a loss of information when they convert to textual reasoning tokens -- why not reason in a continuous latent space?

Questions remain:

  • Which reasoning steps are suitable for guiding policies in solving embodied robot manipulation tasks?
  • How can we generate training data for tehse reasonign steps at scale on existing robot datasets?
  • How can we carefully reason without slowing down policy inference?

The goal when designing CoT reasoning traces is to (1) reason through the high-level steps of the task and (2) be able to map the high level reasoning to lower-level features of the scene before predicting the next robot action.

The ECoT reasoning steps which the VLA is trained on is as such:

  1. Rephrase the task
  2. Predict high-level plan of steps for achieving the task (Plan)
  3. Reason through which subtasks need to be performed at the current step.
  4. Predict a low level command that is related to the low-level action the robot needs to execute.
  5. Predict the preicse spatial that describe the scene in the present state (pixel position of the robog end effector and the bounding box pixel coordinates of all objects in the scene)

To generate data, they leverage Prismatic-7B Vlm to describe the scene and concatenate it with the original isntruction into Grounding DINO which is an object detection model based on textual inputs.

They use proprioception (via sensors on robot joints)s to collect movement data and finally use OWLv2 and SAM to detect the gripper positions with 3D positions to fit the estimate of the position.

They finally generate the final reasoning chain using each episode's task instruction, scene description, and per-step movement primitive into GEmini 1.0 and prompt it to prodce a high-level plan of sub tasks that go with the task instrucion and observed movements and current state.