SPOC: Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World

Reinforcement Learning (RL) often faces challenges with speed and efficiency, while Imitation Learning (IL) is constrained by the need for extensive and varied data. We show that using transformer architectures with long context windows to imitate heuristic planners at scale can help unlock the power of IL and produce effective agents in simulation and the real world.

Data

SPOC is trained on a large-scale dataset of expert trajectories, featuring RGB observations with the corresponding expert actions. This video showcases a sample from our dataset, displaying the RGB observations captured simultaneously from both the manipulation and navigation cameras. It highlights the specific actions taken by the expert at each time frame during the task of fetching a bowl. This visual representation aids in understanding the synchronized camera inputs and corresponding expert actions within the dataset.

Model

SPOC, an agent embodied in the Stretch RE-1 robot, is trained to follow text instructions for navigation and task completion. It processes both text instructions and visual data to determine its actions. The Stretch Robot uses two cameras: one for navigation and another for arm movements. The model has three main parts: a textual goal encoder for language instructions, a visual encoder that adapts to instructions for each visual input, and a transformer action decoder that predicts each action based on the goal, current and past visuals, and previous actions.

Goal-Conditioned Visual Encoder. At each time step goal-conditioned visual encoder processes visual data from two RGB cameras. It uses a Transformer encoder to turn the images into contextualized patch embeddings. These features are adjusted to fit the transformer's input dimensions. To distinguish between the two camera types, learnable camera-type embeddings are added. The patch features, goal features, and a special [CLS] token embedding are combined and fed into the transformer encoder. The output from the [CLS] token's position is used as the visual representation conditioned on the goal.

Action Decoder. This model predicts actions using an autoregressive Transformer decoder with causal masking. It takes in a sequence of past and present visual representations, additively combined with sinusoidal temporal position encodings and learned embeddings for previous actions. The decoder uses cross-attention to condition on the goal encoding. For each time step, the transformer decoder's output is processed through linear and softmax layers to predict a distribution over possible actions. The model is optimized with cross-entropy loss.