SPOC 🖖

Imitating Shortest Paths in Simulation
Enables Effective Navigation and Manipulation in the Real World

*Equal Contribution, ordered randomly
1Allen Institute for AI, 2University of Washington, 3EPFL
header-image.

SPOC learns from shortest paths to navigate, explore, and manipulate objects in simulation and the real world.

Highlights

Exploration emerges from imitating shortest paths.

No RL and No human data!
Scale is the key.

SPOC only uses RGB Observations!

No Depth/ LiDAR, No map.
No privileged information!

SPOC is trained in Simulation.

No sim-to-real adaptation.
No real-world fine-tuning!

SPOC is proficient at long-horizon tasks.

Including both navigation and manipulation.

Results

SPOC is effective in the real world. Both the environment and the object instances encountered are entirely novel to the model, as its training was exclusively conducted in a simulator. This highlights SPOC's ability to generalize and adapt to previously unencountered, real-world conditions.


Fetching a mug

SPOC successfully locates and picks up a mug.

Fetching an apple

SPOC successfully searches for and picks up an apple.

Navigate to a toilet

SPOC successfully explores to find a toilet.

Extended search for a bed

SPOC explores a long hallway (an uncommon architectural feature within our ProcTHOR training scenes).

Pick up a houseplant

SPOC mistakenly attempts to extend its arm into the coffee table, realizes its error, and is able to progress.

Find a houseplant

A partial failure example: the agent does eventually "succeed" as it takes the done action with a very small sliver of the houseplant visible but this success does not appear robust.

How does it work?

Reinforcement Learning (RL) often faces challenges with speed and efficiency, while Imitation Learning (IL) is constrained by the need for extensive and varied data. We show that using transformer architectures with long context windows to imitate heuristic planners at scale can help unlock the power of IL and produce effective agents in simulation and the real world.

Data

SPOC is trained on a large-scale dataset of expert trajectories, featuring RGB observations with the corresponding expert actions. This video showcases a sample from our dataset, displaying the RGB observations captured simultaneously from both the manipulation and navigation cameras. It highlights the specific actions taken by the expert at each time frame during the task of fetching a bowl. This visual representation aids in understanding the synchronized camera inputs and corresponding expert actions within the dataset.

Model

SPOC, an agent embodied in the Stretch RE-1 robot, is trained to follow text instructions for navigation and task completion. It processes both text instructions and visual data to determine its actions. The Stretch Robot uses two cameras: one for navigation and another for arm movements. The model has three main parts: a textual goal encoder for language instructions, a visual encoder that adapts to instructions for each visual input, and a transformer action decoder that predicts each action based on the goal, current and past visuals, and previous actions.


Goal-Conditioned Visual Encoder. At each time step goal-conditioned visual encoder processes visual data from two RGB cameras. It uses a Transformer encoder to turn the images into contextualized patch embeddings. These features are adjusted to fit the transformer's input dimensions. To distinguish between the two camera types, learnable camera-type embeddings are added. The patch features, goal features, and a special [CLS] token embedding are combined and fed into the transformer encoder. The output from the [CLS] token's position is used as the visual representation conditioned on the goal.

Image

Image

Action Decoder. This model predicts actions using an autoregressive Transformer decoder with causal masking. It takes in a sequence of past and present visual representations, additively combined with sinusoidal temporal position encodings and learned embeddings for previous actions. The decoder uses cross-attention to condition on the goal encoding. For each time step, the transformer decoder's output is processed through linear and softmax layers to predict a distribution over possible actions. The model is optimized with cross-entropy loss.

Simulation examples

Here we present a few illustrative examples of our agent's behavior in simulation. In addition to the agent's navigation (left) and manipulation (right) camera inputs, we also display the probabilities the agent assigns to each of its available actions.


Failure recovery

The agent initially fails to pick up the target object (headphones) but is able to recover from this failure and successfully pick up the object on the second attempt.

Spatial reasoning

The agent is attempting to find the "highest fruit" in the kitchen. Notice it ignores apple when passing it by as it's still searching for other fruits. It eventually sees a bunch of bananas on a shelving unit, as these bananas are lower than the apple, it then navigates to the apple and ends the episode.

Contextual object search

The agent is looking for a "laptop on a sofa". It first walks to the first sofa it sees, does not find a laptop, and then explores until it finds the second sofa at which point it finds the laptop.

What if SPOC had perfect object perception?

Our RGB-only SPOC agent has learned to navigate and explore its environment effectively. An error analysis indicates that most failures are caused from failures to recognize the target object. We hypothesize that the effectiveness of SPOC as an explorer is less affected by its training via shortest path imitation, and instead seems gated by its object perception.


Indeed, when SPOC is trained with RGB inputs along with ground truth (gt) target object detection, i.e. providing it with bounding box coordinates when the target object is visible in its cameras, the agent successfully navigates to the target object in 85% of trials, a notable increase from the 65% success rate of the RGB-only agent.


Image
Image

Top-down trajectory visualizations

This figure shows top-down visualizations of our agent's object-navigation trajectories on our validation set. Here we list all 200 examples within the validation set using our SPOC model trained with ground-truth detections. The agent's path is visualized as a line that goes from white (episode start to red episode end). We also add red markers to indicate the location of objects of the goal object category.


Image

Target: vase, success

Image

Target: bed, success

Image

Target: atomizer, success

Image

Target: basketball, success

Image

Target: ashcan, success

Image

Target: bed, success

Image

Target: houseplant, success

Image

Target: vase, success

Image

Target: apple, success

Image

Target: atomizer, success

Image

Target: ashcan, success

Image

Target: laptop, success

Image

Target: mug, success

Image

Target: bed, success

Image

Target: television_receiver, success

Image

Target: straight_chair, success

Image

Target: atomizer, success

Image

Target: atomizer, success

Image

Target: ashcan, success

Image

Target: toilet, success

Image

Target: mug, success

Image

Target: bowl, success

Image

Target: ashcan, success

Image

Target: basketball, success

Image

Target: television_receiver, success

Image

Target: laptop, success

Image

Target: straight_chair, success

Image

Target: atomizer, success

Image

Target: vase, success

Image

Target: bed, success

Image

Target: bowl, success

Image

Target: bed, success

Image

Target: straight_chair, success

Image

Target: mug, success

Image

Target: laptop, success

Image

Target: toilet, success

Image

Target: bed, success

Image

Target: houseplant, success

Image

Target: bowl, success

Image

Target: toilet, success

Image

Target: sofa, success

Image

Target: vase, success

Image

Target: laptop, success

Image

Target: television_receiver, success

Image

Target: sofa, success

Image

Target: basketball, success

Image

Target: television_receiver, success

Image

Target: toilet, success

Image

Target: straight_chair, success

Image

Target: basketball, success

Image

Target: bed, success

Image

Target: laptop, success

Image

Target: toilet, success

Image

Target: bowl, success

Image

Target: mug, success

Image

Target: ashcan, success

Image

Target: vase, success

Image

Target: sofa, success

Image

Target: bed, success

Image

Target: atomizer, success

Image

Target: alarm_clock, success

Image

Target: houseplant, success

Image

Target: straight_chair, success

Image

Target: straight_chair, success

Image

Target: toilet, success

Image

Target: ashcan, success

Image

Target: laptop, success

Image

Target: atomizer, success

Image

Target: houseplant, success

Image

Target: laptop, success

Image

Target: houseplant, success

Image

Target: bed, success

Image

Target: houseplant, success

Image

Target: apple, success

Image

Target: straight_chair, success

Image

Target: bowl, success

Image

Target: bed, success

Image

Target: atomizer, success

Image

Target: houseplant, success

Image

Target: straight_chair, success

Image

Target: television_receiver, success

Image

Target: sofa, success

Image

Target: bowl, success

Image

Target: houseplant, success

Image

Target: alarm_clock, success

Image

Target: atomizer, success

Image

Target: sofa, success

Image

Target: television_receiver, success

Image

Target: atomizer, success

Image

Target: atomizer, success

Image

Target: vase, success

Image

Target: alarm_clock, success

Image

Target: mug, success

Image

Target: sofa, success

Image

Target: ashcan, success

Image

Target: television_receiver, success

Image

Target: sofa, success

Image

Target: television_receiver, success

Image

Target: toilet, success

Image

Target: television_receiver, success

Image

Target: toilet, success

Image

Target: sofa, success

Image

Target: apple, success

Image

Target: toilet, success

Image

Target: apple, success

Image

Target: alarm_clock, success

Image

Target: straight_chair, success

Image

Target: laptop, success

Image

Target: houseplant, success

Image

Target: mug, success

Image

Target: bowl, success

Image

Target: apple, success

Image

Target: ashcan, success

Image

Target: apple, success

Image

Target: atomizer, success

Image

Target: television_receiver, success

Image

Target: vase, success

Image

Target: straight_chair, success

Image

Target: toilet, success

Image

Target: basketball, success

Image

Target: houseplant, success

Image

Target: apple, success

Image

Target: television_receiver, success

Image

Target: atomizer, success

Image

Target: bowl, success

Image

Target: toilet, success

Image

Target: straight_chair, success

Image

Target: sofa, success

Image

Target: laptop, success

Image

Target: straight_chair, success

Image

Target: straight_chair, success

Image

Target: bed, success

Image

Target: toilet, success

Image

Target: houseplant, success

Image

Target: bed, success

Image

Target: laptop, success

Image

Target: laptop, success

Image

Target: bed, success

Image

Target: apple, success

Image

Target: bowl, success

Image

Target: television_receiver, success

Image

Target: bowl, success

Image

Target: mug, success

Image

Target: mug, success

Image

Target: sofa, success

Image

Target: sofa, success

Image

Target: ashcan, success

Image

Target: ashcan, success

Image

Target: mug, success

Image

Target: vase, success

Image

Target: alarm_clock, success

Image

Target: bowl, success

Image

Target: sofa, success

Image

Target: houseplant, success

Image

Target: toilet, success

Image

Target: laptop, success

Image

Target: apple, failure

Image

Target: alarm_clock, failure

Image

Target: apple, failure

Image

Target: bowl, failure

Image

Target: ashcan, failure

Image

Target: basketball, failure

Image

Target: television_receiver, failure

Image

Target: ashcan, failure

Image

Target: vase, failure

Image

Target: mug, failure

Image

Target: bed, failure

Image

Target: alarm_clock, failure

Image

Target: houseplant, failure

Image

Target: apple, failure

Image

Target: mug, failure

Image

Target: laptop, failure

Image

Target: basketball, failure

Image

Target: vase, failure

Image

Target: alarm_clock, failure

Image

Target: mug, failure

Image

Target: sofa, failure

Image

Target: ashcan, failure

Image

Target: alarm_clock, failure

Image

Target: bowl, failure

Image

Target: alarm_clock, failure

Image

Target: sofa, failure

Image

Target: basketball, failure

Image

Target: alarm_clock, failure

Image

Target: vase, failure

Image

Target: mug, failure

Image

Target: basketball, failure

Image

Target: vase, failure

Image

Target: ashcan, failure

Image

Target: houseplant, failure

Image

Target: alarm_clock, failure

Image

Target: alarm_clock, failure

Image

Target: alarm_clock, failure

Image

Target: toilet, failure

Image

Target: atomizer, failure

Image

Target: straight_chair, failure

Image

Target: vase, failure

Image

Target: apple, failure

Image

Target: apple, failure

BibTeX

@article{spoc2023,        
        author    = {Kiana Ehsani, Tanmay Gupta, Rose Hendrix, Jordi Salvador, Luca Weihs, Kuo-Hao Zeng, Kunal Pratap Singh, Yejin Kim, Winson Han, Alvaro Herrasti, Ranjay Krishna, Dustin Schwenk, Eli VanderBilt, Aniruddha Kembhavi},
        title     = {Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World},
        journal   = {arXiv},
        year      = {2023},
        eprint    = {2312.02976},
}