Composing Pre-Trained Object-Centric Representations for Robotics From "What" and "Where" Foundation Models

Jan 24, 24240·

Junyao Shi*

Jianing Qian*

Yecheng Jason Ma

Dinesh Jayaraman

· 0 min read

PDF Cite arXiv URL

Abstract

There have recently been large advances both in pre-training visual representations for robotic control and segmenting unknown category objects in general images. To leverage these for improved robot learning, we propose POCR, a new framework for building pre-trained object-centric representations (OCR) for robotic control. Building on theories of “what-where” representations in psychology and computer vision, we use segmentations from a pre-trained model to stably locate across timesteps, various entities in the scene, capturing “where” information. To each such segmented entity, we apply other pre-trained models that build vector descriptions suitable for robotic control tasks, thus capturing “what” the entity is. Thus, our OCR for control is constructed by appropriately combining the outputs of off-the-shelf pre-trained models, with no new training. On various simulated and real robotic tasks, we show that imitation policies for robotic manipulators trained on our OCR achieve better performance and systematic generalization than state of the art pre-trained representations for robotics, as well as prior OCRs that are typically trained from scratch.

Type

Publication

ICRA

Last updated on Jan 24, 24240

← Recasting Generic Pretrained Vision Transformers As Object-Centric Scene Encoders For Manipulation Policies Jan 25, 25250

Can Transformers Capture Spatial Relations between Objects? Jan 1, 1010 →