Learning Image Representations Tied to Egomotion from Unlabeled Video

Jan 1, 1010·

Dinesh Jayaraman

Kristen Grauman

· 0 min read

PDF Cite Downloadable Preprint Webpage

Abstract

Understanding how images of objects and scenes behave in response to specific egomotions is a crucial aspect of proper visual development, yet existing visual learning methods are conspicuously disconnected from the physical source of their images. We propose a new “embodied” visual learning paradigm, exploiting proprioceptive motor signals to train visual representations from egocentric video with no manual supervision. Specifically, we enforce that our learned features exhibit equivariance i.e., they respond predictably to transformations associated with distinct egomotions. With three datasets, we show that our unsupervised feature learning approach significantly outperforms previous approaches on visual recognition and next-best-view prediction tasks. In the most challenging test, we show that features learned from video captured on an autonomous driving platform improve large-scale scene recognition in static images from a disjoint domain.

Type

Publication

In IJCV Special Issue of Best Papers from ICCV 2015

Last updated on Jan 1, 1010

Embodied Intelligence Prediction Unsupervised Features Active Perception First-Person Video Equivariance

← Embodied Learning for Visual Recognition Jan 1, 1010

Look-Ahead Before You Leap: End-to-End Active Recognition By Forecasting the Effect of Motion Jan 1, 1010 →