Fast Feature Field (F3): A Predictive Representation of Events

GRASP Laboratory, University of Pennsylvania

Abstract

This paper develops a mathematical argument and algorithms for building representations of data from event-based cameras, that we call Fast Feature Field (F3). We learn this representation by predicting future events from past events and show that it preserves scene structure and motion information. F3 exploits the sparsity of event data and is robust to noise and variations in event rates. It can be computed efficiently using ideas from multi-resolution hash encoding and deep sets—achieving 120 Hz at HD and 440 Hz at VGA resolutions. F3 represents events within a contiguous spatiotemporal volume as a multi-channel image, enabling a range of downstream tasks. We obtain state-of-the-art performance on optical flow estimation, semantic segmentation, and monocular metric depth estimation, on data from three robotic platforms (a car, a quadruped robot and a flying platform), across different lighting conditions (daytime, nighttime), environments (indoors, outdoors, urban, as well as off-road) and dynamic vision sensors (resolutions and event rates). Our implementations can predict these tasks at 25-75 Hz at HD resolution.

Video

F3 is a strong foundation for event-based perception

It is possible to use F3 in any standard computer vision algorithm or neural architecture built for RGB data. F3 can be the foundation for a variety of robotic perception tasks. We show state-of-the-art performance on data from three platforms (driving, quadruped locomotion and a flying platform) and four tasks (supervised semantic segmentation, unsupervised optical flow, unsupervised stereo matching, and supervised monocular metric depth estimation). These sequences showcase F3 performing multiple perception tasks simultaneously—demonstrating its versatility as a foundation for robotic perception.

Overview: Driving scenes in Philly showing multi-task capabilities in urban environments using events.

F3 manages to realize the benefits of event cameras in low-latency and low-light scenarios, especially where standard RGB cameras struggle. Moving objects like pedestrians and vehicles are captured reliably in F3 even under challenging lighting conditions. See cool examples below:

F3 architecture is designed specifically for events

F3 is a predictive representation of events. It is a statistic of past events sufficient to predict future events. We prove that such a representation retains information about the structure and motion in the scene. F3 achieves low-latency computation by exploiting the sparsity of event data using a multi-resolution hash encoder and permutation-invariant architecture. Our implementation can compute F3 at 120 Hz and 440 Hz at HD and VGA resolutions, respectively, and can predict different downstream tasks at 25-75 Hz at HD resolution. These HD inference rates are roughly 2-5 times faster than the current state-of-the-art event-based methods. Please refer to the paper for more details.

F3 Overview Diagram

An overview of the neural architecture for Fast Feature Field (F3).

F3 is robust

F3-based approaches work robustly without additional training across data from different robotic platforms, lighting and environmental conditions (daytime vs. night-time, indoors vs. outdoors, urban vs. off-road) and dynamic vision sensors (with different resolutions and event rates). See examples of F3 (trained on daytime urban car driving sequences) generalizing to different


BibTeX

@misc{das2025fastfeaturefield,
  title={Fast Feature Field ($\text{F}^3$): A Predictive Representation of Events}, 
  author={Richeek Das and Kostas Daniilidis and Pratik Chaudhari},
  year={2025},
  eprint={2509.25146},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2509.25146},
}