This paper develops a mathematical argument and algorithms for building representations of data from event-based cameras, that we call Fast Feature Field (F3). We learn this representation by predicting future events from past events and show that it preserves scene structure and motion information. F3 exploits the sparsity of event data and is robust to noise and variations in event rates. It can be computed efficiently using ideas from multi-resolution hash encoding and deep sets—achieving 120 Hz at HD and 440 Hz at VGA resolutions. F3 represents events within a contiguous spatiotemporal volume as a multi-channel image, enabling a range of downstream tasks. We obtain state-of-the-art performance on optical flow estimation, semantic segmentation, and monocular metric depth estimation, on data from three robotic platforms (a car, a quadruped robot and a flying platform), across different lighting conditions (daytime, nighttime), environments (indoors, outdoors, urban, as well as off-road) and dynamic vision sensors (resolutions and event rates). Our implementations can predict these tasks at 25-75 Hz at HD resolution.
Video
F3 is a strong foundation for event-based perception
It is possible to use F3 in any standard computer vision algorithm or neural architecture built for RGB data. F3 can be the foundation for a variety of robotic perception tasks. We show state-of-the-art performance on data from three platforms (driving, quadruped locomotion and a flying platform) and four tasks (supervised semantic segmentation, unsupervised optical flow, unsupervised stereo matching, and supervised monocular metric depth estimation). These sequences showcase F3 performing multiple perception tasks simultaneously—demonstrating its versatility as a foundation for robotic perception.
Overview: Driving scenes in Philly showing multi-task capabilities in urban environments using events.
F3 manages to realize the benefits of event cameras in low-latency and low-light scenarios, especially where standard RGB cameras struggle. Moving objects like pedestrians and vehicles are captured reliably in F3 even under challenging lighting conditions. See cool examples below:
Standard cameras fail to detect pedestrians crossing the road in low-light conditions
Standard cameras fail to detect pedestrians hidden in the shadows cast by sidewalk trees
Standard cameras fail to detect pedestrians hidden in the shadows cast by buildings
F3 captures very subtle structure and motion details, such as the fine movements of pedestrians
F3 prominently captures moving objects like bikes and pedestrians crossing the road
F3 captures structure and motion even under difficult lighting and very high speeds (freeway) - distinctly segments the passing car and oncoming traffic
F3 architecture is designed specifically for events
F3 is a predictive representation of events. It is a statistic of past events sufficient to predict future events. We prove that such a representation retains information about the structure and motion in the scene. F3 achieves low-latency computation by exploiting the sparsity of event data using a multi-resolution hash encoder and permutation-invariant architecture. Our implementation can compute F3 at 120 Hz and 440 Hz at HD and VGA resolutions, respectively, and can predict different downstream tasks at 25-75 Hz at HD resolution. These HD inference rates are roughly 2-5 times faster than the current state-of-the-art event-based methods. Please refer to the paper for more details.
An overview of the neural architecture for Fast Feature Field (F3).
F3 is robust
F3-based approaches work robustly without additional training across data from different robotic platforms, lighting and environmental conditions (daytime vs. night-time, indoors vs. outdoors, urban vs. off-road) and dynamic vision sensors (with different resolutions and event rates).
See examples of F3(trained on daytime urban car driving sequences) generalizing to different
F3 accurately captures scene motion in an almost completely dark parking lot
F3 accurately captures the motion of oncoming traffic despite extreme lighting changes—generalizing to the high dynamic range capability of events
F3 captures fine monocular depth and optical flow in a forest environment, despite being trained only on urban driving sequences
The motion information in F3 helps extract the correct flow and depth of the closer tree branches, which often get smeared with the background in RGB images
F3 can accurately capture the depth of an indoor cluttered environment from a quadruped robot
F3 can capture the depth of fine structures in the stairwell of a building
F3 can capture the depth of skaters moving rapidly in a skatepark
F3 can capture the sharp depth of indoor structures from a flying platform
F3 can consistently estimate the depth of outdoor structures from a flying platform, even at very high speeds
F3 generalizes to event cameras with different resolutions and event rates
F3 generalizes to event cameras with different resolutions and event rates
F3 generalizes to event cameras with different resolutions and event rates
F3 generalizes to event cameras with different resolutions and event rates
BibTeX
@misc{das2025fastfeaturefield,
title={Fast Feature Field ($\text{F}^3$): A Predictive Representation of Events},
author={Richeek Das and Kostas Daniilidis and Pratik Chaudhari},
year={2025},
eprint={2509.25146},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.25146},
}