Learning Affordances from Visual Observation

Our recent advances in visual understanding and physics simulation are enabling robots to learn how objects afford interaction directly from visual data. This work is the PhD thesis of Long Le.


Pixie: Physics from Pixels. We introduce a fast, generalizable framework for predicting 3D physics and physical properties (e.g., elasticity, stiffness, density) of objects from visual inputs alone. Pixie produces physics that are 1.5–4.4× more realistic while running 10³× faster. By leveraging pretrained visual features like CLIP, our method can also zero-shot generalize to real-world scenes despite only ever been trained on synthetic data.


Articulate-Anything (Le et al., 2025). A major bottleneck in scaling robot learning in simulation is the lack of interactable 3D environments. Our Articulate-Anything method leverages VLMs and an actor-critic refinement process to automatically generate articulated 3D models from various input modalities, including text, real images, or videos. These articulated models can then train robotic manipulation policies that transfer from simulation to real-world systems.

References

2025

  1. Long Le, Jason Xie, William Liang, and 7 more authors
    In International Conference on Learning Representations (ICLR), Jul 2025