Our recent advances in visual understanding and physics simulation are enabling robots to learn how objects afford interaction directly from visual data. This work is the PhD thesis of Long Le.
Pixie: Physics from Pixels. We introduce a fast, generalizable framework for predicting 3D physics and physical properties (e.g., elasticity, stiffness, density) of objects from visual inputs alone. Pixie produces physics that are 1.5–4.4× more realistic while running 10³× faster. By leveraging pretrained visual features like CLIP, our method can also zero-shot generalize to real-world scenes despite only ever been trained on synthetic data.
Articulate-Anything(Le et al., 2025). A major bottleneck in scaling robot learning in simulation is the lack of interactable 3D environments. Our Articulate-Anything method leverages VLMs and an actor-critic refinement process to automatically generate articulated 3D models from various input modalities, including text, real images, or videos. These articulated models can then train robotic manipulation policies that transfer from simulation to real-world systems.
Interactive 3D simulated objects are crucial in AR/VR, animations, and robotics,
driving immersive experiences and advanced automation. However, creating these
articulated objects requires extensive human effort and expertise, limiting their
broader applications. To overcome this challenge, we present ARTICULATE-ANYTHING, a
system that automates the articulation of diverse, complex objects
from many input modalities, including text, images, and videos. ARTICULATE-ANYTHING
leverages vision-language models (VLMs) to generate code that can
be compiled into an interactable digital twin for use in standard 3D simulators.
Our system exploits existing 3D asset datasets via a mesh retrieval mechanism,
along with an actor-critic system that iteratively proposes, evaluates, and refines
solutions for articulating the objects, self-correcting errors to achieve a robust
outcome. Qualitative evaluations demonstrate ARTICULATE-ANYTHING’s capability
to articulate complex and even ambiguous object affordances by leveraging rich
grounded inputs. In extensive quantitative experiments on the standard PartNetMobility
dataset, ARTICULATE-ANYTHING substantially outperforms prior work,
increasing the success rate from 8.7–12.2% to 75% and setting a new bar for
state-of-the-art performance. We further showcase the utility of our system by
generating 3D assets from in-the-wild video inputs, which are then used to train
robotic policies for fine-grained manipulation tasks in simulation that go beyond
basic pick and place. These policies are then transferred to a real robotic system.
@inproceedings{Le2025ArticulateAnything,author={Le, Long and Xie, Jason and Liang, William and Wang, Hung-Ju and Yang, Yue and Ma, Yecheng Jason and Vedder, Kyle and Krishna, Arjun and Jayaraman, Dinesh and Eaton, Eric},year={2025},title={Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model},booktitle={International Conference on Learning Representations (ICLR)},}