Discovering Deformable Keypoint Pyramids


The locations of objects and their associated landmark keypoints can serve as versatile and semantically meaningful image representations. In natural scenes, these keypoints are often hierarchically grouped into sets corresponding to coherently moving objects and their moveable and deformable parts. Motivated by this observation, we propose Keypoint Pyramids, an approach to exploit this property for discovering keypoints without explicit supervision. Keypoint Pyramids discovers multi-level keypoint hierarchies satisfying three desiderata: comprehensiveness of the overall keypoint representation, coarse-to-fine informativeness of individual hierarchy levels, and parent-child associations of keypoints across levels. On human pose and tabletop multi-object scenes, our experimental results show that Keypoint Pyramids jointly discovers object keypoints and their natural hierarchical groupings, with finer levels adding detail to coarser levels to more comprehensively represent the visual scene. Further, we show qualitatively and quantitatively that keypoints discovered by Keypoint Pyramids using its hierarchical prior bind more consistently, and are more predictive of manually annotated semantic keypoints, compared to prior flat keypoint discovery approaches