Body Parsing: Human Pose Estimation, Segmentation and Recognition

Quick links: Introduction | Key Insights | Input for Parsing | Examples | Results | Publications

Introduction

Recognition, pose estimation and segmentation of humans and their body parts remain important unsolved problems in high-level vision. Action understanding and image search and retrieval are just a few of the areas that would benefit enormously from this task. There has been good previous work on this topic, but significant challenges remain ahead.

Key Insights

Our work is guided by the following key insights:

Shape needs to be evaluated in larger context:

Many top-down methods for HPE (human pose estimation) detect individual body parts in the image, and then piece together these body parts with a top-down process. Typically, the top-down process uses simple geometric cues among the parts to piece them together. However, many parts, e.g. the lower leg do not have very distinctive shapes. However when several parts are viewed together in unison (e.g, the lower body) their shape is very distinctive. Our method constructs increasingly larger regions of the body as the search procedes, allowing it to take advantage of large-scale shape cues.

Proposal and evaluation can be separated to improve parsing:

Traditional parsing algorithms exploit factorization according to parse rules in order to efficiently find exact solutions to parsing problems with dynamic programming. However, this comes at a cost: the overall scoring function used to evaluate a particular parse is intrinsically related to the search method (dynamic programming) used to consider parse hypotheses. As a result this scoring function is often suboptimal. There has been some recognition of this in the natural language community, with the advent of parse re-ranking. The common idea in parse re-ranking and our own shape parsing framework is that some scoring functions are better suited for generating good proposals, and other scoring functions are best for discriminating amongst these good proposals. We employ different functions for proposing groupings of body regions and evaluation of these regions.

Input for Parsing: Segments

Since our proposed method functions as shape parsing, we need a set of initial shapes detected in the image. We use Normalized Cuts with several different settings to generate a large set of candidate shapes present in the image. By using multiple segmentations, we can increase the recall of the shape detection process.

Examples of Proposal and Evaluation

Proposal

For a given part, we need to create proposals of possible instances of this part in the image, and evaluate these proposals according to their overall shape. Our system has three different proposal mechanisms, pictured below. These are: recognizing a part directly as a segment, forming a part as the combination of two smaller parts, and extending a smaller part. The second two mechanisms indicate the tradeoff between bottom-up and top-down processing, currently a human-chosen design decision.

Evaluation

Given all the possible proposals for a particular part, we score each proposal by matching to a shape exemplar in a set of hand-segmented exemplars for the part. We retain the proposals that have good scores and are not redundant, and prune the set of proposals to a constant number, 50.

Results

From a standard dataset of baseball player images, we took 15 images for shape exemplar examples, and tested on 39 images. We evaluated the results according to both segmentation accuracy and joint position accuracy. Below we show several top parsing results on our test set. For more details on the results, including quantitative results, please see the publications below.

Publications

  1. P. Srinivasan and J. Shi, "Bottom-up Recognition and Parsing of the Human Body", Proceedings of the 2007 EMMCVPR Conference, August 27-29, 2007. [pdf] [slides]

  2. P. Srinivasan and J. Shi, "Bottom-up Recognition and Parsing of the Human Body", Proceedings of the 2007 Conference on Computer Vision and Pattern Recognition (CVPR), North America, 2007. [pdf] [poster] [ehum2 invited talk slides]