Correspondence via co-embedding prototypes and video segments

Next: Importance of prototype-prototype similarity Up: Algorithm Previous: Document-keyword correspondence

Correspondence via co-embedding prototypes and video segments

Computing a correspondence relationship satisfying transitive-closure is in general a difficult problem. Therefore we restrict our search to an important sub-family of functions of , for which we can compute it efficiently. Let $x:\mathbf{V}\cup\mathbf{P} \rightarrow \mathrm{R}^\mathbf{N}$ denote the co-embedding of the prototypes and video segments. The correspondence function satisfies $\alpha$ -transitive closure with $\alpha = 9$ . By restricting our search for correspondence to , we can rewrite the optimization problem in (1):

$\displaystyle \max E_{C}(CE)$	$\textstyle = \sum_{p_i ,v_j } C(p_i,v_j) CE(p_i,v_j) \mathbf{\Leftrightarrow}$
$\displaystyle \min E_{C}^{'}(x)$	$\textstyle = \frac{\sum_{p_i ,v_j } C(p_i,v_j) (x(p_i) - x(v_j))^2}{\sigma_x^2},$		(2)

where $\sigma_x^2$ is the standard deviation of vector

(it removes an arbitrary scaling factor in the embedding and also prevents the trivial solution of

The co-embedding optimization problem in (2) can be visualized as finding a placement (embedding) vector for each prototype and video segment in a low dimensional space. We can imagine the co-occurrence relationships as springs that pull together (or apart) prototypes and video segments, the nodes of figure 6(a), resulting in the co-occurring elements being maximally aligned, as shown in figure 6(b). Maximal alignment in this case means that corresponding nodes are placed close to each other (on x axis). In order to arrange the prototypes in the $\mathbf{N}$ -D space, we need to know the position of the video segments, and vice versa. Note that the chicken-egg nature of our unusual event detection is inherent in both the correspondence and the embedding problem. Luckily, we can break this chicken-egg deadlock in an optimal and computationally efficient way.

Figure 6: Co-embedding of email-word example. The top row represents emails, the bottom words, the edges are the co-occurrence between them. (a) random embedding, (b) optimal co-embedding using our method. The springs pull corresponding elements in alignment.

$\includegraphics[width=0.23 \textwidth, height=0.16 \textwidth]{email/unsortspring.eps}$	$\includegraphics[width=0.23 \textwidth, height=0.16 \textwidth]{email/sortspring.eps}$
(a)	(b)

To compute the optimal co-embedding, we define a weighted graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ . We take prototypes and video segments as nodes, $\mathcal{V} = \{v_1, ..., v_N\} \cup \{p_1, ..., p_K\}$ . The edges of this graph consist of edges between prototypes and video segments which represent the co-occuring relationship (), and edges between the prototype nodes, which represent the similarity $S_p \in \mathrm{R}^{K\times K}$ between the prototypes: $\mathcal{E} = \{ (p_i,v_j) \vert C(i,j) \neq 0 \} \cup \{(p_i,p_j)\}$ . The edge weight matrix is defined as:

$\begin{displaymath} W=\left[ \begin{array}{cc} I & C^T \\ C & \beta S_p \end{ar... ...mathrm{R}^{\vert\mathcal{V}\vert \times \vert\mathcal{V}\vert} \end{displaymath}$

(3)

where $\beta$ is a weighting factor. When $\beta=0$ , minimization of (2) is equivalent to minimization of

$\begin{displaymath} E_W(x) = \frac{\sum_{(i,j) \in \mathcal{E}} W(i,j) (x(i) - x(j))^2}{\sigma_x^2} \end{displaymath}$

(4)

Expanding the numerator of (4) we get , where $D \in \mathrm{R}^{\vert\mathcal{V}\vert \times \vert\mathcal{V}\vert}$ is a diagonal matrix with $D(i,i) = \sum_j W(i,j)$ . Using the fact that

$\begin{displaymath} \sigma_x^2 = \sum_{i\in \mathcal{V}} x^2(i) P(i) - (\sum_{i\in \mathcal{V}} x(i) P(i))^2 \end{displaymath}$

(5)

where

is an estimate of the prior occurrence likelihood of each prototype or video segment, which can be estimated by their occurrance frequency: $P(i) = \frac{1}{\gamma} D(i,i)$ , with $\gamma = D1$ . Centering the embedding around zero (i.e.

), we get $\sigma_x^2 = \frac{1}{\gamma} x^TDx$ .

Putting all these together, we can rewrite (4) as

$\begin{displaymath} E_W(x) = 2 \gamma \frac{x^T (D-W) x}{x^T D x} \end{displaymath}$

(6)

The global energy minimum of this function is achieved by the eigenvector corresponding to the second smallest eigenvalue of

$\begin{displaymath} (D-W) x = \lambda D x \end{displaymath}$

(7)

Note that the first

elements of vector

contain the coordinates of the video segment nodes, and the next

elements contain the coordinates of the prototype features. Incidentally, this is the same eigenvector used in a different context of image segmentation (Normalized Cuts, [12]).

Next: Importance of prototype-prototype similarity Up: Algorithm Previous: Document-keyword correspondence

Mirko Visontai 2004-05-13