next up previous
Next: Importance of prototype-prototype similarity Up: Algorithm Previous: Document-keyword correspondence

Correspondence via co-embedding prototypes and video segments

Computing a correspondence relationship satisfying transitive-closure is in general a difficult problem. Therefore we restrict our search to an important sub-family of functions of $CP(p_i,v_j)$, for which we can compute it efficiently. Let $x:\mathbf{V}\cup\mathbf{P} \rightarrow \mathrm{R}^\mathbf{N}$ denote the co-embedding of the prototypes and video segments. The correspondence function $CE(p_i,v_j) = 1 - (x(p_i) - x(v_j))^2$ satisfies $\alpha$-transitive closure with $\alpha = 9$. By restricting our search for correspondence to $CE$, we can rewrite the optimization problem in (1):

$\displaystyle \max E_{C}(CE)$ $\textstyle = \sum_{p_i ,v_j } C(p_i,v_j) CE(p_i,v_j) \mathbf{\Leftrightarrow}$    
$\displaystyle \min E_{C}^{'}(x)$ $\textstyle = \frac{\sum_{p_i ,v_j } C(p_i,v_j) (x(p_i) - x(v_j))^2}{\sigma_x^2},$   (2)

where $\sigma_x^2$ is the standard deviation of vector $x$ (it removes an arbitrary scaling factor in the embedding and also prevents the trivial solution of $x=0$).

The co-embedding optimization problem in (2) can be visualized as finding a placement (embedding) vector $x$ for each prototype and video segment in a low dimensional space. We can imagine the co-occurrence relationships as springs that pull together (or apart) prototypes and video segments, the nodes of figure 6(a), resulting in the co-occurring elements being maximally aligned, as shown in figure 6(b). Maximal alignment in this case means that corresponding nodes are placed close to each other (on x axis). In order to arrange the prototypes in the $\mathbf{N}$-D space, we need to know the position of the video segments, and vice versa. Note that the chicken-egg nature of our unusual event detection is inherent in both the correspondence and the embedding problem. Luckily, we can break this chicken-egg deadlock in an optimal and computationally efficient way.

Figure 6: Co-embedding of email-word example. The top row represents emails, the bottom words, the edges are the co-occurrence between them. (a) random embedding, (b) optimal co-embedding using our method. The springs pull corresponding elements in alignment.
\includegraphics[width=0.23 \textwidth, height=0.16 \textwidth]{email/unsortspring.eps} \includegraphics[width=0.23 \textwidth, height=0.16 \textwidth]{email/sortspring.eps}
(a) (b)

To compute the optimal co-embedding, we define a weighted graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$. We take prototypes and video segments as nodes, $\mathcal{V} = \{v_1, ..., v_N\} \cup \{p_1, ..., p_K\}$. The edges of this graph consist of edges between prototypes and video segments which represent the co-occuring relationship ($C$), and edges between the prototype nodes, which represent the similarity $S_p \in \mathrm{R}^{K\times K}$ between the prototypes: $\mathcal{E} = \{ (p_i,v_j) \vert C(i,j) \neq 0 \}
\cup \{(p_i,p_j)\}$. The edge weight matrix is defined as:

\begin{displaymath}
W=\left[ \begin{array}{cc} I & C^T \\ C & \beta S_p
\end{ar...
...mathrm{R}^{\vert\mathcal{V}\vert \times \vert\mathcal{V}\vert}
\end{displaymath} (3)

where $\beta$ is a weighting factor. When $\beta=0$, minimization of (2) is equivalent to minimization of
\begin{displaymath}
E_W(x) = \frac{\sum_{(i,j) \in \mathcal{E}} W(i,j) (x(i) - x(j))^2}{\sigma_x^2}
\end{displaymath} (4)

Expanding the numerator of (4) we get $2x^T(D-W)x$, where $D \in \mathrm{R}^{\vert\mathcal{V}\vert \times \vert\mathcal{V}\vert}$ is a diagonal matrix with $D(i,i) = \sum_j W(i,j)$. Using the fact that

\begin{displaymath}
\sigma_x^2 = \sum_{i\in \mathcal{V}} x^2(i) P(i) -
(\sum_{i\in \mathcal{V}} x(i) P(i))^2
\end{displaymath} (5)

where $P(i)$ is an estimate of the prior occurrence likelihood of each prototype or video segment, which can be estimated by their occurrance frequency: $P(i) = \frac{1}{\gamma} D(i,i)$, with $\gamma = D1$. Centering the embedding around zero (i.e. $x^TD1=0$), we get $\sigma_x^2 = \frac{1}{\gamma} x^TDx$.

Putting all these together, we can rewrite (4) as

\begin{displaymath}
E_W(x) = 2 \gamma \frac{x^T (D-W) x}{x^T D x}
\end{displaymath} (6)

The global energy minimum of this function is achieved by the eigenvector corresponding to the second smallest eigenvalue of
\begin{displaymath}
(D-W) x = \lambda D x
\end{displaymath} (7)

Note that the first $N$ elements of vector $x$ contain the coordinates of the video segment nodes, and the next $K$ elements contain the coordinates of the prototype features. Incidentally, this is the same eigenvector used in a different context of image segmentation (Normalized Cuts, [12]).


next up previous
Next: Importance of prototype-prototype similarity Up: Algorithm Previous: Document-keyword correspondence
Mirko Visontai 2004-05-13