next up previous
Next: Prototype features Up: Video Representation Previous: Video Segmentation

Image features

For each image frame in the video we extract objects of interest, typically moving objects. We make no attempt to track objects. The motion information is computed directly via spatiotemporal filtering of the imageframes: $I_t(x,y,t) = I(x,y,t) * G_t * G_{x,y}$, where $G_t = t e^{-(\frac{t}{\sigma_t})^2}$ is the temporal Gaussian derivative filter and $G_{x,y} = %{1\over\sqrt{2\pi}}
e^{-((\frac{x}{\sigma_{x}})^2+(\frac{y}{\sigma_y})^2)}$ is the spatial smoothing filter. This convolution is linearly separable in space and time and is fast to compute. To detect moving objects, we threshold the magnitude of the motion filter output to obtain a binary moving object map: $M(x,y,t) = \vert\vert I_t(x,y,t)\vert\vert _2 > a $. The process is demonstrated in figure 2(a)-(b).

Figure 2: Feature extraction from video frames. (a) original video frame from the card game sequence. (b) binary map of objects (c) spatial histogram of (b).
\includegraphics[width = 0.15 \textwidth, height = 0.12 \textwidth]{objdetect/bjori} Image bjbw \includegraphics[width = 0.15 \textwidth, height = 0.12 \textwidth]{objdetect/bjhist}
(a) (b) (c)

The image feature we use is the spatial histogram of the detected objects. Let $H_t(i,j)$ be an $m \times m$ spatial histogram, with $m$ typically equal to 10. $H_t(i,j) = \sum_{x,y} M(x,y,t) \cdot \delta(b^x_i \le x <b^x_{i+1})
\cdot \delta(b^y_j \le y<b^y_{j+1})$, where $b^x_i,b^y_j (i,j = 1 \dots m)$ are the boundaries of the spatial bins. The spatial histograms, shown in 2(c), indicate the rough area of object movement. Similarly, we can compute a motion and color/texture histogram for the detected object using the spatiotemporal filter output. As we will see, these simple spatial histograms are sufficient to detect many complex activities.


next up previous
Next: Prototype features Up: Video Representation Previous: Video Segmentation
Mirko Visontai 2004-05-13