Experiments

In order to demonstrate the algorithm we conducted the following tests on various test data. Table 1 gives a short summary of the different tests. In the following experiments for vector quantization we used

prototypes, for segment length

and $\beta$ is defined as

where

is the biggest value in

The first testset is a video shot in a dinning room of a hospital. After removing the motionless frames, we still had $169\,880$ frames. We tested our embedding algorithm to see if it provides a good separation between different events. We observed that the unusual activities are embedded far from the usual ones, as can be seen in figure 7.

Figure 7: Four unusual activities being discovered, corresponding to four remote clusters in the embedding space: A: a patent eating alone at the near table, B: a man on wheel chair slowing goes in and out of the room while everyone else is eating, C: a patient shaking, D: a nurse feeding a patient one-on-one with no one around. E: the 2-D embedding of the video segments.

$\includegraphics[width = 0.15 \textwidth, height = 0.13 \textwidth]{resultimages/hospital/the0.eps}$	$\includegraphics[width = 0.15 \textwidth, height = 0.13 \textwidth]{resultimages/hospital/the2.eps}$	$\includegraphics[width = 0.15 \textwidth, height = 0.13 \textwidth]{resultimages/hospital/the3.eps}$	$\includegraphics[width = 0.15 \textwidth, height = 0.13 \textwidth]{resultimages/hospital/negthe2.eps}$
A	B	C	D

$\includegraphics[width = 0.25 \textwidth, height = 0.17 \textwidth]{resultimages/embedding/eig3vs4_hospital.eps}$

To quantify the ``goodness'' of the embedding provided in our previous experiment we used another video from a surveillance camera overlooking a road adjacent to a fenced facility. We have tested our system on a continuous video from 16:32pm till 12:22pm the next day, containing both day time and night time videos (in total $1\,063\,802$ image frames). We applied our embedding algorithm and classified the embedded segments into two groups, i.e. usual and unusual. To measure the performance we hand-labeled all the sequences (which contained motion) if they were unusual or not and compared our results to the ground truth. The promising results of this experiment are shown in figure 8. Though, this surveillance sequence is somewhat limited in the type of actions it contains (particularly it has just

unusual sequences), we would like to point out that even without motion features, i.e. only with spatial histograms, we were able to detect events such as cars making U-turns, backing off, and people walking on and off the road.

Figure 8: Results for 20 hours long road surveillance video. Usual events consist of cars moving along the road. Correctly detected unusual events include: (A) cars pulling off the road, (B) cars stopping and backing up, (C) car making U-turns, and people walking on the road. Undetected unusual events include: (D) cars stopping on the far end, due to coarseness of spatial feature. False-positives include mainly birds flying by, and direct sunlight into camera (E). the Precision-Recall curve of the results is shown in (F). The star indicates the operating condition achieving the precision/false positive and the precision/recall trade-off shown in (A)-(E).


(A)	(B)	(C)	(D)	(E)

$\includegraphics[width=0.18\textwidth, height = 0.13\textwidth]{bigfigs/PRCroad}$

(F)

Next experiment was aimed to measure the performance in a more complex setting: we recorded a

minutes long poker game sequence, where two players were asked to creatively cheat. The video contains $17\,902$ frames, and every

second hand-labelled with one of the

activity labels. There is a wide variety of natural actions, in addition to playing cards and cheating, players were drinking water, talking, hand gesturing, scratching. Many of the cheatings are among detected unusual events. To demonstrate we can detect a specific cheating type, we find those unusual events corresponding to a prototype feature chosen by us. The results of detecting two cheating types are shown in figure 9.

Figure 9: ``Elbow" cheating detection. A1,B1,C1: examples of detected cheatings, ``near player" reaches to his left elbow to hide a card. D1: non-detected cheating, ``near player" reaching to his elbow but doesn't hide anything; E1: false positives, the ``near player" makes different movement with his hand. ``Under" cheating detection. A2,B2,C2: example of detected events, two players exchange cards under the table; D2: non-detected cheatings, the exchange is mostly occluded. E2: false positives - the near player is drinking, due to camera angle his hand is in similar position. F1, F2: ROC curves of the two events: The red stars indicate the operating condition for results shown here.

			$\includegraphics[width = 0.15\textwidth, height = 0.13\textwidth]{resultimages/elbow/nondet1.eps}$	$\includegraphics[width = 0.15\textwidth, height = 0.13\textwidth]{resultimages/elbow/falsepos1.eps}$
A1	B1	C1	D1	E1
$\includegraphics[width = 0.15\textwidth, height = 0.13\textwidth]{resultimages/under/A1.eps}$	$\includegraphics[width = 0.15\textwidth, height = 0.13\textwidth]{resultimages/under/A2.eps}$	$\includegraphics[width = 0.15\textwidth, height = 0.13\textwidth]{resultimages/under/detected1.eps}$	$\includegraphics[width = 0.15\textwidth, height = 0.13\textwidth]{resultimages/under/nondet1.eps}$	$\includegraphics[width = 0.15 \textwidth,height = 0.13\textwidth]{resultimages/under/falsepos1.eps}$
A2	B2	C2	D2	E2

To show that the algorithm can be used for categorizing usual events as well we took 3 hours long video from Berkeley Sproul Plaza webcam (http://www.berkeley.edu/webcams/sproul.html), which contained $28\,208$ frames. The embedding of video segments, and event category representatives are shown in figure 10 (left). The automatic categorization of events potentially can allow us to develop a statistical model of activities, in an unsupervised fashion.

Figure 10: (left) The embedding of the webcam video show videos are best organized by two independent event types in the scene. The horizontal axis (A-D) represents crowd movements along the building: many people walking (A), and few or no people walking (D). In the vertical axis (B-F) events of walking in/out of Sproul Hall are detected, and are organized according to which orientation people entered/left: (B) along the bottom of image frame; (F) diagonally from the lower left corner. (E) and (C) are compound events: (E) is combination of event (F) and (D), (C) is combination of (B) and (D). (right) Given the classification of the video into distinct events, a transition model is estimated.

$\includegraphics[width = 0.5 \textwidth, height = 0.30 \textwidth]{resultimages/webcam/embedding_examples_med.eps}$

$\includegraphics[width = 0.36 \textwidth, height = 0.28 \textwidth]{resultimages/embedding/transition.eps}$