Faces on TV

Final submission (20% of grade): December 15

Announcements

Introduction

The goal of the final project is to develop machine-learned models to accurately identify characters from two popular TV shows: LOST and CSI. You are given as input an image of a person's face and an audio clip of that character speaking, like so:
jack audio clip

The goal is to produce as output the identity of the person: jack

To make this possible we will need to make a few simplifying assumptions. The system will know ahead of time whom it needs to identify and will have as examples images and audio snippets for each person. Thus have a standard supervised learning problem.

These are the people you'll be learning to identify:

LOST


NOTE: due to time constraints, we are NOT doing CSI, only LOST.

CSI

Your goal is to learn a classifier that has the highest accuracy on unseen data. To evaluate, we will have you submit your classifiers to us, and we will test them on new data your model has not seen before. We will hold a contest to see who in the class has the best model, to be described below.

We have decoupled the audio and visual problems for you, and thus you need to provide models for each of the 4 different problems:

Data Collection

The data was collected from a few episodes of LOST and CSI, available on DVD.

The audio was automatically labeled by aligning it with text scripts obtained freely from the web, which tell which character is speaking. The text of the script was aligned with the frames of the movies via dynamic time warping. For more information, see the paper
Timothee Cour, Chris Jordan, Eleni Miltsakaki, Ben Taskar. Movie/Script: Alignment and Parsing of Video and Text Transcription. (ECCV 2008)

The image data was collected by running a standard face detector over an entire episode (exactly like the one described in the cascade question of the midterm). We further prune false positive detections by linking together face detections into tracks, and throwing away tracks that are too short. Finally, we rank the hypothesized faces' quality based on face part detectors: For each face we run eye, nose and mouth detectors. We then score the faces with how geometrically consistent the combination of eyes, nose and mouth detectors are. The end result is a set of high quality face detections, along with accurate estimates of face part positions:


(To display something like the picture above with the data you are given, see the function play_with_image_data.m for a demonstration.)

We can then extract the faces, and perform a least-squares affine registration to align all the face examples so that they are the same size, and their eyes, and mouth are mapped as close as possible (in a least-squares sense) to the same location across images, giving us our final data set. Noses have too much variation to be consistent, and were not included in the mapping. The resulting data:

The extracted images you are working with are an output of this system, resized to 50x50x3 matrices (containing the three RGB channels red, green and blue). The pixel coordinates for the eye and mouth points in these images were mapped to the following points for consistency:

Baseline System

We have provided you with a baseline system which serves as a starter code for you to improve upon, and which you can change completely if you wish. It simply uses SVM to train models based on image and audio features, which are described below:

Image features

[features,Y] = generate_image_features(data)

The image features are extremely simple: each example is resized to size 25x25x3 (RGB) and treated as a single example's feature vector of dimension 1875x1.

Audio features

[features,Y] = generate_audio_features(data)

Each audio example is a 1 second sound clip stored as a vector of samples, i.e., a wave file. Here is what it looks like when you plot the first example, which is a 1 second clip of Claire from LOST speaking (plot(data(1).y)):

For each example sound clip we generate a spectrogram, which calculates the frequency spectrum of a number of overlapping subintervals (also called "(sub)windows" or "(sub)frames") of the sound wave. The goal is to capture the frequency characteristics of each person, e.g., women typically speak at a higher frequency than men. For more information on spectrograms, you can try the help for the matlab command via help spectrogram, or read the spectrogram article on wikipedia.

As a result of our feature generation technique, each 1 second clip is turned into a number of examples to be used for training. Classifying a test example is then not a straightforward task: each test clip is turned into multiple examples, and it is not clear what the right way to combine these is in order to output a class prediction. The baseline chooses a simple approach for this that classifies each example of a 1 second clip independently, and then takes the most common classification as the predicted label. This type of technique is referred to as max vote.

SVM

svmtrain, svmpredict, train_with_cross_validation

The baseline uses libsvm as a classifier. You should become familiar with all the settings and functions available in this package by reading the documentation available in the README and on the website - even beyond this class, this is one of the easiest and most powerful off-the-shelf machine learning packages to try on any classification or regression task as a first attempt, so it pays to know how to use it! The source is included for you in the starter code, including compiled matlab files for both 32-bit Windows and 32-bit GNU/Linux.

Performance

Running the starter code on the small initial LOST data set, you should get ~80% cross-validation (random splits of 80% training, 20% testing) accuracy for classifying images, and 50% accuracy for classifying sounds (note that in the cross validation function, the max-voting technique for classifying sound clips is not used; the accuracy reported is for individual subwindows). Random guessing would only achieve 20% in this multiclass setting.

Training error for the image and sound classifiers is very low, near 0% for each.

Data Structure Format

The given data is saved as matlab structs and has the following fields:

data(i).name Name of the character for example i.
data(i).label Integer class label for the example, in the range [1,5]. This is redundant with the name field, but easier to work with in code.
fields present in audio data struct only:
data(i).y The audio wave file.
data(i).sampling_freq The audio sampling frequency .
fields present in image data struct only:
data(i).img The cropped and registered face of the example in RGB: a 50x50x3 matrix with values in the range [0,255]
data(i).imgfile Filename of the original frame where this example came from, which is provided for you and can be loaded from disk.
data(i).{left_eye,right_eye,mouth,nose}_pt (x,y) coordinates of facial features in the original frame, pointed to by data(i).imgfile
data(i).bounding_box Coordinates of rectangle around face in the original frame. Format is a 1x4 vector, with values
[top_left_x top_left_y bottom_right_x bottom_right_y]

Requirements

This is a fairly open-ended project, with the overall goal to do the best you can on the test set. Aside from this, we have several minimum requirements we'd like you to achieve:

In the end you should describe what you did in a ~1 page write up, submitted as either README.txt or README.pdf.

How to get started

Download the baseline/starter code here:
cis520_final_project_starter_code.zip

After you unpack it, try running the script main.m in the code/ directory. It should successfully load the data, generate the described baseline features, run cross validation, and calculate training error with the data for LOST. It's a good idea to step through this code via the debugger, just to get an idea of the data formats and how the system is put together. There are many places where design decisions were made arbitrarily, and may be sub-optimal.

Also take a look at play_with_image_data.m for an example of how to use the information in the image data struct.

After that, you're on your own! Try to learn models the have the lowest generalization error possible. Since you don't have the test set, you can only estimate how well you'll do by holding out some of the training data you have as a validation set.

Feel free to email or ask in office hours for tips, techniques and approaches to use.

What to turn in


Code and models

For evaluation, you must provide the following function:

predictions = test_classifier(datatest, datatrain, data_type, tv_show_name)

This function takes as input a test data struct of m examples in the same format as previously described, except it will NOT have label or name fields as it did with the training data. It will also be given the training data which you are given so it can perform instance based algorithms. Thus, you should NOT submit your training data as part of your submission, we will pass it in come test time.

test_classifier should return the m x 1 vector predictions, which are your model's predictions of the true labels from the examples in data_struct. We will then take your predictions, and score them against the true label vector Y using simple average 0-1 loss:

error = mean(predictions ~= Y)

In addition to the test_classifier function, you are welcome to submit any other functions, saved model files, subdirectories, etc. you need to have test_classifier work. Whatever you submit should be completely self-sufficient.

Model restrictions

You are allowed to use any software you have or find, provided

(1) That everything runs on eniac without any supervision by us.
(2) Your total submission is less than 50MB.

Everything else is fair game.

Readme

Provide a 1-page README.txt or README.pdf of what you did, as described in the requirements section above.

names.txt

Include a text file, names.txt with the first line being the team name, followed by each team member's email addresses, one per line. For example, if all the TAs were on a team together called the TAsters, the text file would be:

TAsters
bensapp@seas.upenn.edu
katef@seas.upenn.edu
pasingh@seas.upenn.edu

The team name is how your team will be represented in the contest. The email addresses are needed so we can easily add grades to blackboard, and email you if there's a problem with your submission.

How to turn in

Again we will use turnin. Please follow these instructions:

When to turn in, contest format, etc

We will have several contest checkpoints leading up to the final contest. Submitting to each of these checkpoints is optional. After each checkpoint has passed, we will test all of the submissions, and post the results on the web so you can see how your team compares to the rest. This is in the spirit of the
Netflix Challenge Leaderboard, but will only be updated at these discrete time intervals, the contest checkpoints.

The tentative dates for the checkpoints are