Structured Prediction: A Framework for Machine Learning

Despite their impressive aptitude for crunching numbers and processing text, computers have a lot to learn before they can handle intelligent tasks that are relatively simple for a two-year-old child. As Ben Taskar, Magerman Term Assistant Professor of Computer and Information Science, can attest, getting computers to recognize people, actions, locations, objects, and concepts represented in the digital data they store is harder than it seems.

“We are trying to figure out robust ways for computers to do what humans do effortlessly,” says Taskar, who was recently awarded the prestigious Alfred P. Sloan Foundation Research Fellowship, which recognizes and supports his unique potential to make substantial contributions to the field of computer science. To illustrate the complexity of a seemingly-simple problem, Taskar offers the example of a child reaching for a cup that’s sitting on the tray of his high chair. The child is demonstrating an understanding of the cup as something he can manipulate—a discrete, relatively permanent object, separate from the surface of the tray, composed of a handle, a container, and hopefully, some contents.

It seems simple. But, because computers lack evolutionary-honed capacities for contextual reasoning and highly-specialized “hardware” for perception, Taskar points out that, “These things are not simple when you’re starting from a blank slate.” Teaching a computer to understand what a toddler has mastered requires a new twist on some of computer science’s foundational methods. “The standard paradigm of writing a program doesn’t work with these perceptual problems that are essentially trivial for humans, but extremely difficult for machines, because the real world is too messy,” Taskar explains. “Machine learning requires a different paradigm.”

Rather than writing a program to get a computer to accomplish a specific task, Taskar is developing algorithms that provide computers with basic building blocks to learn by correlating labels with images, and identifying commonalities among multiple examples of a particular concept. Because machines learn primarily by rote, teaching them that a particular combination of pixels corresponds to a face, or that a predictable pattern of colors correlates to a location such as a beach, requires tens of thousands of examples, all of which must be labeled in order to provide the context needed for accurate recogni­tion.

As recently as ten years ago, this was a painstaking process that could be done only on a small scale. Images and videos were relatively scarce, and the time-consuming process of creating repositories of labeled, or supervised data slowed researchers’ progress. As Taskar explains it, “Every time I wanted the computer to detect a new type of object, or a new action—somebody running versus somebody skipping, for example—as long as they’re somewhat different, I had to go out there and label more examples, essentially from scratch.”

But vast stores of data in the wild, or digital files which are becoming more plentiful on the Internet every day, are facilitating Taskar’s efforts to accelerate the process of machine learning. “Before the advent of the Internet, you would have to find thousands of examples of every particular object and every particular action,” Taskar says. “But now people are spending a lot of time posting images, videos and text on the web, creating this vast repository of digital knowledge.” Harvesting video episodes of popular  television shows such as LOST, Alias and CSI, along with screenplays and scripts, commentary from fan sites, and text from closed captions, Taskar is able to create sets of unsupervised, or weakly-labeled data that provide a large enough set of examples from which a computer can learn to identify characters, actions, objects, and locations. In order to do this accurately, however, it must resolve ambiguities between visual and textual inputs.
Many challenges remain, particularly in terms of resolv­ing variations in lighting, facial expression, camera angle, or pose. The successes Taskar has had in his work with weakly-labeled data include near perfect precision in iden­tifying the key characters from LOST and CSI, and over 90 percent accuracy retrieving video results based on textual queries for actions such as grab, cry, kiss, shout, or sleep.

“My hope is that my research can help solve some of the long-standing problems in computer vision and natural language processing,” Taskar says of his work in teaching machines to process language and perceive digital images. From organizing family photos on a hard drive, to searching across terabytes of digital video to find a specific clip, or aggregating news stories told in massive collec­tions of text files, Taskar is transforming the way we access digital information. His research is bringing us beyond the keyword search for needles in haystacks to discovery and retrieval of targeted, multimedia information culled from data in the wild.

View the full article in Penn Engineering Magazine: "Structured Prediction: A Framework for Machine Learning'" by Catherine Von Elm.

Interested? Learn more!

Ben Taskar's Faculty Profile
GRASP Lab Site
PRiML Research Center

Return to News Features