About Machine Learning Lunch
Penn's machine learning lunch is a (more or less) biweekly event designed to give students, faculty, postdocs, and visiting researchers in machine learning a chance to present their research to the department and receive feedback. Talks on both polished work and work in progress are welcome.
This semester talks will generally be held on Wednesday at 12:30pm and will be in Levine 512 unless otherwise noted. Contact Jenn Wortman if you would like to give a talk or suggest a speaker.
To sign up for the mlunch mailing list go here.
Upcoming Talks
Mlunch will return in September 2007 under new management. Contact Alex Kulesza or Ted Sandler for details.
Past Talks (2006-2007)
Friday, September 15, Levine 512
Analysis of Representations for Domain Adaption
John Blitzer, CIS
Discriminative learning methods for classification perform well when training and test data are drawn from the same distribution. In many situations, though, we have labeled training data for a source task drawn from one distribution, and we wish to learn a classifier which performs well on a target task from a different distribution. Under what conditions does a classifier for the source task transfer to the target task? Intuitively, a good feature representation is a crucial factor in the success of transfer. We formalize this intuition theoretically with a generalization bound for transfer. Our theory illustrates the tradeoffs inherent in designing a representation for transfer and gives a new justification for a recently proposed model for transfer. It also points toward a promising new model for transfer learning: one which explicitly minimizes the difference between the source and target domains, while at the same time maximizing the margin of the training set.
Joint work with Shai Ben-David, Koby Crammer, and Fernando Pereira.
Wednesday, September 27, Levine 512
Risk-Sensitive Online Learning
Jenn Wortman, CIS
We consider the problem of online learning in settings in which we want to compete not simply with the rewards of the best expert or stock, but with the best trade-off between rewards and risk. Motivated by finance applications, we consider two common measures balancing returns and risk: the Sharpe ratio and the mean-variance criterion of Markowitz. We first provide negative results establishing the impossibility of no-regret algorithms under these measures, thus providing a stark contrast with the returns-only setting. We then show that the recent algorithm of Cesa-Bianchi et al. achieves nontrivial performance under a modified bicriteria risk-return measure, and give a modified best expert algorithm that achieves no regret for a "localized" version of the mean-variance criterion. We perform experimental comparisons of traditional online algorithms and the new risk-sensitive algorithms on a recent six-year S&P 500 data set and find that the modified best expert algorithm outperforms the traditional with respect to Sharpe ratio, MV, and accumulated wealth. To our knowledge this work initiates the investigation of explicit risk considerations in the standard models of worst-case online learning.
Joint work with Eyal Even-Dar and Michael Kearns.
Wednesday, October 18, Levine 512
Reconciling Author References with Random Walks
Ted Sandler, CIS
Most bibliographic databases contain extensive publication lists but make no effort to track author identity across publications. Consequently, such databases are not able to answer simple queries like "list all publications by the first author of document X." Because manual reconciliation is impossible for bibliographies possessing millions of records, an automated approach is necessary. We present a probabilistic model for performing author reference reconciliation based on random-walks on a bibliography graph. The model is closely related to the Gaussian field and harmonic function model of Zhu et al., and when tested on sets of confusable authors in the DBLP, it yields substantial improvements against a more traditional string-matching approach. We also present a framework for learning transition probabilities in the bibliography graph and detail our current work to this end.
Joint work with Koby Crammer and Lyle Ungar.
Wednesday, November 1, Levine 512
Deriving Value from Consumer Networks
Shawndra Hill, Department of Operations and Information Management, Wharton
Firms may be able to capitalize on the fact that consumers interact. For example, recently the concept of "buzz" marketing has attracted the interest of firms, some of which go to great lengths to take advantage of direct word of mouth. Firms that collect explicit data on networks of interactions may be able to mine it to increase revenue or to reduce costs. In this talk, I will present a framework for representing and analyzing large consumer networks. I motivate our methodology with two business problems, an application in telecommunications fraud and network-based target marketing. I provide evidence that networked data can improve fraud detection, extending the results from a prior, preliminary study. I also present evidence that directly networked consumers generate more revenue than customers targeted through traditional means, such as demographic similarity or prior purchase behavior. The results of this research show that information in consumer network data can be used to improve business outcomes.
Joint work with Deepak Agarwal, Bob Bell, Foster Provost and Chris Volinsky.
Wednesday, November 15, Levine 512
Winner-Take-All EM Clustering
Bill Kandylas, CIS
The EM algorithm is often used with mixture models to cluster data. For many practical problems, it is desirable for efficiency reasons to produce hard clusters, where each data point is assigned to exactly one cluster. For a mixture of Gaussian distributions, under certain assumptions including the limit of all variances going to zero, the widely used k-means clustering algorithm is known to result. In this talk a new method will be presented for deriving winner-take-all (WTA) versions of the EM algorithm for mixture models, which can be used even when it is not possible or desirable to take the limit of vanishing variance. Using this method, WTA limits can be derived for Gaussian mixtures where different mixture components can have different variances. An interesting consequence of this is that it allows for some of the clusters to reside "inside" other clusters. Experiments show how unequal variances lead to better clusters and how they can be used to find background and foreground clusters.
Wednesday, November 29, Levine 512
Regret to the Best vs. Regret to the Average
Eyal Even-Dar, CIS
We study regret minimization algorithms, focusing not only on their regret to the best expert, but also on their regret to the average of all experts and to the worst expert. We show that any algorithm that achieves only O(sqrt(T)) regret to the best expert must, in the worst case, suffer regret Omega(sqrt(T)) to the average, and that for a wide class of update rules that includes many existing no-regret algorithms (such as weighted majority, exponential weights, follow the perturbed leader, and others), the product of the regret to the best and the regret to the average is Omega(T). We describe and analyze a new algorithm, based on the exponential weights algorithm, that achieves cumulative regret only O(sqrt(T)log(T)) to the best expert and has a constant regret to the average (with no dependence on either T or N). We also give a simple algorithm whose payoff is always better (or equal) to the worst expert and has regret of O(sqrt(T)) to the best expert.
Joint work with Michael Kearns, Yishay Mansour, and Jennifer Wortman.
Wednesday, January 10, 1:00pm, Levine 512
Metric Learning with Semidefinite Programming
Kilian Weinberger, CIS
Many machine learning algorithms rely heavily on the existence of a good measure of (dis-)similarities between all input vectors. One of the most commonly used measures of dissimilarity is the Euclidean distance in input space. This is often suboptimal in many ways. The Euclidean distance metric does not incorporate any side information that might be available, it does not take advantage of the given structure of the input data and ignores that the user has information about the nature of the algorithm (such as classification or regression).
I will discuss two algorithms, based on semidefinite programming, to learn a transformation of the input data that gives rise to more meaningful distances.
The first algorithm, Maximum Variance Unfolding (MVU), is designed for the unsupervised case, in which we are only given the input data without any side information. The algorithm finds a low dimensional Euclidean embedding of the data by exploring the underlying structure of the inputs. I will show that the problem can be phrased as a semidefinite program and can be scaled up by orders of magnitude (to tens of thousands of input vectors) with Laplacian Graph Regularization.
The second algorithm, Large Margin Nearest Neighbor (LMNN), is a supervised algorithm for classification that assumes the existence of additional label information for each input. The goal is to learn a linear transformation of the input data that moves similarly labeled inputs close together and separates differently labeled inputs by a large margin. Similar to the first algorithm, we managed to phrase it as a semidefinite program that could be applied to large data sets with up to 60000 training samples.
Wednesday, January 24, 12:30pm, Levine 512
Balanced Graph Matching
Timothee Cour, GRASP Lab
Graph matching is a fundamental problem in Computer Vision and Machine Learning. We present two contributions. First, we give a new spectral relaxation technique for approximate solutions to matching problems, that naturally incorporates one-to-one or one-to-many constraints within the relaxation scheme. The second is a normalization procedure for existing graph matching scoring functions that can dramatically improve the matching accuracy. It is based on a reinterpretation of the graph matching compatibility matrix as a bipartite graph on edges for which we seek a bistochastic normalization. We evaluate our two contributions on a comprehensive test set of random graph matching problems, as well as on image correspondence problem. Our normalization procedure can be used to improve the performance of many existing graph matching algorithms, including spectral matching, graduated assignment and semidefinite programming.
Joint work with Praveen Srinivasan and Jianbo Shi.
Wednesday, January 31, 12:30pm, Levine 307
Structured Learning with Approximate Inference
Alex Kulesza, CIS
For many structured problems, inference (in particular, choosing the labeling with maximal score) is known to be NP-hard. Approximate inference algorithms have been developed, and some even provide theoretical approximation guarantees. However, guarantees for inference may not lead to guarantees for learning, even when the former is a subroutine of the latter. This talk presents ongoing theoretical work that aims to understand how structured learning can occur when approximations are required. We first show a simple example demonstrating that even when inference is very good (but not exact), learning can fail. This leads to a stronger, "approximate" notion of margin that characterizes the ability of a given inference procedure to separate a data set. With the modified definition we are able to prove bounds on learning. We first show a hard-margin mistake bound using the perceptron algorithm, and then describe a soft-margin version in the PAC-Bayes framework that allows for noise.
Wednesday, February 7, 12:30pm, Levine 512
Learning Multiple Online Tasks with a Global Objective
Ofer Dekel, Hebrew University
The simplicity and elegance of online learning make it a practical tool with many useful applications. In practice, we are often faced with multiple online prediction tasks in parallel. The naive approach to dealing with such situations is to learn each task separately and independently of the others. Can we do any better than this?
We present an online multitask learning framework where the multiple tasks all contribute to a common goal and share the consequences of their prediction mistakes. We discuss new learning algorithms that take advantage of this framework and learn the multiple tasks jointly. We also present cumulative loss bounds that provide a guarantee on the worst-case performance of our algorithms and elucidate the advantages of our approach over the naive approach of learning each task separately.
Many real-world online prediction applications naturally fit within our framework. We give several concrete examples, including a multiclass-multilabel text categorization algorithm capable of processing millions of examples in minutes, a surprising application of our technique to online ordinal regression, and a randomized algorithm that dynamically allocates memory to multiple kernel-based classifiers.
Joint work with Yoram Singer and Phil Long.
Wednesday, February 21, 12:30pm, Levine 512
Mark Dredze, CIS
Unless you are the type of person who buys Cialis in bulk online or takes unsolicited stock tips from strangers, a new form of spam has probably been plaguing your inbox recently. Image spam has attacked the most basic line of spam defense, content filters, by using the idea of CAPTCHAs to benefit spammers. This work seeks to reclaim the content filter by developing classification methods for detecting spam images. Instead of relying on a deep processing of the image, we use a large set of simple tests of the image. We also include features similar to those found in existing literature on this problem. Preliminary results show that our system is able to accurately classify spam images.
Even with an accurate filter, a major practical consideration is the speed of the system. Spam filters are run on large mail servers and are expected to quickly process hundreds of thousands of messages. Even a lightweight image processor could slow down the server. To address this concern we develop a new measure of feature selection based not only on the predictive value of a feature, but the cost of computing it from an image. Our new system achieves similar performance to our full system but greatly reduces the processing time per image.
This talk will present very early results from this project. Feedback on future directions is welcome.
Friday, February 23, 1:00pm, Levine 512
Approaches to the DNF Learning Problem
Vitaly Feldman, Harvard University
Learning of Disjunctive Normal Form (DNF) formulas is one of the most fundamental problems in computational learning theory. Informally, in this problem the goal of a learning algorithm is to accurately predict an unknown two-level (OR of ANDs) Boolean formula given examples labeled by this formula. In the first part of this talk I will survey some of the fundamental results concerning efficient learnability of DNF in Valiant's PAC model as well as the progress that was made in recent years.
Learning of parities (XOR functions) with random noise from random and uniform examples is another famous open problem in learning theory and is equivalent to decoding of random linear codes - a major open problem in coding theory. In the second part of this talk I will describe a recent result that shows a surprising connection between learning of parity functions with random noise and learning of DNF from random and uniform examples. I will also show other applications of the key technical component of this result.
Wednesday, March 14, Levine 512
An infinite-state generalized hidden Markov phylogeny for multi-species
regulatory module detection
Jon McAuliffe, Statistics Department, University of Pennsylvania
The pattern of gene expression in a cell is partly determined by proteins called transcription factors (TFs), which bind short stretches of DNA in the vicinity of genes. These transcription factor binding sites are more difficult to detect with statistical methods than genes, because they are much smaller and exhibit fewer strong biological constraints. Recently, two features of binding sites have been exploited to improve detection: they tend to be conserved across related species, and they tend to appear close together, in what are called cis-regulatory modules.
I will describe a statistical model of aligned DNA sequences, from multiple species, containing shared modules. The model incorporates the tendency of certain TFs' binding sites to appear adjacently in a module. It accounts for evolutionary conservation of binding sites. It also requires no information about the number of TFs acting on the sequences, or what patterns their binding sites follow: these questions are answered by the inference procedure, as part of the model. If something is known in advance about TFs and their sites, as is usually the case, the model makes use of that information in a natural way, while still allowing for new, unknown TFs. The core of the inference involves sampling posterior path trajectories in an infinite-state generalized hidden Markov model; I will explain what this means and how it is done.
Wednesday, March 28, Levine 512
Discriminative Learning via Semidefinite Probabilistic Models
Koby Crammer, CIS
Discriminative linear models are a popular tool in machine learning. These can be generally divided into two types: linear classifiers, such as support vector machines (SVMs), which are well studied and provide state-of-the-art results, and probabilistic models such as logistic regression. One shortcoming of SVMs is that their output (known as the "margin") is not calibrated, so that it is difficult to incorporate such models as components of larger systems. This problem is solved in the probabilistic approach. We combine these two approaches above by constructing a model which is both linear in the model parameters and probabilistic, thus allowing maximum margin training with calibrated outputs. Our model assumes that classes correspond to linear subspaces (rather than to half spaces), a view which is closely related to concepts in quantum detection theory. The corresponding optimization problems are semidefinite programs which can be solved efficiently. We illustrate the performance of our algorithm on real world datasets, and show that it outperforms second-order kernel methods.
Joint work with Amir Globerson.
Wednesday, April 11, Levine 512
Transductive Structured Classification through Constrained Min-Cuts
Kuzman Ganchev, CIS
We introduce an extension to a min-cut based algorithm for transductive (semi-supervised) learning (published by Blum and Chawla (2001)). We extend their approach to multi-class problems as well as to structured classification but this makes the problem NP-hard. We approximately solve it by encoding it as an instance of metric labeling (Kleinberg and Tardos 1999). We explain a linear relaxation of an integer program that approximately solves metric labeling problems and explain how we modified it to get an efficient implementation of structured transductive classification. We present some results on real data that shows our method works sometimes.