Kuzman Ganchev
Ph.D. Candidate
Computer and Information Science
University of Pennsylvania
kuzman@cis.upenn.edu

Advisor: Fernando Pereira

About Me

I was born in Sofia, Bulgaria where I lived until February 1989. My family moved to Zimbabwe and then in 1995 to New Zealand where I went to high school. I came to the US in 1999 to study at Swarthmore College. I spent the 2001-2002 academic year studying abroad in Paris. After graduating with a Bachelor of Arts in Computer Science in 2003 I worked at StreamSage Inc. in Washington DC until starting at the University of Pennsylvania in Fall 2004. During the summer of 2007 I was an intern at TrialPay in Mountain View, CA.

Research Interests

My research is in machine learning applied to natural language processing. Recently I have been working on problems where there is only partial supervision. The most common case is where some data has been labeled but much more unlabeled data is available. Other cases include semi-automated annotation, where the goal is to reduce the amount of annotator time by performing part of the annotation automatically.

Publications

2008

Better Alignments = Better Translations? Kuzman Ganchev, Joao Graca and Ben Taskar. Proceedings of the 46th Annual Meeting of the Association of Computational Linguistics [bib] [pdf] A set of large scale experiments with word alignment using agreement constraints during EM training. We look at how alignments improve, and how this can improve BLEU score. The alignment scores always improve, and the translation system using alignments from our agreement HMM model always gets better BLEU scores than the with alignments from the baseline HMM and often better than with alignments from IBM Model 4. @InProceedings{ganchev:acl:2008, title = {Better Alignments = Better Translations?}, author = {Kuzman Ganchev and Joao Graca and Ben Taskar}, booktitle = {Proceedings of the 46th Annual Meeting of the Association of Computational Linguistics}, publisher = {Association for Computational Linguistics}, year = {2008} }

Multi-View Learning over Structured and Non-Identical Outputs Kuzman Ganchev, Joao Graca, John Blitzer and Ben Taskar. Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence (UAI) [bib] [pdf] Use the constrained EM framework to derive a new coregularizer appropriate for (both structured and unstructured) log-linear models. In addition to having some nice properties (like smoothness), the approach can be used in a partial agreement scenario. @InProceedings{ganchev:uai:2008, title = {Multi-View Learning over Structured and Non-Identical Outputs}, author = {Kuzman Ganchev and Joao Graca and John Blitzer and Ben Taskar}, booktitle = {Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence (UAI)}, publisher = {AUAI Press}, year = {2008} }

Small Statistical Models by Random Feature Mixing Kuzman Ganchev and Mark Dredze. Proceedings of the ACL-2008 Workshop on Mobile Language Processing [bib] [pdf] An investigation of a way to save the space used by the alphabet used in most NLP models, by replacing it with a hash function. We try 4 linear classifier learning methods with 13 binary problems created from 4 NLP domains, and find that we can reduce the space used by the model by over 70% with little loss in performance. @InProceedings{ganchev:mobilenlp:2008, title = {Small Statistical Models by Random Feature Mixing}, author = {Kuzman Ganchev and Mark Dredze}, booktitle = {Proceedings of the ACL-2008 Workshop on Mobile Language Processing}, publisher = {Association for Computational Linguistics}, year = {2008} }

Expectation Maximization and Posterior Constraints. Joao Graca, Kuzman Ganchev and Ben Taskar. Advances in Neural Information Processing Systems 20 (NIPS). MIT Press, 2008. [bib] [pdf] [poster] A modification to the EM algorithm to include external information in the form of constraints on model posteriors. Experiments on word alignment and clustering of synthetic data. @incollection{graca:nips:2007, title = {Expectation Maximization and Posterior Constraints}, author = {Joao Graca and Kuzman Ganchev and Ben Taskar}, booktitle = {Advances in Neural Information Processing Systems 20}, editor = {J.C. Platt and D. Koller and Y. Singer and S. Roweis}, publisher = {MIT Press}, address = {Cambridge, MA}, year = {2008} }

2007

Frustratingly Hard Domain Adaptation for Dependency Parsing. Mark Dredze, John Blitzer, Partha Pratim Talukdar, Kuzman Ganchev, Joao Graca and Fernando Pereira. Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL. Association for Computational Linguistics, 2007. [bib] [pdf] The CoNLL 2007 shared task (domain adaptation for dependency parsing) is so hard that no team could substancially improve over the baseline. To a large extent this is because of divergences in the annotation guidelines. @InProceedings{dredze-EtAl:2007:EMNLP-CoNLL2007, author = {Dredze, Mark and Blitzer, John and Pratim Talukdar, Partha and Ganchev, Kuzman and Graca, Jo\~ao and Pereira, Fernando}, title = {Frustratingly Hard Domain Adaptation for Dependency Parsing}, booktitle = {Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007}, pages = {1051--1055}, url = {http://www.aclweb.org/anthology/D/D07/D07-1112} }

Transductive structured classification through constrained min-cuts. Kuzman Ganchev and Fernando Pereira. Proceedings of TextGraphs-2: Workshop on Graph Based Methods for Natural Language Processing. Association for Computational Linguistics, 2007. [bib] [pdf] Graph-based semi-supervised learning for structured learning. The main contribution is an approximate inference method for a multi-way min cut problem with constraints (based on a linear program for the metric labeling problem). @InProceedings{ganchev:textgraphs:2007, author = {Ganchev, Kuzman and Pereira, Fernando}, title = {Transductive Structured Classification through Constrained {Min-Cuts}}, booktitle = {Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing}, month = {}, year = {2007}, address = {Rochester, NY, USA}, publisher = {Association for Computational Linguistics}, pages = {37--44}, url = {http://www.aclweb.org/anthology/W/W07/W07-0206} }

Penn/UMass/CHOP Biocreative II systems. Kuzman Ganchev, Koby Crammer, Fernando Pereira, Gideon Mann, Kedar Bellare, Andrew McCallum, Steven Carroll, Yang Jin and Peter White. Proceedings of the Second BioCreative Challenge Evaluation Workshop. Madrid, Spain, 2007. [bib] [pdf] [slides] Our combined efforts for BioCreative II. Some re-usable tricks include: tuning the loss function during learning, doing greedy feature selection in blocks (to reduce both computational and classifier class complexity) and unsupervised clustering of tokens based on their context in unlabeled data. @InProceedings{ganchev:biocreative:2007, author = {Kuzman Ganchev and Koby Crammer and Fernando Pereira and Gideon Mann and Kedar Bellare and Andrew McCallum and Steven Carroll and Yang Jin and Peter White}, title = {Penn/UMass/CHOP Biocreative II systems}, booktitle = {Proceedings of the Second BioCreative Challenge Evaluation Workshop}, year = {2007}, address = {Madrid, Spain}, pages = {119--124}, url = {http://www.cnio.es/eventos/descargas/Meeting/260454_1346,97_booklet.pdf} }

Empirical Price Modeling for Sponsored Search. Kuzman Ganchev, Ryan Gabbard, Alex Kulesza, Qian Liu, Jinsong Tan and Michael Kearns. Third International Workshop on Internet and Network Economics (WINE). Springer, 2007. Also presented at Second Workshop on Sponsored Search, WWW 2007. [bib] [pdf-swss] [pdf-wine] [slides-swss] An empirical evaluation of the prices of different positions in sponsored search auctions across many different search terms. The results are based on about data from about 37,000 queries to the Overture bid-view tool in 2006. We use the models to summarize the effects of modifier words on prices. @InProceedings{ganchev:wine:2007, author = {Kuzman Ganchev and Ryan Gabbard and Alex Kulesza and Qian Liu and Jinsong Tan and Michael Kearns}, title = {Empirical Price Modeling for Sponsored Search}, booktitle = {Proceedings of the 3rd International Workshop On Internet And Network Economics}, year = {2007}, address = {San Diego, CA, USA}, publisher = {Springer} }

Automatic Code Assignment to Medical Text. Koby Crammer, Mark Dredze, Kuzman Ganchev, Pratim Partha Talukdar and Steven Carroll In Workshop on biological, translational, and clinical language processing (BioNLP). Association for Computational Linguistics, 2007. [bib] [pdf] A learning system for assigning ICD-9-CM clinical codes to radiology free text reports. The task is a (simplified) instance of multi-class, multi-label classification. @InProceedings{crammer:bionlp:2007, author = {Koby Crammer and Mark Dredze and Kuzman Ganchev and Pratim Partha Talukdar and Steven Carroll}, title = {Automatic Code Assignment to Medical Text}, booktitle = {Biological, translational, and clinical language processing}, month = {June}, year = {2007}, address = {Prague, Czech Republic}, publisher = {Association for Computational Linguistics}, pages = {129--136}, url = {http://www.aclweb.org/anthology/W/W07/W07-1017} }

Semi-Automated Named Entity Annotation. Kuzman Ganchev, Fernando Pereira, Mark Mandel, Steven Carroll and Peter White In Proceedings of the Linguistic Annotation Workshop. Association for Computational Linguistics 2007. [bib] [pdf] [slides] By tuning the loss function during learning we can achieve very high recall with a named entity tagger trained on little data. Using a human annotator to filter the resulting entity mentions can greatly reduce the cost of annotation at equal performance. Timing experiments with an experienced annotator. @InProceedings{ganchev:law:2007, author = {Ganchev, Kuzman and Pereira, Fernando and Mandel, Mark and Carroll, Steven and White, Peter}, title = {Semi-Automated Named Entity Annotation}, booktitle = {Proceedings of the Linguistic Annotation Workshop}, month = {June}, year = {2007}, address = {Prague, Czech Republic}, publisher = {Association for Computational Linguistics}, pages = {53--56}, url = {http://www.aclweb.org/anthology/W/W07/W07-1509} }

2003

Nswap: a network swapping module for Linux clusters. Tia Newhall, Sean Finney, Kuzman Ganchev, Michael Spiegel. Proceedings of the 13th International Conference on Parallel and Distributed Computing (Euro-Par'03). Springer, 2003. [bib] [pdf] Swapping to the unused RAM of other machines can be faster than swapping to local disk, even when the local disk is much faster than the network. Implementation as a module for the Linux kernel. @InProceedings{newhall:europar:2003, author = {Tia Newhall and Sean Finney and Kuzman Ganchev and Michael Spiegel}, title = {{Nswap}: A Network Swapping Module for {Linux} Clusters}, booktitle = {Proceedings of the 13th International Conference on Parallel and Distributed Computing (Euro-Par'03)}, year = {2003}, month = {August}, publisher = {Springer}, }