Wei Xu

Postdoctoral Researcher
Computer and Information Science Department
University of Pennsylvania


Office: Levine Hall room 361
3330 Walnut Street
Philadelphia, PA 19104

| research | publications | services |
      I just started as a postdoc with Chris Callison-Burch at University of Pennsylvania. My research interest is centered around Natural Language Processing, with an emphasis on data-driven approaches for paraphrase, social media and information extraction. I create models and systems to help people digest large amount of texts efficiently, read and write in new languages or different styles.

I recently graduated with a PhD in Computer Science from New York University. My thesis work was about Data-driven Approaches for Paraphrasing Across Language Variations and my advisor was Ralph Grishman. I have been immensely fortunate to work and co-author with Adam Meyers (NYU), Bill Dolan (Microsoft Research), Le Zhao (Google), Joel Tetreault (Yahoo!) and Martin Chodorow (CUNY), Alan Ritter (OSU) and Raphael Hoffmann (AI2) during my 2-year visit at University of Washington, and Wenjie Li (Hong Kong Polytechnic University). I received my bachelor and master degrees in Computer Science from Tsinghua University in Beijing, China.

When I have spare time, I enjoy arts, traveling, snowboarding, rock climbing, sailing and windsurfing.

I also made a list of the best dressed NLP researchers.

What's New


Learning Paraphrases from Twitter
Twitter engages millions of users, who naturally talk about the same topics simultaneously and frequently convey similar meaning using diverse linguistic expressions. Learning paraphrases from Twitter can help with adapting generic NLP tools to noisy user-generated text or evolving languages, learning native expressions, generating more natural dialogues for Human-Computer Interaction (e.g. Apple's Siri and Microsoft's Cortana), etc. I have demonstrated the feasibility and import value of gathering and generating paraphrases from Twitter [BUCC13]. I have also developed an efficient crowdsourcing methodology [Thesis] and constructed the Twitter Paraphrase Corpus of more than 18,000 sentences.
Distant Supervision for Information Extraction
Information extraction is to automatically distill structured information from large amount free texts, such as news articles. One cutting-edge technology is distant supervision that utilizes multi-instance learning models and uses large-scale knowledge bases as training sources, e.g. Wikipedia's info-boxes, instead of human labeled data that is of limited amount. My research identifies and addresses the missing data problem [ACL2013] and training data error problem [ACL2014] in distant supervision.
Event Graph
Either in news articles or among Twitter posts, sentences about events carry the most salient information. We have developed efficient approaches to build event graphs using shallow event extraction techniques and automatically generate summaries from dozens of news articles [LSAM2013] and thousands of [COLING-ACL2006] by graph-based ranking/partition algorithms.
Transforming between Writing Styles
This is one of the first studies that aim at modeling meaning-preserving transformations that systematically transform the register or style of input texts [COLING2012]. So we could learn to reliably map from one form of language or another, transforming formal prose into a more colloquial form, explaining legalese or medical jargon into plain English, educating children about reading and writing Shakespearean English, etc. We make use of phrase-based statistical Machine Translation approaches and also create a Human Computing algorithm to convert prose into sonnets [HCOMP2014].
Spelling Correction
For spelling correction, it is more difficult to identify errors when a word is misspelt into another valid word than a word that doesn't exists in the vocabulary. My work is about the former and incorporate syntactic and distributional information into a state-of-the-art web-scale n-gram model [EMNLP2011]. I also have utilized Twitter data to learn to correct spelling errors [BUCC13].


Session Chair :
     ACL (2014)

Program Committee :
     EMNLP (2014), COLING (2014), ACL (2014), LASM (2014), ACL (2013), AAAI (2012)
     Mid-Atlantic Student Colloquium on Speech, Language and Learning (2011)

External Reviewer :
     CoNLL (2014), WWW (2014), CIKM (2013), *SEM (2013), CoNLL(2013), EMNLP (2012)