Wei Xu     

Postdoctoral Fellow
Computer and Information Science Department
University of Pennsylvania
   xwe@cis.upenn.edu

   Levine Hall Room 361
3330 Walnut St, Philadelphia, PA 19104

I am a postdoc at University of Pennsylvania, working with Chris Callison-Burch. My research interest is centered around Natural Language Processing, with an emphasis on data-driven approaches for paraphrase, social media and information extraction. I create models and systems to help people digest large amount of texts efficiently, read and write in new languages or different styles (e.g. text simplification for children).


I graduated with a PhD in Computer Science from New York University and my advisor was Ralph Grishman. My thesis work was about Data-driven Approaches for Paraphrasing Across Language Variations, with Bill Dolan, Satoshi Sekine, Luke Zettlemoyer and Ernest Davis as committee. I received my bachelor and master degrees in Computer Science from Tsinghua University in Beijing, China.

What's New
  Feb 24 2015, I will be giving a talk at Johns Hopkins University as part of the Center for Language & Speech Processing's spring 2015 Lecture Series.
Research

Twitter Paraphrase

Twitter engages millions of users, who naturally talk about the same topics simultaneously and frequently convey similar meaning using diverse linguistic expressions. Learning paraphrases from Twitter can help with adapting generic NLP tools to noisy user-generated text or evolving languages, learning native expressions, generating more natural dialogues for Human-Computer Interaction (e.g. Apple's Siri and Microsoft's Cortana), etc. I have demonstrated the feasibility and important value of gathering and generating paraphrases from Twitter [BUCC2013]. I have also developed an efficient crowdsourcing methodology [Thesis] and constructed the Twitter Paraphrase Corpus of more than 18,000 sentence pairs. I then designed a machine learning algorithm --- the Multi-instance Learning Paraphrase (MultiP) model [TACL2014] --- to learn from this corpus and then automatically extract paraphrases from Twitter.

Distant Supervision for Information Extraction

Information extraction is to automatically distill structured information from large amount free texts, such as news articles. One cutting-edge technology is distant supervision that utilizes multi-instance learning models and uses large-scale knowledge bases as training sources, e.g. Wikipedia's info-boxes, instead of human labeled data that is of limited amount. My research identifies and addresses the missing data problem [ACL2013] and training data error problem [ACL2014] in distant supervision.

Event Graph-based Summarization for News & Twitter

Either in news articles or among Twitter posts, sentences about events carry the most salient information. We have developed efficient approaches to build event graphs using shallow event extraction techniques and automatically generate summaries from dozens of news articles [COLING-ACL2006] and thousands of tweets [LSAM2013] by graph-based ranking/partition algorithms.

Paraphrasing between Writing Styles
(e.g. Text Simplification or Shakespeareanize)

This is one of the first studies that aim at modeling meaning-preserving transformations that systematically transform the register or style of input texts [COLING2012]. So we could learn to reliably map from one form of language or another, transforming formal prose into a more colloquial form, explaining legalese or medical jargon into plain English, educating children about reading and writing Shakespearean English, etc. We make use of phrase-based statistical Machine Translation approaches and also create a Human Computing algorithm to convert prose into sonnets [HCOMP2014].

Spelling Correction

For spelling correction, it is more difficult to identify errors when a word is misspelt into another valid word than a word that doesn't exists in the vocabulary. My work is about the former and incorporate syntactic and distributional information into a state-of-the-art web-scale n-gram model [EMNLP2011]. I also have utilized Twitter data to learn to correct spelling errors [BUCC2013].

Service
Organizer :
     ACL 2015 Workshop on Noisy User-generated Text (W-NUT)
     SemEval 2015 shared-task: Paraphrases and Semantic Similarity in Twitter (PIT)

Session Chair :
     AAAI (2015), ACL (2014)

Program Committee :
     ACL (2015), KDD (2015), NAACL (2015), WWW (2015), AAAI (2015), EMNLP (2014), COLING (2014), ACL (2014), LASM (2014), ACL (2013), AAAI (2012)

Reviewer :
     Transactions of the Association for Computational Linguistics (2015 ~ 2016)
     CoNLL (2014), WWW (2014), CIKM (2013), *SEM (2013), CoNLL(2013), EMNLP (2012)

Publications
Collaborators
I am a big believer of collaborations and have been happy to work and co-author with:
    Colin Cherry (National Research Council Canada)
    Martin Chodorow (CUNY)
    Bill Dolan (Microsoft Research)
    Yangfeng Ji (Gatech)
    Raphael Hoffmann (UW → AI2 Incubator)
    Wenjie Li (Hong Kong Polytechnic University)
    Adam Meyers (NYU)
    Alan Ritter (UW → OSU)
    Joel Tetreault (ETS → Yahoo!)
    Le Zhao (CMU → Google)
    and many others ...

Places I interned and visited when I was a phd student:
    2012-2013, University of Washington, Seattle, WA
    Summer 2011, Microsoft Research, Redmond, WA
    Summer 2010, Amazon.com, Seattle, WA
    Spring/Fall 2010, ETS, Princeton, NJ

Teaching / Advising
I am always happy to work with undergraduate and graduate students. If you are a student at Penn and want to do some research, email me! My past and current advisees (every of them has published with me):
    Quanze Chen (undergraduate at UPenn)
    Bin Fu (undergraduate Tsinghua → phd CMU → Google)
    Mingkun Gao (master Upenn)
    Ray Lei (undergraduate UPenn)
    Ellie Pavlick (phd Upenn)
    Maria Pershina (phd NYU)

Invited Talks
Miscellaneous

When I have spare time, I enjoy arts, traveling, snowboarding, rock climbing, sailing and windsurfing.

I also made a list of the best dressed NLP researchers (2014).