Wei Xu          

Postdoctoral Researcher
Computer and Information Science Department
University of Pennsylvania


Office: Levine Hall Room 361
           3330 Walnut Street
           Philadelphia, PA 19104

| research | publications | CV | talks | services | code & data |
      I am a postdoc at University of Pennsylvania, working with Chris Callison-Burch. My research interest is centered around Natural Language Processing, with an emphasis on data-driven approaches for paraphrase, social media and information extraction. I create models and systems to help people digest large amount of texts efficiently, read and write in new languages or different styles.

I graduated with a PhD in Computer Science from New York University and my advisor was Ralph Grishman. My thesis work was about Data-driven Approaches for Paraphrasing Across Language Variations, with Bill Dolan, Satoshi Sekine, Luke Zettlemoyer and Ernest Davis as committee. I received my bachelor and master degrees in Computer Science from Tsinghua University in Beijing, China.

What's New


Learning Paraphrases from Twitter
Twitter engages millions of users, who naturally talk about the same topics simultaneously and frequently convey similar meaning using diverse linguistic expressions. Learning paraphrases from Twitter can help with adapting generic NLP tools to noisy user-generated text or evolving languages, learning native expressions, generating more natural dialogues for Human-Computer Interaction (e.g. Apple's Siri and Microsoft's Cortana), etc. I have demonstrated the feasibility and important value of gathering and generating paraphrases from Twitter [BUCC13]. I have also developed an efficient crowdsourcing methodology [Thesis] and constructed the Twitter Paraphrase Corpus of more than 18,000 sentence pairs. I then designed a machine learning algorithm --- the Multi-instance Learning Paraphrase (MultiP) model [TACL14] --- to learn from this corpus and then automatically extract paraphrases from Twitter.
Distant Supervision for Information Extraction
Information extraction is to automatically distill structured information from large amount free texts, such as news articles. One cutting-edge technology is distant supervision that utilizes multi-instance learning models and uses large-scale knowledge bases as training sources, e.g. Wikipedia's info-boxes, instead of human labeled data that is of limited amount. My research identifies and addresses the missing data problem [ACL2013] and training data error problem [ACL2014] in distant supervision.
Event Graph
Either in news articles or among Twitter posts, sentences about events carry the most salient information. We have developed efficient approaches to build event graphs using shallow event extraction techniques and automatically generate summaries from dozens of news articles [COLING-ACL2006] and thousands of tweets [LSAM2013] by graph-based ranking/partition algorithms.
Transforming between Writing Styles
This is one of the first studies that aim at modeling meaning-preserving transformations that systematically transform the register or style of input texts [COLING2012]. So we could learn to reliably map from one form of language or another, transforming formal prose into a more colloquial form, explaining legalese or medical jargon into plain English, educating children about reading and writing Shakespearean English, etc. We make use of phrase-based statistical Machine Translation approaches and also create a Human Computing algorithm to convert prose into sonnets [HCOMP2014].
Spelling Correction
For spelling correction, it is more difficult to identify errors when a word is misspelt into another valid word than a word that doesn't exists in the vocabulary. My work is about the former and incorporate syntactic and distributional information into a state-of-the-art web-scale n-gram model [EMNLP2011]. I also have utilized Twitter data to learn to correct spelling errors [BUCC13].

Selected Publications (Full List)

Code & Data


Organizer :
     ACL 2015 Workshop on Noisy User-generated Text (W-NUT)
     SemEval 2015 shared-task: Paraphrases and Semantic Similarity in Twitter (PIT)

Session Chair :
     ACL (2014)

Program Committee :
     ACL(2015), KDD(2015), NAACL (2015), WWW (2015), AAAI (2015), EMNLP (2014), COLING (2014), ACL (2014), LASM (2014), ACL (2013), AAAI (2012)

External Reviewer :
     CoNLL (2014), WWW (2014), CIKM (2013), *SEM (2013), CoNLL(2013), EMNLP (2012)

Invited Talks

Modeling Lexically Divergent Paraphrases in Twitter (and Shakespeare!)
     Dec 2014, Yahoo!, New York, NY
     Nov 2014, Carnegie Mellon University, Pittsburgh, PA
     Aug 2014, Microsoft Research, Redmond, WA

Data-driven Approaches for Paraphrasing across Language Variations
     Jan 2014, University of Pennsylvania, Philadelphia, PA

Incremental Information Extraction
(presentations at IARPA's KDD Technical Exchange Meeting)
     Apr 2012, SRI, Palo Alto, CA
     May 2011, SRI, San Diego, CA

Passage Retrieval for Information Extraction using Distant Supervision
     Nov 2011, Tsinghua University, Beijing, China

Information Extraction Research at New York University
     Jan 2011, University of Washington, Seattle, WA

Event-based Summarization
     Nov 2009, Thomson Reuters, Eagan, Minnesota


I am a big believer of collaborations and have been immensely fortunate to work and co-author with:
    Bill Dolan (Microsoft Research)
    Le Zhao (CMU → Google)
    Joel Tetreault (ETS → Yahoo!)
    Martin Chodorow (CUNY)
    Adam Meyers (NYU)
    Alan Ritter (UW → OSU)
    Raphael Hoffmann (UW → AI2)
    Wenjie Li (Hong Kong Polytechnic University)
    and many others ...

Places I interned and visited when I was a phd student:
    2012-2013, University of Washington, Seattle, WA
    Summer 2011, Microsoft Research, Redmond, WA
    Summer 2010, Amazon.com, Seattle, WA
    Spring/Fall 2010, ETS, Princeton, NJ

Teaching / Advising

I am always happy to work with undergraduate and graduate students. If you are a student at Penn and want to do some research, email me! My past and current advisees (every of them has published with me):
    Ellie Pavlick (phd student at Upenn)
    Jim Chen (undergraduate at UPenn)
    Ray Lei (undergraduate at UPenn)
    Maria Pershina (phd student at NYU)
    Bin Fu (Tsinghua → CMU → Google)


When I have spare time, I enjoy arts, traveling, snowboarding, rock climbing, sailing and windsurfing.

I also made a list of the best dressed NLP researchers (2014).