I am a postdoc at University of Pennsylvania, working with Chris Callison-Burch. My research interest is centered around Natural Language Processing, with an emphasis on data-driven approaches for paraphrase, social media and information extraction. I create models and systems to help people digest large amount of texts efficiently, read and write in new languages or different styles.
I recently graduated with a PhD in Computer Science from New York University and my advisor was Ralph Grishman. My thesis work was about Data-driven Approaches for Paraphrasing Across Language Variations, with Bill Dolan, Satoshi Sekine, Luke Zettlemoyer and Ernest Davis as committee. I received my bachelor and master degrees in Computer Science from Tsinghua University in Beijing, China.
|Learning Paraphrases from Twitter
Twitter engages millions of users, who naturally talk about the same topics simultaneously and frequently convey similar meaning using diverse linguistic expressions. Learning paraphrases from Twitter can help with adapting generic NLP tools to noisy user-generated text or evolving languages, learning native expressions, generating more natural dialogues for Human-Computer Interaction (e.g. Apple's Siri and Microsoft's Cortana), etc. I have demonstrated the feasibility and important value of gathering and generating paraphrases from Twitter [BUCC13]. I have also developed an efficient crowdsourcing methodology [Thesis] and constructed the Twitter Paraphrase Corpus of more than 18,000 sentences.
||Distant Supervision for Information Extraction
Information extraction is to automatically distill structured information from large amount free texts, such as news articles. One cutting-edge technology is distant supervision that utilizes multi-instance learning models and uses large-scale knowledge bases as training sources, e.g. Wikipedia's info-boxes, instead of human labeled data that is of limited amount. My research identifies and addresses the missing data problem [ACL2013] and training data error problem [ACL2014] in distant supervision.
Either in news articles or among Twitter posts, sentences about events carry the most salient information. We have developed efficient approaches to build event graphs using shallow event extraction techniques and automatically generate summaries from dozens of news articles [COLING-ACL2006] and thousands of tweets [LSAM2013] by graph-based ranking/partition algorithms.
||Transforming between Writing Styles
This is one of the first studies that aim at modeling meaning-preserving transformations that systematically transform the register or style of input texts [COLING2012]. So we could learn to reliably map from one form of language or another, transforming formal prose into a more colloquial form, explaining legalese or medical jargon into plain English, educating children about reading and writing Shakespearean English, etc. We make use of phrase-based statistical Machine Translation approaches and also create a Human Computing algorithm to convert prose into sonnets [HCOMP2014].
For spelling correction, it is more difficult to identify errors when a word is misspelt into another valid word than a word that doesn't exists in the vocabulary. My work is about the former and incorporate syntactic and distributional information into a state-of-the-art web-scale n-gram model [EMNLP2011]. I also have utilized Twitter data to learn to correct spelling errors [BUCC13].
Selected Publications (Full List)
- Extracting Lexically Divergent Paraphrases from Twitter
Wei Xu, Alan Ritter, Chris Callison-Burch, William B. Dolan and Yangfeng Ji
In TACL 2014 (Journal)
- Infusion of Labeled Data into Distant Supervision for Relation Extraction [bib]
Maria Pershina, Bonan Min, Wei Xu, Ralph Grishman
Proceedings of ACL 2014
- Data-driven Approaches for Paraphrasing Across Language Variations [bib]
- Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction [data] [bib]
Wei Xu, Raphael Hoffmann, Le Zhao, Ralph Grishman
Proceedings of ACL 2013
- Paraphrasing for Style [data & code]
Wei Xu, Alan Ritter, Bill Dolan, Ralph Grishman, Colin Cherry
Proceedings of COLING 2012
- Exploiting Syntactic and Distributional Information for Spelling Correction with Web-Scale N-gram Models
Wei Xu, Joel Tetreault, Martin Chodorow, Ralph Grishman, Le Zhao
Proceedings of EMNLP 2011
- New York University 2011 System for KBP Slot Filing
Ang Sun, Ralph Grishman, Wei Xu, Bonan Min
Proceedings of TAC 2011
- Who, What, When, Where, Why? Comparing Multiple Approaches to the Cross-Lingual 5W Task
Kristen Parton, Kathleen R. McKeown, Bob Coyne, Mona T. Diab, Ralph Grishman, Dilek Hakkani-Tür, Mary Harper, Heng Ji, Wei Yun Ma, Adam Meyers, Sara Stolbach, Ang Sun, Gokhan Tur, Wei Xu, Sibel Yaman
Proceedings of ACL-IJCNLP 2009
- Extractive Summarization using Inter- and Intra- Event Relevance
Wenjie Li, Wei Xu, Mingli Wu, Chunfa Yuan, Qin Lu
Proceedings of COLING-ACL 2006
Code & Data
Session Chair :
Program Committee :
WWW (2015), AAAI (2015), EMNLP (2014), COLING (2014), ACL (2014), LASM (2014), ACL (2013), AAAI (2012)
External Reviewer :
CoNLL (2014), WWW (2014), CIKM (2013), *SEM (2013), CoNLL(2013), EMNLP (2012)
Modeling Lexically Divergent Paraphrases in Twitter (and Shakespeare!)
Aug 2014, Microsoft Research, Redmond, WA
Data-driven Approaches for Paraphrasing across Language Variations
Jan 2014, University of Pennsylvania, Philadelphia, PA
Incremental Information Extraction
(presentations at IARPA's KDD Technical Exchange Meeting)
Apr 2012, SRI, Palo Alto, CA
May 2011, SRI, San Diego, CA
Passage Retrieval for Information Extraction using Distant Supervision
Nov 2011, Tsinghua University, Beijing, China
Information Extraction Research at New York University
Jan 2011, University of Washington, Seattle, WA
Nov 2009, Thomson Reuters, Eagan, Minnesota
I am a big believer of collaborations and have been immensely fortunate to work and co-author with:
Bill Dolan (Microsoft Research)
Le Zhao (CMU → Google)
Joel Tetreault (ETS → Yahoo!)
Martin Chodorow (CUNY)
Adam Meyers (NYU)
Alan Ritter (UW → OSU)
Raphael Hoffmann (UW → AI2)
Wenjie Li (Hong Kong Polytechnic University)
and many others ...
Places I interned and visited when I was a phd student:
2012-2013, University of Washington, Seattle, WA
Summer 2011, Microsoft Research, Redmond, WA
Summer 2010, Amazon.com, Seattle, WA
Spring/Fall 2010, ETS, Princeton, NJ
Teaching / Advising
I am always happy to work with undergraduate and graduate students. If you are a student at Penn and want to do some research, email me! My past and current advisees (every of them has published a paper with me):
Ellie Pavlick (phd student at Upenn)
Jim Chen (undergraduate at UPenn)
Ray Lei (undergraduate at UPenn)
Maria Pershina (phd student at NYU)
Bin Fu (Tsinghua → CMU → Google)
When I have spare time, I enjoy arts, traveling, snowboarding, rock climbing, sailing and windsurfing.
I also made a list of the best dressed NLP researchers (2014).