I just started as a postdoc with Chris Callison-Burch at University of Pennsylvania. My research interest is centered around Natural Language Processing, with an emphasis on data-driven approaches for paraphrase, social media and information extraction. I create models and systems to help people digest large amount of texts efficiently, read and write in new languages or different styles.
I recently graduated with a PhD in Computer Science from New York University. My thesis work was about Data-driven Approaches for Paraphrasing Across Language Variations and my advisor was Ralph Grishman. I have been immensely fortunate to work and co-author with Adam Meyers (NYU), Bill Dolan (Microsoft Research), Le Zhao (Google), Joel Tetreault (Yahoo!) and Martin Chodorow (CUNY), Alan Ritter (OSU) and Raphael Hoffmann (AI2) during my 2-year visit at University of Washington, and Wenjie Li (Hong Kong Polytechnic University). I received my bachelor and master degrees in Computer Science from Tsinghua University in Beijing, China.
When I have spare time, I enjoy arts, traveling, snowboarding, rock climbing, sailing and windsurfing.
I also made a list of the best dressed NLP researchers.
- September 2014, started our SemEval-2015 shared-task evaluation challenge: Paraphrases and Semantic Similarity in Twitter (task 1). Call for participation!
- August 2014, invited to give a talk on Modeling Paraphrases at Microsoft Research, Redmond.
- January 2014, moved to Philadelphia and started my postdoctoral career. I am still visiting NYC often.
- December 2013, defended my PhD dissertation, entitled as Data-driven Approaches for Paraphrasing Across Language Variations, with Bill Dolan, Satoshi Sekine, Luke Zettlemoyer and Ernest Davis as committee.
- November 2013, released the test data for Twitter summarization (NAACL-2013).
- August 2013, released the data for distant supervision of relation extraction (ACL-2013).
- August 2013, released the data for Twitter paraphrasing and normalization (ACL-2013).
- July 2013, released the data and code for paraphrasing Shakespeare (COLING-2012).
|Learning Paraphrases from Twitter
Twitter engages millions of users, who naturally talk about the same topics simultaneously and frequently convey similar meaning using diverse linguistic expressions. Learning paraphrases from Twitter can help with adapting generic NLP tools to noisy user-generated text or evolving languages, learning native expressions, generating more natural dialogues for Human-Computer Interaction (e.g. Apple's Siri and Microsoft's Cortana), etc. I have demonstrated the feasibility and import value of gathering and generating paraphrases from Twitter [BUCC13]. I have also developed an efficient crowdsourcing methodology [Thesis] and constructed the Twitter Paraphrase Corpus of more than 18,000 sentences.
||Distant Supervision for Information Extraction
Information extraction is to automatically distill structured information from large amount free texts, such as news articles. One cutting-edge technology is distant supervision that utilizes multi-instance learning models and uses large-scale knowledge bases as training sources, e.g. Wikipedia's info-boxes, instead of human labeled data that is of limited amount. My research identifies and addresses the missing data problem [ACL2013] and training data error problem [ACL2014] in distant supervision.
Either in news articles or among Twitter posts, sentences about events carry the most salient information. We have developed efficient approaches to build event graphs using shallow event extraction techniques and automatically generate summaries from dozens of news articles [LSAM2013] and thousands of [COLING-ACL2006] by graph-based ranking/partition algorithms.
||Transforming between Writing Styles
This is one of the first studies that aim at modeling meaning-preserving transformations that systematically transform the register or style of input texts [COLING2012]. So we could learn to reliably map from one form of language or another, transforming formal prose into a more colloquial form, explaining legalese or medical jargon into plain English, educating children about reading and writing Shakespearean English, etc. We make use of phrase-based statistical Machine Translation approaches and also create a Human Computing algorithm to convert prose into sonnets [HCOMP2014].
For spelling correction, it is more difficult to identify errors when a word is misspelt into another valid word than a word that doesn't exists in the vocabulary. My work is about the former and incorporate syntactic and distributional information into a state-of-the-art web-scale n-gram model [EMNLP2011]. I also have utilized Twitter data to learn to correct spelling errors [BUCC13].
Session Chair :
Program Committee :
EMNLP (2014), COLING (2014), ACL (2014), LASM (2014), ACL (2013), AAAI (2012)
Mid-Atlantic Student Colloquium on Speech, Language and Learning (2011)
External Reviewer :
CoNLL (2014), WWW (2014), CIKM (2013), *SEM (2013), CoNLL(2013), EMNLP (2012)
- Infusion of Labeled Data into Distant Supervision for Relation Extraction [bib]
Maria Pershina, Bonan Min, Wei Xu, Ralph Grishman
Proceedings of ACL 2014
- Data-driven Approaches for Paraphrasing Across Language Variations [bib]
- Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction [data] [bib]
Wei Xu, Raphael Hoffmann, Le Zhao, Ralph Grishman
Proceedings of ACL 2013
- Gathering and Generating Paraphrases from Twitter with Application to Normalization [data] [bib]
Wei Xu, Alan Ritter, Ralph Grishman
Proceedings of ACL 2013 Workshop on Building and Using Comparable Corpora (BUCC)
- A Preliminary Study of Tweet Summarization using Information Extraction [data] [bib]
Wei Xu, Ralph Grishman, Adam Meyers, Alan Ritter
Proceedings of NAACL 2013 Workshop on Language Analysis in Social Media (LASM)
- Paraphrasing for Style [data & code]
Wei Xu, Alan Ritter, Bill Dolan, Ralph Grishman, Colin Cherry
Proceedings of COLING 2012
- Exploiting Syntactic and Distributional Information for Spelling Correction with Web-Scale N-gram Models
Wei Xu, Joel Tetreault, Martin Chodorow, Ralph Grishman, Le Zhao
Proceedings of EMNLP 2011
- Passage Retrieval for Information Extraction using Distant Supervision
Wei Xu, Ralph Grishman, Le Zhao
Proceedings of IJCNLP 2011
- New York University 2011 System for KBP Slot Filing
Ang Sun, Ralph Grishman, Wei Xu, Bonan Min
Proceedings of TAC 2011
- Who, What, When, Where, Why? Comparing Multiple Approaches to the Cross-Lingual 5W Task
Kristen Parton, Kathleen R. McKeown, Bob Coyne, Mona T. Diab, Ralph Grishman, Dilek Hakkani-Tür, Mary Harper, Heng Ji, Wei Yun Ma, Adam Meyers, Sara Stolbach, Ang Sun, Gokhan Tur, Wei Xu, Sibel Yaman
Proceedings of ACL-IJCNLP 2009
- A Parse-and-Trim Approach with Information Significance for Chinese Sentence Compression
Wei Xu, Ralph Grishman
Proceedings of ACL-IJNLP Workshop on Language Generation and Summarisation 2009
- Transducing Logical Relations from Automatic and Manual Annotation
Adam Meyers, Michiko Kosaka, Heng Ji, Nianwen Xue, Mary Harper, Ang Sun, Wei Xu, Shasha Liao
Proceedings of ACL-IJNLP Workshop on Linguistic Annotation 2009
- Automatic Recognition of Logical Relations for English, Chinese and Japanese in the GLARF Framework
Adam Meyers, Michiko Kosaka, Nianwen Xue, Heng Ji, Ang Sun, Shasha Liao, Wei Xu
Proceedings of NAACL-HLT Workshop on Semantic Evaluations 2009
- Using Non-Local Features to Improve Named Entity Recognition Recall
Xinnian Mao, Wei Xu, Yuan Dong, Haila Wang
Proceedings of PACLIC 2007
- Domain Extension of Chinese Named Entity Recognition
Wei Xu, Bin Fu, Liu Liu, Chunfa Yuan, Wenjie Li
Frontiers of Content Computing 2007
- Extractive Summarization using Inter- and Intra- Event Relevance
Wenjie Li, Wei Xu, Mingli Wu, Chunfa Yuan, Qin Lu
Proceedings of COLING-ACL 2006
- Deriving Event Relevance from the Ontology Constructed with Formal Concept Analysis
Wei Xu, Wenjie Li, Mingli Wu, Wei Li, Chunfa Yuan
Proceedings of CICLing 2006
- Building Document Graphs for Multiple News Articles Summarization: An Event-Based Approach
Wei Xu, Wenjie Li, Mingli Wu, Wei Li, Chunfa Yuan, Kam-Fai Wong
Proceedings of ICCPOL 2006
- The Hong Kong Polytechnic University at ACE2005
Wenjie Li, Wei Li, Mingli Wu, Wei Xu
Proceedings of ACE Evaluation Workshop 2005