Wei Xu     

[phonetic pronunciation: way shoo ]

Postdoctoral Fellow
Computer and Information Science Department
University of Pennsylvania
   xwe@cis.upenn.edu
   Levine Hall Room 361
3330 Walnut St, Philadelphia, PA 19104

I am a postdoc at University of Pennsylvania, working with Chris Callison-Burch. My research lies at the intersections of machine learning, natural language processing, and social media. I am particularly interested in designing learning algorithms for gleaning semantic and structured knowledge from massive social media and web data. My work enables deeper analysis of text meaning and better natural language generation.

I graduated with a PhD in Computer Science from New York University in 2014. My advisor was Ralph Grishman. My thesis work was about Data-driven Approaches for Paraphrasing Across Language Variations. I received my bachelor and master degrees in Computer Science from Tsinghua University in Beijing, China.

I am currently on the job market and looking for a faculty position starting fall 2016.
What's New
  Feb 4/5 - Yale University
  Feb 11/12 - University of Alberta
  Feb 15 - Simon Fraser University
  Feb 24 - Indiana University, Bloomington
  March 3/4 - University of Waterloo
  March 17/18 - Vanderbilt University
Teaching
I designed and taught a new course — Social Media and Text Analytics

[Summary] Social media provides a massive amount of valuable information and shows us how language is actually used by lots of people. This course covers several important machine learning algorithms and the core natural language processing techniques for obtaining and processing Twitter data.

[Schedule]

Research Highlights

Joint Word-Sentence Models

I build probabilistic graphical models to extract semantic or structured knowledge from large volumes of data. I designed the first succesful models to extract paraphrases from Twitter that can scale up to billions of sentences. These web-scale paraphrases enable natural language systems to handle errors (e.g. “everytime” ↔ “every time”), lexical variations (e.g. “oscar nom’d doc” ↔ “Oscar-nominated documentary”), rare words (e.g “NetsBulls series” ↔ “Nets and Bulls games”), and language shifts (e.g. “is bananas” ↔ “is great”). But it is difficult to capture such lexically divergent paraphrases by the conventional similarity-based approaches. I invented the multi-instance learning paraphrase (MultiP) model [TACL2014], which jointly infers latent word-sentence relations and relaxes the reliance on human annotation. It is the current state-of-the-art, outperforming deep leaning and latent space methods.

Statistical Machine Generation Framework

Many text-to-text generation problems can be thought of as sentential paraphrasing or monolingual machine translation. It faces an exponential search space larger than bilingual translation, but a much smaller optimal solution space due to specific task requirements. I advocate for a statistical text-to-text framework, building on top of statistical machine translation (SMT) technology. My recent work uncovered multiple serious problems in text simplification [TACL2015] research between 2010 and 2014, and set a new state-of-the-art by designing novel objective functions for optimizing syntax-based SMT and overgenerating with large-scale paraphrases [to appear]. I am also very interested in paraphrases of different language styles (e.g. historic ↔ modern [COLING2012], erroneous ↔ well-edited [BUCC2013], feminine ↔ masculine [AAAI2016]).

Publications
Service
Area Chair:   EMNLP (2016)
Publicity Chair:   NAACL (2016)
Session Chair:   EMNLP (2015), NAACL (2015), AAAI (2015), ACL (2014)
Organizer:
     - ACL 2015 Workshop on Noisy User-generated Text (W-NUT)
     - SemEval 2015 shared-task: Paraphrases and Semantic Similarity in Twitter (PIT)
     - Mid-Atlantic Student Colloquium on Speech, Language and Learning 2016
Program Committee:
     ACL (2015, 2014, 2013), NAACL (2015), EMNLP (2015, 2014), COLING (2014)
     WWW (2016, 2015), AAAI (2016, 2015, 2012), KDD (2015)
     WWW Workshop on #Microposts (2016)
     ACL Workshop on Social Factors in Natural Language Processing (2016)
     EACL Workshop on Language Analysis in Social Media (2014)
Journal Reviewer:
     Transactions of the Association for Computational Linguistics (TACL)

Invited Talks
Collaborators
I am a big believer of collaborations and have been happy to work and co-author with:
    Colin Cherry (National Research Council Canada)
    Martin Chodorow (CUNY)
    Bill Dolan (Microsoft Research)
    Yangfeng Ji (Gatech)
    Raphael Hoffmann (U of Washington → AI2 Incubator)
    Wenjie Li (Hong Kong Polytechnic University)
    Adam Meyers (NYU)
    Courtney Napoles (JHU)
    Daniel Preoţiuc-Pietro (UPenn)
    Alan Ritter (U of Washington → Ohio State U)
    Joel Tetreault (ETS → Yahoo! Research)
    Lyle Ungar (UPenn)
    Luke Zettlemoyer (U of Washington)
    Le Zhao (CMU → Google)
    and many others ...

The members of my thesis committee are:
    Ernest Davis (NYU)
    Bill Dolan (MSR)
    Satoshi Sekine (NYU/Rakuten)
    Luke Zettlemoyer (U of Washington)

Places I interned and visited when I was a phd student:
    2012-2013, University of Washington, Seattle, WA
    Summer 2011, Microsoft Research, Redmond, WA
    Summer 2010, Amazon.com, Seattle, WA
    Spring/Fall 2010, Educational Testing Service, Princeton, NJ

Advising
I am always happy to work with undergraduate and graduate students. If you are a student at Penn and want to do some research, email me!

My past advisees all have published a paper with me:
    Quanze Chen (undergraduate UPenn | currently applying for PhD)
    Bin Fu (undergraduate Tsinghua → PhD CMU → Google NYC)
    Mingkun Gao (master Upenn → PhD UIUC)
    Ray Lei (undergraduate UPenn)
    Maria Pershina (PhD NYU | I served on her thesis committee)

My current advisees:
    Siyu Qiu (master UPenn)

Miscellaneous

When I have spare time, I enjoy arts, traveling, snowboarding, rock climbing, sailing and windsurfing.

I also made a list of the best dressed NLP researchers (2015) and (2014).