I am a postdoc at University of Pennsylvania, working with Chris Callison-Burch. My research lies at the intersections of machine learning,
natural language processing, and social media. I am particularly
interested in designing learning algorithms for gleaning semantic and
structured knowledge from massive social media and web data. My work enables deeper analysis of text meaning and better natural language generation. I am an area chair for EMNLP 2016 and the publicity chair for NAACL 2016.
May 11 - University of Edinburgh
April 14/15 - Ohio State University
April 6/7 - University of North Carolina at Chapel Hill
March 22 - Arizona State University
March 17/18 - Vanderbilt University
March 11 - Imperial College London
March 3/4 - University of Waterloo
Feb 24 - Indiana University, Bloomington
Feb 18/19 - Washington University in St. Louis
Feb 15 - Simon Fraser University
Feb 11/12 - University of Alberta
Feb 4/5 - Yale University
area co-chair for EMNLP 2016, Summarization, Generation, Discourse, and Dialogue area.
Mar 2016, my 3rd TACL paper is officially accepted! A state-of-the-art natural language generation approach by optimizing syntax-based statistical machine translation.
[Summary] Social media provides a massive amount of valuable information and shows us how language is actually used by lots of people. This course covers several important machine learning algorithms and the core natural language processing techniques for obtaining and processing Twitter data.
I build probabilistic graphical models to extract semantic or structured knowledge from large volumes of data. I
designed the first succesful models to extract paraphrases from
Twitter that can scale up to billions of sentences. These web-scale
paraphrases enable natural language systems to handle errors (e.g.
“everytime” ↔ “every time”), lexical variations (e.g. “oscar nom’d
doc” ↔ “Oscar-nominated documentary”), rare words (e.g “NetsBulls
series” ↔ “Nets and Bulls games”), and language shifts (e.g. “is
bananas” ↔ “is great”) [BUCC2013][SemEval2015]. But it is difficult to capture such lexically divergent paraphrases by the conventional similarity-based approaches. I invented the multi-instance learning paraphrase (MultiP) model[TACL2014], which jointly infers latent word-sentence relations and relaxes the reliance on human annotation. It is a conditional random field model with latent variables [ACL2014][ACL2013], and the current state-of-the-art, outperforming deep leaning and latent space methods.
Statistical Natural Language Generation (NLG) Framework
Many text-to-text generation problems can be thought of as sentential paraphrasing or monolingual machine translation. It faces an exponential search space larger than bilingual translation, but a much smaller optimal solution space due to specific task requirements. I advocate for a statistical text-to-text framework, building on top of statistical machine translation (SMT) technology. My recent work uncovered multiple serious problems in text simplification [TACL2015] research between 2010 and 2014, and set a new state-of-the-art by designing novel objective functions for optimizing syntax-based SMT and overgenerating with large-scale paraphrases [TACL2016]. I am also very interested in paraphrases of different language styles (e.g. historic ↔ modern [COLING2012], erroneous ↔ well-edited [BUCC2013], feminine ↔ masculine [AAAI2016]).
Multiple-instance Learning from Unlimited Text
Oct 2015, University of Maryland, College Park, MD (CLIP Colloquium)
Oct 2015, Ohio State University, Columbus, OH (Clippers Seminar)
Large-scale Paraphrase Acquisition from Twitter
May 2015, DARPA DEFT PI Meeting, Boulder, CO
Learning and Generating Paraphrases from Twitter and Beyond [poster]
Apr 2015, Carnegie Mellon University, Pittsburgh, PA
Apr 2015, Columbia University, New York, NY (NLP Talk)
Feb 2015, Johns Hopkins University, Baltimore, MD (CLIP Colloquium)
Paraphrases in Twitter [slides]
Feb 2015, Twitter.com, San Francisco, CA
Modeling Lexically Divergent Paraphrases in Twitter (and
Shakespeare!) [poster] Mar 2015, The City University of New York, New York, NY (NLP Seminar)
Feb 2015, IBM Research - Almaden, San Jose, CA
Feb 2015, UC Berkeley, Berkeley, CA
Feb 2015, UT Austin, Austin, TX (Forum for Artificial Intelligence)
Dec 2014, Yahoo! Research, New York, NY
Nov 2014, Carnegie Mellon
University, Pittsburgh, PA (CL+NLP Lunch Seminar)
Aug 2014, Microsoft Research,
Redmond, WA (Visiting Speaker Series)
Incremental Information Extraction
Apr 2012, Stanford Research Institute, Palo Alto, CA
May 2011, IARPA's
KDD PI Meeting, San Diego, CA
Information Extraction Research
Jan 2011, University of Washington,
Nov 2009, Thomson Reuters, Eagan,
Mar 2007, France Telecom, Beijing,
Places I interned and visited when I was a phd student:
2012-2013, University of Washington,
Summer 2011, Microsoft Research, Redmond,
Summer 2010, Amazon.com, Seattle, WA
Spring/Fall 2010, Educational Testing Service, Princeton, NJ
I am always happy to work with undergraduate and graduate students.
If you are a student at Penn and want to do some research, email
My past advisees all have published a paper with me: Quanze Chen (undergraduate UPenn → PhD University of Washington) Bin
Fu (undergraduate Tsinghua → PhD CMU → Google NYC) Mingkun Gao (master Upenn → PhD UIUC) Ray Lei (undergraduate UPenn) Maria Pershina (PhD NYU | I served on her thesis committee) Siyu Qiu (master UPenn → Hulu)
When I have spare time, I enjoy arts, traveling, snowboarding,
rock climbing, sailing and windsurfing.