Skip directly to: content | search

Penn Discourse Treebank Workshop 2012

Presentations

Group 0: The originators of the PDTB project, who developed the guidelines and annotated the PDTB corpus, some of whom are continuing this work based on these plans, including some proposed extensions.

 

• Geraud Campion, Octagon Research Solutions, Inc., PA, gcampion@octagonresearch.com
• Nikhil-Dinesh, SRI, Palo Alto, CA dinesh@ai.sri.com
• Aravind K. Joshi, University of Pennsylvania joshi@seas.upenn.edu
• Eleni Miltsakaki, University of Pennsylvania elenimi@seas.upenn.edu
• Rashmi Prasad, University of Wisconsin-Milwaukee prasadr@uwm.edu
• Bonnie Webber, University of Edinburgh, Edinburgh UK bonnie@inf.ed.ac.uk


Group 1: Those who have experimented with the PDTB corpus in various ways, including some applications.


Discourse connective argument identification with connective specific rankers
Jason Baldridge
University of Texas, Austin, TX

I'll talk about the research in Elwell and Baldridge (2008) on automatically identifying the arguments of discourse connectives. Previous work used a single, general classifier for different connectives; however, connectives differ in their distribution and behavior, so conflating them this way loses discriminative power. That work showed that using models for specific connectives and types of connectives and interpolating them with a general model improves performance. It also used additional features that provide greater sensitivity to morphological, syntactic, and discourse patterns, and less sensitivity to parse quality.


The PDTB in the study of coherence
Ani Nenkova, Annie Louis, and Emily Pitler
University of Pennsylvania, Philadelphia, PA

We have used the PDTB for several studies related to text coherence.

We performed a correlation analysis between a wide range of factors and ratings for how well an article is written. Our findings attest that the distribution of discourse relations is the strongest predictor.  Both implicit and explicit relations are necessary for the correlation to be significant for the 30 WSJ texts we studied.

We have also combined the PDTB with other resources that include information on co-reference.  Our results indicate that additional theories of coherence, beyond discourse relations and co-reference, may be needed to explain local coherence.

Finally, we have used a subset of PDTB implicit relations to develop a highly accurate classifier for sentence specificity. The classifier has been shown useful in several applications in summarization and text quality.

 

Automatically Evaluating Text Coherence Using Discourse Relations
Hwee Tou Ng, Min-Yen Kan
NUS, Singapore

We present a novel model to represent and assess the discourse coherence of text. Our model assumes that a coherent text implicitly favors certain types of discourse relation transitions based on the Penn Discourse Treebank. We implement this model and apply it towards the text ordering ranking task, which aims to discern an original text from a permuted ordering of its sentences. The experimental results demonstrate that our model is able to significantly outperform the coherence model of Barzilay and Lapata (2005). We further show that our model is synergistic with the model of Barzilay and Lapata.

 

Attribution and discourse representation
Christopher Potts
Stanford University, CA

The PDTB Attribution annotations reveal subtle and important facts about the relationships between syntax, semantics, discourse coherence, and agent-relative discourse commitment.  I'll briefly review some of the lessons we've learned from these annotations, and I'll describe a relatively straightforward annotation strategy that could make them even more powerful.

 

Annotation Projection for Discourse Connectives
Yannick Versley
University of Tuebingen, Germany

We present work on tagging German discourse connectives using English training data and a German-English parallel corpus, and report first results towards a more comprehensive approach of doing annotation projection for explicit discourse relations.

Our results show that (i) an approach based on a dictionary of connectives currently has advantages over a simpler approach that uses word alignments without further linguistic information, but also that (ii) bootstrapping a connective dictionary using distribution-based heuristics on aligned bitexts seems to be a feasible and low-effort way of creating such a resource.

Our best method achieves an F-measure of 68.7% for the identification of discourse connectives without any German-language training data, which is a large improvement over a nontrivial baseline.

 

Attribution and the PDTB
Silvia Pareti
University of Edinburgh

The lack of a large resource annotating a wide range of attribution relations has left attribution extraction studies without a solid basis of data to correctly analyse attribution and develop broad-coverage attribution extraction systems. Based on the annotation in the PDTB, I will present the development of such a resource. After further refining the annotation schema in the PDTB and evaluating it through an inter-annotator agreement study, the schema has been applied to annotate a small corpus of Italian and to create a large attribution corpus for English. The latter was created by collecting and further annotating attribution in the PDTB. I will conclude presenting the benefits of such a resource and its future applications.


Group 2: Those who have done or are attempting to carry out PDTB-like annotation on corpora in languages other than English, either by suitably adapting the PDTB annotation guidelines, or by incorporating some ideas from the PDTB guidelines into their own.


Building the Turkish Discourse Tree Bank
Deniz Zeyrek, Cem Bozsahin
METU, Ankara, TK


Annotating Discourse in Prague Dependency Treebank
Eva Hajicova, Pavlina Jinova, Lucie Polakova, Katerina Rysova, Magdalena Rysova Charles University, Prague, CZ

The Prague Dependency Treebank 2.0 consists of 50 000 sentences of Czech journalistic texts; it contains several interlinked layers of annotation. The talk will offer details about the recently completed manual annotation of discourse phenomena in PDTB-style and of co-reference and bridging relations. Unlike in Penn, the Prague annotation of phenomena beyond the sentence boundary is carried out directly on the syntactic tree structures.


Hindi Discourse Relation Bank
Sudheer Kolachina, Dipti Misra Sharma
IIIT, Hyderabad, India

The Hindi discourse relation bank (HDRB) project is an ongoing effort to create a discourse relations annotated corpus for Hindi. The annotation is being carried out using the Penn Discourse Treebank (PDTB) scheme.

Initial studies on annotation of discourse relations in Hindi texts using the lexically grounded approach of the PDTB led to some modifications and extensions to the original scheme. For example, the use of a semantic argument naming convention, as opposed to the syntactic linear order-based convention of the PDTB, was proposed in HDRB. This, in turn, led to the elimination of argument-specific labels in the sense hierarchy.

Apart from such modifications to the annotation scheme, the annotation task design in HDRB is also significantly different from the one followed to create the PDTB. We briefly discuss both these aspects of discourse relation annotation in the HDRB.

We also present a few issues encountered in Hindi discourse which are not being handled by the current annotation scheme.

 

Parsing discourse relations in the PDTB corpus
Giuseppe Riccardi
University of Trento, Italy

In this work we take a data driven approach to theory-free robust shallow discourse parsing. Following the Penn Discourse Tree Bank (PDTB) annotation model, we identify arguments of explicit discourse connectives. In contrast to previous work we do not make any assumptions on the span of arguments and consider parsing as a token-level sequence labeling task. We design the argument parsing task as a cascade of decisions based on conditional random fields (CRFs) and re-ranking module. We report the latest results on feature selection and parser performance on the standard PDTB training/test splits. 


Annotation of Italian Dialogues Following the Penn Discourse Treebank Paradigm
Sara Tonelli
University of Trento, Italy

In this talk, I will make a qualitative and quantitative analysis of discourse relations within the LUNA conversational spoken dialog corpus. In particular, we describe the adaptation of the Penn Discourse Treebank (PDTB) annotation scheme to the LUNA dialogs. We discuss similarities and differences between our approach and the PDTB paradigm and point out the peculiarities of spontaneous dialogs w.r.t. written text, which motivated some changes in the sense hierarchy, in order to give more emhpasis to the pragmatic aspects of spontaneous speech. Then, we present corpus statistics about the discourse relations within a representative set of annotated dialogs.

 

PDTB-style discourse annotation of Chinese text
Yuping Zhou, Nianwen Xue
Brandeis University, Waltham, MA

In this presentation we address two main issues regarding discourse annotation of Chinese text. The first one is to what extent discourse structure and relations can be extracted automatically from existing resources such as treebanks if they exist. We show what while a first approximation of discourse structure and relations are hidden in the Chinese Treebank annotation and can be extracted automatically, more refined discourse structures still have to be annotated manually. Having justified the necessity of manually annotating discourse relations, the second issue is how to annotate Chinese text in the PDTB style. Based on the characteristics of Chinese, we have made the following adaptations: i) explicit and implicit discourse relations are annotated in one single pass rather than separately; ii) in annotating implicit relations, the sense of the relation is annotated directly, bypassing the step of inserting a connective; and iii) the argument labels are defined semantically rather than syntactically. These adaptations work out very well in our annotation experiment.

 

 

Group 3: Those who are experts in many aspects of NLP (CL) and have expressed interest in providing us their feedback, based on what they would have heard from the people in the first three groups and more specifically, concerning the proposed extensions of PDTB.


• Nicoleta Calzolari, ILC, Italy nicoletta.calzolari@ilc.cnr.it
• Barbara Di Eugenio, UIC, Chicago IL bdieugen@cs.uic.edu
• Nancy Ide, Vassar College, Poughkeepsie NY ide@cs.vassar.edu
• Andy Kehler, UCSD, San Diego CA kehler@ling.ucsd.edu
• Kathy McKeown, Columbia Univ, NYC kathy@cs.columbia.edu
• James Pustejovsky, Brandeis Univ, Waltham MA jamesp@cs.brandeis.edu
• Hannah Rohde, Univ of Edinburgh, Edinburgh UK hannah.rohde@ed.ac.uk
• Dan Roth, UIUC, Champaign IL danr@uiuc.edu
• Marilyn Walker, UCSC, Santa Cruz CA maw@soe.ucsc.edu
• Michael White, OSU, Columbus OH mwhite@ling.osu.edu