University of Pennsylvania

Institute For Research in Cognitive Science
The Penn Discourse Treebank Project is an NSF funded (NSF grants EIA-05-63063 and IIS-07-05671) project at the Institute for Research in Cognitive Science, University of Pennsylvania.

PDTB 2.0 is available from the Linguistic Data Consortium.

Document-level and document-internal meta-data on the PDTB files are now available from the LDC to all current and future PDTB license holders.

Please visit the PDTB API page for technical support.


The Penn Discourse Treebank (PDTB) is a large scale corpus annotated with information related to discourse structure and discourse semantics. While there are many aspects of discourse that are crucial to a complete understanding of natural language, the PDTB focuses on encoding discourse relations. The annotation methodology follows a lexically-grounded approach. The PDTB has strived to maintain a theory-neutral approach with respect to the nature of high-level representation of discourse structure, in order to allow the corpus to be usable within different theoretical frameworks. Theory-neutrality is achieved by keeping annotations of discourse relations "low-level": Each discourse relations is annotated independently of other relations, that is, dependencies across relations are not marked.

The PDTB is aimed to support the extraction of a range of inferences associated with discourse relations, for a wide range of NLP applications, such as parsing, information extraction, question-answering, summarization, machine translation, generation, as well as corpus based studies in linguistics and psycholinguistics.

Discourse relations in the current version of the PDTB are taken to be triggered by explicit phrases or by structural adjacency. Each relation is further annotated for its two abstract object arguments, the sense of the relation, and the attributions associated with the relation and each of its two arguments. The annotations in the PDTB are aligned with the syntactic constituency annotations of the Penn Treebank.

The following publication describes PDTB-2.0. corpus:
Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi and Bonnie Webber. The Penn Discourse Treebank 2.0. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC). Marrakech, Morocco.

PDTB annotation guidelines, annotation format, and summary distributions are provided in the manual:
The PDTB Research Group. 2008. The PDTB 2.0. Annotation Manual. Technical Report IRCS-08-01. Institute for Research in Cognitive Science, University of Pennsylvania.

The PDTB project also aims to conduct empirical research with the PDTB corpus, for NLP as well as theoretical linguistics. See the publications for PDTB related research supported by the project.