University of Pennsylvania

Institute For Research in Cognitive Science

Data and meta-data relevant to understanding as texts the files in the Penn TreeBank (LDC Catalog entry LDC99T42) and the Penn Discourse TreeBank (LDC Catalog entry LDC99T42), can be found in the the TIPSTER WSJ corpus (LDC Catalog entry LDC93T3A). (The same information can be found in the ACL/DCI corpus, LDC Catalog entry LDC93T1.) This information can be accessed indirectly using map files that pair each Penn TreeBank file name (eg, wsj_0005) with its corresponding index in the TIPSTER WSJ corpus (eg, 891031-0011). While these map files are publically available via the LDC catalog entry for the Penn TreeBank, the LDC will provide a convenient meta-data/data package we have prepared, to current and future license holders of the Penn Discourse TreeBank. Please contact the LDC for a copy of the tarball.

The meta-data for a Penn TreeBank file comprise the set of header fields from its TIPSTER entry, explained as follows in the TIPSTER WSJ sample text:

  • DD: the date the article appeared in the Wall Street Journal
  • AN: unique identifier for the article
  • HL: the column name (for regular features such as , Marketing & Media, Technology), its headline and by-line
  • SO: the source of the article
  • IN: manually-assigned codes or keywords for the article
  • CO: manually-assigned codes for companies or other organizations
  • DATELINE: normally the location where the article was filed, but sometimes has very unexpected contents
  • GV: Branch of Government or Government Agency mentioned in the article
While files usually have an AN field and two DD fields (dates in different formats), the other fields may or may not be present.

Where the HL field goes over multiple lines in the original TIPSTER entry, we have removed the line breaks. So where the TIPSTER file has:

  <HL> Technology & Health:
  @  Asbestos Once Used in Kent Filters Led
  @  To Workers' Cancer Deaths, Group Says 
  @  ---- 
  @  By Anne Newman
  @  Staff Reporter of The Wall Street Journal </HL> 
the HL field for file wsj_0003.meta appears as:

HL : Technology & Health:@ Asbestos Once Used in Kent Filters Led@ To Workers' Cancer Deaths, Group Says@ ----@ By Anne Newman@ Staff Reporter of The Wall Street Journal

There are two sorts of data related to text structure:

  • indications as to where errors were made in creating Penn TreeBank files from TIPSTER entries, such that more than a single WSJ article from a TIPSTER file was included in a single Penn TreeBank file.
  • indications as to where separator symbols appear in the body of the original WSJ article, indicating that it consists of a sequence of texts from different sources. For example, a separator symbol appears between each of the four letters to the editor in TIPSTER entry 891102-0087, corresponding to the Penn TreeBank file wsj_0105. The PTB file lacks such separators.

Both these meta-data and data can be of value to discourse researchers. The meta-data can, for example, enable the texts to be distinguished by genre (news reports, editorials, etc. [Webber, 2009] or by topic [Petrenz and Webber, 2011]. These can then be used, for example, in text segmentation and text summarization, or in testing hypotheses about domain adaptation [Plank and van Noord, 2011]. The data, on the other hand, can allow researchers to distinguish separate texts within a single file (e.g. the four separate letters to the editor in file wsj_0105, or the two separate TIPSTER articles, each with its own meta-data, that were included in error in the same Penn TreeBank file (eg, wsj_0814) and thereby avoid, for example, attempting to produce one summary for the entire file.

This new resource created from the TIPSTER files employs the same file structure and conventions used in the Penn TreeBank and the PDTB 2.0. The meta-data and data for a single Penn TreeBank / PDTB file (wsj_XXXX) reside in a corresponding file (wsj_XXXX.meta). All the files corresponding to a given section (XX) are in sub-directory XX. The tarball distribution contains the 25 sub-directories 00 through 24.

Each individual file starts with the meta-data from its corresponding article in the TIPSTER corpus, followed by a list headed SBREAKS of the byte positions of section breaks present in the file. For example:

  DOCNO : 891102-0087.
  DD : = 891102
  AN : 891102-0087.
  HL : Letters to the Editor:@  Brutal World of Life on the Streets
  DD : 11/02/89
  SO : WALL STREET JOURNAL (J)
  SBREAKS : 1988..1989;2857..2858;3536..3537;4077..4078

Those files that, in error, contain more than one article from the TIPSTER corpus have two copies of the above data separated by ARTICLEBREAK, as for example wsj_0545:

  DOCNO : 891030-0156.
  DD : = 891030
  AN : 891030-0156.
  HL : Canadian Pig Herd Shrinks
  DD : 10/30/89
  SO : WALL STREET JOURNAL (J)
  CO : CANDA
  DATELINE : OTTAWA
  ARTICLEBREAK : 212..213
  DOCNO : 891030-0155.
  DD : = 891030
  AN : 891030-0155.
  HL : Who's News:@  American Federal Savings Bank of Duval County
  DD : 10/30/89
  SO : WALL STREET JOURNAL (J)
  CO : AMJX WNEWS