|
|
Data and meta-data relevant to understanding as texts the files in the Penn
TreeBank (LDC Catalog entry LDC99T42) and the Penn Discourse TreeBank
(LDC Catalog entry LDC99T42), can be found in the
the TIPSTER WSJ corpus (LDC Catalog entry LDC93T3A). (The same
information can be found in the ACL/DCI corpus, LDC Catalog entry LDC93T1.)
This information can be accessed
indirectly using map files that pair each Penn TreeBank file name
(eg, wsj_0005) with its corresponding index in the TIPSTER WSJ corpus
(eg, 891031-0011). While these map files are publically available
via the LDC catalog entry for the Penn TreeBank, the LDC will provide a
convenient meta-data/data package we have prepared, to current and future
license holders of the Penn Discourse TreeBank. Please contact the LDC
for a copy of the tarball.
The meta-data for a Penn TreeBank file comprise the set of header fields
from its TIPSTER entry, explained as follows in the
TIPSTER
WSJ sample text:
- DD: the date the article appeared in the Wall Street Journal
- AN: unique identifier for the article
- HL: the column name (for regular features such as ,
Marketing & Media, Technology), its headline and by-line
- SO: the source of the article
- IN: manually-assigned codes or keywords for the article
- CO: manually-assigned codes for companies or other organizations
- DATELINE: normally the location where the article was filed, but
sometimes has very unexpected contents
- GV: Branch of Government or Government Agency mentioned in the article
While files usually have an AN field and two DD fields (dates in different formats),
the other fields may or may not be present.
Where the HL field goes over multiple lines in the original TIPSTER entry,
we have removed the line breaks. So where the TIPSTER file has:
<HL> Technology & Health:
@ Asbestos Once Used in Kent Filters Led
@ To Workers' Cancer Deaths, Group Says
@ ----
@ By Anne Newman
@ Staff Reporter of The Wall Street Journal </HL>
the HL field for file wsj_0003.meta appears as:
HL : Technology & Health:@ Asbestos Once Used in Kent Filters Led@ To Workers' Cancer Deaths, Group Says@ ----@ By Anne Newman@ Staff Reporter of The Wall Street Journal
There are two sorts of data related to text structure:
- indications as to where errors were made in creating Penn TreeBank files
from TIPSTER entries, such that more than a single WSJ article from a
TIPSTER file was included in a single Penn TreeBank file.
- indications as to where separator symbols appear in the body of the original
WSJ article, indicating that it consists of a sequence
of texts from different sources. For example, a separator symbol appears between
each of the four letters
to the editor in TIPSTER entry 891102-0087, corresponding to the Penn TreeBank
file wsj_0105. The PTB file lacks such separators.
Both these meta-data and data can be of value to discourse researchers.
The meta-data
can, for example, enable the texts to be distinguished by genre
(news reports, editorials, etc.
[Webber, 2009]
or by topic
[Petrenz and Webber,
2011]. These can then be used, for example, in text segmentation and
text summarization, or in testing hypotheses about domain adaptation
[Plank and van
Noord, 2011].
The data, on the other hand, can allow researchers to distinguish separate
texts within a single file
(e.g. the four separate letters to the editor in file wsj_0105,
or the two separate TIPSTER articles, each with its own meta-data,
that were included in error in the same Penn TreeBank file (eg, wsj_0814)
and thereby avoid, for example, attempting to produce one summary for the entire file.
This new resource created from the TIPSTER files employs
the same file structure and conventions used in the Penn TreeBank
and the PDTB 2.0.
The meta-data and data for a single Penn TreeBank / PDTB file (wsj_XXXX)
reside in a corresponding file (wsj_XXXX.meta). All the files corresponding
to a given section (XX) are in sub-directory XX. The tarball distribution
contains the 25 sub-directories 00 through 24.
Each individual file starts with the meta-data from its corresponding
article in the TIPSTER corpus, followed by a list headed SBREAKS
of the byte positions of section breaks present in the file. For
example:
DOCNO : 891102-0087.
DD : = 891102
AN : 891102-0087.
HL : Letters to the Editor:@ Brutal World of Life on the Streets
DD : 11/02/89
SO : WALL STREET JOURNAL (J)
SBREAKS : 1988..1989;2857..2858;3536..3537;4077..4078
Those files that, in error, contain more than one article from the TIPSTER
corpus have two copies of the above data separated by ARTICLEBREAK, as for
example wsj_0545:
DOCNO : 891030-0156.
DD : = 891030
AN : 891030-0156.
HL : Canadian Pig Herd Shrinks
DD : 10/30/89
SO : WALL STREET JOURNAL (J)
CO : CANDA
DATELINE : OTTAWA
ARTICLEBREAK : 212..213
DOCNO : 891030-0155.
DD : = 891030
AN : 891030-0155.
HL : Who's News:@ American Federal Savings Bank of Duval County
DD : 10/30/89
SO : WALL STREET JOURNAL (J)
CO : AMJX WNEWS
|
|