PDTB API (0.2.9)

This is an API to interact with the Penn Discourse Treebank , and Penn Treebank annotations. It provides:

These are described below. See the PDTB User Manual for terminology and a description of the file formats expected. There are two distributions of this API:
  1. The user distribution - This should suffice for all needs that don't involve modifying source. To run the browser (also avaliable as a Java™ Web Start application here), download and unzip. Then:
    cd PDTBUser
    java -jar pdtb.jar RawRoot PtbRoot PdtbRoot
    The locations RawRoot, PtbRoot, and PdtbRoot are described here . Alternatively, run:
    cd PDTBUser
    java -jar pdtb.jar
    And you will get "friendly" prompts for the directories. To program with it, see API docs.

  2. Older Distributions:
    The PDTB1 user distribution
    The PDTB1 API docs

  3. The developer distribution - If you need to modify the source. It contains Lex and Yacc specifications which should be reasonably easy to port for APIs in other languages. Also, if you have other classes which you want the PDTB annotation deserialized into, the Yacc specs will need to be modified. An Ant build script is included.
By downloading this, you accept the license.

System requirements: Java™ 1.4.2. The API relies on Java™ for platform independence. However, the portability of component graphics is not always seamless. We test mostly on Linux and marginally on Windows. Problems on other platforms cannot be supported.

Mac Users - Pre 0.2.4 the bottom left window of the browser appears garbled under the default look and feel (Aqua). This can be fixed without upgrading your version by running:

cd PDTBUser
java -Dswing.defaultlaf=javax.swing.plaf.metal.MetalLookAndFeel -jar pdtb.jar
Version 0.2.4 uses this (Metal) look and feel.

Questions/bugs etc:

Contacts: Nikhil Dinesh - first name followed by d at seas dot upenn dot edu

          Geraud Campion - first name followed by at seas dot upenn dot edu

Change log:

The Browser
Launch the PDTB Browser via Java Web Start.

A screenshot of the browser (parse trees on the top, PDTB annotations bottom left, and the raw text on the bottom right) is given below. Arg1 is highlighted in yellow, Arg2 in blue, the connective in red, and the features associated with the connective can be seen on the bottom left.

The bottom left window can be used to select connectives and arguments. Clicking on a connective shows the features associated. Double clicking toggles between showing and hiding the arguments. Clicking on an argument will show the features. The spans associated with attribution will be added in the second release of the PDTB, and so clicking on the features has no effect.

Nodes in the parse tree can be collapsed for viewing without scrolling. For example, the PP-LOC node in Arg1 (in yellow) has been collapsed. Clicking on a parse tree node toggles between expanding and collapsing.

The combo boxes (drop down lists) in the middle correspond to the section and file numbers (on the left and right respectively). Only files under PdtbRoot, which have associated files in PtbRoot, and RawRoot can be loaded (the boxes will let you select only valid files). These are supplied as commandline arguments as mentioned above. When the Load button is clicked the files correponding to the section and file numbers selected in the combo boxes are loaded as a tab. The Close Tab button closes the tab currently in focus.

The buttons Prev and Next switch to the previous and next files (if any) from the file denoted by the combo boxes. These buttons were added in v0.2.7.

The button New Query brings up a Query Window, allowing you to build a PDTBXPath search on the entire corpus.

XPath Support
The API supports XPath queries on the PDTB annotations, and PTB annotations independently. For joint queries one needs to query the PDTB, and use the results to query the PTB. XPath support is achieved via Jaxen (Jaxen is bundled with the distributions and a separate download is not required).

This is not intended as a replacement for tgrep, or various other query tools written for the PTB. The mechanisms used here are significantly slower, as each file will have to be queried independently. The advantage that this API offers is a more accessible programming model.

As of v0.2.3, the query speeds are about 4 times faster on average. The following queries were run (v0.2.3) on a machine with an Intel Pentium4 (at 2.4GHz, 1G memory) processor, running Suse Linux (total counts of results were produced):

  1. >::ExplicitRelation[@connHead='because'] (3secs without loading syntax, vmflags: -Xmx300M -Xms300M). This is equivalent to the XPath query child::ExplicitRelation[@connHead='because']
  2. =>>::S/>::NP[->::VP] (7secs, vmflags: -Xmx300M -Xms300M). This is equivalent to the XPath query descendant-or-self::S/child::NP[following::VP]
We tested on 23 PTB queries (in the Bird et al. paper), and they take between 6-12secs. There is an implementation of Bird et al.'s extensions to XPath, and it is described briefly here. We do not have a command line XPath interface to the PTB yet, some interface work is needed to facilitate browsing results. Contributions welcome. The command line XPath tool for the PDTB, and the syntactic sugar used are described below.

See the API docs of the following packages for more info:

edu.upenn.cis.pdtb.xpath and edu.upenn.cis.ptb.xpath

A simple top level query interface to the PDTB (examples are given toward the bottom of the page). Usage is as follows:
   cd PDTBUser
   java -Xmx300M -Xms300M -classpath "pdtb.jar" edu.upenn.cis.pdtb.xpath.PDTBXPath args
The arguments are as follows:
   --rawRoot RawRoot (or -r RawRoot)
   --ptbRoot PtbRoot (or -p PtbRoot)
   --pdtbRoot PdtbRoot (or -d PdtbRoot)
   --outputRoot OutputRoot (or -o OutputRoot. This will serve as the result PdtbRoot.
                            OutputRoot should not exist when this is run)
   --xpath XPathExpression (or -x XPathExpression)
   -c (generates total counts in addition to the files)
   -b (opens results in the browser. The saved results can always be opened in
       the browser at a later time, by specifying OutputRoot as the PdtbRoot 
       argument to the browser.)

Or modify this Perl script appropriately, and place it in the PDTBUser directory. Here is another Perl script with example queries, and some slides explaining the overall design. These files use the standard XPath syntax.

We have added extensions to XPath specific to the PDTB. Since this a growing list, we have moved it to a separate PDTB XPath Extensions page. Scripts containing example queries are now part of the user distribution.

Users unfamiliar with Java should go here for a tutorial on how to run Java programs. The most common error is setting the classpath wrong, which results in output of the form:

  Exception in thread "main" java.lang.NoClassDefFoundError: edu.upenn.cis.pdtb.xpath.PDTBXPath

Please make sure that the classpath includes the "pdtb.jar" file in the PDTBUser directory. That is if you run the command from a directory other than PDTBUser the option should be specified as -classpath "....path to...pdtb.jar" .

The XPath expression should select Element nodes listed below. If any other kind of node or object is returned by the query, the program exits. The full XPath functionality can be accessed via the API. The following is the list of Elements, their children, and attributes:

Element QNameChildren QNames Attribute QNames
RelationList ExplicitRelation*, ImplicitRelation*, AltLexRelation*, EntityRelation*, NoRelation*
ExplicitRelation Sup1?, Arg1, Arg2, Sup2? Source, Type, Polarity, Det, connHead, sClassA, sClassB?, rawText
ImplicitRelation Sup1?, Arg1, Arg2, Sup2? Source, Type, Polarity, Det, conn1, sClass1A, sClass1B?, conn2?, sClass2A?, sClass2B?
AltLexRelation Sup1?, Arg1, Arg2, Sup2? Source, Type, Polarity, Det, sClassA, sClassB?, rawText
EntityRelation Arg1, Arg2
NoRelation Arg1, Arg2
Arg1 Source, Type, Polarity, Det, rawText
Arg2 Source, Type, Polarity, Det, rawText
Sup1 rawText
Sup2 rawText

Note that the Source, Type, Polarity, and Det attributes appear on all arguments except those of EntityRelation and NoRelation.

The expression "//ExplicitRelation[@connHead='because']" selects all occurences of the explicit connective "because". The expression "//ImplicitRelation[@conn1='instead']" selects all implicit relations where "instead" is chosen as conn1. Note that the enclosing quotes are necessary on the command line.

If an Element is selected by an expression, the closest ancestor (or the node itself) which is a child of PDTBRelationList will be output. This is so that the results can be loaded in the browser. If the -c option is specified total counts of the number of objects selected is produced. This will usually be greater than or equal to the number of relations saved in the OutputRoot.

Here is a sample command line invocation for the Bourne shell:

     cd PDTBUser
     java -Xmx300M -Xms300M -classpath "pdtb.jar" edu.upenn.cis.pdtb.xpath.PDTBXPath \
             -r Corpora/PTB/raw/wsj \
             -p Corpora/PTB/combined/wsj \
             -d Corpora/PDTB/pdtb/wsj \
             -o Corpora/PDTB/PDTBImplicitBecause \
             -x "//ImplicitRelation[@conn1 = 'because']" -c -b
Note that the output directory specified by -o, should not exist prior to invocation.

See the XPath recommendation for further info on XPath. For querying the syntax as well, use the API.