This is an API to interact with the Penn Discourse Treebank , and Penn Treebank annotations. It provides:
cd PDTBUser java -jar pdtb.jar RawRoot PtbRoot PdtbRootThe locations RawRoot, PtbRoot, and PdtbRoot are described here . Alternatively, run:
cd PDTBUser java -jar pdtb.jarAnd you will get "friendly" prompts for the directories. To program with it, see API docs.
System requirements: Java™ 1.4.2. The API relies on Java™ for platform independence. However, the portability of component graphics is not always seamless. We test mostly on Linux and marginally on Windows. Problems on other platforms cannot be supported.
Mac Users - Pre 0.2.4 the bottom left window of the browser appears garbled under the default look and feel (Aqua). This can be fixed without upgrading your version by running:
cd PDTBUser java -Dswing.defaultlaf=javax.swing.plaf.metal.MetalLookAndFeel -jar pdtb.jarVersion 0.2.4 uses this (Metal) look and feel.
Contacts: Nikhil Dinesh - first name followed by d at seas dot upenn dot edu Geraud Campion - first name followed by at seas dot upenn dot edu
edu.upenn.cis.ptb.util.PTBTaskshould simplify basic access patterns. Some example XPath queries are given below.
edu.upenn.cis.ptb.util.CorpusFileIterator. Used to cause an exception when the Propbank parsed
.cmbfiles appear under PtbRoot. Added
Nextbuttons to the browser.
Launch the PDTB Browser via Java Web Start.
A screenshot of the browser (parse trees on the top, PDTB annotations bottom left, and the raw text on the bottom right) is given below. Arg1 is highlighted in yellow, Arg2 in blue, the connective in red, and the features associated with the connective can be seen on the bottom left.
The bottom left window can be used to select connectives and arguments. Clicking on a connective shows the features associated. Double clicking toggles between showing and hiding the arguments. Clicking on an argument will show the features. The spans associated with attribution will be added in the second release of the PDTB, and so clicking on the features has no effect.
Nodes in the parse tree can be collapsed for viewing without scrolling. For example, the PP-LOC node in Arg1 (in yellow) has been collapsed. Clicking on a parse tree node toggles between expanding and collapsing.
The combo boxes (drop down lists) in the middle correspond to the section and file numbers (on the left and right respectively). Only files under PdtbRoot, which have associated files in PtbRoot, and RawRoot can be loaded (the boxes will let you select only valid files). These are supplied as commandline arguments as mentioned above. When the Load button is clicked the files correponding to the section and file numbers selected in the combo boxes are loaded as a tab. The Close Tab button closes the tab currently in focus.
The buttons Prev and Next switch to the previous and next files (if any) from the file denoted by the combo boxes. These buttons were added in v0.2.7.
The button New Query brings up a Query Window, allowing you to build a PDTBXPath search on the entire corpus.
The API supports
XPath queries on the PDTB
annotations, and PTB annotations independently. For joint queries one needs to
query the PDTB, and use the results to query the PTB. XPath support is
Jaxen (Jaxen is bundled with the distributions
and a separate download is not required).
This is not intended as a replacement for tgrep, or various other query tools written for the PTB. The mechanisms used here are significantly slower, as each file will have to be queried independently. The advantage that this API offers is a more accessible programming model.
As of v0.2.3, the query speeds are about 4 times faster on average. The following queries were run (v0.2.3) on a machine with an Intel Pentium4 (at 2.4GHz, 1G memory) processor, running Suse Linux (total counts of results were produced):
>::ExplicitRelation[@connHead='because'](3secs without loading syntax, vmflags: -Xmx300M -Xms300M). This is equivalent to the XPath query
=>>::S/>::NP[->::VP](7secs, vmflags: -Xmx300M -Xms300M). This is equivalent to the XPath query
See the API docs of the following packages for more info:
edu.upenn.cis.pdtb.xpath and edu.upenn.cis.ptb.xpath
A simple top level query interface to the PDTB (examples are given toward the bottom of the
page). Usage is as follows:
cd PDTBUser java -Xmx300M -Xms300M -classpath "pdtb.jar" edu.upenn.cis.pdtb.xpath.PDTBXPath argsThe arguments are as follows:
--rawRoot RawRoot (or -r RawRoot) --ptbRoot PtbRoot (or -p PtbRoot) --pdtbRoot PdtbRoot (or -d PdtbRoot) --outputRoot OutputRoot (or -o OutputRoot. This will serve as the result PdtbRoot. OutputRoot should not exist when this is run) --xpath XPathExpression (or -x XPathExpression) -c (generates total counts in addition to the files) -b (opens results in the browser. The saved results can always be opened in the browser at a later time, by specifying OutputRoot as the PdtbRoot argument to the browser.)
Or modify this Perl script appropriately, and place it in the PDTBUser directory. Here is another Perl script with example queries, and some slides explaining the overall design. These files use the standard XPath syntax.
We have added extensions to XPath specific to the PDTB. Since this a growing list, we have moved it to a separate PDTB XPath Extensions page. Scripts containing example queries are now part of the user distribution.
Users unfamiliar with Java should go here for a tutorial on how to run Java programs. The most common error is setting the classpath wrong, which results in output of the form:
Exception in thread "main" java.lang.NoClassDefFoundError: edu.upenn.cis.pdtb.xpath.PDTBXPath
Please make sure that the classpath includes the "pdtb.jar" file in the PDTBUser directory. That is if you run the command from a directory other than PDTBUser the option should be specified as -classpath "....path to...pdtb.jar" .
The XPath expression should select Element nodes listed below. If any other kind of node or object is returned by the query, the program exits. The full XPath functionality can be accessed via the API. The following is the list of Elements, their children, and attributes:
|Element QName||Children QNames||Attribute QNames|
|RelationList||ExplicitRelation*, ImplicitRelation*, AltLexRelation*, EntityRelation*, NoRelation*|
|ExplicitRelation||Sup1?, Arg1, Arg2, Sup2?||Source, Type, Polarity, Det, connHead, sClassA, sClassB?, rawText|
|ImplicitRelation||Sup1?, Arg1, Arg2, Sup2?||Source, Type, Polarity, Det, conn1, sClass1A, sClass1B?, conn2?, sClass2A?, sClass2B?|
|AltLexRelation||Sup1?, Arg1, Arg2, Sup2?||Source, Type, Polarity, Det, sClassA, sClassB?, rawText|
|Arg1||Source, Type, Polarity, Det, rawText|
|Arg2||Source, Type, Polarity, Det, rawText|
Note that the Source, Type, Polarity, and Det attributes appear on all arguments except those of EntityRelation and NoRelation.
all occurences of the explicit connective "because". The expression
"//ImplicitRelation[@conn1='instead']" selects all implicit relations
where "instead" is chosen as conn1. Note that the enclosing quotes are necessary on
the command line.
If an Element is selected by an expression, the closest ancestor (or the node itself) which is a child of PDTBRelationList will be output. This is so that the results can be loaded in the browser. If the -c option is specified total counts of the number of objects selected is produced. This will usually be greater than or equal to the number of relations saved in the OutputRoot.
Here is a sample command line invocation for the Bourne shell:
cd PDTBUser java -Xmx300M -Xms300M -classpath "pdtb.jar" edu.upenn.cis.pdtb.xpath.PDTBXPath \ -r Corpora/PTB/raw/wsj \ -p Corpora/PTB/combined/wsj \ -d Corpora/PDTB/pdtb/wsj \ -o Corpora/PDTB/PDTBImplicitBecause \ -x "//ImplicitRelation[@conn1 = 'because']" -c -bNote that the output directory specified by -o, should not exist prior to invocation.
See the XPath recommendation for further info on XPath. For querying the syntax as well, use the API.