This is an API to interact with the Penn Discourse Treebank , and Penn Treebank annotations. It provides:
cd PDTBUser java -jar pdtb.jar RawRoot PtbRoot PdtbRootThe locations RawRoot, PtbRoot, and PdtbRoot are described here . Alternatively, run:
cd PDTBUser java -jar pdtb.jarAnd you will get "friendly" prompts for the directories. To program with it, see API docs.
System requirements: Java™ 1.4.2. The API relies on Java™ for platform independence. However, the portability of component graphics is not always seamless. We test mostly on Linux and marginally on Windows. Problems on other platforms cannot be supported.
Mac Users - Pre 0.2.4 the bottom left window of the browser appears garbled under the default look and feel (Aqua). This can be fixed without upgrading your version by running:
cd PDTBUser java -Dswing.defaultlaf=javax.swing.plaf.metal.MetalLookAndFeel -jar pdtb.jarVersion 0.2.4 uses this (Metal) look and feel.
Questions/bugs etc:
Contacts: Nikhil Dinesh - first name followed by d at seas dot upenn dot edu Geraud Campion - first name followed by at seas dot upenn dot edu
Change log:
edu.upenn.cis.pdtb.util.PDTBTask
and edu.upenn.cis.ptb.util.PTBTask
should simplify basic access
patterns. Some example XPath queries are given below. edu.upenn.cis.pdtb.xpath.PDTBNavigator
and edu.upenn.cis.ptb.xpath.PTBNavigator
edu.upenn.cis.ptb.util.CorpusFileIterator
. Used to cause an exception when the
Propbank parsed
.cmb
files appear under PtbRoot. Added Prev
and Next
buttons to the browser.
The Browser
Launch the PDTB Browser via Java Web Start.
A screenshot of the browser (parse trees on the top, PDTB annotations bottom left, and the raw text on the bottom right) is given below. Arg1 is highlighted in yellow, Arg2 in blue, the connective in red, and the features associated with the connective can be seen on the bottom left.
The bottom left window can be used to select connectives and arguments. Clicking on a connective shows the features associated. Double clicking toggles between showing and hiding the arguments. Clicking on an argument will show the features. The spans associated with attribution will be added in the second release of the PDTB, and so clicking on the features has no effect.
Nodes in the parse tree can be collapsed for viewing without scrolling. For example, the PP-LOC node in Arg1 (in yellow) has been collapsed. Clicking on a parse tree node toggles between expanding and collapsing.
The combo boxes (drop down lists) in the middle correspond to the section and file numbers (on the left and right respectively). Only files under PdtbRoot, which have associated files in PtbRoot, and RawRoot can be loaded (the boxes will let you select only valid files). These are supplied as commandline arguments as mentioned above. When the Load button is clicked the files correponding to the section and file numbers selected in the combo boxes are loaded as a tab. The Close Tab button closes the tab currently in focus.
The buttons Prev and Next switch to the previous and next files (if any) from the file denoted by the combo boxes. These buttons were added in v0.2.7.
The button New Query brings up a Query Window, allowing you to build a PDTBXPath search on the entire corpus.
XPath Support
The API supports
XPath queries on the PDTB
annotations, and PTB annotations independently. For joint queries one needs to
query the PDTB, and use the results to query the PTB. XPath support is
achieved via
Jaxen (Jaxen is bundled with the distributions
and a separate download is not required).
This is not intended as a replacement for tgrep, or various other query tools written for the PTB. The mechanisms used here are significantly slower, as each file will have to be queried independently. The advantage that this API offers is a more accessible programming model.
As of v0.2.3, the query speeds are about 4 times faster on average. The following queries were run (v0.2.3) on a machine with an Intel Pentium4 (at 2.4GHz, 1G memory) processor, running Suse Linux (total counts of results were produced):
>::ExplicitRelation[@connHead='because']
(3secs without loading syntax, vmflags: -Xmx300M -Xms300M). This
is equivalent to the XPath query child::ExplicitRelation[@connHead='because']
=>>::S/>::NP[->::VP]
(7secs, vmflags: -Xmx300M -Xms300M). This is equivalent to the XPath query
descendant-or-self::S/child::NP[following::VP]
See the API docs of the following packages for more info:
edu.upenn.cis.pdtb.xpath and edu.upenn.cis.ptb.xpath
PDTBXPath
A simple top level query interface to the PDTB (examples are given toward the bottom of the
page). Usage is as follows:
cd PDTBUser java -Xmx300M -Xms300M -classpath "pdtb.jar" edu.upenn.cis.pdtb.xpath.PDTBXPath argsThe arguments are as follows:
--rawRoot RawRoot (or -r RawRoot) --ptbRoot PtbRoot (or -p PtbRoot) --pdtbRoot PdtbRoot (or -d PdtbRoot) --outputRoot OutputRoot (or -o OutputRoot. This will serve as the result PdtbRoot. OutputRoot should not exist when this is run) --xpath XPathExpression (or -x XPathExpression) -c (generates total counts in addition to the files) -b (opens results in the browser. The saved results can always be opened in the browser at a later time, by specifying OutputRoot as the PdtbRoot argument to the browser.)
Or modify this Perl script appropriately, and place it in the PDTBUser directory. Here is another Perl script with example queries, and some slides explaining the overall design. These files use the standard XPath syntax.
We have added extensions to XPath specific to the PDTB. Since this a growing list, we have moved it to a separate PDTB XPath Extensions page. Scripts containing example queries are now part of the user distribution.
Users unfamiliar with Java should go here for a tutorial on how to run Java programs. The most common error is setting the classpath wrong, which results in output of the form:
Exception in thread "main" java.lang.NoClassDefFoundError: edu.upenn.cis.pdtb.xpath.PDTBXPath
Please make sure that the classpath includes the "pdtb.jar" file in the PDTBUser directory. That is if you run the command from a directory other than PDTBUser the option should be specified as -classpath "....path to...pdtb.jar" .
The XPath expression should select Element nodes listed below. If any other kind of node or object is returned by the query, the program exits. The full XPath functionality can be accessed via the API. The following is the list of Elements, their children, and attributes:
Element QName | Children QNames | Attribute QNames |
RelationList | ExplicitRelation*, ImplicitRelation*, AltLexRelation*, EntityRelation*, NoRelation* | |
ExplicitRelation | Sup1?, Arg1, Arg2, Sup2? | Source, Type, Polarity, Det, connHead, sClassA, sClassB?, rawText |
ImplicitRelation | Sup1?, Arg1, Arg2, Sup2? | Source, Type, Polarity, Det, conn1, sClass1A, sClass1B?, conn2?, sClass2A?, sClass2B? |
AltLexRelation | Sup1?, Arg1, Arg2, Sup2? | Source, Type, Polarity, Det, sClassA, sClassB?, rawText |
EntityRelation | Arg1, Arg2 | |
NoRelation | Arg1, Arg2 | |
Arg1 | Source, Type, Polarity, Det, rawText | |
Arg2 | Source, Type, Polarity, Det, rawText | |
Sup1 | rawText | |
Sup2 | rawText |
Note that the Source, Type, Polarity, and Det attributes appear on all arguments except those of EntityRelation and NoRelation.
The expression "//ExplicitRelation[@connHead='because']"
selects
all occurences of the explicit connective "because". The expression
"//ImplicitRelation[@conn1='instead']"
selects all implicit relations
where "instead" is chosen as conn1. Note that the enclosing quotes are necessary on
the command line.
If an Element is selected by an expression, the closest ancestor (or the node itself) which is a child of PDTBRelationList will be output. This is so that the results can be loaded in the browser. If the -c option is specified total counts of the number of objects selected is produced. This will usually be greater than or equal to the number of relations saved in the OutputRoot.
Here is a sample command line invocation for the Bourne shell:
cd PDTBUser java -Xmx300M -Xms300M -classpath "pdtb.jar" edu.upenn.cis.pdtb.xpath.PDTBXPath \ -r Corpora/PTB/raw/wsj \ -p Corpora/PTB/combined/wsj \ -d Corpora/PDTB/pdtb/wsj \ -o Corpora/PDTB/PDTBImplicitBecause \ -x "//ImplicitRelation[@conn1 = 'because']" -c -bNote that the output directory specified by -o, should not exist prior to invocation.
See the XPath recommendation for further info on XPath. For querying the syntax as well, use the API.