University of Pennsylvania

Institute For Research in Cognitive Science

TOOLS


The Browser / API (0.2.9)

This is an API to interact with the Penn Discourse Treebank , and Penn Treebank annotations. See the PDTB User Manual for terminology and a description of the file formats expected. There are two distributions of this API:

  1. The user distribution - This should suffice for all needs that don't involve modifying source. To run the browser (also avaliable as a Java™ Web Start application here), download and unzip. Then:
    cd PDTBUser
    java -jar pdtb.jar RawRoot PtbRoot PdtbRoot
    											
    The locations RawRoot, PtbRoot, and PdtbRoot are described here . Alternatively, run:
    cd PDTBUser
    java -jar pdtb.jar
    											
    And you will get "friendly" prompts for the directories. To program with it, see API docs.

  2. Older Distributions:
    The PDTB1 user distribution
    The PDTB1 API docs

  3. The developer distribution - If you need to modify the source. It contains Lex and Yacc specifications which should be reasonably easy to port for APIs in other languages. Also, if you have other classes which you want the PDTB annotation deserialized into, the Yacc specs will need to be modified. An Ant build script is included.
By downloading this, you accept the license.

System requirements: Java™ 1.4.2. The API relies on Java™ for platform independence. However, the portability of component graphics is not always seamless. We test mostly on Linux and marginally on Windows. Problems on other platforms cannot be supported.

Mac Users - Pre 0.2.4 the bottom left window of the browser appears garbled under the default look and feel (Aqua). This can be fixed without upgrading your version by running:

cd PDTBUser
java -Dswing.defaultlaf=javax.swing.plaf.metal.MetalLookAndFeel -jar pdtb.jar
									
Version 0.2.4 uses this (Metal) look and feel.

Questions/bugs etc:

Contacts: Nikhil Dinesh - first name followed by d at seas dot upenn dot edu

          Geraud Campion - first name followed by at seas dot upenn dot edu
									

Change log:

  • 0.2 - Several bug fixes. Some add/delete operations in PTBTreeNode/PDTBNode were buggy. The tree canvas had some erroneous behaviour during mutation and when an expansion state isn't maintained.
  • 0.2.1 - Bug fixes in PDTBXPath. Connectives weren't being sorted on output. Added a regexp function for PDTBXPath.
  • 0.2.2 - Fixed a minor bug in the raw text attribute, an extra space was being prepended. Optimizations to the Penn Treebank portion of the API. It is now 2-3 times faster. The convenience classes: edu.upenn.cis.pdtb.util.PDTBTask and edu.upenn.cis.ptb.util.PTBTask should simplify basic access patterns. Some example XPath queries are given below.
  • 0.2.3 - Some more optimizations, with a two-fold speedup on the PTB side. Some syntactic sugar for the XPath queries, and a little bit of documentation on how to query the PTB. The queries are now (perhaps) fast enough to be useful. Query times are discussed here.
  • 0.2.4 - Browser bug on Macs fixed. It seems to be a java problem, rather than one with our code base. The browser uses the Metal look and feel now. See note above to fix it without upgrading.
  • 0.2.5 - Some extensions to the PDTBXPath tool, described here. Regression fixes to edu.upenn.cis.pdtb.xpath.PDTBNavigator and edu.upenn.cis.ptb.xpath.PTBNavigator
  • 0.2.6 - Added the ability to cache XPath subexpressions on the PDTB, described here. Makes the extensions defined in 0.2.5 more efficient.
  • 0.2.7 - Fixed a bug in edu.upenn.cis.ptb.util.CorpusFileIterator . Used to cause an exception when the Propbank parsed .cmb files appear under PtbRoot. Added Prev and Next buttons to the browser.
  • 0.2.8 - Added XPath functions to group explicit connectives into subordinating conjunctions, coordinating conjunctions and adverbials. Described here. A bug in the regex for subordinating conjunctions was fixed on 03-27-2007 (not worth a new version number).
  • 0.2.9 - Some of the span comparison functions were too rigid and crossing was incorrectly defined. Added functionality to state more approximate versions of span queries and corrected crossing. Added a split mode to the browser to compare connectives in the same discourse.

Launch the PDTB Browser via Java Web Start.

A screenshot of the browser (parse trees on the top, PDTB annotations bottom left, and the raw text on the bottom right) is given below. Arg1 is highlighted in yellow, Arg2 in blue, the connective in red, and the features associated with the connective can be seen on the bottom left.

The bottom left window can be used to select connectives and arguments. Clicking on a connective shows the features associated. Double clicking toggles between showing and hiding the arguments. Clicking on an argument will show the features. The spans associated with attribution will be added in the second release of the PDTB, and so clicking on the features has no effect.

Nodes in the parse tree can be collapsed for viewing without scrolling. For example, the PP-LOC node in Arg1 (in yellow) has been collapsed. Clicking on a parse tree node toggles between expanding and collapsing.

The combo boxes (drop down lists) in the middle correspond to the section and file numbers (on the left and right respectively). Only files under PdtbRoot, which have associated files in PtbRoot, and RawRoot can be loaded (the boxes will let you select only valid files). These are supplied as commandline arguments as mentioned above. When the Load button is clicked the files correponding to the section and file numbers selected in the combo boxes are loaded as a tab. The Close Tab button closes the tab currently in focus.

The buttons Prev and Next switch to the previous and next files (if any) from the file denoted by the combo boxes. These buttons were added in v0.2.7.




Graphical Query Support

The button New Query brings up a Query Window, allowing you to build a PDTBXPath search on the entire corpus. Here is a screenshot.


XPath Support

The API supports XPath queries on the PDTB annotations, and PTB annotations independently. For joint queries one needs to query the PDTB, and use the results to query the PTB. XPath support is achieved via Jaxen (Jaxen is bundled with the distributions and a separate download is not required).

This is not intended as a replacement for tgrep, or various other query tools written for the PTB. The mechanisms used here are significantly slower, as each file will have to be queried independently. The advantage that this API offers is a more accessible programming model.

As of v0.2.3, the query speeds are about 4 times faster on average. The following queries were run (v0.2.3) on a machine with an Intel Pentium4 (at 2.4GHz, 1G memory) processor, running Suse Linux (total counts of results were produced):

  1. >::ExplicitRelation[@connHead='because'] (3secs without loading syntax, vmflags: -Xmx300M -Xms300M). This is equivalent to the XPath query child::ExplicitRelation[@connHead='because']
  2. =>>::S/>::NP[->::VP] (7secs, vmflags: -Xmx300M -Xms300M). This is equivalent to the XPath query descendant-or-self::S/child::NP[following::VP]
We tested on 23 PTB queries (in the Bird et al. paper), and they take between 6-12secs. There is an implementation of Bird et al.'s extensions to XPath, and it is described briefly here. We do not have a command line XPath interface to the PTB yet, some interface work is needed to facilitate browsing results. Contributions welcome. The command line XPath tool for the PDTB, and the syntactic sugar used are described below.

See the API docs of the following packages for more info:

edu.upenn.cis.pdtb.xpath and edu.upenn.cis.ptb.xpath
			


PDTBXPath

A simple top level query interface to the PDTB (examples are given toward the bottom of the page). Usage is as follows:
   cd PDTBUser
   java -Xmx300M -Xms300M -classpath "pdtb.jar" edu.upenn.cis.pdtb.xpath.PDTBXPath args
			
The arguments are as follows:
   --rawRoot RawRoot (or -r RawRoot)
   --ptbRoot PtbRoot (or -p PtbRoot)
   --pdtbRoot PdtbRoot (or -d PdtbRoot)
   --outputRoot OutputRoot (or -o OutputRoot. This will serve as the result PdtbRoot.
                            OutputRoot should not exist when this is run)
   --xpath XPathExpression (or -x XPathExpression)
   -c (generates total counts in addition to the files)
   -b (opens results in the browser. The saved results can always be opened in
       the browser at a later time, by specifying OutputRoot as the PdtbRoot 
       argument to the browser.)
			

Or modify this Perl script appropriately, and place it in the PDTBUser directory. Here is another Perl script with example queries, and some slides explaining the overall design. These files use the standard XPath syntax.

We have added extensions to XPath specific to the PDTB. Since this a growing list, we have moved it to a separate PDTB XPath Extensions page. Scripts containing example queries are now part of the user distribution.

Users unfamiliar with Java should go here for a tutorial on how to run Java programs. The most common error is setting the classpath wrong, which results in output of the form:

  Exception in thread "main" java.lang.NoClassDefFoundError: edu.upenn.cis.pdtb.xpath.PDTBXPath
			

Please make sure that the classpath includes the "pdtb.jar" file in the PDTBUser directory. That is if you run the command from a directory other than PDTBUser the option should be specified as -classpath "....path to...pdtb.jar" .

The XPath expression should select Element nodes listed below. If any other kind of node or object is returned by the query, the program exits. The full XPath functionality can be accessed via the API. The following is the list of Elements, their children, and attributes:

Element QName Children QNames Attribute QNames
RelationList ExplicitRelation*, ImplicitRelation*, AltLexRelation*, EntityRelation*, NoRelation*
ExplicitRelation Sup1?, Arg1, Arg2, Sup2? Source, Type, Polarity, Det, connHead, sClassA, sClassB?, rawText
ImplicitRelation Sup1?, Arg1, Arg2, Sup2? Source, Type, Polarity, Det, conn1, sClass1A, sClass1B?, conn2?, sClass2A?, sClass2B?
AltLexRelation Sup1?, Arg1, Arg2, Sup2? Source, Type, Polarity, Det, sClassA, sClassB?, rawText
EntityRelation Arg1, Arg2
NoRelation Arg1, Arg2
Arg1 Source, Type, Polarity, Det, rawText
Arg2 Source, Type, Polarity, Det, rawText
Sup1 rawText
Sup2 rawText

Note that the Source, Type, Polarity, and Det attributes appear on all arguments except those of EntityRelation and NoRelation.

The expression "//ExplicitRelation[@connHead='because']" selects all occurences of the explicit connective "because". The expression "//ImplicitRelation[@conn1='instead']" selects all implicit relations where "instead" is chosen as conn1. Note that the enclosing quotes are necessary on the command line.

If an Element is selected by an expression, the closest ancestor (or the node itself) which is a child of PDTBRelationList will be output. This is so that the results can be loaded in the browser. If the -c option is specified total counts of the number of objects selected is produced. This will usually be greater than or equal to the number of relations saved in the OutputRoot.

Here is a sample command line invocation for the Bourne shell:

     cd PDTBUser
     java -Xmx300M -Xms300M -classpath "pdtb.jar" edu.upenn.cis.pdtb.xpath.PDTBXPath \
             -r Corpora/PTB/raw/wsj \
             -p Corpora/PTB/combined/wsj \
             -d Corpora/PDTB/pdtb/wsj \
             -o Corpora/PDTB/PDTBImplicitBecause \
             -x "//ImplicitRelation[@conn1 = 'because']" -c -b
				
Note that the output directory specified by -o, should not exist prior to invocation.

See the XPath recommendation for further info on XPath. For querying the syntax as well, use the API.


The Annotator

---Running---

Requirements: In order to Run, there are three sets of folders required:
..raw/section/
..ann/
..comments/
The names of the folders do not matter. You are required to have at least one section folder inside raw. Each of the rawtext files must go into any of the section folders. For the annotation and comments folder, no sections are required during the first run. Any annotation and comment files that are created by this program will parallel the hierarchy of the raw section folders.
Start the program and you will be given a File dialog.
With the above structure, you would select "raw" as the "RawRoot", "ann" as the "AnnRoot", and "comments" as the commentRoot.
The first time you run the program from a new directory, these fields will be blank. After running, your settings will be saved in the same directory as the jar file as AnnSettings.txt.

To change the combo box choices in the program, you can edit Options.cfg. The numbers in the sets of choices represent the zero-based index of the default choice. These numbers are required for the first 8 items. The lists under the items must be in alphabetic order for this to work. If you delete Options.cfg, the jar will revert back to a default Options.cfg inside the jar at:
Annotator.jar\edu\upenn\cis\anntool\settings\Options.cfg
This file will be copied back out to the same directory as the jar file next time you start the Annotator.

---Using---

Now, when the Annotator starts, you should be able to choose a raw text file by selecting its section number and the file name. Click load to load the raw text and all of the annotations for that file.
Now you can create a new relation or select a relation from the list and edit and delete the relation. If you make any changes to a relation, you will not be able to switch relations, load, or exit without saving the relation or canceling the changes.
For editing spans, the colors of the buttons match with the colors of the spans. To create a span, you can select the text and then click the corresponding span button. To select multiple spans, hold down ctrl (Windows/Linux) or Cmd (Mac - untested) or Spacebar (All) between selections. To deselect a spanlist, click the corresponding span button again.
There is one more handy feature that lets you search for a token. As you type, all instances of a whole token will be highlighted in blue-green. You can also add all instances of this token to the relation list as an Explicit connective by clicking the "Add All" Button.
You can save any relation even if you have not completely filled out all of its required values. These relations will show up with a red background in your relation list.
You can also save comments in the big text box in the bottom right corner. These comments will be saved separately from the annotation files for convenience when parsing the annotation files later.

---Annotated File Structure---

Each annotated relation is simply a pipe "|" delimited line with the following format:

Relation Type|Conn Span|Conn Src|Conn Type|Conn Pol|Conn Det|Conn Feat Span|Conn1|SClass1A|SClass1B|Conn2|SClass2A|SClass2B|Sup1 Span|Arg1 Span|Arg1 Src|Arg1 Type|Arg1 Pol|Arg1 Det|Arg1 Feat Span|Arg2 Span|Arg2 Src|Arg2 Type|Arg2 Pol|Arg2 Det|Arg2 Feat Span|Sup2 Span

For relation types that do not have a particular field, these fields are simply left blank

Description of the Annotation File Fields
Relation TypeExplicit, Implicit, AltLex, EntRel, NoRel
Conn SpanText Span of the Connective
Conn SrcConnective's Source
Conn TypeConnective's Type
Conn PolConnective's Polarity
Conn DetConnective's Determinacy
Conn Feat SpanConnective's Feature Span
Conn1Explicit Connective / First Implicit Connective
SClass1AFirst Semantic Class of the First Connective
SClass1BSecond Semantic Class of the First Connective
Conn2Second Implicit Connective
SClass2AFirst Semantic Class of the Second Connective
SClass2BSecond Semantic Class of the Second Connective
Sup1 SpanText Span of the First Argument's Supplement
Arg1 SpanText Span of the First Argument
Arg1 SrcFirst Argument's Source
Arg1 TypeFirst Argument's Type
Arg1 PolFirst Argument's Polarity
Arg1 DetFirst Argument's Determinacy
Arg1 Feat SpanText Span of the First Argument's Feature
Arg2 SpanText Span of the Second Argument
Arg2 SrcSecond Argument's Source
Arg2 TypeSecond Argument's Type
Arg2 PolSecond Argument's Polarity
Arg2 DetSecond Argument's Determinacy
Arg2 Feat SpanText Span of the Second Argument's Feature
Sup2 SpanText Span of the Second Argument's Supplement

---Comments File Structure---

The comments files use the java properties class to allow key, value pairs, where:
key = Relation Type|Conn Span|Arg1 Span|Arg2 Span
value = A Multiline Comment

---About---

- The source code is available here: Annotator-src.zip.
- Compiled with Java 1.5.0_17 (for compatibility with most Macs)
- For bugs or feature requests please feel free to e-mail Geraud Campion at geraud@seas.upenn.edu

The Adjudication Functionality of the Annotator

---Running---

Requirements:
  • Annotator.jar
  • Java 1.5 or later.
  • Raw Text Files.
  • One or more sets of Annotation Files to be adjudicated.
In order to Run, there are three sets of folders required:
..raw/section/
..adjudicated/
..ann1/section/
..ann2/section/
..comments/
The names of the folders do not matter. You are required to have at least one section folder inside raw. Each of the rawtext files must go into any of the section folders. Each of the annotation roots must follow the same section/file structure as the raw root. For the adjudication and comments folder, no sections are required during the first run. Any adjudication and comment files that are created by this program will parallel the hierarchy of the raw section folders.
Start the program and you will be given a File dialog as shown:

As you specify the annotation roots, additional text boxes will appear.
With the above structure, you would select "raw" as the "RawRoot", "adjudicated" as the "Output/AdjudicationRoot", "ann1" as the first "AnnRoot", "ann2" as the second "AnnRoot", and "comments" as the commentRoot.
The first time you run the program from a new directory, these fields will be blank. After running, your settings will be saved in the same directory as the jar file as AnnSettings.txt.

To change the combo box choices in the program, you can edit Options.cfg. The numbers in the sets of choices represent the zero-based index of the default choice. These numbers are required for the first 8 items. The lists under the items must be in alphabetic order for this to work. If you delete Options.cfg, the jar will revert back to a default Options.cfg inside the jar at:
Annotator.jar\edu\upenn\cis\anntool\settings\Options.cfg
This file will be copied back out to the same directory as the jar file next time you start the Annotator.

---Using---

When the Annotator starts, you should be able to choose a raw text file by selecting its section number and the file name. Click load to load the raw text and all of the annotations for that file.

You can see in the "Relation List" pane that all annotations are grouped together if they are annotations of similar relations. Relations are considered similar if either of the conditions hold:
  • The relation types are Explicit and the locations of the start of the connective spans are the same.
  • The relation types are not Explicit and the location of the start of the relation identifiers are the same. The relation identifiers are specified as follows:
    • For AltLex relations, the start of the connective spans are compared.
    • For Implicit, EntRel, and NoRel relations, the start of arg2s are compared
In each grouping, the parent node represents the relation in the adjudication/output and the child nodes are from the source annotation files. There are 4 icons that indicate 6 different scenarios:
  • Parent : Relation does not exist in adjudication file but similar relation exists in annotation file(s)
  • Parent : Relation exists in adjudication file but similar relation does not exist in annotation file(s)
  • Parent : Relation in adjudication file completely matches all similar relations in annotation file(s)
  • Parent : Relation in adjudication file does not completely match all similar relations in annotation file(s)
  • Child : Relation in annotation file completely matches similar relation in adjudication file
  • Child : Relation in annotation file does not completely match similar relation in adjudication file
Other features in the relation list:
  • Red Font: Incomplete relation
  • Black Font : Complete relation
  • The number in parentheses, e.g., (2), represents the matching group
  • The partial file path in the relation represents the name of the root directory for the set of annotations/adjudications the the relation belongs to.
When loading a new file for the first time, groups of similar relations that do not have any conflicts are automatically moved to the output adjudication file. To manually select an annotation and copy it to the output adjudication file, select that relation and click the "Select Annotation" button. You also have the option of "Deleting Annotation" from the adjudication output and dragging an annotation from one group to another.

---Annotated File Structure---

Each annotated relation is simply a pipe "|" delimited line with the following format:

Relation Type|Conn Span|Conn Src|Conn Type|Conn Pol|Conn Det|Conn Feat Span|Conn1|SClass1A|SClass1B|Conn2|SClass2A|SClass2B|Sup1 Span|Arg1 Span|Arg1 Src|Arg1 Type|Arg1 Pol|Arg1 Det|Arg1 Feat Span|Arg2 Span|Arg2 Src|Arg2 Type|Arg2 Pol|Arg2 Det|Arg2 Feat Span|Sup2 Span

For relation types that do not have a particular field, these fields are simply left blank

Description of the Annotation File Fields
Relation TypeExplicit, Implicit, AltLex, EntRel, NoRel
Conn SpanText Span of the Connective
Conn SrcConnective's Source
Conn TypeConnective's Type
Conn PolConnective's Polarity
Conn DetConnective's Determinacy
Conn Feat SpanConnective's Feature Span
Conn1Explicit Connective / First Implicit Connective
SClass1AFirst Semantic Class of the First Connective
SClass1BSecond Semantic Class of the First Connective
Conn2Second Implicit Connective
SClass2AFirst Semantic Class of the Second Connective
SClass2BSecond Semantic Class of the Second Connective
Sup1 SpanText Span of the First Argument's Supplement
Arg1 SpanText Span of the First Argument
Arg1 SrcFirst Argument's Source
Arg1 TypeFirst Argument's Type
Arg1 PolFirst Argument's Polarity
Arg1 DetFirst Argument's Determinacy
Arg1 Feat SpanText Span of the First Argument's Feature
Arg2 SpanText Span of the Second Argument
Arg2 SrcSecond Argument's Source
Arg2 TypeSecond Argument's Type
Arg2 PolSecond Argument's Polarity
Arg2 DetSecond Argument's Determinacy
Arg2 Feat SpanText Span of the Second Argument's Feature
Sup2 SpanText Span of the Second Argument's Supplement

---Comments File Structure---

The comments files use the java properties class to allow key, value pairs, where:
key = Relation Type|Conn Span|Arg1 Span|Arg2 Span
value = A Multiline Comment

---About---

- The source code is available here: Annotator-src.zip.
- Compiled with Java 1.5.0_17 (for compatibility with most Macs)
- For bugs or feature requests please feel free to e-mail Geraud Campion at geraud@seas.upenn.edu

The PTB-Free Browser

The PTB-Free Browser allows you to easily view annotation files without the need to convert the annotation files. The disadvantage of this browser is that the annotations are not aligned with the Penn Tree Bank.

---Running---

Requirements:
  • noptb.jar
  • Java 1.5 or later
  • Raw Text Files
  • Annotation Files

The Conversion Tool

The Conversion tool is useful if you would like to use the Annotation Tool to annotate files for the PDTB Browser. It lets you convert PDTB Browser files to Annotation files, do annotations with those files, and then convert the Annotated files back to PDTB Browser files.

---Running---

Requirements:
  • Conversion.jar
  • Java 1.5 or later
  • Raw Text Files
  • PTB Files
  • For converting from Browser files to Annotation files:
    • PDTB Files
  • For converting from Annotation files to Browser files:
    • Annotation Files
    • Connective Head File (the default is included with the Conversion jar distribution as "ConnHeads.txt")

---Using---

When you start the Conversion tool, you will have to select which way you want to convert the files, using the radio button.

For converting from the browser files to annotation files, you need to provide, for input, the locations of the Rawtext file root, the PTB file root, and the original pdtb file root.
The files will be output to the AnnRoot location.

For converting from the annotation files to browser files, you need to provide, for input, the locations of the Rawtext file root, the PTB file root, the Annotation file root, and the Connective Head file.
Please note that when converting annotation files to browser files, any relations that are incomplete will be skipped. Incomplete relations are indicated by a red background in the relation list in the Annotator tool.
The files will be output to the New PDTB Root location.
The temporary folder is used for intermediate files and will be deleted automatically at the end of conversion.
The Standoff PTB (SPTB) needs to be created if this is the first time running an Annotation to PDTB file conversion. To create this, you just need to provide an empty folder for SptbRoot. Provide that same location for each successive conversion.

The conversions may take some time after clicking the "convert" button. Conversion from PDTB to Ann takes about 1-2 min. Conversion from Ann to PDTB takes about 3-5 min.
The file locations get saved in a file called "ConvertSettings.txt" in the same directory as Conversion.jar.

---Problems---

During conversion from Annotation files back to PDTB files, if a file exists in the annotation root, but the corresponding file does not exist in the raw root or ptb root, that file is skipped in the conversion (for example, log files or merge files).

There are also a few known cases where the raw or ptb files have errors. These files prevent a conversion of the corresponding pdtb files. These are cases where the text of the rawtext files does not completely match the text of the lexical leaf nodes in the ptb files. If new annotations are necessary for these cases, you can fix the raw files and ptb files yourself or do the annotations by hand. If fixing the raw files and ptb files, please delete the sptb files from previous conversions so that new ones will be created using the fixed files. The following is a list of the known rawtext-ptb problem file pairs:

0004, 0142, 0203, 0285, 0455, 0749, 0998, 1625, 2170, 2312

DO NOT BOTHER ANNOTATING THESE FILES. There is no way to convert them without introducing some non-standardized technique.

---About---

- The source code is available here: Conversion-src.zip.
- Compiled with Java 1.5.0_17 (for compatibility with most Macs)
- For bugs or feature requests please feel free to e-mail Geraud Campion at geraud@seas.upenn.edu