This page is intended to serve as minimal documentation for the extensions we make to XPath, and will be expanded time permitting. The extensions so far fall into the following categories:

We assume familiarity with the overall design . The reader is invited to play with the queries, using the scripts included in the user distribution in conjunction with reading this.

Axis Syntactic Sugar

As of v0.2.3, the following syntactic shortenings are supported:
Standard XPathShort Form
self::*=::*
child::* >::*
parent::* <::*
descendant::* >>::*
descendant-or-self::* =>>::*
ancestor-or-self::* <<=::*
following::* ->::*
preceding::* <-::*
following-sibling::*$>::*
preceding-sibling::* <$::*
To select all instance of explicit "because", we can write:

 
>::*[@connHead='because'] 
Simple string substitution using the table above will give you the standard XPath form. A slightly more interesting example is to select all instances of "if" followed (anywhere) by "otherwise", and all instances of "otherwise" preceded by "if".
>::*[(@connHead='if' and 
       $>::*[@connHead='otherwise']) or 
     (@connHead='otherwise' and 
        <$::*[@connHead='if'])]
More complex locational information can be expressed using the span functions

Regular Expressions

Regexes are supported via the following function:

boolean regexp(string toMatch,string regex)
This is straightforward to use. The following query which selects "if" with "not" in Arg2:
>::*[@connHead='if' and 
     >::Arg2[regexp(@rawText,'.*\\W*not\\W*.*')]]

Valid regexes are those given by the Java Standard

Node Type Functions

Three functions are used to group the nodes into more coarse grained categories than given by element names:

boolean is-rel()
boolean is-arg()
boolean is-sup()
These functions return true iff the context node is a relation, argument, supplementary information respectively. The following query selects all supplementary information nodes:
>>::*[is-sup()]

Span Functions

Three functions are provided, which we will discuss in turn:

Span Comparison

This section has been modified in version 0.2.9. Please use your local copy of the documentation with earlier versions.

This is a little more complex than the previous extensions. We begin with some basic terminology. A span s has two fields s.start and s.end. A span corresponds to a stretch of text selected by an annotator. Given spans s1 and s2 we say that:

s1 contains s2  iff s1.start <= s2.start and s2.end <= s1.end

s1 is contained by s2 iff s2 contains s1

s1 overlaps s2 iff (s1.start <= s2.start and s2.start < s1.end) or (s2.start <= s1.start and s1.start < s2.end)

Overlapping is symmetric.

s1 is identical to s2 iff s1.start = s2.start and s1.end = s2.end

Identity is an equivalence relation.

s1 crosses s2 iff (s1.start < s2.start and s2.start < s1.end and s2.end > s1.end) or 
                  (s2.start < s1.start and s1.start < s2.end and s1.end > s2.end)

Crossing is symmetric.
The PDTB allows for discontinuous selections of text, so we need to be able to talk about sets of spans rather than single spans. We will call an ordered set of disjoint spans a spanlist. The relations of containment, overlapping and identity can be easily extended to spanlists. Given spanlists L1 and L2 we say that:
L1 contains L2 iff for all s2 in L2: there exists s1 in L1: s1 contains s2

L1 is contained by L2 iff L2 contains L1

L1 overlaps L2 iff there exists s1 in L1 and s2 in L2: s1 overlaps s2

Overlapping remains symmetric

L1 is identical to L2 iff for all s1 in L1: exists s2 in L2: s1 is identical to s2 (and vice-vera).
Crossing does not extend naturally to spanlists, and we define an approximate notion using ranges. Given a spanlist L1, the range of L1 (range(L1)) is the smallest span s such that for all s1 in L1: s contains s1. We define relations of overlapping, containment, identity and crossing on ranges:
L1 range-overlaps L2 iff range(L1) overlaps range(L2)

L1 range-contains L2 iff range(L1) contains range(L2)

L1 is range-contained by L2 iff range(L1) is contained by range(L2)

L1 is range-identical to L2 iff range(L1) is identical to range(L2)

L1 range-crosses L2 iff range(L1) crosses range(L2)
All that remains is to make these relations available via XPath. This is achieved by the following function:
boolean comp-splist(string comparisonMethod, node-set setOfL2s, string flags?)
The splist L1 is given by the context node. We will explain the optional use of flags in what follows. The node-set setOfL2s should be a set of Elements, which are relations, arguments and supplementary information. This function will be true iff there exists L2 in setOfL2s such that L1 comparisonMethod L2 holds. There are two kinds of flags used:

For Arg1, Sup1 and Sup2 there is a unique way to determine the spanlist, i.e., the text selected by the annotator. For relations there are two spanlists that one might want to associate:

Similarly, for Arg2 of a relation there are two spanlists that one might want to associate: We use the flags uc (union interpretaion of context) and uns (union interpretation of node-set) to shift to the union interpretation for L1 and L2 respectively.

We now explain the various pieces with examples from Lee et al. The queries below are simplified versions of queries in the paper, and several flags need to be used to improve the quality of the selection. Several variants can be found in the scripts that come with the user distribution.

The following query selects relations which share an argument exactly with another relation:

Shared Argument:

>::*[>::*[is-arg() and 
     comp-splist('identity',/>>::*[is-arg()])]]
There are no flags in use here. For the function comp-splist('identity',/>>::*[is-arg()]) the context splist L1 is given by the argument of some relation, and it checks if the spanlist L2 of some node in />>::*[is-arg()] (which is the set of arguments in the same discourse) is such that L1 is identical to L2. A node will never be compared to itself, so the trivial case (here) is excluded.

The following query selects relations which are contained in an argument of another relation (both the container and containee are selected):

Nested Relation:

>::*[comp-splist('is-contained-by',/>>::*[is-arg()], 'uc') or 
     >::*[is-arg() and 
          comp-splist('contains', />>::*[is-rel()], 'uns')]]"
The first disjunct comp-splist('is-contained-by',/>::*[is-arg()], 'uc') selects the contained relation. L1 is the spanlist for the relation and the flag uc tells us that the spanlist should contain the spanlists of the connective and the arguments. L2 is given by />::*[is-arg()] is an argument of some relation in the discourse such that L1 is-contained-by L2. The second disjunct is responsible for selecting the container and is symmetric to the first part.

The following query selects relations which have an argument properly contained inside an argument of another relation, both the relevant relations are selected and the identical argument cases are excluded.

Properly Contained Argument:

>::*[>::*[is-arg() and 
          (comp-splist('contains',/>>::*[is-arg()],'ei') or 
           comp-splist('is-contained-by',/>>::*[is-arg()],'ei'))
          ]]

As before, the disjuncts are symmetrical and the flag ei stands for exclude identical.

The following query selects what Lee et al. call pure crossing, where an argument of a connective appears interspersed with material from another relation with no overlap. As before, both relevant relations are selected.

Pure Crossing:

>::*[comp-splist('ranges-crosses',/>>::*[is-rel()],'uc-uns-eo')]
The flag eo stands for exclude overlap, and the flags uc and uns are as before.

Finally, the following query selects relations with a partial overlap of arguments.

Partial Overlap:

>::*[>::*[is-arg() and comp-splist('overlaps',/>>::*[is-arg()],'ec')]]
The flag ec stands for exclude containment, and there is no disjunction because overlapping (like identity) is symmetric. The available comparison methods are as follows: The following flags are available: The flags emp, emcp and emnsp exclude cases only if the parent is a relation. While checking if a parent or parents match: the flags eo, ec, ei, ero, erc, eri and erx are negated, i.e., the parent matches if there is overlap, containment, identity, range-overlap, range-containment, range-identity and range crossing.

The queries written in the form above will be slightly slow. They can be made two to three times faster by caching subexpressions .

Span List Size

number splist-size()
This evaluates to the number of disjoint spans in the context spanlist. For relations this is defined as the spanlist of the connective (if any) and -1 otherwise. The following query selects the explict relations with discontinuous connectives, such as, "either..or".
>::*[splist-size() > 1]
Text Delimiting

string delim-text()
If the context node has multiple spans, then the spans are concatenated with #### as delimiter. This is perhaps useful in conjunction with the regexp function to search for text patterns within a span. The same effect as the previous query can be achieved using:
>::*[contains(delim-text(),'####')]
Caching Subexpressions

We are often interested in writing queries that select relations based on some other relations in the document. Consider the simple example of selecting "if" which overlaps with "otherwise".

>::*[@connHead='if' and comp-splist('overlaps',/>>::*[@connHead='otherwise'],'uc-uns')]
The subexpression />>:*[@connHead='otherwise'] is constant for a given discourse, but the XPath engine has no way to determine this. This results in evaluating the subexpression everytime comp-splist is called. This can make things quite slow. We would like the ability to somehow cache expressions when we know they are constant. Something like:
>::*[@connHead='if' and comp-splist('overlaps',cache('/>>::*[@connHead='otherwise']'),'uc-uns')]
However, the subexpression cache('/>>:*[@connHead='otherwise']') has embedded quotes which is a preprocessing nightmare. For this reason, inside the cache function, quotes should be replaced by curly braces, like so:
>::*[@connHead='if' and comp-splist('overlaps',cache({/>>::*[@connHead={otherwise}]}),'uc-uns')]
Using this for the queries discussed previously makes them two to three times faster. Note that caching should be used only if the embedded XPath is constant for a given discourse, i.e., independent of context. If the embedded query depends, for example, on being the sibling of the context node, caching may give errors. The following are not equivalent:
>>::Arg1[comp-splist('overlaps',$>::Arg2)] (Selects Arg1 which overlaps with Arg2)


>>::Arg1[comp-splist('overlaps',cache({$>::Arg2}))] (Selects nothing)
The second query attempts to select Arg1 nodes which overlap with Arg2 nodes which are following siblings of the RelationList (which has no siblings as it is the root of the tree).

Grouping Explicit Connectives

There are three functions for grouping explicit connectives into coarse-grained categories - subordinating conjunctions, coordinating conjunctions and adverbials. It is not always clear which category a connective belongs in, which is why this information does not appear in the files. However, we have found this distinction useful in studying the properties of connectives. The functions are as follows:

Subordinating Conjunctions

boolean is-sc()
Returns true iff the context node is an explicit relation and the connHead attribute matches the regex:
.*(because|although|even though|when|so that|(^|\s)while|if|since|unless|after|until|
   whereas|as$|as though|^till|once|for$|before|lest|except|else|now that).*

Coordinating Conjunctions

boolean is-cc()
Returns true iff the context node is an explicit relation and the connHead attribute matches the regex:
.*(and|or|but|nor)

Adverbials

boolean is-adv()
Returns true iff the context node is an explicit relation and the connHead attribute matches the regex:
.*(instead|otherwise|therefore|as a result|nevertheless|^then|on the other hand|however|in fact|
   further|furthermore|indeed|for example|^though|yet|so|on the contrary|conversely|consequently|
   besides|nonetheless|afterwards|finally|by contrast|in sum|simultaneously|in addition|accordingly|
   thus|overall|in the meantime|meanwhile|in other words|still|previously|as an alternative|specifically|
   in particular|hence|earlier|later|regardless|for instance|in the end|on the other side|by comparison|
   alternatively|in short|rather|ultimately|moreover|likewise|next|similarly|in contrast|thereafter|by then|
   additionally|also|on the whole|plus as well|separately|in turn).*