This page is intended to serve as minimal documentation for the extensions we make to XPath, and will be expanded time permitting. The extensions so far fall into the following categories:

- Axis Syntactic Sugar - Since 0.2.3
- Regular Expressions - Since 0.2.2
- Node Type Functions - Since 0.2.5
- Span Functions - Since 0.2.5. Modified in 0.2.9.
- Caching Subexpressions - Since 0.2.6
- Grouping Explicit Connectives - Since 0.2.8

As of v0.2.3, the following syntactic shortenings are supported:

Standard XPath | Short Form |

`self::*` | `=::*` |

`child::*` | ` >::*` |

`parent::* ` | ` <::*` |

`descendant::* ` | ` >>::*` |

`descendant-or-self::*` | ` =>>::*` |

`ancestor-or-self::*` | ` <<=::*` |

`following::*` | ` ->::*` |

`preceding::*` | ` <-::*` |

`following-sibling::*` | `$>::*` |

`preceding-sibling::* ` | `<$::*` |

>::*[@connHead='because']Simple string substitution using the table above will give you the standard XPath form. A slightly more interesting example is to select all instances of "if" followed (anywhere) by "otherwise", and all instances of "otherwise" preceded by "if".

>::*[(@connHead='if' and $>::*[@connHead='otherwise']) or (@connHead='otherwise' and <$::*[@connHead='if'])]More complex locational information can be expressed using the span functions

Regexes are supported via the following function:

boolean regexp(string toMatch,string regex)This is straightforward to use. The following query which selects "if" with "not" in Arg2:

>::*[@connHead='if' and >::Arg2[regexp(@rawText,'.*\\W*not\\W*.*')]]Valid regexes are those given by the Java Standard

Three functions are used to group the nodes into more coarse grained categories than given by element names:

boolean is-rel() boolean is-arg() boolean is-sup()These functions return true iff the context node is a relation, argument, supplementary information respectively. The following query selects all supplementary information nodes:

>>::*[is-sup()]

Three functions are provided, which we will discuss in turn:

This section has been modified in version 0.2.9. Please use your local copy of the documentation with earlier versions.

This is a little more complex than the previous extensions. We begin with some basic terminology.
A span `s`

has two fields `s.start`

and `s.end`

. A span corresponds
to a stretch of text selected by an annotator. Given spans `s1`

and `s2`

we
say that:

s1 contains s2 iff s1.start <= s2.start and s2.end <= s1.end s1 is contained by s2 iff s2 contains s1 s1 overlaps s2 iff (s1.start <= s2.start and s2.start < s1.end) or (s2.start <= s1.start and s1.start < s2.end) Overlapping is symmetric. s1 is identical to s2 iff s1.start = s2.start and s1.end = s2.end Identity is an equivalence relation. s1 crosses s2 iff (s1.start < s2.start and s2.start < s1.end and s2.end > s1.end) or (s2.start < s1.start and s1.start < s2.end and s1.end > s2.end) Crossing is symmetric.The PDTB allows for discontinuous selections of text, so we need to be able to talk about sets of spans rather than single spans. We will call an ordered set of disjoint spans a spanlist. The relations of containment, overlapping and identity can be easily extended to spanlists. Given spanlists

`L1`

and `L2`

we say that:
L1 contains L2 iff for all s2 in L2: there exists s1 in L1: s1 contains s2 L1 is contained by L2 iff L2 contains L1 L1 overlaps L2 iff there exists s1 in L1 and s2 in L2: s1 overlaps s2 Overlapping remains symmetric L1 is identical to L2 iff for all s1 in L1: exists s2 in L2: s1 is identical to s2 (and vice-vera).Crossing does not extend naturally to spanlists, and we define an approximate notion using ranges. Given a spanlist

`L1`

, the range of `L1`

(`range(L1)`

) is the smallest span `s`

such that `for all s1 in L1: s contains s1`

. We define relations of overlapping, containment, identity and
crossing on ranges:
L1 range-overlaps L2 iff range(L1) overlaps range(L2) L1 range-contains L2 iff range(L1) contains range(L2) L1 is range-contained by L2 iff range(L1) is contained by range(L2) L1 is range-identical to L2 iff range(L1) is identical to range(L2) L1 range-crosses L2 iff range(L1) crosses range(L2)All that remains is to make these relations available via XPath. This is achieved by the following function:

boolean comp-splist(string comparisonMethod, node-set setOfL2s, string flags?)The splist

`L1`

is given by the context node. We will explain the optional use of flags in what follows.
The node-set `setOfL2s`

should be a set of Elements, which are relations, arguments and supplementary
information. This function will be true iff there exists `L2 in setOfL2s`

such that
`L1 comparisonMethod L2`

holds. There are two kinds of flags used:
- Flags which change the way a node is associated with a spanlist
- Flags which exclude nodes based on other comparison methods

For Arg1, Sup1 and Sup2 there is a unique way to determine the spanlist, i.e., the text selected by the annotator. For relations there are two spanlists that one might want to associate:

- the spanlist of the connective (for a lexicalized relation), or
- the spanlist of the connective (if any) together with its arguments (the union interpretation)

- the spanlist of the text selected, or
- the spanlist of the connective (if any) together with the text selected (the union interpretation)

`uc`

(union interpretaion of context) and `uns`

(union interpretation of node-set)
to shift to the union interpretation for `L1`

and `L2`

respectively.
We now explain the various pieces with examples from Lee et al. The queries below are simplified versions of queries in the paper, and several flags need to be used to improve the quality of the selection. Several variants can be found in the scripts that come with the user distribution.

The following query selects relations which share an argument exactly with another relation:

Shared Argument: >::*[>::*[is-arg() and comp-splist('identity',/>>::*[is-arg()])]]There are no flags in use here. For the function

`comp-splist('identity',/>>::*[is-arg()])`

the context
splist `L1`

is given by the argument of some relation, and it checks if the spanlist `L2`

of
some node in `/>>::*[is-arg()]`

(which is the set of arguments in the same discourse) is such that
`L1 is identical to L2`

. A node will never be compared to itself, so the trivial case (here) is
excluded.
The following query selects relations which are contained in an argument of another relation (both the container and containee are selected):

Nested Relation: >::*[comp-splist('is-contained-by',/>>::*[is-arg()], 'uc') or >::*[is-arg() and comp-splist('contains', />>::*[is-rel()], 'uns')]]"The first disjunct

` comp-splist('is-contained-by',/>::*[is-arg()], 'uc')`

selects the contained relation.
`L1`

is the spanlist for the relation and the flag `uc`

tells us that the spanlist should
contain the spanlists of the connective and the arguments.` L2`

is given by `/>::*[is-arg()]`

is an argument of some relation in the discourse such that `L1 is-contained-by L2`

. The second disjunct
is responsible for selecting the container and is symmetric to the first part.
The following query selects relations which have an argument properly contained inside an argument of another relation, both the relevant relations are selected and the identical argument cases are excluded.

Properly Contained Argument: >::*[>::*[is-arg() and (comp-splist('contains',/>>::*[is-arg()],'ei') or comp-splist('is-contained-by',/>>::*[is-arg()],'ei')) ]]As before, the disjuncts are symmetrical and the flag

`ei`

stands for `exclude identical`

.
The following query selects what Lee et al. call pure crossing, where an argument of a connective appears interspersed with material from another relation with no overlap. As before, both relevant relations are selected.

Pure Crossing: >::*[comp-splist('ranges-crosses',/>>::*[is-rel()],'uc-uns-eo')]The flag

`eo`

stands for `exclude overlap`

, and the flags `uc`

and
`uns`

are as before.
Finally, the following query selects relations with a partial overlap of arguments.

Partial Overlap: >::*[>::*[is-arg() and comp-splist('overlaps',/>>::*[is-arg()],'ec')]]The flag

`ec`

stands for `exclude containment`

, and there is no disjunction
because overlapping (like identity) is symmetric. The available comparison methods are as
follows:
- overlaps
- contains
- is-contained-by
- identity
- ranges-overlaps
- ranges-contains
- is-range-contained-by
- range-identity
- range-crosses

- uc - Union interpretation of context
- uns - Union interpretation of node-set
- eo - Exclude overlap
- ec - Exclude containment
- ei - Exclude identity
- ero - Exclude range overlap
- erc - Exclude range contains
- eri - Exclude range identity
- erx - Exclude range crosses
- esp - Exclude siblings and parents from being compared against each other
- emp - Exclude if the parents of the context and the node-set match
- emcp - Exclude if the parent of the context matches the node in the node-set
- emnsp - Exclude if the context node matches the parent of the node in the node-set

`emp, emcp and emnsp`

exclude cases only if the parent is a relation. While
checking if a parent or parents match: the flags `eo, ec, ei, ero, erc, eri and erx`

are
negated, i.e., the parent matches if there is overlap, containment, identity, range-overlap, range-containment,
range-identity and range crossing.
The queries written in the form above will be slightly slow. They can be made two to three times faster by caching subexpressions .

number splist-size()This evaluates to the number of disjoint spans in the context spanlist. For relations this is defined as the spanlist of the connective (if any) and

`-1`

otherwise. The following query selects
the explict relations with discontinuous connectives, such as, "either..or".
>::*[splist-size() > 1]

string delim-text()If the context node has multiple spans, then the spans are concatenated with

`####`

as delimiter. This is perhaps useful in
conjunction with the ` regexp`

function to search for
text patterns within a span. The same effect as the previous query can be
achieved using:
>::*[contains(delim-text(),'####')]

We are often interested in writing queries that select relations based on some other relations in the document. Consider the simple example of selecting "if" which overlaps with "otherwise".

>::*[@connHead='if' and comp-splist('overlaps',/>>::*[@connHead='otherwise'],'uc-uns')]The subexpression

`/>>:*[@connHead='otherwise']`

is constant for a given discourse, but
the XPath engine has no way to determine this. This results in evaluating the subexpression everytime
`comp-splist`

is called. This can make things quite slow. We would like the ability to
somehow cache expressions when we know they are constant. Something like:
>::*[@connHead='if' and comp-splist('overlaps',cache('/>>::*[@connHead='otherwise']'),'uc-uns')]However, the subexpression

`cache('/>>:*[@connHead='otherwise']')`

has embedded quotes which
is a preprocessing nightmare. For this reason, inside the ` cache `

function, quotes should
be replaced by curly braces, like so:
>::*[@connHead='if' and comp-splist('overlaps',cache({/>>::*[@connHead={otherwise}]}),'uc-uns')]Using this for the queries discussed previously makes them two to three times faster. Note that caching should be used only if the embedded XPath is constant for a given discourse, i.e., independent of context. If the embedded query depends, for example, on being the sibling of the context node, caching may give errors. The following are not equivalent:

>>::Arg1[comp-splist('overlaps',$>::Arg2)] (Selects Arg1 which overlaps with Arg2) >>::Arg1[comp-splist('overlaps',cache({$>::Arg2}))] (Selects nothing)The second query attempts to select Arg1 nodes which overlap with Arg2 nodes which are following siblings of the RelationList (which has no siblings as it is the root of the tree).

There are three functions for grouping explicit connectives into coarse-grained categories - subordinating conjunctions, coordinating conjunctions and adverbials. It is not always clear which category a connective belongs in, which is why this information does not appear in the files. However, we have found this distinction useful in studying the properties of connectives. The functions are as follows:

*Subordinating Conjunctions*

boolean is-sc()Returns true iff the context node is an explicit relation and the

`connHead`

attribute matches the regex:
.*(because|although|even though|when|so that|(^|\s)while|if|since|unless|after|until| whereas|as$|as though|^till|once|for$|before|lest|except|else|now that).*

*Coordinating Conjunctions*

boolean is-cc()Returns true iff the context node is an explicit relation and the

`connHead`

attribute matches the regex:
.*(and|or|but|nor)

*Adverbials*

boolean is-adv()Returns true iff the context node is an explicit relation and the

`connHead`

attribute matches the regex:
.*(instead|otherwise|therefore|as a result|nevertheless|^then|on the other hand|however|in fact| further|furthermore|indeed|for example|^though|yet|so|on the contrary|conversely|consequently| besides|nonetheless|afterwards|finally|by contrast|in sum|simultaneously|in addition|accordingly| thus|overall|in the meantime|meanwhile|in other words|still|previously|as an alternative|specifically| in particular|hence|earlier|later|regardless|for instance|in the end|on the other side|by comparison| alternatively|in short|rather|ultimately|moreover|likewise|next|similarly|in contrast|thereafter|by then| additionally|also|on the whole|plus as well|separately|in turn).*