This page is intended to serve as minimal documentation for the extensions we make to XPath, and will be expanded time permitting. The extensions so far fall into the following categories:
As of v0.2.3, the following syntactic shortenings are supported:
| Standard XPath | Short Form |
self::* | =::* |
child::* | >::* |
parent::* | <::* |
descendant::* | >>::* |
descendant-or-self::* | =>>::* |
ancestor-or-self::* | <<=::* |
following::* | ->::* |
preceding::* | <-::* |
following-sibling::* | $>::* |
preceding-sibling::* | <$::* |
>::*[@connHead='because']Simple string substitution using the table above will give you the standard XPath form. A slightly more interesting example is to select all instances of "if" followed (anywhere) by "otherwise", and all instances of "otherwise" preceded by "if".
>::*[(@connHead='if' and
$>::*[@connHead='otherwise']) or
(@connHead='otherwise' and
<$::*[@connHead='if'])]
More complex locational information can be expressed using the span functions
Regular Expressions
Regexes are supported via the following function:
boolean regexp(string toMatch,string regex)This is straightforward to use. The following query which selects "if" with "not" in Arg2:
>::*[@connHead='if' and
>::Arg2[regexp(@rawText,'.*\\W*not\\W*.*')]]
Valid regexes are those given by
the Java Standard
Node Type Functions
Three functions are used to group the nodes into more coarse grained categories than given by element names:
boolean is-rel() boolean is-arg() boolean is-sup()These functions return true iff the context node is a relation, argument, supplementary information respectively. The following query selects all supplementary information nodes:
>>::*[is-sup()]Span Functions
Three functions are provided, which we will discuss in turn:
Span ComparisonThis section has been modified in version 0.2.9. Please use your local copy of the documentation with earlier versions.
This is a little more complex than the previous extensions. We begin with some basic terminology.
A span s has two fields s.start and s.end. A span corresponds
to a stretch of text selected by an annotator. Given spans s1 and s2 we
say that:
s1 contains s2 iff s1.start <= s2.start and s2.end <= s1.end
s1 is contained by s2 iff s2 contains s1
s1 overlaps s2 iff (s1.start <= s2.start and s2.start < s1.end) or (s2.start <= s1.start and s1.start < s2.end)
Overlapping is symmetric.
s1 is identical to s2 iff s1.start = s2.start and s1.end = s2.end
Identity is an equivalence relation.
s1 crosses s2 iff (s1.start < s2.start and s2.start < s1.end and s2.end > s1.end) or
(s2.start < s1.start and s1.start < s2.end and s1.end > s2.end)
Crossing is symmetric.
The PDTB allows for discontinuous selections of text, so we need to be able to talk about sets of
spans rather than single spans. We will call an ordered set of disjoint spans a spanlist. The relations of
containment, overlapping and identity can be easily extended to spanlists. Given spanlists L1
and L2 we say that:
L1 contains L2 iff for all s2 in L2: there exists s1 in L1: s1 contains s2 L1 is contained by L2 iff L2 contains L1 L1 overlaps L2 iff there exists s1 in L1 and s2 in L2: s1 overlaps s2 Overlapping remains symmetric L1 is identical to L2 iff for all s1 in L1: exists s2 in L2: s1 is identical to s2 (and vice-vera).Crossing does not extend naturally to spanlists, and we define an approximate notion using ranges. Given a spanlist
L1, the range of L1 (range(L1)) is the smallest span s
such that for all s1 in L1: s contains s1. We define relations of overlapping, containment, identity and
crossing on ranges:
L1 range-overlaps L2 iff range(L1) overlaps range(L2) L1 range-contains L2 iff range(L1) contains range(L2) L1 is range-contained by L2 iff range(L1) is contained by range(L2) L1 is range-identical to L2 iff range(L1) is identical to range(L2) L1 range-crosses L2 iff range(L1) crosses range(L2)All that remains is to make these relations available via XPath. This is achieved by the following function:
boolean comp-splist(string comparisonMethod, node-set setOfL2s, string flags?)The splist
L1 is given by the context node. We will explain the optional use of flags in what follows.
The node-set setOfL2s should be a set of Elements, which are relations, arguments and supplementary
information. This function will be true iff there exists L2 in setOfL2s such that
L1 comparisonMethod L2 holds. There are two kinds of flags used:
For Arg1, Sup1 and Sup2 there is a unique way to determine the spanlist, i.e., the text selected by the annotator. For relations there are two spanlists that one might want to associate:
uc
(union interpretaion of context) and uns (union interpretation of node-set)
to shift to the union interpretation for L1 and L2 respectively.
We now explain the various pieces with examples from Lee et al. The queries below are simplified versions of queries in the paper, and several flags need to be used to improve the quality of the selection. Several variants can be found in the scripts that come with the user distribution.
The following query selects relations which share an argument exactly with another relation:
Shared Argument:
>::*[>::*[is-arg() and
comp-splist('identity',/>>::*[is-arg()])]]
There are no flags in use here. For the function comp-splist('identity',/>>::*[is-arg()]) the context
splist L1 is given by the argument of some relation, and it checks if the spanlist L2 of
some node in />>::*[is-arg()] (which is the set of arguments in the same discourse) is such that
L1 is identical to L2. A node will never be compared to itself, so the trivial case (here) is
excluded.
The following query selects relations which are contained in an argument of another relation (both the container and containee are selected):
Nested Relation:
>::*[comp-splist('is-contained-by',/>>::*[is-arg()], 'uc') or
>::*[is-arg() and
comp-splist('contains', />>::*[is-rel()], 'uns')]]"
The first disjunct comp-splist('is-contained-by',/>::*[is-arg()], 'uc') selects the contained relation.
L1 is the spanlist for the relation and the flag uc
tells us that the spanlist should
contain the spanlists of the connective and the arguments. L2 is given by />::*[is-arg()]
is an argument of some relation in the discourse such that L1 is-contained-by L2. The second disjunct
is responsible for selecting the container and is symmetric to the first part.
The following query selects relations which have an argument properly contained inside an argument of another relation, both the relevant relations are selected and the identical argument cases are excluded.
Properly Contained Argument:
>::*[>::*[is-arg() and
(comp-splist('contains',/>>::*[is-arg()],'ei') or
comp-splist('is-contained-by',/>>::*[is-arg()],'ei'))
]]
As before, the disjuncts are symmetrical and the flag ei stands for exclude identical.
The following query selects what Lee et al. call pure crossing, where an argument of a connective appears interspersed with material from another relation with no overlap. As before, both relevant relations are selected.
Pure Crossing:
>::*[comp-splist('ranges-crosses',/>>::*[is-rel()],'uc-uns-eo')]
The flag eo stands for exclude overlap, and the flags uc and
uns are as before.
Finally, the following query selects relations with a partial overlap of arguments.
Partial Overlap:
>::*[>::*[is-arg() and comp-splist('overlaps',/>>::*[is-arg()],'ec')]]
The flag ec stands for exclude containment, and there is no disjunction
because overlapping (like identity) is symmetric. The available comparison methods are as
follows:
emp, emcp and emnsp exclude cases only if the parent is a relation. While
checking if a parent or parents match: the flags eo, ec, ei, ero, erc, eri and erx are
negated, i.e., the parent matches if there is overlap, containment, identity, range-overlap, range-containment,
range-identity and range crossing.
The queries written in the form above will be slightly slow. They can be made two to three times faster by caching subexpressions .
Span List Size
number splist-size()This evaluates to the number of disjoint spans in the context spanlist. For relations this is defined as the spanlist of the connective (if any) and
-1 otherwise. The following query selects
the explict relations with discontinuous connectives, such as, "either..or".
>::*[splist-size() > 1]Text Delimiting
string delim-text()If the context node has multiple spans, then the spans are concatenated with
#### as delimiter. This is perhaps useful in
conjunction with the regexp function to search for
text patterns within a span. The same effect as the previous query can be
achieved using:
>::*[contains(delim-text(),'####')]Caching Subexpressions
We are often interested in writing queries that select relations based on some other relations in the document. Consider the simple example of selecting "if" which overlaps with "otherwise".
>::*[@connHead='if' and comp-splist('overlaps',/>>::*[@connHead='otherwise'],'uc-uns')]
The subexpression />>:*[@connHead='otherwise'] is constant for a given discourse, but
the XPath engine has no way to determine this. This results in evaluating the subexpression everytime
comp-splist is called. This can make things quite slow. We would like the ability to
somehow cache expressions when we know they are constant. Something like:
>::*[@connHead='if' and comp-splist('overlaps',cache('/>>::*[@connHead='otherwise']'),'uc-uns')]
However, the subexpression cache('/>>:*[@connHead='otherwise']') has embedded quotes which
is a preprocessing nightmare. For this reason, inside the cache function, quotes should
be replaced by curly braces, like so:
>::*[@connHead='if' and comp-splist('overlaps',cache({/>>::*[@connHead={otherwise}]}),'uc-uns')]
Using this for the queries discussed previously makes them two to three times
faster. Note that caching should be used only if the embedded XPath is constant for a given discourse, i.e.,
independent of context. If the embedded query depends, for example, on being the sibling of the context node, caching
may give errors. The following are not equivalent:
>>::Arg1[comp-splist('overlaps',$>::Arg2)] (Selects Arg1 which overlaps with Arg2)
>>::Arg1[comp-splist('overlaps',cache({$>::Arg2}))] (Selects nothing)
The second query attempts to select Arg1 nodes which overlap with Arg2 nodes which are following siblings
of the RelationList (which has no siblings as it is the root of the tree).
Grouping Explicit Connectives
There are three functions for grouping explicit connectives into coarse-grained categories - subordinating conjunctions, coordinating conjunctions and adverbials. It is not always clear which category a connective belongs in, which is why this information does not appear in the files. However, we have found this distinction useful in studying the properties of connectives. The functions are as follows:
Subordinating Conjunctions
boolean is-sc()Returns true iff the context node is an explicit relation and the
connHead attribute matches the regex:
.*(because|although|even though|when|so that|(^|\s)while|if|since|unless|after|until| whereas|as$|as though|^till|once|for$|before|lest|except|else|now that).*
Coordinating Conjunctions
boolean is-cc()Returns true iff the context node is an explicit relation and the
connHead attribute matches the regex:
.*(and|or|but|nor)
Adverbials
boolean is-adv()Returns true iff the context node is an explicit relation and the
connHead attribute matches the regex:
.*(instead|otherwise|therefore|as a result|nevertheless|^then|on the other hand|however|in fact| further|furthermore|indeed|for example|^though|yet|so|on the contrary|conversely|consequently| besides|nonetheless|afterwards|finally|by contrast|in sum|simultaneously|in addition|accordingly| thus|overall|in the meantime|meanwhile|in other words|still|previously|as an alternative|specifically| in particular|hence|earlier|later|regardless|for instance|in the end|on the other side|by comparison| alternatively|in short|rather|ultimately|moreover|likewise|next|similarly|in contrast|thereafter|by then| additionally|also|on the whole|plus as well|separately|in turn).*