Project Option 2: P2P/XML Data
Sharing
This project is relatively open-ended, with the goal that
you can tailor it slightly based on your preferences.
The general goal is to produce something analogous to an
instant messaging system, except that (1) all messages are in XML form, and (2)
users can create “profiles” expressing what kinds of messages they are
interested in, and they can also pose queries to find new data. All information in the system should be
recorded permanently (unless its original author deletes it).
The system will have the following characteristics:
- It will
build over P2P technologies, which are designed to give you the ability to
do work in a server-free, decentralized manner. Two popular P2P libraries are JXTA, which provides services for
discovering other peers, coordinating between them, etc.; and Pastry, which
provides a globally distributed hash table over peers that may abruptly join
and leave the system. Distributed
hash tables are a useful way of distributing work and
storage – if I can come up with a global name for something, I can put it
in the “DHT” and later retrieve it from any node.
- Profiles
and queries will be expressed using a subset of the XPath language. However, we must allow for the
possibility that not everyone speaks the same “language” (i.e., has the
same idea of how a schema should be formed and how values should be
named). We will adopt a simple
model in which an element may have more than one name, and the “mappings”
between names are defined in so-called “concordance tables” (these are
simply binary tables that say “Terminology 1’s term X ó
Terminology 2’s term Y”). Likewise,
certain values may also have different mappings in different schemas, and
again we capture this using concordance tables.
- We
should also be able to perform keyword queries over the data. The simplest way of supporting keyword
queries is to do the following:
- Stemming: remove suffixes and variant forms of
words; enforce a single case for all text. “Cleared” => “clear”; “ran” => “run”;
etc. Often, it’s sufficient to simply
write an algorithm that removes common suffixes, and then supplement this
with a dictionary that maps nonstandard verb forms.
- Stop words: certain very common words and strings
are virtually useless in searching.
For instance, “a” and “the” appear in virtually every English
language document.
- Inverted index: we can build an index that, for each
word stem, contains a list of all locations where the word appeared. Note that this meshes nicely with the
distributed hash table concept.
- Ranking: in performing keyword matches, it is
important to provide a ranking over the matches. A common ranking metric is the TF/IDF
(term frequency/inter-document frequency) metric.
- There
have been numerous attempts to combine XML query language (e.g., XPath) and keyword search. Extra credit will be given to those who
manage to come up with a solution to the problem of searching for keywords
within a particular XML element or elements.
Useful references:
- The Salton/Buckley paper on TF/IDF ranking.
- Baeza-Yates and Ribeiro-Neto,
Modern Information Retrieval,
Chapters 2 and 8.
- Kowalski
and Maybury, Information
Storage and Retrieval Systems: Theory and Implementation, 2nd
ed.
Available from the library.
- The Galax and Tukwila XML query processors.
- Altinel and Franklin:
Efficient
Filtering of XML Documents for Selective Dissemination of Information.
- Rowstron and P. Druschel,
"Pastry:
Scalable, distributed object location and routing for large-scale
peer-to-peer systems". IFIP/ACM International Conference on
Distributed Systems Platforms (Middleware), Heidelberg,
Germany, pages
329-350, November, 2001.
- Brendon Wilson:
Project JXTA
book online.