Project Option 2: P2P/XML Data Sharing

This project is relatively open-ended, with the goal that you can tailor it slightly based on your preferences.

The general goal is to produce something analogous to an instant messaging system, except that (1) all messages are in XML form, and (2) users can create “profiles” expressing what kinds of messages they are interested in, and they can also pose queries to find new data. All information in the system should be recorded permanently (unless its original author deletes it).

The system will have the following characteristics:

It will build over P2P technologies, which are designed to give you the ability to do work in a server-free, decentralized manner. Two popular P2P libraries are JXTA, which provides services for discovering other peers, coordinating between them, etc.; and Pastry, which provides a globally distributed hash table over peers that may abruptly join and leave the system. Distributed hash tables are a useful way of distributing work and storage – if I can come up with a global name for something, I can put it in the “DHT” and later retrieve it from any node.

Profiles and queries will be expressed using a subset of the XPath language. However, we must allow for the possibility that not everyone speaks the same “language” (i.e., has the same idea of how a schema should be formed and how values should be named). We will adopt a simple model in which an element may have more than one name, and the “mappings” between names are defined in so-called “concordance tables” (these are simply binary tables that say “Terminology 1’s term X ó Terminology 2’s term Y”). Likewise, certain values may also have different mappings in different schemas, and again we capture this using concordance tables.

We should also be able to perform keyword queries over the data. The simplest way of supporting keyword queries is to do the following:

Stemming: remove suffixes and variant forms of words; enforce a single case for all text. “Cleared” => “clear”; “ran” => “run”; etc. Often, it’s sufficient to simply write an algorithm that removes common suffixes, and then supplement this with a dictionary that maps nonstandard verb forms.
Stop words: certain very common words and strings are virtually useless in searching. For instance, “a” and “the” appear in virtually every English language document.
Inverted index: we can build an index that, for each word stem, contains a list of all locations where the word appeared. Note that this meshes nicely with the distributed hash table concept.
Ranking: in performing keyword matches, it is important to provide a ranking over the matches. A common ranking metric is the TF/IDF (term frequency/inter-document frequency) metric.

There have been numerous attempts to combine XML query language (e.g., XPath) and keyword search. Extra credit will be given to those who manage to come up with a solution to the problem of searching for keywords within a particular XML element or elements.

Useful references:

The Salton/Buckley paper on TF/IDF ranking.
Baeza-Yates and Ribeiro-Neto, Modern Information Retrieval, Chapters 2 and 8.
Kowalski and Maybury, Information Storage and Retrieval Systems: Theory and Implementation, 2^nd ed. Available from the library.
The Galax and Tukwila XML query processors.
Altinel and Franklin: Efficient Filtering of XML Documents for Selective Dissemination of Information.
Rowstron and P. Druschel, "Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems". IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), Heidelberg, Germany, pages 329-350, November, 2001.
Brendon Wilson: Project JXTA book online.