CIS 550 Course Project

More details on the course project will be made available in the future.

The course project will be assigned to teams of approximately 3 students (depending on the size of the course). There are two alternative projects for this course, one of which is more applications-oriented (building upon database technology), and one of which is more systems-oriented (working under the covers). We are also open to proposals for other projects if they are of a research nature.

Option 1: Feeder

In the tradition of past CIS 550 projects (where we built blog sites and a clone of Google's GMail), we will be designing a Web site that implements one of the hottest buzzwords today: RSS (and its cousin Atom).

As you may know, most news sites today have an option by which you may "subscribe" to RSS feeds -- XML-based notifications about new documents that appear on the sites. For instance, NYTimes.com, news.com, slashdot, blogs, etc. all offer this capability.

Many RSS readers exist for the desktop. Likewise, "RSS Aggregators" are available on the Web -- one instance is the "My Yahoo" service. Here, users can log in, subscribe to a series of "RSS Channels", and see a list of current articles from all of the sites. They can click on a given article's title and be brought to it.

We will be taking the RSS aggregator one step further, combining it with GMail-like capabilities. When a new article is posted on a channel, it should immediately be "crawled" and text-indexed, so that users may search their RSS feeds for data.

You will manage multiple users' subscriptions, keeping them confidential from other users (perhaps the boss shouldn't know about the number of hours spent reading craigslist). You will also need to do the following:

Build a servlet-based application that runs on Apache Tomcat (or the equivalent).
Develop user account and document storage capabilities using a relational database, e.g., Oracle or MySQL.
Develop XML processing routines to combine and format the list of RSS articles, channels, etc.
Develop a simple Web crawler that can go to a URL, download the text file, and parse it into keywords.
Develop inverted-index search capabilities, which only allow a user to search the documents from his/her channels, using SQL.
When displaying emails, using XML and style sheets to format their appearance.

Additionally, there are many opportunities to add extra capabilities such as "word stemming" (looking for different forms of a keyword, e.g., run vs. ran vs. running), ranking of answers (as with Google), generating "meta-feeds" that combine information from other feeds, tracking popular feeds etc.

Option 2: P2P Database Synchronization

An alternative project, recommended for students who want to know more about what goes on "under the covers" of a data management system, is to build a peer-to-peer system for "synchronizing" multiple copies of a database.

If a database gets copied to multiple sites, each of which independently makes changes to it, how can we quickly and effectively "sync up" the different databases so they all look the same?

One solution, which doesn't require everyone to sync at the same time, is to use a peer-to-peer system substrate, such as Pastry or Chord, to distribute the data across a network -- where each peer gets part of every table. Individuals download a copy of the entire table, make changes, and then send these changes out to the different peers in the P2P network. Each peer synchronizes part of the database and sends its data back to the individual.

This project is somewhat more open-ended, and the first step would be to learn the Pastry Java APIs (which take care of most details of peer-to-peer distribution), then to build a database interface over Pastry. This database might be relational or it could even be an XML database.

More detailed information on both projects will be made available as the semester progresses.