CSE 455 / CIS 555: Internet and Web Systems

Project and Homework Assignments

Resources

For all assignments, we have a dedicated cluster of machines, named spec01 through approximately spec40 (the exact number will vary depending on resources available to CETS), for this course. These machines are set up with less port blocking than a typical SEAS machine: it is up to you to use them responsibly.

Development will be in Java 6, aka JDK 1.6.

We recommend the use of subversion, a version control system, for maintaining your project code. See here for details.

As a development environment, we recommend Eclipse 3.4 or later (available on the spec cluster). You can get an Eclipse plug-in for subversion here.

Assignments

Assignment 1: Web and application servers; thread pools; learning APIs.

You will also want the servlet helper classes, the servlet API jar, and the simple command line servlet runner (aka TestHarness).

Note that if you use Eclipse, you may ultimately need to set the classpath using the Eclipse GUI (this is via Window|Preferences, Java, Build Path, User Libraries in Eclipse 3.x).

Some useful URLs:

Apache Jakarta's cookie specification summary.
Servlet introduction; a second one; a third one. Here are two more: InformIT and DevArticles
Servlet 2.4 APIs

Assignment 2: web crawling, XPath, XQuery.

You will also need to use the following:

For persistent storage: BerkeleyDB Java Edition, from Oracle. You will need to configure Eclipse to use the lib/je.jar file in this package.
For HTML parsing: JTidy parser, available here.
Sample DOM writer and SAX writer example code.
For testing, your crawler should use the "sandbox" we have set up here.

Assignment 3: Web services, distributed hash tables.

You will want the following:

A simple example NodeFactory.java (may need tweaking to work with FreePastry 2.0)
Pastry and tutorial
YouTube Data Web Services
Jetty, if you need an alternate servlet container

Final team project: P2P web crawler and search engine.

We will be using Amazon's Elastic Compute Cloud (EC2) and Simple Storage Service (S3) for the project. See here for details on how to get started with EC2. We will be distributing some EC2 credits that may cover a portion of the costs. The "Getting Started" guide is here. An excerpt tailored for our class is at this location.

At least one member of your team will need to learn Hadoop MapReduce. You may want to install locally according to these instructions.