The course project will involve building a Google-style Web indexer/crawler, search engine, and ranking system. You will learn the fundamentals of the HTTP protocol, Web servers, synchronization, and servlets in constructing the Web interface. Then you will learn Distributed Hash Tables (specifically Pastry) and other schemes for partitioning data and work (including Hadoop, an open-source clone of Google MapReduce) across multiple nodes. These will be the basis of a distributed Web crawler, implemented in a scalable fashion. You will learn different schemes for ranking keywords and documents (including PageRank), and use these to provide ranked search to the users.
The homework assignments will be done individually, and each represents a piece of a core component in the project. The project will be built in a team.
You can find more details about the project here.