CIS 650: Data Sharing and the Web

Spring 2003

Instructor:  Zachary Ives, zives@cis.upenn.edu, 562 Moore-GRW
Location:  Towne 321
Times:  MW 4:30-6PM
Office Hours: Thursday 2:00-3:00PM or by appointment

[ Objectives & Format ] [ Readings ] [ Reviews ] [ Presentations ] [ Project ] [ Grading ] [ Schedule ]

Objectives

The goals of this course are to gain a better understanding of the issues of querying, integrating, and otherwise sharing data across the Internet and the World-Wide Web. Data integration is perhaps the best-studied instance of this problem, and we focus on it for much of the semester.  We begin with a study of relational query processing as a foundation, and then move on to answering queries using views and adaptive query processing.

We also examine architectures for larger-scale or richer data exchange:  wide-area data sharing (as proposed by projects such as Mariposa in the 1990s and Piazza today) and sharing with very expressive data definition formalisms (the Semantic Web).

Finally, we hope to investigate several efforts to support data exchange scenarios that are not merely query-driven:  publish-subscribe, groupware applications, collaborative web sites (such as Sourceforge.net), and versioning-based systems.

Format

The goal of this course is to give a strong understanding of the research issues in the area of focus, but also to stimulate discussion and interaction and build presentation skills. Some topics that involve a great deal of background knowledge will be covered in standard lectures, but there will also be a mix of student-led presentations and project presentations.  I expect each student to prepare for and lead a paper discussion, in consultation with me.  Moreover, each student will be expected to write and post a short analysis of the assigned papers for each lecture.

There will be a term project, which can either be a survey and analysis paper or an implementation of a novel idea or system. During the last week, there will be a take-home final exam.

Topics and Readings

For each week's readings, a 1-page report posted to the upenn.cis.cis650 newsgroup, is due by NOON, 12:00:00PM, the day the paper is to be discussed in class. Any submission made after this deadline is considered LATE and will receive a 30% point deduction. You may miss one review over the course of the semester.

Data Integration Overview and Introduction

Data Sharing across the Wide Area

Query Optimization Basics

Query Execution Basics

Adaptive Query Processing

Querying the Web

XML Processing

Answering Queries Using Views

Versioning, Diffs, and Updates

Alternative Data Sharing Paradigms

Reading Papers and Writing Reviews

Here are some things to consider every time you read a paper and write an analysis/review:

Take notes or scribble in the margins while you read. Don't expect that you'll get everything out of a single read; you may have to come back to certain parts several times. If you start getting bogged down in too many details in a section of the paper, sometimes it's helpful to take a step back and try to figure out what the authors are trying to do in that section.

Also, after you've figured out the main points of your review, it may be interesting and educational to see what other people have said. Feel free to respond to other people's comments -- that's why this is an open forum.

I will try to read all of the (on-time) reviews before the evening's class, and to address any points of confusion or bring up points of debate.

Grading

The grading breakdown will tentatively be as follows:

You may miss one paper review without penalty. Any review posted after 12:00 noon on the day it is due is LATE, and it will receive a 30% point deduction.

Tentative Schedule

Papers highlighted in GREEN are papers that are presented by a student. Papers highlighted in BLUE will be presented by myself or a guest lecturer.

Week Monday Wednesday
1 1/13:  Intro to course; data integration 1/16: Data integration systems
2 1/20:  MLK Holiday 1/22: Mariposa and Piazza
3 1/27: Query optimization 1/29: PENG - Volcano
4 2/3: Query execution 2/5: Query execution. Project proposals due.
5 2/10: IVAN - Mid-query re-optimization 2/12: Tukwila
6 (snow cancellation) 2/19: JAE - Eddies
7 2/24: YIWEN - Stats on intermediate tables 2/26: YONG - XQuery keyword querying, KIT - IR querying
8 (ICDE) 3/3: ARDIANTO - WHIRL 3/5: MURAT - XFilter
- 3/10:  Spring break
9 3/17: QUN - AQUV; STAN - MiniCon 3/19: Piazza ICDE. Project status reports due.
10 3/24: LARRY - Change detection 3/26: Heraclitus
11 3/31: BENJAMIN PIERCE - Harmony/Unison 4/2: VIJAI - Semantic Web
12 4/7: Semantic Web and Piazza 4/9: IFE - Groupware
13 4/14: Schema matching 4/16: Course wrap-up
14 4/21: Project presentations; final handed out 4/25: AnHai Doan talk, 11AM, Moore-GRW 554
15 4/28: Finals week -- projects and final exam due by 5:59:59 PM EST, Friday 5/2.

Significant Dates