The goal of this semester-long project will be to develop some of the ideas in the course readings, or elsewhere in a database systems area, into a research-oriented project. The project will include an implementation, some experimental validation, a project report, and a brief (~15 minute) presentation. You may suggest your own project idea, but here are several.
Project 1: Smart CIS
As sensor network technology has improved, we are moving towards a world in which large numbers of sites and entities – sensors, network routers, servers, Web services – are producing data that can be considered “streams,” all accessible from the Internet. A key problem in the future will be to develop architectures to integrate this data – distributing computation to the appropriate devices in order to ensure good performance and reliability, abstracting the data into views, and feeding the data to applications or displays.
This project builds upon several components developed as research projects within the Database Group: the Orchestra distributed stream engine, the Aspen runtime system for sensor query processing, and optionally the P2 network dataflow engine. The goal will be to “hook” them together, build a query parser and optimizer over them, and target applications for monitoring Penn CIS hardware, software, and laboratories. We would like to be able to answer queries – or respond to events – such as these:
“Direct me to a free workstation with MS Word”
- Sources: Lights & motion in lab; machine available; apps installed; lab near user
Query: “How fast can I run my jobs, given 5
machines and a given power/cooling budget?”
- Run a machine under load; measure throughput; measure temperature
Query: “What is the location of the following
mobile robot in GRASP?”
- On-board positioning; nearby ceiling-mounted sensors; complex interpolation function
Event: reminders trigger and appear wherever you
- Your calendar; your RFID and position; nearest display
Event: gracefully shut down machines near air
conditioning outage or fire alarm
- Air cond. cooling zone, alarm state, machine state, lookup of IP address from location
Building blocks include the following:
· Sensor runtime system – we currently support select-project-join queries. Interfaces and a plan representation would need to be developed.
· Orchestra query engine – currently has its own plan language, and does not have a query parser.
· P2 query engine – currently has a parser for datalog, but does not have an optimizer.
New components would include the following:
· Graphical interface. We need visualization of state, and an easy way of mapping query output to device displays.
· Extended language. We need to go beyond SQL, to include constructs for security, for routing data to various displays and applications, and to interface with Java (or other) code.
· Query language parser.
· Query plan generator. This would need to generate plans that all of the runtime systems can handle – and it would need to distribute plans to all of the runtime systems.
· Query optimizer. This would need to determine which computation needs to be done in which subsystem.
· Wrappers. We need “adapters” or “wrappers” to monitor internal state in numerous devices (e.g., server load) and to connect to various sensors (e.g., power usage monitors).
Project 2: Cloud-Based Query Processing
One of the ongoing research projects in the database group (and the PhD thesis of Nick Taylor) involves building a distributed query processor over the Pastry distributed hash table. In principle the Pastry APIs greatly resemble what can be achieved through a MapReduce/GFS architecture (and hence their open-source “clone,” Hadoop). This project involves taking the existing query engine (in Java) and adapting it so it can also run on a Hadoop-style architecture.
Project 3: Sensor Application
We have approximately half a dozen "mote" sensor devices. Additionally, there are at least two acquisitional query engines for the motes: TinyDB (covered in class) and Svilen Mihaylov's research project. This project involves building a real application (with real motes) based on one or both of these engines, then experimentally studying the application.
Project 4: Data Visualizer
One of the major open areas in databases tends to be the visualization of the output data. For example, we might like to look at a table and determine which data does not satisfy a key constraint; which data comes from where (i.e., its provenance), etc. There exist numerous visualization toolkits in Java and other languages. This project involves designing and implementing a visualization interface for data produced by a DBMS and/or the Orchestra research system. The interface should allow manipulation (e.g., deletion, modification, annotation) of data as well as its visualization.
Project 5: Transformation Reverse Engineering
Sometimes we would like to take an existing transformation tool -- a Perl script, an ETL (Extract-Transform-Load) tool, a C program -- that translates data from one format to another, and reverse engineer it, arriving at a (declarative) specification that can lead to a schema mapping. This project would involve constructing a tool that does the following: take an input and output schema (in, e.g., SQL DDL and XSD), create a sample input data instance, run the transformation, then determine the operations performed by the tool to create the output data instance. This is more complex than it sounds, because one create enough rows and unique identifiers in the input instance to determine exactly what join, nesting, and grouping operations are done in order to produce an output instance.