Project Ideas - CIS 400/401
IDEA: Spam Detection for Wikipedia

PROFESSOR: Oleg Sokolsky (sokolsky@cis) (also contact Andrew West, westand@cis.upenn)

DESCRIPTION: Student(s) will work closely with a graduate student in developing an automatic classifier for the detection of link-spam edits to Wikipedia. Work will begin by examining a spam corpus -- from which student(s) will develop a taxonomy of spam behaviors. From this, feature extraction will identify predictive measures. These features will then be implemented into a real-time edit processing infrastructure backed by a machine-learning classifier. Wikipedia spam is unique from other spam forms (i.e., email-based) because it is common to see poor links which do not show commercial intent -- but are posted for vanity, subject-skewing, etc.. Further, features will likely pull not just from the URL destination (e.g., text-processing over the HTML), but also from the presentation of the edit on the Wiki (where on the page? what is the description text?).

PRE-REQUISITES: Prospective student(s) should be comfortable with Java programming, and have at least an elementary knowledge of machine-learning. It is expected this work will lead to publication -- and therefore may be most appropriate for those intending to attend graduate school in CS or a related field.

IDEA: Senior Design Projects at NetDB@Penn

PROFESSOR: Boon Thau Loo (boonloo@cis)

DESCRIPTION: The NetDB@Penn ( has a number of senior design projects suitable for undergraduates.  In previous years, student projects have resulted in conference papers (in collaboration with doctoral students) at top conferences such as CIDR'09, NDSS'10, and SIGMOD'10. This year, we are particularly looking for students to develop various components of the following three projects:

If interested, please contact Prof. Boon Thau Loo for more details. In your email, please include your C.V. 

IDEA: Advanced Telepresence using Virtual Reality and a Humanoid Robot

PROFESSOR: Camillo J. Taylor (cjtaylor@cis)

DESCRIPTION: The goal of this work is to explore new ways for humans to operate advanced humanoid robotic systems. More specifically the aim is to develop a system that will allow a human user to virtually inhabit our newly acquired PR2 humanoid robot from Willow Garage. This system is sufficiently anthropomorphic to allow us to consider mapping the motions of a human operator directly onto the motions of the head, base and arms of the robot. The concept is to outfit the operator with a virtual reality headset, monitor his movements with a Vicon motion capture system and then map those motions onto the robot while relaying the video feeds from the robots head camera back to the head mounted display to create an immersive teleoperation experience.

IDEA: Provenance Aware Scientific Workflow Systems

PROFESSOR: Susan Davidson (susan@cis)

DESCRIPTION:  This is a perfect project for students interested in bioinformatics or computational biology. The project involves developing technology for next-generation scientific workflow systems, which are "provenance-aware". Currently, scientific workflow systems maintain repositories of specifications (think of these as programs) that are searchable by keywords to enable component reuse. However, many systems are starting to maintain information about workflow executions as well, e.g. through provenance logs. By maintaining information about the sequence of module executions (processing steps) used to produce a data item, as well as the parameter settings and intermediate data items passed between module executions, the validity and reliability of data can be better understood and results be made reproducible. Provenance-aware workflow systems will yield repositories of both workflow specifications and of the provenance graphs that represent their executions, and will enable a new paradigm for creating and correcting scientific analyses: Scientists who wish to perform new analyses may search workflow repositories to find specifications of interest to reuse or modify. They may also search provenance information to understand the meaning of a workflow, or to correct/debug an erroneous specification. Finding erroneous or suspect data, a user may then ask provenance queries to determine what downstream data might have been affected, or to understand how the process failed that led to creating the data. The project will involve:

  1. Gaining a working knowledge of a scientific workflow system (Taverna);
  2. Understanding how provenance information is captured in the workflow system;
  3. Creating a database schema for managing workflow specfications and their associated executions (from which provenance is obtained);
  4. Populating the database to create a repository; and
  5. Creating a front-end to search the repository.

IDEA: Enhance the data exploration functionality of

PROFESSOR: Susan Davidson (susan@cis) (with Julia Stoyanovich (jstoy@cis))

DESCRIPTION:  This project is ideal for a student who is interested in data management, social systems, and/or bioinformatics. The goal of the project is to enhance the data exploration functionality of, an open-source on-line collaborative platform for the sharing of scientific workflows and experimental plans. Scientific workflows are emerging as a state-of-the-art technology for in-silico experimentation in bioinformatics, and repositories such as that maintained by play a crucial role in the wide-spread adoption of this technology. The goal of the project is to develop new and effective data exploration techniques for myExperiment. In particular, the project will involve the following:

  1. Gaining a working knowledge of the myExperiment platform and its implementation (in Ruby)
  2. Understanding of the myExperiment dataset -- characteristics of users, their interactions, and the workflows they create
  3. Understanding data exploration approaches such as frequent itemset mining, clustering, and topic modeling, and implementing them in scope of the myExperiment framework
  4. Participating in the design and implementation of a user study that would test the effectiveness of methods in (3)
More information about myExperiment is available at and For more information on proposed data exploration approaches, see our recent publication at