Spring 2007

Faculty Advisor: Dr. Sudipto Guha

Many OLAP (On-line Analytical Processing) applications have produced data cubes that summarize and aggregate details of data queries. These data cubes are multi-dimensional matrices where each cell that satisfies a specific property or trait is represented as a 1. A cell that does not satisfy that specific property is represented as a 0. In order to compress the amount of space required to represent this matrix completely, others have used MDL (Minimum Description Length) Summarization, including the MDL Summarization with Holes. The MDL Summarization describes rectangular regions (rectangles) of the matrix that contain all 1's, and the MDL Summarization with Holes also describes rectangles that are mostly 1's and then describes the 0's within those rectangles.

While it is NP-Hard to compute the optimal MDL Summarization with
Holes for a data matrix of 2 or more dimensions (Proven by Bu et al. [1]), there
exists a greedy algorithm to approximate the MDL Summarization with Holes,
proven to give an answer that within a factor of l_{m} &lowast log(M) of the optimal
solution (Proven by Guha and Tan [3]), where M is a factor dependent size
of the data matrix. See the "Definitions" section of this web page for a
definition of l_{m}. However, Guha and Tan in [3] mention that this bound has not
been proven tight. I studied this for 2-dimensional matrices where the algorithm
can only compress by covering rows and columns (here l_{m} = 2). In this case the only non-trivial rectangles are the rows and the columns of the matrix. Currently, I
have a proof that the greedy algorithm is a 4-approximation algorithm in this
special 2-dimensional case and a constant-factor (2(&kappa -2))-approximation
algorithm in the general case. Furthermore, I have written a program that uses
the greedy approximation that performs an MDL Summarizization with Holes on an arbitrary n-by-n
2-dimensional matrix of 1's and 0's with the rows and columns of the matrix as the available rectangles.

Final Report - This is the writeup submitted for the Senior Design Project.

Program Source - Here is the Source code to the program I wrote. It is written in C.

Example Program Test File 1 - This is an example that when run, shows the greedy algorithm in action. Here is the output of Test File 1.

Example Program Test File 2 - This file is an illustrative example that illustrates where the greedy algorithm produces a suboptimal solution, by a factor of 2. This example generalizes to show that the algorithm is at best a l

[2] Chaudhuri, Surajit and Umeshwar Dayal. "An Overview of Data Warehousing and OLAP Technology."

[3] Guha, Sudipto and Jinsong Tan. "Recursive MDL Summarization and Approximation Algorithms." 2006.

[4] Lakshmanan, Laks V.S., Raymond T. Ng, Christing Xing Wang, Xiaodong Zhou and Theodore J Johnson. "The Generalized MDL Approach for Summarization."

[5] Pu, Ken Q. and Alberto O. Mendelzon. "Concise Descriptions of Subsets of Structured Sets."

[6] Sathe, Gayatri and Sunita Sarawagi. "Intelligent Rollups in Multidimensional OLAP Data."

[7] Vazirani, Vijay V.