Analyzing a Greedy Approximation of MDL (Minimum Description Length) Summarization with Holes
CSE 401 Senior Project
Spring 2007
Peter Fontana
Faculty Advisor: Dr. Sudipto Guha
Abstract
Many OLAP (On-line Analytical Processing) applications have produced
data cubes that summarize and aggregate details of data queries. These data
cubes are multi-dimensional matrices where each cell that satisfies a specific
property or trait is represented as a 1. A
cell that does not satisfy that specific property is represented as a 0. In order to compress the amount of space required to
represent this matrix completely, others have used MDL (Minimum Description
Length) Summarization, including the MDL Summarization with Holes. The MDL Summarization describes rectangular regions (rectangles) of the matrix that contain all 1's, and the MDL Summarization with Holes also describes rectangles that are mostly 1's and then describes the 0's within those rectangles.
While it is NP-Hard to compute the optimal MDL Summarization with
Holes for a data matrix of 2 or more dimensions (Proven by Bu et al. [1]), there
exists a greedy algorithm to approximate the MDL Summarization with Holes,
proven to give an answer that within a factor of lm &lowast log(M) of the optimal
solution (Proven by Guha and Tan [3]), where M is a factor dependent size
of the data matrix. See the "Definitions" section of this web page for a
definition of lm. However, Guha and Tan in [3] mention that this bound has not
been proven tight. I studied this for 2-dimensional matrices where the algorithm
can only compress by covering rows and columns (here lm = 2). In this case the only non-trivial rectangles are the rows and the columns of the matrix. Currently, I
have a proof that the greedy algorithm is a 4-approximation algorithm in this
special 2-dimensional case and a constant-factor (2(&kappa -2))-approximation
algorithm in the general case. Furthermore, I have written a program that uses
the greedy approximation that performs an MDL Summarizization with Holes on an arbitrary n-by-n
2-dimensional matrix of 1's and 0's with the rows and columns of the matrix as the available rectangles.
Note: My project improves upon a result in [3]. Currently, [3] is a manuscript that will be submitted for publication in the near future, after this result (and its proof) is added to it. The Final Report contains the full proof.
Definitions
&kappa: &kappa is the largest number of rectangles which contain a common cell of the matrix. Two such rectangles, the entire matrix and the individual cell, are trivial. Excluding those rectangles gives (&kappa -2) rectangles.
lm: Take each cell of the data matrix. For each cell u of the matrix, consider all of the rectangles that contain u. The contains (subset) relation forms a partially ordered set over the rectangles, and more specifically a lattice, since the cell and the matrix are considered rectangles. lm is the size of the largest antichain of the lattice for all cells u.
.
Project Documents
Poster - This is the poster presented during CSE Senior Design Day on April 19, 2007.
Final Report - This is the writeup submitted for the Senior Design Project.
Program Source - Here is the Source code to the program I wrote. It is written in C.
Example Program Test File 1 - This is an example that when run, shows the greedy algorithm in action. Here is the output of Test File 1.
Example Program Test File 2 - This file is an illustrative example that illustrates where the greedy algorithm produces a suboptimal solution, by a factor of 2. This example generalizes to show that the algorithm is at best a lm-approximation algorithm. See [3] for the general example. Here is the output of Test File 2.
References
[1] Bu, Shaofeng, Laks V.S. Lakshmanan and Raymond T. Ng. "MDL Summarization
with Holes." Proceedings of the 31st VLDB Conference, pages
433-444, 2005.
[2] Chaudhuri, Surajit and Umeshwar Dayal. "An Overview of Data Warehousing
and OLAP Technology." ACM SIGMOD Record, Volume 26, Issue 1, pages 65-74, 1997.
[3] Guha, Sudipto and Jinsong Tan. "Recursive MDL Summarization and Approximation Algorithms." 2006.
[4] Lakshmanan, Laks V.S., Raymond T. Ng, Christing Xing Wang, Xiaodong
Zhou and Theodore J Johnson. "The Generalized MDL Approach for Summarization."
Proceedings of the 28th VLDB Conference, pages 766-777, 2002.
[5] Pu, Ken Q. and Alberto O. Mendelzon. "Concise Descriptions of Subsets
of Structured Sets." ACM Transactions on Database Systems (TODS) Vol.
30, No. 1, pages 211-248, March 2005.
[6] Sathe, Gayatri and Sunita Sarawagi. "Intelligent Rollups in Multidimensional OLAP Data." Proceedings of the 27th VLDB Conference, pages 531-
540, 2001.
[7] Vazirani, Vijay V. Approximation Algorithms. Berlin: Springer-Verlag.
Corrected Second Printing, 2003.