Analyzing a Greedy Approximation of MDL (Minimum Description Length) Summarization with Holes

CSE 401 Senior Project
Spring 2007

Peter Fontana
Faculty Advisor: Dr. Sudipto Guha

Abstract

Many OLAP (On-line Analytical Processing) applications have produced data cubes that summarize and aggregate details of data queries. These data cubes are multi-dimensional matrices where each cell that satisfies a specific property or trait is represented as a 1. A cell that does not satisfy that specific property is represented as a 0. In order to compress the amount of space required to represent this matrix completely, others have used MDL (Minimum Description Length) Summarization, including the MDL Summarization with Holes. The MDL Summarization describes rectangular regions (rectangles) of the matrix that contain all 1's, and the MDL Summarization with Holes also describes rectangles that are mostly 1's and then describes the 0's within those rectangles.

While it is NP-Hard to compute the optimal MDL Summarization with Holes for a data matrix of 2 or more dimensions (Proven by Bu et al. [1]), there exists a greedy algorithm to approximate the MDL Summarization with Holes, proven to give an answer that within a factor of lm &lowast log(M) of the optimal solution (Proven by Guha and Tan [3]), where M is a factor dependent size of the data matrix. See the "Definitions" section of this web page for a definition of lm. However, Guha and Tan in [3] mention that this bound has not been proven tight. I studied this for 2-dimensional matrices where the algorithm can only compress by covering rows and columns (here lm = 2). In this case the only non-trivial rectangles are the rows and the columns of the matrix. Currently, I have a proof that the greedy algorithm is a 4-approximation algorithm in this special 2-dimensional case and a constant-factor (2(&kappa -2))-approximation algorithm in the general case. Furthermore, I have written a program that uses the greedy approximation that performs an MDL Summarizization with Holes on an arbitrary n-by-n 2-dimensional matrix of 1's and 0's with the rows and columns of the matrix as the available rectangles.

Note: My project improves upon a result in [3]. Currently, [3] is a manuscript that will be submitted for publication in the near future, after this result (and its proof) is added to it. The Final Report contains the full proof.

Definitions

&kappa: &kappa is the largest number of rectangles which contain a common cell of the matrix. Two such rectangles, the entire matrix and the individual cell, are trivial. Excluding those rectangles gives (&kappa -2) rectangles.

lm: Take each cell of the data matrix. For each cell u of the matrix, consider all of the rectangles that contain u. The contains (subset) relation forms a partially ordered set over the rectangles, and more specifically a lattice, since the cell and the matrix are considered rectangles. lm is the size of the largest antichain of the lattice for all cells u. .

Project Documents

Poster - This is the poster presented during CSE Senior Design Day on April 19, 2007.
Final Report - This is the writeup submitted for the Senior Design Project.
Program Source - Here is the Source code to the program I wrote. It is written in C.
Example Program Test File 1 - This is an example that when run, shows the greedy algorithm in action. Here is the output of Test File 1.
Example Program Test File 2 - This file is an illustrative example that illustrates where the greedy algorithm produces a suboptimal solution, by a factor of 2. This example generalizes to show that the algorithm is at best a lm-approximation algorithm. See [3] for the general example. Here is the output of Test File 2.

References

[1] Bu, Shaofeng, Laks V.S. Lakshmanan and Raymond T. Ng. "MDL Summarization with Holes." Proceedings of the 31st VLDB Conference, pages 433-444, 2005.
[2] Chaudhuri, Surajit and Umeshwar Dayal. "An Overview of Data Warehousing and OLAP Technology." ACM SIGMOD Record, Volume 26, Issue 1, pages 65-74, 1997.
[3] Guha, Sudipto and Jinsong Tan. "Recursive MDL Summarization and Approximation Algorithms." 2006.
[4] Lakshmanan, Laks V.S., Raymond T. Ng, Christing Xing Wang, Xiaodong Zhou and Theodore J Johnson. "The Generalized MDL Approach for Summarization." Proceedings of the 28th VLDB Conference, pages 766-777, 2002.
[5] Pu, Ken Q. and Alberto O. Mendelzon. "Concise Descriptions of Subsets of Structured Sets." ACM Transactions on Database Systems (TODS) Vol. 30, No. 1, pages 211-248, March 2005.
[6] Sathe, Gayatri and Sunita Sarawagi. "Intelligent Rollups in Multidimensional OLAP Data." Proceedings of the 27th VLDB Conference, pages 531- 540, 2001.
[7] Vazirani, Vijay V. Approximation Algorithms. Berlin: Springer-Verlag. Corrected Second Printing, 2003.