next up previous
Next: Document-keyword correspondence Up: Algorithm Previous: How to extract document

Bipartite graph coclustering

Clustering based on similarity information in $C^TC$ can be though of as the following bipartite graph coclustering. This concept has been explored in the document-keyword analysis by Dhillon [4] and in bioinformatics by Kluger et al. [8]. The document and keywords represent the two sets of nodes in the bipartite graph, and the co-occurring documents and keywords are connected by graph edges. The graph has an edge weight matrix $\left(
\begin{array}{cc}
0 & C \\
C^T & 0
\end{array}\right)$. These methods are based on finding the normalized cut in this graph. Intuitively, this is the right thing to do, we want to select a set of keyword for grouping a particular set of documents. Furthermore, this grouping gives the correspondence between the informative keywords and relevant documents. However, graphcuts on bipartite graphs amounts to separate clustering on documents and on keywords, given by the fact that $\left(
\begin{array}{cc}
0 & C \\
C^T & 0
\end{array}\right)$ and $\left(
\begin{array}{cc}
CC^T & 0 \\
0 & C^TC
\end{array}\right)$ have the same eigenvectors. Thus, finding the optimal partition on the bipartite graph contradicts the concepts of coclustering and simply results in clustering the documents on the information given by $C^TC$.


next up previous
Next: Document-keyword correspondence Up: Algorithm Previous: How to extract document
Mirko Visontai 2004-05-13