next up previous
Next: Bipartite graph coclustering Up: Algorithm Previous: Algorithm

How to extract document clusters from co-occurrance matrix $C$?

To contrast our algorithm let us compare it with Latent Semantic Indexing (LSI, [3]), a commonly used approach in Natural Language Processing. LSI works in two steps. First it computes the singular value decomposition of the document-keyword co-occurrance matrix: $C_{K\times N} = U_{K\times c} \Sigma_{c\times c} V_{c \times N}^T$. The singular vectors $V$ defines a new feature space where documents can be clustered using similarity defined by $VV^T$. Similarly, the keywords can be clustered using $UU^T$. Note that the SVD of $C$ is equivalent to the eigendecomposition $C^TC = V \Sigma^2 V^T$. Therefore the clustering information used is contained in $C^TC$. Expanding out the terms, we see that $(C^TC)(i,j) = \sum_{k=1}^K C(k,i)C(k,j)$ defines a document-document similarity measure by counting the number of keywords occuring in both documents $i$ and $j$. As we have seen in the email example using the complete set of keywords as a basis for similarity measure (5(e)) can obscure the true similarity (5(d)) provided by the informative keywords.

A remedy to this problem is to try to identify non-informative keywords, e.g computing the keyword occurrance frequency count. This is insufficient. As we can see in figure 5(c) the non-informative keywords can have both high frequency (common keywords) and low frequency (random keywords).


next up previous
Next: Bipartite graph coclustering Up: Algorithm Previous: Algorithm
Mirko Visontai 2004-05-13