How to extract document clusters from co-occurrance matrix ?

Next: Bipartite graph coclustering Up: Algorithm Previous: Algorithm

How to extract document clusters from co-occurrance matrix ?

To contrast our algorithm let us compare it with Latent Semantic Indexing (LSI, [3]), a commonly used approach in Natural Language Processing. LSI works in two steps. First it computes the singular value decomposition of the document-keyword co-occurrance matrix: $C_{K\times N} = U_{K\times c} \Sigma_{c\times c} V_{c \times N}^T$ . The singular vectors defines a new feature space where documents can be clustered using similarity defined by . Similarly, the keywords can be clustered using . Note that the SVD of is equivalent to the eigendecomposition $C^TC = V \Sigma^2 V^T$ . Therefore the clustering information used is contained in . Expanding out the terms, we see that $(C^TC)(i,j) = \sum_{k=1}^K C(k,i)C(k,j)$ defines a document-document similarity measure by counting the number of keywords occuring in both documents and . As we have seen in the email example using the complete set of keywords as a basis for similarity measure (5(e)) can obscure the true similarity (5(d)) provided by the informative keywords.

A remedy to this problem is to try to identify non-informative keywords, e.g computing the keyword occurrance frequency count. This is insufficient. As we can see in figure 5(c) the non-informative keywords can have both high frequency (common keywords) and low frequency (random keywords).

Next: Bipartite graph coclustering Up: Algorithm Previous: Algorithm

Mirko Visontai 2004-05-13