To contrast our algorithm let us compare it with Latent Semantic Indexing (LSI, [3]), a commonly used approach in Natural Language Processing. LSI works in two steps. First it computes the singular value decomposition of the document-keyword co-occurrance matrix: . The singular vectors defines a new feature space where documents can be clustered using similarity defined by . Similarly, the keywords can be clustered using . Note that the SVD of is equivalent to the eigendecomposition . Therefore the clustering information used is contained in . Expanding out the terms, we see that defines a document-document similarity measure by counting the number of keywords occuring in both documents and . As we have seen in the email example using the complete set of keywords as a basis for similarity measure (5(e)) can obscure the true similarity (5(d)) provided by the informative keywords.
A remedy to this problem is to try to identify non-informative keywords, e.g computing the keyword occurrance frequency count. This is insufficient. As we can see in figure 5(c) the non-informative keywords can have both high frequency (common keywords) and low frequency (random keywords).