US6038561A | 2000-03-14 |
Claims Following is the claim for this invention:- 1. Flat clustering creates a flat set of clusters without any explicit structure that would relate clusters to each other. In this invention we use scatter-gather approach to flat clustering for frequently used text based search queries. Now for a given set of documents we compute the Term-Document or Document-Term matrix which is a matrix that describes the frequency of terms that occur in a collection of documents. (Note here we do not consider frequently occurring insignificant terms like the, of, for, etc.) Also here we have Term Frequency- Inverse Document Frequency (TF-IDF) which is a numerical statistic that is intended to reflect how important a word is to a document in a collection of documents. The Term Frequency-Inverse Document Frequency (TF-IDF) value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word. Now from the above Term-Document matrix we create a Term Frequency -Inverse Document Frequency (TF-IDF) matrix which is used to compute document similarity and create a flat set of clusters of documents which relate to each other. Now in the scatter-gather approach for a particular frequently used text based search query of a user we group the clusters of documents with relevant information and the resulting set is again clustered. The above process is repeated until a cluster of interest is found. The above novel technique of using scatter-gather approach to flat clustering for frequently used text based search queries is the claim for this invention. |
Flat clustering creates a flat set of clusters without any explicit structure that would relate clusters to each other. In this invention we use scatter-gather approach to flat clustering for frequently used text based search queries. Now for a given set of documents we compute the Term- Document or Document-Term matrix which is a matrix that describes the frequency of terms that occur in a collection of documents. (Note here we do not consider frequently occurring insignificant terms like the, of, for, etc.) Also here we have Term Frequency-Inverse Document Frequency (TF- IDF) which is a numerical statistic that is intended to reflect how important a word is to a document in a collection of documents. The Term Frequency-Inverse Document Frequency (TF-IDF) value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word. Now from the above Term-Document matrix we create a Term Frequency-Inverse Document Frequency (TF-IDF) matrix which is used to compute document similarity and create a flat set of clusters of documents which relate to each other. Now in the scatter-gather approach for a particular frequently used text based search query of a user we group the clusters of documents with relevant information and the resulting set is again clustered. The above process is repeated until a cluster of interest is found.