Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND APPARATUS FOR GENERATING KEY PHRASES
Document Type and Number:
WIPO Patent Application WO/2011/141449
Kind Code:
A1
Abstract:
A method for identifying a set of keyphrases is disclosed which comprises selecting one or more initial keyphrases, generating semantically-related keyphrases from the one or more initial keyphrases, clustering the semantically-related keyphrases into a plurality of clustered keyphrases; and further comprising removing, if necessary, at least one of the plurality of clustered keyphrases by determining a greatest one of a cluster distance between clusters of the clustered keyphrases and removing the cluster having the greatest one of the cluster distance.

Inventors:
WEIR DAVID (GB)
HARPER ROBERT (GB)
TEE PHILIP (GB)
Application Number:
PCT/EP2011/057482
Publication Date:
November 17, 2011
Filing Date:
May 10, 2011
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
JABBIT LTD (GB)
WEIR DAVID (GB)
HARPER ROBERT (GB)
TEE PHILIP (GB)
International Classes:
G06F17/30
Domestic Patent References:
WO2001031479A12001-05-03
Foreign References:
EP2045735A22009-04-08
Other References:
SHUI-LUNG CHUANG ET AL: "Towards automatic generation of query taxonomy: a hierarchical query clustering approach", 2002 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, MAEBASHI CITY,, 9 December 2002 (2002-12-09) - 12 December 2002 (2002-12-12), IEEE COMPUT. SOC., LOS ALAMITOS, CA, US, pages 75 - 82, XP010805104, ISBN: 978-0-7695-1754-4, DOI: 10.1109/ICDM.2002.1183888
DANIEL CRABTREE ET AL: "Query Directed Web Page Clustering", PROCEEDINGS OF THE IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, 1 December 2006 (2006-12-01), IEEE COMPUT. SOC., LOS ALAMITOS, CA, US, pages 202 - 210, XP031008602, ISBN: 978-0-7695-2747-5
Attorney, Agent or Firm:
HARRISON, Robert (Munich, DE)
Download PDF:
Claims:
Claims: 1. A method for identifying a set of keyphrases comprising:

- selecting one or more initial keyphrases;

• generating semantically-related keyphrases from the one or more initial keyphrases;

- clustering the semantically-related keyphrases into a plurality of clustered keyphrases; and

• further comprising removing, if necessary, at least one of the plurality of clustered keyphrases by detennining a greatest one of a cluster distance between clusters of the clustered keyphrases and removing the cluster having the greatest one of the cluster distance.

2. The method of claim 1 , further comprising

- inputting one or more search terms;

- using the one or more search terms to locate a plurality of electronic documents relevant to the one or more search terms; and

- selecting the one or more initial keyphrases from keyphrases associated with the located plurality of electronic documents.

3. The method of claim 1 or claim 2, further comprising

- choosing one or more electronic documents; and

- generating one or more initial keyphrases from the electronic documents.

4. The method of any of the above claims, wlaerein one or more clusters of the clustered keyphrases are associated with one of the initial keyphrases; and removing any of the initial keyphrases having no cluster associated the initial keyphrase after removal of the cluster. The method of any of the above claims, further comprising presenting the semantically-related keyphrases to a user.

The method of any of title above claims, further comprising using the set the semantically-related keyphrases to expand a search query.

The method of any of the above claims, further comprising tagging one or more electronic documents with one or more members of the semantically-related keyphrases.

An apparatus comprising:

- a thesaurus having a plurality of keyphrase triples, each of the keyphrases triples having a first keyphrase, a second keyphrase and a keyphrase distance value;

- a keyphrase augmenter adapted to take one or more initial keyphrases and generate a plurality of semantically-related keyphrases from looking-up the one or more initial search terms in the thesaurus.

The apparatus of claim 8, further comprising:

- a search engine adapted to use one or more search terms and search a collection of electronic documents to locate a set of electronic documents relevant to the search terms and farther adapted to select the one or more initial keyphrases from keyphrases associated with the located plurality of electronic documents.

The apparatus of any of the claims 8 to 9, wherein the keyphrase augmenter is further adapted to generate initial keyphrases from one or more electronic documents.

The apparatus of any of the claims 8 or 10 wherein the keyphrase augmenter is further adapted to cluster the semantically-related keyphrases to generate a plurality of clustered keyphrases.

The apparatus of claim 11, wherein the keyphrase augmenter is adapted to use the keyphrase distance value from the thesaurus to generate the plurality of clustered keyphrases.

The apparatus of claim 12, wherein the keyphrase augmenter is adapted to determine a cluster distance between the plurality of clustered keyphrases and to determine which one of the clustered keyphrases is at a greatest cluster distance from the other ones of me plurality of clustered keyphrases.

The apparatus of claim 13, wherein the keyphrase augmenter is adapted to identify any ones of the initial keyphrases having none of the plurality of clustered keyphrases associated therewith.

A computer program product in a non-transitory computer readable storage medium comprising

- first logic means for selecting one or more initial keyphrases;

- second logic means for generating semantically-related keyphrases from the one or more initial keyphrases;

- third logic means for clustering the semantically-related keyphrases into a plurality of clustered keyphrases, and

- form logic means for further comprising removing, if required, at least one of the plurality of clustered keyphrases by determining a greatest one of a cluster distance between clusters of the clustered keyphrases and removing the cluster having the greatest one of the cluster distance.

Description:
Description

Titel: Method and Apparatus for Generating Key Phrases

Cross-Refercnce to Related Applications

[0001] This application claims the priority of and benefit to US Provisional Patent Application No. 61/333,233 "Method and Apparatus for Generating Key Phrases" filed on 10 May 2010.

Field of the Invention

[0002] The field of the present invention relates to an apparatus and a method for generating key phrases.

Background of the Invention

[0003] The internet, often referred to as the worldwide web, is an extremely large interconnected network of electronic information that comprises millions of private and public networks that are linked together. The electronic information is stored on the worldwide web in the form of electronic documents. The contents of the electronic documents can be searched by a user of the internet using, for example, well-known search engines, such as Google or Yahoo. [0004] It has been estimated during the last decade mat the internet has grown by 100% per year (see document titled "The size and growth rate of the internet" by AT&T Labs by Coffmnan, K. G. and Oblyzkoam on the website http://netlib.bell- labs.com/netlib/att/mat/people/amo/doc/internat.size.ps). As a consequence of the growth of the internet, the content of electronic documents on the internet has grown too.

[0005] There is a need tor a user to quickly locate electronic documents that have a degree of relevance and accuracy to a search that has been performed by the user.

[0006] A user conducting the search on the internet typically enters into a search engine a search string or a plurality of search strings. The search engine will men query the search string with a plurality of further websites to see whether the search string matches a portion of one or more of the electronic documents. Such a search procedure makes the searching process for particular one of the electronic documents over the internet laborious and time consuming until the user finds the electronic documents that are truly relevant to the information sought

[0007] Another search strategy implemented by the search engine will search for an embedded tag in a webpage or in the electronic document, which may be relevant to the search string entered by the user. However, the relevance of the results of the search may not be applicable to the information sought by the user. In particular it is possible that embedded tags are inserted to direct the user to one of the websites or one of the electronic documents that are of marginal interest Therefore the user may have to use a plurality of search engines offered by a plurality of websites to find the relevant information sought [0008] One issue with the input of the search string is that mere may be a number of related search strings which the user might also be able to use in order to optimize the search results. Such related search strings include synonyms of the search string or different spellings of the search string. If the user fails to use all relevant related search strings, then mere is a risk that the user may fail to identify relevant documents in the search.

Summary of the Invention

[0009] This disclosure teaches a method for identifying a set of keyphrases by selecting one or more initial keyphrases, generating semanticaily-related keyphrases from the one or more initial keyphrases and then clustering the semantically-related keyphrases into a plurality of clustered keyphrases. Outlying or less relevant ones of the semantically-related keyphrases can be removed by determining a greatest one of a cluster distance between clusters of the clustered keyphrases and removing the cluster having the greatest one of the cluster distance. [00010] This enables the initial keyphrases to be augmented with semanfically related keyphrases and then using the clustering of the semantically-related keyphrases to disambiguate the semantically-related keyphrases by grouping phrases with semantically- similar structures together. [00011] It will be appreciated that the term "keyphrase" as used in this disclosure means any word or set of words which is relevant to the content of an electronic documents. The term "keyphrase" is substantially synonymous with the term "keyword" often used in connection with search engines, such as the Google or Yahoo search engine.

[00012] The method of the disclosure further comprises, in one aspect, inputting one or more search terms and using the one or more search terms to locate a plurality of electronic documents relevant to the one or more search terms, using for example a search engine. The keywords or keyphrases associated with the located plurality of electronic documents can be used to select the one or more initial keyphrases. This enables the generation of a set of initial keyphrases from a user- initiated search.

[00013] It is also possible in another aspect of the invention to choose one or more electronic documents and generating one or more initial keyphrases from the electronic documents directly.

[00014] The clustering of the semantically-related keyphrases is based upon distance measurements between clusters of the semantically-related keyphrases. [00015] Outlying (i.e. less relevant) ones of the initial keyphrases can be removed by removing any of the initial keyphrases having no cluster associated the initial keyphrase after several iterations of removal of the clusters. It will be noted that it is possible that no clusters will be removed since none of the initial keyphrases are considered to outlying keyphrases. [00016] The disclosure also teaches an apparatus comprising: a thesaurus having a plurality of keyphrase triples, each of the keyphrases triples having a first keyphrase, a second keyphrase and a keyphrase distance value, and a keyphrase augmenter adapted to take one or more initial keyphrases and generate a plurality of semantically-related keyphrases from looking-up the one or more initial search terms in the thesaurus.

[00017] The disclosure also teaches a computer program product in a non-transitory computer readable storage medium that comprises first logic means for selecting one or more initial keyphrases, second logic means for generating semantically-related keyphrases from the one or more initial keyphrases, and third logic means for clustering the semanticaliy-related keyphrases into a plurality of clustered keyphrases.

Description of Figures

[00018] The foregoing features of the present invention will be apparent from the following detailed description of the invention, taken in conjunction with the accompanying drawings in which:

[00019] Figure 1 shows a schematic of an apparatus for searching and grouping a subset of a plurality of electronic documents according to the present invention.

[00020] Figure 2 shows an example of an apparatus that can be used to implement the invention.

[00021] Figure 3 shows an example of clustering of keyphrases. Detailed Description of the Invention

[00022] For a complete understanding of the present invention and the advantages thereof, reference is made to the following detailed description taken in conjunction with the accompanying figures. [00023] It should be appreciated that various aspects of the present invention discussed herein are merely illustrative of the specific ways to make and use the invention and do not limit the scope of the invention when taken into consideration with the claims and the following detailed description and the accompanying figures. [00024] It should be observed that features from one aspect of the invention can be combined with features from other aspects of the invention.

[00025] The teachings of the cited documents should be incorporated by reference into the description. [00026] Figure 1 shows an overview of an architecture of a system 10 for generating keyphrases to enable the searching and grouping of a subject of a plurality of electronic documents 70 that are searched for by a user 20 over a web 60. The web 60 could be the internet, an internal intranet, or a combination of the two. Fig. 1 shows a plurality of users 20 who are at a terminal 22 with a display device 23 for accessing the web 60. The users 20 also includes the user 20 who is performing the search using one or more search terms for the relevant electronic document 70 and includes further users 20 who are not the user 20 performing the search (i.e. not entering a search string). It will be appreciated that the system 10 shown in Fig. 1 is merely exemplary and that other architectures are possible.

[00027] The terminal includes any means for the user 20 to access the web 60 and includes, but is not limited to, a personal computer, PDA, mobile phone etc.

[00028] The user 20 can access the web 60 through a web interface 30 by a network connection 25. The network connection 25 includes any known method in the art for connecting to the web interface 30, such as fixed line commmucation or a mobile communication. The web interface 30, which could be an Internet Service Provider or telecommunication provider, is connected to the web 60 by generally by an Internet backbone 50.

[00029] The users 20 are connected through the web interface 30 to a thesaurus 40 and a keyphrase augmenter 42. The operation of the keyphrase augmentor 42 will be explained in more detail later in this disclosure. It will be noted at this stage mat the keyphrase augmenter 42 is shown as a single block or module on Fig. 1 but mat in practice the keyphrase augmenter 42 comprises a series of software modules or processes. The thesaurus 40 is shown as a single database on Fig. 1 but the thesaurus 40 may comprises a plurality of different components or individual fhesauruses which can either be accessed individually or formed together as an aggregated thesaurus which can be commonly accessed. [00030] Fig. 1 also shows a search engine 35 connected to the web interface 30 and to the web 60 by the internet backbone 50. The search engine 35 is also connected to the thesaurus 40. The search engine 30 could be one or more of conventional search engines 30, such as Google, Yahoo, Bing, or Al, or could be a proprietary search engine. The search engine 30 will generally have a search database 37 attached. The search database 37 wilt contain references 37a to the electronic documents 70 in the web 60, for example in the form of hyperlinks or web addresses, and may also contain one or more search tags 37b associated with the electronic documents 70. The electronic documents 70 in the web 60 may also contain one or more document keyphrases 72.

[00031] There are numerous ways in which such document keyphrases 72 can be generated and this disclosure is not intended to limit the types and manner of the document keyphrases 72. For example, the Yahoo search engine generates one or more document keyphrases 72 for the electronic documents 70 using a keyword generation algorithm to identify the most important subject words in the electronic document 70. These document keyphrases 72 can be accessed through an BOSS API provided by the Yahoo search engine.

[00032] The document keyphrases 72 can also be generated manually by a reader or author of the electronic document 70. The document keyphrases 72 might also be derivable from metatags associated with the electronic document 70 and programmed by the author of the electronic document 70. It is also possible to use a software which will generate document keyphrases 72 from a collection of electronic documents 70.

[00033] It will be appreciated mat the thesaurus 40, the keyphrase augmenter 42, the search engine 35 and the search database 37 are shown as separate elements on Fig. 1. In practice the separate elements could without limitation be computer programs running on one or more servers or distributed throughout a cloud computing network and need not be separate dedicated computers, servers or processors. [00034] The keyphrase augmenter 42 will now be described in more detail The keyphrase augmenter 42 is connected to and can access keyphrase triplets 45 stored in the thesaurus 40 (or in one or more the thesaurus components). Entries for the keyphrase triplets 45 comprise a first keyphrase 45a, a second keyphrase 45b which is semantically related to the first keyphrase 45b and a keyphrase distance value 45c which indicates the semantic distance between the first keyphrase 45a and the second keyphrase 45b.

[00035] The keyphrase triplets 45 in the thesaurus 40 are created by parsing a corpus of text and determining semantically related keyphrases. In essence, a large matrix is created in which each row of the matrix represents one of the first keyphrase 45a and/or the second keyphrase 45b. Each column entry of the matrix represents a feature of the first keyphrase 45a or second keyphrase 45b in the row entry of the matrix. The term "feature" used in this context is any word or phrase which has a position in the text of the corpus which is related in some manner to the first keyphrase 45a or the second keyphrase 45b. This relationship could be the combination of a verb and an object of the verb, a noun and an adjective, or it could be two adjacent words or phrases in a text or part of a text, such as a paragraph or sentence. The type of relationship chosen depends on the purpose of the thesaurus 40. The value in the matrix at the intersection of the column and row represents a frequency in which the relationship between the row entry and the column entry. This frequency is normalized against a suitable measure. It is possible that the value is given one value (e.g. 1) when the frequency exceeds a given threshold value and is given a second value (e.g. 0) otherwise.

[000361 This can be better explained using examples. Suppose, an entry in the raw might be "apple", "banana" or "orange". The feature in the column is for example "red", "green", "yellow" (adjectives) or "ear" and "slip on" (verbs). An apple is red, yellow or green but a banana is yellow or green (but not red). There would therefore be high values in the matrix at the intersections representing "apple, red", "apple, yellow" and "apple, green" as well as "banana, yellow" and "banana, green" since it could be expected that these combinations are common.

[00037] One would expect smaller low values for "apple, orange", "orange, yellow" and "orange, green". Similarly there would be high values for "apple, eat" and "banana, eat" but not necessarily for "apple, slip on". The value for "banana, slip on" would depend how often the concept of slipping on banana appeared in the corpus of the text It will be noted mat keyphrases are stemmed, i.e. that variations of a word such as past tense of a verb (ate) and present tenses (eat) are recognized as being equivalent to each other.

[00038] Examination of the matrix in the thesaurus 40 reveals mat a number of the rows are similar to each other. Each row of the matrix is examined for the degree of similarity with other rows in the matrix. The degree of similarity between the values in the row representing one of the keyphrases (e.g. a first keyphrase 45a) with the values in the row representing another one of the keyphrases (e.g. a second keyphrase 45b) is assigned the keyphrase distance value 45c. The keyphrase distance value 45c is therefore a measurement of the semantic similarity of the first keyphrase 45a and the second keyphrase 45b. [00039] There are several methods known for determining the degree of similarity between the rows. If the rows contain merely values of 1 and 0, as discussed above, then one measure would be to look at the degree to which the rows have a similar pattern of 0s and Is. Another measure would be to use the (normalized) frequencies and compute the semantic distance there from.

[00040] The matrix and the thesaurus reveal semantic patterns in the words. This can be done by considering the co-occurrence of the keyphrasc pairs "big apple", "big orange" and "big banana". It might be expected that such keyphrase pairs are highly similar as they all refer to the size of a size of fruits. However the term "Big Apple" is often used to refer to the city of New York and thus would be expected to occur in different contexts and more often in a randomly chosen corpus of text Thus, the keyphrase distance value 45c between "big banana" and "big orange" is expected to indicate that these keypfarases are highly semantically similar. Whilst the keyphrase distance values 45c between "big apple" and "big banana" or "big orange" indicate that the keyphrase "big apple" has another semantic function. In this example the keyphrase "Big Apple" is a so-called non-compositional term.

[00041] It will be noted that the first keyphrase 45a and the second keyphrase 45b are not necessarily synonyms of each other (but that might happen).

[00042] It will also be appreciated that the keyphrase distance value 45c may also depend on the type of the corpus of text from which the thesaurus 40 (or, more likely one of the component thesaurus 40) is derived The example shown in Fig. 3 relates to a keyphrase "Red Dwarf" which is the name of a TV show "Red Dwarf and is also the name of an astronomical feature. One would expect that by parsing a large randomly selected number of electronic documents 70 on the web 60 to find a semantic relationship between the term "red dwarf" and other terms (or keypfarases) in both television and astronomy. This close semantic relationship would be reflected by a smaller value for the keyphrase distance value 45c between the first keyphrase 45a (red dwarf) and the second keyphrase 45b (television or astronomy).

[00043] Examining, for example, only book and TV reviews or scientific publications would be expected to produce different values for the same keyphrase pairs for the keyphrase distance value 45c. The scientific publication is more likely to refer to a red dwarf star (i.e. astronomical feature) as opposed to the TV programme and books of the same name. On the other hand the book and TV reviews are more likely to refer to the TV programme. This would be reflected in a smaller value for the keyphrase distance value 45c.

[00044] An aspect of the present invention will now be described with reference to Figure 2. In a first step 200 the user 20 enters a search query with several search terms on the terminal 22. The search query will be displayed on the display device 23. The search query with the search terms is passed (in this aspect of the invention) in step 210 through the web interface 30 to the search engine 35. The search engine 35 produces a list of relevant electronic documents 70 by searching through the search database 37. The search engine 35 extracts in step 213 from this list a set of initial keyphrases from the document keyphrases 72 associated with the electronic documents 70, and/or the search tags 37b in the search database 37. The list of initial keyphrases is passed in step 215 to the keyphrase augmenter 42. The Yahoo search engine offers the BOSS API which allows the extraction of the list of document keyphrases 72 found in a Yahoo search and is one non-limiting way of implementing the step 213.

[00045] It will be noted at mis stage that an alternative aspect of the invention is to review a selected set of electronic documents 70 and extracting important and relevant initial keyphrases from the selected set of electronic documents using either a human or an algorithmic approach, such as that known in art This can be done using software as noted above. It will be noted that it is indeed possible to generate the initial keyphrases from a single document.

[00046] The key phrase augmenter 42 examines in step 217 each individual member of the set of initial keyphrases using the thesaurus 40 to identify all occurrences of the initial keyphrases in the thesaurus 40. It will be appreciated that it is possible that the thesaurus 40 may not contain all of the initial keyphrases but is likely to contain many or most of the initial keyphrases. The keyphrase augmenter 42 generates for each individual member of the set of initial keyphrases a series of keyphrase triplets 45 that comprises the individual member, e.g. first keyphrase 45a, one of the entries in the thesaurus 40having a semantic relationship with the individual member 45a (mis is the second keyphrase 45b in mis example) and the keyphrase distance value 45c indicating the semantic distance between the first keyphrase 45a and the second keyphrase 45b. The returned keyphrase triplets 45 are examined to form a set of candidate keyphrases. It will be understood mat the role of the keyphrase augmenter 42 is thus to take the set of initial keyphrases and augment mis set of initial keyphrases to produce for each member of the set of initial keyphrase the set of candidate keyphrases.

[00047] The number of keyphrase triplets 45 returned to the keyphrase augmenter 42 from the thesaurus 40 could potentially be very high. Cut-off mechanisms can be installed to ensure mat only, for example, the first fifty most semanticaUy relevant keyphrase triplets 45 are retrieved in step 217 from the thesaurus 40. It will also be understood mat mere may be overlap between some of the candidate keyphrases extracted from the thesaurus 40. This will occur when two or more of the initial keyphrases are semantically related to the same keyphrase in the thesaurus.

[00048] The set of candidate keyphrases generated by the keyphrase augmenter 42 from the thesaurus 40 can be clustered in step 220 for each individual member of the set of initial keyphrases. This is done by taking the candidate keyphrases and clustering the candidate keyphrases for each individual member of the set of initial keyphrases using the keyphrase distance value 42c and using a clustering algorithm, such as the hierarchical clustering algorithm, as known in the art, to arrange the candidate keyphrases into clusters. It will be appreciated that other types of clustering algorithms are possible, for example the "k-means clustering dgorithm".

[00049] An example of the use of the clustering algorithm is shown in Fig. 3 in which three different initial keyphrases 310 (white dwarf, red dwarf, kryten) are used to generate semantically related keyphrases 320 (i.e. augmented keyphrases) from the thesaurus 40. These candidate keyphrases are then clustered into clusters 330 using the clustering algorithm. It is possible for the same one of the candidate keyphrases to appear in more than one cluster (so- called "soft-clustering") or it is possible that each one of the candidate keyphrases appears in one and in only one cluster (i.e. hard-clustering).

[00050] The relationships between each one of the clusters 330 to another one of the clusters 330 is calculated in step 225 for each one of the clusters 330. This calculation is performed once in mis aspect of the invention. There are several ways of calculating the relationship between one of the clusters 330 and another one of the clusters 330. In one aspect of the invention a "centroid" is calculated for each one of the clusters and cluster distances between the centroids are calculated. In another aspect of the invention, the semantic distance between each one of the candidate keyphrases in a first one of the clusters is calculated with respect to each one of the candidate keyphrases in a second one of the clusters and the total of the semantic distances weighted to give the cluster distance.

[00051] The cluster 330 (or possibly clusters 330) having the lowest relationship (i.e. greatest distance) with all of the other clusters 320 is determined in step 230 and removed in step 240. The removal of the cluster 330 has the effect mat those candidate keyphrases in the removed cluster 330 are considered to be those least relevant candidate keyphrases and thus of less interest If all of the clusters 330 associated with one of the initial keyphrases 310 are removed, then it will be assumed that the initial keyphrase 310 is of little or no relevance. The initial keyphrase 310 having no remaining ones of the clusters 320 will be removed from further consideration. If, on the other hand, the initial keyphrase 310 still has some of the clusters 330 associated with it, then the initial keyphrase 310 are kept for further consideration.

[00052] Let us take the example of Fig. 3 to illustrate this in more detail. The user is interested in the TV show "red dwarf" and has entered the query "red dwarf which has produced (via the search engine 35 and the electronic documents 70) the initial keyphrases "white dwarf", "red dwarf and "kryten". The keyphrase augmenter 42 has processed these initial keyphrases to produce the candidate keyphrases 320 shown in Fig. 3. It is expected that kryten and red dwarf would be closely related, as kryten is a character in the TV programme Red Dwarf. White dwarf is an astronomical feature. It is therefore probable that the clustering voting method described will identify "white dwarf as not as relevant to the search as the other two search keyphrases.

[00053] The most remote cluster determination step 230 and the cluster removal step 240 is repeated until a cut-off point is reached at which only the main clusters 320 are left, Le. the clusters 320 whose cluster score indicate mat the clusters 320 are most closely related to each other (step 250). The cut-off point is arbitrarily defined and depends on the purpose for which the candidate keyphrases have been calculated . For example, there may be a requirement that only ten augmented keyphrases 320 are produced for an augmented search or there may be a requirement that only two or three augmented keyphrases 320 should be produced to automatically tag relevant ones of the electronic documents 70. [00054] In step 260 a list of the most relevant semantically-relevant keyphrases is generated from those related keyphrases left in the clusters 320. These are all of the candidate keyphrases which have not been removed in step 240. It is also possible in step 260 to pick a single, most relevant one of the clusters 320 for each one of the initial keyphrases 310.

[00055] The set of candidate keyphrases 320 and the initial keyphrases 310 (which have not been removed) are combined to create a set of augmented keyphrases which can either be presented to the user on the display device 22 in step 275 to enable the user 20 to select more appropriate search terms for searching for the electronic documents 70 using the search engine 35.

[00056] The set of augmented keyphrases could be used to conduct a search for relevant electronic documents 70 using the search engine 35. This is done by creating a method of emery expansion in which the augmented keyphrases are combined using a complex Boolean operation to produce an expanded list of the electronic documents 70. The query expansion using the teachings of this disclosure increases both the recall and the precision of the search. The recall, i.e. number of relevant electronic documents 70 retrieved, is increased because additional search terms are identified. The precision is increased because the teachings of this disclosure identify those search terms which are less relevant This enables faster searches to be made with less storage of irrelevant electronic documents 70.

[00057] In another aspect of the invention, the set of augmented keyphrases can be added as extra tags in the search database 37 to enable better identification of relevant ones of the electronic documents 70 linked to entries in the search database 37. This enables better optimization of searches using the search engine 35 since relevant ones of the electronic documents 70 can be found even if the text in the electronic document 70 mils to contain the searched-for keyphrase. [00058] It will be appreciated that the thesaurus 40 can have other applications. It was noted above that it is possible to generate the thesaurus 40 form a corpus of texts of a particular type. This will lead to a set of keyphrase triplets 45 with a particular set of keyphrase distance values 42c which is likely to be different than the keyphrase distance values 42c from a randomly selected corpus of texts. Examination of one of the electronic documents 70 using this specialized thesaurus 40 and extracting the ke phrase triplets 45 would enable the type of the electronic document 70 to be identified. For example, a scientific paper is likely to have a much more different relationship between the keyphrase pairs (white dwarf-red dwarf; white dwarf-kryten; and kryten-red dwarf) than a TV programme review guide.

[00059] It is also possible to use the thesaurus 40 to identify compositional and non- compositional phrases. Examination of the keyphrase pairs "Big" and "Apple" and "Big" and "Orange" or indeed "Big" as a modifier (adjective) for other nouns will indicate that the keyphrase "Big Apple" is semantically different than the keyphrase "Big Orange". This is unsurprising since "Big Apple" - as noted above - is a keyphrase often used to refer to the city of New York. Comparison of the keyphrases "Orange" and "Big Orange" with "Apple" and "Big Apple" indicate a greater degree of semantic similarity between "Orange", "Big Orange" and "Apple" than with "Big Apple". This would indicate that the keyphrase "Big Apple" has an additional function. Indeed it is possible that the thesaurus 40 will indicate that mis keyphrase is also similar in semantic function to names of cities like "London" or "Bristol". The technique can be used to identify so called non-compositional, multi-word expressions.

[00060] Having thus described the present invention in detail, it is to be understood that the foregoing detailed description of the invention is not intended to limit the scope of the invention thereof. What is desired to be protected by letters patent is set form in the following claims.

Reference Num

10 System

20 User

22 Display Device

25 Network connections

30 Web interface

35 Search Engine

37 Search database

37a References

37b Search tags

40 Thesaurus

42 Keyphrase augmenter

45 keyphrase triplet

50 Internet backbone

60 Web

70 Electronic document

72 Document keyphrase