To efficiently extract suitable feature words corresponding to a specific category.
A first appearance frequency indicating the number of document data in which word pairs included in a plurality of document data concurrently occur and a second appearance frequency indicating the number of document data in which word pairs concurrently occur out of the plurality of document data to which a specified category is made to correspond are calculated. A value obtained by dividing the first appearance frequency by the second appearance frequency is calculated as a degree of concurrent occurrence. Network data using words as nodes and the degree of concurrent occurrence as an edge is generated as matrix data which are a symmetrical matrix of N×N. A maximum inherent value of the generated matrix data is calculated as a degree of aggregation. A cluster being a set of a plurality of words determined from an inherent vector corresponding to the calculated degree of aggregation is extracted. A degree of the attribution of each word to the cluster is calculated. A plurality of nodes having attribution degrees exceeding a threshold are extracted as feature words expressing a feature of the specified category.
SUENAGA TAKASHI
JP2007079948A | 2007-03-29 | |||
JP2004030202A | 2004-01-29 | |||
JPH06251072A | 1994-09-09 |
Yasuhiko Murayama