To provide a corpus selection device, a corpus selection method, and a program which select a learning corpus capable of achieving both the improvement in quality of a language model and the reduction in capacity for use of a storage area.
A corpus selection device AA divides a learning corpus (whole) into learning corpuses (subset 1 to subset 3), and subset language models 1 to 3 corresponding to the learning corpuses (subsets 1 to 3) respectively are generated by language modeling. With respect to respective subset language models 1 to 3, perplexities are calculated using a task representation corpus to obtain perplexity-1 to perplexity-Y. Learning corpuses corresponding to subset language model having lower perplexities are removed from the learning corpus(whole) to select a learning corpus (selected).
UTSUNOMIYA EIJI
FURUI SADAOKI
SHINOZAKI TAKAHIRO
KUBOTA YU
TOKYO INST TECH