Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
情報抽出方法、情報抽出装置、情報抽出プログラム
Document Type and Number:
Japanese Patent JP5559104
Kind Code:
B2
Abstract:
PROBLEM TO BE SOLVED: To extract the text from a structured document without depending upon a rule for text extraction.SOLUTION: A document set recording part 2 records an HTML file of a document to be processed in a document set DB 3. A link source information extraction part 4 extracts a hyperlink embedded in an HTML file, acquired from the document set DB 3, and link peripheral text information. A text extraction part 5 specifies a hyperlink referring to the HTML file acquired from the document set DB 3 as an HTML file of a link destination document on condition that the hyperlink is extracted by the link source information extraction part 4. The text extraction part 5 compares a character string of text information present in the HTML file of the specified link destination document with a character string of the text information, and extracts a representative part in the link destination document as the body. An output part 6 outputs the extracted text.

Inventors:
Masayuki Sugizaki
Yuichiro Sekiguchi
Kenji Ezaki
Tadashi Uchiyama
Application Number:
JP2011166460A
Publication Date:
July 23, 2014
Filing Date:
July 29, 2011
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
Nippon Telegraph and Telephone CORP.
International Classes:
G06F17/30; G06F13/00
Attorney, Agent or Firm:
Hiromichi Kobayashi
Hidehisa Uzawa
Yamaguchi Koji
Hashimoto 剛