Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
AUTOMATIC ABSTRACTION IN HUNGARIAN TEXTS
Document Type and Number:
WIPO Patent Application WO/2016/055818
Kind Code:
A1
Abstract:
The procedure will identify the most significant sentences of an article/text written in Hungarian by considering the roots of the Hungarian words and applying a special weighing method.

Inventors:
LENGYELNÉ MOLNÁR TÜNDE (HU)
Application Number:
PCT/HU2014/000092
Publication Date:
April 14, 2016
Filing Date:
October 08, 2014
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ESZTERHÁZY KÁROLY COLLEGE (HU)
International Classes:
G06F17/27; G06F17/30
Other References:
TÜNDE MOLNÁR LENGYEL: "Automatic abstract preparation", ICI 10TH INTERNATIONAL CONFERENCE ON INFORMATION, 4 December 2010 (2010-12-04), Delta University, Gamasa, Mansoura, Egypt, pages 550 - 560, XP055201472
EDMUNDSON ET AL: "New methods in automatic extracting", JOURNAL OF THE ASSOCIATION FOR COMPUTING MACHINERY, ACM, NEW YORK, NY, US, no. 16, 1 April 1969 (1969-04-01), pages 264 - 285, XP002078269, ISSN: 0004-5411, DOI: 10.1145/321510.321519
MUHAMMED YAVUZ NUZUMLALI ET AL: "Analyzing Stemming Approaches for Turkish Multi-Document Summarization", PROCEEDINGS OF THE 2014 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 25 October 2014 (2014-10-25), pages 702 - 706, XP055201613
K SANKAR ET AL: "Text Extraction for an Agglutinative Language", LANGUAGE IN INDIA, SPECIAL VOLUME: PROBLEMS OF PARSING IN INDIAN LANGUAGES, vol. 11, 5 May 2011 (2011-05-05), pages 56 - 59, XP055201709, ISSN: 1930-2940
Attorney, Agent or Firm:
KONYA, Tamas (H-3300 Eger Szent Janos út 2. 1/2., HU)
Download PDF:
Claims:
CLAIMS

1. To abstract Hungarian texts, a specific characteristic feature of which is that the unique system of conjugations in the Hungarian language must be considered.

2. To incorporate the technique of highlighting the essence in the way that it is done by humans-

Description:
The title of the patent

AUTOMATIC ABSTRACTION IN HUNGARIAN TEXTS

Description

The subject of the patent:

Making abstractions from Hungarian texts, which conveys the message by including the most important sentences of the text.

A brief definition of the areas for the application of the patent:

It can be used in libraries and data bases where texts available in a digital format can be abstracted, and the patent helps to retrieve information from texts.

It can also be used for creating a brief summary of any electronic text, which can be the abstract of a writer's article or essay for a monthly publication, and for highlighting the gist of texts.

The status of technology (solutions closest to the invention):

In case of English language texts, several programs performing the task of highlighting the essence by using quantitative and qualitative theories exist. Programs abstracting not only English language texts can also be found, so researchers can receive support for highlighting the most important sentences in a particular article in languages like German 1 , French 2 , Greek 3 , Spanish 4 , Italian 5 , Chinese 6 .

Abstraction has not been solved in the Hungarian language due to the lack of a theoretical solution.

Krippendorff breaks down the steps of content analysis as follows: As a first step the text to be processed had to be typed. That typing was more than just copying the text because it also meant the preparation of the text. Automatic text analysis has to overcome a lot of difficulties (see Chapter 9.2), most of the problems, however, can be avoided if not the original words are provided as an input text for the computer. If the text is manually prepared, and if it is typed as follows, most of the problems are eliminated:

typing abbreviated words in full original form,

replacing pronouns and references with personal names, placing special characters to indicate the end of sentences and paragraphs, and the beginning of dialogues. 7

Sandor Szalai conducted an analysis forming the basis for the patent, and published it in his book, "Szalai, Sandor, Mechanical abstract making. Budapest: National Library of Engineering and Documentation Centre, 1963".

Deficiencies to be eliminated with the invention:

In Hungarian language regions the increasing availability of digital datasets does not match with the quantity of automatic devices revealing the content. The area of the definition of key words has been researched, but in Hungary solutions for conveying large contextual units has not been offered. In case of Hungarian texts automatic abstracting processes are non-existent.

The definition of the task to be solved with the invention:

In recent years the automation of content exploration has gained momentum. The reason for this is obvious: the mass presence of information appearing makes it impossible to follow the

developments even in a small sphere like one's own area of research. This is the reason why any research, or project helping to attract attention or to highlight essential information is important. In the Hungarian language it is a neglected area because automatic abstracting programs do not exist. This can be argued for by considering the complexity of the structure of the Hungarian language, but this fact does not diminish the necessity for implementation.

The patent is a process which performs the task of automatic abstraction in Hungarian texts, by identifying the most important sentences of the texts. The procedure simulates a way of thinking similar to the technique of highlighting the essence by the human mind, thus by using a special method of weighing, it achieves that the abstract should be similar to that of humans.

Sequence listing:

Step 1: Defining word roots

Step 2: Conflating, summary of word root frequency

Step 3: Finding word pairs, word triplets, and word quadruplets in texts (omitting taboo words)

Step 4: Determining the basic units for the abstracts (usually a sentence but it can be a paragraph too)

Step 5: Determining the significant words of word frequency:

a/ either

the words occurring more than three times in the frequency list of Step 2,

b/ or

if a frequency dictionary of the particular professional area exists, the words found in the dictionary will be regarded as significant from among the words in Step 2. Step 6: Weighing, each basic unit will receive weight: a/ one point for each significant word contained in the basic unit,

b/ one point plus if, considering the results of Step 3, the word pairs, word triplets or word quadruplets are contained in the basic unit,

c/'if the text contains either an "introduction", or a "summary", or a "conclusion", the weight of the sentences in the paragraphs of these will be multiplied by 1.5,

d/ if the text does not contain either an "introduction", or a "summary", or a "conclusion", the weight of the sentences in the first and the last paragraphs will be multiplied by 1.5.

As a result of this each basic unit will have a weighted score.

Step 7: Final step, sequencing the basic units in a descending order based on their points.

8. Abstract: based on the sequence above, creating an abstract of the given work determined by percentage and length.

One or more examples proving the extent of the patent:

Positive effects in relation to the present state of technology and linked to the invention

By developing the software to automate the steps of abstraction,

retrieval of documents in a digital format, and available in libraries, may be enhanced. in general the retrieval of texts in Hungarian will be enhanced (as search robot software will consider the summaries with a greater weight).

the invention will help the user to decide whether the reading of the given article or text is necessary or not (or whether another approach to the topic is needed), so it will save time and energy for the researcher/reader.