Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A COMPUTER IMPLEMENTED METHOD OF EXTRACTION AND TRANSLATION OF TEXTUAL DATA TO A COMMON FORMAT
Document Type and Number:
WIPO Patent Application WO/2017/116245
Kind Code:
A1
Abstract:
A computer implemented method of extraction and translation of textual data to a common format, consisting in that it comprises the following steps: (a) decoding of the text materials, (b) processing of text data, including the following steps: dividing the text into units, dividing the text into sentences, removal of orthographic mistakes, affixing words with grammatical tags, bringing words into their basic forms, extraction of word stems, using the "stop list" technique, normalisation of capitalisation, representation of text documents, searching for n-gram representations, searching for unigram representations, binary counting, counting of word occurrences, (c) analysis of text data, (d) converting data into a common collection of data describing the content.

Inventors:
KOPER MACIEJ (PL)
Application Number:
PCT/PL2015/000223
Publication Date:
July 06, 2017
Filing Date:
December 31, 2015
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
VOLANTIS SPOŁKA Z OGRANICZONĄ ODPOWIEDZIALNOŚCIĄ (PL)
International Classes:
G06F17/27; G06F17/30
Foreign References:
US20040148278A12004-07-29
US20140280352A12014-09-18
Other References:
STEVEN BIRD, EWAN KLEIN, AND EDWARD LOPER: "Natural Language Processing with Python --- Analyzing Text with the Natural Language Toolkit", 30 December 2014 (2014-12-30), O'Reilly Media, XP002761222, Retrieved from the Internet [retrieved on 20160830]
Attorney, Agent or Firm:
GÓRSKA, Anna (PL)
Download PDF:
Claims:
Claims

1. A computer implemented method of extraction and translation of textual data to a common format, characterised in that it comprises the following steps:

(a) decoding of the text materials,

(b) processing of text data, including the following steps: dividing the text into units, dividing the text into sentences, removal of orthographic mistakes, affixing words with grammatical tags, bringing words into their basic forms, extraction of word stems, using the "stop list" technique, normalisation of capitalisation, representation of text documents, searching for n-gram representations, searching for unigram representations, binary counting, counting of word occurrences,

(c) analysis of text data,

(d) converting data into a common collection of data describing the content.

2. The computer-implemented method according to claim 1, characterised in that in step (a), the text materials constitute unstructured data.

3. The computer-implemented method according to claim 1, characterised in that in step (a), the text materials constitute semi -structured data.

Description:
A computer implemented method of extraction and translation

of textual data to a common format

The invention relates to a computer implemented method of extraction and translation of textual data to a common format from any type of text, used in particular to create platforms for sharing knowledge.

Prior art

It is known that the Internet and other open environments for data exchange collect increasing amounts of textual data stored digitally. In view of very large amounts of data, the problem is to aggregate, view and search for their collections. An additional difficulty is that data are generated with the use of different formats. The problem is also the lack of technological capacity to manage large collections of data, to search for and to aggregate unstructured and diverse external data to the system, which would then allow to share the collected knowledge, also with the use of platforms dedicated thereto.

Various systems for data processing are known from the prior art. For example, from the US Patent Application US2004148278, a method of dynamic building of scalable information databases that contain semi-stmctured data is known. The method disclosed in the specification of this invention includes performing data acquisition from multiple data repositories some of which store data in semi -structured or unstructured forms. The obtained data are enriched and stored in the memory.

In turn, from the US Patent Application US2014280352, a method of obtaining an ordered collection of data from unstructured data collections is known. For this puipose, the following functions are used: (a) reception of values for semi-stmctured data and a key associated therewith; (b) structured collection of data, which constitutes an association of categories having many characteristics; (c) obtaining at least one of historical data, associated with multiple attributes, or of additional data associated with the user of the computer system; (d) mapping an attribute or multiple attributes based on at least one of historical data or additional data; (e) storage of values in a cell of data register of structured collection of data.

Solutions identified in the prior art are insufficient in relation to current needs, which consequently determines situations in which a given user must have a properly adjusted IT software to have the opportunity to see and use the contents in question. The problem to be solved is to develop and implement specific algorithms and mechanisms for methods of integrating formats, which allow processing, compressing, coding and unification of formats, and presenting data in a unified format, especially with the use of a platform for sharing knowledge.

The object of the invention is to provide a method of integrating formats which allows processing, compressing, coding and unification of formats, and presenting contents in a unified format, especially with the use of a platform for sharing knowledge. The mechanism of the method according to the invention is based on the concept called "black box" according to which, at the input, users add different types of digital content in various formats and standards so that, at the output, after completion of the internal process of extraction and translation of contents inside the box, end users are given access to the same contents yet in one universal format strictly defined for the entire platform, the said format allowing the use of contents without any troubles in any place and via any medium, especially with the use of a platfomi for sharing knowledge. The solution being the object of the invention fills a gap which is the lack of a system allowing concentration of all functions in one tool and extraction, and translation of data from any type of text into a common format

Essence of the invention

The essence of the solution according to the invention is a computer implemented method of extraction and translation of textual data to a common format, characterised in that it comprises the following steps:

(a) decoding of the text materials,

(b) data processing including the following steps: dividing the text into units, dividing the text into sentences, removal of orthographic mistakes, affixing words with grammatical tags, bringing words into their basic forms, extraction of word stems, using the "stop list" technique, normalisation of capitalisation, representation of text documents, searching for n-gram representations, searching for unigram representations, binary counting, counting of word occurrences,

(c) analysis of text data,

(d) converting data into a common collection of data describing the content. Ιΐ is preferable when text materials constitute unstructured data that do not have any predetermined internal structure.

It is also preferable when text materials constitute semi-structured data which have their general scheme defined or a structure relating to a fragment of documents.

A direct automatic analysis of original documents is unfeasible in practice and it is necessary to process them first. The main task of processing is to create a representation of a text document, which would be suitable for the planned analysis. The process of pre-processing has a large impact on the total time required to implement the process of discovering knowledge in texts and may generate a fairly large overhead compared to the remaining steps of the process. Thanks to the pre-processing, quality of data is improved and infomiation noise is eliminated. Typically, in the texts, errors may occur due to human errors of their authors and imperfections of tools for automatic generation of documents, e.g. in the case of automatic translations. In the text of documents, there are also expressions that are not related to then basic content, and e.g. they result only from the assumed convention of creating documents, such as repeated descriptions in footers or headers. Elimination - or at least a significant reduction - of these unfavourable factors is particularly important in tasks in which characteristic features for a given document are sought, e.g. in classification or grouping of texts. The final result of processing made by the method according to the invention is a data stream describing the document, allowing its formatting before being sent to a database.

An advantage of the solution according to the invention is that it allows extraction and translation of textual data from any type of digital content into a common format of data, and sharing the processed knowledge, in particular by means of platforms for sharing knowledge.

Embodiment of the invention

Automatic analysis of text documents written in a natural language in principle requires processing of these documents into a suitable form and constructing their representation. Only on the processed text, reduced to the form of machine-understandable structures, analysis can be performed, including the use of data exploration methods. In the embodiment, text materials, including unstructured and semi-structured ones, were subjected to decoding, and then processing comprising the following steps: dividing the text into units (tokenisation), dividing the text into sentences, removal of orthographic mistakes, affixing words with grammatical tags (POS tagging), bringing words into their basic forms - lemma (lemmatisation), extraction of word stems (stemming), using the "stop list" technique consisting in removal of the most frequent words (stop words removal), nonnalisation of capitalisation, representation of text documents, searching for n-gram representations, searching for unigram representations, binary counting, counting of word occurrences. Afterwards, there is an analysis of text data which are converted into a common collection of data describing the content. Processes occurring in the step of data processing are described in detail below.

Division of the text into units. Division of the text into units, also known as text segmentation, consists in extraction of units (tokens), mainly words but also numbers, symbols, punctuation marks or dates, in the text. The words do not have to be words of a natural language, they may also be other strings of characters, e.g. identifiers or uninominal proper names, specific to a given field.

For the division of text into units, regular expressions adjusted to the characteristics of the natural language of the processed text materials are mainly used (though simple mles based on the occurrence of white characters and punctuation marks can be used for texts written in different languages, in particular European languages). These expressions may be supplemented with mles taking into account specific grammatical structures of a given language. In cases where it is not obvious how to extract a unit, heuristic approaches are used.

For the extraction of units in the text, name indexes (gazetteers) are also used, which contain collections of, among others, proper names, and also e.g. abbreviations. Typically, proper names are extracted from the original text, without any prior transformations, by pattem matching. It should be noted that simple application of this method may lead to division of words.

Division of the text into sentences. Division of the text into sentences is performed based on a set of mles. Typically, these mles relate to appropriate punctuation marks and line breaks appealing in texts. In typical situations, it is easy to define boundaries of sentences since the dot, exclamation or question mark usually determine end of the sentence in many languages. Division of the text into sentences can be performed both before and after the division of the text into units. In the latter case, attention should paid so as not to lose information on the presence of line break characters.

Removal of orthographic mistakes. Removal of orthographic mistakes relates primarily to the correction of typos and spelling mistakes. Most of the available algorithms for correcting orthography mistakes, based on digital dictionaries, use, to a greater or lesser extent, the edit distance (Levenshtein distance). The Levenshtein distance between two words is defined as the minimum number of operations required to convert one word into another. In this type of methods, a misspelled word is converted into a word taken from the dictionaiy for which the edit distance is the smallest. There is also a weighted version of this method, in which approximate probability of occurrence of each mistake is used to determine the weights of individual corrections. Another type of mistake correction methods is based on the use of probabilities of occurrence of individual n-grams, to determine the distance between words. For example, words at the input: unfortunatly, existance, bizzare, forseeable, are converted at the output to the following: unfortunately, existence, bizarre, foreseeable.

Affixing words with grammatical tags. Affixing words with grammatical tags is often encountered in the processing of text documents, and in many applications (e.g. in semantic text analysis), it is essential. This method is based on lists of decision rules which are created during the phase of learning the algorithm. Tagging with this method is much faster than in the case of a classic algorithm, and exhibits the same or better correctness. Method of affixing words with grammatical tags is dependent on language, and for each language used in the processed documents, it is necessaiy to prepare a dedicated method.

Bringing words into their basic forms. A variety of forms of words occurring in texts written in a natural language is associated with their declination, and in many applications (e.g. in classification, grouping) it is not recommended or is even unwanted. Various fonns of words cause the growth of dictionary for a given repositoiy, they often do not cany additional information, and they can also cause problems relating to comparison of documents or when searching for occunences of words in texts. Therefore, the process of bringing words to their basic forms lemmatisation) is often performed. This task can be made on the basis of declination dictionaries and also of a set of rules adapted to specificities of a given language. For fusional languages, this process results in a significant reduction in the number of words appearing in the repository.

Extraction of word stems. Extraction of word stems is a process similar to bringing words to their basic forms and consists in searching for stems in words. The stems typically do not include a correct grammatical form, do not take into account the context of their occurrence, but are sufficient to represent a set of words having a similar theme. The main advantage of extraction of word stems is a small complexity of the process compared to lemmatisation. For example, in the Polish word przedszkolny, the stem is the fragment -szkol-. There is also a prefix przed- and suffixes: formative -n (forming adjectives) and fusional -y (the latter is also known as a fusional ending).

Using the "stop list" technique. Technique for removing commonly occurring words is intended to weed out the most frequently occurring words which do not cany any infonnation on its contents. They are expressions the type and the frequency of occurrence of which are largely dependent on the language used (e.g. words associated with grammatical structures) and on the type of the processed document. In the case of the Polish language, they are most commonly prepositions, conjunctions, pronouns, etc. Removal of this type of words from a text takes place on the basis of a pre-built dictionary. It is also possible to automatically build a collection which is a list of words to be removed based on the analysis of the frequency of occurrence of words in documents from a given repository. Words which occur in all documents or in large percentage thereof are introduced to the "stop list". Such words do not allow for distinguishing between documents with different contents, and thus they are not useful in tasks consisting in classification or grouping. The advantage of this approach is the ease of its use in any language.

Removal of the most frequently occurring words is performed in one of the final steps of the processing documents so as not to remove information required in other processing techniques (e.g. when affixing words with grammatical tags).

For example, the following collection of documents is given:

dl : disc, linux and disc, disc and linux control, control and disc, linux

d2: container, linux and linux, and linux, disc, linux and disc, box and disc d3: box and control, container, control, linux and control box container, box and container, and box

d4: and box, and box, and box, control, container, control, container container, control, container

In Table 1, matrix for the above collection of list documents is presented.

Normalisation of capitalisation. Normalisation of capitalisation is a commonly used method and typically it involves replacement of all the characters to lower case, excluding those that may carry some information on special meaning of words (e.g. proper names). This noraialisation should be performed after previous division of texts into sentences and affixing the words with grammatical tags. Standardisation of spelling of words results not only in reduction of the glossary containing words occurring in the analysed documents but also has a positive impact on the accuracy in determining similarities of the documents.

Representations of text documents. For the purpose of automated analysis, each text document, having undergone the pre-processing, must be stored in a way that allows for the extraction of relevant information. Text documents are stored in the form of so-called representations containing, depending on the type of representation, values such as cardinality of words, sequences in which words occur, metadata on the document layout, etc.

Searching for n-grani representations, in situations where in the processed document, the occurrence of certain characteristics, such as an expression or a set of expressions, is important, its representation is a structure comprising an association of the document with so-called n-grams.N-gram is a sequence of n expressions (words) occurring consecutively in the analysed text. A special type of n-grams are unigrams consisting of individual expressions.

The n-gram representation contains not only information on the veiy word occurrences but also information on their co-occurrence. A sequence of letters, words or more complex expressions can be understood as the n-gram. Most often, the n-gram is a sequence of words occurring together. The use of n-grams gives some information about the context of word occurrence in the document and allows obtaining information on phrasemes or proper names. An important element of an n-gram representation is a number of words from which each n-gram is built. Short n-grams lead to representations similar to unigrams, and thus unnecessary complexity of the issue. Too large number of constituent expressions of n-grams may result in a very large uniqueness of n-grams. This is due to the fact that the entire sequence of constituent expressions is required in the document so that an n-gram could occur. The higher their number, the lower the likelihood of occurring of all the constituents at the same time in the document, and thus the lower the likelihood of finding common characteristics of the documents. The number of potential n-grams increases rapidly with an increase in the number of expressions from which they are built. For long documents, an increase in the size of n-grams significantly affects the size of their representations, and thus the computational complexity of algorithms that process them. Complex n-grams characterise documents in a much better way than, for example, unigrams, because they contain information on dependencies between expressions occurring simultaneously in documents. To reduce this type of representation of document, limitations on the occurrence of n-grams in texts can be introduced, especially those related to the position of constituents of the n-gram, e.g. generation of n-grams based on fragments rather than on whole texts.

Searching for unigram representations. Unigram representations are such representations in which words are the elementary units. In these representations, only occurrences of words in the document are taken into account. To create such representation, pre-processing is essential, thanks to which non-essential elements, and thus a large part of the so-called "information noise" which can cause problems in the process of analysis, are removed from the documents. In addition, limitation of representations results in significant reduction in the processing time of documents in processes using them. In order to reduce the amount of calculations and, above all, to improve the effectiveness of analyses, algorithms are used which allow for the selection of words which best characterise a given group of documents.

To create a unigram representation, two approaches are used for counting words: binaiy counting and counting of word occurrences.

Binary counting. This approach is characterised in that during the countmg, the very fact of word occurrence is taken into account. In this process, a two-dimensional array is created which comprises, for each document, a vector composed of values 0 or 1 representing the occurrence of consecutive words. These values determine the mere fact of occurrence of a given word in the document. This approach has the fdndamental advantage which is low computational complexity and simplicity of the solution.

Counting of word occurrences. Expressions that are related to the subject of the document in all likelihood will occur more often than others. In the approach of counting occurrences, this fact is taken into account. The result of counting occurrences of words is an array which contains the total number of occurrences of words in the document. In this approach, there is a problem with representation of documents of different lengths. Short documents may be classified in a completely different manner than documents containing a very large number of words. The solution is to normalise the obtained results. Instead of the value representing the number of occurrences, the ratio of the number of occurrences of individual words to the number of occurrences of all words in the document is taken into account. This solution will bring the results obtained for different documents to the form in which they can be compared.