Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A METHOD FOR EXTRACTING INFORMATION FROM SEMI-STRUCTURED DOCUMENTS, A RELATED SYSTEM AND A PROCESSING DEVICE
Document Type and Number:
WIPO Patent Application WO/2021/009375
Kind Code:
A1
Abstract:
The invention relates to a method for clustering, each semi-structured document of said plurality of semi-structured documents based on meta-information of said semi-structured documents in a cluster of a plurality of clusters followed by detecting segments in semi- structured document for each cluster of said plurality of clusters by means of classification methods and detecting segment features of each segment, using NLP, by analyzing a textual content of each said segment, said segment features comprising a plurality of possible sets of named entities with corresponding confidence level; and matching segments of each of the documents in each cluster based on said segment features of each segment determined and a probability distribution of said named entities succeeded by determining the meaning of each said segment group comprising similar segments based on NLP of said content of each said segments; and finally assigning a segment identification to each segment.

Inventors:
PAVEL JEŽ (BE)
DE FEU GEORGES (BE)
Application Number:
PCT/EP2020/070367
Publication Date:
January 21, 2021
Filing Date:
July 17, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
LYNXCARE CLINICAL INFORMATICS (BE)
International Classes:
G06F16/35; G06F16/80
Foreign References:
US20060026203A12006-02-02
Attorney, Agent or Firm:
GEVERS (BE)
Download PDF:
Claims:
Claims

1 . A method for extracting information from semi-structured documents said method comprising the steps of:

retrieving a plurality of semi-structured documents from at least one semi- structured document source, characterized in that said method further comprises the steps of:

clustering, by an unsupervised learning algorithm or semi-supervised learning algorithm, each semi-structured document of said plurality of semi-structured documents based on at least one of meta-information, content and layout of each said semi-structured document in a cluster of a plurality of clusters; and

detecting segments in semi-structured document for each cluster of said plurality of clusters by means of unsupervised classification methods or semi-supervised classification methods; and

detecting segment features of each segment, using Natural Language Processing, by analyzing a textual content of each said segment, said segment features comprising a plurality of possible sets of named entities , each set having a certain confidence level; and

matching segments of each of the documents in each cluster based on said segment features of each segment determined and a probability distribution of said named entities; and

determining the meaning of each said segment group comprising similar segments based on a natural language processing of said content of each said segments in s said segment group; and

assigning a segment identification to each segment based on the determined concept distribution of each segment.

2. The method for extracting information from semi-structured documents according to claim 1 , characterized in that said method further comprises the step of:

determining relations between segments of documents in said cluster by applying natural language processing.

3. The Method for extracting information from semi-structured document according to claim 1 , characterized in that said step of clustering further may be based on the graphical layout of such document.

4. The Method for extracting information from semi-structured document according to claim 1 , characterised in that said step of detecting features wherein the context is applied as input parameter for the Natural Language Processing. 5. The Method for extracting information from semi-structured document according to claim 1 , characterised in that step of said analyzing a textual content of each said segment is based on ensemble of algorithms that are organized in several layers.

6. Processing device, for extracting information from semi-structured documents retrieved from at least one semi-structured document source, said processing device comprising a processing means (3) configured to:

cluster, by an unsupervised learning algorithm or semi-supervised learning algorithm, each semi-structured document of said plurality of semi-structured documents based on at least one of meta-information, content and layout of each said semi-structured document in a cluster of a plurality of clusters; and

detect segments in semi-structured document for each cluster of said plurality of clusters by means of unsupervised classification methods or semi-supervised classification methods; and

detect segment features of each segment, using Natural Language Processing, by analyzing a textual content of each said segment, said segment features comprising a plurality of possible sets of named entities, each set having a certain confidence level; and

match segments of each of the documents in each cluster based on said segment features of each segment determined and a probability distribution of said named entities; and

determine the meaning of each said segment group comprising similar segments based on a natural language processing of said content of each said segments in s said segment group; and

assign a segment identification to each segment based on the determined concept distribution of each segment.

7. Processing device for extracting information from semi-structured documents according to claim 6 characterised in that said processing device further is configured to:

determine relations between segments of documents in said cluster by applying natural language processing.

8. Processing device for extracting information from semi-structured documents according to claim 6, characterised in that said processing device further is configured to: base said cluster of each semi-structured document of said plurality of semi- structured documents on the graphical layout of such document.

9. Processing device for extracting information from semi-structured documents according to claim 6, characterised in that said processing device further is configured to: detect segment features by additionally applying a context as an input parameter for the Natural Language Processing.

10. Processing device for extracting information from semi-structured documents according to claim 6, characterised in that said processing device further is configured to:

analyse a textual content of each said segment is based on ensemble of algorithms that are organized in several layers.

1 1 . System for extracting information from semi-structured documents said system comprising means configured to:

retrieve a plurality of semi-structured documents from at least one semi- structured document source, characterized in that said system further comprises a processing device according to claim 6.

AMENDED CLAIMS

received by the International Bureau on 18 November 2020 (18.1 1.2020)

1 . A computer controlled method for extracting information from semi-structured documents said method comprising the steps of:

Retrieving, by means of a computer controlled processing device, a plurality of semi-structured electronic documents from at least one semi-structured electronic document source, characterized in that said method further comprises the steps of:

clustering, by means of a computer controlled processing device, by an unsupervised learning algorithm or semi-supervised learning algorithm, each semi- structured document of said plurality of semi-structured electronic documents based on at least one of i) meta-information, ii) content and iii) layout of each said semi-structured electronic document in a cluster of a plurality of clusters; and

detecting, by means of a computer controlled processing device, segments in semi-structured electronic document for each cluster of said plurality of clusters by means of unsupervised classification methods or semi-supervised classification methods; and

detecting, by means of a computer controlled processing device, segment features of each segment, using Natural Language Processing, by analyzing a textual content of each said segment, said segment features comprising a plurality of possible sets of named entities, each set having a certain confidence level; and

matching, by means of a computer controlled processing device, segments of each of the documents in each cluster based on said segment features of each segment determined and a probability distribution of said named entities in a segment group; and

determining, by means of a computer controlled processing device, the meaning of each said segment group comprising similar segments based on a natural language processing of said content of each said segments in said segment group; and assigning, by means of a computer controlled processing device, a segment identification to each segment based on the determined concept distribution of each segment.

2. The computer controlled method for extracting information from semi-structured documents according to claim 1 , characterized in that said method further comprises the step of: determining, by means of a computer controlled processing device, relations between segments of electronic documents in said cluster by applying natural language processing.

3. The computer controlled method for extracting information from semi-structured electronic document according to claim 1 , characterized in that said step of clustering further may be based on the graphical layout of such electronic document.

4.The computer controlled method for extracting information from semi-structured electronic document according to claim 1 , characterised in that said step of detecting features wherein the context is applied as input parameter for the Natural Language Processing.

5.The computer controlled method for extracting information from semi-structured electronic document according to claim 1 , characterised in that step of said analyzing a textual content of each said segment is based on ensemble of algorithms that are organized in several layers.

6. Computer controlled processing device, for extracting information from semi- structured electronic documents retrieved from at least one semi-structured document source, said computer controlled processing device comprising a processing means (3) configured to:

cluster, by an unsupervised learning algorithm or semi-supervised learning algorithm, each semi-structured electronic document of said plurality of semi-structured electronic documents based on at least one of meta-information, content and layout of each said semi-structured electronic document in a cluster of a plurality of clusters; and

detect segments in semi-structured document for each cluster of said plurality of clusters by means of unsupervised classification methods or semi-supervised classification methods; and

detect segment features of each segment, using Natural Language Processing, by analyzing a textual content of each said segment, said segment features comprising a plurality of possible sets of named entities, each set having a certain confidence level; and

match segments of each of the electronic documents in each cluster based on said segment features of each segment determined and a probability distribution of said named entities in a segment-group; and determine the meaning of each said segment group comprising similar segments based on a natural language processing of said content of each said segments in said segment group; and

assign a segment identification to each segment based on the determined concept distribution of each segment.

7. Computer controlled processing device for extracting information from semi- structured electronic documents according to claim 6 characterised in that said processing device further is configured to:

determine relations between segments of electronic documents in said cluster by applying natural language processing.

8. Computer controlled processing device for extracting information from semi- structured electronic documents according to claim 6, characterised in that said processing device further is configured to:

base said cluster of each semi-structured electronic document of said plurality of semi-structured electronic documents on the graphical layout of such electronic document.

9. Computer controlled processing device for extracting information from semi- structured electronic documents according to claim 6, characterised in that said processing device further is configured to:

detect segment features by additionally applying a context as an input parameter for the Natural Language Processing.

10. Computer controlled processing device for extracting information from semi- structured electronic documents according to claim 6, characterised in that said processing device further is configured to:

analyse a textual content of each said segment is based on ensemble of algorithms that are organized in several layers.

1 1 . Computer controlled system for extracting information from semi-structured electronic documents said system comprising means configured to:

retrieve a plurality of semi-structured electronic documents from at least one semi-structured electronic document source, characterized in that said system further comprises a processing device according to claim 6.

Description:
A method for extracting information from semi-structured documents, a related system and a processing device

Technical field

The present invention relates to a method for extracting information from semi- structured documents.

Background art

Currently, a lot of information is being exchanged in a form of semi-structured text. This is especially true in the medical domain where many systems output messages in HL7 or XML formats with various definitions. There are, however, many definitions for each of those formats. Moreover, many institutions add their own definitions on top of the standard format to accommodate their needs. This makes data exchange complicated and limits interoperability between different, e.g. medical, establishments.

To overcome these difficulties, custom interfaces can to be developed to consume messages from a given producer. However, such approach is not scalable, as adding a new message source leads to an update of the configuration on the receiver side, often requiring some development and/or manual intervention. The systems and methods for extracting information from semi-structured documents currently known however have the disadvantage that such systems still deal with complicated data and experience limited interoperability between different, e.g. medical establishments. Disclosure of the invention

It is an object of the present invention to provide a method for extracting information from semi-structured documents, a related system and related processor device that overcome or alleviate these mentioned problems.

Accordingly, embodiments of the present invention relate to a method for extracting information from semi-structured documents said method comprising the steps of:

retrieving a plurality of semi-structured documents from at least one semi- structured document source, characterized in that said method further comprises the steps of:

- clustering, by an unsupervised learning algorithm or semi-supervised learning algorithm, each semi-structured document of said plurality of semi-structured documents based on at least one of meta-information, content and layout of each said semi-structured document in a cluster of a plurality of clusters; and

detecting segments in semi-structured document for each cluster of said plurality of clusters by means of an unsupervised classification method or semi- supervised classification method and

detecting segment features of each segment, using Natural Language Processing (NLP), by analyzing a textual content of each said segment, said segment features comprising a plurality of possible set of entity types, each set having a certain confidence level; and

- matching segments of each of the documents in each cluster based on said segment features of each segment determined (said segment features comprising a probability distribution of said named entities; and

determine the meaning of each said segment group comprising similar segments based on a natural language processing of said content of each said segments in said segment group; and

assigning a segment identification to each segment based on the determined concept distribution of each segment.

A further relevant embodiment relates to the method for extracting information from semi-structured documents according to claim 1 , characterized in that said method further comprises the step of determining relations between segments of documents in said cluster by applying natural language processing.

Another relevant embodiment relates to a method for extracting information from semi-structured document according to claim 1 , characterized in that said step of clustering further may be based on the graphical layout of such document.

A subsequent embodiment of the present invention relates to a method for extracting information from semi-structured document according to claim 1 , wherein said step of detecting features wherein the context is applied as input, for the Natural Language Processing.

A further relevant embodiment relates to a method for extracting information from semi-structured document according to claim 1 , characterized in that step of said analyzing a textual content of each said segment is based on ensemble of algorithms that are organized in several layers.

Another relevant object relates to a processing device, for extracting information from semi-structured documents retrieved from at least one semi-structured document source, said processing device comprising a processing means (3) configured to:

cluster, by an unsupervised learning algorithm or semi-supervised learning algorithm, each semi-structured document of said plurality of semi-structured documents based on at least one of meta-information, content and layout of each said semi-structured document in a cluster of a plurality of clusters; and

detect segments in semi-structured document for each cluster of said plurality of clusters by means of unsupervised classification methods or semi-supervised classification methods; and

detect segment features of each segment, using Natural Language Processing, by analyzing a textual content of each said segment, said segment features comprising a plurality of possible sets of named entities, each set having a certain confidence level; and

match segments of each of the documents in each cluster based on said segment features of each segment determined and a probability distribution of said named entities; and

determine the meaning of each said segment group comprising similar segments based on a natural language processing of said content of each said segments in s said segment group; and

assign a segment identification to each segment based on the determined concept distribution of each segment.

A further relevant embodiment relates to the Processing device for extracting information from semi-structured documents according to claim 6 characterised in that said processing device further is configured to:

determine relations between segments of documents in said cluster by applying natural language processing.

Still a further relevant embodiment relates to the Processing device for extracting information from semi-structured documents according to claim 6, characterised in that said processing device further is configured to:

base said cluster of each semi-structured document of said plurality of semi- structured documents on the graphical layout of such document.

Another relevant embodiment relates to a Processing device for extracting information from semi-structured documents according to claim 6, characterised in that said processing device further is configured to:

detect segment features by additionally applying a context as an input parameter for the Natural Language Processing.

Still another relevant embodiment relates to a Processing device for extracting information from semi-structured documents according to claim 6, characterised in that said processing device further is configured to:

analyse a textual content of each said segment is based on ensemble of algorithms that are organized in several layers. Another relevant embodiment relates to a system for extracting information from semi-structured documents said system comprising means configured to:

retrieve a plurality of semi-structured documents from at least one semi- structured document source, characterized in that said system further comprises a processing device according to claim 6. Indeed this objective is achieved by, at retrieving of a plurality of semi-structured documents from at least one semi-structured document source, clustering each semi-structured documents of the plurality of semi-structured documents by means of an unsupervised or semi-supervised learning algorithm and subsequently including each retrieved semi-structured document of a plurality of semi- structured documents in a cluster (group of documents) based on at least one of meta information, content and layout of each semi-structured document and then detecting segments in each semi-structured document in each cluster of the plurality of clusters by means of unsupervised clustering methods like k-means clustering or hierarchical clustering. It is also possible to employ semi-supervised clustering methods where several example documents, where domain experts indicated the type, are used to train a model that puts documents in clusters. Any supervised classification algorithm (random forest, support vector machines, neural network to name a few) can be used as a starter for a continuous, semi-supervised learning. The sample of the results is then checked and if needed fixed by domain experts to improve model performance.

The features that will be used by the clustering algorithm include not only the document text but also its layout. The position of a text in a document may indicate certain type of document (e.g. laboratory results with many tables are visually different from clinical letters containing a steady flow of the text). Finally, it is possible to detect the type of the document based on a frequency of certain keywords. For example, a term frequency-inverse document frequency (Rajaraman & Ullman, 201 1 ) can be used to check the relevancy (similarity) of the document to a given document cluster which can be defined by one or more keywords.

Subsequently, segment features are detected for each detected segment of all documents in a cluster, by analyzing a textual content of each said segment detected, where the segment features comprise a plurality of possible sets of named entities and where each set having a certain confidence level.

This step of detecting segment features is followed by the step of matching segments of each of the documents in each cluster based on the segment features of each segment that are determined and a probability distribution of said named entities;

Matching is meant to group segments of the same type across different documents. For example, the segments containing description of the operation in the operation reports can be grouped together based on similarity of concepts they contain. This similarity can be measured by comparing so-called segment vectors that are calculated as a superposition of the term vectors obtained by means of an algorithm for determining medical text similarity such as the UMLS2vec algorithm.

Further, a meaning is determined for each segment group which comprises similar segments where this meaning is extracted by means of a natural language processing of the content of each the segments in the determined segment group. The meaning of every segment is condensed in the segment vector (superposition of all concept vectors) that contains numerical representation (provided by UMLS2Vec algorithm) of all concepts detected by the NLP algorithm. Therefore, by comparing this vector to the one assigned to the segments identified by domain experts, it is possible to assign an identifying label to it (e.g. institution address or patient medical history). Subsequently, by assigning a segment identification, such as a label, to each segment based on the determined concept distribution of each segment all segments of the documents in a certain cluster are identified and labeled and hence are assigned a structure which is knowable and recognizable.

It is to be noted that, a semi-structured documents may include operation reports, clinical letters, laboratory results or any other document included in the patient’s electronic health record, but also messages such as a communications between hospital and a patient in a form of tweets, social media comments or contributions, SMS, or phone-call transcripts which are maintained by means of private storage capacities, such as a SQL, nSQL or Hadoop database that can be hosted on premises or in a remote computing center for each of the respective hospitals.

It is to be noted that such clustering of the semi-structured documents of said plurality of semi-structured documents may be based on at least one of meta-information, content and layout of each said semi-structured document where such meta-information may include a document file type, a document name, document date and further metadata such as (size, creator, and location, etc.), the title of the document, number of paragraphs or sections and/or the word frequency.

Such unsupervised learning algorithm may be implemented by means of the unsupervised methods like k-means clustering or hierarchical clustering. It is also possible to employ semi-supervised methods where several example documents where domain experts indicated the type are used to train a model that puts documents in clusters. Any supervised classification algorithm (random forest, support vector machines, neural network to name a few) can be used as a starter for continuous, semi-supervised learning. The sample of the results is then checked and if needed fixed by domain experts to improve model performance.

A segment of such a document for example may be a paragraph, a table, a document line in HL7, or an element in XML or JSON or a row in a form, one tweet out of many with the same user or hashtag or one comment on a social networks, or a section in an article or any other logically separate unit of text of any origin.

The segment feature detection being the step of detecting segment features in each said segment by analyzing a textual content of each said segment where the step results in extracting named entities. Such named entities are the interesting entities in the text indicating relevant items/topics in a segment of a document. Other segment features may include for example its title, an XML or HL7 tag, previous and following segments and/or the length and position of the segment

The analyzing of the textual content of each said segment is performed by using a NLP algorithm such as sequence tagging that detects interesting concepts in sentences and/or segments and assigns to them a label indicating the type. The sequence tagging algorithms include, but are not limited to, conditional random fields, long short-term memory or other recurrent neural network, various Markov models or multinomial logistic classifiers or any combination of thereof.

A further relevant embodiment relates to the method for extracting information from semi-structured documents according to claim 1 , wherein said method further comprises the step of determining relations between segments of documents in said cluster by applying NLP. In other words, the method of the present embodiment determines relations across segments which are relations between non-similar segments which determination of relations is performed by executing a further natural language processing. By means of the determination across segments being non-similar segments it is enabled to improve the identification of segments. Non-similar segments can be related for example by being present in the same document or by being related to the same patient. Certain class of documents (e.g. operation report) contain certain segments, e.g. pre-operation diagnosis, operation description, etc. If a certain segment cannot be uniquely identified, it is possible to limit the possibilities by excluding such segment labels that already appeared in a given document.

Another relevant embodiment relates to a method for extracting information from semi-structured messages /document according to claim 1 , wherein said step of clustering further may be based on the graphical layout of such document.

The determination of the cluster of a received semi-structured documents can be optimized by, besides the use of the content of the document, to apply the results of an analysis of the graphical layout of a document giving additional information for basing the decision to which cluster a document belongs.

A subsequent embodiment of the present invention relates to a method for extracting information from semi-structured documents according to claim 1 , wherein the step of detecting features wherein the context is applied as input parameter, for the NLP algorithm.

Indeed, if the context of a document is known already, this context can be applied in the step of detecting segment features using this context as an input parameter to the NLP algorithm parameter so that the decision making can be improve and consequently the levels of confidence are increased.

A further relevant embodiment relates to a method for extracting information from semi-structured document according to claim 1 , characterized in that step of said analyzing a textual content of each said segment is based on ensemble of algorithms that are organized in several layers. This means that the upper layers use the knowledge and output of the algorithms in the bottom layers. The bottom layer can be a general, yet language specific, Named Entity Algorithm (NER) algorithm that can recognize items like people, numbers, dates, addresses, geographical location, etc. Next layer may contain general medicine knowledge, so it can extract entities like doctor, patient, medication, etc. Transfer learning from the bottom layer makes training of this layer easier - for example if it wants to extract doctor, it does not need to learn how to extract a person, as this is already being done by the layer below. Rather it specializes in distinguishing whether this person is a doctor or patient.

Above the generic medical layer can be a layer specific for a given department/medical field, e.g. orthopedics, neonatology, etc. This layer is trained to extract terms specific for the given department. For example, a knee prosthesis will be labeled as “man-made object” by the general layer, “medical device” by the generic medical layer and as a“knee prosthesis” by the orthopedics layer. Other layers are optional and user specific. Their task is to detect something that is specific for a given department and not being used in general. Typically, it can be a clinical study when a client wants to access an effect of a new drug/treatment that is not yet part of general methodology of the field.

Brief description of the drawings

The invention will be further elucidated by means of the following description and the appended figures. Figure 1 represents a system for extracting information from semi-structured documents.

Figure 2 represents a more detailed system for extracting information from semi- structured documents.

Modes for carrying out the invention

The present invention will be described with respect to particular embodiments and with reference to certain drawings, but the invention is not limited thereto but only by the claims. The drawings described are only schematic and are non-limiting. In the drawings, the size of some of the elements may be exaggerated and not drawn on scale for illustrative purposes. The dimensions and the relative dimensions do not necessarily correspond to actual reductions to practice of the invention.

Furthermore, the terms first, second, third and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequential or chronological order. The terms are interchangeable under appropriate circumstances and the embodiments of the invention can operate in other sequences than described or illustrated herein.

Moreover, the terms top, bottom, over, under and the like in the description and the claims are used for descriptive purposes and not necessarily for describing relative positions. The terms so used are interchangeable under appropriate circumstances and the embodiments of the invention described herein can operate in other orientations than described or illustrated herein.

The term “comprising”, used in the claims, should not be interpreted as being restricted to the means listed thereafter; it does not exclude other elements or steps. It needs to be interpreted as specifying the presence of the stated features, integers, steps or components as referred to, but does not preclude the presence or addition of one or more other features, integers, steps or components, or groups thereof. Thus, the scope of the expression“a device comprising means A and B” should not be limited to devices consisting only of components A and B. It means that with respect to the present invention, the only relevant components of the device are A and B.

In the following paragraphs, referring to the drawing in FIG.1 , an implementation of the system for analysing /extracting information from semi-structured documents according to an embodiment of the present invention is described. In a further paragraph, all connections between mentioned elements are defined. Subsequently all relevant functional means of the system for extracting information from semi-structured documents as presented in FIG.1 are described followed by a description of all interconnections. In the succeeding paragraph the actual execution of the extracting information from semi-structured documents according to an embodiment of the present invention is described under control of the system is described.

A first essential element of the system 1 is a document reception means 2 that is configured to retrieve a plurality of semi-structured documents from at least one semi-structured message or document source, where each such source 8,9,1 0 may be a respective database of a first, second and third institution such as a hospital.

A second essential element is the processing means 3 that is first configured to cluster, by executing an unsupervised learning algorithm or semi-supervised learning algorithm, each semi-structured document of said plurality of semi-structured documents based on at least one of meta-information content and layout of each said semi- structured document in a cluster of a plurality of clusters; and

configured to detect segments in (each) semi-structured document in/of each cluster of said plurality of clusters by means of unsupervised classification algorithms, such as hierarchical or k-means clustering, or semi-supervised classification algorithms that allows human feedback to improve performance of supervised classifiers; and

configured to detect segment features of each segment, using NLP, by analyzing a textual content of each said segment, said segment features comprising a plurality of possible set of named entities, each set having a certain confidence level; and configured to match segments of each of the documents in each cluster based on said segment features of each segment determined and a probability distribution of said named entities. The matching is the grouping of segments of same type across different documents.

The processing means 3 further is configured to determine the meaning of each said segment group comprising similar segments based on a natural language processing of said content of each said segments in said segment group and additionally to assign a segment identification to each segment based on the determined concept distribution in said segment. The meaning of the segment is condensed in the segment vector that contains numerical representation of all concepts detected by the NLP algorithm. Therefore, by comparing this vector to the one assigned to the segments identified by domain experts, it is possible to assign an identifying label to it (e.g.

institution address or patient medical history).

A further essential means is a storage means that may exist of a single database or a plurality of local or distributed databases 4, 5, 6 and 7 as shown in FIG.1 , where all documents clusters are stored, in combination with or in a separate manner all results of the processing steps of the processing means 3. It is assumed that the number of institutions, e.g. first hospital 8, second hospital 9 and third hospital 10 forward a plurality of semi-structured documents such as patient reports, clinical letters, laboratory results, but possibly also a communication between hospital and a patient in a form of tweets, social media comments or contributions, SMS, or phone-call transcripts which are maintained by means of private storage capacities, such as a SQL, nSQL or Hadoop database that can be hosted on premises or in a remote computing center for each of the respective hospitals.

The plurality of semi-structured documents are received at the reception means and fed to the processing means 3 that in turn initiates the clustering of each received each semi-structured document, by first subjecting a received semi-structured document to the execution of a unsupervised learning algorithm such as k-means or hierarchical clustering. In another embodiment it is possible to use any supervised classifier with an expert feedback as a semi-supervised learning algorithm.

Each semi-structured document of said plurality of semi-structured documents based on at least one of meta-information, content and layout of each said semi- structured document in a cluster of a plurality of clusters is assigned to a certain cluster and stored as such in a respective database. Assume cluster 1 is assigned, the document is stored in the first database 5.

This clustering is done automatically, using one or more algorithms for so-called un-supervised learning. The document features, i.e. the meta-information, that can be used to decide in which cluster document belongs may include the Document file type, Document name, Document date and other computer file metadata (size, creator, location ...), but also the content-related features like the title of the document, number of paragraphs or sections or the word frequency.

Additional information for the document classification, i.e. the clustering, may be the layout of the semi-structured document as e.g. a letter has different layout than operative report or laboratory result. Such a document may be converted to a black-and- white image being black in the areas where there is text and white otherwise. Therefore, additional un-supervised learning algorithm can be used to divide the document set into distinct clusters. This can be done by either vectorizing the images directly (assigning 1 or 0 to every pixel based on being black or white) or using an autoencoder neural net or any other machine learning algorithm that extracts the relevant latent features from the image corresponding to the document and assigns a vector with the length different than the number of pixels in the image to the document. The resulting feature vectors can be grouped using hierarchical or k-means clustering or other unsupervised clustering methods The output of the clustering based on the content and based on graphical layout are then combined to provide a more reliable and robust result.

In some cases, the type of document is known - in this case the category of the document is used to train the supervised classifiers to be able to distinguish between different types of documents based on their content and layout.

The rules of separation between different document types learned by this clustering algorithm are stored and re-used whenever a new batch of documents from the same context arrives.

When a batch of new documents arrives from a new context (e.g. so far, the algorithm has seen only cardiology documents and it receives oncology reports), it uses already learned rules as a starting position and checks if it can consistently separate the document set into clusters. Obviously, the processing for determining the clustering needs to adapt itself a bit, but this phase of unsupervised adaptation is much simpler than before when no context was known as now the algorithm only needs to learn the difference between the two medical context and does not need to start from scratch. In practice, the algorithm clusters the document based on the rules it has learned from the previous concept. Only after that the next phase of unsupervised learning starts. The advantage is that here the learning starts already from reasonably well determined clusters, while for the first context (contexts), the algorithm started from random cluster assignment and then iteratively tried the best division to document clusters such that documents with similar features are in the same cluster.

After the step of clustering the received semi-structured documents the processing means processes each of the semi-structured document for each cluster in order to detect segments. This processing entails detection of segments in every cluster. A segment in this case can be for example a paragraph, a table, a message line in HL7 or an element in the XML or JSON. This stage can also profit from using the graphical layout analysis as was used already in clustering, because the chunks of text that form a single cluster are often also visually connected in the document. This use of graphical layout to provide a more reliable and robust result in the detection of segments.

The segment detection is analogical to the cluster detection - in the latter case, the set of all documents was separated into various groups that have something in common. In case of segment detection, the set of all lexical tokens (strings of characters with an assigned and thus identified meaning) within a document is separated into various groups that have something in common - a section, line in the form, etc.

The segments are detected using unsupervised methods like k-means clustering or hierarchical clustering. It is also possible to employ semi-supervised methods where several example documents, where domain experts indicated segment boundaries, are used to train a model that detects segments in further documents. Any supervised classification algorithm (random forest, support vector machines, neural network to name a few) can be used as a starter for continuous, semi-supervised learning. The sample of the results is then checked and if needed fixed by domain experts to improve model performance. The features that will be used by the segment detection algorithm include not only the document text but also its layout. The position of a text in a document may indicate that it belongs to a certain segment. Finally, it is possible to detect the segment boundaries (and thus segments) by means of various regular expressions that will look for typical means to separate the segments (several newlines, numbered titles, page breaks, etc.)

After or in parallel with the step of detecting segments the processing means processes (each) semi-structured document for each cluster for detecting segment features of each segment by using NLP for analyzing a textual content of each said segment. The segment features comprising a plurality of possible sets of named entities, each set having a certain confidence level. The NLP algorithms can be organized in layers, meaning that the one from upper layers use the knowledge and output of the algorithms in the bottom layers. The bottom layer can be a general, yet language specific, NER algorithm that can recognize items like people, numbers, dates, addresses, geographical location, etc. Next layer may contain general medicine knowledge, so it can extract entities like doctor, patient, medication, etc. Transfer learning from the bottom layer makes training of this layer easier - for example if it wants to extract doctor, it does not need to learn how to extract a person, as this is already being done by the layer below. Rather it specializes in distinguishing whether this person is a doctor or patient.

Above the generic medical layer can be a layer specific for a given department/medical field, e.g. orthopedics, neonatology, etc. This layer is trained to extract terms specific for the given department. For example, a knee prosthesis will be labeled as “man-made object” by the general layer, “medical device” by the generic medical layer and as a“knee prosthesis” by the orthopedics layer. Other layers are optional and user specific. Their task is to detect something that is specific for a given department and not being used in general. Typically, it can be a clinical study when a client wants to access an effect of a new drug/treatment that is not yet part of general methodology of the field.

In this step overall features of segments are detected. Examples of such segment features are the type of segment (paragraph, table, and element name in case of XML or JSON, message type for HL7), its length, position in document etc.

The textual content of a segment is analyzed as well and named entities are extracted using NLP algorithm described before. It returns the list of all possible named entities together with the algorithm decision level of confidence. The result of this processing stage is e.g. that given section contains 10 problems with a high confidence, some procedures with mediocre confidence and some lexical tokens that can be both equipment and a medication depending on a context. In other words, the results are several sets of entity types with different confidence levels.

In case the context is known, this context may be supplied as an input-parameter to the NLP algorithm in order to improve its decision and levels of confidence. On the other hand, in case the context is unknown, algorithm returns several most likely interpretations within the contexts known to the processing system. In order to check the agreement with some known contexts a medical text similarity algorithm such as the UMLS2Vec algorithm is applied to calculate the distance of the CUIs detected in the segments from those typically expected for a given context:

UMLS2Vec algorithm allows to assign a vector to every detected concept in a given segment. Superposition of those vectors returns a segment vector. If there are more possible interpretations of a segment within known contexts, a vector is constructed for every context hypothesis. This vector can be compared against a distribution of the segment vectors for a given segment in the hypothesized context. If detected segment vector is statistically compatible with that distribution, we can assume that the segment is coming from the hypothesized context

If no known context is compatible with any interpretation of the segment, all possible interpretations of all concepts are saved, and the disambiguation is done at the end of the processing the document.

Disambiguation then means finding the most likely interpretation of all combinations of concept interpretations. The most likely interpretation will be the one which, when assigned vectors via UMLS2Vec algorithm, has the smallest variance. This means that it contains terms that are related. Higher variance means that the connection between concepts is smaller or non-existing.

Subsequently or in parallel to the previous steps, the processing means 3 matches segments of each of the documents, i.e. determine similar segments and grouping, in each cluster based on said segment features of each segment determined and a probability distribution of said named entities

For example, the segments containing description of the operation in the operation reports can be grouped together based on similarity of concepts they contain. This similarity can be measured by comparing so-called segment vectors that are calculated as a superposition of the term vectors obtained by a medical text similarity algorithm such as the UMLS2Vec algorithm. Successively or in parallel with the former steps the processing means 3 determines the meaning of each said segment group where such segment group is a group of segments that comprises similar segments. The interpretation of such segment group is determined by applying natural language processing of said content of each said segments in said segment group.

Subsequently, by assigning a segment identification, such as a label, to each segment based on the determined concept distribution of each segment all segments of the documents in a certain cluster are identified and/or labeled and hence are assigned a structure which is knowable and recognizable.

The meaning of the segment is condensed in the segment vector that contains numerical representation of all concepts detected by the NLP algorithm. Therefore, by selecting among the ones assigned to the segments identified by domain experts the one with the smallest angle between the vector assigned to the studied segment and the vector corresponding to an expert labeled segment, one finds the segment type and thus the label.

The concepts in the medical domain can be described with the help of UMLS that assigns a unique code (CUI - concept unique identifier) to each concept. Besides this, UMLS also contains various types of relationships between the concepts (e.g. parent- child, broader relation, narrow relation, etc). We can use these relations as a constraint that would allow us to define an N-dimensional metric in which every CUI would correspond to N-dimensional vector and therefore it would be possible to access how close to each other are various concepts that are not directly related.

For example, if concept A is a parent of concept B and concept B is a parent of concept C, a relation exists also between A and C, even though they are not directly related. Obviously, because their distance is larger than the one between A and B or B and C respectively as one needs to make 2 steps.

Formally, finding vectors corresponding to all concepts would be equivalent to solving a system (or several systems) of linear equations.

In order to describe the way how the concept will be vectorized, let us first introduce some definitions:

• 2 UMLS concepts A and B are directly related if there is defined such a bilateral relation of any type that involves both concepts A and B. For example, A is parent of B; A is caused by B

• 2 UMLS concepts A and B are related if there exists ordered, finite list of distinct concepts Ai , A2, A3, ..., AN such that each consecutive pair (A and Ai , Ai and A2,

...., AN and B) are directly related. The relationships in each direct relation can be of any type. For example, A cures Ai , Ai is parent of A2, etc. and finally B is a manifestation of AN. Because of this definition, all directly related CUIs are also related. The inverse is however not true.

The UMLS does not guarantee neither that all concepts are related nor that for every concept there is at least one bilateral relation defined. Therefore, the first step in the vectorization will be the identification of the largest connected sets which will be constructed in a following way:

1 ) A CUI is randomly selected. This CUI is the first member of the set. Then, using UMLS, all directly related CUIS are found and added to the set

2) For every CUI found in the previous step, all directly related CUIS are found and added to the set, unless they are already there

3) The step 2 is repeated until no new CUIS can be added

Then the remaining CUIS will be used to construct next set using the steps 1 -3 above. After M steps we obtain M sets with one or more CUIS. In every set that has more than one CUI, every CUI is related to at least one other CUI in the same set. In the same time, it is not related to any CUI from any other set.

The second step is to assign a real number to every pair of related CUIs that will correspond to the strength of the relationship. Large number will indicate very strong relationship while zero will be used for all CUI pairs that are not related in UMLS.

The relation strength can be determined from the shortest path (smallest number of intermediaries) between the related CUIs. Let’s assume that there are K relationship types defined in the UMLS. Then we can define for every relationship type 2 parameters <3 and p which have values from the interval (-1 ,1 ).

The relationship strength can be then calculated using formula:

Where k runs over all relationship types that are defined along the shortest past and N is a distance of the furthest intermediary that is connected via relationship of type k. With this construction we obtain a real number for every pair of CUIs that are related. This can be represented in a matrix fl with dimensions N c x N c where N c is the number of CUIs in a connected set / ' constructed earlier.

We can now use matrix factorization in order to obtain matrix Y such that

R = Y T x Y

In order for matrix multiplication to work, the matrix Y must have N c columns and arbitrary number of rows f, so we can see that every column of matrix Y is an f- dimensional vector corresponding to a given CUI. There are many proven techniques how to do a matrix factorization in practice, for example Alternating least squares method.

This factorization is done for every set of related CUIs /. The numbers f, a k and /¾ are free parameters and they value is tuned for the optimal performance of the factorization process.

The outcome of this process is that we can assign a /-dimensional vector to every CUI and so determine the similarity of any pair of CUIs from the set of related CUIs, even those which are not related in the UMLS. Also, one can calculate a center-of-mass for any group of CUIs from one set of related CUIs and thus compare group of CUIs to each other.