METHOD AND SYSTEM FOR AUTOMATED CORRECTION AND/OR COMPLETION OF A DATABASE

Title:

METHOD AND SYSTEM FOR AUTOMATED CORRECTION AND/OR COMPLETION OF A DATABASE

Document Type and Number:

WIPO Patent Application WO/2023/020892

Kind Code:

Abstract:

An auto-encoder model (AEM) processes a datasets describing a physical part from a part catalogue in the form of a property co-occurrence graph (G), and performs entity resolution and auto-completion on the co-occurrence graph (G) in order to compute a corrected and/or completed dataset. According to an embodiment, the encoder (E) consists of a recurrent neural network (RNN) and a graph attention network (GAT). The decoder (D) contains a linear decoder (LD) for numeric values and a recurrent neural network decoder (RNN-D) for strings. The auto-encoder model provides an automated end-to-end solution that can auto-complete missing information as well as correct data errors such as misspellings or wrong values. The auto-encoder model is capable of auto-completion for highly unaligned part specification data with missing values. This has multiple benefits: First, the auto-encoder model can be trained completely unsupervised (self-supervised) as no labeled training data is required. Second, the auto-encoder model can capture correlation between any part specification property, value, and unit of measure. Third, the auto-encoder model is a single model instead of many models (for example, one for each property and unit) as would be the case in a Euclidean (table-based) missing data imputation algorithm. Fourth, the auto-encoder model can natively handle misspelled property and values terms and learn to align them. A further advantage is the ability for interactive user involvement. As the auto-encoder model operates purely on character-level, immediate feedback to the user can be given, for example after each character that the user is typing or editing.

Inventors:

BRIKIS GEORGIA OLYMPIA (US)
HASAN RAKEBUL (DE)
HILDEBRANDT MARCEL (DE)
JOBLIN MITCHELL (CA)
KOLEVA ANETA (DE)
RINGSQUANDL MARTIN (DE)

Application Number:

PCT/EP2022/072331

Publication Date:

February 23, 2023

Filing Date:

August 09, 2022

Export Citation:

Click for automatic bibliography generation Help

Assignee:

SIEMENS AG (DE)

International Classes:

G06F16/215; G06F16/23; G06F16/332; G06F16/901; G06N3/08

Foreign References:

US20210117395A1	2021-04-22
US20200089650A1	2020-03-19
US10394770B2	2019-08-27

Other References:

JIE ZHOU ET AL: "Graph Neural Networks: A Review of Methods and Applications", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 20 December 2018 (2018-12-20), XP081921491
PAPADAKIS, G.TSEKOURAS, L.THANOS, E.GIANNAKOPOULOS, G.PALPANAS, T.KOUBARAKIS, M., JEDAI: "Revised Selected Papers", vol. 10577, 2017, SPRINGER, article "The force behind entity resolution, The Semantic Web: ESWC 2017 Satellite Events", pages: 161 - 166
OBRACZKA, D.SCHUCHART, J.RAHM, E.: "EAGER: embedding-assisted entity resolution for knowledge graphs", ARXIV:2101.06126V1, 2021
XU, D.RUAN, C.KORPEOGLU, E.KUMAR, S.ACHAN, K: "Product knowledge graph embedding for e-commerce", WSDM '20: THE THIRTEENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2020, pages 672 - 680, XP058708109, DOI: 10.1145/3336191.3371778

Download PDF:

View/Download PDF PDF Help

Claims:

Patent claims

1. A computer implemented method for automated correction and/or completion of a database, comprising the following operations, wherein the operations are performed by com- ponents, and wherein the components are software compo- nents executed by one or more processors and/or hardware components: providing (OP1), in a database, a set of datasets repre- senting a set of physical parts, wherein each dataset represents one of the parts and identifies at least one property of the part as well as its value, receiving (OP3), by an auto-encoder model (AEM) contain- ing an encoder (E) and a decoder (D), a co-occurrence graph (G) representing a first dataset from the set of datasets, with the first dataset representing a first part (A), wherein the co-occurrence graph (G) contains a set of nodes (X), including property nodes (PN) repre- senting the properties of the first part (A) and value nodes (VN) representing values of the properties of the first part (A), wherein the value nodes (VN) form a fully connected graph within the co-occurrence graph (G), and wherein each value node (VN) is connected to its cor- responding property node (PN), and optionally, to a unit node (UN) representing a unit of measure, performing (OP4), by the auto-encoder model (AEM), entity resolution and/or auto-completion on the co-occurrence graph (G) in order to compute a corrected and/or complet- ed first dataset, and storing (OP6) the corrected and/or completed first da- taset in the database.

2. The method according to claim 1, wherein the first dataset has at least one missing value that is filled in the corrected and/or completed first dataset, wherein the missing value is represented as an auxiliary node in the co-occurrence graph (G), wherein the decoder (D) decodes the missing value for the auxiliary node, and wherein the missing value is filled into the corrected and/or completed first dataset. The method according to claim 1 or 2, wherein the first dataset contains at least one data er- ror, in particular a misspelling or a wrong numeric val- ue, that is corrected in the corrected and/or completed first dataset, wherein the decoder (D) decodes values for every node in the co-occurrence graph (G), wherein all decoded values are compared to their original values in the first dataset, and wherein if one of the decoded values differs from its original value, or if a difference between one of the de- coded values and its original value exceeds a threshold, the respective decoded value replaces the respective original value in the corrected and/or completed first dataset. The method according to any of the preceding claims, wherein the first dataset contains an incomplete string or an incomplete number that is completed with output of the decoder (D) in the corrected and/or completed first dataset. The method according to any of the preceding claims, wherein the encoder (E) consists of a recurrent neural network (RNN) and a graph attention network (GAT). The method according to claim 5, wherein the encoder (E) processes the set of nodes (X) in in the co-occurrence graph (G) according to the formula

Encoder(X,G)= GAT(RNN(X),G). The method according to claim 5 or 6, wherein the graph attention network (GAT) stores an at- tention weight (α_ij) for every link in the co-occurrence graph (G) according to the formula α_ij = a[W_att RNN(x_i),W_att RNN(x_j)], wherein where a and W_att are trainable parameters. The method according to any of the preceding claims, wherein the decoder (D) contains a linear decoder (LD) for numeric values and a recurrent neural network decoder (RNN-D) for strings. The method according to any of the preceding claims, with the additional step of outputting (OP5), by a user interface, the corrected and/or completed first dataset, and detecting, by the us- er interface, a confirming user interaction, before stor- ing (OP6) the corrected and/or completed first dataset in the database. The method according to any of the preceding claims, with the additional step of receiving (OP2), by the user interface, one or more char- acters or digits and storing them as an incomplete string or incomplete number in the first dataset, wherein the incomplete string or incomplete number is completed in the corrected and/or completed first da- taset. The method according to any of the preceding claims, wherein the database is a graph database containing as datasets a co-occurrence graph (G) for each part, with each co-occurrence graph (G) containing a set of nodes (X), including property nodes (PN) representing the prop- erties of the respective part and value nodes (VN) repre- senting values of the properties for the respective part, wherein the value nodes (VN) form a fully connected graph within the respective co-occurrence graph (G), and wherein each value node (VN) is connected to its cor- responding property node (PN), and optionally, to a unit node (UN) representing a unit of measure. A system for automated correction and/or completion of a database, comprising: a database storing a set of datasets representing a set of physical parts, wherein each dataset represents one of the parts and identifies at least one property of the part as well as its value, a processor configured for execution of an auto-encoder model (AEM) containing an encoder (E) and a decoder (D), configured for receiving (OP3) a co-occurrence graph (G) represent- ing a first dataset from the set of datasets, with the first dataset representing a first part (A), wherein the co-occurrence graph (G) contains a set of nodes (X), including property nodes (PN) representing the properties of the first part (A) and value nodes (VN) representing values of the properties of the first part (A), wherein the value nodes (VN) form a fully con- nected graph within the co-occurrence graph (G), and wherein each value node (VN) is connected to its corresponding property node (PN), and optionally, to a unit node (UN) representing a unit of meas- ure, and performing (OP4), by the auto-encoder model (AEM), entity resolution and/or auto-completion on the co- occurrence graph (G) in order to compute a corrected and/or completed first dataset. 13. Computer program product with program instructions for carrying out a method according to one of the method claims. 14. Provision device for the computer program product accord- ing to the preceding claim, wherein the provision device stores and/or provides the computer program product.

Description:

Description

Method and system for automated correction and/or completion of a database

Product catalogues are an essential element of any component engineering process. In contrast to consumer product cata- logues (e.g., amazon, etc.), industrial product/part cata- logues must be highly accurate in terms of their technical data, especially product parameters (i.e., technical attrib- utes). When looking for suitable parts, a component engineer must be certain that the given parameters match his specifi- cation, otherwise correct functioning of his solution cannot be guaranteed.

Insufficient data quality in industrial part catalogues is a widely acknowledged problem with significant impacts on pro- curement, manufacturing, and product quality. Especially in electrical engineering, millions of parts are on the market often with low quality specifications. Any application area that leverages search technologies on this data suffers from returning either incorrect results or missing relevant re- sults entirely. Machine learning models that wish to make use of this data will also tend to perform poorly (e.g., similar- ity search, part recommender systems).

Finding data quality issues, such as wrong units of measure or inconsistent parameter values, in a semi-automated and da- ta-driven way is a difficult task since it requires domain knowledge. Human engineers usually have a very good intuition on parameters and how they are related physically in their field of expertise and can spot issues quickly. However, get- ting human experts to label issues is time consuming and ex- pensive .

Papadakis, G., Tsekouras, L., Thanos, E., Giannakopoulos, G., Palpanas, T., and Koubarakis, M., Jedai: The force behind en- tity resolution, The Semantic Web: ESWC 2017 Satellite Events, Revised Selected Papers, volume 10577 of Lecture Notes in Computer Science, Springer, 2017, pages 161-166, de- scribes a traditional end-to-end system for entity resolution that involves several steps, such as data pre-processing, blocking, clustering and matching. In each of these steps a domain expert is expected to be involved and to guide the process. Additionally, knowledge and understanding of the da- ta sources is required, which can be challenging. Moreover, the entity matching depends on regular expression text match- ing and fuzzy string matching, which in the case of missing values or noisy data would not yield a good result.

Obraczka, D., Schuchart, J., and Rahm, E., EAGER: embedding- assisted entity resolution for knowledge graphs, 2021, arXiv:2101 .06126vl [cs.LG], describes a more sophisticated tool that starts by representing the datasets as knowledge graphs (KGs) and then by using different graph embedding methods, generates representation of the data in high- dimensional space. The transformation from tabular to a KG format of industrial data is not a trivial task and it can involve domain specific modification, such as the one men- tioned in Xu, D., Ruan, C., Korpeoglu, E., Kumar, S., and Achan, K., Product knowledge graph embedding for e-commerce, WSDM ’20: The Thirteenth ACM International Conference on Web Search and Data Mining, 2020, pages 672-680.

Solutions which resort to graph embedding methods also re- quire a set of seed alignments, i.e., a set of existing matches between the data in the input graphs. However, in the industrial setting, such seed alignments are very rarely available and the process of extracting such pairs is expen- sive, requiring expertise and human labor.

It is an object of the present invention to identify a prob- lem in the prior art and to find a technical solution for this. The objectives of the invention are solved by the independent claims. Further advantageous arrangements and embodiments of the invention are set forth in the respective dependent claims.

According to the computer implemented method for automated correction and/or completion of a database, the following op- erations are performed by components, wherein the components are software components executed by one or more processors and/or hardware components: providing, in a database, a set of datasets representing a set of physical parts, wherein each dataset represents one of the parts and identifies at least one property of the part as well as its value, receiving, by an auto-encoder model containing an encoder and a decoder, a co-occurrence graph representing a first dataset from the set of datasets, with the first dataset representing a first part, wherein the co-occurrence graph contains a set of nodes, including property nodes representing the properties of the first part and value nodes representing values of the properties of the first part, wherein the value nodes form a fully connected graph within the co-occurrence graph, and wherein each value node is connected to its corre- sponding property node, and optionally, to a unit node representing a unit of measure, performing, by the auto-encoder model, entity resolution and/or auto-completion on the co-occurrence graph in or- der to compute a corrected and/or completed first da- taset, and storing the corrected and/or completed first dataset in the database.

The system for automated correction and/or completion of a database comprises: a database storing a set of datasets representing a set of physical parts, wherein each dataset represents one of the parts and identifies at least one property of the part as well as its value, a processor configured for execution of an auto-encoder model containing an encoder and a decoder, configured for receiving a co-occurrence graph representing a first dataset from the set of datasets, with the first da- taset representing a first part, wherein the co- occurrence graph contains a set of nodes, including property nodes representing the properties of the first part and value nodes representing values of the properties of the first part, wherein the value nodes form a fully connected graph within the co-occurrence graph, and wherein each value node is connected to its cor- responding property node, and optionally, to a unit node representing a unit of measure, and performing, by the auto-encoder model, entity resolu- tion and/or auto-completion on the co-occurrence graph in order to compute a corrected and/or complet- ed first dataset.

The following advantages and explanations are not necessarily the result of the object of the independent claims. Rather, they may be advantages and explanations that only apply to certain embodiments or variants.

With regard to the invention, the term "part" refers to any physical part, physical component, and/or physical material that can be used by an engineer to build any kind of physical product, for example a car or an industrial automation sys- tem. In other words, a part could be a transistor, a casing, a processor, a PLC, a motor, or a cable. These are, of course, completely arbitrary examples.

With regard to the invention, the term "co-occurrence graph" refers to a graph representing a part, wherein the co- occurrence graph contains a set of nodes, including property nodes representing the properties of the part and value nodes representing values of the properties of the part. The value nodes form a fully connected graph within the co-occurrence graph. Each value node is connected to its corresponding property node, and optionally, to a unit node representing a unit of measure.

With regard to the invention, the automated correction and/or completion of the database is achieved by automated or semi- automated correction and/or completion of a single dataset in the database, for example. Of course, the automated correc- tion and/or completion of the database can also include auto- mated or semi-automated correction and/or completion of sev- eral or all datasets stored in the database.

For example, each dataset contains technical part specifica- tion data as parametric data describing each part's given pa- rameters. As a result, the set of datasets forms, for exam- ple, a part catalogue. The datasets can be implemented with any kind of data structure, for example tables, a co- occurrence graphs, or elements of a relational database.

In connection with the invention, unless otherwise stated in the description, the terms "training”, "generating", "comput- er-aided", "calculating", "determining", "reasoning", "re- training" and the like relate preferably to actions and/or processes and/or processing steps that change and/or generate data and/or convert the data into other data, the data in particular being or being able to be represented as physical quantities, for example as electrical impulses.

The term "computer" should be interpreted as broadly as pos- sible, in particular to cover all electronic devices with da- ta processing properties. Computers can thus, for example, be personal computers, servers, clients, programmable logic con- trollers (PLCs), handheld computer systems, pocket PC devic- es, mobile radio devices, smartphones, devices or any other communication devices that can process data with computer support, processors and other electronic devices for data processing. Computers can in particular comprise one or more processors and memory units.

In connection with the invention, a "memory”, "memory unit" or "memory module" and the like can mean, for example, a vol- atile memory in the form of random-access memory (RAM) or a permanent memory such as a hard disk or a Disk.

The method and system, or at least some of their embodiments, provide an automated end-to-end solution that can auto- complete missing information as well as correct data errors such as misspellings or wrong values.

The method and system, or at least some of their embodiments, provide with the auto-encoder model a joint model that joint- ly solves entity resolution as well as missing data imputa- tion on any material property. As a result, the auto-encoder model is capable of auto-completion for highly unaligned part specification data with missing values. This has multiple benefits :

First, the auto-encoder model can be trained completely unsu- pervised (self-supervised) as no labeled training data is re- quired. Second, the auto-encoder model can capture correla- tion between any part specification property, value, and unit of measure. Third, the auto-encoder model is a single model instead of many models (for example, one for each property and unit) as would be the case in a Euclidean (table-based) missing data imputation algorithm. Fourth, the auto-encoder model can natively handle misspelled property and values terms and learn to align them.

A further advantage is the ability for interactive user in- volvement. As the auto-encoder model operates purely on char- acter-level, immediate feedback to the user can be given, for example after each character that the user is typing or edit- ing. In an embodiment of the method and system, the first dataset has at least one missing value that is filled in the correct- ed and/or completed first dataset. The missing value is rep- resented as an auxiliary node in the co-occurrence graph. The decoder decodes the missing value for the auxiliary node. The missing value is filled into the corrected and/or completed first dataset.

In an embodiment of the method and system, wherein the first dataset contains at least one data error, in particular a misspelling or a wrong numeric value, that is corrected in the corrected and/or completed first dataset. The decoder de- codes values for every node in the co-occurrence graph. All decoded values are compared to their original values in the first dataset. If one of the decoded values differs from its original value, or if a difference between one of the decoded values and its original value exceeds a threshold, the re- spective decoded value replaces the respective original value in the corrected and/or completed first dataset.

In an embodiment of the method and system, the first dataset contains an incomplete string or an incomplete number that is completed with output of the decoder in the corrected and/or completed first dataset.

In an embodiment of the method and system, the encoder con- sists of a recurrent neural network and a graph attention network.

In an embodiment of the method and system, the encoder pro- cesses the set of nodes in in the co-occurrence graph ac- cording to the formula

Encoder(X,G)= GAT(RNN(X),G).

In an embodiment of the method and system, the graph atten- tion network stores an attention weight for every link in the co-occurrence graph according to the formula α _ij = a[W _att RNN(x _i),W _att RNN(x _j)], wherein where a and W _att are trainable parameters.

In an embodiment of the method and system, the decoder con- tains a linear decoder for numeric values and a recurrent neural network decoder for strings.

An embodiment of the method comprises the additional step of outputting, by a user interface, the corrected and/or com- pleted first dataset, and detecting, by the user interface, a confirming user interaction, before storing the corrected and/or completed first dataset in the database.

An embodiment of the method comprises the additional step of receiving, by the user interface, one or more characters or digits and storing them as an incomplete string or incomplete number in the first dataset. The incomplete string or incom- plete number is completed in the corrected and/or completed first dataset.

This embodiment allows the auto-encoder model to auto- complete character-level input of a user, in particular after typing first letters or numbers in a property name field, a unit of measure field, or a value field on a user interface.

In an embodiment of the method and system, the database is a graph database containing as datasets a co-occurrence graph for each part, with each co-occurrence graph containing a set of nodes, including property nodes representing the proper- ties of the respective part and value nodes representing val- ues of the properties for the respective part, wherein the value nodes form a fully connected graph within the respective co-occurrence graph, and wherein each value node is connected to its correspond- ing property node, and optionally, to a unit node repre- senting a unit of measure. The computer program product has program instructions for carrying out the method.

The provision device for the computer program product stores and/or provides the computer program product.

The foregoing and other aspects of the present invention are best understood from the following detailed description when read in connection with the accompanying drawings. For the purpose of illustrating the invention, there are shown in the drawings embodiments that are presently preferred, it being understood, however, that the invention is not limited to the specific instrumentalities disclosed. The embodiments may be combined with each other. Furthermore, the embodiments may be combined with any of the features described above. Unless stated otherwise, identical reference signs denote the same features or functionally identical elements between drawings. Included in the drawings are the following Figures:

Fig. 1 shows a first embodiment;

Fig. 2 shows another embodiment;

Fig. 3 shows a first example from an electronics product catalogue,

Fig. 4 shows a second example from an electronics product catalogue,

Fig. 5 shows a co-occurrence graph G for a first part A and a second co-occurrence graph G2 for a second part B,

Fig. 6 shows an auto-encoder model AEM,

Fig. 7 shows a training procedure for the auto-encoder mod- el AEM when masking a numeric value, Fig. 8 shows a training procedure for the auto-encoder mod- el AEM when masking a string value,

Fig. 9 shows term-based entity resolution by the auto- encoder model AEM,

Fig. 10 shows auto-completion by the auto-encoder model AEM,

Fig. 11 shows a first example of a user U interacting with the auto-encoder model AEM acting as an auto- completion system ACS,

Fig. 12 shows a second example of a user U interacting with the auto-encoder model AEM acting as an auto- completion system ACS, and

Fig. 13 shows a flowchart of a possible exemplary embodiment of a method.

In the following description, various aspects of the present invention and embodiments thereof will be described. However, it will be understood by those skilled in the art that embod- iments may be practiced with only some or all aspects there- of. For purposes of explanation, specific numbers and config- urations are set forth in order to provide a thorough under- standing. However, it will also be apparent to those skilled in the art that the embodiments may be practiced without these specific details.

The described components can each be hardware components or software components. For example, a software component can be a software module such as a software library; an individual procedure, subroutine, or function; or, depending on the pro- gramming paradigm, any other portion of software code that implements the function of the software component. A combina- tion of hardware components and software components can oc- cur, in particular, if some of the effects according to the invention are preferably exclusively implemented by special hardware (e.g., a processor in the form of an ASIC or FPGA) and some other part by software.

Fig. 1 shows one sample structure for computer-implementation of the invention which comprises:

(101) computer system

(102) processor

(103) memory

(104) computer program (product)

(105) user interface

In this embodiment of the invention the computer program product 104 comprises program instructions for carrying out the invention. The computer program 104 is stored in the memory 103 which renders, among others, the memory and/or its related computer system 101 a provisioning device for the computer program product 104. The system 101 may carry out the invention by executing the program instructions of the computer program 104 by the processor 102. Results of inven- tion may be presented on the user interface 105. Alternative- ly, they may be stored in the memory 103 or on another suita- ble means for storing data.

Fig. 2 shows another sample structure for computer- implementation of the invention which comprises:

(201) provisioning device

(202) computer program (product)

(203) computer network / Internet

(204) computer system

(205) mobile device / smartphone

In this embodiment the provisioning device 201 stores a com- puter program 202 which comprises program instructions for carrying out the invention. The provisioning device 201 pro- vides the computer program 202 via a computer network / In- ternet 203. By way of example, a computer system 204 or a mo- bile device / smartphone 205 may load the computer program 202 and carry out the invention by executing the program in- structions of the computer program 202.

Preferably, the embodiments shown in Figs. 6-13 can be imple- mented with a structure as shown in Fig. 1 or Fig. 2.

Fig. 3 shows an example from an electronics product cata- logue, here a specification for a capacitor A, while Fig. 4 shows a specification for capacitor B. As typical, properties of the capacitors are described in free text, the unit of measure is optional, and the value can also be free text.

The task for the embodiments described with regard to Figs. 6-13 is to detect that the properties "shape” and "form” have the same semantics (i.e., both correspond to the same physi- cal property). Further, the embodiments need to know that "V/K" is the same as "Volt per Kelvin" and that "6.4mv/K" is in this case the same as "0.0064 V/K". Additionally, the text expression "also operates at 88 to 350 VDC" is the same as "88 (Min)-350(Max) ".

As for auto-completion, at least some of the embodiments de- scribed with regard to Figs. 6-13 can infer that the missing unit of measure for "Current Gain" for Capacitor B is proba- bly "Amp V".

A technical challenge of at least some of the embodiments de- scribed with regard to Figs. 6-13 is to encode these misa- ligned and domain-specific vocabularies of materials in a se- mantic space that reflects aligned semantics across different terms.

State-of-the-art NLP techniques (e.g., general-purpose pre- trained word embeddings such as BERT or word2vec) fail to deal with such domain-specific terminology. On the other hand, there is a lot of hidden correlation to be captured in material specifications, since all materials operate in the physical world and therefore share physical properties that are highly correlated.

At least some of the embodiments described with regard to Figs. 6-13 aim at assisting engineers to search through non- aligned material specifications and provide a "cleansed" view on missing or erroneous information. In order to reach this goal, at least some of the embodiments use a co-occurrence graph as shown in Fig. 5, an auto-encoder model AEM as shown in Fig. 6, and a suitable training procedure for training the auto-encoder model AEM as shown in Figs. 7 and 8.

In this sense at least some of the embodiments provide one model that jointly solves entity resolution as well as miss- ing data imputation on any material property.

As a first step, for each part of a set of parts (with the set containing, for example, all parts) a co-occurrence graph G is built (one graph per part, focusing on the co-occurrence of properties between parts), with the co-occurrence graph G representing properties of the respective part, and wherein the property values of each part form a fully connected graph within the co-occurrence graph G. Each value is connected to its property and optional unit of measure.

Building a co-occurrence graph (G) for each part in this way is a simple procedure that can be implemented by a computer program that processes, for example, tables in a database that store the respective information about the parts.

Fig. 5 shows a co-occurrence graph G for a first part A and a second co-occurrence graph G2 for a second part B. Property nodes PN represent properties of the respective part and val- ue nodes VN represent values of these properties for the re- spective part. As shown in Fig. 5, the value nodes VN are specific either to the first part A or to the second part B, whereas both co-occurrence graphs have the same property nodes PN. The value nodes VN form a fully connected graph within the respective co-occurrence graph. Each value node VN is connected to its corresponding property node PN, and op- tionally, to a unit node UN representing a unit of measure.

To simplify the encoding process, all node features are just the character sequences of their values/names. Even in case of numerical values, a character sequence is formed, e.g., 50 turns into the character sequences ”5", "0”.

Fig. 6 shows a high-level encoder-decoder model, which is an auto-encoder model AEM that contains an encoder E and a decoder D. Technically, the encoder E consists of a recurrent neural network RNN and a graph attention network GAT.

Given a set of nodes X (containing the value nodes VN, the property nodes PN, and the unit nodes UN) in a graph G (for example the co-occurrence graph G representing the first part

A) as input:

Encoder(X,G)= GAT(RNN(X),G)

The internal attention mechanism of the graph attention net- work GAT learns which other values are of high importance when trying to predict the value of another node. For every link in graph G the attention weight a^- for that link is given by: α _ij = a[W _att RNN(x _i),W _att RNN(x _j)] where a and W _att are trainable parameters.

The training procedure takes a given set of part specifica- tions (from non-aligned data sources) as input. The auto- encoder model AEM is then trained by iteratively sampling masks (i.e., temporary deletion of certain nodes) and the ob- jective is to reconstruct the masked values as close as pos- sible. Any node type can be masked: value nodes VN, unit nodes UN, or property nodes PN. To further simulate missing values during training a certain percentage of links are masked and hidden from the model. While the encoder does not distinguish between strings and numeric values, the decoders do. Two types of decoders can be used, one for numeric values and one for strings.

Fig. 7 shows the training procedure when masking a numeric value of a first value node VN _i and a link between the first value node VN _i and a third value node VN _Z . A linear decoder LD decodes the numeric value for the masked first value node VN _i. The decoded value DV (40.3) is then compared to its ground-truth GT (50.0) in order to compute a L1-Loss L1L.

Fig. 8 shows the training procedure when masking a string value of the third value node VN _Z and a link between the third value node VN _Z and a second value node VN _j . Here, a re- current neural network decoder RNN-D decodes the letters ’D' and 'C' as a decoded string DS, which is then compared to its ground-truth GT (a string formed by the letters ’A’ and ’C’) in order to compute a loss L.

The loss objective for a masked node x in graph G can be formulated as follows:

Where is either a decoded numeric value or a character sequence . And l _num is the root-mean squared error (or any other regression loss, for example smooth L1).

Similar, l _str is a character-wise binary cross-entropy loss. The decoder here also needs to learn a special character [EOS] which represents its "end of sequence". As the two losses possible lie on different scales, α ₁ and α ₂ are two hyperparameters that can ensure stable training.

After training, the auto-encoder model AEM is capable of two kinds of inference steps: in a first step, term-based entity resolution as shown in Fig. 9, and in a second step, auto- completion as shown in Fig. 10.

Term-based entity resolution discovers how different terms are related to each other. Fig. 9 shows that by simply feeding dif- ferent terms, character-by-character into the recurrent neural network RNN (which is part of the encoder E shown in Fig. 6) it can be automatically detected which terms are semantically sim- ilar to each other, due to their high similarity HS exceeding a threshold .

Fig. 10 shows the second inference step, which is auto- completion. Here, the auto-encoder model AEM shown in Fig. 6 acts as a generative model that can fill in any gaps in a given co-occurrence graph G, here two unknown nodes U. In or- der to trigger auto-completion, the input is prepared accord- ingly to include any nodes that should be auto-completed, as will be described in more detail in the next paragraph.

To infer missing values, the unknown nodes U (e.g., a missing unit node as shown in Fig. 10) are attached to the co- occurrence graph G as auxiliary nodes, with the unknown node U for the unit node for example containing the empty string. The co-occurrence graph G is then fed to the auto-encoder model AEM. The decoder D will give the most likely value for the unit node in place of the empty string, here the string "mV". Similarly, the decoder D decodes the missing numeric value 500 for the other unknown node U as shown in Fig. 10.

As an alternative or in addition to the example shown in Fig. 10, the auto-encoder model AEM can also infer data errors: For every node in the co-occurrence graph G the original (po- tentially erroneous) input is fed to the auto-encoder model AEM and the decoded output is compared to its original value. In case of deviations (or if a user-defined threshold is met), the decoded value can directly replace the original value.

As an alternative or in addition to the examples described above, the auto-encoder model AEM can also infer incomplete values. The procedure is the same as in the previous para- graph with the additional constraint that the decoder output is only starting after the incomplete original value.

By automatically translating datasets, for example tables, stored in a database into co-occurrence graphs and processing them with the auto-encoder model AEM, some or all of the da- tasets stored in the database can be corrected and/or auto- completed in a fully automated fashion.

All of the above operations, including operations that are not performed by the auto-encoder model itself, for example generating the co-occurrence graph G from a dataset in the database, comparing decoded values to original values and correcting and/or completing the dataset, are easily imple- mented in software that is executable by one or more proces- sors and provides a fully automated solution for correcting and/or completing the database.

However, it is also possible to involve a user in the process of correcting and/or completing the database with a semi- automated approach, as will be described in the following.

Fig. 11 shows user-interaction between the auto-encoder model AEM acting as an auto-completion system ACS, and a user U. Since the auto-completion system ACS operates on character- level input, the user U can interact with it by typing, e.g., giving the first letter of the property name.

In the example shown in Fig. 11, a first view VI on a user interface contains auto-complete suggestions that the auto- completion system ACS has generated. In a first user interac- tion UI1, the user U enters the letter "k". The user inter- face then outputs a second view V2 with adjusted suggestions. In a second user interaction UI2, the user U accepts the au- to-complete suggestions, which leads to the user interface outputting a third view V3 with the accepted values.

Similarly, when the user U wants to enter a new property for a given part, the auto-completion system ACS helps to auto- complete, as shown in Fig. 12. Here, in a third user interac- tion UI3 the user U has typed "Diel" to name a newly created property and gets auto-complete suggestions that are output by the user interface in a fourth view V4.

Fig. 13 shows a flowchart of a possible exemplary embodiment of a method for automated correction and/or completion of a database .

According to the embodiment, the following operations are performed by components, wherein the components are software components executed by one or more processors and/or hardware components .

In a first operation OP1, a database provides a set of da- tasets representing a set of physical parts, wherein each da- taset represents one of the parts and identifies at least one property of the part as well as its value.

In a second operation OP2, which is optional, a user inter- face receives one or more characters or digits. These are stored as an incomplete string or as an incomplete number in a first dataset. The first dataset is identical with the first dataset that will be described with regard to the next operation .

In a third operation OP3, an auto-encoder model containing an encoder and a decoder receives a co-occurrence graph repre- senting a first dataset from the set of datasets, with the first dataset representing a first part, wherein the co- occurrence graph contains a set of nodes, including property nodes representing the properties of the first part and value nodes representing values of the properties of the first part, wherein the value nodes form a fully connected graph within the co-occurrence graph, and wherein each value node is connected to its corresponding property node, and option- ally, to a unit node representing a unit of measure.

In a fourth operation OP4, the auto-encoder model performs entity resolution and/or auto-completion on the co-occurrence graph in order to compute a corrected and/or completed first dataset.

In a fifth operation OP5, which is optional, the user inter- face outputs the corrected and/or completed first dataset and detects a confirming user interaction.

In a sixth operation OP6, the corrected and/or completed first dataset is stored in the database.

For example, the method can be executed by one or more pro- cessors. Examples of processors include a microcontroller or a microprocessor, an Application Specific Integrated Circuit (ASIC), or a neuromorphic microchip, in particular a neuro- morphic processor unit. The processor can be part of any kind of computer, including mobile computing devices such as tab- let computers, smartphones or laptops, or part of a server in a control room or cloud.

The above-described method may be implemented via a computer program product including one or more computer-readable stor- age media having stored thereon instructions executable by one or more processors of a computing system. Execution of the instructions causes the computing system to perform oper- ations corresponding with the acts of the method described above. The instructions for implementing processes or methods de- scribed herein may be provided on non-transitory computer- readable storage media or memories, such as a cache, buffer, RAM, FLASH, removable media, hard drive, or other computer readable storage media. Computer readable storage media in- clude various types of volatile and non-volatile storage me- dia. The functions, acts, or tasks illustrated in the figures or described herein may be executed in response to one or more sets of instructions stored in or on computer readable storage media. The functions, acts or tasks may be independ- ent of the particular type of instruction set, storage media, processor or processing strategy and may be performed by software, hardware, integrated circuits, firmware, micro code and the like, operating alone or in combination. Likewise, processing strategies may include multiprocessing, multitask- ing, parallel processing and the like.

The invention has been described in detail with reference to embodiments thereof and examples. Variations and modifica- tions may, however, be effected within the spirit and scope of the invention covered by the claims. The phrase "at least one of A, B and C" as an alternative expression may provide that one or more of A, B and C may be used.

Previous Patent: CONTENT DISPLAY METHOD, DEVICE, APPARATUS, MEDIUM AND VEHICLE

Next Patent: METHOD FOR PLASMA-CUTTING WORKPIECES