Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
VALIDATING DATA ENTRIES
Document Type and Number:
WIPO Patent Application WO/2021/183111
Kind Code:
A1
Abstract:
In some examples, a method for validating an entry in a set of data, the entry structured according to a first format, comprises using a mapping to convert the entry from the first format to a second format to obtain a derived entry that is semantically equivalent to the entry, comparing a representation of the derived entry to a corresponding representation of the entry in a ledger associated with the set of data to validate the provenance of the entry structured according to the first format.

Inventors:
TAYLOR ROBERT (US)
BALINKSY HELEN (GB)
Application Number:
PCT/US2020/021852
Publication Date:
September 16, 2021
Filing Date:
March 10, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HEWLETT PACKARD DEVELOPMENT CO (US)
International Classes:
G06F21/64
Foreign References:
US20200007344A12020-01-02
US20170243286A12017-08-24
US20170214675A12017-07-27
Attorney, Agent or Firm:
WOODWORTH, Jeffrey C. et al. (US)
Download PDF:
Claims:
CLAIMS

1. A method for validating an entry in a set of data, the entry structured according to a first format, the method comprising: using a mapping to convert the entry from the first format to a second format to obtain a derived entry that is semantically equivalent to the entry; comparing a representation of the derived entry to a corresponding representation of the entry in a ledger associated with the set of data to validate the provenance of the entry structured according to the first format.

2. The method as claimed in claim 1 , further comprising: storing a representation of the mapping in a ledger.

3. The method as claimed in claim 1 , wherein the ledger is a distributed ledger.

4. The method as claimed in claim 1 , wherein the representation of the entry and/or of the derived entry is a digital signature or hash.

5. The method as claimed in claim 1 , wherein the mapping comprises an executable script.

6. The method as claimed in claim 1 , wherein a framework for the set of data is defined according to a schema comprising multiple data fields, the method further comprising: generating a mapping to modify a structure of at least one data field provided in the first format to a structure arranged on the basis of the second format.

7. The method as claimed in claim 1 , further comprising: recording an inverse of the mapping in a ledger.

8. A machine-readable storage medium encoded with instructions for validating an entry in a set of data, the instructions executable by a processor of an apparatus to cause the apparatus to: modify the structure of an entry in the set of data to a provide a derived entry that is semantically equivalent to the entry; generate a representation of the derived entry; compare the representation of the derived entry to a corresponding representation of the entry in a ledger associated with the set of data; and on the basis of the comparison, validate the provenance of the entry.

9. The machine-readable storage medium as claimed in claim 8, further comprising instructions to cause the apparatus to: generate an inverse of a mapping used to modify the structure of the entry in the set of data.

10. The machine-readable storage medium as claimed in claim 9, further comprising instructions to cause the apparatus to: store the mapping and/or the inverse thereof in the ledger.

11. The machine-readable storage medium as claimed in claim 8, further comprising instructions to cause the apparatus to: generate a hash or digital signature of the entry; generate a hash or digital signature of the derived entry; and store the hash or digital signature of the entry and the hash or digital signature of the derived entry in the ledger. 12. The machine-readable storage medium as claimed in claim 8, further comprising instructions to cause the apparatus to: execute a script representing the mapping.

13. The machine-readable storage medium as claimed in claim 8, further comprising instructions to cause the apparatus to: compare the entry and the derived entry to determine a modification of the derived entry relative to the entry.

14. The machine-readable storage medium as claimed in claim 13, further comprising instructions to cause the apparatus to: verify whether the modification is a permitted modification.

15. The machine-readable storage medium as claimed in claim 13, further comprising instructions to cause the apparatus to: determine whether the modification comprises: formatting, and/or rearranging, and/or masking.

Description:
VALIDATING DATA ENTRIES

BACKGROUND

[0001] A set of data, such in the form of a database for example, can contain multiple related data entries that can be logically linked to form records. The data can be organised according to a schema used to define data attributes and/or structures. Over time, such attributes and data structures may evolve as a result of, for example, data migration, efficiency improvements, schema extensions and so on. Although migrated records may remain semantically equivalent to previous versions, evolution of attributes and data structures can result in a change in the framework of certain records.

BRIEF DESCRIPTION OF THE DRAWINGS

[0002] Various features of certain examples will be apparent from the detailed description which follows, taken in conjunction with the accompanying drawings, which together illustrate, by way of example only, a number of features, and wherein:

[0003] Figure 1 is a flowchart of a method for validating an entry in a set of data according to an example;

[0004] Figure 2 is a flowchart of a validation process for migrated data according to an example;

[0005] Figure 3 is a schematic representation of a process according to an example; and

[0006] Figure 4 is a schematic representation of a processor associated with a memory of an apparatus according to an example.

DETAILED DESCRIPTION

[0007] In the following description, for purposes of explanation, numerous specific details of certain examples are set forth. Reference in the specification to "an example" or similar language means that a particular feature, structure, or characteristic described in connection with the example is included in at least that one example, but not necessarily in other examples.

[0008] A set of data can be provided in the form a database system, in which data items forming value entries in the database can be organised into records or tuples. A set of attributes can be used to describe the values, and form a schema for the set of data.

[0009] Such database systems, which may be provided as part of complex systems, are rarely static. That is, they can be modified in order to improve and extend functionality, improve reliability, facilitate integration with other systems, reduce cost, and so on. Database “migrations”, in which the existing data in the database is changed in form to facilitate these improvements and extensions are common place. For example, an enterprise may move data from one platform to another for cost and/or functionality reasons. As part of this process, the database schema, which provides a structural framework for the database, may need to be modified, changed or converted. For example, one or more attributes that go to make up the schema may need to be modified by, for example, splitting, combining, removing, adding and so on. Such modification has an effect on the underlying data items that form the values for the database. Modification can be performed by a script that modifies entries so that the structure of the data works with the new database as defined according to its schema (or schemas).

[0010] Data, such as entries or records in a database, can be validated using a digital ledger, such as a blockchain for example. Note that the terms ledger and blockchain are used interchangeable herein. Blockchain technology is deployed to assert integrity and authenticity of records stored in database - ensuring that they are not accidentally damaged or intentionally modified/erased. Blockchain records, protecting authenticity of the corresponding database records, can be created simultaneously with the original database records. The immutable Blockchain records provide assurance that data in a database cannot be modified without being detected. For example, data in a database (e.g. records or entries) can be hashed and stored in a blockchain in order to enable the record to be validated at a later date. This provides assurance between stakeholders that the data in question can be trusted and was not in any way tampered with or modified. Blockchain technology can prove this trust by validating that the data in the database still creates the same hash as is stored in the blockchain: if the data in the database is modified, the hash computed from that data will not match the equivalent hash in the database, indicating that an improper change has been made.

[0011] Since a digital ledger being used to validate data in a database is a mechanism that presupposes that existing data in the database does not change, this collides with the practice of database management, in which evolution (e.g. in the form of a migration) relies on the database being changeable. That is, when a database is migrated for example, the hash of some data in the database will no longer match the information stored in the blockchain. Thus, even though the semantic content of the database remains unchanged, validation will fail.

[0012] Where authenticity of data after a database update can no longer be validated, both versions of the database: old and new, can be kept. Accordingly, data from old database version can be used for historic authentication and provenance, whilst new operations move to the new database. However, every time a database updated, the old version is retained for data validation, thereby resulting in data duplication (stored in both the old and new format). Furthermore, keeping older versions does not help to ensure that the new data and the corresponding old data are the same. For instance, they might be the same data, but differently formatted.

[0013] Although blockchain technology, for example, provides a way to ensure the integrity of information stored in a database, when a database is migrated or refactored, the historical data may be preserved together with the new (e.g. migrated) data in order to enable the provenance of the migrated data to be validated against corresponding blockchain signatures and hashes. This, however, can double storage used for every migration, increasing system complexity, and increasing costs (e.g. of storage alone). Furthermore, blockchain authentication of old data does not automatically guarantee the integrity of new data. [0014] According to an example, there is provided a method that enables the immutable and provable provenance of migrated data to be retained without the need to keep or store multiple copies of a database (e.g. pre- and post migration versions). A method according to an example provides a mechanism to preserve the ability to successfully validate the contents of an updated (e.g. post-migration) database using a blockchain entry recorded pre-migration. In an example, this proceeds by generating an inverse mapping of a script that is used to migrate the database so that an individual record or entry (for example) can be rolled-back to its pre-migration form to enable validation of that record/entry at the blockchain.

[0015] According to an example, a migration from an original data format to a new data format can be executed using a first migration script. In an example, a second migration script can be generated that can perform the inverse of the first migration script. That is, the first migration script enables migration of data from an original format to a new format, whereas the second migration script enables data in the new format to be rolled back to the original format. The second script thus enables a copy of the original data to be generated from new migrated data at will. Therefore, a hash generated from data in the original format that is stored in a blockchain can be recreated from data in the new format, thereby enabling migrated data to be validated using a digital ledger that comprises hashes generated from the original data (i.e. an original hash can be created from migrated data that will match the hash in the blockchain). This enables the old data and database to be deleted if so desired, whilst individual records and the entire database can be restored from the new data. In this way, blockchain verification of immutability remains consistent, while the system architecture can evolve and remain consistent.

[0016] In an example, a migration script can be used to perform a mapping that can convert a database entry from a first format to a database entry in a second format that is different to the first format. In an example, the entry in the second format is a derived entry that is semantically equivalent to the entry in the first format. An inverse mapping can be used to convert a derived entry from a second format to a first format. In an example, semantically equivalent can mean that the informational content of an entry in one format (e.g. a second format) is the same as the informational content of the entry in a different format (e.g. a first format). That is, although the format of an entry may be migrated from one form to another using mapping (or its inverse), its informational content can remain the same.

[0017] Figure 1 is a flowchart of a method for validating an entry in a set of data according to an example. In the example of figure 1, the entry is initially structured according to a second format. A hash or digital signature of the entry is stored in a ledger, such as a blockchain for example. The ledger can be replicated (i.e. , full copies may be replicated in multiple locations). Such replicated ledgers are often referred to as distributed ledgers or distributed blockchains.

[0018] In block 101 a first script, which is one example of a or mapping, is generated for data migration/refactoring from the second format (also termed original or historical format) to a new, first format. In block 103, the first script to migrate data from the second format to the first format is applied, thereby modifying the structure of the entry using the mapping so as to convert the structure of the entry from the second (original) format to a first (new) format. In block 105, the second script is generated. The second script is the inverse (or reverse) of the first script and defines a mapping that can be used to produce historical (original) data in the second format from new data in the first format. Although no order is implied by the disposition of blocks in figure 1, it will be clear that that the first script may be applied before or after generation of the second script, and the two operations are not mutually exclusive.

[0019] The second (reverse) script thus enables the entry in the first format, representing migrated data to be restored to the second (original) format. Accordingly, a representation (such as a hash or digital signature for example) of the entry structured according to the second format, which has been derived from an entry in the first format, can be compared to a corresponding representation of the entry in a ledger associated with the set of data to validate the provenance of the entry structured according to the first format.

[0020] In block 107, the second (reverse) script used to restore historical data from new can be validated. For example, a proportion up to 100% of the blockchain hashes for restored historical data can be generated to determine whether they correspond correctly to those stored in the blockchain. In block 109, both first and second scripts (migration and reverse scripts) and/or a hash of the scripts can be recorded to a dedicated storage (which can be a new data storage or an unrelated location, or the blockchain itself). In block 111 blockchain transactions recording hashes of both scripts are created. In block 113, the historical data can be removed or deleted.

[0021] In an example, validation in block 107 may be performed in the following way:

Data O -> Data N -> Data O’

Compare O and O’

(0=old, N=new, O’ restored version of O from N)

[0022] Thus, a migration script can be applied to data O to create data N in format 1 (new refactored database). A reverse script can be applied to data N [in format 1] to generate a new version of data in format 2 (restored version) data O’. A comparison of data O and O’ can be performed.

[0023] Figure 2 is a flowchart of a validation process for migrated data according to an example. In block 201 , a second (reverse) script is provided or retrieved from a storage location. The authenticity of the second script can be validated against the version stored in the blockchain. Once confirmed as authentic, the second script, as described above, is used to restore data to an original (second) format from a new or migrated (first) format. In block 203 the second script is executed on data in a migrated (first) format to restore historical data. In block 205, a hash of the restored historical data is generated. In block 207 the hash generated in block 205 is compared to a hash of the corresponding historical data from the ledger. If the hashes/signatures match then data provenance can be trusted, and the restored data matches the historical data.

[0024] An example of a set of data and corresponding scripts is as follows. The table below represents a simple relational database in which several records, each comprising multiple data entries, are provided based on a schema comprising five attributes:

[0025] Suppose that the entry for the street for Archibold Jones (“Pudding Lane”) is hashed, with the resultant value recorded into a blockchain in order to enable validation of this entry at a later date. For example, the MD5 hash of this entry is: c305eea643d171fd769bdc572e4fcac3

[0026] Now consider a script used to migrate the data of the table above to a different format in which the house number and street attributes are merged to provide the following:

[0027] The script could, for example, take the form:

“Street” = concatenate House number”, “Street”); delete column House number”) which simply concatenates the values from the house number and street attributes to new values under the street attribute, and removes the house number attribute. [0028] Semantically, the content of the database is the same. However, the migration has caused a change in the attributes of original database such that the (MD5) hash of the street value for Archibold Jones is now:

5e80b4f728830e59a8d4ad 146ef0a792

[0029] This is obviously different to that generated pre-migration, and so any corresponding validation will fail. In an example, an inverse script may take the form: split value of column “Street” into a = “house number” and b = “street name” create column House number”) replace existing value in column “Street” with a record value b into column house number”)

[0030] This script adds a house number attribute and populates a value with the numeric part of an existing street record, and updates a value in the street attribute with the text part of the existing street record. The original (concatenated) street part can be deleted if desired.

[0031] Accordingly, the inverse script separates a street value into two parts - house number and street, which is the same format as was provided in the original data. Thus, a hash of the street will now return the correct value since the numerical part (which caused the change in hash value) has been stripped away. Therefore, a value in migrated database can be validated.

[0032] Consider another example:

[0033] Here, the Address attribute has values with a postcode embedded. A migration script can split Address into (street) Address and Postcode, and update existing rows. Here, extraction of the postcode is relatively easy since it is separated from the address by a semicolon character providing the following:

[0034] In the original format, the MD5 hash of the Address value for Archibold was: d30f7e322db09da5b8b067d6454ee114, whereas in the migrated format, the MD5 hash of the Address value for Archibold is: 37381443f03b305510e8c7e16eb6d44a.

[0035] Thus, a validation on the migrated data of this address value by comparing hash values will fail. An inverse script can thus be used to append the postcode values to the corresponding address values, resulting in recovery of the original values from which hashes can be generated, and thus validation can be performed. Although the above examples refer to MD5 hashes, any one of the multitude of other hash functions can be used as desired. For example, SHA256 may be used.

[0036] Therefore, according to an example, a method for validating a data entry of a set of data, in which a framework for the set of data is defined according to a schema comprising multiple data fields, can comprise generating a mapping to modify a structure of at least one data field provided in the second format to a structure arranged on the basis of the first format. The mapping, which can be in the form of a script as described above for example, can therefore be used to recover a data entry from a migrated (second) format to an original (first) format in order to enable data validation.

[0037] Figure 3 is a schematic representation of a process according to an example. A first script 301 is applied (303) to original data 305. The first script 301 is a migration script that modifies the original data 305 to generate (309) migrated data 307, e.g. as described in one of the examples above. The first script (or a hash thereof) is recorded in 308 in digital ledger 311. The ledger 311 can be stored in a storage apparatus 313. Storage apparatus 313 may be the same as storage apparatus 315 in which the migrated and/or original data is stored (which themselves may be stored in different locations from one another), or may be a different apparatus (e.g. at different location).

[0038] A second script 317 is generated, which is the reverse of script 301. That is, application of script 317 to data 307 will result in a version of data 305. That is, application of script 317 to data D (307) results in a version of data D’ (305), where D is original data 305, and D’ is a restored version of the same data. Validation can be performed to ensure that D and D’ are matching. The second script (or a hash thereof) is recorded in 310 in digital ledger 311.

[0039] An entry 318 in the original data 305 can be hashed using a hash function 320 to form a hash 321 of the entry 318. This hash (321) can be recorded in the digital ledger 311. In an example, for an entry 318 from the original data 305 (e.g. at the time when the entry is created), a blockchain transaction (record) can be created that comprises a hash or digital signature of the entry 318. The hash can be a cryptographic hash, e.g. SHA256, SHA512, and so on, and a digital signature can comprise a signature from a DSA, RSA, or an elliptic curve signature for example.

[0040] In order to perform validation of an entry 322 in a set of migrated data 307, the structure of the entry 322 in the migrated data 307 (which corresponds to entry 318, e.g. a value for address for a particular person as noted in the examples above) can be modified (319) using a mapping in the form of the second script 317 to convert the structure of the entry 322 from a second format (migrated data 307) to a first format (original data 305) to form a modified or derived entry 324.

[0041] A hash 323 of the modified/derived entry 324 can be generated using hash function 320. The resultant hash 323 can be compared 325 to the corresponding hash 321 of the entry in the distributed ledger 311 to validate the provenance of the entry 322 structured according to the first format 307. Therefore, in order to validate that the modified/derived entry 324 corresponds to entry 318, the entry 322 is rolled back to its pre-migration format, and a hash of the rolled back version is compared to the hash of the original recorded in the digital ledger. If the hash values match, the provenance of the entry 322 is validated. Otherwise, a difference in the hash values indicates that the entry 322 has been, e.g., tampered with.

[0042] Examples in the present disclosure can be provided as methods, systems or machine-readable instructions. Such machine-readable instructions may be included on a computer readable storage medium (including but not limited to disc storage, CD-ROM, optical storage, etc.) having computer readable program codes therein or thereon.

[0043] The present disclosure is described with reference to flow charts and/or block diagrams of the method, devices and systems according to examples of the present disclosure. Although the flow diagrams described above show a specific order of execution, the order of execution may differ from that which is depicted. Blocks described in relation to one flow chart may be combined with those of another flow chart. In some examples, some blocks of the flow diagrams may not be necessary and/or additional blocks may be added. It shall be understood that each flow and/or block in the flow charts and/or block diagrams, as well as combinations of the flows and/or diagrams in the flow charts and/or block diagrams can be realized by machine readable instructions.

[0044] The machine-readable instructions may, for example, be executed by a general-purpose computer, a special purpose computer, an embedded processor or processors of other programmable data processing devices to realize the functions described in the description and diagrams. In particular, a processor or processing apparatus may execute the machine-readable instructions. Thus, modules of apparatus may be implemented by a processor executing machine readable instructions stored in a memory, or a processor operating in accordance with instructions embedded in logic circuitry. The term 'processor' is to be interpreted broadly to include a CPU, processing unit, ASIC, logic unit, or programmable gate set etc. The methods and modules may all be performed by a single processor or divided amongst several processors.

[0045] Such machine-readable instructions may also be stored in a computer readable storage that can guide the computer or other programmable data processing devices to operate in a specific mode. [0046] For example, the instructions may be provided on a non-transitory computer readable storage medium encoded with instructions, executable by a processor.

[0047] Figure 4 is a schematic representation of a processor 450 associated with a memory 452 of an apparatus 401 according to an example. The memory 452 comprises computer readable instructions 454. The instructions 454 can comprise instructions which are executable by the processor 450 to validate an entry 322 in a set of data 307 by causing the apparatus to modify the structure of the entry 322 in the set of data 307 to a provide a modified entry 324, generate a representation 323 of the modified entry 324, compare (325) the representation 323 of the modified entry 324 to a corresponding representation 321 of the entry in a distributed ledger associated with the set of data, and on the basis of the comparison, validate the provenance of the entry.

[0048] Such machine-readable instructions may also be loaded onto a computer or other programmable data processing devices, so that the computer or other programmable data processing devices perform a series of operations to produce computer-implemented processing, thus the instructions executed on the computer or other programmable devices provide a operation for realizing functions specified by flow(s) in the flow charts and/or block(s) in the block diagrams.

[0049] Further, the teachings herein may be implemented in the form of a computer software product, the computer software product being stored in a storage medium and comprising a plurality of instructions for making a computer device implement the methods recited in the examples of the present disclosure.

[0050] While the method, apparatus and related aspects have been described with reference to certain examples, various modifications, changes, omissions, and substitutions can be made without departing from the present disclosure. In particular, a feature or block from one example may be combined with or substituted by a feature/block of another example. [0051] The word "comprising" does not exclude the presence of elements other than those listed in a claim, "a" or "an" does not exclude a plurality, and a single processor or other unit may fulfil the functions of several units recited in the claims.

[0052] The features of any dependent claim may be combined with the feature of any of the independent claims or other dependent claims.