Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DATA STORAGE ARRANGEMENT AND METHOD FOR ANONYMIZATION AWARE DEDUPLICATION
Document Type and Number:
WIPO Patent Application WO/2022/069042
Kind Code:
A1
Abstract:
A data storage arrangement includes memory and controller. Memory is configured to store one or more data elements. Controller is configured to store at least one of one or more data elements utilizing differential compression. Controller is further configured to receive data element to be stored, generate copy of the data element to be stored and mask data to be anonymized in copy of data element to be stored by exchanging one or more portions to be anonymized with predefined content. Controller is further configured to generate corresponding hashes for one or more portions of copy of data element with masked data for finding one or more reference portions and compress data element to be stored utilizing differential compression with reference to one or more reference portions. Controller hints storage on areas of data element which may be modified during data anonymization which enables in anonymization aware deduplication.

Inventors:
NATANZON ASSAF (DE)
AKIRAV SHAY (DE)
Application Number:
PCT/EP2020/077446
Publication Date:
April 07, 2022
Filing Date:
October 01, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
NATANZON ASSAF (DE)
International Classes:
G06F21/62; G06F16/174; H03M7/30
Foreign References:
US20190286839A12019-09-19
CN111737742A2020-10-02
Other References:
ANONYMOUS: "Data Anonymization Techniques and Best Practices: A Quick Guide", 29 July 2020 (2020-07-29), XP055811177, Retrieved from the Internet [retrieved on 20210607]
Attorney, Agent or Firm:
KREUZ, Georg (DE)
Download PDF:
Claims:
CLAIMS

1. A data storage arrangement (100 A, 100B) comprising a memory (102) and a controller (104), the memory (102) being configured to store one or more data elements (106), and the controller (104) being configured to store at least one of the one or more data elements (106) utilizing differential compression wherein a hash for a first portion of a data element is generated and compared to a stored hash for a second portion, and if the hashes match, the first portion is compressed with reference to the second portion, wherein the data storage arrangement (100 A, 100B) is characterized in that the controller (104) is further configured to: receive a data element to be stored; generate a copy of the data element to be stored; mask data to be anonymized in the copy of data element to be stored by exchanging one or more portions to be anonymized with predefined content; generate corresponding hashes for one or more portions of the copy of data element with masked data for finding one or more reference portions; and compress the data element to be stored utilizing differential compression with reference to the one or more reference portions.

2. The data storage arrangement (100A, 100B) according to claim 1, wherein the predefined content is NIL.

3. The data storage arrangement (100A, 100B) according to claim 1, wherein the predefined content is an indicator of a size of the exchanged data portion

4. The data storage arrangement (100A, 100B) according to claim 1, wherein the predefined content is the same for the one or more portions to be anonymized.

5. The data storage arrangement (100A, 100B) according to claim 1, wherein the predefined content is the same for a type of data to be anonymized.

6. The data storage arrangement (100A, 100B) according to any previous claim, wherein the controller (104) is further configured to mask the data to be anonymized when generating the copy of the data element to be stored through a write-with-mask command.

7. The data storage arrangement (100 A, 100B) according to claim 6, wherein the write-with- mask command is associated with a first data type and a second data type and wherein the write- with-mask command is arranged to utilize a first mask when generating the copy of the data element when the data element is of the first data type and to utilize a second mask when generating the copy of the data element when the data element is of the second data type.

8. The data storage arrangement (100A, 100B) according to any previous claim, wherein the mask indicates location and size of data to be anonymized.

9. The data storage arrangement (100A, 100B) according to any previous claim, wherein the controller (104) is further configured to store the one or more data elements (106) utilizing block storage.

10. The data storage arrangement (100A, 100B) according to any previous claim, wherein the controller (104) is further configured to delete the copy of the masked data element.

11. The data storage arrangement (100 A, 100B) according to any previous claim, wherein the controller (104) is further configured to receive the data element to be stored as part of a plurality of data elements comprised in a database, wherein the data elements of the data base are arranged in a table; mask the data to be anonymized by exchanging one or more portions to be anonymized with predefined content based on the position of the data element in the table; and to store an indication of the location of the data anonymized.

12. A method for a data storage arrangement (100A, 100B) comprising a memory (102) being configured to store a one or more data elements (106), the method comprising storing at least one of the one or more data elements (106) utilizing differential compression wherein a hash for a first portion of a data element is generated and compared to a stored hash for a second portion, and if the hashes match, the first portion is compressed with reference to the second portion, wherein the method is characterized in that the method further comprises: receiving a data element to be stored; generating a copy of the data element to be stored; masking data to be anonymized in the copy of data element to be stored by exchanging one or more portions to be anonymized with predefined content; generating corresponding hashes for one or more portions of the copy of data element with masked data for finding one or more reference portions; and compressing the data element to be stored utilizing differential compression with reference to the one or more reference portions.

13. A computer-readable medium carrying computer instructions that when loaded into and executed by a controller (104) of a data storage arrangement (100A, 100B) enables the data storage arrangement (100A, 100B) to implement the method according to claim 12.

14. A data storage arrangement (100 A, 100B) comprising a memory (102) being configured to store a one or more data elements (106), and the data storage arrangement (100 A, 100B) further comprising a compression software module (110A) for compressing least one of the one or more data elements (106) utilizing differential compression wherein a hash for a first portion of a data element is generated and compared to a stored hash for a second portion, and if the hashes match, the first portion is compressed with reference to the second portion, wherein the data storage arrangement (100A, 100B) is characterized in that the data storage arrangement (100 A, 100B) further comprises: a data element receiving software module (HOB) for receiving a data element to be stored; a copy software module (110C) for generating a copy of the data element to be stored; a masking software module (HOD) for masking data to be anonymized in the copy of data element to be stored by exchanging one or more portions to be anonymized with predefined content; a hash generating software module (110E) for generating corresponding hashes for one or more portions of the copy of data element with masked data for finding one or more reference portions; and the compression software module (110A) for compressing the data element to be stored utilizing differential compression with reference to the one or more reference portions.

15. A computer-readable medium carrying a data storage comprising a plurality of data elements, wherein one or more of the plurality of data elements are stored utilizing the method of claim 12.

Description:
DATA STORAGE ARRANGEMENT AND METHOD FOR ANONYMIZATION

AWARE DEDUPLICATION

TECHNICAL FIELD

The present disclosure relates generally to the field of data protection and backup; and more specifically, to data storage arrangements and methods for anonymization aware deduplication.

BACKGROUND

Typically, data backup is used to protect and recover data in an event of data loss in a primary storage system. Examples of the event of data loss may include, but is not limited to, data corruption, hardware or software failure in the primary storage system, accidental deletion of data, hacking, or malicious attack. Thus, for safety reasons, a separate backup system or a secondary storage (for example a data storage arrangement) is extensively used to store a backup of the data present in the primary storage system.

Currently, data anonymization is used to protect privacy of confidential or private information such as credit card numbers, social security number. In data anonymization, the confidential or personal data, such as person identifiable information, is often obfuscated so that the people who are associated with the data remain anonymous. However, it is observed that data anonymization manifests an adverse impact on the effectiveness of deduplication as the anonymized data is not deduplicated if the original data is used as a reference. The term deduplication refers generally to eliminating duplicate or redundant information. As data anonymization changes the data, it is no more same as original data, and thus, increases the backup and deduplication effort. Further, existing deduplication methods, such as variable length deduplication, work well when large identical data chunks appear in a data stream that is to be backed up. However, such existing deduplication methods does not work well if there are frequent changes in the data or the changes are very small. For example, even a change of a character in a chunk may render the data chunk as new chunk, where a conventional deduplication method will not find any identical data chunks, thereby reducing the effectiveness of deduplication. Moreover, in some cases, it may be difficult to identify the sensitive data, and in cases where data is obfuscated it may be almost impossible to identify the data from a storage layer in a conventional data storage arrangement. As a result, with time more storage space of the secondary storages become occupied as the duplicate data may be stored which occupy a large storage space in the conventional secondary storages. This is undesirable as it causes reduction in performance of the secondary storages. Moreover, the cost of data storage, with all the associated costs including cost of storage hardware, continues to be a burden.

Therefore, in light of the foregoing discussion, there exists a need to overcome the aforementioned drawbacks associated with conventional data storage systems and methods of deduplication.

SUMMARY

The present disclosure seeks to provide a data storage arrangement and a method for anonymization aware deduplication. The present disclosure seeks to provide a solution to the existing problem of inefficient deduplication associated with data anonymization. An aim of the present disclosure is to provide a solution that overcomes at least partially the problems encountered in prior art, and provide an improved data storage arrangement and method that takes in account data anonymization during deduplication (i.e. an anonymization aware deduplication), which enables an efficient deduplication even in the presence of anonymized data.

The object of the present disclosure is achieved by the solutions provided in the enclosed independent claims. Advantageous implementations of the present disclosure are further defined in the dependent claims.

In one aspect, the present disclosure provides a data storage arrangement comprising a memory and a controller, the memory being configured to store one or more data elements, and the controller being configured to store at least one of the one or more data elements utilizing differential compression, wherein a hash for a first portion of a data element is generated and compared to a stored hash for a second portion, and if the hashes match, the first portion is compressed with reference to the second portion, wherein the data storage arrangement is characterized in that the controller is further configured to: receive a data element to be stored; generate a copy of the data element to be stored; mask data to be anonymized in the copy of data element to be stored by exchanging one or more portions to be anonymized with predefined content; generate corresponding hashes for one or more portions of the copy of data element with masked data for finding one or more reference portions; and compress the data element to be stored utilizing differential compression with reference to the one or more reference portions.

The data storage arrangement of the present disclosure takes in account data anonymization during deduplication (i.e. an anonymization aware deduplication), which enables an efficient deduplication even in the presence of anonymized data. The present disclosure enables provisioning of hints to the data storage arrangement (i.e. to storage) on areas of the data element which may be modified during data anonymization. Thus, the one or more portions of the data element to be anonymized is masked and if the corresponding hash is already present, the data element is further compressed and stored utilizing differential compression to enable an effective deduplication of the data elements that are received for storing. Moreover, the present disclosure efficiently utilizes the storage space as in the anonymization aware deduplication, duplicate data is not stored, thereby improving system performance in comparison to conventional storages where large amount of storage space is occupied by duplicate data, which further affects system performance of conventional storages.

In an implementation form, the predefined content is NIL.

By virtue of the predefined content being NIL, it acts as a hint of the presence of anonymized data in the data element, which then allows effective compression of the data element to be stored utilizing differential compression, which in turn results in effective data deduplication.

In a further implementation form, the predefined content is an indicator of a size of the exchanged data portion.

The indicator of the size of the exchanged data portion further acts as a hint of the presence of anonymized data of a defined length in the data element, which then allows effective compression of the data element to be stored utilizing differential compression, which in turn results in effective data deduplication.

In a further implementation form, the predefined content is the same for the one or more portions to be anonymized.

By virtue of similar predefined content for the one or more portions to be anonymized, the presence of data portion to be anonymized in the data element is easily identified which further results in effective data deduplication. In a further implementation form, the predefined content is the same for a type of data to be anonymized.

By virtue of similar predefined content for a type of data to be anonymized, the presence of data portion to be anonymized in the data element is easily identified which further results in effective data deduplication. The type of data may correspond to a field of data, such as name, age, phone number, credit card number, address.

In a further implementation form, the controller is further configured to mask the data to be anonymized when generating the copy of the data element to be stored through a write-with- mask command.

The write-with-mask command enables in efficient masking of the data by providing information, such as type of data, size of data, data structure of the data. The write-with-mask refers to an IO write command that hints to the data storage arrangement (i.e. to storage) on areas or portions of the data element which may be modified during data anonymization.

In a further implementation form, the write-with-mask command is associated with a first data type and a second data type and wherein the write-with-mask command is arranged to utilize a first mask when generating the copy of the data element when the data element is of the first data type and to utilize a second mask when generating the copy of the data element when the data element is of the second data type.

The first mask and the second mask generated for the first data type and the second data type acts as hints and enables to accurately identify data portions to be anonymized in the data element, which further results in effective data deduplication.

In a further implementation form, the mask indicates location and size of data to be anonymized.

The location and size of the data act as hints and enables to accurately identify the data portions to be anonymized in the data element which further results in effective data deduplication.

In a further implementation form, the controller is further configured to store the one or more data elements utilizing block storage.

The use of block storage for storing the one or more data elements enables in efficient data retrieval from the storage and recovery of the one or more data elements, when needed. In a further implementation form, the controller is further configured to delete the copy of the masked data element.

The copy of the masked data element is deleted to enable the controller to find similarity hash digest for the unmasked data element which further enables execution of differential compression. As a result of which, storage space is efficiently used.

In a further implementation form, the controller is further configured to receive the data element to be stored as part of a plurality of data elements comprised in a database, wherein the data elements of the data base are arranged in a table; mask the data to be anonymized by exchanging one or more portions to be anonymized with predefined content based on the position of the data element in the table; and to store an indication of the location of the data anonymized.

In many cases, data (e.g. different data portions of the data element) is kept in tables in a database, where sensitive data is typically stored in one or more specific columns of the table. In contradiction to conventional systems, the present disclosure enables to execute anonymization at the layer of the database (e.g. using database queries, such as SQL queries) and masking data portions (e.g. by exchanging data) in the one or more specific columns containing the sensitive data. In other words, the present disclosure enables the knowledge and know-how of how to anonymize the data to be built into a database application and the changes can be done either inside the database or even externally by using any database query language (e.g. SQL queries). This is advantageous in comparison to identification of data portions that is to be anonymized at the storage layer (i.e. after being stored at the block level storage). Alternatively stated, database hints are beneficial and thus the present disclosure also works even in the presence of anonymized data which is obfuscated (e.g. encrypted). In other words, the masking commands flows through the database and even through the anonymization mechanism that understands what is to be anonymized.

In another aspect, the present disclosure provides a method for a data storage arrangement comprising a memory being configured to store a one or more data elements, the method comprising storing at least one of the one or more data elements utilizing differential compression wherein a hash for a first portion of a data element is generated and compared to a stored hash for a second portion, and if the hashes match, the first portion is compressed with reference to the second portion, wherein the method is characterized in that the method further comprises: receiving a data element to be stored; generating a copy of the data element to be stored; masking data to be anonymized in the copy of data element to be stored by exchanging one or more portions to be anonymized with predefined content; generating corresponding hashes for one or more portions of the copy of data element with masked data for finding one or more reference portions; and compressing the data element to be stored utilizing differential compression with reference to the one or more reference portions.

The method of the present disclosure takes in account data anonymization during deduplication (i.e. an anonymization aware deduplication), which enables an efficient deduplication even in the presence of anonymized data. The method of the present disclosure enables provisioning of hints to the data storage arrangement (i.e. to storage) on areas of the data element which may be modified during data anonymization. Thus, the one or more portions of the data element to be anonymized is masked by use of the method and if the corresponding hash is already present, the data element is further compressed and stored by the method by utilizing differential compression to enable an effective deduplication of the data elements that are received for storing. Moreover, the method efficiently utilizes the storage space as in the anonymization aware deduplication, duplicate data is not stored, thereby improving performance in comparison to conventional methods where large amount of storage space is occupied by duplicate data as conventional methods do not take into account presence of anonymized data, which further affects performance of conventional storages.

In an implementation form aspect, the present disclosure provides a computer-readable medium carrying computer instructions that when loaded into and executed by a controller of a data storage arrangement enables the data storage arrangement to implement the method of previous aspect.

The computer-readable medium of the present disclosure takes in account data anonymization during deduplication (i.e. an anonymization aware deduplication), which enables an efficient deduplication even in the presence of anonymized data. Beneficially, the computer-readable medium of the present disclosure enables provisioning of hints to the data storage arrangement (i.e. to storage) on areas of the data element which may be modified during data anonymization.

In another aspect, the present disclosure provides a data storage arrangement comprising a memory being configured to store a one or more data elements, and the data storage arrangement further comprising a compression software module for compressing least one of the one or more data elements utilizing differential compression wherein a hash for a first portion of a data element is generated and compared to a stored hash for a second portion, and if the hashes match, the first portion is compressed with reference to the second portion, wherein the data storage arrangement is characterized in that the data storage arrangement further comprises: a data element receiving software module for receiving a data element to be stored; a copy software module for generating a copy of the data element to be stored; a masking software module for masking data to be anonymized in the copy of data element to be stored by exchanging one or more portions to be anonymized with predefined content; a hash generating software module for generating corresponding hashes for one or more portions of the copy of data element with masked data for finding one or more reference portions; and a compression software module for compressing the data element to be stored utilizing differential compression with reference to the one or more reference portions.

The data storage arrangement of the present disclosure takes in account data anonymization during deduplication (i.e. an anonymization aware deduplication) via the software module, which enables an efficient deduplication even in the presence of anonymized data. The data storage arrangement of the present disclosure via the software module enables provisioning of hints to the data storage arrangement (i.e. to storage) on areas of the data element which may be modified during data anonymization. Thus, the one or more portions of the data element to be anonymized is masked and if the corresponding hash is already present, the data element is further compressed and stored utilizing differential compression to enable an effective deduplication of the data elements that are received for storing.

In another aspect, the present disclosure provides a computer-readable medium carrying a data storage comprising a plurality of data elements, wherein one or more of the plurality of data elements are stored utilizing the method of previous aspect.

The computer-readable medium of the present disclosure takes in account data anonymization during deduplication (i.e. an anonymization aware deduplication), which enables an efficient deduplication even in the presence of anonymized data. Beneficially, the computer-readable medium of the present disclosure enables provisioning of hints to the data storage arrangement (i.e. to storage) on areas of the data element which may be modified during data anonymization.

It is to be appreciated that all the aforementioned implementation forms can be combined.

It has to be noted that all devices, elements, circuitry, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof. It will be appreciated that features of the present disclosure are susceptible to being combined in various combinations without departing from the scope of the present disclosure as defined by the appended claims.

Additional aspects, advantages, features and objects of the present disclosure would be made apparent from the drawings and the detailed description of the illustrative implementations construed in conjunction with the appended claims that follow.

BRIEF DESCRIPTION OF THE DRAWINGS

The summary above, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the present disclosure, exemplary constructions of the disclosure are shown in the drawings. However, the present disclosure is not limited to specific methods and instrumentalities disclosed herein. Moreover, those in the art will understand that the drawings are not to scale. Wherever possible, like elements have been indicated by identical numbers.

Embodiments of the present disclosure will now be described, by way of example only, with reference to the following diagrams wherein:

FIG. 1A is a block diagram of a data storage arrangement for anonymization aware deduplication, in accordance with an embodiment of the present disclosure;

FIG. IB is a block diagram of a data storage arrangement for anonymization aware deduplication, in accordance with another embodiment of the present disclosure; and FIG. 2 is a flowchart of a method for a data storage arrangement for anonymization aware deduplication, in accordance with an embodiment of the present disclosure.

In the accompanying drawings, an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent. A non-underlined number relates to an item identified by a line linking the nonunderlined number to the item. When a number is non-underlined and accompanied by an associated arrow, the non-underlined number is used to identify a general item at which the arrow is pointing.

DETAILED DESCRIPTION OF EMBODIMENTS

The following detailed description illustrates embodiments of the present disclosure and ways in which they can be implemented. Although some modes of carrying out the present disclosure have been disclosed, those skilled in the art would recognize that other embodiments for carrying out or practicing the present disclosure are also possible.

FIG. 1 A is a block diagram of a data storage arrangement, in accordance with an embodiment of the present disclosure. With reference to FIG.1 A there is shown a data storage arrangement 100A. The data storage arrangement 100A includes a memory 102 and a controller 104. The memory 102 is configured to store one or more data elements 106. In an implementation, the data storage arrangement 100 A further includes a network interface 108.

In one aspect, the present disclosure provides a data storage arrangement 100A comprising a memory 102 and a controller 104, the memory 102 being configured to store one or more data elements 106, and the controller 104 being configured to store at least one of the one or more data elements 106 utilizing differential compression wherein a hash for a first portion of a data element is generated and compared to a stored hash for a second portion, and if the hashes match, the first portion is compressed with reference to the second portion, wherein the data storage arrangement 100A is characterized in that the controller 104 is further configured to: receive a data element to be stored; generate a copy of the data element to be stored; mask data to be anonymized in the copy of data element to be stored by exchanging one or more portions to be anonymized with predefined content; generate corresponding hashes for one or more portions of the copy of data element with masked data for finding one or more reference portions; and compress the data element to be stored utilizing differential compression with reference to the one or more reference portions.

The data storage arrangement 100A includes the memory 102 being configured to store one or more data elements 106. The memory 102 refers to a hardware storage of the data storage arrangement 100A. The memory 102 includes suitable logic, circuitry, or interfaces that is configured to store one or more data elements 106, pointers and other data based on instructions received from the controller 104. Moreover, the memory 102 may be configured to further store instructions executable by the controller 104. Examples of implementation of the memory 102 may include, but are not limited to, Electrically Erasable Programmable Read-Only Memory (EEPROM), Random Access Memory (RAM), Read Only Memory (ROM), Hard Disk Drive (HDD), Flash memory, Solid-State Drive (SSD), and/or CPU cache memory. The memory 102 may store an operating system and/or other program products (including one or more operation algorithms) to operate the data storage arrangement 100A.

The one or more data elements 106 refers to incoming information or data stream that arrive at the data storage arrangement 100A. For example, the one or more data elements 106 may arrive as an input/output (VO) request during deduplication process (i.e. when backup is performed from a primary storage system (e.g. a host server) to a secondary storage system, such as the data storage arrangement 100A. In an example, the one or more data elements 106 potentially includes personal data, such as personal identifiable data or data that may be subject to data privacy and security under various regulations or data protection laws, such as Data Protection Act (DPA). The one or more data elements 106 may be a structured data, such as with defined fields, such as name, age, phone number, credit card number, address and the like. For example, the one or more data elements 106 may include a string of characters as “Dani, 33, 0541111111, 4580800080001999, A street” representing fields name, age, phone number, credit card number, and address fields in a sequence. In another example, the one or more data elements 106 may be unstructured data where the data that needs to be anonymized may appear at any place in data set without specific field or known location.

The network interface 108 includes suitable logic, circuitry, and/or interfaces that may be configured to communicate with one or more external devices, such as user devices or servers. Examples of the network interface 108 may include, but is not limited to, a network interface card (NIC), an antenna, a radio frequency (RF) transceiver, or network ports. In an implementation, the controller 104 is configured to execute instructions stored in the memory 102. In an example, the controller 104 may be a general-purpose processor. Examples of the controller 104 may include, but is not limited to a microprocessor, a microcontroller, a complex instruction set computing (CISC) processor, an application-specific integrated circuit (ASIC) processor, a reduced instruction set (RISC) processor, a very long instruction word (VLIW) processor, a central processing unit (CPU), a state machine, a data processing unit, and other processors or control circuitry. Moreover, the controller 104 may refer to one or more individual processors, processing devices, a processing unit that is part of a machine, such as the data storage arrangement 100A.

The controller 104 is configured to store at least one of the one or more data elements 106 utilizing differential compression, wherein a hash for a first portion of a data element is generated and compared to a stored hash for a second portion, and if the hashes match, the first portion is compressed with reference to the second portion. Differential compression refers to finding similar portions of data element (such as first portion) and compressing a similar portion (first portion) using another portion (second portion) as reference. Differential compression enables much better compression as similar portion of data element is already present. In other words, differential compression enables the controller 104 to store portions of data element that have matching hashes in a compressed form to execute data deduplication to save storage space. The controller 104 is configured to generate the hash (may also be referred to as similarity hash digest) for the first portion of data element using a hashing algorithm. Moreover, hashing algorithm is used for all portions of data elements previously stored to enable detection of duplication portions of any data element that is to be stored in the data storage arrangement 100A. In this case, chunking of the data element is of fixed-size. Further, if the hash of the first portion of data element is identical to the stored hash of the second portion previously stored in the data storage arrangement 100A then the first portion of data element is identified as a duplicate data portion and then first portion is compressed with reference to the second portion. As a result, the storage space is significantly reduced compared to conventional systems and methods. In an example, the compression of the first portion with reference to second portion may be executed using compression algorithms known in the art.

In an example, the controller 104 calculates hash digest and similarity hash digests for the first portion of data element. Hash digest simply refers to hash value of the first portion and similarity hash digests refers to hash value of the first portion after masking out the data to be anonymized or data that is already anonymized in the first portion. In such an example, if the hash digest already exists in the memory 102, then the controller 104 identifies that an identical portion of data element already exists and keeps a pointer instead. If the controller 104 identifies another data portion (such as second portion) with the same similarity hash digests, the controller 104 may compress the first portion with the second portion as a reference. In other words, data other than data to be anonymized or data that is already anonymized, is same for both the first portion and the second portion. As a result, a large amount of storage space is saved which further improves system performance of the data storage arrangement 100A.

According to an embodiment, the controller 104 is further configured to store the one or more data elements utilizing block storage. The use of block storage for storing the one or more data elements enables in efficient data retrieval from the storage and recovery of the one or more data elements, when needed. The one or more data elements may be stored as fixed sized blocks (i.e. fixed size chunks). In an example, a basic block size may be 8 Kilobytes, that means all inputs and outputs of the data elements must be aligned to 8 Kilobytes offset and of size which is a multiple or not multiple of 8 Kilobytes (i.e. 24 Kilobytes, 40 Kilobytes, 128 Kilobytes and the like).

The controller 104 is further configured to receive a data element to be stored. The data element (e.g. an I/O write request) to be stored may be received from the memory 102 or from a remote data source. The data element to be stored may be received by the controller 104 from an external device that is communicatively coupled to the data storage arrangement 100A via the network interface 108. The data storage arrangement 100 A may be a secondary storage for storing backup of data of user devices or primary storage system(s). In an example, the data storage arrangement 100A is configured to store data of a plurality of user devices or host server in an organization. In an example, the data element may be received at the time of backup to perform deduplication. The data element corresponds to the one or more data elements 106.

In an example, the data element to be stored that is received by the controller 104 is given below in table (1). It is to be understood that in practice, there may be millions of rows or large number of columns stored in tabular form in a database. wherein name is of 64 characters, age is of 3 characters, phone number is of 10 characters, credit card number is of 16 characters, address of 64 characters. In another example, the data element to be stored that is received by the controller 104 is an alphanumeric character string: Dani, 33, 0541111111, 4580800080001999, A street; Josef, 55, 0541222222, 4580123434001999, B street; Eyal, 22, 0543111111, 4580800080011999, C street; Guy, 53, 0544222222, 4580123434061999, D street.

The controller 104 is further configured generate a copy of the data element to be stored. The copy of the one or more data portions to be store is generated to enable the controller 104 to mask the data to be anonymized from the rest of the data element received by the controller 104. In an example, the copy of the data element to be stored is generated based on a start offset and an end offset of the data element. In an example, in the data element received by the controller 104, the start offset is ‘ 1 ’ and end offset is ‘224’ and thus a copy of the data element is generated based on the start and end offset generated. The copy of the data element to be stored may be a virtual copy (e.g. for temporary processing purpose) of the data element without actually fully copying the data element.

The controller 104 is further configured to mask data to be anonymized in the copy of data element to be stored by exchanging one or more portions to be anonymized with predefined content. The data to be anonymized is masked by exchanging with predefined content to enable identification of the data portions to be anonymized among all the data portions in the copy of data element. In other words, the controller 104 hints the memory 102 on areas or portions of the data element that may be modified during data anonymization. As a result of which data deduplication can be easily executed by the data storage arrangement 100A. In comparison to some conventional techniques where conventional storages have no access to portions of data elements that may be modified during anonymization as a result of which it is impossible for conventional storages to identify such sensitive data (i.e. portions of data elements that may be modified during anonymization) when data element is modified (such as encrypted). Thereby, such conventional techniques are not able to execute data deduplication.

According to an embodiment, the controller 104 may receive an indication of which data to be anonymized in the copy of data element. Further based on the indication the controller 104 may mask the data to be anonymized. In an example, the indication of data to be anonymized may be provided by a user. The user may define which columns in a table (of the data element received) include sensitive data, which may be anonymized, or the controller 104 may use automatic tool to identify such fields. In an example, the controller 104 may be configured to receive a user input that is an indication to which setting or configuration to select for masking data. For example, a user may configure which anonymization is likely applied to the data, such as General Data Protection Regulation (GDPR) anonymization, anonymization of a portion, such as credit card numbers only, and the like, as per need. Such settings and configurations which are selected or configured may be prestored in the memory 102.

According to an embodiment, the predefined content is NIL. The data to be anonymized is masked by exchanging with predefined content which is NIL. In an example, the one or more portions to be anonymized is exchanged with predefined content which is NIL. In an example, the one or more portions to be anonymized is exchanged with predefined content which is bit ‘O’. In such an example, the other portions which are not to be anonymized remain unchanged. In an example, the other portions which are not to be anonymized may be exchanged with predefined content which may be bit ‘ 1’ of the binary bits. As a result, a deletion of the data to be anonymized may be executed at a known place of a known size. The predefined content as NIL acts as a hint of the presence of anonymized data in the data element, which allows effective compression of the data element to be stored utilizing differential compression. Beneficially, masking the data to be anonymized by exchanging with predefined content which is NIL also enables in reducing a data storage space.

In an alternative embodiment, the predefined content is character ‘ 1’. In an example, the one or more portions to be anonymized is masked by exchanging with predefined content which is bit ‘ 1’ . In such an example, the other portions which are not to be anonymized and is not masked may be exchanged with predefined content which is NIL (or even NULL).

According to an embodiment, the predefined content is an indicator of a size of the exchanged data portion. A number of predefined content corresponds to a number of characters in the data element received which are to be anonymized. In an example, if the number of predefined content is 16 then a size of exchanged data portion is 16 characters. Beneficially, the indicator of the size of the exchanged data portion further acts as a hint of the presence of anonymized data of a defined length in the data element.

According to an embodiment, the predefined content is the same for the one or more portions to be anonymized. In an example, the predefined content is NIL for all portions of the one or more portions to be anonymized. In another example, the predefined content may be the character ‘ 1’ for all portions of the one or more portions to be anonymized.

According to an embodiment, the predefined content is the same for a type of data to be anonymized. The type of data may correspond to a field of data such as name, age, phone number, credit card number, address. The predefined content is same for a type of data to be anonymized to enable ease in identification of portion of data element to be anonymized. In an example, two type of data to be anonymized may have different predefined content such as a first type of data that is phone number may have predefined content as NIL and a second type of data that is credit card number may have predefined content as numeric character “1” or “2” or other predefined characters or bit values.

In an example, a serialized data received by the controller 104 is represented as “Dani3305411114580800080001999 A street”. In such an example, the copy of data after being masked may be: “00000000000000001111111111111111000000”, wherein bit “1” is the predefined content i.e. it indicates that the data in that location maybe anonymized. In this case, the location where there is “1” is the location of the credit card number. The masked copy of data is provided in form of bytes such that 8 bits are translated to a byte. The data mask in this example is a bitmap, where each bit indicates a byte in the data, or it may be kept in a different format.

According to an embodiment, the controller 104 is further configured to mask the data to be anonymized when generating the copy of the data element to be stored through a write-with- mask command. In an example, the controller 104 is configured to provide the write-with-mask command to the memory 102. The memory 102 upon receiving the command may use the mask to improve deduplication. The write-with-mask refers to a new IO write command that hints to the data storage arrangement on areas or portions of the data element which may be modified during data anonymization. In an example, the write-with-mask command is represented by function (1) that is shown below.

WriteWithMask (Data, device, offset, size, dataMask) (1) wherein

"Data ’ refers to the data to be written;

"device ’ refers to the device where the data is written, it can be a file, or a Logical Unit (LU);

'offset' refers to the offset where the data is written (i.e. in the file or the LU);

"size" refers to the size of the data, which is typically a multiple of 512 bytes for block devices;

"dataMask’ refers to a data structure describing which parts of the data may be changed during anonymization (this is used for any change an application can cause to the data that the controller 104 hints to the memory 102).

According to an embodiment, the write-with-mask command is associated with a first data type and a second data type and wherein the write-with-mask command is arranged to utilize a first mask when generating the copy of the data element when the data element is of the first data type and to utilize a second mask when generating the copy of the data element when the data element is of the second data type. The first mask and second mask are utilized for the first data type and the second data type to enable in distinguishing of the two data types which are to be anonymized to enable ease in their identification. In an example, there may be two data types (or fields) such as data types of phone number and credit card number, which are to be masked for anonymization. In such an example, the write-with-mask command utilizes a first mask for masking the data type of phone number and a second mask for masking the data type of credit card number.

According to an embodiment, the mask indicates location and size of data to be anonymized. The location of data may be indicated based on the value of "offset’ in the write-with-mask command which refers to location of where the data is written. The size of data may be indicated based on the value of "size’ in the write-with-mask command which refers to the size of the data typically in multiples of a specified size, for example, 512 bytes. According to an embodiment, the controller 104 is further configured to delete the copy of the masked data element. The copy of the masked data element is not stored by the controller 104. As a result, even if the same serialized data is received by the controller 104 but with a changed data to be anonymized, the hash digest and the similarity hash digest remain the same. In an example, if the data to be anonymized is a credit card number, then upon masking and deletion of the masked data element associated with credit card number, even if same sequence of data element with a different credit card number is received, a same similarity hash digest is obtained. This is results in easy deduplication by the controller 104 on the data elements received by the controller 104.

Beneficially, the memory 102 may leverage the Write- With-Mask command to significantly improve the differential compression technique. Moreover, the hashes digests or similarity hash digests, may be calculated only for non-masked data element.

According to an embodiment, the controller 104 is further configured to receive the data element to be stored as part of a plurality of data elements comprised in a database, wherein the data elements of the data base are arranged in a table; mask the data to be anonymized by exchanging one or more portions to be anonymized with predefined content based on the position of the data element in the table; and to store an indication of the location of the data anonymized. In an example, the data element to be stored that is received by the controller 104 as part of the plurality of data elements comprised in the database, wherein the data elements of the data base are arranged in the table (2) given below.

In such an example, the data to be anonymized is masked by exchanging one or more portions to be anonymized with predefined content based on the position of the data element in the table. The masked data is given in the table (3) wherein credit card number is masked.

Further, in such an example, the indication of the location of the data anonymized is stored. The indication of the location of the credit card number which is anonymized is stored. The indication may be in form of a start offset value and an end offset value of the credit card number. In other words, the masking commands flows through the database and even through anonymization mechanism that understands what is to be anonymized. For example, in many cases, data (e.g. different data portions of the data element) is kept in tables in a database, where sensitive data is typically stored in one or more specific columns of the table. In contradiction to conventional systems, the present disclosure enables to execute anonymization at the layer of the database (e.g. using database queries, such as SQL queries) and masking data portions (e.g. by exchanging data) in the one or more specific columns containing the sensitive data. In other words, the present disclosure enables the knowledge and know-how of howto anonymize the data to be built into a database application and the changes can be done either inside the database or even externally by using any database query language (e.g. SQL queries). This is advantageous in comparison to identification of data portions that is to be anonymized at the storage layer (i.e. after being stored at the block level storage). Alternatively stated, database hints are beneficial and thus the present disclosure also works even in the presence of anonymized data which is obfuscated (e.g. encrypted).

The controller 104 is further configured to generate corresponding hashes for one or more portions of the copy of data element with masked data for finding one or more reference portions. The corresponding hashes herein refer to the similarity hash digests that is generated for one or more portions of the copy of data element with masked data. The similarity hash digest is generated using hashing algorithms known in the art. The similarity hash digests are compared with hashes of reference portions which are already stored to find reference portions having same similarity hash digests. The controller 104 is further configured to compress the data element to be stored utilizing differential compression with reference to the one or more reference portions. If the corresponding hashes i.e. the similarity hash digests that is generated for one or more portions of the copy of data element with masked data is same as the similarity hash digests for one or more portions (i.e. reference portions) already stored, then the data element is compressed and stored. As a result, storage space of the memory 102 is saved which further increases efficiency of the data storage arrangement 100A. Beneficially, compression ratios are high due to the similarity of the portions.

In an example, a serialized data that is received is represented by a first sequence as: “Dani3305411114580800080001999Astreet”. In such an example, the data mask may be: “00000000000000001111111111111111000000”. In this example, the data mask is a bitmap. In the first sequence, “4580800080001999” is to be anonymized. The similarity hash digest or score is calculated for “Dani330541111111Astreet”. So, even if a same sequence with a different credit card number is received, the same hash and similarity hash digest is obtained.

Further, in such an example, a second sequence that is received may be represented as “Dani3305411111111111222233334444 Astreet” and the similarity hash digest is calculated for “Dani330541111111 Astreet”. In the second sequence, “1111222233334444” is already anonymized. Thus, for both the first sequence and the second sequence the same similarity hash digest is obtained. Further, the second sequence is compressed using the first sequence as reference.

Beneficially, the data storage arrangement 100A of the present disclosure enables provisioning of hints to the data storage arrangement (i.e. to storage) on areas of the data element which may be modified during data anonymization. Thus, the one or more portions of the data element to be anonymized is masked and if the corresponding hash is already present, the data element is further compressed and stored utilizing differential compression to enable an effective deduplication of the data elements that are received for storing.

FIG. IB is a block diagram of a data storage arrangement, in accordance with yet another embodiment of the present disclosure. With reference to FIG. IB there is shown a data storage arrangement 100B. The data storage arrangement 100B further includes a compression software module 110A, data element receiving software module HOB, copy software module HOC, masking software module HOD, hash generating software module HOE, and compression software module 110F that are installed in the memory 102. There is further shown the controller 104 and the network interface 108.

In another aspect the present disclosure provides a data storage arrangement 100B comprising a memory 102 being configured to store a one or more data elements 106, and the data storage arrangement 100B further comprising a compression software module 110A for compressing least one of the one or more data elements 106 utilizing differential compression wherein a hash for a first portion of a data element is generated and compared to a stored hash for a second portion, and if the hashes match, the first portion is compressed with reference to the second portion, wherein the data storage arrangement 100B is characterized in that the data storage arrangement 100B further comprises: a data element receiving software module HOB for receiving a data element to be stored; a copy software module HOC for generating a copy of the data element to be stored; a masking software module HOD for masking data to be anonymized in the copy of data element to be stored by exchanging one or more portions to be anonymized with predefined content; a hash generating software module HOE for generating corresponding hashes for one or more portions of the copy of data element with masked data for finding one or more reference portions; and the compression software module 110A for compressing the data element to be stored utilizing differential compression with reference to the one or more reference portions.

The data storage arrangement 100B comprising the data element receiving software module HOB, which when executed by the controller 104, receives a data element to be stored. The data element to be stored may be received from memory 102 or from remote source via the network interface 108, when the software module HOB is executed by the controller 104.

The data storage arrangement 100B comprising the copy software module HOC, which when executed by the controller 104, generates a copy of the data element to be stored. The copy of the one or more data portions to be store is generated when the software module HOC is executed by the controller 104 to enable the controller 104 to mask the data to be anonymized from the rest of the data element received by the controller 104.

The data storage arrangement 100B comprising the masking software module HOD, which when executed by the controller 104, masks data to be anonymized in the copy of data element to be stored by exchanging one or more portions to be anonymized with predefined content. The data to be anonymized is masked by exchanging with predefined content when the software module HOD is executed by the controller 104 to enable identification of the data portions to be anonymized among all the data portions in the copy of data element. In other words, the controller 104 hints the memory 102 on areas or portions of the data element that may be modified during data anonymization.

The data storage arrangement 100B comprising the hash generating software module 110E, which when executed by the controller 104, generates corresponding hashes for one or more portions of the copy of data element with masked data for finding one or more reference portions. The corresponding hashes is generated when the software module 110E is executed by the controller 104 for one or more portions of the copy of data element with masked data. The corresponding hashes are compared with hashes of reference portions which are already stored to find reference portions having same corresponding hashes.

The data storage arrangement 100B comprising the compression software module 110A, which when executed by the controller 104, compresses the data element to be stored utilizing differential compression with reference to the one or more reference portions. If the corresponding hashes that is generated for one or more portions of the copy of data element with masked data is same as the similarity hash digests for one or more portions already stored, then the data element is compressed and stored when the software module 110A is executed by the controller 104.

Beneficially, the data storage arrangement 100B of the present disclosure enables provisioning of hints to the data storage arrangement (i.e. to storage) via the software modules, on areas of the data element which may be modified during data anonymization. Thus, the one or more portions of the data element to be anonymized is masked and if the corresponding hash is already present, the data element is further compressed and stored utilizing differential compression.

FIG. 2 is a flowchart of a method for a data storage arrangement, in accordance with an embodiment of the present disclosure. The method 200 is executed at a data storage arrangement (e.g. the data storage arrangement 100A or 100B) described, for example, in Fig. 1. The method 200 includes steps 202 to 210.

In one aspect the present disclosure provides a method 200 for a data storage arrangement (e.g. the data storage arrangement 100A or 100B) comprising a memory 102 being configured to store a one or more data elements, the method 200 comprising storing at least one of the one or more data elements utilizing differential compression wherein a hash for a first portion of a data element is generated and compared to a stored hash for a second portion, and if the hashes match, the first portion is compressed with reference to the second portion, wherein the method is characterized in that the method 200 further comprises: receiving a data element to be stored; generating a copy of the data element to be stored; masking data to be anonymized in the copy of data element to be stored by exchanging one or more portions to be anonymized with predefined content; generating corresponding hashes for one or more portions of the copy of data element with masked data for finding one or more reference portions; and compressing the data element to be stored utilizing differential compression with reference to the one or more reference portions.

At step 202, the method 200 comprises receiving a data element to be stored. The data element to be stored may be received by the controller 104 from the memory 102 or from a remote source. The data element to be stored may be received by the controller 104 from an external device that is communicatively coupled to the data storage arrangement (e.g. the data storage arrangement 100A or 100B) via the network interface 108.

At step 204, the method 200 comprises generating a copy of the data element to be stored. The copy of the one or more data portions to be store is generated by the controller 104 to enable masking the data to be anonymized from the rest of the data element received by the controller 104

At step 206, the method 200 comprises masking data to be anonymized in the copy of data element to be stored by exchanging one or more portions to be anonymized with predefined content. The data to be anonymized is masked by exchanging with predefined content by the controller 104 to enable identification of the data portions to be anonymized among all the data portions in the copy of data element. In other words, the controller 104 hints the memory 102 on areas or portions of the data element that may be modified during data anonymization.

According to an embodiment, the method further comprises masking the data to be anonymized when generating the copy of the data element to be stored through a write-with-mask command. In an example, the controller 104 is configured to provide the write-with-mask command to the memory 102. The memory 102 upon receiving the command may use the mask to improve deduplication. According to an embodiment, the method further comprises deleting the copy of the masked data element. The copy of the masked data element is not stored by the controller 104.

According to an embodiment, the method further comprises receiving the data element to be stored as part of a plurality of data elements comprised in a database, wherein the data elements of the data base are arranged in a table; masking the data to be anonymized by exchanging one or more portions to be anonymized with predefined content based on the position of the data element in the table; and storing an indication of the location of the data anonymized.

At step 208, the method 200 comprises generating corresponding hashes for one or more portions of the copy of data element with masked data for finding one or more reference portions. The corresponding hashes is generated by the controller 104 for one or more portions of the copy of data element with masked data. The corresponding hashes are compared with hashes of reference portions which are already stored to find reference portions having same corresponding hashes.

At step 210, the method 200 comprises compressing the data element to be stored utilizing differential compression with reference to the one or more reference portions. If the corresponding hashes that is generated by the controller 104 for one or more portions of the copy of data element with masked data is same as the similarity hash digests for one or more portions (i.e. reference portions) already stored, then the data element is compressed and stored by the controller 104. As a result, storage space of the memory 102 is saved which further increases efficiency of the data storage arrangement (e.g. the data storage arrangement 100A or 100B). Beneficially, compression ratios are high due to the similarity of the portions.

The steps 202 to 210 are only illustrative and other alternatives can also be provided where one or more steps are added, one or more steps are removed, or one or more steps are provided in a different sequence without departing from the scope of the claims herein.

In one aspect, provided is a computer-readable medium carrying computer instructions that when loaded into and executed by a controller 104 of a data storage arrangement (e.g. the data storage arrangement 100A or 100B) enables the data storage arrangement to implement the method 200. In another aspect, provided is a computer-readable medium carrying a data storage comprising a plurality of data elements, wherein one or more of the plurality of data elements are stored utilizing the method 200. Examples of implementation of the computer-readable medium include, but is not limited to, Electrically Erasable Programmable Read-Only Memory (EEPROM), Random Access Memory (RAM), Read Only Memory (ROM), Hard Disk Drive (HDD), Flash memory, a Secure Digital (SD) card, Solid-State Drive (SSD), a computer readable storage medium, and/or CPU cache memory. In yet another aspect, a computer program product is provided comprising a non-transitory computer-readable storage medium having computer program code stored thereon, the computer program code being executable by a processor to execute the method 200. A computer readable storage medium for providing a non-transient memory may include, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.

Modifications to embodiments of the present disclosure described in the foregoing are possible without departing from the scope of the present disclosure as defined by the accompanying claims. Expressions such as "including", "comprising", "incorporating", "have", "is" used to describe and claim the present disclosure are intended to be construed in a non-exclusive manner, namely allowing for items, components or elements not explicitly described also to be present. Reference to the singular is also to be construed to relate to the plural. The word "exemplary" is used herein to mean "serving as an example, instance or illustration". Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments. The word "optionally" is used herein to mean "is provided in some embodiments and not provided in other embodiments". It is appreciated that certain features of the present disclosure, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the present disclosure, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable combination or as suitable in any other described embodiment of the disclosure.