Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND SYSTEM FOR INDEXING DATA ITEM IN DATA STORAGE SYSTEM AND DATA INDEXING MODULE
Document Type and Number:
WIPO Patent Application WO/2022/262990
Kind Code:
A1
Abstract:
A computer-implemented method for indexing a data item in a data storage system, includes dividing the data item into one or more blocks, calculating a strong hash value for each block, and checking for a duplicate block by checking the calculated strong hash value of each block against a deduplication index. For any non-duplicate block, the block data is stored in the data storage system and an ID table is updated. For any duplicate block, the method further includes storing a reference to a matching block in the data storage system, determining one or more common areas, selecting one or more representative hash values, and updating the deduplication index with an entry for each representative hash value. The deduplication index can be efficiently used for multiple scenarios (i.e., source-based deduplication, online deduplication, and offline deduplication) without degrading the performance (i.e., latency, throughput, and deduplication ratio) for any scenario.

Inventors:
NATANZON ASSAF (DE)
ZACH IDAN (DE)
STERNBERG MICHAEL (DE)
Application Number:
PCT/EP2021/066536
Publication Date:
December 22, 2022
Filing Date:
June 18, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
NATANZON ASSAF (DE)
International Classes:
G06F3/06
Foreign References:
US20120166448A12012-06-28
US20110276781A12011-11-10
Attorney, Agent or Firm:
KREUZ, Georg (DE)
Download PDF:
Claims:
CLAIMS

1. A computer-implemented method (100) for indexing a data item in a data storage system (202), the method comprising: dividing the data item into one or more blocks; calculating a strong hash value for each block; checking for a duplicate block by checking the calculated strong hash value of each block against a deduplication index; for any non-duplicate block: storing the block data at an address in the data storage system (202); and updating an ID table with an entry for the block, including the strong hash value, a block ID and a pointer indicating the address of the block data, wherein if the block is a duplicate the pointer indicates a location of the target block; for any duplicate block for which a matching block is found in the data storage system (202): storing, at an address in the data storage system (202), a reference to the block ID of the matching block, such that the block data for the duplicate block can be found at an address of the matching block; determining one or more common areas of the data storage system (202) having the largest number of addresses which store block data for the data item; and selecting one or more representative hash values from the calculated strong hash values of each block; and updating the deduplication index with an entry for each representative hash value, wherein each entry includes a pointer to one or more block IDs, wherein: if the representative hash value corresponds to a block having an address in the common area, the pointer indicates the block ID of that block; and if the representative hash value corresponds to a block having an address which is not in the common area, the pointer indicates a block ID of a block in one or more common areas.

2. The method (100) of claim 1, wherein selecting the representative hash values uses a determinative process.

3. The method (100) of claim 2, wherein the determinative process comprises selecting the one or more largest hash values.

4. The method (100) of claim 2, wherein the determinative process comprises selecting one or more hash values matching a predetermined bit-pattern.

5. The method (100) of any preceding claim, wherein two representative hash values are selected.

6. The method (100) of any preceding claim, wherein updating the deduplication index includes calculating a weak hash for each representative hash value.

7. The method (100) of any preceding claim, wherein the deduplication index is stored in a memory (212), and the ID table is stored in a disk storage.

8. The method (100) of any preceding claim, wherein checking for a duplicate block comprises: selecting one or more representative hash values from the calculated strong hash values of each block; checking the representative hash values against the deduplication index for matches; and for one or more positive matches, retrieving the strong hash values corresponding to the one or more block IDs indicated by the pointer and checking for a match against the incoming data item.

9. The method (100) of claim 8 wherein checking for a match against the incoming data item comprises fetching block data from a target address corresponding to the block ID and from a plurality of associated addresses.

10. The method (100) of claim 9, wherein the plurality of associated addresses are addresses within a certain distance of the target address.

11. The method (100) of claim 9, wherein the plurality of associated addresses are addresses following the target address.

12. The method (100) of any preceding claim, wherein updating the deduplication index with an entry for a representative hash value comprises, if an entry for an identical representative hash value exists, addition an additional pointer to one or more block IDs.

13. A computer readable medium configured to store instructions which, when executed by a processor, cause the processor to execute the method (100) of any preceding claim.

14. A data indexing module (210) for a data storage system (202), the module comprising one or more processors (210 A) configured to execute the method (100) of any one of claims 1 to 12.

15. A data storage system (202) comprising: one or more data storage units (208); and the data indexing module (210) of claim 14.

Description:
METHOD AND SYSTEM FOR INDEXING DATA ITEM IN DATA STORAGE SYSTEM AND DATA INDEXING MODULE

TECHNICAL FIELD

The present disclosure relates generally to the field of data deduplication; and more specifically, to a method for indexing a data item in a data storage system, a data indexing module for a data storage system, and a data storage system.

BACKGROUND

Generally, data deduplication is a process that eliminates duplicate copies of data which are either shared over a network or stored in a data storage system. The data deduplication solutions are required to detect duplicate copies of data and thus, significantly decrease storage capacity requirements. Typically, data deduplication is done by dividing the data into segments, calculating fingerprints (i.e., strong hashes) per segment, and using these fingerprints to identify identical data. However, to efficiently locate the existing fingerprints that may have high probability of being equal to incoming fingerprints may be difficult, since the number of fingerprints is significant for large amounts of data (where deduplication is required). Conventionally, to execute data deduplication, a conventional sparse index may be used, in which the fingerprints are divided into areas (such as association blocks) and representatives (i.e., weak hashes) are selected from each area via a deterministic method. The duplicate data is located when pre-stored fingerprints (i.e., weak hashes) and the incoming fingerprints are found equal. However, the information in the conventional sparse index needs to be consistent with the existing fingerprints; otherwise, deduplication ratio (i.e., the ratio of the size of original data to the size of data after deduplication) will degrade (since existing fingerprints will not be represented in the conventional sparse index).

In certain scenarios, the conventional sparse index may be used for source-based deduplication. The source-based deduplication is a client-server-based deduplication in which the conventional sparse index is mainly used to reduce the network bandwidth. The source-based deduplication is sensitive to latency. Thus, in such scenarios, the focus is to provide minimal latency in order to minimize the time that the client waits for the server to reply. In another scenario, the conventional sparse index may be used for an online deduplication. In the online deduplication, the conventional sparse index is mainly used to reduce the amount of data that should be written into a disk. Thus, in this scenario, the focus is to provide good throughput with low latency. However, an ultra-low latency is not critical in this scenario, because the client has already received an acknowledgement from the server (i.e., the data is already available in the server’s cache and should be written into the disk). In yet another scenario, the conventional sparse index may be used for an offline deduplication. In the offline deduplication, the conventional sparse index is mainly used to reduce the amount of data stored on the storage system. The offline deduplication is not sensitive to latency. In this scenario, the focus is to find duplicate data as much as possible.

The conventional data deduplication solutions make trade-off between several performance aspects and optimize the algorithm for a single application scenario while degrading for the other application scenarios. For example, one conventional data deduplication solution may use an optimized conventional deduplication index for online deduplication, which provides high throughput but with relatively high latency. Thus, such conventional deduplication index is not preferred for source-based deduplication scenario in which a very low latency is required. Therefore, there exists a technical problem of inefficient data deduplication solutions that provide non-optimal performance aspects, such as latency, throughput, and deduplication ratio, under multiple application scenarios.

Typically, the main flow of the conventional data deduplication solutions is receiving a new write request, segmenting the data using any segmentation algorithm, calculating a strong hash and a weak hash for each segment, and searching the strong hash in a given strong hash cache. If the strong hash is found in the given strong hash cache, the block identity (ID) is returned. If the strong hash is not found in the given strong hash cache, the weak hash is searched in a given weak hash table. If the weak hash is found in the given weak hash table, the strong hash is retrieved from a given ID table for comparison. The ID table includes a list of unique data blocks, sorted by the ID number of the block. If the weak hash is not found in the given weak hash table, the data cannot be deduplicated and a new block ID is generated. However, since the given weak hash table is large and most of the given weak hash table is stored on the disk, the search requires read IOs from the disk. Hence, there is a significant impact on the latency as well as the read IOs throughput. Moreover, a single entry in the given weak hash table for each weak hash key creates a conflict (i.e., different data blocks with the same weak hash key) and leads to degradation of the deduplication ratio.

Therefore, in light of the foregoing discussion, there exists a need to overcome the aforementioned drawbacks associated with the conventional data deduplication solutions.

SUMMARY

The present disclosure provides a method for indexing a data item in a data storage system, a data indexing module for a data storage system, and a data storage system. The present disclosure provides a solution to the existing problem of inefficiency and unreliability associated with the conventional data deduplication solutions, where the problem is compounded by the fact that in existing solutions, the same deduplication index cannot be used for multiple application scenarios with optimal performance parameters, such as latency, throughput, and deduplication ratio. An aim of the present disclosure is to provide a solution that overcomes at least partially the problems encountered in prior art, and provides an improved data deduplication solution that can efficiently utilize the same deduplication index for multiple application scenarios as well as with optimal performance parameters, such as latency, throughput, and deduplication ratio.

One or more objects of the present disclosure is achieved by the solutions provided in the enclosed independent claims. Advantageous implementations of the present disclosure are further defined in the dependent claims.

In one aspect, the present disclosure provides a computer-implemented method for indexing a data item in a data storage system. The method comprises dividing the data item into one or more blocks; calculating a strong hash value for each block; and checking for a duplicate block by checking the calculated strong hash value of each block against a deduplication index. For any non-duplicate block, the method further comprises storing the block data at an address in the data storage system; and updating an ID table with an entry for the block, including the strong hash value, a block ID and a pointer indicating the address of the block data, wherein if the block is a duplicate the pointer indicates a location of the target block. For any duplicate block for which a matching block is found in the data storage system, the method further comprises storing, at an address in the data storage system, a reference to the block ID of the matching block, such that the block data for the duplicate block can be found at an address of the matching block; determining one or more common areas of the data storage system having the largest number of addresses which store block data for the data item; selecting one or more representative hash values from the calculated strong hash values of each block; and updating the deduplication index with an entry for each representative hash value, wherein each entry includes a pointer to one or more block IDs. Further, if the representative hash value corresponds to a block having an address in the common area, the pointer indicates the block ID of that block; and if the representative hash value corresponds to a block having an address which is not in the common area, the pointer indicates a block ID of a block in one or more common areas.

The method of the present disclosure provides an improved data deduplication solution by efficiently indexing a data item in the data storage system with the help of the deduplication index (i.e., a weak hash sparse cache). By virtue of the deduplication index, the number of disk- read accesses are reduced, which consequently minimizes the latency in fetching the IO operations of the data storage system. Further, the method of the present disclosure can be efficiently used for multiple scenarios (e.g., source-based deduplication, online deduplication, and offline deduplication) without degrading the performance (i.e., latency, throughput, and deduplication ratio) for any scenario. Moreover, the method allows an efficient and reliable data deduplication solution to manage duplicate data as well as non-duplicate data. The non duplicate data is stored in the data storage system with a new block ID. On the other hand, the duplicate data is not stored explicitly in the data storage system; rather a reference (i.e., a pointer) to the matching data is stored. Thus, the method of the present disclosure allows an efficient utilization of the storage capacity of the data storage system.

In an implementation form, selecting the representative hash values uses a determinative process.

By virtue of the determinative process, for a given weak hash values of the calculated strong hash values of each block, the same representative hash values are selected every time for the same weak hash values. Hence, the method of selection of the representative hash values becomes deterministic and easy. In a further implementation form, the determinative process comprises selecting the one or more largest hash values.

Beneficially, the method of selection of the representative hash values for the deduplication index (i.e., the weak hash cache) becomes easy and deterministic because for the given one or more largest hash values, the same one or more largest hash values are selected as the representative hash values every time.

In a further implementation form, the determinative process comprises selecting one or more hash values matching a predetermined bit-pattern.

Beneficially, the method of selection of the representative hash values for the deduplication index (i.e., the weak hash cache) becomes easy and deterministic because for the given one or more hash values matching a predetermined bit-pattern, the same one or more hash values matching the predetermined bit-pattern are selected as the representative hash values every time.

In a further implementation form, two representative hash values are selected.

Beneficially, the deduplication index can contain multiple IDs (i.e., multiple pointers) for a single weak hash key. Hence, in case of a conflict (i.e., different data blocks with the same weak hash key), the deduplication index can improve the deduplication ratio (i.e., the ratio of the size of original data to the size of data after deduplication).

In a further implementation form, updating the deduplication index includes calculating a weak hash for each representative hash value.

Beneficially, the size of the deduplication index (i.e., the weak hash sparse cache) is made small because the weak hash comprises only a portion of the bits of a corresponding strong hash value for the representative hash value.

In a further implementation form, the deduplication index is stored in a memory, and the ID table is stored in a disk storage.

Beneficially, the number of disk-read accesses is reduced and consequently the latency in the data storage system is minimized. In a further implementation form, checking for a duplicate block comprises selecting one or more representative hash values from the calculated strong hash values of each block; checking the representative hash values against the deduplication index for matches; and for one or more positive matches, retrieving the strong hash values corresponding to the one or more block IDs indicated by the pointer and checking for a match against the incoming data item.

Beneficially, the duplicate block (i.e., the block for which a matching block is found in the data storage system) can be efficiently identified by the deduplication index. Thus, the performance of the data deduplication solution is improved.

In a further implementation form, checking for a match against the incoming data item comprises fetching block data from a target address corresponding to the block ID and from a plurality of associated addresses.

By virtue of fetching block data from the target address corresponding to the block ID and from a plurality of associated addresses, maximum number of data blocks, which store the maximum data for the incoming data item, can be fetched for checking against the incoming data item.

In a further implementation form, the plurality of associated addresses are addresses within a certain distance of the target address.

Beneficially, the size of the plurality of associated addresses can be bounded (e.g., up to 5 references), and the older reference can be dropped in case of the limit exceeds.

In a further implementation form, the plurality of associated addresses are addresses following the target address.

Beneficially, the addresses which store the maximum data for the data item can be fetched in a single read 10.

In a further implementation form, updating the deduplication index with an entry for a representative hash value comprises, if an entry for an identical representative hash value exists, addition an additional pointer to one or more block IDs.

Beneficially, the number of disk-read accesses are reduced and consequently the latency in the data storage system is minimized. Moreover, by virtue of the deduplication index, the throughput of the data storage system is increased and the deduplication ratio (i.e., the ratio of the size of original data to the size of data after deduplication) is improved.

In another aspect, the present disclosure provides a computer readable medium configured to store instructions which, when executed by a processor, cause the processor to execute the method of aforementioned aspect.

The computer readable medium achieves all the advantages and effects of the respective method of the present disclosure.

In a yet another aspect, the present disclosure provides a data indexing module for a data storage system. The module comprises one or more processors configured to execute the method of aforementioned aspect.

The data indexing module of the present disclosure executes the method of aforementioned aspect by efficiently indexing a data item in the data storage system with the help of the deduplication index (i.e., weak hash sparse cache). Further, the data indexing module provides an efficient and reliable data deduplication solution to manage duplicate data as well as non duplicate data. The non-duplicate data is stored in the data storage system with a new block ID. On the other hand, the duplicate data is not stored explicitly in the data storage system; rather a reference (i.e., a pointer) to the matching data is stored. Thus, the data indexing module helps in an efficient utilization of the storage capacity of the data storage system.

In a yet another aspect, the present disclosure provides a data storage system. The data storage system comprises one or more data storage units, and the data indexing module.

The data storage system of the present disclosure provides an efficient and reliable system to store data. By virtue of the data indexing module, the data storage system can easily avoid storing duplicate copies of data and thus, significantly decreases storage capacity requirements.

It has to be noted that all devices, elements, circuitry, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof. It will be appreciated that features of the present disclosure are susceptible to being combined in various combinations without departing from the scope of the present disclosure as defined by the appended claims.

Additional aspects, advantages, features and objects of the present disclosure would be made apparent from the drawings and the detailed description of the illustrative implementations construed in conjunction with the appended claims that follow.

BRIEF DESCRIPTION OF THE DRAWINGS

The summary above, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the present disclosure, exemplary constructions of the disclosure are shown in the drawings. However, the present disclosure is not limited to specific methods and instrumentalities disclosed herein. Moreover, those in the art will understand that the drawings are not to scale. Wherever possible, like elements have been indicated by identical numbers.

Embodiments of the present disclosure will now be described, by way of example only, with reference to the following diagrams wherein:

FIG. 1 A and FIG. IB collectively, is a flowchart of a method for indexing a data item in a data storage system, in accordance with an embodiment of the present disclosure;

FIG. 2 is a block diagram of a data storage system, in accordance with an embodiment of the present disclosure;

FIG. 3 is an illustration that depicts various operations of different layers of a data deduplication solution, in accordance with an embodiment of the present disclosure;

FIG. 4 is an exemplary illustration of a deduplication index, in accordance with an embodiment of the present disclosure. In the accompanying drawings, an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent. A non-underlined number relates to an item identified by a line linking the non- underlined number to the item. When a number is non-underlined and accompanied by an associated arrow, the non-underlined number is used to identify a general item at which the arrow is pointing.

DETAILED DESCRIPTION OF EMBODIMENTS

The following detailed description illustrates embodiments of the present disclosure and ways in which they can be implemented. Although some modes of carrying out the present disclosure have been disclosed, those skilled in the art would recognize that other embodiments for carrying out or practicing the present disclosure are also possible.

FIG. 1 A and FIG. IB, collectively, is a flowchart of a method for indexing a data item in a data storage system, in accordance with an embodiment of the present disclosure. With reference to FIG. 1A and FIG. IB, there is shown a method 100. The method 100 is executed by a data indexing module described in detail, for example, in FIG. 2. The method 100 includes steps 102 to 118. The step 118 further includes the steps 118A, 118B, and 118C.

In one aspect, the present disclosure provides a computer-implemented method (i.e., the method 100) for indexing a data item in a data storage system. The method 100 comprises: dividing the data item into one or more blocks; calculating a strong hash value for each block; checking for a duplicate block by checking the calculated strong hash value of each block against a deduplication index; for any non-duplicate block: storing the block data at an address in the data storage system; and updating an ID table with an entry for the block, including the strong hash value, a block ID and a pointer indicating the address of the block data, wherein if the block is a duplicate the pointer indicates a location of the target block; for any duplicate block for which a matching block is found in the data storage system: storing, at an address in the data storage system, a reference to the block ID of the matching block, such that the block data for the duplicate block can be found at an address of the matching block; determining one or more common areas of the data storage system having the largest number of addresses which store block data for the data item; selecting one or more representative hash values from the calculated strong hash values of each block; and updating the deduplication index with an entry for each representative hash value, wherein each entry includes a pointer to one or more block IDs, wherein: if the representative hash value corresponds to a block having an address in the common area, the pointer indicates the block ID of that block; and if the representative hash value corresponds to a block having an address which is not in the common area, the pointer indicates a block ID of a block in one or more common areas.

At step 102, the method 100 comprises dividing the data item into one or more blocks. The data item (i.e., data files, database files, input-output (I/O) writes, and the like) can be huge. In order to efficiently manage such huge data item (or the incoming write request), the data item is broken down into one or more blocks. This process of dividing is also referred to as chunking of data, that is splitting a large data item into small data items called data chunks. The one or more blocks may also be referred as data segments. Further, dividing the data item into one or more blocks may result into blocks of either fixed-length (i.e., equal size blocks) or variable- length (i.e., unequal size blocks), depending upon the way chunking is performed.

At step 104, the method 100 further comprises calculating a strong hash value for each block. Generally, hashing is the process of converting a given key into a new value. A hash function is used to generate the new value according to a mathematical algorithm. The result of the hash function (i.e., the new value) is referred as the strong hash value or simply, a hash. The strong hash values may also be referred to as fingerprints of data segments (or data blocks), since each hash value may be of same length but may be unique based on the data block. A strong hash means that the value that uniquely describes the data block. As such, two data blocks with the same strong hash are identical with such high probability that in a real practical system they will be considered identical. In another words, the strong hash value is a bit string that represents the data block that is processed. If a given data block is processed by a given hashing algorithm, and later if the same hashing algorithm is applied on the same data block, then, the same strong hash value is created each time. Thus, if same copies of data segments arrive, then the same strong hash value is generated for all the copies.

At step 106, the method 100 further comprises checking for a duplicate block by checking the calculated strong hash value of each block against a deduplication index. The deduplication index refers to an indexing process used for database management, in which an index record stores pairs of keys and pointers for every block of the data stored in a data storage system. The deduplication index may be referred to as a sparse index. Further, the calculated strong hash value of each block corresponds to the strong hash value calculated for each block of the data item (i.e., the incoming write request). The calculated strong hash value of each block is checked against the deduplication index for identifying the duplicate block. The checking against the deduplication index corresponds to checking against a weak hash sparse cache. The weak hash sparse cache refers to a small memory footprint that maps a weak hash value into a list of block IDs. The weak hash sparse cache is used to quickly (i.e., with ultra-low latency) find a good bunch of candidates retrieved from a full index (e.g., a weak hash table), such that a single read IO can fetch a lot of relevant block IDs. Alternatively stated, the weak hash sparse cache is populated with only a sub-set of the weak hash values, such that the sub-set provides ultra-low latency for most of the deduplication block IDs. In addition, the weak hash sparse cache can contain multiple IDs (i.e., multiple pointers) for a single weak hash key. Hence, in case of a conflict (i.e., different data blocks with the same weak hash key), the weak hash sparse cache can improve the deduplication ratio (i.e., the ratio of the size of original data to the size of data after deduplication). Further, the weak hash table refers to a full index that maps a weak hash value into a block ID. In order to provide a good deduplication ratio, the size of the weak hash table is large and hence, the whole weak hash table cannot be stored in a memory. However, the deduplication index (i.e., the weak hash sparse cache) is a small index and hence, the deduplication index is stored in the memory and is used as an in-memory cache for weak hash values. Thus, when a match is found for the calculated strong hash value of each block against the deduplication index, it is identified as the duplicate block; otherwise, if the match is not found, the block is identified as a non-duplicate block.

In a case where the block is found to be a non-duplicate block, the control moves to steps 108- 110 else if the block is found to be a duplicate block, the control moves to steps 112-118.

At step 108, for any non-duplicate block, the method 100 comprises storing the block data at an address in the data storage system. The non-duplicate block corresponds to the data block for which no matches (or fingerprints) could be found in the data storage system. Alternatively stated, the non-duplicate block corresponds to the data block which cannot be deduplicated because it does not exist in the data storage system. A new block ID is generated for the non duplicate block and the exact block (i.e., the non-duplicate block) is stored at the address in the data storage system.

At step 110, for any non-duplicate block, the method 100 further comprises updating an ID table with an entry for the block, including the strong hash value, a block ID and a pointer indicating the address of the block data, wherein if the block is a duplicate the pointer indicates a location of the target block. The ID table refers to a full index that maps block ID into the actual address of the data item. The ID table includes a list of unique data blocks, sorted by the ID number of the block. Once a new data block is found, it is added to the end of the list with a new sequence number, which is the ID number of the data block. Hence, the ID table includes all the block IDs of all the blocks stored in the data storage system. When the new data block is a non-duplicate block, the ID table is updated by the data indexing module of the data storage system. The updating the ID table refers to adding the information regarding the non-duplicate block, such as block data (i.e., data stored in the new block), the strong hash value corresponding to the non-duplicate block, the block ID, and the pointer indicating the actual address of the non-duplicate block in the data storage system. If the new data block is a duplicate data block, then the pointer indicates a location of the target block (i.e., the data block already stored in the data storage system, which contains the same data as of the new data block). Hence, no duplicate copies of the data block are stored at different addresses in the data storage system and beneficially, the storage capacity of the data storage system is increased. At step 112, for any duplicate block for which a matching block is found in the data storage system, the method 100 comprises storing, at an address in the data storage system, a reference to the block ID of the matching block, such that the block data for the duplicate block can be found at an address of the matching block. The duplicate block corresponds to the data block for which a matching data block is found in the data storage system. Alternatively stated, the duplicate block corresponds to the data block which is to be deduplicated because similar data block already exists in the data storage system. When the data block is found to be the duplicate block, the data block is not stored in the data storage system; rather a reference to the block ID of the matching block (i.e., the data block already existing in the data storage system, which contains similar data as of the duplicate block) is stored in the data storage system. Here, the reference refers to a pointer to the address of the matching block in the data storage system. Hence, the block data of the duplicate block can be found at the address of the matching block with the help of the stored reference. Thus, no duplicate copies of the data block are stored at different addresses in the data storage system and beneficially, the storage capacity of the data storage system is increased.

At step 114, for any duplicate block, the method 100 further comprises determining one or more common areas of the data storage system having the largest number of addresses which store block data for the data item. The one or more common areas corresponds to a list of block IDs that are stored close to one another in the ID Table. The ID table includes all the block IDs of all the blocks stored in the data storage system, and refers to a full index that maps the block ID into the actual address of the data item. The one or more common areas may also be referred as association blocks (e.g., 1 MB of sequential data). The data indexing module determines the one or more common areas which have the largest number of addresses, where most of the block data for the data item is stored. Since the data item is divided into one or more blocks, these blocks maybe stored at different addresses in the data storage system. The largest number of addresses, which are close to one another, are selected as one or more common areas because they store the maximum data for the data item.

At step 116, for any duplicate block, the method 100 further comprises selecting one or more representative hash values from the calculated strong hash values of each block. The one or more representative hash values are selected by generating a list of weak hash values from the calculated strong hash values of each block and selecting one or more of the weak hash values as representative hash values. A weak hash is so named because it comprises only a portion of the bits of a corresponding strong hash value. For example, for a strong hash value having 160 bits, a weak hash may be generated by selecting, e.g., 64 bits out of the total 160 bits. Furthermore, the one or more representative hash values are selected from the calculated strong hash values of each block in a deterministic manner. For example, selecting two (or more) representative hash values from the calculated strong hash values of each block such that the two (or more) representative hash values have the minimal/maximal hash value. The one or more representative hash values can also be selected by using any other method (e.g., probabilistic method).

At step 118, for any duplicate block, the method 100 further comprises updating the deduplication index with an entry for each representative hash value, wherein each entry includes a pointer to one or more block IDs. For any duplicate block for which a matching block is found in the data storage system, each representative hash value corresponding to the duplicate block is added into the deduplication index (i.e., the weak hash sparse cache). The entry for each representative hash value refers to the pointer set to the one or more block IDs. The step 118 further includes the sub-steps 118A to 118C to determine the address of one or more block IDs to which the pointer for each representative hash value addresses.

At step 118A, the method 100 checks if the representative hash value corresponds to a block having an address in the common area. The common area refers to the association block (i.e., one or more common areas) in which most of data for the duplicate block is stored.

In a case where the representative hash value corresponds to a block having an address in the common area, the control moves to step 118B else if the representative hash value corresponds to a block having an address which is not in the common area, the control moves to step 118C.

At step 118B, the representative hash value corresponds to a block having an address in the common area and the pointer indicates the block ID of that block. In this case, the pointer for the representative hash value of the duplicate block points to the block ID of the deduplicate block (i.e., the matching block corresponding to the duplicate block) found in the common area of the data storage system.

At step 118C, the representative hash value corresponds to a block having an address which is not in the common area and the pointer indicates a block ID of a block in one or more common areas. In this case, the pointer for the representative hash value of the duplicate block does not point to the block ID of the deduplicate block; rather points to a different block ID where most of the data for the specified association block (i.e., one or more common areas) is actually stored in the ID table. The idea is to find as much as possible potential block IDs in a single read IO. In this case, the deduplicate block (i.e., the matching block corresponding to the duplicate block) may probably be missed because the representative hash value corresponds to the duplicate block whose address is not in the common area, i.e., the deduplicate block is stored at a different location in the ID Table and not in the one or more common areas. However, the overall performance of the data storage system is improved because the value (i.e., the pointer) in the deduplication index (i.e., the weak hash sparse cache) refer to the one or more common areas in the ID Table in which most of the original data is stored.

In accordance with an embodiment, selecting the representative hash values uses a determinative process. The one or more representative hash values are selected by generating a list of weak hash values from the calculated strong hash values of each block and selecting one or more of the weak hash values as representative hash values. A weak hash is so named because it comprises only a portion of the bits of a corresponding strong hash value. For example, for a strong hash value having 160 bits, a weak hash may be generated by selecting, e.g., 64 bits out of the total 160 bits. Furthermore, selecting the representative hash values uses the determinative process. The determinative process refers to a process in which no randomness is involved in the development of future states of the process. Thus, the determinative process always produces the same output from a given starting condition or initial state. The determinative process may also refer to a probabilistic process. For example, selecting two (or more) representative hash values from the calculated strong hash values of each block such that the two (or more) representative hash values have the minimal or maximal hash value.

In accordance with an embodiment, the determinative process comprises selecting the one or more largest hash values. The one or more representative hash values are selected by generating a list of weak hash values from the calculated strong hash values of each block and selecting one or more of the weak hash values as representative hash values. Further, selecting the representative hash values uses a determinative process. The determinative process refers to a process in which no randomness is involved in the development of future states of the process. Thus, the determinative process always produces the same output from a given starting condition or initial state. The determinative process may also refer to a probabilistic method. For example, selection of one or more representative hash values from the calculated strong hash values of each block can be done via selecting the one or more largest weak hash values generated from the calculated strong hash values of each block.

In accordance with an embodiment, the determinative process comprises selecting one or more hash values matching a predetermined bit-pattern. The one or more representative hash values are selected by generating a list of weak hash values from the calculated strong hash values of each block and selecting one or more of the weak hash values as representative hash values. Further, selecting the representative hash values uses a determinative process. The determinative process refers to a process in which no randomness is involved in the development of future states of the process. Thus, the determinative process always produces the same output from a given starting condition or initial state. The determinative process may also refer to a probabilistic method. For example, selection of one or more representative hash values from the calculated strong hash values of each block can be done via selecting the one or more weak hash values generated from the calculated strong hash values of each block, such that the one or more weak hash values match the predetermined bit-pattern.

In accordance with an embodiment, two representative hash values are selected. The representative hash values are selected by generating a list of weak hash values from the calculated strong hash values of each block and selecting one or more of the weak hash values as representative hash values. For example, selecting two representative hash values from the calculated strong hash values of each block such that the two representative hash values have the minimal/maximal hash value. The two representative hash values may also be selected by any other method (e.g., probabilistic method). Each representative hash value is added in the deduplication index (i.e., the weak hash sparse cache). Since, the deduplication index (i.e., the weak hash sparse cache) is populated with only a sub-set of the weak hash values, two representative hash values are selected. Hence, the deduplication index (i.e., the weak hash sparse cache) provides a small memory footprint and can be used to quickly (i.e., with ultra- low latency) fetch a good bunch of relevant block IDs in a single read IO.

In accordance with an embodiment, updating the deduplication index includes calculating a weak hash for each representative hash value. The one or more representative hash values are selected by generating a list of weak hash values from the calculated strong hash values of each block and selecting one or more of the weak hash values as representative hash values. A weak hash is so named because it comprises only a portion of the bits of a corresponding strong hash value. For example, for a strong hash value having 160 bits, a weak hash may be generated by selecting, e.g., 64 bits out of the total 160 bits. Further, the deduplication index (i.e., the weak hash sparse cache) is updated by adding an entry for each representative hash value, i.e., the weak hash corresponding to the representative hash value.

In accordance with an embodiment, the deduplication index is stored in a memory, and the ID table is stored in a disk storage. The deduplication index (i.e., the weak hash sparse cache) refers to a small memory- footprint, which is used to quickly (i.e., with ultra-low latency) fetch the list of block IDs corresponding to the weak hash retrieved from a full index (e.g., weak hash table) in a single read IO. Alternatively stated, the deduplication index (i.e., the weak hash sparse cache) is a sub-set of the weak hashes of a full index (e.g., weak hash table). Hence, the size of the deduplication index is small. Thus, the deduplication index is stored in the memory of the data storage system. On the other hand, the ID table refers to a full index that maps the block ID of the strong hash into the actual address of the data. Alternatively stated, the ID table includes all the block IDs of all the blocks stored in the data storage system. Hence, the size of the ID table is large. Thus, the ID table is stored in the disk storage (e.g., one or more data storage units) of the data storage system.

In accordance with an embodiment, checking for a duplicate block comprises selecting one or more representative hash values from the calculated strong hash values of each block; checking the representative hash values against the deduplication index for matches; and for one or more positive matches, retrieving the strong hash values corresponding to the one or more block IDs indicated by the pointer and checking for a match against the incoming data item. The one or more representative hash values are selected by generating a list of weak hash values from the calculated strong hash values of each block and selecting one or more of the weak hash values as representative hash values. Further, the one or more representative hash values are checked against the deduplication index for matches. The checking against the deduplication index corresponds to checking against a weak hash sparse cache. The weak hash sparse cache refers to a small memory footprint that that maps a weak hash value into a list of block IDs. The weak hash sparse cache quickly (i.e., with ultra-low latency) fetches a list of relevant block IDs from a full index (e.g., a weak hash table) in a single read IO. The one or more representative hash values are checked against the weak hashes corresponding to the list of block IDs of the weak hash sparse cache (i.e., the deduplication index). Further, for one or more positive matches, the strong hash values corresponding to the one or more block IDs are retrieved from the ID table and indicated by the pointer to the disk address of the strong hash values. The retrieved strong hash values are checked for a match against the incoming data by checking the retrieved strong hash values against the calculated strong hash values for each block of the incoming data. Hence, if a match is found, the incoming data is identified as a duplicate block and is not stored explicitly in the data storage system.

In accordance with an embodiment, checking for a match against the incoming data item comprises fetching block data from a target address corresponding to the block ID and from a plurality of associated addresses. The deduplication index (i.e., weak hash sparse cache) is checked against the incoming data item (i.e., data files, database files, I/O writes, and the like) for find a match between the weak hash values of the deduplication index and the weak hashes corresponding to the calculated strong hash values for each block of the incoming data item. Since a data item is divided into one or more blocks, each block maybe stored at different addresses in the data storage system. In other words, the data of a data item may be stored in one or more blocks, which may be stored at different addresses in the data storage system. Hence, while checking for a match against the incoming data item, the block data is fetched from the target address corresponding to the block ID of the data block, as well as from the plurality of associated addresses. The plurality of associated addresses refers to one or more common areas. The one or more common areas corresponds to a list of block IDs that are stored close to one another in the ID Table. The ID table includes all the block IDs of all the blocks stored in the data storage system and refers to a full index that maps the block ID into the actual address of the data item. Further, the largest number of addresses, which are close to one another, are selected as one or more common areas (i.e., the plurality of associated addresses) because they store the maximum data for the data item. Thus, by virtue of fetching block data from the target address corresponding to the block ID and from a plurality of associated addresses, maximum data blocks can be fetched for checking against the incoming data item.

In accordance with an embodiment, the plurality of associated addresses are addresses within a certain distance of the target address. The plurality of associated addresses refers to one or more common areas, which corresponds to a list of block IDs that are stored close to one another in the ID Table. The largest number of addresses, which are close to one another, are selected as one or more common areas (i.e., the plurality of associated addresses) because they store the maximum data for the data item. The largest number of addresses corresponds to addresses within a certain distance of the target address. Hence, the size of the plurality of associated addresses can be bounded (e.g., up to 5 references), and the older reference can be dropped in case of the limit exceeds. Beneficially, the size of the deduplication index (i.e., the weak hash sparse cache) remains small and the deduplication index can be stored in the memory. Thus, the number of disk-read accesses are reduced and consequently the latency in the data storage system is minimized. Moreover, by virtue of the plurality of associated addresses, the throughput of the data storage system is increased and the deduplication ratio (i.e., the ratio of the size of original data to the size of data after deduplication) is improved.

In accordance with an embodiment, the plurality of associated addresses are addresses following the target address. The target address refers to the address of the block ID corresponding to the matching block found for the incoming data item in the data storage system. The plurality of associated addresses refers to one or more common areas, which corresponds to a list of block IDs that are stored close to one another in the ID Table. The plurality of associated addresses is selected such that the addresses follow the target address because they store the maximum data for the data item. Hence, the plurality of associated addresses comprises addresses which are near to the target address in the ID table. The size of the plurality of associated addresses can be bounded (e.g., up to 5 references), and the older reference can be dropped in case of the limit exceeds. Beneficially, the size of the deduplication index (i.e., the weak hash sparse cache) remains small and the deduplication index can be stored in the memory. Thus, the number of disk-read accesses are reduced and consequently the latency in the data storage system is minimized. Moreover, by virtue of the plurality of associated addresses, the throughput of the data storage system is increased and the deduplication ratio (i.e., the ratio of the size of original data to the size of data after deduplication) is improved.

In accordance with an embodiment, updating the deduplication index with an entry for a representative hash value comprises, if an entry for an identical representative hash value exists, addition an additional pointer to one or more block IDs. For any duplicate block for which a matching block is found in the data storage system, each representative hash value corresponding to the duplicate block is added into the deduplication index (i.e., the weak hash sparse cache). The entry for each representative hash value refers to the pointer set to the one or more block IDs. Further, if an identical representative hash value exists, an additional pointer to one or more block IDs is added in the deduplication index (i.e., the weak hash sparse cache). The additional pointer to one or more block IDs corresponds to the pointer to one or more IDs where most of the data for the specified association block is actually stored in the ID table. The idea is to find as much as possible potential block IDs in a single read IO. Thus, the number of disk-read accesses are reduced and consequently the latency in the data storage system is minimized. Moreover, by virtue of the deduplication index, the throughput of the data storage system is increased and the deduplication ratio (i.e., the ratio of the size of original data to the size of data after deduplication) is improved.

The method 100 uses the data indexing module to provide an improved data duplication solution by efficiently indexing a data item in the data storage system with the help of the deduplication index (i.e., weak hash sparse cache). Since the deduplication index is stored in the memory, the number of disk-read accesses are reduced, which consequently minimizes the latency in fetching the IO operations of the data storage system. Further, the deduplication index of the present disclosure can be efficiently used for multiple scenarios (e.g., source-based deduplication, online deduplication, and offline deduplication) without degrading the performance (i.e., latency, throughput, and deduplication ratio) for any scenario. Moreover, the method 100 allows an efficient and reliable data deduplication solution to manage duplicate data as well as non-duplicate data. The non-duplicate data is stored in the data storage system with a new block ID. On the other hand, the duplicate data is not stored explicitly in the data storage system; rather a reference (i.e., a pointer) to the matching data is stored. Thus, the method 100 allows an efficient utilization of the storage capacity of the data storage system.

The steps 102 to 118 are only illustrative and other alternatives can also be provided where one or more steps are added, one or more steps are removed, or one or more steps are provided in a different sequence without departing from the scope of the claims herein.

In yet another aspect, the present disclosure provides a computer readable medium configured to store instruction which, when executed by a processor, cause the processor to perform the method 100 of aforementioned aspect. The computer readable medium refers to a non- transitory computer-readable storage medium. Examples of implementation of the computer- readable media include, but is not limited to, Electrically Erasable Programmable Read-Only Memory (EEPROM), Random Access Memory (RAM), Read Only Memory (ROM), Hard Disk Drive (HDD), Flash memory, a Secure Digital (SD) card, Solid-State Drive (SSD), a computer readable storage medium, and/or CPU cache memory.

FIG. 2 is a block diagram of a data storage system, in accordance with an embodiment of the present disclosure. FIG. 2 is described in conjunction with elements of FIGs. 1 A and IB, collectively. With reference to FIG. 2, there is shown a block diagram 200 of a data storage system 202. The data storage system 202 includes a control circuitry 204, a transceiver 206, one or more data storage units 208, a data indexing module 210, and a memory 212. The data indexing module 210 further includes one or more processors 210A.

In yet another aspect, the present disclosure provides a data storage system 202. The data storage system 202 comprises: one or more data storage units 208; and the data indexing module 210.

The data storage system 202 refers to a computer storage system that stores information (i.e., data item such as data files or database files, I/O writes etc.) in a storage medium, such as a storage disk. Examples of data storage system 202 include, but are not limited to, a secondary storage system, a cloud server, a file storage system, a block storage system, an object storage system, or a combination thereof.

The control circuitry 204 includes a logic circuitry that may be communicatively coupled to the transceiver 206, the one or more data storage units 208, the data indexing module 210, and the memory 212. In an implementation, the one or more processors 210A may execute its operation independently of the control circuitry 204 or under control of the control circuitry 204. Examples of the control circuitry 204 may include, but are not limited to, a microprocessor, a microcontroller, a complex instruction set computing (CISC) processor, an application-specific integrated circuit (ASIC) processor, a reduced instruction set (RISC) processor, a very long instruction word (VLIW) processor, a central processing unit (CPU), a state machine, a data processing unit, and other processors or control circuitry.

The transceiver 206 includes a suitable logic, circuitry, and/or interfaces that is configured to transmit/receive IO read/write operations. Examples of the transceiver 206 include, but are not limited to, a transmitter/receiver antenna, an Internet-of-Things (IoT) controller, a network interface, a customized hardware for wireless communication, or the like.

The data storage system 202 comprises one or more data storage units 208. The one or more data storage units 208 include suitable logic, circuitry, and/or interfaces that may be configured to store machine code and/or instructions executable by the data storage system 202 and store data received from various users. The one or more data storage units 208 refers to one or more storage devices or one or more storage arrays. Other examples of implementation of the one or more data storage units 208 may include, but are not limited to, Network Attached Storage (NAS), SSD Flash Drive Arrays, Hybrid Flash Arrays, Cloud Storage, and the like.

In yet another aspect, the present disclosure provides a data indexing module 210 for a data storage system 202. The module comprises one or more processors 210A configured to execute the method of aforementioned aspect.

The data storage system 202 further comprises the data indexing module 210. The data indexing module 210 refers to a computer component which uses a data structure technique to quickly retrieve records from a database file. The data indexing module 210 may also be simply referred as a module or a data indexing circuitry. The data indexing module 210 includes suitable logic, circuitry, and/or interfaces that may be configured to execute the method 100 (of FIGs. 1 A and IB, collectively).

The data indexing module 210 includes one or more processors 210A. The one or more processors 210A may be configured to monitor and execute operations of the data indexing module 210. The one or more processors 210A may be configured to index a data item in the data storage system 202 so as to ensure that duplicate copies of the data item are not created in the data storage system 202. Examples of the one or more processors 210A include, but are not limited to, a microprocessor, a microcontroller, a complex instruction set computing (CISC) processor, an application-specific integrated circuit (ASIC) processor, a reduced instruction set (RISC) processor, a very long instruction word (VLIW) processor, a central processing unit (CPU), a state machine, a data processing unit, and other processors or control circuitry.

The memory 212 includes a suitable logic, circuitry, and/or interfaces that may be configured to store machine code and/or instructions executable by the deduplication index. Examples of implementation of the memory 212 may include, but are not limited to, Hard Disk Drive (HDD), Flash memory, Solid-State Drive (SSD), Network Attached Storage (NAS), SSD Flash Drive Arrays, Hybrid Flash Arrays, Cloud Storage, and the like.

In operation, the one or more processors 210A of the data indexing module 210 executes the method 100 (of FIGs. 1A and IB, collectively) by efficiently indexing a data item in the data storage system 202 with the help of the deduplication index (i.e., weak hash sparse cache). The operations of the data indexing module 210 include dividing the data item into one or more blocks; calculating a strong hash value for each block; and checking for a duplicate block by checking the calculated strong hash value of each block against the deduplication index (i.e., the weak hash sparse cache). For any non-duplicate block, the operations of the data indexing module 210 further include storing the block data at an address in the data storage system 202; and updating an ID table with an entry for the block, including the strong hash value, a block ID and a pointer indicating the address of the block data, wherein if the block is a duplicate the pointer indicates a location of the target block. For any duplicate block for which a matching block is found in the data storage system 202, the operations of the data indexing module 210 further include storing, at an address in the data storage system 202, a reference to the block ID of the matching block, such that the block data for the duplicate block can be found at an address of the matching block; determining one or more common areas of the data storage system 202 having the largest number of addresses which store block data for the data item; selecting one or more representative hash values from the calculated strong hash values of each block; and updating the deduplication index with an entry for each representative hash value, wherein each entry includes a pointer to one or more block IDs. Further, if the representative hash value corresponds to a block having an address in the common area, the pointer indicates the block ID of that block; and if the representative hash value corresponds to a block having an address which is not in the common area, the pointer indicates a block ID of a block in one or more common areas.

In accordance with an embodiment, the deduplication index is stored in the memory 212, and the ID table is stored in a disk storage (e.g., one or more data storage units 208). In accordance with an embodiment, checking for a duplicate block comprises selecting one or more representative hash values from the calculated strong hash values of each block; checking the representative hash values against the deduplication index for matches; and for one or more positive matches, retrieving the strong hash values corresponding to the one or more block IDs indicated by the pointer and checking for a match against the incoming data item.

In accordance with an embodiment, checking for a match against the incoming data item comprises fetching block data from a target address corresponding to the block ID and from a plurality of associated addresses. The plurality of associated addresses are addresses within a certain distance of the target address. Further, the plurality of associated addresses are addresses following the target address. In accordance with an embodiment, selecting the representative hash values uses a determinative process. The determinative process comprises selecting the one or more largest hash values. The determinative process may also comprise selecting one or more hash values matching a predetermined bit-pattern.

In accordance with an embodiment, updating the deduplication index includes calculating a weak hash for each representative hash value. Further, updating the deduplication index with an entry for a representative hash value comprises, if an entry for an identical representative hash value exists, addition an additional pointer to one or more block IDs.

Thus, the data indexing module 210 provides an efficient and reliable data deduplication solution to manage duplicate data as well as non-duplicate data. The non-duplicate data is stored in the data storage system 202 with a new block ID. On the other hand, the duplicate data is not stored explicitly in the data storage system 202; rather a reference (i.e., a pointer) to the matching data is stored. Thus, the data indexing module 210 helps in an efficient utilization of the storage capacity of the data storage system 202.

FIG. 3 is an illustration that depicts various operations of different layers of a data deduplication solution, in accordance with an embodiment of the present disclosure. FIG. 3 is described in conjunction with elements of FIGs. 1A, IB, and 2. With reference to FIG. 3, there is shown a process 300 that depicts various operations of different layers of a data deduplication solution. There is further shown an incoming write request 302, a strong hash cache 304, a weak hash sparse cache 306, a weak hash table 308, and an ID table 310. The strong hash cache 304 further includes a strong hash 304A, and a block ID 304B. The weak hash sparse cache 306 further includes a weak hash 306A, and a list of block IDs 306B. The weak hash table 308 further includes a weak hash 308 A, and a block ID 308B. The ID table 310 further includes a block ID 310A, a strong hash 310B, a ref-count 310C, and a disk address 310D. The process 300 includes operations 312 to 320. The operations 312 to 320 are performed by the one or more processors 210A of the data indexing module 210 (of FIG. 2). The process 300 may correspond to the method 100 (of FIGs. 1 A and IB, collectively).

The incoming write request 302 refers to a request to store a data item (i.e., data files, database files, I/O writes, and the like) in the data storage system 202. The strong hash cache 304 refers to a small index that maps the strong hash 304A into the block ID 304B. The strong hash cache 304 is stored in the memory 212 and is used as an in-memory cache for strong hashes. The strong hash cache 304 corresponds to a first layer of the data deduplication solution.

The weak hash sparse cache 306 refers to a small memory-footprint, which is used to quickly (i.e., with ultra-low latency) fetch the list of block IDs 306B corresponding to the weak hash 306A retrieved from the weak hash table 308 in a single read IO. Hence, the number of disk- read accesses are reduced, and consequently the latency in the data storage system 202 is minimized. The weak hash sparse cache 306 corresponds to a second layer of the data deduplication solution. In the present disclosure, the weak hash sparse cache 306 is the main index used for data deduplication and thus, the weak hash sparse cache 306 is also referred as the deduplication index. Further, the weak hash sparse cache 306 is a sub-set of the weak hashes of the weak hash table 308 and thus, the size of the weak hash sparse cache 306 is small. In addition, the weak hash sparse cache 306 can contain multiple IDs (i.e., multiple pointers) for a single weak hash key (e.g., weak hash 306A). Hence, in case of a conflict (i.e., different data blocks with the same weak hash key), the weak hash sparse cache 306 can improve the deduplication ratio (i.e., the ratio of the size of original data to the size of data after deduplication).

The weak hash table 308 refers to a full Index that maps the weak hash 308A into the block ID 308B. The weak hash table 308 corresponds to a third layer of the data deduplication solution. In order to provide good deduplication ratio (i.e., the ratio of the size of original data to the size of data after deduplication), the size of the weak hash table 308 is large and hence, the whole weak hash table 308 cannot be stored in the memory 212.

The ID table 310 refers to a full index that maps the block ID 310A of the strong hash 310B into the actual address of the data (i.e., the disk address 310D). The ID table 310 corresponds to a fourth layer of the data deduplication solution. The ID table 310 includes all the block IDs of all the blocks stored in the data storage system 202. Hence, the size of the ID table 310 is large. Thus, the ID table 310 is stored on the disk storage (e.g., one or more data storage units 208). Further, the ID table 310 keeps a count of the references (i.e., pointers) for duplicate blocks in the ref-count 310C. At operation 312, the incoming write request 302 is divided into one or more blocks and calculating a strong hash value for each block. The incoming write request 302 may include a data item (i.e., data files, database files, I/O writes, and the like) which is very large in size. In order to efficiently manage such large size data item (i.e., the incoming write request 302), the data item is broken down into one or more blocks. The dividing of the data item into one or more blocks may result into blocks of fixed-length (i.e., equal size blocks) or variable-length (i.e., unequal size blocks), depending upon the dividing algorithm performed. Further, a strong hash value is calculated for each block of the data item (i.e., the incoming write request 302). The calculated strong hash value for each block refers to a fingerprint of the corresponding block, which uniquely describes the corresponding block of the data item. In another words, the calculated strong hash value for each block is a bit string that represents the corresponding block of the data item.

At operation 314, the calculated strong hash value for each block is checked against the strong hash cache 304. The checking of the calculated strong hash value for each block against the strong hash cache 304 corresponds to checking the calculated strong hash value for each block against the strong hash 304A of the strong hash cache 304. If a match is found between the calculated strong hash value and the strong hash 304A, then the corresponding block ID (i.e., the block ID 304B) is returned to the ID table 310 for update. On the other hand, if a match is not found between the calculated strong hash value and the strong hash 304A, then the calculated strong hash value for each block is sent to the weak hash sparse cache 306 for further processing. Hence, if a match is found, the control from operation 314 moves to operation 320 else if a match is not found, the control moves to operation 316.

At operation 316, the calculated strong hash value for each block is checked against the weak hash sparse cache 306. The checking of the calculated strong hash value for each block against the weak hash sparse cache 306 corresponds to checking one or more weak hashes of the calculated strong hash value for each block against the weak hash 306A of the weak hash sparse cache 306. A weak hash is so named because it comprises only a portion of the bits of a corresponding to the calculated strong hash value. For example, for a calculated strong hash value having 160 bits, a weak hash may be generated by selecting, e.g., 64 bits out of the total 160 bits. Further, the one or more weak hashes for each association block (e.g., 1 MB of sequential data of the incoming write request 302) are selected from the calculated strong hash value of each block in a determinative manner. For example, selecting two weak hashes from the calculated strong hash values of each block such that the two weak hashes have the minimal/maximal hash value. Further, if a match is found between the one or more weak hashes of the calculated strong hash value for each block and the weak hash 306A of the weak hash sparse cache 306, then the one or more weak hashes of the calculated strong hash value for each block are selected and directly sent to the ID table 310 for update. Moreover, the bunch of blocks corresponding to the one or more weak hashes are loaded in the same area (e.g., all blocks in the same 4 KB) into the strong hash cache 304. On the other hand, if a match is not found, then the one or more weak hashes of the calculated strong hash value for each block are ignored and the calculated strong hash value for each block is sent to the weak hash table 308 for further processing. Hence, if a match is found, the control from operation 316 moves to operation 320 else if a match is not found, the control moves to operation 318.

In accordance with an embodiment, checking for a duplicate block comprises selecting one or more representative hash values from the calculated strong hash values of each block; checking the representative hash values against the deduplication index (i.e., the weak hash sparse cache 306) for matches; and for one or more positive matches, retrieving the strong hash values corresponding to the one or more block IDs indicated by the pointer and checking for a match against the incoming write request 302.

In accordance with an embodiment, checking for a match against the incoming write request 302 comprises fetching block data from a target address corresponding to the block ID and from a plurality of associated addresses. The plurality of associated addresses are addresses within a certain distance of the target address. Further, the plurality of associated addresses are addresses following the target address.

In accordance with an embodiment, selecting the representative hash values uses a determinative process. The determinative process comprises selecting the one or more largest hash values. The determinative process may also comprise selecting one or more hash values matching a predetermined bit-pattern.

In accordance with an embodiment, updating the deduplication index (i.e., the weak hash sparse cache 306) includes calculating a weak hash for each representative hash value. Further, updating the deduplication index (i.e., the weak hash sparse cache 306) with an entry for a representative hash value comprises, if an entry for an identical representative hash value exists, addition an additional pointer to one or more block IDs. At operation 318, the calculated strong hash value for each block is checked against the weak hash table 308. The checking of the calculated strong hash value for each block against the weak hash table 308 corresponds to checking one or more weak hashes of the calculated strong hash value for each block against the weak hash 308A of the weak hash table 308. If a match is found between the one or more weak hashes of the calculated strong hash value for each block and the weak hash 308A of the weak hash table 308, then the corresponding strong hash is retrieved from the ID table 310 for comparison. On the other hand, if a match is not found, then the data block is identified as a non-duplicate block and thus, the data block cannot be deduplicated. A new block ID is generated for the data block (i.e., the non-duplicate block) and the exact data block is stored in the data storage system 202.

At operation 320, the process 300 comprises updating the ID table 310. Updating the ID table 310 corresponds to updating the ID table 310 with an entry for the block (i.e., non-duplicate block or duplicate block). The entry for the block includes the block ID 310A, the strong hash 310B, a pointer with ref-count 310C, and the disk address 310D. The pointer indicates the disk address 310D of the block. If the block is a duplicate block, then the pointer indicates a location of the matching block corresponding to the duplicate block. On the other hand, if the block is a non-duplicate block, then the disk address 310D represents the actual address of the block.

In conventional systems, in an example, the main flow typically of the conventional data deduplication solutions is receiving a new write request, segmenting the data using any segmentation algorithm, calculating a strong hash and a weak hash for each segment, and searching the strong hash in a given strong hash cache. If the strong hash is found in the given strong hash cache, the block ID is returned. If the strong hash is not found in the given strong hash cache, the weak hash is searched in a given weak hash table. If the weak hash is found in the given weak hash table, the strong hash is retrieved from a given ID table for comparison. If the weak hash is not found in the given weak hash table, the data cannot be deduplicated and a new block ID is generated. However, since the given weak hash table is large and most of the given weak hash table is stored on the disk, the search requires read IOs from the disk. Hence, there is a significant impact on the latency as well as the read IOs throughput. Moreover, a single entry in the given weak hash table for each weak hash key creates a conflict (i.e., different data blocks with the same weak hash key), and leads to degradation of the deduplication ratio. Thus, the conventional data deduplication solutions have only three data deduplication layers (i.e., the given strong hash cache, the given weak hash table, and the give ID table) and uses the given weak hash table as the main index.

In contrast to the conventional systems that have only three layers, the data duplication solution of the present disclosure includes another layer (i.e., the weak hash sparse cache 306) on top of the existing data deduplication layers. The deduplication index (i.e., the weak hash sparse cache 306) of the present disclosure reduces the number of disk-read accesses and consequently, the latency in the data storage system 202 is minimized.

Thus, the operations 312 to 320 of the process 300 corresponds to the method for indexing the data item in the data storage system 202, which provides an improved data deduplication solution with the help of the deduplication index (i.e., the weak hash sparse cache 306). The data deduplication solution of the present disclosure can be efficiently used for multiple scenarios (i.e., source-based deduplication, online deduplication, and offline deduplication) without degrading the performance (i.e., latency, throughput, and deduplication ratio) for any scenario. Further, since the deduplication index (i.e., the weak hash sparse cache 306) is populated with only a sub-set of the full index (i.e., the weak hash table 308), the deduplication index provides ultra-low latency for most of the deduplication blocks IDs. In addition, the deduplication index (i.e., the weak hash sparse cache 306) can contain multiple IDs (i.e., multiple pointers) for a single weak hash key. Hence, in case of a conflict (i.e., different data blocks with the same weak hash key), the deduplication index of the data deduplication solution can improve the deduplication ratio (i.e., the ratio of the size of original data to the size of data after deduplication).

The operations 312 to 320 are only illustrative and other alternatives can also be provided where one or more operations are added, one or more operations are removed, or one or more operations are provided in a different sequence without departing from the scope of the claims herein.

FIG. 4 is an exemplary illustration of a deduplication index, in accordance with an embodiment of the present disclosure. FIG. 4 is described in conjunction with elements of FIGs. 1 A, IB, 2, and 3. With reference to FIG. 4, there is shown an exemplary scenario 400 of a deduplication index. There is further shown a weak hash sparse cache 402, and an ID table 404. The weak hash sparse cache 402 further includes a weak hash 402A, and a list of block IDs 402B. The ID table 404 further includes a block ID 404 A, a strong hash 404B, a ref-count 404C, and a disk address 404D. The exemplary scenario 400 includes exemplary operations 406 to 412.

The weak hash sparse cache 402 refers to a small memory-footprint, which is used to quickly (i.e., with ultra-low latency) fetch the list of block IDs 402B corresponding to the weak hash 402A in a single read IO. In the present disclosure, the weak hash sparse cache 402 is the main index used for data deduplication and thus, the weak hash sparse cache 402 is also referred as the deduplication index. In addition, the weak hash sparse cache 402 can contain multiple IDs (i.e., multiple pointers) for a single weak hash key (e.g., the weak hash 402A). In an implementation, the weak hash 402A of the weak hash sparse cache 402 may include weak hash keys, such as “OxAOlB”, “0xA02C”, and “0xA03D”. The list of block IDs 402B for the weak hash key “OxAOlB” may include an ID, such as “ID 1” and “ID 100”. Similarly, the list of block IDs 402B for the weak hash key “0xA02C” may include an ID, such as “ID 200”, and for the weak hash key “0xA03D” an ID, such as “ID 300”.

The ID table 404 refers to a full index that maps the block ID 404 A of the strong hash 404B into the actual address of the data (i.e., the disk address 404D). The ID table 404 includes all the block IDs of all the blocks stored in the data storage system 202. Hence, the size of the ID table 404 is large. Thus, the ID table 404 is stored on the disk (e.g., one or more data storage units 208). Further, the ID table 404 keeps a count of the references (i.e., pointers) for duplicate blocks in the ref-count 404C. In an implementation, the ID table 404 may include all the IDs (e.g., 1 to 300) for the blocks stored in the data storage system 202.

In a first example, at operation 406, storing of non-deduplicate data is executed, where the “ID 1” for the weak hash key “OxAOlB” corresponds to the ID of a non-duplicate block. Hence, the exact block (i.e., the non-duplicate block) is stored in the data storage system 202 and the ID table 404 is updated with the exact ID which the block refers to, i.e., the “ID 1”.

At operation 408, storing of another non-deduplicate data is executed in another example, where the “ID 100” for the weak hash key “OxAOlB” corresponds to the ID of a non-duplicate block. Hence, the exact block (i.e., the non-duplicate block) is stored in the data storage system 202 and the ID table 404 is updated with the exact ID which the block refers to, i.e., the “ID 100”.

At operation 410, storing of another non-deduplicate data is executed in yet another example, where the “ID 200” for the weak hash key “0xA02C” corresponds to the ID of a non-duplicate block. Hence, the exact block (i.e., the non-duplicate block) is stored in the data storage system 202 and the ID table 404 is updated with the exact ID which the block refers to, i.e., the “ID 200”. Thus, for non-deduplicate sequential data, the IDs for the data are stored close to one another in the ID table 404 and hence, a single IO read can retrieve a bunch of relevant IDs together.

At operation 412, storing of deduplicate data is executed in another example, where the “ID 300” for the weak hash key “0xA03D” corresponds to the ID of a duplicate block. Here, the exact block (i.e., the duplicate block) is not stored in the data storage system 202; rather a reference (i.e., a pointer) to the matching block corresponding to the duplicate block is stored. The ID table 404 maps the “ID 300” of the duplicate block to the ID of the matching block, i.e., the “ID 200”. In this case, the weak hash key (i.e., “0xA03D”) may point to the deduplicate block ID (i.e., “ID 300”), but the reference is set to a different block ID (i.e., “ID 200”) where most of the data for the specified association block is actually stored in the ID table 404. Thus, for deduplicate data, the values in the weak hash sparse cache 402 refer to the area in the ID table 404 in which most of the original data is stored. The idea is to find as much as possible potential block IDs in a single read IO. In this case, the deduplicate block may probably be missed because it is stored at a different location in the ID table 404, but the overall performance of the data storage system 202 is improved.

For both non-deduplicate and deduplicate scenarios, in case of a conflict (i.e., different data blocks with the same weak hash key), multiple references to the ID table 404 can be added into the weak hash sparse cache 402 for a single weak hash key and thus, the deduplication ratio (i.e., the ratio of the size of original data to the size of data after deduplication) is improved. Further, the size of a single list per key can be bounded (e.g., up to 5 references), and the older reference can be dropped in case of the limit exceeds.

Modifications to embodiments of the present disclosure described in the foregoing are possible without departing from the scope of the present disclosure as defined by the accompanying claims. Expressions such as “including”, “comprising”, “incorporating”, “have”, “is” used to describe and claim the present disclosure are intended to be construed in a non-exclusive manner, namely allowing for items, components or elements not explicitly described also to be present. Reference to the singular is also to be construed to relate to the plural. The word “exemplary” is used herein to mean “serving as an example, instance or illustration”. Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments. The word “optionally” is used herein to mean “is provided in some embodiments and not provided in other embodiments”. It is appreciated that certain features of the present disclosure, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the present disclosure, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable combination or as suitable in any other described embodiment of the disclosure.