Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND APPARATUS FOR QUOTING DATA ON MAGNETIC TAPE STORING DEDUPLICATED DATA
Document Type and Number:
WIPO Patent Application WO/2024/056163
Kind Code:
A1
Abstract:
To write input data to a magnetic tape storing deduplicated data, a sparse index of the deduplicated data is searched for all weak hash, WH, records matching with weak hashes of segments of the input data. For each matched WH record, determined are strong hash, SH, areas and locations on the magnetic tape storing quoted data pointed by the SH areas. All routes of reading the quoted data are determined by selecting one of the determined locations for each of the matched WH records and adding to each route a head of tape location. A new part of the input data is determined using a minimal route of reading and the new part is written to the head of tape location. Finally, the indexes are updated and a metadata is updated with locations of the new part and the quoted data in accordance with the minimal route of reading.

Inventors:
KUVENT AVIV (DE)
NATANZON ASSAF (DE)
TOAFF YAIR (DE)
ZACH IDAN (DE)
STERNBERG MICHAEL (DE)
Application Number:
PCT/EP2022/075545
Publication Date:
March 21, 2024
Filing Date:
September 14, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
KUVENT AVIV (DE)
International Classes:
G06F3/06
Foreign References:
US20140358871A12014-12-04
US20010034811A12001-10-25
US20120154946A12012-06-21
Attorney, Agent or Firm:
KREUZ, Georg M. (DE)
Download PDF:
Claims:
CLAIMS

1. A method (100) of quoting data on magnetic tape (206) storing deduplicated data, comprising: obtaining input data to be written to a magnetic tape (206) storing deduplicated data that has a full index of strong hashes, SHs, and a spare index of weak hashes, WHs, of the deduplicated data, wherein the input data comprises segments that belong to a target file or data object, calculating a strong hash, SH, for each of the segments of the input data, selecting a number of SH representatives from the calculated SHs for each area of a predefined size in the calculated SHs, extracting a weak hash, WH, from each of the SH representatives, searching the sparse index of the deduplicated data for all WH records matching with the extracted WHs to obtain a set of i matched WH records of the sparse index, for each of the i matched WH records: determining a set of j SH areas of the full index pointed by the WH record, determining j locations on the magnetic tape (206) storing quoted data pointed by the determined j SH areas that potentially matches with a repeating part of the input data, determining all unique routes of reading the quoted data from the magnetic tape (206) by selecting one of the j determined locations for each of the i matched WH records and adding to each route a head of tape location into which a new part of the input data that does not match with the deduplicated data will be written, determining the new part of the input data by loading SH areas that correspond to a minimal route of reading with minimal travelling distance among the determined routes of reading and comparing the loaded SH areas with the SHs of the input data, writing the new part of the input data to the head of tape location into the magnetic tape (206) as a part of the deduplicated data, and updating the full index and the sparse index of the deduplicated data with the calculated SHs and the extracted WHs to quote the new part of the input data and updating a metadata, MD, with the location of the new part of input data and the locations of the quoted data in accordance with the minimal route of reading to enable restoring of the input data.

2. The method (100) of claim 1, wherein the minimal route of reading is selected among the determined routes of reading by using a travelling salesman algorithm.

3. The method (100) of claim 1 or 2, wherein the obtaining of the input data comprises: receiving data to be written to the magnetic tape (206), splitting the data into segments, defining the target file or data object, and grouping together segments of the split data that belong to the target file or data object to obtain the input data.

4. An apparatus (202) for quoting data on magnetic tape (206) storing deduplicated data, comprising: a magnetic-tape data storage (204) comprising a magnetic tape (206) with a plurality of tape wraps (208) storing deduplicated data that has a full index of strong hashes, SHs, and a spare index of weak hashes, WHs, of the deduplicated data, and a data processing module (210) configured for: obtaining input data to be written to the magnetic tape (206), wherein the input data comprises segments that belong to a target file or data object, calculating a strong hash, SH, for each of the segments of the input data, selecting a number of SH representatives from the calculated SHs for each area of a pre-defined size in the calculated SHs, extracting a weak hash, WH, from each of the SH representatives, searching the sparse index of the deduplicated data for all WH records matching with the extracted WHs to obtain a set of i matched WH records of the sparse index, for each of the i matched WH records: determining a set of j SH areas of the full index pointed by the WH record, determining j locations on the magnetic tape storing quoted data pointed by the determined j SH areas that potentially matches with a repeating part of the input data, determining all unique routes of reading the quoted data from the magnetic tape (206) by selecting one of the j determined locations for each of the i matched WH records and adding to each route a head of tape location into which a new part of the input data that does not match with the deduplicated data will be written, determining the new part of the input data by loading SH areas that correspond to a minimal route or reading with minimal travelling distance among the determined routes of reading and comparing the loaded SH areas with the SHs of the input data, controlling the magnetic-tape data storage (204) to write the new part of the input data to the head of tape location into the magnetic tape (206) as a part of the deduplicated data, and updating the full index and the sparse index of the deduplicated data with the calculated SHs and the extracted WHs to quote the new part of the input data and updating a metadata, MD, with the location of the new part of input data and the locations of the quoted data in accordance with the minimal route of reading to enable restoring of the input data.

5. The apparatus (202) of claim 4, wherein the data processing module (210) is configured for selecting the minimal route of reading among the determined routes of reading by using a travelling salesman algorithm.

6. The apparatus (202) of claim 4 or 5, wherein the data processing module (210) is further configured for: receiving data to be written to the magnetic tape (206), splitting the data into segments, defining the target file or data object, and grouping together segments of the split data that belong to the target file or data object to obtain the input data.

Description:
METHOD AND APPARATUS FOR QUOTING DATA ON MAGNETIC TAPE STORING DEDUPLICATED DATA

TECHNICAL FIELD

The present disclosure relates generally to the field of data storage and more specifically, to a method and an apparatus for quoting data on a magnetic tape storing deduplicated data when writing input data thereto.

BACKGROUND

Generally, various storage devices are used for storing digital data, such as hard disks, pen drives, memory cards, and the like. Moreover, the need for storing digital data is increasing such that tape technologies are used with deduplication for secondary storage due to cheap price and long retention. However, the seek time of a conventional magnetic tape is relatively long, due to which the magnetic tape is used without any optimization beside the built-in compression.

Conventionally, certain attempts have been made in order to solve the problem of seek time when using deduplication on the magnetic tape, such as by writing the data on the magnetic tape in its original (i.e., without deduplication) form. The other existing solution to solve the problem of seek time is by maintaining a closed unit, such as a similarity locality (SILO) that includes all the deduped data and further reads the entire unit of data during the required. In certain scenarios, the problem of seek time can also be resolved by storing the deduped data on the magnetic tape to reduce the restoration time, such as by writing a common data in two files for sequentially reading the common data from the magnetic tape. However, such attempts further require lots of preprocessing of the common data and also lower the deduplication ratio drastically. Thus, there exists a technical problem of how to allow deduplication with the reduced seek time to optimize future reads of data by reducing seeks required to read files (or data objects) from the magnetic tape storing deduplicated data.

Therefore, in light of the foregoing discussion, there exists a need to overcome the aforementioned drawbacks associated with the conventional techniques for writing data to the magnetic tape storing the deduplicated data. SUMMARY

The present disclosure provides a method and an apparatus for quoting data on a magnetic tape storing deduplicated data. The present disclosure provides a solution to the existing problem of how to allow deduplication with reduced seek time to optimize future reads by reducing seeks required to read files (or data objects) from the magnetic tape by choosing which data to quote during deduplication when writing data on the magnetic tape. An aim of the present disclosure is to provide a solution that overcomes at least partially the problem encountered in the prior art and provides an improved method of quoting data on the magnetic tape storing deduplicated data and an improved apparatus for quoting data on the magnetic tape storing deduplicated data when writing input data thereto.

One or more objectives of the present disclosure are achieved by the solutions provided in the enclosed independent claims. Advantageous implementations of the present disclosure are further defined in the dependent claims.

In one aspect, the present disclosure provides a method of quoting data on magnetic tape storing deduplicated data, comprising obtaining input data to be written to a magnetic tape storing deduplicated data that has a full index of strong hashes, SHs, and a spare index of weak hashes, WHs, of the deduplicated data, wherein the input data comprises segments that belong to a target file or data object, calculating a strong hash, SH, for each of the segments of the input data, selecting a number of SH representatives from the calculated SHs for each area of a predefined size in the calculated SHs, extracting a weak hash, WH, from each of the SH representatives, and searching the sparse index of the deduplicated data for all WH records matching with the extracted WHs to obtain a set of i matched WH records of the sparse index. The method then comprises, for each of the i matched WH records, determining a set of j SH areas of the full index pointed by the WH record and determining j locations on the magnetic tape storing quoted data pointed by the determined j SH areas that potentially matches with a repeating part of the input data. The method further comprises determining all unique routes of reading the quoted data from the magnetic tape by selecting one of the j determined locations for each of the i matched WH records and adding to each route a head of tape location into which a new part of the input data that does not match with the deduplicated data will be written, determining the new part of the input data by loading SH areas that correspond to a minimal route of reading with minimal travelling distance among the determined routes of reading and comparing the loaded SH areas with the SHs of the input data, writing the new part of the input data to the head of tape location into the magnetic tape as a part of the deduplicated data, and updating the full index and the sparse index of the deduplicated data with the calculated SHs and the extracted WHs to quote the new part of the input data and updating a metadata, MD, with the location of the new part of input data and the locations of the quoted data in accordance with the minimal route of reading to enable restoring of the input data.

The method is used to choose which data on the magnetic tape storing the deduplicate data to quote when writing input data thereto to achieve reduced seek time on reads of the data. The method includes determining the minimal route for reading the quoted data from the magnetic tape that provides the overall minimal distance for reading the data from the magnetic tape. Furthermore, the method provides an optimal order to read the data from the magnetic tape with deduplicated data more efficiently, with reduced complexity and seek time, such as by moving the head of the tape drive across the tape wraps than by moving the head of the tape across the length of the tape wrap. In other words, the method is used to select which segments of deduplicated data already stored on the magnetic tape to quote for providing optimized reads of the data. As a result, the seek time to read the files (or the data objects) from the magnetic tape is reduced.

In an implementation form, the minimal route of reading is selected among the determined routes of reading by using a travelling salesman algorithm.

The travelling salesman algorithm is used to calculate the minimal distance required to traverse all the points of each of the determined routes and select a route that provides the overall minimal distance for optimizing later restoration of data from the magnetic tape.

In a further implementation form, the obtaining of the input data includes receiving data to be written to the magnetic tape, splitting the data into segments, defining the target file or data object, and grouping together segments of the split data that belong to the target file or data object to obtain the input data.

Such implementation allows selecting target files or data objects in the input data for optimization of later reads of these files or data objects from the magnetic tape. In yet another aspect, the present disclosure provides an apparatus for writing data to magnetic tape for storing deduplicated data that includes a magnetic-tape data storage that includes a magnetic tape with a plurality of tape wraps for storing deduplicated data that has a full index of strong hashes (SHs) and a sparse index of weak hashes (WHs) of the deduplicated data and a data processing module. The data processing module is configured for obtaining input data to be written to the magnetic tape. The input data includes segments that belong to a target file or data object. Further, the data processing module is configured for calculating a strong hash, SH, for each of the segments of the input data, selecting a number of SH representatives from the calculated SHs for each area of a pre-defined size in the calculated SHs, extracting a weak hash, WH, from each of the SH representatives, searching the sparse index of the deduplicated data for all WH records matching with the extracted WHs to obtain a set of i matched WH records of the sparse index. For each of the i matched WH records determining a set of j SH areas of the full index pointed by the WH record, determining j locations on the magnetic tape storing quoted data pointed by the determined j SH areas that potentially match with a repeating part of the input data, determining all unique routes of reading the quoted data from the magnetic tape by selecting one of the j determined locations for each of the i matched WH records and adding to each route a head of tape location into which a new part of the input data that does not match with the deduplicated data will be written. The data processing module is configured for determining the new part of the input data by loading SH areas that correspond to a minimal route or reading with minimal travelling distance among the determined routes of reading and comparing the loaded SH areas with the SHs of the input data, controlling the magnetic-tape data storage to write the new part of the input data to the head of the tape location into the magnetic tape as a part of the deduplicated data, and updating the full index and the sparse index of the deduplicated data with the calculated SHs and the extracted WHs to quote the new part of the input data and updating metadata (MD) with the location of the new part of input data and the locations of the quoted data in accordance with the minimal route of reading to enable restoring of the input data.

The apparatus achieves all the advantages and technical effects of the method of the present disclosure.

It is to be appreciated that all the aforementioned implementation forms can be combined. It is be noted that all devices, elements, circuitry, units, and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application, as well as the functionalities described to be performed by the various entities, are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity that performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof. It will be appreciated that features of the present disclosure are susceptible to being combined in various combinations without departing from the scope of the present disclosure as defined by the appended claims.

Additional aspects, advantages, features, and objects of the present disclosure would be made apparent from the drawings and the detailed description of the illustrative implementations construed in conjunction with the appended claims that follow.

BRIEF DESCRIPTION OF THE DRAWINGS

The summary above, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the present disclosure, exemplary constructions of the disclosure are shown in the drawings. However, the present disclosure is not limited to specific methods and instrumentalities disclosed herein. Moreover, those in the art will understand that the drawings are not to scale. Wherever possible, like elements have been indicated by identical numbers.

Embodiments of the present disclosure will now be described, by way of example only, with reference to the following diagrams wherein:

FIG. 1 is a flowchart of a method of quoting data on a magnetic tape storing deduped data when writing input data thereto, in accordance with an embodiment of the present disclosure;

FIG. 2 is a block diagram that depicts an apparatus for quoting data on a magnetic tape storing deduplicated data when writing input data thereto, in accordance with an embodiment of the present disclosure; and FIG. 3 is a diagram that depicts reordering of data segments in quoting of data on a magnetic tape storing deduplicated data, in accordance with an embodiment of the present disclosure.

In the accompanying drawings, an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent. A non-underlined number relates to an item identified by a line linking the nonunderlined number to the item. When a number is non-underlined and accompanied by an associated arrow, the non-underlined number is used to identify a general item at which the arrow is pointing.

DETAILED DESCRIPTION OF EMBODIMENTS

The following detailed description illustrates embodiments of the present disclosure and ways in which they can be implemented. Although some modes of carrying out the present disclosure have been disclosed, those skilled in the art would recognize that other embodiments for carrying out or practising the present disclosure are also possible.

FIG. 1 is a flowchart of a method of quoting data on a magnetic tape storing deduplicated data when writing input data thereto, in accordance with an embodiment of the present disclosure. With reference to FIG. 1, there is shown a flowchart of a method 100 that includes steps 102 to 122

There is provided the method 100 of quoting data on the magnetic tape storing the deduplicated data when writing input data thereto. In an implementation, the magnetic tape is composed of tracks on which the data is written and from which the data is read. Moreover, each track runs a length of the magnetic tape from a start point to an end point of the magnetic tape. In addition, the magnetic tape is divided into bands and a tape drive head that covers the width of a single band. In addition, each band is composed of an even number of tape wraps. In addition, the tape wraps correspond to a head of a tape drive, used for reading and writing, and contain multiple read and write elements, which allows reading/writing multiple adjacent tracks. In an example, a set of such adjacent tracks is referred to as the tape wrap. For example, linear tape open ninth generation (LTO 9) tapes include thirty-two (32) tracks per tape wrap and fifty-two (52) tape wraps per band with twenty-six (26) start-to-end tape wraps and twenty-six (26) end-to-start tape wraps and four (4) bands. In addition, the length of the magnetic tape (i.e., the length of every tape wrap) is nearly 1 kilometre (km), such as in the LTO 9.

At step 102, the method 100 comprises obtaining an input data to be written to the magnetic tape for storing the deduplicated data that includes a full index of strong hashes (SHs) and a sparse index of weak hashes (WHs) of the deduplicated data. In order to decrease the space required to store the data, deduplication of the data is used. In an example, a data processing module is configured to obtain the input data written to the magnetic tape for storing the deduplicated data that includes the full index of the SHs and the sparse index of the WHs of the deduplicated data. Moreover, the input data includes segments that belong to a target file or data object. In deduplication, the input data is segmented, and the SH is computed per segment. In an implementation, the segments are made of variable sizes, and a variable segmentation is used for dividing the data into the segments in accordance with a window of the input data. A rolling hash function is used on a rolling window. In an example, the rolling window ranges from sixty-four (64) to two hundred fifty-six (256) bytes. Furthermore, an algorithm cuts the segment and starts a new segment to provide the segments with desired hash values when a hash value of the rolling hash function accommodates a certain condition. In an implementation, only the SHs with unique values are maintained in the full index, and the SHs of new input data is checked against the full index. Moreover, the data is not stored for the SHs with repeated values. In other words, for SHs that already exist, the data is not stored again, and rather the already existing data is pointed to by a logical unit being stored (e.g., by the file or the object). However, the already existing data is pointed to by a logical unit (e.g., the target files or the data objects). Furthermore, for SHs that do not exist in the full index, the data is considered new and is stored, and the full index is updated to include the new SHs that do not exist in the full index.

In accordance with an embodiment, the obtaining of the input data includes receiving the data to be written to the magnetic tape, splitting the data into segments, defining the target file or the data object, and grouping together the segments of the split data that belong to the target file or the data object to obtain the input data. Firstly, the data to be written to the magnetic tape is received. Thereafter, the received data is split into segments (i.e., the data segments), and the target file or the data object is defined. In an implementation, the received data is split into the segments (i.e., the data segments), and the target file is defined. In another implementation, the received data is split into the segments (i.e., the data segments), and the data objects are defined. Furthermore, the segments are grouped together to finally obtain the input data. At step 104, the method 100 comprises calculating the SH for each of the segments of the input data. In an implementation, the data processing module is configured to calculate the SH for each of the segments of the input data. The SH for each of the segments is calculated to determine unique SHs that are to be maintained in the full index of the deduplicated data. At step 106, the method 100 further comprises selecting a number of SH representatives from the calculated SHs for each area of a pre-defined size in the calculated SHs. Firstly, the method 100 includes obtaining the input data and calculating the SH for each of the segments of the input data. Thereafter, the SH representatives are selected for each of the pre-defined sizes from the calculated SH. For example, two SH representatives for each area with the pre-defined size of four (4) megabytes (MB) are selected. At step 108, the method 100 further comprises extracting the WH from each of the SH representatives. For example, ten (10) bytes to seventeen (17) bytes out of twenty (20) bytes of full hash are extracted as the WH. After that, at step 110, the method 100 comprises searching for each of the i matched WH records, the sparse index of the deduplicated data for all WH records matching with the extracted WHs to obtain a set of i matched WH records of the sparse index. The sparse index of the deduplicated data includes all the WH records that are the part of the SH. In an implementation, a small index for a determined length of the original data corresponds to a subset of the hashes from the full index is referred to as a sparse index. At step 112, the method 100 further comprises determining, for each of the i matched WH records, a set of j SH areas of the full index pointed by the WH record. Each WH record points to one or more SH areas in the full index that have a high probability of pointing to the stored data matching with the segment(s) of the input data corresponding to the WH record which thereby can be quoted. At step 114, the method 100 comprises determining j locations on the magnetic tape for storing quoted data pointed by the determined j SH areas that potentially match with a repeating part of the input data.

At step 116, the method 100 further comprises determining all unique routes of reading the quoted data from the magnetic tape by selecting one of the j determined locations for each of the i matched WH records and adding to each route a head of tape location into which a new part of the input data that does not match with the deduplicated data will be written.

At step 118, the method 100 comprises determining the new part of the input data by loading SH areas that correspond to a minimal route of reading with minimal travelling distance among the determined routes of reading and comparing the loaded SH areas with the SHs of the input data. The minimal route of reading can be determined by selecting a route with minimal travelling distance among all the determined routes of reading. Thereby, the minimal route of reading can be determined by comparing travelling distances of all the determined routes of reading. After that, the SH areas of the full index that correspond to the minimal route are loaded for comparing them with the SHs of the input data. The comparison of the loaded SH areas with the SHs of the input data enables determining a repeating part of the input data that matches with already existing data stored on the magnetic tape (which is to be quoted) and a new part of the input data that does not match with the already stored data. In an implementation, various techniques are used for loading and comparing the SHs to achieve a high probability of locating already existing data in the deduplicated data, such as by utilizing the sparse index.

In accordance with an embodiment, the minimal route of reading is selected among the determined routes of reading by using a travelling salesman algorithm. The travelling salesman algorithm is used to calculate the minimal distance required to traverse all the points of each of the determined routes and select a route that provides the overall minimal distance for optimizing later restoration of data from the magnetic tape. At step 120, the method 100 further comprises writing the new part of the input data to the head of tape location into the magnetic tape as a part of the deduplicated data. Thereafter, at step 122, the method 100 comprises updating the full index and the sparse index of the deduplicated data with the calculated SHs and the extracted WHs to quote the new part of the input data, and further updating the metadata (MD) with the location of the new part of the input data and the locations of the quoted data in accordance with the minimal route of reading to enable restoring of the input data. The metadata corresponds to a collection of information that is used to map user offsets to the location of the data on the magnetic tape. The metadata updated as described above in accordance with the minimal route of reading provides for an optimized restoration of the input data with a reduced seek time and minimal travelling distance.

The method 100 is used to quote the data on the magnetic tape storing the deduplicate data when writing input data thereto efficiently to achieve reduced seek time in later restoring data from the magnetic tape. The method 100 includes determining the minimal route for reading the data from the magnetic tape by using the travelling salesman algorithm that calculates the total minimal distance required to traverse to visit all locations on the magnetic tape storing segments of the input data and select a route that provides the overall minimal distance for later restoring data from the magnetic tape. Thereby, the method 100 provides an optimal quoting of data on the deduped magnetic tape to read the data more efficiently with reduced complexity, such as by moving the head of the tape drive across the tape wraps than by moving the head of the tape across the length of the tape wrap. The method 100 is used to quote data on the magnetic tape storing the deduplicated data when writing input data thereto without affecting a deduplication ratio of the data. As a result, the seek time to read the files (or the data objects) from the magnetic tape is reduced.

The steps 102 to 122 are only illustrative, and other alternatives can also be provided where one or more steps are added, one or more steps are removed, or one or more steps are provided in a different sequence without departing from the scope of the claims herein.

FIG. 2 is a block diagram that depicts an apparatus for quoting data on a magnetic tape storing deduplicated data when writing input data thereto, in accordance with an embodiment of the present disclosure. With reference to FIG. 2 there is shown a block diagram 200 of an apparatus 202 that includes a magnetic-tape data storage 204, a magnetic tape 206, a plurality of tape wraps 208, a communication interface 214, a memory 212, and a data processing module 210.

The apparatus 202 is configured to quote data on the magnetic tape 206 of the magnetic-tape data storage 204 storing the deduplicated data when writing input data thereto. The magnetic- tape data storage 204 includes the magnetic tape 206 with the plurality of tape wraps 208. The magnetic tape 206 consists of tracks that are used for both writings and reading the data. A track runs the length of the magnetic tape 206, from tape start to tape end. In addition, the head of a tape drive used for reading and writing that contains multiple read and write elements that allows reading and writing on multiple adjacent tracks are referred to as the plurality of tape wraps 208, such as a first tape wrap 208A, a second tape wrap 208B, and the like.

The communication interface 214 is used by the apparatus 202 to communicate with another device(s). Examples of implementation of the communication interface 214 may include but are not limited to a network interface, a computer port, a network socket, a network interface controller (NIC), and any other network interface device.

The data processing module 210 is configured to write the data to the magnetic tape 206. Examples of implementation of the data processing module 210 may include but are not limited to a central data processing device, a microprocessor, a microcontroller, a complex instruction set computing (CISC) processor, an application-specific integrated circuit (ASIC) processor, a reduced instruction set (RISC) processor, a very long instruction word (VLIW) processor, a state machine, and other processors or control circuitry.

The memory 212 is configured to store the instructions to write to the magnetic tape 206. Examples of implementation of the memory 212 may include, but are not limited to, Electrically Erasable Programmable Read-Only Memory (EEPROM), Dynamic Random- Access Memory (DRAM), Random Access Memory (RAM), Read-Only Memory (ROM), Hard Disk Drive (HDD), Flash memory, a Secure Digital (SD) card, Solid-State Drive (SSD), and/or CPU cache memory.

The apparatus 202 is configured to write the data to the magnetic tape 206 for storing the deduplicated data. In an implementation, the magnetic tape 206 is composed of tracks on which the data is written and from which the data is read. Moreover, each track runs a length of the magnetic tape 206 from a start to an end of the magnetic tape 206. In addition, the magnetic tape 206 is divided into bands and a tape drive head that covers the width of a single band. In addition, each band is composed of an even number of tape wraps. In addition, the tape wraps correspond to the head of a tape drive, used for reading and writing, and contains multiple read and write elements, which allows reading/writing multiple adjacent tracks. In an example, a set of such adjacent tracks is referred to as the tape wrap. For example, linear tape open ninth generation (LTO 9) tapes include thirty-two (32) tracks per tape wrap and fifty-two (52) tape wraps per band with twenty-six (26) start-to-end tape wraps and twenty-six (26) end-to-start tape wraps and four (4) bands. In addition, the length of the magnetic tape 206 (i.e., the length of every tape wrap) is nearly 1 kilometer (km), such as in LTO 9. The apparatus 202 is configured to write the data to the magnetic tape 206 provides an efficient and reliable use of recommended access order and time-based access order system techniques for optimized reading order of objects (or files), such as by reordering the data before writing. The apparatus 202 includes the magnetic-tape data storage 204 that includes the magnetic tape 206 with the plurality of tape wraps 208 such as the first tape wrap 208A, the second tape wrap 208B and the Nth tape wrap 208N. The plurality of tape wraps 208, such as the first tape wrap 208A, the second tape wrap 208B, and the Nth tape wrap 208N, is used to store deduplicated data that has a full index of strong hashes (SHs) and a sparse index of weak hashes (WHs) of the deduplicated data. In an implementation, the magnetic-tape data storage 204 includes the magnetic tape 206 that includes the first tape wrap 208A. Similarly, the magnetic-tape data storage 204 includes the second tape wrap 208B and the like.

Furthermore, the apparatus 202 includes the data processing module 210 configured to obtain an input data to be written to the magnetic tape 206. In order to decrease the space required to store the data, deduplication of the data is used. In an example, a data processing module 210 is configured to obtain the input data written to the magnetic tape 206, storing deduplicated data that includes the full index of SHs and the sparse index of WHs of the deduplicated data. Moreover, the input data includes segments that belong to a target file or data object. In deduplication, the input data is segmented, and the SH is computed per segment. In an implementation, the segments are of variable sizes, and a variable segmentation is used for dividing the data to the segments in accordance with a window of the input data. A rolling hash function is used on a rolling window. In an example, the rolling window ranges from sixty-four (64) to two hundred fifty-six (256) bytes. Furthermore, an algorithm cuts the segment and starts a new segment to provide the segments with desired hash values when a hash value of the rolling hash function accommodates a certain condition. In an implementation, only the SHs with unique values are maintained in the full index, and the SHs of new input data is checked against the full index. Moreover, the data is not stored for the SHs with repeated values. In other words, for SHs that already exist, the data is not stored again, and rather the already existing data is pointed to by a logical unit being stored (e.g., by the file or the object). However, the already existing data is pointed to by a logical unit (e.g., the target files or the data objects). Furthermore, for SHs that do not exist in the full index, the data is considered new and is stored, and the full index is updated to include the new SHs that do not exist in the full index.

In accordance with an embodiment, the data processing module 210 is configured to receive data to be written to the magnetic tape 206, split the data into segments, define the target file or the data object, and group together the segments of the split data that belong to the target file or data object to obtain the input data. Firstly, the data to be written to the magnetic tape is received. Thereafter, the received data is split into segments (i.e., the data segments) and the target file or the data object is defined. In an implementation, the received data is split into the segments (i.e., the data segments), and the target file is defined. In another implementation, the received data is split into the segments (i.e., the data segments), and the target data object is defined. The data processing module 210 is further configured to calculate the SH, for each of the segments of the input data. The SH for each of the segments is calculated to determine unique SHs that are to be maintained in the full index. The calculation of the SHs for each of the segments reduces the probability of conflicts, such as the probability of storing different data in the same hash (e.g., the SH). Furthermore, the data processing module 210 is configured to select the number of SH representatives from the calculated SHs for each area of a pre-defined size in the calculated SHs. Firstly, the data processing module 210 is configured to obtain the input data and calculate the SH for each of the segments of the input data. Thereafter, the SH representatives are selected for each of the pre-defined sizes from the calculated SH. For example, two SH representatives for each area with the pre-defined size of four (4) megabytes (MB) are selected. The data processing module 210 is configured to extract the WH from each of the SH representatives and search the sparse index of the deduplicated data for all WH records matching with the extracted WHs to obtain a set of i matched WH records of the sparse index. For example, ten (10) bytes to seventeen (17) bytes out of twenty (20) bytes of full hash are extracted as the WH. The sparse index of the deduplicated data includes all the WH records that are the part of the SH. In an implementation, a small index for a determined length of the original data corresponds to a subset of the hashes from the full index is referred to as sparse index.

The data processing module 210 is further configured to determine a set of j SH areas of the full index pointed by each of the i matched WH records, and then j locations on the magnetic tape 206 storing quoted data pointed by the determined j SH areas that potentially match with a repeating part of the input data for each of the i matched WH records. The data processing module 210 is further configured to determine all unique routes of reading the quoted data from the magnetic tape 206 by selecting one of the j determined locations for each of the i matched WH records and adding to each route a head of tape location into which a new part of the input data that does not match with the deduplicated data will be written. Furthermore, the data processing module 210 is configured to determine the new part of the input data by loading SH areas that correspond to a minimal route or reading with minimal travelling distance among the determined routes of reading and comparing the loaded SH areas with the SHs of the input data. The data processing module 210 is configured to determine the minimal route of reading by selecting a route with minimal travelling distance among all the determined routes of reading. Thereby, the minimal route of reading can be determined by comparing travelling distances of all the determined routes of reading. After that, the SH areas of the full index that correspond to the minimal route are loaded for comparing them with the SHs of the input data. The comparison of the loaded SH areas with the SHs of the input data enables determining a repeating part of the input data that matches with already existing data stored on the magnetic tape (which is to be quoted) and a new part of the input data that does not match with the already stored data. In an implementation, various techniques are used for loading and comparing the SHs to achieve a high probability of locating already existing data in the deduplicated data, such as by utilizing the sparse index.

In accordance with an embodiment, the data processing module 210 is configured to select the minimal route of reading among the determined routes of reading by using a travelling salesman algorithm. The travelling salesman algorithm is used to calculate the minimal distance required to traverse all the points of each of the determined route and select a route that provides the overall minimal distance for optimizing later restoration of data from the magnetic tape 206. The data processing module 210 is further configured to control the magnetic-tape data storage 204 to write the new part of the input data to the head of tape location into the magnetic tape 206 as a part of the deduplicated data.

Furthermore, the data processing module 210 is configured to update the full index and the sparse index of the deduplicated data with the calculated SHs and the extracted WHs to quote the new part of the input data and to update the metadata (MD) with the location of the new part of the input data and the locations of the quoted data in accordance with the minimal route of reading to enable restoring of the input data. The metadata corresponds to a collection of data that is used to map user offsets to the location of the data on the magnetic tape 206. The metadata updated as described above in accordance with the minimal route of reading provides for an optimized restoration of the input data with a reduced seek time and minimal travelling distance.

The apparatus 202 is thereby configured to quote data on the magnetic tape 206 storing the deduplicate data when writing input data thereto efficiently so that to provide a reduced seek time in later restoration of the data. For this end, the data processing module 210 of the apparatus 202 is configured to determine, before the writing of the input data, a minimal route for reading to-be quoted data on the magnetic tape by using the travelling salesman algorithm for selecting a route of reading that provides the overall minimal distance that optimizes later restoration of data from the magnetic tape 206. Furthermore, the data processing module 210 is configured to control the magnetic-tape data storage 204 to write the new part of the input data into the magnetic tape 206 as a part of the deduplicated data and to update the full and sparse indexes of the deduplicated data as well as the metadata (MD) to quote the data in accordance with the minimal route of reading which enables optimizing of restoration of the data from the magnetic tape with a reduced complexity, such as by moving the head of the tape drive across the tape wraps than by moving the head of the tape across the length of the tape wrap. At that, the optimized quoting implemented in the apparatus 202 does not affect a deduplication ratio of the data on the magnetic tape 206. As a result, the seek time to read the files (or the data objects) from the magnetic tape 206 is reduced.

FIG. 3 is a diagram that depicts an exemplary representation of points or physical locations on a magnetic tape storing segments of data to be quoted in accordance with an embodiment of the present disclosure. FIG. 3 is described in conjunction with elements from FIG. 2 and FIG. l to present an explanatory example of the quoting techniques according to the disclosure.

Let us assume that in an implementation, there are 3 matched WH records, WH1, WH2, WH3, found out in the sparse index at the step 110 of the method 100, or by means of the data processing module 210 as described above. In other words, the 3 matched WH records, WH1, WH2, WH3, are the set of i matched WH records. At that, the first matched WH record WH1 points to 3 different SH areas (Areal, Area2, Area3) in the full index, while the other two matched WH records, WH2 and WH3, point each to 2 different SH areas (Area4, Area5) and (Area6, Area7):

WH1 = (Areal, Area2, Area3),

WH2 = (Area4, Area5),

WH3 = (Area6, Area7).

Thereby, the above-listed 3 sets of SH areas are the sets of j SH areas of the full index pointed by each of the i matched WH records (the 3 matched WH records in this example) as determined on the step 112 of the method 100, for example, by means of the data processing module 210.

Each of Areal, Area2, Area3, Area4, Area5, Area6 and Area7 comprises SHs that point to different data segments stored in different physical locations on the magnetic tape. Let us assume that on the step 114 of the method 100, for example, by means of the data processing module 210, determined are seven points Pl, P2, P3, P4, P5, P6 and P7 on the magnetic tape that store data segments pointed by SHs in the said seven areas. Let us further define the eighth point P8 on the magnetic tape as a head of tape location into which a new data is to be written, or in other words, the point of maximum offset already reached in the magnetic tape.

With reference to FIG. 3, there is shown a diagram 300 of the magnetic tape that includes a first tape wrap 302, a second tape wrap 304, a third tape wrap 306, and a fourth tape wrap 308, each having different direction of writing and reading data, as illustrated in the diagram by arrows. In the diagram 300, there is also shown a plurality of points or physical locations storing data in said tape wraps, including a first point Pl 310, a second point P2 312, a third point P3 314, a fourth point P4 316, a fifth point P5 318, a sixth point P6 320, and a seventh point P7322 that were described earlier as storing data segments pointed by the seven SH areas, and an eighth point P8 324 being the head of tape location.

In this example, the first point Pl 310, the second point P2 312, the third point P3 314, the fourth point P4 316, the fifth point P5 318, the sixth point P6 320, the seventh point P7 322, and the eighth point P8 324 are located on three tape wraps only, including the first tape wrap 302 (e.g., wrapO), the second tape wrap 304 (e.g., wrapl) and the third tape wrap 306 (e.g., wrap2), whereas the fourth tape wrap 308 does not comprise any data (it has not been written with data yet). At that, the first point Pl 310, the second point P2 312, and the fourth point P4 316 are located on the first tape wrap 302. The third point P3 314, the fifth point P5 318, the sixth point P6 320, and the seventh point P7 322 are located on the second tape wrap 304. In addition, the eighth point P8 324 representing the head of tape location is located on the third tape wrap 306.

Further on the step 116 of the method 100, for example, by means of the the data processing module 210, determined are all unique routes of reading. In the given example, the determined routes shall comprise all the possible 3x2x2 point combinations with the addition of the eighth point P8 of the head of tape location to each of the possible combination, i.e.: (P1,P4,P6,P8), (P1,P4,P7,P8), (P1,P5,P6,P8), (P1,P5,P7,P8), (P2,P4,P6,P8), (P2,P4,P7,P8), (P2,P5,P6,P8), (P2,P5,P7,P8), (P3,P4,P6,P8), (P3,P4,P7,P8), (P3,P5,P6,P8), (P3,P5,P7,P8). Then, on the step 118 of the method 100, determined is a minimal route of reading with a minimal traveling distance among all the determined routes, for example, by using the travelling salesman algorithm. In the given example, the minimal route of reading with the minimal traveling distance is (P1,P5,P7,P8) i.e. the route 326 (illustrated by three arrows) that starts from the first point Pl 310 and goes through the fifth point P5 318 and the seventh point 322, to the eighth point P8 324 of the head of tape location. One can readily see on the diagram 300 in FIG. 3 that the route 326 is the one of the minimal traveling distance, and thereby it is an optimal route for reading the data segments with a reduced seek time.

Modifications to embodiments of the present disclosure described in the foregoing are possible without departing from the scope of the present disclosure as defined by the accompanying claims. Expressions such as "including", "comprising", "incorporating", "have", "is" used to describe, and claim the present disclosure are intended to be construed in a non-exclusive manner, namely allowing for items, components or elements not explicitly described also to be present. Reference to the singular is also to be construed to relate to the plural. The word "exemplary" is used herein to mean "serving as an example, instance or illustration". Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments or to exclude the incorporation of features from other embodiments. The word "optionally" is used herein to mean "is provided in some embodiments and not provided in other embodiments". It is appreciated that certain features of the present disclosure, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the current disclosure, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable combination or as suitable in any other described embodiment of the disclosure.