Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
APPARATUS AND METHOD FOR DATA COMPRESSION
Document Type and Number:
WIPO Patent Application WO/2018/139947
Kind Code:
A1
Abstract:
A data compression apparatus for compressing a data stream, the data compression apparatus comprising a data splitter configured to split the data stream into data blocks, a classifier configured to classify the data blocks into a plurality of data classes, a reference block detector configured to detect reference blocks for each of the plurality of data classes, and a first data compressor configured to compress data blocks of a first data class based on a difference between the data blocks of the first data class and reference blocks detected for the first data class.

Inventors:
MAZURENKO IVAN LEONIDOVICH (CN)
PARKHOMENKO DENIS VLADIMIROVICH (CN)
LENG JINAN (CN)
ZHANG XUECANG (CN)
KHOLODENKO ALEXANDER BORISOVICH (CN)
PETYUSHKO ALEXANDER ALEXANDROVICH (CN)
Application Number:
PCT/RU2017/000030
Publication Date:
August 02, 2018
Filing Date:
January 24, 2017
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
MAZURENKO IVAN LEONIDOVICH (CN)
International Classes:
H03M7/30
Foreign References:
US8712978B12014-04-29
US20100250501A12010-09-30
US9514146B12016-12-06
Other References:
MIKHAIL BILENKO ET AL: "On Evaluation and Training-Set Construction for Duplicate Detection", PROCEEDINGS OF THE KDD-2003 WORKSHOP ON DATA CLEANING, RECORD LINKAGE, AND OBJECT CONSOLIDATION, 31 August 2003 (2003-08-31), pages 7 - 12, XP055406428, Retrieved from the Internet [retrieved on 20170913]
Attorney, Agent or Firm:
MITS, Alexander Vladimirovich et al. (RU)
Download PDF:
Claims:
CLAIMS

A data compression apparatus (100) for compressing a data stream, the data compression apparatus comprising:

a data splitter (1 10) configured to split the data stream into data blocks, a classifier (120) configured to classify the data blocks into a plurality of data classes,

a reference block detector (130) configured to detect (312) reference blocks for each of the plurality of data classes, and

a first data compressor (140) configured to compress data blocks of a first data class based on a difference between the data blocks of the first data class and reference blocks detected for the first data class.

The data compression apparatus (100) of claim 1 , wherein the classifier comprises a plurality of similarity detectors (306, 308) which correspond to the plurality of data classes.

The data compression apparatus (100) of claim 2, wherein the data compression apparatus is configured to, if for a given data block each of the plurality of similarity detectors (306, 308) yields a detection score below a predetermined threshold, the given data block is assigned (322) to a second data compressor which is configured to operate in a different way than the first data compressor.

The data compression apparatus (100) of claim 2 or 3, wherein the data compression (100) apparatus is configured to, if for a given data block each of the plurality of similarity detectors (306, 308) yields a detection score below a predetermined threshold, accumulate (608) the data block in a log of data blocks and, if a size of the log of data blocks is larger than a predetermined threshold, train a new similarity detector based on the log of data blocks and add (320) the new similarity detector to the plurality of similarity detectors.

5. The data compression apparatus (100) of one of claims 2 to 4, wherein a similarity detector of the plurality of similarity detectors (306, 308) comprises a machine learning-based similarity detector, in particular a support vector machine, SVM.

The data compression apparatus (100) of one of claims 2 to 5, wherein an input to a similarity detector (306, 308) of the plurality of similarity detectors comprises:

a histogram (410, 420) of n-grams of elements of a to be classified data block,

a histogram of hashes of n-grams of elements of a to be classified data block, and/or

a binary vector (602) indicating one or more maxima of a histogram of n-grams of elements of a to be classified data block and/or of a histogram of hashes of n-grams of elements of a to be classified data block.

The data compression apparatus (100) of claim 5 or 6, wherein the reference block detector is configured to detect the reference blocks for a data class based on support vectors of the SVM of the similarity detector for the data class.

The data compression apparatus (100) of one of the previous claims, wherein the apparatus is configured to adapt one or more parameters of the classifier online after an initial training phase.

The data compression apparatus (100) of one of the previous claims, wherein the data compression apparatus is configured to determine the plurality of data classes by clustering a plurality of data blocks of the data stream.

The data compression apparatus (100) of one of the previous claims, wherein a reference block detector (130) is configured to detect the reference blocks based on the rule

mod (block ndex, CN) == 0 wherein block index is an index of a block within a data class and 1/CN is a reference block frequency.

The data compression apparatus (100) of one of the previous claims, wherein the reference block detector (130) is configured to adapt the reference block frequency of a data class based on a classification score of the classifier, in particular based on a detection score of a similarity detector of the data class corresponding to the data class.

The data compression apparatus (100) of claim 11, wherein adapting the reference block frequency comprises a step of decreasing the reference block frequency if the classification score increases in time.

A method (200) for compressing a data stream, the method comprising:

splitting (210) the data stream into data blocks,

classifying (220) the data blocks into a plurality of data classes, detecting (230, 312) reference blocks for each of the plurality of data classes, and

compressing (240) data blocks of a first data class based on a difference between the data blocks of the first data class and reference blocks detected for the first data class.

The method (200) of claim 13, wherein the method comprises an initial training phase, wherein the data classes are determined and a subsequent online phase, wherein the online phase comprises a step of adapting one or more classification parameters.

A computer-readable storage medium storing program code, the program code comprising instructions that when executed by a processor carry out the method of claim 13 or 14.

Description:
APPARATUS AND METHOD FOR DATA COMPRESSION

TECHNICAL FIELD The present invention relates to a data compression apparatus and a method for compressing a data stream. The present invention also relates to a computer-readable storage medium storing program code, the program code comprising instructions for carrying out a method for compressing a data stream. BACKGROUND

Traditional identity-based data deduplication is a technique for eliminating duplicate copies of repeating data. It can be applied to storage systems to improve the storage utilization and can also be applied to network data transfers to improve throughput. A typical deduplication process searches for duplicate data blocks. In case of a storage system, traditional deduplication can save space by substituting a duplicated portion of data with a hard link to an identical portion of data already stored in the system. Hard- link-like techniques can give a good space gain when dealing with small blocks of data.

Similarity-based deduplication differs from traditional identity-based deduplication in the way how redundant data is eliminated. Traditional identity-based deduplication is designed to detect exact matches of data. In contrast to identity-based deduplication that looks for exact matches, a similarity-based deduplication system tries to identify similar data blocks and remove redundant data by applying differential compression, i.e. computing a difference D between two data blocks, A and B, and then substituting B with the difference D and a link to A. Similarity-based deduplication can give an additional benefit when dealing with large files with small modifications, e.g. databases, text documents, presentations, when exact matching cannot find identical blocks.

Looking for a similar data block (to apply differential compression with) is a computationally complex and time-consuming task. To speed-up the search, a typical similarity-based deduplication system uses a cache of recent data blocks and their locality-sensitive fingerprints (LSH fingerprints). An LSH fingerprint is a small portion of data such that if two data blocks, A and B, have similar LSH fingerprints, LSH_A and LSH_B, then the blocks themselves have a significant amount of common information with high probability rate. This property makes it possible to search the cache for a data block similar to a given data block by computing the distance between their LSH fingerprints. A typical deduplication system contains a decision module, a compression module and an in-RAM cache.

The decision module typically has the following responsibilities:

- It accepts a new data block as input.

- It makes queries to cache to find data blocks similar to a given data block (the one that is best-suited to apply differential compression to).

- It decides what kind of compression to apply: two-block differential compression, single-block compression, or no compression.

- It decides if a new block should be put to cache.

The purpose of a cache is to store recent data blocks to be used as candidates for differential compression in future. A typical cache has the following responsibilities:

- It stores a set of data blocks and their fingerprints.

- It supports fast search of a reference data block with fingerprint similar to a given fingerprint.

Typically, the compression module supports many kinds of compression, including two-block differential compression, single-block traditional compression, and no compression. A typical deduplication procedure involves a cooperation of all three main modules of a deduplication system - in-RAM cache, decision module, and compression module. Different compression methods have been suggested in the prior art, but these typically involve significant computational effort or achieve sub-optimal compression rates.

SUMMARY OF THE INVENTION

The objective of the present invention is to provide a data compression apparatus and a method for compressing a data stream, wherein the data compression apparatus and the method for compressing a data stream overcome one or more of the above-mentioned problems of the prior art.

A first aspect of the invention provides a data compression apparatus for compressing a data stream, the data compression apparatus comprising:

a data splitter configured to split the data stream into data blocks,

a classifier configured to classify the data blocks into a plurality of data classes, - a reference block detector configured to detect reference blocks for each of the plurality of data classes, and

a first data compressor configured to compress data blocks of a first data class based on a difference between the data blocks of the first data class and reference blocks detected for the first data class.

The data stream can be any kind of data that is read e.g. from a storage device, a file, from a network, over the Internet and/or from a plurality of data sources internal or external to the data compression apparatus. The first data compressor can be configured to compress the data blocks from all of the plurality of data classes (using the corresponding reference blocks detected for the plurality of data classes). In other words, the first data compressor can use reference blocks detected for the first data class to compress data blocks of the first data class and can use reference blocks detected for a second data class to compress data blocks of a second data class, and so on.

The data compression apparatus can also comprise a second data compressor, which uses a completely different compression scheme.

The data compression apparatus of the first aspect solves the problem of efficient similarity-based data deduplication. It is suitable not only for a specific kind of input stream data (e.g. Oracle, MySQL databases, logs etc.), but also for mixed-type data. The proposed apparatus is adaptive and automatically adjusts to input stream characteristics, allowing for a better overall compression ratio and computational complexity. According to tests, the data compression apparatus outperforms many top- edge deduplication systems in terms of deduplication ratio while keeping a same or similar computational complexity.

The data compression apparatus of the first aspect can use big block deduplication (BBD). BBD refers to a set of compression methods which compress input (so-called target) data blocks with a help of another so-called reference predefined data block. Hereinafter, without loss of generality, by calling BBD we refer to any method from delta compression family, e.g. lz-delta compression. Usually delta compression provides higher compression ratio than conventional compression methods, but requires search of appropriate reference data blocks. In a first implementation of the data compression apparatus according to the first aspect, the classifier comprises a plurality of similarity detectors which correspond to the plurality of data classes.

For example, each of the plurality of similarity detectors can be configured to determine a score for a given data block. Then, the data class can be assigned that corresponds to the highest score.

Having a classifier which comprises a plurality of similarity detectors has the advantage that the classifier can easily be extended to include a further data class - by adding a further similarity detector. Similarly, a data class can easily be removed from a classifier, by removing the corresponding similarity detector. In a second implementation of the data compression apparatus according to the first aspect as such or according to the first implementation of the first aspect, the data compression apparatus is configured to, if for a given data block each of the plurality of similarity detectors yields a detection score below a predetermined threshold, the given data block is assigned to a second data compressor which is configured to operate in a different way than the first data compressor.

If the similarity detectors yield a low score this can indicate that the data block is not similar to any of the existing data classes and thus none of the first data compressors corresponding to the different data classes may be suitable for compressing this data block. Thus, it may be preferable that a second data compressor, which operates in a different way, is assigned to compress this data block. For example, this may be a second data compressor that operates independent of any prior assumptions about a similarity of the data block.

In a third implementation of the data compression apparatus according to the first aspect as such or according to any of the preceding implementations of the first aspect, the data compression apparatus is configured to, if for a given data block each of the plurality of similarity detector yields a detection score below a predetermined threshold, accumulate the data block in a log of data blocks and, if a size of the log of data blocks is larger than a predetermined threshold, train a new similarity detector based on the log of data blocks and add the new similarity detector to the plurality of similarity detectors. This has the advantage that the data compression apparatus can adapt to a new class of data blocks that is found in the data stream.

The data compression apparatus of the third implementation can be implemented as a data compression apparatus according to the second implementation: Data blocks that achieve a low score from each of the similarity detectors are compressed by an independent second data compressor. However, at the same time these "unclassified" data blocks are accumulated in a log of data blocks such that a new similarity detector can be trained with the unclassified data blocks.

In a fourth implementation of the data compression apparatus according to the first aspect as such or according to any of the preceding implementations of the first aspect, a similarity detector of the plurality of similarity detectors comprises a machine learning-based similarity detector, in particular a support vector machine, SVM.

Support vector machines have been shown to be particularly good binary classifiers. They can be seen as linear classifiers, but, by using the so-called kernel trick can also be adapted to classify data that are not linearly separable. In particular, SVMs are suitable for dealing with very high-dimensional input data.

In a fifth implementation of the data compression apparatus according to the first aspect as such or according to any of the preceding implementations of the first aspect, an input to a similarity detector of the plurality of similarity detectors comprises:

a histogram of n-grams of elements of a to be classified data block,

a histogram of hashes of n-grams of elements of a to be classified data block, and/or

a binary vector indicating one or more maxima of a histogram of n-grams of elements of a to be classified data block and/or of a histogram of hashes of n- grams of elements of a to be classified data block.

Looking at histograms of n-grams of elements of data blocks has the advantage that a possibly extremely high dimensionality of the data blocks can be significantly reduced, which simplifies further processing.

Looking at hashes of n-grams instead of the n-grams themselves has the advantage that a further dimensionality reduction can be achieved. In a sixth implementation of the data compression apparatus according to the first aspect as such or according to any of the preceding implementations of the first aspect, the reference block detector is configured to detect the reference blocks for a data class based on support vectors of the SVM of the similarity detector for the data class.

SVMs can be trained with a large number of training data and identify as the so-called "support vectors" data points that are particularly useful for classifying data. Thus, because the number of support vectors is typically a small fraction of the number of training data, the classification of further input data is simplified.

In a seventh implementation of the data compression apparatus according to the first aspect as such or according to any of the preceding implementations of the first aspect, the apparatus is configured to adapt one or more parameters of the classifier online after an initial training phase.

The initial training phase can be performed e.g. with a special selection of data blocks from the data stream. Subsequently, one or more parameters of the classifier can be adapted based on processing of further data blocks from the data stream. Thus, the data compression apparatus can adapt to changes of the data in the data stream over time.

In an eighth implementation of the data compression apparatus according to the first aspect as such or according to any of the preceding implementations of the first aspect, the data compression apparatus is configured to determine the plurality of data classes by clustering a plurality of data blocks of the data stream.

Clustering is a non-supervised machine learning technique. Thus, the data compression apparatus can recognize different data classes in the data stream even if no "ground truth" labels are available. Useful clustering techniques include e.g. k-means clustering.

If a dimensionality of the data blocks is too high, the dimensionality of the data blocks can be reduced before the clustering. For example, clustering can be performed on histograms of n-grams of elements of the data blocks and/or based on hashes of histograms of n-grams of elements of the data blocks. In a ninth implementation of the data compression apparatus according to the first aspect as such or according to any of the preceding implementations of the first aspect, a reference block detector is configured to detect the reference blocks based on the rule mod (blockjndex, CN) == 0

wherein blockjndex is an index of a block within a data class and 1/C7V is a reference block frequency.

In other words, for a given data class, the reference block detector can be configured to detect every CN-th block as reference block, wherein CN is a number that can be selected separately for each data class.

In a tenth implementation of the data compression apparatus according to the first aspect as such or according to any of the preceding implementations of the first aspect, the reference block detector is configured to adapt the reference block frequency of a data class based on a classification score of the classifier, in particular based on a detection score of a similarity detector of the data class corresponding to the data class.

In an eleventh implementation of the data compression apparatus according to the first aspect as such or according to any of the preceding implementations of the first aspect, adapting the reference block frequency comprises a step of decreasing the reference block frequency if the classification score increases in time.

If a classification score increases, this means that the data blocks become more and more similar to the class. This means that the data does not change significantly in time. Therefore, the frequency of data blocks (1/CN) can be reduced, which means increasing the value CN.

A second aspect of the invention refers to a method for compressing a data stream, the method comprising:

- splitting the data stream into data blocks,

classifying the data blocks into a plurality of data classes,

detecting reference blocks for each of the plurality of data classes, and compressing data blocks of a first data class based on a difference between the data blocks of the first data class and reference blocks detected for the first data class. The methods according to the second aspect of the invention can be performed by the data compression apparatus according to the first aspect of the invention. Further features or implementations of the method according to the second aspect of the invention can perform the functionality of the data compression apparatus according to the first aspect of the invention and its different implementation forms.

In a first implementation of the method for compression a data stream of the second aspect, the method comprises an initial training phase, wherein the data classes are determined and a subsequent online phase, wherein the online phase comprises a step of adapting one or more classification parameters.

A third aspect of the invention refers to a computer-readable storage medium storing program code, the program code comprising instructions for carrying out the method of the second aspect or the implementation of the second aspect. BRIEF DESCRIPTION OF THE DRAWINGS

To illustrate the technical features of embodiments of the present invention more clearly, the accompanying drawings provided for describing the embodiments are introduced briefly in the following. The accompanying drawings in the following description are merely some embodiments of the present invention, modifications on these embodiments are possible without departing from the scope of the present invention as defined in the claims.

FIG. 1 is a block diagram illustrating a data compression apparatus,

FIG. 2 is a flow chart of a method for compressing a data stream, FIG. 3 is a flow chart of a further method for compressing a data stream,

FIG. 4 is an illustration of histograms of unigrams and bigrams for the set

L=[2,2,3, 1 ,4,2, 1,4],

FIG. 5 is a flow chart of a method for classifying a data chunk, and

FIG. 6 is a flow chart of a method for determining a new similarity detector. Detailed Description of the Embodiments

Herein, data deduplication can refer to a method for eliminating duplicate copies of repeating data. Similarity-based deduplication can refer to a method of compression that exploits data resemblance. Similarity detection can refer to a procedure that detects if two portions of data are similar or not. Similarity degree can refer to a number that shows how much two data blocks are similar or functions that computes such number. Hashing can refer to a method to compute a fixed-size fingerprint from a block of data of an arbitrary length. Locality-sensitive hashing (LSH) can refer to a hashing method that preserves locality of data, i.e. if two data blocks, A and B, have similar LSH fingerprints, then A and B are similar with high degree of probability. When dealing with LSH hashing, two similarity measures must be defined: one for data blocks, and the second - for LSH fingerprints. Fingerprint can refer to a small fixed-size data, usually describing bigger data chunk. If two fingerprints are equal, data chunks that they describe are also equal with high probability rate. Differential compression can refer to a lossless compression method that computes a difference between two binary data blocks. SVM refers to Support vector machine. SVMs can be used to linearly separate two or more classes of samples, n-gram can refer to a contiguous sequence of n items from a given sequence of symbols. A similarity detector can be a detector that finds "a most similar" class of samples to a given sample, for the given similarity measure when number of classes is unknown. Similarity detection is a subclass of pattern recognition problem. The deduplication ratio can be a ratio equal to the volume of non-compressed data divided by the volume of data deduplicated by some method.

FIG. 1 shows a data compression apparatus 100 for compressing a data stream. The data compression apparatus 100 comprises a data splitter 1 10, a classifier 120, a reference block detector 130 and a first data compressor 140.

The data splitter 1 10 is configured to split the data stream into data blocks. For example, the data blocks can be blocks of size n Bytes, and the data splitter is configured to create a new block based on every n Bytes of the data stream.

The classifier 120 is configured to classify the data blocks into a plurality of data classes. The reference block detector 130 is configured to detect reference blocks for each of the plurality of data classes.

The first data compressor 140 is configured to compress data blocks of a first data class based on a difference between the data blocks of the first data class and reference blocks detected for the first data class.

Advantages of the data compression apparatus 100 of FIG. 1 can include: It can be suitable both for file-based, constant and variable data block based deduplication architectures. It can provide a higher compression rate than known architectures (+12% to existing solution, in average). It can be implemented in an adaptive way, so that it naturally adapts for changing input stream features (as it uses data locality). It has low computational complexity and is applicable for inline data deduplication and even RAM deduplication FIG. 2 shows a method 200 for compressing a data stream.

The method comprises a first step 210 of splitting the data stream into data blocks. The method comprises a second step 220 of classifying the data blocks into a plurality of data classes. The method comprises a third step 230 of detecting reference blocks for each of the plurality of data classes.

The method comprises a fourth step 240 of compressing data blocks of a first data class based on a difference between the data blocks of the first data class and reference blocks detected for the first data class.

A further method for data-locality aware deduplication, based on machine learning techniques, comprises the following steps:

- detecting locality class in an input data stream (file-based, constant or variable block length) by:

• splitting a data stream into blocks and

• applying similarity class detection for the given block using any machine learning techniques

- detecting reference chunks for each of similarity stream independently using chunk index modulo rule.

- using big block data deduplication (e.g. delta-compression) method for blocks in each similarity stream with the given reference blocks.

In a further embodiment, a data deduplication mechanism can split input streams into sub-streams using sets of similarity detectors. A similarity detector reconstructs a data locality feature, which typically exists in an adequate input data. The set of similarity detectors is adaptive and can be enlarged by new detectors if the existing one cannot sufficiently recognize an input data block. After stream splitting, each sub-stream is responsible for handling one locality class and therefore all blocks in such a stream are processed in similar manner. Each stream has its own reference detection mechanism to extract reference data blocks. Reference detection is managed by an adaptive chunk type detection rule. After a reference chunk is found, a delta compression method is applied for a number of subsequent data chunks.

Special branch processing can be provided for a case when none of the similarity detectors could find a close-enough similarity class. In this case some conventional compression method (e.g. Lempel-Ziv4 compression) can be applied.

Overall, reconstructing data locality and adaptive reference chunk detection based on similarity detectors feedback (e.g. SVM scores) allows providing a higher compression ratio. Tests show up to 15% compression ratio improvement compared to competing deduplication schemes.

FIG. 3 is a flow chart of a further method for compressing a data stream. In the following, chunks are data blocks. The method 300 receives some input 302. From this input 302, data chunks of size "T" are determined in step 304. The data chunks are fed to N similarity detectors, including a first similarity detector 306 and an N-th similarity detector 308.

The similarity detectors 306, 308 can be implemented using Support Vector Machines. SVMs are a classifier method, which can separate given samples of two different classes in the best way. In an implementation, the SVM score is a weighted sum score = ai *ti+ ... +a„*t„ + bias, where (ti, .,.,t n ) are coefficients and bias is a constant term (both describing the separating hyper plane). A typical binary SVM decision rule is: score > 0

Herein, training the SVM detector we refer to calculating the best set of coefficients and the best bias. Retraining the SVM detector can refer to recalculating the coefficients and the bias online.

If none of similarity detectors detects a similarity, in step 312 it is determined whether all scores are low, e.g. by comparing all scores with a predetermined threshold. If all scores are low, the method proceeds with step 318 of adding and applying a new similarity detector. Otherwise, the method proceeds in step 320 of applying a conventional deduplication method. If a similarity detector 306, 308 detects a similarity, the method proceeds in a step 310 of updating a detection rule. Subsequently, in step 312, a chunk type is detected. If it is detected that the chunk is a reference chunk, then in step 314, this chunk is set as current reference block. If, in step 312, it is determined that the chunk is a delta block, then the method proceeds in step 316 with applying block deduplication based on the current reference block.

If none of the similarity detectors 306, 308 detects a similarity, it is determined in step 318 whether all similarity scores are low, e.g. whether all scores are lower than a predetermined threshold. If yes, a new similarity detector is added and applied in step 320. If no, a conventional deduplication method is applied.

Similarity detection can be based on so-called n-grams. Consider a string L = aia2...ak of length k, k> =n. An n-gram is any substring of length n: ... / + w- /, i=l, ...,k- n+1. A histogram of n-grams is the frequencies of all n-grams in the input string. 1- grams are called unigrams; 2-grams are called bigrams.

FIG 4 is an illustration of a first histogram 410 of a unigram and a second histogram 420 of a bigram for the set L=[2,2,3, 1,4,2, 1 ,4] over alphabet A={ 1 , 2, 3, 4} . Hashes can be computed based on the histograms 410, 420 using suitable cryptographic functions.

FIG. 5 is a flow chart of a method 500 for classifying a data chunk. A data chunk is a data block. To detect closest similarity class different machine learning based similarity detection methods can be used, e.g. a computationally efficient Support Vector Machines method based on data chunk histogram analysis. In a first step 502, the method is provided with data chunks of equal fixed size T (e.g. T=4096) bytes at the input of system. In a second step 504, a cryptographic hash function of an n-gram representation of a given data chunk is computed. This is repeated for all data chunks.

In a third step 506, a cryptographic hash of the n-gram representation is computed. The cryptographic function is preferably chosen in a way that the hash size of one n-gram is less than n Bytes.

In a fourth step 508, frequencies calculation of hashes is performed (leading to histograms of hashes). In a fifth step 510, M most frequent positions are selected. Preferably, this is performed as follows: As all hashes are enumerable: h lt h 2 , ... /i 2 ™, where m indicates a bit count of the cryptographic function, we extract pi, ... ,ρΜ , where 0 < p, < 2 m , 1 < i < and a form vector, where M out of 2 m ones are placed on positions pi, ... ,pM - In a sixth step 512, this vector, which typically comprises mostly zeros, is fed to the inputs of all classifiers.

In a final step 514, besides the class index, output classifier scores are accumulated. In one embodiment, =4096, n=4, =8 and a cyclic redundancy check (CRC) code is used as cryptographic hash.

FIG. 6 is a flow chart of a method 600 for determining a new similarity detector. If no similar class is found for a new input chunk, the similarity detectors return low feedback scores. To improve compression ratio we propose to incorporate additional similarity detectors. As the new similarity detector requires training, we propose to accumulate these "hardly recognized" chunks in a history pool. When the pool is full, a new training procedure for classifiers starts.

In particular, the method 600 receives a vector 602 of scores as input. In step 604, it is determined whether all scores are a first predetermined threshold, thr_l . If not, the method ends in step 606. If all scores are below the first predetermined threshold, the method proceeds in step 608 and accumulates the current chunk in a log of unclassified chunks. In step 610, it is then determined whether the size of this log, referred to as history_size, is larger than a second threshold, thr_2. If so, SVM coefficients are trained for a new similarity class in step 612. Otherwise, if the log is not yet large enough, the method proceeds in step 614 with processing further data chunks.

Reference block selection is important as delta compression directly depends on the reference/target block pair. A bigger number of reference blocks leads to a better data locality response, but on the other hand requires additional space for reference block storage. To keep the balance we propose using adaptive thresholds CN set to a detect frequency of reference blocks. Adaptation is performed by "update detection rule" block. In one embodiment, the following method of chunk type detection is used (it should be noted that more compound approaches can be applied). To detect whether the current chunk is a reference chunk for big block compression we use the following rule:

If modfchunk index, CN(i)) == 0 then set chunk as reference, i=l,..,SDN, where SDN stands for the similarity detector index. The threshold CN(i) is unique for each similarity class and can depend on locality features of the current class. At the initialization stage CN(i) may be set to some predefined constant value (e.g. 1000) or it can be updated on every step by the "update detection rule" block described below.

To update thresholds in the detection rule, the following procedure can be used:

- If classifier feedback score [i] for the class #i increases in time (it means data locality growth), increase CN(i) in order to achieve a better compression rate, and/or - if score [i] for class #i decreases in time (which means less data locality), decrease CN(i).

The updating mechanism can be different. For example, the following method can be used without loss of generality:

If score[i](t)-score[ i](t-l)> CN (i),

then CN(i)=MAX(l, CN(i)-0.02*MAX_CN),

else CN(i)=MIN(MAX_CN, CN(i)+0.02*MAX_CN). Here CN 2 denotes a threshold difference threshold, and MAX CN= min CN(i) .

~ - Ki<SDN

Exhaustive simulation with various scenarios and configurations shows that deduplication system based on proposed method significantly outperforms a reference solution almost on every test and at least is not worse on the minor part of the test in terms of deduplication rate. Analysis shows that our system has the same computational complexity as competing deduplication systems level and equals to a reference solution complexity. Table 1 shows deduplication ratio of the proposed system in comparison with an existing EMC solution and a reference solution for different input scenarios.

Table 1. Compression ratio testing results for various input scenarios. Embodiments of the invention include:

■ A method and system for efficient data locality-aware deduplication comprising:

- detecting locality class in an input data stream (file-based, constant or variable block length) by:

• splitting a data stream into data blocks and

• applying similarity class detection for the given data block using machine learning techniques

- detecting reference chunks for each of similarity streams independently using chunk index modulo rule: mod(chunkJndex, CN) == 0

- using big block data deduplication (e. g. delta-compression) method for data blocks in each similarity stream with the given reference blocks

■ A method described above, which uses computationally efficient support vector machines detector on a binary vectors corresponding to n-gram histogram maximums for detecting locality class

- A method may be implemented using summation of a limited number of SVM filter coefficients

■ A method described above, which optionally applies score analysis procedure to construct the new similarity classes on the fly or to update the SVM filter coefficients

■ A method described above, which optionally updates detecting policy for reference blocks in each similarity stream by changing the reference block selection parameter CN, based on a similarity detection score for the given stream

■ A method described above, which classifies a data block not detected by any similarity stream detector (i.e. with detection score below a predefined score threshold in all similarity streams) by a separate stream in which some universal data deduplication method is used.

The foregoing descriptions are only implementation manners of the present invention, the scope of the present invention is not limited to this. Any variations or replacements can be easily made through person skilled in the art. Therefore, the protection scope of the present invention should be subject to the protection scope of the attached claims.