Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DISTRIBUTED DATA STORAGE AND RECOVERY
Document Type and Number:
WIPO Patent Application WO/2012/007715
Kind Code:
A2
Abstract:
Method and apparatus for retrieving and receiving data divided into components of the data, the components being recreatable using associated parity data generated from the components of data, wherein each component and parity data is sub-divided into related data subsets, the related data subsets being recreatable from associated further parity data generated from the related data subsets, wherein the data subsets and further parity data are stored in separate storage locations, the method comprising the steps of: a) retrieving some of the subsets of data and further parity data from the separate storage locations, b) recreating any missing subsets of data by executing a parity function across unrelated further data subsets and further parity data, c) Combining related subsets of data and any recreated related subsets of data to form one or more combined components of data, d) Combining the combined components and recreated components to form the retrieved data.

Inventors:
SYGABEKOV ISKENDER (KZ)
ZADAULY YERKIN (KZ)
LAUMULIN CHOKAN (GB)
Application Number:
PCT/GB2011/001050
Publication Date:
January 19, 2012
Filing Date:
July 12, 2011
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
EXTAS GLOBAL LTD
SYGABEKOV ISKENDER (KZ)
ZADAULY YERKIN (KZ)
LAUMULIN CHOKAN (GB)
International Classes:
G06F11/10
Domestic Patent References:
WO2010026366A12010-03-11
WO2010002633A12010-01-07
Foreign References:
EP1780716A22007-05-02
Attorney, Agent or Firm:
HOWARD, Sands (Verulam Gardens70 Gray`s Inn Road, London WC1X 8BT, GB)
Download PDF:
Claims:
CLAIMS :

1. A method of retrieving data divided into components of the data, the components being recreatable using associated parity data generated from the components of data, wherein each component and parity data is sub-divided into related data subsets, the related data subsets being recreatable from associated further parity data generated from the related data subsets, wherein the data subsets and further parity data are stored in separate storage locations, the method comprising the steps of:

a) retrieving some of the subsets of data and further parity data from the separate storage locations;

b) recreating any missing subsets of data by

executing a parity function across unrelated further data subsets and further parity data;

c) combining related subsets of data and any

recreated related subsets of data to form one or more combined components of data; and

d) combining the combined components and recreated components to form the retrieved data.

2. The method of claim 1 further comprising the steps of : bl) recreating any missing further parity data by executing a parity function across unrelated data subsets, wherein the unrelated data subsets are either retrieved data subsets or recreated data subsets; and/or

b2) recreating any missing subsets of data by

executing a parity function across related data subsets and the recreated further parity data.

3. The method of claim 1 or claim 2 further comprising the steps of: cl) recreating a component of data for which a

complete set of its related data subsets are not available from combining retrieved or recreated data subsets by executing a parity function across the available data subsets of the component of data and their associated further parity data.

4. The method according to any previous claim further comprising the step of:

c2) recreating the retrieved data by executing a parity function across the components of data and parity data .

5. The method according to any previous claim, wherein the parity function is exclusive OR, XOR.

6. The method according to any previous claim, wherein the separate storage locations are remote separate physical devices .

7. The method according to any previous claim, wherein the retrieved data is combinable with other retrieved data to form consolidated data. 8. The method according to any previous claim, further comprising the step of decrypting the data that had been previously encrypted.

9. The method according to any previous claim, wherein the separate storage locations are selected from the group consisting of hard disk drive, optical disk, FLASH RAM, web server, FTP server and network file server.

10. The method according to any previous claim, wherein the data are web pages .

11. The method according to any previous claim further comprising the steps of before retrieving the data:

dividing the data into the components;

generating the parity data from the components;

dividing each component and parity data into the related data subsets;

generating the further parity data from the related data subsets; and

storing the data subsets and further parity data in the separate storage locations . 12. Apparatus for retrieving data divided into components of the data, the components being recreatable using parity data generated from the components of data, wherein each component and parity data is sub-divided into related data subsets, the data subsets being recreatable from associated further parity data generated from the related data subsets, wherein the data subsets and further parity data are stored in separate storage locations, the apparatus comprising a processor arranged to:

a) retrieve some of the subsets of data and subsets of further parity data from the separate storage locations; b) recreate any missing subsets of data by executing a parity function across unrelated further data subsets and further parity data;

c) combine related subsets of data and any recreated related subsets of data to form one or more combined

components of data;

d) combine the combined components and recreated components to form the retrieved data.

13. The apparatus of claim 12 further comprising separate remote storage locations . 14. A method of receiving data divided into components of the data, the components being recreatable using associated parity data generated from the components of data, wherein each component and parity data is sub-divided into related data subsets, the related data subsets being recreatable from associated further parity data generated from the related data subsets, wherein the data subsets and further parity data are received from separate transmission

channels, the method comprising the steps of:

a) receiving some of the subsets of data and further parity data from the separate transmission channels;

b) recreating any missing subsets of data by

executing a parity function across unrelated further data subsets and further parity data,- c) combining related subsets of data and any

recreated related subsets of data to form one or more combined components of data; and

d) combining the combined components and recreated components to form the received data. 15. The method of claim 14, wherein the different

transmission channels are one or more selected from the group consisting of: wire, optical fibre, radio wave, internet protocol and mobile communication channels. 16. The method of claim 14, wherein the different

transmission channels are different frequencies.

17. The method according to any previous claim, wherein the components of data are bits, bytes or groups of bytes.

18. A computer program comprising program instructions that, when executed on a computer cause the computer to perform the method of any of claims 1 to 11 or 14 to 17.

19. A computer-readable medium carrying a computer program according to claim 18.

20. A computer programmed to perform the method of any of claims 1 to 11 or 14 to 17.

Description:
Distributed Data Storage and Recovery

Field of the Invention The present invention relates to a system and method for distributing, storing, receiving and retrieving data.

Background of the Invention WO2010/026366 describes a system for storing data across separate storage locations such as hard drives located at remote locations. In this system, the data or file is divided into separate data subsets. Parity data are created so that any lost data subsets may be recreated. The data subsets and parity data are then further separated with further parity data being recreated for each division. The data may be divided and redivided until each available separate storage location is allocated a separate data subset or parity data component. The original data is recreated by retrieving each data subset. Where a

particular storage location becomes unavailable or a data subset is corrupted or lost then it may be recreated from its remaining data subsets at that particular level of division using the corresponding parity data.

This system provides resilience to loss of storage locations or corruption of data and also provides security to the owner of the original data as the original data may not be recreated without knowledge of the data division scheme used to divide and sub-divide the data and distribute it amongst the separate storage locations.

However, once a particular level of data loss is reached (number of data subsets) or when certain

combinations of the data subsets are lost simultaneously then it becomes impossible to recreate the data from the remaining data subsets are parity data according to this method .

Therefore, there is required a way to improve

resilience to data loss under such conditions.

Summary of the Invention

In accordance with a first aspect of the present invention there is provided a method of retrieving data divided into components of the data, the components being recreatable using associated parity data generated from the components of data, wherein each component and parity data is sub-divided into related data subsets, the related data subsets being recreatable from associated further parity data generated from the related data subsets, wherein the data subsets and further parity data are stored in separate storage locations, the method comprising the steps of:

a) retrieving some of the subsets of data and further parity data from the separate storage locations;

b) recreating any missing subsets of data by

executing a parity function across unrelated further data subsets and further parity data;

c) combining related subsets of data and any

recreated related subsets of data to form one or more combined components of data;

d) combining the combined components and recreated components to form the retrieved data. Combining the related subsets and combined components may be the same step and involve directly forming the original data by matching the recovered and any recreated data elements to its position in the original data based on the storage location from which it was recovered or for which it was recreated. WO2010/026366 describes recovering data from associated data sets. In other words, data subsets that are directly combinable to form the original data or file. Enhanced data recovery may be achieved by utilising unrelated or

unassociated data subsets and parity data that are not directly combinable to form components of the original data. For instance, it is possible to recreate a lost set of parity data from two or more data subsets that do not combine to form a data subset (or component) at a higher level in a cascade of data division. Once this missing parity data is recreated then it may be used to regenerate or recreate data subsets that are combinable to form a higher level data subset (or component) . In a similar way, a missing data subset may be recreated from recovered or retrieved data subsets and parity data that are not

components of higher level data subsets .

Parity data at a higher level (i.e. towards the

original data) may be divided in a similar way to form subsets of data at a lower level. Once divided it may be treated as any other data subset with further parity data generated amongst its component data subsets. Similarly, the parity data at the higher level may be reconstructed by either retrieving all of its data subsets or by regenerating missing data subsets from its associated available subsets and the further parity data. This will then be followed by the reconstruction step.

This technique makes use of the communicative and associative properties of parity comparison and in

particular, properties of the exclusive OR operation (XOR) .

In other words, carrying out the parity operation and in any order provides the same result. Therefore, unrelated or distantly related data subsets of parity data may be compared to accurately recreate unrelated data subsets or parity data.

This property may be described as cross parity control or mutual parity control and improves the resilience of the system or method to loss of storage locations or other corruptions of data.

Optionally, the separate locations may be located on one or more storage media selected from the group consisting of: compact disc, DVD, hard drive, solid state drive, FLASH memory and digital tape.

Optionally, the data or data file may be selected from the group consisting of: multimedia, audio, video, MPEG, MP3 , music, database and binary file.

Optionally, the method may further comprise the steps of :

bl) recreating any missing further parity data by executing a parity function across unrelated data subsets, wherein the unrelated data subsets are either retrieved data subsets or recreated data subsets; and/or

b2) recreating any missing subsets of data by

executing a parity function across related data subsets and the recreated further parity data. These steps may be used if further parity data and one or more associated data subsets are lost from the same group combinable to form a higher level component. These steps may be carried out at any suitable time or at the same time as step b. The missing further parity data is recreated from data subsets outside of its own group (i.e. unrelated data subsets are those data subsets that are not directly combinable to form a higher level component) . It should be noted that the further parity data may have been created from these unrelated data subsets or from related data subsets (i.e. from within their own group) . Optionally, the method further comprises the steps of: cl) recreating a component of data for which a

complete set of its related data subsets are not available from combining retrieved or recreated data subsets by executing a parity function across the available data subsets of the component of data and their associated further parity data.

Optionally, the method according to any previous claim further comprising the step of:

c2) recreating the retrieved data by executing a parity function across the components of data and parity data .

Preferably, the parity function may be exclusive OR, XOR. Therefore, XOR can be used to both generate the parity data and to regenerate any lost data or parity data.

Preferably, the separate storage locations may be remote separate physical devices. This improves security and resilience to data loss.

Optionally, the retrieved data is combinable with other retrieved data to form consolidated data. In other words, the method may be extended to any number of levels or cascades of data. Regenerating or combining data may take the form of several steps until the original data or file is reconstructed .

Optionally, the method may further comprise the step of decrypting the data that had been previously encrypted.

Therefore, further enhancing security.

Optionally, the separate storage locations are selected from the group consisting of hard disk drive, optical disk, FLASH RAM, web server, FTP server and network file server.

Others may be used. Optionally, the data are web pages. Therefore, it may be difficult or impossible to determine what web pages have been retrieved.

Preferably, the method further comprises the steps of before retrieving the data:

dividing the data into the components;

generating the parity data from the components;

dividing each component and parity data into the related data subsets;

generating the further parity data from the related data subsets ; and

storing the data subsets and further parity data in the separate storage locations. This is one way for the data to arrive at the separate storage locations.

According to a second aspect there is provided

apparatus for retrieving data divided into components of the data, the components being recreatable using parity data generated from the components of data, wherein each

component and parity data is sub-divided into related data subsets, the data subsets being recreatable from associated further parity data generated from the related data subsets, wherein the data subsets and further parity data are stored in separate storage locations, the apparatus comprising a processor arranged to:

a) retrieve some of the subsets of data and subsets of further parity data from the separate storage locations; b) recreate any missing subsets of data by executing a parity function across unrelated further data subsets and further parity data;

c) combine related subsets of data and any recreated related subsets of data to form one or more combined

components of data; d) combine the combined components and recreated components to form the retrieved data.

Preferably, the apparatus may further comprise separate remote storage locations. These may be physically separate, remote or separate parts of the same physical device.

According to a third aspect there is provided a method of receiving data divided into components of the data, the components being recreatable using associated parity data generated from the components of data, wherein each

component and parity data is sub-divided into related data subsets, the related data subsets being recreatable from associated further parity data generated from the related data subsets, wherein the data subsets and further parity data are received from separate transmission channels, the method comprising the steps of:

a) receiving some of the subsets of data and further parity data from the separate transmission channels;

b) recreating any missing subsets of data by

executing a parity function across unrelated further data subsets and further parity data;

c) combining related subsets of data and any

recreated related subsets of data to form one or more combined components of data;

d) combining the combined components and recreated components to form the received data. Therefore,

communication of data can have similar benefits to those described with reference to the storage of data including improved security and resilience to loss.

Optionally, the different transmission channels may be one or more selected from the group consisting of: wire, optical fibre, radio wave, internet protocol and mobile communication channels or any combination. Optionally, the different transmission channels may be different frequencies.

Advantageously, the components of data may be formed by taking bits, bytes or groups of bytes from the data. For example, the data may be separated bitwise or byte-wise.

The method may be implemented using a computer or a device including a processor programmed or dedicated to perform the method.

It should be noted that any feature described above may be used with any particular aspect or embodiment of the invention.

Brief description of the Figures The present invention may be put into practice in a number of ways and embodiments will now be described by way of example only and with reference to the accompanying drawings, in which:

FIG. 1 shows a flowchart of a method for dividing and storing data, retrievable by an aspect of the present invention; and

FIG. 2 shows a schematic diagram of a network used to retrieve and store data according to a further aspect of the present invention;

FIG. 3 shows a schematic diagram of data stored

according to the method of FIG. 1;

FIG. 4 shows a schematic diagram of the data

distributed as clusters stored following the method of FIG.

1;

FIG. 5 shows a schematic diagram of data split into components or data subsets and parity data; FIG. 6 shows a schematic diagram of each of the data subsets and parity data further split into further data subsets and further parity data;

FIG. 7 shows a diagram illustrating the parity

relationships of data subsets and parity data retrieved according an aspect of the present invention;

FIG. 8 shows a diagram illustrating the recoverability of data according to a prior art technique;

FIG. 9 shows a diagram illustrating the non- recoverability of data according to a prior art technique;

FIG. 10 shows a diagram illustrating the recoverability of the data of FIG. 9 according to an aspect of the present invention, given by way of example only;

FIG. 11 shows a schematic diagram of the parity

relationship of data subsets and parity data illustrating the recoverability of data according to an aspect of the present invention given by way of example only;

FIG. 12 shows a graph of the the greatest percent of loss not being fatal to further restoration;

FIG. 13 shows a graph of the probability of total loss for with the minimum quantity of lost files depending on division level;

FIG. 14 shows a graph of the probability of total loss for different percentages of lost files at a second level of file division;

FIG. 15 shows a schematic diagram of an example method of dividing data;

FIG. 16 shows a schematic diagram of an example method of dividing data; and

FIG. 17 shows a schematic representation of apparatus for transmitting data according to a further aspect of the present invention. It should be noted that the figures are illustrated for simplicity and are not necessarily drawn to scale.

Detailed description of the preferred embodiments

Data to be stored may be in the form of a binary file, for instance. The data may be divided into subsets of data. Parity data may be generated from the subsets of data in such a way that if one or more of the data subsets is destroyed, corrupted or lost, that missing subset may be recreated from the remaining subsets and parity data.

Parity or control data, C, is generated from the original data for the purpose of error checking or to enable lost data to be regenerated. However, the parity data does not contain any additional information over that contained by the original data. There are several logical operations that may achieve the generation of such parity data. For instance, applying an exclusive or (XOR) to two binary numbers results in a third binary number, which is the parity number. Should either of the original two binary numbers be lost then it may be recovered by simply

performing an XOR between the remaining original number and the parity number. For a more detailed description of a calculation of parity data see

http: //www.pcguide . com/ref/hdd/perf/raid/concepts/genParity- c.html. Once the parity data has been calculated all of the data subsets and parity data may be stored in separate or remote file locations.

However, each of the data subsets or parity data may be separated into further subsets and further parity data may be generated in order to utilise any additional storage locations. In this way, a cascade of data subsets may be created until all available storage locations are utilised or a predetermined limit in the number of locations is reached. The data may be recovered using a reverse process with any missing data subsets being regenerated or recreated from the remaining data subsets and parity data using a suitable regeneration calculation or algorithm. The reading process continues until the single original data is

recovered.

FIG. 1 shows a flow diagram of an example method 10 for storing data. The original data 20 is split into data subsets A and B in step 30. The data may be split into two equal parts, so that the subsets A and B are of equal size. Zero padding may be used to ensure equal sized subsets A and B. For example, additional zero bytes (or groups of bits) may be added to the end of subsets A and B before the parity data P is generated. After the data 20 has been split into subsets A and B an exclusive OR operation may be carried out on subsets A and B, at step 40, to generate parity data set P. Alternatively, the parity data P may be generated during the splitting or separation step 30.

FIG. 3 shows a schematic diagram of the data resulting from a single iteration of the method shown in FIG. 1. Like method steps have the same reference numerals. The original data set or file 20 is split byte-wise to generate data subset A and data subset B (i.e. block size of one byte) . The exclusive OR operation generates parity data P. Where there are three separate storage locations available the method 10 would stop at this stage resulting in a data cluster 150 having three distributed discrete data subsets A, B and P.

FIG. 4 shows the result of a further iteration of steps

30, 40 and 50 of method 10. In this case, nine separate storage locations are available and so each of the three data subsets A, B and P may be further split into three further data subsets each.

This additional recursive splitting 230 results in data subset A being split to form further data subsets AA and further parity data AP. Similarly, data subset B may be split into BA and BB, which together may be used to form parity data BP. Parity data P may be split into PA, PB and PP. For this particular embodiment of the method each of the three data subsets have the same size. The nine

separate data locations used to store each of these nine data subsets may form a second level cluster 250.

In other words, the first level cluster 150 has been expanded to form a second level cluster 250. There is therefore no need to store the original three data sets A, B and P (but this may be done anyway as an alternative method for additional resilience to data loss) as these may each be recreated from the nine data subsets in the second level cluster 250. The loop in the method 10 may be repeated as many times as necessary until all available storage

locations are used or a predetermined limit is reached of the size of each subset has been reduced to a particular level .

A system used to perform the method 10 is shown in FIG. 2. The system 400 shown in FIG. 2 may be used to distribute information securely over networks such as the Internet or an intranet. The Internet or subsets of web pages 420 may be distributed securely to a user machine 440 via a central server 410. The central server 410 takes the web pages 420 and stores them according to the method 10 shown in FIG. 1 within separate storage locations 430. Central server 410 supplies the user machine 440 with an optional decryption code or codes and information to identify and locate data subsets from particular storage locations 430 and how to recreate the data forming the original web pages 420.

Therefore, the web pages 420 are no longer prone to a single point of failure or attack (for instance, a single web server going down) as the original data 20 is distributed amongst separate storage locations 430. Furthermore, any third party intercepting the network traffic of the user computer 440 would not be able to decrypt or recreate the original data forming the web pages 420 without the

decryption keys and regeneration information supplied by the central server 410.

Figure 5 shows an alternative notation used to depict the splitting of a File into a data subset labelled as Left L and labelled as Right R with the parity data labelled as the Check C. Figure 6 indicates how these data subsets and parity data are further split and were divided to a second level where each of the Left, Right and Check data sets are themselves split into Left, Right and Check data subsets (for example LL, LR and LC) . This results in nine separate data subsets or Files, at the second cascade or level, that may be stored at separate remote storage locations.

Figure 7 illustrates the possible ways in which data may be regenerated and how each potential missing data subset can be recreated from existing data in the example where the data is split to a second level and wherein each splitting splits a File F or subset of data into two with a single parity data being generated (e.g. XOR) .

The possible recreation techniques described in

WO2010/026366 are shown at the end of the horizontal arrows. According to this technique, associated data files are those that directly form from a split file at a previous level or were generated as parity data from these split data subsets (although it can be seen that the parity data may be created from data sets outside of its own group) . Regeneration possibilities available due to the cross parity or mutual parity comparison techniques are shown for this particular example at the end of the vertical arrows . This provides additional resilience to data loss as more possibilities are available using this technique.

Figure 8 illustrates how the loss of three data subsets or files from three separate storage locations (LL, LR and LC) is recoverable in the sense that the original file may be retrieved, by using the parity reconstruction technique described in WO2010/026366 for the "Nine files at a second level or splitting" example. In this case, it is not necessary to recreate the first level parity data C in order to create File F but this is shown in Figure 7 for

completeness .

Figure 9 shows an example where the reconstruction technique described in WO2010/02633 cannot be used to recreate the original data or file F where five particular data subsets (LL, LC, RL, RR and CR) are lost. In this example, there is not enough information to recreate the original file based on the prior art method.

However, Figure 10 shows how the use of the cross parity technique and parity commutation can be used to reconstruct enough missing data subsets and parity data to reconstruct the original data F. As illustrated by the vertical arrow pointing down through missing data subset LC, present data subset RC and available data subset CC, the data subset LC may be reconstructed. Following from this and at the first level, data subset L may be reconstructed from one of its two component data subsets LR and the reconstructed parity data LC. Data subset C may be

reconstructed according to the conventional parity

reconstruction method described in WO2010/026366.

Therefore, data subset L and its corresponding parity set C have been formed and from this, the missing data subset R may be reconstructed. Once R and L are available then they may be combined in reverse of the splitting procedure used to generate them in the first place to arrive at the

original data file F.

It is noted that without the use of the cross parity technique then reconstruction of the original data file F would not be possible.

Figure 11 illustrates diagrammatically the data

structure for a third level of data splitting, again where at each level a file is split into two (for example) with parity data generated. Therefore, the third level results in 27 separate data subsets or files and parity data. A similar naming convention is used in Figure 11. The darker circles indicate data subsets containing only parity data information.

Whereas, two- level splitting results in the potential for each data subset to be recreated in two different ways, following three-level splitting each data subset may be recreated in three separate ways. This may be described as three-dimensional cross parity control. In other words, the number of dimensions available in recreating data subsets is proportional to the level of splitting. This is illustrated by the lower half of Figure 11 where each data subset is shown as a cube . Nine cubes form a layer and three layers form a larger cube made up of each of the 27 separate data subsets (3x3x3) .

As in the two-dimensional example, two out of three corresponding groups of data subsets may be used to recreate a missing third data subset. As shown in Figure 11, data reconstruction according to WO2010/026366 is shown in the direction of arrow 1. Cross parity reconstruction is illustrated in the direction of arrows 2 and 3.

The following example uses numbers to represent data components (e.g. bits) in a file F.

The file F of size n bits can be presented in the form of a bit set {1,2,3,4 ... n} , where the digits represent a serial number of a bit in the file.

One can split the file bitwise, whilst a parity set is generated (or generated at a different time) .

File F, is split on three parts {L, R, C} where:

L — Sequence of odd bits;

R — Sequence of even bits;

C — Sequence of bits gained as a result of a parity operation «exclusive OR» (XOR) over the first two sequences. This procedure is illustrated in FIG. 5.

F → {l, 2, 3, 4, 5, 6, 7, 8 ...} then

L → {1, 3, 5, 7 ...}

R → {2, 4, 6, 8 ...}

C → {Ιθ2, 3θ4, 5Θ6, 7θ8 ...}

This procedure results in file F having parity control across the three new files, which are then resistant to the loss or corruption of one of the files; any missing file may be restored:

L © R = C → R ® C = L → L ® C = R

Following this first level of splitting, dividing or cascading, as it may be known, a further division or

splitting procedure may be applied.

This results in a file structure of the form shown in FIG. 6 resulting in nine files or data subsets in three groups :

Group 1 LL → {1, 5 ...}

LR → {3, 7 ...}

LC → {1Θ3, 507 ...} Group 2

RL → { 2 , 6 ...}

RR — {4, 8 ...}

RC → {204, 608 ...}

Group 3

CL → {102, 506 ...}

CR → {304, 708 ...}

CC -> {(102)0(304), (506)0(708) ...} where each group of three files has a similar

resilience to data loss whereby by any one file within a group may be regenerated from the remaining associated two files or datasets . However, the loss of two files in one particular group means that it will not be possible to recreate the missing data from that group alone. Should two groups both be missing two data sets then the entire file will be unrecoverable using the reconstruction method described in WO2010/026366. FIG. 9 illustrates the three groups in rows. In this example, groups 1 and 2 both have two missing data subsets and so the file F is unrecoverable.

The associative and commutative property of the parity option and in particular the XOR function, enables different combinations of data subsets to be used to regenerate missing or corrupted data. For example:

CC→{ (102^0^(304) , (506) 0(708 )■■■} = { (103)0(204) , (507)0(608) ...}

Therefore, a different grouping of the nine files is available, which is different to the three groups above but which still provides resistance to data loss from one file within each group. Group 1 '

LL → {l, 5 ...}

RL —> {2, 6 ...}

CL → {l©2, 5®6 ...}

Group 2

LR → {3, 7 ...}

RR →· {4, 8 ...}

CR → {3θ4, 7Θ8 ...}

Group 3 '

LC → {Ιθ3, 5Θ7 ...}

RC → {2®4, 6θ8 ...}

CC → { (1®3)Φ (2®4) , (5θ7)Φ(6θ8) ...}

The above groups are shown vertically in FIG. 10. In this grouping the same loss of files does not result in a total loss of the data as LC may be regenerated from RC and CC.

Further data loss resilience is found when a further level of splitting is made:

LLL → 1, 9 ...}

LLR → 5, 13 ...}

LLC → 1θ5, 9Θ13 ...}

LRL → 3, 11 ...}

LRR → 7, 15 ...}

LRC → 3Θ7, 11Θ15 ...}

LCL → 1®3, 9Θ11 ...}

LCR → 5Θ7, 13®15 ...}

LCC → (Ιθ3)θ(5θ7) , (9θ11)Φ( 13Θ15)

CLL → 1Θ2, 9Θ10 ...}

CLR - 5Θ6, 13Φ14 ...}

CLC → (Ιθ2)θ(5θ6) , (9©10)Φ(13Φ14) ...}

CRL - 3θ4, 11Φ12 ...}

CRR → 7Φ8, 15Φ16 ...}

CRC → (3Φ4)Φ (7Φ8) , (11Φ12)Θ(15Φ16) ...}

CCL → (1Φ2) Φ(3Φ4) , (9Φ10)® (11Φ12) ...}

CCR - (5Φ6)Φ(7Φ8) , (13©14)Θ(15Φ16) ...}

CCC - ( (1Φ2)Φ (3Φ4) )©( (5Φ6)Φ(7Φ8) ) ,

( (9Θ10) Φ(11Φ12) )©( (13Φ14)® (15Θ16) ) ...} RLL → {2, 10 ...}

RLR → {6, 14 ...}

RLC → {2θ6, 10Φ14 . ·}

RRL → {4, 12 ...}

RRR → {8, 16 ...}

RRC → {4θ8, 12Θ16 . ·}

RCL → {2θ4, 10Θ12 . ·}

RCR → {608, 14Θ16 . ·}

RCC → { (2θ4)θ(6θ8) , (10θ12)θ(14θΐ6)

Resilience is enhanced for each further level of splitting or dividing.

For a level n, it can be shown that an n-dimensional cross parity check n-dimensional space is obtained. Each split file in n-dimensional space can be characterised by one of the following features, namely L, R or C. A set of files, sequentially accepting the values of L, R and C, forms a stable group of three data subsets , allowing retrieval of the lost file on the basis of the remaining other two.

If or ! 0^ 2 ar ... o^ a are the co-ordinates of the file, where i = {L, R, c} , and n — a dimensionality of space (equal to the level of splitting) the following correlation is found:

cc\ a a . • a l „ Φ a l aS · . 1 ,, a aS . • a x n

a L i a a . • Oi^ Θ a c i a a . = a R x a .

c a L 2 a . • a l n Φ <x « R 2 a · • α Α η = \ a c 2 a . • a\,

a\ a a . Φ \ a · = a\ a aS - where L, R, C are conditional indications of the identifier of the split file, marking up the serial numbers of bit sequences of the file F in the appropriate split file. In these examples, at each level the file F and data subsets are split into two with a parity check file. Other file splitting may be used. For example, the file may be divided into three, four, five or more parts and a parity file generated across all subsets. However, splitting into two is advantageous as it provides the greatest resistance to data loss.

Having such cross parity control in each of n- directions, and n-dimensional space, provides more

opportunities for a full retrieval of the original file F, when data is lost.

Loss of all data subsets formed from the original file is the same as total deletion of the original file F. Obviously, this will not be recoverable, and therefore makes retrieval using remaining parity files (even all parity files or parts) impossible. Removal of all of the parity files whilst retaining all original file subsets, does not lead to any loss .

This forms two extreme points of stability of the algorithm. Any variations of lost files and parity data lies between these two extremes.

The total number of the files in the algorithm is as follows: T = 3", and F = 2 n are parts of the original file. Loss of all parity files C ~ 3" - 2 n would be the most amount of loss that the system could tolerate where

retrieval (n is a splitting level) is still possible.

A minimal number of files, leading to a fatal loss, is equal to a total loss of the original file, i.e. F = 2\ The total number N of possible combinations of k from j is determined by the following binomial coefficient:

(j-k \

If in the N combinations some number of the

combinations of K is critical for retrieving the file, then the probability P of the loss of the entire file is as follows:

One can estimate a probability P of fatal

combinations in order to achieve a variation of a minimal possible number of the lost files: combinations on k files from j , where k = F = 2 n and j = T = 3 n .

1. At the first level of splitting, a loss of two files unambiguously leads to a fatal result, as all three combinations of two out of three are fatal — P = 1.0.

2. At the second level, 22=4 files should be in pairs simultaneously in both a line and a column. The total of such combinations is nine. Thus, for a combination of 4 out of 9, a probability of loss of the split file is as follows :

j=9, k=4, K=9; N = ^-^= 126,

P =— = 0,07

126

3. At the third level of splitting and applying a similar approach, K=21 is obtained for 8 out of 27 fatal combinations. Accordingly,

= 2220075,

P = 21 - = 0,95 · 10 ~s

2220075

FIG. 12 shows the the greatest percent of loss not being fatal to the further restoration or retrieval .

The graph in FIG. 13 shows, that the probability of a scenario of total loss of a file F from the lowest number of lost files {K = 2 n ) rapidly falls to an insignificantly small value already at the low levels of splitting (n) . For instance, at the third level, the probability of a fatal combination, with a loss of 8 out of 27 files (about 30 % of files lost), is equal to 10 "5 .

Now one can consider different fatal combinations of lost data subsets. As an example and for simplicity, the second level is considered here. Fatal combinations are as follows:

1. 4 out of 9 data subsets lost. A probability of loss of the file F is as follows:

2. 5 out of 9 data subsets lost. This may be obtained based on the previous configuration but by adding a file in different positions for each fatal combination 4 out of 9. Taking into consideration that there are five positions of single files (9 - 4 = 5), then one can obtain the following:

n=9, k=5, Κ=9·5=45

Ψ 4&

N = 5!,(9- ' 5.)! = 126, P =—126 = 0,'36

3. All combinations of 6 and more out of 9, are fatal, and a probability of loss P = 1.0.

These three scenarios are illustrated in FIG. 14. The described algorithm in different ways. For instance, blocks consisting of sets of bits, may be involved splitting:

1. Splitting maybe blockwise, i.e. involve a block, consisting of several consecutive bits grouped together

(as illustrated in FIG. 15) .

2. Splitting may be bitwise but according to a variable pattern. Examples include where:

a. The pattern is generated randomly (as illustrated in FIG. 16) .

Given a random sequence, for example, of the five nonrepeating numbers in the range from 1 to 10

M(odd) → 1,3,4,5,9

where, according to our terminology, this issue of "odd" bit. Then the "even" bits are numbered

M (even) → 2,6,7,8,10

The proposed algorithm can be used in systems of distributed storage of the information or in transmitting, receiving and reconstructing data (e.g. mobile telephony as illustrated schematically in FIG . 17 between a

transmitting mobile telephone 500 and a receiving mobile telephone 510, wireless or wired data transmission

including voice, audio, video and other data types) . The initial file or data set may be split into groups of files whose number is determined by a designated level of splitting. The split files can be stored in different locations of computers or servers, connected by either corporate networks, or the Internet. The method allows restoring of an initial file or data set even under destruction of a large number of data storage locations or elements. Calamity-proof networks of such data storages may be created, improving resilience, security and quality. The split files, stored in data centres, may be made meaningless, i.e. they contain no comprehensive information. This significantly increases information security in a distributed system.

As will be appreciated by the skilled person, details of the above embodiment may be varied without departing from the scope of the present invention, as defined by the appended claims.

Many combinations, modifications, or alterations to the features of the above embodiments will be readily apparent to the skilled person and are intended to form part of the invention. Any of the features described specifically relating to one embodiment or example may be used in any other embodiment by making the appropriate changes .