Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHODS FOR ENCODING AND DECODING A BINARY STRING AND SYSTEM THEREFORE
Document Type and Number:
WIPO Patent Application WO/2017/085245
Kind Code:
A1
Abstract:
The invention relates to a method and a system for encoding a binary string (SB), comprising the steps of: • receiving said binary bit string (SB), • converting said binary bit string (SB) into a two-element-string (ST1, ST2, … STN), whereby each element of said two-element-string is uniquely represented by a sequence of two nucleobases, selected form the group comprising Adenine (A), Cytosine (C), Thymine (T) and Guanine (G), • whereby for each two-element-string (ST1, ST2, … STN) an error protection nucleobase (ET1, ET2,… ETN) is selected from the group comprising Adenine (A), Cytosine (C), Thymine (T) and Guanine (G), whereby each two-element-string and its selected error protection nucleobase form an error protected block, • whereby said error protection nucleobase is selected according to a selection scheme taking into account at least a preceding previous error protected block if present. The invention relates to a method and a system for decoding a binary string (SB), said binary string being encoded as a two-element-string (ST1, ST2, … STN), comprising the steps of: • receiving three consecutive nucleobases forming an error protected block (B1, B2,… BN), whereby each nucleobase is selected form the group comprising Adenine (A), Cytosine (C), Thymine (T) and Guanine (G), whereby the first and second nucleobases are representing information and said third nucleobase represents an error protection nucleobase, • checking each error protected block (B1, B2, … BN), whether the block (B1, B2, …BN) is error-free on basis of the first and second nucleobase and an thereto expected error protection nucleobase and said received error protection nucleobase, • whereby said expected error protection base is selected according to a selection scheme taking into account at least an preceding previous error protected block if present, • for each error-free block (B1, B2, … BN) converting the first two nucleobases into a binary bit string.

Inventors:
SONG, Lifu (Denickestraße 40a, Hamburg-Harburg, 21073, DE)
ZENG, An-Ping (Hohlredder 15a, Rosengarten, 21224, DE)
Application Number:
EP2016/078122
Publication Date:
May 26, 2017
Filing Date:
November 18, 2016
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
TECHNISCHE UNIVERSITÄT HAMBURG-HARBURG (Am Schwarzenberg-Campus 1, Hamburg, 21073, DE)
International Classes:
H03M13/09; H03M5/14
Foreign References:
US20120304041A12012-11-29
Other References:
RON M. ROTH: "Introduction to Coding Theory", 2006, CAMBRIDGE UNIVERSITY PRESS, pages: 5 - 6, XP002766626
"Runlength Limited Codes for Single Error-Detection and Single Error-Correction with Mixed Type Errors", IEEE TRANSACTIONS ON INFORMATION THEORY, vol. 44, no. 4, July 1998 (1998-07-01), XP002766627, Retrieved from the Internet [retrieved on 20170201]
MARKARIAN G ET AL: "Trellis decoding technique for block RLL/ECC", IEE PROCEEDINGS : COMMUNICATIONS, INSTITUTION OF ELECTRICAL ENGINEERS, GB, vol. 141, no. 5, 1 October 1994 (1994-10-01), pages 297 - 302, XP006001671, ISSN: 1350-2425, DOI: 10.1049/IP-COM:19941275
POPPLEWELL A ET AL: "SPECTRAL CHARACTERISATION AND PERFORMANCE EVALUATION FOR A NEW CLASS OF ERROR CONTROL LINE CODES", IEE PROCEEDINGS I. SOLID- STATE & ELECTRON DEVICES, INSTITUTION OF ELECTRICAL ENGINEERS. STEVENAGE, GB, vol. 137, no. 4, 1 August 1990 (1990-08-01), pages 242 - 246, XP000147604, ISSN: 0956-3776
JAEJIN LEE ET AL: "ERROR CORRECTING RUN-LENGTH LIMITED CODES FOR MAGNETIC RECORDING", 1995 DIGESTS OF INTERMAG. INTERNATIONAL MAGNETICS CONFERENCE. SAN ANTONIO, APR. 18 - 21, 1995; [PROCEEDINGS OF THE INTERNATIONAL MAGNETICS CONFERENCE (INTERMAG)], NEW YORK, IEEE, US, 18 April 1995 (1995-04-18), pages HP - 03, XP000582233, ISBN: 978-0-7803-2606-4
OYVIND YTREHUS: "RUNLENGTH-LIMITED CODES FOR MIXED-ERROR CHANNELS", IEEE TRANSACTIONS ON INFORMATION THEORY, IEEE PRESS, USA, vol. 37, no. 6, 1 November 1991 (1991-11-01), pages 1577 - 1585, XP000235395, ISSN: 0018-9448, DOI: 10.1109/18.104318
P. LEE, WOLF: "A General Error-Correcting Code Construction for Run-Length Limited Binary Channels", IEEE TRANSACTIONS ON INFORMATLON THEORY, vol. 35, no. 6, November 1989 (1989-11-01), XP002766629, Retrieved from the Internet [retrieved on 20170201]
JACK KEIL WOLF: "AN INTRODUCTION TO ERROR CORRECTING CODES Part 1", INTERNET ARTICLE, 2008, XP002755802, Retrieved from the Internet [retrieved on 20160323]
ARA PATAPOUTIAN, P. VIJAY KUMAR: "The ( d , k ) Subcode of a Linear Block Code", IEEE TRANSACTIONS ON INFORMATLON THEORY, vol. 38, no. 4, July 1992 (1992-07-01), XP002766630, Retrieved from the Internet [retrieved on 20170201]
ROBERT GRASS, REINHARD HECKEL: "Robust Data Storage in DNA with Error-Correcting Codes", INTERNET ARTICLE, 28 September 2015 (2015-09-28), XP002766631, Retrieved from the Internet [retrieved on 20170201]
SEBASTIAN ANTHONY: "Harvard cracks DNA storage, crams 700 terabytes of data into a single gram", INTERNET ARTICLE, 17 December 2012 (2012-12-17), XP002766632, Retrieved from the Internet [retrieved on 20170201]
Attorney, Agent or Firm:
SCHMELCHER, Thilo (3106, TPH III Eingang BKaisertraße 100, Herzogenrath, 52118, DE)
Download PDF:
Claims:
Claims

1 . Method for encoding a binary string (SB), comprising the steps of:

• receiving said binary bit string (SB), · converting said binary bit string (SB) into a two-element-string (S-ρι , ST2, ■■■ STN), whereby each element of said two-element-string is uniquely represented by a sequence of two nucleobases, selected form the group comprising Adenine (A), Cytosine (C), Thymine (T) and Guanine (G),

• whereby for each two-element-string (S-ρι , ST2, ■■■ STN) an error protection nucleobase (En , EJ2, ... ETN) is selected from the group comprising Adenine (A),

Cytosine (C), Thymine (T) and Guanine (G), whereby each two-element-string and its selected error protection nucleobase form an error protected block,

• whereby said error protection nucleobase is selected according to a selection scheme taking into account at least a preceding previous error protected block if present.

2. Method according to claim 1 , whereby the error protection nucleobase is attributed according to a first selection scheme, and when a TTT or GGG or CCC or AAA sequence is to be expected in the present and the previous error protected block, the error protection nucleobase is selected according to a second selection scheme.

3. Method for decoding a binary string (SB), said binary string being encoded as a two-element-string (S-pi , ST2, ■■■ STN), comprising the steps of: · receiving three consecutive nucleobases forming an error protected block (B ; B2, .. . BN), whereby each nucleobase is selected form the group comprising Adenine (A), Cytosine (C), Thymine (T) and Guanine (G), whereby the first and second nucleobases are representing information and said third nucleobase represents an error protection nucleobase, · checking each error protected block (B ; B2, .. . BN), whether the block (B ; B2, .. .

BN) is error-free on basis of the first and second nucleobase and an thereto expected error protection nucleobase and said received error protection nucleobase,

• whereby said expected error protection base is selected according to a selection scheme taking into account at least an preceding previous error protected block if present,

• for each error-free block (B-i , B2, ... BN) converting the first two nucleobases into a binary bit string.

4. Method according to claim 1 , the error protection nucleobase is attributed according to a first selection scheme, and when a TTT or GGG or CCC or AAA sequence is to be expected in the present and the previous error protected block, the error protection nucleobase is selected according to a second selection scheme.

5. Method according to claim 3 or 4, whereby at least 5 consecutive nucleobases are received, whereby for the first and second received nucleobase an expected error protection base is determined and compared with the third received nucleobase, whereby for the second and third received nucleobase an expected error protection base is determined and compared with the fourth received nucleobase, whereby for the third and fourth received nucleobase an expected error protection base is determined and compared with the fifth received nucleobase, whereby when an expected nucleobase is not matching the received nucleobase, the respective nucleobases are held not to from a block.

6. Method according to claim 3 or 4 or 5, whereby a plurality representation of a same blocks are received, whereby erroneous representations are replaced by error-free representations.

7. System for encoding a binary string (SB), comprising · means for Receiving said binary bit string (SB), • means for Converting said binary bit string (SB) into a two-element-string (STi , ST2, ■■■ STN), whereby each element of said two-element-string is uniquely represented by a sequence of two nucleobases, selected form the group comprising Adenine (A), Cytosine (C), Thymine (T) and Guanine (G), · whereby for each two-element-string (S-ρι , ST2, ■■■ STN) an error protection nucleobase (En , ET2, ... ETN) is selected from the group comprising Adenine (A), Cytosine (C), Thymine (T) and Guanine (G),

• whereby said two-element-string and said selected error detection nucleobase (En , ET2, .. . ETN) form an error protected block (B-i, B2, ... BN), · whereby said error protection nucleobase is selected according to a selection scheme taking into account at least a preceding previous error protected block if present.

8. System for decoding a binary string (SB), said binary string being encoded as a two-element-string (S-pi , ST2, ■■■ STN), comprising

• means for receiving three consecutive bases forming an error protected block (B ;

B2, ... BN), whereby each base is selected form the group comprising Adenine (A), Cytosine (C), Thymine (T) and Guanine (G), whereby the first and second bases are representing information and said third base represents an error protection base,

• means for checking each error protected block (B-i, B2, ... BN), whether the block (B-i, B2, ... BN) is error-free on basis of the first and second nucleobase and an thereto expected error protection nucleobase and said received error protection nucleobase, · whereby said expected error protection base is selected according to a selection scheme taking into account at least an preceding previous error protected block if present,

• means for converting the first two nucleobase into a binary bit string.

Description:
Methods for encoding and decoding a binary string and system therefore

The invention relates to methods for encoding and decoding a binary string and a system therefore.

Background

Reliable information storage on a large scale becomes an ever increasing problem.

While on the one hand the need for quick access towards data having a short lifetime increases it also becomes apparent that archiving large sets of data for long periods is getting a serious problem.

The need for such reliable data storage may be different. Just to mention a simple example: Storage of data relating to cancer and later-on processing of high amount of data sets for distilling commonalities needs storage of Hugh data amounts. Also the storage of knowledge for future generations is an ever increasing problem as the amounts of knowledge explodes while the lifetime of modern storage systems is less of old stylish books.

The presently used media of data storage such as magnetic tape or hard drives have a decisive shortcoming of limited life time and density, e.g. around 50 years for hard drivers. However, to achieve such a long-term storage the respective devices needs to be stored in special rooms in a strictly controlled environment such that temperature changes / humidity changes, etc. may not negatively impact the devices. Such storage is extremely cost-expensive.

Long-term archival of big data is thus expensive and challenging. I.e. reliable long term storage is extremely expensive. Recently synthetic DNA (Deoxyribonucleic acid) has been proposed for future digital information storage with unprecedented density and longevity over thousands of years.

Even though some progress has been made, this progress is made on the expense of additional costs and / or increased efforts for adapting environmental conditions like in previous technologies.

To push the DNA information storage into a real cost-effective and practical technology, improvements along several lines are desperately needed. Unlike other data storage media, errors in "writing" and "reading" information in DNA introduce inevitable errors into the digital data, especially if fast and cheap synthesis and sequencing technologies (such as Nanopore sequencer) are to be used.

I.e. the content of Guanine (G) and Cytosine (C) among the nitrogenous bases forming the DNA affects the stability of DNA. While a higher GC content may lead to a certain higher thermo-stability (e.g. due to 3 hydrogen bonds) a GC content above a certain level may tend to autolysis. Also it is known that a GC content above a certain level is hard to synthesize and is also hard to sequence. To address this problem, it has been proposed to use a 1 bit / per two base coding, where a binary "1 " was coded as A/C (Adenine/ Cytosine) while a binary "0" was coded as G/T (Guanine/Thymine). However, such a coding scheme is extremely inefficient, leading again to an increased need for storage capacity.

However, in biochemistry also so called homopolymers may be a problem. Homopolymers are sequences of nitrogenous bases where a plurality of consecutive equal nitrogenous bases is arranged. Synthesis and sequencing of these homopolymers constitute a problem. In electronics, a similar problem arises when using binary technology. There, the problem is tackled by use of RLL (Run Length Limited) - Codes, which are typically classified by two parameters, a minimum number of consecutive "0" and a maximum number of consecutive "0" in between to "1 ". To address the biochemistry problem, which is different from the electronic problem, a base-3 encoding scheme has been proposed. Within the proposed scheme in order to avoid long homopolymers a quarter of the encoding capacity was spend and additional complex means introducing a very high level of redundancy were taken to ensure a full coverage of every fragment during sequencing. However, such a high redundancy reduces the data density and raises the cost for information storage. In addition, while the problem associated with homopolymers is reduced, the problems associated with GC content may not be solved at the same time. Other scientists tried to introduce error-correction codes. However, the usage thereof typically involves sophisticated en-/de- coding algorithms. The code achieved thereby introduces complex secondary structures which again may constitute a problem when sequencing. Secondary structures in the understanding of nucleic acids are e.g. helices, loops, double helixes, stem-loops, pseudoknots, to name some. It is therefore an object of the invention to provide methods and systems allowing for highly reliable encoding and decoding. It is another object to provide methods and systems which are cost effective.

Short description of the Invention

The object is solved by a method and a system for encoding a binary string. The method comprises a step of receiving said binary bit string, a step of converting said binary bit string into a two-element-string, whereby each element of said two- element-string is uniquely represented by a sequence of two nucleobases, selected form the group comprising Adenine (A), Cytosine (C), Thymine (T) and Guanine (G), whereby for each two-element-string an error protection nucleobase (En , E J2 , ... E T N) is selected from the group comprising Adenine (A), Cytosine (C), Thymine (T) and Guanine (G), and whereby said two-element-string and said selected error detection nucleobase form an error protected block, whereby said error protection nucleobase is selected according to a selection scheme taking into account at least a preceding previous error protected block if present.

The invention also relates to a method and system for decoding a binary string, said binary string being encoded as a two-element-string. The method comprises a step of receiving three consecutive nucleobases forming an error protected block, whereby each nucleobase is selected form the group comprising Adenine (A), Cytosine (C), Thymine (T) and Guanine (G), whereby the first and second nucleobases are representing information and said third nucleobase represents an error protection nucleobase, a step of checking each error protected block, whether the block is error-free on basis of the first and second nucleobase and an thereto expected error protection nucleobase and said received error protection nucleobase, whereby said expected error protection base is selected according to a selection scheme taking into account at least an preceding previous error protected block if present, and for each error-free block the first two nucleobases are converted into a binary bit string.

Further advantageous embodiments are subject to the detailed description as well as to the dependent claims.

Brief description of the drawings

In the following reference will be made towards the figures. In these

Fig. 1 shows an aspect of an exemplary embodiment of encoding a binary string according to the invention,

Fig. 2 shows another aspect of an exemplary embodiment of encoding a binary string according to the invention, and

Fig. 3 shows an aspect of an exemplary embodiment of decoding a binary string according to the invention.

Detailed Description

The present disclosure describes preferred embodiments with reference to the Figures, in which like reference signs represent the same or similar elements. Reference throughout this specification to "one embodiment," "an embodiment," or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases "in one embodiment," "in an embodiment," and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment. The described features, structures, or characteristics of the invention may be combined in any suitable manner in one or more embodiments. In the description, numerous specific details are recited to provide a thorough understanding of embodiments of the invention. I.e., unless indicated as alternative only any feature of an embodiment may also be utilized in another embodiment.

In addition, even though at some occurrences certain features will be described with reference to a single entity, such a description is for illustrative purpose only and actual implantations of the invention may also comprise one or more of these entities. I.e. usage of singular also encompasses plural entities unless indicated.

The principles of the invention will be best understood when discussing an example.

DNA sequences are composed of nucleobases, selected form the group comprising Adenine (A), Cytosine (C), Thymine (T) and Guanine (G). I.e. by means of 4 different (constituting) nucleobases, it is possible to address 4 different information statuses. 4 different information statues in digital information processing are understood as information, which could be encoded by 2 bits, 00, 01 , 10, 1 1 .

Now within the invention more than 2 bits are encoded.

In the following we will present one example, where 4 bits are encoded into a sequence of 2 nucleobases. Obviously, further embodiments may use different amounts of bits. E.g. it may also be possible to use coding schemes which employ 3 bits into 2 nucleobases, whereby the remaining bit may be used for other purposes such as redundancy, error detection, error correction.... As such the invention is not limited to a particular scheme.

In the following we will nevertheless assume the simple case that 4 bits are encoded in a sequence of 2 nucleobases.

For the purpose of illustration we will assume an attribution of sequences of nucleobases according to the following table, without limiting the invention to that s ecific example:

ΤΑ TC TG TT

0100 0101 01 10 01 1 1

GG GA GT GC

1000 1001 1010 101 1

CC CT CA CG

1 100 1 101 1 1 10 1 1 1 1

Ετ A G c T

Ε'τ C A T G

Table 1

In this table for each binary String SB composed of 4 bits a respective (unique) sequence of nucleobases ST is displayed. Now suppose a text string as displayed in Figure 1 at the top. This text string may be represented by a respective binary string. For ease of understanding each character of the text string is displayed as an 8 bit sequence.

For each char, said 8 bit sequence may be understood as consecutive 4 bit sequences. Let's have a closer look.

The character for "space" is encoded as 0010 0000. Hence, the first 4-bits SB = 0010 may be mapped into AC while the second 4-bits SB = 0000 may be mapped into AT. The letter "w" is encoded as 01 1 1 01 1 1 . Hence, the first 4-bits SB = 01 1 1 may be mapped into TT and the second 4-bits S B = 01 1 1 may be mapped into TT as well. This process may be done continuously. As each nucleobase could in principle store 2 bits, the sequence of two nucleobases is in the following referred to as a two- element-string, i.e. a string of quaternary information elements. Obviously, in our example the string "space" may be composed of two such two-element-string elements. However, there may also be other cases where three or more such elements may constitute such a string.

The sequences of nucleobases may be arranged in the coding scheme as displayed in table 1 such that GC sequences are held at a moderate level. It is however noted that also other attribution schemes could be employed will still having the same properties.

Now to enable error detection (in that case single error detection), an error protection nucleobase is added. As a single nucleobase can encode two bits information, a single base error can lead to two bit changes which could be undiscovered by parity checking.

To enable single base error detection, the 16 nucleobase - nucleobase combinations may be arranged in into four groups arranged in columns. Within each two- nucleobase column, all two- nucleobase combinations do not share any identical base in each of the single- nucleobase columns.

Therefore, any single nucleobase change will result in a two-base combination that does not belong to the corresponding two- nucleobase column any more.

I.e. suppose that the initial two nucleobases were GA (2 nd column). In case the first nucleobase would display T, the nucleobases would correspond to the first column, while in case the second nucleobase would display a C, the nucleobases would correspond to the fourth column.

To allow detection of an erroneous nucleobase, an error protection nucleobase is added.

The exact process may be subject to implementation. We will first describe certain thoughts on the attribution of error protection nucleobases and later-on describe one example of implementation. We will refer to table 1 and figure 2.

Suppose the two nucleobases S T i=AC. For the respective column, the error protection nucleobase is C. Hence the error protection nucleobase E T i=C is associated to the two-element-string S T i . For the second two-element-string S T 2=AT the error protection nucleobase is found in column 1 . Hence the error protection nucleobase E T 2=A is associated to the two-element-string ST2- The process moves on to the third two-element-string S T3 =TT. The respective error protection nucleobase is found in column 4. Hence the error protection nucleobase E T3 =T is associated to the two-element-string ST 3 - The process moves on to the fourth two- element-string ST 4 =TT. The respective error protection nucleobase is again found in column 4. Hence the error protection nucleobase E T4 =T is associated to the two- element-string ST 4 . Next, the process moves on to the fifth ttwo-element-string S T 5=TG. The respective error protection nucleobase is found in column 3. Hence the error protection nucleobase E T 5=C is associated to the two-element-string S T 5- This process is shown in the second line of Figure 2. Now, have a closer look at the third and fourth two-element-strings and the respective error protection nucleobase. They would all be T. As 6 (or more) consecutive Ts are understood as an unwanted homopolymer, we adapt the attribution scheme. In case a certain pattern would appear, like TTT, the error protection nucleobase of the next two-element-string is attributed according to a different scheme indicated in table 1 as E' T . I.e. in the case of the fourth two-element- string TT, the error protection base is chosen according to the last line of table 1 , i.e. the error protection base is selected in the 4 th column as G.

In this way, the data encoding two-base and the error detecting base will not match to each other if there is any single base error in any of the three bases. In other words, single base error in any of the three bases can be detected by checking the compatibility between the two-base and the error detecting base during decoding. I.e. the error protection nucleobase selection takes into account at least a preceding error protected block, i.e. here error protected block formed of ST 3 and E T3 is taken into account. This process is shown in the third line of Figure 2. Obviously, there might be several ways to implement such a processing.

One example of such an implementation is as follows:

First, error protection nucleobases are attributed according to the normal scheme. Afterwards, presence of certain patterns is examined. In case a certain pattern is present, the respective error protection nucleobase is altered according to a secondary scheme.

The table 1 provides along with the attribution scheme for error protection nucleobases for an attribution scheme which avoids extreme GC sequences and long homopolymers.

The error-detecting bases may be assigned in such a way that (1 ) "A" or "T" are assigned to "GC/GG/CG/CC" containing columns to avoid extreme GC sequences;

(2) three-base homopolymers are avoided. To fulfill requirement (1 ), the two-base combinations of "GC/GG/CG/CC" are kept within two columns such that "A" and "T" are available for assigning the error- detecting base.

Therefore, in the example "GC/CG" is arranged in column 4 while "GG/CC" is arranged in column 1 .

Based on the same idea, the allowed two-base combinations that could be included in these two "GC/GG/CG/CC" containing columns are "AA/TT" or "AT/TA".

Hence, by means of the invention it is possible to encode quick and reliable in a manner allowing for long term storage. Now as this scheme may still lead to TTT sequences, it may be foreseen in advantageous embodiments that when a certain pattern arises (like TTT), the next error protection base is selected according to a different process to ensure that the same pattern is avoided.

Naturally, the pattern may be different subject to the coding scheme, e.g. TTT or GGG or CCC or AAA.

The next block is then handled again according to the normal selection scheme.

During the encoding process, a normal selection scheme is used in general and when a three-base combination "TTT" is encountered, the coding scheme for assigning the consecutive error-detecting base is switched to a different selection scheme temporarily and switched back to the normal selection scheme immediately after encoding one three-base block. Furthermore, extreme GC combinations (i.e. a high GC content) are avoided as combinations showing such content (e.g. 3 consecutive nucleobases based on combinations having merely G or C (e.g. GGG, GGC, CCG or CCC) in said different selection scheme, are only present if the previous encoding three-base block is "TTT", i.e. said different selection scheme was selected, which is rarely the case.

I.e. by means of switching the coding scheme according to a pattern noticed in a preceding error protected block, one may even avoid 7 consecutive nucleobases of the same type, thereby improving reliability. It is to be noted that when the error protection nucleobase is not added to the tail but to the header or even in between nucleobases of a two-element-string, the attribution scheme may be adopted accordingly. Now we will turn to the decoding process. The decoding process may be performed in pretty much the same scheme.

First a string of three consecutive nucleobases is received, e.g. S T i , E T -| . We again assume that the first and second nucleobases are representing information and said third nucleobase represents an error protection nucleobase. These three consecutive nucleobases represent an error protected block B-i . Further blocks B 2 , ... B N may be received as well.

Then each error protected block B ; B 2, ... B N is checked whether the block B ; B 2, ... B N is error-free on basis of the first and second nucleobase and an thereto expected error protection nucleobase.

This checking may be done by means of table 1 .

Suppose we receive the blocks as indicated in the first line of Figure 3.

There based on the first two nucleobases in the first block AG, we would expect G as error protection base En. However, received error protection nucleobase E T i=C. Hence, we may deduce that an error is present. Hence, the first block is not correctly received.

If the received error protection base matches the expected error protection base, the error protected block is held error-free. Consequently, the two nucleobases containing the stored information may be translated back into a binary string according to table 1 .

Hence, by means of the invention it is possible to decode quick and reliable in a manner allowing for long term storage.

Again, it may be foreseen that the coding avoids long homopolymers. Then the respective process needs to be implemented with respect to detection as well. Here we may again rely on the same scheme as for the encoding. I.e. a selection scheme takes into account at least a preceding previous error protected block if present.

That is, in case of the block B 4 it is taken into account that block B 3 matches a certain pattern and consequently a different error protection nucleobase is to be expected. Again, as in the encoding example the pattern may be TTT.

The next block B 5 is then handled again according to the normal selection scheme. During the encoding process, a normal selection scheme is used in general and when a three-base combination "TTT" is encountered, the coding scheme for assigning the error-detecting base is switched to a different selection scheme temporarily and switched back to said normal selection scheme immediately after encoded one three-base block. Furthermore, extreme GC combinations can be avoided as three-base combinations with merely G or C (e.g. GGG, GGC, CCG or CCC) in different selection scheme is only present if the previous encoding three- base block is "TTT".

I.e. by means of switching the coding scheme according to a pattern noticed in a preceding error protected block, one may even avoid 7 consecutive nucleobases of the same type, thereby improving reliability.

It is to be noted that when the error protection nucleobase is not added to the tail but to the header or even in between nucleobases of a two-element-string, the attribution scheme may be adopted accordingly. Now, as DNA may be reproduced in numerous copies, one may advantageously benefit from multiple copies of the same original.

Suppose that one has received multiple copies showing errors. Then one may reproduce from said plurality of copies a copy which most likely represents the original. In Figure 3 such erroneous copies of the same sequence are displayed. For ease of understanding, the erroneous nucleobases are indicated by bold italic face. Note, even though errors are shown only to be present in the second nucleobase of a block, the same process may apply in case an error is in a first nucleobase or an error protection nucleobase to be found. Now, within each sequence we see one error. Gladly, the errors are of different nature, i.e. within each sequence a different block is erroneous. While in sequence 1 block Bi is erroneous, in sequence 2 block B 3 , in sequence 3 block B 2 and in sequence 5 block B 5 is erroneous. Hence, we may deduce from all sequences a sequence which is error-free. I.e. if a plurality of representations of a same block is received, erroneous representations may be replaced by error-free representations. Without further detailing, it is apparent that the method steps of the encoding as well as the decoding may be embodied in respective synthesis systems for encoding and sequencing systems for decoding.

Although not further detailed, it may be envisaged that adding further higher coding schemes like the one known in binary world allowing for error detection and error correction, e.g. by implementing hamming distances when mapping binary information towards quaternary information and/or additional redundancy, code overlapping, spreading, etc., e.g. Reed-Solomon error correction coding, may additionally improve the methods and systems. In addition one may provision certain arrangement to detect the proper start of block.

For that purpose different methods may be applied.

A basic process is to first read at least 5 sequences. The first sequence is held as starting sequence and consequently by the first and second sequence one can determine the expected error protection sequence. Then one may shift the starting sequence and perform error protection determination again. This process may again be repeated. The most likely starting sequence of a block may be detected upon detecting no error.

I.e. in case of row 2, one would examine ACC, CCA and CAT as a candidate block. Taking Table 1 into account, one would deduce that ACC and CCA would be valid blocks while CAT would not be a valid block.

Obviously, accuracy of this procedure may be improved by taking a plurality of blocks into account.

Now we assume in case of row 2 that the sequences ACC ATA, CCA TAT, CAT ATA are examined, ACC ATA are two valid consecutive blocks. Within the second sample CCA TAT, TAT would not be valid, while within the third sample CAT ATA, both CAT and ATA would not be valid.

Obviously there may also be other measures to indicate the proper start of a sequence. Such measures may include a predefined start sequence. Such a start sequence may also be repeated in predefined distances, e.g. every 8 th block. By means of the invention an information density far above that of commercial technologies such as hard disks may be provisioned. Furthermore, the storage of information by means of (synthetic) DNA provides for a long-term storage which needs little efforts for maintenance.

The invention envisages usage of a feature of DNA synthesis for digital information storage, namely there are always many copies of each DNA fragment synthesized. In other words, there is already a high "natural" redundancy introduced by DNA synthesis.

Additionally, as DNA fragments are special molecules with potential biological activities, the encoding scheme should provide way(s) to prevent that the encoded DNA fragments would form biologically dangerous sequences (e.g. viruses) with regard to biological safety.

By usage of coding schemes biological safety may be maintained.

The invention proposes in the described example to benefit of natural redundancy for error detection and correction.

The proposed encoding scheme is a self-error-detecting three-base block (SED3B) scheme for DNA digital data encoding and decoding.

The proposed SED3B scheme can tolerate relatively high error rates (>40%) in the synthesis (information writing) and sequencing (information retrieval) of DNA.

By using the SED3B scheme, the data storage density can be increased to about 9 Zettabytes (9 x 10 21 byte or 9 x 10 6 petabyte) per gram DNA. DNA sequences encoded according to the proposed method can limit extreme GC content, remove homopolymers longer than 6bp and show a much simple secondary structure. Furthermore, the encoding scheme provides a strong biological safety of data storage.