Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
NOVEL SEQUENCE CONVERSION METHODS FOR THE PRODUCTION OF OVERLAPPINGLY TRANSLATED PROTEIN SEQUENCES (OTS)
Document Type and Number:
WIPO Patent Application WO/2003/019453
Kind Code:
A1
Abstract:
The present invention relates to a new method for the converting of a protein sequence or an NA sequence into an overlappingly translated protein sequence (OTS) comprising the steps:a) back-translating said protein sequence into a NA sequence, if necessary, otherwise use already existing NA sequence and go directly to step b), b) translating said NA sequence wherein 3 different codon reading frames are used, one starting on nucleic acid 1, one starting on nucleic acid 2 and one starting on nucleic acid 3, c) combining the amino acids obtained into an overlappingly translated protein sequence. Also a method for comparing overlappingly translated protein sequences and computer programs for performing said above methods stored on a data carrier are disclosed. The present invention also relates to a method for the converting of an overlappingly translated protein sequence (OTS) into a 20 amino acid long fingerprint and a method for determining a fingerprint similarity coefficient comprising comparing a fingerprint obtainable from said method with a predetermined fingerprint stored on a data carrier.

Inventors:
BIRO JAN (SE)
Application Number:
PCT/SE2002/001432
Publication Date:
March 06, 2003
Filing Date:
August 07, 2002
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
BIRO JAN (SE)
International Classes:
G16B30/10; C12N15/10; G16B45/00; (IPC1-7): G06F19/00; G06F17/30; C12N15/10
Domestic Patent References:
WO2001041038A22001-06-07
Foreign References:
EP1108781A22001-06-20
Other References:
AGARWAL PANKAJ ET AL.: "Comparative accuracy of methods for protein sequence similarity search", BIOINFORMATICS, vol. 14, no. 1, 1998, pages 40 - 47, XP002958451
KORF I. AND GISH W.: "MPBLAST: improved BLAST performance with multiplexed queries", BIOINFORMATICS IN PRESS, 2000, XP002958452, Retrieved from the Internet [retrieved on 20020327]
BLEASBY A.: "EMBOSS: fuzztran", 2000, XP002958453, Retrieved from the Internet [retrieved on 20020327]
Attorney, Agent or Firm:
Berglund, Erik (Aspebråten, Sturefors, SE)
Download PDF:
Claims:
Claims
1. A method for converting of a first protein sequence or an NA sequence into a second protein sequence, a so called overlappingly translated protein sequence (OTS), comprising the steps: a) backtranslating said first protein sequence into an NA sequence, if necessary, otherwise use already existing NA sequence and go directly to step b), b) translating said NA sequence wherein 3 different codon reading frames are used, one starting on nucleic acid 1, one starting on nucleic acid 2 and one starting on nucleic acid 3, c) combining the amino acids obtained into said second protein sequence, i. e. an overlappingly translated protein sequence.
2. A method according to claim 1 wherein the NA sequence obtained in step a) is reversed, reversecomplemented and complemented before step b) is performed.
3. A method according to claim 1 wherein the combining, of step c), into an overlappingly translation protein sequence comprises forming the overlappingly translated protein sequence by using the following steps: d) adding one amino acid starting with the amino acid obtained using the reading frame starting on nucleic acid 1 ; e) adding one amino acid obtained using the reading frame starting on nucleic acid 2; adding one amino acid obtained using the reading frame starting on nucleic acid 3; g) adding the next amino acid obtained using the reading frame starting on nucleic acid 1; and h) repeating steps e), f) and g).
4. A method according to claim 1 comprising, additionally, in step a) reversing, reversecomplementing and complementing said NA sequence; in step b) translating said NA sequence and reversed, reversecomplemented and complemented NA sequence; and in step c) combining the amino acids obtained into four different overlappingly translated protein sequences (OTSD, OTSR, OTSRC and OTSC) derived from said NA sequence and reverse, reversecomplement and complement of said NA sequence respectively.
5. A method according to claim 3 wherein the combining, of step c), into four overlappingly translated protein sequences derived from said NA sequence and reversed, reversecomplemented and complemented NA sequence, respectively, comprises forming each of the four overlappingly translated protein sequences by using the following steps: d) adding one amino acids starting with the amino acid obtained using the reading frame starting on nucleic acid 1 ; e) adding one amino acid obtained using the reading frame starting on nucleic acid 2; f) adding one amino acid obtained using the reading frame starting on nucleic acid 3; g) adding the next amino acid obtained using the reading frame starting on nucleic acid 1; and h) repeating steps e), f) and g).
6. A method according to any one of claims 1 to 5 characterized by that the method is performed in vitro.
7. An overlappingly translated protein sequence obtainable by a method according to any one of claims 1 to 6.
8. A computer program stored on a data carrier for performing the method according to any one of the claims 1 to 6.
9. A method for determining a sequence similarity coefficient comprising comparing an overlappingly translated protein sequence obtainable by a method according to any one of claims 1 to 5 with a predetermined overlappingly translated protein sequence stored on a data carrier; preferably the predetermined overlappingly translated protein sequence has been obtained from a method according to any one of claims 1 to 5.
10. A computer program stored on a data carrier for performing the method according to claim 9.
11. A method for obtaining a fingerprint comprising the following steps: i) translating an NA sequence into 4 overlappingly translated protein sequences, representing the D: direct, RC: reversecomplement, R: reverse and C: complement strands, according to any one of claim 2 to 5; ii) counting the amino acid composition of each overlappingly translated protein sequence; iii) comparing the number (#) of individual amino acids in [D+RC], [R+C] and [D+RC+R+C] with the number of amino acids which might be expected (EXP#) by translation of a random nucleic acid sequence containing equal number of the 4 nucleotides; iv) obtaining ratios of the real and expected numbers of the different amino acids [D+RC] #/EXP#, [R+C] #/EXP# and [D+RC+R+C] #/EXP# values, optionally multiplied with 100 to obtain percentages; v) sorting said ratios in descending order, preferably using [D+RC+R+C]/EXP100% or [D+RC]/EXP100% values obtained for each amino acid. The ranking and sorting of amino acids (iv. and v. ) gives an order of the 20 amino acids which is regarded to be the OTS fingerprint of the original NA sequence. The ranking values, and the occurrence (order after sorting) of amino acids determined by the ranking values, gives an OTS"fingerprint for said NA sequence, which preferably may be plotted in a diagram; and optionally vi) forming a 20 amino acid long fingerprint by starting with the amino acid with the highest rank and adding the other 19 amino acids subsequently by virtue of their occurrence (order after sorting) which is determined by the ranking value.
12. A computer program stored on a data carrier for performing the method according to claim 11.
13. A method for determining a fingerprint similarity coefficient comprising comparing a fingerprint obtainable by a method according to claim 11 with a predetermined fingerprint, preferably obtained using a method according to claim 11 stored on a data carrier.
14. A computer program stored on a data carrier for performing the method according to claim 13.
15. A database comprising predetermined overlappingly translated protein sequences obtainable by a method according to any one of claims 1 to 5.
16. A database comprising predetermined fingerprints obtainable by a method according to claim 11.
Description:
Novel sequence conversion methods for the production of overlappingly translated Protein sequences (OTS) The present invention relates to a new method for the converting of a protein sequence or a nucleic acid (NA) sequence into an overlappingly translated protein sequence comprising the steps: a) back-translating said protein sequence into a nucleic acid (DNA or RNA) sequence, if necessary, otherwise use already existing NA sequence, b) translating said NA sequence wherein 3 different codon reading frames are used, one starting on nucleic acid 1, one starting on nucleic acid 2 and one starting on nucleic acid 3, c) combining the amino acids obtained into an Overlappingly Translated protein Sequence (OTS). Steps b and c together is called overlapping translation.

Table I. below shows the comparison of overlapping and non-overlapping (normal) translations of nucleic acids, illustrates the principle and gives an example of the OTS construction performed on the BMC Search Launcher [w4]].

Further the present invention relates to a method for comparing overlappingly translated protein sequences, a method for obtaining a fingerprint of an OTS sequence and computer programs for performing said above methods stored on a data carrier. Further databases comprising OTS sequences and fingerprints, respectively, are disclosed.

Table I.

Comparison of the overlapping and non-overlapping translations of nucleic acids.

A. : Illustratio n o f the p rinc le

Background to the Invention Any improvement to existing techniques for detecting weak similarity based on pair- wise database searches would be of significant interest to a large part of the molecular biology community. Other sensitive methods, such as those based on profiles or Hidden Markov Models, may become impracticable due to the exponential growth of the databases.

Bioinformatics methods that can be applied to either proteins or nucleic acids are more sensitive when applied to proteins [1-2,10-11]. This is because the 20-letter polypeptide code is more compact and robust than the 4-letter polynucleotides code. A random sequence of 8 amino acids would be expected to occur only once in a sequence the size of the human genome (3x10E+09 characters), whereas a random sequence of 8 nucleotides would be expected to occur 45,000 times.

Therefore many scientists will prefer to work with proteins if both nucleic acid and protein sequences are available. Unfortunately the protein databases are much smaller than the nucleic acid ones and will continue to be so because only a small proportion of all nucleic acids are translated and spliced into real proteins. One might completely disregard the biochemical nature of the sequences and codon-translate the query, the entire DNA/RNA database, or both, into the six possible frames and examine them as if they were polypeptides (TblastX, blastX, TblastN). However, this approach increases the number of in-pair- comparisons by 6 or 36-fold and often results in a chaotic network of fragmented similarities caused by frame shifts.

Overlapping translation of nucleic acids solves this problem: a. , it converts one nucleic acids to only one protein-like sequence and bypasses the problem caused by the codon redundancy and frame shifts; b. , it makes possible the creation of the"protein-based- genome" ; c. , it improves the sensitivity of already existing bioinformatics methods; d. , amplifies the signal of the tools which are developed for visualizing sequence similarity.

A short review of the related literature 1., Database searching Database searching for sequence similarity is a fundamental practice in bioinformatics. These methods and tools are looking for a. , global or local sequence similarity; b. , continuous sequence similarities, but with more ore less gapes; c. , similar motifs, profiles or Hidden Markov models; d. , similarities between nucleic acids, proteins or both (similarities between nucleic acids after frame translation into proteins). There are many

names in the literature for these methods: for example the Blast family: blastN, blastP, TblastX, psi-blast, et cetera; the WU-blast family; the Fasta family; Ssearch; the Probabilistic Smith-Waterman models (PSW) ; the Hidden Markov Models (HMM) ; Fuzztran et cetera, [3- 5].

No one of these methods is applied on overlappingly translated sequences. The overlapping translation is fundamentally different than the conventional frame translation.

Overlapping translation is a conceptual sequence conversion and not a Hidden Markov model.

There is, at least, 33% identity between overlapping and non-overlapping translations of the same nucleic acid. However, this similarity is not enough to permit using OTS query on a non-OTS database or non-OTS (conventional protein) query on OTS database.

2., Construction of the query : The query, as well as the database, may be protein, nucleic acid, profile, motif or Hidden Markov model. The quality of the query is critical for the quality and reliability of the result. Some efforts to improve the quality of the query are: a. , generating candidate query sequences using a profile and successively improving the quality of the query by iterative searching and re-searching of the databases [6]; b. , recursive nucleic acid recombination ("shuffling") of two or more parental sequences or DNA mutagenesis by random fragmentation and reassembling to model the natural molecular evolution [7]; c. , multiplexing (concatenating) the query sequences and thereby reducing the number of actual database searches, which are necessary to perform [8].

These methods are clearly different, and not to confuse with, the overlapping translation of nucleic acid sequences.

3., some historical references : George Gamow and Francis Crick At the early days of biochemistry, directly after the discovery of the DNA double spiral in 1953, a question arose: how are the nucleic acids translated into proteins? The Diamond Code of George Gamow (in 1954) and the Comma-Free Codes of Francis Crick (in 1957) are actually speculations about a kind of overlapping translation of nucleic acids. [9] Summary of the invention The present invention provides, according to a first aspect of the present invention, a method, solving one or more of the above problems, for converting of a first protein sequence

or a NA sequence into a second protein sequence, a so called overlappingly translated protein sequence, comprising the steps: a) back-translating said first protein sequence into a NA sequence, if necessary, otherwise use already existing NA sequence, b) translating said NA sequence wherein 3 different codon reading frames are used, one starting on nucleic acid 1, one starting on nucleic acid 2 and one starting on nucleic acid 3, c) combining the amino acids obtained into a second protein sequence, i. e. an Overlappingly Translated Protein Sequence (OTS). The steps b and c together are called Overlapping Translation because the codons coding the OTS residues overlap each other.

Figure 24 shows, in addition to Table I, a block diagram of the method for obtaining an OTS protein sequence (exemplified through DNA).

Figure 24 Block diagram of the method for obtaining an OTS protein sequence. protein sequence or a DNA /protein sequence J sequence if protein or DNA back-translating said protein sequence sequence into a DNA sequence D m x t back-translating translating DNA sequence translating DNA sequence and repeat, combining the obtained amino acids if DNA into an OTS left (a codon ! buiidOTS, until DNA ends l end product : 0 T S The present invention also provides, according to a second aspect of the present invention, a method for determining a sequence similarity coefficient comprising comparing an overlappingly translated protein sequence obtainable from a method as outlined above with a predetermined overlappingly translated protein sequence stored on a data carrier [blastNP].

The present invention also provides, according to a third aspect of the present invention, a method for the converting of an Overlappingly Translated protein Sequence (OTS) into a 20 amino acid long"fingerprint".

Further the present invention also provides, according to a fourth aspect of the present invention, a method for determining a"fingerprint"similarity coefficient comprising comparing a"fingerprint"obtainable from a method as outlined above with a predetermined fingerprint stored on a data carrier.

Further the present invention also provides, according to a fifth aspect of the present invention, a database comprising overlappingly translated protein sequences (OTS) obtainable by the method according to the first aspect of the present invention.

Further the present invention also provides, according to a sixth aspect of the present invention, a database structure, comprising"fingerprints"obtainable by the method according to the third aspect of the present invention.

Further the present invention also provides, according to a seventh aspect of the present invention a computer program, i. e. software, for performing the method according to the first aspect of the present invention.

Further the present invention also provides, according to an eighth aspect of the present invention a computer program, i. e. software, for performing the method according to the second aspect of the present invention.

Further the present invention also provides, according to a ninth aspect of the present invention a computer program, i. e. software, for performing the method according to the third aspect of the present invention.

Further the present invention also provides, according to a tenth aspect of the present invention a computer program, i. e. software, for performing the method according to the fourth aspect of the present invention.

Detailed description of the Invention The terms"NA sequence"or"nucleic acid sequence", which are used interchangeably throughout the present specification, are intended to embrace in the present description either a DNA sequence or an RNA sequence or a combination thereof.

The term"back-translating"in the present description is intended to embrace any process for converting a protein into an NA sequence. The conventional back translation was used in the present specification, which means that known, non-overlapping codons and the human codon usage table were used (w17). This conversion may preferably be performed electronically (in-silico) e. g. in a computer. Further this conversion may be followed by biochemical processes e. g. in-vitro or in-vivo (which may involve enzymatically processes) for obtaining the in-silico derived NA sequence as a chemical entity.

The term"translating"in the present description is intended to embrace the conventional, usual, frame translation process for converting an NA sequence into a protein, where the regular three letter long nucleic acid codons of each amino acids are used and the codons do not overlap with each other (non-overlapping translation). This conversion may be performed biochemically i. e. in-vitro or in-vivo (which may involve enzymatically processes) or electronically e. g. in a computer. The translation involves reading codons appearing in a row, normally starting on codon No. 1, each made up of three nucleotides. The NA sequence comprising codons may be 10000000 nucleotides long. The translation normally goes on until all codons appearing in a row of the NA sequence has been read and thus each amino acid coded by the codons has been deciphered. The term"back-translating"is thus not equal to "reversing"a nucleic acid.

The term"codon reading frame starting on nucleic acid 1", (frame 1) in the present description is intended to embrace any codon reading frame which starts the reading of an NA sequence on nucleic acid No. l or any other nucleic acid appearing on a subsequent position appearing 3 * x nucleotides further downstream in the sequence, wherein x is an integer and may be from 1 to 1000000. Preferably the codon-reading frame starting on nucleic acid 1 actually starts the reading on nucleotide No. 1. The codon-reading frame reads three nucleotides at a time.

The term"codon reading frame starting on nucleic acid 2", (frame 2) in the present description is intended to embrace any codon reading frame which starts the reading of an NA sequence on nucleic acid No. 2 or any other nucleic acid appearing on a subsequent position appearing 3 * x nucleotides further downstream in the sequence, wherein x is an integer and may be from 1 to 1000000. Preferably the codon-reading frame starting on nucleic acid 2 actually starts the reading on nucleotide No. 2. The codon-reading frame reads three nucleotides at a time. Obviously there will be one less codon read by the"codon reading frame starting on nucleic acid 2"than for the one starting on nucleotide No. 1 as there will be a nucleotide deficit (one nucleotide) at the end of the DNA sequence.

The term"codon reading frame starting on nucleic acid 3", (frame 3) in the present description is intended to embrace any codon reading frame which starts the reading of an NA sequence on nucleic acid No. 3 or any other nucleic acid appearing on a subsequent position appearing 3 * x nucleotides further downstream in the sequence, wherein x is an integer and may be from 1 to 1000000. Preferably the codon-reading frame starting on nucleic acid 3 actually starts the reading on nucleotide No. 3. The codon-reading frame reads three nucleotides at a time. Obviously there will be one less codon read by the"codon reading

frame starting on nucleic acid 3"than for the one starting on nucleotide No. 1 as there will be a nucleotide deficit (two nucleotides) at the end of the NA sequence.

The term"reversing"in the present description is intended to embrace any method for obtaining an NA sequence in its reverse sequence, e. g. a DNA sequence having GCG---- ACT is reversed to TCA----GCG. The method may preferably be performed electronically (in-silico) e. g. in a computer. Further this method may be followed by (bio) chemical processes e. g. in-vitro or in-vivo (which may involve enzymatically processes) for obtaining the in-silico derived NA sequence in its reverse sequence as a chemical entity.

The term"combining"in the present description is intended to mean alternating and periodic addition of one amino acid from frame 1, followed by one amino acid from frame 2, followed by one amino acid from frame 3, followed by the next amino acid from frame 1, and so on, to a common protein like sequence, called OTS. The OTS itself is not further combined. (see even Table 1. part B.) The term"overlapping-translation"in the present description is intended to embrace the construction of a protein-like sequence (OTS) in which the codons coding for the amino acids overlap with each other, i. e. the last two bases of one codon are also the first two bases of the next codon. The overlapping translation produces a poly-amino-acid sequence (OTS), corresponding to a coding polynucleotide sequence, whereas the OTS is just 2 letters shorter, than the template, i. e. the polynucleotide sequence (Table 1). The OTS may, when all codons available have been read, be stored in an OTS-database. Preferably the combining is performed according to a preferred embodiment of the first aspect of the present invention wherein the combining, of step c), into an overlapping translation protein sequence comprises forming the overlapping translation protein sequence by using the following steps: d) adding one amino acid starting with the amino acid obtained using the reading frame starting on nucleic acid 1; e) adding one amino acid obtained using the reading frame starting on nucleic acid 2; f) adding one amino acid obtained using the reading frame starting on nucleic acid 3; g) adding the next amino acid obtained using the reading frame starting on nucleic acid 1; and h) repeating steps e), f) and g). A chain of amino acids (the OTS) as set out in Table I is thus obtained.

The addition may be performed in-vitro, in-vivo or electronically (in-silico). The combining of the amino acids into an overlappingly translated protein sequence (OTS) may further by done chemically using the peptide bonds.

The steps b and c together in the first aspect of the present invention is the overlapping translation.

The term"reverse-complementing"in the present description is intended to embrace any method for obtaining an NA sequence in its reverse-complementing sequence, e. g. a DNA sequence having GCG----ACT is reverse-complemented into AGT----CGC. The method may preferably be performed electronically (in-silico) e. g. in a computer. Further this method may be followed by (bio) chemical processes e. g. in-vitro or in-vivo (which may involve enzymatically processes) for obtaining the in-silico derived NA sequence in its reverse complementing sequence as a chemical entity.

The term"complementing"in the present description is intended to embrace any method for obtaining an NA sequence in its complementing sequence, e. g. a DNA sequence having GCG----ACT is complemented into CGC----TGA. The method may preferably be performed electronically (in-silico) e. g. in a computer. Further this method may be followed by (bio) chemical processes e. g. in-vitro or in-vivo (which may involve enzymatically processes) for obtaining the in-silico derived NA sequence in its complementing sequence as a chemical entity.

According to a preferred embodiment of the first aspect of the present invention there is provided a method as outlined above wherein the NA sequence obtained in step a) is reversed, reverse-complemented and complemented before step b) is performed.

According to a preferred embodiment of the first aspect of the present invention there is provided a method as outlined above wherein the combining, of step c), into an overlappingly translated protein sequence comprises forming the overlappingly translated protein sequence by using the following steps: d) adding one amino acid starting with the amino acid obtained using the reading frame starting on nucleic acid 1; e) adding one amino acid obtained using the reading frame starting on nucleic acid 2; f) adding one amino acid obtained using the reading frame starting on nucleic acid 3; g) adding the next amino acid obtained using the reading frame starting on nucleic acid 1 ; and h) repeating steps e), f) and g).

The addition may be performed in-vitro, in-vivo or electronically (in-silico). The combining of the amino acids into an overlappingly translated protein sequence (OTS) may further by done chemically using the peptide bonds.

According to a preferred embodiment of the first aspect of the present invention there is provided a method as outlined above comprising, additionally, in step a) reversing, reverse-

complementing and complementing said NA sequence; in step b) translating said NA sequence and said reversed, reverse-complemented and complemented NA sequence; and in step c) combining the amino acids obtained into four overlappingly translated protein sequences, called OTS-D, OTS-R, OTS-C, OTS-RC, derived from said NA sequence and reverse, reverse-complement and complement of said NA sequence respectively.

According to a preferred embodiment of the first aspect of the present invention there is provided a method as outlined above wherein the combining, of step c), into four overlappingly translated protein sequences derived from said NA sequence and reversed, reverse-complemented and complemented NA sequence, respectively, comprises forming each of the four overlappingly translated protein sequences by using the following steps: d) adding one amino acid starting with the amino acid obtained using the reading frame starting on nucleic acid 1; e) adding one amino acid obtained using the reading frame starting on nucleic acid 2; f) adding one amino acid obtained using the reading frame starting on nucleic acid 3; g) adding onee amino acid obtained using the reading frame starting on nucleic acid 1 ; and h) repeating steps e), f) and g). The four OTSs (OTS-D, OTS-R, OTS-RC and OTS-C) are not further combined. (The method for obtaining the four different OTS sequences from the same NA is illustrated below in Figure 30. Note that the OTS-D, OTS-R, OTS-C and OTS- RC are results of independent overlapping translation of the NA and they are normally not combined with each other. ) The addition may be performed in-vitro, in-vivo or electronically (in-silico).

Figure 30 Block diagram of the method for obtaining Overlappingly Translated protein Sequences (OTS) DNA or RNA sequence Protein sequence laek-eassmtioa Nucleic Acid [NA] seqnence oeÇnged (I)) reçene eenwim ; > melse eomplement [RC] NA-D NA-R NA-C NA-RC m) MtifNA npett, ifNA repttt. ifNA ttpM<, ifNA (codon) lefl (coon) 1eR (unx) left (codoa) Ie8 build OTS, is. Awld OTS, luild OTS, = r X/am nntil NA ends untll IVEI enda uxtll IHA eads ux NA ends en pro uc end pniduct en pro uct end prodnct OTS D OTS-R TS-C OTS-RC

According to a preferred embodiment of the first aspect of the present invention there is provided an overlappingly translated protein sequence obtainable by a method as outlined above, i. e. according to the method of the first aspect of the present invention..

According to a preferred embodiment of the first aspect of the present invention there is provided a method as outlined above characterized by that the method is performed in-vitro.

According to a preferred embodiment of the first aspect of the present invention there is provided, according to a seventh aspect of the present invention, a computer program (software) stored on a data carrier for performing the method as outlined above and its preferred embodiments outlined above as well. The data carrier may e. g. be a floppy disc or a hard disc in a computer. Further the software may preferably perform formatting of sequences. The computer program according to the seventh aspect of the present invention is based on that this programs is essentially not general purpose, but special purpose computer program, which preferably is designed to - manage the conversion of nucleic acids to overlappingly translated proteins, - to manage the statistical analysis of OTS, which is necessary to create OTS- fingerprints, (which preferably is performed in a computer program according to the eighth or the ninth aspect, respectively, as set out below),

- to obtain and sort OTS fingerprints (which also preferably is performed in a computer program according to the eighth or the ninth aspect, respectively, as set out below).

This program operates in large scale and mass-converts and mass-analyses data usable for the bioinformatics industry.

According to a second aspect, as mentioned above, of the present invention there is provided a method for determining a sequence similarity coefficient comprising comparing an overlappingly translated protein sequence obtainable by a method as outlined above, according to the first aspect of the present invention, with a predetermined overlappingly translated protein sequence stored on a data carrier. Preferably the predetermined overlappingly translated protein sequence has been obtained from a method as outlined above.

According to a preferred embodiment of the second aspect of the present invention there is provided, according to an eighth aspect of the present invention, a computer program stored on a data carrier for performing the method according to the second aspect of the present invention.

According to the third aspect of the present invention there is thus also provided a. method for the converting of an overlappingly translated protein sequence into a fingerprint, preferably a 20 amino acid (AA) long fingerprint. The method for obtaining a fingerprint comprises the following steps: i) translating an NA sequence into 4-OTSs representing the D: direct (normal), RC: reverse - complement, R: reverse and C: complement strands of the translated NA (Table 30); ii) counting the amino acid composition of each OTS protein sequence (OTS-D, OTS-R, OTS-RC and OTS-C); iii) comparing the number (#) of individual amino acids in [D+RC], [R+C] and [D+RC+R+C] with the number of amino acids, which might be expected (EXP#) by translation of a random nucleic acid sequence containing equal number of the 4 nucleotides; iv) obtaining ratios of the real and expected numbers of the different amino acids [D+RC] #/EXP#, [R+C] #/EXP# and [D+RC+R+C] &num /EXP&num values, optionally multiplied with 100 to obtain percentages; v) sorting said ratios in descending order, preferably using [D+RC+R+C]/EXP100% or [D+RC]/EXP 100% values obtained for each amino acid. The ranking and sorting of amino acids (iv. and v. ) gives an order of the 20 amino acids which is regarded to be the OTS "fingerprint"of the original NA sequence. The ranking values, and the occurrence (order after sorting) of amino acids determined by the ranking values, gives an OTS"fingerprint" for said NA sequence, which preferably may be plotted in a diagram; and optionally

vi) forming a 20 amino acid long fingerprint by starting with the amino acid with the highest rank and adding the other 19 amino acids subsequently by virtue of their occurrence (order after sorting) which is determined by the ranking value.

The obtaining of ratios in step iv) may thus be performed by using the following formulas: [D+RC]/EXP% = [D+RC] # x 100/EXP# ; [R+C]/EXP% = [R+C] &num x 100 EXP# ; [D+RC+R+C]/EXP% = [D+RC+R+C] # x 100/EXP# The obtaining of ratios in step iv) may also be performed by using the following formulas: EXP/EXP% =EXP# x 100/EXP# =100 ; [D+RC]/EXP100% = [D+RC] /EXP% x 100/ [D+RC+R+C]/EXP% = [D+RC] x 1 00/[D+RC+R+C] [R+C]/EXP100% = [R+C] /EXP% x 100/ [D+RC+R+C]/EXP% = [R+C] xlOO/ [D+RC+R+C] The method according to the third aspect of the present invention, is a very special one, because it is entirely built on the properties of the overlappingly translated sequences and it is used essentially exclusively to analyse OTS. It is a special application, which is applied for the achievement of a useful, concrete and tangible result and is thus a means for obtaining information about OTS sequences. Using the OTS fingerprinting algorithm for non- overlappingly translated sequences is essentially meaningless. The"fingerprinting"of any sequence after conversion to OTS sequence is the key to a new sequence context searching method, like different gene-finder programs.

According to a preferred embodiment of the third aspect of the present invention there is provided, according to a ninth aspect of the present invention, a computer program stored on a data carrier for performing the method according to the third aspect of the present invention. Further the computer program (software) according to the ninth aspect of the present invention may preferably perform sorting of the OTS statistical data and performing the OTS sequence fingerprinting. Further the software may preferably perform formatting of sequences.

According to the fourth aspect of the present invention there is provided a method for determining a fingerprint similarity coefficient comprising comparing a fingerprint obtainable

by a method according to the third aspect of the present invention with a predetermined fingerprint stored on a data carrier.

According to a preferred embodiment of the fourth aspect of the present invention there is provided, according to a tenth aspect of the present invention, a computer program stored on a data carrier for performing the method according to the fourth aspect of the present invention.

According to the present invention there is also provided according to a fifth aspect of the present invention a database comprising predetermined overlappingly translated protein sequences obtainable by a method according to the first aspect of the present invention. The Overlappingly Translated Sequence Database (OTS-DB) according to the fifth aspect of the present invention not only concerns the information content itself but also the structure of the database, namely that - the biological information (sequences) in this database is stored in the form of OTS, and that - the fundamental aspect of all new information obtained from the OTS-DB is that the information is more difficult or not possible to obtain from other sequence databases (i. e. from nucleic acid or non-overlappingly translated protein databases).

According to the present invention there is also provided according to a sixth aspect of the present invention a database comprising predetermined fingerprints obtainable by a method according to the third aspect of the present invention.

Preferred features of each aspect of the invention are as for each of the other aspects mutatis mutandis. The prior art documents mentioned herein are incorporated to the fullest extent permitted by law. The invention is further described in the following examples in conjunction with the appended figures, which shall not limit the scope of the invention in any way and shall not in any way limit the scope of the appended set of claims. Embodiments of the present invention are described in more detail with the aid of examples of embodiments, the only purpose of which is to illustrate the invention and are in no way intended to limit its extent.

Figures & Tables-Description Table I shows the Comparison of the overlapping and non-overlapping translation of nucleic acids.

Table II shows the Construction of the conventional nucleic acid (N4), protein (PX12), overlapping translation sequence (OTS) databases (OTS-DB, P4) and the respective queries.

Figure 1 shows the LALIGN comparison the overlapping (OTS) and non-overlapping (non- OTS) translations of an artificial sequence. A randomised artificial sequence (ASR) was translated into twelve frames (-X [6L-6A] -) by non-overlapping translation or into four frames (-P [2L-2A] -) by overlapping translation. The sixteen frames (-16S-) were ordered into one 2432 letter long sequence and analysed by the LALIGN method. Matrix: Blosum50, gap opening penalty: 19, gap extension penalty: 8.

Figure 2 shows the Comparison of blast scores. Prion protein (PRP-D-771, A), human insulin gene (HIG-D-4044, B) and insulin promoter sequence (INSP-D-600, C) were used as query either in nucleic acid or protein sequence (obtained by overlapping translation according to the present invention) forms. Local nucleic acid and protein (OTS) databases were searched by blastN, TblastX and blastNP methods. OTS: overlappingly translated sequence, CDS: coding sequence, control: shuffled sequence.

Figure 3 shows the Estimation of the specificity of 3 blast methods. The three different blast methods, together, identified 40 sequences (100%) in a human sequence database that were significantly similar to prion protein (PrP-D-771). The bars represent the percentage of these sequence similarities that were found by 1,2 and 3 methods.

Table III shows the Summary of LALIGN comparison statistics.

Table IV shows the Comparison of the specificity and sensitivity of three different blasts.

Figure 4 shows the Comparison of the specificity and sensitivity of three blast methods. The number of true (T) and false (F) positive matches, which were found by three different blast methods (blastN, TblastX and blastNP), was counted using two relative score values (>10% and >R) and expressed as the proportion of the total number of sequences in the super- families. Each symbol represent the mean + S. E. M of blast searches in 5 different PIR super families. R: random, ns: not significant.

Figure 5 shows Sequence Composition. Mixing of two sequences (A-black boxes-and B- white boxes) to archive hybrid sequences with predefined similarities to the original sequences. Note the even distribution of the replaced residues. Only 10 residues are shown.

Figure 6 shows Estimation of the specificity and sensitivity of three blast methods. BlastN, TblastX and blastNP searches were performed on nucleic acid and OTS databases. The sequences in the databases were designed to be 100-0% similar to an artificial sequence, called A and 0-100% similar to another artificial sequence, called B. (triangles below axis x).

The queries were identical to sequence B (B=100%). The number of identical residues in the matching sequences is plotted against the detected relative scores (where SR=100% means the highest possible score at 100% sequence identity).

Figure 7 shows the Correlation between the length of the query and the E-values.

Paired BlastN, TblastX and BlastNP were performed on two identical sequences and the effect of sequence length on the E-values was monitored. E is the actual E-value, Eo: the E- value of the 100 residue long identical match. p: is the significance of the difference with Student's t-test.

Figure 8 shows the LALIGN comparison of Human prion protein (PrP) mRNA (HPRPC) and novel PrP-like Protein Doppel mRNA (PRND) as nucleic acid sequences (C), as conventionally translated proteins (B) and as OTSs (A). The gridlines in the Part B of the figure indicate the borders between the different translation frames Figure 9 shows the LALIGN comparison of the insulin related substance 1 (IRS1) and 2 (IRS2) as nucleic acid sequences (C), as conventionally translated proteins (B) and as OTSs (A). The gridlines in the Part B of the figure indicate the borders between the different translation frames.

Figure 10 shows the LALIGN comparison of human insulin (INS) and human insulin like growth factor II (HSIGFIIC) as nucleic acid sequences (C), as conventionally translated proteins (B) and as OTSs (A). The gridlines in the Part B of the figure indicate the borders between the different translation frames.

Figure 11 shows the LALIGN comparison of the Human Insulin Gene (HIG) to Insulin pro- peptide (INSULIN) and Prion Protein (PRP). S, B, C and A are different domains of the insulin pro-peptide, and they were repeated 10-times. The sequences were compared as nucleic acid and overlappingly translated Sequences (OTS).

LALIGN parameters: Matrix: Blosum50, gap opening penalty: 19, gap extension penalty: 8.

Figure 12 shows the LALIGN comparison of insulin (INS) the insulin pro-peptide domains INS- [S-B-C-A] and the NH2-terminal 300 residues of the Prion Protein PRP- [1-300]. The Insulin pro-peptide domains (S, B, C, and A) were repeated 10-times. The sequences were compared as nucleic acids and overlappingly translated sequences (OTS). The similarity between INS and PRP- [1-300] is emphasized by circles. LALIGN parameters: Matrix: Blosum50, gap opening penalty: 19, gap extension penalty: 8.

Table V shows the Results of SIM with INS-PD-245 (245 residues) and INS-PRP- CONSENSUS (245 residues).

Table VI shows the Comparison of the similarity parameters measured by blastNP, TblastX and blastN. Sequences, listed under NAME, ACCESSION NUMBER-DESCRIPTION, were compared with human prion protein (HUMPRPOA) and INSULIN-C peptides. BlastN method was used to compare the sequences in nucleic acid form, TblastX to compare the sequences after conventional (non-overlapping, frame) transcription and blastNP to compare the sequences in overlappingly translated (OTS) forms. Blast parameters: % id (percent identical residues in the matching sequences), length (number of residues in the matching sequences, score (bits) and E (E-value). The bold values indicate parameters determined when the HUMPRPOA and INSULIN-C were compared with themselves. PD: OTS. XD: frame transcription, non-OTS. ND: nucleic acid. SD: standard deviation.

Figure 13 shows the Comparison of the similarity parameters measured by blastNP, TblastX and blastN. Box plot display of the similarity parameters listed in Table VI but only the result of human prion protein (hprp) and insulin-C (insc) similarity to the transcriptions factors is calculated. The bars indicate the minimum, first quartile, median, third quartile and maximum values of 7 independent comparisons. o: suspected outlier value. The blast parameters are: the identical residues in the matching sequences (A), the number of the

residues in the matching sequences (B), the score (C) and the log E (D). The shaded area indicates score and log E values, which are usually not regarded to be significant.

Figure 14 shows the Visualization of the similarities between sequences with LALIGN.

Humane Prion Protein (HUMPRPOA) and INSULIN-C sequences were compared to transcriptions factors and each other (1-6, see even Table: Comparison of the similarity parameters by blastN, TblastX and blastNP). The sequences were compared to each other as nucleic acids (N), conventionally, (non-overlappingly) translated proteins (P) and overlappingly translated proteins (O). Gridlines separate the different protein frames of non- overlappingly translated sequences (P). The scale is artificially made uniform and does not reflect the real size (number of residues) of the sequences. The real size is indicated with numbers after the short name of the sequences. The 82 residue long insulin-c sequence was repeated 10-times and the 300 residue long NH2-terminal part of the HUMPRPOA was repeated 3-times to archive similar length and to intensify the signals.

Figure 15 shows the Multiple Sequence Alignment of transcription factors, insulin-C peptide and Prion Protein in OTS form. ClustalW alignment displayed by Jalview (A) and Boxshade (B) as well as the Neighbour joining tree using PID (C).

Figure 16 shows the Multiple Sequence Alignment of transcription factors, insulin-C peptide and Prion Protein in OTS form. ClustalW alignment displayed by Jalview (A) and Boxshade (B) as well as the Neighbour joining tree using PID (C).

Figure 17 shows the amino acid usage frequency-expected and found in protein frames of an artificial randomised sequence, ASR. The ASR was designed to contain 5% of each amino acid (20x5%), in one of it's reading frames, and compared to the calculated frequency as determined by the"biased-codon" (EXPECTED) and the real amino acid usage frequency as it was found in the human genome (HS. REAL) (a). The cumulative frequency (%) is the sum of different amino acid frequencies in the 12 (=4x3) theoretically possible reading frames (D: direct, RC: reverse & complement, R: reverse, C: complement) (b).

Figure 18 shows the Amino acid usage frequency in Insulin (INS) -OTS frames. The number of a given amino acid coded by the four OTS framed (D: direct, RC: reverse & complement, R: reverse, C: complement) together was regarded to be 100% and the

contribution of one frame to it was expressed as D%, RC%, R%, C%. The amino acid usage frequency by the different frames (a, b, c) and the correlation between R+C% and D+RC% (d) are plotted.

Figure 19 shows the Amino acid usage frequency in TIG gene-OTS frames. The number of a given amino acid coded by the four OTS framed (D: direct, RC: reverse & complement, R: reverse, C: complement) together was regarded to be 100% and the contribution of one frame to it was expressed as D%, RC%, R%, C%. The amino acid usage frequency by the different frames (a, b, c) and the correlation between R+C% and D+RC% (d) is plotted.

Figure 20 shows the Amino acid usage frequency in an artificial, randomised sequence (ASR) -OTS frames. The number of a given amino acid coded by the four OTS framed (D: direct, RC: reverse & complement, R: reverse, C: complement) together was regarded to be 100% and the contribution of one frame to it was expressed as D%, RC%, R%, C%. The amino acid usage frequency by the different frames (a, b, c) and the correlation between R+C% and D+RC% (d) is plotted.

Figure 21 shows the Construction of an OTS-"fingerprint"-1.

The real amino acid frequency (%) of INS, ASR, TIG and the expected frequency (EXP) in the four (D+RC+R+C) OTS reading frames together are indicated (S-EXP, S-INS, S-ASR, S- TIG) (a). The ratio of the real and expected values (both in D+RC+R+C frames together) gives the relative frequencies (% of S expected: S-INS%, S-ASR%, S-TIG%, where the S- EXP% =S-EXP/S-EXP x 100 = 100%) (b). The ratio of the real and expected values (but both only in D+RC frames) gives the relative frequencies (% of D+RC expected: [D+RC] INS%, [D+RC] ASR%, [D+RC] TIG%, where the [D+RC] EXP% = [D+RC] EXP/[D+RC+R+C] EXP x 100 = 50%) (c).

Figure 22 shows the Construction of an OTS-"fingerprint"-II. and indicates the amino acid frequency-sorting I. (a. ) which was obtained by sorting the data in Table VII. in descending order of [D+RC+R+C] /EXP% VALUES. There is no obvious correlation between the [D+RC] /EXP% and [R+C]/EXP% values (b). The order of amino acids on the x axis of figure a. , was regarded to be one OTS-FINGERPRINT of ASRx-P-298 sequences (c.)

Figure 23 shows the Construction of an OTS-"fingerprint"-III. and indicates the amino acid frequency-sorting II. (a) which was obtained by recalculating the [D+RC]/EXP% and [R+C] /EXP% values, regarding the [D+RC+R+C]/EXP% to be 100 % and sorted the data in descending order of [D+RC]/EXP100%. (See the equations at the end of the Table V. ) There is a strong, negative, linear correlation between the [D+RC]/EXP100% and [R+C]/EXP100% values (b).

Table VII shoves the Construction of an OTS-"fingerprint" Figure 24 shoves a Block diagram of the method for obtaining an OTS protein sequence.

Figure 25 shoves a Flow diagram, the modular structure of a computer program (software) according to the seventh aspect of the present invention. The raw input file is a text (txt) file, which is extracted for the FASTA sequence (SEQ) and the description (DES), converted into the four nucleic acid (N4), twelve conventionally translated (X12) and four OTS form (P4).

The nucleic acid and amino acid composition of each sequence is counted and stored in separate files (SN4, SX12, SP4).

Figure 26 shows a conversion scheme of a computer program (software) according to the seventh aspect of the present invention.

Figure 27 shows a view results scheme of a computer program (software) according to the seventh aspect of the present invention.

Figure 28 shows a view log scheme of a computer program (software) according to the seventh aspect of the present invention.

Figure 29 shows use cases implemented by a sequence-converting tool of a computer program (software) according to the seventh aspect of the present invention, Figure 30 shows the method for obtaining four different overlappingly translated sequences from the same NA wherein the OTS-D, OTS-R, OTS-C and OTS-RC are results of independent overlapping translation of the NA.

Examples I. Application I : Sequence Searching and Visualization A. , Construction of local real and artificial OTS databases.

Nucleic acid FASTA sequences, mostly randomly selected complete human genes, were taken from the Genbank database [12, wl]. No selection was done for exons or introns.

The sequences were converted into the four possible nucleic acid frames: direct (D), reverse (R), complement (C), reverse-complement (RC) without considering whether the resulted sequences were biochemically possible or not. The D and R sequences represented the leading (L) and the RC and C sequences represented the anti-sense (A) DNA strand. These sequences were pooled in a local database, called N4.

The nucleic acid sequences in the N4 database were translated by both non- overlappingly and overlappingly. For the non-overlapping translation, consecutive amino acids were specified by consecutive code words (triplet codons, [w2] ), as it is known to produce proteins. The results of non-overlapping translation lead to the conventional 4x3=12 frames-each one third of the length of the coding nucleic acid sequence- (D3, D2, Dl-R3, R2, Rl,-RC1, RC2, RC3,-C1, C2, C3) and were collected in a local database, called PX12.

For overlapping code, consecutive amino acids were encoded by codons that shared two bases; the last two bases of one codon were also the first two bases of the next codon. The overlapping translation produced 4 poly-amino-acid sequences (OTS-s), corresponding to the 4 coding polynucleotide sequences, each been just 2 letters shorter, than the template (Table 1). They were stored in an OTS-database, called P4.

Table I.

Comparison of the overlapping and non-overlapping translations of nucleic acids.

A.: Hustration of the princip le

Standard Code (transl table=l) was used in both cases [w2-w3].

Publicly available tools and databases were used: BCM search launcher [w4], HGMP- RC, [w5], Biology Workbench [w6], SWISS-PROT [w7], TrEMBL [w8] and Eucaryotic Promoter Database [EPD, 13, w9]. Conventional blastN, TblastX [w10], blast ? [14-15, w10], LALIGN [wll], SIM [16, wl2], methods were used.

A software, called SeqConv (JAVA) was developed for overlapping translation of nucleic acids.

Paired two sample Student's t-test was used for statistical evaluation of the results [w13].

The construction of conventional nucleic acid (N4), protein (PX12) sequences and databases, in contrast to the overlappingly translated sequences (OTS) and database (OTS- DB, P4), are illustrated in the Table II (in addition to the Table I).

Table II Construction of the conventional nucleic acid (N4, protein (PX12), overlapping translation sequence (OTS) databases (OTS-DB, P4) and the respective queries.

(Illustrated an an artificial sequence)

An artificial protein sequence was designed, which contained 5% of each amino acids, randomised and the back-translated (w17), using the human codon usage frequency table, to produce a nucleic acid (ASR-ND-300). This nucleic acid contained 26.0% A, 23.7% T, 26.0% G and 24.3% C nucleotides and served as a random template to translations. The resulting proteins, ASR-X [6L-6A]-12S-1228 after non-overlapping and ASR-P [2L-2A]-4S-1204 after overlapping translation, were regarded to be the random controls to studies and comparisons of the basic bioinformatics properties of these two types of sequences. First of all, the non-

OTS and OTS does not show any significant similarity when compared to each other with LALIGN method (Figure 1. ). To search a conventional protein database with an OTS query is therefore not meaningful.

B. , The BlastNP 1., Abstract of blastNP Background of blastNP : Using Overlappingly translated Sequences (OTS) instead of conventionally (non-overlappingly) translated sequences or nucleic acids seems to have special advantages in database searching and visualization methods like BLAST, FASTA, and LALIGN.

BLASTs (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA. The standard nucleotide-nucleotide BLAST [blastN] has relatively low sensitivity because of the poor information density of nucleotide sequences. The standard protein-protein BLAST [blastP] is much more sensitive but the number of known real protein sequences is limited. It is possible to combine the advantages of the nucleotide and protein blasts by translating the nucleotide query [blastX], the database [TblastN] or both [TblastX] into real or conceptual protein frames.

Results with blastNP : An alternative method to TblastX has now been developed, the "blastNP". Nucleic acids in database and query sequences were translated into overlapping protein-like sequences (overlappingly translated sequences or OTSs) before searching with blastP. Thus, each nucleic acid sequence is represented by a single"protein like"sequence (instead of three hypothetical proteins in different reading frames), and the 6x6 comparison of TblastX is replaced by 2x2 comparisons, as both OTS sequences can be generated in the forward and reverse directions.

A qualitative comparison of blastN, TblastX and blastNP [blastP against OTS] showed some fundamental differences between these methods: blastNP using OTS detected about two thirds of blastN and TblastX matches but discovered additional similarities. When blastN was compared to blastNP, identical matches discovered by blastNP were generally longer (602 compared to 213 letters, p<0.01), had higher scores (748 compared to 460 bits, p<0.05) and lower E values (3.16E-20 vs. 1. 17E+01, p<0. 01) but the percentage identity was lower (25 % vs. 61 %, p<0.001). A qualitative evaluation with LALIGN showed that the visualization improved when OTS-s were used instead of nucleic acids.

Conclusion regarding blastNP : The BlastNP method can be defined as a BlastP that is performed on an overlappingly translated nucleic acid database (OTS-DB) using a similarly. converted nucleic acid query (OTS). BlastNP combines the advantages of nucleotide and protein blasts and bypasses many difficulties: 1. it can be more sensitive to weak sequence similarities than either blastN or TblastX, 2. codon redundancy is eliminated, 3. the sensitivity to single nucleotide polymorphism, mutation and sequencing errors are reduced, 4. it is insensitive to frame shifts and 5. it makes the use of shorter queries possible.

2. Results with blastNP a., Sequence searching in conventional and OTS databases.

Conventional blastN, TblastX [w10], blastP [14] and LALIGN [wl 11] methods were used. (LALIGN compares two-protein or DNA sequences for local similarity, finds multiple matching sub-segments in the two sequences, and shows and plots the local sequence alignments). Generally, the BLOSUM-62 matrix was used because experimentation has shown that this matrix is among the best for detecting weak protein similarities using queries with average length (100-10000 residues).

DUST [17] and SEG [18] respectively were used by default for filtering the queries, mask off segments of the query sequence that have low compositional complexity and to find a set of best scoring sequences. However, the final scores were determined on the pre-selected sequences without filtering.

BlastNP refers to the following procedure: 1. the creation of an OTS database from any nucleic acid sequence database, 2. the conversion of a nucleic acid query into an OTS query. 3. the application of the usual blast ? method to the OTS sequences. An in-house sequence-converting program, SeqConv, performs the conversion of nucleic acids The paired two-sample t-test was used for statistical evaluation of the results [w13].

Three very well known sequences, the sheep gene for prion protein (PrP) coding sequence (PRP-D-771, D38179) and human insulin gene (HIG-D-4044, J00265) as well as promoter (INSP-600, EP07109) were chosen as examples for searching the local nucleic acid and OTS databases. BlastN and blast ? vs. OTS gave very similar best scores and both were higher than the TblastX scores (Figure 2). The respective shuffled (randomised) sequences were used as controls. No matches to the OTS sequences were found in the conventional protein databases SWISS-PROT and TrEMBL [11, w7, w8]; no similarities were found between the reverse (R) and complement (C) sequences and any known conventional nucleic acid sequences (results not shown)

b., Estimation of the specificity and sensitivity of the blastNP method -Searching databases with the prion protein The blastN, TblastX and blastNP searches with PrP identified 40 sequences altogether (100%), which were regarded to be significant matches, because their score were over 150 bits (random controls below 30 bits). Not one of these sequences was located on the R or C strands and none was detected by the DUST or SEG filters as low complexity sequences (Figure 3). Twelve (30% of 40) were identified by all three methods. An additional 11 (27%) were found by two methods in combination but missed by the third method. As many as 17 (43%) of the significant matches were identified by only one blast method, and five (13%) of these were identified by blastNP but not the other two.

Fifteen sequences that showed similarity to PrP in both blastN and blastNP were further studied by LALIGN comparisons. This approach turned out to be very useful for visualizing the location and extent of the sequence similarities. Comparison of OTS-s seems to give a subjectively better visualization of the local sequence similarities than comparison of nucleic acids. A numerical evaluation of the LALIGN results is seen in Table III where the parameters of the 15 different nucleic acid and protein LALIGNs (length of the match, percentage of identical residues in the match, score and E-value) are compared with each other using paired Student's t-test. Only the parameters of the first and best local similarity in the 15 different LALIGN comparisons were statistically evaluated. This shows that OTS alignments produced higher scores (748 vs. 460, p<0.05) and lower E values (3.16E-20 vs.

1. 17E+01, p<0. 01) compared to than the corresponding nucleic acid alignments. The length of identical sequence segments was longer (602 respectively 213, p<0.01), but the percentage of identical matches was lower (25.3 respectively 61.6 %, p<0.001) when OTS-s was used instead of nucleic acids.

Table III.

Summary of LALIGN Comparison Statistics Type of the sequence : nucleic acid (N) protein (P) significance (N-P) Sequence A: PrP-nucleic acid PrP-OTS Sequence B: 15 na. sequences 15 OTSs Length of match (M): 213 +413 602 + 656 p<ODl % identity in M : 61. 6 + 7.3 25.3 + 11.5 p<0.001 Score : 460 +1274 748 +1719 p<ODS évalue : 1.17E01 + 1.02E03 3. 16E-20 + 5.0E25 p<0.01 Sequence A: prion protein (PrP), Sequence B: preselected sequence on the basis of its similarity to PrP with both bbstN and blastNP. Each value represents the mean + standard deviation of the main parameters of LALIGN comparisons of A and 15 different B sequences (n=15). Only the top ! oeal aligntnents in every LALIGN comparisons are evaluated by paired Student's t-test. OTS: overlappingly translated sequence. n. a.: nucleic acid

- Searching the PIR superfamilies Five, medium size super-families (insulin, prolactin, zfp-36, actin, globin) were chosen from the PIR (Protein identification resource) database [19-20, w14]. Fifty percent sequence identity is used by the PIR database staff for the provisional clustering of proteins into families, because sequences with more than 50% sequence identity are usually similar in structure and function and the major sequence future are unambiguously aligned by commonly used multiple sequence alignment programs. Protein families are clustered into "homeomorphic superfamilies", i. e. all members of the superfamilies should have the same overall domain architecture. It is assumed, that the molecules in a homeopathic superfamily share a common evolutionary history. The genes, containing the coding sequences of the chosen proteins, were identified in the EMBL database by the Sequence Retrieval System (SRS), [w16]. The sequences were stored in a local nucleic acid database and, after overlapping translation, in an OTS database. BlastN, TblastX and blast ? searches were performed on the respective databases using one representative query from each super-family.

(5-times-5-times 3 blast altogether).

All blast methods generate a series of optimal and sub-optimal local alignments, which are characterized by a score and an E-value. Perfect identity between two sequences is characterized by a maximum score (Smax) and a corresponding minimum E-value (Emin).

Mismatch between two (random) sequences gives the random score (Sr) and random E-values (Er). These values are easy to determine. Different blast methods operate in somewhat different range of score and E-value (between Smax to Sr and Emin to Er). This complicates the direct comparison of the results of two different blast methods. To solve this difficulty the relative score (RS) and relative E-values (RE) were calculated. The RS is defined as the ratio

of a given score and the maximum score of a blast search (RS=S/Smax, RSmax=Smax/Smax=l or 100%). The RE is defined as the ratio of a given E-value and the Emin of a blast search (RE=E/Emin, REmin--Emin/Emin=l or 100%).

It is up to the investigator to decide which score and E-value thresholds he wants to use to discriminate between significant and non-significant matches. If the RS and RE are too low (close to the Sr and Er) many weak similarities will be regarded as statistically significant. If the RS and RE are too high only the almost identical and optimal alignments will be discovered. The specificity and sensitivity of a blast method is obviously depending on the actual RS and RE values. In this study two SR values were used to give a nuanced picture of the specificity and sensitivity question: the SR>10% and the SR>Sr values.

Significant matches between a query and the sequences belonging to its own super- family were regarded to be true positives. Significant matches between a query and the sequences belonging to a different super-family, than the query, were regarded to be false positives. The proportion of the true positive matches to the number of sequences in the super-families was regarded to be the sensitivity of the blast method (100% is the maximal sensitivity). The proportion of the false positive matches to the number of sequences in the super-families was regarded to be a measure of the selectivity (0% false negatives indicated maximal, 100% selectivity). Table IV and Figure 4.

The sensitivity of TblastX and blastNP (blastP using OTS query and database) was very similar (88. 6+ 4. 2,80. 2+4. 6 %, S. E. M. ) but much higher than the sensitivity of blastN (43. 49 %). The opposite was found for the selectivity: the number of false positive matches was 19. 6+8. 9% when using TblastX and 18. 6+2. 6% when using blastNP, but as little as 0. 8+0. 3 % was detected by blastN.

Table IV COMPARISON OF THE SPECIFICITY AND SENSITIVITY OF THREE BLAST METHODS PIR superfamily lsfl Insulin Prolactin Zfp36 Actin Globin SUMMARY size (# of sequences in the so 197 171 131 252 397 blastNP Score (% of maximum) >10 >R >10 >R >10 >R >10 >R >10 >R >10 >R Query (name, accession no.) length mean S. E. meat S. E. Igf-l (m29644) 726 38 73 prolactin (V00566) 831 28 79 znfl32 (U09411) 2366 45 92 h-alpha-actin (J05192) 1327 78 90 h-alpha-globin (YU0493) 573-30 67 TRUE POZITIVE (% of sf size) 38 73 28 79 45 92 78 90 30 67 9. 2 802 4. FALSE POZITIVE (% of sf sixe) 0 19 0 27 0 11 0 20 0 16 0 0 18. 6 2. 6 TblastX Score (% of maximum) >10 >R >10 >R >10 >R >10 >R >10 >R >10 >R Query (name, accession no.) length mean S. E. meat S. E. lgf-I (m29644) 242 35 83 prolactin (900566) 277 28 93 znfl32 (U09411) 789 52 96 h-alpha-actin (J05192) 443 91 98 h-alpha-globin (VD0493) 192 65 73 TRUE POZITIVE (% of sf size) 35 83 28 93 52 96 91 98 65 73 88. 6 4. 2 FALSE POZITIVE (% of sf size) 0 6 0 24 0 10 0 52 0 6 0 0 19. 6 8. 9 blastN Score (% of maximum) >10 >R >10 >R >10 >R >10 >R >10 >R >10 >R Query (name, accession no.) length mean S. E. meat S. E. ! gf- ! (m29644) 728 19 40 prolactin M0566) 833 7 26 znM32 (U09411) 2368 0 74 h-alpha-actin (J05192) 1329 24 50 h-alpha-globin (00493) 575 9 27 TRUE POZITIVE (% of sf size) 19 40 7 26 0 74 24 50 9 27 11. 8 4. 3 43. 4 9 FALSE POZITIVE (% of sf size) 0100010102 0 0 0. 8 0. 3

- Searching with artificial sequences Two different artificial sequences were constructed as it was described before.

The sequences (A and B) were mixed as it is illustrated in Figure 5 and placed into local nucleic acid and OTS databases. Blast searches (blastN, TblastX and blastNP) were performed on the nucleic acid and OTS databases with queries containing only the B sequence. (B-100% query). Figure 6 shoves the correlation between the sequence identity in the detected sequences and the corresponding relative scores (RS) as it was detected by the three different blast methods. The sensitivity of TblastX and blastNP seemed to be very similar at every RS value and they detected as little as about 30% sequence similarity to the query. More than 80 percentage sequence similarity to the query was required to rich the SR=20% detection level with blastN. These results indicated that the sensitivity and selectivity of TblastX and blastNP are very similar.

However it is true only for average-size query. When the query is short (the number of residues are below 50) the sensitivity of blastNP is clearly better than the sensitivity of TblastX end it approaches to the sensitivity of the blastN. (Figure 7).

This is one of the major advantages of blastNP : it makes possible the detection of very short sequence identities, which are not detectable with TblastX and not practical to search for with blastN.

C. Signal amplification by converting nucleic acids to OTS Sequence similarities between nucleic acids (found by blastN or TblastX) are sometimes very weak and difficult to visualize by LALIGN. However, they can be significantly improved by converting the nucleic acid sequences into OTSs. Some examples are the signal amplification of the similarity between human prion (HPRPC, M13899) and doppel prion protein (PRND, NM012409) (Figure 8); the insulin related substance 1 and insulin related substance 2 (IRS1, XM002475 ; IRS2, XM007095) (Figure 9); and insulin (INS, L15440) and insulin like growth factor II (IGFII, Y13633) (Figure 10). Especially interesting is the similarity between prion and prion doppel sequences, which was not detectable using blastN. The similarity between the 179-residue doppel prion and human PrP was discovered by TblastX [21]. This similarity was clearly seen when OTS sequences were compared.

LALIGN is a powerful tool for localizing and visualizing sequence similarities. It works nicely on nucleic acids but converting the nucleic acids into OTS further amplifies the similarity signal. It is nicely demonstrated by the LALIGN comparisons of the Human insulin gene (HIG) to Insulin propeptide domains and to the Prion protein. (Figure 11,12).

Statistical analysis of the similarity between the insulin pro-peptide and prion protein showed that it is highly significant, E < 1.8 E-14 (Table V).

Table V Results of SIM with INS 245 (245 residues) and INS-PRP-CONSENSUS (245 residues)

D. Identifying the prion protein and the insulin-C peptide as novel transcription factors and its consequences.

The similarity between the prion protein and the pro-insulin is clearly a new and unexpected observation. The sequence similarity is statistically clearly significant, it is detected by blastNP and not detectable with other methods like blastN or TblastX, The similarity involves mainly the first 300 residues at the NH2-terminal of the PRP and the C- peptide of the pro-insulin.

The NH2 terminal part of the PRP is protease sensitive, flexible part of the sequence and contains four, zinc-binding octa-repeats. Prion protein is normally present in the nervous system, however a change in its structure makes it to a pathogen agent, which causes the prion diseases (kuru, scrapie, bovine spongiform encephalopathy (BSE), Creutzfeldt-Jacob disease) [22] The insulin pro-peptide consists of four domains: signal (S), B, C and A. The S and C sequences are cleaved out during the synthesis of insulin. The hormone, insulin contains only the A and B domains. The insulin (A and B domains) regulates the sugar metabolism while the C-peptide does not. The type I diabetes mellitus is caused by the defective synthesis of the pro-insulin (S+B+C+A) while the disease is treated with the insulin (A+B) only. The insulin treatment normalizes the blood sugar level and keeps the patient alive, but do not prevent the

development of so called"late complications of the diabetes"like neuropathy, nephropathy and angiopathy. Therefore, there is a hypothesis that the lack of the C-peptide is responsible for the development of the late complications and the C-peptide should be added to the substitution treatment of a diabetic patient [23]. This question has of course a tremendous importance if one considers that as much as 1% of the population in the industrial countries suffers of type I diabetes and is already treated with insulin.

It is hard to believe that a statistically significant similarity between two sequences should indicate a biologically important connection between prion disease and the diabetes mellitus. There is no indication in the literature for any connection between these two molecules although they are both belonging to the worlds most intensively studied molecules and diseases.

Further bioinformatics studies of the NH2 terminal part of the PRP and the insulin-C peptide showed that both sequences are similar to transcription factors. The similarity is detectable by blastNP (blast ? using OTS query and OTS database) but not with blastN or TblastX. Seven of the transcriptions factors similar to both the PRP and insulin-C are listed in the Table VI together with the main parameters of three different blast comparisons. The parameters are statistically analysed and summarized in Figure 13.

Table VI Comparison of the similarity parameters measured by blastNP, TblastX and blastN. BLAST-M* HUMPRPOA ? D INSULINGPU NAME ACCESBiONNUMBER-UESCRIVTION % id Lsnpth scon E % Id LmgM ncons E erebEp2 U85962-P-D-7327-CREBBP2-CREB-bWiig protein 33. 5 221 165 1. 30E-07 29. 7 238 130 S. OOE44 snapt5 U44898-P-o-I 512-FLSU44898 Hnmen SNM45 s6mrt mRN/4 cokm cde. 70. 5 2N ! 2 6. 4CE10 31. 4 Ml 170). M6-M PIP U29185-P-D-35520-U291S5Hortnfapiernprionproteim (PrP) gerc E6. 9 320 H67 2. 50E-180 30. 3 627 32S 3. OOE-23 znve M91592-HSZINC-P-d-2455-Hmm zbx-finger protein (ZNF76) 29. 7 12D 155 5. 30E-07 32. 3 375 1M 2. QOE-tO e21 AF519877-P-D-24330k-Hoiw upiem EZF"racrn factor 2 (E2F2) Wm 26. 8 400 158 2. OCEAN 296 3t7 193 7. OOE-10 AF3239t4-P-D-1907-AF325914 FIsKmppeEhkeancfapyerproeemGLIS2mRNA 7a. m W 9 2. 0tE-09 32. 5 431 261 8. 00E-1 zne3l AF052224-r-iN15962-AF0522242bneuora1doubia2asEn ; rpno ; pin (ZNF231) 2ii. 3 232 IEa 2orE47 31. 7 325 178 looE4s Efp AH026054-P-D3080. AB026054 Fls BFP2NF 179 mRNA forbrem &ger prote'm 11. 7 J31 285 S. OOE-20 34. 9 f 75 198 4. OOE-11 insa lNS-C-18223x10-S31-pd 30. 4 815 353 2. 00E-27 lOO 748 _ noli hPP OA-1-300 anc7-904-700 101 Ib7 ! I 70. 1 ! 15 351 2. OOE-27 mem 36t 32i 393 3. 20E-07 31. 4 412 222 5. 50E-05 SD 19 199 SS5 8. 50E47 1. 6 197 75. 8 1. BOE-01 T-aLAST-XHUOPRPOAXOtMBUUM-XD NAME ACCESSION NUMBEROE. 9GRIPTION 7 (id knptn cnn E 7f10 LENOTMCORI E erdbp2 U85962-P-D-7327-CRFBP2-CREB-bcd'ygproteen 30. 4 48 63 2. OOE-03 25. 2 iSt 94 2. DOE101 onap45 U44999-P. D. 1512-RSU44898 HumanSNAP45 suburit MRNA, conv [cte cds. 24, 6 so 75 4. OOE-01 24. 5 143 e9 I. tMEtM PIP U29185-P-D-35520-U29185 Homoupimprion protein (PrP) gm 90. 2 102 672 G. OOE58 29. 8 228 120 2. 00E42 : n76 M91592-HSZINC-P-d-2435-Huananme-ivgerpmtein (ZNF76) 324 74 61 I. OOE-03 26 89 eo ZOOE-02 e21 AF518877-P-D-24330-HownpiensEZF tramripdonfictor2 (E2F2) gene 25 52 80 2. o0Eo2 t w 2t o t 1t t. ooE-o2 k. ppl AF325914-P-D-1907-AF325914 HsK. ppeEkeim &e"protein GUS2 niRNA 32. 7 52 76 4. OOE-01 27. 1 181 118 2. ME-03 zni231 AF052224-P-D-15962-AF052224 Rs raldotibit imc finger protein (ZNF23 1) 31. 3 67 99 2. OOE-00 29. 8 114 125 3. OGE-03 bO AB026054-P-D-3080-ABO26054 Hi BFPFZNF 179 niRNA for brain EW protein 42. 5 40 97 S. Oye41 30. 5 125 95 7. 00EO inse NS-C-E822lx10-831-pd 251 239 150 9. 00E47 X3 7s3 5t3 E9D hprp HUMPRPOA-1-300 x3-904-pd 7 noll 35. f 238 10 O. OOE-07 mtM 371 t2. 3 153 3. ME*02 27. 6 165 109 2. 10EOt D 20 7 61. 5 197 8. 90E 02 2. 7 52. 8 23. 3 6. 6) Et01 BLAST-NHUMPRPOA-HOtMaUUfC-ND NAME ACCESSION NUMBER-0ESCRIPTION % id Inplh ton E % id hnqth son E crebbp2 U85962-P-I). 7327- CREBBP2-CRB3-biriding protein 77, 5 31 77 3. OOE-03 59. 1 88 ES 8. OOE-02 nep45 U44898-P-D-1512-HSU44898 Hmmn SNAP45 snbnmi mRNPS conpkte cds. 66. 8 75 78 I. OOE01 5l. 177 77 7. DOE. 02 prp U29185-P-D-35520-U29185 Honio sapiem prion prottin (PrP) gim 91. 7 302 1181 2. 00E-88 se. e 67 3. 00E-ti2 : nne M91592-HSZINC-P-d-255-Human ic-fngerprotein (ZNF76) 67. 3 IA S2 S. OOE-02 63. 9 63 61 3. ODE-02 a7t AF518877-P-D-24330. Homo sapiens F1F tanscription factor 2 (E2F2) re 635 52 89 I. OOH02 71. 4 49 107 4. 00E-OI 4kuppel AF325914-P-D-1907-AF325914 HsKruppei-ikezim&VrpreteinGLIS2mRNA 59. 5 79 73 2. 00E-03 ei. 6 es 105 . OOE Oi zn^31 AF052224-P-D. 15962-AF052224 HsneurotWdotble air firwrpmdein (2NF23 1) 55. 3 132 91 8. 00E. 57. 4 209 117 400Etm bf ABa26054-P-D-3080-AB026054 Hs BFP2NF 179 mRNA forbrein fingerprotein 80. 4 139 121 3. DOE-01 59. 4 105 108 4. ODE-00 insc INS-C- [82*10-831. pd 54. 6 204 all 4. OOE-02 loo L"3772 3. 00 hprp 3-904-pd 100 ol 1. 00-M 54. 9 204 99 4. OOE+01 605 114 209 7. ME*02 60. 4 115 95. 9 2. 40E+02 SD11. 6 91. 4 M5 1. ME<03 4. 6 SS. 6 136 3. 00E*02

The parameters of similarity between PRP, insulin-C peptide and the transcriptions factors are dependent on which blast method was used. The scores and E-values are significant only when blastNP is used. This confirms that the similarity of PRP and insulin-C to transcriptions factors is normally not detectable with blastN and TblastX.

The similarity between PRP, Insulin-C and some transcriptions factors is visualized by LALIGN graphic (Figure 14). The intensity of the similarity signals in LALIGN is different whether the sequences are compared as nucleic acids (as detectable by blastN), proteins after conventional, non-overlapping translations (as detectable by TblastX) and overlappingly translated protein-like sequences (OTS, as detectable by blastNP, i. e. blast ? on OTS). The LALIGN signal is most intensive (most significant) when OTSs are compared.

The multiple sequence alignment shows the common identical residues in the PRP, insulin-C and transcriptions factor sequences (Figure 15) and only between insulin C and transcriptions factor sequences (Figure 16). The neighbour joining tree shows that especially the insulin-C and the transcriptions factor are closely related to each other, while the PRP sequence is somewhat more distant relative.

II. Application II : Sequence"fingerprinting", filter and function-finder A, General properties of the OTS.

The average amino acid frequency of the proteins is expected to be 1/20,5%, but it varies between 1.29% (Trp) and 9.86% (Ala) (w3). It does not follow the codon-biased pattern of translation ether. It is because only one of 12 possible translations is biologically meaningful and considered in the codon usage tables. However, when the sequences of the 12 non-overlapping translation frames are summarized the sum value follows the codon-biased frequency pattern (Figure 17/a-b). This is the case for the four overlapping translation frames too; they together follow the codon-biased amino acid usage pattern.

The amino acid usage frequency on the four OTS frames shows regularity: there is an inverse relationship between the amino acid frequency on the D and RC frames as well as on the R and C frames. There is a strong negative correlation between the D+RC and R+C frequencies. It is illustrated on the INS, TIG and ASR sequences (Figure 18-19-20/a-d).

B. , Construction of an OTS-fingerprint Generally the amino acid usage frequency is not specific for a given protein, because very different sequences may have exactly the same composition. However, in the case of OTS, the composition is sequence dependent because of the overlaps of codons in coding of the neighbouring amino acids. Therefore the amino acid composition of an overlappingly translated sequence may be a unique, sequence specific"fingerprint"of the OTS. (Figure 21- 22-23). See also Table VII below.

Table VII Construction of an OTS-"fingerprint" Illustrated on an artificial sequence: ASRx-P-298

sRx-P# EXP [D+RC] [R+C] [D+RC+R+C] [D+RC]/ [R+C]/ [D+RC+R+C]/ EXP/ [D+RC]/ [R+C]/ 298 # # # # EXP% EXP% EXP% EXP%=10IEXP100%#EXP100% M-MET 19 20 7 27 105 36 142 100 73 25 H-HIS 37 29 le 45 78 43 121 100. 64 35 G CY S 37 25 20 45 67 54 121 100 55 44 V-VAL 75 37 40 86 49 65 114 100 42 57 T-THR 75 37 49 86 49 65 114 100 42 57 Q-GLN 37 20 21 41 54 so 110 100 49 50 Y-TYR 37 13 26 39 35 70 105 100 33 80 GASP 37 21 17 38 56 45 102 100 54 44 E-GLU 37 20 18 38 54 48 102 100 52 47 S-SER 112 59 55 114 52 49 101 100 51 48 L-LEU 112 51 59 110 45 52 98 100 45 53 G-GLY 75 37 3B 73 49 48 97 100 50 49 P-PRO 75 36 37 73 48 49 97 100 49 50 WTRP 19 12 6 18 83 31 94 100 67 32 N-ASN 37 18 18 34 48 43 91 100 52 47 R-A RG 112 45 56 101 40 50 90 100 44 55 F-PHE 37 17 15 32 45 40 86 100 52 46 K. LYS 37 15 17 32 40 45 86 100 46 52 A-ALA 75 35 28 63 48 37 84 100 54 44 FILE 58 24 19 43 42 34 76 100 55 44 SUMMA 1138 571 567 138 1065 960 2031 2000 1029 945 [D+RC]/EXP%=[D+RC]&num x 100 /EXP&num ;<BR> [R+C]/EXP%=[R+C]&num x 100 /EXP&num ; [D+RC+R+C]/EXP%=[D+RC+R+C]# x 100 / EXP# ; EXPIE% =EXP&num x 100/EXP&num =100 ; [D+RC]/EXP100%=[D+RC]/EXP% x 100 / [D+RC+R+C]/EXP% = P+RC] x100/[D+RC+R+C] ; [R+C] / EXP100%=[R+C] / EXP% x 100 / [D+RC+R+C]/EXP% = [R+C] x100/[D+RC+R+C] Table VII. Construction of an OTS-"fingerprint". An artificial nucleic acid sequence coding for ASRx-P-298 was translated into 4-OTSs representing the D: direct, RC: reverse-complement, R: reverse and C: complement strands. The amino acid (AA) composition of them was counted. The number (#) of individual amino acids in [D+RC], [R+C] and [D+RC+R+C] were compared to the number of amino acids which might be expected (EXP#) by translation of a random nucleic acid sequence containing equal number of the 4 nucleotides. The ratios of the real and expected numbers of the different amino acids resulted the [D+RC]/EXP%, [R+C]/EXP% and [D+RC+R+C]/EXP% values (EXP/EXP% =100 in these cases). When the [D+RC+R+C] /EXP% was regarded to be I 00% the [D+RC]/EXP100% and [R+C]/EXP100% values were derived.

The overlapping coding rules create an intrinsic relationship between the amino acid neighbours and between the sequences coded by the leading (L) and antisense (A) strands of a DNA. There is a strong reverse correlation between the proportion of amino acids in the direct plus reverse-complement (D+RC%) and the reverse plus complement (R+C%) sequences. It makes that the amino acid composition of an OTS is not only a mathematical sum of its components, as in the case of a non-OTS, but even a specific and unique "fingerprint"of its individual sequence.

It is hypothesized that: a. , the OTS-s represent a kind of"core"biological information in the protein and nucleic acid sequences; b. , the use of OTS-based methods and databases may improve the sensitivity of many bioinformatics tools; c. , the OTS may became a new opening to understand the sequence-structure-function relationships; d. , the OTS may thus be a starting point of sequence/composition based, uniform, DNA-RNA-protein "fingerprint"catalogue and dictionary.

Discussion One potential problem with protein bioinformatics is that the genetic code is degenerate, in that some amino acids are specified by more than one codon; 20 amino acids are coded by 61 codons. The base in the third position, the so-called"wobble base", is often allowed to vary. Crick [24] first suggested that the interaction between the bases at the 3'end of the codon and the 5'end of the anticodon is not as spatially confined as the other two. Two significantly different nucleic acid sequences may therefore code for exactly the same protein sequence. Another potential problem in proteomics is that only a fraction of the total DNA of a species is expressed as proteins (often <5%). Many more protein sequences have been predicted than are confirmed with confidence. The non-overlapping coding of proteins allows the co-existence of 3 completely different reading frames, each frame carrying only one third of the coding information. Insertions, deletions, mutations and sequencing errors may lead to frame shifts as well as naturally occurring dynamic reprogramming of translation [25]. These factors seriously limit the reliability of protein bioinformatics.

However, although the idea of overlapping translation was disproved as a biological possibility [9,26, 28], it may be useful as a bioinformatics abstraction. Overlapping translation of nucleic acids and using OTS sequences as if they were real proteins circumvents the limitations described above. OTS sequences contain a high proportion of start and stop signals: exactly how high will depend on the base composition. However, the exact location of the start signal has no critical influence on the translated sequence. As there is only one

frame, frame shift is not a possibility. It is important to keep in mind that an OTS sequence will essentially only find specific matches in an OTS database; cross searching between real proteins and OTSs is not possible.

It is not surprising that the blastNP method has a somewhat different match profile to blastN and TblastX. It appears to be more sensitive but less specific than the classical methods of searching DNA databases with nucleotide sequences, because it does not take in account the redundant codon and frame sequence variations. However, it is difficult to judge the true sensitivity and specificity of the blastNP method because it obviously missed many sequences which were detectable by blastN and/or TblastX. It did discover some new similarities that were not detectable by either blastN or TblastX. The different blast methods are, of course, meant to complement, not to replace each other.

The information encoded in OTS sequences is non-redundant. It can be speculated that converting an entire nucleic acid genome sequence into an OTS genome (a kind of"protein- like virtual genome"), and comparing the two different ways of representing biological information with each other, might uncover even more repeats and redundancies (species and individual variations) in the nucleic acid genome, than it is already realized.

The blastNP search method combined with LALIGN visualization obviously amplifies the similarity signal. However, it amplifies even the non-specific, low complexity (LC) sequence regions commonly present in biological sequences that yield false-positive matches.

As much as 25% of the known protein sequence universe is comprised of LC regions that exhibit non-specific but high-scoring similarity to each other. Nucleic acid sequences may often be worse in this regard [27]. Usually sequences are masked for LC regions before being compared-this is in fact the default behavior of NCBI's BLAST software. Using OTSs it seems likely that many LC regions will no longer be masked by current LC detection software, which has been tuned to find low complexity regions within regular biological sequences. The development of a program like SEG for blastNP (to filter OTSs before searching with blastP) would usefully increase the specificity of this method. However, it is still unknown whether it is always correct to automatically filter low complexity sequences, as some may have biological importance.

An interesting feature of the OTS is that while it seems to eliminate the codon redundancy it emphasizes the importance of the nucleic acid sequence on the amino acid composition of a protein. The amino acid usage frequency tells a lot about a protein [29].

Collagens are rich in proline and glycine, transmembrane domains are hydrophobic, and active sites contain many polar residues. However there is no close correlation between the

sequence and the amino acid usage in the known proteins. The overlapping translation of amino acids establishes a close correlation between the sequence and amino acid composition of the OTS. A specific and sensitive correlation between the sequence and composition enables the possibility of a 20-letters"fingerprinting"of the nucleic acid and protein sequences and the creation of a catalogue (or dictionary) of these polymeric molecules as set out earlier in the description.

Conclusions It is believed that the overlappingly translated sequences (OTS) have unique properties (listed in Comparison of Nucleic acids, Proteins and Overlappingly Translated protein Sequences (OTS), Table VIII.) and will be a significant contribution to studies of the information transfer between biologically active macromolecules.

Table VIII Comparison of some properties Nucleic acids, Proteins and Overlappingly Translated protein Sequences (OTS).

PROPERTY DNA (RNA PROTEIN OTS Composition 4+1 nucleotides 20 amino acids 20 amino acids Information type primary secondary secondary Information transfer transcription frame translation overlapping translation Information density 8 bits/codon 2. 5 biteslarnino acid 8 biteslamino acid Reading continuous discontinuous eontinuous-overiapping Redundancy yes : 2+1 strands yes, 2x3 frames no ?, 2 frames Reading frame not obvious start-stop sites not important Blast blastN, TblastX blastP blastNP Blast sensitivity at 20% RS low, 20% difference high, 60-70% difference best for short seq.,-70% diff.

Blast specificity at 20% RS high, 80% identity low, 20-30% identity best for short seq.,-30% id.

Minimum length of query 15 n. a. /hum. gen. 3 x 8=24 a. a./hum. gen. 8 a. a. / hum. gen.

Biological importance yes, real sequence yes, real sequence no, virtual sequence SNP, mutations exists is a problem not a problem Frame shift crossing-over is a problem not eiists BlastNP combines the advantages of blastN (all genetic information is represented in nucleic acids) and blastP (sensitivity for short sequences, high information density of proteins). Translating all nucleic acid sequences into OTSs would be very simple, and would open the way for many protein bioinformatics tools to be applied to nucleic acids. Using blastNP and LALIGN in combination with each other is a very powerful similarity signal amplification and visualization method. However, the user has to keep in mind that this method amplifies not only the signal but also the noise. Results need to be monitored for possible false matches caused by the presence of low complexity sequences.

The OTS"fingerprint"reflects the sequence (the order of residues) of its coding DNA and probably its function too. It is the key to a new sequence filter and function (gene) finder.

Accordingly, overlapping translation of nucleic acids (instead of conventional, non- overlapping translation) thus represents a useful improvement of the performance of the already existing bioinformatics methods as described earlier. Converting the nucleic acid sequences into OTS (i. e. using overlapping translation according to the present invention instead of the conventional, non-overlapping translation) gives the unexpected, inventive, result of this invention. The already existing database searching and visualizations methods (like blastN, TblastX, WU-blast.) were not and are not used on OTS It is novelty in bioinformatics : overlapping translation of nucleic acids for bioinformatics applications was never ever patented or published before. It has been demonstrated (see the experimental part above) that the blastNP method (blastP applied on OTS databases and using OTS query) provides new and unexpected results, like the significant similarity between prion protein and insulin-C peptide and the similarity of both to transcription factors. This type of new information is utilizable because it is very useful; it leads to new working hypothesis and knowledge. Classifying insulin-C as transcription factor might actually change our view on the biological role of insulin-C peptide and reconsidering its possible therapeutic application in the treatment of late complications of diabetes mellitus. Thus the idea of using OTS instead of conventionally translated proteins is thus not obvious as judged by taking the frame of mind of an average person in the field of bioinformatics with knowledge of all prior art (previous knowledge and inventions). [30, 31, w17] Software developments related to OTS A. , A computer program (software) according to the seventh aspect of the present invention The computer program, according to the seventh aspect of the present invention, appearing below is an example of software, which may be used when performing the method according to the first aspect of the present invention.

The computer program is a software for generating conventionally translated proteins and OTSs from nucleic acids, count the number of nucleic acids and amino acids in each sequences for statistical purposes. The data flow and modules are illustrated in Figure 25.

The sequence conversion of a single nucleic acid or an entire nucleic acid sequence database is performed automatically without human intervention. The result of the conversions are files

(databases) containing the 4 forms of the original nucleic acid (N4), 12 forms of their conventional (non overlapping) translations (X12) and the four possible OTSs (P4). Each file has its statistical data stored in separate files (SN4, SX12, SP4). The prototype version of the computer program was written in Java 1.3. 1, Its performance in PC milieu (Pentium III, 900 MHz, Microsoft Windows NT 2000 Professional) was about 1.0 MB of nucleic acid data conversion/minute.

The different functions of the computer program are illustrated in figures: a, Conversion (Figure 26); b, View results (Figure 27) ; c, View Log (Figure 28); d, Use cases implementation (Figure 29).

B. , Specification General 'The text files are generated in UNIX like text format. (Only CR at the end of lines.) One input sequence contains less than 350.000 letters.

'Only one conversion table will be implemented.

'The converted files will be stored in the same directory as the input file. The extension will be different (see the specification).

The sequences will be formatted in the converted files and on the screen as well.

Raw input file format File name: NAME. TXT NAME contains only capital letters and digits. The extension must be SEQ. drawInputDescription EOL NUCLEIC ACID SEQUENCE EOL EOF OR drawInputDescription... inputDescription : EMBL: TEXT | Description with spaces without"|"and">"letter | >ACCESSION_NUMBER Valid ACCESSION NUMBER formats : 1 letter and 5 digits 0 2 letters and 6 digits (Any other format is accepted but warning is generated.) NUCLEIC ACID SEQUENCE :

A#a#T#t#G#g#C#c#U#u#N#n#EOL#space (Other letters and numbers are allowed, but replaced by N letters. The NA triplets containing N letters will be converted to Z letters.) Example: (SEQ ID No. 1) EMBL: AF007567 | Homo sapiens insulin receptor substrate 4 mRNA, complete cds.

>AF007567 ggtcagggtagttccccaaccctccctttcgtgaattccccctcgtcctcgctcacctta aaaccatcgtgcatcaccatggcgagttgctccttcactcgcgaccaagcgacaagaaga ctaagaggtgcagcagcggcggcagcggcagctctagcagcagtggtgaccaccccgctt ctttcctcgggaaccccgaccgcactcattgggaccgggtcgtcttgtccgggagccatg Input file format File name: NAME. SEQ NAME contains only capital letters and digits. The extension must be SEQ. inputDescription EOL NUCLEIC ACID SEQUENCE EOL EOF OR inputDescription... inputDescription : > ACCESSION NUMBER Description Valid formats: * 1 letter and 5 digits * 2 letters and 6 digits (Any other format is accepted but warning is generated.) NUCLEIC ACID SEQUENCE : A#a#T#t#G#g#C#c#U#u#N#n#EOL#space (Other letters are allowed, but ignored during the conversion.) N4 result format Normal, direct File name: NAME. ND NAME contains only capital letters and digits. The extension must be ND. ndDescription EOL NUCLEIC ACID SEQUENCE EOL EOF OR

ndDescription... ndDescription : > ACCESSION NUMBER N # D ! INT_LENGTH l Description Normal, reverse File name : NAME. NR NAME contains only capital letters and digits. The extension must be NR. nrDescription EOL NUCLEIC ACID SEQUENCE EOL EOF OR nrDescription... nrDescription : > ACCESSION NUMBER N ! R # IN-LENGTH ! Description Complement File name: NAME. NC NAME contains only capital letters and digits. The extension must be NC. ncDescription EOL NUCLEIC ACID SEQUENCE EOL EOF OR ncDescription... ncDescription : > ACCESSION NUMBER N # C ! INT_LENGTH l Description Reverse, complement File name: NAME. NRC NAME contains only capital letters and digits. The extension must be NRC. nrcDescription EOL NUCLEIC ACID SEQUENCE EOL EOF OR nrcDescription... nrDescription : > ACCESSION NUMBER N ! RC # INT_LENGTH | Description N4 combined result File name: NAME. N4 NAME contains only capital letters and digits. The extension must be N4.

n4Description EOL NUCLEIC ACID SEQUENCE EOL EOF OR n4Description... nrDescription : ndDescription OR nrDescription OR ncDescription OR nrcDescription X12 result format File name : NAME. X12 NAME contains only capital letters and digits. The extension must be X12. xl2Description EOL PROTEIN SEQUENCE EOL EOF OR x12Description PROTEIN SEQUENCE : <BR> <BR> <BR> <BR> F#f#L#l#I#i#M#m#V#v#S#s#P#p#T#t#A#a#Y#y#X#x#H#h#Q#q#N#n#K#k& lt;BR> <BR> <BR> <BR> <BR> <BR> D#d#E#e#C#c#W#w#R#r#G#g#EOL#space (No other letters are generated, see the conversion table. N is used when the converted triplet is not defined in the conversion table. (e. g. contains non-convertible character)) xl2Description : > ACCESSION NUMBER PX # D1 VINT-LENGTH Description > ACCESSION_NUMBER#PX# D2 VINT-LENGTH Description > ACCESSION_NUMBER#PX#D3#INT_LENGTH# Description > ACCESSION_NUMBER#PX#R1#INT_LENGTH# Description > ACCESSION_NUMBER#PX# R2 VINT-LENGTH Description > ACCESSION_NUMBER#PX# R3 VINT-LENGTH Description > ACCESSION NUMBER PX # Cl VINT-LENGTH Description > ACCESSION_NUMBER#PX# C2 VINT-LENGTH Description > ACCESSIONNUMBER ! PX ! C3 VINT-LENGTH Description > ACCESSION_NUMBER#PX#RC1#INT_LENGTH# Description > ACCESSION_NUMBER#PX# RC2 1 INT_LENGTH {Description > ACCESSION_NUMBER#PX#PX#RC3#INT#LENGTH# Description Log file format Method of error handling The conversion program generates a log file about the result of the conversion process.

File format ------ CONVERSION BEGIN------ > (header line with the accession number) (result lines: ERROR, WARNING, or OK) At the end of conversion the system inserts a message to the file. e. g.

------CONVERSION BEGIN------ > Y00856 OK ------CONVERSION BEGIN------ > M009655656 | Human mRNA for insulin.....

WARNING-Invalid accession number format.

OK ------CONVERSION BEGIN------ > M00965 ERROR-Not enough disk space.

------CONVERSION FINISHED------ Conversion summary file File name: NAME. CON NAME contains only capital letters and digits. The extension must be CON. conDescription EOL EOF ß conDescription EOL conDescription : > ACCESSION_NUMBER#NAs#N4s#X12s# Description Statistical summary file File name: NAME. SN4, NAME. SP4, NAME. X12 NAME contains only capital letters and digits. The extension depends on the used statistical module.

The numbers in the statistical modules are separated by tabulators. The MS Excel must be able to read the files and show it in an appropriate format specified in the attached xls file.

A computer program, according to the ninth aspect of the present invention A computer program, according to the ninth aspect of the present invention, is a software to perform the statistical calculations which are necessary for constructing the OTS- fingerprint. The input files of this program are the SN, SX, SP output files from the computer program, according to the seventh aspect of the present invention. The computer program, according to the ninth aspect of the present invention can display the statistical results in diagrams and construct OTS-fingerprint files.

A sequence-formatting tool Another computer program may be used as a sequence-formatting tool. It preferably has"find","select","replace"functions, however these functions are specially designed to use on nucleic acid, protein sequences and OTS. The user can performe the functions in pre- designed periods, i. e. the start and stop sites periodicity of function is possible to define. All functions are designed and displayed on a"test sequence"before performing it on large scale, for example on every sequence in a database.

References 1. MJ Bishop: Genetics Databases. Academic Press 1999 2. S I Letovsky: Bioinformatics. Databases and systems. Kluwer Academic Publishers 1999 3. Pankaj Agarwal & David J. States: Comparative accuracy of methods for protein sequence similarity search. Bioinformatics 1998,14 : 40-47.

4. Alan Bleasby: Program fuzztrann EMBOSS 2000. http : //www. hgmp. mrc. ac. uk/Software/EMBOSS/Apps/fuzztran. html 5. Warren Gish: WUBLAST2: Wash-U. multi-processor BLAST, with gaps. http : //bioweb. pasteur. fr/seqanal/interfaces/wublast2. html 6. Douglas W. Bigwood, John T. Hart, David K. Gonda, Joseph A. Cerro & Luan Cong: Finding homologs of known genes. International patent application 2001. WO 01/41038 A2.

7. Sergey A. Selifonov at al. Method for making character strings, polynucleotides and polypeptides. European patent application 2001. EP 1 108 781 A2.

8. Ian Korf & Warren Gish: MPBlast: improved BLAST performance with multiplex queries. Bioinformatics 2000,16 : 1052-1053.

9. Brian Hayes:"Computing Science: The Invention of the Genetic Code."American Scientist, Vol. 86, No. 1, January-February 1998, pages 9-14. http : //www. amsci. org/amsci/issues/Comsci98/compsci9801. html 10. Baxevanis, A. D. & Ouellette, B. F. F. Bioinformatics. A practical guide to the analysis of genes and proteins. Wiley-Interscience (1998) 11. Bairoch, A. & Apweiler, R. The SWISH-PROT protein sequence data bank and its supplement TrEMBL. Nucl. Acids Res. 25,31-36 (1997) 12. Benson, D. A. , Boguski, M. S. , Lipman, D. J. , Ostell, J. & Ouelette, B. F. F.

GeneBank. Nucl. Acids Res. 26,1-7 (1998) 13. Perier, R. C. , Praz, V. , Junier, T. , Bonnard, C. & Bucher, P. The Eucaryotic Promoter Database (EPD). Nucleic Acids Res. 28,302-303. (2000) 14. Altschul, S. F. , Gish, W. , Miller, W. , Myers, E. W. & Lipman, D. J. (1990) "Basic local alignment search tool."J. Mol. Biol. 215: 403-410.

15. Pearson, W. R. Rapid and Sensitive Sequence Comparison with FASTP and FASTA.

Methods in Enzymology 183,63-98. (1990).

16. Xiaoquin Huang & Webb Miller. A Time-Efficient, Linear-Space Local Similarity Algorithm. Advances in Applied Mathematics, 12,337-357 (1991).

17. JM Hancock, JS Armstrong : SIMPLE34 : an improved and enhanced implementation for VAX and Sun computers of the SIMPLE algorithm for analysis of clustered repetitive motifs in nucleotide sequences. Comput. Apple.

Biosci. 1994,10 : 67-70.

18. C Wootton, S Federhen : Analysis of compositionally biased regions in sequence databases. Methods in Enzymology 1996,266 : 554-571.

19. Idman KE. George DG. Barker WC. Hunt LT (1988). The protein identification resource (PIR). Nucleic Acids Res 16: 1869-71.

20. Cathy H. Wu, Hongzhan Huang, Leslie Arminski, Jorge Castro-Alvear, Yongxing Chen, Zhang-Zhi Hu, Robert S. Ledley, Kali C. Lewis, Hans-Werner Mewes, Bruce C. Orcutt, Baris E. Suzek, Akira Tsugita, C. R. Vinayaka, Lai-Su L. Yeh, Jian Zhang, and Winona C. Barker. (2002). The Protein Information Resource : an integrated public resource of functional annotation of proteins Nucleic Acids Research 30,35-37.

21. RC Moore, IY Lee, GL Silverman, PM Harrison, R Strome, C Heinrich, A Karunaratne, SH Pasternak, MA Chishti, Y Liang, et al.: Ataxia in prion protein (PrP) -deficient mice is associated with upregulation of the novel PrP- like protein doppel. JMol Biol., 1999,292 (4): 797-817

22. Stanley B Prusiner : Prions-Proc. Natl. Acad. Sci. USA 1998.95 : 13363-13383 23. John Wahren, Karin Ekberg, et al. Role of C-peptide in human physiology.

Am. J. Physiol. Endocrinol. Metab. 2000.278 : 759-768.

24. FHC Crick: Codon-anticodon pairing: the wobble hypothesis. J. Mol. Biol.

1966,19 : 548-555 25. F Gesteland, J F Atkins: Recoding : dynamic reprogramming of translation. Annu Rev Biochem. 1996.65 : 741-68 26. Sidney Brenner: On the impossibility of all overlapping triplet codes in information transfer from nucleic acid to protein, Proc. Nat. Acad. Sci. USA 1957.43 : 687-694.

27. SF Altschul, MS Boguski, W Gish, and JC Wootton: Issues in searching molecular sequence databases. Nat Genet 1994,6 : 119-129.

28. Crick, F. H. C. Codon-anticodon pairing: the wobble hypothesis. J. Mol. Biol. 19: 548- 555 (1966).

29. Ponting, C. P. Issues in predicting protein function from sequence. Briefing in Bioinformatics. 2: 19-29 (2001).

30. Steven J. Hultquist, Robert Harrison, Yongzhi Yang: Patenting bioinformatics inventions: Emerging trends in Europe. Nature Biotechnology 2002.20 : 517- 518 31. Steven J. Hultquist, Robert Harrison, and Yongzhi Yang: Patenting bioinformatics inventions: Emerging trends in the United States. Nature Biotechnology 2002.20 : 743-744 WEB References wl GeneBank: http ://www. ncbi. nlm. nih. gov/ w2 The genetic codes: http://www3. ncbi. nlm. nih. gov/htbin- post/Taxonomy/wprintgc ? mode=t&num SG 1 w3 Codon usage database: http://www. kazusa. or. ip/codon/ w4 BCM Search Launcher http ://dot. imgen. bcm. tmc. edu: 9331 w5 HGMP R-C, UK : http : //www. hump. mrc. ac. uk/ w6 Biology Workbench: http : //workbench. sdsc. edu/CGI/BW. cgi# w7 SWISS-PROT: http : //www. expasy. ch/ w8 TrEMBL http://www. expasy. ch/ w9 EPD: http ://www. epd. isb-sib. ch/

w10 BLAST http : //www. ncbi. nlm. nih. gov/BLAST/ wll LALIGN http ://biowb. sdsc. edu/CGI/BW.cgi# w12 SIM http : //www. expasv. ch/tools/sim-prot. html w13 t-test http : //www. hanover. edu/biology/statistics. htmp wl4 PIR http ://pir. georgetown. edu/ w15 dnapatent http : //www. dnapatent. com/law/watcan. html w16 SRS http ://srs. hgmp. mrc. ac. uk/srs6/ wl7 back-translation http ://www. entelechon. com/englbacktranslation. html It should be understood that modifications can be made to the embodiments disclosed herein. Therefore the above description should not be construed as limiting, but merely as exemplification of preferred embodiments. Those skilled in the art will envision other modifications within the scope of the claims appended hereto.

Content: Text New conversion method-Contents 01 Background to the Invention 02 A short review of the related literature 02 - database searching - construction of the query - some historical references Summary of the invention 03 Detailed description of the Invention 05 Figures & Tables-Description 13 Examples 20 I., Application I : Sequence searching and visualization 20 A. , Construction of local real and artificial OTS databases 20 B. , The BlastNP 24 1., Abstract of blastNP 2., Results with blastNP a. , Sequence searching in conventional and OTS databases b., Estimation of the specificity and sensitivity of the blastNP method - Searching databases with the prion protein - Searching the PIR superfamilies - Searching with artificial sequences C. Signal amplification by converting nucleic acids to OTS 30 D, Identifying the prion protein and the insulin-C peptide as novel transcription factors and its consequences. 31 II. , Application II : Sequence"fingerprinting", filter and function-finder 33 A. General properties of the OTS 33 B. Construction of an OTS-fingerprint. 34 Discussion 35 Conclusion 37 Software developments related to OTS 38 A. , According to the 7"aspect B. , Specification C. , According to the 9*'* aspect

D. , Sequence formatting tool References 44 WEB References 46 Contents 48 Sequence listing 51 Claims 52 Abstract 55 Content: Figures & Tables Table 1 : Comparison of the overlapping and non-overlapping translation of nucleic acids Table II: Construction of the conventional nucleic acid (N4), protein (PX12).

Figure 1: LALIGN comparison of OTS and non-OTS Figure 2: Comparison of blast scores Figure 3: Estimation of the specificity of 3 blast methods.

Table HI : Summary of LALIGN comparison statistics Table IV : Comparison of the specificity and sensitivity of three different blasts.

Figure 4: Comparison of the specificity and sensitivity of three blast methods.

Figure 5: Sequence Composition.

Figure 6: Estimation of the specificity and sensitivity of three blast methods Figure 7: Correlation between the length of the query and the E-values.

Figure 8: LALIGN comparison of HPRPC and novel PrP-like Protein Doppel mRNA (PRND) Figure 9: LALIGN comparison of the insulin related substance 1 (IRS 1) and 2 (IRS2) Figure 10: LALIGN comparison of human insulin (INS) and insulin like growth factor II (HSIGFIIC) Figure 11: LALIGN comparison of Insulin Gene (HIG) to Insulin pro-peptide (INSULIN) and PRP.

Figure 12: LALIGN comparison of insulin pro-peptide (INS), the INS-domains and the PRP- [1- 300].

Table V: Results of SIM with INS-PD-245 and INS-PRP-CONSENSUS.

Table VI: Comparison of the similarity parameters measured by blastNP, TbIastX and blastN.

Figure 13: Comparison of the similarity parameters measured by blastNP, TblastX and blastN Figure 14: Visualization of the similarities between sequences with LALIGN.

Figure 15: M. S. Alignment of transcription factors, insulin-C peptide and Prion Protein in OTS form.

Figure 16: M. S. Alignment of transcription factors, insulin-C peptide and Prion Protein in OTS

form.

Figure 17: The amino acid usage frequency-expected and found in protein frames Figure 18: Amino acid usage frequency in INS OTS frames.

Figure 19: Amino acid usage frequency in TIG OTS frames Figure 20: Amino acid usage frequency in ASR OTS frames.

Figure 21 : Construction of an OTS-"fingerprint"-I.

Figure 22: Construction of an OTS-"fingerprint"-II.-amino acid frequency-sorting I.

Figure 23: Construction of an OTS-"fingerprint"-III.-amino acid frequency-sorting II.

Table VII. : Construction of an OTS-"fingerprint" Figure 24: Block diagram of the method for obtaining an OTS protein sequence.

Table VIM : Comparison of Nucleic acids, Proteins and Overlappingly Translated protein Sequences (OTS) Figure 25: Flow diagram, the modular structure of SeqConv 1.0 Figure 26: SeqConv 1.0 conversion scheme.

Figure 27: SeqConv 1.0 view results scheme.

Figure 28: SeqConv 1.0 view log scheme.

Figure 29: Use cases implemented by the sequence converting tool.

Figure 30: Block of the method for obtaining OTSs.