Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND APPARATUS FOR DETERMINING A MUTATION
Document Type and Number:
WIPO Patent Application WO/1997/000972
Kind Code:
A1
Abstract:
A method for determining a mutation in a sample base code sequence which is made up of a first base code sequence superimposed on a known wild type second base code sequence and having inserted or deleted nucleotides compared to the wild type second base code sequence, comprises the steps of: (a) aligning said sample base code sequence with a master base code sequence identical to said known wild type second base code sequence; (b) identifying a region of the alignment with a frequent occurrence of matching ambiguities; (c) comparing, for each individual base code position in said region of the alignment, whether the respective sample base code matches a combination of the aligned master base code and a master base code at a predetermined distance to the right or to the left of said aligned master base code; (d) summing the number of matches obtained in step (c); (e) repeating steps (c) and (d) for a number of different predetermined distances to the right and to the left of the aligned master base code; (f) assigning the distance resulting in the highest number of matches as the length of the mutation; and (g) determining on the basis of whether said distance that resulted in said highest number of matches is to the right or to the left of the aligned master base code, whether the mutation is an insertion or deletion.

Inventors:
BJOERKESTEN LENNART (SE)
Application Number:
PCT/SE1996/000726
Publication Date:
January 09, 1997
Filing Date:
May 31, 1996
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
PHARMACIA BIOTECH AB (SE)
BJOERKESTEN LENNART (SE)
International Classes:
C12N15/09; C12Q1/68; C12Q1/6827; C12Q1/6869; G16B30/10; H03M13/00; (IPC1-7): C12Q1/68; H03M13/00
Other References:
J. MOL. BIOL., Volume 221, 1991, B. EDWIN BLAISDELL et al., "An Efficient Algorithm for Identifying Matches with Errors in Multiple Long Molecular Sequences", page 1367 - page 1378.
BULLETIN OF MATHEMATICAL BIOLOGY, Volume 52, No. 3, 1990, OSAMU GOTOH, "Optimal Sequence Alignment Allowing for Long Gaps", page 359 - page 373.
Download PDF:
Claims:
CLAIMS
1. A method for determining a mutation in a sample base code sequence which is made up ofa first base code sequence superimposed on a known wild type second base code sequence and having inserted or deleted nucleotides compared to the wild type second base code sequence, characterized by the steps of a) aligning said sample base code sequence with a master base code sequence identical to said known wild type second base code sequence, b) identifying a region of the alignment with a frequent occurrence of matching ambiguities, c) comparing, for each individual base code position in said region ofthe alignment, whether the respective sample base code matches a combintion ofthe aligned master base code and a master base code at a predetermined distance to the right or to the left of said aligned master base code, d) summing the number of matches obtained in step c), e) repeating steps c) and d) for a number of different predetermined distances to the right and to the left ofthe aligned master base code, f) assigning the distance resulting in the highest number of matches as the length ofthe mutation, and g) determining on the basis of whether said distance that resulted in said highest number of matches is to the right or to the left ofthe aligned master base code, whether the mutation is an insertion or deletion.
2. Method according to claim 1, characterized by statistically estimating from said highest number of matches and the length of said region, an upper limit for the probability of obtaining a false mutation assignment.
3. A method for determining a mutation in a nucleic acid sample, comprising subjecting said sample to a sequencing procedure to obtain a sample base code sequence which is made up of a first base code sequence superimposed on a known wild type second base code sequence and having inserted or deleted nucleotides compared to the wild type second base code sequence, characterized in that the sample base code sequence obtained, is processed in accordance with steps a) g) ofthe method according to claim 1.
4. An apparatus for determining a mutation in a sample base code sequence which is made up of a first base code sequence superimposed on a known wild type second base code sequence and having inserted or deleted nucleotides compared to the wild type second base code sequence, characterized in that it comprises a) aligning means for aligning said sample base code sequence with a master base code sequence identical to said known wild type second base code sequence, b) identifying means for identifying a region of the alignment with a frequent occurrence of matching ambiguities, c) comparing means for comparing, for each individual base code position in said region ofthe alignment, whether the respective sample base code matches a combintion ofthe aligned master base code and a master base code at a predetermined distance to the right or to the left of said aligned master base code, d) summing means for summing the number of matches, e) control means for causing said comparing means to repeat the comparison for a number of different predetermined distances to the right and to the left ofthe aligned master base code, and for causing said summing means to repeat the summation accordingly, f) assigning means for assigning the distance resulting in the highest number of matches as the length ofthe mutation, and g) determining means for determining on the basis of whether said distance that resulted in said highest number of matches is to the right or to the left ofthe aligned master base code, whether the mutation is an insertion or deletion.
Description:
METHOD AND APPARATUS FOR DETERMINING A MUTATIO N

Technical Field

The invention relates to a method and an apparatus for determining a mutation in a sample base code sequence which is made up ofa first base code sequence superimposed on a known wild type second base code sequence and havong inserted or deleted nucleotides compared to the wild type second base code sequence.

Background of the invention

Sequence data obtained by processing and sequencing DNA or mRNA samples from e.g. fine needle aspirates of tumour tissue can be used for the detection of inherited or induced mutations in genes related to the cause ofthe tumour. The sample sequence obtained will often consist ofa mixture of two superimposed sequence components, the wild type component and a mutated component. This could be due to a mixture of cell populations in the sample or due to an inherited mutation in one ofthe two copies ofthe gene. In cases where the mutated sequence component is the predominant component, insertion and deletion mutations as well as point mutations can be readily detected by aligning the sequence data obtained from the sample with the expected wild type sequence using standard alignment algorithms. See for example S. Needleman and C. Wunsch, J. Mol. Biol. 48, 444 (1970), and W. R. Pearson and W. Miller, Methods in Enzymology, 210, 575(1992). Often however the mutated sequence material is mixed up with an equally large amount of non-mutated material. In some cases the non-mutated material will even be predominant. In these cases ordinary alignment algorithms fail to resolve the mutation.

Brief description of the invention

The object ofthe invention is to bring about an improved method and an improved apparatus for determining mutations.

This is attained by the method according to the invention, which comprises the steps of: a) aligning said sample base code sequence with a master base code sequence identical to said known wild type second base code sequence, b) identifying a region of the alignment with a frequent occurrence of matching ambiguities, c) comparing, for each individual base code position in said region ofthe alignment, whether the respective sample base code matches a combintion ofthe aligned master base code and a master base code at a predetermined distance to the right or to the left of said aligned master base code, d) summing the number of matches ofthese comparisons, e) repeating steps c) and d) for a number of different predetermined distances to the right and to the left ofthe aligned master base code, f) assigning the distance resulting in the highest number of matches as the length ofthe mutation, and g) determining on the basis of whether said distance that resulted in said highest number of matches is to the right or to the left ofthe aligned master base code, whether the mutation is an insertion or deletion.

This is also attained by the apparatus according to the invention, which comprises a) aligning means for aligning said sample base code sequence with a master base code sequence identical to said known wild type second base code sequence, b) identifying means for identifying a region of the alignment with a frequent occurrence of matching ambiguities,

CONFIRMATION COPY

c) comparing means for comparing, for each individual base code position in said region ofthe alignment, whether the respective sample base code matches a combintion ofthe aligned master base code and a master base code at a predetermined distance to the right or to the left of said aligned master base code, d) summing means for summing the number of matches, e) control means for causing said comparing means to repeat the comparison for a number of different predetermined distances to the right and to the left ofthe aligned master base code, and for causing said summing means to repeat the summation accordingly, f) assigning means for assigning the distance resulting in the highest number of matches as the length ofthe mutation, and g) determining means for determining on the basis of whether said distance that resulted in said highest number of matches is to the right or to the left ofthe aligned master base code, whether the mutation is an insertion or deletion.

Brief description of the drawing

The invention will be described more in detail below with reference to the appended drawing, on which Fig. 1 shows a sample sequence component with deletion mutation, superimposed on a wild type component, and Fig. 2A and 2B illustrate shift score calculations in sequence alignment representation.

Preferred embodiments

The method according to the invention assumes that sequence data has been obtained from an automatic sequencer capable of distinguishing between pure DNA base codes (A, C, G and Ts) and mixed base codes produced due to the supeφosition of data from two different DNA fragments. The combinations of two different base codes at certain positions are denoted by ambiguity codes in accordance with Nomenclature Committee ofthe International Union of Biochemistry (NC-IUB): Nomenclature for incompletely specified bases in nucleic acid sequences. Eur J Biochem 150:1, 1985, as per Table 1 below.

Table 1. Ambiguity codes

Notation Nucleotide combination

M AC

R AG

W AT

S CG

Y CT

K GT

V ACG

H ACT

D AGT

B CGT

N ACGT

The ambiguity positions should in this context not be referred to as ambiguous but rather as a consequence ofthe mixed sample sequence.

An example ofa superimposed sequence pattern is shown in Fig. 1. Fig. IA shows the wild type sequence. The deletion action producing the mutated sequence is illustrated in Fig. IB and the superimposed sequence pattern is derived in Fig. IC. The two sequence components are aligned at the left end due to the action ofa common sequencing primer and add up to the resulting sequence pattern.

The method according to the invention comprises the following steps of a) aligning said sample base code sequence with a master base code sequence identical to said known wild type second base code sequence, b) identifying a region of the alignment with a frequent occurrence of matching ambiguities, c) comparing, for each individual base code position in said region ofthe alignment, whether the respective sample base code matches a combintion ofthe aligned master base code and a master base code at a predetermined distance to the right or to the left of said aligned master base code, d) summing the number of matches obtained in step c), e) repeating steps c) and d) for a number of different predetermined distances to the right and to the left ofthe aligned master base code, f) assigning the distance resulting in the highest number of matches as the length ofthe mutation, and g) determining on the basis of whether said distance that resulted in said highest number of matches is to the right or to the left ofthe aligned master base code, whether the mutation is an insertion or deletion.

These steps will be described more in detail below.

1. Aligning Ihe sample sequence with the wildtype sequence.

The alignment algorithm used is based on the algorithm proposed by S. Needleman and C. Wunsch, J. Mol. Biol. 48, 444 (1970), where all different possible alignments between the sample sequence and a master sequence (the wild type sequence) are given a total score calculated as the sum of local points for all the base positions involved. By this approach the alignment given the highest score is assumed to describe the behaviour ofthe sample sequence with respect to the master sequence in terms of point mutations, insertion mutations and deletion mutations. The following events are assigned a local score (point):

Event Score

Matching base +1

Mismatching base -1

Gap in the master sequence, -1 (independent of length)

Gap in the sample sequence, -1 (independent of length)

In the method according to the invention, the algorithm described above was extended by two more event types:

Event Score

Matching ambiguity 0

Mismatching ambiguity -1

The matching ambiguity event occurs when there is an ambiguity code in the sample sequence including a base code component that matches the master base code at the corresponding position in the alignment.

The mismatching ambiguity event occurs when there is an ambiguity code in the sample sequence not including a base code component matching the master base code at the corresponding position in the alignment.

The reason for incoφorating these two event types into the alignment algorithm is that sequence data consisting of two shifted and superimposed components will then still align with the master sequence in a well behaved manner.

The nomenclature that is used for describing the alignment between the sample sequence and the master sequence is shown in Table 2 below.

Table 2. Nomenclature used for the sequence alignment

Matching bases:

Sample sequence: . . .ACAGGTAGA. . . Master sequence: . . .ACAGGTAGA. . . Mismatching bases:

T Sample sequence: . . .ACA GTAGA. . . Master sequence: . . .ACA GTAGA. . .

G Master sequence gap (explicit insertion):

TCA Sample sequence: . . .ACAG GTAGA. . .

Master sequence: . . .ACAG GTAGA. . .

Sample sequence gap (explicit deletion):

Sample sequence: .ACAG GTAGA. . Master sequence: .ACAG GTAGA. . CCT

Matching ambiguity:

Sample sequence: .ACAG TAGA. . . Master sequence: .ACAG TAGA. . . G

Mismatching ambiguity

Y

Sample sequence: .ACAG TAGA. . . Master sequence: .ACAG TAGA. . . G

It should be noted that a point mutation in a population with high enough representation relative to the wild type background population will be explicitly pointed out during this stage either as a mismatching base or as a single matching ambiguity. It should furthermore be noted that insertion mutations and deletion mutations in cases where the wild type contribution is weak will also be pointed out explicitly during this stage in terms oϊ master sequence gaps and sample sequence gaps respectively.

A frequent occurrence oϊ matching ambiguity positions in the alignment should however be an indication of possible insertion or deletion of one or more nucleotides in a superimposed sample sequence component.

2. Sequence shift evaluation

When there is an indication of a superimposed and partly shifted sequence component i.e. when there is a frequent occurrence of matching ambiguity positions starting at a certain position in the alignment one should continue with the sequence shift evaluation. The sequence shift evaluation is based on the sequence alignment produced during the previous step ofthe analysis:

Let the sample sequence bases be denoted by s } , s 2t , s 3 ... and the master sequence bases be denoted by m lt m^ m 3 ... according to their position in the alignment, the base s } being aligned to m } , s 2 to m 2 and and so on. The following master sequence base codes are used:

(1) A, C, G, T, -

A to T represent the nucleotides and "-" represents a gap in the master sequence at this position of the alignment.

The following sample sequence base codes are used:

(2) A, C, G, T, M, R, W, S, Y, K, V, H, D, B, N, -

A to T represent the nucleotides, M to N represent superimposed bases denoted using ambiguity codes and "-" represents a gap in the sample sequence at this position ofthe alignment.

A region ofthe alignment where there is a frequent occurrence oϊ matching ambiguity positions is chosen. Let the starting position ofthe chosen region be denoted by/and the length ofthe region chosen be denoted by n. For each position in this region ofthe alignment the sample base code is compared to the combination of the aligned master base and the master base at a certain offset A (sequence shift). Shifts where Δ > 0 are denoted right shifts and shifts where Δ < 0 are denoted left shifts. Ifthe sample base code matches the combination of master codes the shift score for that certain shift, k Δ is incremented. This procedure is repeated for all shift candidates, Δ j to be included in the analysis. The total shift score for a certain shift, k^ is then obtained as n-l

(3) * Δj = ∑S0 f+i > * m'"f-τi ι '"f,+i+ j ,. ) ι=0

where the score function S is defined by:

= 1 if b a is an ambiguity code comprising components b b and b c (4) Sfb^ b^ bJ = 1 ιϊb a - b b = b c

= 0 otherwise

The shift associated with the highest score will be the most probable candidate for the shift assignment. This indicates that there is probably a sequence component superimposed on the sequence that was aligned to the master sequence, the superimposed sequence being shifted relative to the aligned sequence by the amount given by the highest scored shift i.e. there is probably a mutation present, the length of which is given by the size ofthe shift. The sequence shift evaluation is illustrated in Fig. 2A, wherein the sample base code at position C, *sy- +/ is compared to a combination ofthe master sequence base code at position B, m^, and the left shifted base code, A = -6, at position A, m f+l+Δ . The alignment is indicated by the "X" letters.

3. Significance of most probable shift

The significance ofthe most probable shift obtained in step 2 for the chosen region will now be discussed in some detail.

The most probable shift is characterized by the length ofthe region n and the number of matches between the combined master base codes and the corresponding sample base codes in that region, k. Assuming randomly distributed base codes superimposed on a wild type sample sequence one would expect the probability/^ to obtain k matching positions to be given by the binomial distribution:

where p = 0.25 is the probability for each base code type.

The probability/^ to obtain k or more matching positions in a random distribution would then be

(6) p, = ∑F(n,i) ι=k

This means that given the probability pk and assuming randomly distributed superimposed base codes to a wild type sample sequence, one would expect every 1 / pkn :th sample sequence to show k or more matching base positions within the region investigated. Equation (6) then gives an estimated upper limit for the probability to obtain a false assignment ofa shifted sequence component.

Since there is a correlation between the different base positions in the wild type sequence (the base codes are not randomly distributed) one should also account for the risk of getting a false shift assignment based on the score obtained when there is a short sequence repeated several times in the region used ofthe wild type sequence. This would erroneously indicate a shifted component the size ofthe shift being the length ofthe repeat cycle. This situation could be avoided by comparing the score obtained between the sample sequence bases and the combined master bases, k^ obtained according to the description in step 2 to the score obtained for the master bases, k' 4 which is defined in the following way:

where the score function S' is defined by:

= 1 iϊb a = b b

(8) S'

= 0 otherwise

This part ofthe evaluation is further illustrated in Fig. 2B, wherein the master sequence base code at position B, w, +1 is compared to the left shifted base code, A = -6, at position A, m f+l+Δ . The alignment is indicated by the "X" letters.

If a high score is obtained in the latter case one should be aware of this when considering the significance ofa shifted superimposed sequence component based on the former score.

4. Insertion or deletion mutation?

The method described so far gives information on the length of an insertion or deletion and on the significance of this information. This information will be enough to decide if one has to deal with an in-frame or out-of-frame mutation but it has not so far been discussed how to decide which kind of mutation (insertion or deletion) is present. In order to illustrate the principles we need a simple example. Let us consider for simplicity the following three sequences represented by strings of plain letters:

The wild type sequence is represented by the following letter string:

( 9 ) ABCDE FGH IJKL . . .

A sequence with a deletion mutation is represented by a similar letter string where the letters F and G were removed (deleted):

( 10 ) ABCDEHIJKL . . .

In the same way a sequence with an insertion mutation is represented by a similar letter string but the letters XYZ were inserted after the letter E:

(11) ABCDEXYZFGHIJKL...

From the alignment point of view there are two different cases to be considered: a. The wild type component did aligtt to the master sequence

This case will often occur when the wild type component is the predominating component in the sample mixture. In this case there is no explicit gap in the alignment.

A deletion mutation will in this case be represented by a right shift, Δ j > 0, the size ofthe shift being equal to the length ofthe mutation. This is illustrated using the representation from our simple example in the following alignment where the two uppermost lines represent the superimposed sample sequences and the lowermost line represents the master sequence:

(12)

.ABCDE f' taHIJKL . .AB ffiJKLMN .

.ABCDE PIJKL .

An insertion mutation will be represented by a left shift, Δ* < 0, the size ofthe shift being equal to the length ofthe mutation. This is illustrated in the following alignment example:

(13)

...ABCDEFGH] ...ABCDEXYZI Hlilfci..

.ABCDEFGHIJKL

b. The mutated component did aligtt to the master sequence

This will often be the case when the mutated component is the predominating component in the sample mixture.

A deletion mutation will in this case be represented by an explicit gap in the sample sequence and a left shift, Δ < 0, the size ofthe shift being equal to the length ofthe mutation. This is illustrated in the following example:

(14)

.ABCDE--FGH[Ϊ]J . . . .ABCDE--HI JKL . . .

. ABCDE FGHfflj fL . . .

An insertion mutation will in this case be represented by an explicit gap in the wild type sequence and a right shift, Δ j > 0, the size ofthe shift being equal to the length ofthe mutation. This is illustrated in the following example:

(15)

In cases where there is almost no wild type component in the sample sequence mix there will be no significant shift assignment information available but the insertion or deletion mutation will be fully characterized by the explicit gap.

Ifthe complementary reversed strain is sequenced and the sequence data is translated into coding direction before evaluation, one should notice that the left and right shifts in the description above will be the opposite. This implies that a deletion mutation corresponds to a left shift and an insertion mutation corresponds to a right shift in cases when the wild type sequence component aligned to the master sequence. Analogously, a deletion mutation corresponds to a right shift and an insertion mutation corresponds to a left shift in cases when the mutated sequence component aligned to the master sequence.

The reason for this is that the sample sequence components are in this case aligned by the common sequencing primer at the right end ofthe sequence components when represented in the coding direction.

The apparatus according to the invention for determining a mutation in a sample base code sequence which is made up of a first base code sequence superimposed on a known wild type second base code sequence and having inserted or deleted nucleotides compared to the wild type second base code sequence, comprises aligning means (not shown) for aligning the sample base code sequence with a master base code sequence identical to the known wild type second base code sequence, identifying means (not shown) for identifying a region of the alignment with a frequent occurrence of matching ambiguities, comparing means (not shown) for comparing, for each individual base code position in said region ofthe alignment, whether the respective sample base code matches a combintion ofthe aligned master base code and a master base code at a predetermined distance to the right or to the left of said aligned master base code, summing means (not shown) for summing the number of matches, control means (not shown) for causing said comparing means to repeat the comparison for a number of different predetermined distances to the right and to the left ofthe aligned master base code, and for causing said summing means to repeat the summation accordingly, assigning means (not shown) for assigning the distance resulting in the highest number of matches as the length ofthe mutation, and determining means (not shown) for determining on the basis of whether said distance that resulted in said highest number of matches is to the right or to the left of the aligned master base code, whether the mutation is an insertion or deletion.

This apparatus is preferably implemented in computer software.

In accordance with the invention, insertion and deletion mutations superimposed on a high wild type background can be detected and given a significance measure. Even a weak superimposed signal that only sparsely gives rise to relevant ambiguity codes could be used to detect insertion and deletion mutations at a reasonable level of significance.