Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
CHROMATOGRAPHIC ANALYSIS OF NUCLEIC ACIDS
Document Type and Number:
WIPO Patent Application WO/2007/019619
Kind Code:
A1
Abstract:
A method is disclosed for predicting the retention time of a nucleic acid during chromatographic analysis. Typically, this will be the prediction of the retention time of DNA molecules during DHPLC analysis. The predicted retention time is calculated based on the nucleotide distribution of the nucleic acid, and particular emphasis is placed on the terminal nucleotides. Furthermore, different retention times are predicted depending on whether dideoxynucleotides are present in the nucleic acid. The method can also be used in a software program for the design of multiplex primer extension reactions, enabling high throughput SNP genotyping.

Inventors:
WARD MICHAEL BRUCE (AU)
SORICH MICHAEL JOSEPH (AU)
MCKINNON ROSS A (AU)
Application Number:
PCT/AU2006/001164
Publication Date:
February 22, 2007
Filing Date:
August 15, 2006
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV SOUTH AUSTRALIA (AU)
WARD MICHAEL BRUCE (AU)
SORICH MICHAEL JOSEPH (AU)
MCKINNON ROSS A (AU)
International Classes:
C12Q1/68; G06Q99/00; G16B30/00
Other References:
GILAR M. ET AL.: "ION-PAIR REVERSED-PHASE HIGH-PERFORMANCE LIQUID CHROMATOGRAPHY ANALYSIS OF OLIGONUCLEOTIDES: RETENTION PREDICTION", JOURNAL OF CHROMATOGRAPHY A, vol. 958, 2002, pages 167 - 182, XP004358049
HOOGENDOORN B. ET AL.: "GENOTYPING NUCLEOTIDE POLYMORPHISMS BY PRIMER EXTENSION AND HIGH PERFORMANCE LIQUID CHROMATOGRAPHY", HUM. GENET., vol. 104, 1999, pages 89 - 93, XP000938094
Attorney, Agent or Firm:
MADDERNS (64 Hindmarsh Square Adelaide, S.A. 5000, AU)
Download PDF:
Claims:

THE CLAIMS:

1. A method for predicting the retention time of a nucleic acid during chromatographic analysis, comprising: calculating a predicted retention time based on one or more factors, wherein the factors used to predict the retention time comprise at least one of the following: the nucleotide distribution of the nucleic acid, and whether or not the nucleic acid comprises a dideoxynucleotide triphosphate.

2. A method according to claim 1, wherein the factors used to predict the retention time comprise whether or not the nucleic acid comprises a dideoxynucleotide triphosphate.

3. A method according to any one of claims 1 to 3, wherein the factors used to predict the retention time comprise both the nucleotide distribution of the nucleic acid, and whether or not the nucleic acid comprises a dideoxynucleotide triphosphate.

4. A method according to any one of claims 1 to 4, wherein the factors used to predict the retention time further comprise the length and the nucleotide composition of the nucleic acid.

5. A method according to claim 1, wherein the nucleotide distribution is used as a factor to predict the retention time by placing an increased importance on the nucleotides at the 3' and 5' ends of the nucleic acid.

6. A method according to claim 5, wherein the nucleic acid is a deoxyribonucleic acid, and the retention time is calculated according to the following general formula:

T R = a x (log (L) ) + a 2 ( %A) /L + a 3 (%C) /L + a 4 (%G) /L + a 5 (%T) /L + a 6 (3 ' end (A) ) /L + a 7 (3 ' enά (C) ) /L + a s (3 ' end (G) ) /L + a 9 ( 3 ' end (T) ) /L + a lo ( 5 ' end (A) ) /L + a lx ( 5 ' end (C) ) /L + a 12 ( 5 ' end (G) ) /L + a i3 (5 ' end (T) ) /L + a.i4 , wherein TR is the predicted retention time, L is the length of the nucleic acid in nucleotides, %X is the percentage composition of nucleotide X in the nucleic acid, ai, a 2 , a3, a.4, as, a 6 , aγ, as, a.9, aio, an, a 12 , ai3 and a 14 are constants,

3'end(A) is the number of adenine nucleotides at the 3' end of the nucleic acid, 3'end (C) is the number of cytidine nucleotides at the 3' end of the nucleic acid, 3'end(G) is the number of guanine nucleotides at the 3' end of the nucleic acid, 3'end(T) is the number of thymidine nucleotides at the 3' end of the nucleic acid, 5'end(A) is the number of adenine nucleotides at the 5' end of the nucleic acid, 5 'end (C) is the number of cytidine nucleotides at the 5' end of the nucleic acid, 5'end(G) is the number of guanine nucleotides at the 5' end of the nucleic acid, and 5'end(T) is the number of thymidine nucleotides at the 5' end of the nucleic acid.

7. A method according to claim 6, wherein different constants ai ... au are used depending on whether the nucleotide at the 3'end of the nucleic acid is a dideoxynucleotide .

8. A method of determining the nucleotide sequence of a nucleic acid from two or more possible nucleotide sequences, by: predicting the retention time during chromatographic analysis of a nucleic acid of each possible sequence, wherein the factors used to predict the retention time comprise at least one of the following: the nucleotide distribution of the nucleic acid, and

whether or not the nucleic acid comprises a dideoxynucleotide triphosphate; measuring the retention time of the nucleic acid through a liquid chromatography column; and comparing the measured retention time to the predicted retention times, thereby identifying the nucleotide sequence of the nucleic acid.

9. A method according to claim 8, wherein the nucleic acid is a product of a primer extension reaction at a polymorphic site, and the method further includes identifying the two or more possible nucleotide sequences by determining the different primer extension products obtained depending on the nucleotide at the polymorphic site.

10. A method of selecting primers for use in analysing two or more polymorphic sites by primer extension, the method including:

(a) for each site, creating a list of one or more possible primers that can be extended to or past the corresponding site;

(b) for each possible primer, determining the primer extension products that would be obtained by conducting primer extension at the corresponding site;

(c) predicting the retention time of said possible primers and primer extension products during chromatographic analysis; and

(d) selecting a primer from each list of possible primers to form one or more primer sets, such that, for each primer set, the predicted retention times differ for all selected primers and the corresponding primer extension products.

11. A method according to claim 10, wherein the chromatographic analysis is liquid chromatographic analysis.

12. A method according to claim 11, wherein the chromatographic analysis is by denaturing high performance liquid chromatography, under completely denaturing conditions.

13. A method according to any one of claims 10 to 12, wherein the number of polymorphic sites is greater than five.

14. A method according to any one of claims 10 to 12, wherein the number of polymorphic sites is greater than ten.

15. A method according to any one of claims 10 to 14, wherein (a) comprises for each site, creating a list of two or more possible primers that can be extended to or past the corresponding site.

16. A method according to any one of claims 10 to 15, wherein (d) comprises selecting a primer from each list of possible primers to form two or more primer sets, such that, for each primer set, the predicted retention times differ for all selected primers and the corresponding primer extension products.

17. A method according to any one of claims 10 to 15, wherein (d) comprises selecting a primer from each list of possible primers to form two or more primer sets of up to five primers, such that, for each primer set, the predicted retention times differ for all selected primers and the corresponding primer extension products.

18. A method according to any one of claims 10 to 17, wherein the one or more primer sets is selected to provide the least likelihood of conflict between the observed retention times.

19. A method according to any one of claims 10 to 18, further comprising:

(aa) creating a list of one or more combinations of terminating nucleotides to terminate primer extension.

20. A method according to claim 19, wherein (aa) comprises creating a list of two or more combinations of terminating nucleotides to terminate primer extension.

21. A method according to any one of claims 10 to 20, wherein each list of possible primers is created comprising all primers that bind with their 3' end adjacent the site, between a specified minimum and maximum primer length.

22. A system for selecting primers for use in analysing two or more polymorphic sites by primer extension, the system comprising: a list creator to create, for each site, a list of one or more possible primers that can be extended to or past the corresponding site; a determiner to determine, for each primer, the primer extension products that would be obtained by conducting primer extension of the primer at the corresponding site; a predictor to predict the retention time during chromatographic analysis of said possible primers and primer extension products; and a selector to select a primer set which includes a selected primer for each site, such that the predicted retention times for all selected primers and the corresponding primer extension products differ.

23. A computer readable medium, encoded with data representing a computer program, that can be used to direct a programmable device to perform the method of any of claims 1 to 21.

24. A computer program element comprising computer program code means to make a programmable device execute the method of any of claims 1 to 21.

Description:

CHROMATOGRAPHIC ANALYSIS OF NUCLEIC ACIDS

FIELD OF THE INVENTION

The invention relates to the chromatographic analysis of nucleic acids.

BACKGROUND OF THE INVENTION

Adverse reactions to medications are a major medical problem resulting in significant patient suffering and financial cost to the community, and the frequency of adverse reactions appears to be increasing. Unfortunately, many adverse drug reactions occur unpredictably. For one drug, some patients may suffer from debilitating side effects that limit treatment whilst others experience only the desired, beneficial responses. However, for another drug to treat the same condition, the effects may be reversed. Genetic variation between patients may explain these differences.

The human genome is made up from deoxyribonucleic acid (DNA), which is a sequence of adenine (A), cytidine (C), guanine (G) and thymidine (T) nucleotides. Each copy of the human genome is unique and varies from any other version in the population by approximately 1 in every 1,250 nucleotides. Many of these variations are single nucleotide polymorphisms (SNPs) which are scattered throughout the human genome with varying density. Significant work has been done to identify and catalogue SNPs.

It is increasingly clear that access to high-throughput methods of gene analysis, primarily for SNP discovery and genotyping, will be essential as pharmacogenomics moves from the laboratory to the clinic. SNP genotyping is expected to provide benefits in drug development and testing by pharmaceutical companies which in turn may benefit the consumer in reduced drug costs. Furthermore, a major advantage will be the ability to assess an individual's reaction to a drug before it is prescribed. This will

increase a physician's confidence in prescribing the drug and the patient's confidence in taking the drug, and reduce the frequency of adverse drug reactions. In turn, this should encourage the development of new drugs.

For SNP discovery, a number of techniques are available, such as direct sequencing, single-strand conformation analysis, and denaturing high performance liquid chromatography (DHPLC). The technique chosen is dependent upon factors such as the number of samples to be screened and cost. The most popular techniques are reliant upon conformation or stability differences induced by the presence of a mutation.

Arguably the most sensitive method of SNP discovery is DHPLC. When using DHPLC to discover SNPs, a target sequence is first amplified using the well known polymerase chain reaction (PCR) technique. The PCR products are denatured and allowed to cool slowly. For heterozygous samples, this cooling process allows a mismatch to occur in the Watson-Crick base pairing at the SNP location. Therefore, heterozygous samples will result in the formation of both correctly matched homoduplex DNA helices and mismatched heteroduplex DNA helices, whilst homozygous samples will result in only homoduplices.

The cooled PCR products are then subjected to high performance liquid chromatographic analysis, wherein heteroduplices are not retained as strongly by the column as homoduplices. Accordingly, they will be eluted after a shorter retention time. Samples containing only homoduplices (either homozygous wild type or homozygous mutant) will generally display a single peak within the chromatogram whilst those samples generating heteroduplices, which are heterozygous for an SNP, will display a different

peak profile (generally two peaks). The specific nucleotide change can then be determined using direct dye terminator sequencing.

In relation to SNP genotyping, various technologies have emerged in recent years for the high-throughput detection of genetic variability. These include restriction fragment length polymorphism (RFLP) analysis, and TaqMan® genotyping. However, recently, primer extension (PE) genotyping has become one of the most popular genotyping techniques.

As the name implies, PE genotyping involves the extension of a synthetic oligonucleotide primer by one or more nucleotides in an allele specific manner. The primer to be used for the PE reaction is specifically designed to anneal one nucleotide immediately upstream of the polymorphic site. The reaction utilises DNA polymerase and a specific combination of deoxynucleotide triphosphates (dNTPs) and dideoxynucleotide triphosphates (ddNTPs) to extend and terminate the primers respectively. The primer extension products will differ in sequence depending on the nucleotide at the polymorphic site.

The next problem, therefore, is to analyse the products produced by the PE reaction. Analysis methods may take advantage of differences in length or mass of the primer extension products - for example, gel electrophoresis. Alternatively, specific nucleotides may have been fluorescently or radioactively labeled, which may also be used to distinguish between PE products.

DHPLC can also be used for PE reaction product analysis, which extends the utility of the equipment beyond SNP discovery. However, this option suffers because the retention of the primers and their respective extension products is

highly variable. Therefore, multiplexing of two or more PE reactions is severely limited by the need to ensure that the various nucleic acids contained within the complex mixture of primers and PE products can be satisfactorily resolved from each other.

A commonly used method to analyse PE products is matrix-assisted laser desorption/ ionization time-of-flight mass spectrometry (MALDI-TOF MS). The major advantages of MALDI-TOF MS are the ability to easily multiplex the assay and the speed of analysis. PE reactions with MALDI-TOF MS analysis do not require the use of labeled nucleotides. Separation is based upon the mass of the primers and extension products, and multiplexing is achieved through the selection of primers of different length, up to a maximum of approximately 40 nucleotides. A one nucleotide difference in length results in at least a 300 Da mass difference, which can easily be resolved in a mass spectrum.

However, MALDI-TOF MS analysis suffers due to the presently prohibitive cost of the equipment, and the strict requirements for sample preparation.

It is an object of the present invention to reduce or eliminate some or all of the disadvantages of, or to provide an alternative to, current methods for analysing polymorphic sites.

SUMMARY OF THE INVENTION Accordingly, in a first aspect of the present invention, there is provided a method for predicting the retention time of a nucleic acid during chromatographic analysis, comprising: calculating a predicted retention time based on one or more factors, wherein the factors used to predict the retention time comprise at least one of the following:

the nucleotide distribution of the nucleic acid, and whether or not the nucleic acid comprises a dideoxynucleotide triphosphate.

As will be understood, the nucleotide distribution of a nucleic acid differs from its nucleotide composition. The nucleotide composition of a nucleic acid is simply the amount of each nucleotide (A, C, G and T) represented in the nucleic acid. However, nucleic acids having the same nucleotide composition may have these nucleotides distributed very differently throughout - for example, nucleotides of the same type may be clustered together in nucleotide repeats, or they may be over-represented in a portion of the acid, such as a portion at an end (or both ends) of the acid or towards the middle of the acid. This distribution of nucleotides is referred to as the nucleotide distribution, and the examples given above are not intended to be exhaustive.

Preferably, the factors used to predict the retention time comprise whether or not the nucleic acid includes a dideoxynucleotide triphosphate. This will typically be the 3' terminal nucleotide of the nucleic acid.

Further, the factors used to predict the retention time preferably comprise both of the abovementioned factors. Ideally, the factors used to predict the retention time also comprise the length and the nucleotide composition of the nucleic acid. One way of considering nucleotide distribution is to place an increased importance on the nucleotides at the 3' and 5' ends of the nucleic acid.

Preferably, the present invention is used to determine the nucleotide sequence of a nucleic acid from two or more possible sequences, by:

predicting the retention time during chromatographic analysis of a nucleic acid of each possible sequence; measuring the retention time of the nucleic acid through a liquid chromatography column; and comparing the measured retention time to the predicted retention times, thereby identifying the nucleotide sequence of the nucleic acid.

Preferably, the method is used to determine the nucleotide sequence of a nucleic acid formed by conducting primer extension at a mutation site. The method can therefore be used for genotyping.

In a second aspect of the present invention, there is provided a method of selecting primers for use in analysing two or more polymorphic sites by primer extension, the method including: (a) for each site, creating a list of one or more possible primers that can be extended to or past the corresponding site;

(b) for each possible primer, determining the primer extension products that would be obtained by conducting primer extension at the corresponding site; (c) predicting the retention time of said possible primers and primer extension products during chromatographic analysis; and

(d) selecting a primer from each list of possible primers to form one or more primer sets, such that, for each primer set, the predicted retention times differ for all selected primers and the corresponding primer extension products.

Each primer set is intended to be subsequently used in a PE reaction, and then the products of the PE reaction are subjected to chromatographic analysis. Each primer set in (d) may be any set wherein the predicted retention times

differ for all selected primers and primer extension products in the set. The minimum separation between predicted retention times will depend on the accuracy of the predicted retention times and /or on the sensitivity of the chromatography apparatus.

It will be desirable to require the least number of PE reactions (and, accordingly, primer sets), as this will allow the greatest number of polymorphic sites to be analysed in the shortest time. However, as the number of primer sets is decreased, the number of primers in each set will correspondingly increase. The number of primers and primer extension products which can be differentiated in a single assay will depend on the sensitivity of the chromatographic analysis.

Preferably, the type of chromatography used is liquid chromatography, and most preferably is denaturing high performance liquid chromatography (DHPLC), under completely denaturing conditions. Currently, an appropriate limit to primer set size when using DHPLC to analyse the PE products is five. However, obviously this limit is not absolute.

In some cases, there may be a number of possible selections of primer set(s) for which the predicted retention times differ for all selected primers in the primer set and the corresponding primer extension products. It may, therefore, be preferable to select the primer set(s) which have the maximum total difference between the predicted retention times, provided all the differences meet the minimum separation for the particular chromatography apparatus to be used. However, where the accuracy of the retention time predictions is not absolute, but the limitations of the prediction method are understood, the most desirable selection of primer sets is the one that results in the least likelihood of conflict between the observed retention times. This

accordingly minimises the likelihood of being unable to separate relevant nucleic acids.

Preferably, the method of the second aspect further includes: (aa) creating a list of one or more combinations of terminating nucleotides to terminate primer extension.

In this event, the selection in (d) would include selecting both the primer set(s) and a terminating nucleotide combination for each primer set. The terminating nucleotides will generally be dideoxynucleotide triphosphates (ddNTPs). Preferably, the list of combinations of terminating nucleotides includes at least two combinations. Different terminating nucleotide combinations may be selected for different primer sets.

Furthermore, preferably at least one and, ideally, all of the lists of possible primers include at least two possible primers.

Each list of possible primers may be created in many ways. One way of creating each list is by including all primers that bind with their 3' end adjacent the site, between a specified minimum and maximum primer length. Of course, factors other than length may be used to restrict the possible primers. Criteria such as melting point or GC (guanine and cytidine) content may also be considered. Primer selection may also be limited by the predicted retention time of the primer - e.g. if the predicted retention time is too long or too short.

In another aspect of the present invention, there is provided a system for selecting primers for use in analysing two or more polymorphic sites by primer extension, the system including:

a list creator to create, for each site, a list of one or more possible primers that can be extended to or past the corresponding site; a determiner to determine, for each primer, the primer extension products that would be obtained by conducting primer extension of the primer at the corresponding site; a predictor to predict the retention time during chromatographic analysis of said possible primers and primer extension products; and a selector to select a primer set which includes a selected primer for each site, such that the predicted retention times for all selected primers and the corresponding primer extension products differ.

In further aspects of the present invention, computer readable media and computer program elements for directing a programmable device to perform the steps of the above methods are also provided.

BRIEF DESCRIPTION OF THE FIGURES

An illustrative embodiment of the present invention will be discussed with reference to the accompanying drawings wherein:

FIGURE 1 is a graph of the correlation between nucleic acid length and retention time;

FIGURES 2 to 4 are graphs of predicted retention time against measured retention time, for different models for predicting retention time;

FIGURE 4a is a graph of standardised residuals against retention time, for the model for predicting retention time of Figure 4;

FIGURE 5 is a graph of predicted retention time against measured retention time, for another model for predicting retention time;

FIGURE 5a is a graph of standardised residuals against retention time, for the model for predicting retention time of Figure 5;

FIGURE 6 is an overall flow chart of the method of the second aspect of the present invention;

FIGURE 7 depicts a polymorphic site with several possible primers which can anneal adjacent the polymorphic site;

FIGURE 8 depicts a system according to an embodiment of the present invention;

FIGURE 9 is a more detailed flow chart of one step of the method of the second aspect of the present invention;

FIGURE 10 is a flow chart of the general use of the second aspect of the present invention;

FIGURE 11 is a user interface for a software embodiment of the second aspect of the present invention; and

FIGURE 12 provides a detailed analysis of the output of the second aspect of the present invention.

DESCRIPTION OF A PREFERRED EMBODIMENT OF THE INVENTION

The retention time of a nucleic acid during chromatographic analysis can vary widely depending on many factors. Firstly, it will depend on the type of chromatography used, and also on the type of nucleic acid (for instance, RNA

or DNA). Hereinafter, the present invention is discussed with reference to the retention time of DNA using DHPLC in completely denaturing conditions. However, the present invention should not be limited to this preferred embodiment, since the present invention may be applied to different types of chromatography and nucleic acids.

To briefly review the operation of DHPLC apparatus, a stationary phase is contained within a chromatography column. The stationary phase is generally chemically inert, electrically neutral and hydrophobic, such as alkylated non-porous polystyrene-divinylbenzene particles of 2-3 microns in diameter. However, DNA is negatively charged and therefore cannot adsorb to the stationary phase by itself. Accordingly, the mobile phase includes not only water and an organic solvent (e.g. acetonitrile) but also an ion-pairing reagent to help the adsorption of the DNA. Commonly, the ion-pairing reagent is triethlyammonium acetate (TEAA), but other reagents such as tryethylamine-hexafluorisopropanol (TEA-HFIP) may also be used. The ion- pairing reagent is used to interact with both the stationary phase and the DNA. For instance, the positively charged ammonium ions of TEAA molecules will interact with the negatively charged phosphate ions of DNA molecules, while the alkyl chains of the TEAA molecule interact with the hydrophobic surface of the stationary phase.

When used to analyse PE products, the DHPLC column is generally operated at 70 to 80 degrees Celsius to ensure completely denaturing conditions. The DNA sample is injected into the mobile phase, which is then passed through the DHPLC column. After a period of time (the "retention time"), the DNA will be eluted from the column. The eluting DNA fragments are then typically detected by ultraviolet (UV) absorbance and a chromatogram created.

As will be understood, the retention time of a DNA molecule through the DHPLC column will depend on the nucleotide sequence of that molecule, and there is provided herein a method to predict the retention time of a particular nucleotide sequence.

However, the retention time will also depend on other factors such as the temperature of the column, the stationary phase used, the flow rate of the mobile phase, the concentration of the solvent and the ion-pairing reagent used. For simplicity, the present invention will be described assuming these factors are kept consistent, and that variations in retention time of nucleic acids are caused by variations in their sequence. However, it would be within the scope of the present invention to provide a method that accounted for these other factors listed above.

Predicting the Retention Time of a Nucleic Acid

Sequence Length

The most significant determinant of the retention time of a nucleic acid is its length - increased length results in a longer retention time, as shown by way of example in Figure 1. However, taking into account only the length of a nucleic acid does not predict retention time with sufficient accuracy (e.g. for the efficient design of multiplexed assays). Using the logarithm of the length of the nucleic acid can improve the accuracy of the predictions, but not to a significant extent.

Sequence Composition

In addition to length, sequence composition is an important determinant of nucleic acid retention time. Two nucleic acids of the same length will not necessarily demonstrate the same retention time. Therefore, in addition to

length, the composition of a nucleic acid is preferably used to predict retention time.

Sequence composition may be considered by predicting the retention time of a DNA molecule using the following general equation (Equation 1):

T R = ai (log (L) ) + a 2 ( %A) /L + a 3 (%C) /L + a 4 (%G) /L + a s (%T) /L + a 6

T R - Predicted Retention time L - Length

%X - Percentage Composition of Nucleotide X ai : a 6 - constants

The constants in the equation will vary depending on the DHPLC parameters used (e.g. temperature). However, the constants can be determined by first measuring the retention times for nucleic acids of known and varying sequence through a DHPLC column having the desired settings to obtain a data set, and then subjecting the data set to partial least squares (PLS) regression analysis. Of course, the above is provided only as an example, and other ways of accounting for sequence composition may also be used.

It should be observed that the nucleotide percentages have been scaled against the length of the nucleic acid, since the relative importance of sequence composition is inversely related to the length.

In passing, it is also noted that the importance of sequence composition is also dependent upon the ion-pair reagent being used. Furthermore, the importance of each of the four nucleotides is dependant upon the stationary phase employed.

Nucleotide Distribution

Although taking into account both nucleic acid length and composition improves the accuracy of retention time predictions, the accuracy can still be further improved. Accordingly, the present invention discloses that the distribution of nucleotides within the nucleic acid is also important for predicting retention time. Indeed, two nucleic acids may have the same composition of nucleotides, but if the nucleotides are distributed differently then the retention times will differ.

There are several aspects to nucleotide distribution, and accordingly a method of predicting nucleic acid retention time can account for nucleotide distribution in several different ways. For instance, emphasis may be placed on nucleotide repeats, but this may not be applicable where the sequences of interest do not contain nucleotide repeats.

One aspect of nucleotide distribution of particular importance is that the nucleotides on the ends of the nucleic acid have an increased effect on retention time. Therefore, for more accurate prediction of retention time, the distribution of the nucleotides should be taken into account, and nucleotides on the 3' and 5' terminals of the nucleic acid should be given more emphasis.

Nucleotide distribution can consequently be considered by predicting the retention time of a DNA molecule using the following general equation (Equation 2):

T R = ai (log (L) ) + a 2 ( %A) /L + a 3 (%C) /L + a 4 (%G) /L + a 5 (%T) /L + a 6 (3 ' end (A) ) /L + a 7 (3 ' end (C) ) /L + a 8 (3 ' end (G) ) /L + a 9 (3 ' end (T) ) /L + a lo (5 ' end (A) ) /L + an ( 5 ' end (C) ) /L + a 12 ( 5 ' end (G) ) /L + a 13 (5 ' end (T) ) /L + a 14

T R - Predicted Retention time

L - Length

%X - Percentage Composition of Nucleotide X a i: a i4 ~ constants

3 'end (X) - Number of X Nucleotides at 3' End (i.e. 1 or 0) 5 'end (X) - Number of X Nucleotides at 5' End (i.e. 1 or 0)

As above, the constants can be determined by PLS regression analysis of a data set of nucleic acid retention times.

Of course, the above is provided only as an example, and other ways of accounting for nucleotide distribution may also be used. For instance, an alternative would be to consider the 3' and 5' ends of the nucleic acid together - i.e. do not assign different weights to the different ends.

Dideoxynucleotide triphosphates (ddNTFs)

It would be natural to assume that it would make no difference to retention time whether or not any nucleotide in the nucleic acid was a ddNTP or a dNTP, since the structural differences between dNTPs and ddNTPs are minor. Furthermore, generally only a single base within the entire sequence - e.g. the 3' terminal nucleotide of a PE product - will be a ddNTP.

However, a surprising discovery of the present invention is that this assumption is false - a single ddNTP can have a significant impact on the retention time of a nucleic acid. In fact, inclusion of a ddNTP will tend to increase retention time. Therefore, where retention times must be predicted for nucleic acids both with and without ddNTPs (e.g. for predicting the retention times of primers used in primer extension and their PE products), the method preferably takes into account the effect of ddNTPs.

A simple way of taking ddNTPs into account is by applying a different equation if the nucleic acid includes a ddNTP (e.g. if it is a PE product). For example, when predicting the retention times of nucleic acids including a ddNTP, the following general equation could be used (Equation 3):

T R = b x (log (L)) + b 2 (%A)/L + b 3 (%C)/L + b 4 (%G)/L + b 5 (%T)/L + b 6 (3'end(A) ) /L + b 7 (3 'end (C) ) /L + b 8 (3 'end (G) ) /L + b 9 (3 'end (T) ) /L + bio(5'end(A))/L + b n (5 'end(C) ) /L + b 12 (5 'end(G) ) /L + b 13 (5 'end(T) ) /L +

T R - Predicted Retention time L - Length

%X - Percentage Composition of Nucleotide X t>i;t>i 4 - constants 3 'end (X) - Number of X Nucleotides at 3' End (i.e. 1 or 0) 5 'end (X) - Number of X Nucleotides at 5' End (i.e. 1 or 0)

As above, the constants can be determined by PLS regression analysis of a data set of nucleic acid retention times, where all the nucleic acids measured include a ddNTP. However, postulating for the moment that Equation 2 applies to nucleic acids without a ddNTP, then the constants b 1: bi 4 in Equation 3 will differ to the constants a 1: ai 4 in Equation 2, because they would need to account for the ddNTP. Since the ddNTP will generally be the 3' terminal nucleotide, an equation as above is well suited to this analysis since it separately considers the nucleotide on the 3' end.

By way of example, two data sets were obtained by measuring the retention times for nucleic acids of known and varying sequence through a DHPLC column having the settings as set out in Table 1. Retention times were measured by detecting UV absorption at 260nm.

Table 1

The first data set contained retention times for nucleic acids without ddNTPs. For this data set, Figure 2 displays the correlation obtained between measured retention times and retention time predictions based solely on the length of the nucleic acid, whilst Figure 3 shows the vastly improved correlation where the retention time predictions are further based on nucleotide composition and nucleotide distribution according to the following equation (corresponding generally to Equation 2):

T R = 3.49*(log(L) ) - 0.0851*(%A)/L - 0.656* (%C) /L - 1.03* (%G) /L + 0.343*(%T)/L - 0.567*(3'end(A) ) /L - 4.17* (3 'end (C) ) /L - 3.32*(3'end(G) ) /L - 5.40* (3 ' end(T) ) /L + 0.0387* (5 ' end (A) ) /L - 5.17*(5'end(C) ) /L - 1.69* (5 'end(G) ) /L - 1.86* (5 ' end(T) ) /L + 0.941

T R - Predicted Retention time (minutes) L - Length

%X - Percentage Composition of Nucleotide X 3 'end (X) - Number of X Nucleotides at 3' End (i.e. 1 or 0) 5 'end (X) - Number of X Nucleotides at 5' End (i.e. 1 or 0)

The second data set contained retention times for nucleic acids with 3' terminal ddNTPs. In relation to the second data set, Figures 4 and 4a indicate the underprediction of retention times where the equation developed for the first data set is used. Figures 5 and 5a demonstrate the improved correlation of a separate equation for the nucleic acids including a ddNTP, according to the following equation (corresponding generally to Equation 3):

T R = 3.15*(log(L) ) - 0.0511* (%A) /L - 0.605* (%C) /L - 0.957* <%G) /L + 0.378*(%T)/L + 0.842*(3'end(A) ) /L - 7.07* (3 ' end(C) ) /L - 0.234*(3'end(G) ) /L - 3.72* (3 'end(T) ) /L + 0.888* (5'end{A) ) /L -

4.92*(5'end(C) ) /L - 2.49* (5 'end(G) ) /L - 1.06* (5 'end (T) ) /L + 1.49

T R - Predicted Retention time (minutes) L - Length %x - Percentage Composition of Nucleotide X

3 'end (X) - Number of X Nucleotides at 3' End (i.e. 1 or 0) 5 'end (X) - Number of X Nucleotides at 5' End (i.e. 1 or 0)

Selecting Primers for use in Analysing Polymorphic Sites by Primer Extension

The second aspect of the present invention provides a method of selecting primers for use in analysing two or more polymorphic sites by primer extension. The polymorphic sites will typically be SNPs, and the particular sites to be analysed at a given time may vary widely. For instance, certain SNPs may indicate a predisposition to a particular disorder, or an increased likelihood of an adverse drug reaction. The sites to be analysed will depend on the reason for the analysis, and the initial step is to define the sites of interest.

To specify a site, the surrounding nucleotides must be identified either specifically (i.e. by entering the precise sequence and identifying the

polymorphic site) or by reference - there are a number of existing nomenclature formats for defining SNPs. Where possible, specific identification of the sites is preferred, since this removes any ambiguity due to different nomenclature formats.

Figure 6 shows an overview of the method of the present invention, once each site 20 is identified. The polymorphic sites 20 are identified, and then the method of the present invention is used to select a primer to anneal adjacent to each site, for use in a PE reaction.

Create a List of Possible Terminating Nucleotide Combinations (21) Referring to Figure 6, primer extension uses terminating nucleotides (generally ddNTPs) to terminate primer extension. In DNA, there are four different nucleotides (A, C, G, T), and any combination of 1, 2, 3 or 4 of these nucleotides (12 total combinations) may be used to terminate a PE reaction. However, the terminating nucleotide combination will affect the length of the PE products. This step is optional, since the terminating nucleotide combination may be predetermined.

Create a List of Possible Primers (23)

For each identified polymorphic site, there are many possible primers that could be designed to anneal with their 3' ends adjacent to the site. Figure 7 shows a DNA sequence containing an SNP, and lists eleven possible primer sequences that could hybridise with their 3' ends paired at a position one nucleotide upstream of the polymorphic site. The eleven possible primers range in length from 20 to 30 nucleotides, and differ only in that the longer possible primers have additional nucleotides at their 5' ends.

As will be understood, each of the possible primers shown in Figure 7 can be extended by a PE reaction to or past the SNP site, regardless of the combination of terminating nucleotides (ddNTPs) that are used to terminate PE.

However, it should be mentioned that the list of possible primers in Figure 7 is not exhaustive. Firstly, longer or shorter primers could be used. Secondly, if (for example) the terminating nucleotide combination used in the PE reaction does not include Gs - i.e. deoxyguanine triphosphates (dGTPs) are used, not dideoxyguanine triphosphates (ddGTPs) - then the list of possible primers could include primers with up to two fewer G nucleotides at their 3' ends. Thirdly, primers having additional non-annealing nucleotides (e.g. a sequence of Ts) at their 5' ends could also be used.

Nonetheless, in most cases it will be preferable to create the list of possible primers by including primers which are complementary to the nucleic acid containing the polymorphic site, and which hybridisie with their 3' ends paired at a position one nucleotide upstream of the polymorphic site.

However, the number of primers included in the list may be limited by various primer restriction criteria. These criteria could include length (e.g. the primer must be in the range 20 - 30 nucleotides), the melting temperature of the primer, the GC content of the primer, or the primer retention time (e.g. within 4 and 8 minutes).

In a system according to the present invention, as depicted in Figure 8, this step is performed by List Creator 60.

Determining Primer Extension Products (25)

Referring again to Figure 6, PE products are determined for each of the possible primers. The possible PE products from each possible primer will depend not only on the primer itself, but also on the polymorphisms at the polymorphic site which are to be detected, and on the terminating nucleotide combination used in the PE reaction. For instance, in Figure 7, if the nucleotide at the SNP site could be either T or C, and the terminating nucleotide used is a C, then each primer may be extended by either of the sequences AC or GC, depending on the nucleotide at the SNP site.

In this step, the list 22 of terminating nucleotide combinations, the polymorphisms at each site 20 and the lists of possible primers 24 are all used to determine the range of possible PE products 26 that may be produced.

In a system according to the present invention, as depicted in Figure 8, this step is performed by Determiner 62.

Predicting Retention Times (27)

Many methods could be used to predict the retention times during chromatographic analysis of each of the possible primers and primer extension products. Preferably, a method according to the first aspect of the present invention is used. Otherwise, if the prediction method used does not predict retention times with sufficient accuracy, then the selection of primers may not be adequate to ensure separation of the various nucleic acids during chromatographic analysis.

In a system according to the present invention, as depicted in Figure 8, this step is performed by Predictor 64.

Selecting Primer Set(s) (29)

The next step (Figure 6) is to select (based on the predicted retention times 28) primer set(s) and terminating nucleotide combination(s), wherein the predicted retention times differ for all selected primers and primer extension products in each set. In a system according to the present invention, as depicted in Figure 8, this step is performed by Selector 66.

Each primer set is intended to be subsequently used in a PE reaction, and it is therefore most desirable to have only one primer set. However, where many polymorphic sites are to be analysed, the limitations of the chromatographic apparatus (or of the retention time prediction method) will usually mean that all of the sites cannot be analysed using a single PE reaction. In this event, two or more primer sets will need to be selected, one set for each PE reaction.

The most important aspect of each primer set 30 is that the predicted retention times differ for all of the selected primers and their associated PE products. The minimum separation required will depend on the chromatography apparatus being used and the accuracy of the predicted retention times 28. However, the selection is preferably optimised to reduce the likelihood of any nucleic acids being inseparable during chromatographic analysis.

Where the limitations of the prediction method are understood, the most desirable primer set is the one that results in the least likelihood of conflict in the predicted retention times of the primers and primer extension products. This can be calculated using (for example) the standard deviation of the method used to predict retention times.

Figure 9 shows in detail one method of selecting primer set(s) 29 for analysing polymorphic sites (the sites are SNPs in this instance). First, the retention

time differences between each primer and its extension products are calculated 29a. Then, the probability is calculated 29b that any pair of primers 31 (for different sites) and PE products (for a given terminating nucleotide combination) will not overlap.

"Overlap" between a primer pair and associated PE products occurs if a PE product associated with the primer of shorter retention time elutes after the primer of longer retention time. In most cases, the likelihood of conflict (wherein nucleic acids cannot be appropriately separated by chromatographic analysis) would increase if primers were considered which would overlap with any PE products of a primer having shorter retention time. Therefore, this preferred embodiment aims to minimise the chance of overlap between primer pairs. However, depending on the sensitivity of the chromatography apparatus and /or the accuracy of the prediction method, it would be within the scope of the present invention to select primers such that there was an "overlap" between primer pairs, provided that the predicted retention times for all primers and PE products differ.

The overlap probabilities can be calculated with a Student T test using i) the retention time of the primer having longer retention time and ii) the maximum retention time of the PE products associated with the primer having shorter retention time, for the given terminating nucleotide combination. The calculated probabilities are subsequently stored 29c.

The possible combinations of primer set(s) are then determined 29d. Clearly, to analyse four SNPs (for example), a single set of four primers could be used for a single PE reaction. Alternatively, one set of three primers and one set of one primer might be appropriate. Finally, two sets of two primers may be

used, and this alternative is shown in Figure 8, where primers may be selected to analyse SNPs in three set combinations 33: i) AB and CD ii) AC and BD iii) AD and BC

For each set combination 33, the stored overlap probabilities (29c) are multiplied 35 to give a total no overlap probability for each primer set, which is then stored 29f and compared so that the primer set(s) that yield the highest no overlap probability is selected 29g.

Preferably, the method of the present invention is embodied as a software program, which can be used by a doctor or a scientist to obtain an individual's genotype or haplotype. Figure 10 shows generally the use of the method for genotyping or haplotyping. Initially, the polymorphic sites 20 are defined by the user (the doctor or scientist). The software then compiles a list of possible primers 23 for each site, determines the possible PE products for each primer and accordingly makes a selection of a primer set 29. Primer extension 36 is then conducted, and the PE products are then subjected to DHPLC analysis 38. The chromatogram in Figure 10 would indicate an individual who is heterozygous at each of the polymorphic sites.

Figure 11 displays a screenshot for an exemplary user interface for a software program performing the method of the present invention, for analysis of polymorphic sites (in this instance SNPs). SNP Table 40 displays the SNPs selected by the user. Although not all of them can be seen in the screenshot, in this instance thirteen SNPs have been selected.

Terminating Nucleotide Table 42 allows the user to select which terminating nucleotide combinations should be considered by the software program. In this instance, no limitations have been applied, and therefore the program will consider all terminating nucleotide combinations.

Assay Size Table 44 allows the user to select the combination of assays to be used - the software program shown allows assays to be designed to analyse up to five SNPs by PE in one assay. Therefore, the selected primer set(s) will correspond to the combination of assays selected by the user. In this instance, the selected combination is for three primer sets - two of four primers, and one of five primers. The software performs the method of the present invention to select a combination of primers in each assay.

Limit Button 46 allows limitations to be set to the selection of primers. These limitations may include primer length, retention time, melting temperature or GC content. The SNP Table 40, Terminating Nucleotide Table 42, Assay Size Table 44 and Limit Button 46 allow the user to specify the inputs to the method.

The software then performs the method according to the present invention. Predicted Retention Time Table 48 displays a summary for each SNP (one SNP per row) and each terminating nucleotide combination (one terminating nucleotide combination per column) of the number of possible primers (shown in brackets) and the average primer retention time. Again, there are more predicted retention times in the Predicted Retention Time Table 48 than can be displayed in this screenshot.

Finally, the Assay Selection Table 50 displays the designed assays, including the terminating nucleotide combinations, the primer set(s) selected, and the

probability of no overlap for each assay is displayed alongside each primer set. The total no overlap probability (i.e. the likelihood that there will be no overlap for any of the three selected primer sets) is displayed in No Overlap Box 51. In this case, the displayed no overlap probability is .858, meaning there is less than a 15% chance that any of the primer pairs in any of the assays will overlap.

Figure 12 shows in more detail Assay 2 in Assay Selection Table 50. All of the primers and PE products are displayed in Sequence Table 52, along with their predicted retention times. The differing retention times have also been graphically presented in the predicted chromatogram 54.

The hardware and software platform for the implementation of this method is not restrictive of the present invention. The software may be programmed in any programming language, for example Python or C++, and typically the software will be run on a personal computer.

The present invention allows the design of multiplex PE reactions, for efficient analysis of the PE products using chromatography equipment. Chromatography equipment is much less expensive and more readily available than the equipment required for MALDI-TOF MS analysis of PE products. Accordingly, an application of the present invention is to allow for more widespread genotyping or haplotyping of patients before the prescription of a drug. In turn, this will help ensure the most suitable drug is prescribed, and reduce the incidence of adverse drug reactions.

Although a preferred embodiment of the present invention has been described in the foregoing detailed description, it will be understood that the invention is not limited to the embodiment disclosed, but is capable of

numerous rearrangements, modifications and substitutions without departing from the scope of the invention. Modifications and variations such as would be apparent to a skilled addressee are deemed within the scope of the present invention.

For example, although the preferred embodiment has been described with particular reference to analysing SNPs, the present invention could clearly be used to analyse insertion or deletion sites.

Throughout this specification and the claims that follow unless the context requires otherwise, the words 'comprise' and 'include' and variations such as 'comprising' and 'including' will be understood to imply the inclusion of a stated integer or group of integers but not the exclusion of any other integer or group of integers.

The reference to any prior art in this specification is not, and should not be taken as, an acknowledgment or any form of suggestion that such prior art forms part of the common general knowledge.