NEXT-GENERATION SEQUENCING FOR PHASED HLA CLASS I ANTIGEN RECOGNITION DOMAIN EXONS

Title:

NEXT-GENERATION SEQUENCING FOR PHASED HLA CLASS I ANTIGEN RECOGNITION DOMAIN EXONS

Document Type and Number:

WIPO Patent Application WO/2016/054135

Kind Code:

Abstract:

Provided are methods for the rapid and accurate determination of the phase of nucleotide polymorphisms in the functionally important region of an HLA protein, i.e., which polymorphic nucleotides are encoded by the same allele. This allows the resolution of an HLA assignment to a specific genotype, and thereby more accurate matching for transplantation and more accurate testing of HLA in general for other applications. In accordance with the instant invention, the phase of polymorphic nucleotides in the functionally relevant region of the HLA molecule, the antigen recognition domain exons, is determined in a single next-generation sequence run.

Inventors:

HURLEY CAROLYN K (US)
NG-MCCLELLAND JENNIFER (US)
HOU LIHUA (US)

Application Number:

PCT/US2015/053087

Publication Date:

April 07, 2016

Filing Date:

September 30, 2015

Export Citation:

Click for automatic bibliography generation Help

Assignee:

UNIV GEORGETOWN (US)

International Classes:

C40B20/00; C12Q1/68; G01N33/50

Foreign References:

US20140206547A1

2014-07-24

Other References:

HOSOMICHI ET AL.: "Phase-defined complete sequencing of the HLA genes by next-generation sequencing", BMC GENOMICS, vol. 14, no. 355, 28 May 2013 (2013-05-28), pages 1 - 16

Attorney, Agent or Firm:

STEELE, Alan, W. et al. (Seaport West 155 Seaport Boulevar, Boston MA, US)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS

We claim:

1. A method of phase-defined genotyping of both alleles of at least one human leukocyte antigen (HLA) locus of a subject, comprising

amplifying a sample of human genomic DNA encoding an antigen recognition domain (ARD) of both alleles of at least one HLA locus, thereby forming a plurality of amplicons;

fragmenting the amplicons to give a plurality of fragments of about 200 to about 800 nucleotides;

sequencing the fragments using sequencing-by-synthesis, thereby generating a plurality of overlapping partial nucleotide sequences;

aligning the overlapping partial nucleotide sequences to determine a contiguous composite nucleotide sequence encoding the ARD of each allele of the at least one HLA locus;

comparing the contiguous composite nucleotide sequences to a library of reference genomic sequences encoding ARDs of the at least one HLA locus; and

2. The method of claim 1, wherein the at least one HLA locus is selected from the group consisting of HLA-A, HLA-B, HLA-C, and any combination thereof.

3. The method of claim 2, wherein the at least one HLA locus is HLA-A.

4. The method of claim 2, wherein the at least one HLA locus is HLA-B.

5. The method of claim 2, wherein the at least one HLA locus is HLA-C.

6. The method of claim 2, wherein the at least one HLA locus is HLA-A and HLA-B.

7. The method of claim 2, wherein the at least one HLA locus is HLA-A and HLA-C.

8. The method of claim 2, wherein the at least one HLA locus is HLA-B and HLA-C.

9. The method of claim 2, wherein the at least one HLA locus is HLA-A, HLA-B, and HLA-C.

10. The method of any one of claims 2-9, wherein each amplicon comprises DNA encoding exon 2, intron 2, and exon 3 of the at least one HLA locus.

11. The method of any one of the preceding claims, wherein the fragments are about 200 to about 500 nucleotides.

12. The method of claim 11, wherein the fragments are about 300 to about 400 nucleotides.

13. The method of any one of the preceding claims, wherein the fragmenting comprises acoustical shearing.

14. The method of claim 13, wherein the fragmenting further comprises end-repairing the fragments.

15. The method of any one of the preceding claims, further comprising labeling each fragment, prior to sequencing, with at least one source label.

16. The method of claim 15, wherein the at least one source label is an oligonucleotide label.

17. The method of claim 15 or 16, wherein each fragment is labeled with one source label.

18. The method of claim 15 or 16, wherein each fragment is labeled with two source labels.

19. The method of any one of claims 16-18, further comprising sequencing the at least one source label.

20. The method of any one of the preceding claims, further comprising attaching to each fragment, prior to sequencing, an oligonucleotide complementary to a sequencing primer.

21. The method of any one of the preceding claims, further comprising attaching to each fragment, prior to sequencing, an oligonucleotide adapter complementary to at least one immobilized bridge amplification primer.

22. The method of any one of the preceding claims, wherein the method is performed in a multiplex manner.

23. A kit, comprising

(a) paired oligonucleotide polymerase chain reaction (PCR) amplification primers suitable for use to amplify, from a sample of human genomic DNA, DNA encoding an antigen recognition domain (ARD) of both alleles of at least one human leukocyte antigen (HLA) locus;

(b) paired oligonucleotide adapters, each adapter oligonucleotide comprising a nucleotide sequence complementary to at least one bridge amplification primer

immobilized on a substrate; and

(c) paired sequencing primers suitable for use to sequence amplification products prepared using the paired PCR amplification primers.

24. The kit of claim 23, wherein the at least one HLA locus is selected from the group consisting of HLA- A, HLA-B, HLA-C, and any combination thereof.

25. The kit of claim 24, wherein the at least one HLA locus is HLA- -A.

26. The kit of claim 24, wherein the at least one HLA locus is HLA- -B.

27. The kit of claim 24, wherein the at least one HLA locus is HLA- -C.

28. The kit of claim 24, wherein the at least one HLA locus is HLA- -A and HLA-B.

29. The kit of claim 24, wherein the at least one HLA locus is HLA- -A and HLA-C.

30. The kit of claim 24, wherein the at least one HLA locus is HLA- -B and HLA-C.

31. The kit of claim 24, wherein the at least one HLA locus is HLA- -A, HLA-B, and

HLA-C.

32. The kit of any one of claims 23-31 , further comprising paired oligonucleotide adapters, each adapter comprising a unique sequence to be used in identifying the source of the sample of human genomic DNA.

33. The kit of any one of claims 23-32, further comprising enzymes T4 DNA

polymerase, Klenow fragment of T4 DNA polymerase, and T4 polynucleotide kinase; and a buffer suitable for activity of said enzymes in repairing DNA fragments generated by acoustical shearing.

34. The kit of claim 33, further comprising a DNA polymerase and dATP in a buffer suitable for activity of said DNA polymerase.

35. The kit of any one of claims 23-34, further comprising at least one source label.

36. The kit of claim 35, wherein the at least one source label is an oligonucleotide label.

37. The kit of any one of claims 23-36, further comprising an oligonucleotide complementary to at least one of the paired sequencing primers.

38. The kit of any one of claims 23-25, 28, 29, and 31-37, wherein at least one HLA locus is HLA-A; and nucleotide sequences of the paired PCR amplification primers for HLA-A are selected from

CCCAGACGCCGAGGATGGCCG (SEQ ID ΝΟ: 1) (5A2) (sense) and

GCAGGGCGGAACCTCAGAGTCACTCTCT (SEQ ID NO:2) (3A2) (antisense); and

CCTCTGYGGGGAGAAGCAA (SEQ ID NO:3) (AInl-46-F) (sense) and

GTCCCAATTGTCTCCCCTCCTT (SEQ ID NO:4) (3AIn3-62-R) (antisense).

39. The kit of any one of claims 23-25, 28, 29, and 31-37, wherein at least one HLA locus is HLA-A; and nucleotide sequences of the paired PCR amplification primers for HLA-A are

CCCAGACGCCGAGGATGGCCG (SEQ ID ΝΟ: 1) (5A2) (sense) and

GCAGGGCGGAACCTCAGAGTCACTCTCT (SEQ ID NO:2) (3A2) (antisense).

40. The kit of any one of claims 23, 24, 26, 28, and 30-37, wherein at least one HLA locus is HLA-B; and nucleotide sequences of the paired PCR amplification primers for HLA-B are selected from

CCGAACCSTCCTCCTGCTGCTCT (SEQ ID NO:5) (Bexl-BT) (sense), CCATCCCCGGCGACCTATAGGAGATG (SEQ ID NO:6) (3B1) (antisense), and AGGCCATCCCGGGCGATCTAT (SEQ ID NO:7) (3B1-AC) (antisense); and

GGGAGGAGMRAGGGGACCGCAG (SEQ ID NO:8) (BInlb-F) (sense), GGAGSCCATCCCCGSCGACCTAT (SEQ ID NO:9) (BIn3-R) (antisense), and GGAGGCCATCCCGGGCGATCTAT (SEQ ID NO: 10) (BIn3-AC) (antisense).

41. The kit of any one of claims 23, 24, 26, 28, and 30-37, wherein the nucleotide sequences of the paired PCR amplification primers for HLA-B are selected from

CCGAACCSTCCTCCTGCTGCTCT (SEQ ID NO:5) (Bexl-BT) (sense), CCATCCCCGGCGACCTATAGGAGATG (SEQ ID NO:6) (3B1) (antisense), and AGGCCATCCCGGGCGATCTAT (SEQ ID NO:7) (3B1-AC) (antisense).

42. The kit of any one of claims 23, 24, 27, and 29-37, wherein at least one HLA locus is HLA-C; and nucleotide sequences of the paired PCR amplification primers for HLA-C are AGCGAGGKGCCCGCCCGGCGA (SEQ ID NO: l l) (5CInl-61) (sense) and GGAGATGGGGAAGGCTCCCCACT (SEQ ID NO: 12) (3BCIn3-12) (antisense).

Description:

NEXT-GENERATION SEQUENCING FOR PHASED HLA CLASS I ANTIGEN

RECOGNITION DOMAIN EXONS

RELATED APPLICATION This application claims benefit of United States Provisional Patent Application No.

62/058,246, filed October 1, 2014.

GOVERNMENT SUPPORT

This invention was made with government support under grant numbers N00014- 11-0590, N00014-12-0240, and N00014-13-0210 awarded by the Office of Naval Research. The government has certain rights in the invention.

BACKGROUND OF THE INVENTION

DNA sequencing is a powerful technique for identifying allelic variation within the human leukocyte antigen (HLA) genes. Sequencing is usually focused on the most polymorphic exons of the class I (HLA-A, -B, -C) and class II (HLA-DR, -DQ, and -DP) genes. These exons encode the antigen recognition domain, the region of the HLA molecule that binds peptides and interacts with the T cell receptor for antigen and natural killer cell immunoglobulin-like receptors (KIR).

Today DNA sequencing is commonly used for unrelated donor and umbilical cord blood selection in hematopoietic stem cell transplantation (HSCT) used to treat leukemia, lymphoma, or other serious diseases affecting the hematopoietic system. Sequencing may also be used to identify an HLA-matched family donor for HSCT when similar HLA alleles are segregating in the family or where parents are unavailable. Selection of a donor who is allele matched with the patient for HLA-A, -B, -C, and -DRBl increases survival following transplantation.

The concept behind next-generation sequencing (NGS) technology is similar to Sanger-based DNA sequencing— the bases of a single strand of DNA are sequentially identified from signals emitted as the strand is re-synthesized to complement a DNA template strand. NGS extends this process across millions of reactions in a massively parallel fashion, rather than being limited to a single or a few DNA fragments. This enables rapid sequencing of large stretches of DNA base pairs spanning entire genomes, with the latest instruments capable of producing hundreds of gigabases of data in a single sequencing run. In a typical application, genomic DNA (gDNA) is first fragmented into a library of small segments that can be uniformly and accurately sequenced in numerous, e.g., millions or even billions, of parallel reactions. The newly identified strings of bases, called reads, are then reassembled using a known reference genome as a scaffold (resequencing), or in the absence of a reference genome (de novo sequencing). The full set of aligned reads reveals the entire sequence of each chromosome in the gDNA sample.

There are a number of methods of NGS, including sequencing-by-synthesis, single- molecule real-time sequencing, ion semiconductor sequencing, pyrosequencing, and sequencing by ligation. The sequencing-by- synthesis method was developed by Shankar Balasubramanian and David Klenerman at the University of Cambridge, and it is described in International Publication No. WO 00/06770 and U.S. Patent Nos. 6,787,308 and 7,232,656, the entire disclosures of which are incorporated herein by reference.

SUMMARY OF THE INVENTION

An aspect of the invention is a method of phase-defined genotyping of both alleles of at least one human leukocyte antigen (HLA) locus of a subject, comprising

amplifying a sample of human genomic DNA encoding an antigen recognition domain (ARD) of both alleles of at least one HLA locus, thereby forming a plurality of amplicons;

fragmenting the amplicons to give a plurality of fragments of about 200 to about 800 nucleotides;

sequencing the fragments using sequencing-by-synthesis, thereby generating a plurality of overlapping partial nucleotide sequences;

aligning the overlapping partial nucleotide sequences to determine a contiguous composite nucleotide sequence encoding the ARD of each allele of the at least one HLA locus;

comparing the contiguous composite nucleotide sequences to a library of reference genomic sequences encoding ARDs of the at least one HLA locus; and

identifying each contiguous composite nucleotide sequence as either (i) a sequence encoding a known ARD of the at least one HLA locus, or (ii) a sequence encoding a novel ARD of the at least one HLA locus. In certain embodiments, the at least one HLA locus is selected from the group consisting of HLA-A, HLA-B, HLA-C, and any combination thereof.

In certain embodiments, the at least one HLA locus is HLA-A, HLA-B, and HLA-

In certain embodiments, each amplicon comprises DNA encoding the entirety of exon 2, intron 2, and exon 3 of the at least one HLA locus.

In certain embodiments, the fragmenting comprises acoustical shearing, i.e., sonicating.

In certain embodiments, the method is performed in a multiplex manner.

An aspect of the invention is a kit, comprising

(b) paired oligonucleotide adapters, each adapter oligonucleotide comprising a nucleotide sequence complementary to at least one bridge amplification primer

immobilized on a substrate; and

(c) paired sequencing primers suitable for use to sequence amplification products prepared using the paired PCR amplification primers.

In certain embodiments, the kit further comprises paired oligonucleotide adapters, each adapter comprising a unique sequence to be used in identifying the source of the sample of human genomic DNA.

In certain embodiments, the kit further comprises enzymes T4 DNA polymerase,

Klenow fragment of T4 DNA polymerase, and T4 polynucleotide kinase; a buffer suitable for activity of said enzymes in repairing DNA fragments generated by acoustical shearing; and, optionally, a DNA polymerase and dATP in a buffer suitable for activity of said DNA polymerase.

BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is a schematic drawing of the structure of a class I molecule. NH2, amino terminal end; COOH, carboxy terminal end.

Figure 2 is a schematic drawing depicting the structure of the human major histocompatibility gene complex (MHC) located on chromosome 6. A, B, and C in the HLA Class I Region represent loci encoding the alpha chain of HLA-A, -B, and -C, respectively. Also shown are the loci encoding the alpha (A) and beta (B) chains of HLA- DR, -DQ, and -DP.

Figure 3 is a schematic drawing of the structure of a class I alpha chain locus, corresponding mRNA, and corresponding polypeptide. In the DNA, exons are depicted as numbered boxes, and introns flank the exons. In the DNA, intron 1 is between exon 1 and exon 2, intron 2 is between exon 2 and exon 3, and intron 3 is between exon 3 and exon 4. In the polypeptide, numbered boxes correspond to structural and functional regions. L, leader; TM, transmembrane domain; Cyt, cytoplasmic domain.

Figure 4 is a schematic drawing of the structure of a class II molecule.

Figure 5 is a table of samples to be analyzed for HLA-A, -B, and -C from ten subjects. Note the paired unique indexes associated with the combined HLA-A, -B, and -C for each of the ten subjects. Each index shown includes a source label. For example, the index sequence ATTACTCG and index sequence TATAGCCT identify subject 1; and index sequence TCCGGAGA and index sequence TATAGCCT identify subject 2.

Figure 6 is a table depicting representative results, related to the information in Figure 5, after alignment and assembly with reference.

DETAILED DESCRIPTION OF THE INVENTION One challenge for unrelated donor registries is the complexity of the HLA system.

The continuing discovery of novel alleles has resulted in loci with hundreds to thousands of alleles, for example, HLA-B with over 2000 alleles. DNA-based typing results obtained at recruitment of registry volunteers usually include many alternative (or ambiguous) genotypes. An added complexity for typing is that more than one pair of alleles share a diploid DNA sequence for these exons. These pairs of alleles differ in the phase of the polymorphisms, i.e., which of the alternative polymorphic nucleotides are located on a specific homologue of chromosome 6. As novel alleles are identified, the number of pairs of alleles sharing a diploid sequence increases and new ambiguities are identified.

This means that, because of cost constraints, additional testing is required prior to donor selection to "phase" polymorphic nucleotides. This slows down the process and would not be ideal in a contingency situation. Even more important, however, is the impact of secondary assays on the robust nature of the HLA assignment. Today primary data and test reagents used in secondary assays are not readily incorporated into the initial result and are not captured by the registry. This is particularly true if the secondary assay uses a different testing technology than the initial assay (e.g., DNA sequencing followed by sequence-specific priming). In these cases, laboratory software is unlikely to capture and merge primary data from both results, making it difficult for the registry to collect this information. A second limitation is that the reagents used are selected based on the current alternative genotypes and do not take into account new alternatives that will appear over time.

An advantage of the instant invention is that single molecules of DNA are sequenced so that alleles are routinely separated and ambiguity is reduced. This should allow more rapid donor selection.

Human Leukocyte Antigens

The human leukocyte antigen (HLA) system includes multiple genes that are highly polymorphic in the human population (i.e., several thousand alleles at some of the HLA loci). HLA genes are encoded by 12 loci including three class I loci (HLA-A, -B, -C) and nine class II loci (HLA-DRA, -DRB1, -DRB3, -DRB4, -DRB5, -DQA1, -DQB1, -DPA1, - DPBl). These genes likely arose from gene duplications and crossing over since they share extensive sequence homology. Each gene is divided into 5-8 exons that encode the signal peptide, two or three extracellular domains, transmembrane region, and cytoplasmic tail. The coding sequence of each gene is approximately 1,100 base pairs in length for a class I gene and about 800 base pairs in length for a class II gene; if introns are included, an HLA gene is about 3,000 base pairs in length for a class I gene and about 5,000-10,000 bases in length for a class II gene.

While most HLA genes are present in a diploid state, individuals vary in the number of DRB genes carried from two to four. Usually two DRB1 loci are present, one on each copy of chromosome six; the additional loci can be one or two of the following loci, DRB3, DRB4, or DRB5.

The HLA loci are polymorphic. DRA has only seven alleles, while HLA-B has over 2,200. Alleles at a locus may differ by synonymous substitutions that do not alter the protein sequence or by nonsynonymous substitutions that alter the protein sequence. Some alleles are not expressed as full length proteins. For example, the presently known 2,271 HLA-B alleles encode 1,737 different proteins, and there are 73 non-expressed alleles.

Nomenclature used to designate HLA alleles and a description of the DNA sequence variation are can be found, for example, at <http://hla.alleles.org/> and

<http://www.ebi.ac.uk/imgt/hla/.

The protein structures formed by the assembly of a HLA class I polypeptide with beta-2 microglobulin and by the assembly of HLA class II alpha and beta polypeptides are very similar. The amino-terminal regions of the HLA polypeptides form an antigen recognition domain (ARD), binding antigenic peptides within the cell for transport to the cell surface and interacting with antigen-receptors on T lymphocytes to trigger an adaptive immune response. The remainder of the HLA protein forms a scaffold for the ARD. The majority of the genetic variation within an HLA allele alters the sequence of the ARD, giving that region different specificities for antigenic peptide binding and T-cell antigen receptor interaction. Natural killer (NK) cell immunoglobulin-like receptors also interact with the ARD to influence NK cell killing.

HLA Genotyping

HLA genotypes are used in solid organ and bone marrow transplantation for donor selection to reduce immune responses to foreign tissue and to generate anti-tumor responses, to diagnose autoimmune diseases, to determine sensitivity to specific drugs, to determine the effectiveness of peptide-based vaccines, and in research studies including population genetic studies and studies of disease resistance and susceptibility. The instant invention allows the rapid and accurate determination of the phase of nucleotide polymorphisms in the functionally important region of an HLA protein, i.e., which polymorphic nucleotides are encoded by the same allele. This allows the resolution of an HLA assignment to a specific genotype, and thereby more accurate matching for transplantation and more accurate testing of HLA in general for other applications.

In accordance with the instant invention, the phase of polymorphic nucleotides in the functionally relevant region of the HLA molecule is determined in a single next- generation sequence run. Significantly, current DNA sequencing methods in wide use, Sanger sequencing and next-generation sequencing (NGS) using isolated HLA exons, do not phase nucleotides across the entire functionally important exons and thus cannot readily resolve alternative genotypes. In order to reach this level of resolution using Sanger sequencing, multi-step testing must be employed. For example, sequencing of a heterozygote by Sanger is followed by use of additional DNA sequencing primers selected based on the heterozygous sequence in order to determine phase. The second step sequence usually covers only a portion of the functionally important exons. The possibility of missing new alleles because of the limited secondary sequencing is a possibility. Another approach using today's methods is by allele-specific amplification or cloning to isolate individual alleles for Sanger DNA sequencing. Again these rely on first obtaining a heterozygous sequence in order to select the PCR primers and/or to eliminate the artifacts of cloning. Some of these strategies require extensive expertise and knowledge of molecular biology which may be lacking in a hospital clinical laboratory. NGS for isolated exons does not establish phase between the exons characterized and may miss new alleles. Current attempts to develop NGS strategies are focused on sequencing of the entire HLA gene. Since much information is missing from the reference HLA database about exons and introns outside of the well characterized functionally important exons of this highly polymorphic system, these current efforts will be slow and it will be difficult to obtain precise HLA assignments until the reference database is much more complete.

In accordance with the instant invention, polymerase chain reaction (PCR) amplicons including exons encoding the HLA Class I (A,B,C) antigen recognition domain (ARD) and intervening intron are sequenced using sequencing-by-synthesis technique. By including the intron in next-generation sequencing, the phase of polymorphic residues is established throughout exons 2 and 3. This allows the identification of the HLA G group genotypes present or absent without ambiguity.

Advantageously, the methods of the invention can be used in multiplex format to process, simultaneously, genomic DNA from a plurality of unique samples, i.e., genomic DNA from multiple individuals.

HLA is too polymorphic to be accurately analyzed through whole genome sequencing. In contrast, targeted resequencing focuses on PCR-amplified HLA genes. Until now, such targeted resequencing has been carried out by amplification of the whole gene or by amplification of individual ARD-encoding exons only. While the former amplification strategy permits exon phasing, it is complicated by lack of robustness associated with the long amplicon, missing reference sequence information for exons and introns, more complex analysis of sequencing fragments, and generally more information than is needed for matching purposes, resulting in end-user confusion. The amplification strategy using only individual ARD-encoding exons has the advantage of simplified analysis of sequence, but it does not phase exons. A third strategy of amplification for targeted re-sequencing, employed by the methods of the present invention, involves amplification of a region or regions of an HLA gene encoding antigen recognition domain (ARD) exons and intervening intron or introns. Such strategy includes the region most important in matching and phases exons.

In certain embodiments, the methods of the invention include, in a general sense, the steps of amplifying genomic DNA; fragmenting the amplified DNA; attaching bar codes and annealing sites (sequencing adapters), for example through a second round of PCR; PCR clean-up and size selection; sample normalization and pooling of multiple samples to form a library; sequencing by synthesis, for example using an Illumina® (San Diego, Calif.) platform; and analyzing sequence data.

Sequencing-by-Synthesis

The sequencing-by-synthesis method is similar to Sanger sequencing, but it uses modified dNTPs containing a terminator which blocks further polymerization, so only a single base can be added by a polymerase enzyme to each growing DNA copy strand. The sequencing reaction is conducted simultaneously on a very large number (many millions or more) of different template molecules spread out on a solid surface. The terminator also contains a fluorescent label, which can be detected by a camera or other suitable optical device.

In a common embodiment, sequencing-by-synthesis technology uses four fluorescently labeled nucleotides to sequence the tens of millions of clusters on the flow cell surface in parallel. During each sequencing cycle, a single labeled deoxynucleoside triphosphate (dNTP) is added to the nucleic acid chain. The nucleotide label serves as a terminator for polymerization, so after each dNTP incorporation, the fluorescent dye is imaged to identify the base and then enzymatically cleaved to allow incorporation of the next nucleotide. Since all four reversible terminator-bound dNTPs (A, C, T, G) are present as single, separate molecules, natural competition minimizes incorporation bias. Base calls are made directly from signal intensity measurements during each cycle, which greatly reduces raw error rates compared to other technologies. The end result is highly accurate base-by-base sequencing that eliminates sequence-context specific errors, enabling robust base calling across the genome, including repetitive sequence regions and within homopolymers.

In an alternative embodiment, only a single fluorescent color is used, so each of the four bases must be added in a separate cycle of DNA synthesis and imaging. Following the addition of the four dNTPs to the templates, the images are recorded and the terminators are removed. This chemistry is called "reversible terminators". Finally, another four cycles of dNTP additions are initiated. Since single bases are added to all templates in a uniform fashion, the sequencing process produces a set of DNA sequence reads of uniform length.

Although the fluorescent imaging system used in sequencers is not sensitive enough to detect the signal from a single template molecule, the major innovation of the

sequencing-by-synthesis method is the amplification of template molecules on a solid surface. The DNA sample is prepared into a "sequencing library" by the fragmentation into pieces each typically around 200 to 800 nucleotides long. Custom adapters are added to each end and the library is flowed across a solid surface (the "flow cell"), whereby the template fragments bind to this surface. Following this, a solid phase "bridge

amplification" PCR process (cluster generation) creates approximately one million copies of each template in tight physical clusters on the flow cell surface. These clusters are of sufficient size and density to permit signal detection.

Amplicon sequencing allows researchers to sequence small, selected regions of the genome spanning hundreds of base pairs. Commercially available NGS amplicon library preparation kits allow researchers to perform rapid in-solution amplification of custom- targeted regions from genomic DNA. Using this approach, thousands of amplicons spanning multiple samples can be simultaneously prepared and indexed in a matter of hours. With the ability to process numerous amplicons and samples on a single run, NGS is much more cost-effective than CE-based Sanger sequencing technology, which does not scale with the number of regions and samples required in complex study designs. NGS enables researchers to simultaneously analyze all genomic content of interest in a single experiment, at fraction of the time and cost.

This highly targeted NGS approach enables a wide range of applications for discovering, validating, and screening genetic variants for various study objectives.

Amplicon sequencing is well-suited for clinical environments, where researchers are examining a limited number of disease-related highly polymorphic genes like HLA.

Methods of the Invention

An aspect of the invention is a method of phase-defined genotyping of both alleles of at least one human leukocyte antigen (HLA) locus of a subject, comprising amplifying a sample of human genomic DNA encoding an antigen recognition domain (ARD) of both alleles of at least one HLA locus, thereby forming a plurality of amplicons;

fragmenting the amplicons to give a plurality of fragments of about 200 to about 800 nucleotides;

sequencing the fragments using sequencing-by-synthesis, thereby generating a plurality of overlapping partial nucleotide sequences;

aligning the overlapping partial nucleotide sequences to determine a contiguous composite nucleotide sequence encoding the ARD of each allele of the at least one HLA locus; and

comparing the contiguous composite nucleotide sequences to a library of reference genomic sequences encoding ARDs of the at least one HLA locus.

In certain embodiments, the method further includes the step of identifying each contiguous composite nucleotide sequence as either (i) a sequence encoding a known ARD of the at least one HLA locus, or (ii) a sequence encoding a novel ARD of the at least one HLA locus.

As discussed above, there are numerous HLA proteins and corresponding genes and loci encoding them. Normally, each nucleated diploid cell has both a maternal allele and a paternal allele for each HLA locus, e.g., a maternal HLA-A allele and a paternal HLA-A allele. In accordance with the methods of the invention, not only can both alleles of a given locus be sequenced and phased simultaneously, but also both alleles of a plurality of loci can be sequenced and phased simultaneously.

In certain embodiments, the at least one HLA locus is a class I HLA locus.

In certain embodiments, the at least one HLA locus is HLA-A.

In certain embodiments, the at least one HLA locus is HLA-B.

In certain embodiments, the at least one HLA locus is HLA-C.

In certain embodiments, the at least one HLA locus is HLA-A and HLA-B.

In certain embodiments, the at least one HLA locus is HLA-A and HLA-C.

In certain embodiments, the at least one HLA locus is HLA-B and HLA-C.

In certain embodiments, the at least one HLA locus is HLA-A, HLA-B, and HLA-

The term "phase-defined genotyping" as used herein generally refers to elucidating the nucleotide sequence of a single allele of an HLA-encoding locus on a first chromosome with sufficient detail to distinguish it from a heterologous allele at the same locus on a second chromosome. This term can also be understood as defining individual HLA haplotypes. In a preferred embodiment, "phase-defined genotyping" refers to elucidating the nucleotide sequences of both alleles of an HLA-encoding locus with sufficient detail to distinguish one allele from the other and one genotype from another. Of course, when two alleles (e.g., maternal and paternal alleles) are completely identical, it will not be possible to distinguish one from the other. Information generated by the method is used to separate two chromosomes and to determine the two phase-defined HLA gene sequences for any given HLA locus of a subject. Taking advantage of highly polymorphic nature of the HLA genes, wide-ranged library size, and massively parallel sequencing, it becomes possible to phase sequence reads on a chromosome and tile phased reads to generate HLA gene haplotype sequences from large numbers of individuals needed to maintain a hematopoietic stem cell registry of volunteer donors.

In certain embodiments, the term "phase-defined genotyping" as used herein refers to elucidating the nucleotide sequence of a single allele of an HLA-encoding locus on a first chromosome with sufficient detail to distinguish it from a reference allele at the same locus on a second chromosome. In such embodiments the reference allele can be a known haplotype sequence, for example, a haplotype sequence in a library of known haplotype sequences.

Amplification primers are designed and selected so that, when they are used to amplify a sample of human genomic DNA encoding an ARD of both alleles of at least one HLA locus, the resulting amplification products include a plurality of amplicons comprising sequence encoding the ARD of both alleles of the at least one HLA locus.

For class I HLA, DNA encoding an ARD generally includes all of exon 2, all of intron 2, and all of exon 3. Accordingly, in certain embodiments, each amplicon comprises DNA encoding all of exon 2, all of intron 2, and all of exon 3 of the at least one HLA locus, i.e., each amplicon comprises DNA encoding exon 2, intron 2, and exon 3 of the at least one HLA locus. Each such amplicon optionally can include additional sequence from intron 1, intron 3, or both intron 1 and intron 3.

In certain embodiments, each amplicon comprises DNA encoding part of exon 2, all of intron 2, and all of exon 3 of the at least one HLA locus. In certain various

In certain embodiments, each amplicon comprises DNA encoding all of exon 2, all of intron 2, and part of exon 3 of the at least one HLA locus. In certain various

embodiments, the part of exon 3 can comprise at least 10 percent, at least 20 percent, at least 30 percent, at least 40 percent, at least 50 percent, at least 60 percent, at least 70 percent, at least 80 percent, at least 90 percent, at least 95 percent, or at least 98 percent of the 5' end of exon 3. That is, if all of exon 3 occupied only 100 nucleotides, then in certain various embodiments, the part of exon 3 can comprise at least the first (i.e., 5') 10, 20, 30, 40, 50, 60, 70, 80, 90, 95, or 98 nucleotides of exon 3. Each such amplicon optionally can include additional sequence from intron 1.

In certain embodiments, each amplicon comprises DNA encoding part of exon 2, all of intron 2, and part of exon 3 of the at least one HLA locus. In certain various

embodiments, the part of exon 2 can comprise at least 10 percent, at least 20 percent, at least 30 percent, at least 40 percent, at least 50 percent, at least 60 percent, at least 70 percent, at least 80 percent, at least 90 percent, at least 95 percent, or at least 98 percent of the 3' end of exon 2. That is, if all of exon 2 occupied only 100 nucleotides, then in certain various embodiments, the part of exon 2 can comprise at least the last (i.e., 3') 10, 20, 30, 40, 50, 60, 70, 80, 90, 95, or 98 nucleotides of exon 2. In certain various embodiments, the part of exon 3 can comprise at least 10 percent, at least 20 percent, at least 30 percent, at least 40 percent, at least 50 percent, at least 60 percent, at least 70 percent, at least 80 percent, at least 90 percent, at least 95 percent, or at least 98 percent of the 5' end of exon 3. That is, if all of exon 3 occupied only 100 nucleotides, then in certain various embodiments, the part of exon 3 can comprise at least the first (i.e., 5') 10, 20, 30, 40, 50, 60, 70, 80, 90, 95, or 98 nucleotides of exon 3.

In certain embodiments, nucleotide sequences of the paired PCR amplification primers for HLA-A are selected from

CCCAGACGCCGAGGATGGCCG (SEQ ID ΝΟ: 1) (5A2) (sense) and

GCAGGGCGGAACCTCAGAGTCACTCTCT (SEQ ID NO:2) (3A2) (antisense); and CCTCTGYGGGGAGAAGCAA (SEQ ID NO:3) (AInl-46-F) (sense) and

GTCCCAATTGTCTCCCCTCCTT (SEQ ID N0:4) (3AIn3-62-R) (antisense). In certain embodiments, nucleotide sequences of the paired PCR amplification primers for HLA-A are

CCCAGACGCCGAGGATGGCCG (SEQ ID NO: 1) (5A2) (sense) and

GCAGGGCGGAACCTCAGAGTCACTCTCT (SEQ ID NO:2) (3A2) (antisense). In certain embodiments, nucleotide sequences of the paired PCR amplification primers for HLA-A are

CCTCTGYGGGGAGAAGCAA (SEQ ID NO:3) (AInl-46-F) (sense) and

GTCCCAATTGTCTCCCCTCCTT (SEQ ID NO:4) (3AIn3-62-R) (antisense).

In certain embodiments, nucleotide sequences of the paired PCR amplification primers for HLA-B are selected from

CCGAACCSTCCTCCTGCTGCTCT (SEQ ID NO:5) (Bexl-BT) (sense), CCATCCCCGGCGACCTATAGGAGATG (SEQ ID NO:6) (3B1) (antisense), and AGGCCATCCCGGGCGATCTAT (SEQ ID NO:7) (3B1-AC) (antisense); and

GGGAGGAGMRAGGGGACCGCAG (SEQ ID NO:8) (BInlb-F) (sense), GGAGSCCATCCCCGSCGACCTAT (SEQ ID NO:9) (BIn3-R) (antisense), and GGAGGCCATCCCGGGCGATCTAT (SEQ ID NO: 10) (BIn3-AC) (antisense). In certain embodiments, nucleotide sequences of the paired PCR amplification primers for HLA-B are selected from

CCGAACCSTCCTCCTGCTGCTCT (SEQ ID NO:5) (Bexl-BT) (sense), CCATCCCCGGCGACCTATAGGAGATG (SEQ ID NO:6) (3B1) (antisense), and AGGCCATCCCGGGCGATCTAT (SEQ ID NO:7) (3B1-AC) (antisense).

In certain embodiments, nucleotide sequences of the paired PCR amplification primers for HLA-B are selected from

AGCGAGGKGCCCGCCCGGCGA (SEQ ID NO: l 1) (5CInl-61) (sense) and GGAGATGGGGAAGGCTCCCCACT (SEQ ID NO: 12) (3BCIn3-12) (antisense). The amplicons are fragmented to give a plurality of fragments of about 200 to about 800 nucleotides. In certain embodiments, the fragments are about 200 to about 500 nucleotides. In certain embodiments, the fragments are about 300 to about 400 nucleotides.

Generally, the fragmentation will be random. In certain embodiments, the fragmentation comprises acoustical shearing, i.e., sonication. In certain embodiments, the fragmentation comprises enzymatic cleavage, for example using a transposase or the like. In certain embodiments, the fragmentation results in fragments having blunt ends. In certain embodiments, the fragmentation results in fragments having single-strand 5' overhangs, 3' overhangs, or both 5' overhangs and 3' overhangs. For example,

fragmentation with acoustical shearing generally will result in fragments with single-strand 5' overhangs, 3' overhangs, or both 5' overhangs and 3' overhangs.

In certain embodiments, the method further includes end-repairing such fragments, for example with enzymes selected from T4 DNA polymerase, Klenow fragment of T4 DNA polymerase, T4 polynucleotide kinase, and any combination thereof.

In certain embodiments, the method further comprises labeling each fragment, prior to sequencing, with at least one source label. The source label can be designed and used to associate a source (subject or potential donor) with any given piece of DNA. For example, DNA from a subject can be amplified, sheared, optionally end-repaired, and optionally labeled, all prior to sequencing. Importantly, DNA from a first subject can be amplified, sheared, optionally end-repaired, and optionally labeled, all prior to pooling such DNA with corresponding DNA from a second subject, prior to sequencing. Advantageously, DNA from a first subject can be amplified, sheared, optionally end-repaired, and optionally labeled, all prior to pooling such DNA with corresponding DNA from a plurality of other subjects, prior to sequencing. In such embodiments, DNA of any one subject can be differentiated from DNA of any other subject or plurality of subjects, even when such DNA is pooled prior to sequencing.

In certain embodiments, the at least one source label is an oligonucleotide label. Such oligonucleotide label is sometimes referred to as a barcode or index, and it can be attached to an amplicon or fragment thereof by any suitable method, including, for example, ligation. Such oligonucleotide labels are generally synthetic oligonucleotides, about 8 to about 40 nucleotides long, characterized by a specific nucleotide sequence. In certain embodiments, an oligonucleotide label comprises about 8 to about 16 nucleotides. In certain embodiments, an oligonucleotide label comprises about 12 to about 40 nucleotides. In certain embodiments, an oligonucleotide label comprises about 15 to about 30 nucleotides. In certain embodiments, an oligonucleotide label comprises about 20 to about 25 nucleotides. In certain embodiments, an oligonucleotide label consists of 8 nucleotides.

In certain embodiments, the oligonucleotide label is part of a longer oligonucleotide construct comprising additional functional sequence, e.g., annealing site or adapter suitable for making the modified fragment compatible with a sequencing primer, an immobilized bridge amplification primer of complementary sequence (part of the sequencing strategy), or both a sequencing primer and an immobilized bridge amplification primer.

Other types of source labels are also contemplated by the invention. Such alternative source labels can include, for example, radiolabels, fluorescent tags, chemical tags, and the like.

In certain embodiments, each fragment is labeled with one source label.

In certain embodiments, each fragment is labeled with two source labels. The two source labels can be the same or different from one other.

For embodiments in which at least one source label is an oligonucleotide, generally such source label will be sequenced along with the amplified DNA to which it is attached.

In certain embodiments, the method further comprises attaching to each fragment, prior to sequencing, an oligonucleotide complementary to a sequencing primer.

In certain embodiments, the method further comprises attaching to each fragment, prior to sequencing, an oligonucleotide adapter complementary to at least one immobilized bridge amplification primer. Bridge amplification is part of and preparatory to sequencing- by-synthesis, whereby clusters of immobilized sequencing templates are formed on a surface. Each such cluster typically can include approximately 10 ⁶ copies of a given template.

The method optionally can include a clean-up step prior to sequencing. For example, the clean-up step can comprise a sizing step, a quantity normalization step, or both a sizing step and a quantity normalization step in preparation for sequencing.

In certain embodiments, the method is performed in a multiplex manner.

Typically, the method comprises the step of pooling samples (amplicon fragments) prepared as described above from a plurality of loci and a plurality of subjects, prior to sequencing. The fragments, e.g., pooled sample fragments, are then sequenced using

sequencing-by-synthesis, thereby generating a plurality of overlapping partial nucleotide sequences. Preferably, the sequencing will result in so-called deep sequencing.

Sequencing depth refers to the total number of reads is many times larger than the length of the sequence under study. Coverage is the average number of reads representing a given nucleotide in the reconstructed sequence. Depth can be calculated from the length of the original genome or sequence under study (G), the number of reads (TV), and the average read length (L) as TV x LIG. For example, a hypothetical genome or sequence with 2,000 base pairs reconstructed from 8 reads with an average length of 500 nucleotides will have 2x redundancy. The same hypothetical genome or sequence with 2,000 base pairs reconstructed from 80 reads with an average length of 500 nucleotides will have 20x redundancy, and the same hypothetical genome or sequence with 2,000 base pairs reconstructed from 400 reads with an average length of 500 nucleotides will have lOOx redundancy. Generally, obtaining several hundred high quality reads at each position along the amplicons is sufficient for purposes of the invention.

Result is many overlapping short reads that cover the area being sequenced.

Confident single-nucleotide polymorphism (SNP) calls may typically require read depth of 30-40x but in some instances might require as little as 15x. Reads are "paired," meaning sequence both sense and antisense. Software assembles sequence either de novo or compared to reference as scaffold.

The overlapping partial nucleotide sequences are then aligned to determine a contiguous composite nucleotide sequence encoding the ARD of each allele of the at least one HLA locus. This alignment step typically uses publicly or commercially available computer-based nucleotide sequence alignment tools, e.g., a genome browser.

In certain embodiments, the contiguous composite nucleotide sequence includes all of exon 2, all of intron 2, and all of exon 3. In certain such embodiments, the contiguous composite nucleotide sequence further includes at least a part of intron 1 , at least a part of intron 3, or at least a part of intron 1 and at least a part of intron 3.

In certain embodiments, the contiguous composite nucleotide sequence includes part of exon 2, all of intron 2, and all of exon 3. In certain such embodiments, the contiguous composite nucleotide sequence further includes at least a part of intron 3. In certain embodiments, the contiguous composite nucleotide sequence includes all of exon 2, all of intron 2, and part of exon 3. In certain such embodiments, the contiguous composite nucleotide sequence further includes at least a part of intron 1.

Following the alignment step just described, the method includes the step of comparing the contiguous composite nucleotide sequences to a library of reference genomic sequences encoding ARDs of the at least one HLA locus. This comparison step typically uses publicly or commercially available computer-based nucleotide sequence analysis tools, e.g., Connexio Assign and Omixon; and libraries of known HLA genomic sequences, e.g., <http://www.ebi.ac.uk/ipd/imgt/hla/>.

Kits of the Invention

An aspect of the invention is a kit, comprising

(b) paired oligonucleotide adapters, each adapter oligonucleotide comprising a nucleotide sequence complementary to at least one bridge amplification primer

immobilized on a substrate; and

(c) paired sequencing primers suitable for use to sequence amplification products prepared using the paired PCR amplification primers.

In certain embodiments, the at least one HLA locus is a class I HLA locus.

In certain embodiments, the at least one HLA locus is HLA-A.

In certain embodiments, the at least one HLA locus is HLA-B.

In certain embodiments, the at least one HLA locus is HLA-C.

In certain embodiments, the at least one HLA locus is HLA-A and HLA-B.

In certain embodiments, the at least one HLA locus is HLA-A and HLA-C.

In certain embodiments, the at least one HLA locus is HLA-B and HLA-C. In certain embodiments, the at least one HLA locus is HLA-A, HLA-B, and HLA-

In certain embodiments, at least one HLA locus is HLA-A; and nucleotide sequences of the paired PCR amplification primers for HLA-A are selected from

CCCAGACGCCGAGGATGGCCG (SEQ ID NO: 1) (5A2) (sense) and

GCAGGGCGGAACCTCAGAGTCACTCTCT (SEQ ID NO:2) (3A2) (antisense); and

CCTCTGYGGGGAGAAGCAA (SEQ ID NO:3) (AInl-46-F) (sense) and

GTCCCAATTGTCTCCCCTCCTT (SEQ ID NO:4) (3AIn3-62-R) (antisense). In certain embodiments, at least one HLA locus is HLA-A; and nucleotide sequences of the paired PCR amplification primers for HLA-A are

CCCAGACGCCGAGGATGGCCG (SEQ ID ΝΟ: 1) (5A2) (sense) and

GCAGGGCGGAACCTCAGAGTCACTCTCT (SEQ ID NO:2) (3A2) (antisense). In certain embodiments, at least one HLA locus is HLA-A; and nucleotide sequences of the paired PCR amplification primers for HLA-A are

CCTCTGYGGGGAGAAGCAA (SEQ ID NO:3) (AInl-46-F) (sense) and

GTCCCAATTGTCTCCCCTCCTT (SEQ ID NO:4) (3AIn3-62-R) (antisense). In certain embodiments, at least one HLA locus is HLA-B; and nucleotide sequences of the paired PCR amplification primers for HLA-B are selected from

CCGAACCSTCCTCCTGCTGCTCT (SEQ ID NO:5) (Bexl-BT) (sense),

CCATCCCCGGCGACCTATAGGAGATG (SEQ ID NO:6) (3B1) (antisense), and AGGCCATCCCGGGCGATCTAT (SEQ ID NO:7) (3B1-AC) (antisense); and GGGAGGAGMRAGGGGACCGCAG (SEQ ID NO:8) (BInlb-F) (sense), GGAGSCCATCCCCGSCGACCTAT (SEQ ID NO:9) (BIn3-R) (antisense), and GGAGGCCATCCCGGGCGATCTAT (SEQ ID NO: 10) (BIn3-AC) (antisense).

In certain embodiments, at least one HLA locus is HLA-B; and nucleotide sequences of the paired PCR amplification primers for HLA-B are selected from

CCGAACCSTCCTCCTGCTGCTCT (SEQ ID NO:5) (Bexl-BT) (sense), CCATCCCCGGCGACCTATAGGAGATG (SEQ ID NO:6) (3B1) (antisense), and AGGCCATCCCGGGCGATCTAT (SEQ ID NO:7) (3B1-AC) (antisense).

In certain embodiments, at least one HLA locus is HLA-B; and nucleotide sequences of the paired PCR amplification primers for HLA-B are selected from

GGGAGGAGMRAGGGGACCGCAG (SEQ ID NO:8) (BInlb-F) (sense), GGAGSCCATCCCCGSCGACCTAT (SEQ ID NO:9) (BIn3-R) (antisense), and

GGAGGCCATCCCGGGCGATCTAT (SEQ ID NO: 10) (BIn3-AC) (antisense).

In certain embodiments, at least one HLA locus is HLA-C; and nucleotide sequences of the paired PCR amplification primers for HLA-C are

AGCGAGGKGCCCGCCCGGCGA (SEQ ID NO: l 1) (5CInl-61) (sense) and

GGAGATGGGGAAGGCTCCCCACT (SEQ ID NO: 12) (3BCIn3-12) (antisense).

In certain embodiments, the kit further comprises enzymes T4 DNA polymerase, Klenow fragment of T4 DNA polymerase, and T4 polynucleotide kinase; and a buffer suitable for activity of said enzymes in repairing DNA fragments generated by acoustical shearing.

In certain embodiments, the kit further comprises a DNA polymerase and dATP in a buffer suitable for activity of said DNA polymerase to allow for adapter ligation.

In certain embodiments, the kit further comprises at least one source label.

In certain embodiments, the at least one source label is an oligonucleotide label. In certain embodiments, the kit further comprises an oligonucleotide complementary to at least one of the paired sequencing primers.

Having now described the present invention in detail, the same will be more clearly understood by reference to the following example, which is included herewith for purposes of illustration only and is not intended to be limiting of the invention.

EXAMPLE

Example 1.

HLA-A, HLA-B, and HLA-C were separately amplified by PCR from each of ten individuals. Each HLA amplicon included exon 2, intron 2, and exon 3 in their entirety and portions of intron 1 and intron 3. The three amplicons (A, B, and C) from each individual were combined. The amplicons were fragmented into an average length of either 200 base pairs or 300 base pairs using acoustical shearing (Covaris instrument). Using an Illumina TruSeq kit, DNA fragment ends were repaired, T-overhangs were added, and adapters and indices were ligated. Figure 5 shows the indices used for each sample.

Following sequencing-by-synthesis using an Illumina MiSeq, the software sorted the reads based on the indices, grouping all the reads for HLA-A, -B, and -C for one individual together. The software, Connexio Assign, compared each read to reference databases for HLA-A, -B, and -C and identified the HLA allele(s) that carry the phased sequences for each locus. See Figure 6.

INCORPORATION BY REFERENCE All patents and published patent applications mentioned in the description above are incorporated by reference herein in their entirety.

EQUIVALENTS

Having now fully described the present invention in some detail by way of illustration and example for purposes of clarity of understanding, it will be obvious to one of ordinary skill in the art that the same can be performed by modifying or changing the invention within a wide and equivalent range of conditions, formulations and other parameters without affecting the scope of the invention or any specific embodiment thereof, and that such modifications or changes are intended to be encompassed within the scope of the appended claims.

Previous Patent: LAUNCH VEHICLE WITH A TILT DECK FOR HIGHWALL MINING

Next Patent: TIP CLEARANCE MEASUREMENT OF A ROTARY WING AIRCRAFT