Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHODS AND APPARATUS FOR IDENTIFYING GENETIC VARIANTS
Document Type and Number:
WIPO Patent Application WO/2018/033733
Kind Code:
A1
Abstract:
Apparatus and computer-implemented methods are disclosed for identifying genetic variants in a target an initial processing step receives initial input sequence data and processes the initial input sequence data to identify variants involving more than N1 base pairs. A first subsequent processing step uses an output from the initial processing step to identify further variants, the further variants involving more than N2 base pairs and including at least one variant that involves less than N1 base pairs.

Inventors:
LUNTER GERTON ANTON (GB)
SANDERS EDWARD (GB)
RIMMER ANDREW (GB)
Application Number:
PCT/GB2017/052419
Publication Date:
February 22, 2018
Filing Date:
August 16, 2017
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
GENOMICS PLC (GB)
International Classes:
G06F19/18
Other References:
ANDRÁS GÉZSI ET AL: "VariantMetaCaller: automated fusion of variant calling pipelines for quantitative, precision-based filtering", BMC GENOMICS, vol. 16, no. 1, 28 October 2015 (2015-10-28), XP055423009, DOI: 10.1186/s12864-015-2050-y
ANDY RIMMER ET AL: "Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications", NATURE GENETICS, vol. 46, no. 8, 1 August 2014 (2014-08-01), pages 912 - 918, XP055158556, ISSN: 1061-4036, DOI: 10.1038/ng.3036
ZAMIN IQBAL ET AL: "De novo assembly and genotyping of variants using colored de Bruijn graphs", NATURE GENETICS., vol. 44, no. 2, 1 February 2012 (2012-02-01), NEW YORK, US, pages 226 - 232, XP055334060, ISSN: 1061-4036, DOI: 10.1038/ng.1028
NATURE GENETICS, vol. 47, 2015, pages 717 - 726
NATURE GENETICS, vol. 43, 2011, pages 491 - 498
NATURE GENETICS, vol. 46, 2014, pages 912 - 918
NATURE GENETICS, vol. 44, 2012, pages 226 - 232
NATURE GENETICS, vol. 47, 2015, pages 296 - 303
BIOINFORMATICS, vol. 28, no. 18, 2012, pages i333 - i339
NATURE GENETICS, vol. 47, 2015, pages 682 - 688
JOURNAL OF MOLECULAR BIOLOGY, vol. 48, no. 3, 1970, pages 443 - 53
Attorney, Agent or Firm:
FORSYTHE, Dominic (GB)
Download PDF:
Claims:
CLAIMS

1. A computer-implemented method for identifying genetic variants in a target individual's genome, comprising:

an initial processing step that receives initial input sequence data and processes the initial input sequence data to identify variants involving more than Nl base pairs; and

a first subsequent processing step that uses an output from the initial processing step to identify further variants, the further variants involving more than N2 base pairs and including at least one variant that involves less than N 1 base pairs.

2. The method of claim 1, wherein the initial input sequence data comprises sequence data aligned to a reference genome sequence and the initial processing step comprises merging the variants identified in the initial processing step into the reference genome sequence to obtain first augmented reference genome sequence data.

3. The method of claim 2, wherein the initial processing step comprises aligning the initial input sequence data to the first augmented reference genome sequence data to obtain first subsequent input sequence data.

4. The method of claim 3, wherein the aligning is performed using an algorithm which re-aligns reads only in regions affected by the variants identified in the initial processing step.

5. The method of claim 3 or 4, wherein the aligning is performed using an algorithm which aligns reads which were previously unaligned in the initial input sequence data.

6. The method of any of claims 3-5, wherein the first subsequent processing step comprises processing the first subsequent input sequence to perform the identification of the further variants.

7. The method of any of claims 3-6, wherein the first subsequent processing step comprises merging the further variants identified in the first subsequent processing step into the first augmented reference genome sequence data to obtain second augmented reference genome sequence data.

8. The method of any preceding claim, further comprising:

a second subsequent processing step that uses an output from the first subsequent processing step to identify still further variants, the still further variants including at least one variant involving less than N2 base pairs.

9. The method of claim 8, wherein:

the initial input sequence data comprises sequence data aligned to a reference genome sequence; the initial processing step comprises merging the variants identified in the initial processing step into the reference genome sequence to obtain first augmented reference genome sequence data; and the first subsequent processing step comprises merging the further variants identified in the first subsequent processing step into the first augmented reference genome sequence data to obtain second augmented reference genome sequence data.

10. The method of claim 9, wherein the first subsequent processing step comprises aligning the first subsequent input sequence data with the second augmented reference genome sequence data to obtain second subsequent input sequence data.

11. The method of claim 10, wherein the aligning is performed using an algorithm which re-aligns reads only in regions affected by the variants identified in the initial processing step and first subsequent processing step.

12. The method of claim 10 or 11, wherein the aligning is performed using an algorithm which aligns reads which were previously unaligned in the initial input sequence data and first subsequent input sequence data.

13. The method of any of claims 10-12, wherein the second subsequent processing step comprises processing the second subsequent input sequence data to perform the identification of the still further variants.

14. The method of any preceding claim, wherein each of any one or more of the initial input sequence data, first subsequent input sequence data, and, where provided, second subsequent input sequence data comprises a graph representation.

15. The method of any preceding claim, wherein each of any one or more of the initial processing step, the first subsequent processing step, and, where provided, the second subsequent processing step, comprises respective posterior probabilities for the identified variants, the identified further variants, and, where the second subsequent processing step is provided, the identified still further variants.

16. The method of any preceding claim, wherein Nl = 1000.

17. The method of any preceding claim, wherein N2 = 50.

18. The method of any of claims 1-16, wherein N2 = 0.

19. The method of any preceding claim, wherein the identification of variants in each of one or more of the initial processing step, the first subsequent processing step, and, where provided, the second subsequent processing step, uses information about the frequency of selected variants in a population of individuals other than the target individual.

20. The method of any preceding claim, wherein each of one or more of the initial input sequence data, the first subsequent input sequence data where the first subsequent input sequence data is obtained, and the second subsequent input sequence data where the second subsequent input sequence data is obtained, comprises a plurality of reads and the respective processing of the processing step comprises identifying genomic regions in which a coverage provided by the reads differs from an average coverage over the genome by more than a predetermined threshold.

21. The method of any preceding claim, wherein each of one or more of the initial input sequence data, the first subsequent input sequence data where the first subsequent input sequence data is obtained, and the second subsequent input sequence data where the second subsequent input sequence data is obtained, comprises a plurality of reads and the respective processing of the processing step comprises identifying genomic regions in each of which an average frequency of a set of variants is substantially 100% in the region.

22. The method of any preceding claim, wherein each of one or more of the initial input sequence data, the first subsequent input sequence data where the first subsequent input sequence data is obtained, and the second subsequent input sequence data where the second subsequent input sequence data is obtained, comprises a plurality of reads and the respective processing of the processing step comprises identifying genomic regions in which a read-pair insert size differs from an average read-pair insert size by more than a predetermined threshold.

23. A computer program comprising code which when run on a computer causes the computer to perform the method of any preceding claim.

24. A computer program product comprising the computer program of claim 23.

25. An apparatus for identifying genetic variants in a target individual's genome, comprising:

a data receiving unit configured to receive at least initial input sequence data; and

a processing unit configured to:

process the initial input sequence data to identify variants exclusively involving more than Nl base pairs; and

using an output from the processing of the initial input sequence data, identify further variants, the further variants exclusively involving more than N2 base pairs and including at least one variant that involves less than N 1 base pairs.

Description:
METHODS AND APPARATUS FOR IDENTIFYING GENETIC VARIANTS

The invention relates to the field of genomic data analysis, particularly the identification of genetic variants across a genome (also referred to as variant calling), particularly across a human genome.

Genomics is now a data-rich science, due to the improvement in DNA sequencing technology over the last 10 years. However, converting huge volumes of sequencing data into useful information for clinical applications is complex and difficult. A significant bottleneck in this process is the identification of genetic variants, i.e. the ways in which the DNA of a person under study is different from a standard reference genome. This genetic variation occurs in several forms, ranging from single base changes or single nucleotide polymorphisms (SNPs) to duplications or deletions of whole chromosomes.

Accordingly, the size of a single variation event may range from one base pair to over 100 million base pairs (bp).

Identification of SNPs is a well-studied problem, and there are a variety of accurate and effective algorithms for this (e.g. GATK - see reference (2); and Platypus - see reference (3)). Small (<50bp) insertions and deletions of sequence - so called 'indels' - are also quite well studied, but accurate methods for identifying these from short read data have proved more difficult to produce (however, GATK (2) and Platypus (3) do a reasonable job here). Larger variants are less well studied still, and there are few accurate approaches for calling these. DELLY (see reference (6)) and GenomeStrip (see reference (5)) are example approaches which have been used. Of particular interest are large Copy Number Changes (CNVs), where a repeated section of DNA is duplicated or deleted, and large insertions and deletions (which may be referred to as "indels").

Because the basic methods used for identifying SNPs, indels, CNVs etc. are very different, they are currently implemented as separate programs, so that a scientist / clinician wanting to identify multiple sizes of genetic variants in a patient must run several different programs and somehow combine the output of these manually. This approach is inefficient, as it requires processing the same (large) dataset multiple times, and can also be inaccurate. Furthermore, it has been found that different tools can give conflicting information about the same parts of the genome (e.g. one may call a deletion and one an insertion at the same location), which makes downstream interpretation of genomic information very difficult. In addition, tools that aim to identify small variants would be helped by having accurate information about the location of large variants; for instance, within a region affected by a heterozygous deletion, small variants can only affect a single chromosome (the technical term is "hemizygous"), and data patterns that suggest a heterozygous small variant in such regions must be due to technical artefacts.

Some existing variant calling software (e.g. Platypus (3)) uses a haplotype-based variant calling method. In this type of method, the variant caller does not consider each base in the genome separately, but looks at blocks of contiguous bases (typically around 10-200 bases). The contiguous sequence of bases is called a haplotype, and may span one or more variants. This means that variants occurring close together (within about 100 bases in this case) can be identified simultaneously. Such methods integrate SNP and short indel calling, and employ a local haplotype assembly algorithm which can identify indels up to ~lkb. But these methods have no ability to detect large variants, and the integration between small and medium variants is poor. The GATK Haplotype Caller's (2) most widely used variant calling tool can detect small variants and some medium variants (at least up to lOObp), but again not large variants.

It is an object of the invention to provide improved methods and apparatus for efficiently and accurately identifying genetic variants.

According to an aspect of the invention, there is provided a computer-implemented method for identifying genetic variants in a target individuars genome, comprising: an initial processing step that receives initial input sequence data and processes the initial input sequence data to identify variants involving (optionally exclusively) more than Nl base pairs; and a first subsequent processing step that uses an output from the initial processing step to identify further variants, the further variants involving (optionally exclusively) more than N2 base pairs and including at least one variant that involves less than Nl base pairs.

Thus, a multi-stage method is provided in which variants of different size ranges are identified in an integrated manner in a single tool.

Efficiency is improved relative to prior art systems because multiple processing of the same data set using different tools is reduced or avoided. Furthermore, the method is easier to use because one tool is capable of providing what was previously only available using multiple tools, often with improved accuracy relative to a combination of such tools. A single output containing all the information needed for clinical interpretation can be provided, instead of the multiple files, with potentially conflicting data, that arise from the use of conventional methods. Accuracy is improved because the identification of larger variants in earlier steps can be used to inform calling of smaller variants in later steps. In particular, by first identifying variants including variants involving more than Nl base pairs, and assessing variants involving fewer than Nl base pairs at a subsequent processing step, it becomes possible to make use of the inferred ploidy along the more than Nl base pairs involved in the variants involving more than Nl base pairs, which information helps to improve the quality of the subsequent variant calls involving fewer than Nl base pairs made at a subsequent processing step. This contrasts with prior art variant calling approaches which treat variation on different scales with separate algorithms. These approaches typically result in a fragmented picture of the genome, and many errors where one class of variant (typically larger) is not considered, and a spurious, different sized, variant is called instead.

For clinical applications, accurate variant calling is essential, and variants of all sizes are known to be implicated in rare diseases and cancers. Spurious small variant calls due to the lack of large variant calls could be the source of serious misinterpretations of clinical data. For projects looking at thousands or tens of thousands (or more) of human genomes, extensive manual curation of the data is not possible, and reliance on high-quality variant calls is great.

In an embodiment, the initial input sequence data comprises sequence data aligned to a reference genome sequence. The initial processing step comprises merging the variants identified in the initial processing step into the reference genome sequence to obtain first augmented reference genome sequence data. The initial processing step comprises aligning the initial input sequence data with the first augmented reference genome sequence data to obtain first subsequent input sequence data. The first subsequent processing step comprises processing the first subsequent input sequence to perform the identification of the further variants. The inventors have found that re-aligning the input sequence data in this manner improves the subsequent variant identification processes.

In an embodiment, the aligning is performed using an algorithm which re-aligns reads only in regions affected by the variants identified in the initial processing step. The inventors have found this approach improves efficiency significantly because only a small fraction of the reads need to be moved. In an embodiment, the aligning is performed using an algorithm which aligns reads which were previously unaligned in the initial input sequence data. The inventors have found this approach improves accuracy significantly. In an embodiment, each of any one or more of the sequence data comprises a graph representation. The inventors have found that a graph representation improves the efficiency of the method, particularly with regard to the re-aligning of the sequence data mentioned above.

In an embodiment, the method further comprises a second subsequent processing step that uses an output from the first subsequent processing step to identify still further variants, the still further variants including at least one variant involving less than N2 base pairs. Thus, three processing steps are provided in which variants of progressively smaller size are identified, the output from each of the processing steps forming the basis for the variant identification in a subsequent processing step. The inventors have found that this three stage approach provides particularly accurate and efficient identification of variants.

In an embodiment, each of one or more of the initial input sequence data, the first subsequent input sequence data where the first subsequent input sequence data is obtained, and the second subsequent input sequence data where the second subsequent input sequence data is obtained, comprises a plurality of reads and the respective processing of the processing step comprises identifying genomic regions in which a coverage provided by the reads differs from an average coverage over the genome by more than a predetermined threshold. The inventors have found that abnormal coverage provides a particularly effective indication of the presence of large variants. This approach is therefore particularly

advantageously used in the initial processing step.

In an embodiment, each of one or more of the initial input sequence data, the first subsequent input sequence data where the first subsequent input sequence data is obtained, and the second subsequent input sequence data where the second subsequent input sequence data is obtained, comprises a plurality of reads and the respective processing of the processing step comprises identifying genomic regions in each of which an average frequency of a set of variants is substantially 100% in the region. For typical common variants the normal observation is a mixture of variants occurring in both copies and variants occurring in only one copy, leading to an average frequency of around 65-70%. Where a large deletion is present in one of the chromosomes an abnormally high frequency of substantially 100% may be observed. The set of variants used for this process may comprise a set of common SNPs for example.

In an embodiment, each of one or more of the initial input sequence data, the first subsequent input sequence data where the first subsequent input sequence data is obtained, and the second subsequent input sequence data where the second subsequent input sequence data is obtained, comprises a plurality of reads and the respective processing of the processing step comprises identifying genomic regions in which a read-pair insert size differs from an average read-pair insert size by more than a predetermined threshold. The inventors have found that abnormal read-pair insert size provides a particularly effective indication of the presence of large variants. This approach is therefore particularly advantageously used in the initial processing step.

In an embodiment, the first processing step comprises the identification of variant breakpoints, here defined as candidates for the first or final base pair that is affected by a potential variant involving more than Nl base pairs; followed by the identification of breakpoint clusters, here defined as sets of two or more of said break points, representing candidates for variants involving more than Nl base pairs, such as for example pairs of breakpoints that are consistent with deletions involving more than Nl base pairs; followed by the identification of candidate variants involving fewer than Nl base pairs; followed by the clustering of said candidate variants into candidate variant clusters; followed by constructing the union of said candidate variant clusters and breakpoint clusters; followed by the sorting of said union of clusters in a hierarchy in such as way that clusters representing variants that involve more than N 1 base pairs are sorted before clusters representing variants that involve all or some of said base pairs; followed by a process that calls variants in an order that reflect said hierarchy and that makes use of the ploidy in the local region that is implied by the called variants that go before in said hierarchy.

According to an alternative aspect of the invention, there is provided an apparatus for identifying genetic variants in a target individual's genome, comprising: a data receiving unit configured to receive at least initial input sequence data; and a processing unit configured to: process the initial input sequence data to identify variants involving (optionally exclusively) more than Nl base pairs; and using an output from the processing of the initial input sequence data, identify further variants, the further variants involving (optionally exclusively) more than N2 base pairs and including at least one variant that involves less than Nl base pairs.

Embodiments of the invention provide particularly effective identification of large variants, which makes it possible for example to output genotypes with correct ploidy (e.g. triploid or haploid genotypes) when appropriate.

The invention will now be further described, by way of example, with reference to the accompanying drawings, in which:

Figure 1 depicts an example method for identifying genetic variants in a target individual's genome; Figure 2 is a graph of coverage (vertical axis) against genomic coordinate (horizontal axis) illustrating identification of variants based on deviations in coverage;

Figure 3 is a graph of SNP frequency against genomic coordinate illustrating identification of variants based on deviations in SNP frequency;

Figure 4 is a graph of read-pair insert size against genomic coordinate illustrating identification of variants based on deviations in read-pair insert size;

Figure 5 depicts an apparatus for identifying genetic variants; and

Figure 6 depicts a schematic output from an example method showing identified genetic variants in a segment of reference genome NA 12878.

According to an embodiment of the invention, an example of which is depicted in Figure 1, a computer-implemented method is provided for identifying genetic variants in a target individual's genome. In an embodiment the target individual is a human. Any combination of steps in the method or all steps in the method may be implemented by a computer. The computer may comprise any of the various hardware configurations known for providing the required data processing functions, e.g. CPU, memory, data input/output ports, etc. A computer program may be provided comprising code suitable for instructing a computer to carry out the computer-implemented steps of the method. The computer program may be provided in any of the forms known in the art, including on a transient or non-transient computer program product, such as a storage medium storing the computer program. The method may be implemented on an apparatus 2 such as that depicted in Figure 5. The apparatus 2 comprises a data receiving unit 4 configured to receive any data required by the method and a processing unit 6 configured to perform any data processing task required by the method. The data receiving unit 4 may also be configured to output data produced by the method, for example in the form of an output data file.

The method comprises an initial processing step S I. Step SI comprises receiving initial input sequence data. The initial input sequence data may take various forms. In an embodiment, the initial input sequence data comprises sequence list data aligned relative to a reference genome sequence. The sequence list data aligned relative to a reference genome may be provided in the BAM file format, for example. The BAM file format is a well known binary format for storing sequence data. The sequence list data may be stored in the form of short sequences of DNA bases called reads, where each read has N bases and N quality scores. Step S 1 may further receive data representing the reference genome sequence. The data representing the reference genome sequence may be provided in the FASTA file format, for example. The FASTA file format is a well known text-based format for representing nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. Alternatively or additionally, the reference genome sequence may be provided in a graph representation (see reference (7)). The inventors have found that the graph representation is particularly well suited to implement the method of embodiments of the invention in a computationally efficient manner.

Step S 1 further comprises processing the initial input sequence data to identify variants involving (optionally exclusively involving) more than Nl base pairs. Such variants may be referred to as large variants. In an embodiment, N 1 is at least 1000, optionally at least 2000, optionally at least 5000, optionally at least 10000. The identified variants are output to a first subsequent processing step S2. In an embodiment, the identified variants are merged (e.g. spliced) into the reference genome sequence to provide a first augmented reference genome sequence that represents the target individual's genome more closely than the original reference genome sequence. In an embodiment, the identified variants are output from step S 1 in a form which includes the first augmented reference genome sequence, which may be stored in the FASTA file format for example.

In step S2, an output from the initial processing step S 1 is used to identify further variants. The further variants exclusively involve more than N2 base pairs (i.e. none of the variants has less than N2 base pairs) and includes at least one variant that involves less than Nl base pairs. In an embodiment, the processing exclusively identifies variants involving between N2 and Nl base pairs (i.e. none of the variants has more than Nl base pairs or less than N2 base pairs). The variants identified in step S2 may be referred to as medium variants. In an embodiment, N2 is at least 50, optionally at least 100, optionally at least 200. In an embodiment, N2 = 0. In the embodiment shown the further identified variants are output to a second subsequent processing step S3. In an embodiment, the further identified variants are merged (e.g. spliced) into the first augmented reference genome sequence output from the initial processing step S 1 (which effectively represents the variants identified in the initial processing step) to provide a second augmented reference genome sequence (which effectively therefore represents the variants identified in steps S I and S2). This second augmented reference genome sequence represents the target individual's genome even more closely than the first augmented reference genome sequence obtained in step S 1. In an embodiment, the variants identified in steps S 1 and S2 are output from step S2 in a form which includes the second augmented reference genome sequence, which may be stored in the FASTA file format for example. In an embodiment, step S I comprises aligning the initial input sequence data with the first augmented reference genome sequence data to obtain first subsequent input sequence data. This process may be referred to as a re-mapping. Step S2 may then comprise processing the first subsequent input sequence to perform the identification of the further variants. The variants identified in step S I are therefore used to improve the alignment of the sequence data with the reference genome sequence, which improves the accuracy of variant identification in subsequent steps. This not essential, however. In other embodiments, step S2 may process the initial input sequence using other data about variants provided by S 1 to identify the further variants.

In an embodiment, the aligning is performed using an algorithm which re-aligns reads only in regions affected by the variants identified in the initial processing step. This improves the efficiency of the alignment process. Optionally, the algorithm aligns reads which were previously unaligned in the initial input sequence data. This improves the accuracy of the first subsequent input sequence data.

In one particular embodiment, the aligning is performed using an algorithm having all of the following properties: (i) it can re-map (re-align) any read from any part of the genome to any other (i.e. it is a global re-mapping, not a local re-mapping); (ii) any read which is in a part of the genome not affected by the new variants, and which is not similar to those affected parts, will not be moved. This means that the whole BAM can be processed very efficiently, moving only a small fraction of reads; (iii) any read which is in a part of the genome affected by the changes will be re-mapped; and (iv) previously unmapped reads (those which the original mapping tool could not place) will be re-mapped, and may now be placed on a segment of the augmented reference genome sequence, which was not present in the original reference. The algorithm works particularly efficiently when the reference genome sequence data (augmented or not) comprises a graph representation.

In step S3, which is optional, an output from step S2 is used to identify still further variants. The still further variants include at least one variant involving less than N2 base pairs. In an embodiment, the still further variants exclusively involve less than N2 base pairs (i.e. none of the variants has more than N2 base pairs). The still further variants identified in step S3 may be referred to as small variants. In an embodiment, the identified still further variants are merged (e.g. spliced) into the second augmented reference genome sequence output from step S2 to provide a third augmented reference genome sequence. This third augmented reference genome sequence represents the target individual's genome even more closely than the first and second augmented reference genome sequences. In an embodiment, the variants identified in steps S I, S2 and S3 are output from step S3 in a form that includes the third augmented reference genome sequence, which may be stored in the FASTA file format for example.

In an embodiment, step S2 further comprises aligning the first subsequent input sequence data with the second augmented reference genome sequence data to obtain second subsequent input sequence data, thereby again improving the alignment of the sequence data with the reference genome sequence, and improving the accuracy of variant identification in subsequent steps. In an embodiment step S3 comprises processing the second subsequent input sequence to perform the identification of the still further variants. This not essential, however. In other embodiments, step S3 may process the initial input sequence using other data about variants identified in preceding steps to identify the still further variants.

In an embodiment, the aligning is performed using an algorithm which re-aligns reads only in regions affected by the variants identified in the initial processing step and first subsequent processing step. This improves the efficiency of the alignment process. Optionally, the algorithm aligns reads which were previously unaligned in the initial input sequence data. This improves the accuracy of the second subsequent input sequence data.

As mentioned above, the method may be provided as steps S I and S2 only (i.e. a two stage process). In this case step S2 would provide an output from the method. In the case where steps S I, S2 and S3 are provided, an output from the method would be provided by step S3. In other embodiments further steps (e.g. third subsequent processing step, fourth subsequent processing step, etc.) are provided to identify additional variants and/or refine the variant calling performed in earlier processing steps.

The output may be provided in various forms. The output may comprise one or more of the following in any combination: 1) an augmented reference genome sequence (e.g., as a FASTA file or graph representation); 2) sequence data aligned with an augmented reference genome sequence (e.g., as a BAM file); and 3) data representing the identified variants (e.g. as a VCF file).

In an embodiment, the output from each of the processing steps comprises respective posterior probabilities for the identified variants. These probabilities provide a measure of confidence in the identified variants and allow the output of the method to be used in subsequent probabilistic analyses.

In an embodiment, each of one or more of the initial input sequence data, the first subsequent input sequence data where the first subsequent input sequence data is obtained, and the second subsequent input sequence data where the second subsequent input sequence data is obtained, comprises a plurality of reads and the respective processing of the processing step comprises identifying genomic regions in which a coverage provided by the reads differs from an average coverage over the genome by more than a predetermined threshold. Sharp changes in coverage indicate the addition or deletion of segment of DNA. For example a duplicated sequence will increase coverage and a deleted sequence will decrease coverage. This is depicted schematically in Figure 2.

In an embodiment, identification of variants in each of one or more of the initial processing step, the first subsequent processing step, and, where provided, the second subsequent processing step uses information about the frequency of selected variants in a population of individuals other than the target individual.

For example in step S 1 an input VCF of all known large variants could be supplied along with a population frequency for each variant (i.e. the fraction of previously observed genomes in which the variant has been found). This information is used by the variant calling algorithm to improve accuracy, particularly when the initial input sequence data (e.g. BAM data) alone is inconclusive or gives only weak support for a variant.

In an embodiment, each of one or more of the initial input sequence data, the first subsequent input sequence data where the first subsequent input sequence data is obtained, and the second subsequent input sequence data where the second subsequent input sequence data is obtained, comprises a plurality of reads and the respective processing of the processing step comprises identifying genomic regions in each of which an average frequency of a set of variants is substantially 100% in the region. For example, if the method is supplied with a VCF of common SNP information, it can be used to examine sites at which known SNPs occur at a reasonable (e.g. >0.01) frequency in the population. Where such SNPs are present, the average frequency of the SNPs in the data may be used as further evidence for the presence or absence of large variants. For example, a large deletion which removes a chunk of one chromosome would (since chromosomes are paired) leave only a single copy of that piece of sequence, so any SNPs occurring would appear at substantially 100% frequency in the input sequence data (e.g. BAM data) across that stretch of chromosome, whereas the normal observation is a mixture of SNPs occurring in both copies and SNPs occurring in only one copy, leading to a frequency of around 65-70%. See Figure 3.

In an embodiment, each of one or more of the initial input sequence data, the first subsequent input sequence data where the first subsequent input sequence data is obtained, and the second subsequent input sequence data where the second subsequent input sequence data is obtained, comprises a plurality of reads and the respective processing of the processing step comprises identifying genomic regions in which a read-pair insert size differs from an average read-pair insert size by more than a predetermined threshold. In typical high-throughput sequencing, reads are sequenced in pairs, with each of the pair reading the same physical fragment of DNA from opposite ends. The distance between the read-pairs, when aligned against a reference genome sequence is typically constrained to be between 200-600bp. Read-pairs with insert-sizes significantly outside a normal (average) range can be used as evidence for the presence of large changes in the DNA. See Figure 4.

In an embodiment two or more of the coverage information, SNP frequency information, and read-pair insert size information mentioned above can be combined in a Bayesian model and used to infer the presence or absence of variants, particularly large variants (i.e. to identify variants in step S I). Other sources of information (e.g. common variant frequencies) can be used to augment this model.

In an embodiment, where there is significant evidence for a large variant in step S I, a Hidden Markov Model (HMM) is used to estimate the size of the variant, and the start and end points in the reference genome sequence.

In an embodiment, step S2 comprises comparing the first subsequent input sequence data output from step S I (e.g. as a BAM file) to the reference genome sequence to look for evidence of differences. An algorithm is used to determine if the differences show a consensus pattern which can be resolved in step S3, or if additional processing is required due to the likely presence of one or more medium variants. If medium variants are suspected, the invention uses a de-Bruijn graph to assemble consensus sequences across all the data present. This may be implemented using the algorithms described in reference (3) or similar. Any medium variants found are then aligned against the reference sequence to give exact start and end positions. A posterior for the presence of the variant is computed from the proportion of that sample's k-mers which can be explained by the sequences identified.

In an embodiment, step S2 comprises analysing the first subsequent input sequence data to examine the read data around the estimated start and end points in more detail. In an embodiment, all the reads within about 1000 bases of the estimated breakpoints are obtained, the reads are broken down into short sub-sequences of length k (k-mers), and the sub-sequences are loaded into a coloured de-Bruijn graph (see reference (4)). Using a sequence assembly algorithm scaffolded onto the reference sequence as in reference (3), (4), this embodiment determines the breakpoints to single base-pair accuracy. The breakpoints and other data about the large variants are then output to a new file, similar to the output of step S 1 but with the additional data. In an embodiment the algorithm used in step S3 for small variant calling follows closely the approach described in reference (3).

In an embodiment, the input to step S3 comprises the second subsequent input sequence data (e.g. a BAM file), the second augmented reference genome sequence data (e.g. a FASTA file or graph representation), and a VCF file (all output from step S2). A detailed scan is then performed across every read in the second subsequent input sequence data (e.g. BAM file), comparing each to the second augmented reference genome sequence data (e.g. a FASTA file or graph representation) and looking for changes of either base substitutions (SNPs) or alignment differences (insertions or deletions of sequence).

Once a full scan has been completed, a list of every difference from the second augmented reference genome sequence seen in all reads is provided. This list is sorted by genomic co-ordinate. The embodiment then employs a heuristic clustering to identify groups of differences (variant candidates) which occur close (within a read-length) together.

In addition, at this point a file of known variants (e.g. variants found in the 1000 Genomes Project or in other studies) can be input to augment the list of variant candidates.

It is desirable at this point to build pairs (because the human genome is diploid, i.e. there are two copies of each chromosome normally) of consensus sequences across each chromosome, for each sample. Due to limitations in the data (the reads, and even read-pairs are typically too short to unambiguously resolve large repeats in the genome), and our knowledge of certain parts of the genome (centromeres, telomeres, and other repetitive regions) it is not possible to build complete, contiguous sequences of each chromosome. Embodiments aim however to build the longest possible contiguous sequences, given the available data.

Information is used from steps S 1 and S2 to build a scaffold for each sample's (target individual's) genome, modifying the reference genome sequence to include all large and medium variants. On top of this scaffold are placed the clusters of small variant candidates just identified. A list of possible sequences (haplotype candidates) using all combinations of small variants in the cluster is built on top of the scaffold of large and medium variants. The output of this process is a (potentially large) list of possible haplotypes, which are then evaluated against the data.

The read data from the second subsequent input sequence data (e.g. a BAM file) is used to evaluate each haplotype by locally re-aligning each read against each haplotype, using an alignment algorithm. The Needleman-Wunsch algorithm (8) or variant thereof can be used for example. The scores for the best alignments of each read against each haplotype form the basis of a likelihood computation, which is used to score the variants. These are combined to give Genotype Likelihoods, which are scores for each pair of haplotypes. The pair of sequences / haplotypes which give the highest Genotype Likelihood are considered to be the best inference at the sequence of the genome of the sample under study, and these are then added to the genome scaffold.

The processing moves along the genome in order of genomic coordinate, and as each cluster of variant candidates is resolved into a hard call for the pair of sequences, those sequences are added to the scaffold and it moves further along.

All discovered variants are then merged (spliced) back into the augmented reference genome. The second subsequent input sequence data (e.g. a BAM file) is then re-mapped (re-aligned) to the augmented reference genome. The final output can be a FASTA (or graph representation, for example GraphGenome) file, a BAM file, and a VCF file containing the best possible inference of the genome of the sample (target individual) under study. The VCF file will contain every identified difference from the reference genome sequence, as well as regions (reference calls) where the genomes are identical to the reference genome sequence. The latest augmented reference genome sequence data and the latest aligned subsequent input sequence data (e.g. BAM file) provide a comprehensive representation of the patient's genome.

Figure 6 depicts a schematic output from an example method showing identified genetic variants in a segment of reference genome NA 12878 comprising a homozygous large deletion. The output is presented in the form of a screenshot comprising five panels 11-15.

The first panel 11 indicates the location of the specific 1200bp segment of the NA12878 genome.

The second panel 12 indicates variant calls made by a method according to an embodiment. The primary call is a large deletion followed by three SNPs. These calls align perfectly with the raw read data (shown in the fourth panel 14 described below).

The third panel 13 shows the coverage distribution of read data along the genome. The drop in coverage to zero is consistent with the method's assertion of a large-deletion event. This panel also has three thin lines which are consistent with the three SNPs that the method identifies.

The fourth panel 14 shows the raw read level data. The grey reads are normally aligned reads to the reference sequence (shown in the fifth panel 15). The black reads contain signal (soft-clipping) that is used by the method to make a large deletion call. They occur in this figure near to the large deletion event.

The fifth panel 15 shows the NA 12878 reference sequence.

References:

(1) Nature Genetics, 47, 717-726 (2015) doi: 10.1038/ng.3304.

(2) Nature Genetics 43, 491^198 (2011) doi: 10.1038/ng.806.

(3) Nature Genetics 46, 912-918 (2014) doi: 10.1038/ng.3036.

(4) Nature Genetics 44, 226-232 (2012) doi: 10.1038/ng.1028.

(5) Nature Genetics 47, 296-303 (2015) doi: 10.1038/ng.3200.

(6) Bioinformatics 28 (18): Ϊ333-Ϊ339 (2012); doi: 10.1093/bioinformatics/bts378.

(7) Nature Genetics 47, 682-688 (2015) doi: 10.1038/ng.3257.

(8) Journal of Molecular Biology 48 (3): 443-53 (1970); doi: 10.1016/0022-2836(70)90057-4.