METHODS, SYSTEMS AND APPARATUS FOR IDENTIFYING PATHOGENIC GENE VARIANTS

Title:

METHODS, SYSTEMS AND APPARATUS FOR IDENTIFYING PATHOGENIC GENE VARIANTS

Document Type and Number:

WIPO Patent Application WO/2018/042185

Kind Code:

Abstract:

A system for assessing the pathogenicity of a genetic variant, comprises a data analysis server connected to at least one of a genetic information data source storing frequency information relating to frequency of at least one genetic variant in at least a control population, a disease-variation association data source storing information on associations between at least one gene variant, at least one gene or other gene variants in the at least one gene with diseases; and a protein-related data source storing information on the known or predicted effects of at least one genetic variant on a gene product. The data analysis server is connected to a user device and is configured to receive information from a user about a genetic variant identified in an individual, and to determine and transmit a pathogenicity score to the user device.A method of assessing pathogenicity of a genetic variant, a data analysis server and a computing device are also disclosed.

Inventors:

COOK STUART ALEXANDER (GB)
WARE JAMES (GB)
BARTON PAUL (GB)
WALSH RODDY (GB)
REA GILLIAN (GB)
WHIFFIN NICOLA (GB)
EDWARDS ELIZABETH (GB)
MACARTHUR DANIEL GEOFFREY (US)
MINIKEL ERIC (US)

Application Number:

PCT/GB2017/052545

Publication Date:

March 08, 2018

Filing Date:

September 01, 2017

Export Citation:

Click for automatic bibliography generation Help

Assignee:

IMP INNOVATIONS LTD (GB)

International Classes:

G16B20/20; G06F19/00

Domestic Patent References:

WO2008067551A2

2008-06-05

Foreign References:

US20160140288A1	2016-05-19
US20090087854A1	2009-04-02

Other References:

EXAC ET AL., NATURE, 2015
RICHARDS, S. ET AL., GENETICS IN MEDICINE, vol. 17, 2015, pages 405 - 423
LANDRUM, M. J. ET AL., NUCLEIC ACIDS RESEARCH, vol. 42, 2013, pages D980 - D985

Attorney, Agent or Firm:

MOORE, Michael et al. (GB)

Download PDF:

View/Download PDF PDF Help

Claims:

Claims

1. A system for assessing the pathogenicity of a genetic variant, the system comprising:

a data analysis server including a processor and associated memory, the data analysis server being connected to at least one of a genetic information data source storing frequency information relating to frequency of at least one genetic variant in at least a control population, a disease-variation association data source storing information on associations between at least one gene variant, at least one gene or other gene variants in the at least one gene with diseases; and a protein-related data source storing information on the known or predicted effects of at least one genetic variant on a gene product;

wherein the data analysis server is connected to a user device and configured to receive information from a user about a genetic variant identified in an individual;

the data analysis server further including

a search application configured to query at least one of the genetic information data source for frequency information relating to the genetic variant in at least a control population, the protein-related data source for information on the known or predicted effect of the variant on the gene product, and the disease-variation association data source for information on association between the variant, the gene or other variants in the gene with diseases; and

a data analysis application configured to execute at least one of

one or more rules that include a comparison of the frequency of the genetic variant in a control population and

one or more rules that include a comparison of the known or predicted effect of the genetic variant on the gene product and

one or more rules that include a comparison of the association information from a disease-variation association data source; and

determine a pathogenicity score as a function of a result of the execution of one or more of the rules, and transmit the pathogenicity score to the user device,

wherein the executing one or more rules that include a comparison of the frequency of the variant in a control population comprises calculating a maximum tolerated allele frequency using Eq. 1 or Eq. 3 and determining a maximum sample estimate corresponding to the population frequency obtained at the chosen confidence level,

wherein Eq. 1 is mpat = prevalence x mac x

penetrance

where mac (maximum allelic contribution) is the maximum proportion of cases potentially attributable to a single allele; mpaf s the maximum credible population allele frequency; prevalence is the proportion of a population found to have a condition; and penetrance is the proportion of individuals carrying a particular variant of a gene, allele or genotype that also expresses an associated trait or phenotype;

wherein Eq. 3 is

mpaf = ^prevalence x mac x ^jmgc where mgc (maximum genetic contribution) represents the proportion of all cases that are attributable to the gene under evaluation, and mac (maximum allelic contribution) represents the maximum proportion of cases attributable to that gene that are attributable to an individual variant.

2. The system according to Claim 1 , wherein the data analysis server includes a program that generates a pathogenicity assessment as function of the pathogenicity score.

3. The system according to Claim 1 or Claim 2, wherein the search application is configured to query a genetic information database for frequency information relating to the variant in at least a control and a diseased population.

4. The system according to Claim 3, wherein the data analysis application is configured to execute one or more rules that determine whether the prevalence of the variant in affected individuals is significantly increased compared with controls, and to execute a Fisher's exact test to determine whether a variant is associated with a disease based on the frequency of the variant in the control and the diseased population.

5. The system according to Claim 4, wherein the results of the one or more rules are pre-computed across all variants present in the control and the diseased population, and the results are corrected for multiple testing.

6. The system according to any of Claims 3 to 5, wherein the maximum allelic / genetic contribution parameter in Eq. 1 or Eq. 3 is determined as a function of the frequency of the most common pathogenic variant in the diseased population, or in a disease population for a similar disease.

7. The system according to any preceding claim, which comprises a genomic database (14b).

8. The system according to Claim 7, wherein the genomic database contains information about paralogous genes, and the data analysis application is configured to execute one or more rules that includes a comparison of whether the variant is a missense mutation and an equivalent amino acid change in a paralogous gene is pathogenic, based on information from the disease-variation association data source.

9. The system according to Claim 7 or Claim 8, wherein the genomic database contains information about paralogous genes, and the data analysis application is configured to execute one or more rules that includes a comparison of whether there is a pathogenic missense mutation at an equivalent amino acid residue of a paralogous gene, based on information from the disease-variation association data source.

10. The system according to any preceding claim, wherein information on the known or predicted effect of the variant on the gene product is obtained from the protein-related data source.

1 1. The system according to Claim 10, wherein the information obtained from the protein-related data source comprises information relating to the amino acid sequence of the variant protein and/or the effect of an amino acid sequence variant on the function of the protein.

12. The system according to any preceding claim, wherein the disease-variant association data source contains information on the association between a chosen disease and the variant under assessment, other variants in the same gene, and/or variants in a paralogous gene.

13. The system according to any preceding claim, wherein:

the known or predicted effect of the variant on the gene product comprises information on whether the variant is a null variant; the disease-variation association data source contains information on whether loss of function of the gene containing the variant is a known mechanism of disease; and wherein the data analysis application is configured to execute one or more rules that includes determining based at least on the frequency of the variant whether the variant is a null variant in a gene where loss of function of the gene is a known mechanism of disease.

14. The system according to any preceding claim, wherein the disease-variation association data source contains information on the location of pathogenic variants within a gene, and the data analysis application is configured to execute one or more rules that include determining whether the pathogenicity of a variant in the gene is highly dependent on the location of the variant.

15 The system according to Claim 14, wherein determining whether the pathogenicity of the variant is highly dependent on its location, and/or determining whether the variant is a null variant in a gene where loss of function of the gene is a known mechanism of disease, additionally comprises determining whether the variant is a nonsense, frameshift or essential splice site variant within exons with a high proportion spliced in (PSI), such as a PSI > 0.9.

16. The system according to any preceding claim, wherein the protein-related data source comprises the results of at least five tools for prediction of the effect of a variant on the function of the protein, and the data analysis application is configured to execute one or more rules that comprises:

determining whether at least two of the tools predict a deleterious effect on the gene or gene product; and

no more than one tool predicts no deleterious effect on the gene or gene product.

17. The system according to any preceding claim, wherein the data analysis application is configured to execute one or more rules that include determining, based at least on the frequency of the variant in a control population, whether the allele frequency of the variant in a control population is above a threshold, wherein the threshold is the maximum sample estimate corresponding to the population maximum tolerated allele frequency at the chosen confidence level.

18. The system according to any preceding claim, wherein the data analysis application is configured to execute one or more rules that includes assigning weights; and determining the pathogenicity score includes computing a sum of the weights for all the rules that are evaluated as positive.

19. The system according to Claim 18, wherein the weights are dependent on the strength of the evidence associated with each test.

20. The system according to Claim 19, wherein the rules are separated into multiple categories of evidence, and a common weight is assigned to all the tests in the same category.

21. The system according to any preceding claim, wherein the pathogenicity of a protein altering genetic variant is assessed in relation to cardiac conditions.

22. The system according to any preceding claim, wherein the pathogenicity of a protein altering genetic variant is assessed in relation to cardiomyopathies.

23. The system according to any preceding claim, wherein calculating the maximum sample estimate corresponding to the population frequency mpaf obtained at the chosen confidence level x comprises calculating the x^th percentile of a Poisson distribution where λ is given by Eq. 2, wherein sample size is the number of individuals in the control population from the genetic information data source. 24. The system according to any preceding claim, wherein the confidence level x is 90%, 95% or 99%; preferably 95%.

25. The system of any preceding claim, wherein providing the pathogenicity assessment to a user comprises providing a report that comprises the pathogenicity score, and an indication of the result of all the rules evaluated.

26. The system of any preceding claim, wherein the data analysis server is further programmed to receive user input commands to modify the results of one or more rules, in order to further refine the pathogenicity assessment.

27. A method for assessing the pathogenicity of a protein altering genetic variant, the method comprising:

(1) receiving information from a user about a genetic variant identified in an individual;

(2) querying:

(i) a genetic information data source for frequency information relating to the variant in at least a control population; and

(ii) a protein-related data source for information about the known or predicted effect of the variant on the gene product; and/or

(iii) a disease-variation association data source for information on association between a chosen disease and the variant under assessment, other variants in the same gene; and/or variants in a paralogous gene;

(3) executing one or more rules that include a comparison of the frequency of the variant in a control population and one or more rules that include a comparison of the known or predicted effect of the variant on the gene product and/or information from the disease-variant association data source;

(4) determining a pathogenicity score as a function of a result of the execution of one or more of the rules; and

(5) transmitting the pathogenicity score to a user,

wherein the evaluating the results of at least one test based on the frequency of the variant in a control population comprises calculating a maximum tolerated allele frequency using Eq. 1 or Eq. 3 and determining a maximum sample estimate corresponding to the population frequency obtained at the chosen confidence level, wherein Eq. 1 is

mpaf = prevalence x mac x

penetrance

where mac (maximum allelic contribution) is the maximum proportion of cases potentially attributable to a single allele; mpaf is the maximum credible population allele frequency; prevalence is the proportion of a population found to have a condition; and penetrance is the proportion of individuals carrying a particular variant of a gene, allele or genotype that also expresses an associated trait or phenotype;

wherein Eq. 3 is

mpaf = ] prevalence x mac x

^penetrance where mgc (maximum genetic contribution) represents the proportion of all cases that are attributable to the gene under evaluation, and mac (maximum allelic contribution) represents the maximum proportion of cases attributable to that gene that are attributable to an individual variant.

28. A data analysis server comprising a processor and a memory, wherein the memory includes a processor executable program configured to perform the method of claim 27 and the processor is configured to execute the program. 29. A computing device comprising a software program configured to control the computing device to perform the method of Claim 27.

Description:

METHODS, SYSTEMS AND APPARATUS FOR IDENTIFYING PATHOGENIC GENE

VARIANTS

Field of the Invention

The present invention relates to methods, systems and apparatus for classifying gene variants according to pathogenicity. In particular, the invention relates to diagnosing inherited cardiac conditions based on the genetic variant profile of an individual and identifying causative variants for these conditions, for example to allow family screening after an individual has been diagnosed.

Background of the Invention

Genetic information is increasingly used as part of the diagnostic toolbox for conditions that have an inheritable component, supported by advances in sequencing technologies such as the development of affordable high-throughput next generation sequencing. However, the generation of increasing amounts of genetic data is accompanied by new challenges in interpreting this information. Indeed, as every individual is estimated to carry approximately 12,000 to 14,000 predicted protein-altering variants, distinguishing disease-causing variants from benign bystanders is perhaps the principal challenge in contemporary clinical genetics.

While a standardised approach to variant interpretation has been proposed by the American College of Medical Genetics and Genomics, this merely comprises general guidelines combining expertise from clinical laboratory geneticists and is still far from a tool that can be applied directly by clinicians in their practice.

The consensus view in the clinical arena is that the value and power of sequence-based diagnostic approaches will be driven by the ability to meaningfully and confidently interpret genetic variation data of an individual in a disease and gene specific manner, combined with an understanding of the clinical phenotype and familial history. Accordingly, there is a need for methods and tools that provide readily interpretable output for a clinician to discriminate genetic variants that are clinically relevant in an individual.

Summary of the invention

In accordance with a first aspect of the invention, there is provided a system for assessing the pathogenicity of a genetic variant, the system comprising: a data analysis

l server including a processor and associated memory, the data analysis server being connected to at least one of a genetic information data source storing frequency information relating to frequency of at least one genetic variant in at least a control population, a disease-variation association data source storing information on associations between at least one gene variant, at least one gene or other gene variants in the at least one gene with diseases; and a protein-related data source storing information on the known or predicted effects of at least one genetic variant on a gene product; wherein the data analysis server is connected to a user device and configured to receive information from a user about a genetic variant identified in an individual; the data analysis server further including a search application configured to query at least one of the genetic information data source for frequency information relating to the genetic variant in at least a control population, the protein-related data source for information on the known or predicted effect of the variant on the gene product, and the disease-variation association data source for information on association between the variant, the gene or other variants in the gene with diseases; and a data analysis application configured to execute at least one of one or more rules that include a comparison of the frequency of the genetic variant in a control population, and one or more rules that include a comparison of the known or predicted effect of the genetic variant on the gene product, and one or more rules that include a comparison of the association information from a disease-variation association data source; and determine a pathogenicity score as a function of a result of the execution of one or more of the rules, and transmit the pathogenicity score to the user device, wherein the executing one or more rules that include a comparison of the frequency of the variant in a control population comprises calculating a maximum tolerated allele frequency using Eq. 1 or Eq. 3 and determining a maximum sample estimate corresponding to the population frequency obtained at the chosen confidence level, wherein Eq. 1 is

mpaf = prevalence x mac x

penetrance

mpaf = ] prevalence x mac x

In an alternative aspect or embodiment of the system, the data analysis application is configured to transmit a pathogenicity assessment to the user device, wherein the pathogenicity assessment is based at least in part on the pathogenicity score. In another aspect there is provided a method for assessing the pathogenicity of a protein altering genetic variant, the method comprising: (1) receiving information from a user about a genetic variant identified in an individual; (2) querying: (i) a genetic information data source for frequency information relating to the variant in at least a control population; and (ii) a protein-related data source for information about the known or predicted effect of the variant on the gene product; and/or (iii) a disease-variation association data source for information on association between a chosen disease and the variant under assessment, other variants in the same gene; and/or variants in a paralogous gene; (3) executing one or more rules that include a comparison of the frequency of the variant in a control population and one or more rules that include a comparison of the known or predicted effect of the variant on the gene product and/or information from the disease-variant association data source; (4) determining a pathogenicity score as a function of a result of the execution of one or more of the rules; and (5) transmitting the pathogenicity score to a user, wherein the evaluating the results of at least one test based on the frequency of the variant in a control population comprises calculating a maximum tolerated allele frequency using Eq. 1 or Eq. 3 and determining a maximum sample estimate corresponding to the population frequency obtained at the chosen confidence level, wherein Eq. 1 is

mpaf = prevalence x mac x

penetrance

mpaf = ^prevalence x mac x ^jmgc

In an alternative aspect or embodiment of the method, the data analysis application is configured to transmit a pathogenicity assessment to the user device, wherein the pathogenicity assessment is based at least in part on the pathogenicity score. According to another aspect of the invention there is provided a data analysis server comprising a processor and a memory, wherein the memory includes a processor executable program configured to perform the method of the invention and the processor is configured to execute the program. According to yet another aspect there is provided a computing device comprising a software program configured to control the computing device to perform the method of the invention.

According to still another aspect of the invention, there is provided a system for assessing the pathogenicity of a genetic variant, the system comprising: a data analysis server, a genetic information data source, a disease-variation association data source and a protein-related data source wherein the data analysis server is programmed to: (1) receive information from a user about a genetic variant identified in an individual; (2) query a genetic information data source for frequency information relating to the variant in at least a control population and a protein-related data source for information on the known or predicted effect of the variant on the gene product and/or a disease-variation association data source for information on association between the variant, the gene or other variants in the gene with diseases; (3) evaluate the results of one or more tests based at least on the frequency of the variant in a control population and one or more tests based at least on the known or predicted effect of the variant on the gene product and/or information from a disease-variation association data source; (4) combine the results of the tests of step (3) into a pathogenicity score; and (5) provide the pathogenicity assessment to a user, wherein the evaluating the results of one or more tests based at least on the frequency of the variant in a control population comprises calculating a maximum tolerated allele frequency using Eq. 1 or Eq. 3 and determining a maximum sample estimate corresponding to the population frequency obtained at the chosen confidence level.

In accordance with still another aspect of the invention there is provided a method for assessing the pathogenicity of a protein altering genetic variant, the method comprising:

(1) receiving information from a user about a genetic variant identified in an individual;

(2) querying: (i) a genetic information data source for frequency information relating to the variant in at least a control population; and (ii) a protein-related data source for information about the known or predicted effect of the variant on the gene product; and/or (iii) a disease-variation association data source for information on association between a chosen disease and the variant under assessment, other variants in the same gene; and/or variants in a paralogous gene; (3) evaluating the results of one or more tests based at least on the frequency of the variant in a control population and one or more tests based at least on the known or predicted effect of the variant on the gene product and/or information from the disease-variant association data source; (4) combining the results of the tests of step (3) into a pathogenicity score; and (5) providing the pathogenicity assessment to a user, wherein the evaluating the results of at least one test based on the frequency of the variant in a control population comprises calculating a maximum tolerated allele frequency using Eq. 1 or Eq. 3 and determining a maximum sample estimate corresponding to the population frequency obtained at the chosen confidence level.

In yet another aspect of the invention there is provided a data analysis server comprising a processor and a memory, wherein the processor is programmed to perform the method of the invention.

In another aspect of the invention there is provided a computing device comprising software adapted to perform the method of the invention. Within the scope of this application it is expressly intended that the various aspects, embodiments, examples and alternatives set out in the preceding paragraphs, in the claims and/or in the following description and drawings, and in particular the individual features thereof, may be taken independently or in any combination. That is, all embodiments and/or features of any embodiment or any claim can be combined in any way and/or combination, unless such features are incompatible. Brief Description of the Drawings

One or more embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings, in which: Figure 1 shows schematically relevant parts of a representative genetics diagnostic system suitable for implementing an embodiment of the disclosure;

Figures 2a and 2b illustrate schematically relevant functions of a user device and a data analysis server, each suitable for implementing an embodiment of the disclosure;

Figure 3 describes a method according to an aspect of the disclosure;

Figure 4 is an example of the use of a rare variant frequency filtering method according to embodiments of the invention.

Figure 5a shows the results of applying a variant frequency filtering method according to an aspect of the invention on the ExAC data and a simulated disease.

Figure 5b shows how the use of methods according to aspects of the invention in the context of cardiac disease allow filtering for clinically significant genes.

Figure 6 shows an extract of an exemplary report generated using the methods and systems of the invention. Detailed Description of the Invention

Although the invention will be described by way of examples, it will be appreciated by a person skilled in the art that the invention could be modified to take many alternative forms without departing from the spirit and scope of the invention as defined in the appended claims.

Unless mentioned otherwise, the terminology used throughout this disclosure is to be interpreted with the meaning that is common in the art. In addition, the following terms are used throughout this document, with the following meaning: A reference sample or reference data / database is a source of genetic data from one or more individuals that have not been specifically selected for the presence or absence of a particular condition. Therefore, the frequency of a disease or disease causing variant in this data is not expected to be larger than the disease prevalence in the general population or subpopulation from which the genetic data was extracted. An example of a reference data and associated database is the Exome Aggregation Consortium (ExAC) dataset (see ExAC et al. (2015), Nature; bioRxiv 030338; doi: http://dx.doi.org/10.1101/030338), which has characterised the population allele frequencies of 10 million genomic variants through the analysis of exome sequencing data from over 60,000 humans. A genetic information data source, as used in this document, is a data source that comprises reference data. A genetic information data source may also contain genetic data from one or more individuals in one or more case cohorts, as part of the same or multiple databases.

A genetic variant is any departure from the sequence of a reference genome. Variant may be single nucleotide changes, changes in copy number, insertions, deletions, or other structural variants. In particular, in the context of this invention, 'variant' refers to genetic sequence modifications that influence protein function, either by influencing the abundance of the protein, or by causing a change in protein coding genetic sequences. In some embodiments, the term 'variant' refers in particular to protein altering variants. These are variations in a gene sequence that alter the protein that results from the gene once transcribed and translated (when compared to a non-variant / wild-type gene), for example, by changing the amino acid incorporated at a position or by causing the premature termination of translation (truncating variants). In particular, protein altering variants may result in a frameshift (where a mutation caused by the addition or deletion of a base pair or base pairs in the DNA of a gene results in the translation of the genetic code in an unnatural reading frame from the position of the mutation to the end of the gene / next stop codon), a nonsense codon (a mutation replacing a codon corresponding to an amino acid by a stop codon), a splicing error (creation of a new splice donor / acceptor or loss of a site), missense (point mutation resulting in change of the amino acid incorporated) and in-frame insertions / deletions.

The known or predicted effect of a variant on a gene product, as used herein, refers to the consequences of the variant being present on the subsequent steps of expression and function of the gene in which the variant is located. These may include predicted effects of the variant on the expression of the gene (e.g. transcription and/or translation rate), but in the context of protein altering variants, these relate to the way in which the resulting protein differs from the reference one. For example, this includes whether the variant will cause a frameshift (and hence a completely different sequence), a nonsense codon resulting in truncation of the protein, a splicing error, resulting in a different combination of exons being included in the protein, a missense mutation resulting in a different amino acid being incorporated, including the identity of the new amino acid, or an insertion-deletion resulting in a change of the total length of the protein. Additionally, the known or predicted effect of a variant on a gene product may also refer to the functional effect of such protein sequence modifications, whether known or predicted, such as: does the variant cause a loss or degradation of the function of the protein, a structural change, etc. The known or predicted effect of a variant on a gene product refers to changes in the identity and/or function of the protein, not to whether or not these changes, in turn, may contribute to the development of a disease.

The term 'allele' as used herein takes its common meaning in the art and refers to one of a number of alternative forms of the same gene or genetic locus.

Methods of obtaining sequence data for an individual, such as using second generation sequencing techniques are well known in the art and are not described in the present disclosure. Method of extracting variant data based on an individual's sequence data and reference genome data are also known in the art and will not be discussed further.

The term 'Inherited Cardiovascular Conditions' (or ICCs) refers to a diverse set of diseases of the heart and blood vessels with a strong genetic predisposition, and in which genetic testing may be applicable. These include cardiomyopathies (heart muscle diseases), arrhythmia syndromes or "channelopathies" (leading to abnormalities of heart rhythm), dyslipidaemias (abnormalities of blood lipids including cholesterol), aortopathies (abnormalities of the aorta), and a number of congenital structural abnormalities. The phenotypic features of these diseases are known in the art, and these terms are used throughout this disclosure with the meaning that is common in the art. The terms prevalence, penetrance and heterogeneity are used herein with the meaning that is common in the art.

In particular, penetrance refers to the proportion of individuals carrying a particular variant of a gene (allele or genotype) that also expresses an associated trait (phenotype). Put another way, in a clinical context, the penetrance of a disease causing mutation is the proportion of individuals with the mutation who also exhibit clinical manifestations. As penetrance estimates for individual variants are not widely available, a variant penetrance of 0.5 may advantageously be used. This corresponds to the minimum variant penetrance found when researching HCM and other variants / disorders.

Heterogeneity refers to a phenomenon whereby a disorder may be caused by any one of a number of disease-causing variants. In particular in the context of the assessment of rarity of a variant described herein and below, the maximum allelic or genetic contribution is used to refer to the maximum proportion of cases potentially attributable to a single allele or gene (depending on whether allelic or genetic heterogeneity are investigated). This maximum allelic / genetic contribution is inversely proportional to the allelic / genetic heterogeneity. Where a large cohort exists for a disorder, the upper confidence interval of the frequency of the most common variant in this cohort may be used.

The prevalence of a condition is the proportion of a population found to have a condition. Estimates of disease prevalence may be obtained from the literature. Where multiple different values are reported, the highest value may be used in the calculation, which leads to conservative filtering.

The ACMG Standard and Guidelines refers to a set of guidelines published by the American College of Medical Genetics and Genomics and the Association for Molecular Pathology, for assessing the pathogenicity of a genetic variant. These are published in Richards, S. et al. (2015), Genetics in Medicine 17, 405-423.

A protein-related data source is a source of information regarding the sequence, function or properties of proteins. For example, protein-related data sources may contain information regarding the predicted effect of a protein-coding genetic variant on the resulting protein (e.g. frameshift, truncation due to stop codon, a splicing error, missense and in-frame insertions / deletions). Protein-related data sources may instead or in addition contain information regarding functional domains, predicted effect of mutations on the function of a protein, etc. As such, protein-related data sources may be in the form of a static data repository, or may be in the form of algorithms that can e.g. predict the effect of a variant on the gene product. A disease-variation association data source refers to a data source, in the form of e.g. a database that collects information about genetic variants and their association with diseases. This may be in the form of e.g. annotations of a variant for reported pathogenicity, functional data, etc. Examples of such a data sources include the ClinVar database (http://www.ncbi.nlm.nih.gov/clinvar/) and the HGMD® database (http://www.hgmd.cf.ac.uk/ac/index.php).

Description of a general embodiment of the invention

The present invention is directed to methods, systems and apparatus for analysing a patient's genetic variation profile and determining whether any variants identified are likely disease causing variants. In particular, the present invention determines whether any variants identified in a patient are likely to cause an inherited cardiovascular condition in the patient. Figure 1 shows schematically relevant parts of a representative genetics diagnostic system suitable for implementing an embodiment of the disclosure.

A user (not shown) is provided with an electronic device 2 - this may be for example a personal computer or a mobile device 2 (such as a mobile phone, tablet, laptop, or other mobile computing device). These devices typically have processors and memories for storing information including firmware and applications run by the respective processors. The user device 2 may comprise an antenna and associated hardware and software to allow communications with a data analysis server 4 via the internet 6 via a local WiFi router 10, a 3G/4G telecommunications network 8, any combination of the above or any wireless communications protocol, or may connect to the internet using a wired connection.

The data analysis server 4 (while represented here as a single server, may of course comprise any appropriate computer system or set of computer systems) is shown as interacting with both the user device 2 and a genetic data source 12. The genetic data source 12 contains genetic data from individuals in a reference or control group. Advantageously the data source 12 also contains data from individuals in a case cohort for a particular disease. The genetic data source 12 is used to represent a repository of genetic data but may in fact be implemented as a collection of databases, such as e.g. a database containing reference genetic information, and one or more databases containing genetic information of disease cohorts. In embodiments, the data services server also interacts with one or more other data sources 14, such as a disease-variation association data source 14a, a genomic database 14b, a protein-related data source 14c, etc. The protein-related data source 14c contains information about the known or predicted effect of a variant on a protein sequence, structure and/or function. For example, the protein-related data source 14c may comprise data on or tools to determine the class of a variant in terms of its consequence on the protein sequence (as detailed above in reference to protein-altering variants). The disease-variation association data source 14a contains information about variants and genes and their relationship with diseases, such as e.g.: reported association with disease; associated publications; associated in vitro and/or in vivo functional evaluation; which diseases are linked to a gene; what classes of variant are relevant for each disease; are there specific sub- regions that are 'hotspots'; what are the possible inheritance patterns associated with each class of variant and disease, and so on. As the person skilled in the art would understand, the different sources of data 12, 14 may in fact be organised as a single database, or multiple separate databases or tools that provide the required data on query.

Figures 2a and 2b illustrate schematically relevant functions of a user device and a data analysis server that are suitable for implementing embodiments of the disclosure.

Figure 2a shows a user device 2 such as a mobile phone, though it should be noted that any other portable computing apparatus such as a laptop, notebook or tablet computer, or even a fixed apparatus such as a desktop computer, can be used as computing apparatus in embodiments of the disclosure.

The mobile device comprises a processor 202 and a memory 204, such that the memory stores and the processor will subsequently run applications 206. The user device has a user interface comprising a display 208 and an input device 210 such as a keyboard, a mouse, touchpad, touchscreen or any combination of these and associated drivers to allow a user to enter data into and view information from the applications 206. In embodiments where the user device 2 is a mobile phone, it also has a cellular telecommunications capability, including a wireless communication element 212 providing the ability to connect to a cellular communications network. The user device 2 may, instead or in addition to the wireless communication element 212, include a local networking element 214, in order to establish a short range wireless network. While a network connection is needed to enable communication between the computing device and the data services server, this need not involve cellular telecommunications. For example, the computing device may be a tablet computer without cellular telecommunications capability but capable of making a local wireless network connection, and so a connection to the data analysis server through the public internet. Further, the device may be a fixed apparatus such as a desktop computer, establishing a wired or wireless connection to the data analysis server 4 via the internet.

Figure 2b describes elements of the data analysis server 4. This is shown as comprising a server 220 with processor 222 and memory 224, with associated communications functionality 226. The communications functionality may include networking capability allowing communication with the user device 2. The processor 222 is a representation of processing capability and may in practice be provided by several processors. The server provides at least a data analysis application 228 stored in the memory 224 and run on the processor 222, and a search engine 230 interacting with the one or more databases 12, 14. In some embodiments, the memory 224 also stores the genetic data 12, and/or one or more other data 14.

The data analysis server 4 receives information from the user device 2, and interacts with the data analysis application 228 and the search engine 230 to obtain the required data (as will be further described below) from the databases 12, 14. The data analysis application 228 collects and analyses the data and serves it to the user device 2 for display by an application 206. In some embodiments, the application 206 is a browser, and the data is submitted by, and provided to a user, via a webpage.

In some embodiments, the methods of the invention may be run locally on the user device 2. In such embodiments, the functionalities of the data analysis server 4 are run directly on the user device as part of an application 206. As is the case with embodiments in which the methods of the invention are run on a remote server, the data from databases 12, 14 may also be locally stored on the device, or may be accessed via e.g. the internet.

Figure 3 describes a method according to an aspect of the disclosure. At step 310, a user enters genetic information in the form of variant data about a patient, and any other relevant information available (see below). Variant data may advantageously be in the form of a list of variants found in a sample (i.e. a list of loci where the sample sequence was found to depart from a reference genome, and the nature of the departure, for example a VCF file). Any number of variants may be included in such a list, and means to obtain such a list from a DNA sample or resulting sequencing data are known in the art.

At step 320, the server collects the data that is relevant to any variants present in the user data. At step 330, the server computes 330a a series of evidence rules (as described below) and combines 330b this evidence to generate a variant score. The variant score is to be understood here in the broadest sense as any combined measure of how likely a variant is to be pathogenic. In particular, in some embodiments the score is computed as the assignment of a variant to one of a discrete set of categories based on the combined result of the evaluation of the rules (see section 'Combining evidence for pathogenicity' below). At step 340, the server sends that combined result to the user in the form of a report highlighting activated evidence rules. The report is displayed by the user device at step 350. Optionally, the report may be queried, for example by triggering the display of underlying evidence for a rule, or any evidence rule may be modified by the user. This may result in the modified evidence being used to re-start the process from step 330. Optionally, a user may then decide 360, based on the output of the method, whether any of the variants identified in a patient are disease causing for a specific disease.

If multiple variants are present in the genetic information submitted by the user, the server will collect, analyse and report data separately for each variant in the data set for which evidence is available.

Evidence for pathogenicity

The methods of the invention rely on computing the results of multiple evidence rules (i.e. tests based on evidence related to the variant, for which a yes/no answer provides evidence of the variant being benign or pathogenic), each of which analyses a piece of data that is relevant to the pathogenicity (or lack thereof) of a variant. The rules (also referred to herein as 'tests') presented below have been found by the inventors to provide a superior diagnostic assessment, in particular, for inherited cardiac conditions (ICCs). However, additional rules may be added or removed as appropriate, for example because of growing knowledge about a disease. In embodiments, variants are only analysed if they are protein altering variants, as many of the rules mentioned below relate to the function of the resulting protein. In some embodiments, all variants that may alter protein function are analysed, including e.g. synonymous variants. As would be clear to the person skilled in the art, variants may only be analysed if data is available for this genetic location; for example, if sufficient data is available to evaluate the result of at least some of the evidence tests detailed below, in relation to the genetic location of the variant. Additionally, some of the evidence required to assess pathogenicity (see e.g. the discussion on the rarity of variants below), depends on the disease that the variant is analysed for. Much of this document is centred on cardiomyopathies, as the inventors have found the set of rules described in the embodiments below to be particularly useful in diagnosing pathogenic variants in such diseases. However, the person skilled in the art would understand that the principles of this invention may be applicable to a variety of other inheritable diseases, provided that adequate data is available.

We will now describe each of the rules that may be used to determine whether a variant is pathogenic or not. As the person skilled in the art will understand, all of these rules may not be present or used in any particular embodiment of the method, for example because data may not be available for this rule. However, methods and systems of the invention suitable require evaluation of (i) at least one rule relating to the rarity of the variant in a reference population, using embodiments of the method for determining what constitutes a rare variant for a disease that is described in the 'Rare variants' section below, and (ii) at least one test based at least on the known or predicted effect of the variant on the gene product and/or information from a disease-variantion association data source.

For practical reasons that will be clear in the description of how evidence from rules is combined (see below), rules are divided into different categories. This follows a scheme set out in the ACMG guidelines, and in particular rules are referred to as 'Pathogenic Very Strong' (PVS, indicating that such rules being activated represents very strong evidence of pathogenicity), 'Pathogenic Strong' (PS, indicating that such rules being activated represents strong evidence of pathogenicity), 'Pathogenic Moderate' (PM, indicating that such rules being activated represent moderate evidence of pathogenicity), 'Pathogenic Supporting' (PP, indicating that such rules being activate support pathogenicity), 'Benign Stand Alone' (BSA, indicating that a variant is very likely to be benign), 'Benign Strong' (BS, indicating that such rules being activated represents strong evidence of the variant being benign), and 'Benign Supporting' (BP, indicating that such rules being activated represents strong evidence of the variant being benign). Very strong evidence for pathogenicity (PVS rules)

The first PVS rule determines whether the variant is a null variant in a gene where loss of function of the gene is a known mechanism of disease. In embodiments, truncating variants are assumed to cause loss of function. In embodiments, the first PVS rule is activated when the variant is a truncating variant in a gene that has been implicated in a disease with loss of function as a reported mechanism. Further, in embodiments, a variant causes activation of the rule if any one of three criteria are fulfilled: (i) the variant is in a gene with a significant burden in disease cases compared to control populations based on one or more genetic information data sources; (ii) the variant is in a gene where truncating mutations are reported in excess in case cohorts compared to reference data; (iii) the variant is in a gene associated with phenocopy and where truncating mutations are reported in excess in case cohorts compared to reference data.

In embodiments where the systems and methods of the invention are used to diagnose variants involved in inherited cardiac conditions, a truncating variant in any of the following genes may activate this rule: LMNA, DSP, VCL, MYBPC3, TNNT2, PLN, DSP, DSG2, PKP2 and DSC2 (based on criterion (i) and analysis of 7,855 cardiomyopathy cases and 60,706 controls from http://biorxiv.org/content/early/2016/02/24/041 1 11), KCNQ1 , KCNH2, SCN5A, FHL1 , BAG3, TAZ, FBN1 , TGFB2, LDLR (based on criterion (ii) and comparison of data between the HGMD® database (http://www.hgmd.cf.ac.uk/ac/index.php) and the ExAC data), and GLA, LAMP2 (based on criterion (iii)).

In embodiments, null variants in a gene where loss of function is a known mechanism of disease do not activate this rule if there is a strong regional effect in the gene (i.e. the pathogenicity is highly dependent on the location of the variant). This may be the case, for example, where variants are located in regions that are frequently spliced out. This exception was not identified in the ACMG guidelines, but the inventors have found that it allowed for a more reliable output of the rule. In particular, in embodiments relating to cardiomyopathy, the gene TTN is excluded on this basis. This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product and information frequency of the variant in a control population (as compared to a case population). Strong evidence for pathogenicity (PS rules)

The first PS rule is activated if the variant results in the same amino acid change as a previously established pathogenic variant. In embodiments, any variants from the ClinVar database (http://www.ncbi.nlm.nih.gov/clinvar/) that have multiple submitters, for the phenotype of interest, with no conflicting evidence and classed as 'pathogenic' are used. Although the ACMG guidelines indicate that variants resulting in the same amino acid change and "previously established as pathogenic" should be considered as strong evidence for pathogenicity, there is no indication in the guidelines as to what level of evidence constitutes "established pathogenic variants". The inventors have found that, using a disease-variation association data source containing annotations (from individual laboratories, publications etc.) could be used to reliably implement this condition if the above parameters were used (i.e. a filter on the other variant being classified as "pathogenic", a filter on the number of lines of individual evidence- e.g. submitters - for the specific phenotype / disease of interest and a filter on the presence of conflictual annotations). In some embodiments, users can save the final classification of a variant being analysed using the methods and systems of the invention, and this data may be queried to evaluate this rule (for example, it may be considered as one of the lines of evidence as mentioned above). In some embodiments, a user can choose to include previous user data in the evaluation of this rule. In some embodiments, the method comprises displaying previous user data so that a user can decide to activate this rule or not. This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product and information from a disease-variant association data source.

A further PS rule may be activated if the variant is observed de novo in a patient that has the disease and the paternity and maternity of the patient are confirmed. In embodiments of the method, this rule is displayed in the diagnostic report and a user can insert evidence. In embodiments, the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence.

A further PS rule may be activated is there are well established in vitro or in vivo functional studies supportive of a damaging effect of the variant on the gene or gene product. In embodiments, the rule is activated if the variant has been shown to recapitulate a disease phenotype or endophenotype in a model system that has been shown to be predictive of human disease. While the ACMG guidelines suggest the use of functional studies supporting of a damaging effect of the variant, they provide no indication as to how this should be assessed. The present inventors have found that the above criterion provided a reliable and transferrable way of assessing this rule. In embodiments, the rule is activated when such evidence is available in a database, such as a database collated from previous user reports, or a source of curated data on the effect of mutations on protein function. In some embodiments, the rule is only activated if the user provides information to activate it, at step 310 or 350 above. This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product.

A fourth PS rule may be activated if the prevalence of the variant in affected individuals is significantly increased compared with controls. According to ACMG guidelines, this rule should be applied based on an odds ratio above 5 and confidence interval not including 1. However, the present inventors have found that this threshold was not generally applicable in cases where the case and control cohorts were imbalanced. Accordingly, in embodiments, a case data base is compared with a reference database in the following way: a case and reference database are used to compare the frequency of each rare variant in the two data sets using a Fisher's exact test to assay for association of each variant with disease. The results across variants are adjusted for multiple testing as known in the art, for example, using a Bonferroni correction. An appropriate threshold for statistical significance of the corrected test result may then be used. In embodiments, the strength of the disease association data is taken into account in calculating the weight associated with activation of this rule (see 'Combining evidence' below). In such embodiments, the weight associated with activation of this rule may be proportional to the odds ratio (odds of developing the condition if an individual has the variant versus odds of developing the condition if an individual does not have the variant).

A minimum threshold on the number of individuals with a variant in the case cohort data may also be applied in order to avoid including variants where there is not enough data available. Advantageously, this data is precomputed, based on chosen data sources for a given disease, for each rare variant found in these databases. However, in some embodiments, this test may be computed dynamically based on the disease being analysed, and e.g. a choice of case cohort and reference database given to the user.

A threshold for what is considered a rare variant may be set at e.g. a frequency in control data of below 0.0001. This rule is an example of a rule that is evaluated based on the frequency of the variant in a control population.

A fifth PS rule may be activated if the variant is a truncating variant in a gene where truncating variants are known to cause disease, but the gene shows strong regional effect such that not all truncating variants are equally deleterious. This may be the case e.g. where the variant only truncates such isoforms (e.g. Titin). The rule is only activated if the truncating variant is in an exon that is constitutively expressed in the specific transcripts relevant to the disease. For example, in the case of inherited cardiac conditions, the rule may only be activated if it is in an exon constitutively expressed in the isoforms relevant to the heart. This rule was no present in the ACMG guidelines but the present inventors have found that introducing this rule and excluding such variants from activating a pathogenic very strong rule led to more reliable results. In embodiments, this rule is activated for nonsense, frameshift and essential splice site variants within exons with proportion spliced in (PSI) > 0.9. In embodiments, this rule is restricted to a predetermined set of genes. In embodiments relating to cardiomyopathy, the predetermined set of genes may comprise TTN. This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product and information frequency of the variant in a control population (as compared to a case population).

Moderate evidence for pathogenicity (PM rules)

A first PM rule may be activated when the variant is located in a mutational hot spot, and/or in a critical and well established functional domain. In the context of the present invention, mutational hot spots are defined regions of genes that are either enriched in variation in cases, or depleted of variation in controls such that the odds ratio associated with variants in that region is higher than for other parts of the gene. They may be defined using curated literature evidence, or by comparing variant frequencies from case data to reference data over a set of defined regions, as known in the art. In embodiments, the rule is evaluated by calculating the prior probability that a variant in certain regions is pathogenic before considering other evidence, based on the frequencies of variants in disease and control populations. Advantageously, the above definition of mutational hotspots allows a consistent definition to be applied across genes and conditions, thereby providing a reliable and widely applicable test for this property. Protein functional domains may be extracted e.g. from protein databases. This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product (when the rule is activated based on the location in a critical functional domain), or the frequency of variants in a gene in a control and disease population (when the rule is activated based on the location in a mutational hotspot).

A second PM rule may be activated if the variant is present at extremely low frequency in a control population. In advantageous embodiments, this rule is activated if the allele frequency in a reference population is below the maximum acceptable frequency calculated as described in the 'Rare variants' section below. Although the ACMG guidelines suggest that the frequency in control populations should be taken into account as evidence of pathogenicity, they recommend considering this as evidence only if the variant is absent in control populations. However, the inventors have found that the methods of the invention, as described below, provided a widely applicable, reliable approach to confidently identify how low a variant frequency has to be in a control population, to be potentially pathogenic, taking into account the size of the control population used to make this assessment. In embodiments, the user is able to directly access the reference data from the report and check the coverage at the variant location. The user may then be able to overrule an activation of this rule if the coverage is not sufficient. In some embodiments, the rule is automatically deactivated if the coverage in the reference data used is insufficient. This rule is an example of a rule that is evaluated based on the frequency of the variant in a control population.

A further PM rule may be activated if the disorder analysed is recessive and the variant is detected in trans with a pathogenic variant. In embodiments of the method, this rule is displayed in the diagnostic report and a user can insert evidence. In embodiments, the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence. This rule is an example of a rule that is evaluated based on information from a disease-variant association data source.

A fourth PM rule may be activated if the variant results in an in-frame deletion or insertion in a non-repeat region or a stop-loss variant, resulting in a protein length change. In embodiments, this rule is activated based on a prediction of the effect of the variant on the protein. Methods of prediction such as that implemented in the Ensembl Variant Effect Predictor (VEP) are suitable for the purpose of the invention, and known in the art. In embodiments, variants that are in-frame insertion / deletions only activate the rule if they are not with a repeat region, where repeat regions are available from e.g. genome browsers such as the UCSC table browser (https://genome-euro.ucsc.edu/cgi- bin/hgTables). This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product.

A fifth PM rule may be activated if the variant results in a novel missense change at an amino acid residue where a different missense change has previously been determined to be pathogenic. In embodiments, data from a suitable database is used, in addition with filters (based on reliability) on the evidence provided. For example, any variants from the ClinVar database (http://www.ncbi.nlm.nih.gov/clinvar/) that have multiple submitters, for the phenotype of interest, with no conflicting evidence and classed as 'pathogenic' may be used. In embodiments, data from previous uses of the method are stored and queried for any 'pathogenic' variants at the same residue that result in a different missense change. This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product and information from a disease-variant association data source. A further PM rule may be activated if the variant is observed de novo in a patient that has the disease, but the paternity and maternity of the patient are not confirmed. In embodiments of the method, this rule is displayed in the diagnostic report and a user can insert evidence. In embodiments, the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence. In embodiments of the method, the user provides information to indicate whether the variant is de novo or inherited. In other embodiments, this may be obtained from the variant input file. Having been identified as a de novo variant once (either through use of the tool, or in literature or other data source available to the inventors), such information may be stored and used to activate this rule in further uses.

A seventh PM rule may be activated if an equivalent amino acid change in a paralogous gene is pathogenic. This rule is not used in the ACMG guidelines, but the inventors have found that it strengthened the diagnostic provided by the tool of the invention. This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product and information from a disease-variant association data source. Supporting evidence for pathogenicity (PP rules)

A first PP rule may be activated if the variant co-segregates with the disease in multiple affected family members, and lies in a gene that is known to cause the disease. Information to assess that rule may be obtained from disease-variant association data sources such as ClinVar. Alternatively, this information can be provided directly by the user at steps 310 or 350. In embodiments of the method, this rule is displayed in the diagnostic report and a user can insert evidence. In embodiments, the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence. In embodiments, the strength of the segregation data is taken into account in calculating the weight associated with activation of this rule (see 'Combining evidence' below). In such embodiments, the weight associated with activation of this rule may be proportional to the strength of segregation as quantified by the LOD score (log odds). In embodiments, the LOD score may be estimated as 0.3 x the number of informative meioses, or more formally calculated, as known in the art. LOD thresholds for supporting, moderate & strong evidence may be predefined or specified by the user. For example the following thresholds may be used: i) strong when random chance <1 % (~7 meioses/segregations); ii) moderate when random chance <5% (~5 meioses/segregations); and iii) supporting when random chance <25% (~3 meioses/segregations). In other embodiments, thresholds of 3, 6 and 10 meioses/segregations, respectively for supporting, moderate and strong evidence may be used. This rule is an example of a rule that is evaluated based on information from a disease-variant association data source. A further PP rule may be activated if the variant is a missense variant in a gene with a low rate of benign missense variation and in which missense variants are common mechanisms of disease. In embodiments, this rule is activated based on a comparison between the frequencies of missense variants in a gene of interest in a case cohort and control population. This rule may be activated if the variant is in a gene with etiological fraction (i.e. the estimated proportion of cases with a variant where the variant is causative) >0.90 with Fisher's exact P<0.05. Using this novel approach of defining what should be interpreted as a "low rate of benign missense variation" and "missense variants being common mechanisms of disease", the rule can be evaluated in an unambiguous and widely applicable way, thereby providing a consistent and reliable method of assessing evidence for pathogenicity. This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product. In embodiments, this rule is not activated if the first PM rule mentioned above (activated when the variant is located in a mutational hot spot, and/or in a critical and well established functional domain) is already activated. A third PP rule may be activated if multiple lines of computational evidence support a deleterious effect on the gene or gene product. The ACMG guidelines recommend that the rule only be activated if all tools provide a consistent result. However, the present inventors have found that a more reliable prediction could be obtained by using multiple (4 or more) independent computational tools and combining their results in a slightly less stringent way. In embodiments, at least 5 tools, preferably at least 7 tools are used and the rule is activated if: (i) only 1 tool predicts that the variant is benign and less than 3 have unknown classifications, or (ii) 3 or more tools have unknown outcomes and all other tools predict that the variant is damaging. Examples of tools that may be used include SIFT, PolyPhen2 var, LRT, Mutation Taster, Mutation Assessor, FATHMM and Grantham scores, as known in the art. This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product.

A fourth PP rule may be activated if the patient's phenotype or family history is highly specific for a disease with a single genetic aetiology. This information can be provided directly by the user at steps 310 or 350. In embodiments of the method, this rule is displayed in the diagnostic report and a user can insert evidence. In embodiments, the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence. In other embodiments, the rule is activated if information from the disease-variation association data source indicates that the disease implies a specific single genetic aetiology concordant with input by the user.

A further PP rule may be activated when a reputable source has reported the variant as pathogenic, but the evidence is not available to the user to perform an independent validation. This information can be provided directly by the user at steps 310 or 350. In embodiments of the method, this rule is displayed in the diagnostic report and a user can insert evidence. In embodiments, the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence.

A sixth PP rule may be activated if there is a missense mutation at an equivalent amino acid residue of a paralogous gene and this mutation is pathogenic. This rule is not used in the ACMG guidelines, but the inventors have found that it strengthened the diagnostic provided by the tool of the invention. This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product and information from a disease-variant association data source. Stand-alone benign evidence (BA rules)

An example of a benign stand-alone rule is a rule that is activated when the allele frequency of the variant in a control population is above a threshold. In the ACMG guidelines, the threshold suggested is 5%. However, in the case of inherited cardiac conditions, the present inventors have found that a lower threshold produced more reliable results, at least in part due to the rarity of the disorder. In embodiments, variants present in a control population at a frequency >0.1 % for heterozygotes or >3.16% (sqrt(O.OOI)) for homozygotes activate this rule. In embodiments, the sampling variance in subset populations is taken into account in applying this threshold. In embodiments, the variant count in a control population is compared to a maximum count calculated from the 95 percentile of a Poisson distribution with A=totalCount x maximumFreq where totalCount is the number of individuals in the control population covered at that variant position and maximumFreq is the thresholds as described above (e.g. 0.1 % and 3.16% respectively for hetero- and homozygotes). This rule is an example of a rule that is evaluated based on the frequency of the variant in a control population.

In embodiments, this rule is used as a pass / fail test, whereby any variant that activates this rule is automatically classified as benign.

Strong evidence for benignity (BS rules)

A first BS rule may be activated if the allele frequency in a control population is higher than expected for the disorder. The maximum credible population frequency for any variant involved in a disease is calculated using embodiments of the method described in the 'Rare variants' section below. Although the ACMG guidelines indicate that an allele frequency being "too high" for a disorder, they do not provide any indication on how to decide on what is "too high". The solution of the present invention, as described below, provides a reliable framework to confidently assess this rule, taking the genetic architecture of the disease under consideration into account. In embodiments, the penetrance is set at 0.5. This is a conservatively low value that the present inventors have found useful when specific information is not available. In embodiments, where data pertaining to large cohorts of cases is available, the maximum allelic contribution is defined as the upper confidence interval of the most common causal variant in the case cohort. In embodiments, in the absence of a case cohort the frequency in a mutation database is used instead. In embodiments, if neither a case cohort or a mutation database are available, the maximum allelic contribution is set to the maximum proportion of cases due to a single variant across diseases of interest (e.g. diseases that are similar or related to a disease of interest) where this was known. In the case of cardiac diseases, this may be set to 0.1. This rule is an example of a rule that is evaluated based on the frequency of the variant in a control population.

A further BS rule may be activated if the variant is observed in a healthy individual, with full penetrance expected at an early age. This information can be provided directly by the user at steps 310 or 350. In embodiments of the method, this rule is displayed in the diagnostic report and a user can insert evidence. In embodiments, the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence.

A third BS rule may be activated if there is well-established in vitro or in vivo functional studies showing that there is no damaging effect on protein function or splicing. This information can be provided directly by the user at steps 310 or 350. In embodiments of the method, this rule is displayed in the diagnostic report and a user can insert evidence. In embodiments, the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence. In embodiments, the rule is activated based on stored results of previous users. In embodiments, this information is obtained from disease-variant association data sources. A fourth BS rule may be activated if there is a lack of segregation in affected members of a family. This information can be provided directly by the user at steps 310 or 350. In embodiments of the method, this rule is displayed in the diagnostic report and a user can insert evidence. In embodiments, the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence. In embodiments, this information is obtained from disease-variant association data sources.

Supporting evidence for benignity (BP rules)

A first BP rule may be activated if the variant is a missense variant in a gene for which primarily truncating variants are known to cause disease. This information can be provided directly by the user at steps 310 or 350. In embodiments of the method, this rule is displayed in the diagnostic report and a user can insert evidence. In embodiments, the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence. In embodiments, the rule is activated based on data from a variation-disease association database. This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product and information from a disease-variant association data source.

A second BP rule may be activated if the variant is or has been observed in trans with a pathogenic variant for a fully penetrant dominant gene / disorder, or observed in cis with a pathogenic variant in any inheritance pattern. This information can be provided directly by the user at steps 310 or 350. In embodiments of the method, this rule is displayed in the diagnostic report and a user can insert evidence. In embodiments, the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence. In embodiments, the rule is activated based on data from a variation-disease association database. This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product and information from a disease-variant association data source. A further BP rule may be activated if the variant is an in-frame deletion or insertion in a repetitive region without known function. Data on repetitive regions may be obtained from e.g. genomic data sources, such as the UCSC table browser (https://genome- euro.ucsc.edu/cgi-bin/hgTables). Such data may for example be cross referenced with gene regions, also available from genomic data sources. In embodiments, any variant that is an in-frame insertion / deletion that overlaps with a repeat region activates this rule. This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product.

A fourth BP rule may be activated if multiple lines of computational evidence suggest that the variant has no impact on the gene or gene product. The ACMG guidelines recommend that the rule only be activated if all tools provide a consistent result. However, the present inventors have found that a more reliable prediction could be obtained by using multiple (4 or more) independent computational tools and combining their results in a slightly less stringent way. In embodiments, at least 5 tools, preferably at least 7 tools are used and the rule is activated if: (i) only 1 tool predicts that the variant is damaging and less than 3 have unknown classifications, or (ii) 3 or more tools have unknown outcomes and all other tools predict that the variant is benign. Examples of tools that may be used include SIFT, PolyPhen2 var, LRT, Mutation Taster, Mutation Assessor, FATHMM and Grantham scores, as known in the art. This rule is an example of a rule that is evaluated based on the known or predicted effect of the variant on the gene product.

A fifth BP rule may be activated if the variant was found in a case with an alternative molecular basis for disease. This information can be provided directly by the user at steps 310 or 350. In embodiments of the method, this rule is displayed in the diagnostic report and a user can insert evidence. In embodiments, the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence. In embodiments, the rule is blocked from activation if the user has already activated the rule indicating that the variant is observed in trans with a pathogenic variant for a fully penetrant dominant gene / disorder, or observed in cis with a pathogenic variant in any inheritance pattern.

A sixth BP rule may be activated if a reputable source has reported the variant as benign, but the evidence is not available to the user to perform an independent evaluation. This information can be provided directly by the user at steps 310 or 350. In embodiments of the method, this rule is displayed in the diagnostic report and a user can insert evidence. In embodiments, the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence.

A seventh BP rule may be activated if the variant is a synonymous (silent) variant for which splicing prediction algorithms predict no impact and the nucleotide is not highly conserved. This information can be provided directly by the user at steps 310 or 350. In embodiments of the method, this rule is displayed in the diagnostic report and a user can insert evidence. In embodiments, the rule is only evaluated and taken into account in computing the score of a variant if the user has activated the rule by providing evidence. In embodiments, data about the known or predicted effect of the variant on the gene product is combined with data from a genomic data source to evaluate this rule.

Rare variants

The present invention is directed to identification of a genetic variant as likely pathogenic or discounting it as likely benign. A variant's low frequency in, or absence from, reference databases is a necessary but not sufficient criterion for variant pathogenicity, and a high frequency is strong evidence for a benign role. However, assessing how rare a variant has to be in order to confidently mark it as likely pathogenic is not trivial. In practice, there exists considerable ambiguity around what allele frequency should be considered 'too common' (in a reference sample), with the conservative values of 1 % and 0.1 % often invoked as frequency cut-offs for recessive and dominant diseases respectively. However, such thresholds generate a large numbers of false positive candidates, and therefore provide an indication that is not sufficiently reliable to be used in a clinical setting. The present invention provides methods and systems for confidently determining an allele frequency threshold under which a variant may be considered as a pathogenic candidate for a given disease. In particular, the invention provides a method for assessing whether rare variants are sufficiently rare to cause penetrant Mendelian diseases, while accounting for both disease-specific genetic architecture and sampling variance in observed allele counts.

In some embodiments the method disclosed relies on the principle that when assessing a variant for a causative role in a dominant Mendelian disease, the frequency of a variant in a reference sample not selected for the condition, should not exceed the prevalence of the condition. However, this will be influenced by different inheritance modes, genetic and allelic heterogeneity, and reduced penetrance. In addition, for rare variants, estimation of true population allele frequency is clouded by considerable sampling variance, even in the largest samples currently available. Methods disclosed addresses these issues and provide an improved way of assessing whether a variant is likely pathogenic, as shown in Examples 1 and 2.

The invention also provides systems to allow a user to determine a frequency threshold for a particular disease, where the system comprises a reference data source and a processor programmed to perform the method described herein and below. The system may also comprise input means for a user to specify the genetic architecture of the disease of interest.

For a penetrant dominant Mendelian allele to be disease causing, it cannot be present in the general population more frequently that the disease it causes. Furthermore, if the disease is genetically heterogeneous, it must not be more frequent than the proportion of cases attributable to that gene, or indeed to any single variant. We therefore define the maximum credible population allele frequency (for a pathogenic allele) as (Eq. 1):

mpaf = prevalence x mac x

penetrance

Where mac (maximum allelic contribution) is the maximum proportion of cases potentially attributable to a single allele; and mpaf is the maximum credible population allele frequency.

By specifying a maximum credible allele frequency value that should be considered in the population (using Eq. 1), it is possible to estimate the probability distribution for allele counts in a given sample size. This follows a binomial distribution, and can be satisfactorily approximated with a Poisson distribution. This allows the setting of an upper limit on the number of alleles in a sample (based on a confidence interval associated with a chosen level of confidence, e.g. 90%, 95% or 99%) that is consistent with a given population frequency.

For example, using the mpaf calculated above for a particular disease, the maximum allele count tolerated in a reference data set can be calculated, taking into account the sampling variation in the reference data set using a Poisson distribution and the size of the reference data set used. Any allele count in the reference data set that is above e.g. the 95 ^th percentile of the Poisson distribution (upper bound of the one tailed 95% confidence interval) for that allele frequency - given the observed allele number where λ is the expected allele count given by Eq. 2:

λ = 2 x sample size x mpaf

is too frequent to be a credible disease causing variant. Therefore, the 95 ^th percentile of the Poisson distribution with the above λ is a maximum sample estimate corresponding to the population frequency mpaf obtained at the chosen confidence level of 95% for the population size (in the number of individuals in the reference data) of sample size.

The sample size is the number of individual genomes sequenced in the reference data. It is multiplied by two to account for the fact that loci on autosomal chromosomes will be sequenced twice for each individual (i.e. 2 x sample size = allele number). The confidence chosen is, as would be clear to the person skilled in the art, a matter of preference. Confidence above 95% is generally preferred, however thresholds based on the 90 ^th, 95 ^th, 99 ^th percentile may be used, depending on the number of false positive results that a user may be prepared to take into account. This may also depend on the availability of any orthogonal data to make the pathogenicity assessment, and on how conservatively any of the parameters of Eq. 1 have been set, as would be clear to the person skilled in the art. Accordingly, the maximum sample estimate may be based on a 90%, 95% or 99% confidence threshold (or any other appropriate level).

The allele number will depend on the size of the population that is considered, and as the tightness of a 95% confidence interval in the Poisson distribution depends upon sample size, the stringency of the filter depends upon the allele number (AN). In some embodiments, it may be appropriate to consider a subpopulation of a reference data set, for example, where differences in ethnic background of individuals are likely to play a role. For example, a variant relatively common in any one population is unlikely pathogenic, even if it is rare in other populations. In some embodiments, the method therefore comprises computing a maximum tolerated AC for each distinct subpopulation of a reference sample, and filter based on the highest allele frequency observed in any major population. In other embodiments, the sequencing coverage in the reference data at a particular size is taken into account to correct the sample size. The stringency of the filter therefore may therefore vary according to the size of the sub-population in which the variant is observed, and/or the sequencing coverage at that site.

Where large case cohort data is not available to determine the parameters of Eq. 1 , these may be estimated by extrapolating from similar disorders and/or variant data. In particular, where disease-specific variant databases exist, we can use these to help estimate the maximum genetic / allelic contribution in lieu of individual case series.

Where no mutation database exists, knowledge of similar disorders may be used to estimate the maximum allelic / genetic contribution. If the maximum allelic contribution of a disorder is not well characterised, the maximum genetic contribution (i.e. the maximum proportion of the disease attributable to a single gene - rather than a single variant as used above) can be used as a conservative estimate.

Example 1 below demonstrates application of this approach to dominant diseases, Example 1a demonstrates the application of this approach to hypertrophic cardiomyopathy, for which case cohort data exists. Example 1 b demonstrates the application of this approach to a disease where disease-specific variant databases exist and can be used to estimate maximum allelic / genetic contribution. Example 1c demonstrates the application of this approach to diseases where no mutation database exists. Example 1 d demonstrates the application of this approach to a disease where allelic heterogeneity is poorly characterised.

For recessive diseases, the method above is modified and the maximum frequency of a recessive disease-causing variant in the population is represented by Eq. 3:

mpaf = ] revalence x mac x Jmgc

^penetrance

where mgc (maximum genetic contribution) represents the proportion of all cases that are attributable to the gene under evaluation, and mac (maximum allelic contribution) represents the maximum proportion of cases attributable to that gene that are attributable to an individual variant.

This is calculated according to Eq. 4:

prevalence =∑(a/ casuative alleles in each contributing gene) ² x penetrance

(in which af refers to allele frequency); approximating to Eq. 5 :

prevalence =

^(combined frequency of causative alleles in gene) ² x number similar genes x penetrance and expanding to Eq. 6:

I T 1

prevalence = (max individual af x ) x x penetrance

mac mgc

Example 2 below demonstrates the application of this approach to a recessive disease, Primary Ciliary Dyskinesia. Example 3 shows the application of this approach on every variant in the ExAC data using a simulated dominant Mendelian variant discovery analysis, and uses HCM data to demonstrate that the approach results in filtering of candidate variants that are in the clinically actionable range of disease odds ratio. Note that the approach has been described and illustrated below by analysing frequencies at the level of a disease. However, in some cases this approach may be further refined by calculating distinct thresholds for individual genes, or even variants. For example, if there is one common founder mutation but no other variants that are recurrent across cases, then the founder mutation may be seen as an exception to the calculated threshold.

Combining evidence for pathogenicity

Having obtained the results of the evaluation of as many rules described above as possible, depending on the data available, the method of the invention classifies a variant into one of a series of diagnostic categories. In embodiments, the classification is based on how many of the evidence rules described above are activated (i.e. the rule produces a positive outcome when assessed), and the nature (in terms of evidence category, as described above) of these activated rules. In embodiments of the invention, in line with the ACMG guidelines, the categories are: Pathogenic, Likely Pathogenic, Benign, Likely Benign and Uncertain Significance. As the person skilled in the art would understand, further categories or subcategories may be created and the combination of evidence rules leading to a certain classification may be adapted accordingly.

An embodiment in which the evidence rules are combined in line with the ACMG guidelines will now be described. However, in other embodiments, the outcome of assessment of evidence rules is combined using any approach that allows weighing evidence for pathogenicity against evidence for benignity. In embodiments, evidence rules relating to pathogenicity is assigned a weight χ _Ρί ... _Ρ/ν for each rule P _1T ...,P _N relating to pathogenicity, and evidence rules relating to benignity may be assigned a weight ΥΒΙ, .,.,ΒΜ for each rule Bi, ...,B _M relating to pathogenicity. A combined score may for example be obtained by summing the evidence for pathogenicity and subtracting the sum of evidence for benignity (i.e. variant score = ∑" x _p —∑± y _B) . In such embodiments, a variant may be classified based on thresholds on the variant score, such as variant score > pp : pathogenic, Ip < variant score≤ pp : likely pathogenic, etc. As the person skilled in the art would understand, the thresholds will depend on the values of the individual rules scores, and on the number of categories used. In some embodiments, rules may be divided in categories, such as very strong pathogenic / strong pathogenic / pathogenic moderate / supporting pathogenic / strong benign / supporting benign / stand-alone benign as described above, and all rules in a category may be assigned the same weight. For example weights x _PSv, Xps, MP, PP, yBA, yBs, yBP may be used, wherein x _PSv > Xps > MP > Xpp, and y _BA > yBs > yBP- In embodiments of the invention, in line with the ACMG guidelines, a variant is put in the pathogenic category if any of the following apply: one rule with a 'pathogenic very strong' label is activated, and any of the following applies

o (i) at least one rule with a 'pathogenic strong' label is activated; o (ii) at least two rules with a 'pathogenic moderate' label are activated; o (iii) one rule with a 'pathogenic moderate' and one rule with a 'pathogenic supporting' label are activated; or

o (iv) at least two rules with a 'pathogenic supporting' label are activated; at least two rules with a 'pathogenic strong' label are activated;

one rule with a 'pathogenic strong' label is activated, and any of the following applies:

o (i) at least three rules with a 'pathogenic moderate' label are activated; o (ii) two rules with a 'pathogenic moderate' and at least two rules with a

'pathogenic supporting' label are activated; or

o (iii) one rule with a 'pathogenic moderate' and at least four rules with a

'pathogenic supporting' label are activated.

In embodiments of the invention, in line with the ACMG guidelines, a variant is put in the likely pathogenic category if any of the following apply:

one rule with a 'pathogenic very strong' label is activated and a rule with a

'pathogenic moderate' label is activated;

one rule with a 'pathogenic strong' label is activated and one or two rules with a 'pathogenic moderate' label is/are activated;

one rule with a 'pathogenic strong' label is activated and at least two rules with a 'pathogenic supporting' label are activated;

at least three rules with a 'pathogenic moderate' label are activated;

two rules with a 'pathogenic moderate' and at least two rules with a 'pathogenic supporting' label are activated;

one rule with a 'pathogenic moderate' and at least four rules with a 'pathogenic supporting' label are activated.

In embodiments of the invention, in line with the ACMG guidelines, a variant is put in the benign category if either a rule with a 'benign standalone' label is activated, or at least two rules with a 'benign strong' label are activated.

In embodiments of the invention, in line with the ACMG guidelines, a variant is put in the likely benign category if any of the following apply: one rule with a 'benign strong' label is activated and at one rule with a 'benign supporting' label is activated;

at least two rules with a 'benign supporting' label are activated. In embodiments of the invention, in line with the ACMG guidelines, a variant is put in the 'uncertain significance' category if none of the other criteria apply, or the criteria for benign and pathogenic are contradictory.

In embodiments, any combination of the rules described in this document may be used, provided that at least one test is based at least on the frequency of the variant in a control population and at least one test is based at least on the known or predicted effect of the variant on the gene product and/or information from a disease-variant association data source. In embodiments, the user can decide which rules are used. Having obtained a score and/or a classification for the variant, this information is sent to the user (via the user device 2) for assessment of the pathogenicity of the variant. Additionally, the methods of the invention may involve producing a report that allows a user to confidently decide whether a variant may be pathogenic by providing the evidence that supports this decision. In particular, the report may display the result of evaluation of each rule, as well as any evidence that has led to this activation, and highlight any rule that is activated. Additional evidence may be displayed together with the classification and the outcome of the rules, e.g. in the form of frequency data in one or more reference or disease cohorts datasets, predicted effect of a mutation on the resulting protein function, location of the variant in the gene sequence, possibly in relation to other known disease causing variants, etc.

Example 4 shows an example of a report generated using a method of the invention.

The invention will now be further illustrated by way of the following non-limiting examples.

Examples

Example 1 - Assessing rarity of variants for specific diseases Data from the ExAC database was used to assess maximum tolerated allele count in the data for variants causative of a series of inherited cardiac conditions with different patterns of available information. Example 1a - Case cohorts data available

We illustrate our general approach using the dominant cardiac disorder hypertrophic cardiomyopathy (HCM), which has an estimated prevalence of 1 in 500 in the general population. As there have been previous large-scale genetic studies of HCM, with series of up to 6, 179 individuals an assumption is made that no newly identified variant will be more frequent in cases that those identified to date (for well-studied ethnicities). This allows an estimation of the minimum genetic / allelic heterogeneity of the disorder. In large case series, the largest proportion of cases is attributable to the missense variant MYBPC3:c.1504C>T (p.Arg502Trp), found in 104/6179 HCM cases (1.7%; 95CI 1.4- 2.0%). We conservatively take the upper bound of this proportion as our minimum mac (maximum allele contribution), and term this the 'heterogeneity factor' (HF) for HCM. Our maximum expected population allele frequency for this allele, assuming 50% penetrance, is 1/500 x 0.5 (to convert prevalence in individuals to chromosomes) x 0.02 x 1/0.5 = 4.0x10 ^"5, which we take as the maximum credible population AF for any causative variant for HCM.

We then calculate how many times a variant with true population allele frequency of 4.0x10 ^"5 can be observed in a random population sample of a given size. For a 5% error rate we take the 95 ^th percentile of a Poisson distribution with λ as detailed above. For HCM this gives a maximum tolerated allele count of 9, assuming 50% penetrance (or 5 for fully penetrant alleles), for variants genotyped in the full ExAC cohort (sample size = 121 ,412 chromosomes). The MYBPC3:c.1504C>T variant is observed 3 times in ExAC (freq=2.49x10 ^"5). To empirically assess these thresholds, we explored the ExAC allele frequency spectrum of 1 ,132 distinct autosomal variants identified in 6,179 HCM cases referred for diagnostic sequencing, and individually assessed and reported according to international guidelines (as described in Walsh, R. et al. (2016), Genetics in Medicine, doi: 10.1038/gim.2016.90). Figure 4 illustrates the results of this analysis, and shows the ExAC allele count (all populations) against case allele count for variants classified as variants of uncertain significance (VUS), Likely Pathogenic or Pathogenic in 6,179 HCM cases. The dotted lines represent the maximum tolerated ExAC counts in HCM for 50% (upper line) and 100% penetrance (lower line). Variants are colour coded according to reported pathogenicity. Where classifications from contributing laboratories were discordant the more conservative classification is plotted.

477/479 (99.6%) variants reported as 'Pathogenic' or 'Likely Pathogenic' fell below (i.e. were rarer in the reference population than the maximum tolerated frequency) the threshold calculated according to methods of the invention, including all variants with a clear excess in cases. The 2 variants historically classified as 'Likely Pathogenic', but prevalent in ExAC in this analysis, were reassessed using contemporary criteria: there was no strong evidence in support of pathogenicity, and they were reclassified in light of these results. This analysis identifies 66/653 (10.1 %) VUS that are unlikely causative for HCM. The results demonstrate that by calculating a maximum tolerated frequency in a control population for a particular disease according to the method of the invention allows the identification of high proportions of likely true positive pathogenic variants, while excluding likely false positive identifications. In this way, likely false positive identifications are lower than the based on prior art methods. Thus, variants filtered using these criteria provide a more reliable data set for further pathogenicity investigations when compared to results from prior art methods and systems.

The above analysis applied a single global allele count limit of 9 for HCM. However, as allele frequencies differ between populations, filtering based on frequencies in individual populations may provide greater power. We examined all 601 variants identified as 'Pathogenic' or 'Likely Pathogenic' and non-conflicted for HCM in ClinVar (Landrum, M. J. et al. (2013), Nucleic Acids Research 42, D980-D985). 558 (93%) were sufficiently rare when assessed as described. 43 variants were insufficiently rare in at least one ExAC population, and were therefore re-curated. 42 of these had no segregation or functional data sufficient to demonstrate pathogenicity in the heterozygous state, and so would be classified as VUS (at most) according to the methods described herein. This demonstrates that the methods described herein allow removal of further likely false positive identifications by stratifying data by subpopulations, when such data is available. The remaining variant had convincing evidence of pathogenicity, though with uncertain penetrance, and was observed twice in the African / African American ExAC population. This fell outside the 95% confidence interval for a true population frequency <4x10 ^"5, but within the 99% confidence threshold: a single outlier due to stochastic variation is unsurprising given that these nominal probabilities are not corrected for multiple testing across 601 variants. In light of the updated assessment, 20 variants were reclassified as Benign / Likely Benign and 22 as VUS according to the American College for Medical Genetics and Genomics (ACMG) guidelines for variant interpretation.

Example 1b - Disease-specific variant data available

Where disease-specific variant databases exist, these can be used to help estimate the genetic / allelic heterogeneity in lieu of individual case series. For example, Marfan syndrome is a rare connective tissue disorder caused by variants in the FBN1 gene. The UMD-FBN1 database contains 3,077 variants in FBN 1 from 280 references (last updated 28/08/14). The most common variant is in 30/3,006 records (1.00%; 95CI 0.53-1.46%), which likely overestimates its contribution to disease if related individuals are not systematically excluded. Taking the upper bound of this frequency as our HF, a maximum tolerated allele count of 2 is derived. None of the five most common variants in the database are present in ExAC.

Example 1c - No mutation database available

Where no mutation database exists, knowledge of similar disorders can be used to estimate the HF. For the cardiac conditions with large cases series available, the maximum proportion of cases attributable to one variant is 6.7% (95CI 4.1-9.2%; PKP2:c.2146-1 G>C found in 24/361 ARVC cases). Therefore, the upper bound of this confidence interval (rounded up to 0.1) can be taken as an estimate of the HF for other genetically heterogeneous cardiac conditions, unless disease-specific evidence can be found to alter it. For Noonan syndrome and Catecholaminergic Polymorphic Ventricular Tachycardia (CPVT - an inherited cardiac arrhythmia syndrome) with prevalences of 1 in 1 ,000 and 1 in 10,000 respectively, this translates to maximum population frequencies of 5x10 ^"5 and 5x10 ^"6 and maximum tolerated ExAC allele counts of 10 and 2.

Example 1d- Poorly characterised allelic heterogeneity

If the allelic heterogeneity of a disorder is not well characterised, the maximum genetic contribution (i.e the maximum proportion of the disease attributable to a single gene) can be used as a conservative estimate. For classic Ehlers-Danlos syndrome, up to 40% of the disease is caused by variation in the COL5A1 gene. Taking 0.4 as our HF, and a population prevalence of 1/200,000 a maximum tolerated ExAC AC of 5 is derived. Example 2 - Recessive diseases

Primary Ciliary Dyskinesia (PCD) is a recessive disease with a prevalence of up to 1 in 10,000 individuals in the general population. The method above was applied using data from the ExAC database, as for Example 1. Across previously published cohorts of PCD cases, DNAI 1 IVS1 +2_3insT was the most common variant with a total of 17/358 alleles (4.7% 95CI 2.5-7.0%). Given that approximately 9% of all patients with PCD have disease-causing variants in DNAI 1 and the IVS1 +2_3insT variant is estimated to account for approximately 57% of variant alleles in DNAI127, these values can be taken as

PCD, yielding a maximum expected = 2.42 x lO ^"3. This translates to a

maximum tolerated ExAC AC of 322. DNAI 1 IVS1 +2_3insT is itself present at 56/121 ,108 ExAC alleles (45/66,636 non-Finnish European alleles). A single variant reported to cause PCD in ClinVar occurs in ExAC with AC > 332 (NME8 NM_016616.4:c.271-27C>T; AC=2306/120984): this variant meets none of the ACMG criteria for assertions of pathogenicity, and was reclassified as VUS.

Example 3 - Computing threshold values for the ExAC population.

For each ExAC variant, a 'filtering allele frequency' was defined, which represents the highest disease-specific 'maximum tolerated allele frequency' that would be incompatible with that variant causing disease. If the disease under study has a maximum tolerated allele frequency≤ the filtering allele frequency the variant should be filtered, while if it has a maximum tolerated allele frequency > the filtering allele frequency the variant remains a candidate.

To assess the efficiency of our approach, the filtering allele frequency was calculated based on 60,206 exomes and the filters were applied to a simulated dominant Mendelian variant discovery analysis on the remaining 500 exomes. Figure 5a shows that filtering at allele frequencies lower than 0.1 % can substantially reduce the number of predicted protein-altering variants in consideration, with the mean number of variants per exome falling from 176 at a cutoff of 0.1 % to 63 at a cutoff of 1e-6.

Additionally, the prevalence of variants in HCM genes in cases and controls across the allele frequency spectrum, and computed disease odds ratios for different frequency bins were compared. We used a cohort of 322 patients recruited to the Royal Brompton Hospital cardiac Biomedical Research Unit with diagnosis of HCM confirmed by cardiac MRI. These samples were sequenced using the lllumina® TruSight Cardio Sequencing Kit on the lllumina® MiSeq and NextSeq platforms. The number of rare variants in MYBPC3, MYH7 and the six other sarcomeric genes associated with HCM (TNNT2, TNNI3, MYL2, MYL3, TPM 1 and ACTC1) were calculated for this HCM cohort, and for reference population samples from ExAC. Case / control variant frequencies were calculated for all protein altering variants (frameshift, nonsense, splice donor / acceptor, missense and in- frame insertions / deletions), with frequencies and case / control odds ratios calculated separately for non-overlapping ExAC allele frequency bins with the following breakpoints: 1x10-5, 5x10-5, 1x10-4, 5x10-4, 1x10-3, 5x10-3 and 1x10-2.

Odds Ratios were calculated as OR = (cases with variant / cases without variant) / (ExAC samples with variant / ExAC samples without variant) along with 95% confidence intervals. In the absence of sample-level genotype data for ExAC, the number of samples with a variant was approximated by the total number of variant alleles - i.e. assuming that each rare variant was found in a distinct sample. Figure 5b shows that the odds ratio for disease-association increases markedly at very low allele frequencies demonstrating that increasing the stringency of a frequency filter improves the information content of a genetic result. Therefore, for established disease genes it has been shown that prioritisation of variants purely by rarity can achieve disease-association odds ratios in the clinically-actionable range.

Example 4 - Generation of a diagnostic report

Figure 6 shows an extract of a report generated using the methods and systems of the invention. The report includes a top line providing information about the variant, including the variant score (or its corresponding classification, i.e. 'Likely Pathogenic', in this case) the gene that the variant is found in, the type of variant (e.g. non synonymous single nucleotide polymorphism, or snSNP, in this case), the effect of the variant on the coding sequence of the gene, the effect on the protein sequence, the zygosity of the variant in the patient, and the data source (e.g. sequencing platform used to obtain the data). The exact information displayed will of course depend on the data that is available, and the variant being assessed.

The report also includes a table that details the evidence rules that have been used to produce the assessment. The table in subdivided into two main sections: a set of columns showing rules that are evidence of benignity of the variant, and a set of columns showing rules that are evidence of pathogenicity of the variant. The table is further divided into rows according to the type of data on which the rule is based. Fields that are 'greyed out' indicate that the rule was not used to calculate the score of the variant, or that the assessment of the rule led to the rule not being activated. Hence, by looking at the report, a user can immediately identify those rules that are activated, make a diagnostic assessment based on the balance of benign / pathogenic rules that are activated (i.e. coloured rather than greyed out), and the strength of evidence that is associated with the activation of these rules (i.e. whether a rule is strong, moderate or supporting of a pathogenic diagnostic, or strong / supporting of a benign diagnostic).

Additionally, the report displayed in Figure 6 contains, below the table, additional evidence that can be taken into account by the user to refine the diagnostic. In this case, the report shows detailed accounts of the number of individuals with the variant in a series of genetic information databases, including reference and disease cohort data sources. Additionally, the report displayed in Figure 6 shows the results produced by multiple computational tools that predict the functional effect that the variant might have on the protein, and the location of the variant in relation to other variants of the gene. As the person skilled in the art would understand, the additional data displayed here may depend on the data that is available for a variant, as well as user preferences on which evidence they would like to be able to investigate for themselves.

Aspects and embodiments of the invention are also defined by the following numbered clauses:

1. A system for assessing the pathogenicity of a genetic variant, the system comprising:

a data analysis server (4), a genetic information data source (12), a disease- variation association data source (14a) and a protein-related data source (14c) wherein the data analysis server is programmed to: (1) receive (310) information from a user about a genetic variant identified in an individual;

(2) query (320) a genetic information data source for frequency information relating to the variant in at least a control population and a protein-related data source for information on the known or predicted effect of the variant on the gene product and/or a disease-variation association data source for information on association between the variant, the gene or other variants in the gene with diseases;

(3) evaluate (330a) the results of one or more tests based at least on the frequency of the variant in a control population and one or more tests based at least on the known or predicted effect of the variant on the gene product and/or information from a disease-variation association data source;

(4) combine (330b) the results of the tests of step 3) into a pathogenicity score; and

(5) provide (340) the pathogenicity assessment to a user,

wherein the evaluating the results of one or more tests based at least on the frequency of the variant in a control population comprises calculating a maximum tolerated allele frequency using Eq. 1 or Eq. 3 and determining a maximum sample estimate corresponding to the population frequency obtained at the chosen confidence level.

2. The system according to Clause 1 , wherein the data analysis server is programmed to provide a pathogenicity assessment including the pathogenicity score of step (4). 3. The system according to Clause 1 or Clause 2, wherein the data analysis server is programmed to query a genetic information database for frequency information relating to the variant in at least a control and a diseased population.

4. The system according to Clause 3, wherein evaluating the results of one or more tests comprises determining whether the prevalence of the variant in affected individuals is significantly increased compared with controls, and a Fisher's exact test is used to determine whether a variant is associated with a disease based on the frequency of the variant in the control and the diseased population. 5. The system according to Clause 4, wherein the results of the one or more test are pre-computed across all variants present in the control and the diseased population, and the results are corrected for multiple testing. 6. The system according to any of Clauses 3 to 5, wherein the maximum allelic / genetic contribution parameter in Eq. 1 or Eq. 3 is determined based on the frequency of the most common pathogenic variant in the diseased population, or in a disease population for a similar disease. 7. The system according to any of Clauses 1 to 6, which comprises a genomic database (14b).

8. The system according to Clause 7, wherein the genomic database contains information about paralogous genes, and evaluating the one or more tests comprises determining whether the variant is a missense mutation and an equivalent amino acid change in a paralogous gene is pathogenic, based on information from the disease- variation association data source.

9. The system according to Clause 7 or Clause 8, wherein the genomic database contains information about paralogous genes, and evaluating the results of one or more tests comprises determining whether there is a pathogenic missense mutation at an equivalent amino acid residue of a paralogous gene, based on information from the disease-variation association data source. 10. The system according to any of Clauses 1 to 9, wherein information on the known or predicted effect of the variant on the gene product is obtained from the protein-related data source.

1 1. The system according to Clause 10, wherein the information obtained from the protein-related data source comprises information relating to the amino acid sequence of the variant protein and/or the effect of an amino acid sequence variant on the function of the protein.

12. The system according to any of Clauses 1 to 11 , wherein the disease-variant association data source contains information on the association between a chosen disease and the variant under assessment, other variants in the same gene, and/or variants in a paralogous gene.

13. The system according to any of Clauses 1 to 12, wherein:

the known or predicted effect of the variant on the gene product comprises information on whether the variant is a null variant;

the disease-variation association data source contains information on whether loss of function of the gene containing the variant is a known mechanism of disease; and wherein evaluating the results of one or more test based at least on the frequency of the variant comprises determining whether the variant is a null variant in a gene where loss of function of the gene is a known mechanism of disease.

14. The system according to any of Clauses 1 to 13, wherein the disease-variation association data source contains information on the location of pathogenic variants within a gene, and evaluating the results additionally comprises determining whether the pathogenicity of a variant in the gene is highly dependent on the location of the variant.

15 The system according to Clause 14, wherein determining whether the pathogenicity of the variant is highly dependent on its location, and/or determining whether the variant is a null variant in a gene where loss of function of the gene is a known mechanism of disease, additionally comprises determining whether the variant is a nonsense, frameshift or essential splice site variant within exons with a high proportion spliced in (PSI), such as a PSI > 0.9. 16. The system according to any of Clauses 1 to 15, wherein the protein-related data source comprises the results of at least five tools for prediction of the effect of a variant on the function of the protein, and evaluating the results of one or more tests comprises: determining whether at least two of the tools predict a deleterious effect on the gene or gene product; and

no more than one tool predicts no deleterious effect on the gene or gene product.

17. The system according to any of Clauses 1 to 16, wherein evaluating the results of one or more tests based at least on the frequency of the variant in a control population comprises determining whether the allele frequency of the variant in a control population is above a threshold, wherein the threshold is the maximum sample estimate corresponding to the population maximum tolerated allele frequency at the chosen confidence level.

18. The system according to any of Clauses 1 to 17, wherein the one or more tests of step (3) are assigned weights; and combining the results of the tests of step (3) into a pathogenicity score in step (4) comprises computing a sum of the weights for all the test that are evaluated as positive.

19. The system according to Clause 18, wherein the weights are dependent on the strength of the evidence associated with each test.

20. The system according to Clause 19, wherein the tests are separated into multiple categories of evidence, and a common weight is assigned to all the tests in the same category.

21. The system according to any of Clauses 1 to 20, wherein the pathogenicity of a protein altering genetic variant is assessed in relation to cardiac conditions.

22. The system according to any of Clauses 1 to 21 , wherein the pathogenicity of a protein altering genetic variant is assessed in relation to cardiomyopathies.

23. The system according to any of Clauses 1 to 22, wherein calculating the maximum sample estimate corresponding to the population frequency mpaf obtained at the chosen confidence level x comprises calculating the x ^th percentile of a Poisson distribution where λ is given by Eq. 2, wherein sample size is the number of individuals in the control population from the genetic information data source.

24. The system according to any of Clauses 1 to 23, wherein the confidence level x is 90%, 95% or 99%; preferably 95%.

25. The system of any of Clauses 1 to 24, wherein providing the pathogenicity assessment to a user comprises providing a report that comprises the pathogenicity score, and an indication of the result of all the rules evaluated. 26. The system of any of Clauses 1 to 25, wherein the data analysis server is further programmed to (6) receive user input commands to modify the results of one or more tests, in order to further refine the pathogenicity assessment. 27. A method for assessing the pathogenicity of a protein altering genetic variant, the method comprising:

(1) receiving (310) information from a user about a genetic variant identified in an individual;

(2) querying (320):

(i) a genetic information data source (12) for frequency information relating to the variant in at least a control population; and

(ii) a protein-related data source (14c) for information about the known or predicted effect of the variant on the gene product; and/or

(3) evaluating (330a) the results of one or more tests based at least on the frequency of the variant in a control population and one or more tests based at least on the known or predicted effect of the variant on the gene product and/or information from the disease-variant association data source;

(4) combining (330b) the results of the tests of step (3) into a pathogenicity score; and

(5) providing (340) the pathogenicity assessment to a user,

28. A data analysis server comprising a processor 222 and a memory 224, wherein the processor is programmed to perform the method of Clause 27.

29. A computing device comprising software adapted to perform the method of Clause 27.

Previous Patent: SEALED GUTTER CONNECTORS

Next Patent: RISER GAS HANDLING SYSTEM AND METHOD OF USE