Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHODS AND APPARATUS FOR IDENTIFYING ONE OR MORE GENETIC VARIANTS ASSOCIATED WITH DISEASE IN AN INDIVIDUAL OR GROUP OF RELATED INDIVIDUALS
Document Type and Number:
WIPO Patent Application WO/2018/051072
Kind Code:
A1
Abstract:
Methods and apparatus for identifying one or more genetic variants associated with a disease are disclosed. In one arrangement a method comprises receiving input data comprising a set of candidate genetic variants present in an individual or group of individuals. A set of candidate hypothesesis generated. Each candidate hypothesis comprises a set of one or more of the candidate genetic variants. Prioritisation data is received. The prioritisation data represents an initial prioritisation of the candidate hypotheses in relative order of likely validity. A plurality of update steps is performed. Each update step takes as input all of the prioritised candidate hypotheses from a preceding step and up dates the prioritisation using update reference data. A prioritised list of candidate hypotheses is output after the updating of the prioritisation provided by the plurality of update steps.

Inventors:
DING ZHIHAO (GB)
FROT BENJAMIN (GB)
JOSTINS LUKE (GB)
MCVEAN GILEAN (GB)
SIMPSON MICHAEL (GB)
WELLER SUSANNE (GB)
Application Number:
PCT/GB2017/052677
Publication Date:
March 22, 2018
Filing Date:
September 12, 2017
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
GENOMICS PLC (GB)
International Classes:
G16B20/20; G16B40/20; G16B20/00; G16B20/40
Foreign References:
US20130231404A12013-09-05
US20140310215A12014-10-16
Other References:
DACE RUKLISA ET AL: "Bayesian models for syndrome- and gene-specific probabilities of novel variant pathogenicity", GENOME MEDICINE,, vol. 7, no. 1, 28 January 2015 (2015-01-28), pages 5, XP021210524, ISSN: 1756-994X, DOI: 10.1186/S13073-014-0120-4
GLORIA M PETERSEN: "Missense Mutations in Disease Genes: A Bayesian Approach to Evaluate Causality", 1 January 1998 (1998-01-01), XP055428255, Retrieved from the Internet [retrieved on 20171123]
JENNY C TAYLOR ET AL: "Factors influencing success of clinical genome sequencing across a broad spectrum of disorders", NATURE GENETICS., vol. 47, no. 7, 18 May 2015 (2015-05-18), NEW YORK, US, pages 717 - 726, XP055426698, ISSN: 1061-4036, DOI: 10.1038/ng.3304
CAROLINE F WRIGHT ET AL: "Genetic diagnosis of developmental disorders in the DDD study: a scalable analysis of genome-wide research data", THE LANCET, vol. 385, no. 9975, 1 April 2015 (2015-04-01), GB, pages 1305 - 1314, XP055426697, ISSN: 0140-6736, DOI: 10.1016/S0140-6736(14)61705-0
MATTHEW STEPHENS ET AL: "Bayesian statistical methods for genetic association studies", NATURE REVIEWS GENETICS, vol. 10, no. 10, 1 October 2009 (2009-10-01), GB, pages 681 - 690, XP055428167, ISSN: 1471-0056, DOI: 10.1038/nrg2615
Attorney, Agent or Firm:
FORSYTHE, Dominic (GB)
Download PDF:
Claims:
CLAIMS

1. A computer-implemented method for identifying one or more genetic variants associated with a disease in an individual or group of individuals, comprising:

receiving input data comprising a set of candidate genetic variants present in the individual or group of individuals;

generating a set of candidate hypotheses, each candidate hypothesis comprising a set of one or more of the candidate genetic variants;

receiving prioritisation data representing an initial prioritisation of the candidate hypotheses in relative order of likely validity;

performing a plurality of update steps, each update step taking as input all of the prioritised candidate hypotheses from a preceding step and updating the prioritisation using update reference data; and

outputting a prioritised list of candidate hypotheses after the updating of the prioritisation provided by the plurality of update steps.

2. The method of claim 1, wherein the update steps are performed using a probabilistic model.

3. The method of claim 1 or 2, wherein the probabilistic model comprises a Bayesian model.

4. The method of any preceding claim, wherein:

the received prioritisation data comprises a probability parameter representing the likely validity of each hypothesis; and

the updating of the prioritisation comprises, in each update step and for each hypothesis, calculating an update factor and applying the update factor to the probability parameter to obtain an updated probability parameter.

5. The method of claim 4, wherein in at least one of the update steps:

the update reference data comprises a value of a predetermined metric for each candidate hypothesis; and

the calculation of the update factor comprises using a probability distribution of the predetermined metric over variants known to be pathogenic.

6. The method of claim 5, wherein the calculation of the update factor comprises using a probability distribution of the predetermined metric over variants known to be pathogenic with respect to disease consistent with a phenotype displayed by the individual or group of individuals.

7. The method of claim 5 or 6, wherein the calculation of the update factor comprises using a probability distribution of the predetermined metric over variants not known to be pathogenic.

8. The method of any of claims 5-7, wherein the predetermined metric in at least one of the update steps comprises identification of one of a plurality of sub-ranges of allele frequency in which the allele frequency of a variant of the candidate hypothesis falls.

9. The method of any of claims 5-8, wherein the predetermined metric in at least one of the update steps comprises a classification of the impact of a variant of the candidate hypothesis on a protein produced by a gene containing the variant.

10. The method of any of claims 5-9, wherein the predetermined metric in at least one of the update steps comprises identification of one of a plurality of sub-ranges of a number of mutations that are within a predetermined number of amino acids of a reference mutation, the reference mutation being a mutation caused in a protein produced by a gene containing a variant of the candidate hypothesis.

11. The method of any of claims 4-10, wherein the calculating of the update factor in at least one of the update steps comprises calculating a proportion by which a gene containing the variant or variants of the candidate hypothesis is depleted of variation.

12. The method of any of claims 4-11, wherein the calculating of the update factor in at least one of the update steps comprises determining whether the variant or variants of the candidate hypothesis are in a gene on a reference list of genes known to contribute to the disease.

13. The method of any of claims 4-12, further comprising:

outputting update step ranking data, wherein the update step ranking data provides a quantitative indication of a relative contribution of one or more of the update steps to the prioritised list of candidate hypotheses output by the method, wherein the update step ranking data is generated using the calculated update factors.

14. The method of any preceding claim, wherein one or more of the candidate hypotheses each comprise that a combination of a selected set of one or more of the candidate genetic variants and a selected inheritance mode causes the disease.

15. The method of any preceding claim, wherein one of the candidate hypotheses is that the disease is not caused by any of the candidate genetic variants.

16. The method of any preceding claim, wherein each update step outputs a probability for each of the candidate hypotheses.

17. The method of any preceding claim, wherein at least one of the candidate hypotheses comprises a combination of a selected one of the candidate genetic variants and an inheritance mode consistent with a parent carrying a variant that is causal for the disease but where the phenotype of the disease has not been identified in the parent.

18. A computer program comprising code which when run on a computer causes the computer to perform the method of any preceding claim.

19. A computer program product comprising the computer program of claim 18.

20. An apparatus for identifying one or more genetic variants associated with a disease in an individual or group of individuals, comprising:

a data receiving unit configured to receive data representing a set of candidate genetic variants in the individual or group of individuals; and

a processing unit configured to:

generate a set of candidate hypotheses, each candidate hypothesis comprising a set of one or more of the candidate genetic variants;

receive prioritisation data representing an initial prioritisation of the candidate hypotheses in relative order of likely validity;

perform a plurality of update steps, each update step taking as input all of the prioritised candidate hypotheses from a preceding step and updating the prioritisation using update reference data; and

output a prioritised list of candidate hypotheses after the updating of the prioritisation provided by the plurality of update steps.

Description:
METHODS AND APPARATUS FOR IDENTIFYING ONE OR MORE GENETIC VARIANTS ASSOCIATED WITH DISEASE IN AN INDIVIDUAL OR GROUP OF RELATED INDIVIDUALS

The invention relates to the field of genomic medicine, and in particular to identifying one or more disease-causing genetic variants in a set of variants that typically span the whole genome. Disclosed methods and apparatus provide a framework in which evidence from multiple data sources can be combined effectively to evaluate pathogenicity for each genetic variant.

Genomic medicine is defined by the US National Human Genome Research Institute as an emerging medical discipline that involves using genomic information about an individual as part of their clinical care (e.g., for diagnostic or therapeutic decision-making) and the health outcomes and policy implications of that clinical use. Central to the successful application of genomic medicine is accurate, consistent and reproducible identification of the genetic basis of certain diseases through the interpretation of genomic data in a clinical setting.

The human genome consists of around 3.2 billion base pairs. It is known that the average extent of divergence from the reference human genome is around 3.6 million sites per individual (1). In the context of this application, we refer to these divergences as 'variants'.

In order to diagnose a genetic disease, one must identify the one or more disease causing variants within all recorded variants in an individual. In the context of this application, we refer to a diagnosis as the identification of the disease causing variant(s) for a particular individual in the context of a clinical phenotype. It should be noted that clinical analysis of genetic disease can also relate to members of a family (group of related individuals) where one or more individuals are affected by a putative genetic disease.

In the context of diagnostics, the molecular genetic assessment of genetic diseases has made huge advances in the last ten years. So called 'next-generation genome sequencing' technology has made it feasible in a clinical setting to sequence a whole human genome or exome, or in the case of cancer, a patient's tumour and normal tissue separately. These sequencing efforts enable the construction of an individual's variant profile, which is used as a starting point to identify possible disease causing variants.

A clinical scientist will then work through this data. Depending on the condition, several sets of filters are usually applied to narrow down on the causing variant. These filters usually include filtering for variants whose genotype is consistent with the observed inheritance, excluding common variants, and filtering for deleterious variants that are located within genes of interest.

Some genetic conditions are caused by variants in a single gene, variants that are usually inherited in one of several straightforward patterns, depending on the gene involved. In the case of diseases that are inherited from an individual's parents, clinical variant analysis would typically consider only those variants that match a particular inheritance mode i.e. the way in which a variant is passed on from parents to offspring. These modes include autosomal dominant (where one mutated copy of a gene in each cell is sufficient to cause disease); autosomal recessive (where two mutated copies of the gene are required). Special attention is usually given to variants on the sex chromosomes because rare developmental disorders are often caused by variants on the X chromosome. These diseases are referred to as X-linked diseases.

After screening for all variants that match an inheritance mode that is consistent with the disease pattern observed in the family, these candidate variants are then compared against known, common variants, which are listed in public genetic databases such as the 1000 Genomes Consortium database and the Exome Aggregation Consortium (ExAC) database. Variants that are common in the population will be filtered out as they are most probably not causing a disease. To further assess the effect of the assumed disease-causing variant(s) one needs to annotate the variant with its predicted functional consequence. Within protein coding regions this may include predicted effects on protein structure and function, using bioinformatic tools such as Annovar, SnpEff or Variant Effect Predictor (VEP). After applying all these filters, the list of potentially disease causing variants typically numbers in the tens to hundred. Each of these diagnostic candidate variants needs to be assessed manually, usually by a clinical scientist or pathologist, for quality, causality, and recurrence in reported disease cases. If the disease -causing variant cannot be found, the process has to be repeated for another likely inheritance mode, or a different set of filters.

Several methods exist that attempt to assign functional consequence to variants without a disease context (or inheritance mode). Examples include polyPhen, SIFT, VAAST, and mutationTaster. These approaches can be characterised as classification methods, aiming to assign any variant to a "benign" or "disease-causing" class, and use a generic machine-learning classifier algorithm, two training sets of known benign and known disease-causing variants, and a large range of covariates, such as evolutionary conservation, to arrive at an automatic classification method. These methods can provide useful information for diagnosis, but by themselves have insufficient specificity to be used as-is in a clinical context. For instance, mutationTaster, one of the more accurate methods, achieves a false-positive rate of ~1% in exome data, corresponding to -100 mutations per individual.

A few authors have considered the problem of interpreting the variation in an individual's genome in the context of a disease and familial inheritance patterns. Taylor et al (2015) (2) use a semi- automated approach of variant analysis based on a three-tiered system, comprising consideration of known genes for the disorder; known genes in related disorders; and known genes in relevant biological pathways. A similar approach was used by Wright et al (2015) (3), who combined an automated variant filtering pipeline with a comparison against a database of genes consistently implicated in specific developmental disorders. Common to these methods is the exclusion of variants under consideration based on specific rules or thresholds, leading to suboptimal results.

It is an object of the invention to provide improved methods and apparatus for efficiently identifying variants associated with disease in an individual or group of related individuals.

According to an aspect of the invention, there is provided a computer-implemented method for identifying one or more genetic variants associated with a disease in an individual or group of individuals, comprising: receiving input data comprising a set of candidate genetic variants present in the individual or group of individuals; generating a set of candidate hypotheses, each candidate hypothesis comprising a set of one or more of the candidate genetic variants; receiving prioritisation data representing an initial prioritisation of the candidate hypotheses in relative order of likely validity; performing a plurality of update steps, each update step taking as input all of the prioritised candidate hypotheses from a preceding step and updating the prioritisation using update reference data; and outputting a prioritised list of candidate hypotheses after the updating of the prioritisation provided by the plurality of update steps.

Thus, a method is provided in which candidate hypotheses concerning genetic variants are evaluated in a series of steps to provide a prioritised list indicating which variants are most likely to be disease-causing. In contrast to prior art approaches involving filtering, each update step in the computation allows new evidence to be incorporated without discarding any candidate hypotheses (and associated genetic variants) from the process. As described in further detail below, this improves the reliability and/or sensitivity with which genetic variants associated with disease can be identified and/or evaluated, particularly for certain classes of genetic disease or variants with unusual properties. The method allows candidate hypotheses to be prioritised that would not meet all of the inclusion criteria of alternative tiered filtering approaches.

In addition to providing improved reliability and/or sensitivity compared to a filtering-based approach, the proposed method has the advantage of allowing the user to interrogate the reasons for the algorithm's decisions after the algorithm has run. In particular, it is possible to quantify the contribution of each update step to the final posterior probability, affording the possibility of determining a list of qualitative reasons why the variant has been assigned its posterior probability, ranked by their importance, that is, their contribution to the final posterior probability.

In an embodiment, the update steps are performed using a probabilistic model such as a Bayesian model. The inventors have found that this approach can be implemented quickly on a computer and provides a natural framework for the update steps. In the context of a Bayesian model the update steps may be referred to as Bayesian update steps. If a user needs to reconfigure input data, for example by changing gene lists, this can be achieved significantly more efficiently than in prior art approaches where several filtering steps would typically need to be repeated.

Filtering pipelines in variant analysis also need to be rerun for each suspected inheritance mode. Embodiments disclosed herein are able to present weighted evidence for each inheritance mode, and provide a statistical score or confidence measure for both the inheritance mode and the variant in the diagnosis.

In an embodiment, at least one of the candidate hypotheses comprises a combination of a selected one of the candidate genetic variants and an inheritance mode consistent with a parent carrying a variant that is causal for the disease but where the phenotype of the disease has not been identified in the parent. This is desirable for scenarios with low penetrance, where a parent may carry the causal variant but may not exhibit the phenotype. In a rule-based approach, a variant with genotypes not matching the phenotype description would be filtered out. Embodiments disclosed herein are able to recover these variants and present evidence for low penetrance. The user would then be able to check for hidden phenotypes in the parent.

In contrast to other state-of-the art methods, embodiments disclosed herein are able to present evidence across the spectrum of inheritance modes, including X chromosome linked or compound heterozygous modes.

Furthermore, unlike other analytical methods, embodiments disclosed herein can take into account assumptions that certain diseases may not have a genetic cause or a genetic cause that currently cannot be interpreted. This is achieved by including a candidate hypothesis that the disease does not have a genetic cause. Providing evidence for a non-genetic cause is of substantial clinical relevance especially when diagnosing infants and children, as this may change their prognosis and treatment trajectory. Having knowledge about the absence of a genetic condition may affect decisions about treatment, as the scope for improvement of the symptoms may be different than under the assumption that there is an underlying genetic cause. Similarly, such issues may impact the parents' future reproductive decisions.

The inventors have found that the methodology lends itself to particularly efficient computer- implementation. It has been found that posterior probabilities can typically be calculated in 5 -7 minutes for a trio family on a single CPU. Speed can be enhanced by annotating variants in memory using a single, collated database rather than different sources of annotation data.

The disclosed methodology provides particular advantages if the user needs to reconfigure the input data, for example by changing gene lists. The flexibility and speed provides a significant advantage where data needs to be frequently analysed and potentially re-analysed. Prior art methods for interpretation of disease causing variants require repetition of several filtering steps. According to methods disclosed herein all variants can be considered simultaneously, allowing automated reanalysis any time the input data is altered.

In an embodiment, the predetermined metric in at least one of the update steps comprises identification of one of a plurality of sub-ranges of allele frequency in which the allele frequency of a variant of the candidate hypothesis falls. The method is able efficiently to take into account that severe disease causing variants tend to have low allele frequency, as natural selection purges them from the population. This information is used by the method to improve the prioritisation of the candidate hypotheses.

In an embodiment, the predetermined metric in at least one of the update steps comprises a classification of the impact of a variant of the candidate hypothesis on a protein produced by a gene containing the variant. The method is able efficiently to take into account that the impact that a mutation has on a gene is a major predictor of how likely it is to be pathogenic. This information is used by the method to improve the prioritisation of the candidate hypotheses.

In an embodiment, the predetermined metric in at least one of the update steps comprises identification of one of a plurality of sub-ranges of a number of mutations that are within a predetermined number of amino acids of a reference mutation, the reference mutation being a mutation caused in a protein produced by a gene containing a variant of the candidate hypothesis. The method is able efficiently to take into account that disease causing mutations, particularly gain-of -function mutations, are often clustered in the same region of the gene. This information is used by the method to improve the prioritisation of the candidate hypotheses.

In an embodiment, the calculating of the update factor in at least one of the update steps comprises calculating a proportion by which a gene containing the variant or variants of the candidate hypothesis is depleted of variation. The method is able efficiently to take into account that if mutations in a gene cause a severe disease or other deleterious phenotype then this gene will be become depleted of variation as natural selection weeds out the pathogenic variation. This information is used by the method to improve the prioritisation of the candidate hypotheses.

In an embodiment, the calculating of the update factor in at least one of the update steps comprises determining whether the variant or variants of the candidate hypothesis are in a gene on a reference list of genes known to contribute to the disease. The method is able efficiently to take into account that for many diseases we have a good idea of many of the genes that contribute to the disease. A metric that describes whether the variant under consideration is located in one of these well known genes, and if so which one, can then be used to carry out an update that upweights mutations found in important genes. This information is used by the method to improve the prioritisation of the candidate hypotheses. In an embodiment, the method further comprises outputting update step ranking data, wherein the update step ranking data provides a quantitative indication of a relative contribution of each of the update steps to the prioritised list of candidate hypotheses output by the method, wherein the update step ranking data is generated using the calculated update factors. Thus, the method is able to quantify the contribution of each update step to the final posterior probability, and therefore to rank the various update steps by their importance to the final result, which can be interpreted by the end user as reasons why the algorithm has made its particular decision.

According to an alternative aspect of the invention, there is provided an apparatus for identifying one or more genetic variants associated with a disease in an individual or group of individuals, comprising: a data receiving unit configured to receive data representing a set of candidate genetic variants in the individual or group of individuals; and a processing unit configured to: generate a set of candidate hypotheses, each candidate hypothesis comprising a set of one or more of the candidate genetic variants; receive prioritisation data representing an initial prioritisation of the candidate hypotheses in relative order of likely validity; perform a plurality of update steps, each update step taking as input all of the prioritised candidate hypotheses from a preceding step and updating the prioritisation using update reference data; and output a prioritised list of candidate hypotheses after the updating of the prioritisation provided by the plurality of update steps.

The invention will now be further described, by way of example, with reference to the accompanying drawings, in which:

Figure 1 schematically depicts a prior art rule-based approach to variant analysis;

Figure 2 schematically depicts a probabilistic approach to variant analysis;

Figure 3 is a flowchart depicting steps in an example method for identifying one or more genetic variants associated with disease in a subject; and

Figure 4 schematically depicts an apparatus for identifying one or more genetic variants associated with a disease in an individual or group of individual according to an embodiment.

As mentioned in the introductory part of the description, prior art approaches to identifying genetic variants associated with disease use tiered filtering approaches, whereas embodiments of the present invention rely on a sequence of update steps (e.g. Bayesian update steps) which involve all of the candidate hypotheses (and associated variants). Within each update step, evidence for disease-causing properties of a variant is weighed against evidence for known healthy properties in order to provide improved prioritisation of a list of candidate hypotheses. The difference between the tiered filtering approach of the prior art and embodiments of the invention is depicted schematically in Figures 1 and 2. In contrast to the tiered filtering approach, embodiments of the invention do not discard any variants, but rather enable prioritisation of them based on their probability of being disease causing. This means that all variants, including those with unusual properties that would likely not form part of other analytical methods, will be considered in the variant prioritisation process. As a consequence it is possible to prioritise variants that do not match all inclusion criteria - a scenario that is likely for higher frequency recessive variants or variants with marginal quality metrics.

An example rule-based (tiered filtering) approach is depicted in Figure 1 and an example probabilistic approach involving update steps (as in embodiments of the invention) is depicted in Figure 2. In the rule-based approach of Figure 1, a set of candidate variants (A, B, C) is run through Filters 1, 2 and 3. The result is a short list of variants that then have to be examined manually. In difficult cases, such as the aforementioned recessive variants, an important variant might get discarded (C in Filter 1). The state-of-the art approach currently allows for an adjustment of the filter threshold (dotted lines) if no plausible disease-causing variant can be found. This will result in the potential inclusion of the variant in the results, but also increases the number of variants that have to be screened manually. When the filter thresholds are adjusted, the filtering pipeline has to be reiterated, and a second set of results has to be stored. This can lead to reproducibility problems. In contrast, in the approach of Figure 2, all variants are kept in the system at each stage of the calculation. Instead of a hard filter, the method models the probability of the variant being disease causing for a certain annotation, and prioritises variants along a well-calibrated gradient of disease probability (labelled "Prob. gradient" and decreasing from left to right in the figure). This is repeated for different annotations, forming several 'update steps'. Note how variants are ordered differently after each update step (A, B, C in Update 1, C, B, A in Update 2 etc.). The result is a full list of all variants, eliminating the risk of filtering out the causal variant. Each variant is annotated with a disease probability (0-100%), so that manual examination can be limited to the relevant variants.

Figure 3 is a flowchart illustrating the framework of methods according to an embodiment of the invention.

In an embodiment of this type, there is provided a computer-implemented method for identifying one or more genetic variants associated with a disease in an individual or group of individuals (e.g. a family group).

In step S I, the method comprises receiving input data comprising a set of candidate genetic variants present in the individual or group of related individuals. The input data may be provided for example in the form of a VCF (V ariant Call Format) file or be derived from a VCF file. In step S2, the method comprises generating a set of candidate hypotheses. Each candidate hypothesis comprises a set of one or more of the candidate genetic variants. As described in further detail below, in one embodiment each candidate hypothesis comprises a combination of an inheritance mode and a small set of one or more genetic variants.

In step S3, the method comprises receiving prioritisation data representing an initial prioritisation (e.g. ranking) of the candidate hypotheses in relative order of likely validity. The prioritisation data may comprise a prior probability for each candidate hypothesis and be set by a user or calculated based on initial reference data.

In step S4, the method further comprises performing an update step. The update step S4 may be repeated, as indicated schematically via the YES loop via block S5. The update step takes as input all of the prioritised candidate hypotheses from the preceding step. The preceding step will be step S3 for the first update step S4 and a preceding update step S4 for each subsequent update step S4. Each update step comprises updating the prioritisation using update reference data. Examples of update steps are described below.

In step S6, the method outputs a prioritised list of candidate hypotheses after the updating of the prioritisation provided by the plurality of update steps S4. As indicated schematically by the broken line, after output of any given prioritised list the method may perform further update steps S4 and further corresponding output steps S6, thereby providing a sequence of outputs of increasingly refined prioritised lists of candidate hypotheses.

The computer-implemented method may be controlled by a computer program which when run on a computer causes the computer to perform the method. A computer program product, such as a storage device, may provide the computer program.

In an embodiment the update steps are performed using a probabilistic model, particularly a Bayesian model. In an embodiment each update step outputs a probability for each of the candidate hypotheses.

One or more of the candidate hypotheses each comprise that a combination of a selected set of one or more of the candidate genetic variants and a selected inheritance mode causes the disease. For example, one or more of the candidate hypotheses may each comprise a combination of a selected one of the candidate genetic variants and an X chromosome linked inheritance mode. One or more of the candidate hypotheses may comprise a combination of a selected two of the candidate genetic variants and a compound heterozygous inheritance mode.

In an embodiment, one of the candidate hypotheses is that the disease is not caused by any of the candidate genetic variants. In an embodiment, the prioritisation data received in step S3 (before the update steps) comprises a probability parameter representing a measure of the likely validity of each hypothesis. In the detailed Bayesian model example described below, the initial probability parameter = Pr(C = c\M = m) Pr(M = m) for each hypothesis is derived from a combination of an initial prior probability for an inheritance mode of the hypothesis , Pr(M = m), and a prior probability for a set of variants of the hypothesis conditional on the inheritance mode, Pr(C = c\M = m)

The updating of the prioritisation comprises, in each update step S4 and for each hypothesis, calculating an update factor and applying the update factor to the probability parameter to obtain an updated probability parameter. In the detailed Bayesian model example described below, the update factor comprises a Bayes factor F^.

In an embodiment, the update reference data used in each update step S4 comprises a value of a predetermined metric for each candidate hypothesis. Examples of the types of predetermined metric that have been found to be effective are described in detail below. The predetermined metric will typically comprise a single value for each candidate variant summarising external data concerning that variant. For some types of external data (e.g. mutation consequence) a single value will be present already in the raw data. For other types of external data some processing may be necessary to obtain the single value.

The calculation of the update factor for each candidate hypothesis comprises using a probability distribution of the predetermined metric over variants known to be pathogenic. In an embodiment, the calculation is performed using a probability distribution of the predetermined metric over variants known to be pathogenic with respect to disease consistent with a phenotype displayed by the individual or group of individuals. A probability is obtained of observing the value of the predetermined metric for the candidate hypothesis provided by the update reference data conditional on the candidate hypothesis being causally linked to a pathogenic state. In the detailed Bayesian model described below, this probability is combined with a probability of observing the value of the predetermined metric for the candidate hypothesis provided by the update reference data conditional on the candidate hypothesis not being causally linked to a pathogenic state to obtain the Bayes factor BF^. Thus, the method uses sources of data from both healthy and pathogenic variation to support the identification of genetic variants associated with disease.

Figure 4 depicts an example apparatus 2 for carrying out the computer-implemented method. The apparatus 2 comprises a data receiving unit 4 and a processing unit 6. These components may be implemented using any of the wide range of suitable computer hardware (e.g. CPU, memory, network infrastructure, servers, etc.) known to the skilled person. In an embodiment, the data receiving unit 4 receives data representing a set of candidate genetic variants in the individual or group of individuals and/or prioritisation data representing an initial prioritisation of candidate hypotheses in relative order of likely validity. The processing unit 6 performs at least the following steps: generate a set of candidate hypotheses, each candidate hypothesis comprising a set of one or more of the candidate genetic variants; receive prioritisation data representing an initial prioritisation of the candidate hypotheses in relative order of likely validity; perform a plurality of update steps, each update step taking as input all of the prioritised candidate hypotheses from a preceding step and updating the prioritisation using update reference data; and output a prioritised list of candidate hypotheses after the updating of the prioritisation provided by the plurality of update steps

Detailed Bayesian Model Example

The following section describes a detailed example embodiment using a Bayesian model. The method uses a sequence of Bayesian updates to generate a posterior probability that any one of a set of candidate hypotheses (each associated with one or more variants) are causing the disease, and includes techniques for parameterizing these Bayesian update steps using external datasets. A large set of candidate genetic variants is taken as input from the patient (step S I) (e.g. in VCF format), candidate hypotheses are generated (step S2), and, after prioritising and updating steps (steps S3-S5), a relative likely validity (e.g. as a weight or posterior probability) for each candidate hypothesis is output (step S6).

The following discussion addresses two aspects of the methodology: 1) the calculating of posteriors (examples of the updated probability parameters referred to above) from a series of Bayes factors (examples of the update factors referred to above), and 2) the calculation of those Bayes factors (update factors) by parameterizing models of healthy and pathogenic genetic variation using external datasets (also referred to as update reference data). The first part is dependent on the application of the second.

Calculating candidate hypothesis posteriors from sequential Bayesian updates

The method assumes that there is a single genetic cause of this disease in this patient (i.e. this patient's disease is caused by one or more mutations in one gene). We assume that the genetic cause is inherited according to some inheritance mode M, where M can take six possible forms, i.e. M E {m p , m m , m d , m r , m c , m n }, where m p is paternal dominant (i.e. a dominant mutation inherited from the father), m m is maternal dominant, m d is de novo dominant, m r is recessive, m c is compound heterozygous and m n is the null mode, corresponding to a non-genetic, or otherwise unknown disease cause. We assume that we observe a set of possible disease-causing variants, V = \y x , . . . , v n ], where a small subset of these variants, V c C, is truly driving the disease. The size of this subset is determined by the inheritance mode: for m p , m m , m d and m r , there is only one causal variant (i.e. | C| = 1). For the compound heterozygous mode, m c , there are two causal variants in the same gene (one on the maternal chromosome and one on the paternal chromosome i.e. | C| = 2). For the null mode there are no causal variants i.e. | C| = 0.

The aim of the method is to identify the probability that each possible combination of inheritance mode and set of variants m c (where m E {m p , m m , m d , m r , m c , m n } and c c V) is the true mode and causal variants, conditional on some observed data D, i.e. Pr(M = m, C = c \ D). We refer to this combination of inheritance mode and set of variants as a candidate hypothesis.

The method requires in this embodiment an initial prior probability for each inheritance mode, Pr(M = m). This can be set by the user, or it can be calculated from the observed pedigree data, as described below. Each possible set of variants is then assigned a prior probability conditional on each inheritance mode,Pr(C = c\M = m). This value is zero if a set of variants is incompatible with this inheritance mode (e.g. a heterozygous variant is incompatible with the recessive mode, a set of two variants is incompatible with the dominant mode, a variant inherited from the mother alone is incompatible with the paternal dominant mode, etc). Otherwise, the prior is spread equally across all compatible sets of variants. From these two values we can calculate an initial prior for each hypothesis, which we will write = Pr(C = c\M = m) Pr(M = m). The initial prior is an example of a the probability parameter representing a likely validity of the candidate hypothesis referred to above. The set of probability parameters provides the initial prioritisation of the candidate hypotheses in relative order of likely validity. The later performed update steps will refine the probability parameter based on external data so that it represents the likely validity of the hypothesis more and more reliably and will then be referred to as a posterior probability.

In an embodiment each update step incorporates a new source of data (referred to above as update reference data). Each update step t takes in the probabilities (probability parameters) of a preceding update step t— 1 and returns probabilities that have been updated to reflect the update reference data D f being incorporated in that update step t. We write the posterior that a given hypothesis is true, conditional on all the data incorporated at update steps 1 to t, as c , which we calculate using where BF^ is the Bayes factor for the candidate hypothesis m, c and the update reference data used at update step t, and is given by where is the subset of the update reference data D t that is associated with variant t (e.g. that variant's allele frequency, or predicted consequence). Pr(D \ M = m , ν^ E C) is therefore the probability of observing the associated data if the variant is underlying the disease, and Pr(D \ Vi g C)is the probability of observing the associated data if the variant is not underlying the disease. The Bayes factors are thus calculated from probability distributions of data across truly pathogenic and truly non-pathogenic variation.

Generating Bayes factors from external data

If desired, the Bayes factors (update factors) mentioned in the previous section could be set arbitrarily in one or more of the update steps. For instance, to recover the behaviour of standard allele frequency filters, it is possible to set BF^ C = 0 for all variants with an allele frequency above a certain threshold, and BF^ = 1 otherwise. However, improved performance relative to such filters can be achieved by calculating the Bayes factors in at least one of the update steps in the manner described below.

We will discuss specific updates below, but here we will describe the approach in general. As we mention above, a Bayes factor update consists of two parts, the distribution of the data conditional on a variant being pathogenic, and the distribution of data conditional on a variant being non-pathogenic. In practice, these distributions are not known, and must be estimated from real data.

The approach to calculating these Bayes factors for the update reference data in a given update step can be summarized as:

1. Identify a metric that summarizes the data as a single value for each variant. For some data sources (e.g. mutation consequence) this is already the case. For hard-to-estimate continuous distributions (such as allele frequency) we split the data into discrete bins.

2. Identify a large set of non-pathogenic variants (e.g. the 1000 Genomes database) and calculate the metric for each of these variants. 3. Identify a large number of pathogenic variants (e.g. the ClinVar database) and calculate the metric for each of these variants.

4. Carry out an optional smoothing step to de-noise the data and ensure robustness.

5. Calculate the Bayes factor for each value of the metric.

In an embodiment, the Bayes factor calculation is personalised based on patient information to provide the best possible interpretation for that specific patient. Examples of specific information for personalization are given in the example specific updates described below. In general, specific information for personalization falls into the following categories:

• Set inheritance priors based on disease-specific parameters, pedigree information and genetic evidence of consanguinity

• Carry out an update stage based on a phenotype-specific and/or inheritance specific data source (e.g. using a disease-relevant gene list).

• Calculate Bayes factors using a user-provided or otherwise disease-specific set of pathogenic variants (e.g. variants from previously diagnosed patients, or by subsetting a pathogenic variant database for a particular disease)

• Use an update based on supplied phenotypes (e.g. human phenotype ontology terms) and a gene- phenotype database

• Use variant information from the individual to inform the estimated data distribution for nonpathogenic variation

Example Generation of Prioritisation Data Representing Initial Prioritisation

Setting inheritance priors

We mention above that the method is initialized with prior probabilities on each inheritance state. These can be set by the user, or they can be calculated from disease parameters (disease penetrance and prevalence) and the observed affection status of the parents. So, for instance, a high-penetrance disease with both parents unaffected will place a larger prior on the recessive and de novo inheritance modes. By contrast, if the mother is affected and the father is unaffected, the prior on a maternal dominant model will be much higher.

In an embodiment, the inheritance mode priors incorporate a genetic estimate of consanguinity (i.e. relatedness between parents). The offspring of related parents are more likely to have a recessive disease mode. We assess the parental relatedness using runs of homozygosity, inferred using a Hidden Markov Model to infer parts of the genome that are or are not in a run of homozygosity (roh), using emission probabilities determined by allele frequencies from an external database and transition probabilities given by a genetic map. Individuals who have more than a given threshold of their genome inside a roh are declared as consanguineous, and consanguineous individuals have the recessive inheritance mode upweighted. Both the roh threshold and the upweighting ratio is estimated from a reference set of previously diagnosed individuals (either a user-provided disease-specific reference set or a pre-defined reference set taken from publicly accessible data).

Example Update (Bayes) Factors

Allele frequency updates

Severe disease causing variants tend to have low allele frequency, as natural selection purges them from the population.

In an embodiment a predetermined metric used in the update step S4 of the method of Figure 3 comprises one of a plurality of sub-ranges (referred to below as "bins") of allele frequency. The update reference data comprises a value of the predetermined metric for each candidate hypothesis. The calculation of the update factor comprises using a probability distribution of the predetermined metric over variants known to be pathogenic. The calculation of the update factor may comprise using a probability distribution of the predetermined metric over variants known to be pathogenic with respect to disease consistent with a phenotype displayed by the individual or group of individuals. The calculation of the update factor may further comprise using a probability distribution of the predetermined metric over variants not known to be pathogenic.

We measure allele frequencies using an allele frequency reference set (e.g. the 1000 Genomes project data), with the metric being a small set of allele frequency bins. These bins can be varied in size and position. In one example, the bins are as follows: 0% (i.e. not observed in the allele frequency reference set), 0-0.1%, 0.1-0.5%, 0.5%-l%, 1-5% and 5-100%.

The method estimates the data distribution of allele frequency in pathogenic variants using a database of pathogenic variants (such as ClinVar, appropriately filtered to contain only high confidence pathogenic variants). To make this stage personalised we optionally intersect the pathogenic variants list with a list of genes that have a one-to-one relationship with the broad phenotype of the patient being analyzed using a database of disease genes (such as OMIM). This ensures that the allele frequency bin distribution is set for variants that are pathogenic specifically for the phenotype under study. We also use the same technique to estimate different allele frequency bin distributions for recessive and dominant disease causing variants.

The same approach described above could be used to calculate the allele frequency bin distribution for healthy variants. However, in practice this can introduce bias and inaccuracy due to mismatches between the allele frequency reference set and the data for the patient under study (e.g. if the patient is from an ethnic group not well represented in the allele frequency reference set). Instead, we define a personalised non-pathogenic allele frequency bin distribution for using the patient data itself. Assuming that the vast majority of variants are non-pathogenic, we count the number of variants that the individual carries in allele frequency bin (where those allele frequency bins are themselves calculated using the allele frequency reference set), and use these for the denominator in the Bayes factor.

Homozygous and heterozygous variants have a different allele frequency distribution, and thus we calculate separate allele frequency bin distributions for these different classes of variants. In practice, there are too few homozygous variants with low allele frequency observed in a single individual, which introduces a significant degree of noise into the homozygous frequency bin distribution estimates. In an embodiment, this is addressed using a smoothing step that incorporates both heterozygous and homozygous variants into the allele frequency bin distribution calculation. Assuming random mating, the estimate for the probability of a homozygous variant being in the allele frequency bin with lower bound a and upper bound b is given by where / v is the allele frequency of variant v in the reference set, and g v is the dosage of variant v in the individual (1 for heterozygous variants and 2 for homozygous variants). A similar expression exists to estimate the allele frequency bin distribution for heterozygous variants.

In an embodiment a further update step is used that includes allele frequencies from a cohort of other patients sequenced by the same customer, genome centre or project, and processed using the same software, as the patient under consideration. This is useful to downweight variants that are common in the patient population, or that were introduced by sequencing or data processing errors that are specific to the data source. To avoid double-counting, we produce an average allele frequency metric using both the public and private allele frequency datasets, weighted by the sample sizes of the data sets.

Consequence updates The impact that a mutation has on the gene is a major predictor of how likely it is to be pathogenic.

In an embodiment a predetermined metric used in the update step S4 of the method of Figure 3 comprises a classification of the impact of a variant on a protein produced by a gene containing the variant. The update reference data comprises a value of the predetermined metric for each candidate hypothesis. The calculation of the update factor comprises using a probability distribution of the predetermined metric over variants known to be pathogenic. The calculation of the update factor may comprise using a probability distribution of the predetermined metric over variants known to be pathogenic with respect to disease consistent with a phenotype displayed by the individual or group of individuals. The calculation of the update factor may further comprise using a probability distribution of the predetermined metric over variants not known to be pathogenic.

The classification may contain categories such as "truncating", "missense damaging" and "synonymous". We calculate the healthy and pathogenic consequence distributions by counting the number of each consequence in a database of healthy variation, and an appropriately filtered disease- causing variants respectively. Our method can also generate personalised Bayes factors by subsetting these databases according to disease phenotype using a phenotype -genotype database such as OMIM, as described for the allele frequency update.

In practice, the consequence distribution varies strongly from gene to gene for both healthy and pathogenic variation. For healthy variation, some genes may have more sites where mutations can introduce a particular consequence (e.g. more potential to introduce stop codons). For pathogenic variation, some genes may cause disease through loss-of -function (in which case stop gain and frameshift mutations will be more common among pathogenic variants), or through gain-of -function (in which case missense variants may be more common). Ideally, the consequence distribution for both healthy and pathogenic variation would be estimated on a per-gene basis, to account for these differences. However, in practice the number of observations per gene is small, and thus calculating the distribution per gene would be very noisy and would introduce errors.

In an embodiment, a form of Dirichlet-Multinomial smoothing is used to reduce error on the per- gene estimates. We assume that the consequence bin distribution for a given gene, p 3 (a vector with the probability mass for each consequence) is drawn from a Dirichlet distribution, i.e. p 3 ~ Dirichlet( a ), where a is a vector that represents the distribution of consequence bin distributions across all genes. The vector of mutation counts for a given gene is thus drawn from a multinomial distribution y 9 ~ Multinomial(p 3 , N), where N is the total number of mutations observed at that gene. The value of the vector a is estimated across all genes by maximum likelihood. The Dirichlet distribution is the conjugate prior for the multinomial distribution, and thus p 9 ~ Dirichlet (a + y 9 ) and a good estimator for the the consequence bin distribution is thus given by the posterior expectation, calculated by

This has the desirable property that when y 9 = 0, i.e. no mutations are observed, the consequence distribution defaults to the global estimate across all genes a. However, when y 9 is very large the consequence distribution is dominated by the observed data at that gene. This smoothing approach can be used to estimate both pathogenic and non-pathogenic consequence distributions. The same Dirichlet- multinomial smoothing approach can be applied to other update steps to introduce a per -gene estimate for the pathogenic and healthy metric distributions.

Spatial clustering updates

Disease causing mutations, particularly gain-of-function mutations, are often clustered in the same region of the gene. We thus include a spatial clustering update that upweights hypotheses that include mutations that are nearby to known mutations.

In an embodiment a predetermined metric used in the update step S4 of the method of Figure 3 comprises identification of one of a plurality of sub-ranges of a number of mutations that are within a predetermined number of amino acids of a reference mutation, the reference mutation being a mutation caused in a protein produced by a gene containing a variant of the candidate hypothesis. The update reference data comprises a value of the predetermined metric for each candidate hypothesis. The calculation of the update factor comprises using a probability distribution of the predetermined metric over variants known to be pathogenic. The calculation of the update factor may comprise using a probability distribution of the predetermined metric over variants known to be pathogenic with respect to disease consistent with a phenotype displayed by the individual or group of individuals. The calculation of the update factor may further comprise using a probability distribution of the predetermined metric over variants not known to be pathogenic.

In one embodiment, the predetermined metric is the number of (pathogenic) mutations seen within 5 amino acids on either side of the mutation under consideration (counting each report of the same mutation separately, e.g. if one mutation has been seen in 10 patients, we add 10 to the metric). We bin (assign to a plurality of sub-ranges) the number of nearby mutations into 5 bins (sub-ranges): 0 (i.e. no mutations seen within 5 amino acids), 1-4, 5-9, 10-99 and 100+.

We calculate the pathogenic spatial clustering bin distribution using a database of pathogenic variation (such as ClinVar). Because we also calculate the metric using this reference dataset, we need to ensure that the variant itself does not contribute to the metric for itself. We generate the healthy spatial clustering bin distribution from population variation (e.g. from 1000 Genomes).

Selective constraint updates

If mutations in a gene cause a severe disease or other deleterious phenotype then this gene will be become depleted of variation as natural selection purges the pathogenic variation from the population. This will be especially true for dominant disease mutations and loss-of-function or other severe consequence mutations. To harness this information, we include a selective constraint stage that upweights the dominant hypothesis with variants that are found in genes that are depleted for loss-of- function mutations.

In an embodiment, the calculating of the update factor in step S4 of the method of Figure 3 comprises calculating a proportion by which a gene containing the variant or variants of the candidate hypothesis is depleted of variation.

We can show using basic population genetic theory that, if a gene is depleted for loss-of-function mutations by a proportion a) g , then pathogenic variation at this gene will be enriched by a proportion (Bayes factor) 1/ω„ compared to the average gene, where the ω„ = observed = ^w+^ p ( 1+g ) μ Ν , μ Ρ , μ Β ) are probabilities of neutral, positively selected and pathogenic LoF mutations and their sum is μ. The bayes factor is computed by

P(LoF\Disease) μ Ό 1 μ Ό

P(LoF\Healthy) μ Ν + μ Ρ (1 + e) ω β μ

As a result, for example, we apply a Bayes factor update of BF^ C = k/ g ^ for all dominant hypotheses, where g(c) is the gene containing the variants in hypothesis c and k is a normalizing constant that ensures the mean Bayes factor, taken over the distribution of non-pathogenic variants, is equal to 1. The mean values of the Bayes factor across all genes is ^∑f ^ = 1/0.28, thus by default the normalising constant k=0.28. We estimate a) g from a population reference set. We assume that a) g is distributed according to a gamma distribution with ω 9 ~ gammaia, /?), where the values of a and β are shared across all genes. The observed number of mutations is then assumed to be distributed as y g ~ poisson(a) g Xg), where x g is the expected number of loss-of -function mutations in gene g in the absence of selection (calculated as the total rate of mutations that cause loss-of-function in that gene). We set priors on a ~uniform(0, 100) and β ~uniform(0,100), draw samples from the posterior distribution across a, β and all ω 9 using a Gibbs sampler, and then estimate = Ε[ω 3 ] for each gene by averaging across all samples.

Personalised gene list updates

For many diseases there are established genes that harbour variants that contribute to the disease. A metric that describes whether you are in one of these well known genes, and if so which one, can then be used to carry out an update that upweights mutations found in important genes.

In an embodiment, the calculating of the update factor in step S4 of the method of Figure 3 in at least one of the update steps comprises determining whether the variant or variants of the candidate hypothesis are in a gene on a reference list of genes known to contribute to the disease.

We call this a gene list update.

We calibrate the Bayes factors for the gene list update using a reference set of successfully diagnosed patients with the disease of interest. The Bayes factors are then calculated as BF^ C = x

7 , where is the proportion of diagnoses from mutations in the gene g(c), and \ G \ is the total number of genes in the genome.

In practice, for many diseases (e.g. developmental disorders) the list of known genes can be long, and thus the number of known diagnoses for each gene in the reference set of patients can be small (often less than 1 per gene). This means that the per-gene Bayes factor estimates can be noisy. In this instance we combine all the genes into a single gene list, change the metric for each variant to be whether or not it is in a gene in this entire gene list. The Bayes factor update for variants with genes in this gene list then estimated as BF^ c =— x j^j-, where— is the proportion of diagnoses with mutations in genes that are in

I I

the gene list and \S\ is the number of genes on the gene list.

If the number of diagnoses per gene is too small to get reliable estimates, but large enough to have multiple observations at most genes, we can instead optionally use Dirichlet-Multinomial smoothing to de-noise the per-gene Bayes factors. In some cases the user will not have a list of known genes, as the patient has a disease that has not been widely studied, or has a set of phenotypes that does not clearly match any one disease. In this case, we can generate a personalised gene list update on the bases of the patient's phenotype. The phenotype is specified using a series of human phenotype ontology (HPO) terms. HPO terms are hierarchical, and thus form a natural graph. We combine this with a genotype -phenotype graph that connects HPO terms to genes (such as OMIM), and a gene-gene graph (such as a protein-protein interaction graph) to form a combined graph that contains HPO-HPO connections, HPO-gene connections and gene-gene connections. We then use a personalised pagerank algorithm that starts at one of the patient's HPO terms selected at random, and randomly walks around the graph, occasionally restarting to another randomly selected patient HPO term. Each gene is then given a score according to how often it is visited during the random walk, and the ratio of this score to the average score across all genes is taken as a gene list update Bayes factor for that gene. The parameters of the graph (the edge weights of HPO-HPO and HPO-Gene connections and the restart rate) are tuned on a reference set of diagnosed individuals with disorders with diverse phenotypes. We calculate an ensemble of networks using various values of the three parameters, use each networks to generate rank all genes for each diagnosed individual by the score given by the personalised pagerank algorithm using their reported phenotypes. We then check the rank of the true diagnosed gene for each of the individuals, and chose the value of the parameters associated with network that gives the highest average rank of the true causal gene across all individuals.

Determining reasons for variant ranking decisions

After completing all (e.g. Bayesian) update steps, each variant is associated with a posterior probability that it is in fact driving the disease under consideration. Variants are then ranked by this posterior probability to provide a list of potential disease-causing variants for consideration by the physician. As described above, in embodiments the posterior probabilities are formed by taking the product of a prior probability, and a number of update factors (e.g. Bayes factors) deriving from the various update steps.

Each update step can be interpreted as quantifying the change in belief that a variant is in fact causing the disease under consideration, as a result of considering a particular piece of evidence. For example, the allele frequency update will assign a high Bayes factor to variants that have never been seen before, than to variants that occur at high frequency in a population. As another example, the spatial clustering update will assign a high Bayes factor to variants close to known pathogenic variants, whereas variants further away from known pathogenic variants will receive lower Bayes factors in this update step. The update steps that result in the highest step-up in posterior belief of disease causation can be interpreted as the "reasons why" the variant has received its ranking.

In an embodiment, data providing a quantitative indication of a relative contribution of one or more of the update steps to the prioritised list of candidate hypotheses (e.g. to the order of the prioritised list and/or to the final posterior associated with each hypothesis in the list) output by the method is provided. The data may be referred to as update step ranking data and be generated using the calculated update factors. For example, in an embodiment, for each variant we keep the individual Bayes factors (update factors) that have resulted in the final posterior. After the variants have been ranked, the individual Bayes factors are then ranked to form update step ranking data. The update step ranking data may be used in various ways to provide information to the physician. In one embodiment, the update step ranking data is used to identify update steps that contribute substantially to the posterior. The associated reason (e.g. allele frequency, clustering with known variation) associated with each of the identified update steps is then communicated to the physician.

Bibliography

(1) An integrated map of genetic variation from 1,092 human genomes The 1000 Genomes Project Consortium, Nature, 491,56-65, 2012

(2) Factors influencing success of clinical genome sequencing across a broad spectrum of disorders Nature Genetics, 47, 717-726, 2015

(3) Genetic diagnosis of developmental disorders in the DDD study: a scalable analysis of genome -wide research data. Wright, Lancet (London, England) 2015;385;9975;1305-14