Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
AN ITERATIVE REGRESSION METHOD FOR GENOMIC PREDICTION
Document Type and Number:
WIPO Patent Application WO/2021/011990
Kind Code:
A1
Abstract:
A method for improving the computational efficiency of estimation of Bayesian methods, such as Bayesian MCMC models, for performing genomic analysis of SNP data to estimate breeding values, Quantitative Trait Locus (QTLs), or genomic locations associated with a disease or trait. The method extends the BayesR method by using a blocked Gibbs sampling approach to estimate SNO effects comprising breaking the SNPS into non overlapping blocks of markers and sequentially processing each block according to a block ordering. Each regression coefficient in the block is then sequentially estimated, starting from a starting index. The method significantly reduces the computational time to estimate model parameters and can be applied to both single trait and multi-trait analyses.

Inventors:
BREEN EDWARD JOSEPH (AU)
DAETWYLER HANS DIETER (AU)
GODDARD MICHAEL EDWARD (AU)
Application Number:
PCT/AU2020/000072
Publication Date:
January 28, 2021
Filing Date:
July 23, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
AGRICULTURE VICTORIA SERV PTY (AU)
DIARY AUSTRALIA LTD (AU)
GEOFFREY GARDINER DAIRY FOUNDATION LTD (AU)
International Classes:
G16B20/40; C12Q1/68
Other References:
JIANG JICAI: "Genetic Architecture of Complex Traits and Accuracy of Genomic Selection in Dairy Cattle", PHD THESIS, UNIVERSITY OF MARYLAND, DIGITAL REPOSITORY AT THE UNIVERSITY OF MARYLAND, 1 January 2018 (2018-01-01), XP055784258, Retrieved from the Internet [retrieved on 20210310], DOI: 10.13016/xsqu-xjmp
MACLEOD I. M., BOWMAN P. J., VANDER JAGT C. J., HAILE-MARIAM M., KEMPER K. E., CHAMBERLAIN A. J., SCHROOTEN C., HAYES B. J., GODDA: "Exploiting biological priors and sequence variants enhances QTL discovery and genomic prediction of complex traits", BMC GENOMICS, vol. 17, no. 1, 1 December 2016 (2016-12-01), XP055784255, DOI: 10.1186/s12864-016-2443-6
KATHRYN E KEMPER;CORALIE M REICH;PHILIP J BOWMAN;CHRISTY J VANDER JAGT;AMANDA J CHAMBERLAIN;BRETT A MASON;BENJAMIN J HAYES;MICHAEL: "Improved precision of QTL mapping using a nonlinear Bayesian method in a multi-breed population leads to greater accuracy of across-breed genomic predictions", GENETICS SELECTION EVOLUTION, EDP SCIENCES, LES ULIS,, FR, vol. 47, no. 1, 17 April 2015 (2015-04-17), FR, pages 29, XP021216724, ISSN: 1297-9686, DOI: 10.1186/s12711-014-0074-4
ERBE M., B.J. HAYES, L.K. MATUKUMALLI, S. GOSWAMI, P.J. BOWMAN, C.M. REICH, B.A. MASON, M.E. GODDARD: "Improving Accuracy of Genomic Predictions Within and Between Dairy Cattle Breeds with Imputed High-Density Single Nucleotide Polymorphism Panels", JOURNAL OF DAIRY SCIENCE, 1 January 2012 (2012-01-01), pages 4114 - 4129, XP055784194, DOI: 10.3168/jds.2011-5019
TINGTING WANG;YI-PINGPHOEBE CHEN;PHILJ. BOWMAN;MICHAELE. GODDARD;BENJ. HAYES: "A hybrid expectation maximisation and MCMC sampling algorithm to implement Bayesian mixture model based genomic prediction and QTL mapping", BMC GENOMICS, BIOMED CENTRAL LTD, LONDON, UK, vol. 17, no. 1, 21 September 2016 (2016-09-21), London, UK, pages 1 - 21, XP021266405, DOI: 10.1186/s12864-016-3082-7
KEMPER KATHRYN E., BOWMAN PHILIP J., HAYES BENJAMIN J., VISSCHER PETER M., GODDARD MICHAEL E.: "A multi-trait Bayesian method for mapping QTL and genomic prediction", GENETICS SELECTION EVOLUTION, vol. 50, no. 1, 1 December 2018 (2018-12-01), XP055784186, DOI: 10.1186/s12711-018-0377-y
Attorney, Agent or Firm:
MADDERNS PTY LTD (AU)
Download PDF:
Claims:
CLAIMS

1. A computational method for performing genomic analysis on a plurality of individuals in a population by estimating a plurality of model parameters of a Bayesian statistical model using an iterative regression method, the method comprising:

obtaining response data and an explanatory data wherein either the response data or the explanatory data comprises genomic data for a plurality of individuals in a population;

estimating a plurality of model parameters of a Bayesian statistical model using an iterative regression method applied to the response data and the explanatory data, wherein the plurality of model parameters comprises a plurality of regression coefficients, and processing each iteration of the iterative regression method comprises:

breaking the plurality of regression coefficients into a plurality of non-overlapping blocks of regression coefficients;

determining a block ordering for the iteration;

sequentially processing each block according to the block ordering for each iteration to obtain updated estimates of the regression coefficients, wherein sequentially processing each block comprises:

determining a starting index within the block;

sequentially estimating each regression coefficient in the block starting with the starting index, and

generating, using the estimated plurality of regression coefficients, one or more of an estimated breeding value, one or more Quantitative Trait Locus (QTL), or one or more genomic locations associated with a disease or trait.

2. The method as claimed in claim 1, wherein the block ordering is a random ordering.

3. The method as claimed in claim 1 or 2, wherein determining a starting index within the block comprises randomly selecting a starting index within the block.

4. The method as claimed in claim 1 , 2 or 3 further comprising updating a set of model residuals after sequentially processing each block.

5. The method as claimed in claim 4, wherein the model has a form, ignoring fixed effects and a polygenic effect, of : where y is the response data, Vi is the explanatory data in block i, gi is the vector of regression coefficients associated with block i, B is the number of blocks, b is a block size of a block comprising the number of regression coefficients in the block and e is a residual error vector, and sequentially estimating a regression coefficient for each regression coefficient in the /"' block of block size b comprises obtaining a least squares estimate of:

for each k is in the set { 1, 2, ... , b} , where is the diagonal element of associated with

regression coefficient j, block i, and is precomputed prior to performing the sequential estimation step and r is the right hand side vector with b elements where and e* are the residuals which

are calculated using the current value of each regression coefficient during sequential estimation of regression coefficients and , for the first block being processed at each iteration.

6. The method as claimed in claim 5 wherein on entering block /, the residuals r for that block are estimated using:

where , for the first block being processed at each iteration, and we set , and then set all

elements of gi to zero, where gf represents the regression coefficients associated with the ith block prior to sequentially processing the block.

7. The method as claimed in claim 5 or 6 wherein the right-hand side vector r is updated for regression coefficient / + 1 by adding to it: , where is thejth column of and is the jth regression coefficient.

8. The method as claimed in claim 7, wherein updating the set of model residuals after sequentially processing each block comprises calculating:

where represents the regression coefficients associated with the ith block prior to sequentially processing the block and represents the regression coefficients after sequentially processing the block.

9. The method as claimed in claim 8, wherein the block size b is fixed for all blocks except a last block and the last block has a size smaller than or equal to b.

10. The method as claimed in claim 9, wherein the block size is set as where R is a number

of individuals.

11. The method as claimed in claim 9 or 10, wherein the L is a number of iterations and the method further comprises a step of updating all other model effects wherein all other model effects comprises one or more polygenic effects, pedigree effects, fixed effects and/or variances, and updating of all other model effects is interlaced with the step of sequentially processing each block such that updating of all other model effects is performed at least / times where / = floor(Lb).

12. The method as claimed in claim 11, wherein updating of all other model effects is performed after sequentially processing every B/b blocks.

13. The method as claimed in any preceding claim, wherein the iterative regression method is a Bayesian Markov Chain Monte Carlo (MCMC) method in which a Gibbs Sampler is used to sample regression coefficients.

14. The method as claimed in claim 13, wherein the explanatory data comprises genotype data and the response data comprises phenotype data for a plurality of individuals in a population where the genotype data is obtained using a plurality of markers.

15. The method as claimed in claim 14 wherein the Bayesian statistical model a priori assumes the markers are each drawn from a mixture of K zero mean normal distributions each with a different marker effect, and breaking the plurality of regression coefficients comprises breaking the genotype data into a plurality of non-overlapping blocks of markers.

16. The method as claimed in claims 14 or 15, wherein the plurality of markers are classified into a plurality of classes, where allocation of markers to each class is based on biological prior data.

17. The method as claimed in any one of claims 1 to 15, wherein the Bayesian statistical model is a single trait model.

18. The method as claimed in any one of claims 1 to 15, wherein the Bayesian statistical model is a multi-trait model.

19. The method as claimed in any one of claims 1 to 16, wherein the method further comprises: obtaining genotype data of an individual drawn from the population, wherein the genotype data comprises one or more of the one or more markers associated with one or more traits; estimating a genomic breeding value for making a breeding decision in relation to the individual based on their genotype data and one or traits.

20. The method as claimed in any one of claims 1 to 12, wherein the Bayesian statistical model is a genetic prediction or association model and the explanatory data comprises phenotype data and the response data comprises genotype data for a plurality of individuals in a population where the genotype data is obtained using a plurality of markers.

21. The method as claimed in any preceding claim, wherein the genomic data comprises Single Nucleotide Polymorphism (SNP) data.

22. The method as claimed in any preceding claim, wherein the genomic data comprises marker data where each marker has a set of n allele states.

23. A computer system comprising:

at least one processor;

at least one memory, wherein the memory comprises instructions to perform the method of any one of claims 1 to 22.

24. A computer readable medium comprising instructions for causing a processor to perform the method of any one of claims 1 to 22.

Description:
AN ITERATIVE REGRESSION METHOD FOR GENOMIC PREDICTION

PRIORITY DOCUMENTS

[0001] The present application claims priority from Australian Provisional Patent Application No. 2019902659 titled“AN ITERATIVE REGRESSION METHOD FOR GENOMIC PREDICTION” and filed on 25 July 2019, the content of which is hereby incorporated by reference in its entirety.

TECHNICAL FIELD

[0002] The present disclosure relates to methods for genomic analysis and genomic prediction. In a particular form the present disclosure relates to methods of estimating model parameters in Bayesian statistical model for performing genomic analysis or genomic prediction of traits in populations.

BACKGROUND

[0003] Estimation of single trait and multi-trait loci in genomic data from a population of individuals can be used to improve estimation of breeding value for individuals, as well as predicting genetic traits and mapping quantitative trait loci (QTLs). Erbe et al. 2012, developed a non-linear Bayesian Markov Chain Monte Carlo (MCMC) model, known as BayesR, to make genomic estimates of breeding value using high density Single Nucleotide Polymorphism (SNP) data. This used a mixture model in which it a prior assumes that each SNP is drawn from a mixture of 4 zero mean normal distributions of SNP effects and was able to make across-breed predictions of breeding values and mapping of QTLs more accurately than genomic best linear unbiased prediction (GBLUP).

[0004] However whilst BayesR, and similar Bayesian Markov Chain Monte Carlo (MCMC) based inference models are more accurate than GBLUP, they are computationally slow. For example the BayesR model has a per iteration processing time proportional to M × R, where M is the number of markers and R is the number of records. Thus for a Bovine High Density SNP chip comprises 633374 SNPs) performing estimate of single traits in large populations (eg over 10,000) takes days. Performing multi-trait analysis takes many times longer (eg tens or hundreds of days) to the point of being either irrelevant/out of date (eg for use in a monthly publication of genomic breeding values, or computationally infeasible for large populations).

[0005] Some attempts have been made to improve the speed of Bayesian selection methods. Calus, M.L.P.“Right-hand-side updating for fast computing of genomic breeding values”, Genetics Selection Evolution 2014, 46:24 (Calus) proposed a blocking method for application to SNP data. Calus proposes two approaches to speed up the Bayesian Genomic Selection algorithm. In the first method, takes advantage of the characteristic that the predictor variables in the model (i.e. the SNP genotypes) take only three different values, and uses this characteristic to reduce the number of calculations required to update the residuals (and is therefore termed“improved residual updating”). A second method called RHS- updating proposes blocking on up to s SNPS subject to the constraint that they must also track the indexes of which individuals are in each block (see Figure 1 of Calus). Identifying the groups within each block requires keeping track of up to 3 s groups per block, where s is the number of SNPs in a block. If s = 6 there are potentially 729 groups to keep track off and to process for each block. Therefore, if the block size is small and the number of SNPs is small but the number of records is relatively large (see Figure 5 of Calus), then compared to a non-block process the Calus method appears to give a speed increase. Flowever Calus uses standard Fortran matrix routines for the calculations rather than an optimised matrix calculation library. As such the actual gain in speed compared to BayesR is expected to be reduced in the case that such an optimised matrix calculation library. More significantly the requirement to keep track of up to 3 s groups per block effectively limits the size of the SNP blocks to less than around six. Once s increases beyond around six, any computational gains are lost to the overhead associated with tracking of the groups.

[0006] There is thus a need to provide improved methods for estimation of single trait and multi-trait loci in genomic data, or to at least provide a useful alternative to current methods.

SUMMARY

[0007] According to a first aspect, there is provided a computational method for performing genomic analysis on a plurality of individuals in a population by estimating a plurality of model parameters of a Bayesian statistical model using an iterative regression method, the method comprising:

obtaining a response data and an explanatory data wherein either the response data or the explanatory data comprises genomic data for a plurality of individuals in a population;

estimating a plurality of model parameters of a Bayesian statistical model using an iterative regression method applied to the response data and the explanatory data, wherein the plurality of model parameters comprises a plurality of regression coefficients, and processing each iteration of the iterative regression method comprise:

breaking the plurality of regression coefficients into a plurality of non-overlapping blocks of regression coefficients;

determining a block ordering for each iteration of a blocked Gibbs Sampler; sequentially processing each block according to the block ordering for each iteration to obtain updated estimates of the regression coefficients, wherein sequentially processing each block comprises:

determining a starting index within the block; sequentially estimating each regression coefficient in the block starting with the starting index, and

generating, using the estimated plurality of regression coefficients, one or more of an estimated breeding value, one or more Quantitative Trait Locus (QTL), or one or more genomic locations associated with a disease or trait.

[0008] In one form, the block ordering may be a random ordering.

[0009] In one form, determining a starting marker within the block may comprise randomly selecting a starting index within the block.

[0010] In one form the method may further comprise updating a set of model residuals after sequentially processing each block.

[0011] In a further form, the model may have a form, ignoring fixed effects and a polygenic effect, of :

where y is the response data, V i is the explanatory data in block i, g i is the vector of regression coefficients associated with block i, B is the number of blocks, b is a block size of a block comprising the number of regression coefficients in the block and e is a residual error vector, and sequentially estimating a regression coefficient for each regression coefficient in the i th block of block size b comprises obtaining a least squares estimate of:

for each k is in the set {1, 2, ... ,}, where is the diagonal element of associated with

regression coefficient j, block i, and is precomputed prior to performing the sequential estimation step and r is the right hand side vector with b elements, where and e* are the residuals which are calculated using the current value of each regression coefficient during sequential estimation of regression coefficients. In a further form, on entering block i, the residuals r for that block can also be estimated using:

[0012] where , for the first block being processed at each iteration, and we set , and then

set all elements of g i to zero. In a further form the right-hand side vector r may be updated for regression coefficient j + 1 by adding to it: , where is the jth column of and is the jth regression coefficient

[0013] In a further form, updating the set of model residuals after sequentially processing each block may comprise calculating:

where represents the regression coefficients associated with the block prior to sequentially processing the block and represents the regression coefficients after sequentially processing the block.

[0014] In a further form the block size b may be fixed for all blocks except a last block and the last block has a size smaller than or equal to b. In a further form, the block size may be set as where R is the number of individuals.

[0015] In one form the L is a number of iterations and the method may further comprise a step of updating all other model effects wherein all other model effects comprises one or more polygenic effects, pedigree effects, fixed effects and/or variances, and updating of all other model effects is interlaced with the step of sequentially processing each block such that updating of all other model effects is performed at least / times where / = floor(Lb).

[0016] In a further form updating of all other model effects may be performed after sequentially processing every B/b blocks.

[0017] In one form, the iterative regression method is a Bayesian Markov Chain Monte Carlo (MCMC) method in which a Gibbs Sampler is used to sample regression coefficients. In a further form, the explanatory data comprises genotype data and the response data comprises phenotype data for a plurality of individuals in a population where the genotype data is obtained using a plurality of markers. In one form the Bayesian statistical model a priori assumes the markers are each drawn from a mixture of K zero mean normal distributions each with a different marker effect, and breaking the plurality of regression coefficients comprises breaking the genotype data into a plurality of non-overlapping blocks of markers. In one form the plurality of markers are classified into a plurality of classes, where allocation of markers to each class is based on biological prior data.

[0018] The method may be used for single trait analysis using a single trait model or for multi-trait analysis including Quantitative Trait Loci (QTL) analysis, using a multi-trait model. The method may also be used for genomic discovery and genomic prediction of disease. The method may be used on human, animal or plant data. The method may also be used for estimating genomic breeding values. In one form the method may further comprise

obtaining genotype data of an individual drawn from the population, wherein the genotype data comprises one or more of the one or more markers associated with one or more traits;

estimating a genomic breeding value for making a breeding decision in relation to the individual based on their genotype data and one or traits.

[0019] In one form the Bayesian statistical model is a genetic prediction or association model and the explanatory data comprises phenotype data and the response data comprises genotype data for a plurality of individuals in a population where the genotype data is obtained using a plurality of markers. In one form the explanatory data is based on processing of mid infra-red (MID) spectra of milk and used to map genes (or genetic loci) affecting milk traits (ie associated with spectral features).

[0020] In one form the genomic data comprises are Single Nucleotide Polymorphisms (SNP) data (eg SNP chip data). In another form, the genomic data comprises marker data where each marker has a set of n allele states. The markers may also include minisattellites, microsatellites, tandem repeats, short tandem repeats, and other DNA polymorphisms.

[0021] In a further aspect there is provided a computer system comprising:

at least one processor

at least one memory, wherein the memory comprises instructions to perform the method of the first aspect.

[0022] In a further aspect, there is provided a computer readable medium comprising instructions for causing a processor to perform the method of the first aspect.

BRIEF DESCRIPTION OF DRAWINGS

[0023] Embodiments of the present disclosure will be discussed with reference to the accompanying drawings wherein:

[0024] Figure 1A is a flowchart of a computational method for a plurality of model parameters of a Bayesian statistical model using an iterative regression method according to an embodiment;

[0025] Figure 1B is a schematic illustration of an implementation of a Bayesian Markov Chain Monte Carlo (MCMC) based genomic prediction statistical model using a blocked Gibbs Sampler to estimate one or more marker effects according to an embodiment; [0026] Figure 2A is a plot of the Accuracy = correlation (phenotype, Genomic Breeding Value) between EM-Hybrid (horizontal bars)and BayesR3 (diagonal bars), where FY=fat yield, MY= milk yield and PY = protein yield; and

[0027] Figure 2B is a plot of the Bias between EM-Hybrid (horizontal bars) and BayesR3 (diagonal bars), where FY=fat yield, MY- milk yield and PY - protein yield;

[0028] Figure 3 is a plot of the expected processing time ratio between a standard BayesR to BayesR3 for 633374 SNPs and 97000 animals with respect to block size;

[0029] Figure 4 is a Manhattan plot of posterior probabilities (PP > 0.05) for Multi-trait BayesR3 analysis of 20 milk PCA components. Square symbols indicate SNP from the MY single trait analysis that had a PP > 0.8 and overlapped with the multi-trait analysis. Several of these SNP are in or very close to annotated genes known to influence milk composition; and

[0030] Figure 5 is a plot showing the utility of BayesR3 to analyse large numbers of traits in a multi-trait model. In this plot phenotype data for 24 traits was concatenated in a repeating pattern to generate 576 traits. The top panel shows the proportion of single nucleotide polymorphisms in each of four variance categories. Bottom panel shows achieved predictive ability of model across the traits. A clear repeating pattern is evident demonstrating that BayesR3 can successfully analyse a very large number of traits simultaneously.

[0031] Figure 6 is a schematic illustration of a computer system according to an embodiment.

[0032] In the following description, like reference characters designate like or corresponding parts throughout the figures.

DESCRIPTION OF EMBODIMENTS

[0033] Referring now to Figure 1 A, there is shown a flowchart of method for estimating a plurality of model parameters of a Bayesian statistical model using an iterative regression method 100. The model parameters may comprise fixed effects, random effects, and variances, and may be used for performing genomic prediction or genomic analysis such as estimating breeding value (eg for making breeding decisions) for example by estimating one or more marker (eg SNP) effects in a population, performing single trait and multi-trait analyses for example to identity Quantitative Trait Loci, as well as performing genomic prediction or association studies by estimating genotypes based on phenotypic data, for example by using milk spectra data to map genes affecting milk traits or by identifying genomic predictors of a disease for estimating disease risk. [0034] The Bayesian statistical model may be Bayesian Markov Chain Monte Carlo (MCMC) statistical model using a computer implementation of a blocked Gibbs Sampler. To further illustrate the method Figure 1B is a schematic illustration of an implementation of the Bayesian MCMC based genomic prediction statistical model using a blocked Gibbs Sampler to estimate one or more marker effects using genotype data and phenotype data from individuals in a population according to an embodiment.

Embodiments of the method, which will also be referred to as BayesR3, provide a faster implementation of BayesR (as described in Erbe et al. 2012, the contents of which is hereby incorporated in its entirety) with equal accuracy. In particular the implementation is significantly faster when applied to multi-trait analysis. However it is to be understood this is illustrative only, and the method can be more generally applied to any case where iterative regression is performed to estimate model parameters, including genomic prediction models based on BayesA, BayesB, or BayesR.

[0035] To assist in understanding various embodiments, several terms will now be defined.

[0036] The term "allele" refers to any one of the different forms of a gene or DNA sequence at a single locus i.e., chromosomal location including a coding sequence, non-coding sequence or regulatory sequence.

[0037] The term "amplified fragment length polymorphism" or "AFLP" refers to any one of different DNA fragment lengths produced by random-primed amplification of pooled or isolated restriction DNA fragments of genomic DNA or cDNA, wherein the fragment length varies between individuals in a population.

[0038] By "ancestor" is meant an individual having a genetic contribution to the current population. The term "ancestor" is thus a function of pedigree, the determination of which does not require prior knowledge of a particular trait or combination of traits present in the current population and its progenitors. Genotype information for an ancestor, as opposed to a founder, is generally incomplete as a consequence of poor record keeping and the absence of genetic material e.g., semen, from the ancestor to permit genotyping, such that missing genotypes of the ancestral population must be inferred to complete a genotype analysis. Ancestors in a pedigree may be overlapping, e.g., a sire and one of his sons, by virtue of contributing common genetic material to the current population notwithstanding any genes contributed independently by one or the other ancestor. To determine ancestry, the average relationship of a progenitor to the current population is determined excluding double -counting of overlapping ancestral contributions.

[0039] "Artificial selection" shall be taken to mean a selection under human control, including those systems, processes, steps or combinations of steps of a breeding program for producing genetic gain, including the collective design and/or implementation of said breeding program and intermediate steps by one or more persons. It is to be understood that artificial selection therefore requires a determination by man, based on a defined selection criterion or defined selection criteria, of one or more individuals in a population that are to be parents and ultimately ancestors, thereby producing a genetic gain as defined herein. This is distinct from the mere observation of population genetics e.g., for determining a genetic parameter such as heritability, diversity, inbreeding etc. Artificial selection systems include phenotypic selection and genotypic selection processes. Artificial selection steps include e.g., determining one or more of the following parameters: selection criteria and/or breeding objective(s); one or more selection indices; one or more selection targets; selection intensity; one or both sexual partners for a single mating or for multiple matings including references and/or replacements; the number of matings that any one or more individuals will contribute to a breeding program and the length of time that an individual will remain in a breeding population; generation interval; breeding value; or genetic gain. Artificial selection steps can also include e.g., performing one or more breeding steps based on a determination of one or more parameters supra and/or selecting progeny.

[0040] "Breeding objective" refers to a goal of an artificial selection program e.g., an improved germplasm. Breeding objective may be determined by weighted combination of traits defining an aggregate breeding value of an animal.

[0041] "Breeding value" means the genetic value of an individual as a parent in a breeding program and, more particularly, the effect of an individual's genes or genetic markers when considered in isolation or combination ("aggregate breeding value") on performance against a selection criterion or selection criteria.

[0042] By "current population" is meant a population that are candidates for selection. Typically the current population includes individuals e.g., animals that are at or near an end-point in a pedigree.

[0043] As used herein the term "derived from" shall be taken to indicate that a specified integer may be obtained from a particular source albeit not necessarily directly from that source.

[0044] The term "effective population size" or "N e " refers to the number of individuals in a population that contribute gametes to the next generation and preferably also to future generations. The effective population size is generally calculated as the number of breeding individuals in an idealized population that would show the same amount of dispersion of allele frequencies under random genetic drift or the same amount of inbreeding as a population under consideration. For example, in a randomly mating population consisting of 1000 individuals of which 500 are male are 500 are female with discrete generations, the expected fraction of the genes carried by any future generation contributed by any one animal in the current generation is 0.1% and the effective population size is the same as the absolute or real population size (N) i.e., 1000. However, because most populations are inbred to some degree, individuals do not select mates at random, generations may overlap, and fewer males generally breed than females, the effective population size typically has a value less than the absolute or real population size.

[0045] "Estimated breeding value" or "EBV" refers to a predicted breeding value of the progeny of a mating event, as determined by multiplying the ploidy of the organism in question by the progeny difference i.e., the difference between the average performances of an individual's progeny and the average performances of all progeny in a population assuming random mating. For a diploid organism, the progeny difference is doubled, because the breeding value is a measure of all genes for the organism, whereas the progeny difference is based upon the contribution of only one haploid genome from one parent. Progeny differences are based on average predicted performance of the progeny because each parent contributes the same number of genes to each progeny in the population.

[0046] "Generation interval" means the amount of time required to replace one generation with the next and, in a closed population that is subject to artificial selection, the average age of parents when their selected progeny are bom. As used herein, the term "genetic gain" shall be taken to mean the average change in a heritable trait or combination of heritable traits from one generation to the next generation, including a predicted genetic gain and/or actual genetic gain. More particularly, the average change is in the direction of one or more selection targets, or will at least avoid significant negative genetic gain i.e., an undesired effect for the selection criteria. The genetic gain can arise from artificial selection.

[0047] "Genotypic selection" means an artificial selection based upon the presence and/or absence of one or more genes or genetic markers of an individual associated with a particular gene, combination of genes, single-gene trait, quantitative trait, or combination of traits. Genotypic selection includes a diverse array of marker-assisted selection methods comprising the use of genetic markers e.g., alleles, haplotypes, haplogroups, loci, quantitative trait loci, or DNA polymorphisms (including restriction fragment length polymorphisms(RFLPs), amplified fragment length polymorphisms (AFLPs), single nuclear

polymorphisms (SNPs), indels, short tandem repeats (STRs), microsatellites and minisatellites) wherein the marker(s) is(are) determinative of the estimated breeding value of the individual.

[0048] A "haplogroup" is a cluster of similar haplotypes e.g., haplogroups of the human Y chromosome defined on the basis of unique mutation events in Y-STRs.

[0049] The term "haplotype" refers to a combination of alleles, loci or DNA polymorphisms that are linked so as to cosegregate in a significant proportion of gametes during meiosis. The alleles of a haplotype may be in linkage disequilibrium (LD). [0050] The term "indel" refers to any one of different insertions or deletions of DNA at a particular allele or locus that are present in different individuals in a population, e.g., Y chromosome AIu polymorphisms (YAPs).

[0051] The term "linkage disequilibrium" or "LD" refers to alleles or loci or DNA polymorphisms that associate at a frequency higher than expected for independent alleles or markers, such that they appear as a haplotype. For example, when variants of two genetic loci are in strong linkage disequilibrium, the variant at one locus is predictive of the variant at the other locus on an individual chromosome.

[0052] In the present context, the term "mating" or similar term such as "mate" shall be construed without reference to kingdom or phylum to mean any sexual reproduction wherein a haploid genome is transferred from one individual of a population to another individual of a population, including the mating of one animal or cell (e.g., yeast cell) to another by natural or assisted means e.g., artificial insemination (AI); and the self-pollination of a plant or cross-pollination between plants.

[0053] "Mendelian sampling variation" means the variation in the deviation of the breeding value of an individual from the mean breeding values of its parents.

[0054] The term "minisatellite" refers to a variable number tandem repeat (VNTR) comprising more than about 5 repeats and from 6 to about 60 base pairs per repeat unit, wherein the number of repeat units varies between individuals in a population. As with microsatellites, changes may occur and the number of repeats may increase or decrease.

[0055] "Phenotypic selection" means an artificial selection based upon one, and possibly more, phenotypes of an individual. Phenotypic selection generally comprises progeny testing wherein the estimated breeding value of an individual is determined by performing multiple matings of the individual and determining the performance of the progeny.

[0056] In the present context, the term "population" is to be interpreted broadly and means a group of individuals that potentially breed with each other such that they contribute genetically to the next generation, including but not limited to those individuals in a breeding program. The group can be of any size e.g., a species, breed, line, cultivar, herd or flock etc). The individuals (and populations) may be humans, animals, or plants.

[0057] The term "quantitative trait" refers to a trait that is determined by expression of more than one gene. [0058] The term "quantitative trait locus" or "QTL" refers to a region of DNA that is associated with a particular quantitative trait, wherein variation in the QTL is associated with variation in the quantitative trait as determined by genetic mapping or marker-assisted selection.

[0059] "Reference" means a parent or ancestor (and/or founder) that provides a genetic contribution to a number of groups of individuals, thereby permitting comparison of the performances of the progeny within and between groups relative to the performance of progeny from other parents or ancestors (and/or founders). References permit the best ancestors (and/or founders) to be selected and used in artificial selection.

[0060] "Replacement" means an individual that is to become a parent for the first time in an artificial selection program.

[0061] The term "restriction fragment length polymorphism" or "RFLP" refers to any one of different DNA fragment lengths produced by restriction digestion of genomic DNA or cDNA with one or more endonuclease enzymes, wherein the fragment length varies between individuals in a population.

[0062] As used herein, the term "selection" shall be taken to refer to one or more systems, processes, steps or combinations of steps that determine one or more individuals in a population that are to contribute to the next generation, including natural selection and artificial selection.

[0063] "Selection criterion" refers to a phenotype or genotype forming the basis for a selection decision, including the presence or absence of one or more genes, or one or more genetic markers associated with a particular gene, combination of genes, trait or combination of traits.

[0064] "Selection index" means a ranking of a selection criterion or selection criteria according to a weighting or grade, used for estimating breeding value.

[0065] "Selection intensity" refers to the extent to which a breeder adheres to a decision on the selection of a particular individual or group of individuals for mating. Statistically, the selection intensity is determined as the difference between mean selection criterion of those individuals selected to contribute to the next generation and the mean selection criterion of all potential parents, expressed in standard deviation units.

[0066] "Selection target" refers to an optimum desired breeding value.

[0067] The term "short tandem repeat" or "STR" refers to a variable number tandem repeat (VNTR) comprising from 2 to about 5 or 6 base pairs per repeat unit, wherein the number of repeat units varies between individuals in a population. Microsatellites are an example of an STR that is generally highly polymorphic and randomly distributed in the genome and that may contain variability in sequence and/or for which the number of repeat units may increase or decrease.

[0068] The term "single-gene trait" refers to a trait that is determined by expression of one gene.

[0069] The term "single nucleotide polymorphism" or "SNP" refers to any one of different single nucleotides at a particular allele or locus varying between individuals in a population. Many SNPs are bi- allelic.

[0070] A statistical model is a mathematical model that embodies a set of statistical assumptions regarding the generation of sample or observed data, and uses a mathematical relationship to estimate model effects or parameters based on the observed data. A fixed effect model is a statistical model in which the model parameters are fixed or non-random quantities. A random effects model also known as a variance components model in which the model parameters are random variables. A mixed effects model is a statistical model that contains both fixed effects and random effects. Bayesian statistical model require the specification of prior distributions for unknown parameters. An iterative regression method is a method of estimating the parameters of a model using an iterative process to progressively update model parameters and residuals. Model parameters may include fixed effects, random effects (each comprising vectors of regression coefficients) and variances including random term variances and residual variances. Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution and may be used to estimate parameters in a Bayesian statistical model. MCMC methods create samples from a possibly multi-dimensional continuous random variable, with probability density proportional to a known function. These samples can be used to evaluate an integral over that variable, as its expected value or variance. Practically, an ensemble of Markov chains is developed, starting from a set of points arbitrarily chosen and sufficiently distant from each other. These chains are stochastic processes of "walkers" which move around randomly according to an algorithm which looks for places with a reasonably high contribution to the integral to move into next, assigning them higher probabilities. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain a sample of the desired distribution by observing the chain after a number of steps. One class of MCMC method are based on the Metropolis-Hasting algorithm which generates a Markov chain using a proposal density for new steps and a method for rejecting some of the proposed moves. Gibbs Sampling is a special case of Metropolis-Hasting algorithm, and is MCMC method for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution.

[0071] At step 1 10 we obtain a response data and an explanatory data. In the context of genomic applications, either the response data or the explanatory data comprises genomic data for a plurality of individuals in a population. This data will typically be arranged (or structured) as a data vector or matrix, but other data structures could be used. For example in the example shown in Figure 1 B we obtain genotype (or genotypic) data 112 and phenotype (or phenotypic ) data 113 for a plurality of individuals in a population 111. The genotype data may be obtained using a plurality of markers. In one embodiment the markers are Single Nucleotide Polymorphism (SNPs), having two alleles at a genetic locus, and the genotype data may be obtained using a SNP chip, including high density SNP chips. However it will be understood that a range of genetic markers, and high throughput genomic platforms (eg microarray, bead based systems, high throughput sequencing systems) may be used. In some embodiments each marker has a set of n allele states, where n is typically more than 2, and may be for example a minisatellite, microsatellite, tandem repeat, short tandem repeat, restriction fragment length polymorphisms (RFLPs), amplified fragment length polymorphisms (AFLPs), and other DNA polymorphisms. An allele will refer to any one of the different forms of a gene or DNA sequence at a single locus i.e., chromosomal location including a coding sequence, non-coding sequence or regulatory sequence.

[0072] The phenotype data can be any relevant dataset comprising observable characteristics or traits of the individuals in the population. The phenotype data may be generated from any suitable source and may be pre-processed to generate data more suitable for use by the MCMC model. In one embodiment, in the context of dairy cows and bulls, the phenotype data may comprise milk, fat and protein yield phenotypes. In another embodiment the phenotype data may be milk composition analysed using an infrared spectrometer. In this embodiment the corresponding spectra may be pre-processed for example to remove noise, eliminating outliers and by taking first derivative to generate wavenumber data. A principal components analysis may be performed on the wavenumber data to identify the first n components which explain a specified level (eg, 90 or 95%) of the variation in the phenotype data, or some arbitrary number (eg first 10 or 20), which are then used as the phenotype data in the model. In the context of plants, phenotype data may relate to salinity tolerance, yield, herbicide resistance etc. In the context of humans, phenotype data may relate to an endotype, pharmacokinetics, disease status, blood pressure, cholesterol, weight, etc.

[0073] At step 120 we estimate a plurality of model parameters of a Bayesian statistical model using an iterative regression method applied to the response data and the explanatory data. For example in the embodiment shown in Figure 1B this comprises estimating marker effects in a Bayesian Markov Chain Monte Carlo (MCMC) based statistical model using the genotype data 112 and phenotype data 113. The model a priori assumes the markers are each drawn from a mixture of K distributions each with a different marker effect i.e. a mixture model of marker priors 122. A computer implementation of a blocked Gibbs Sampler 123 is used to perform the iterative regression in order to estimate the marker effects. Processing each iteration of the iterative regression method is performed by breaking the vector of regression coefficients (eg genotype data) into a plurality of non-overlapping blocks of regression coefficients 130. For example in the case of Figure 1B this may comprise breaking the genome (ie set of markers) into non overlapping blocks of markers. Each block has a block size b, which is the number of regression coefficients in the block. The block size may vary from block to block. In one embodiment the block size is fixed (or constant) for all blocks except the last block, which has a size smaller than or equal to b. In this embodiment the number of blocks B is B=ceil(Number of markers / b), where ceil(x) maps x to the least integer greater to or equal to x.

[0074] A block ordering 140 is then determined for each iteration (of L iterations) and the blocks are then sequentially processed (according to the block ordering for that iteration) to obtain updated estimates of regression coefficients (eg marker effects) 150. The block order could be determined at the start of each iteration, or the block orderings could be precomputed for all iterations in an initialisation step. In one embodiment the block ordering is a random ordering, or a pseudo-random ordering, or some sequence of orderings to reduce or eliminate any bias due to the use of the same block order in each iteration. Sequential processing of each block 150 comprises determining a starting index within the block 152 and then sequentially estimating a regression coefficient (eg marker effect) starting with the starting index 154. Selection of which index in the block to start the sequential estimation of regression coefficients may be performed randomly or pseudo-randomly, or according to some sequence of orderings in order to reduce or eliminate any bias due to the use of the same starting index (ie location) within the block.

[0075] In one embodiment the set of model residuals are updated after sequential processing of a block (ie they are periodically updated) 156. In one embodiment there are two sets of residuals comprising outer residuals which are the residuals for the whole model, and a set of residuals for each block. At the start of processing each block a set of block residuals are setup that are corrected for all effects except for the block being updated. These block residuals are then updated after each inner iteration (ie after sequentially estimating, or processing all of the markers in a block). This may be performed based on equation 5 of Calus (the entire content of which is hereby included by reference). After a block has been processed, the current block residuals are then used to update the outer residuals. Again these may be updated based on equation 5 of Calus, but modified to update a block of residuals rather than just a single residual at a time.

[0076] Updating all other model effects 128 may be interlaced with the step of sequentially processing each block 150. All other model effects (beside the marker effects) will include, but is not limited to, polygenic effects, pedigree effects (eg any effect due to relationships between individuals - such as described in the Average relationship matrix A or genomic relationship matrix G), fixed effects and /or variances, and will depend upon the specific model applied. In one embodiment where block sizes are fixed except for the last block, updating of all other model effects is performed at least / times where / = floor(Lb).

[0077] After performing L iterations, final model parameters (eg estimates of the marker effect) are returned 129. The method may be used to estimate one or more traits in the population, and the estimated traits may be single traits or multi traits (i.e. QTLs). The trait estimates may then be used for breeding decisions. For example in one embodiment the method may further include the step 160 of obtaining genotype data an individual drawn from the population. The genotype data may be the markers associated with specific traits (although the same marker set used to generate the model could be used, e.g. by using the same SNP chip). A breeding decision in relation to the individual can then be made based on the observed genotype data for the specific traits of interest 170. However it will be understood that this is one example embodiment, and in addition to estimating genomic breeding value, or making genomic breeding decisions the method may be used for genomic discovery, estimating disease risk (ie by identifying genomic predictors of a disease), dosage estimation based on genetic factors (e.g. warfarin dose), performing genomic prediction or association studies, etc. For example mid-infrared spectra can be used to map genes affecting milk traits as described in Benedet et al“The use of mid-infrared spectra to map genes affecting milk composition, Journal of Dairy Science, Volume 102, Issue 8, 2019, Pages 7189- 7203, ISSN 0022-0302”. In this application the explanatory data is based on milk spectra, and the response (dependent) variables are the genes or markers of interest.

[0078] To further illustrate features of the above described method (which we refer to as the BayesR3 method) a more detailed mathematical explanation will now be provided in which the Bayesian MCMC based genomic prediction statistical model is based on the BayesR model described in Erbe et al, and is applied to SNP data. In the context of the following discussion, estimating marker effects thus comprises estimating SNP effects (ie the markers are SNPs). We begin with a review of the BayesR (mixed effects) model described in Erbe et al.: Equation 1 where y is a vector of n phenotype data for each trait; W is the (n × m ) design matrix allocating records to the marker effects; vector u is a (m x 1) vector of SNP effects assumed normally distributed

; e is a vector of random deviates, where is the error variance; v j is the polygenic

breeding value of the /th individual, where A is the average relationship matrix; where is the polygenic error variance; and Z is a matrix that allocated records to individuals. The BayesR model shown in equation 1 is a mixed effects model that estimates random marker effects (u) and random polygenic effects (v). The polygenic effect accounts for genetic variance not explained by the markers. In other embodiments variations of the model shown in equation 1 may be used including additional or fewer fixed or random effects. For example polygenic effects could be omitted by dropping the Zv term, or an extra random effects term added to account for repeat values, or a different form of the fixed effects component such as Xa for 1 n 'm where X is a design matrix allocating phenotypes to fixed effects, and a is the vector of fixed effects, such as overall mean, breed, data type (eg trait deviations, daughter trait deviation) etc. In this embodiment we are estimating model parameters

[0079] The (mixture) model 122 a priori assumes each SNP is drawn from a mixture of 4 zero mean normal distributions of SNP effects:

Equation 2 where

gives the prior probability that SNP j has an effect, given p and ; and is the sum of all the genetic variances.

[0080] Sparseness is included in the model by setting the first mixture's component variance to zero. The prior distributions for the BayesR parameters are given in (Erbe et al. 2012) as follows.

[0081] As described in Erbe et al. 2012, the proportions of the SNP in each distribution can be defined as pr1, pr2, pr3, and pr4, forming vector pr. The prior distribution of the proportions of SNP in each distribution pr was the Dirichlet distribution, with a = 1 (where a is a 4 × 1 vector of pseudo counts, all with value 1 to give an almost uninformative prior with the numbers of SNP used here). The Dirichlet distribution is a convenient choice of prior, as it is a conjugate before the multinomial distribution, such that the posterior distribution of pr is ~Dir(a + b ), where b is a vector containing the number of SNP in each distribution estimated from the data. To obtain these estimates, we first calculated 4 likelihoods assuming the considered SNP being in 1 of the 4 normal distributions at a time with the respective probability pr k . The likelihood that SNP i is in distribution k is:

Equation 3 where y* is the vector of phenotypes corrected for all marker effects other than marker i, the overall mean, and the polygenic effects (û); Z* is a column vector containing the SNP genotypes of all animals for SNP i; V is the variance-covariance structure of a reduced model, including only the effect of the respective SNP and a residual effect; and log|V| was calculated as where

is the kth distribution variance and Z* contains only the information for the current SNP effect. Then, the probability that SNP / is in distribution k is

Equation 4

[0082] Based on these probabilities e can select the normal distrib tion to sample the SNP effect from using a uniform random variate, using the probabilities of the SNP being in each of the distributions for the current iteration. Over all the SNP, we thus obtained estimates for the elements of b. Tha t in each iteration we sample pr in order t o update the conditional posterior probability that SNP i belongs to istribution k. The posterior of pr cannot be estimated directly, as it is conditional on both the estimates of the SNP effects (to calculate y*) and estimates of the polygenic effects . A Gibbs sampling scheme was, therefore, used to sample from the posterior distributions of all parameters. Prior distributions for other parameters were as described by Verbyla et al. (2009). The Gibbs sampling scheme was similar to that described by Meuwissen et al. (2001) for BayesA, but with the addition of a polygenic effect, and with the SNP variances described above. At the end of each iteration, the proportion of SNP in each distribution can be sampled from the posterior Dirichlet distribution as described above. Wang et al (2016) introduced a hybrid between an Expectation -Maximisation (EM) algorithm and standard Bayes R (“EM-Hybrid”) to reduce computer time. This method proposes performing initial iterations using an EM algorithm to obtain initial estimation of model parameters which are then used as starting values in a MCMC algorithm with limited iterations.

[0083] In another embodiment, a loglikehood function with similar properties to Equation 3, could be used, such as:

Equation 5 where

[0084] As discussed above, a blocked Gibbs sampling approach is used to estimate SNP effects. Ignoring fixed effects and a polygenic effect described by the A matrix, the genomic prediction model is: Equation 6

where y represents a vector of phenotypes, the genotypes in block i, g i the SNP effects associated with block i and e the residual error vector. The genome (or more specifically, the genotype data) is broken into B non-overlapping blocks of SNPs.

[0085] As discussed above the B Blocks may be any size. However in the following discussion we will fix all blocks to be the same size except the last block that can be smaller. In one embodiment the number of blocks B = ceil (M/b), where ceil(x) maps x to the least integer greater to or equal to x, Mis the total number of markers in the genotype and b is the chosen block size (ie number of markers in a block). For example in the case of a genotype comprising 633,374 SNPs and a block size of 300, the number of blocks B = ceil(633,374/300)=2112 (ie 2111 with 300 markers and 1 with 74 markers). That is if the set of m markers are considered a vector, blocking represents slicing the vector after every b th marker to create a plurality of smaller vectors for example blocking {a, b, c, d, e, f} using a block size of 2 would generate 3 blocks {a, b}, {c, d} and {e, f}. Alternatively if the genotype data is represented as an n x m array where the n rows are each individual’s genotype (eg from a SNP chip) and the m columns represent the m SNPs, and, then blocking represents slicing the array along column boundaries (every b th marker/column) to create B subarrays with the same number of rows n but fewer columns (ie B n x b arrays).

[0086] The residuals ( e *) are calculated using the current value of each parameter. When sampling the SNP effects in block i, the right-hand sides: Equation 7 are calculated. Note that r has only b elements. Each SNP's least squares estimate at the kth iteration, where k in the set {1, 2, ..., b} is given by:

Equation 8

[0087] After which, is sampled as specified by (Erbe et al. 2012). is the diagonal element of

associated with SNP j, block i. All are precomputed and saved prior to analysis. The extra memory required is b × M. [0088] In another embodiment, on entering block i, the residuals, r, for that block are calculated using:

Equation 9

where , for the first block being processed at each iteration. The 1st term on the

righthand side, adds to r the current block residual estimate and represents an optional extra calculation designed to stabilise block convergence across runs (repeatability). This is useful when the heritability ( H 2 ), of the trait is expected to be small (ie H 2 « 0.2). In this case we (1) set and (2) then set all

elements of g i to zero (to be clear if we are using equation 7 then we leave g i as is). Whilst this approach imposes a slight performance penalty (increasing the computational time by around 10%), this improves the robustness of the model estimates, as it effectively prevents the Gibbs sampler from getting trapped in a local minima (or maxima) by forcing a new (random) search of the state space on each iteration.

[0089] After each SNP has been sampled the right-hand sides, r, are updated for SNP j + 1 by adding to it: . where is the jth column of and is the jth SNP effect. After all

block SNPs have been processed sequentially b-times, e is updated in preparation for block i+1 :

Equation 10 where gf represents the SNP-effects associated with the i lh block prior to block updating. The term represents the effects after block updating. Each block in turn gets iterated L times, where L is ceil (//b), and / represents the number of iterations associated with a normal Gibbs process; such as 50,000 (Wang et al. 2016). Equivalently / = floor(Lb). In one embodiment the updating of all other effects 128, such as polygenic effects, pedigree effects (eg due to relationships between individuals, for example the Average relationship matrix A or a similar term), fixed effects and variances is interlaced with the block updating process so that they are processed at least I times. This may be done by sampling these effects after every B/b blocks. As noted above, the blocks are sequentially processed, however the block ordering for each iteration may be randomised or varied. It is also noted that estimation of marker effects within a block is performed sequentially (rather than a joint estimation of all marker effects in the block) according to an order, eg, {1, 2, ..., b}. As noted above order of sequential processing may be varied by, at the start of each block, selecting a starting index in the set {1, 2, ... , b ) . For example if the starting index i s i (i th marker in the set) then the ordering is { i, i + 1 , ... , b, 1, 2, ..., i - 1 } .

[0090] Under standard BayesR approaches described above, all genotypes are assumed to be equally likely to affect a trait. In another embodiment, where biological prior information is available such as information generated using functional genetic assays in any species, the markers could be classified into different classes based on the biological prior data such as described in MacLeod, I. M., et al. (2016). "Exploiting biological priors and sequence variants enhances QTL discovery and genomic prediction of complex traits." BMC Genomics 17(1): 1 -21. This paper proposed a method called BayesRC which extends the BayesR approach to the QTL case. This proposes that the a priori independent biological information can be used to allocate each variant to a specific“class” c (where c ³ 2), where the purpose is to provide one or more classes that are enriched for QTL. That is the markers are classified into different classes depending on the biological significance on the trait(s) of interest. For example, all variants in or close to candidate genes could be allocated to class I, while all other variants could be in class II. As for BayesR, the variant effects for members of class I are assumed to belong to a mixture of four normal distributions with proportions (P d1_ cI , P d2_ cI , P d3_ cI , P d4_ cI , ) while the variant effects that are members of class II belong to an independent mixture of the four distributions with proportions

(P d1 _ c II ,P d2 _ c II ,P d3 _ c II ,P d4 _ c II , ), etc. As outlined in MacLeod et al, this can be taken into account through a small modification in the BayesR algorithm to allow updating of the distribution of QTL effects within classes (which is an advantage if a particular class is enriched for QTL). Within each class c, we used a uniform Dirichlet prior (as in BayesR) for the proportion of effects in each distribution: P c ~Dir(a c ), where a c = [1,1, 1,1] (in the case 4 classes). This is updated each iteration within each class P c ~Dir(a c + b c ) where b c was the current number of variants in each of the four distributions within class c, as estimated from the data. Thus, we used a relatively uninformative prior for all classes, but within a class the posterior proportion of variants in each distribution was informed by the biological prior data and could vary from one class to the next. If a class is found to be enriched for QTL this increases the probability that a true QTL effect in this class will be included in the model. The prior of a c = 1 can be argued to have little influence on the posterior distribution provided that there is a reasonably large number of variants per class.

[0091] The changes from BayesR to BayesRC is that McLeod first requires that prior to calculating the model, the markers are allocated to a class. Then, when estimating a marker effect, we must sample from the appropriate distribution based on the class of the marker. However the actual process (or algorithm) for estimating the marker effect is unchanged, and as noted by McLeod, a drawback of the BayesRC method is that as it effectively implements the BayesR method for estimating regression coefficients, BayesRC (like BayesR) is computational demanding. Embodiments of the present BayesR3 method can thus be use speed up the estimation process. With reference to Figure 1B application of McLeod thus comprises modifying the mixture model of marker priors in block 122 so that markers are assigned to a class. Then rather than using a BayesR or BayesRC approach to calculate regression coefficients we instead use the method of BayesR3 as described herein in and illustrated in Figure 1B in which we apply blocking of regression co-efficients as shown in block 130 to improve the computational efficiency. This thus enables the use of prior biological knowledge to improve the accuracy of QTL discovery and genomic prediction to assist with breeding decisions. [0092] An embodiment of the method (BayesR3) described herein was implemented and benchmarked on a single trait analysis and a multi-trait analysis of SNP data.

[0093] Single trait analyses: A reference population of 97,634 dairy cows and bulls was available with milk, fat and protein yield phenotypes (MY, PY and FY) pre-corrected for known fixed effects. The cattle included Holstein (68%), Jersey (14%) and crossbreds (17%). There were two validation sets: 1251 Holstein Bulls and 508 Jersey bulls. All animals had genotypes that were mainly imputed from low or medium density SNP arrays up to high density (633,374 SNP). A single trait genomic prediction analysis was carried out for each trait using both the BayesR3 and the EM-Hybrid. An analysis using the equivalent GBLUP model using MTG2 software (Lee and van der Werf 2016) was also performed.

MTG2 requires the genomic relationship (G) matrix to be pre-calculated prior to running GBLUP, so G was calculated using GCTA software (Yang et al. 2011), invoking the multi-threading option.

[0094] Multi trait MIR analyses: Milk samples of 5,202 Australian Holstein, Jersey, and cross-bred cows were analysed for milk composition by an infrared spectrometer (Model 2000, Bentley Instruments, Chaska, MN) and the corresponding spectra were stored for this study. A single spectrum includes 899 data points, with each point representing the absorption of infrared light through the milk sample at wave number from 649 to 3,999 cm -1 regions. Several procedures were applied to the raw spectra, including removal of noise, eliminating outliers and taking first derivative, as commonly recommended for MIR studies (De Marchi et al. 2014). This resulted in 537 wavenumbers per sample, which was used for further analysis. The BayesR approach to multi -trait analysis is described by (Kemper et al. 2018). As the traits in this model are assumed to be uncorrelated, the traits are first transformed into PCA components and the first 20 components were used. Genotypes were imputed HD genotypes (633,374 SNP).

[0095] Single Trait: For both methods, EM-Hybrid and BayesR3, we ran four independent analyses (“MCMC chains”) for each milk trait to test for convergence of the results. The accuracy of genomic prediction was estimated as the correlation between the genomic estimated breeding value (GEBV) and the phenotype. The bias of genomic prediction was estimated as the regression of the phenotype on the GEBV. The accuracy and bias are presented as the average across the 4 MCMC chains and the results shown in Figure 2A and 2B of the Accuracy and Bias between EM-Hybrid (horizontal bars) and BayesR3 (diagonal bars) which show that the two methods give very comparable results.

[0096] The processing time (PT) for a standard BayesR (Erbe et al 2012) run per iteration is proportional to M × R, where M is the number of markers and R is the number of records (ie the number of individuals genotyped). For embodiments of the present method (BayesR3) PT per iteration is proportional to B × R + M × b, where B is the number of blocks, and b is block size. Figure 3 is a plot of the expected processing time ratio 30 between a standard BayesR to BayesR3 for 633374 SNPs and 97000 animals with respect to block size. The maximum speed gain 32 is proportional to . Table 1 gives the actual processing times observed for the 97,634-cow data set, using EM-Hybrid, BayesR3 and GBLUP models.

TABLE 1

Comparison of actual processing times in days between EM-Hybrid, GBLUP, and BayesR3

[0097] Note that BayesR3 is approximately 18 times faster than EM-Hybrid, but only 2 times faster than GBLUP. As block size increases beyond a certain limit, not only does processing time of BayesR3 decrease, its prediction accuracy starts to decrease also. Thus in one embodiment the block size is set as where R is the number of individuals genotyped.

[0098] Multi-trait: The BayesR3 multi -trait analysis took 14 hrs and in the standard multi-trait BayesR was projected to take around 110 days. The posterior probabilities (PP > 0.05) that each SNP was included in the multi -trait MIR model are shown for chromosome 5 to 27 in Figure 4. The square symbols on Figure 4 are the overlapping SNP from the milk yield single trait results (97,634 cow data set) for only those SNP with PP > 0.8 (98 SNPs). Figure 4 shows the subset of those SNPs where PP from the MIR analysis was above 0.05. A number of these SNP map to gene regions that are well known to influence milk components (annotated on Figure 4). This demonstrates that the MIR multi-trait has successfully fine mapped SNP to expected gene regions.

[0099] The computational efficiency gains of BayesR3 allow for the implementation of multi-trait analyses with a very large number of traits (eg thousands of traits). The utility of BayesR3 to analyse large numbers of traits in a multi-trait model is illustrated in Figure 5 which shows a repeating a set of 24 traits multiple times (576 in total) which shows that predictive ability remains for the repeated trait groups.

[00100] Determining the association of a SNP to the traits in a multi -trait analysis requires a product of probabilities. Therefore, there will be a point where the resulting product essentially becomes digitally zero at which point SNPs will be prevented from being included into the analysis. For this reason, we defined an upper-bound to the number of traits we could analyse before computation precision issues prevented SNPs from being associated with the traits. We constructed a data set of 46,771 SNPs, and an associated phenotype data set of 5248 records of 24 uncorrelated traits, all with zero mean and variance of 1. The phenotype records were broken into a training set of 4924 records and a test set of 314 records. Across the traits there was also varying degrees of heritability that ranged from 7% to 88%, and the traits also had different amounts of missing values, which ranged from zero to 70% on different traits. The traits were placed in order such that the trait with the least number of missing values became trait number 1 and the trait with the most missing values was assigned to trait 24.

[00101] We know the analysis results of this 24 trait data set, such as each trait's prediction accuracies (R 2 ) and the number of SNPs that fell into each of the 4 model variance distributions for each trait Therefore, we should be able to concatenate the 24 trait phenotype data set into larger sets and inspect the results to see whether they confirm the expected repeated pattern across traits. We did this for trait sets that varied in trait number from 48 to 2304, and all of which showed the expected repeating pattern of that seen for the 24-trait analysis. In Figure 5, for ease of inspection, the results for the 576 (24x24) trait analysis is shown. From the top plot 52 in Figure 5 the SNP counts falling into the four variance distributions, Nkl to Nk4 are given as stacked bar plots. From this plot it is easily seen that there is a clear repeating pattern for every 24 traits. Likewise, for the pattern of accuracies across the 576 traits as shown in the bottom plot 54 of Figure 5. Therefore, we conclude that the multi-trait analysis procedure presented here can simultaneously analyse phenotype data sets comprising of thousands of traits - in this case 2304 traits was demonstrated, but it could reasonably be expected that more traits could be simultaneously estimated.

[00102] An embodiment of computer system 200 for implementing the methods described herein is illustrated in Figure 5. Genotype data 202 and phenotype data 204 are generated and stored in a storage device 206. A computing apparatus 210 comprises a central processing unit (CPU) 220, a memory 230, and an Input/Output (or Communications) interface 240, and may include a graphical processing unit (GPU) 250, and input and output devices 260. The CPU 220 may comprise an Arithmetic and Logic Unit (ALU) 222 and a Control Unit and Program Counter element 224. The memory 230 may comprise RAM and ROM components 232 such as solid state memory and secondary storage 234 such as a hard disk.

The Input/Output Interface 240 may comprise a network interface and/or communications module for communicating with an equivalent communications module in another apparatus using a predefined communications protocol (e.g. Bluetooth, Zigbee, IEEE 802.25, IEEE 802.22, TCP/IP, UDP, etc). Input and output devices may be connected via wired or wireless connections. The Input/Output interface 240 may be in communication with the storage device 206 and other computing apparatus or servers 208. Input and output devices 260 may comprise a keyboard, a mouse, and a display apparatus such as a flat screen display (eg LCD, LED, plasma, touch screen, etc), a projector, CRT, etc.

[00103] The computing apparatus 210 may comprise a single CPU (core) or multiple CPU's

(multiple core), or multiple processors. The computing apparatus may be a server, desktop or portable computer and may use a parallel processor, a vector processor, or be may be part of a distributed (cloud) computing apparatus. The memory 230 is operatively coupled to the processor(s) 220 and may comprise RAM and ROM components 232, and secondary storage components such as solid state disks and hard disks 234, which may be provided within or external to the device. The storage device 206 may be an internal device, a directly connected external device, or a network storage device (ie connected over a network interface) external to the computing apparatus 210. The memory may comprise instructions to cause the processor to execute a method described herein. The memory 230 may be used to store the operating system and additional software modules or instructions. The processor(s) 220 may be configured to load and execute the software modules or instructions stored in the memory 230. A report comprising the likely genetic markers determined from the model may be compiled into an electronic report or graphic and displayed on a display apparatus 260, or the likely genetic markers may be reported to another computing apparatus 208 for report compilation or other uses. The software modules may be written in high level computing languages such as Fortran, C, C++, java, python, etc, or statistical languages such as R. In some embodiments the software modules may be use software libraries optimised for fast matrix operations (eg LAPACK, NAG, etc). The genotype data and phenotype data may be stored in one or more databases, including relational/SQL databases and NoSQL databases, or in various formats (eg text files, csv files, etc) in data repositories or file systems, including structured file systems or data repositories, which can be accessed by the computing system.

[00104] The genotype data may be obtained from a range of genotyping systems. Genotyping generally involves detecting one or more markers of interest e.g., SNPs in a sample from an individual being tested, and analysing the results obtained to determine the haplotype of the subject. As will be apparent from the disclosure herein, it is particularly preferred to detect the one or more markers of interest using a high-throughput system comprising a solid support consisting essentially of or having nucleic acids of different sequence bound directly or indirectly thereto, wherein each nucleic acid of different sequence comprises a polymorphic genetic marker derived from an ancestor or founder that is representative of the current population and, more preferably wherein said high-throughput system comprises sufficient markers to be representative of the genome of the current population (eg high density SNP chips).

[00105] Suitable samples for genotyping may comprise nucleic acid, e.g., RNA or genomic DNA and preferably genomic DNA. Genetic testing of plants can involve testing of any plant part e.g., leaf, floral organ, seed, etc. Genetic testing of animals can be performed using a hair follicle, for example, isolated from the tail of an animal to be tested. Other examples of readily accessible samples include, for example, skin or a bodily fluid or an extract thereof or a fraction thereof. For example, a readily accessible bodily fluid includes, for example, whole blood, saliva, semen or urine. Exemplary whole blood fractions are selected from the group consisting of buffy-coat fraction, Fraction IH-III obtainable by ethanol fractionation of Cohn (E. J. Cohn et al., J. Am. Chem. Soc, 68, 459 (1946), Fraction II obtainable by ethanol fractionation of Cohn (E. J. Cohn et al., J. Am. Chem. Soc, 68, 459 (1946), albumin fraction, an immunoglobulin-containing fraction and mixtures thereof. Preferably, a sample from an animal has been isolated or derived previously from an animal subject by, for example, surgery, or using a syringe or swab. In another embodiment, a sample can comprise a cell or cell extract or mixture thereof derived from a tissue or organ such as described herein above. Nucleic acid preparation derived from organs, tissues or cells are also particularly useful. The sample can be prepared on a solid matrix for histological analyses, or alternatively, in a suitable solution such as, for example, an extraction buffer or suspension buffer, and the present invention clearly extends to the testing of biological solutions thus prepared. However, in a preferred embodiment, the high-throughput system of the present invention is employed using samples in solution.

[00106] The skilled artisan is aware that a suitable probe or primer may be designed for a maker i.e., one capable of specifically detecting a marker, will specifically hybridize to a region of the genome in genomic DNA from the individual being tested that comprises the marker. As used herein "selectively hybridizes" means that the polynucleotide used as a probe is used under conditions where a target polynucleotide is found to hybridize to the probe at a level significantly above background. The background hybridization may occur because of other polynucleotides present, for example, in genomic DNA being screened. In this event, background implies a level of signal generated by interaction between the probe and non-specific DNA which is less than 10 fold, preferably less than 100 fold as intense as the specific interaction observed with the target DNA. The intensity of interaction are measured, for example, by radiolabelling the probe, e.g. with 32P, or by labelling the probe with a fluorophore including fluorescent dyes (eg Cy3, Cy5), fluorescent molecules (eg GFP), or fluorescent particle (Quantum dots).

[00107] As will be known to the skilled artisan a probe or primer comprises nucleic acid and may consist of synthetic oligonucleotides up to about 100-300 nucleotides in length and more preferably of about 50-100 nucleotides in length and still more preferably at least about 8-100 or 8-50 nucleotides in length. For example, locked nucleic acid (LNA) or protein-nucleic acid (PNA) probes or molecular beacons for the detection of one or more SNPs are generally at least about 8 to 12 nucleotides in length. Longer nucleic acid fragments up to several kilobases in length can also be used, e.g., derived from genomic DNA that has been sheared or digested with one or more restriction endonucleases.

Alternatively, probes/primers can comprise RNA. Preferred probes or primers for use in the present invention will be compatible with the high-throughput system described herein. Exemplary probes and primers will comprise locked nucleic acid (LNA) or protein-nucleic acid (PNA) probes or molecular beacons, preferably bound to a solid phase. For example, LNA or PNA probes bound to a solid support are used, wherein the probes each comprise an SNP and sufficient probes are bound to the solid support to span the genome of the species to which an individual being tested belongs. The number of probes or primers will vary depending upon the number of loci or QTLs being screened and, in the case of genome- wide screens, the size of the genome being screened. The determination of such parameters is readily determined by a skilled artisan without undue experimentation.

[00108] Specificity of probes or primers can also depend upon the format of hybridization or amplification reaction employed for genotyping. The sequence(s) of any particular probe(s) or primer(s) used in the method of the present invention will depend upon the locus or QTL or combination thereof being screened. In this respect, the present invention can be generally applied to the genotyping of any locus or QTL or to the simultaneous or sequential genotyping of any number of QTLs or loci including genome -wide genotyping. This generality is not to be taken away or read down to a specific locus or QTL or combination thereof. The determination of probe/primer sequences is readily determined by a skilled artisan without undue experimentation. Standard methods are employed for designing probes and/or primers e.g., as described by Dveksler (Eds) (In: PCR Primer: A Laboratory Manual, Cold Spring Harbour Laboratories, NY, 1995). Software packages are also publicly available for designing optimal probes and/or primers for a variety of assays, e.g., Primer 3 available from the Center for Genome Research, Cambridge, MA, USA. Probes and/or primers are preferably assessed to determine those that do not form hairpins, self-prime, or form primer dimers (e.g. with another probe or primer used in a detection assay). Furthermore, a probe or primer (or the sequence thereof) is preferably assessed to determine the temperature at which it denatures from a target nucleic acid (i.e. the melting temperature of the probe or primer, or Tm). Methods of determining Tm are known in the art and described, for example, in Santa Lucia, Proc. Natl. Acad. Sci. USA, 95: 1460-1465, 1995 or Bresslauer et al, Proc. Natl. Acad.

Sci. USA, 83: 3746-3750, 1986.

[00109] For LNA or PNA probes or molecular beacons, it is particularly preferred for the probe or molecular beacon to be at least about 8 to 12 nucleotides in length and more preferably, for the SNP to be positioned at approximately the centre of the probe, thereby facilitating selective hybridization and accurate detection. For detecting one or more SNPs using an allele -specific PCR assay or a ligase chain reaction assay, the probe/primer is generally designed such that the 3' terminal nucleotide hybridizes to the site of the SNP. The 3' terminal nucleotide may be complementary to any of the nucleotides known to be present at the site of the SNP. When complementary nucleotides occur in both the probe/primer and at the site of the polymorphism, the 3' end of the probe or primer hybridizes completely to the marker of interest and facilitates, for example, PCR amplification or ligation to another nucleic acid. Accordingly, a probe or primer that completely hybridizes to the target nucleic acid produces a positive result in an assay.

[00110] For primer extension reactions, the probe/primer is generally designed such that it specifically hybridizes to a region adjacent to a specific nucleotide of interest, e.g., an SNP. While the specific hybridization of a probe or primer may be estimated by determining the degree of homology (or sequence similarity) of the probe or primer to any nucleic acid using software, such as, for example, BLAST, the specificity of a probe or primer is generally determined empirically using methods known in the art. Methods of producing/synthesizing probes and/or primers useful in the present invention are known in the art. For example, oligonucleotide synthesis is described, in Gait (Ed) (In: Oligonucleotide Synthesis: A Practical Approach, IRL Press, Oxford, 1984); LNA synthesis is described, for example, in Nielsen et al, J. Chem. Soc. Perkin Trans., 1 : 3423, 1997; Singh and Wengel, Chem. Commun. 1247, 1998; and PNA synthesis is described, for example, in Egholm et al, Am. Chem. Soc, 114: 1895, 1992; Egholm et al, Nature, 365: 566, 1993; and Orum et al., Nucl. Acids Res., 21: 5332, 1993.

[00111] Numerous methods are known in the art for determining the occurrence of a particular marker in a sample may be used. In a preferred embodiment, a marker is detected using a probe or primer that selectively hybridizes to said marker in a sample from an individual under moderate stringency, and preferably, high stringency conditions. If the probe or primer is detectably labelled with a suitable reporter molecule, e.g., a chemiluminescent label, fluorescent label, radiolabel, enzyme, hapten, or unique oligonucleotide sequence etc, then the hybridization may be detected directly by determining binding of reporter molecule. Alternatively, the hybridized probe or primer may be detected by performing an amplification reaction such as polymerase chain reaction (PCR) or similar format, and detecting the amplified nucleic acid. Preferably, the probe or primer is bound to solid support e.g., in the high- throughput system of the present invention.

[00112] For the purposes of defining the level of stringency to be used in the hybridization, a low stringency is defined herein as hybridization and/or a wash step(s) carried out in 2-6 x SSC buffer, 0.1% (w/v) SDS at 28°C, or equivalent conditions. A moderate stringency is defined herein as hybridization and/or a wash step(s) carried out in 0.2-2 x SSC buffer, 0.1% (w/v) SDS at a temperature in the range 45°C to 65°C, or equivalent conditions. A high stringency is defined herein as hybridization and/or a wash step(s) carried out in 0.1 x SSC buffer, 0.1% (w/v) SDS, or lower salt concentration, and at a temperature of at least 65°C, or equivalent conditions. Reference herein to a particular level of stringency encompasses equivalent conditions using wash/hybridization solutions other than SSC known to those skilled in the art. Generally, the stringency is increased by reducing the concentration of SSC buffer, and/or increasing the concentration of SDS and/or increasing the temperature of the hybridization and/or wash. Those skilled in the art will be aware that the conditions for hybridization and/or wash may vary depending upon the nature of the hybridization matrix used to support the sample DNA, or the type of hybridization probe used. Progressively higher stringency conditions can also be employed wherein the stringency is increased stepwise from lower to higher stringency conditions. Exemplary progressive stringency conditions are as follows: 2 x SSC/0.1% SDS at about room temperature (hybridization conditions); 0.2 x SSC/0.1% SDS at about room temperature (low stringency conditions); 0.2 x SSC/0.1% SDS at about 42°C (moderate stringency conditions); and 0.1 x SSC at about 68°C (high stringency conditions). Washing can be carried out using only one of these conditions, e.g., high stringency conditions, or each of the conditions can be used, e.g., for 10-15 minutes each, in the order listed above, repeating any or all of the steps listed. However, as mentioned above, optimal conditions will vary, depending on the particular hybridization reaction involved, and can be determined empirically.

[00113] For example, a change in the sequence of a region of the genome or an expression product thereof, such as, for example, an insertion, a deletion, a transversion, a transition, is detected using a method, such as, polymerase chain reaction (PCR), strand displacement amplification, ligase chain reaction, cycling probe technology, DNA microarray chip (including SNP microarrays or SNP chips), and other high throughput sequencing technologies including Ion semiconductor (IonTorrent) sequencing, sequencing by synthesis (Illuimina), sequencing by ligation (SOLiD), Single-molecule real time (SMRT) sequencing (PacBio), Combinatorial probe anchor synthesis (cPAS BGI), Pyrosequencing (454), Nanopore sequencing, sequencing by hybridisation, sequencing with mass spectrometry, microfluidic sequencing, or other high throughput methods.

[00114] Methods of PCR are known in the art and described, for example, in Dieffenbach (ed) and Dveksler (ed) (In: PCR Primer: A Laboratory Manual, Cold Spring Harbour Laboratories, NY, 1995). Generally, for PCR two non-complementary nucleic acid primer molecules comprising at least about 15 nucleotides, more preferably at least 20 nucleotides in length are hybridized to different strands of a nucleic acid template molecule, and specific nucleic acid molecule copies of the template are amplified enzymatically. PCR products may be detected using electrophoresis and detection with a detectable marker that binds nucleic acids. Alternatively, one or more of the oligonucleotides is/are labeled with a detectable marker (e.g. a fluorophore) and the amplification product detected using, for example, a lightcycler (Perkin Elmer, Wellesley, MA, USA). Clearly, the present invention also encompasses quantitative forms of PCR, such as, for example, Taqman assays.

[00115] Strand displacement amplification (SDA) utilizes oligonucleotides, a DNA polymerase and a restriction endonuclease to amplify a target sequence. The oligonucleotides are hybridized to a target nucleic acid and the polymerase used to produce a copy of this region. The duplexes of copied nucleic acid and target nucleic acid are then nicked with an endonuclease that specifically recognizes a sequence at the beginning of the copied nucleic acid. The DNA polymerase recognizes the nicked DNA and produces another copy of the target region at the same time displacing the previously generated nucleic acid. The advantage of SDA is that it occurs in an isothermal format, thereby facilitating high- throughput automated analysis.

[00116] Ligase chain reaction (described, for example, in EP 320,308 and US 4,883,750) uses at least two oligonucleotides that bind to a target nucleic acid in such a way that they are adjacent. A ligase enzyme is then used to link the oligonucleotides. Using thermocycling the ligated oligonucleotides then become a target for further oligonucleotides. The ligated fragments are then detected, for example, using electrophoresis, or MALDI-TOF. Alternatively, or in addition, one or more of the probes is labeled with a detectable marker, thereby facilitating rapid detection.

[00117] Cycling Probe Technology uses chimeric synthetic probe that comprises DNA-RNA-

DNA that is capable of hybridizing to a target sequence. Upon hybridization to a target sequence the RNA-DNA duplex formed is a target for RNase H thereby cleaving the probe. The cleaved probe is then detected using, for example, electrophoresis or MALDI-TOF. Additional methods for detecting SNPs are known in the art, and reviewed (for example, in Landegren et al, Genome Research 8: 769-776, 1998).

[00118] For example, an SNP that introduces or alters a sequence that is a recognition sequence for a restriction endonuclease is detected by digesting DNA with the endonuclease and detecting the fragment of interest using, for example, Southern blotting (described in Ausubel et al (In: Current Protocols in Molecular Biology. Wiley Interscience, ISBN 047 150338, 1987) and Sambrook et al (In: Molecular Cloning: Molecular Cloning: A Laboratory Manual, Cold Spring Harbor Laboratories, New York, Third Edition 2001)). Alternatively, a nucleic acid amplification method described supra, is used to amplify the region surrounding the SNP. The amplification product is then incubated with the endonuclease and any resulting fragments detected, for example, by electrophoresis, MALDI-TOF or PCR.

[00119] The direct analysis of the sequence of polymorphisms of the present invention can be accomplished using either the dideoxy chain termination method or the Maxam-Gilbert method (see Sambrook et al, Molecular Cloning, A Laboratory Manual (2nd Ed., CSHP, New York 1989); Zyskind et al, Recombinant DNA Laboratory Manual, (Acad. Press, 1988)). For example, a region of genomic DNA comprising one or more markers is amplified using an amplification reaction, e.g., PCR, and following purification of the amplification product, the amplified nucleic acid is used in a sequencing reaction to determine the sequence of one or both alleles at the site of an SNP of interest.

[00120] Alternatively, one or more SNPs is/are detected using single stranded conformational polymorphism (SSCP). SSCP relies upon the formation of secondary structures in nucleic acids and the sequence dependent nature of these secondary structures. In one form of this analysis, an amplification method, such as, for example, a method described supra, is used to amplify a nucleic acid that comprises an SNP. The amplified nucleic acids are then denatured, cooled and analyzed using, for example, non- denaturing polyacrylamide gel electrophoresis, mass spectrometry, or liquid chromatography (e.g., HPLC or dHPLC). Regions that comprise different sequences form different secondary structures, and as a consequence migrate at different rates through, for example, a gel and/or a charged field. Clearly, a detectable marker may be incorporated into a probe/primer useful in SSCP analysis to facilitate rapid marker detection. [00121] Alternatively, any nucleotide changes may be detected using, for example, mass spectrometry or capillary electrophoresis. For example, amplified products of a region of DNA comprising an SNP from a test sample are mixed with amplified products from an individual having a known genotype at the site of the SNP. The products are denatured and allowed to re-anneal. Those samples that comprise a different nucleotide at the position of the SNP will not completely anneal to a nucleic acid molecule from the control sample thereby changing the charge and/or conformation of the nucleic acid, when compared to a completely annealed nucleic acid. Such incorrect base pairing is detectable using, for example, mass spectrometry.

[00122] Allele-specific PCR (as described, for example, In Liu et al, Genome Research, 7: 389-

398, 1997) is also useful for determining the presence of one or other allele of an SNP. An

oligonucleotide is designed, in which the most 3' base of the oligonucleotide hybridizes to a specific form of an SNP of interest (i.e., allele). During a PCR reaction, the 3' end of the oligonucleotide does not hybridize to a target sequence that does not comprise the particular form of the SNP detected.

Accordingly, little or no PCR product is produced, indicating that a base other than that present in the oligonucleotide is present at the site of SNP in the sample. PCR products are then detected using, for example, gel or capillary electrophoresis or mass spectrometry.

[00123] Primer extension methods (described, for example, in Dieffenbach (ed) and Dveksler (ed)

(In: PCR Primer: A Laboratory Manual, Cold Spring Harbour Laboratories, NY, 1995)) are also useful for the detection of an SNP. An oligonucleotide is used that hybridizes to the region of a nucleic acid adjacent to the SNP. This oligonucleotide is used in a primer extension protocol with a polymerase and a free nucleotide diphosphate that corresponds to either or any of the possible bases that occur at the site of the SNP. Preferably, the nucleotide-diphosphate is labeled with a detectable marker (e.g. a fluorophore). Following primer extension, unbound labeled nucleotide diphosphates are removed, e.g. using size exclusion chromatography or electrophoresis, or hydrolyzed, using for example, alkaline phosphatase, and the incorporation of the labeled nucleotide into the oligonucleotide is detected, indicating the base that is present at the site of the SNP. Alternatively, or in addition, as exemplified herein primer extension products are detected using mass spectrometry (e.g., MALDI-TOF).

[00124] The present invention extends to high-throughput forms of primer extension analysis, such as, for example, minisequencing (Sy Vamen et al, Genomics 9: 341-342, 1995) wherein a probe or primer or multiple probes or primers is/are immobilized on a solid support (e.g. a glass slide), a sample comprising nucleic acid is brought into contact with the probe(s) or primer(s), a primer extension reaction is performed wherein each of the free nucleotide bases A, C, G, T is labeled with a different detectable marker and the presence or absence of one or more SNPs is determined by determining the detectable marker bound to each probe and/or primer. Fluorescently labeled locked nucleic acid (LNA) molecules or fluorescently labeled protein-nucleic acid (PNA) molecules are useful for the detection of SNPs (as described in Simeonov and Nikiforov, Nucleic Acids Research, 30(17): 1 -5, 2002). LNA and PNA molecules bind, with high affinity, to nucleic acid, in particular, DNA.

[00125] Flurophores (in particular, rhodomine or hexachlorofluorescein) conjugated to the LNA or PNA probe fluoresce at a significantly greater level upon hybridization of the probe to target nucleic acid compared to a probe that has not hybridized to a target nucleic acid. Flowever, the level of increase of fluorescence is not enhanced to the same level when even a single nucleotide mismatch occurs.

Accordingly, the degree of fluorescence detected in a sample is indicative of the presence of a mismatch between the LNA or PNA probe and the target nucleic acid, such as, in the presence of an SNP.

Preferably, fluorescently labeled LNA or PNA technology is used to detect a single base change in a nucleic acid that has been previously amplified using, for example, an amplification method described supra. As will be apparent to the skilled artisan, LNA or PNA detection technology is amenable to a high- throughput detection of one or more markers immobilizing an LNA or PNA probe to a solid support, as described in Oram et al, Clin. Chem. 45: 1898-1905, 1999.

[00126] Similarly, Molecular Beacons are useful for detecting SNPs directly in a sample or in an amplified product (see, for example, Mhlang and Malmberg, Methods 25: 463-471 , 2001). Molecular beacons are single stranded nucleic acid molecules with a stem-and-loop structure. The loop structure is complementary to the region surrounding the SNP of interest. The stem structure is formed by annealing two "arms" complementary to each other on either side of the probe (loop). A fluorescent moiety is bound to one arm and a quenching moiety that suppresses any detectable fluorescence when the molecular beacon is not bound to a target sequence bound to the other arm. Upon binding of the loop region to its target nucleic acid the arms are separated and fluorescence is detectable. However, even a single base mismatch significantly alters the level of fluorescence detected in a sample. Accordingly, the presence or absence of a particular base at the site of an SNP is determined by the level of fluorescence detected.

[00127] The present invention encompasses other methods of detecting a SNP that is , such as, for example, SNP microarrays (available from Affymetrix, or described, for example, in US 6,468,743 or Hacia et al, Nature Genetics, 14: 441, 1996), Taqman assays (as described in Livak et al, Nature Genetics, 9: 341-342, 1995), solid phase minisequencing (as described in Syvamen et al, Genomics, 13: 1008-1017, 1992), minisequencing with FRET (as described in Chen and Kwok , Nucleic Acids Res. 25: 347-353, 1997) or pyrominisequencing (as reviewed in Landegren et al, Genome Res., 8(8): 769-776, 1998) and other high throughput sequencing technologies including Ion semiconductor (IonTorrent) sequencing, sequencing by synthesis (Illuimina), sequencing by ligation (SOLiD), Single-molecule real-time (SMRT) sequencing (PacBio), Combinatorial probe anchor synthesis (cPAS BGI), Pyrosequencing (454), Nanopore sequencing. [00128] In those cases in which the polymorphism or marker occurs in a region of nucleic acid that encodes RNA, said polymorphism or marker is detected using a method such as, for example, RT- PCR, NASBA or TMA. Methods of RT-PCR are known in the art and described, for example, in Dieffenbach (ed) and Dveksler (ed) (In: PCR Primer: A Laboratory Manual, Cold Spring Harbour Laboratories, NY, 1995). Methods of TMA or self-sustained sequence replication (3SR) use two or more oligonucleotides that flank a target sequence, a RNA polymerase, RNase H and a reverse transcriptase. One oligonucleotide (that also comprises a RNA polymerase binding site) hybridizes to an RNA molecule that comprises the target sequence and the reverse transcriptase produces cDNA copy of this region. RNase H is used to digest the RNA in the RNA-DNA complex, and the second oligonucleotide used to produce a copy of the cDNA. The RNA polymerase is then used to produce a RNA copy of the cDNA, and the process repeated. NASBA systems relies on the simultaneous activity of three enzymes (a reverse transcriptase, RNase H and RNA polymerase) to selectively amplify target mRNA sequences. The mRNA template is transcribed to cDNA by reverse transcription using an oligonucleotide that hybridizes to the target sequence and comprises a RNA polymerase binding site at its 5' end. The template RNA is digested with RNase H and double stranded DNA is synthesized. The RNA polymerase then produces multiple RNA copies of the cDNA and the process is repeated.

[00129] The hybridization to and/or amplification of a marker is detectable using, for example, electrophoresis and/or mass spectrometry. In this regard, one or more of the probes/primers and/or one or more of the nucleotides used in an amplification reactions may be labeled with a detectable marker to facilitate rapid detection of a marker, for example, a fluorescent label (e.g. Cy5 or Cy3) or a radioisotope (e.g. 32P). Alternatively, amplification of a nucleic acid may be continuously monitored using a melting curve analysis method, such as that described in, for example, US 6,174,670. Such methods are suited to determining the level of an alternative splice form in a biological sample.

[00130] Methods of the invention can identify nucleotide occurrences at SNPs using genome wide sequencing or "microsequencing" methods. Whole-genome sequencing of individuals identifies all SNP genotypes in a single analysis. Microsequencing methods determine the identity of only a single nucleotide at a "predetermined" site. Such methods have particular utility in determining the presence and identity of polymorphisms in a target polynucleotide. Such microsequencing methods, as well as other methods for determining the nucleotide occurrence at SNP loci are discussed in Boyce-Jacino, et al, U.S. Pat. No. 6,294,336, incorporated herein by reference.

[00131] Microsequencing methods include the Genetic Bit Analysis method disclosed by Goelet,

P. et al. (WO 92/15712, herein incorporated by reference). Additional, primer-guided, nucleotide incorporation procedures for assaying polymorphic sites in DNA have also been described (Komher et al, Nucl. Acids. Res. 17, 7779-7784, 1989; Sokolov, Nucl. Acids Res. 18, 3671 (1990); Syvanen et al, Genomics 8, 684-692, 1990; Kuppuswamy et al, Proc. Natl. Acad. Sci. (U.S.A.) 88, 1143-1147, 1991; Prezant et al, Hum. Mutat. 1, 159-164, 1992; Ugozzoli et al, GATA 9, 107-112, 1992; Nyren et al, Anal Biochem. 208, 171 -175, 1993; Wallace, WO89/10414; Mundy, U.S. Pat. No. 4,656, 127; Cohen et al, French Pat. No. 2,650,840; W091/02087). In response to the difficulties encountered in employing gel electrophoresis to analyze sequences, alternative methods for microsequencing have been developed, e.g., Macevicz, U.S. Pat. No. 5,002,867 incorporated herein by reference. Boyce-Jacino et al, U.S. Pat. No. 6,294,336 provide a solid phase sequencing method for determining the sequence of nucleic acid molecules (either DNA or RNA) by utilizing a primer that selectively binds a polynucleotide target at a site wherein the SNP is the most 31 nucleotide selectively bound to the target. Oliphant et al, Suppl Biotechniques, June 2002, describe the use of BeadArray Technology to determine the nucleotide occurrence of an SNP. Alternatively, nucleotide occurrences for SNPs can be determined using a DNAMassARRAY system (Sequenom, San Diego, Calif.) is used, which system combines

SpectroChips™, microfiuidics, nanodispensing, biochemistry, and MALDI-TOF MS (matrix-assisted laser desorption ionization time of flight mass spectrometry).

[00132] Particularly useful methods include those that are readily adaptable to a high throughput format, to a multiplex format, or to both. High-throughput systems for analyzing markers, especially SNPs, can include, for example, a platform such as the UHT SNP-IT™ platform (Orchid Biosciences, Princeton, N.J., USA) MassArray™ system (Sequenom, San Diego, Calif., USA), the integrated SNP genotyping system (Illumina, San Diego, Calif, USA), TaqMan™ (ABI, Foster City, Calif, USA), Rolling circle amplification, fluorescent polarization, amongst others described herein above. In general, SNP- IT™ is a 3 -step primer extension reaction. In the first step, a target polynucleotide is isolated from a sample by hybridization to a capture primer, which provides a first level of specificity. In a second step the capture primer is extended from a terminating nucleotide trisphosphate at the target SNP site, which provides a second level of specificity. In a third step, the extended nucleotide trisphosphate can be detected using a variety of known formats, including: direct fluorescence, indirect fluorescence, an indirect colorimetric assay, mass spectrometry, fluorescence polarization, etc. Reactions can be processed in 384 well format in an automated format using an SNPstream™ instrument (Orchid BioSciences, Inc., Princeton, N.J.).

[00133] The present invention also provides a high-throughput system for genotypic selection in a current population having a small effective population size, said system comprising a solid support consisting essentially of or having nucleic acids of different sequence bound directly or indirectly thereto, wherein each nucleic acid of different sequence comprises a polymorphic genetic marker derived from an ancestor or founder that is representative of the current population. Exemplary high-throughput systems are hybridization mediums e.g., a microfluidic device or homogenous assay medium. Numerous microfluidic devices are known that include solid supports with microchannels (See e.g., U.S. Pat. Nos. 5,304,487, 5, 110, 745, 5,681,484, and 5,593,838). In a particularly preferred embodiment, the high throughput system comprises an SNP chip comprising 10,000-1,000,000 oligonucleotides each of which consists of a sequence comprising an SNP. Each of these hybridization mediums is suitable for determining the presence or absence of a marker associated with a trait.

[00134] The nucleic acids are typically oligonucleotides, attached directly or indirectly to the solid support. Accordingly, the oligonucleotides are used to determine the nucleotide occurrence of a marker associated with a trait, by virtue of the hybridization of nucleic acid from the subject being tested to an oligonucleotide of a series of oligonucleotides bound to the solid support being affected by the nucleotide occurrence of the marker in question e.g., by the presence or absence of an SNP in the subject's nucleic acid. Accordingly, oligonucleotides can be selected that bind at or near a genomic location of each marker. Such oligonucleotides can include forward and reverse oligonucleotides that can support amplification of a particular polymorphic marker present in template nucleic acid obtained from the subject being tested. Alternatively, or in addition, the oligonucleotides can include extension primer sequences that hybridize in proximity to a marker to thereby support extension to the marker for the purposes of identification. A suitable detection method will detect binding or tagging of the

oligonucleotides e.g., in a genotyping method described herein.

[00135] Techniques for producing immobilised arrays of DNA molecules have been described in the art. Generally, most methods describe how to synthesise single-stranded nucleic acid molecule arrays, using for example masking techniques to build up various permutations of sequences at the various discrete positions on the solid substrate. U.S. Patent No. 5,837,832, the contents of which are

incorporated herein by reference, describes an improved method for producing DNA arrays immobilised to silicon substrates based on very large scale integration technology. In particular, U.S. Patent No. 5,837,832 describes a strategy called "tiling" to synthesize specific sets of probes at spatially-defined locations on a substrate which are used to produce the immobilised DNA array. U.S. Patent No.

5,837,832 also provides references for earlier techniques that may also be used.

[00136] DNA can be synthesised in situ on the surface of the substrate. However, DNA may also be printed directly onto the substrate using for example robotic devices equipped with either pins or piezo electric devices. Microarrays are generally produced stepwise, by the in situ synthesis of the target directly onto the support, or alternatively, by exogenous deposition of pre-prepared targets.

Photolithography, mechanical microspotting, and inkjet technology are generally employed for producing microarrays.

[00137] In photolithography, a glass wafer, modified with photolabile protecting groups, is selectively activated e.g., for DNA synthesis, by shining light through a photomask. Repeated deprotection and coupling cycles enable the preparation of high -density oligonucleotide microarrays (see for example, U.S. Pat. No. 5,744,305, issued Apr. 28, 1998). [00138] Microspotting encompasses deposition technologies that enable automated microarray production, by printing small quantities of pre-made target substances onto solid surfaces. Printing is accomplished by direct surface contact between the printing substrate and a delivery mechanism, such as a pin or a capillary. Robotic control systems and multiplexed print heads allow automated microarray fabrication.

[00139] Inkjet technologies utilize piezoelectric and other forms of propulsion to transfer biochemical substances from miniature nozzles to solid surfaces. Using piezoelectricity, the target sample is expelled by passing an electric current through a piezoelectric crystal which expands to expel the sample. Piezoelectric propulsion technologies include continuous and drop-on-demand devices. In addition to piezoelectric ink jets, heat may be used to form and propel drops of fluid using bubble-jet or thermal inkjet heads; however, such thermal ink jets are typically not suitable for the transfer of biological materials due to the heat which is often stressful on biological samples. Examples of the use of ink jet technology include U.S. Pat. No. 5,658,802 (issued Aug. 19, 1997).

[00140] A plurality of nucleic acids is typically immobilised onto or in discrete regions of a solid substrate. The substrate is porous to allow immobilisation within the substrate, or substantially non- porous to permit surface immobilization.

[00141] The solid substrate can be made of any material to which polypeptides can bind, either directly or indirectly. Examples of suitable solid substrates include flat glass, silicon wafers, mica, ceramics and organic polymers such as plastics, including polystyrene and polymethacrylate. It is also possible to use semi-permeable membranes such as nitrocellulose or nylon membranes, which are widely available. The semi-permeable membranes are mounted on a more robust solid surface such as glass. The surfaces may optionally be coated with a layer of metal, such as gold, platinum or other transition metal. Preferably, the solid substrate is generally a material having a rigid or semi-rigid surface. In preferred embodiments, at least one surface of the substrate will be substantially flat, although in some

embodiments it are desirable to physically separate synthesis regions for different polymers with, for example, raised regions or etched trenches. It is also preferred that the solid substrate is suitable for the high density application of DNA sequences in discrete areas of typically from 50 to 100 mm, giving a density of 10,000 to 40,000 cm-2.

[00142] The solid substrate is conveniently divided up into sections. This is achieved by techniques such as photoetching, or by the application of hydrophobic inks, for example teflon-based inks (Cel-line, USA). Discrete positions, in which each different member of the array is located may have any convenient shape, e.g., circular, rectangular, elliptical, wedge-shaped, etc. [00143] Attachment of the nucleic acids to the substrate can be covalent or non-covalent, generally via a layer of molecules to which the nucleic acids bind. For example, the nucleic acid probes/primers can be labelled with biotin and the substrate coated with avidin and/or streptavidin. A convenient feature of using biotinylated probes/primers is that the efficiency of coupling to the solid substrate is determined easily. A chemical interface may be provided between the solid substrate e.g., in the case of glass, and the probes/primers. Examples of suitable chemical interfaces include hexaethylene glycol, polylysine. For example, polylysine can be chemically modified using standard procedures to introduce an affinity ligand. Other methods for attaching the probes/primers to the surface of a solid substrate include the use of coupling agents known in the art, e.g., as described in WO98/49557.

[00144] The high-throughput system of the present invention is designed to determine nucleotide occurrences of one SNP or a series of SNPs. The systems can determine nucleotide occurrences of an entire genome- wide high-density SNP map.

[00145] High-throughput systems for analyzing markers, especially SNPs, can include, for example, a platform such as the UF1T SNP-IT platform (Orchid Biosciences, Princeton, N.J., USA) MassArray™ system (Sequenom, San Diego, Calif., USA), the integrated SNP genotyping system (Illumina, San Diego, Calif., USA), TaqMan™ (ABI, Foster City, Calif, USA). Exemplary nucleic acid arrays are of the type described in WO 95/1 1995. WO 95/1 1995 also describes sub-arrays optimized for detection of a variant form of a pre-characterized polymorphism. Such a sub-array contains probes designed to be complementary to a second reference sequence, which is an allelic variant of the first reference sequence. The inclusion of a second group (or further groups) can be particularly useful for analyzing short sub-sequences of a primary reference sequence in which multiple mutations are expected to occur within a short distance commensurate with the length of the probes (e.g., two or more mutations within 9 to 21 bases). More preferably, the high throughput system comprises a SNP microarray such as those available from Affymetrix or described, for example, in US 6,468,743 or Hacia et al, Nature Genetics, 14: 441, 1996.

[00146] DNA arrays are typically read at the same time by charged coupled device (CCD) camera or confocal imaging system. Alternatively, the DNA array can be placed for detection in a suitable apparatus that can move in an x-y direction, such as a plate reader. In this way, the change in

characteristics for each discrete position are measured automatically by computer controlled movement of the array to place each discrete element in turn in line with the detection means. The detection means is capable of interrogating each position in the library array optically or electrically. Examples of suitable detection means include CCD cameras or confocal imaging systems. The system can further include a detection mechanism for detecting binding the series of oligonucleotides to the series of SNPs. Such detection mechanisms are known in the art. The high-throughput system of the present invention can include a reagent handling mechanism that can be used to apply a reagent, typically a liquid, to the solid support. The high-throughput system can also include a mechanism effective for moving a solid support and a detection mechanism.

[00147] Embodiments of the method can be used to assist calculating estimated breeding value or for marker assisted selective breeding. Embodiments of the method describes herein may be used to identify markers associated with desired traits (eg via the phenotype data) from which a breeding value or similar score can be estimated. That is through the use of statistical models phenotypic traits can be decomposed into an estimated breeding value (EBV, and a residual). The breeding value of an individual can then be estimated by genotyping the individual, either using genome wide genotyping (eg SNP chip) or a reduced set of specific markers of interest (for example known to be correlated with the trait or traits of interest) and estimating the correlation with desired traits. For example embodiments of the method described herein may be used to estimate a set of markers with significant effects (ie QTLs) which can them be used to build a regression or similar model which can be used to obtain a breeding value or score based on the how well the observed genotype data of the individual correlates with the significant markers in the model. Embodiments of the method may also be used for genomic prediction studies, including human disease studies for the inference of genetic architecture, the identification of causal mutations (QTL mapping), and prediction of disease risk.

[00148] Embodiments of a Bayesian Markov Chain Monte Carlo based statistical model, referred to as BayesR3 have been developed which in some embodiments are 18 times faster than previous versions of BayesR, whilst maintaining similar accuracy. This is achieved by sequentially processing SNPs (or other markers) in blocks and residual updating. Embodiments are efficient enough to allow them to used to perform large multi-trait analyses and are suitable for Australian genomic evaluations.

For example an embodiment of this method was applied to two large data sets of dairy cattle: a single trait analysis of milk, fat and protein yields using 97,000 animals with high density genotypes (633374 SNPs); and a multi-trait analysis of 20 principal components extracted from milk MIR recorded on 5000 cows. Both analyses find SNPs in common with high association probabilities with these traits. The method was twice as fast as GBLUP on the single trait analysis, and 14 times faster than BayesR on the multi-trait analysis.

[00149] Currently genomic breeding value tables are produced monthly using BLUP and the use of embodiments of the above method would allow multi-trait analysis to be performed and made available at similar frequency. Embodiments of the method also have much wider application, and can be used on human, animal or plant data, for a wide range of applications including genomic discovery, estimating genomic breeding value, making genomic breeding decisions, disease risk (ie by identifying genomic predictors of a disease), dosage estimation, wheat grain composition, etc. Embodiments of the method can be applied on a wide range of data, including full genome sequence data with rich and/or complex phenotypic data such as milk mass spectra data, blood samples, FACS, gene expression data etc. The method may also be used on other markers besides SNPs. The method can also be coupled with one-step algorithms for use in national genetic evaluations (akin to Fernando, R. L., et al. (2016). "Computational strategies for alternative single-step Bayesian regression models with large numbers of genotyped and non-genotyped animals." Genetics, Selection, Evolution : GSE 48(1): 96-96.), More generally the method could be applied to any regression problem where Bayesian MCMC based statistical models using Gibbs Sampler are appropriate, for example models similar to the model shown in equation 1 , such as a model ignoring polygenic effects (ie dropping Zv), or including other random effects.

[00150] Embodiments of the method can be applied to similar genomic prediction algorithms such as BayesA, BayesB, BayesC, or BayesR. It would further be apparent to the person of ordinary skill in the art that embodiments of the blocking approach described herein can be generally applied to any application where we are estimating model parameters (of a statistical model) using iterative regression. As would be apparent to the person of ordinary skill in the art, the blocking approach can be generally applied to application where we are for solving linear systems; particularly those of a Gauss-Seidel nature. The blocking approach developed is a method that reduces the speed of calculation to be in some sense independent of the number of records (R) being analysed. Thus it reduces the number of calculation from being proportional to M × R to M × b. where b is block size and M is the number of markers or regression coefficients. However, blocks still need to be entered and this incurs an overhead of B x R, where B is the number of blocks. In particular embodiments of the BayesR3 method described herein provide computational efficient estimation of regression co-efficients. The model parameters can then be used to estimate a breeding value, one or more QTLs or one or more genomic locations associated with a disease or trait.

[00151] The following references are each hereby incorporated by reference:

A. Benedet, P.N. Ho, R. Xiang, S. Bolormaa, M. De Marchi, M.E. Goddard, J.E. Pryce,‘The use of mid-infrared spectra to map genes affecting milk composition’, Journal of Dairy Science, 102(8): 7189-7203

Calus, M.P.L.“Right-hand-side updating for fast computing of genomic breeding values”, Genetics Selection Evolution 2014, 46:24

De Marchi, M., V. Toffanin, M. Cassandro, and M. Penasa. 2014. 'Invited review: Mid-infrared spectroscopy as phenotyping tool for milk traits', J Dairy Sci, 97: 1171-86.

Erbe, M., B. J. Hayes, L. K. Matukumalli, S. Goswami, P. J. Bowman, C. M. Reich, B. A.

Mason, and M. E. Goddard. 2012. 'Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels', J Dairy Sci, 95: 4114-29.

Kemper, K. E., P. J. Bowman, B. J. Hayes, P. M. Visscher, and M. E. Goddard. 2018. 'A multi- trait Bayesian method for mapping QTL and genomic prediction', Genet Sel Evol, 50: 10. Lee, S. H., and J. H. van der Werf. 2016. 'MTG2: an efficient algorithm for multivariate linear mixed model analysis based on genomic information', Bioinformatics, 32: 1420-2.

MacLeod, I. M., et al. (2016). "Exploiting biological priors and sequence variants enhances QTL discovery and genomic prediction of complex traits." Bmc Genomics 17(1): 1-21.

Meuwissen, T. H. E., B. J. Hayes, and M. E. Goddard. 2001. 'Prediction of total genetic value using genome -wide dense marker maps. ' Genetics 157:1819-1829.

Verbyla, K. L., B. J. Hayes, P. J. Bowman, and M. E. Goddard. 2009. 'Accuracy of genomic selection using stochastic search variable selection in Australian Holstein Friesian dairy cattle. ' Genet. Res.(Camb.) 91:307-311.

Wang, T., Y. P. Chen, P. J. Bowman, M. E. Goddard, and B. J. Hayes. 2016. 'A hybrid expectation maximisation and MCMC sampling algorithm to implement Bayesian mixture model based genomic prediction and QTL mapping', BMC Genomics, 17: 744.

Yang, J., S.H. Lee, M.E. Goddard, and P.M. Visscher. 2011. 'GCTA: a tool for genome -wide complex trait analysis', American journal of human genetics, 88: 6.

[00152] The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. Embodiments of the method may be implemented in a computer program product. For example, such a computer program product may comprise a computer (or processor) readable medium having instructions stored (and/or encoded) thereon, the instructions being executable by one or more processors to perform the operations described herein. For certain aspects, the computer program product may include packaging material. The computer program product may be an“App” available at an app store for download by a user.

[00153] Software modules, also known as computer programs, computer codes, or instructions, may contain a number a number of source code or object code segments or instructions, and may reside in any computer readable medium such as a RAM memory, flash memory, ROM memory, EPROM memory, registers, hard disk, a removable disk, a CD-ROM, a DVD-ROM, a Blu-ray disc, or any other form of computer readable medium. In some aspects the computer-readable media may comprise non- transitory computer-readable media (e.g., tangible media). In addition, for other aspects computer- readable media may comprise transitory computer- readable media (e.g., a signal). Combinations of the above should also be included within the scope of computer-readable media. In another aspect, the computer readable medium may be integral to the processor. The processor and the computer readable medium may reside in an ASIC or related device. The software codes may be stored in a memory unit and the processor may be configured to execute them. The memory unit may be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor via various means as is known in the art. [00154] Further, it should be appreciated that modules and/or other appropriate means for performing the methods and techniques described herein can be downloaded and/or otherwise obtained by a computing device. For example, such a device can be coupled to a server to facilitate the transfer of means for performing the methods described herein. Alternatively, various methods described herein can be provided via storage means (e.g., RAM, ROM, a physical storage medium such as a compact disc (CD) or floppy disk, etc.), such that a computing device can obtain the various methods upon coupling or providing the storage means to the device. Moreover, any other suitable technique for providing the methods and techniques described herein to a device can be utilized.

[00155] The methods disclosed herein comprise one or more steps or actions for achieving the described method. The method steps and/or actions may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of steps or actions is specified, the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims.

[00156] The processing of signals may be performed directly in hardware, in a software module executed by a processor, or in a combination of the two. Those of skill in the art would understand that information and signals may be represented using any of a variety of technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof. Those of skill in the art would further appreciate that the various illustrative logical blocks, modules, processing and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software or instructions, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. For a hardware implementation, processing may be implemented within one or more processors, controllers, micro-controllers, microprocessors, application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), other electronic units designed to perform the functions described herein, or a combination thereof.

[00157] As used herein, the terms“processing” and“estimating” encompasses a wide variety of actions. For example,“processing” and“estimating” may include analysing, calculating, computing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure), ascertaining and the like. Also,“processing” may include receiving (e.g., receiving information), accessing (e.g., accessing data in a memory) and the like. Also,“processing” may include resolving, selecting, choosing, establishing and the like.

[00158] As used herein, a phrase referring to“at least one of’ a list of items refers to any combination of those items, including single members. As an example,“at least one of: a, b, or c” is intended to cover: a, b, c, a-b, a-c, b-c, and a-b-c.

[00159] Throughout the specification and the claims that follow, unless the context requires otherwise, the words“comprise” and“include” and variations such as“comprising” and“including” will be understood to imply the inclusion of a stated integer or group of integers, but not the exclusion of any other integer or group of integers.

[00160] The reference to any prior art in this specification is not, and should not be taken as, an acknowledgement of any form of suggestion that such prior art forms part of the common general knowledge.

[00161] It will be appreciated by those skilled in the art that the disclosure is not restricted in its use to the particular application or applications described. Neither is the present disclosure restricted in its preferred embodiment with regard to the particular elements and/or features described or depicted herein. It will be appreciated that the disclosure is not limited to the embodiment or embodiments disclosed, but is capable of numerous rearrangements, modifications and substitutions without departing from the scope as set forth and defined by the following claims.