Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
CHROMOSOME BIOMARKER
Document Type and Number:
WIPO Patent Application WO/2020/084289
Kind Code:
A1
Abstract:
A process for analysing chromosome regions and interactions relating to physical performance.

Inventors:
HUNTER EWAN (GB)
RAMADASS AROUL (GB)
AKOULITCHEV ALEXANDRE (GB)
Application Number:
PCT/GB2019/052996
Publication Date:
April 30, 2020
Filing Date:
October 21, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
OXFORD BIODYNAMICS LTD (GB)
International Classes:
C12Q1/6876
Domestic Patent References:
WO2016207661A12016-12-29
Other References:
"ENCYCLOPEDIA OF LIFE SCIENCES", 17 June 2010, JOHN WILEY & SONS, LTD, Chichester, ISBN: 978-0-470-01590-2, article BING YU ET AL: "Genetics of Athletic Performance", XP055558481, DOI: 10.1002/9780470015902.a0022400
N.C. CRAIG SHARP: "The Human Genome and Sport, Including Epigenetics, Gene Doping, and Athleticogenomics", ENDOCRINOLOGY AND METABOLISM CLINICS OF NORTH AMERICA., vol. 39, no. 1, 1 March 2010 (2010-03-01), PHILADELPHIA, pages 201 - 215, XP055558468, ISSN: 0889-8529, DOI: 10.1016/j.ecl.2009.10.010
JOÃO PAULO LIMONGI FRANÇA GUILHERME ET AL: "Genetics and sport performance: current challenges and directions to the future", REVISTA BRASILEIRA DE EDUCAÇÃO FÍSICA E ESPORTE, vol. 28, no. 1, 1 March 2014 (2014-03-01), pages 177 - 193, XP055558473, DOI: 10.1590/S1807-55092014000100177
ELAINE A. OSTRANDER ET AL: "Genetics of Athletic Performance", ANNUAL REVIEW OF GENOMICS AND HUMAN GENETICS, vol. 10, no. 1, 1 September 2009 (2009-09-01), US, pages 407 - 429, XP055558459, ISSN: 1527-8204, DOI: 10.1146/annurev-genom-082908-150058
DEVEREUX ET AL., NUCLEIC ACIDS RESEARCH, vol. 12, 1984, pages 387 - 395
ALTSCHUL S. F., J MOL EVOL, vol. 36, 1993, pages 290 - 300
ALTSCHUL, S, F ET AL., J MOL BIOL, vol. 215, 1990, pages 403 - 10
HENIKOFFHENIKOFF, PROC. NATL. ACAD. SCI. USA, vol. 89, 1992, pages 10915 - 10919
KARLINALTSCHUL, PROC. NATL. ACAD. SCI. USA, vol. 90, 1993, pages 5873 - 5787
Attorney, Agent or Firm:
AVIDITY IP (GB)
Download PDF:
Claims:
CLAIMS

1. A process for detecting a chromosome state which represents a subgroup in a population comprising determining whether a chromosome interaction relating to that chromosome state is present or absent within a defined region of the genome, wherein said subgroup relates to physical performance in an individual; and

- wherein said chromosome interaction has optionally been identified by a method of determining which chromosomal interactions are relevant to a chromosome state corresponding to a physical performance subgroup of the population, comprising contacting a first set of nucleic acids from subgroups with different states of the chromosome with a second set of index nucleic acids, and allowing complementary sequences to hybridise, wherein the nucleic acids in the first and second sets of nucleic acids represent a ligated product comprising sequences from both the chromosome regions that have come together in chromosomal interactions, and wherein the pattern of hybridisation between the first and second set of nucleic acids allows a determination of which chromosomal interactions are specific to a physical performance subgroup; and

- wherein the chromosome interaction either:

(i) corresponds to any one of the chromosome interactions shown in any of Tables 3, 7, 8, 9, 25 and 30; and/or

(ii) corresponds to any one of the chromosome interactions shown in any of Tables 13, 14, 18, 22, 23 and 24; and/or

(iii) corresponds to any one of the chromosome interactions shown in Table 31 or 32; and/or

(iv) is present in a 4,000 base region which comprises or which flanks (i), (ii) or (iii); and/or

(v) is present in any one of the regions or genes listed in Table 21, 24, 25, 30, 31 or 32.

2. A process according to claim 1 wherein:

- the individual is a human and the subgroup is a human subgroup

- the individual is a horse and the subgroup is a horse subgroup, and

wherein optionally:

(i) the process is carried out to determine physical performance ability, and/or

(ii) the process is carried out to detect responsiveness to a stimulus relating to physical performance, which is preferably physical training, and optionally strength or endurance training; and/or

(iii) the process is carried out to select an individual suitable for a physical activity, which is preferably a sport; and/or

(iv) the process is carried out to select a stimulus relating to physical performance to give to the individual, wherein said stimulus is a type of physical training.

3. A process according to claim 1 or 2 wherein a specific combination of chromosome interactions are typed: (i) comprising all of the chromosome interactions represented in any of Tables 3, 7, 8, 9, 25 and 30 or any of Tables 13, 14, 18, 22, 23; and/or

(ii) comprising at least 10%, 20%, 50%, or 80% of the chromosome interactions in any of Tables 3, 7, 8, 9, 25 and 30 or any of Tables 13, 14, 18, 22, 23; and/or

(iii) which together are present in at least 10, 50 or 100 of the regions or genes listed in any of Tables 21, 24, 25 or 30; and/or

(iv) wherein at least 10, 50, 100, 150, 200 or 300 chromosome interactions are typed which are present in a 4,000 base region which comprises or which flanks the chromosome interactions represented in any of Tables 3, 7, 8, 9, 25 and 30 or any of Tables 13, 14, 18, 22, 23.

4. A process according to any one of the preceding claims wherein a specific combination of chromosome interactions are typed:

(i) comprising all of the chromosome interactions represented in Table 31 or 32; and/or

(ii) comprising at least 10%, 20%, 50%, or 80% of the chromosome interactions in Table 31 or 32; and/or

(iii) which together are present in at least 5 of the regions or genes listed in Table 31 or 32; and/or

(iv) wherein at least 5 chromosome interactions are typed which are present in a 4,000 base region which comprises or which flanks the chromosome interactions represented in Table 31 or 32.

5. A process according to any one of the preceding claims in which the chromosome interactions are typed:

- in a sample from an individual, and/or

- by detecting the presence or absence of a DNA loop at the site of the chromosome interactions, and/or

- detecting the presence or absence of distal regions of a chromosome being brought together in a chromosome conformation, and/or

- by detecting the presence of a ligated nucleic acid which is generated during said typing and whose sequence comprises two regions each corresponding to the regions of the chromosome which come together in the chromosome interaction, wherein detection of the ligated nucleic acid is preferably by using either:

(i) a probe that has at least 70% identity to any of the specific probe sequences mentioned in Table 24, 25,

30, 31 or 32 and/or (ii) by a primer pair which has at least 70% identity to any primer pair in Table 24, 25, 30, 31 or 32.

6. A process according to any one of the preceding claims, wherein:

- the second set of nucleic acids is from a larger group of individuals than the first set of nucleic acids; and/or

- the first set of nucleic acids is from at least 8 individuals; and/or

- the first set of nucleic acids is from at least 4 individuals from a first subgroup and at least 4 individuals from a second subgroup which is preferably non-overlapping with the first subgroup.

7. A process according to any one of the preceding claims wherein:

- the second set of nucleic acids represents an unselected group; and/or

- wherein the second set of nucleic acids is bound to an array at defined locations; and/or

- wherein the second set of nucleic acids represents chromosome interactions in least 100 different genes; and/or

- wherein the second set of nucleic acids comprises at least 1,000 different nucleic acids representing at least 1,000 different chromosome interactions; and/or

- wherein the first set of nucleic acids and the second set of nucleic acids comprise at least 100 nucleic acids with length 10 to 100 nucleotide bases.

8. A process according to any one of the preceding claims, wherein the first set of nucleic acids is obtainable in a process comprising the steps of: -

(i) cross-linking of chromosome regions which have come together in a chromosome interaction;

(ii) subjecting said cross-linked regions to cleavage, optionally by restriction digestion cleavage with an enzyme; and

(iii) ligating said cross-linked cleaved DNA ends to form the first set of nucleic acids (in particular comprising ligated DNA).

9. A process according to any one of the preceding claims:

- wherein at least 10 to 50 different chromosome interactions are typed, preferably in 10 to 50 different regions or genes optionally as defined in Table 21, 24, 25, 30, 31 or 32; and/or

- which is:

(i) carried out on a human or horse athlete; and/or

(ii) carried out as part of a training regime, preferably after the start of the training regime; and/or

(iii) carried out on a Thoroughbred horse, preferably a racing horse; or

(iv) carried out on a human individual of who is less than 20 years old or is carried out on a horse that is less than 18 months old, and/or

(v) which is carried out at multiple time points to assess physical performance characteristics at specific time points, wherein the process is optionally carried out at at least 3 time points, which are preferably at least 30 days apart from each other.

10. A process according to any one of the preceding claims wherein said defined region of the genome:

(i) comprises a single nucleotide polymorphism (SNP); and/or

(ii) expresses a microRNA (miRNA); and/or

(iii) expresses a non-coding RNA (ncRNA); and/or

(iv) expresses a nucleic acid sequence encoding at least 10 contiguous amino acid residues; and/or

(v) expresses a regulating element; and/or

(vi) comprises a CTCF binding site.

11. A process according to any one of the preceding claims:

- which is carried out to identify an individual that is suited to endurance training, and preferably the identified individual is then subject to endurance training, which optionally occurs on at least 100 days out of the next 365 days after the identification; or

- which is carried out to identify an individual that is suited to strength training, and preferably the identified individual is then subject to strength training, which optionally occurs on at least 100 days out of the next 365 days after the identification.

12. A process according to any one of the preceding claims which is carried out to select the individual for racing.

13. A process according to any one of the preceding claims wherein a specific combination of chromosome interactions are typed:

(i) comprising all of the chromosome interactions represented in any of Tables 33, 34, 35, 36, 37, 38, 39, 40 or 41, or in any of Figures 16, 17 or 18; and/or

(ii) comprising at least 10%, 20%, 50%, or 80% of the chromosome interactions in any of Tables 33, 34, 35, 36,

37, 38, 39, 40 or 41, or in any of Figures 16, 17 or 18; and/or

(iii) which together are present in at least 10 of the regions or genes listed in any of Tables 33, 34, 35, 36, 37,

38, 39, 40 or 41, or in any of Figures 16, 17 or 18; and/or

(iv) wherein at least 10 chromosome interactions are typed which are present in a 4,000 base region which comprises or which flanks the chromosome interactions represented in any of Tables 33, 34, 35, 36, 37, 38,

39, 40 or 41, or in any of Figures 16, 17 or 18.

14. A process according to any one of the preceding claims which is carried out to identify or design a an agent that affects physical performance, wherein said process is used to detect whether a candidate agent is able to cause a change to a chromosome state which is associated with a different physical performance state; wherein - the chromosomal interaction is any specific interaction or combination of interactions defined in any preceding claim and/or is present in any one of the regions or genes listed in Table 21, 24, 25, 30, 31 or 32; and/or

- the change in chromosomal interaction is monitored using (i) a probe that has at least 70% identity to any of the specific probe sequences mentioned in Table 24, 25, 30, 31 or 32 and/or (ii) by a primer pair which has at least 70% identity to any primer pair in Table 24, 25, 30, 31 or 32.

15. A process according to claim 14 which comprises selecting a target based on detection of chromosome interactions, and preferably screening for a modulator of the target to identify an agent which affects physical performance, wherein said target is optionally a protein.

16. A process according to any one of the preceding claims, wherein the typing or detecting comprises specific detection of the ligated product by quantitative PCR (qPCR) which uses primers capable of amplifying the ligated product and a probe which binds the ligation site during the PCR reaction, wherein said probe comprises sequence which is complementary to sequence from each of the chromosome regions that have come together in the chromosome interaction, wherein preferably said probe comprises:

an oligonucleotide which specifically binds to said ligated product, and/or

a fluorophore covalently attached to the 5' end of the oligonucleotide, and/or

a quencher covalently attached to the 3' end of the oligonucleotide, and

optionally

said fluorophore is selected from HEX, Texas Red and FAM; and/or

said probe comprises a nucleic acid sequence of length 10 to 40 nucleotide bases, preferably a length of 20 to 30 nucleotide bases.

17. A process according to any one of the proceeding claims which further comprises:

- producing a report on the physical performance characteristics of the individual based on the results of the process, or

- inputting the results of the process into a database, or

- assigning a specific fitness or training regime to the individual based on the results of the process, or

- designing a specific fitness or training regime for the individual based on the results of the process.

Description:
Chromosome Biomarker

Field of the Invention

The invention relates to detecting chromosome interactions.

Background of the Invention

Physical performance is complex and cannot be predicted using available methods. It is clear coordination, flexibility, precision, power, speed, endurance, balance, awareness efficiency, and timing are relevant to performance.

Summary of the Invention

The inventors have identified chromosomal interactions relevant to physical performance using an approach which analyses subgroups in a population. The inventors' work allows physical performance to be typed and modulated in an entirely new way which is more sensitive and personalised than genomic or protein typing, and which reflects the individual history of the individual. This has applications in fitness and physical training regimes, as well as in sports medicine.

Accordingly, the invention provides a process for detecting a chromosome state which represents a subgroup in a population comprising determining whether a chromosome interaction relating to that chromosome state is present or absent within a defined region of the genome, wherein said subgroup relates to physical performance in an individual; and

- wherein said chromosome interaction has optionally been identified by a method of determining which chromosomal interactions are relevant to a chromosome state corresponding to a physical performance subgroup of the population, comprising contacting a first set of nucleic acids from subgroups with different states of the chromosome with a second set of index nucleic acids, and allowing complementary sequences to hybridise, wherein the nucleic acids in the first and second sets of nucleic acids represent a ligated product comprising sequences from both the chromosome regions that have come together in chromosomal interactions, and wherein the pattern of hybridisation between the first and second set of nucleic acids allows a determination of which chromosomal interactions are specific to a physical performance subgroup; and

- wherein the chromosome interaction either:

(i) corresponds to any one of the chromosome interactions shown in any of Tables 3, 7, 8, 9, 25 and 30; and/or

(ii) corresponds to any one of the chromosome interactions shown in any of Tables 13, 14, 18, 22, 23 and 24; and/or

(iii) corresponds to any one of the chromosome interactions shown in Table 31 or 32; and/or

(iv) is present in a 4,000 base region which comprises or which flanks (i), (ii) or (iii); and/or (v) is present in any one of the regions or genes listed in Table 21, 24, 25, 30, 31 or 32.

In a preferred embodiment, the invention provides a process for detecting a chromosome state which represents a subgroup in a population comprising determining whether a chromosome interaction relating to that chromosome state is present or absent within a defined region of the genome, wherein said subgroup relates to physical performance in an individual; and wherein the chromosome interaction either:

(a) corresponds to any one of the chromosome interactions shown in any of Tables 33, 34, 35, 36, 37, 38, 39, 40 or 41, or in any of Figures 16, 17 or 18; and/or

(b) is present in a 4,000 base region which comprises or which flanks (a).

Detailed Description of the Invention

Aspects of the Invention

The inventions concerns a panel of epigenetic markers which relate to the regulation and stratification of physical performance, in particular strength and endurance. The markers are preferably stable and may allow detection of a predisposition in physiology to a specific stimulus, such as physical training.

The invention also includes monitoring of physical performance or responsiveness of physical performance to a specific stimulus, such as a specific training regimen. The invention therefore provides in one aspect a 'live' ongoing readout of physical performance status allowing a personalised stimulus to be given to the individual which reflects the individual's needs.

The inventions provided a method of selecting an individual for a given physical task, such as racing or specific training. The invention also provides a method of selecting or designing a training or fitness regime, for example for a specific individual.

The invention also provides methods of predicting physical performance, including whether an individual would have high strength or endurance or both, for example as measured in any specific way described herein.

The invention allows categorisation of individuals into 'fit', 'strong', or 'sedentary'. The invention allows stratification of 'baseline' individuals entering training programmes for better predisposition for either strength, or endurance - de facto predictive biomarkers for response. The invention also allows stratification of early biomarker evidence of response to training in individuals - de facto early response biomarkers for monitoring training progress.

Any marker disclosed herein may be used in the method of the invention, including any marker disclosed in any table or Figure. Preferred markers are shown in: - Tables 3, 7, 8, 9, 25 and 30; and

- Tables 13, 14, 18, 22, 23 and 24; and - Table 31 or 32; and - Tables 33 to 41; and

- Figures 16 to 18.

Physical Performance

The invention relates to determining physical performance. The process of the invention may detect responsiveness to a stimulus relating to physical performance, for example to training, such as strength or endurance training. The training typically comprises subjecting the individual to physical exertion, for example in terms of the applying force, moving physically, running or carrying out a specific physical activity (for example as disclosed herein) over a certain time period. The process of the invention may be used to detect a high or low response to the stimulus, such as a high or low response to any specific training or physical activity disclosed herein. Preferably the invention detects the response to strength or endurance training.

The invention can be used to select an individual suitable for a physical activity, such as a sport. Preferred sports including strength or endurance sports. A preferred endurance sport is racing or running. Thus the invention may be used to select an individual that is suited to any particular activity mentioned herein, such as training.

The Process of the Invention

The process of the invention comprises a typing system for detecting chromosome interactions relevant to physical performance. This typing may be performed using the EpiSwitch™ system mentioned herein which is based on cross-linking regions of chromosome which have come together in the chromosome interaction, subjecting the chromosomal DNA to cleavage and then ligating the nucleic acids present in the cross-linked entity to derive a ligated nucleic acid with sequence from both the regions which formed the chromosomal interaction. Detection of this ligated nucleic acid allows determination of the presence or absence of a particular chromosome interaction.

The chromosomal interactions may be identified using the above described method in which populations of first and second nucleic acids are used. These nucleic acids can also be generated using EpiSwitch™ technology. The Epigenetic Interactions Relevant to the Invention

As used herein, the term 'epigenetic' and 'chromosome' interactions typically refer to interactions between distal regions of a chromosome, said interactions being dynamic and altering, forming or breaking depending upon the status of the region of the chromosome.

In particular processes of the invention chromosome interactions are typically detected by first generating a ligated nucleic acid that comprises sequence from both regions of the chromosomes that are part of the interactions. In such processes the regions can be cross-linked by any suitable means. In a preferred embodiment, the interactions are cross-linked using formaldehyde, but may also be cross-linked by any aldehyde, or D-Biotinoyl-e- aminocaproic acid-N-hydroxysuccinimide ester or Digoxigenin-3-O- methylcarbonyl-e-aminocaproic acid-N-hydroxysuccinimide ester. Para-formaldehyde can cross link DNA chains which are 4 Angstroms apart. Preferably the chromosome interactions are on the same chromosome and optionally 2 to 10 Angstroms apart.

The chromosome interaction may reflect the status of the region of the chromosome, for example, if it is being transcribed or repressed. Chromosome interactions which are specific to subgroups as defined herein have been found to be stable, thus providing a reliable means of measuring the differences between the two subgroups.

In addition, chromosome interactions specific to a characteristic (such as physical performance) will normally occur early in a biological process, for example compared to other epigenetic markers such as methylation or changes to binding of histone proteins, and are also capable of providing a 'live' readout of the status of the individual. Thus the process of the invention is able to detect early stages of a biological process as well as allowing continuous monitoring. Chromosome interactions also reflect the current state of the individual and therefore can be used to assess changes to physical performance. Furthermore there is little variation in the relevant chromosome interactions between individuals within the same subgroup. Detecting chromosome interactions is highly informative with up to 50 different possible interactions per gene, and so processes of the invention can interrogate 500,000 different interactions.

Preferred Marker Sets

Herein the term 'marker' or 'biomarker' refers to a specific chromosome interaction which can be detected (typed) in the invention. Specific markers are disclosed herein, any of which may be used in the invention or any of which may be used in any combination with other specific markers or combinations disclosed herein. Preferably sets of markers may be used, for example in the combinations or numbers disclosed herein. The specific markers disclosed in the tables herein are preferred as well as markers presents in genes and regions mentioned in the tables herein are preferred. These may be typed by any suitable method, for example the PCR or probe based methods disclosed herein, including a qPCR method. The markers are defined herein by location or by probe and/or primer sequences.

Location and Causes of Epigenetic Interactions

Epigenetic chromosomal interactions may overlap and include the regions of chromosomes shown to encode relevant or undescribed genes, but equally may be in intergenic regions. The chromosome interactions which are detected in the invention could be caused by changes to the underlying DNA sequence, by environmental factors, DNA methylation, non-coding antisense RNA transcripts, non- mutagenic carcinogens, histone modifications, chromatin remodelling and specific local DNA

interactions. The changes which lead to the chromosome interactions may be caused by changes to the underlying nucleic acid sequence, which themselves do not directly affect a gene product or the mode of gene expression. Such changes may be for example, SNPs within and/or outside of the genes, gene fusions and/or deletions of intergenic DNA, microRNA, and non-coding RNA. For example, it is known that roughly 20% of SNPs are in non-coding regions, and therefore the process as described is also informative in non-coding situation. In one embodiment the regions of the chromosome which come together to form the interaction are less than 5 kb, 3 kb, 1 kb, 500 base pairs or 200 base pairs apart on the same chromosome.

The chromosome interaction which is detected is preferably within any of the genes mentioned in Table 21. However it may also be upstream, for example up to 50,000, up to 30,000, up to 20,000, up to 10,000 or up to 5000 bases upstream from the gene. It may be downstream, for example up to 50,000, up to 30,000, up to 20,000, up to 10,000 or up to 5000 bases downstream from the gene. It may be upstream or downstream from coding sequences, for example by any of these specific numbers of bases.

Subgroups, Time Points and Personalisation

One aim of the present invention is to determine a characteristic relevant to physical performance. This may be at one or more defined time points for the same individual, for example at at least 1, 2, 5, 8 or 10 different time points. The durations between the time points may be at least 20, 50, 80 or 100 days. Typically testing of the individual (by the process of the invention) may occur before a physical stimulus is applied, or during or after. The testing may determine the predisposition to certain types of response, or the actual response to the stimulus.

As used herein, a "subgroup" preferably refers to a population subgroup (a subgroup in a population), more preferably a subgroup in the population of a human or horse population. The invention includes detecting and applying a physical stimulus to particular subgroups in a population. The inventors have discovered that chromosome interactions differ between subsets (for example at least two subsets) in a given population. Identifying these differences will allow categorisation of individuals and this allows personalised stimuli to be given, such a personalised training, or allows selection of the individual for particular physical activities.

Generating Ligated Nucleic Acids

Certain embodiments of the invention utilise ligated nucleic acids, in particular ligated DNA. These comprise sequences from both of the regions that come together in a chromosome interaction and therefore provide information about the interaction. The EpiSwitch™ method described herein uses generation of such ligated nucleic acids to detect chromosome interactions.

Thus a process of the invention may comprise a step of generating ligated nucleic acids (e.g. DNA) by the following steps (including a method comprising these steps):

(i) cross-linking of epigenetic chromosomal interactions present at the chromosomal locus, preferably in vitro;

(ii) optionally isolating the cross-linked DNA from said chromosomal locus;

(iii) subjecting said cross-linked DNA to cutting, for example by restriction digestion with an enzyme that cuts it at least once (in particular an enzyme that cuts at least once within said chromosomal locus);

(iv) ligating said cross-linked cleaved DNA ends (in particular to form DNA loops); and

(v) optionally identifying the presence of said ligated DNA and/or said DNA loops, in particular using techniques such as PCR (polymerase chain reaction), to identify the presence of a specific chromosomal interaction.

These steps may be carried out to detect the chromosome interactions for any embodiment mentioned herein. The steps may also be carried out to generate the first and/or second set of nucleic acids mentioned herein.

PCR (polymerase chain reaction) may be used to detect or identify the ligated nucleic acid, for example the size of the PCR product produced may be indicative of the specific chromosome interaction which is present, and may therefore be used to identify the status of the locus. In preferred embodiments at least 1, 2 or 3 primers or primer pairs as shown in Table 24, 25 or 30 are used in the PCR reaction. In other preferred embodiments at least 1, 2 or 3 primers or primer pairs as shown in Table 31 or 32 are used in the PCR reaction. The skilled person will be aware of numerous restriction enzymes which can be used to cut the DNA within the chromosomal locus of interest. It will be apparent that the particular enzyme used will depend upon the locus studied and the sequence of the DNA located therein. A non-limiting example of a restriction enzyme which can be used to cut the DNA as described in the present invention is Taql. Embodiments such as EpiSwitch™ Technology

The EpiSwitch™ Technology also relates to the use of microarray EpiSwitch™ marker data in the detection of epigenetic chromosome conformation signatures specific for phenotypes. Embodiments such as EpiSwitch™ which utilise ligated nucleic acids in the manner described herein have several advantages. They have a low level of stochastic noise, for example because the nucleic acid sequences from the first set of nucleic acids of the present invention either hybridise or fail to hybridise with the second set of nucleic acids. This provides a binary result permitting a relatively simple way to measure a complex mechanism at the epigenetic level. EpiSwitch™ technology also has fast processing time and low cost. In one embodiment the processing time is 3 hours to 6 hours.

Samples and Sample Treatment

The process of the invention will normally be carried out on a sample. The sample may be obtained at a defined time point, for example at any time point defined herein. The sample will normally contain DNA from the individual. It will normally contain cells. In one embodiment a sample is obtained by minimally invasive means and may for example be a blood sample. DNA may be extracted and cut up with a standard restriction enzyme. This can pre-determine which chromosome conformations are retained and will be detected with the EpiSwitch™ platforms. Due to the synchronisation of chromosome interactions between tissues and blood, including horizontal transfer, a blood sample can be used to detect the chromosome interactions in tissues.

Properties of Nucleic Acids of the Invention

The invention relates to certain nucleic acids, such as the ligated nucleic acids which are described herein as being used or generated in the process of the invention. These may be the same as, or have any of the properties of, the first and second nucleic acids mentioned herein. The nucleic acids of the invention typically comprise two portions each comprising sequence from one of the two regions of the chromosome which come together in the chromosome interaction. Typically each portion is at least 8, 10, 15, 20, 30 or 40 nucleotides in length, for example 10 to 40 nucleotides in length. Preferred nucleic acids comprise sequence from any of the genes mentioned in any of the tables, including Table 21. Typically preferred nucleic acids comprise the specific probe sequences mentioned in Table 24, 25 or 30; or fragments and/or homologues of such sequences. Preferred nucleic acids also comprise the specific probe sequences mentioned in Table 31 or 32; or fragments and/or homologues of such sequences. Preferably the nucleic acids are DNA. It is understood that where a specific sequence is provided the invention may use the complementary sequence as required in the particular embodiment. The primers shown in Table 24, 25 or 30 may also be used in the invention as mentioned herein. In one embodiment primers are used which comprise any of: the sequences shown in Table 24, 25 or 30; or fragments and/or homologues of any sequence shown in Table 24, 25 or 30.

The primers shown in Table 31 or 32 may also be used in the invention as mentioned herein. In one embodiment primers are used which comprise any of: the sequences shown in Table 31 or 32; or fragments and/or homologues of any sequence shown in Table 31 or 32.

The Second Set of Nucleic Acids - the 'Index' Sequences

The second set of nucleic acid sequences has the function of being a set of index sequences, and is essentially a set of nucleic acid sequences which are suitable for identifying subgroup specific sequence. They can represents the 'background' chromosomal interactions and might be selected in some way or be unselected. They are in general a subset of all possible chromosomal interactions.

The second set of nucleic acids may be derived by any suitable process. They can be derived computationally or they may be based on chromosome interaction in individuals. They typically represent a larger population group than the first set of nucleic acids. In one particular embodiment, the second set of nucleic acids represents all possible epigenetic chromosomal interactions in a specific set of genes. In another particular embodiment, the second set of nucleic acids represents a large proportion of all possible epigenetic chromosomal interactions present in a population described herein. In one particular embodiment, the second set of nucleic acids represents at least 50% or at least 80% of epigenetic chromosomal interactions in at least 20, 50, 100 or 500 genes, for example in 20 to 100 or 50 to 500 genes.

The second set of nucleic acids typically represents at least 100 possible epigenetic chromosome interactions which modify, regulate or in any way mediate a phenotype in population. The second set of nucleic acids may represent chromosome interactions that affect a physical characteristic. The second set of nucleic acids typically comprises sequences representing epigenetic interactions both relevant and not relevant to a physical performance subgroup.

In one particular embodiment the second set of nucleic acids derive at least partially from naturally occurring sequences in a population, and are typically obtained by in silico processes. Said nucleic acids may further comprise single or multiple mutations in comparison to a corresponding portion of nucleic acids present in the naturally occurring nucleic acids. Mutations include deletions, substitutions and/or additions of one or more nucleotide base pairs. In one particular embodiment, the second set of nucleic acids may comprise sequence representing a homologue and/or orthologue with at least 70% sequence identity to the corresponding portion of nucleic acids present in the naturally occurring species. In another particular embodiment, at least 80% sequence identity or at least 90% sequence identity to the corresponding portion of nucleic acids present in the naturally occurring species is provided.

Properties of the Second Set of Nucleic Acids

In one particular embodiment, there are at least 100 different nucleic acid sequences in the second set of nucleic acids, preferably at least 1000, 2000 or 5000 different nucleic acids sequences, with up to 100,000, 1,000,000 or 10,000,000 different nucleic acid sequences. A typical number would be 100 to 1,000,000, such as 1,000 to 100,000 different nucleic acids sequences. All or at least 90% or at least 50% or these would correspond to different chromosomal interactions.

In one particular embodiment, the second set of nucleic acids represent chromosome interactions in at least 20 different loci or genes, preferably at least 40 different loci or genes, and more preferably at least 100, at least 500, at least 1000 or at least 5000 different loci or genes, such as 100 to 10,000 different loci or genes. The lengths of the second set of nucleic acids are suitable for them to specifically hybridise according to Watson Crick base pairing to the first set of nucleic acids to allow identification of chromosome interactions specific to subgroups. Typically the second set of nucleic acids will comprise two portions corresponding in sequence to the two chromosome regions which come together in the chromosome interaction. The second set of nucleic acids typically comprise nucleic acid sequences which are at least 10, preferably 20, and preferably still 30 bases (nucleotides) in length. In another embodiment, the nucleic acid sequences may be at the most 500, preferably at most 100, and preferably still at most 50 base pairs in length. In a preferred embodiment, the second set of nucleic acids comprises nucleic acid sequences of between 17 and 25 base pairs. In one embodiment at least 100, 80% or 50% of the second set of nucleic acid sequences have lengths as described above. Preferably the different nucleic acids do not have any overlapping sequences, for example at least 100%, 90%, 80% or 50% of the nucleic acids do not have the same sequence over at least 5 contiguous nucleotides.

Given that the second set of nucleic acids acts as an 'index' then the same set of second nucleic acids may be used with different sets of first nucleic acids which represent subgroups for different characteristics, i.e. the second set of nucleic acids may represent a 'universal' collection of nucleic acids which can be used to identify chromosome interactions relevant to different characteristics.

The First Set of Nucleic Acids

The first set of nucleic acids are typically from subgroups relevant to physical performance. The first nucleic acids may have any of the characteristics and properties of the second set of nucleic acids mentioned herein. The first set of nucleic acids is normally derived from samples from the individuals which have undergone treatment and processing as described herein, particularly the EpiSwitch™ cross- linking and cleaving steps. Typically the first set of nucleic acids represents all or at least 80% or 50% of the chromosome interactions present in the samples taken from the individuals.

Typically, the first set of nucleic acids represents a smaller population of chromosome interactions across the loci or genes represented by the second set of nucleic acids in comparison to the chromosome interactions represented by second set of nucleic acids, i.e. the second set of nucleic acids is representing a background or index set of interactions in a defined set of loci or genes.

Library of Nucleic Acids

Any of the types of nucleic acid populations mentioned herein may be present in the form of a library comprising at least 200, at least 500, at least 1000, at least 5000 or at least 10000 different nucleic acids of that type, such as 'first' or 'second' nucleic acids. Such a library may be in the form of being bound to an array. A library may for example comprise all of the nucleic acids disclosed in any table disclosed herein, or all of the probe sequences disclosed in any table herein.

Hybridisation

The invention requires a means for allowing wholly or partially complementary nucleic acid sequences from the first set of nucleic acids and the second set of nucleic acids to hybridise. In one embodiment all of the first set of nucleic acids is contacted with all of the second set of nucleic acids in a single assay, i.e. in a single hybridisation step. However any suitable assay can be used.

Labelled Nucleic Acids and Pattern of Hybridisation

The nucleic acids mentioned herein may be labelled, preferably using an independent label such as a fluorophore (fluorescent molecule) or radioactive label which assists detection of successful hybridisation. Certain labels can be detected under UV light. The pattern of hybridisation, for example on an array described herein, represents differences in epigenetic chromosome interactions between the two subgroups, and thus provides a process of comparing epigenetic chromosome interactions and determination of which epigenetic chromosome interactions are specific to a subgroup in the population of the present invention.

The term 'pattern of hybridisation' broadly covers the presence and absence of hybridisation between the first and second set of nucleic acids, i.e. which specific nucleic acids from the first set hybridise to which specific nucleic acids from the second set, and so it not limited to any particular assay or technique, or the need to have a surface or array on which a 'pattern' can be detected. Selecting a Subgroup with Particular Characteristics

The invention provides a process which comprises detecting the presence or absence of chromosome interactions, typically 5 to 20 or 5 to 500 such interactions, preferably 20 to 300 or 50 to 100 interactions, in order to determine the presence or absence of a characteristic relating to physical performance in an individual. Preferably the chromosome interactions are those in any of the genes mentioned herein or in any Table herein. In one embodiment the chromosome interactions which are typed are those represented in any of Tables 3, 7, 8, 9, 25 and 30. In another embodiment the chromosome interactions are those represented in any of Tables 13, 14, 18, 22, 23 and 24. In a preferred embodiment the chromosome interactions which are typed are those represented in Table 31 or 32. In one embodiment the chromosome interactions which are typed are those from any of Tables 33, 34, 35, 36, 37, 38, 39 or 40 or in any of Figures 16, 17 or 18. The relevant chromosome interaction may be present or absent for a given characteristic, and therefore either presence or absence of the interaction will indicate the presence of the characteristic.

The Individual that is Tested

The individual that is tested is preferably a human or horse. The human be an athlete or sportsman. The human is typically 30 years old or less. The horse may be any type of horse mentioned herein, such as a Thoroughbred. The horse may be racing horse. The horse may be one which is not a racing horse, but which optionally is being considered for selection as a race horse. The horse may be less than 500 days old, such as less than 200 or less than 100 days old.

Preferred Gene Regions, Loci, Genes and Chromosome Interactions

For all aspects of the invention preferred gene regions, loci, genes and chromosome interactions are mentioned in the tables, for example in any of Tables 3, 7, 8, 9, 25 and 30 (preferably for typing humans) or in any of Tables 13, 14, 18, 22, 23 and 24 (preferably for typing horses), or in Table 31 or 32 (preferably for typing humans). Typically in the processes of the invention chromosome interactions are detected from at least 1, 2, 10, 30 or 50 genes listed in Table 21. The chromosome interaction may be upstream or downstream of any of the genes mentioned herein, for example 50 kb upstream or 20 kb downstream, for example from the coding sequence.

In one embodiment at least 5, 10 or all of the chromosome interactions of Table 3 are typed. In one embodiment at least the interactions with the top 5 or 10 highest odds ratio of Table 3 are typed.

In one embodiment at least 5, 10, 15 or all of the chromosome interactions in Table 7 are typed. In one embodiment at least the interactions with the smallest 5, 10 or 15 mean p values of Table 7 are typed. In one embodiment at least 5, 10, 15, 20 or all of the chromosome interactions in Table 8 are typed. In one embodiment at least the interactions with the smallest 5, 10, 15 or 20 mean p values of Table 8 are typed.

In one embodiment at least 5, 10, 15, 20 or all of the chromosome interactions in Table 9 are typed. In one embodiment at least the interactions with the smallest 5, 10, 15 or 20 mean p values of Table 9 are typed.

In one embodiment at least 5, 10, 15 or all of the chromosome interactions in Table 13 are typed. In one embodiment at least the interactions with the smallest 5 or 10 Exact Boschloo p value of Table 13 are typed.

In one embodiment at least 5, 10, 15 or all of the chromosome interactions in Table 14 are typed.

In one embodiment at least 5, 10, 20, 30 or all of the chromosome interactions in Table 18 are typed. In one embodiment at least the interactions with the smallest 5, 10, 20 or 30 Exact Boschloo p value of Table 18 are typed.

In one embodiment at least 5, 10 or all of the chromosome interactions of Table 22 are typed. In one embodiment at least the interactions with the top 5 or 10 highest odds ratio of Table 22 are typed.

In one embodiment at least 5, 10, 15 or all of the chromosome interactions of Table 23 are typed.

In one embodiment at least 5, 10, 20, 30 or all of the chromosome interactions of Table 24 are typed. In one embodiment at least the interactions with the top 5, 10, 20, 30 GLMNET values of Table 24 are typed.

In one embodiment at least 5, 10, 20, 30, 40, 50 or all of the chromosome interactions of Table 25 are typed. In one embodiment at least the interactions with the smallest 5, 10, 20, 30, 40, 50 adjusted p values of Table 25 are typed. In one embodiment at least the markers numbered 1 to 30 in Table 25 are typed. In another at least the markers numbers 31 to 77 in Table 25 are typed.

In one embodiment at least 5, 10, 20, 30, 40, 50, 150, 180 or all of the chromosome interactions of Table 30 are typed. In one embodiment at least the interactions with the smallest 5, 10, 20, 30, 40, 50, 150, or 180 adjusted p values of Table 30 are typed. In one embodiment at least the markers numbered 1 to 50 in Table 30 are typed. In another embodiment at least the markers numbered 51 to 100 in Table 30 are typed. In another embodiment at least the markers numbered 101 to 150 in Table 30 are typed. In one embodiment at least the markers numbered 151 to 202 in Table 30 are typed.

In one embodiment at least 5, 10 or all of the chromosome interactions of Table 31 are typed.

In one embodiment at least 5 or all of the chromosome interactions of Table 32 are typed. In one embodiment at least 5, 10, 20, 30, 40, 50, 150, 180, 200, 250 or all of the chromosome interactions of Table 33 are typed. In one embodiment at least the interactions with the smallest 5, 10, 20, 30, 40, 50, 150, 180 or 250 adjusted p values of Table 33 are typed. In one embodiment at least the markers numbered 1 to 50 in Table 33 are typed. In another embodiment at least the markers numbered 51 to 100 in Table 33 are typed. In another embodiment at least the markers numbered 101 to 150 in Table 33 are typed. In one embodiment at least the markers numbered 151 to 202 in Table 33 are typed. In one embodiment at least the markers numbered 202 to 320 in Table 33 are typed.

In one embodiment at least 5, 10, 20, 30, 40, 50 or all of the chromosome interactions of Table 34 are typed.

In one embodiment at least 5, 10, 20, 30 or all of the chromosome interactions of Table 35 are typed.

In one embodiment at least 5, 10, 15 or all of the chromosome interactions of Table 36 are typed.

In one embodiment at least 5, 10, 20, 30, 40, 50, 150, 180 or all of the chromosome interactions of Table 37 are typed. In one embodiment at least the interactions with the smallest 5, 10, 20, 30, 40, 50, 150, or 180 adjusted p values of Table 37 are typed. In one embodiment at least the markers numbered 1 to 50 in Table 37 are typed. In another embodiment at least the markers numbered 51 to 100 in Table 37 are typed. In another embodiment at least the markers numbered 101 to 150 in Table 37 are typed. In one embodiment at least the markers numbered 151 to 202 in Table 37 are typed.

In one embodiment at least 5, 10 or all of the chromosome interactions of Table 38 are typed.

In one embodiment at least 5 or all of the chromosome interactions of Table 39 are typed.

In one embodiment at least 5, 10 or all of the chromosome interactions of Table 40 are typed. In one embodiment at least the 3 'shared' chromosome interactions of Table 40 are typed. In one embodiment at least the 7 'strength' chromosome interactions of Table 40 are typed.

In one embodiment at least 5, 10, 20, 30, 40, 50 or all of the chromosome interactions of Table 41 are typed. In one embodiment at least the interactions with the smallest 5, 10, 20, 30, 40, 50 adjusted p values of Table 41 are typed. In one embodiment at least the markers numbered 1 to 30 in Table 41 are typed. In another at least the markers numbers 31 to 77 in Table 41 are typed. [Table 41 is shown in abbreviated form to avoid duplicating information from Table 25 which relates to the same marker set. It is understood that smallest p values mentioned here can be obtained from Table 25]

In one embodiment at least 5, 8 or all of the chromosome interactions of Figure 16 are typed.

In one embodiment at least 5, 10 or all of the chromosome interactions of Figure 17 are typed.

In one embodiment at least 5, 8 or all of the chromosome interactions of Figure 18 are typed. Typically at least 5, 10, 15, 20, 30, 40 or 70 chromosome interactions are typed from any of genes or regions disclosed the tables herein, or parts of tables disclosed herein. Typically the chromosome interactions which are typed are present in at least 20, 50 or all of the genes mentioned in Table 21.

For all aspects of the invention preferred gene regions, loci, genes and chromosome interactions are mentioned in Tables 24 and 30.

In one embodiment the locus (including the gene and/or place where the chromosome interaction is detected) may comprise a CTCF binding site. This is any sequence capable of binding transcription repressor CTCF. That sequence may consist of or comprise the sequence CCCTC which may be present in 1, 2 or 3 copies at the locus. The CTCF binding site sequence may comprise the sequence CCGCGNGGNGGCAG (in IUPAC notation). The CTCF binding site may be within at least 100, 500, 1000 or 4000 bases of the chromosome interaction or within any of the chromosome regions shown Table 24 or

30.

Thus typically sequence from both regions of the probe (i.e. from both sites of the chromosome interaction) could be detected. In preferred embodiments probes are used in the process which comprise or consist of the same or complementary sequence to a probe shown in any table. In some embodiments probes are used which comprise sequence which is homologous to any of the probe sequences shown in the tables.

Tables Provided Herein

The tables show probe (Episwitch™ marker) data and gene data representing chromosome interactions relevant to physical performance. The probe sequences show sequence which can be used to detect a ligated product generated from both sites of gene regions that have come together in chromosome interactions, i.e. the probe will comprise sequence which is complementary to sequence in the ligated product. The first two sets of Start-End positions show probe positions, and the second two sets of Start- End positions show the relevant 4kb region. The following information is provided in the probe data table:

HyperG_Stats: p-value for the probability of finding that number of significant EpiSwitch™ markers in the locus based on the parameters of hypergeometric enrichment

Probe Count Total: Total number of EpiSwitch™ Conformations tested at the locus

Probe Count Sig: Number of EpiSwitch™ Conformations found to be statistically significant at the locus

FDR HyperG: Multi-test (Fimmunoresposivenesse Discovery Rate) corrected hypergeometric p- value

Percent Sig: Percentage of significant EpiSwitch™ markers relative the number of markers tested at the locus

logFC: logarithm base 2 of Epigenetic Ratio (FC) AveExpr: average log2-expression for the probe over all arrays and channels

T: moderated t-statistic

p-value: raw p-value

adj. p-value: adjusted p-value or q-value

B - B-statistic (lods or B) is the log-odds that that gene is differentially expressed.

FC - non-log Fold Change

FC_1 - non-log Fold Change centred around zero

LS - Binary value this relates to FC_1 values. FC_1 value below -1.1 it is set to -1 and if the FC_1 value is above 1.1 it is set to 1. Between those values the value is 0

The tables also shows genes where a relevant chromosome interaction has been found to occur. The p- value in the loci table is the same as the FlyperG_Stats (p-value for the probability of finding that number of significant EpiSwitch™ markers in the locus based on the parameters of hypergeometric enrichment).

The probes are designed to be 30bp away from the Taql site. In case of PCR, PCR primers are typically designed to detect ligated product but their locations from the Taql site vary.

Probe locations:

Start 1 - 30 bases upstream of Taql site on fragment 1

End 1 - Taql restriction site on fragment 1

Start 2 - Taql restriction site on fragment 2

End 2 - 30 bases downstream of Taql site on fragment 2

4kb Sequence Location:

Start 1 - 4000 bases upstream of Taql site on fragment 1

End 1 - Taql restriction site on fragment 1

Start 2 - Taql restriction site on fragment 2

End 2 - 4000 bases downstream of Taql site on fragment 2

GLMNET values related to procedures for fitting the entire lasso or elastic-net regularization (Lambda set to 0.5 (elastic-net)).

Preferred Embodiments for Sample Preparation and Chromosome Interaction Detection

Methods of preparing samples and detecting chromosome conformations are described herein.

Optimised (non-conventional) versions of these methods can be used, for example as described in this section. Typically the sample will contain at least 2 xlO 5 cells. The sample may contain up to 5 xlO 5 cells. In one embodiment, the sample will contain 2 xlO 5 to 5.5 xlO 5 cells

Crosslinking of epigenetic chromosomal interactions present at the chromosomal locus is described herein. This may be performed before cell lysis takes place. Cell lysis may be performed for 3 to 7 minutes, such as 4 to 6 or about 5 minutes. In some embodiments, cell lysis is performed for at least 5 minutes and for less than 10 minutes.

Digesting DNA with a restriction enzyme is described herein. Typically, DNA restriction is performed at about 55°C to about 70°C, such as for about 65°C, for a period of about 10 to 30 minutes, such as about 20 minutes.

Preferably a frequent cutter restriction enzyme is used which results in fragments of ligated DNA with an average fragment size up to 4000 base pair. Optionally the restriction enzyme results in fragments of ligated DNA have an average fragment size of about 200 to 300 base pairs, such as about 256 base pairs. In one embodiment, the typical fragment size is from 200 base pairs to 4,000 base pairs, such as 400 to 2,000 or 500 to 1,000 base pairs.

In one embodiment of the EpiSwitch method a DNA precipitation step is not performed between the DNA restriction digest step and the DNA ligation step.

DNA ligation is described herein. Typically the DNA ligation is performed for 5 to 30 minutes, such as about 10 minutes.

The protein in the sample may be digested enzymatically, for example using a proteinase, optionally Proteinase K. The protein may be enzymatically digested for a period of about 30 minutes to 1 hour, for example for about 45 minutes. In one embodiment after digestion of the protein, for example

Proteinase K digestion, there is no cross-link reversal or phenol DNA extraction step.

In one embodiment PCR detection is capable of detecting a single copy of the ligated nucleic acid, preferably with a binary read-out for presence/absence of the ligated nucleic acid.

Figure 14 shows a preferred method of detecting chromosome interactions.

Processes and Uses of the Invention

The process of the invention can be described in different ways. It can be described as a method of making a ligated nucleic acid comprising (i) in vitro cross-linking of chromosome regions which have come together in a chromosome interaction; (ii) subjecting said cross-linked DNA to cutting or restriction digestion cleavage; and (iii) ligating said cross-linked cleaved DNA ends to form a ligated nucleic acid, wherein detection of the ligated nucleic acid may be used to determine the chromosome state at a locus, and wherein preferably:

- the locus may be any of the loci, regions or genes mentioned in any table, and/or

- wherein the chromosomal interaction may be any of the chromosome interactions mentioned herein or corresponding to any of the probes disclosed in any table, and/or

- wherein the ligated product may have or comprise (i) sequence which is the same as or homologous to any of the probe sequences disclosed in any table herein; or (ii) sequence which is complementary to (ii).

The process of the invention can be described as a process for detecting chromosome states which represent different subgroups in a population comprising determining whether a chromosome interaction is present or absent within a defined epigenetically active region of the genome, wherein preferably: the subgroup is defined by presence or absence of physical performance, and/or

the chromosome state may be at any locus, region or gene mentioned in any table; and/or the chromosome interaction may be any of those mentioned in any table or corresponding to any of the probes disclosed in that table.

The process of the invention can be described as a method of making a ligated nucleic acid comprising (i) in vitro cross-linking of chromosome regions which have come together in a chromosome interaction; (ii) subjecting said cross-linked DNA to cutting or restriction digestion cleavage; and (iii) ligating said cross- linked cleaved DNA ends to form a ligated nucleic acid, wherein detection of the ligated nucleic acid may be used to determine the chromosome state at a locus, and wherein preferably:

- the locus may be any of the loci, regions or genes mentioned in any table, and/or

- wherein the chromosomal interaction may be any of the chromosome interactions mentioned herein or corresponding to any of the probes disclosed in any table, and/or

- wherein the ligated product may have or comprise (i) sequence which is the same as or homologous to any of the probe sequences disclosed in any table; or (ii) sequence which is complementary to (ii).

The process of the invention can be described as a process for detecting chromosome states which represent different subgroups in a population comprising determining whether a chromosome interaction is present or absent within a defined epigenetically active region of the genome, wherein preferably: the subgroup is defined by presence or absence of physical performance, and/or

the chromosome state may be at any locus, region or gene mentioned in any table; and/or the chromosome interaction may be any of those mentioned in any table or corresponding to any of the probes disclosed in that table. The invention includes detecting chromosome interactions at any locus, gene or regions mentioned in any table, such as Table 24 or 30. The invention includes use of the nucleic acids and probes mentioned herein to detect chromosome interactions, for example use of at least 1, 5, 10, 50, 100 such nucleic acids or probes to detect chromosome interactions, preferably in at least 1, 5, 10, 50, 100 different loci or genes. The invention includes detection of chromosome interactions using any of the primers or primer pairs listed in Table 24 or 30 or using variants of these primers as described herein (sequences comprising the primer sequences or comprising fragments and/or homologues of the primer sequences).

The invention includes detecting chromosome interactions at any locus, gene or regions mentioned Table 24 or 30. The invention includes use of the nucleic acids and probes mentioned herein to detect chromosome interactions, for example use of at least 1, 5, 10, 50, 100, 200, 250, 300 such nucleic acids or probes to detect chromosome interactions, preferably in at least 1, 5, 10, 50, 100, 200, 250, 300 different loci or genes. The invention includes detection of chromosome interactions using any of the primers or primer pairs listed in Table 24 or 30 or using variants of these primers as described herein (sequences comprising the primer sequences or comprising fragments and/or homologues of the primer sequences).

When analysing whether a chromosome interaction occurs 'within' a defined gene, region or location, either both the parts of the chromosome which have together in the interaction are within the defined gene, region or location or in some embodiments only one part of the chromosome is within the defined, gene, region or location.

Use of the Method of the Invention to Identify New Training or Fitness Regimens

Knowledge of chromosome interactions can be used to identify new fitness or training regimens. The invention provides methods and uses of chromosomes interactions defined herein to identify or design new agents.

Homologues

Homologues of polynucleotide / nucleic acid (e.g. DNA) sequences are referred to herein. Such homologues typically have at least 70% homology, preferably at least 80%, at least 85%, at least 90%, at least 95%, at least 97%, at least 98% or at least 99% homology, for example over a region of at least 10, 15, 20, 30, 100 or more contiguous nucleotides, or across the portion of the nucleic acid which is from the region of the chromosome involved in the chromosome interaction. The homology may be calculated on the basis of nucleotide identity (sometimes referred to as "hard homology").

Therefore, in a particular embodiment, homologues of polynucleotide / nucleic acid (e.g. DNA) sequences are referred to herein by reference to percentage sequence identity. Typically such homologues have at least 70% sequence identity, preferably at least 80%, at least 85%, at least 90%, at least 95%, at least 97%, at least 98% or at least 99% sequence identity, for example over a region of at least 10, 15, 20, 30, 100 or more contiguous nucleotides, or across the portion of the nucleic acid which is from the region of the chromosome involved in the chromosome interaction.

For example the UWGCG Package provides the BESTFIT program which can be used to calculate homology and/or % sequence identity (for example used on its default settings) (Devereux et al (1984) Nucleic Acids Research 12, p387-395). The PILEUP and BLAST algorithms can be used to calculate homology and/or % sequence identity and/or line up sequences (such as identifying equivalent or corresponding sequences (typically on their default settings)), for example as described in Altschul S. F. (1993) J Mol Evol 36:290-300; Altschul, S, F et al (1990) J Mol Biol 215:403-10.

Software for performing BLAST analyses is publicly available through the National Center for Biotechnology Information. This algorithm involves first identifying high scoring sequence pair (HSPs) by identifying short words of length W in the query sequence that either match or satisfy some positive valued threshold score T when aligned with a word of the same length in a database sequence. T is referred to as the neighbourhood word score threshold (Altschul et al, supra). These initial neighbourhood word hits act as seeds for initiating searches to find HSPs containing them. The word hits are extended in both directions along each sequence for as far as the cumulative alignment score can be increased. Extensions for the word hits in each direction are halted when: the cumulative alignment score falls off by the quantity X from its maximum achieved value; the cumulative score goes to zero or below, due to the accumulation of one or more negative-scoring residue alignments; or the end of either sequence is reached. The BLAST algorithm parameters W5 T and X determine the sensitivity and speed of the alignment. The BLAST program uses as defaults a word length (W) of 11 , the BLOSUM62 scoring matrix (see Henikoff and Henikoff (1992) Proc. Natl. Acad. Sci. USA 89: 10915-10919) alignments (B) of 50, expectation (E) of 10, M=5, N=4, and a comparison of both strands.

The BLAST algorithm performs a statistical analysis of the similarity between two sequences; see e.g., Karlin and Altschul (1993) Proc. Natl. Acad. Sci. USA 90: 5873-5787. One measure of similarity provided by the BLAST algorithm is the smallest sum probability (P(N)), which provides an indication of the probability by which a match between two polynucleotide sequences would occur by chance. For example, a sequence is considered similar to another sequence if the smallest sum probability in comparison of the first sequence to the second sequence is less than about 1, preferably less than about 0.1, more preferably less than about 0.01, and most preferably less than about 0.001.

The homologous sequence typically differs by 1, 2, 3, 4 or more bases, such as less than 10, 15 or 20 bases (which may be substitutions, deletions or insertions of nucleotides). These changes may be measured across any of the regions mentioned above in relation to calculating homology and/or % sequence identity.

Homology of a 'pair of primers' can be calculated, for example, by considering the two sequences as a single sequence (as if the two sequences are joined together) for the purpose of then comparing against the another primer pair which again is considered as a single sequence.

Arrays

The second set of nucleic acids may be bound to an array, and in one embodiment there are at least 15,000, 45,000, 100,000 or 250,000 different second nucleic acids bound to the array, which preferably represent at least 300, 900, 2000 or 5000 loci. In one embodiment one, or more, or all of the different populations of second nucleic acids are bound to more than one distinct region of the array, in effect repeated on the array allowing for error detection. The array may be based on an Agilent SurePrint G3 Custom CGH microarray platform. Detection of binding of first nucleic acids to the array may be performed by a dual colour system.

Forms of the Substance Mentioned Herein

Any of the substances, such as nucleic acids or therapeutic agents, mentioned herein may be in purified or isolated form. They may be in a form which is different from that found in nature, for example they may be present in combination with other substance with which they do not occur in nature. The nucleic acids (including portions of sequences defined herein) may have sequences which are different to those found in nature, for example having at least 1, 2, 3, 4 or more nucleotide changes in the sequence as described in the section on homology. The nucleic acids may have heterologous sequence at the 5' or 3' end. The nucleic acids may be chemically different from those found in nature, for example they may be modified in some way, but preferably are still capable of Watson-Crick base pairing. Where appropriate the nucleic acids will be provided in double stranded or single stranded form. The invention provides all of the specific nucleic acid sequences mentioned herein in single or double stranded form, and thus includes the complementary strand to any sequence which is disclosed.

The invention provides a kit for carrying out any process of the invention, including detection of a chromosomal interaction relating to physical performance. Such a kit can include a specific binding agent capable of detecting the relevant chromosomal interaction, such as agents capable of detecting a ligated nucleic acid generated by processes of the invention. Preferred agents present in the kit include probes capable of hybridising to the ligated nucleic acid or primer pairs, for example as described herein, capable of amplifying the ligated nucleic acid in a PCR reaction. The invention provides a device that is capable of detecting the relevant chromosome interactions. The device preferably comprises any specific binding agents, probe or primer pair capable of detecting the chromosome interaction, such as any such agent, probe or primer pair described herein.

Detection Methods

In one embodiment quantitative detection of the ligated sequence which is relevant to a chromosome interaction is carried out using a probe which is detectable upon activation during a PCR reaction, wherein said ligated sequence comprises sequences from two chromosome regions that come together in an epigenetic chromosome interaction, wherein said method comprises contacting the ligated sequence with the probe during a PCR reaction, and detecting the extent of activation of the probe, and wherein said probe binds the ligation site. The method typically allows particular interactions to be detected in a MIQE compliant manner using a dual labelled fluorescent hydrolysis probe.

The probe is generally labelled with a detectable label which has an inactive and active state, so that it is only detected when activated. The extent of activation will be related to the extent of template (ligation product) present in the PCR reaction. Detection may be carried out during all or some of the PCR, for example for at least 50% or 80% of the cycles of the PCR.

The probe can comprise a fluorophore covalently attached to one end of the oligonucleotide, and a quencher attached to the other end of the nucleotide, so that the fluorescence of the fluorophore is quenched by the quencher. In one embodiment the fluorophore is attached to the 5'end of the oligonucleotide, and the quencher is covalently attached to the 3' end of the oligonucleotide.

Fluorophores that can be used in the methods of the invention include FAM, TET, JOE, Yakima Yellow, HEX, Cyanine3, ATTO 550, TAMRA, ROX, Texas Red, Cyanine 3.5, LC610, LC 640, ATTO 647N, Cyanine 5, Cyanine 5.5 and ATTO 680. Quenchers that can be used with the appropriate fluorophore include TAM, BHQ1, DAB, Eclip, BHQ2 and BBQ650, optionally wherein said fluorophore is selected from HEX, Texas Red and FAM. Preferred combinations of fluorophore and quencher include FAM with BHQ1 and Texas Red with BHQ2.

Use of the Probe in a qPCR Assay

Hydrolysis probes of the invention are typically temperature gradient optimised with concentration matched negative controls. Preferably single-step PCR reactions are optimized. More preferably a standard curve is calculated. An advantage of using a specific probe that binds across the junction of the ligated sequence is that specificity for the ligated sequence can be achieved without using a nested PCR approach. The methods described herein allow accurate and precise quantification of low copy number targets. The target ligated sequence can be purified, for example gel-purified, prior to temperature gradient optimization. The target ligated sequence can be sequenced. Preferably PCR reactions are performed using about lOng, or 5 to 15 ng, or 10 to 20ng, or 10 to 50ng, or 10 to 200ng template DNA. Forward and reverse primers are designed such that one primer binds to the sequence of one of the chromosome regions represented in the ligated DNA sequence, and the other primer binds to other chromosome region represented in the ligated DNA sequence, for example, by being complementary to the sequence.

Choice of Ligated DNA Target

The invention includes selecting primers and a probe for use in a PCR method as defined herein comprising selecting primers based on their ability to bind and amplify the ligated sequence and selecting the probe sequence based properties of the target sequence to which it will bind, in particular the curvature of the target sequence.

Probes are typically designed/chosen to bind to ligated sequences which are juxtaposed restriction fragments spanning the restriction site. In one embodiment of the invention, the predicted curvature of possible ligated sequences relevant to a particular chromosome interaction is calculated, for example using a specific algorithm referenced herein. The curvature can be expressed as degrees per helical turn, e.g. 10.5° per helical turn. Ligated sequences are selected for targeting where the ligated sequence has a curvature propensity peak score of at least 5° per helical turn, typically at least 10°, 15° or 20° per helical turn, for example 5° to 20° per helical turn. Preferably the curvature propensity score per helical turn is calculated for at least 20, 50, 100, 200 or 400 bases, such as for 20 to 400 bases upstream and/or downstream of the ligation site. Thus in one embodiment the target sequence in the ligated product has any of these levels of curvature. Target sequences can also be chosen based on lowest thermodynamic structure free energy.

Particular Embodiments

In one embodiment only intrachromosomal interactions are typed/detected, and no extrachromosomal interactions (between different chromosomes) are typed/detected.

In particular embodiments certain chromosome interactions are not typed, for example any specific interaction mentioned herein (for example as defined by any probe or primer pair mentioned herein). In some embodiments chromosome interactions are not typed in any of the genes mentioned here, for example in any gene mentioned in Table 21.

Publications

The contents of all publications mentioned herein are incorporated by reference into the present specification and may be used to further define the features relevant to the invention.

Tables

Table 1 shows patient sample for the human study. Table 2 shows the classification of responders in the human study.

Tables 3 and 4 show markers from the human study which are preferably used for typing humans.

Table 5 illustrates for the human study predispositions present in subjects.

Table 6 shows markers from the human study, which are preferably used for typing humans. Table 7 shows predictive markers for strength training response, which are preferably used for typing humans.

Table 8 shows predictive markers for endurance training response, which are preferably used for typing humans.

Table 9 shows predictive markers for either strength or endurance training response, which are preferably used for typing humans.

Table 10 shows the samples for the equine study.

Table 11 defines the 'sex' description used in Table 10.

Table 12 shows the sex distribution in the equine study.

Table 13 shows the top markers for Stayer versus Sprinter phenotype (n=32, 16 Stayer, 16 Sprinter), which are preferably used to type horses.

Table 14 shows markers discovered in humans that applicable to horses, and the closest genomic loci, which can be used to type horses.

Table 15 shows classifier calls for Sprinters and Stayers

Table 16 shows probability scores for the equine study. Table 17 shows classifier calls of naive samples from young Thoroughbreds.

Table 18 shows the informative markers from the equine study, which are preferably used to type horses.

Tables 19 and 20 show the subjects for the human study.

Table 21 shows preferred genes for carrying out the invention. Table 22 and 23 shows preferred markers from the equine study and the traits they relate to, which are preferably used to type horses.

Table 24 shows markers identified in the equine study, which are preferably used to type horses.

Table 25 shows markers identified in the human study, which are preferably used to type humans. Table 26 shows another set of preferred genes for carrying out the invention.

Table 27 shows pathway analysis for genes locations for 171 chromosome interactions shared between the strength and endurance groups.

Table 28 shows pathway analysis for genes locations for the top 79 chromosome interactions which are unique to the endurance group.

Table 29 shows pathway analysis for genes locations for the top 79 chromosome interactions which are unique to the strength group.

Tables 30 to 32 show markers identified in the human study, which are preferably used to type humans. To clarify the nomenclature used in the tables, including Table 30: E_Trn refers to presence in Endurance Training

Str_Trn refers to presence in Strength Training

E_Ctrl refers to presence in Endurance Control (i.e. absence in Endurance Training)

Str_Ctrl refers to presence in Strength Control (i.e. absence in Strength Training)

Table 33 shows markers from an equine study, which are preferably used to type horses. In LS column: 1 means present in Sprinters, while (-1) means present in Stayers. The 'Loop detection' column is de facto decoding what +1 and -1 means in terms of detection.

Table 34 shows markers from a human study, which are preferably used to type humans.

Table 35 shows preferred markers from a horse study, which are preferably used to type horses.

Table 36 shows preferred markers from a human study, which are preferably used to type humans. Table 37 shows an updated version of Table 30. The same markers are typed in this study in humans, preferably used to type humans.

Table 38 shows an updated version of Table 31. The same markers are typed in a human study and preferably used to type humans.

Table 39 shows an updated version of Table 32. The same markers are typed in a human study and preferably used to type humans.

Table 40 shows markers corresponding to those shown in Figure 15, which are preferably used to type humans.

Table 41 shows updated results for Table 25, where the markers are from a human study and are preferably used to type human. Figures 16 and 17 shows markers from a human study, which preferably can be used to type humans.

Figure 18 shows markers from a horse study, which preferably can be used to type horses.

Preferred Methods

The following numbered paragraphs define preferred methods: 1. A process for detecting a chromosome state which represents a subgroup in a population comprising determining whether a chromosome interaction relating to that chromosome state is present or absent within a defined region of the genome, wherein said subgroup relates to physical performance in an individual; and

- wherein said chromosome interaction has optionally been identified by a method of determining which chromosomal interactions are relevant to a chromosome state corresponding to a physical performance subgroup of the population, comprising contacting a first set of nucleic acids from subgroups with different states of the chromosome with a second set of index nucleic acids, and allowing complementary sequences to hybridise, wherein the nucleic acids in the first and second sets of nucleic acids represent a ligated product comprising sequences from both the chromosome regions that have come together in chromosomal interactions, and wherein the pattern of hybridisation between the first and second set of nucleic acids allows a determination of which chromosomal interactions are specific to a physical performance subgroup; and

- wherein the chromosome interaction either:

(i) corresponds to any one of the chromosome interactions shown in any of Tables 3, 7, 8, 9, 25 and 30; and/or

(ii) corresponds to any one of the chromosome interactions shown in any of Tables 13, 14, 18, 22, 23 and 24; and/or

(iii) is present in a 4,000 base region which comprises or which flanks (i) or (ii); and/or

(iv) is present in any one of the regions or genes listed in Table 21, 24, 25 or 30. 2. A process according to paragraph 1 wherein:

- the individual is a human and the subgroup is a human subgroup

- the individual is a horse and the subgroup is a horse subgroup, and wherein optionally:

(i) the process is carried out to determining physical performance ability, and/or (ii) the process is carried out to detect responsiveness to a stimulus relating to physical performance, which is preferably physical training, and optionally strength or endurance training; and/or

(iii) the process is carried out to select an individual suitable for a physical activity, which is preferably a sport; and/or

(iv) the process is carried out to select a stimulus relating to physical performance to give to the individual, wherein said stimulus is a type of physical training.

3. A process according to paragraph 1 or 2 wherein a specific combination of chromosome interactions are typed:

(i) comprising all of the chromosome interactions represented in any of Tables 3, 7, 8, 9, 25 and 30 or any of Tables 13, 14, 18, 22, 23; and/or

(ii) comprising at least 10%, 20%, 50%, or 80% of the chromosome interactions in any of Tables 3, 7, 8, 9, 25 and 30 or any of Tables 13, 14, 18, 22, 23; and/or

(iii) which together are present in at least 10, 50 or 100 of the regions or genes listed in any of Tables 21, 24, 25 or 30; and/or

(iv) wherein at least 10, 50, 100, 150, 200 or 300 chromosome interactions are typed which are present in a 4,000 base region which comprises or which flanks the chromosome interactions represented in any of Tables 3, 7, 8, 9, 25 and 30 or any of Tables 13, 14, 18, 22, 23.

4. A process according to any one of the preceding paragraphs in which the chromosome interactions are typed:

- in a sample from an individual, and/or

- by detecting the presence or absence of a DNA loop at the site of the chromosome interactions, and/or

- detecting the presence or absence of distal regions of a chromosome being brought together in a chromosome conformation, and/or

- by detecting the presence of a ligated nucleic acid which is generated during said typing and whose sequence comprises two regions each corresponding to the regions of the chromosome which come together in the chromosome interaction, wherein detection of the ligated nucleic acid is preferably by using either:

(i) a probe that has at least 70% identity to any of the specific probe sequences mentioned in Table 24,

25 or 30, and/or (ii) by a primer pair which has at least 70% identity to any primer pair in Table 24, 25 or

30. 5. A process according to any one of the preceding paragraphs, wherein:

- the second set of nucleic acids is from a larger group of individuals than the first set of nucleic acids; and/or

- the first set of nucleic acids is from at least 8 individuals; and/or - the first set of nucleic acids is from at least 4 individuals from a first subgroup and at least 4 individuals from a second subgroup which is preferably non-overlapping with the first subgroup.

6. A process according to any one of the preceding paragraphs wherein:

- the second set of nucleic acids represents an unselected group; and/or - wherein the second set of nucleic acids is bound to an array at defined locations; and/or

- wherein the second set of nucleic acids represents chromosome interactions in least 100 different genes; and/or

- wherein the second set of nucleic acids comprises at least 1,000 different nucleic acids representing at least 1,000 different chromosome interactions; and/or - wherein the first set of nucleic acids and the second set of nucleic acids comprise at least 100 nucleic acids with length 10 to 100 nucleotide bases.

7. A process according to any one of the preceding paragraphs, wherein the first set of nucleic acids is obtainable in a process comprising the steps of: -

(i) cross-linking of chromosome regions which have come together in a chromosome interaction; (ii) subjecting said cross-linked regions to cleavage, optionally by restriction digestion cleavage with an enzyme; and

(iii) ligating said cross-linked cleaved DNA ends to form the first set of nucleic acids (in particular comprising ligated DNA).

8. A process according to any one of the preceding paragraphs: - wherein at least 10 to 50 different chromosome interactions are typed, preferably in 10 to 50 different regions or genes optionally as defined in Table 21, 24, 25 or 30; and/or

- which is: (i) carried out on a human or horse athlete; and/or

(ii) carried out as part of a training regime, preferably after the start of the training regime; and/or

(iii) carried out on a Thoroughbred horse, preferably a racing horse; or

(iv) carried out on a human individual of who is less than 20 years old or is carried out on a horse that is less than 18 months old, and/or

(v) which is carried out at multiple time points to assess physical performance characteristics at specific time points, wherein the process is optionally carried out at at least 3 time points, which are preferably at least 30 days apart from each other.

9. A process according to any one of the preceding paragraphs wherein said defined region of the genome:

(i) comprises a single nucleotide polymorphism (SNP); and/or

(ii) expresses a microRNA (miRNA); and/or

(iii) expresses a non-coding RNA (ncRNA); and/or

(iv) expresses a nucleic acid sequence encoding at least 10 contiguous amino acid residues; and/or (v) expresses a regulating element; and/or

(vii) comprises a CTCF binding site.

10. A process according to any one of the preceding paragraphs:

- which is carried out to identify an individual that is suited to endurance training, and preferably the identified individual is then subject to endurance training, which optionally occurs on at least 100 days out of the next 365 days after the identification; or

- which is carried out to identify an individual that is suited to strength training, and preferably the identified individual is then subject to strength training, which optionally occurs on at least 100 days out of the next 365 days after the identification.

11. A process according to any one of the preceding paragraphs which is carried out to select the individual for racing.

12. A process according to any one of the preceding paragraphs which is carried out to identify or design a an agent that affects physical performance, wherein said process is used to detect whether a candidate agent is able to cause a change to a chromosome state which is associated with a different physical performance state; wherein - the chromosomal interaction is any specific interaction or combination of interactions defined in any paragraph and/or is present in any one of the regions or genes listed in Table 21, 24, 25 or 30; and/or

- the change in chromosomal interaction is monitored using (i) a probe that has at least 70% identity to any of the specific probe sequences mentioned in Table 24, 25 or 30, and/or (ii) by a primer pair which has at least 70% identity to any primer pair in Table 24, 25 or 30.

13. A process according to paragraph 12 which comprises selecting a target based on detection of chromosome interactions, and preferably screening for a modulator of the target to identify an agent which affects physical performance, wherein said target is optionally a protein.

14. A process according to any one of the preceding paragraphs, wherein the typing or detecting comprises specific detection of the ligated product by quantitative PCR (qPCR) which uses primers capable of amplifying the ligated product and a probe which binds the ligation site during the PCR reaction, wherein said probe comprises sequence which is complementary to sequence from each of the chromosome regions that have come together in the chromosome interaction, wherein preferably said probe comprises: an oligonucleotide which specifically binds to said ligated product, and/or a fluorophore covalently attached to the 5' end of the oligonucleotide, and/or a quencher covalently attached to the 3' end of the oligonucleotide, and optionally said fluorophore is selected from HEX, Texas Red and FAM; and/or said probe comprises a nucleic acid sequence of length 10 to 40 nucleotide bases, preferably a length of 20 to 30 nucleotide bases.

15. A process according to any one of the proceeding paragraphs which further comprises:

- producing a report on the physical performance characteristics of the individual based on the results of the process, or

- inputting the results of the process into a database, or

- assigning a specific fitness or training regime to the individual based on the results of the process, or

- designing a specific fitness or training regime for the individual based on the results of the process. Specific Embodiments

The EpiSwitch™ platform technology detects epigenetic regulatory signatures of regulatory changes between normal and abnormal conditions at loci. The EpiSwitch™ platform identifies and monitors the fundamental epigenetic level of gene regulation associated with regulatory high order structures of human chromosomes also known as chromosome conformation signatures. Chromosome signatures are a distinct primary step in a cascade of gene deregulation. They are high order biomarkers with a unique set of advantages against biomarker platforms that utilize late epigenetic and gene expression biomarkers, such as DNA methylation and RNA profiling.

EpiSwitch™ Array Assay

The custom EpiSwitch™ array-screening platforms come in 4 densities of, 15K, 45K, 100K, and 250K unique chromosome conformations, each chimeric fragment is repeated on the arrays 4 times, making the effective densities 60K, 180K, 400K and 1 Million respectively.

Custom Designed EpiSwitch™ Arrays

The 15K EpiSwitch™ array can screen the whole genome including around 300 loci interrogated with the EpiSwitch™ Biomarker discovery technology. The EpiSwitch™ array is built on the Agilent SurePrint G3 Custom CGH microarray platform; this technology offers 4 densities, 60K, 180K, 400K and 1 Million probes. The density per array is reduced to 15K, 45K, 100K and 250K as each EpiSwitch™ probe is presented as a quadruplicate, thus allowing for statistical evaluation of the reproducibility. The average number of potential EpiSwitch™ markers interrogated per genetic loci is 50; as such the numbers of loci that can be investigated are 300, 900, 2000, and 5000.

EpiSwitch™ Custom Array Pipeline

The EpiSwitch™ array is a dual colour system with one set of samples, after EpiSwitch™ library generation, labelled in Cy5 and the other of sample (controls) to be compared/ analyzed labelled in Cy3. The arrays are scanned using the Agilent SureScan Scanner and the resultant features extracted using the Agilent Feature Extraction software. The data is then processed using the EpiSwitch™ array processing scripts in R. The arrays are processed using standard dual colour packages in Bioconductor in R: Limma *. The normalisation of the arrays is done using the normalisedWithinArrays function in Limma * and this is done to the on chip Agilent positive controls and EpiSwitch™ positive controls. The data is filtered based on the Agilent Flag calls, the Agilent control probes are removed and the technical replicate probes are averaged, in order for them to be analysed using Limma *. The probes are modelled based on their difference between the 2 scenarios being compared and then corrected by using False Discovery Rate. Probes with Coefficient of Variation (CV) <=30% that are <=-1.1 or =>1.1 and pass the p<=0.1 FDR p-value are used for further screening. To reduce the probe set further Multiple Factor Analysis is performed using the FactorMineR package in R. * Note: LIMMA is Linear Models and Empirical Bayes Processes for Assessing Differential Expression in Microarray Experiments. Limma is an R package for the analysis of gene expression data arising from microarray or RNA-Seq.

The pool of probes is initially selected based on adjusted p-value, FC and CV <30% (arbitrary cut off point) parameters for final picking. Further analyses and the final list are drawn based only on the first two parameters (adj. p-value; FC).

Statistical Pipeline

EpiSwitch™ screening arrays are processed using the EpiSwitch™ Analytical Package in R in order to select high value EpiSwitch™ markers for translation on to the EpiSwitch™ PCR platform.

Step 1

Probes are selected based on their corrected p-value (False Discovery Rate, FDR), which is the product of a modified linear regression model. Probes below p-value <= 0.1 are selected and then further reduced by their Epigenetic ratio (ER), probes ER have to be <=-1.1 or =>1.1 in order to be selected for further analysis. The last filter is a coefficient of variation (CV), probes have to be below <=0.3.

Step 2

The top 40 markers from the statistical lists are selected based on their ER for selection as markers for PCR translation. The top 20 markers with the highest negative ER load and the top 20 markers with the highest positive ER load form the list.

Step 3

The resultant markers from step 1, the statistically significant probes form the bases of enrichment analysis using hypergeometric enrichment (FIE). This analysis enables marker reduction from the significant probe list, and along with the markers from step 2 forms the list of probes translated on to the EpiSwitch™ PCR platform.

The statistical probes are processed by FIE to determine which genetic locations have an enrichment of statistically significant probes, indicating which genetic locations are hubs of epigenetic difference.

The most significant enriched loci based on a corrected p-value are selected for probe list generation. Genetic locations below p-value of 0.3 or 0.2 are selected. The statistical probes mapping to these genetic locations, with the markers from step 2, form the high value markers for EpiSwitch™ PCR translation.

Array design and processing

Array Design 1. Genetic loci are processed using the Sll software (currently v3.2) to:

a. Pull out the sequence of the genome at these specific genetic loci (gene sequence with 50kb upstream and 20kb downstream)

b. Define the probability that a sequence within this region is involved in CCs c. Cut the sequence using a specific RE

d. Determine which restriction fragments are likely to interact in a certain orientation e. Rank the likelihood of different CCs interacting together.

2. Determine array size and therefore number of probe positions available (x)

3. Pull out x/4 interactions.

4. For each interaction define sequence of 30bp to restriction site from part 1 and 30bp to restriction site of part 2. Check those regions aren't repeats, if so exclude and take next interaction down on the list. Join both 30bp to define probe.

5. Create list of x/4 probes plus defined control probes and replicate 4 times to create list to be created on array

6. Upload list of probes onto Agilent Sure design website for custom CGH array.

7. Use probe group to design Agilent custom CGH array.

Array Processing

1. Process samples using EpiSwitch™ Standard Operating Procedure (SOP) for template production.

2. Clean up with ethanol precipitation by array processing laboratory.

3. Process samples as per Agilent SureTag complete DN A labelling kit - Agilent Oligonucleotide Array- based CGH for Genomic DNA Analysis Enzymatic labelling for Blood, Cells or Tissues

4. Scan using Agilent C Scanner using Agilent feature extraction software.

EpiSwitch™ biomarker signatures demonstrate high robustness, sensitivity and specificity in the stratification of complex disease phenotypes. This technology takes advantage of the latest breakthroughs in the science of epigenetics, monitoring and evaluation of chromosome conformation signatures as a highly informative class of epigenetic biomarkers. Current research methodologies deployed in academic environment require from 3 to 7 days for biochemical processing of cellular material in order to detect CCSs. Those procedures have limited sensitivity, and reproducibility; and furthermore, do not have the benefit of the targeted insight provided by the EpiSwitch™ Analytical Package at the design stage. EpiSwitch™ Array in silico marker identification

CCS sites across the genome are directly evaluated by the EpiSwitch™ Array on clinical samples from testing cohorts for identification of all relevant stratifying lead biomarkers. The EpiSwitch™ Array platform is used for marker identification due to its high-throughput capacity, and its ability to screen large numbers of loci rapidly. The array used was the Agilent custom-CGH array, which allows markers identified through the in silico software to be interrogated.

EpiSwitch™ PCR

Potential markers identified by EpiSwitch™ Array are then validated either by EpiSwitch™ PCR or DNA sequencers (i.e. Roche 454, Nanopore MinlON, etc.). The top PCR markers which are statistically significant and display the best reproducibility are selected for further reduction into the final EpiSwitch™ Signature Set, and validated on an independent cohort of samples. EpiSwitch™ PCR can be performed by a trained technician following a standardised operating procedure protocol established. All protocols and manufacture of reagents are performed under ISO 13485 and 9001 accreditation to ensure the quality of the work and the ability to transfer the protocols. EpiSwitch™ PCR and EpiSwitch™ Array biomarker platforms are compatible with analysis of both whole blood and cell lines. The tests are sensitive enough to detect abnormalities in very low copy numbers using small volumes of blood.

Example 1

This work concerns human epigenetic biomarkers which monitor physiological differences and predispositions associated with physical fitness training programs. Defined biomarkers have been discovered and evaluated to assist in the determination of epigenetic predisposition for either strength or endurance training, with monitoring after 4 weeks of mixed training and 8 weeks of specialized training.

Participant Recruitment:

Participants were recruited using posted fly paper, electronic newsletter, local print and radio media, targeted recruitment at local athletics clubs and word of mouth. To be eligible for enrolment to the study, participants were required to meet the following 'performance' criteria.

Group 1: strength athlete: Participant should be a regular weight-lifter. Example provided: 100kg body mass and have a current accumulated total of 550kg across; bench-press + squat + deadlift exercise.

Group 2: fitness athlete: Participant should be a regular fitness athlete. Example provided: current 10km run time: <40 mins; or current 5km run time <19 mins.

Group 3: sedentary non-athlete: has not been participating in sport or any form of structured exercise that causes physical exertion for >3 years. Study Enrolment

Requisite criteria for potential enrolment were subjective (participant) reporting of athletic ability in order to meet one of three distinct phenotypes; 1) strength athlete 2) fitness athlete 3) sedentary non-athlete. Subsequently, eighty five (n=85) male participants aged 18-54 years, provided written informed consent prior to enrolling to the study. To confirm the meeting of enrolment criteria, comprehensive medical and athletic history were obtained before familiarisation protocol, blood sampling and performance tests were performed.

Familiarisation Protocol

Prior to physiological assessment, participants were acquainted with study procedures, personnel and provided with triaxial accelerometers (ActiGraph GT3X+, ActiGraph Corp), which were worn for 7-days in order to objectively determine participant physical activity, prior to physiological assessment.

Blood sampling

Following overnight fast, morning blood samples were drawn from an antecubital vein by venepuncture, using 22 gauge needle into a 6ml EDTA (BD Vacutainer ® ) blood tube. The blood tubes underwent 12 gentle inversions and immediately frozen at -80°C.

Anthropometries

Height was determined using a portable stadiometer (Seca, Birmingham, U.K.). Body mass was measured to the nearest 0.1 kg by commercially available scales (body composition analyser TBF-300, Tanita, Tokyo, Japan). Total body fat percentage was calculated using bioelectrical impedance analysis (BIA) using a commercially available analyser (body composition analyser TBF-300, Tanita, Tokyo, Japan).

Strength Tests

For the 1RM tests, each subject attempted a weight that he believed could be lifted only once using maximum effort. The subject then added weight in increments of 2.2-4.5 kg until the heaviest load that could be successfully lifted once was determined. The subjects rested for approximately 3-5 minutes between attempts. The criterion for participant maximum strength was the combined 1RM max lifts (kg) for Squat + Bench Press + Deadlift exercises. Participant relative strength ratio was calculated as [Maximum Strength: Body Mass (kg:kg)j

Sguat Exercise, 1 Rep Maximum (SQ 1RM)

During the SQ 1RM test, each lifter assumed an upright position, with the top of the bar not more than 3.0 cm below the top of the anterior deltoids. With both hands grasping the bar, the bar was removed from the rack, and the lifter moved back to assume a ready position, with knees extended, looking forward at the chief referee. On command, the lifter bent the knees and lowered the body in one smooth descent, until the top surface of the legs at the hip joint were lower than the top of the knees. The lifter then raised himself from the deepest point of the SQ to a standing position, with the knees extended. On command, the lifter replaced the weight back onto the rack with the aid of a spotter.

Bench Press, 1 Rep Maximum (BP 1RM)

The participant placed himself in a supine position, keeping his head, shoulders, and buttocks in constant contact with the weightlifting bench. The lifter's feet remained flat and motionless on the floor during the attempt. The participant received the bar at full arm's length from a spotter located behind the head of the bench. The bar was then lowered to the chest at a point 1-2 cm below the nipple line along the chest. When the bar became motionless on the chest, "press" command was issued and the participant extended his arms, returning the weight back to its starting position. Once the arms were completely extended, the chief referee gave a rack command, and the spotter aided the participant in returning the weight back to the racks on the bench.

Dead-lift, 1 Rep Maximum (DL 1RM)

For the DL 1RM, the participants feet and hands were spaced evenly from the centre of the bar and were allowed to be placed close to its centre (power style), or farther from the centre (sumo style). The participant lifted the bar vertically from the floor with one smooth motion until the knees and back extended the body to an erect position. When the knees became fully extended, upon command, and the participant lowered the bar to the floor.

Peak Aerobic Capacity

Peak aerobic capacity {VO peak) was obtained indirect calorimetery on an electronically braked cycle ergometer (Velotron, RacerMate, Seattle, WA, USA). Gas exchange was collected throughout the test using a metabolic cart (Moxus, AEI Technologies, Pittsburgh, PA, USA). The test consisted unloaded pedalling for 1 min, followed by a step-wise increase to 50 W for 2 min. Subsequently, work rate was increased by 30 W min -1 until the participant reached volitional fatigue (determined by the inability to maintain a minimum cadence above 60 rpm, blood lactate >7mmol.l _1 , respiratory exchange ratio >1.15; reaching >90% of age predicted heart rate maximum). VO peak values were confirmed as the highest value during the final stage of the ramp protocol. Work rate (WR) was collected continuously throughout the test and peak aerobic power was calculated using the average WR from the last 30 s of the test.

Phenotype Confirmation

Group 1: Participants were required to meet at least the 90 th percentile for strength

STRENGHT PHENOTYPE: Relative Strength Ratio* (RSR) > 4.5 kg:kg *RSR = [(SQ1RM kg+ BP1RM kg + DL1RM kg) ÷ Body Mass (kg)]

Group 2: Participants were required to meet at least the 90 th percentile for aerobic capacity FITNESS PHENOTYPE: V0 2 peak > 51.4ml.kg -1

Group 3: Participants were required to be on or below the 50 th percentile for strength and aerobic capacity Strength: Relative Strength Ratio* (RSR) < 3 kg:kg Fitness: 50 th Percentile: V0 2 peak < 40.8ml.kg -1 Samples and Processing

The samples used in the study are shown in Table 1. The EpiSwitch™ template was prepared for each of the samples using the EpiSwitch TM extraction procedure. The 3C template library was quantified and the amount standardised to lng/mI. A serial dilution was created and Nested PCR performed according to the EpiSwitch TM protocols.

Nested PCR was performed using the created serial dilutions for each sample for all 65 markers identified in part 2. For each marker the appropriate controls were included, these consisted of a no template control (NTC, all other reagents minus any DNA template) to monitor for any potential contamination of PCR reagents and a genomic control (negative control) to ensure the PCR products being detected are specific 3C products. The Nested PCR was analysed using high throughput capillary gel electrophoresis (LabChip GX Touch HT, Perkin Elmer) to identify and size the PCR products.

Training Response Annotations

The Nested PCR data was analysed using the retrospective annotation of end-point outcome for high and low response to exercise. To generate the annotations for each training group the individuals increase in physiological measurements due to undertaking the specified training regime, was ranked. The top 5 individuals in the ranking system were classed as High Responders for that training regime, the bottom 5 individuals were classed as Low responders (see table 2).

Results Overview

The analysis for part three of the project has been performed in two separate ways. The first is a direct analysis for predictive capability of 65 markers identified through the response analysis in part two and now screened in the third part of the project. The second is a parallel analysis of all the top 131 markers originally translated from EpiSwitch™ array of the high achieving Endurance and Strength Athletes. In this analysis the markers were evaluated strictly for their predictive potential using their baseline readouts and retrospective annotations of end-point outcome as High and Low response to specialized training by the end of 8 weeks of training. Figure 1 provides a graphical representation and overview of the analysis. When compared, the two analysis streams have an overlap of 40.8%. The second analysis identifies 38 additional markers from the 131 original markers at the start of part 2 that are prime candidates to be predictive markers for training response. The overlap is shown in Figure 2, with a circle marking the markers identified in the section 2 analysis stream that have not been processed. [Figure 2 shows a Venn diagram of 80 potential markers identified in Analysis 2 compared to the 65 markers identified in part 2 and screened in part 3 based on exercise response]

Analysis 1

Analysis 1 was based on 3 comparisons. These three questions were designed to identify the markers from the 65 that were originally filtered on their responsiveness to mixed exercise and were predictive of an individual's outcome to specific training regimes or training in general. To that end the nested PCR data was analysed in three stratification groups:

1. High versus Low response (H/L) for Strength Training - to identify markers that were predictive at baseline to the end-point outcome of high or low response to Strength training.

2. High versus Low response (H/L) for Cardiovascular Endurance Training - to identify markers that are predicative at baseline to the end-point outcome of high or low response to Strength training at baseline.

3. High versus Low (H/L) response independent of training, grouping Strength and Endurance together - to identify markers that are predictive at baseline to the end-point outcome of a response to any/a training program at baseline.

The annotations of the outcome response to the training regimes at 8 weeks (after 4 weeks of specific cardio or strength training) were used to identify markers that are statistically significant for exercise response for the individual training regimes the third group looked at markers that were significant in both training programs.

To ensure the markers were predictive of training response H/L outcome and not just response markers, the binary nature of the marker at baseline and 8 weeks was investigated. Only markers that showed concordance between the two time points were selected. Filtering chromosome conformations that are stably detected before and after training that also show statistical significance between low and high response to the training programs selects for high quality markers that represent an inherent stable regulation framework in individuals that pre-disposes them to physiologically advance well or poorly to the final outcome of physical training of particular type.

Out of the 65 markers filtered through exercise response to 4-week of mixed training, 17 were found to be statistically significant. These are detailed in Table 3 and 4. These 17 markers represent a pool of high quality markers for the predisposition of training response in individuals at baseline (naive) before training commences. The odds ratio shown in Table 3 is the measure of how strongly the absence and detection of the individual marker is associated with High/Low predicted outcome response to training in the sample population. An odds ratio of 1 indicates there is no difference between the two subpopulations (High and low response). The data strongly suggest that 17 discovered predictive markers are strongly associated with the successful outcome of training in the sample population with odds ratios of between 2 and 12.

Each marker was also individually assessed using Welch's t-test. This is a statistical test used to compare two sample populations and ascertain if the populations have equal means. An equal mean demonstrating that the two populations are not significant different. The p values shown in Table 5 are a measure of confidence that the inequality in the population means is due to actual differences and not be chance sampling.

A p value cut off of 0.3 is used to determine if the difference in detection of chromosome conformations are statistically significant. As will be seen in other literature this differs from the normally used 0.05 values. The 0.3 limit was experimentally derived to assure capturing markers that provide useful information gain, for example in the machine learning classifiers, strengthening the classifier performance.

The significant p values for each marker show the statistical significance of the odds ratios (OR) between training types in Table 3. The data in Tables 3 and 4 demonstrate quality and robustness of the 17 predictive markers identified.

Binary data from the marker OBD142_081.083_1.4x, as shown in Table 5 gives a useful example of both the higher detection rates between High and Low responders to exercise, in this case cardio endurance training and the predisposition already present at baseline before training commences. The data also represents a documented phenomenon of interference by mixed training causing disruption to the regulation before the chromatin regulation is reprogrammed back to its original predisposition state.

We identified the regulatory frame work for predictive advantage and predisposition of High response outcome in individuals being present at onset. The regulatory frame work in the low responders either preprogrammes to match that of the high responders, which they inherently possess already or fails to change at all. The markers represented in Figure 3 are all marker for high response in strength training. Figure 6. showing changes in detection for specific markers from baseline and after 8 weeks of training (4 weeks mixed, 4 weeks strength training), Y axis detection state 0 = no detection, 1 = detection, X axis 1 = baseline, 2 = 8 weeks. Figure 3A shows High responders is fixed with Low responders reprogramming causing the marker to become detectable due to the training programme. Figure 3B shows High responders is fixed with Low responders reprogramming causing the marker to become undetectable due to the training programme. Figure 3C shows High responders is fixed with Low responders also fixed, showing no modification of the chromatin landscape for this particular interaction. Three of the 17 markers identified are anchored in the 3' UTR of the DKK3 genetic locus (Table 6). It is evident that the DKK3 3'UTR is a hub of epigenetic control for this specific genetic locus with differential genomic architecture changing the regulation and accessibility creating predisposition for training response.

Analysis 2

The rationale for the second analysis was to look at the 131 markers screened at baseline in part 2 against retrospective annotations of end-point outcome of High or Low response to specialised training after 8 weeks. Many potential predictive markers may have been excluded in Analysis 1 as they were largely insensitive to the 4 weeks of mixed training. In contrast to analysis 1, which identified predictive markers with additional feature of responsiveness to mixed exercise, Analysis 2 was designed only to identify markers that were predictive of an individual's response to specific training programs.

Analysis 2 was based on the same three comparisons.

1. High versus Low response for Strength Training - to identify markers that were predictive at baseline to the end-point outcome of high or low response to Strength training at baseline.

2. High versus Low response for Cardiovascular Endurance Training - to identify markers that are predicative at baseline to the end-point outcome of high or low response to Strength training at baseline.

3. High versus Low response independent of training, grouping Strength and Endurance together - to identify markers that are predictive at baseline to the end-point outcome of a response to any/a training program at baseline.

Welch's t - test was used to investigate the three comparisons. The test was cross validated by performing 1000 repeats with randomised sample selection. This gives 1000 different sample populations to test for the same 131 markers. Resampling and cross validation are used to ensure that the markers can generalise to an independent data set. The analysis output statistics for the predictive markers are shown in tables 7, 8 and 9. When duplications between the comparisons were removed the analysis identified 80 potential predictive markers. These 80 contained all 17 good predictive markers identified in Analysis 1.

Figure 4 shows the analysis overlap. The circle on the left represents 38 additional predictive markers from Analysis 2. The circle in the middle shows the 42 overlapping predictive markers that have been investigated in part 2 and 3. These include the 17 markers detailed in Analysis 1. The circle on the right shows 23 markers investigated in part 2 and 3 that are Responsive markers to training. The composition of the 80 markers identified in Analysis two is shown in the Venn diagram in Figure 5.

It is important to mention, that as an independent validation of the methodology in general and individual markers in particular, top significant markers with predictive powers for strength or endurance training outcome have been successfully translated into markers in an equine study. Based on the evolutionary conservation of genetic and epigenetic regulatory mechanisms underlying those markers, equine study of stayers and sprinters achieved 93,75% accuracy in classification of the development cohort of 32 (Figure 6 which shows confusion matrix results).

Conclusions

1. Existing chromosome conformation signatures can be predictive of later response outcome to training phenotypes. This is consistent with a number of validated studies for development of the predictive biomarkers to response to various treatments.

2. Statistically significant disseminating EpiSwitch™ markers identified and evaluated in strength/endurance training for human participants have been successfully translated into statistically significant equine markers for stratification of verified Sprinters and Stayers.

3. The 10 discovered markers classify between Sprinter and Stayer phenotypes with an accuracy of 93.75% on the available set of 32 validated samples.

Analysis 2 has identified 80 strong predictive markers, these include all the 17 predictive/response markers earlier reported in analysis 1 and an additional 38 markers that are predictive for H/L outcome to training response phenotypes when measured at baseline.

We have successfully identified and evaluated significant predictive biomarkers for H/L outcome of the strength/endurance training. These results achieve the objective of the project.

• 17 high quality robust predictive stable markers, with sensitivity to mixed training, have been developed for use in baseline prediction of an individual's response outcome to specialized training regimes

• A further 38 strictly predictive markers, independent of response to mixed training, have been identified and evaluated at baseline, for use in baseline prediction of an individual's response outcome to specialized training regimes.

Example 2

Identification of validated signatures contain binary CCSs which are either present, or absent as conditional biomarkers of epigenetic regulation in equine individuals in strength or endurance training.

We identified CCSs biomarkers to successfully distinguish between Thoroughbreds, trained as either sprinters or a stayers. These markers can be used to determine predisposition and monitoring of young unraced Thoroughbreds as they are selected and undergo the training programs for sprinters and stayers. The markers are relevant to training potential, physiological monitoring and epigenetic reprogramming through training.

Project Approach

The top 50 EpiSwitch™ markers, identified from the work in human was translated into EpiSwitchTM designed Equine nested PCR assays and tested by the EpiSwitch™ PCR platform using Thoroughbreds with a defined phenotype: 1) sprinting 2) long-distance (Stayers) for evaluation of best disseminating markers for Sprinters and Stayers.

Samples

A total of 48 Samples were used as shown in Table 10:

• 16 Untrained young Thoroughbreds.

• 16 Trained Thoroughbreds classed as Sprinters.

• 16 Trained Thoroughbreds classed as Stayers.

The sex distribution between the sprinter and stayer sample types is balanced. The untrained male samples are skewed towards Colts against their castrated peers (Gelding), table 4. This skew will be accounted for in the biostatistical analysis.

Genome Conversion

The 131 markers identified on the Human array and translated to Nested PCR were translated from the GRCh38 Human Genome assembly to the EquCab2.0 Equine genome assembly using LiftOver (USSC). In total 114 of these were successfully translated. OBD's internal primer design application was used to design Nested primer sets for PCR, 65 of the 114 markers met the strict criteria for the successful primer design according to EpiSwitchTM operational procedures and methodology. The top 50 markers were selected for EpiSwitchTM equine nested PCR interrogation.

Nested PCR screening

The EpiSwitch™ template was prepared for each of the samples using the EpiSwitch TM extraction standard operation procedure. The 3C template library was quantified and the amount standardised to lng/mI. A serial dilution was created and Nested PCR performed according to the EpiSwitch TM protocols

Nested PCR was performed using the created serial dilutions for each sample for all 50 markers selected. The Nested PCR was analysed using high throughput capillary gel electrophoresis (LabChip GX Touch HT, Perkin Elmer) to identify and size the PCR products. The data for the 16 Sprinters and 16 stayers was analysed using Boschloos' Test with resampling. The top 17 markers are shown in table 13. In total, 17 of the 50 markers investigated were found to be statistically significant using a Boschloo p value cut off of 0.2. Importantly, this selection is produced using dilution individual titres, so each marker is produced by p-values based on the specific titres. Resampling of the data set was performed 100 times with a 66.7% partition, from the dataset of 32, 22 samples were selected at random and the statistical tests performed. This was repeated 100 times with the median result of this documented in table 13. These 17 markers represent interactions spaced around 16 genomic loci. The markers OBD154_045/047 and OBD154_049/051 are unique interactions that are across the same genomic location, see Table 4. The data show that a number of translated markers, discovered originally in a human cohort, are also applicable to horses.

MTFR1 (mitochondrial fission regulator 1) is notable from amongst identified loci as this was indicated in the most significant region from the Horse genetics for selection in thoroughbred Horses. Mitochondrial function and the management of reactive oxygen species are important for favorable exercise responses and therefore MTFR1 may impose strong selection pressure in Thoroughbreds by protection of mitochondria-rich tissue against oxidative stress. As such it is an excellent region for racing performance and endurance.

Classification

A classifier based on the Sprinter vs Stayer markers identified on the trained samples was developed to enable the classification of the untrained samples. Glmnet (Lasso and Elastic-Net Regularized Generalized Linear Models) was used to rank the markers in their ability to classify between sample types. The top 14 markers identified in Glmnet were then compared to the top 11 markers based on Boschloos' p value (Table 13). As shown in Figure 7, 10 of the markers showed concordance and were shared between the two statistical analyses. The 10 concordant markers were selected to develop the classifier. XGBoost was used to model the markers and build a classifier.

The XGBoost model is an ensemble-based classifier that uses a series of weak classifying models to produce one overall strong model. In this gradient boosting methodology an initial model is created from the training data, a second model is then created that attempts to correct any errors in classification from the first model. This process is repeated n number of times to produce a final model that will classify the training set. A process of early stopping, ending the classifier build chain earlier then the initially set n value, is used to prevent the model over fitting the training set. A classification of the training set based on Sprinters and Stayers is shown in Figure 6. Based on the 10 markers, the EpiSwitchTM test classifies well between Sprinters and Stayers, with an accuracy of 93.75%.

The probability calls of the 10 markers classifier are shown in Table 15. By default, a probability score >0.5 by the classifier is considered a call for a Sprinter, while < 0.5 is considered a call for Stayer. On the basis of the results of classification in Table 15, the original probability scores (>0.5 for Sprinter, <0.5 for Stayer) were adjusted for quality classification calls on naive samples from young Thoroughbreds. The established cut offs for the classifier calls are shown in Table 16.

The classifier was applied to the data for the naive samples from young Thoroughbreds with the call results in Table 17. For 4 of the samples the classifier was unable to call the phenotype with high enough certainty and they fell into the unclassified limits. The remaining 12 samples were successfully called with a prediction of epigenetic profile conducive for the potential good Sprinter or Stayer. The 10 discovered markers classify between Sprinter and Stayer phenotypes with an accuracy of 93.75% on the available set of 32 validated samples.

Example 3

Further Work on the Human Study

A multinomial Glmnet regression analysis was performed based on three annotation groups, High responder, Medium responder and Low responder to each training type. Each CCS identified as predictive at baseline relative to training response outcomes was then compared across the time points for the sedentary controls to remove any CCS that showed variation due to effects other than the training under investigation. In total 18 CCS were identified as predictive for response to Strength training and 7 for response to Endurance training. Two of the CCS are shared between the training types (Mixed). Results are shown in tables 31 and 32.

Loop detected, or EpiSwitch marker present, are strict categories of predictive and stable markers for strength or endurance training. 'High-responder' markers are parts of epigenetic profile that conducive to very good physiological response to training program (using V02 maximum and one-repetition maximum strength tests), i.e responders in our use of term. 'Low -responders' is the data analysis term that marks the stable predictive markers for individuals who will not change much their physiological performance after the training program ('non-responders').

Considering that 'high response' is the physiological data analysis term for 'responder' biomarker, and 'low-response' for 'non-responders', the majority of predictive response biomarkers are strength or endurance specific. However, several biomarkers reflect a dual function and provide information for good response in both endurance and strength.

The loop detected refers to array data, however Table 31 refers to strength and Table 32 refers to endurance based on PCR data. In regards to marker definitions and marker categories used by data analysis on array data:

Strength control: these are is just reference controls on the untrained side - readouts on the group of sedentary individuals before training at baseline. This is a physiological baseline control group. Endurance control: same as above, but for endurance.

Strength training: these are reference controls, retrospective and ultimate positive controls, on high achievers in strength read out on the group of strength trained individuals after successful training. This is a physiological, positive control group.

Endurance training: same as before but for endurance.

This group of markers should be evaluated based on the following rules: the high responder markers as predictive and stable (from baseline onwards) responder markers, while low responder are the opposite side - predictive and stable markers of non-response to training. Generally predictive markers are specific for either strength or endurance as they were developed separately on either strength or endurance programmes and control groups. However there is some overlap, probably reflecting physiological overlap of the relevant regulatory networks - several markers are good predictive marker for good response in both strength or endurance training.

Overall Review of the Work

The original pool of genetic sites on the screening array was comprised of 1) inflammatory genes, 2) loci associated with micro-satellite differences in Thoroughbred horses and translated into humans.

Thoroughbreds are highly inbred and the input of genetic component into phenotypical differences is highly limited in those subjects - this was considered the right model to search for any associations with epigenetic control sites, in terms of chromosome conformations. The fact that statistically significant sites are associated broadly with some of the well-known genes functionally linked to strength and endurance is the consequence of an independent selection. Importantly, screening of the potential candidates for chromosome conformation markers was driven by proprietary EpiSwitch annotations within non-coding parts of the genome within 100 kb windows of the referenced loci: upstream, downstream or encompassing the whole genes within the chromatin domain. At any part of the selection of CCS marker leads, no comparative assessment for any gene expression of any genes in the vicinity were made to assist the selection, thus excluding any link or correlation to gene expression changes of any of the genes and their known functional link to physical training outcomes.

Tables 34 to 36 show preferred powerful markers. The odds ratio shown in Table 36 is the measure of how strongly the absence and detection of the individual marker is associated with High/Low predicted outcome response to training in the sample population. An odds ratio of 1 indicates there is no difference between the two subpopulations (High and low response). The data strongly suggest that 17 discovered predictive markers are strongly associated with the successful outcome of training in the sample population with odds ratios of between 2 and 12. Each marker was also individually assessed using Welch's t-test. This is a statistical test used to compare two sample populations and ascertain if the populations have equal means. An equal mean

demonstrating that the two populations are not significant different. The p values shown are a measure of confidence that the inequality in the population means is due to actual differences and not be chance sampling.

We used a p value cut off of 0.3 to determine if the difference in detection of chromosome

conformations are statistically significant. As will be seen in other literature this differs from the normally used 0.05 values. Through our previous projects, assay type and know-how the 0.3 limit was experimentally derived to assure capturing markers that provide useful information gain in the machine learning classifiers, strengthening the classifier performance. The significant p values for each marker show the statistical significance of the odds ratios (OR) between training types in this table. The data in this table demonstrates quality and robustness of the 17 predictive markers identified.

Figure 15 is a VENN Diagram of Significant Markers for CV and Strength Baseline and 8 weeks. These markers are show in Table 40.

Figure 16 shows baseline CV markers based on High and Low CV groups Exact P-values for association. These are the top 10 significant markers for the subjects at baseline based on the CV ranges. The range for Strength is a significant at baseline also. This indicates the close association between CV and strength ranges in these markers.

Figure 17 shows baseline Strength markers based on high and Low Strength group Exact P-values for association. These are the top 20 significant markers for the subjects at baseline based on the Strength ranges. The range for CV is a significant at baseline also. This indicates the close association between CV and strength ranges in these markers.

Figure 18 shows 10 Episwitch horse markers Identified from a UK cohort. The figure shows the genomic location in horses and the homologous region in the Human genome and the type of the marker:

Sprinter or Stayer. The same 10 markers were used to classify the Singaporean Horses.

Table 1 Table 2

Table 3

Table 4

Table 5

Table 6

Table 7 shows predictive markers for strength training response, which are preferably used for typing humans. (Present in strength training in humans)

Table 8 shows predictive markers for endurance training response, which are preferably used for typing humans. (Present in endurance training in humans).

Table 9 shows predictive markers for either strength or endurance training response, which are preferably used for typing humans. (Present in strength and endurance training in humans).

Table 10

Table 11

Table 12

ab e 13 shows the top markers for Stayer versus Sprinter phenotype (n=32, 16 Stayer, 16 Sprinter), which are preferably used to type horses.

Table 14 shows markers discovered in humans that applicable to horses, and the closest genomic loci, which can be used to type horses.

Table 15

Table 16

Table 17

Table 18 shows the informative markers from the equine study, which are preferably used to type horses.

Table 19

Table 20

* REMOVE - KALLMANS SYNDROME

Table 21.a

Table 21.b

Table 21.c

Table 21.d

Table 22 shows preferred markers from the equine study and the traits relate to, which are preferably used to type horses.

Table 23 shows preferred markers from the equine study and the traits they relate to, which are preferably used to type horses.

Table 24.a [Table 24 shows markers identified in the equine study, which are preferably used to type horses.]

Table 24.b

Table 24.c

Table 24.d

Table 24.e

Table 24.f

Table 25.a

Table 25.b

Table 25.c

Table 25.d

Table 25.e

Table 25.f

Table 25. g

Table 25.h

Table 26

Table 27

Table 28

Table 29

1

2

3

4

5

6

7

8 9

10

11

12

13

14

15

16

17

18

19

20 21 22

23

24

25

26

27

28 29 able 30.A1

1

2

3

4

5

6

7

8 9

10

11

12

13

14

15

16

17

18

19

20 21 22

23

24

25

26

27

28 29 able 30.A2

1

2

3

4

5

6

7

8 9

10

11

12

13

14

15

16

17

18

19

20 21 22

23

24

25

26

27

28 29 able 30.A3

1

2

3

4

5

6

7

8 9

10

11

12

13

14

15

16

17

18

19

20 21 22

23

24

25

26

27

28 29

Table 30.A4

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59 ble 30.B1

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59 ble 30.B2

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59 ble 30. B3

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

Table 30.B4

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80 81 82

83

84

85

86

87

88 89 able 30.C1

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80 81 82

83

84

85

86

87

88 89

Table 30 C2

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80 81 82

83

84

85

86

87

88 89

Table 30.C3

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80 81 82

83

84

85

86

87

88 89

Table 30.C4

90

91

92

93

94

95

96

97

98

99 100 101 102

103

104

105

106

107

108

109

110 111 112

113

114

115

116

117

118 119 able 30.D1

90

91

92

93

94

95

96

97

98

99 100 101 102

103

104

105

106

107

108

109

110 111 112

113

114

115

116

117

118 119

Table 30.D2

Table 30.D3

90

91

92

93

94

95

96

97

98

99 100 101 102

103

104

105

106

107

108

109

110 111 112

113

114

115

116

117

118 119

Table 30.D4

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149 able 30.E1

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149 able 30.E2

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149 able 30.E3

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

Table 30.E4

150

151

152

153

154

155

156

157

158

159

160 161 162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179 able 30.F1

150

151

152

153

154

155

156

157

158

159

160 161 162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179 able 30.F2

150

151

152

153

154

155

156

157

158

159

160 161 162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179 able 30.F3

150

151

152

153

154

155

156

157

158

159

160 161 162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179 able 30.F4

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200 201 202 able 30.G1

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200 201 202 able 30.G2

able 30.G3

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200 201 202

Table 30.G4

Table 31.a

Table 31.b

Table 31.c

Table 31. d

Table 31.e

Table 31.f

Table 31.g

Table 31.h

Table 31.i

Table 32.a

Table 32.b

Table 32.c

Table 32.d

Table 32.e

Table 32.f

Table 32.g

Table 32.h

Table 32.i

Table 33.al

Table 33.a2

Table 33. a3

Table 33.a4

Table 33.a5

Table 33. a6

Table 33.a7

Table 33.a8

Table 33.bl

Table 33.b2

Table 33.b3

Table 33.b4

Table 33.b5

Table 33.b6

Table 33.b7

Table 33.b8

Table 33.cl

Table 33x2

Table 33. c3

Table 33x4

Table 33.c5

Table 33. c6

Table 33x7

Table 33. c8

Table 33.dl

Table 33.d2

Table 33.d3

Table 33.d4

Table 33.d5

Table 33.d6

Table 33.d7

Table 33.d8

Table 33.el

Table 33.e2

Table 33.e3

Table 33.e4

Table 33.e5

Table 33.e6

Table 33.e7

Table 33.e8

Table 34.a

Table 34.b

Table 35

Table 36. Top Human Markers (from I to NIB)

Table 37 is shown below. Some parts of the table have been left out which relate to information shown in other tables.

1

2

3

4

5

9

10

11

12

13

18

19

20

21

22

27

28

29

Table 37.A1

Table 37.B1

60

64

65

66

67

68 69

73

74

75 76

77

78

79

80 81 82

83

84

85

86

87

88 89

Table 37.C1

90

91

92

93

94

95

96

97

98

99

100 101 102

103

104

105

106

107

108

109

110 111 112

113

114

115

116

117

118 119 150

151

152

153

154

155

156

157

158

159

160 161 162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

Table 37.E1

180

181

182

183

184

185

186

187

188

189

190

191

192

193 194

195

196

197

198

199

200

201

202

Table 37.F2

Table 38 is below. Some parts of the table have been left out which relates to information in other tables.

Table 38. a

Table 39. a

Table 40

Table 41 shows markers identified in the human study, which are preferably used to type humans.