Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
GENOMIC ANALYSIS
Document Type and Number:
WIPO Patent Application WO/2020/095035
Kind Code:
A1
Abstract:
A method of identifying contribution of a genomic variation to a phenotypic feature is disclosed. The method comprises determining a degree of a genomic variation (51; Fig. ) in each individual in a set of individuals and recording the degree of genomic variation in each individual in a database, for example, in a first table (6). The method comprises determining a location of the genomic variation to allow determination, by virtue of the location, of whether the genomic variation affects a gene product directly and/or regulates production of a gene product (53, 54; Fig. 5). The method comprises determining a first gene (51; Fig. 5) affected by the genomic variation. When the genomic variation affects a gene product directly, recording the gene, in the coding sequence of which the genomic variation is located, as the first gene, for example, in a second table (7). When the genomic variation regulates the production of a gene product, recording the gene, in the promoter region or other regulatory region of which the genomic variation is located, as the first gene. The method comprises determining an outcome of the genomic variation on the first gene and recording the outcome of the genomic variation in the database, for example, in the second table and determining the presence or absence of one or more other gene products (i) interacting with the first gene, (ii) encoded by the first gene; and/or (iii) regulating the first gene and, if present, recording the identity of the gene encoding the one or more other gene products in the database as the second gene, for example, in the second table.

Inventors:
KORCSMAROS TAMAS (GB)
BROOKS JOHANNE (GB)
CARDING SIMON R (GB)
Application Number:
PCT/GB2019/053128
Publication Date:
May 14, 2020
Filing Date:
November 05, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
EARLHAM INST (GB)
UNIV OF EAST ANGLIA (GB)
QUADRAM INST BIOSCIENCE (GB)
International Classes:
G16B5/00; G16B20/20
Foreign References:
US20130116930A12013-05-09
US20160048634A12016-02-18
US20030144799A12003-07-31
Other References:
JIANMING WU ET AL: "Human FasL Gene Is a Target of [beta]-Catenin/T-Cell Factor Pathway and Complex FasL Haplotypes Alter Promoter Functions", PLOS ONE, vol. 6, no. 10, 11 October 2011 (2011-10-11), pages e26143, XP055661194, DOI: 10.1371/journal.pone.0026143
DEZSO MÓDOS ET AL: "Neighbours of cancer-related proteins have key influence on pathogenesis and could increase the drug target space for anticancer therapies", NPJ SYSTEMS BIOLOGY AND APPLICATIONS, vol. 3, no. 1, 24 January 2017 (2017-01-24), pages 1 - 13, XP055660890, DOI: 10.1038/s41540-017-0003-6
YUNGUO GONG ET AL: "Polymorphisms in microRNA target sites influence susceptibility to schizophrenia by altering the binding of miRNAs to their targets", EUROPEAN NEUROPSYCHOPHARMACOLOGY., vol. 23, no. 10, 1 October 2013 (2013-10-01), NL, pages 1182 - 1189, XP055660932, ISSN: 0924-977X, DOI: 10.1016/j.euroneuro.2012.12.002
PATRICIA SARLOS ET AL: "Genetic update on inflammatory factors in ulcerative colitis: Review of the current literature", WORLD JOURNAL OF GASTROINTESTINAL PATHOPHYSIOLOGY, vol. 5, no. 3, 1 January 2014 (2014-01-01), pages 304, XP055498556, ISSN: 2150-5330, DOI: 10.4291/wjgp.v5.i3.304
OZ, L MOL. BIOL. AND EVOL., 2014
LAZAR ET AL., MOL. SYS. BIOL., 2013
LAZAR ET AL., NATURE COMM., 2014
BURKITT, M.D.HANEDI, A.F.DUCKWORTH, C.A.WILLIAMS, J.M.TANG, J.M.O'REILLY, L.A.PUTOCZKI, T.L.GERONDAKIS, S.DIMALINE, R.CAAMANO, J.H: "NF-KBI, NF-KB2 and c-Rel differentially regulate susceptibility to colitis-associated adenoma development in C57BL/6 mice", J PATHOL, vol. 236, 2015, pages 326 - 336
CAAMANO, J.HUNTER, C.A.: "NF-kappaB family of transcription factors: central regulators of innate and adaptive immune functions", CLIN MICROBIOL REV, vol. 15, 2002, pages 414 - 429
CROFT, D.MUNDO, A.F.HAW, R.MILACIC, M.WEISER, J.WU, G.CAUDY, M.GARAPATI, P.GILLESPIE, M.KAMDAR, M.R. ET AL.: "The Reactome pathway knowledgebase", NUCLEIC ACIDS RES, vol. 42, 2014, pages D472 - 7
ENRIGHT, A.J.JOHN, B.GAUL, U.TUSCHL, T.SANDER, C.MARKS, D.S.: "MicroRNA targets in Drosophila", GENOME BIOL, vol. 5, 2003, pages R1, XP021012829, DOI: 10.1186/gb-2003-5-1-r1
GINI, C., VARIABILITY E MUTUABILITA (BOLOGNA: C. CUPPINI, 1912
GONG, Y.WU, C.N.XU, J.FENG, G.XING, Q.H.FU, W.LI, C.HE, L.ZHAO, X.Z.: "Polymorphisms in microRNA target sites influence susceptibility to schizophrenia by altering the binding of miRNAs to their targets", EUR NEUROPSYCHOPHARMACOL, vol. 23, 2013, pages 1182 - 1189
GOULD, N.J.DAVIDSON, K.L.NWOKOLO, C.U.ARASARADNAM, R.P.: "A systematic review of the role of DNA methylation on inflammatory genes in ulcerative colitis", EPIGENOMICS, vol. 8, 2016, pages 667 - 684
HUANG, H.FANG, M.JOSTINS, L.UMICEVIC MIRKOV, M.BOUCHER, G.ANDERSON, C.A.ANDERSEN, V.CLEYNEN, I.CORTES, A.CRINS, F. ET AL.: "Fine-mapping inflammatory bowel disease loci to single-variant resolution", NATURE, vol. 547, 2017, pages 173 - 178
JOSTINS, L.RIPKE, S.WEERSMA, R.K.DUERR, R.H.MCGOVERN, D.P.HUI, K.Y.LEE, J.C.SCHUMM, L.P.SHARMA, Y.ANDERSON, C.A. ET AL.: "Host-microbe interactions have shaped the genetic architecture of inflammatory bowel disease", NATURE, vol. 491, 2012, pages 119 - 124, XP055484363, DOI: 10.1038/nature11582
KANG, S.W.WAHL, M.I.CHU, J.KITAURA, J.KAWAKAMI, Y.KATO, R.M.TABUCHI, R.TARAKHOVSKY, A.KAWAKAMI, T.TURCK, C.W. ET AL.: "PKCbeta modulates antigen receptor signaling via regulation of Btk membrane localization", EMBO J, vol. 20, 2001, pages 5692 - 5702
KATOH, M.KATOH, M.: "Notch signaling in gastrointestinal tract (review", INT J ONCOL, vol. 30, 2007, pages 247 - 251, XP009100925
KENT, W.J.SUGNET, C.W.FUREY, T.S.ROSKIN, K.M.PRINGLE, T.H.ZAHLER, A.M.HAUSSLER, D.: "The human genome browser at UCSC", GENOME RES, vol. 12, 2002, pages 996 - 1006, XP007901725, DOI: 10.1101/gr.229102. Article published online before print in May 2002
KIM, Y.S.HO, S.B.: "Intestinal goblet cells and mucins in health and disease: recent insights and progress", CURR GASTROENTEROL REP, vol. 12, 2010, pages 319 - 330
KINI, A.T.THANGARAJ, K.R.SIMON, E.SHIVAPPAGOWDAR, A.THIAGARAJAN, D.ABBAS, S.RAMACHANDRAN, A.VENKATRAMAN, A.: "Aberrant niche signaling in the etiopathogenesis of ulcerative colitis", INFLAMM BOWEL DIS, vol. 21, 2015, pages 2549 - 2561
KOZOMARA, A.GRIFFITHS-JONES, S.: "miRBase: integrating microRNA annotation and deep-sequencing data", NUCLEIC ACIDS RES, vol. 39, 2011, pages D152 - 7
KRUSKAL, J.B.: "Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis", PSYCHOMETRIKA, vol. 29, 1964, pages 1 - 27, XP008130844
DE LANGE, K.M.MOUTSIANAS, L.LEE, J.C.LAMB, C.A.LUO, Y.KENNEDY, N.A.JOSTINS, L.RICE, D.L.GUTIERREZ-ACHURY, J.JI, S.-G. ET AL.: "Genome-wide association study implicates immune activation of multiple integrin genes in inflammatory bowel disease", NAT GENET, vol. 49, 2017, pages 256 - 261
LIU, C.CHENG, H.SHI, S.CUI, X.YANG, J.CHEN, L.CEN, P.CAI, X.LU, Y.WU, C. ET AL.: "MicroRNA-34b inhibits pancreatic cancer metastasis through repressing Smad3", CURR MOL MED, vol. 13, 2013, pages 467 - 478
MATHELIER, A.FORNES, O.ARENILLAS, D.J.CHEN, C.-Y.DENAY, G.LEE, J.SHI, W.SHYR, C.TAN, G.WORSLEY-HUNT, R. ET AL.: "JASPAR 2016: a major expansion and update of the openaccess database of transcription factor binding profiles", NUCLEIC ACIDS RES, vol. 44, 2016, pages D110 - 5
MCELHINNY, A.S.LI, J.L.WU, L.: "Mastermind-like transcriptional co-activators: emerging roles in regulating cross talk among multiple signaling pathways", ONCOGENE, vol. 27, 2008, pages 5138 - 5147
MORRIS, J.H.APELTSIN, L.NEWMAN, A.M.BAUMBACH, J.WITTKOP, T.SU, G.BADER, G.D.FERRIN, T.E.: "clusterMaker: a multi-algorithm clustering plugin for Cytoscape", BMC BIOINFORMATICS, vol. 12, 2011, pages 436, XP021093164, DOI: 10.1186/1471-2105-12-436
NEWMAN, M.E.J.GIRVAN, M.: "Finding and evaluating community structure in networks", PHYS REV E STAT NONLIN SOFT MATTER PHYS, vol. 69, 2004, pages 026113
PEDREGOSA, F.VAROQUAUX, G.GRAMFORT, A.MICHEL, V.THIRION, B.GRISEL, O.BLONDEL, M.PRETTENHOFER, P.WEISS, R.DUBOURG, V. ET AL.: "Scikit-learn: Machine Learning in Python", JOURNAL OF MACHINE LEARNING RESEARCH, 2011
PRAGER, M.BUETTNER, J.BUENING, C.: "Genes involved in the regulation of intestinal permeability and their role in ulcerative colitis", J DIG DIS, vol. 16, 2015, pages 713 - 722
SHERRY, S.T.WARD, M.H.KHOLODOV, M.BAKER, J.PHAN, L.SMIGIELSKI, E.M.SIROTKIN, K.: "dbSNP: the NCBI database of genetic variation", NUCLEIC ACIDS RES, vol. 29, 2001, pages 308 - 311, XP055125042, DOI: 10.1093/nar/29.1.308
TURATSINZE, J.-V.THOMAS-CHOLLIER, M.DEFRANCE, M.VAN HELDEN, J.: "Using RSAT to scan genome sequences for transcription factor binding sites and cis-regulatory modules", NAT PROTOC, vol. 3, no. 1, 2008, pages 8 - 1588
TIIREI, D.KORCSMAROS, T.SAEZ-RODRIGUEZ, J.: "OmniPath: guidelines and gateway for literature-curated signaling pathway resources", NAT METHODS, vol. 13, 2016, pages 966 - 967
"UniProt Consortium (2015). UniProt: a hub for protein information", NUCLEIC ACIDS RES, vol. 43, 2015, pages D204 - 12
WU, L.GRIFFIN, J.D.: "Modulation of Notch signaling by mastermind-like (MAML) transcriptional co-activators and their involvement in tumorigenesis", SEMIN CANCER BIOL, vol. 14, 2004, pages 348 - 356
WU, F.ZIKUSOKA, M.TRINDADE, A.DASSOPOULOS, T.HARRIS, M.L.BAYLESS, T.M.BRANT, S.R.CHAKRAVARTI, S.KWON, J.H.: "MicroRNAs are differentially expressed in ulcerative colitis and alter expression of macrophage inflammatory peptide-2 alpha", GASTROENTEROLOGY, vol. 13, 2008, pages 1624 - 1635
WU, F.HUANG, Y.DONG, F.KWON, J.H.: "Ulcerative Colitis-Associated Long Noncoding RNA, BC012900, Regulates Intestinal Epithelial Cell Apoptosis", INFLAMM BOWEL DIS, vol. 22, 2016, pages 782 - 795
Attorney, Agent or Firm:
PIOTROWICZ, Pawel et al. (GB)
Download PDF:
Claims:
- 3i -

Claims

1. A method of identifying contribution of a genomic variation to a phenotypic feature, the method comprising:

determining a degree of a genomic variation in each individual in a set of individuals and recording the degree of genomic variation in each individual in a database;

determining a location of the genomic variation to allow determination, by virtue of the location, of whether the genomic variation affects a gene product directly and/or regulates production of a gene product;

determining a first gene affected by the genomic variation wherein:

when the genomic variation affects a gene product directly, recording the gene, in the coding sequence of which the genomic variation is located, as the first gene;

when the genomic variation regulates the production of a gene product, recording the gene, in the promoter region or other regulatory region of which the genomic variation is located, as the first gene;

determining an outcome of the genomic variation on the first gene and recording the outcome of the genomic variation in the database; and

determining the presence or absence of one or more other gene products:

(i) interacting with the first gene;

(ii) encoded by the first gene; and/ or

(iii) regulating the first gene; and

if present, recording the identity of the gene encoding the one or more other gene products in the database as the second gene.

2. A method, comprising:

determining a degree of a genomic variation in each individual in a set of individuals and recording the degree of genomic variation in each individual in a database;

determining a location of the genomic variation to allow determination, by virtue of the location, of whether the genomic variation affects a gene product directly and/or regulates production of a gene product;

determining a first gene affected by the genomic variation wherein: when the genomic variation affects a gene product directly, recording the gene, in the coding sequence of which the genomic variation is located, as the first gene;

when the genomic variation regulates the production of a gene product, recording the gene, in the promoter region or other regulatory region of which the genomic variation is located, as the first gene; and

determining an outcome of the genomic variation on the first gene and recording the outcome of the genomic variation in the database. 3. A method of determining the presence or absence of one or more other gene products, wherein a degree of genomic variation in each individual in a set of individuals has been recorded in a database, a location of the genomic variation to allow determination, by virtue of the location, of whether the genomic variation affects a gene product directly and/or regulates production of a gene product has been determined, a first gene affected by the genomic variation has been determined wherein, when the genomic variation affects a gene product directly, the gene has been recorded, in the coding sequence of which the genomic variation is located, as the first gene and, when the genomic variation regulates the production of a gene product, the gene has been recorded, in the promoter region or other regulatory region of which the genomic variation is located, as the first gene, and an outcome of the genomic variation on the first gene has been determined and recorded in the database, the method comprising:

determining the presence or absence of one or more other gene products:

(i) interacting with the first gene;

(ii) encoded by the first gene; and/ or

(iii) regulating the first gene; and

if present, recording the identity of the gene encoding the one or more other gene products in the database as the second gene, optionally in the second database. 4. A method as claimed in any one of claims 1 to 3, wherein a genomic variation that regulates the production of a gene product is a genomic variation located outside of a coding region of a gene and within the promotor region or other regulatory region of said gene, and transcription or transcription and translation of the gene has the effect of modifying the production of the gene product.

5. A method as claimed in any one of claims 1 to 4, wherein when the genomic variation regulates the production of a gene product, the first gene is identified by: assessing the flanking sequences of the genomic variation for transcription factor binding sites (TFBS) and/or miRNA target sites (miRNA-TS);

classifying the effect of the genomic variation on the TFBS or miRNA-TS as loss or gain of binding site/target or a neutral change; and identifying the gene

corresponding to the loss or gain effect as the first gene.

6. A method as claimed in claim 5, wherein the flanking sequences are 50 bases upstream and downstream of the genomic variation.

7. A method as claimed in any one of claims 1 to 6, further comprising recording the gene product(s) encoded by the first gene in the database and/or recording the one or more other gene products in the database.

8. A method as claimed in claim 7, further comprising generating a genetic profile for each individual within the set of individuals, wherein the genetic profile comprises one or more of: the degree of genomic variation; the location of the genomic variation; the outcome of the genomic variation; the first gene; the second gene; the gene product(s) encoded by the first gene, and the one more other gene products.

9. A method as claimed in any one of claims 1 to 8, further comprising clustering individuals within the set of individuals based on similarities in their genetic profile. 10. A method of identifying, from a group of individuals with a particular phenotypic feature, individuals with one or more common genomic variations using the method of claim 8 or claim 9.

11. A method of identifying, from a group of individuals with a particular phenotypic feature, an individual with a susceptibility to one or more treatment pathways using the method of claim 8 or claim 9.

12. A method of determining the susceptibility of an individual to a particular treatment pathway using the method of claim 8 or claim 9.

13. A method as claimed in claim 12, wherein the individual has one or more phenotypic features, and the treatment pathway is associated with one or more of the phenotypic features. 14. A method of assigning an individual to one of a certain number of treatment pathways, the method comprising obtaining a sample from each said individual and creating, for each sample, a genetic profile using a method according to claim 8.

15. A method of assigning an individual to one of a certain number of treatment pathways on the basis of their genetic profile as created using a method according to claim 8; and in particular, a method of assigning an individual with one or more phenotypic features to one of a certain number of treatment pathways on the basis of their genetic profile as created using a method according to claim 8, wherein the treatment pathway is relevant to one or more aspects of their genetic profile.

16. A method as claimed in any one of claims 1 to 15, wherein the genomic variation is one or more of an allelic variation, polymorphism or mutation in DNA or RNA, missense or non-synonymous, synonymous and nonsense mutations, insertions, deletions, substitutions, inversions, frameshift mutations, repeat expansions, duplications, copy number variations, point mutations, single nucleotide

polymorphisms (SNPs), and epigenetic modifications.

17. A method as claimed in any one of claims 1 to 16, wherein the genomic variation is a SNP.

18. A method as claimed in any one of claims 1 to 17, wherein the phenotypic feature is a disease, clinical condition or pathology, a stage of a disease, clinical condition or pathology; a marker of a disease, clinical condition or pathology, a response to treatment of a disease, clinical condition or pathology or a stage of a disease, clinical condition or pathology.

19. A method as claimed in any one of claims 1 to 18, wherein the phenotypic feature is elevation of one or more markers of inflammation; depression of a metabolite or hormone; presence or absence of biomarkers associated with a disease or condition; elevation or depression of expression of transcripts, proteins and/or metabolites, or altered levels of cell death markers.

20. A method as claimed in any one of claims l to 19, wherein the individual is a mammal or a bacterium, preferably wherein the individual is a human. 21. A method as claimed in claim 20, wherein the individual is a human and the phenotypic feature is a disease or clinical condition.

22. A method as claimed in claim 21, wherein the phenotypic feature is ulcerative colitis.

23. A method as claimed in claim 22, wherein the genomic variation is a SNP.

24. A method as claimed in claim 17 or claim 22, wherein the degree of genomic variation is the presence or absence of the SNP.

25. A Notch pathway inhibitor or MAML2 inhibitor for use in treating individuals with ulcerative colitis having a mutation in a Notch signalling pathway member, but not a mutation in an NFkB pathway member. 26. A Notch pathway inhibitor or MAML2 inhibitor for use in treating individuals with ulcerative colitis as claimed in claim 25, wherein the Notch signalling pathway member is MAML2, and/or the NFkB pathway member is NFkB or PRKCB, and/or the inhibitor is a gamma-secretase inhibitor. 27. A Notch pathway inhibitor or MAML2 inhibitor for use in a method of treating individuals with ulcerative colitis wherein the method comprises (i) determining whether a test sample from the patient comprises a mutation in a Notch signalling pathway member; and (ii) determining whether a test sample from the same individual does not comprise a mutation in an NFkB pathway member, and if the test sample from the patient comprises a mutation in a Notch signalling pathway member and not a mutation in an NFkB pathway member, administering to the patient an effective amount of a Notch pathway inhibitor or MAML2 inhibitor.

28. A Notch pathway inhibitor or MAML2 inhibitor for use in a method of treating individuals with ulcerative colitis as claimed in claim 27, wherein the Notch pathway member is MAML2, and/or the NFkB pathway member is NFkB or PRKCB and/or the inhibitor is a gamma-secretase inhibitor.

29. An inhibitor for use in treating individuals with ulcerative colitis having a genomic variation in a member of a specific pathway associated with a cell type, wherein the genomic variation increases function of the cell type, and the inhibitor inhibits the function or number of the cell type, wherein the cell type is a fibroblast, myofibroblast, regulatoiy T -cell, B-cell, macrophage or dendritic cell; or

an activator for use in treating individuals with ulcerative colitis having a genomic variation in a member of a specific pathway associated with a cell type, wherein the genomic variation decreases function of the cell type, and the activator increases the function or number of the cell type, wherein the cell type is a fibroblast, myofibroblast, regulatoiy T -cell, B-cell, macrophage or dendritic cell. 30. An inhibitor or activator for use in treating individuals with ulcerative colitis having a genomic variation in a member of a specific pathway associated with a cell type as claimed in claim 29, wherein the inhibitor or activator is a B-cell inhibitor or activator, and the genomic variation is a genomic variation in a B-cell pathway member.

31. A B-cell inhibitor or activator for use in a method of treating individuals with ulcerative colitis wherein the method comprises (i) determining whether a test sample from the patient comprises a genomic variation in a B-cell pathway member; (ii) if the test sample comprises a genomic variation in a B-cell pathway member establishing if the genomic variation is a gain or loss of function variation; (iii) if the test sample from the patient comprises:

(a) a gain of function genomic variation, administering to the patient an effective amount of a B-cell inhibitor; or

(b) a loss of function genomic variation, administering to the patient an effective amount of a B-cell activator.

32. A computer program which, when executed by one or more processors, causes the one or more processors perform the method of any one of claims 1 to 24. 33· Apparatus comprising:

at least one processor; and memory;

wherein the at least one processor is configured to perform the method of any one of claims 1 to 24.

Description:
Genomic analysis

Field

The present invention relates to analysis of genomic data. In particular, the present invention relates to a method of identifying the contribution of a genomic variation to a phenotypic feature.

The present invention also relates to identifying, from a group of individuals with a particular phenotypic feature, individuals with one or more genomic variations in common.

The present invention also relates to identifying, from a group of individuals with a particular phenotypic feature, an individual with a susceptibility to one or more treatment pathways. The present invention also relates to determining susceptibility of an individual to a particular treatment pathway, and in particular, a method of determining the susceptibility of an individual with one or more phenotypic features to a particular treatment pathway which may be associated with one or more of the phenotypic features.

The present invention also relates to a method of assigning an individual to one of a certain number of treatment pathways on the basis of their genetic profile, and in particular, a method of assigning an individual with one or more phenotypic features to one of a certain number of treatment pathways on the basis of their genetic profile, wherein the treatment pathway is relevant to one or more aspects of their genetic profile.

Background

Generation of large quantities of genomic data is possible through methods such as comparative sequence analysis. Functional genomics attempt to provide a meaningful analysis of such data by evaluating patterns of gene expression and interactions between molecular gene products, and Genome-Wide Association Studies (GWAS) permit comparisons between DNA sequences by identifying regions of genetic variation associated to a specific feature or trait. However, meaningful interpretation of GWAS data can be limited, for example in the case of the location of genetic variation is in non-coding region.

Summary

According to an aspect of the invention there is provided a computer-implemented method of identifying contribution of a genomic variation to a phenotypic feature. The method comprises determining a degree of a genomic variation in each individual in a set of individuals and recording the degree of genomic variation in each individual in a database, for example, in a first table. The method comprises determining a location of the genomic variation to allow determination, by virtue of the location, of whether the genomic variation affects a gene product directly and/ or regulates production of a gene product, for example, from a table, such as a second table. The method comprises determining a first gene affected by the genomic variation. When the genomic variation affects a gene product directly, recording the gene, in the coding sequence of which the genomic variation is located, as the first gene for example, in a second table. When the genomic variation regulates the production of a gene product, recording the gene, in the promoter region or other regulatory region of which the genomic variation is located, as the first gene, for example, in the second table. The method comprises determining an outcome of the genomic variation on the first gene and recording the outcome of the genomic variation in the database, for example, in the second table, and determining the presence or absence of one or more other gene products (i) interacting with the first gene, (ii) encoded by the first gene; and/or (iii) regulating the first gene and, if present, recording the identity of the gene encoding the one or more other gene products in the database as the second gene for example, in the second table. According to another aspect of the present invention there is provided a computer- implemented method, the method comprises determining a degree of a genomic variation in each individual in a set of individuals and recording the degree of genomic variation in each individual in a database, determining a location of the genomic variation to allow determination, by virtue of the location, of whether the genomic variation affects a gene product directly and/or regulates production of a gene product; determining a first gene affected by the genomic variation wherein, when the genomic variation affects a gene product directly, recording the gene, in the coding sequence of which the genomic variation is located, as the first gene, when the genomic variation regulates the production of a gene product, recording the gene, in the promoter region or other regulatory region of which the genomic variation is located, as the first gene, and determining an outcome of the genomic variation on the first gene and recording the outcome of the genomic variation in the database.

According to a yet another aspect of the present invention there is provided a computer-implemented method of determining the presence or absence of one or more other gene products. A degree of genomic variation in each individual in a set of individuals has been recorded in a database, a location of the genomic variation to allow determination, by virtue of the location, of whether the genomic variation affects a gene product directly and/or regulates production of a gene product has been determined, a first gene affected by the genomic variation has been determined wherein, when the genomic variation affects a gene product directly, the gene has been recorded, in the coding sequence of which the genomic variation is located, as the first gene and, when the genomic variation regulates the production of a gene product, the gene has been recorded, in the promoter region or other regulatory region of which the genomic variation is located, as the first gene, and an outcome of the genomic variation on the first gene has been determined and recorded in the database. The method comprises determining the presence or absence of one or more other gene products: (i) interacting with the first gene, (ii) encoded by the first gene; and/or (iii) regulating the first gene, and, if present, recording the identity of the gene encoding the one or more other gene products in the database as the second gene, optionally in the second database.

As used herein, the term“genomic variation” means a difference in any aspect of the coding sequence that determines the genetic makeup of an individual compared to a comparable aspect of code from another individual or group of individuals.

Examples of genomic variations include allelic variations, polymorphism or mutations in DNA or RNA, such as hereditary mutations and somatic mutations, missense or non- synonymous, synonymous and nonsense mutations, insertions, deletions, substitutions, inversions, frameshift mutations, repeat expansions, duplications, copy number variations, point mutations, single nucleotide polymorphisms (SNPs). Mutation may be within an extra-chromosomal nucleotide sequence (such as a plasmid) or a chromosomal nucleotide sequence. Genomic variations may also include epigenetic modifications, for example DNA methylation (for example of CpG regions/islands) or histone modifications (for example methylation, acetylation, phosphorylation, ubiquitination). As used herein, the term“phenotypic feature” means an identifiable trait or condition. It includes observable characteristics, such as one or more aspects of morphology, for example bone length; physiology, for example metabolic rate; or behaviour, such as aggression. It also includes diseases, clinical conditions and/or pathologies in any stage or state, or a marker of a disease, clinical condition or pathology, or a marker of a response to treatment of a disease. It also includes desirable traits (for example increased milk yield in a cow), or undesirable traits, such as biofilm formation in a bacteria or bacterial resistance to an antibiotic.

The phenotypic feature may be a disease, clinical condition or pathology, or a stage of a disease, clinical condition or pathology; or a marker of a disease, clinical condition or pathology. Alternatively, the phenotypic feature may be a marker of a response to treatment of a disease, clinical condition or pathology or a stage of a disease, clinical condition or pathology. Examples include elevation of one or more markers of inflammation; depression of a metabolite or hormone, for example depression of insulin levels as an indicator of diabetes; presence or absence of biomarkers associated with a disease or condition, for example CD34 or CD38 as prognostic biomarkers for acute B lymphoblastic leukemia; elevation or depression of expression of transcripts, proteins and/or metabolites, for example elevation of phospholipid metabolites as an indicator of cancer cell growth, or altered levels of cell death markers, such as apoptotic markers, as an indicator of neurodegenerative conditions or cancer.

As used herein, the term“individual” means any organism, for example eukaryotes such as animal, plants and protists, prokaryotes such as bacteria and Archaea, viruses and fungi.

The degree of genomic variation in each individual in a set of individuals is then determined.

The step of determining the degree of a genomic variation in each individual in a set of individuals may be carried out by comparing the code or a part of the code that determines the genetic makeup of each individual to the comparable code, or comparable part of the code from a control group. The individual has or displays one or more phenotypic features of interest. The control group does not have and/ or display the phenotypic feature(s) of interest. The code or part of the code that determines the genetic makeup of each individual may be in a dataset stored, for example, in an appropriate storage. One or more of the datasets may be publicly available and/or are available to members of the research community. The code may have been generated from genome-wide association studies, and thus the data may be genotype-phenotype associated data. The code or part of the code may therefore be from individuals with a recognised phenotypic feature. The phenotypic feature may be the phenotypic feature of interest. Examples of datasets include Immunochip, or databases such as Immunobase, the database of Genotypes and Phenotypes (dbGaP), OMIM, COSMIC, PharmGKB and bacterial databases such as SalComMac, PhenoLink, ProTraits; epigenetic databases such as IHEC Data Portal, ROADMAP Epigenomics, DiseaseMeth; and genome-wide association databases (Oz et al Mol. Biol and Evol. (2014); Lazar et al Mol. Sys. Biol. (2013) and Lazar et al Nature Comm. (2014)).

The comparable code or part of the code for the control group may be a dataset stored for example in an appropriate storage means such as one or more databases or one or more chips. One or more of the datasets maybe publicly available or available to members of the research community. Examples of databases include the International Genome Sequencing Consortium; TSC (The SNP Consortium ltd); the International HapMap Project, Decipher and ancestral allele databases, such as db ancestral allele database (Sherry et al., 2001), Human Gene Mutation Database (HGVbase), ExPASy, GeneSNPs, ClinBar, Geneatlas, GeneCards Database, Genome Variation Server (GVS), Human Organised Whole Genome Database (HOWDY), jSNP (Japanese SNPs), Leelab SNP database, The Human SNP database, OMIM (Online Mendelian Inheritance in

Man), NIEHs SNPs, PharmGKB, Seattle SNPs, Sequence Tag Alignment and Consensus Knowledgebase (STACK), SNP database (Single Nucleotide Polymorphisms in the Human Genome), copy number variation (CNV) databases. All of the code that determines the genetic makeup of an individual may be compared to the comparable code of the control, and the degree of genomic variation between the codes maybe recorded in the database. Alternatively, part of the code that determines the genetic makeup of an individual may be compared to the comparable part of the control code. For example, the computer may be programmed to analyse only a proportion of the code of the individual and comparable control, such as regions of code known or thought to comprise genomic variations associated with the phenotypic feature of interest. The degree of genomic variation between the codes in those regions may be recorded in the database.

The degree of genomic variation may be binary. For example, a SNP may be found, as compared to the control, in a particular location in part of the code that determines the genetic makeup of an individual. The presence of the SNP may be recorded in the database as the‘presence’ or‘absence’ of an SNP.

Alternatively, the degree of genomic variation may be non-binary, that is quantitative, for example the number of mutations, size of insertion, number of CNV, size of deletion, or graded, for example the degree of gene expression, methylation, non- coding RNA regulatory effect. Non-binary genomic variation may be measured using a continuous scale. Accordingly, by the phrase“degree of genomic variation” as used herein is meant a quantitative and/or qualitative measurement of the difference between any aspect of the code that determines the genetic makeup of an individual compared to a comparable aspect of code from another individual or group of individuals. The step of recording the degree of genomic variation in each individual in a database may occur as an integral part of the comparison step.

One or more details of the genomic variation (“variant identifiers”) may be recorded in the database. Example of such details include the accession number, the reference SNP cluster ID (rsID), TGCA IDs, the chromosomal location ID, ssID, HGVS, Cosmic ID, HGMD, ClinVar ID, Uniprot ID, DGVa variant call ID, dbVAr variant call.

Information about an individual, or each individual in the set of individuals may also be recorded. Such information may include clinically relevant information, and/ or demographic data, for example, gender, weight, height, age, symptoms, date of onset of symptoms, drug or treatment regime. Such information may be recorded in the database.

The location of the genomic variation is then determined. The location of the genomic variation maybe achieved by comparing the genomic variation with a database of genetic information and/ or identification of the genomic variation within one or more database of genetic information. For example, databases containing details, including location details, of known genomic variations; or databases containing details of coding sequences which are known to relate or be ascribed to a particular function. One or more of the databases may be publicly available or available to members of the research community. For example, the JASPAR database of transcription factor binding sites; the JCR GMO Amplicons database of DNA sequences appearing in genetically modified organisms; the

Modomics database; miRBase (Kozomara and Griffiths- Jones, 2011), TRANSFAC;

RegTransBase; PlantRegMap; RegulonDB, AnimalTFDB, PlantCARE.

The information provided in the database may be reordered or re-co nfigured, for example for ease of reference or use.

Comparison of the identified genomic variation with a database of genetic information for the purpose of determining the location of the genomic variation may be carried out by a computer using software adapted for the purpose. For example, Regulatory Sequence Analysis Tool (RSAT) matrix-scan (Turatsinze et ah, 2008) and ReMap with integrated ChIP-seq peak analysis (Cheneby et al, 2018)

The purpose of identifying the location of the genomic variation is to allow

determination of whether the genomic variation affects a gene product directly and/or whether the genomic variation regulates the production of a gene product.

As used herein, the term“gene product” means an entity resulting from transcription, or transcription and translation of genetic code. It includes proteins and RNA including non-coding RNAs (nc-RNAS), rRNA, tRNA, tmRNA, antisense RNA, messenger RNA (mRNA), microRNA (miRNA), small nuclear RNA (snoRNA), short interfering RNA (siRNA), rasiRNA, piwi RNA (piRNA), (tmRNA) and long non-coding (lncRNA).

Gene expression is the process by which the code that determines the genetic makeup of an individual is used as a template for the synthesis of a gene product. The process is comprised of two independent processes: transcription and translation. A gene product can result from either transcription alone (an RNA species), or transcription followed by translation (a protein).

Transcription involves the synthesis of precursor messenger RNA (pre-mRNA), catalysed by RNA polymerase, from a DNA template. The pre-mRNA then undergoes further processing (splicing to remove non-coding introns, addition of a 5’ Cap and a 3’ poly (A) tail) and in eukaryotes (and their viruses), the mature mRNA is translocated out of the nucleus (in prokaryotes and their viruses, transcription and translation occur simultaneously). At this stage, gene products that do not encode proteins (such as ncRNAs) are processed further and directed to their downstream pathways.

Translation occurs within the ribosome and is the process by which the mature mRNA is used as a template for the directed assembly of an amino acid or polypeptide chain. With the aid of chaperone proteins, the polypeptide chain folds into the characteristic three-dimensional structure of a protein.

The transcription or transcription and translation of gene products is tightly controlled. Ensuring the controlled transcription or transcription and translation of a gene is important for the maintenance of homeostasis within the cellular environment. This can be affected the presence of a genomic variation within the code that determines the genetic makeup of an individual.

For example, if a genomic variation is present within the coding region of the code that determines the genetic makeup of an individual the resultant gene product will be modified as a result of the variation.

Thus, the phrase“affects a gene product directly” as used herein means the genomic variation is located within a coding region of a gene, and transcription, or transcription and translation of the coding sequence results in modification of the gene product in comparison to the gene product that would result from transcription or transcription and translation of the coding region without the genomic variation.

For example, a genomic variation within a coding region for a mRNA could result in an amino acid substitution or an alteration to the length of the translated polypeptide. The resultant protein would thus be modified in comparison to a protein resulting from transcription and translation of the coding region with an alternative genomic variant that does not lead to an amino acid substitution.

As an alternative example, the genomic variation is present within the mRNA, in a target site for another gene product (such as but not limited to a miRNA). The modification within the target site is recognised by miRNA, which then directs the gene product to RNA degradation pathway.

In addition, the genomic variation can be present on the miRNA and either changes the miRNA’s mature region responsible for binding to the target site on the mRNA, or changes the miRNA sequences targeted by lncRNAs or other regulatory RNAs or other competitive endogenous RNAs (ceRNAs).

However, genomic variations can occur outside of a coding region of a gene and have the effect of modifying the production of the gene product. A genomic variation can also occur at an epigenetic site on the DNA in the promoter region or other regulatory regions, and thus modify the transcription rate of nearby gene(s).

Thus, by the phrase“regulating production of a gene product” as used herein means that the genomic variation is located outside of a coding region of a gene, and within the promoter region or other regulatory region of said gene, and transcription or transcription and translation of the gene has the effect of modifying the production of the gene product. The promoter region includes sequences recognised by gene products such as transcription factors and enhancers, which are required for the initiation of transcription. Genomic variations within this region can, for example, prevent the binding of the factors required for the activation of transcription. This can modify the levels of the gene product which would normally result from transcription, or transcription or translation of the gene: for example, a genomic variation within the promoter region can prevent the production of a transcript and thus prevent production of a gene product; or enhance the production of a transcript thus resulting in an increase in the gene product. An epigenetic (methylation) marker within a promoter or other regulatory region as a result of a genomic variation can, for example, result in modification of the transcription rate of nearby gene(s). For example, the aberrant addition of a methyl group to cytosine (hyper-methylation) within CpG islands found in the promoter region of a gene can prevent the binding of transcription factors. This can silence the gene and prevent transcription of a gene product (for example, methylation of tumour suppressor genes in cancer). Methyl groups may also be removed (hypo-methylation), allowing transcription factors to bind and allow transcription to occur (for example removal of methylation on oncogenes promoting the development of cancer). As an alternative example, an epigenetic marker on key histone residues as a result of a genomic variation may modify the transcription rate of nearby gene(s), for example, the addition of a methyl group to Histone3 at lysine 27, causes re-modelling of the local chromatin structure and blocks access to the promoter region of a gene

(heterochromatin state). This results in gene silencing and prevents the transcription of a gene product. Alternatively, histone modifications may also result in gene activation and local re-modelling of chromatin structure allowing transcription factors to bind (euchromatin state).

The genomic variation is thereby determined to be located either within a coding region or a non-coding region. For example, a genomic variation, such as a single nucleotide polymorphism in a human DNA sequence may be located within a region which is known to encode for a protein; a further SNP within the DNA sequence may be found to be located within a region known to encode for a regulatory entity, such as miRNA or a transcription factor binding site. Location of an epigenetic genomic variation may also be determined, for example remodelling of chromatin or altering/regulating expression of gene products in a specific location, for example by methylation of DNA at CpG sites or modification of histone residues.

The location of the genomic variation is recorded in the database.

The outcome may be, for example, production of a modified protein: a SNP present in the coding region of the first gene may result in an amino acid substitution in the resulting protein; production of a modified miRNA: a SNP in the coding region of the first gene may result in the production of a modified miRNA which may, for example, affect the half-life of mRNA and result in reduction in gene expression; as a further example, the outcome of the genomic variation may be the presence or absence of an epigenetic marker within the promoter region of the first gene, for example, the addition of a methyl group to a cytosine base within the CpG region of the promoter of the first gene, which may cause the DNA to adopt a heterochromatin state and silencing of the first gene prevents transcription or transcription and translation of a gene product. When the genomic variation regulates the production of a gene product, the first gene may be identified by assessing the flanking sequences of the genomic variation for transcription factor binding sites (TFBS) and/or miRNA target sites (miRNA-TS). Further to this, the effect of the genomic variation on the TFBS or miRNA-TS may be classified as loss or gain of binding site/target or a neutral change. The gene corresponding to the loss or gain effect may then be identified as the first gene. In some embodiments, the flanking sequences are 50 bases upstream and downstream of the genomic variation.

The gene product(s) encoded by the first gene may be recorded in the database.

The second gene maybe referred to herein as“the effecting gene”. The gene product of the second gene may have an effect on the product of the first gene. For example, a transcription factor that is encoded by a second gene may be the transcription factor required for the transcription of the first gene.

The gene product(s) encoded by the second gene may be recorded in the database.

One or more further details about the genomic variation may be recorded in the database, for example, the identification and/or name of the gene effected by the genomic variation.

A genetic profile may be generated for an or each individual using the information obtained according to the methods discussed herein. For example, a genetic profile may be generating comprising or more of: the degree of genomic variation; the location of the genomic variation; the outcome of the genomic variation; the first gene; the second gene; or the gene product(s) encoded by the first and/or second genes. The genetic profile may also contain additional data such as transcriptomics and proteomics data. Individuals within a set of individuals may then be clustered based on similarities in their genetic profile. The methods discussed herein provide a means of analysis which is applicable to many areas. They allow the grouping of individuals within a cohort on the basis of a profile created as a result of their genetic code. In particular, they allow the grouping of individuals within a cohort having a particular phenotypic feature on the basis of common features identified in a profile created as a result of genomic variations in their genetic code. This may allow determination of common or consistent pathways affected, for example, common biological processes or pathways in such individuals, and thus allow identification of targets for treatment. Such information can be particularly relevant to disease or disease states or clinical or pathological conditions, for example where individuals with the condition or state displays common phenotypic features but respond differently to the same treatment(s). Accordingly, there is provided a method of identifying, from a group of individuals with a particular phenotypic feature, individuals with one or more common genomic variations.

Also provided is a method of identifying, from a group of individuals with a particular phenotypic feature, an individual(s) with a susceptibility to one or more treatment pathways associated with the phenotypic feature.

Also provided is a method of determining the susceptibility of an individual to a particular treatment pathway, and in particular, a method of determining the susceptibility of an individual with one or more phenotypic features to a particular treatment pathway associated with one or more of the phenotypic features.

Also provided is a method of determining the susceptibility of an individual to a particular treatment pathway, and in particular, a method of determining the susceptibility of an individual with one or more phenotypic features to a particular treatment pathway, which may be associated with one or more of the phenotypic features.

Also provided is a method of assigning a person to one of a certain number of treatment pathways on the basis of their genetic profile; and in particular, a method of assigning a person with one or more phenotypic features to one of a certain number of treatment pathways on the basis of their genetic profile, wherein the treatment pathway is relevant to one or more aspects of their genetic profile.

As used herein the term“genetic profile” means a profile created as a result of the genomic variations of said individual.

As defined above, the term“individual” means any organism, for example eukaryotes such as animal, plants and protists; prokaryotes such as bacteria; Archaea; viruses and fungi.

The individual maybe a plant or an animal. The individual maybe a host, for example to a microbe. The individual may be a mammal. The individual may be a human, a domesticated animal, a microbe or a bacterium. Where the individual is a human, the phenotypic feature may be a disease or clinical condition. Where the individual is a microbe, the phenotypic feature may be antibiotic resistance or virulence. Where the individual is a plant or animal host for a microbe, the phenotypic feature may be host- microbe interaction.

Accordingly, there is provided a method which allows the contribution of a genomic variation to a phenotypic feature to be identified.

The method may allow extraction of meaningful information relevant to one or more phenotypic traits from genetic information.

The method may allow insight into a microbe-host relationship.

The method may allow identification of biological pathways and processes which had previously not been associated with a particular phenotypic feature, or individuals with a particular phenotypic feature, and/or biological pathways and processes which had previously not been associated with a particular cohort of individuals with a particular phenotypic feature. For example, in a cohort of human patients suffering from a particular pathological condition, it may have been found that certain individuals respond positively to a particular treatment, for example by demonstrating an improvement in one or more pathological markers, whilst other individuals within the cohort may show no response or limited response to the same treatment. The methods described herein may be used to reveal the biological basis for the difference in response; for example, the individuals within the cohort who do not respond to the treatment may have one or more genomic variations which affect certain gene products associated with a particular cellular pathway, which mean that treatment aimed at regulating that pathway will (or will not) be efficacious. The methods described herein may inform the decision as to whether an individual in a group of individuals with a particular phenotypic feature maybe a candidate for treatment in a particular manner. For example, the genetic profile of an individual may be established using the methods described herein, which may then be used to determine the susceptibility of that individual to a particular treatment.

For example, if a patient with ulcerative colitis (UC) is known to have a SNP that is annotated or determined to lead to an increase in MAML2 (Mastermind-like protein 2) expression. MAML2 is known to activate the NOTCHi receptor, thereby increasing NOTCHi activation, a key player in the activation of inflammation in UC, treatment with inhibitors of NOTCHi or MAML2 would be appropriate. The methods described herein may also allow identification of individuals with a phenotypic feature who do not have a genomic variation(s) known or thought to be associated with that phenotypic feature. For example, NfKBi (nuclear factor kappa light chain enhancer of activated B cells) is the central mediator of inflammation (Caamano and Hunter, 2002) and PRKCB (protein kinase C beta type; also denoted by PKCB) activates NFKBi indirectly (Kang et ah, 2001). A prominent position for NfKB and PRKCB in ulcerative colitis would therefore be expected. However, the methods described herein may allow identification of a patient with UC who does not have a genomic variation associated with a NFkB pathway member and may not, therefore, be an appropriate candidate for treatment with a drug which targets the NfKB pathway.

Accordingly, there is provided a Notch pathway inhibitor or MAML2 inhibitor for use in treating individuals with ulcerative colitis having a genomic variation, such as a gain of function mutation, in a Notch signalling pathway member, but not a genomic variation, such as a mutation, in an NFkB pathway member. The Notch signalling pathway member may be MAML2, and/ or the NFkB pathway member may be NfKB or PRKCB, and/ or the inhibitor may be a gamma-secretase inhibitor.

Also provided is a Notch pathway inhibitor or MAML2 inhibitor for use in a method of treating individuals with ulcerative colitis wherein the method comprises (i) determining whether a test sample from the patient comprises a genomic variation, such as a mutation, in a Notch signalling pathway member; and (ii) determining whether a test sample from the same individual does not comprise a genomic variation, such as mutation, in an NFkB pathway member, establishing whether the genomic variation is a loss or gain of function variation, and if the test sample from the patient comprises a gain of function genomic variation in a Notch signalling pathway member and not a genomic variation in an NFkB pathway member, administering to the patient an effective amount of a Notch pathway or MAML2 inhibitor. The Notch pathway member may be MAML2, and/ or the NFkB pathway member may be NFkB or PRKCB and/ or the inhibitor may be a gamma-secretase inhibitor.

As a further example, there is provided an inhibitor for use in treating individuals with ulcerative colitis having a genomic variation in a member of a specific pathway associated with a cell type, wherein the genomic variation increases function of the cell type, and the inhibitor inhibits the function or number of the cell type, wherein the cell type is a fibroblast, myofibroblast, regulatory T -cell, B-cell, macrophage or dendritic cell; or an activator for use in treating individuals with ulcerative colitis having a genomic variation in a member of a specific pathway associated with a cell type, wherein the genomic variation decreases function of the cell type, and the activator increases the function or number of the cell type, wherein the cell type is a fibroblast, myofibroblast, regulatory T -cell, B-cell, macrophage or dendritic cell.

Also provided is a B-cell inhibitor or activator for use in treating individuals with ulcerative colitis having a genomic variation, such as a mutation in a B-cell pathway member.

Also provided is a B-cell inhibitor or activator for use in a method of treating individuals with ulcerative colitis wherein the method comprises (i) determining whether a test sample from the patient comprises a genomic variation in a B-cell pathway member; (ii) if the test sample comprises a genomic variation in a B-cell pathway member establishing if the genomic variation is a gain or loss of function variation; (iii) if the test sample from the patient comprises: (a) a gain of function genomic variation, administering to the patient an effective amount of a B-cell inhibitor; or (b) a loss of function genomic variation, administering to the patient an effective amount of a B-cell activator. According to an aspect of the present invention there is provided a computer program which, when executed by a computer, causes the computer to perform the method.

According to another aspect of the present invention there is provided a computer readable medium (which may be non-transitory) which stores the computer program.

According to yet another aspect of the present invention there is provided apparatus, for example one or more computer systems, configured to perform the method. The at least one computer system comprises at least one processor and memory.

According to still yet another aspect of the present invention there is provided a computer- readable table storing, for each of a plurality of genomic variations: an identity of genomic variation, an identity of a product of an effecting gene, an identity of an interaction type, if present, an identity of one or more first genes directly affected by the genomic variation and an identity of the type of genomic variation and, for a set of individuals, a respective degree of variation of the genomic variation.

According to yet another aspect of the present invention there is provided a system comprising at least one computer system and a database storing the table.

Brief Description of the Drawings

Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawing, in which:

Figure l is a block diagram of a system for identifying contribution of a genomic variation to a phenotypic feature;

Figure 2 is a process flow diagram of a method of identifying contribution of a genomic variation to a phenotypic feature;

Figure 3 is a first table used in a method of identifying contribution of a genomic variation to a phenotypic feature;

Figure 4 is a second table used in a method of identifying contribution of a genomic variation to a phenotypic feature;

Figure 5 illustrates three examples of gene products;

Figure 6 is a block diagram of a system for extracting and processing single nucleotide polymorphisms (SNPs) to uncover hidden pathways and to identify mediators;

Figure 7 is a block diagram of a high-performance computing cluster used in extracting and processing single nucleotide polymorphism to uncover hidden pathways and to identify mediators;

Figure 8 is a block diagram of a local computing devices user in extracting and processing single nucleotide polymorphism to uncover hidden pathways and to identify mediators;

Figure 9 illustrates analysed SNPs;

Figure 10 illustrates an SNP matrix;

Figure 11 illustrates interaction matrix; and

Figure 12 schematically illustrates extraction of patient-specific UC-omes from a UC- ome.

Detailed Description

Introduction

Advances in genomics have resulted in an increase in the availability and access to whole genome sequencing (WGS) and these recent advances in sequencing technology have resulted in the generation of large quantities of genomic data. In the next decade it is predicted that genomic data produced will reach the exabyte scale.

Advances in sequencing have also led to the emergence of genome-wide association studies (GWAS). GWAS seek to identify and compare single nucleotide polymorphisms (SNPs), which are a common source of genomic variation amongst individuals. GWAS compare the distributions of SNPs to make meaningful conclusions and comparisons between genomes. There are an estimated to million commonly occurring SNPs within the human genome. Prior to the completion of the human genome project in 2003, the study of genomic variation was limited to the study of genetic linkage within families to identify heritable traits and genetic disorders. Whilst this worked well for single gene conditions, it was more challenging to solve complex disease patterns with multiple allelic variants. The concept of genetic association, wherein the frequency of genetic variants of a particular allele are evaluated and compared between individuals with and without the phenotype of interest (i.e., a disease state), was proposed and led to the further development of GWAS.

The progression of GWAS was further supported by the advent of biobanks and open access databases where genomes and genetic sequences are deposited and can be freely accessed. A global collaboration, the HapMap project, was founded to identify the distribution and location of SNPs and other causes of genetic variation present in the human genome. This access to data has led to a reduction in cost and difficulties involved in collecting sufficient numbers of biological specimens required to make meaningful genetic comparisons. Since the development of HapMap, numerous databases are now in existence.

For example, the database PharmGKB facilitates the study of how different people respond to different pharmacological agents depending upon which genetic variant they possess. COSMIC, the Catalogue of Somatic Mutations in Cancer, curates data on the genetic variants amongst cancer types, enabling identification of common cancer sub-types and potentially informing treatment.

This process is not just restricted to human genomes; databases such as SalComMac, PhenoLink and ProTraits curate gene expression and genome profiles of bacterial species. Additional databases such as the Influenza Research Database and viruSITE enable similar studies in viral populations.

Sequencing advances have provided an invaluable tool for generation and storage of genomic data, and GWAS have contributed a significant amount of data to further the understanding of the association between genomic variation and phenotypic traits. However, problems remain regarding the meaningful analysis of genomic data.

For example, the“missing heritability” problem identifies that a single genomic variation may not account for much of a phenotypic feature. This is a problem that has significant implications for medicine, since a person's susceptibility to disease may depend more on "the combined effect of all the genes in the background than on the disease genes in the foreground", or the role of particular genomic variations may have been overestimated. For example, a phenotypic feature may be determined by more than the concerted effect of disease associated genes, so that the direct role of individual genes/gene variants in determining phenotypic variation could prove insignificant. Similarly, it has been identified that not all genetic effects are easily attributed to the SNPs identified by GWAS. Instead, rare mutations, variations in copy number and other unusual genetic variants have been implicated in contributing to phenotypic variation. Such genomic variants are often maintained at low frequencies by natural selection and would require WGS methods to identify the specific mutations. GWAS also fails to address the combinatorial effect of very different genomic variants/genetic loci and the effect these have on the phenotypic output. Further, the presence of genomic variants that impact expression but have no impact on disease risk remains unclear. The effects of combinatorial genetic variants on important cellular networks and pathways are also uncertain.

Thus, there is provided herein a systems approach to enable the analysis of genomic variations on both the genome and the network of regulatory interactions. Such an approach can provide insight into the cumulative effects of genomic variations, including multiple regulatory genomic variations, to the central developmental and cell proliferation pathways in the core region of the network.

System for identifying contribution of a genomic variation to a phenotypic feature Before describing a specific example in relation to processing single nucleotide polymorphisms (SNPs) and ulcerative colitis (UC), a generic system for and a method of identifying contribution of a genomic variation to a phenotypic feature will first be described. Referring to Figure 1, the system 1 may include first and second computer systems 2, 3 in communication with storage 4 which stores first and second tables 8, 9. The first system 2 may take the form of a high-performance computer cluster and the second system 3 may take the form of non-HPC system, such a desktop computer. The first system 2 includes first, second, third and fourth modules 21, 22, 23, 24.

Referring also to Figures 3, 4 and 5, the first module 21 determines a degree 31 of genomic variation 51 for a plurality of genomic variations 51 for each individual in a set of individuals and records the value 31 (i.e., the degree) in the first table 8 (step Si). The degree 31 of genomic variation may be the presence or absence of the genomic variation 51 and so may be stored using‘o’ or . The second module 22 determines the location of the genomic variation 51 which is stored in the second table 9 (step S2).

Referring still to Figure 1 to 5, the third module 23 determines the identity of a first gene 52 directly affected by the genomic variation 51 and, if the genomic variation affects a gene product directly, recording the gene, in the coding sequence of which the genomic variation is located, as the first gene (steps S3 to S5).

The fourth module 24 determines whether the genomic variation regulates the production of a gene product 53, 54, recording the gene, in the promoter region or other regulatory region of which the genomic variation is located, as the first gene (steps S6 to S7).

The fourth module 24 also determines the outcome of the genomic variation on the first gene and record the outcome of the genomic variation in the second table 7 (step S8).

The second system 3 includes fifth and sixth modules 25, 26.

The fifth module 25 determines whether one or more other gene products are present and, if so, records the identity of the gene encoding the one or more other gene produces in the second table 7 (steps S9 & S10).

Referring in particular to Figure 5, the one or more other gene products may interact with the first gene, may be encoded by the first gene and/or regulate the first gene. The following example is provided to illustrate an embodiment of the present invention and should not be construed as limiting thereof. Example 1: Identification of an aberrant Notch signalling pathway in a well-defined cohort of individuals suffering from ulcerative colitis Referring to Figures 6 to n, the system and method can be used to identify contribution of a genomic variation, in the form of processing single nucleotide polymorphisms (SNPs), to ulcerative colitis (UC).

Summary

Genetic association studies have identified causal disease-associated variants including single nucleotide polymorphisms (SNPs) for inflammatory bowel diseases.

Understanding how SNPs affect cellular signalling networks is crucial for

understanding disease pathogenesis. SNP profiles of 377 ulcerative colitis (UC) patients and their regulatory effects were mapped. Individuals within the cohort formed four distinct clusters based on whether (1) PRKCB alone, (2) NFKBi alone, (3) both or, (4) neither were affected by SNPs. In a subset of cluster (4) patients, the Notch pathway was identified as key in pathogenesis and a therapeutic response which was validated by UC transcriptomics and qPCR analysis. Background

Genome-wide association studies (GWAS) and subsequent fine-mapping of SNPs have identified causal disease-associated variants for the inflammatory bowel diseases Crohn’s disease (CD) and Ulcerative colitis (UC) (de Lange et ah, 2017; Huang et ah, 2017). However, the direct biological impact of many of the UC-associated SNPs are unknown. In addition, initial GWAS studies identified SNPs occurring in exonic regions with the potential to alter amino acid composition and function of the translated proteins. However, these SNPs account for less than 10% of the total SNPs profiled in UC, and did not result in the expected pathogenic effect (Prager et ah, 2015). Sources of SNP data

The degree of a genomic variation in each individual in the cohort was determined and recorded in a database: UC-associated SNPs and their associated‘risk’ allele were identified using Immunochip data (Jostins et ah, 2012) and the dbSNP ancestral allele database (Sherry et ah, 2001). Using this combined SNP dataset, UC-specific SNP data for 377 UC patients were compiled from seven centres across East Anglia, UK (Cambridge, Norwich, Ipswich, Welwyn Garden City, Luton, Bedford, and West- Suffolk).

Patients were aged between 25 and too years with a mean age of diagnosis of 37 (SD:i4-9 years). 246 patients were on mesalazine treatment and 124 with additional immunomodulatory treatment (therapeutic upscaling).

The location of the genomic variation was determined: the location of the SNPs was recorded as either exonic (missense, synonymous), intro nic/non-translated regions and intergenic.

Flanking nucleotide sequences were obtained from dbSNP (Sherry et ah, 2001). The analyzed SNPs are shown Figure 9. Assessing effect of SNPs on transcription factor binding sites and miRNA target sites From the JASPAR database we downloaded 396 human transcription factors’ binding profiles represented by Position Specific Scoring Matrices (PSSMs) (Mathelier et ah, 2016). The downloaded PSSMs in JASPAR format were converted to the TRANSFAC format to ease handling of results. To assess the effect of the SNP on the gain or loss of putative TF binding sites, flanking sequences 50 bases upstream and downstream of the SNPs were extracted. The Regulatory Sequence Analysis Tool (RSAT) matrix-scan (Turatsinze et ah, 2008) was used to search for potential TFBS in the ancestral and patient-specific mutant alleles. The background model estimation was determined by using residue probabilities from the input sequences with a Markov order of 1. The search was subject to both strands of the sequences. Hits with a P-value < ie-05 were considered as putative binding sites. Other parameters were set at default values. To assess the effect of the SNPs on miRNA target sites, the 22bp sequences of mature miRNAs were retrieved from miRBase (Kozomara and Griffiths- Jones, 2011). The flanking sequences of SNPs were assessed for the presence of miRNA target sites using miRanda (Enright et ah, 2003). Hits predicted to occur in the seed region (2’-8’) of the miRNAs and with alignment scores ³ 90 and energy threshold < -16 kcal/mol were considered as target sites. Other parameters were set to default settings. A final manual check was performed to ensure that the SNPs overlapped with the predicted TF or miRNA binding sites. We also considered gain or loss of the regulatory interactions between TFs and protein-coding genes in our analysis, where the protein-coding gene was within tokb upstream or downstream of the SNP-affected TFBS. This information was retrieved using the feature retrieval function of the UCSC genome table browser (Kent et ah, 2002). We also captured pre-existing regulatory interactions with experimentally determined binding regions/sites. In these cases, the protein coding gene(s) at the cis level corresponding to the SNP were assigned as targets of the TF which recognizes the binding regions/sites. All gains or losses of regulatory interactions and protein coding genes via SNP-affected miRNA target sites were included in the network except when the SNPs were annotated or recorded as intergenic. The effect of SNPs on the uncovered TFBS or miRNA-TS were classified into either a gain or loss of binding site/target or a neutral change. Only those sites identified as loss or gain with respect to sites corresponding to the ancestral allele were considered for subsequent analysis.

We called the genes corresponding to such SNPs‘SNP-affected genes’ from here onwards.

Creation of a genetic profile

Protein-protein interactions of the proteins encoded by SNP-affected genes were obtained from OmniPath in January 2017 (Tiirei et ah, 2016). For each patient, the set of proteins encoded by SNP-affected genes and their first interactors (first neighbors) were defined as the UC-specific network footprint of a particular patient. The union of all network footprints, the UC-ome, was analyzed and visualized in Cytoscape 3.3.0 (Su et ah, 2014) using the inverted self-organizing map layout. We retained only those SNP- affected genes present in the OmniPath resource, which formed a giant component with their interactors. Patient-specific networks were constructed using the Cytoscape CyRestClient 0.6 in Python (Ono et ah, 2015). Cluster analysis was carried out by using the Clustermaker Cytoscape app (Morris et ah, 2011) implementing the GLay clustering method (Su et ah, 2010), which is an implementation of the Girvan-Newman clustering algorithm (Newman and Girvan, 2004). Hereinafter, network clusters are referred to as“modules” (to be distinguishable from patient clusters).

The Scipy scikit-learn package was used for hierarchical clustering (Pedregosa et ah, 2011) of the patient-specific clusters. The constructed distance matrix between patients was based on the Hamming distance (Hamming 1950). If a protein was directly or indirectly affected by a SNP, then it was assigned a“1” in a patient. If the protein was not affected, then it was scored as“o”. Multidimensional scaling was conducted in the KNIME environment using the MSA KNIME node (Berthold et al. 2008; Kruskal, 1964). We retained only the first three dimensions. The first two dimensions were plotted in Microsoft Excel. For the obtained clusters and sub-clusters we compared the occurrence of therapeutic upscaling performance with Fisher exact tests in Matlab version 8.4 (MATLAB, 2014).

Gene Ontology analysis

Gene Ontology Biological Processes (GO BP) (Gene Ontology Consortium, 2015) and Reactome pathways (Croft et ah, 2014) were accessed from the UniProt database (UniProt Consortium, 2015) on 7th of February 2017. We used the hypergeometric test to infer overrepresented biological processes and pathways compared to the whole OmniPath as background in an R environment (Core Team, 2015). We considered a pathway or GO BP to be significant if the Benjamini-Hochberg procedure corrected false discovery rate was below 0.1. Machine learning

A Random Forest classifier was implemented to predict therapeutic outcome. The outcome variable was whether a patient needs additional immunomodulation therapy beside mesalazine as a binary variable for the 370 patients that we had therapeutic information for. The SNP affected genes and the first neighbour protein-protein interactors of multiple SNP-affected genes were the binary features and the redundant features were removed. There were 15 such genes/proteins. The model was

implemented in Python library scikit-learn (v. 0.18.1) (Pedregosa et ah, 2011), then evaluated via 5-fold cross validation and accuracies with recall and precision values were computed for each fold. Accuracy was calculated within the range of

2 standard deviations. In addition, feature importance was also calculated with scikit- learn, which implements the Gini importance for each fold. We ranked the features by averaged importance over all the folds. Default parameters were used for the analysis.

Results

In a cohort of 377 patients, we identified 12 UC-associated regulatory SNPs localized within transcription factor binding sites (TFBS) or miRNA target sites (miRNA-TS) based on integrating immunochip data (Jostins et ah, 2012) and the dbSNP ancestral allele database (Sherry et ah, 2001) with regulatoiy network resources (see Methods for details). Of these 12 SNPs, eight were related to the regulation of proteins with known high confidence interactors. We limited the computational analysis to manually curated resources to include only the most reliable biological components and interactions. We added first neighbour interactors to the eight identified proteins, in addition to the regulators whose TFBS or miRNA-TS were affected, and the regulatory interactions that target the first neighbours from these regulators. In total, the UC network consisted of 247 proteins nodes, 1269 protein-protein interactions, 631 TF-target gene and 66 miRNAmRNA regulatory connections. The two most central proteins were

NFKBi and PRKCB, both of which are known to be involved in UC (Burkitt et ah, 2015; Gould et ah, 2016). NFKBi is the central mediator of inflammation (Caamano and Hunter, 2002) and PRKCB activates NFKBi indirectly (Kang et ah, 2001). The prominent position for NFKBi and PRKCBi in the UC network was therefore expected.

The UC network consists of six distinct but intertwined network modules according to Girvan- Newman clustering (Newman and Girvan, 2004). These modules are distinguishable by visualizing the whole network in a force-directed layout (Figure 2a). Each module is centred around a key signalling protein directly affected by a SNP (Figure 2b). The three most abundant modules are formed mainly by the interactors of, 1) PRKCB and FCGR2A (88 proteins), 2) 7 NFKBi (51 proteins), and 3) the binding partners of LSPi and GNA12 that contains many interactors of both NFKBi and PRKCB (71 proteins). We also identified two epigenetic modules centred around 4) histone deacetylase 7 (HDAC7) (25 proteins), and 5) around DNA methyltransferases 3 beta (DNMT3B) (7 proteins). These two epigenetic regulators are affected by SNPs altering not only miRNA-based post-transcriptional regulation (as in the other modules), but also transcriptional regulation. A mutation in the TFBS for SMARCA4 in the promoter region of HDAC7 is associated with UC which supports previously described alterations in both histone acetylation and DNA methylation in the context of UC pathogenesis (Yi and Kim, 2015). Similarly, the TFBSs of the transcription factors SMARCA4, TFAP2C, RBL2 and FOXPi are all affected by SNPs in the promoter region of

DNMT3B. The sixth module contains members of the Notch pathway. The Notch pathway is connected to the UC-ome through MAML2, an important NOTCH protein co-activator (McElhinny et ah, 2008; Wu and Griffin, 2004). MAML2 expression could be affected by loss of the miR-4495 target site as a result of SNP ^543104, which occurs in 40% of the examined patients. NOTCH proteins are involved in

gastrointestinal stem cell homeostasis, driving differentiation towards absorptive epithelial cells (rather than secretory cells, like goblet cells) (Chen et ah, 2017; Katoh and Katoh, 2007). While malfunction of these processes is associated with UC, the direct involvement of NOTCH proteins in UC pathogenesis is unclear (Kini et al.,2015). Thus, the present method identified several major signalling proteins linked to UC with a novel systems level overview identifying cross-talk between these proteins and the pathways they reside within. In addition to connecting known components of UC, this systems genomics approach by revealing SNP-affected signalling proteins and processes, extends our understanding of UC pathogenesis.

Revealing the role of the Notch pathway in a distinct group of patients

Based on the set of SNPs present in each of 377 UC patients, we defined patient-specific subsets. The genetic profile each patient contained the proteins encoded by the SNP- affected genes and the interactors of these proteins, i.e. their first neighbour proteins. We clustered the patients based on their profile with a hierarchical clustering algorithm into four distinct clusters. The first cluster contained the profiles for patients whose mutations were related to PRKCB, with the second cluster containing profiles for patients with mutations related to NFKBi. In the third cluster, the profiles contained both PRKCB and NFKBi SNPs, while the network footprints of the fourth cluster had neither PRKCB nor NFKBi affected. The clear distinction of the four clusters was confirmed by unbiased multidimensional scaling (MDS) analysis, a form of non-linear dimensionality reduction. Next, we investigated whether the patients in distinct network footprint clusters had similar therapeutic escalation in their clinical history as measured by a change from the standard mesalazine (5-amino-salicylic-acid) therapy to an immunomodulatory-based therapy (Clinical guideline CG166, 2013, Ulcerative colitis: management). Such information was available for 370 of the 377 patients. The use of additional

immunomodulators instead of mesalazine did not depend on the network footprint clusters (p>0.05, Chi-squared test) suggesting a complex, non-genomic basis of UC pathogenesis and response to therapy. To identify potential hidden variables underlying the therapeutic escalation, we used a Random Forest machine learning approach. In this analysis, we focused on 15 proteins which were either directly affected by a SNP or interacting with at least two different SNP-affected genes. The output variable was whether a patient requires additional immunomodulation or not. The performance of the machine learning model was poor (5-fold cross-validation average accuracy 0.57 (+/-0.072 SD)) potentially due to a strong influence of non- genetic/ extrinsic factors in the pathogenesis of UC. However, the model highlighted two interesting proteins, FCGR2A (low affinity immunoglobulin gamma Fc region receptor Il-a) and MAML2 (Mastermind Ligand 2). These two proteins were found to contribute 10% or more to the prediction of immunomodulation, using the Gini index for the feature importance. No other proteins had such high contribution to predict the treatment outcome. Using the five Random Forest models to predict therapeutic upscaling, FCGR2A contributed 20.1% (2SD = 3.8%) of the model prediction whereas MAML2 contributed 11.34% (2SD=3.I%). FCGR2A has previously been linked to UC (Li et ah, 2017) whereas MAML2 has not been previously implicated in UC pathogenesis. The model’s selection of MAML2 can be rationalized by its ability to bind NOTCH proteins. Through NOTCH proteins, MAML2 can modulate the activity of already described UC-associated pathways, including the NFKB pathway. Thus, an unbiased machine learning approach has also confirmed the importance of MAML2 in influencing therapeutic outcome, due to its role in regulating the Notch pathway.

As part of a more focused analysis, we assessed the correlation between therapy escalation and the presence of the SNP causing a loss in miRNA regulation of MAML2. No correlation was found when all patients were investigated (Chi-squared test, p = 0.22). However, close scrutiny of the four network footprint clusters identified a correlation with MAML2 and therapeutic escalation in the fourth cluster where neither PRKCB or NFKBi regulation were affected. In this cluster, the 41 patients with the MAML2 affecting SNP had almost three times higher chance of therapy escalation compared to the 64 patients without the MAML2 affecting SNP (Fisher exact test p=o.oi3i OR = 2.959 Confidence interval: 1.302 - 6.724 indicating the importance of patient cohort specific analysis in evaluating SNP effects. The significance and clinical relevance of this SNP in UC is apparent from a specific patient who had only the MAML2 SNP, and still required therapy escalation.

Using these approaches, we discovered that in a well-defined genetic background (present in 11% of patients), Notch signalling and mutations affecting MAML2 have a key role in UC pathogenesis. In order to address various issues and advance the art, the entirety of this disclosure shows, by way of illustration, various embodiments in which the claimed invention may be practiced and provide for a superior process for analysing genetic data. The advantages and features of the disclosure are of a representative sample of

embodiments only and are not exhaustive and/ or exclusive. They are presented only to assist in understanding and teach the claimed features. It is to be understood that advantages, embodiments, examples, functions, features, structures, and/or other aspects of the disclosure are not to be considered limitations on the disclosure as defined by the claims or limitations on equivalents to the claims, and that other embodiments may be utilized and modifications may be made without departing from the scope and/ or spirit of the disclosure. Various embodiments may suitably comprise, consist of, or consist essentially of, various combinations of the disclosed elements, components, features, parts, steps, means, etc. In addition, the disclosure includes other inventions not presently claimed, but which may be claimed in future.

References

Burkitt, M.D., Hanedi, A.F., Duckworth, C.A., Williams, J.M., Tang, J.M., O’Reilly, L.A., Putoczki, T.L., Gerondakis, S., Dimaline, R., Caamano, J.H., et al. (2015). NF-kBi, NF- KB2 and c-Rel differentially regulate susceptibility to colitis-associated adenoma development in C57BL/6 mice. J Pathol 236, 326-336.

Caamano, J., and Hunter, C.A. (2002). NF-kappaB family of transcription factors: central regulators of innate and adaptive immune functions. Clin Microbiol Rev 15,

414-429·

Clinical guideline CG166 : Ulcerative colitis: management (2013). National Institute for Health and Care Excellence. Core Team, R. (2015). R: A Language and Environment for Statistical Computing.

Croft, D., Mundo, A.F., Haw, R., Milacic, M., Weiser, J., Wu, G., Caudy, M., Garapati,

P., Gillespie, M., Kamdar, M.R., et al. (2014). The Reactome pathway knowledgebase. Nucleic Acids Res 42, D472-7.

Enright, A.J., John, B., Gaul, U., Tuschl, T., Sander, C., and Marks, D.S. (2003).

Micro RNA targets in Drosophila. Genome Biol 5, Ri.

Gini, C. (1912). Variabilita e Mutuabilita (Bologna: C. Cuppini). Gong, Y., Wu, C.N., Xu, J., Feng, G., Xing, Q.H., Fu, W., Li, C., He, L., and Zhao, X.Z. (2013). Polymorphisms in micro RNA target sites influence susceptibility to schizophrenia by altering the binding of miRNAs to their targets. Eur Neuropsychopharmacol 23, 1182-1189.

Gould, N.J., Davidson, K.L., Nwokolo, C.U., and Arasaradnam, R.P. (2016). A systematic review of the role of DNA methylation on inflammatory genes in ulcerative colitis. Epigenomics 8, 667-684.

Huang, H., Fang, M., Jostins, L., Umicevic Mirkov, M., Boucher, G., Anderson, C.A., Andersen, V., Cleynen, L, Cortes, A., Crins, F., et al. (2017). Fine-mapping

inflammatory bowel disease loci to single-variant resolution. Nature 547, 173-178. Jostins, L., Ripke, S., Weersma, R.K., Duerr, R.H., McGovern, D.P., Hui, K.Y., Lee, J.C., Schumm, L.P., Sharma, Y., Anderson, C.A., et al. (2012). Host-microbe interactions have shaped the genetic architecture of inflammatory bowel disease. Nature 491, 119- 124.

Kang, S.W., Wahl, M.I., Chu, J., Kitaura, J., Kawakami, Y., Kato, R.M., Tabuchi, R., Tarakhovsky, A., Kawakami, T., Turck, C.W., et al. (2001). PKCbeta modulates antigen receptor signaling via regulation of Btk membrane localization. EMBO J 20, 5692-

5702.

Katoh, M., and Katoh, M. (2007). Notch signaling in gastrointestinal tract (review). Int J Oncol 30, 247-251.

Kent, W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M., and

Haussler, D. (2002). The human genome browser at UCSC. Genome Res 12, 996-1006. Kim, Y.S., and Ho, S.B. (2010). Intestinal goblet cells and mucins in health and disease: recent insights and progress. Curr Gastroenterol Rep 12, 319-330.

Kini, A.T., Thangaraj, K.R., Simon, E., Shivappagowdar, A., Thiagarajan, D., Abbas, S., Ramachandran, A., and Venkatraman, A. (2015). Aberrant niche signaling in the etiopathogenesis of ulcerative colitis. Inflamm Bowel Dis 21, 2549-2561.

Kozomara, A., and Griffiths- Jones, S. (2011). miRBase: integrating micro RNA annotation and deep-sequencing data. Nucleic Acids Res 39, D152-7.

Kruskal, J.B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29, 1-27.

de Lange, K.M., Moutsianas, L., Lee, J.C., Lamb, C.A., Luo, Y., Kennedy, N.A., Jostins, L., Rice, D.L., Gutierrez-Achury, J., Ji, S.-G., et al. (2017). Genome-wide association study implicates immune activation of multiple integrin genes in inflammatory bowel disease. Nat Genet 49, 256-261.

Liu, C., Cheng, H., Shi, S., Cui, X., Yang, J., Chen, L., Cen, P., Cai, X., Lu, Y., Wu, C., et al. (2013). MicroRNA-34b inhibits pancreatic cancer metastasis through repressing Smad3. Curr Mol Med 13, 467-478.

Mathelier, A., Fornes, O., Arenillas, D.J., Chen, C.-Y., Denay, G., Lee, J., Shi, W., Shyr, C., Tan, G., Worsley-Hunt, R., et al. (2016). JASPAR 2016: a major expansion and update of the openaccess database of transcription factor binding profiles. Nucleic

Acids Res 44, D110-5. MATLAB (2014). 8.4.0.150421 (R20i4b) (Natick,

Massachusetts).

McElhinny, A.S., Li, J.L., and Wu, L. (2008). Mastermind-like transcriptional co- activators: emerging roles in regulating cross talk among multiple signaling pathways. Oncogene 27, 5138-5147. Morris, J.H., Apeltsin, L., Newman, A.M., Baumbach, J., Wittkop, T., Su, G., Bader, G.D., and Ferrin, T.E. (2011). clusterMaker: a multi-algorithm clustering plugin for Cytoscape. BMC Bioinformatics 12, 436.

Newman, M.E.J., and Girvan, M. (2004). Finding and evaluating community structure in networks. Phys Rev E Stat Nonlin Soft Matter Phys 69, 026113.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research.

Prager, M., Buettner, J., and Buening, C. (2015). Genes involved in the regulation of intestinal permeability and their role in ulcerative colitis. J Dig Dis 16, 713-722.

Sherry, S.T., Ward, M.H., Kholodov, M., Baker, J., Phan, L., Smigielski, E.M., and Sirotkin, K. (2001). dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29, 308-311.

Turatsinze, J.-V., Thomas-Chollier, M., Defrance, M., and van Helden, J. (2008). Using RSAT to scan genome sequences for transcription factor binding sites and cis- regulatory modules. Nat Protoc 3, 1578-1588.

Turei, D., Korcsmaros, T., and Saez-Rodriguez, J. (2016). OmniPath: guidelines and gateway for literature-curated signaling pathway resources. Nat Methods 13, 966-967. UniProt Consortium (2015). UniProt: a hub for protein information. Nucleic Acids Res 43, D204- 12.

Wu, L., and Griffin, J.D. (2004). Modulation of Notch signaling by mastermind-like (MAML) transcriptional co-activators and their involvement in tumorigenesis. Semin Cancer Biol 14, 348- 356.

Wu, F., Zikusoka, M., Trindade, A., Dassopoulos, T., Harris, M.L., Bayless, T.M., Brant, S.R., Chakravarti, S., and Kwon, J.H. (2008). MicroRNAs are differentially expressed in ulcerative colitis and alter expression of macrophage inflammatory peptide-2 alpha. Gastroenterology 135, 1624-1635.624.

Wu, F., Huang, Y., Dong, F., and Kwon, J.H. (2016). Ulcerative Colitis-Associated Long Noncoding RNA, BC012900, Regulates Intestinal Epithelial Cell Apoptosis. Inflamm Bowel Dis 22, 782-795.