Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
PREDICTION OF GENE TRANSCRIPTION INTENSITY AND GENE PERTURBATION
Document Type and Number:
WIPO Patent Application WO/2014/209225
Kind Code:
A1
Abstract:
This invention relates to methods of using gene co-expression networks for identifying one or more marker genes that allow prediction of the transcription intensity of at least one gene of interest and for predicting the transcription intensity of at least one gene of interest in a given sample. The present invention also relates to methods for identifying a group of genes of interest whose transcription correlates with the transcription of a gene whose transcription is to-be-perturbed and for predicting the effect of the perturbation of the transcription of at least one gene on the transcription intensity of a group of genes of interest in a given organism. Further, the present invention relates to a panel of marker genes and a kit for detecting these marker genes.

Inventors:
LING HAN TONG MAURICE (SG)
POH CHUEH LOO (SG)
LIM YUTING ROSARY (SG)
Application Number:
PCT/SG2014/000234
Publication Date:
December 31, 2014
Filing Date:
May 28, 2014
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV NANYANG TECH (SG)
International Classes:
G16B20/20; C40B30/02; G16B5/00; G16B25/10; G16B20/00
Other References:
ZHANG, J. ET AL.: "Using gene co-expression network analysis to predict biomarkers for chronic lymphocytic leukemia", BMC BIOINFORMATICS, vol. 11, no. SUPPL, 2010, pages 1 - 9
PRIETO, C. ET AL.: "Human Gene Coexpression Landscape: Confident Network Derived from Tissue Transcriptomic Profiles", PLOS ONE, vol. 3, no. ISSUE, 2008, pages 1 - 14
AYA, K. ET AL.: "Comprehensive Network Analysis of Anther-Expressed Genes in Rice by the Combination of 33 Laser Microdissection and 143 Spatiotemporal Microarrays", PLOS ONE, vol. 6, no. ISSUE, 2011, pages 1 - 13
"Affymetrix White Paper: GeneChip 3' IVT Express Kit", 14 May 2009 (2009-05-14), Retrieved from the Internet [retrieved on 20140822]
LING, M.H.T ET AL.: "A predictor for predicting Escherichia coli transcriptome and the effects of gene perturbations", BMC BIOINFORMATICS, vol. 15, no. 140, 2014, pages 1 - 12
Attorney, Agent or Firm:
VIERING, JENTSCHURA & PARTNER LLP (Rochor Post OfficeRochor Road, Singapore 3, SG)
Download PDF:
Claims:
CLAIMS

1. Method for identifying one or more marker genes that allow prediction of the transcription intensity of at least one gene of interest, the method comprising:

a) providing a gene co-expression network (GCN) (1 st network) including the correlation of the transcription of all possible gene pairs and wherein the 1st network comprises the at least one gene of interest;

b) selecting for genes of said gene pairs whose transcription is correlated with the transcription of at least one other gene of the gene co-expression network to form a diminished network of gene(s) (2nd network) and the co-transcription of all gene pairs of the 1 st network is determined by Pearson's correlation and the transcription of a gene pair correlates when the correlation coefficient is at least 0.6;

c) identifying within the selected genes of step b) marker gene(s), wherein said marker gene(s) form with at least 90% of the genes of the 2nd network transcription correlated gene pairs, said correlated gene pairs form a 3rd network and the number of marker genes is at most 15% of the number of genes of the 2nd network and wherein the 3rd network comprises the gene of interest and the gene of interest is not a marker gene.

2. Method for predicting the transcription intensity of at least one gene of interest in a given sample, the method comprising:

a) measuring the transcription intensity of one or more marker genes identified according to the method of claim 1 in the given sample; and

b) using the transcriptional intensity measured in step a) and pair-wise linear regression on the gene pairs of the 3rd network to predict the transcription intensity of the at least one gene of interest.

3. The method of claim 1 or 2, wherein the transcription of a gene pair correlates when the correlation coefficient is at least 0.75 or when the p-value of the correlation coefficient is at least le-9.

4. The method according to any one of the previous claims, wherein the pair-wise linear regression is a pair-wise first-order linear regression.

5. The method according to any one of the previous claims, wherein the marker genes are identified by a round-based selection, wherein one selection round comprises:

- identifying the gene of the 2nd network that forms the highest number of correlated gene pairs with the other genes of the 2nd network, wherein the gene of the 2nd network that forms the highest number of correlated gene pairs with the other genes of the 2nd network is a marker gene, and

- removing said marker gene and the genes that form transcription correlated gene pairs with said marker gene from the list of the 2nd network; and

wherein further rounds of selection are applied on the remaining genes of the 2nd network until the selected marker gene(s) form with at least 90% of the genes of the 2nd network transcription correlated gene pairs.

6. The method according to any one of the previous claims, wherein the marker gene(s) form with at least 95% of the genes of the 2nd network transcription correlated gene pairs.

7. The method according to any one of claims 2 to 6, wherein the transcription intensity of the at least one gene of interest is predicted by using not more than five jumps between linear regressions of the gene pairs of the 3rd network.

8. The method according to any one of claims 2 to 7, wherein the transcription intensity of the at least one gene of interest is predicted by using not more than four jumps between linear regressions of the gene pairs of the 3rd network.

9. The method according to any one of claims 2 to 8, wherein the transcription intensity of the at least one gene of interest is predicted by using not more than three jumps between linear regressions of the gene pairs of the 3rd network.

10. The method according to any one of claims 2 to 9, wherein in step a) of claim 2 the transcription intensity of one gene is measured.

1 1. The method according to any one of claims 2 to 9, wherein in step a) of claim 2 the transcription intensity of at least two genes is measured.

12. The method according to any one of claims 2 to 1 1 , wherein the transcription intensity of at least one gene is measured at least two-times and the measured values are merged and averaged.

13. The method according to any one of claims 2 to 12, wherein the given sample is a human sample.

14. The method according to claim 13, wherein the human sample is a sample from a breast cancer patient.

15. The method according to any one of claims 2 to 12, wherein the given sample is a yeast sample.

16. The method according to any one of claims 2 to 12, wherein the given sample is an Escherichia coli sample.

17. The method according to any one of claims 2 to 16, wherein the transcription intensity of the at least one gene of interest is predicted within three standard deviations of error.

18. Method for identifying a group of genes of interest whose transcription correlates with the transcription of a gene whose transcription is to-be-perturbed, the method comprising:

a) providing a gene co-expression network (GCN) (1st network) including the correlation of the transcription of all possible gene pairs and wherein the 1st network comprises the group of genes of interest and the to -be -perturbed gene; b) selecting for genes of said gene pairs whose transcription is correlated with the transcription of at least one other gene of the gene co-expression network to form a diminished network of gene(s) (2nd network) and the co-transcription of all gene pairs of the 1st network is determined by Pearson's correlation and the transcription of a gene pair correlates when the correlation coefficient is at least 0.6;

c) identifying within the 2nd network genes whose transcription is correlated with the transcription of the to-be-perturbed gene, wherein the genes of the 2nd network whose transcription correlates with the transcription of the to-be-perturbed gene form the group of genes of interest.

19. Method for predicting the effect of the perturbation of the transcription of at least one gene on the transcription intensity of a group of genes of interest in a given organism, the method comprising:

a) identifying the group of genes of interest according to claim 18;

b) using transcriptional intensity of the at least one to-be-perturbed gene in a non- perturbed background and pair-wise linear regression on the gene pairs of the 2nd network to calculate the transcription intensity of each gene of the group of genes of interest in a non- perturbed background;

c) using transcriptional intensity of the at least one to-be-perturbed gene in a perturbed background and pair-wise linear regression on the gene pairs of the 2nd network to calculate the transcription intensity of each gene of the group of genes of interest in a perturbed background; and

d) comparing the transcription intensity of the non-perturbed and perturbed background for each gene of the group of genes of interest.

20. The method of claim 19, wherein the pair-wise linear regression is a pair-wise first-order linear regression.

21. The method of claims 19 or 20, wherein the effect of one perturbed gene is predicted.

22. The method of claims 19 or 20, wherein the effect of at least two perturbed genes is predicted.

23. The method of claims 19, 20 or 22, wherein the effect of at least four perturbed genes is predicted.

24. The method according to any one of claims 19 to 23, wherein the transcription intensity of the at least one perturbed gene in a perturbed and/or non-perturbed background is

a) known;

b) derived from a previously known transcription intensity; or

c) measured.

25. The method of claims 19 or 20, wherein the transcription intensity of the at least one perturbed gene is measured at least two-times and the measured values are merged and averaged.

26. The method of claims 22 or 23, wherein in step b) of claim 18 the transcription intensity of each gene of the group of genes of interest is determined by separately calculating the transcription intensity for each gene perturbation and merging and averaging for each gene of the group of genes of interest the transcription intensity for each gene perturbation.

27. The method according to any one of claims 19 to 26, wherein the at least one perturbed gene has decreased transcription intensity.

28. The method according to any one of claims 19 to 26, wherein the at least one perturbed gene has increased transcription intensity.

29. The method according to any one of claims 22 to 26, wherein the perturbed genes have decreased and increased transcription intensity.

30. The method according to any one of claims 19 to 29, wherein the group of genes of interest comprises at least one gene.

31. The method according to any one of claims 19 to 30, wherein the group of genes of interest comprises at least five genes.

32. The method according to any one of claims 19 to 31 , wherein the group of genes of interest comprises at least ten genes.

33. The method according to any one of claims 19 to 32, wherein the group of genes of interest comprises at least 20 genes.

34. The method according to any one of claims 19 to 33, wherein the group of genes of interest comprises at least 50 genes.

35. The method according to any one of claims 18 to 34, wherein the transcription of a gene pair correlates when the correlation coefficient is at least 0.75.

36. The method according to any one of claims 18 to 35, wherein the transcription intensity for each gene of the group of genes of interest is calculated by using not more than five jumps between linear regressions of the gene pairs of the 2nd network.

37. The method according to any one of claims 18 to 35, wherein the transcription intensity for each gene of the group of genes of interest is calculated by using not more than four jumps between linear regressions of the gene pairs of the 2nd network.

38. The method according to any one of claims 18 to 35, wherein the transcription intensity for each gene of the group of genes of interest is calculated by using not more than three jumps between linear regressions of the gene pairs of the 2nd network.

39. The method according to any one of claims 19 to 38, wherein the given organism is a human.

40. The method according to claim 39, wherein the human is a breast cancer patient.

41. The method according to any one of claims 19 to 38, wherein the given organism is yeast.

42. The method according to any one of claims 19 to 38, wherein the given organism is

Escherichia coli.

43. A panel of marker genes identified according to a method of any one of claims 1 to 17.

44. The panel of marker genes according to claim 43, wherein said panel comprises the marker genes

- Hs.63 1923 (GenBank ID NM_004157), Hs.367799 (GenBank ID NM_002088), Hs.388739 (GenBank ID NM_021 141 ), Hs.469022 (GenBank IDs NM_001929, or NM_08O915, or

NM 080916, or NM_080917, or NM_080918), Hs.579264 (GenBank IDs NM_001 130090, or NM_001 130091 , or NM_001 130092, or NM_03 1294), Hs.435765 (GenBank ID NMJ301977);

- Hs.514242 (GenBank IDs NM_014798, or NR_024386, or NR_027774, or NR_027782), Hs.138860 (GenBank ID NM_004308), Hs.24529 (GenBank IDs NM_001 1 14121 , or

NM_001 1 14122, or NM 001244846, or NM_001274, or NR_045204, or NR_045205);

- Hs.5 14242 (GenBank IDs NM_014798, or NR_024386, or NR_027774, or NR_027782), Hs.595333 (GenBank IDs NM_001025596, or NM_001 172639, or NM_001 172640, or

NM_006560, or NM_ 198700);

- Hs.607822 (GenBank IDs NM_001 145966, or NM_002417), Hs.233240 (GenBank IDs NM_004369, or NM_057164, or NM_057165 , or NM_057166, or NM_057167), Hs.5 18249 (GenBank IDs NM_001 127192, or NM_001 127193, or NM_001 127194, or NM_001 127195, or NM_001 127196, or NM_003418), Hs.731619 (GenBank IDs NM_007172, or NM_153645), Hs.63 1923 (GenBank ID NM_004157), Hs.367799 (GenBank ID NM_002088), Hs.5 13083 (GenBank IDs NM 000661 , or NM_001024921 ), Hs.31 1640 (GenBank IDs NM_001 135592, or NMJD01 177413, or NM_002954), Hs.388739 (GenBank ED NM_021 141 ), Hs.469022 (GenBank IDs NM_001929, or NM_080915, or NM_080916, or NM_080917, or NM_08091 8), Hs.405144 (GenBank IDs NM_003017, or NR_036610), Hs.371240 (GenBank IDs

NM_005100, or NM_144497), Hs.73 1463 (GenBank ID NM_014155), Hs.579264 (GenBank IDs NM_001 130090, or NM_001 130091 , or NM_001 130092, or NM_031294), Hs.210283 (GenBank ID NM_000093), Hs.279784 (GenBank ID NM_013388), Hs.631989 (GenBank ID NM_002931 ), Hs.435765 (GenBank ID NM_001977), Hs.496710 (GenBank IDs

NM_001042749, or NM_001042750, or NM_00104275 1 , or NM_006603);

- Hs.48031 1 (GenBank IDs NM_00101 1513, or NM_00101 1515, or NM_00101 1516, or NM_001256425, or NM_001256426, or NM_001256427, or NM_001256428, or

NM_001256429, or NM_006457, or NR_024179, or NR_ 046186), Hs.516032 (GenBank ID NM_000182), Hs.654370 (GenBank ID NM_ 004460), Hs.5 14242 (GenBank IDs NM_014798, or NR_024386, or NR_027774, or NR_027782), Hs.138860 (GenBank ID NM_004308), Hs.24529 (GenBank IDs NM_001 1 14121 , or NM_001 1 14122, or NM_001244846, or

NM_001274, or NR_045204, or NR_045205), Hs.446588 (GenBank IDs NM_001017, or NR_001452, or XR l 1 1 163), Hs.373763 (GenBank IDs NM_001 102397, or NM_001 102398, or NM_001 102399, or NMJ)05826);

- Hs.514242 (GenBank IDs NM_014798, or NR_024386, or NR_027774, or NR_027782), Hs.595333 (GenBank IDs NM_001025596, or NM_001 172639, or NM_001 172640, or

NM_006560, or NM_198700), Hs.596704 (GenBank IDs NM_001037328, or NM_004602, or NM_017452, or NM_017453, or NM_017454), Hs.122583 (GenBank IDs NM_024743, or NR_024010);

- 1774893_at (GenBank ID NM_001 184556), 1779850_at (GenBank ID NM_001 184257);

- 1771091_at (GenBank ID NM_001 181808);

- 1774893_at (GenBank ID NM_001 184556), 1779850_at (GenBank ID NM_001 184257);

- 177109 l at (GenBank ID NM_001 181808);

- 1 765788_s_at (SEQ ID NO: l ), 17613 13_at (SEQ ID NO:2), 176470 l_s_at (SEQ ID NO:3), 1765667_s_at (SEQ ID NO:4), 1761 171_s_at (SEQ ID NO:5), 1764563_s_at (SEQ ID NO:6), 1765668_at (SEQ ID NO:7), 1766750_s_at (SEQ ID NO:8), 1759650_s_at (SEQ ID NO:9), 1764385_at (SEQ ID NO: 10), 1767674_s_at (SEQ ID NO: l 1 ), 176808 l_s_at (SEQ ID NO: 12), 176859 l_s_at (SEQ ID NO: 13), 1766138_s_at (SEQ ID NO: 14), 1768603_s_at (SEQ ID ΝΟ.Ϊ 5), 1768237_s_at (SEQ ID NO: 16), 1761582_s_at (SEQ ID NO: 17), 1760255_s_at (SEQ ID NO: 18), 1763957_s_at (SEQ ID NO: 19), 175945 l_s_at (SEQ ID NO:20); or

- 1763376_at (SEQ ID NO:21 ), 1768602_at (SEQ ID NO:22), 1765970_x_at (SEQ ID NO:23), 1768855_at (SEQ ID NO:24), 1762662_s_at (SEQ ID NO:25), 1764355_at (SEQ ID NO:26), 1761298_at (SEQ ID NO:27), 1767293_at (SEQ ID NO:28), 1766730_at (SEQ ID NO:29), 1761324_at (SEQ ID NO:30), 1761072_at (SEQ ID NO:3 1 ), 1763355_at (SEQ ID NO:32), 1760568_s_at (SEQ ID NO:33), 1765285_s_at (SEQ ID NO:34), 1762875_at (SEQ ID NO:35), 1762326_at (SEQ ID NO: 36), 1768785_s_at (SEQ ID NO:37), 1759869_at (SEQ ID NO:38), 1766040_at (SEQ ED NO:39), 1760045_s_at (SEQ ID NO:40), 1761673_at (SEQ ID NO:41 ), 1763754_at (SEQ ID NO:42), 1767770_at (SEQ ID NO:43), 1768126_at (SEQ ID NO:44), 1763914_s_at (SEQ ID NO:45), 1764068_x_at (SEQ ED NO:46), 1768400_s_at (SEQ ID NO:47), 1763222_x_at (SEQ ED NO:48), 1768805_s_at (SEQ ID NO:49), 1762892_s_at (SEQ ID NO:50), 1763332_s_at (SEQ ID NO:51 ), 1764533_s_at (SEQ ID NO:52), 1766506_s_at (SEQ ID NO:53), 1765366_s_at (SEQ ED NO:54), 1767220_at (SEQ ID NO:55), 1765015_at (SEQ ID NO:56), 1766029_at (SEQ ID NO:57), 1760408_s_at (SEQ ID NO:58), 1764010_s_at (SEQ ID NO:59), 1763879_s_at (SEQ ED NO:60), 1759435_at (SEQ ID NO:61), 1767485_at (SEQ ID NO:62), 1766425_s_at (SEQ ED NO:63), 1759255_at (SEQ ID NO:64), 1763325_at (SEQ ID NO:65), 1764978_at (SEQ ID NO:66), 1766452_at (SEQ ED NO:67), 1764869_s_at (SEQ ID NO:68), 1764915_s_at (SEQ ED NO:69), 1768426_s_at (SEQ ID NO:70),

1762728_s_at (SEQ ED NO:71 ), 1767138_s_at (SEQ ID NO:72), 1767987_x_at (SEQ ID NO:73), 1766777_s_at (SEQ ED NO:74), 1763925_s_at (SEQ ID NO:75), 1762374_s_at (SEQ ID NO:76), 1767386_at (SEQ ID NO:77), 1767335_s_at (SEQ ED NO:78), 1767009_s_at (SEQ ID NO:79).

45. A kit for detecting the marker genes identified according to a method of any one of claims 1 to 17.

Description:
PREDICTION OF GENE TRANSCRIPTION INTENSITY AND GENE

PERTURBATION

FIELD OF THE INVENTION

[0001] The present invention lies in the field of bioinformatics and relates to methods for identifying one or more marker genes that allow prediction of the transcription intensity of at least one gene of interest and methods for predicting the transcription intensity of at least one gene of interest in a given sample. The present invention also relates to methods for identifying a group of genes of interest whose transcription correlates with the transcription of a gene whose transcription is to-be-perturbed and methods for predicting the effect of the perturbation of the transcription of at least one gene on the transcription intensity of a group of genes of interest in a given organism. Further, the present invention relates to a panel of marker genes and a kit for detecting these marker genes.

BACKGROUND OF THE INVENTION

[0002] One of the key challenges in systems biology is to develop a complete computational model of biology that can be used for both integration of knowledge and to develop and test hypotheses. A number of computational tools had been developed [Medema et al. (2012) Nature reviews Microbiology, 10(3), 191 -202] over the years but did not describe any tools for transcriptome prediction. Selinger et al. [Selinger et al. (2003) Trends in Biotechnology, 21 (6), 251 -254] proposed that a model, which can be used to predict gene expressions, will be useful for predicting the effects of gene over-expression, knockouts, and environmental stimuli. In addition, such prediction model will be useful to synthetic biologists to predict the effects of transgenes, also known as degree of orthogonality, on the gene expression of the host system based on the effects of the transgenes on a small number of effector genes [Porcar et al. (201 1 ) Syst Synth Biol, 5( 1 -2): 1 -9]. Such a model will be complementary to the list of tools for synthetic biology. A source of data for building such a model may be publically available microarray data, such as those deposited in Gene Expression Omnibus. Previous studies estimated that 50 microarrays are needed to specify an expression network of Escherichia coli and 40 thousand microarrays are needed for 99% chance of successful prediction (the number of required microarrays depends on the number of genes in the genome and the degree of complexity of the gene network).

[0003] Gene co-expression networks (GCN) have been commonly used to study expressional similarities of genes [Obayashi et al. (2013) DNA Res, 16(5), 249-260], wherein the nodes of the network are the genes and the links between the nodes, also known as edges, are the degree of co-expression. The basis of GCN is that genes, which are correlated in expression, have a higher likelihood of functional relatedness [Reverter et al. (2005) Australian Journal of Experimental Agriculture, 45, 821 -829]. Once the co-expression between two genes can be established, the expression level of a gene can be predicted from the known expression of another gene by means of linear regression.

[0004] However, despite the increased research interest in recent years concerning the development of tools for transcriptome prediction, no methods are known in the art that allow for precise and fast prediction of the transcriptome of a given individual and that allow for prediction of perturbed gene expression. Hence, there is need in the art for such methods.

SUMMARY OF THE INVENTION

[0005] It is an object of the present invention to meet the above need by providing methods for predicting the transcription intensity of at least one gene of interest in a given sample and methods for predicting the effect of the perturbation of the transcription of at least one gene on the transcription intensity of a group of genes of interest in a given organism. Surprisingly, the present inventors have found that a network of genes wherein the transcription of each gene is significantly correlated with the transcription of at least one other gene of said network allows for a fast and precise prediction of a transcriptome or downstream/effector genes of a perturbed gene. The transcriptional intensity of at least 90% of the genes of a transcriptome or downstream/effector genes of a perturbed gene can be predicted within three standard deviations of the expected value of expression.

[0006J In a first aspect, the present invention is thus directed to a method for identifying one or more marker genes that allow prediction of the transcription intensity of at least one gene of interest, the method comprising: a) providing a gene co-expression network (GCN) (1 st network) including the correlation of the transcription of all possible gene pairs and wherein the 1st network comprises the at least one gene of interest; b) selecting for genes of said gene pairs whose transcription is correlated with the transcription of at least one other gene of the gene co- expression network to form a diminished network of gene(s) (2nd network) and the co- transcription of all gene pairs of the 1st network is determined by Pearson's correlation and the transcription of a gene pair correlates when the correlation coefficient is at least 0.6; c) identifying within the selected genes of step b) marker gene(s), wherein said marker gene(s) form with at least 90% of the genes of the 2nd network transcription correlated gene pairs, said correlated gene pairs form a 3rd network and the number of marker genes is at most 15% of the number of genes of the 2nd network and wherein the 3rd network comprises the gene of interest and the gene of interest is not a marker gene.

[0007] In a further aspect, the present invention relates to a method for predicting the transcription intensity of at least one gene of interest in a given sample, the method comprising: a) measuring the transcription intensity of one or more marker genes identified according to the method according to the present invention in the given sample; and b) using the transcriptional intensity measured in step a) and pair-wise linear regression on the gene pairs of the 3rd network to predict the transcription intensity of the at least one gene of interest.

[0008] In a still further aspect, the present invention relates to a method for identifying a group of genes of interest whose transcription correlates with the transcription of a gene whose transcription is to-be-perturbed, the method comprising: a) providing a gene co-expression network (GCN) (1 st network) including the correlation of the transcription of all possible gene pairs and wherein the 1 st network comprises the group of genes of interest and the to-be- perturbed gene; b) selecting for genes of said gene pairs whose transcription is correlated with the transcription of at least one other gene of the gene co-expression network to form a diminished network of gene(s) (2nd network) and the co-transcription of all gene pairs of the 1 st network is determined by Pearson's correlation and the transcription of a gene pair correlates when the correlation coefficient is at least 0.6; c) identifying within the 2nd network genes whose transcription is correlated with the transcription of the to-be-perturbed gene, wherein the genes of the 2nd network whose transcription correlates with the transcription of the to-be- perturbed gene form the group of genes of interest.

[0009] In another aspect, the present invention relates to a method for predicting the effect of the perturbation of the transcription of at least one gene on the transcription intensity of a group of genes of interest in a given organism, the method comprising: a) identifying the group of genes of interest according to the method of the present invention using transcriptional intensity of the at least one to-be-perturbed gene in a non-perturbed background and pair-wise linear regression on the gene pairs of the 2nd network to calculate the transcription intensity of each gene of the group of genes of interest in a non-perturbed background; c) using transcriptional intensity of the at least one to-be-perturbed gene in a perturbed background and pair-wise linear regression on the gene pairs of the 2nd network to calculate the transcription intensity of each gene of the group of genes of interest in a perturbed background; and d) comparing the transcription intensity of the non-perturbed and perturbed background for each gene of the group of genes of interest.

[00010] In a still further aspect, the present invention relates to a panel of marker genes identified according to a method of the present invention.

[00011] In another aspect, the present invention relates to a kit for detecting the marker genes identified according to a method of the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS

[00012] The invention will be better understood with reference to the detailed description when considered in conjunction with the non-limiting examples and the accompanying drawings.

[00013] Figure 1 shows microarray data series for model building. 605 microarrays from the following 40 data series in Gene Expression Omnibus (GEO) were used for model building. The respective CEL files were downloaded and processed by Affymetrix Expression Console using RMA normalization.

[00014] Figure 2 shows internal consistency of microarray values. In GPL3154 platform, 21 ORFs have 2 probes. As these 2 probes are detecting the same transcript, the ratio of these 2 probes can be used to evaluate the consistency and reproducibility of the microarray data. The ratio of 1 represents perfect consistency between the values of the 2 probes. The average ratio of each pairs of probe deviates from perfect consistency by 19.19% with a standard deviation of 12.61 %, suggesting that the intra-variation in each microarray is estimated to be 19.19%.

[00015] Figure 3 shows the distribution of correlation coefficients. Pairwise Pearson's correlations were calculated from 10,1 12 probes, giving a total of 51 , 121 ,216 permuted pairs of Pearson's correlation. The skew is 3.04 and the kurtosis is 4.09, which does not suggest normal distribution (Jacque-Bera statistic = 138.84).

[00016] Figure 4 shows the distribution of path length. Path lengths (number of jumps or degrees of separation) between each pair of genes were computed. Some pairs of genes were not reachable by each other (infinite jumps). Only reachable pairs were used for tabulation.

[00017] Figure 5 shows the concept of predicting target gene expression values. Target gene expression value can be estimated from a single source gene expression based on linear regression. In this figure, the source probe/marker gene and target probe/gene are separated by 2 sets of probe/genes and 5 different paths. Expression values of probes/genes adjacent to the source probe/gene (denoted as Probe(x)) can be estimated by linear regression. This can be done repeatedly to reach the target. Target probe/gene expression value can then be statistically inferred.

[00018] Figure 6 shows the precision of prediction. (A) shows the percentage of genes predicted within specific thresholds from 2 to 8 jumps from the source gene. The predicted value of each gene is defined as the average predicted value from source gene to target gene across different paths. If the actual expression value of a gene is +/- 20%> of the predicted value, then the gene is considered to be predicted within 20% threshold. (B) shows the difference in prediction precision between the current set of microarray with 19.19% intra-slide variation and a set of theoretically consistent microarray with no intra-slide variation. [00019] Figure 7 shows the number of target genes reached within 4 jumps. Horizontal axis shows the list of genes sorted by descending numbers of genes reachable within 4 jumps.

[00020] Figure 8 shows the network coverage by path length.

[00021] Figure 9 shows microarrays used to evaluate the accuracy of single pass and multi-pass transcriptome predictors (Algorithm 1 and 2). 30 and 10 arrays were used to test single pass and multi-pass transcriptome predictors, respectively. For comparison between single pass and multi-pass transcriptome predictors, the sample numbers from single pass transcriptome predictor test were relabeled in Figure 14 (multi-pass transcriptome predictor test). For example, GSM239043 was used as Sample 3 in the single pass transcriptome predictor test, but the same data was relabeled as Sample 2 in the multi-pass transcriptome predictor test in Figure 14.

[00022] Figure 10 shows a list of 59 source probes/marker genes. Probe ID and Locus Tag are Probe IDs in the microarray (GPL3154) and locus tag in NCBI gene record, respectively.

[00023] Figure 11 shows the correlation between actual expression data from microarray and predicted expression using single pass method. All 30 test samples (transcriptomes) showed statistically significant positive correlation except for Sample 22. The average correlation is 0.468 with a standard error of 0.0382, suggesting that the average correlation is significantly different from zero.

[00024] Figure 12 shows a prediction evaluation test for the single-pass transcriptome predictor. Thirty microarrays were used to evaluate the accuracy of the single-pass transcriptome predictor. Two different tests were performed using 30 to 50 paths between the source and target genes and 30 to 10000 paths between the source and target genes. Error bar denotes standard error. (A) shows the average standard deviations between the predicted and actual target gene values. (B) shows the average percent difference between the predicted and actual target gene values.

[00025] Figure 13 shows the prediction error of each path. Path averaged correlation is defined as the quotient of the sum of the Pearson's correlation coefficients across the entire path, and the number of jumps in the path. Locus Tag Z51 16 was used to predict b l 510 with 180 paths of 3 jumps between these 2 genes. [00026] Figure 14 shows a prediction evaluation test for the multi-pass transcriptome predictor. Ten microarrays which had been used for single-pass transcriptome predictor evaluation were used to evaluate multi-pass transcriptome predictor (Figure 10). Error bar denotes standard error. (A) shows the average standard deviations between the predicted and actual target gene values. (B) shows the average percent difference between the predicted and actual target gene values. (C) shows the average standard deviation from each predicted values across different paths.

[00027] Figure 15 shows the correlation between actual expression data from microarray and predicted expression using multi-pass method. All 10 test samples (transcriptomes) showed statistically significant positive. The average correlation is 0.269 with a standard error of 0.0455, suggesting that the average correlation is significantly different from zero.

[00028] Figure 16 shows the perturbation predictor evaluation setup. Six types of perturbations were performed (1 . Single gene over-expression; 2. Single gene knockdown; 3. Double gene over-expression; 4. Double gene knockdown; 5. Single gene over-expression with single gene knockdown; 6. Double gene over-expression with double gene knockdown). Each perturbation was performed on three microarrays as triplicates.

[00029] Figure 17 shows the prediction evaluation test for the perturbation transcriptome predictor. The vertical axis shows the percentage of affected genes predicted within a specific threshold. (A), (B), (C) and (D) demonstrate single gene over-expression, single gene knockdown or under-expression, double gene over-expression and double gene knockdown respectively. (E) and (F) show two gene perturbations, single gene over-expression and single gene knockdown, predicted using single pass and multi-pass, respectively. (G) and (H) show four gene perturbations, double gene over-expression and double gene knockdown, predicted using single pass and multi-pass, respectively.

[00030] Figure 18 shows the correlations between predicted and expected expression values of genes affected by perturbation(s). Correlations are calculated from all the affected genes by comparing the perturbation affected expression against the respective test transcriptome. Higher correlation is indicative of the higher predictability. The average correlation coefficient of single-pass method is 0.698 (SD = 0. 123). The average correlation coefficient of multi-pass method is 0.392 (SD = 0.036). The difference between single and multipass method is significant (p-value = 7.45e-12) using t-test for unequal variance. In addition, paired t-test between the correlations of single and multi-pass method for single/double gene over-expression/knockdown (n = 6) is significant (p-value = 0.0012).

[00031J Figure 19 shows the panel of source genes/marker genes for the Escherichia coli model.

[00032] Figure 20 shows the average error between actual and predicted gene expression using two different models based on 20 and 59 marker genes, respectively.

[00033 J Figure 21 shows the correlation between actual and predicted gene expression using two different models based on 20 and 59 marker genes, respectively.

[00034] Figure 22 shows the panel of source genes/marker genes for the human breast cancer model.

[00035] Figure 23 shows the evaluation of the gene expression prediction for the human breast cancer model.

[00036] Figure 24 shows GO Terms (biological processes) enriched as a result of BRCA1 down-regulation.

[00037] Figure 25 shows the summary of results of examining the effects of MAPK over- expression in human breast cancer.

[00038] Figure 26 shows the number of significantly affected genes that were affected by one or more MAPK isoforms. 1642 genes were significantly affected by one of the 10 MAPK isoforms while 56 genes were significantly affected by 8 of the 10 MAPK isoforms. No gene was significantly affected by more than 8 MAPK isoforms.

[00039] Figure 27 shows the panel of source genes/marker genes for the yeast model.

[00040] Figure 28 shows the comparison and selection of the regression model used for the prediction of yeast gene expression.

[00041] Figure 29 shows the correlation between predicted and expected gene expressions in yeast using model A (absolute Pearson's correlation coefficient of 0.396 as threshold). [00042] Figure 30 shows GO Terms (biological processes) enriched as a result of HAA 1 up-regulation.

DETAILED DESCRIPTION OF THE INVENTION

[00043] The present inventors surprisingly found that a network of genes wherein the transcription of each gene is significantly correlated with the transcription of at least one other gene of said network allows for a fast and precise prediction of a transcriptome or downstream/effector genes of a perturbed gene.

[00044] Thus, in a first aspect, the present invention is thus directed to a method for identifying one or more marker genes that allow prediction of the transcription intensity of at least one gene of interest, the method comprising: a) providing a gene co-expression network (GCN) (1 st network) including the correlation of the transcription of all possible gene pairs and wherein the 1 st network comprises the at least one gene of interest; b) selecting for genes of said gene pairs whose transcription is correlated with the transcription of at least one other gene of the gene co-expression network to form a diminished network of gene(s) (2nd network) and the co- transcription of all gene pairs of the 1 st network is determined by Pearson's correlation and the transcription of a gene pair correlates when the correlation coefficient is at least 0.6 or when the p-value of the correlation coefficient is at least le-9; c) identifying within the selected genes of step b) marker gene(s), wherein said marker gene(s) form with at least 90% of the genes of the 2nd network transcription correlated gene pairs, said correlated gene pairs form a 3rd network and the number of marker genes is at most 15% of the number of genes of the 2nd network and wherein the 3rd network comprises the gene of interest and the gene of interest is not a marker gene. In various embodiments, the transcription of a gene pair correlates when the correlation coefficient is statistically significant, such as a p-value < l e-9.

[00045] In a further aspect, the present invention relates to a method for predicting the transcription intensity of at least one gene of interest in a given sample, the method comprising: a) measuring the transcription intensity of one or more marker genes identified according to the method of the present invention in the given sample; and b) using the transcriptional intensity measured in step a) and pair-wise linear regression on the gene pairs of the 3rd network to predict the transcription intensity of the at least one gene of interest.

[00046] In various embodiments of the invention, the pair-wise linear regression is a pair- wise first-order linear regression.

[00047] In various embodiments of the invention, the transcription of a gene pair correlates when the correlation coefficient is at least 0.75 or when the p-value of the correlation coefficient is at least le-9. In various embodiments, the transcription of a gene pair correlates when the correlation coefficient is statistically significant, such as a p-value < l e-9.

[00048] In various embodiments of the invention, the marker genes are identified by a round -based selection, wherein one selection round comprises: a) identifying the gene of the 2nd network that forms the highest number of correlated gene pairs with the other genes of the 2nd network, wherein the gene of the 2nd network that forms the highest number of correlated gene pairs with the other genes of the 2nd network is a marker gene, and b) removing said marker gene and the genes that form transcription correlated gene pairs with said marker gene from the list of the 2nd network; and wherein further rounds of selection are applied on the remaining genes of the 2nd network until the selected marker gene(s) form with at least 90% of the genes of the 2nd network transcription correlated gene pairs.

[00049] In various embodiments of the invention, the marker gene(s) form with at least 95% of the genes of the 2nd network transcription correlated gene pairs.

[00050] In various embodiments of the invention, the transcription intensity of the at least one gene of interest is predicted by using not more than five, preferably not more than four, more preferably not more than three and still more preferably not more than two jumps between linear regressions of the gene pairs of the 3rd network.

[00051] In various embodiments of the invention, the transcription intensity of one gene is measured. In other various embodiments, the transcription intensity of at least two genes is measured.

[00052] In various embodiments of the invention, the transcription intensity of at least one gene is measured at least two-times and the measured values are merged and averaged. [00053] In various embodiments of the invention, the given sample is a human sample, preferably the human sample is a sample from a breast cancer patient. In other various embodiments, the given sample is a yeast sample. In still other various embodiments of the invention, the given sample is an Escherichia coli sample.

[00054J In various embodiments of the invention, the transcription intensity of the at least one gene of interest is predicted within three standard deviations of error.

[00055] In a further aspect, the present invention relates to a method for identifying a group of genes of interest whose transcription correlates with the transcription of a gene whose transcription is to-be-perturbed, the method comprising: a) providing a gene co-expression network (GCN) (1 st network) including the correlation of the transcription of all possible gene pairs and wherein the 1 st network comprises the group of genes of interest and the to-be- perturbed gene; b) selecting for genes of said gene pairs whose transcription is correlated with the transcription of at least one other gene of the gene co-expression network to form a diminished network of gene(s) (2nd network) and the co-transcription of all gene pairs of the 1 st network is determined by Pearson's correlation and the transcription of a gene pair correlates when the correlation coefficient is at least 0.6 or when the p-value of the correlation coefficient is at least l e-9; c) identifying within the 2nd network genes whose transcription is correlated with the transcription of the to-be-perturbed gene, wherein the genes of the 2nd network whose transcription correlates with the transcription of the to-be-perturbed gene form the group of genes of interest. In various embodiments, the transcription of a gene pair correlates when the correlation coefficient is statistically significant, such as a p-value < le-9.

[00056] In another aspect, the present invention relates to a method for predicting the effect of the perturbation of the transcription of at least one gene on the transcription intensity of a group of genes of interest in a given organism, the method comprising: a) identifying the group of genes of interest according to a method of the present invention using transcriptional intensity of the at least one to-be-perturbed gene in a non-perturbed background and pair-wise linear regression on the gene pairs of the 2nd network to calculate the transcription intensity of each gene of the group of genes of interest in a non-perturbed background; c) using transcriptional intensity of the at least one to-be-perturbed gene in a perturbed background and pair-wise linear regression on the gene pairs of the 2nd network to calculate the transcription intensity of each gene of the group of genes of interest in a perturbed background; and d) comparing the transcription intensity of the non-perturbed and perturbed background for each gene of the group of genes of interest.

[00057] In various embodiments of the invention, the pair-wise linear regression is a pair- wise first-order linear regression.

[00058] In various embodiments of the invention, the effect of one perturbed gene is predicted. In other various embodiments, the effect of at least two perturbed genes is predicted. In still other various embodiments of the invention, the effect of at least four perturbed genes is predicted.

[00059] In various embodiments of the invention, the transcription intensity of the at least one perturbed gene in a perturbed and/or non-perturbed background is a) known; b) derived from a previously known transcription intensity; or c) measured. In other various embodiments, the transcription intensity of the at least one perturbed gene is measured at least two-times and the measured values are merged and averaged.

[00060] In various embodiments, the present invention relates to methods wherein the transcription intensity of each gene of the group of genes of interest is determined by separately calculating the transcription intensity for each gene perturbation and merging and averaging for each gene of the group of genes of interest the transcription intensity for each gene perturbation.

[00061] In various embodiments of the invention, the at least one perturbed gene has decreased or increased transcription intensity or both.

[00062] In various embodiments of the invention, the group of genes of interest comprises at least one gene and preferably at least five, ten, 20 or 50 genes.

[00063] In various embodiments of the invention, the transcription of a gene pair correlates when the correlation coefficient is at least 0.75 or when the p-value of the correlation coefficient is at least l e-9. In various embodiments, the transcription of a gene pair correlates when the correlation coefficient is statistically significant, such as a p-value < l e-9.

[00064] In various embodiments of the invention, the transcription intensity for each gene of the group of genes of interest is calculated by using not more than five, preferably not more than four, more preferably not more than three and still more preferably not more than two jumps between linear regressions of the gene pairs of the 2nd network.

[00065] In a further aspect, the present invention relates to a kit for detecting the marker genes identified according to a method of the present invention.

[00066] In another aspect, the present invention relates to a panel of marker genes identified according to a method of the present invention. In various embodiments, the panel of marker genes comprises or consists of the marker genes groups a) Hs.631923 (GenBank ID NM_004157), Hs.367799 (GenBank ID NM_002088), Hs.388739 (GenBank ID NM_021 141), Hs.469022 (GenBank IDs NM_001929, or NM_080915, or NMJ380916, or NM_080917, or NM_080918), Hs.579264 (GenBank IDs NM_001 130090, or NM_001 130091 , or NM_001 130092, or NM_03 1294), Hs.435765 (GenBank ID NM_001977); b) Hs.514242 (GenBank IDs NMJH4798, or NR_024386, or NR_027774, or NR_027782), Hs.138860 (GenBank ID NM_004308), Hs.24529 (GenBank IDs NM_001 1 14121 , or NM_001 1 14122, or NM_001244846, or NM_001274, or NR_045204, or NR_045205); c) Hs.514242 (GenBank IDs NM_014798, or NR_024386, or NR_027774, or NRJ327782), Hs.595333 (GenBank IDs NM_001025596, or NM_001 172639, or NM_001 172640, or NM_006560, or NM_198700); d) Hs.607822 (GenBank IDs NM 001 145966, or NM 002417), Hs.233240 (GenBank IDs NM_004369, or NM_ 057164, or NM_057165, or NM_057166, or NM_057167), Hs.518249 (GenBank IDs NM_001 127192, or NM_001 127193, or NM_001 127194, or NM_001 127195, or NM_001 127196, or NM_003418), Hs.731619 (GenBank IDs NM_007172, or NMJ 53645), Hs.631923 (GenBank ID NM_004157), Hs.367799 (GenBank ID NM_002088), Hs.513083 (GenBank IDs NM_000661 , or NMJX) 1024921 ), Hs.31 1640 (GenBank IDs NM 001 135592, or NM_001 177413, or NMJ302954), Hs.388739 (GenBank ID NM_021 141 ), Hs.469022 (GenBank IDs NM_001929, or NM_ 080915, or NM_080916, or NM_080917, or NM_080918), Hs.405144 (GenBank IDs NM_003017, or NR_036610), Hs.371240 (GenBank IDs NM_005100, or NM_144497), Hs.731463 (GenBank ID NM_014155), Hs.579264 (GenBank IDs NM_001 130090, or NM_001 130091 , or NM_001 130092, or NM_03 1294), Hs.210283 (GenBank ID NM_000093), Hs.279784 (GenBank ID NM_013388), Hs.631989 (GenBank ID NM_002931 ), Hs.435765 (GenBank ID NM_001977), Hs.496710 (GenBank IDs NM_001042749, or NM_001042750, or NM_001042751 , or NM_006603); e) Hs.48031 1 (GenBank IDs NM_00101 1513, or NM_00101 1515, or NM_00101 1516, or NM 001256425, or NM_001256426, or NM_001256427, or NM_001256428, or NM_001256429, or NM_006457, or NR_024179, or NR_046186), Hs.516032 (GenBank ID NM_000182), Hs.654370 (GenBank ID NM_004460), Hs.514242 (GenBank IDs NM_014798, or NR_024386, or NR_027774, or NR_027782), Hs.138860 (GenBank ID NM_004308), Hs.24529 (GenBank IDs NM_001 1 14121 , or NM_001 1 14122, or NM_001244846, or NM_001274, or NR_045204, or NR_045205), Hs.446588 (GenBank IDs NM_001017, or NR_001452, or XR_ 1 1 1 163), Hs.373763 (GenBank IDs NM_001 102397, or NM_001 102398, or NM_001 102399, or NM_005826); f) Hs.514242 (GenBank IDs NM_014798, or NR_024386, or NR_027774, or NR_027782), Hs.595333 (GenBank IDs NM_001025596, or NM_001 172639, or NM_001 172640, or NM_006560, or NMJ 98700), Hs.596704 (GenBank IDs NM_001037328, or NM_004602, or NM_017452, or NM_017453, or NM_017454), Hs.122583 (GenBank IDs NM_024743, or NR_024010); g) 1774893_at (GenBank ID NM_001 184556), 1779850_at (GenBank ID NM_001 184257); h) 1771091_at (GenBank ID NM 001 181808); i) 1774893_at (GenBank ID NM_001 184556), 1779850_at (GenBank ID NM_001 184257); j) 1771091_at (GenBank ID NM_001 181808); k) 1765788_s_at (SEQ ID NO: l ), 1761313_at (SEQ ID NO:2), 1764701_s_at (SEQ ID NO:3), 1765667_s_at (SEQ ID NO:4), 1 761 171_s_at (SEQ ID NO:5), 1764563_s_at (SEQ ID NO:6), 1765668_at (SEQ ID NO:7), 1766750_s_at (SEQ ID NO:8), 1759650_s_at (SEQ ID NO:9), 1764385_at (SEQ ID NO: 10), 1767674_s_at (SEQ ID NO: l 1), 1768081_s_at (SEQ ID NO: 12), 1768591_s_at (SEQ ID NO: 13), 1766138_s_at (SEQ ID NO: 14), 1768603_s_at (SEQ ID NO: 15), 1768237_s_at (SEQ ID NO:16), 1761582_s_at (SEQ ID NO: 17), 1760255_s_at (SEQ ID NO: 18), 1763957_s_at (SEQ ID NO: 19), 175945 l_s_at (SEQ ID NO:20); or 1) 1763376_at (SEQ ID NO:21 ), 1768602_at (SEQ ID NO:22), 1765970_x_at (SEQ ID NO:23), 1768855_at (SEQ ID NO:24), 1762662_s_at (SEQ ID NO:25), 1764355_at (SEQ ID NO:26), 1761298_at (SEQ ID NO:27), 1767293_at (SEQ ID NO:28), 1766730_at (SEQ ID NO:29), 1761324_at (SEQ ID NO:30), 1761072_at (SEQ ID NO:3 1 ), 1763355_at (SEQ ID NO:32), 1 760568_s_at (SEQ ID NO:33), 1765285_s_at (SEQ ID NO:34), 1762875_at (SEQ ID NO:35), 1762326_at (SEQ ID NO: 36), 1768785_s_at (SEQ ID NO:37), 1759869_at (SEQ ID NO:38), 1766040_at (SEQ ID NO:39), 1760045_s_at (SEQ ID NO:40), 1761673_at (SEQ ID NO:41 ), 1763754_at (SEQ ID NO:42), 1767770_at (SEQ ID NO:43), 1768126_at (SEQ ID NO:44), 1763914_s_at (SEQ ID NO:45), 1764068_x_at (SEQ ID NO:46), 1768400_s_at (SEQ ID NO:47), 1763222_x_at (SEQ ED NO:48), 1768805_s_at (SEQ ID NO:49), 1762892_s_at (SEQ ID NO:50), 1763332_s_at (SEQ ID N0:51), 1764533_s_at (SEQ ID NO:52), 1766506_s_at (SEQ ID NO:53), 1765366_s_at (SEQ ID NO:54), 1767220_at (SEQ ID NO:55), 1765015_at (SEQ ID N0.56), 1766029_at (SEQ ID NO:57), 1760408_s_at (SEQ ID NO:58), 1764010_s_at (SEQ ID NO:59), 1763879_s_at (SEQ ID NO:60), 1759435_at (SEQ ID NO:61), 1767485_at (SEQ ID NO:62), 1766425_s_at (SEQ ID NO:63), 1759255_at (SEQ ID NO:64), 1763325_at (SEQ ID NO:65), 1764978_at (SEQ ED NO:66), 1766452_at (SEQ ID NO:67), 1764869_s_at (SEQ ID NO:68), 1764915_s_at (SEQ ID NO:69), 1768426_s_at (SEQ ID NO:70), 1762728_s_at (SEQ ID N0:71), 1767138_s_at (SEQ ID NO:72), 1767987_x_at (SEQ ID NO:73), 1766777_s_at (SEQ ID NO:74), 1763925_s_at (SEQ ID NO:75), 1762374_s_at (SEQ ED NO:76), 1767386_at (SEQ ID NO.-77), 1767335_s_at (SEQ ED NO:78), 1767009_s_at (SEQ ID NO:79).

[00067) The terms "marker gene" or "source gene", which are used interchangeably herein, refer to a gene or other genomic nucleic acid sequences that are transcribed into RNA. The marker gene is not limited to a specific nucleic acid sequence and comprises sequences of genomic DNA that can be transcribed into mRNA, rRNA, tRNA, miRNA, ribozymes or snRNA. In preferred embodiments of the invention the marker is a genomic nucleic acid sequence that is transcribed into mRNA. According to the present invention, marker genes are used to form a gene network with a maximum degree of network coverage and a minimum degree of separation (also known as jump) or complexity between the gene pairs of the network. Network (transcriptome) coverage with a minimal number of marker/source genes can be generated by repeatedly choosing the source gene with the largest network coverage in each repeat (for example, by using the code for identifying marker genes for transcriptome prediction). If a 10000-gene network with 4 degrees is given, the algorithm for identifying marker genes for transcriptome prediction will calculate the coverage of each of these 10 thousand genes within 4 degrees and will select the gene with the largest coverage as a source gene. For example, if geneB25 has the highest coverage of 5000 genes (50% coverage), geneB25 will be selected as a source gene and all 5000 genes reachable from geneB25 within 4 degrees are removed from the list of 10 thousand genes, resulting in 5 thousand genes remaining. The next iteration will calculate the coverage of each of the remaining 5 thousand genes within 4 degrees and select the gene with the largest coverage as a source gene. For example, geneC I O has the highest coverage of 2000 genes (20% of original network), geneC IO will be selected as a source gene. Hence, 70% of the original network (50% covered by geneB25 + 20% covered by geneCI O) will be reached and removed. Thus, only 3000 or 30% of the original network remains for the third iteration. In various embodiments, the marker genes identified with the present method form with at least 60%, 65%, 70%, 75%, 80%, 85%, 90% or 95% of the genes of the 2nd network transcription correlated gene pairs. In preferred embodiments, the marker genes identified with the present method form with at least 90% of the genes of the 2nd network transcription correlated gene pairs. In various other embodiments, the number of marker genes is at most 15%, 10%, 7%, 5%, 4%, 3%, 2%, 1 %, 0.5% or 0.1 % of the number ofgenes of the 2nd network.

[00068] The term "prediction", as used herein, relates to the calculation of the transcription intensity of a gene of interest in an individual/patient under defined conditions. According to the present invention, the term "transcription intensity" refers to the expression level of a gene into messenger RNA (mRNA) or other types of RNA under different experimental or environmental conditions. Transcription intensity corresponds to the number of copies of mRNA or other RNAs that exists for a particular gene. In living cells gene expression is controlled by transcription factors binding to the regulatory regions of a gene. The genes can be inducible, i.e. resulting in increase in the amount of mRNA or other types of RNA, or can be constitutively expressed, i.e. having a constant amount of RNA under different conditions. Methods to analyze the quality and quantity of the transcribed RNA, specifically mRNA, are described in the several laboratory handbooks [Sambrook and Russell (2001 ) Molecular Cloning: A laboratory manual, 3rd ed, Cold Spring Harbor Laboratory Press, New York] and are well known for a person skilled in the art. These methods comprise microarray technologies, ribonuclease protection, primer extension, northern blotting, dot blot hybridization, and conventional or real time PCR. In preferred embodiments of the invention, the transcription intensity is measured using microarray technology.

[00069] "At least one", as used herein, relates to one or more, in particular 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10 or more.

[00070] The term "gene of interest", as used herein, relates to a linear sequence of nucleotides along a segment of genomic DNA that provides the coded instructions for synthesis of RNA. In preferred embodiments of the invention said RNA is mRNA that is subsequently translated into a protein.

(00071] "Gene co-expression network (GCN) analysis" or "weighted correlation network analysis", as interchangeably used herein, is a widely used data mining method especially for studying biological networks based on pairwise correlations between variables. It allows the definition of modules (clusters), intramodular hubs, and network nodes with regard to module membership, to study the relationships between co-expression modules, and to compare the network topology of different networks. In detail, the gene co-expression network (GCN), also called 1 st network, is a network wherein the nodes of the network are the genes and the links between the nodes, also known as edges, are the degree of co-expression. Such GCN may be established from the set of data of RNA or DNA microarray analysis. An algorithm to calculate gene co-expression networks is, for example, the python code for generating correlation network and pair-wise regression models.

[00072) The term "correlation", as used herein, relates to any of a class of statistical relationships involving dependence, wherein dependence is defined as any statistical relationship between two random variables or two sets of data. According to the present invention, correlation refers to Pearson's correlation that defines the linear relationship between two sets of data. The degree of linear relationship is indicated by the Pearson's correlation coefficient that measures the linear correlation (dependence) between two variables X and Y, giving a value between +1 and - 1 inclusive, where 1 is total positive correlation, 0 is no correlation, and - 1 is total negative correlation. In preferred embodiments of the invention, the transcription of a gene pair correlates when the correlation coefficient is at least 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9 or 0.95. In more preferred embodiments, the transcription of a gene pair correlates when the correlation coefficient is at least 0.75.

[00073] According to the present invention, a "gene pair" relates to any given pair of genes formed within the 1 st, 2nd or 3rd gene network of the present invention, wherein the two genes are compared for their transcription intensity and the correlation of their transcription intensity is quantified by the correlation coefficient of Pearson's correlation. [00074J In various embodiments of the invention, the sample is a biological sample, for example a body fluid, cell or tissue sample. Body fluids comprise, but are not limited to blood, blood plasma, blood serum, breast milk, cerebrospinal fluid, cerumen (earwax), endolymph and perilymph, gastric juice, mucus (including nasal drainage and phlegm), peritoneal fluid, pleural fluid, saliva, sebum (skin oil), semen, sweat, tears, vaginal secretion, nipple aspirate fluid, vomit and urine. The cell or tissue sample may comprise material originated from any part of the body such as connective tissue, muscle tissue, nervous tissue, and epithelial tissue. The term "obtaining a sample", as used herein, relates to different methods known in the art that comprise, but not limited to, biopsy, sentinel node biopsy or removal of blood, bone marrow, sputum or bronchial fluids. In various embodiments of the invention, the given sample is a human sample, preferably the human sample is a sample from a breast cancer patient. In other various embodiments, the given sample is a yeast sample. In still other various embodiments of the invention, the given sample is an Escherichia coli sample.

[00075] The term "linear regression", as used herein, relates to an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression. In linear regression, data are modeled using linear predictor functions, and unknown model parameters are estimated from the data. For example, if Gene(A) is a source gene with known expression and is reachable to Gene(C) via Gene(B), then the expression level of Gene(C) can be predicted as Gene(C) = bi,B-c(b i,A-BGene(A) + b 0 ,A-B) + b 0i B-c where bi,A-B and b 0 ,A-B is the first-order linear regression gradient and intercept between Gene(A) and Gene(B), respectively, and bi£-c and bo , B-c is the first-order linear regression gradient and intercept between Gene(B) and Gene(C), respectively. The linear regression can be calculated, for example, with the algorithm for single pass and multi-pass transcriptome that are described as Algorithm 1 and 2, respectively.

[00076] "Diminished network of genes" or "2nd network", as used interchangeably herein, relates to a network of genes that is based on the gene co-expression network wherein the co- transcription of all gene pairs significantly correlates. A significant correlation may be found when the correlation coefficients of the gene pairs of the 2nd network are at least 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9 or 0.95. In more preferred embodiments, the transcription of a gene pair correlates when the correlation coefficient is at least 0.75.

[00077] The term "3rd network", as used herein, relates to a network of genes wherein the genes are marker genes or genes whose transcription is correlated with at least one marker gene. The correlation of a gene pair is determined by Person's correlation and a significant correlation may be found when the correlation coefficient is at least 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9 or 0.95. In more preferred embodiments, the transcription of a gene pair correlates when the correlation coefficient is at least 0.75.

[00078] In various embodiments of the invention, the transcription intensity for the gene of interest or for each gene of the group of genes of interest is calculated by using not more than five, preferably not more than four, more preferably not more than three and still more preferably not more than two jumps between linear regressions of the gene pairs of the 2nd or 3rd network. In a statistically significant co-expression network the degree of separation of gene pairs is also known as "jumps". For example, if Gene(A) is a source gene with known expression and is reachable to Gene(C) via Gene(B), then the distance between Gene(A)/Gene(B) and Gene(B)/Gene(C) is one jump, whereas the distance between Gene(A)/Gene(C) is two jumps.

[00079] The term "path", as used herein, relates to one specific combination within a network of genes that allows to estimate the transcription intensity of a gene of interest from a single marker gene based on pair-wise linear regression.

[00080] In various embodiments of the invention, the transcription intensity of a gene is measured at least two-times and the measured values are merged and averaged. This means that the measured transcription intensity values are added and divided by the number of used values to generate the arithmetic mean. An algorithm to calculate the arithmetic mean of a gene those transcription intensity was measured at least two-times is, for example, the code to merge probes of the same ID (examples "Algorithms and codes"). "standard deviation of error in simple linear regression" is calculated by the wherein a(y) is the standard deviation of y (dependent variable), and p(x,y)2 is correlation between x and y squared, a(c) the standard deviation of the error and if y=a+bx+c.

[00082J The term "gene perturbation", as used herein, relates to changes in the transcription intensity of a given gene such as knockdown, knockout or overexpression of this gene. However, this term may also include factors outside of a cell and/or organism that influence the transcription intensity of the given gene. Such factors may include but are not limited to environmental stress, metabolic stress, changing oxygen concentrations, administration of drugs, different diets etc. The term "non-perturbed background", as used herein, relates to conditions or an environment encompassing all living and non-living things wherein a given organism or ecological community normally lives or occurs (natural habitat). In contrast, the term "perturbed background" relates to conditions that are non-natural. The term "comparing the transcription intensity of the non-perturbed and perturbed background for each gene of the group of genes of interest", as used herein, relates to the ratio of how much the transcription intensity of the given gene has been changed under perturbed conditions compared to the transcription intensity under non-perturbed conditions which are used as a reference.

100083] In various embodiments of the invention, the transcription intensity of the at least one perturbed gene in a perturbed and/or non-perturbed background is a) known; b) derived from a previously known transcription intensity; or c) measured. Values of the transcription intensity of perturbed genes in a perturbed and/or non-perturbed background may be found, for example, in literature or in electronic data bases such as the NCBI Gene Expression Omnibus that disclose the expression of several genes from different organisms under diverse conditions. Such data bases may also provide the basis to estimate a virtual transcription intensity for an overexpression or knockdown of interest. For example, if the transcription intensity of gene(A), which is the to-be-perturbed gene, is known from literature and/or an electronic data base to be 10 units under non-perturbed conditions and it is desired to predict the effects of a 50% knockdown of gene(A), then the perturbation predictor of the present invention can be employed using the transcription intensity value of 5 units as the transcriptional intensity of the to-be- perturbed gene in a perturbed background. Alternatively, the transcription intensity of the at least one to-be-perturbed gene in a perturbed and/or non-perturbed background can be measured. Methods to analyze the quality and quantity of the transcription are described in the several laboratory handbooks [Sambrook and Russell (2001 ) Molecular Cloning: A laboratory manual, 3rd ed, Cold Spring Harbor Laboratory Press, New York] and are well known for a person skilled in the art. These methods comprise microarray technologies, ribonuclease protection, primer extension, northern blotting, dot blot hybridization, and conventional or real time PCR. In preferred embodiments of the invention, the transcription intensity is measured using microarray technology.

[00084J The term "panel of genes" in this aspect of the invention can comprise a set of 1 to 1000 genes, more preferably of 1 to 100 genes, more preferably of 2 to 60 genes, and most preferably of 3 to 40 genes.

[00085] The term "kit", as used herein, relates to packaged reagents for the detection of the presence of the marker genes identified according to a method of the present invention. Accordingly, the kits of the invention comprise in various embodiments detection reagents that allow determining the transcription levels of one or more marker genes identified according to the methods of the present invention. Additionally, such a kit may comprise instructions for use as well as typical reagents known to those skilled in the art. For example, the exact number of reaction tubes, their holders, PCR-primers and/or antibodies etc. can be determined by the skilled person.

EXAMPLES

Materials and Methods

Construction of co-expression network and regression models

The CEL files of 605 E. coli microarrays across 40 experiments were downloaded from NCBI Gene Expression Omnibus (Figure 1 ) and RMA normalized using Affymetrix Expression Console. Pairwise permutations of Pearson's correlation were calculated and the expression values for the pair of genes were fitted into first order linear regression equation of the form Gene(x) = biGene(y) + bo (python code). Pairs with absolute Pearson's correlation of more than 0.75 were retained for building co-expression network using NetworkX wherein the nodes are the genes and an edge was generated between the nodes when the absolute Pearson's correlation of two genes is more than 0.75. Transcriptome predictors

Two transcriptome predictors, single pass and multi-pass, were implemented. The difference between the two predictors is that the single pass predictor performed one prediction per target gene, whereas the multi-pass predictor allowed a target gene to be predicted using 2 or more source genes/marker genes. Given a list of source genes (marker genes) and their expression values, the transcriptome predictors predict all genes reachable within 4 jumps using a loop over the linear regression models. For example, if Gene(A) is a source gene with known expression and is reachable to Gene(C) via Gene(B), then the expression level of Gene(C) can be predicted as Gene(C) + bojB-c where b i,A-B and b 0) A-B is the first-order linear regression gradient and intercept between Gene(A) and Gene(B), respectively, and bi B-c and bo,B-c is the first-order linear regression gradient and intercept between Gene(B) and Gene(C), respectively. The algorithm for single pass and multi-pass transcriptome are described as Algorithm 1 and 2, respectively. As there could be more than one path between any source and target genes via different intermediary genes, there could be more than one predicted expression values. The predictor would report the arithmetic mean and standard deviation of the predicted values (Figure 6).

Perturbation predictor

A perturbation predictor was implemented on top of both transcriptome predictors to calculate the effects of perturbation(s), such as single or multiple gene over-expression or knockdowns, on a background transcriptome. A list of perturbations was given as ratio of the original expression values, for example, 1.8 time of Gene(A) and 0.4 times of Gene(B). The predictor estimated the effects of perturbations by a two-pass prediction where the first pass predicted the expression values of all affected target genes within 4 jumps using the original expression values of the background transcriptome [ l x Gene(A) and l x Gene(B)], followed by a second pass using the perturbed values from the background transcriptome [ 1.8x Gene(A) and 0.4x Gene(B)]. As a result, each perturbation runs using different combinations of perturbed genes might have different numbers of affected target genes. Evaluating predictors

The single pass and multi-pass transcriptome predictors were evaluated using 30 and 10 microarrays that were not used for model building, respectively. Perturbation predictor was evaluated using six types of perturbations (1. Single gene over-expression; 2. Single gene knockdown; 3. Double gene over-expression; 4. Double gene knockdown; 5. Single gene over- expression with single gene knockdown; 6. Double gene over-expression with double gene knockdown) on 3 replicates (Figure 9). For each microarray, the expression values of 59 genes were extracted and used as source genes (Figure 10) to predict all reachable genes, known as target genes, within 4 jumps. The target genes consisted of adjacent genes (one jump from source genes) and non-adjacent genes (two to four jumps from source genes). As there would be only one path from source gene to adjacent gene, standard deviation would not be calculated and only non-adjacent genes would be used to evaluate the predictors. The accuracy of prediction was determined by the number of standard deviations and the percentage difference between the expected expression value (from the microarray data) and the average predicted values.

Algorithms and codes

In the following, the used algorithms and codes are disclosed:

Python code for generating correlation network and pair-wise regression models

# datafile refers to input data file

datafile = 'model2_rma.csv'

# output refers to output data file or results file

output = 1 model2 rm correlation . csv 1

import math

def process (x, y) :

mean_x = float (sum (x) ) / len(x)

mean_y = float (sum (y) ) / len(y)

covariance = sum([((x[i] - mean_x) * (y[i] - mean_y) )

for i in range (len (x) )] )

error_x = [i-mean_x for i in x]

error_y = [i-mean_x for i in y]

sd_x = math . sqrt ( sum ( [i*i for i in error_x] ) )

sd_y = math. sqrt (sum( [i*i for i in error_y] ) )

gradient = sum ( [error_x[ index] * error_y [index]

for index in range ( len (error_x) )] ) / \

sum ( [error_x [index] * error_x [index]

for index in range ( len (error_x) )] )

intercept = mean_y - (gradient * mean_x) return ( float ( covariance) ,

float (sd_x), float (sd_y),

float (gradient) , float (intercept) )

data = open (datafile, ' r ' ) . readlines ( )

data = [x [ : -1 ] . split ( ' , ' ) for x in data]

out = open (output, 'w')

print str (len(data) ) , . ' number of samples to process'

out .write ( ' , ' .join ( [ ' sample_x' , ' sample_y' , ' covariance 1 ,

'sd_x', 'sd_y', 'gradient', 'intercept']) + count = 0

whilelen (data) > 1:

dl = data. pop (0)

ID1 = dl [0]

dl = [float (x) for x in dl[l:]]

for x in data:

ID2 = x[0]

d2 = [float(y) for y in x[l:]]

result = process (dl, d2)

result = [ID1, ID2, str (result [0] ) , str ( result [ 1 ]) ,

str(result[2] ) , str ( result [ 3]- ) , str ( result [ 4 ]) ]

result = ','. j oin ( result ) + '\η'

out .write (result)

count = count + 1

print str (count), ' processed. ', str (len (data) ) , ' left.', ID1 - out. close ()

Algorithm 1 for single pass transcriptome prediction

source__data<- source probes/expression value tuples

min_path_length<- minimum path length to predict

max_path_length<- maximum path length to predict

min_paths<- minimum number of paths between source and target max_paths<- maximum number of paths between source and target predictedjprobes<- source probes

unpredicted_probes<- total probes - predicted_probes

for each source probe

predict adjacent (path length = 1) target probe expression for path__length from min_path_length to max_path_length

for each source probe

for each target probe in unpredicted_probes

path_list<- list of minimum lengthed paths

between source and target probes if number of paths < min_paths: discard

predict target expression using up to max_path add target probe to predicted_probes

remove target probe from unpredicted_probes

Algorithm 2 for multi-pass transcriptome prediction

source_data<- source probes/expression value tuples min_path__length<- minimum path length to predict

max_path_length<- maximum path length to predict

min_paths<- minimum number of paths between source and target

max_paths<- maximum number of paths between source and target

max_source<- maximum number of source probes to use per target

predicted_probes<- source probes

for each source probe

predict adjacent (path length = 1) target probe expression

for path_length from min_path_length to max_path_length

for each source probe

for each target probe in unpredicted_probes

path_li.st<- list of minimum lengthed paths

between source and target probes

if number of paths <min_paths: discard

- predict target expression using up to max_path

if target probe had been predicted by max_source probes

add target probe to predicted_probes

. remove target probe from unpredicted_probes calculate average predicted value and standard

deviation of predicted value of target probe calculate average predicted value and standard deviation of predicted value of target probe that had not been reached by max_source probes

Code for identifying marker genes for transcriptome prediction

g network graph

target number of genes to reach

reached, not_reached - { } , { }

source - { }

while len (reached) < target:

high_reacher gene of highest reach within

path_length among genes in not_reached

reached reached + genes reached by high_reacher

within path_length

source source + high_reacher

not reached <- reached'

Code to merge probes of the same ID

Input file format (no headers allowed) :

ID-1, expression-1-1 , expression-1-2 , ... , expression-l-n

ID-2 , expression-2-1 , expression-2-2 , ... , expression-2-n

ID-m, expression-m-1, expression-m-2 , ... , expression-m-n inputfile = 'C:\Users\mauriceling\Desktop\data.csv'

outputfile = 'C:\Users\mauriceling\Desktop\datal.csv'

outfile = open (outputfile, 'w 1 )

d = {} # to aggregate expression values dcount = {} # to keep track of the number of replicates

count = 0

for x in open(inputfile, 'r'):

x = x[:-l] . split ( ' , ' )

data = x[l: ]

ID = x[0]

if not d.has_key (ID) : # first replicate

d[ID] = [float (x) for x in data]

dcount [ID] = 1

else: # subsequent replicate

odata = d[ID]

for i in range (len (data) ) :

odata[i] = odata[i] + float (data [ i] )

d[ID] = odata

dcount [ID] = dcount [ID] + 1

# print something to show that it is running

count = count + 1

print 'Rows read and assembled:', count

for ID in dcount . keys () :

# divide each aggregated expression values by replicates

expressions = join ( [str ( float ( sample) /float (dcount [ID] ) )

for sample in d[ID]])

# write out data

outfile. write ( 1 ,' .join( [ID, expressions]) + '\η')

outfile . close ( )

E. coii model system

Example 1: Establishment of a gene network of E. coli wherein fift -nine genes can cover 90% of the transcriptome within 4 jumps with reasonable accuracy

[00086] A total of 51,121,216 permuted probe-pairs were generated from 10, 112 non- control probes in GPL3154. These non-control probes were mapped to 10,091 genes. Thus, only 21 genes were represented by 2 probes (Figure 2). The results suggest that the Pearson's correlation coefficients were not normally distributed (Figure 3; Jacque-Bera statistic = 138.84). Using the correlation threshold of absolute correlation coefficient suggested by Reverter et al. [Reverter et al. (2005) Australian Journal of Experimental Agriculture, 45, 821-829] that is higher than 0.75, only 533,311 (1.04%) pairs and 7,360 (72.78%) genes remained and were used to construct the co-expression network. Given that 605 microarrays were used to generate the correlation, absolute correlation coefficient of 0.75 is significant after Bonferroni correction for multiple tests (p-value = 1.28e-102).

[00087] Using the 21 genes that were represented by 2 probes in the microarray, intra- array variation can be estimated by analyzing the differences from these 2 probes. Theoretically, their expression values should be the same and the ratio of expression values should be 1 as they are measuring the same transcript. Using all 605 microarrays, the average ratio of expression values could be calculated. The results suggested that the average deviation i [average ratio— 1 |)/N] from a perfect ratio of 1 was 19.19%, suggesting that the average intra-array variation could be estimated to be 19.19%.

[00088] After the construction of a statistically significant co-expression network, the next step was to determine a small set of genes with the maximum network coverage and minimum degree of separation (also known as jump) as network coverage is directly proportional to the extent of predictable transcriptome and the error in prediction is directly proportional to the number of jumps. This was carried out in 4 steps. Firstly, the number of jumps between any given gene-pairs was analyzed. The results suggest that the density peaks at 4 jumps (Figure 4). With reference to Figure 5, when a pair of genes is linked by a finite number of paths, the expression of one of the pairs (known as a target gene) can be predicted if expression of the other gene (known as a source gene) is known. As there can be many paths between the source and target genes, there can be many predicted expression values for the target gene as the number of predicted values equals the number of paths. An average predicted value and the standard deviation for the target gene can be estimated. Secondly, the accuracy of prediction with respect to path length (number of jumps) was analyzed. Accuracy of prediction, in terms of the number of standard deviations between the actual and predicted expression value of the target gene, is likely to decrease with increasing path length between the source and target genes. The results show that accuracy decreases drastically from path length of 5 or more (Figure 6 A and 7), suggesting that the limits of predictability is 4 jumps. In addition, the results also suggest that intra-array variation adversely affect prediction accuracy (Figure 6B).

[00089] Thirdly, the degree of network coverage was analyzed using 2 sets of source probes, a set of 32 source genes from coefficient of variation (r 2 ) of more than 0.95 and a set of

392 source genes from absolute Pearson's correlation of more than 0.95. Since coefficient of variation can be used as a measure of prediction accuracy between a pair of source and target probes, strong correlation in the first jump is likely to improve the overall prediction. The results show that the coverage from the set of 392 genes is significantly better than that of 32 genes

(Figure 8). However, 392 is a large number of genes to be measured experimentally. Further, this set of 392 genes was analyzed in attempt to reduce it into a small set of marker genes [Porcar et al. (201 1) Syst Synth Biol, 5(1 -2), 1 -9] which is feasible for experimental work. At 4 jumps, a number of these genes reach to the same set of target genes. By removing redundancy, the 392 genes were reduced to 49 genes but the coverage dropped from 6154 to 6053 genes. The set of genes not reached by these 49 source genes within 4 jumps was examined and 10 genes with the highest degree were added (most number of edges) to increase the number of source genes from 49 to 59. With this addition, the coverage increased to 6140 genes. It is further argued that adding more source probes at this stage is unlikely to give equivalent increase in coverage. Hence, this set of 59 source genes was used for further testing (Figure 10).

Example 2: The transcriptome can be predicted within 3 standard deviations using 59 source genes

[00090] The E. coli transcriptome was predicted using the panel of 59 source genes. To do so, a single pass transcriptome predictor (Algorithm 1 ) was implemented wherein each target gene will be predicted using expression value from one source gene. Using a test set of 30 microarrays that was not used in model construction two sets of tests were performed to determine the accuracy of transcriptome prediction and the effects of using different number of paths in prediction (Figure 9). The results suggest a positive correlation between the average predicted expression values and the actual expression values of each target gene across all 30 transcriptomes (Figure 1 1 ; average correlation = 0.468, p-value = 2.77e-13). This is similar to , the correlation of 0.489 reported by Fox and Erill [Fox and Erill (2010) DNA Res, 17(3), 185- 196] using relative codon usage bias to predict the expression levels of E. coli genes of more than 1000 bp and higher than the correlation of 0.301 reported by Roymondal et al. [Roymondal et al. (2009) DNA Res, 16(1 ), 13-30] when correlating relative codon usage bias to the expression levels of E. coli genes of all lengths.

[00091) The results also show that 29 of the 30 transcriptomes were predicted within an average of 3 standard deviations from expected values using between 30 and 50 paths per prediction (Figure 12A). All 30 transcriptomes were predicted within an average of 3 standard deviations from expected values using between 30 and 10000 paths per prediction (Figure 12A).

This result is significant (paired t-test, p-value = 4.25e-14). However, there is no difference between the standard deviations of expression values of each predicted target using different number of paths (p-value = 0.077). Using the percentage differences between the predicted and expected target gene expression (Figure 12B), the results show that 23 of the 30 transcriptomes and 24 of the 30 transcriptomes were predicted within 40% error using 30 to 50 paths and 30 to 10000 paths, respectively. These results are comparable to 33% predictability using chromatin states and transcription factor occupancy to predict spatial-temporal expression of genes [Wilczynski et al. (2012) Plos Comput Biol, 8(12), el002798]. Although the results show that using 30 to 10000 paths reduced the error in 28 of the 30 transcriptomes (except Samples 13 and 30), this result is not significant (p-value = 0.095). Despite so, there is a noticeable trend that the standard deviation of prediction error by percentage is lower when 30 to 10000 paths were used compared to 30 to 50 paths (Figure 12B), suggesting that using larger number of paths increases precision even though the accuracy does not increase. Hence, the results demonstrate that better prediction can be achieved by using more paths between the source and target genes.

[00092] The possibility was examined of whether the correlation across a path is able to indicate the accuracy of prediction. It is plausible to expect better prediction accuracy on a path with 4 jumps when the expression values between any 2 genes on the path is perfectly correlated than poorly correlated. To do so, the path averaged correlation was defined as the quotient of the sum of the Pearson's correlation coefficients across the entire path and the number of jumps in the path. However, the result does not suggest that the path averaged correlation is a good indicator for prediction accuracy (r 2 = 0.069; Figure 13).

[00093] In addition, it is conceivable that using more than one source gene to predict a target gene may improve prediction accuracy. To test this hypothesis, a multi-pass transcriptome predictor (Algorithm 2) has been developed that allows for the use of any number of source genes to predict a target gene. Network coverage analysis shows that 59 source genes can reach a total of 169,012 genes in 4 jumps (Figure 7) or each target gene is reached by an average of 27.5 source probes. This suggests that the computation time for multi-pass transcriptome prediction will be 27.5 times longer than single pass transcriptome prediction if maximum number of source gene per target gene is used. It took about 5 to 7 hours to predict a transcriptome using single pass method and using a maximum of two source gene per target gene required about 38 hours (5.4 to 7.6 times) per transcriptome using multi-pass method. [00094] Ten of the 30 transcriptomes used in the evaluation of single pass transcnptome predictor were used to evaluation multi-pass transcnptome predictor (Figure 9). The results suggest that multi-pass method give better accuracy compared to single pass method in terms of the average standard deviations between expected and predicted expression levels of target genes (Figure 14A; p-value = 1.28e-7). However, there is no difference in terms of percentage difference (Figure 14B, p-value = 0.076) even though 3 of the 10 predicted transcriptomes (Samples 3, 6, and 7) are significantly less accurate when predicted by multi-pass method. By examining the standard deviations of the predicted values of each target gene (Figure 14C), multi-pass method consistently gives higher standard deviation compared to single pass method (p-value = 3.10e-5). This suggests that better prediction by multi-pass method in terms of average standard deviations between expected and predicted expression levels of target genes is an artifact as a result of larger standard deviations for the predicted values of each target gene. The correlation analysis between expected and predicted target expression values is significant (Figure 15; average correlation = 0.269, p-value = 1.13e-4), which is similar to that previously reported by Roymondal et al. [Roymondal et al. (2009) DNA Res, 16(1 ), 13-30] but lower than that reported by Fox and Erill [Fox and Erill (2010) DNA Res, 17(3), 185- 196] and from the single pass prediction. This suggests that multi-pass transcnptome predictor does not give better prediction compared to single pass transcnptome predictor despite requiring significantly more computational time.

Example 3: 90% of perturbation-affected genes can be predicted within 3 standard deviations

[00095] An important application of transcriptome prediction model is predicting the effects of gene over-expression, knockouts, and environmental stimuli in silico [Selinger et al. (2003) Trends in Biotechnology, 21(6), 251 -254]. Over-expressions and knockdowns or under- expressions are collectively known as perturbations. A perturbation predictor was developed, based on the validated transcriptome predictors, to estimate the effects of gene perturbations in affected genes. For example, if geneA is 2x over-expressed, the affected genes will be the set of genes reachable within 4 jumps of geneA. The perturbation predictor used a microarray sample as a background transcriptome and performs two predictions. The first prediction predicts the expression values of all reachable genes from the genes of interest before perturbation. Perturbations are carried out by varying the expression values of the genes of interest before predicting the expression values all reachable genes from the genes of interest after perturbation. Both predictions will provide a predicted value (the mean) and a standard deviation of the affected probes, which allow for standard hypothesis testing and power analysis to be performed.

[00096J For evaluation, a background transcriptome and a test transcriptome were identified, and one or more genes from the background transcriptome were perturbed to the value of the test transcriptome. Experimentally, if the effects of a 2-fold over-expression of geneA in E. coli were to be studied, the standard experimental protocol will require an over-expression of geneA using a vector which regulates the expression of geneA under an inducible promoter and compare the transcriptomes of the control sample against the over-expressed sample. In this case, the background and test transcriptomes were selected to represent the control and perturbed samples, respectively. Three replicates were performed on each of the 6 evaluation tests including single, double and quadruple gene perturbations (Figure 16).

[00097] The results show that 87.1 % to 98.5% of the affected genes in single gene over- expression or knockdown are predicted within 3 standard deviations of error (Figure 17A and 17B). For double gene over-expression or knockdown, 91 .3% to 99.0% of the affected genes are predicted within 3 standard deviations of error (Figure 17C and 17D). Using single pass prediction, the results show that at least 89.8% of the genes affected by single gene over- expression with single gene knockdown (Figure 17E) and at least 89.9% of the genes affected by double gene over-expression with double gene knockdown (Figure 17G) can be predicted within 3 standard deviations of error. Comparing single pass versus multi-pass prediction (Figure 17E versus 17F, and 17G versus 17H), although the prediction accuracy in terms of standard deviations is better using multi-pass method, accuracy in terms of percentage difference between the predicted and actual expression values of the affected genes dropped. Statistical comparison between single and multi-pass method shows that this difference is significant (p-value = 0.0012). This is consistent with the findings for the multi-pass predictor in transcriptome prediction. The average correlation between the expression values of affected genes predicted by single-pass method after perturbation is 0.698 with a standard deviation of 0.123 (Figure 18), which is significant (p-value = 7.44e-15). This result is comparable to that reported by McLeay et al. [McLeay et al. (2012) Bioinformatics, 28(21 ), 2789-2796] using ChfP-seq, histones and DNase scores to predict gene expression. This suggests expression values of genes affected by perturbations can be predicted with accuracy comparative to next generation sequencing methods and sequence analyses.

Example 4: A case study using the perturbation predictor

[00098] The perturbation predictor can be used to study the effects of any forms of gene expression perturbations, singly or in combination, on the transcriptome. This case study shows how this predictor can be used. For example, the second replicate of single gene knockdown evaluation corresponds to 56% knockdown of hydrogenase 2 maturation endopeptidase (gene symbol: hybD) that is involved in the maturation of hydrogenase 2. Of the 1603 genes affected by this perturbation, 77 genes are directly correlated and 27 genes show more than 3-fold differences between background expression level and predicted expression level after perturbation. Of the 1526 genes affected between 2 to 4 jumps, 60 are significantly different after Bonferroni correction between predicted expression level before and after perturbation. These 87 genes were analyzed for Gene Ontology enrichment using GOEAST [Zheng and Wang (2008) Nucleic acids research, 36(Web Server issue):W358-363]. All 5 significantly enriched molecular functions are of carbon/sugar transferase-typed activity (GOIDs 0008194, 0008378, 0035250, 0016757, and 0016758). This agrees with recent findings associating hydrogenase 2 to hydrogen production during glucose [Maeda et al. (2007) Applied Microbiology and Biotechnology, 77(4), 879-890] and/or glycerol fermentation [Trchounian et al. (2013) Cell Biochem Biophys, 66(1), 103-8]. In addition, this predictor may also be used to study the effects of 50% versus 60% knockdown of hybD as suggested by Selinger et al. [Selinger et al. (2003) Trends in Biotechnology, 21 (6), 251 -254].

Example 5: Identification of a panel of 20 new marker genes and comparison of the predictions generated by the use of the gene networks using 59 and 20 marker genes

[00099] Choosing a suitable panel of source/marker genes depends on 2 contradictory criteria - maximizing network (transcriptome) coverage while minimizing the number of source genes. Network coverage is directly proportional to the number of source genes.

[000100] Network (transcriptome) coverage was maximized with a minimal number of source genes by repeatedly choosing the source gene with the largest network coverage in each repeat (code for identifying marker genes for transcriptome prediction). If a 10000-gene network and 4 degrees is given, the algorithm to calculate the coverage of each of these 10 thousand genes within 4 degrees and select the gene with the largest coverage as a source gene. For example, if geneB25 has the highest coverage of 5000 genes (50% coverage), geneB25 will be selected as a source gene and all 5000 genes reachable from geneB25 within 4 degrees are removed from the list of 10 thousand genes, resulting in 5 thousand genes remaining.

[000101] The next iteration will calculate the coverage of each of the remaining 5 thousand genes within 4 degrees and select the gene with the largest coverage as a source gene. For example, geneCIO has the highest coverage of 2000 genes (20% of original network), geneCIO will be selected as a source gene. Hence, 70% of the original network (50% covered by geneB25 + 20% covered by geneCI O) will be reached and removed. Thus, only 3000 or 30% of the original network remains for the third iteration.

[000102] Using the code for identifying marker genes for transcriptome prediction, a separate panel of source genes for E. coli was identified. 20 genes were found as source genes for 90% network coverage at 4 degrees, compared to 59 source genes described in Example 1. This represents a 66% reduction in the number of source genes for the E. coli model, whereas the coverage remains the same. The 20 source genes are shown in Figure 19.

[000103] The average error in transcriptome prediction using the set of 59 source genes is 34.289% with a standard error of 2.807% across 30 test samples. However, the average error in transcriptome prediction using the set of 20 source genes is 21.842% with a standard error of 1.035% across 30 test samples (Figure 20). Thus, the average error is much lower when the set of 20 source genes is used compared to the set of 59 (paired t-test p-value = 1.63E-05).

[000104] Similarly, the average correlation (another way of measuring prediction accuracy as correlation of 1 represents perfect accuracy) using the set of 59 source genes is 0.467 with a standard error of 0.038 across 30 test samples. However, the average error in transcriptome prediction using the set of 20 source genes is 0.699 with a standard error of 0.026 across 30 test samples (Figure 21). Thus, the average error is much lower when the set of 20 source genes is used compared to the set of 59 (paired t-test p-value = 5.92E-1 1) and the set of 20 source genes gives less variation in terms of prediction accuracy compared to that from the set of 59 source genes.

Breast cancer model system

Example 6: Transcriptome prediction using the human breast cancer model

[000105] The microarray platform selected for developing the human breast cancer transcriptome prediction model is based on the same criteria used to generate the E, coli model.

[000106] GPL96 was selected as a suitable platform, with 1672 microarray samples for human breast cancer. There are a total of 21351 probes in the array, which are mapped to UniGene IDs. Of which, 13440 UniGene IDs are unique. A sample of 30 pairs (each pair of probes measured the expression of the same UniGene ID) was randomly selected to estimate intra-array variation. The measured intensity should be identical in an ideal situation when 2 probes detect for the same transcript (expression from the same UniGene ID), and variation from ideal situation can be used as an estimate for intra-array variation. The results estimated intra- array variation to be 25.63% with a standard deviation of 19.592%, suggesting the presence of intra-array variation (p-value = 2.5e-9). The average correlation of microarray intensities from 2 probes detect for the same transcript is 0.750 with a standard deviation of 0.1591 , which is significantly different from random (p-value = 1.5e-21 ). Despite the presence of intra-array variations, a statistically significant correlation is expected when the pairs of probes are measuring the same transcript.

[000107] As the number of unique UniGene IDs is 62.9% of the number of probes mapped to UniGene IDs, it suggests that GPL96 has 37.1% probe redundancy. Redundant probes can be merged and averaged, which is likely to give better prediction. The code used to conduct this step is the code to merge probes of the same ID.

[000108] Of the 1672 available microarray samples, 1 100 samples were used for gene co- expression network (GCN) construction. Hence, 572 microarray samples which were not used for GCN construction were used for testing.

[000109] Thirty samples from the test set of 572 samples were used to evaluate the accuracy of prediction at 4 degrees, covering 90% of the entire GCN. Three source genes were used (Hs.514242, Hs.138860, and Hs.24529, Figure 22). The results showed that the average percentage error in prediction is 9.24% with standard error of 0.315% (Figure 23 A), which is significantly lower intra-array variation of 25.63% (p-value = 2.4e-30, power > 0,9999). This suggests that by taking the average of redundant probes (more than one probe in the microarray that measures the same mRNA transcript), transcriptome prediction accuracies can be achieved that are significantly lower than intra-array variation.

[000110] The average correlation between the actual and predicted gene expression in the 30 test samples is 0.83 with a standard error of 0.0129 (Figure 23B). This is comparable to the one described by McLeay et al. [McLeay et al. (2012) Bio informatics, 28(21 ), 2789-2796] whose model required histones modification and chromatin accessibility data (p-value = 0.13) for the prediction of gene expression in mouse embryonic stem cells. However, when tested on another cell line, GM12878, the gene expression prediction model of the present invention is significantly better (p-value = 1.2e-14) than the model reported by McLeay et al. despite requiring histones modification and chromatin accessibility data (correlation of 0.642 between the actual and predicted gene expression). Hence, the results suggest that the method of the present invention for predicting gene expression of human breast cancer is comparable to the one using histones modification and chromatin accessibility data. However, using histones modification and chromatin accessibility data instead of microarray data has the disadvantage that such data is more difficult to obtain.

Example 7: Case study using the human breast cancer model: BRCA1

[000111] BRCA1 (BReast-CAncer susceptible gene 1 ) is commonly mutated in clinical presentations of breast tumors [Yoshida and Miki (2004) Cancer Science, 95(1 1 ), 866-871 ]. It was recently found that down-regulation [Wang et al. (2007) The Journal of International Medical Research, 35(4), 564-573] of BRCA1 results in genetic instability [Yakovlev (2013) Cancer research, 73(2), 706-715] and promotion of cancer growth [Salem et al. (2012) Cell Cycle, 1 1 (22), 4167-4173],

[000112] The method of the present invention can be used to examine the effects of BRCA1 down -regulation. The average expression value of BRCA1 (UniGene ID Hs.194143) in 1671 microarrays is 6.127, with standard deviation of 0.6647. Hence, -3 standard deviations from average expression value of BRCA1 is 4.133, representing 32.54% down-regulation. Therefore, expression values of 6.127 and 4.133 can be considered as average expression and down- regulated expression, respectively. These values can be used to examine the effects of BRCAl down-regulation compared to normal expression.

[000113] A total of 5132 genes were affected by BRCAl down-regulation and all 5132 genes were predicted to have down-regulated expression when BRCAl was down-regulated compared to average BRCAl expression. However, 1839 genes showed more than 32.54% difference in predicted expression between average BRCAl expression and down-regulated BRCAl expression. These 1839 genes were analyzed for Gene Ontology (GO) term enrichment using GOrilla [Eden et al. (2009) BMC Bioinformatics, 10, 48] and 4 GO terms were over- represented (Figure 24).

[000114] The pattern specification process (GO:0007389) is the descendant node of both single-organism developmental process (GO:0044767) and single-multicellular organism process (GO:0044707). De-differentiation of normal mammary cells into breast cancer stem cells is a hallmark of tumor formation [Rameshwar (2009) Current pharmaceutical biotechnology, 10(2), 148-153] and breast cancer invasiveness [Abdel-Fatah et al. (2014) Breast cancer research and treatment, 143(3), 41 1 -421 ]. As the results had suggested that down-regulation of BRCAl resulted in down-regulation of genes whose function involves the pattern specification process or differentiation (as observed in the GO definition of the pattern specification process), it is plausible to consider that the down-regulation of the pattern specification process genes might result in de-differentiation.

[000115] Neutrophils had been suggested to exhibit antitumor effects in breast cancer via the production of tumor necrosis factor alpha [Vendrell et al. (201 1 ) Vaccine, 29(4), 728-736]. Hence, down-regulation of neutrophil mediated immunity (GO:0002446) by down-regulation of BRCAl is consistent with the pathogenesis of breast cancer. Thus, the results suggested that BRCAl may function to maintain the state of differentiation of normal mammary cells and to promote a healthy state of neutrophil mediated immunity to prevent breast cancer formation.

Example 8: Case study using human the breast cancer model: MAP Kinase

[000116] MAP kinase (MAPK) is one breast cancer marker for which it had been demonstrated that over-expression or constitutive expression results in cell proliferation [Nunes- Xavier et al. (2010) The Journal of biological chemistry, 285(34), 26417-26430] and metastasis. Further, MAPK is used as a metastasis marker [Adeyinka et al. (2002) Clinical Cancer Research, 8(6), 1747-1753]. Hence, MAPK is a potential candidate for over-expression study using the method of the present invention.

[000117] There were 10 isoforms of MAPK in the human breast cancer model (Figure 25). Similar to BRCA1 case study, average expression value of each iso form was used as standard. For over-expression, expression values of 3 standard deviations higher than the average expression of each iso form were used. Instead of a single threshold for significant differences in expression value of affected genes as in the case of the BRCA1 case study, two criteria were used. For genes adjacent to MAPK, a change of at least 2-log intensities in expression was needed in order to be significantly different. For genes non-adjacent to MAPK, a p-value of less than 0.001 after Bonferroni correction and a power of more than 0.8 were needed in order to be considered as having a significantly different expression level.

[000118] The results showed that between 57.0% (MAPK9) to 83.6% (MAPK12) of the reachable genes were significantly affected (Figure 25). However, it is possible that two or more isoforms may significantly affect the expression level of the same gene. Hence, it is not feasible to evaluate genes that are significantly affected by over-expression of each MAPK iso form but to examine only the set of genes whose expression are affected by more than one isoform. The results showed that no gene was commonly affected by 9 or all 10 MAPK isoforms in this analysis but a set of 56 genes was commonly affected by 8 MAPK isoforms (Figure 26). This set of 56 genes was analyzed for Gene Ontology (GO) term enrichment using GOrilla and 3 GO terms were over-represented (GO:0032886, GO:0060632, GO:0070507).

[000119] Regulation of microtubule-based process (GO:0032886) is the ancestor to both regulation of microtubule-based movement (GO:0060632) and regulation of microtubule cytoskeleton organization (GO:0070507). This result is consistent with current studies suggesting that MAPK affects cytoskeletal organization [Scuto et al. (2007) Cancer research, 67(21 ), 10317-10324] and the inhibition of MAPK pathways can result in inhibition of metastasis [Lai et al. (2014) Oncology reports, 31(1), 189-196]. Hence, this set of 56 genes whose expression is significantly affected by 8 isoforms of MAPK may present as a useful set of drug targets. Yeast model system

Example 9: Transcriptome prediction using the human breast cancer model

[000120] The microarray platform selected for developing the yeast transcriptome prediction model is based on the same criteria used to generate the E. coli model.

[000121] GPL2529, which contains 1958 microarray samples of yeast, was selected as the initial platform. There are a total of 10715 probes in the array, which are mapped to UniGene IDs, of which 10639 UniGene IDs are unique. A sample of 30 pairs of probes (each pair of probes measured the expression of the same UniGene ID) was randomly selected to estimate intra-array variation. The measured intensity should be identical in an ideal situation when 2 probes detect for the same transcript (expression from the same UniGene ID), and variation from ideal situation can be used as an estimate for intra-array variation. The results estimated the intra- array variation to be 23.22% with a standard deviation of 50.235%, suggesting the presence of intra-array variation (p-value = 0.01705). The average correlation of microarray intensities from 2 probes detect for the same transcript is 0.451 with a standard deviation of 0.2909, which is significantly different from random (p-value = 2.4e-09). Despite the presence of intra-array variations, a statistically significant correlation is expected as the pairs of probes are measuring the same transcript.

[000122] As the number of unique UniGene IDs is 99.3% of the number of probes mapped to UniGene IDs, it suggests that GPL2529 has 0.7% probe redundancy. Redundant probes can be merged and averaged, which is likely to give better prediction. The code used to conduct this step is the code to merge probes of the same ID.

[000123] Of the 1958 available microarray samples, 1500 samples were used for gene co- expression network (GCN) construction. Hence, 458 microarray samples which were not used for GCN construction were used for testing. The procedure for GCN construction is comparable to the ones of constructing the GCN of the E. coli and the human breast cancer model.

[000124] Thirty samples from the test set of 458 samples were used to evaluate the accuracy of prediction at 3 degrees, covering 90% of the entire GCN. Two source genes were used (1774893_at and 1779850_at; Figure 27). The results showed that the average percentage error in prediction is 18.56% with standard error of 0.204% (Figure 28A), which is significantly lower intra-array variation of 25.63% (p-value = 0.0170). This suggests that by taking the average of redundant probes (more than one probe in the microarray that measures the same mRNA transcript), transcriptome prediction accuracies can be achieved that are significantly lower than intra-array variation.

1000125] As shown in Figure 29, the percentage errors of 30 samples from both models in (A) and (B) are well below the intra-array variation of 23.22%. The result from a 2-tailed t-test shows a p-value of l e-06. The average percentage error of Model A (absolute Pearson's correlation coefficient of 0.396 as threshold) is significantly higher (p-value = 1.9e-19) than that in Model B (absolute Pearson's correlation coefficient of 0.528 as threshold), having values of 18.56% and 16.99%, respectively. The average correlation between predicted and expected gene expressions for Model A and Model B are 0.855 and 0.854, respectively, which is not significantly different (p-value = 0.603) between the 2 models.

[000126] However, Model A can predict more genes (3,473 unique genes) than Model B (921 unique genes) due to the fact that it has a higher network density. This comparison provides the evidence that the trade-off of a lower correlation in Model A is justified due to the fact that all 30 samples have percentage errors of prediction well below the intra-array variation, and that it has a greater capacity for a wider coverage of gene predictions. Therefore, Model A can be utilized for subsequent transcriptome predictions.

[000127] Meyer et al. [Meyer et al. (2013) Genome research, 23(1 1 ), 1928-1937] reported the results of a competition (Gene Promoter Expression Prediction) aiming to predict gene expression from promoter sequences. The best predictor (Team FiRST) in the competition showed a correlation of 0.65 between the gene expression predicted from promoter sequence and the actual expression of the set of genes. Meyer et al. attempted to improve on the prediction of

Team FiRST by adding biological features, such as the binding of transcription factors and nucleosomes to the DNA, then training a support vector machine (machine learning approach) on the combined features. In doing so, it was reported that a correlation of 0.67 between the gene expression predicted from promoter sequence and the actual expression of the set of genes can be observed. The average correlation between the actual and predicted gene expression in the 30 test samples using Model A (absolute Pearson's correlation coefficient of 0.396 as threshold) is

0.85 with a standard error of 0.0036 (Figure 29), which is significantly better (p-value = 4.5e-30) than that reported by Meyer et al. An advantage is that the present method used microarray data instead of more difficult to obtain protein-DNA binding data, suggesting that the present method may be more easily applied to other organisms.

Example 10: Case study using the yeast model: HAA1

[000128] HAA1 is a transcriptional activator required for weak acid tolerance and adaptation in Saccharomyces cerevisiae. As reported by Tanaka et al. [Tanaka et al. (2012) Applied and environmental microbiology, 78(22), 8161-8163], an overexpression of the HAA1 gene is associated with increased acetic acid tolerance. Further, the overexpression of HAA1 leads to intracellular levels of acetic acid that were significantly lower compared to the ones in wild-type cells.

[000129] The method of the present invention not only allowed in silico experimental validation, but was also able to identify downstream activated genes responsible for this acid tolerance in budding yeast.

[000130] The present method was used to validate the reproducibility of experimentally- derived conclusions of Tanaka et al. by mimicking an overexpression of 2.5 folds for the HAA1 gene. The average expression value for HAA1 calculated from microarray data is 10.007, with a standard deviation of 0.6747. The value of 25.017, corresponding to approximately +22 standard deviations, was used as a representation of a 2.5 fold increase for HAAl expression. This 2.5 fold increment is equivalent to a 150% difference in overexpression. Henceforth, the values 10.007 and 25.017 are used to signify the average expression and overexpression values, respectively, for the study of regulatory effects of HAAl in Saccharomyces cerevisiae.

[000131] Results of the present method revealed 2019 genes affected from the overexpression of HAAl , where all of them have been predicted to be up-regulated from this effect. Of these, 14 genes are up-regulated by more than a 150% difference in predicted expression between the average and overexpressed values used. These 14 genes were analyzed for Gene Ontology (GO) term enrichment using GOrilla and the major GO process enriched by the overexpression of HAAl gene are GO:0044262, GO:0015078, GO:0015077 (Figure 30).

[000132] The overexpression of the HAAl gene in Saccharomyces cerevisiae apparently up-regulates the cellular carbohydrate metabolic process (GO:0044262) and transmembrane transporter activities (GO:0015078, GO:0015077) to a significantly greater extent. Studies have provided evidence that despite environmental acetic acid stress, glucose uptake and its metabohzation is not inhibited [Mira et al. (2010) Microbial Cell Factories, 9, 79]. In fact, yeast cells succumbed to acetic acid during incubation exhibited a slightly higher maximum glucose uptake rate than wild type cells [Mira et al. (2010) OMICS: A Journal of Integrative Biology, 14(5), 587-601 ]. It might be possible that adaptation and eventual viability of yeast cells in the face of weak acid stress is due to this increase in glucose uptake and metabolism.

[000133] Weak acids are capable of entering the cells via passive diffusion in their un- dissociated toxic form to dissociate into protons and counter-ions inside the cytosol [Fernandes et al. (2005) Biochem Biophys Res Commun, 337, 95-103]. As a result, one of the immediate responses to weak acid aggression is the transcription and regulation of genes related to proton homeostasis via HAA l overexpression. They do so through the assembly or regulation of the activity of plasma membrane H + -ATPase, vacuolar H + -ATPase and mitochondrial FiFo ATP synthase [Mira et al. (2010) Microbial Cell Factories, 9, 79]. The present results have paralleled experimental findings that these significantly overexpressed genes cause immediate and essential acetic acid stress responses in Saccharomyces cerevisiae. In detail, the overexpression of HAAl led to up-regulation of metabolic processes essential for cell viability, as well as an increase in activity of transmembrane transporters to purge and/or sequester the accumulated protons to vacuole lumen for weak acid adaptation [Carmelo et al. ( 1997) Biochim Biphys Acta, 1325, 63- 70; Fernandes et al. (2003) Biochem Biophys Res Commun, 337, 95-103; Makrantoni et al. (2007) Microbiology, 153, 4016-4026].

[000134] The invention has been described broadly and generically herein. Each of the narrower species and subgeneric groupings falling within the generic disclosure also form part of the invention. This includes the generic description of the invention with a proviso or negative limitation removing any subject-matter from the genus, regardless of whether or not the excised material is specifically recited herein. Other embodiments are within the following claims. In addition, where features or aspects of the invention are described in terms of Markush groups, those skilled in the art will recognize that the invention is also thereby described in terms of any individual member or subgroup of members of the Markush group. [000135] One skilled in the art would readily appreciate that the present invention is well adapted to carry out the objects and obtain the ends and advantages mentioned, as well as those inherent therein. Further, it will be readily apparent to one skilled in the art that varying substitutions and modifications may be made to the invention disclosed herein without departing from the scope and spirit of the invention. The compositions, methods, procedures, treatments, molecules and specific compounds described herein are presently representative of preferred embodiments are exemplary and are not intended as limitations on the scope of the invention. Changes therein and other uses will occur to those skilled in the art which are encompassed within the spirit of the invention are defined by the scope of the claims. The listing or discussion of a previously published document in this specification should not necessarily be taken as an acknowledgement that the document is part of the state of the art or is common general knowledge.

[000136] The invention illustratively described herein may suitably be practiced in the absence of any element or elements, limitation or limitations, not specifically disclosed herein. Thus, for example, the terms "comprising", "including," containing", etc. shall be read expansively and without limitation. The word "comprise" or variations such as "comprises" or "comprising" will accordingly be understood to imply the inclusion of a stated integer or groups of integers but not the exclusion of any other integer or group of integers. Additionally, the terms and expressions employed herein have been used as terms of description and not of limitation, and there is no intention in the use of such terms and expressions of excluding any equivalents of the features shown and described or portions thereof, but it is recognized that various modifications are possible within the scope of the invention claimed. Thus, it should be understood that although the present invention has been specifically disclosed by exemplary embodiments and optional features, modification and variation of the inventions embodied therein herein disclosed may be resorted to by those skilled in the art, and that such modifications and variations are considered to be within the scope of this invention.

[000137] The content of all documents and patent documents cited herein is incorporated by reference in their entirety.