Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHODS TO DIAGNOSE AND TREAT CANCER USING NON-HUMAN NUCLEIC ACIDS
Document Type and Number:
WIPO Patent Application WO/2020/093040
Kind Code:
A1
Abstract:
Methods for diagnosing cancer, its subtypes, molecular features, and likelihood of response to therapy, as well as other diseases, based on microbial presence or abundance in tissues, including blood-derived tissues, of the host subject. Methods of treatment of the identified cancer in subjects are also provided.

Inventors:
POORE GREGORY D (US)
KNIGHT ROBIN (US)
Application Number:
PCT/US2019/059647
Publication Date:
May 07, 2020
Filing Date:
November 04, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV CALIFORNIA (US)
International Classes:
C12Q1/6886; C12Q1/689; G01N33/569; G16H50/20
Domestic Patent References:
WO2018109219A12018-06-21
WO2018200813A12018-11-01
WO2018031545A12018-02-15
WO2017156431A12017-09-14
WO2017025617A12017-02-16
WO2018195097A12018-10-25
WO2018112365A22018-06-21
WO2017123676A12017-07-20
WO2018136598A12018-07-26
WO2018026742A12018-02-08
Foreign References:
US20180258495A12018-09-13
US20090061422A12009-03-05
US20180291463A12018-10-11
US20180163272A12018-06-14
US20150259728A12015-09-17
US20160220619A12016-08-04
US20180311269A12018-11-01
US20160130365A12016-05-12
Other References:
ZHU ET AL.: "Analysis of the Intestinal Lumen Microbiota in an Animal Model of Colorectal Cancer", PLOS ONE, vol. 9, no. 3, 6 March 2014 (2014-03-06), pages e90849, XP055133610, DOI: 10.1371/journal.pone.0090849
YU ET AL.: "Predicting non-small cell lung cancer prognosis by fully automated microscopic pathology image features", NATURE COMMUNICATIONS, vol. 7, no. 12474, 16 August 2016 (2016-08-16), XP055706000
HSIEH ET AL.: "Design Ensemble Machine Learning Model for Breast Cancer Diagnosis", JOURNAL OF MEDICAL SYSTEMS, vol. 36, no. 5, 3 August 2011 (2011-08-03), pages 2841 - 2847, XP035103459, DOI: 10.1007/s10916-011-9762-6
WU ET AL.: "Recent Advances and Challenges in Studies of Control of Cancer Stem Cells and the Gut Microbiome by the Trametes-Derived Polysaccharopeptide PSP (Review", INTERNATIONAL JOURNAL OF MEDICINAL MUSHROOMS, vol. 18, no. 8, 31 December 2015 (2015-12-31), pages 651 - 660, XP055706010
See also references of EP 3874068A4
Attorney, Agent or Firm:
WARREN, William L. et al. (US)
Download PDF:
Claims:
What is claimed is:

1. A method for creating a diagnostic model based on non-mammalian features to diagnose a mammalian disease comprising:

detecting microbial presence or abundance in a tissue sample from one or more mammalian subjects;

determining a shared pattern of microbial presence or abundance among one or more of the mammalian subjects;

forming an association between the shared pattern of microbial presence or abundance and the disease present in the mammalian subject; and

summarizing the association in a diagnostic model to diagnose disease in a further mammalian tissue sample using microbial presence or abundance.

2. The method of Claim 1, wherein the diagnostic model utilizes microbial presence or abundance information from one or more of the following non-mammalian domains of life: viral, bacterial, archaeal, and/or fungal.

3. The method of Claim 1, wherein the diagnostic model diagnoses the presence or absence of cancer.

4. The method of Claim 1, wherein the diagnostic model diagnoses a category or location of cancer.

5. The method of Claim 1, wherein the diagnostic model is used to diagnose one or more types of cancer in a subject.

6. The method of Claim 1, wherein the diagnostic model is used to diagnose one or more subtypes of cancer in a subject.

7. The method of Claim 1, wherein the diagnostic model is used to predict the stage of cancer in a subject and/or predict cancer prognosis in the subject.

8. The method of Claim 1, wherein the diagnostic model is used to diagnose a type of cancer at low-stage (stage I or stage II) tumor.

9. The method of Claim 1, wherein the diagnostic model is used to predict the mutation status of one or more cancers in the subject.

10. The method of Claim 1, wherein the diagnostic model is used to predict immunotherapy response of a subject.

11. The method of Claim 1, wherein the diagnostic model is utilized to select an optimal therapy for a particular subject.

12. The method of Claim 1, wherein the diagnostic model is utilized to longitudinally model the course of one or more cancers’ response to therapy and to then adjust a treatment regimen.

13. The method of Claim 1, wherein the diagnostic model diagnoses one or more of the following: acute myeloid leukemia, adrenocortical carcinoma, bladder urothelial carcinoma, brain lower grade glioma, breast invasive carcinoma, cervical squamous cell carcinoma and endocervical adenocarcinoma, cholangiocarcinoma, colon adenocarcinoma, esophageal carcinoma, glioblastoma multiforme, head and neck squamous cell carcinoma, kidney chromophobe, kidney renal clear cell carcinoma, kidney renal papillary cell carcinoma, liver hepatocellular carcinoma, lung adenocarcinoma, lung squamous cell carcinoma, lymphoid neoplasm diffuse large B-cell lymphoma, mesothelioma, ovarian serous cystadenocarcinoma, pancreatic adenocarcinoma, pheochromocytoma and paraganglioma, prostate adenocarcinoma, rectum adenocarcinoma, sarcoma, skin cutaneous melanoma, stomach adenocarcinoma, testicular germ cell tumors, thymoma, thyroid carcinoma, uterine carcinosarcoma, uterine corpus endometrial carcinoma, or uveal melanoma.

14. The method of Claim 1, wherein the diagnostic model is a machine learning model.

15. The method of Claim 1, wherein the diagnostic model is a regularized machine learning model.

16. The method of Claim 1, wherein the diagnostic model is an ensemble of machine learning models.

17. The method of Claim 1, wherein the diagnostic model identifies and removes certain microbial features as contaminants termed noise, while selectively retaining other microbial features termed signal.

18. The method of Claim 1, wherein the subject is a non-human mammal.

19. The method of Claim 1, wherein the subject is human.

20. The method of Claim 1, wherein the tissue is a whole blood biopsy.

21. The method of Claim 1, wherein the tissue biopsy is one or more constituents of whole blood, including but not limited to one or more of the following: plasma, white blood cells, red blood cells, and/or platelets.

22. The method of Claim 1, wherein the tissue is a solid tissue biopsy, including but not limited to a solid tissue biopsy of malignant tissue and/or of adjacent non- malignant tissue.

23. The method of Claim 1, further comprising the inclusion of mammalian features, in addition to non-mammalian microbial features, in the diagnostic model.

24. The method of Claim 23, wherein mammalian features in the diagnostic model include one or more of the following: cell-free tumor DNA, cell-free tumor RNA, exosomal-derived tumor DNA, exosomal-derived tumor RNA, circulating tumor cell derived DNA, circulating tumor cell derived RNA, methylation patterns of cell-free tumor DNA, methylation patterns of cell-free tumor RNA, methylation patterns of circulating tumor cell derived DNA, and/or methylation patterns of circulating tumor cell derived RNA.

25. A method of diagnosing disease in a mammalian subject comprising:

detecting microbial presence or abundance in a tissue sample from the subject;

determining that the detected microbial presence or abundance is similar to or different than microbial presence or abundance in tissues from healthy or diseased individuals; and

correlating the detected microbial presence or abundance with a known microbial presence or abundance for a disease, thereby diagnosing the disease.

26. The method of Claim 25, wherein the diagnosis is the presence or absence of cancer.

27. The method of Claim 25, wherein the diagnosis is a category or location of cancer.

28. The method of Claim 25, wherein the diagnosis is one or more types of cancer in a subject.

29. The method of Claim 25, wherein the diagnosis is one or more subtypes of cancer in a subject.

30. The method of Claim 25, wherein the diagnosis is the stage of cancer in a subject and/or cancer prognosis in the subject.

31. The method of Claim 25, wherein the diagnosis is a type of cancer at low-stage (stage I or stage II) tumor.

32. The method of Claim 25, wherein the diagnosis is the mutation status of one or more cancers in the subject.

33. The method of Claim 25, wherein the diagnosis is an anticipated response to immunotherapy of the subject.

34. The method of Claim 25, wherein the diagnosis is one or more of the following: acute myeloid leukemia, adrenocortical carcinoma, bladder urothelial carcinoma, brain lower grade glioma, breast invasive carcinoma, cervical squamous cell carcinoma and endocervical adenocarcinoma, cholangiocarcinoma, colon adenocarcinoma, esophageal carcinoma, glioblastoma multiforme, head and neck squamous cell carcinoma, kidney chromophobe, kidney renal clear cell carcinoma, kidney renal papillary cell carcinoma, liver hepatocellular carcinoma, lung adenocarcinoma, lung squamous cell carcinoma, lymphoid neoplasm diffuse large B-cell lymphoma, mesothelioma, ovarian serous cystadenocarcinoma, pancreatic adenocarcinoma, pheochromocytoma and paraganglioma, prostate adenocarcinoma, rectum adenocarcinoma, sarcoma, skin cutaneous melanoma, stomach adenocarcinoma, testicular germ cell tumors, thymoma, thyroid carcinoma, uterine carcinosarcoma, uterine corpus endometrial carcinoma, or uveal melanoma.

35. The method of Claim 25, wherein the subject is a non-human mammal.

36. The method of Claim 25, wherein the subject is human.

37. The method of Claim 25, further comprising optimal treatment selection for the disease in the subject based on the diagnostic information.

38. The method of Claim 37, wherein the optimal treatment selection is a regimen comprising administering to the subject in need a treatment an effective amount of one or more of the following: a small molecule, a biologic, an engineered host- derived cell type or types, a probiotic, an engineered bacterium, a natural-but- selective virus, an engineered virus, and/or a bacteriophage.

39. The method of Claim 25, wherein the microbial presence or abundance is obtained from one or more of the following non-mammalian domains of life: viral, bacterial, archaeal, and/or fungal.

40. The method of Claim 25, wherein the tissue is a whole blood biopsy.

41. The method of Claim 25, wherein the tissue is one or more constituents of whole blood, including but not limited to one or more of the following: plasma, white blood cells, red blood cells, and/or platelets.

42. The method of Claim 25, wherein the tissue is a solid tissue biopsy, including but not limited to a solid tissue biopsy of malignant tissue and/or of adjacent non- malignant tissue.

43. The method of Claim 25, wherein the microbial presence or abundance of the disease is determined by measuring other locations of the host microbiome.

44. The method of Claim 25, wherein the microbial presence or abundance is detected by nucleic acid measurement.

45. The method of Claim 44, wherein one or more of the following nucleic acid markers of microbial origin are detected: VI, V2, V3, V4, V5, V6, V7, V8, or V9 variable domain region of 16S rRNA; or the internal transcribed spacer (ITS) region of the 18S rRNA.

46. The method of Claim 44, wherein the nucleic acid detection is intended to target either metagenomic DNA or RNA or both.

47. The method of Claim 44, wherein the nucleic acid detection is intended to target either host DNA or RNA or both.

48. The method of Claim 44, wherein the nucleic acid detection is intended to target either cancer-derived DNA or RNA or both.

49. The method of Claim 44, wherein the nucleic acid detection procedure is modified to selectively deplete host DNA and/or RNA while selectively retaining microbial DNA and/or RNA.

50. The method of Claim 44, further comprising the simultaneous detection and/or quantification of both host-derived nucleic acids and microbial-derived nucleic acids.

51. The method of Claim 25, wherein the microbial presence and/or abundance is detected and/or measured via immunohistochemistry.

52. The method of Claim 25, wherein the microbial presence and/or abundance is detected and/or measured via in situ hybridization.

53. The method of Claim 25, wherein the microbial presence or abundance is detected and/or measured via flow cytometry.

54. The method of Claim 25, further comprising determining the geospatial distribution of microbial nucleic acids within a cancer of the subject.

55. The method of Claim 54, wherein the geospatial distribution of microbial presence or abundance information is detected and/or measured via multisampling the tumor tissue and/or its microenvironment.

56. The method of Claim 54, wherein the geospatial distribution of microbial presence or abundance information is detected and/or measured using one or more of the following methods: immunohistochemistry, in situ hybridization, digital spatial genomics, and/or digital spatial transcriptomics.

57. The method of Claim 54, further comprising administering to the subject in need an effective amount of an optimal treatment regimen, including but not limited to drug choice and dynamic time course, selected based on the geospatial distribution of microbial presence or abundance information of the cancer.

58. A method for treating a mammalian cancer in a subject based on non-mammalian, microbial presence or abundances comprising:

detecting microbial presence or abundance in a tissue sample from the subject with cancer;

determining a shared pattern of the microbial presence or abundance in the mammalian subject with cancer;

forming an association between the pattern of microbial presence or abundance and the cancer present in the mammalian subject; and

administering to the subject a therapeutically effective amount of a treatment utilizing the microbial association with cancer to treat the mammalian cancer.

59. The method of Claim 58, wherein the subject is a non-human mammal.

60. The method of Claim 58, wherein the subject is human.

61. The method of Claim 58, wherein the treatment repurposes an existing medication, which may or may not have been originally approved for targeting cancer, to improve overall therapeutic efficacy by exploiting microbial presence or abundance information.

62. The method of Claim 58, wherein the treatment is a small molecule.

63. The method of Claim 58, wherein the treatment is a biologic.

64. The method of Claim 58, wherein the treatment is an engineered host-derived cell type.

65. The method of Claim 58, wherein the treatment is a probiotic.

66. The method of Claim 58, wherein the probiotic is an engineered bacterium strain or an ensemble of engineered bacteria.

67. The method of Claim 58, wherein the treatment is a vims.

68. The method of Claim 58, wherein the treatment is a bacteriophage.

69. The method of Claim 58, wherein the treatment is an adjuvant given in combination with a primary treatment against the cancer to improve the efficacy of the primary treatment.

70. The method of Claim 58, wherein the treatment is an immunotherapy.

71. The method of Claim 70, wherein the form of immunotherapy involves adoptive cell transfer to target microbial antigens associated with the tumor or tumor microenvironment.

72. The method of Claim 70, wherein the form of immunotherapy is a cancer vaccine that exploits the microbial antigens associated with the cancer or cancer microenvironment.

73. The method of Claim 70, wherein the form of immunotherapy is a monoclonal antibody against microbial antigens associated with the cancer or cancer microenvironment.

74. The method of Claim 70, wherein the form of immunotherapy is an antibody-drug- conjugate designed to at least partially target microbial antigens associated with the cancer or cancer microenvironment.

75. The method of Claim 70, wherein the form of immunotherapy is a multi-valent antibody, antibody fragment, or antibody derivative thereof designed to at least partially target one or more microbial antigens associated with the cancer or cancer microenvironment.

76. The method of Claim 58, wherein the treatment is an antibiotic.

77. The method of Claim 76, wherein the antibiotic is targeted against a particular kind of microbe or class of functionally or biologically similar microbes.

78. The method of Claim 76, wherein the antibiotic is a broad-spectrum agent against multiple microbial groups.

79. The method of Claim 58, wherein two or more of the following treatment types are combined and whereby at least one type exploits cancer microbial presence or abundance to improve overall therapeutic efficacy: small molecules, biologies, engineered host-derived cell types, probiotics, engineered bacteria, natural-but- selective viruses, engineered viruses, and bacteriophages.

80. The method of Claim 58, wherein one or more treatment types exploit the geospatial distribution of microbial presence or abundance information in cancer to improve overall therapeutic efficacy.

Description:
METHODS TO DIAGNOSE AND TREAT CANCER USING NON-HUMAN NUCLEIC

ACIDS

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the priority benefit of U.S. Provisional Application

No. 62/754,696, filed November 2, 2018, which application is incorporated herein by reference.

TECHNICAL FIELD

[0002] The invention relates to the field of methods to accurately diagnose and treat disease using nucleic acids of non-human origin from a human tissue biopsy or blood-derived sample.

BACKGROUND

[0003] Despite a commonly held view that cancer is a‘disease of the human genome,’ an increasing amount of evidence indicates a key role for microbiota in carcinogenesis, tumor progression, and response to therapy. In fact, as much as 20% of the global cancer burden has been estimated to be caused by microbial agents. Many researchers believe the potential mechanism is through our resident microbes’ influence on the immune system, with their abilities to dial up or dampen down inflammation as well as to manipulate the capabilities and responsiveness of our immune cells.

[0004] Based on data from studies using gnotobiotic mouse models colonized with one or more specific bacteria, it appears that microbiota can alter cancer susceptibility and progression by diverse mechanisms, such as modulating inflammation, inducing DNA damage, and producing metabolites involved in oncogenesis or tumor suppression. In addition to carcinogenesis and cancer progression, emerging evidence suggests that microbiota can predict response to cancer treatment or be manipulated for improving cancer treatment, including“traditional” chemotherapies (e.g. gemcitabine) and more “innovative” immunotherapies (e.g. PD-l blockade). Yet, virtually all of this literature has relied on examining variants of the host gut microbiome and its influence on cancer, and the few examples in the literature that have explored cancer tissue specific microbiota— almost universally in gastrointestinal tract cancers— have merely examined questions of pathogenesis. Conversely, no prior art has described broad relationships between non- gastrointestinal microbiota and pan-cancer diagnostics, including from blood-derived samples; similarly, no prior art has described how cancer tissue resident microbiota can predict or impact patient responsiveness to cancer treatment, notably including immunotherapy response. The closest related prior art known to the inventor in this area — US20180291463A1, W02018200813A1, and WO2018031545A1 (all attributed to Robertson et al.)— relies on a microarray-based technology for detecting pre-selected (“biased”) populations of microbes in tumor tissue samples (NOT blood or other bodily fluids); moreover, this prior art has only covered three cancer types (breast cancer, ovarian cancer, and oral squamous cell carcinoma) rather than taking a pan-cancer approach.

[0005] The prior art for this invention builds upon the core concepts of cancer diagnosis using nucleic acids of human origin, either in solid tissue biopsies or liquid (i.e. blood-based) biopsies. It also builds upon the concepts of detecting circulating tumor DNA (ctDNA) to diagnose the presence of a tumor (e.g. PMID: 24553385) and recently described microbial cell-free DNA to detect infectious disease agents in a patient suspected of sepsis (PMID: 30742071). Notably, these host-based ctDNA assays almost always cannot diagnose the kind of cancer since the majority of genomic alterations in cancer are shared between cancer types. From a biological perspective, it has been well known for several years that isolating (via microbial blood culture) certain kinds of bacteria from the blood is highly suggestive of underlying colorectal cancer (e.g. Streptococcus bovis; PMID: 21247505), and a recent study on >13,000 patients demonstrated widespread, transient bacteremias, as detected by traditional blood culture, in those who ended up having colorectal cancer (PMID: 29729257). For blood-based diagnostics, this invention extends the notion of cancer- specific bacteremias to include many more tumor types; it further does not rely on traditional blood culture methods nor does it necessarily require pre-selecting the microbial population of interest and exploits this idea to create a broad diagnostic assay. The invention additionally extends tumor tissue-based diagnostics to discriminate between several dozens of cancer types (i.e.“pan cancer” diagnostics), their subtypes, their molecular features (e.g. mutations), and their predicted response to therapy, including immunotherapy. Moreover, this invention extends the diagnostic information to select or create new treatments based on intra-tumoral microbial features. [0006] Other prior art that is relevant to this field is as follows: U.S. Publication

No. 2018/0223338 describes using the solid tissue microhiome or salvia microhiome in identifying and diagnosing head and neck cancer; and U.S. Publication No. 2018/0258495 A 1 describes using the solid tissue microhiome or fecal microhiome to detect colon cancer, some kinds of mutations associated with colon cancer, and a kit to collect and amplify the corresponding microbes.

SUMMARY OF THE INVENTION

[0007] The disclosure of the present invention provides a method to accurately diagnose cancer and other diseases, its subtypes, and its likelihood to response to certain therapies solely using nucleic acids of non-human origin from a human tissue biopsy or blood-derived sample.

[0008] In embodiments, the invention provides a method for broadly creating patterns of microbial presence or abundance (‘signatures’) that are associated with the presence and/or type of cancer using blood-derived tissues. These‘signatures’ can then be deployed to diagnose the presence, kind, and/or subtype of cancer in a human.

[0009] In embodiments, the invention provides a method for broadly creating patterns of microbial presence or abundance that are associated with the presence and/or type of cancer using primary tumor tissues. These‘signatures’ can then be deployed to diagnose the presence, kind, and/or subtype of cancer in a human.

[0010] In embodiments, the invention provides a method of broadly diagnosing disease in a mammalian subject comprising: detecting microbial presence or abundance in a tissue sample from the subject; determining that the detected microbial presence or abundance is different than microbial presence or abundance in a normal tissue sample, and correlating the detected microbial presence or abundance with a known microbial presence or abundance for a disease, thereby diagnosing the disease.

[0011] In embodiments, the invention provides a method of broadly diagnosing the type of disease in a mammalian subject comprising: detecting microbial presence or abundance in a tumor tissue sample from the subject; determining that the detected microbial presence or abundance is similar or different to the microbial presence or abundance in a population of previously studied tumors, and correlating the detected microbial presence or abundance with the most similar tumor type, thereby diagnosing the kind of disease.

[0012] In embodiments, the invention provides a method of diagnosing the type of disease in a mammalian subject comprising: detecting microbial presence or abundance in a blood-derived tissue sample from the subject; determining that the detected microbial presence or abundance is similar or different to the microbial presence or abundance in a population of cancer and/or healthy patients with previously studied blood-derived tissue samples, and correlating the detected microbial presence or abundance with the most similar blood-derived tissue samples in this cohort, thereby diagnosing the disease and/or kind of disease.

[0013] In embodiments, the invention provides a method of diagnosing the bodily location of disease, wherein the disease is cancer, wherein the location of origin is the bone (acute myelogenous leukemia, sarcoma), the adrenal glands, the bladder, the brain, the breast, the cervix, the gallbladder, the colon, the esophagus, the neck (head and neck squamous cell carcinoma), the kidney, the liver, the lung, the lymph nodes (diffuse large B-cell lymphoma), the skin, the ovary, the prostate, the rectum, the stomach, the thyroid, and the uterus, and wherein the subject is human.

[0014] In embodiments, the invention provides a method of diagnosing disease, wherein the disease is cancer, wherein the cancer is leukemia (acute myelogenous), adrenocortical cancer, bladder cancer, brain cancer (lower grade glioma; glioblastoma), breast cancer, cervical cancer, cholangiocarcinoma, colon cancer, esophageal cancer, head and neck cancer, kidney cancer (chromophobe; renal clear cell carcinoma; papillary cell carcinoma), liver cancer, lung cancer (adenocarcinoma; squamous cell carcinoma), lymphoid neoplasm diffuse large B-cell lymphoma, melanoma (skin cutaneous melanoma, uveal melanoma), ovarian cancer, prostate cancer, rectum cancer, sarcoma, stomach cancer, thyroid cancer (thyroid carcinoma, thymoma), and uterine cancer, and wherein the subject is human.

[0015] In embodiments, the invention provides a method of diagnosing disease, further comprising diagnosis of the stage of the disease, wherein the disease is cancer. [0016] In embodiments, the invention provides a method of diagnosing disease when the disease is at low pathologic stage, wherein the disease is cancer, wherein the pathologic stage is stage I or stage II.

[0017] In embodiments, the invention provides a method of predicting the molecular features of the mammalian disease using non -mammalian features, wherein the mammalian disease is cancer, wherein the molecular features are mutation statuses.

[0018] In embodiments, the invention provides a method of predicting which subjects will respond or will not respond to a particular treatment for disease, wherein the disease is cancer, wherein the subject is human, wherein the treatment is immunotherapy, wherein the immunotherapy is a PD-l blockade (e.g. nivolumab, pembrolizumab).

[0019] In embodiments, the invention provides a method of diagnosing disease, further comprising treating the disease in the subject based on the identified non mammalian features of the disease, wherein the disease is cancer, wherein the non mammalian features are microbial, wherein the subject is human.

[0020] In embodiments, the invention provides a method of diagnosing disease, further comprising designing a new treatment to treat the mammalian disease in the subject based on its non-mammalian features, wherein the disease is cancer, wherein the non-mammalian features are microbial, wherein the subject is human.

[0021] In embodiments, new treatments may be designed to target and exploit the non-mammalian features identified in the mammalian disease using one or more of the following modalities: small molecules, biologies, engineered host-derived cell types, probiotics, engineered bacteria, natural-but- selective viruses, engineered viruses, and bacteriophages.

[0022] In embodiments, the invention provides a method of diagnosing disease, further comprising longitudinal monitoring of its non -mammalian features to indicate response to treating the disease, wherein the disease is cancer, wherein the non mammalian features are microbial, wherein the subject is human. [0023] In embodiments, the invention provides a kit to measure the microbial presence or abundance in the specified tissue samples, thereby permitting diagnosis of the disease.

[0024] In embodiments, the invention utilizes a diagnostic model based on a machine learning architecture.

[0025] In embodiments, the invention utilizes a diagnostic model based on a regularized machine learning architecture.

[0026] In embodiments, the invention utilizes a diagnostic model based on an ensemble of machine learning architectures.

[0027] In embodiments, the invention identifies and selectively removes certain non-mammalian features as contaminants termed noise, while selectively retaining other non-mammalian features as non-contaminants termed signal, wherein non-mammalian features are microbial.

[0028] In embodiments, the invention provides a method of diagnosing disease wherein the microbes are of viral, bacterial, archaeal, and/or fungal origin.

[0029] In embodiments, the invention provides a method of diagnosing disease wherein microbial presence or abundance information is combined with additional information about the host (subject) and/or the host’s (subject’s) cancer to create a diagnostic model that has greater predictive performance than only having microbial presence or abundance information alone.

[0030] In embodiments, the diagnostic model utilizes information in combination with microbial presence or abundance information from one or more of the following sources: cell-free tumor DNA, cell-free tumor RNA, exosomal-derived tumor DNA, exosomal-derived tumor RNA, circulating tumor cell derived DNA, circulating tumor cell derived RNA, methylation patterns of cell-free tumor DNA, methylation patterns of cell- free tumor RNA, methylation patterns of circulating tumor cell derived DNA, and/or methylation patterns of circulating tumor cell derived RNA. [0031] In embodiments, microbial presence or abundance is detected by nucleic acid detection of one or more of the following methods: targeted microbial sequencing (e.g. 16S rRNA sequencing, 18S rRNA ITS sequencing), ecological shotgun sequencing, quantitative polymerase chain reaction (qPCR), immunohistochemistry (IHC), in situ hybridization (ISH), flow cytometry, host whole genome sequencing, host transcriptomic sequencing, cancer whole genome sequencing, and cancer transcriptomic sequencing.

[0032] In embodiments, the geospatial distribution of microbial presence or absence is measured in the cancer tissue of the host by one or more of the following methods: multisampling of the tumor tissue and/or its microenvironment, IHC, ISH, digital spatial genomics, digital spatial transcriptomics.

[0033] In embodiments, the microbial nucleic acids are detected simultaneously with nucleic acids from the host and subsequently distinguished.

[0034] In embodiments, the host nucleic acids are selectively depleted and the microbial nucleic acids are selectively retained prior to measurement (e.g. sequencing) of a combined nucleic acid pool.

[0035] In embodiments, the invention provides that the tissue is blood, a constituent of blood (e.g. plasma), or a tissue biopsy, wherein the tissue biopsy may be malignant or non-malignant.

[0036] In embodiments, the microbial presence or abundance of the cancer is determined by measuring microbial presence or abundance in other locations of the host.

BRIEF DESCRIPTION OF THE DRAWINGS

[0037] Figures 1A-1D: Fig. 1A (left) shows the total percentage of sequencing reads identified as“microbial” by the bioinformatic microbial detection pipeline across 33 cancer types and over 10,000 patients in The Cancer Genome Atlas (TCGA), as well as the percentage of microbial reads retained when summarizing to the genus taxonomy level (right). Figs. 1B-1C show a principal component analysis (PCA) on normalized (i.e. approximately normal in its distribution) but not batch corrected microbial abundances (1B), as well as normalized and batch corrected microbial abundances (1C). The legend shows that the data were derived from eight sequencing centers in total. Fig. ID shows the results of a principal variance component analysis (PVCA) before and after batch correction to estimate the amount of microbial variance (“signal”) attributed across each major metadata variable in the dataset. Fold-increases and fold-decreases are shown above the major metadata variables that changed during the batch correction process.

[0038] Figures 2A-2F: In Fig. 2A, patients that were clinically evaluated for

HPV-infected cervical squamous cell carcinoma and endocervical adenocarcinoma were examined for differential abundance of the Alphapapillomavirus genus in their tumors and matched blood samples. Primary tumor samples are compared as a positive control and blood derived normal samples are compared as a negative control. In Fig. 2B, patients that were clinically evaluated for HPV-infected head and neck squamous cell carcinoma (TCGA-HNSCC; primary tumor samples) were compared for differential abundance of the Alphapapillomavirus genus using both in situ hybridization (ISH) and immunohistochemistry (IHC) assays (pl6). In Fig. 2C, patients with stomach adenocarcinoma that were assigned integrative molecular subtypes by The Cancer Genome Atlas Research Network and those in the Epstein-Barr vims (EBV) subtype were examined for selective overabundance of the EBV genus (i.e. Lymphocrytovirus). Blood derived normal and solid tissue normal samples are shown as negative controls. Other molecular subtypes of STAD: CIN = chromosomal instability; GS = genome stable; MSI = microsatellite unstable. In Fig. 2D, patients with clinically adjudicated risk factors for liver hepatocellular cancer were plotted against the normalized abundance of the Orthohepadnavirus genus to examine selective overabundance of the Orthohepadnavirus genus in patients with a history of hepatitis B infection.“EtOH” denotes heavy alcohol consumption as a prior risk factor while“Hep C” denotes prior hepatitis C infection. Blood derived normal samples are shown as negative controls; solid tissue normals reveal high viral loads of hepatitis B. In Fig. 2E, common gastrointestinal cancers were evaluated for differential abundances of the Fusobacterium genus, as associated in the literature. Blood derived normals and solid tissue normals are shown for comparative negative controls. In Fig. 2F, abundances of the Fusobacterium genus were examined between gastrointestinal tract (Gl-tract) cancers and non-GI-tract cancers. The following cancers were included in the GI- tract group: colon adenocarcinoma, rectum adenocarcinoma, cholangiocarcinoma, liver hepatocellular carcinoma, pancreatic adenocarcinoma, head and neck squamous cell carcinoma, esophageal carcinoma, and stomach adenocarcinoma. The remaining cancer types in Table 1 were placed in the non-GI-tract cancers with the exception of acute myeloid leukemia, which was excluded from this analysis. Fusobacterium abundance from adjacent non-malignant tissue is included from both groups as a negative control. For all figures: The y-axis shows normalized microbial abundances on a log 2 scale; significance testing was performed using a two-sided Mann- Whitney test for all comparisons; symbols are as follows: **** for p-values<=0.000l, *** for p-values<=0.00l, ** for p-values<=0.0l, * for p-values<=0.05, and“ns" for not significant.

[0039] Figure 3: The distribution of Alphapapillomavirus genus abundance across

32 cancer types and 3 sample types (solid tissue normal, blood derived normal, and primary tumor tissues). For cancer types that had patients who were clinically adjudicated for HPV infection, the cancer types are split into groups that either tested“Positive” or “Negative” for HPV infection. The dotted lines are the average abundance values for all patients that tested“Negative” within each sample type.

[0040] Figures 4A-4F: Whole transcriptome data (RNA-Seq) collected by Hugo et al. (2016; Science·, PMID: 26997480) on patients prior to receiving anti-PD-l immunotherapy (pembrolizumab or nivolumab) were explored for microbial RNA reads. Fig. 4A shows the principal co-ordinate analysis for patients with complete response (CR) versus those with progressive disease (PD).“Adonis” denotes a PERMANOVA test for significant separation between the two centroids of the groups. Fig. 4B shows the distances of each patient to his or her respective centroid (i.e. CR or PD), which is a measure of beta-diversity, namely that patients with CR have distinguishably lower beta dispersion than those with PD.“Betadisper Perm Test” denotes a permutation test to discern if the beta dispersion is significantly different between the groups. Fig. 4C shows the principal co-ordinate analysis for patients with complete response (CR) versus those with partial response (PR). “Adonis” denotes a PERMANOVA test for significant separation between the two centroids of the groups. Fig. 4D shows the distances of each patient to his or her respective centroid (i.e. CR or PR), which is a measure of beta- diversity, namely that patients with CR have distinguishably lower beta dispersion than those with PR.“Betadisper Perm Test” denotes a permutation test to discern if the beta dispersion is significantly different between the groups. Fig. 4E shows the ROC and PR curves (i.e. machine learning model performance) for predicting microsatellite instability in TCGA colon adenocarcinoma samples solely using microbial DNA or RNA abundances. These performances are based on a randomly selected, 30% holdout test set after the model was trained on 70% of the data and internally parameterized using k-fold cross validation of the training data. Fig. 4F shows the ROC and PR curves for predicting which TCGA breast cancer samples are triple negative or not. These performances are based on a randomly selected, 30% holdout test set after the model was trained on 70% of the data and internally parameterized using k-fold cross validation of the training data.

[0041] Figures 5A-5F: ROC and PR curves for the following cancer types:

Adrenocortical carcinoma, bladder urothelial carcinoma. Exemplar arrows are given in the first ROC and PR plots and point to respective extrema locations on the plots for a given probability cutoff threshold of 1.0 or 0.0; the rest of the probability cutoff threshold spectrum, as well as their respective ROC or PR points, span proportionately between the two points on the plots that are indicated by the arrows. Abbreviations are as follows: “PT” denotes“Primary Tumor”,“BDN” denotes“Blood Derived Normal”, and“STN” denotes“Solid Tissue Normal”. For“PT” and“BDN” labeled figures, predictions were done in a one-cancer-type-versus-all-others fashion; for“PT vs STN” labeled figures, predictions were done to discriminate primary tumor tissue versus adjacent solid tissue normal within a given cancer type. All prediction performances were generated on a randomly selected, 30% holdout test set after the respective model was trained on the remaining 70% of the data for a given comparison; during model training, k-fold cross validation was employed to tune the model parameters. Additionally, in cases of class imbalance, the minority class was up-sampled to promote model generalization.

[0042] Figures 6A-6F: ROC and PR curves for the following cancer types:

Bladder urothelial carcinoma, brain lower grade glioma. Abbreviations are given in the caption for Figs. 5A-5F. Model performances were generated the same way as described in the caption for Figs. 5A-5F.

[0043] Figures 7A-7F: ROC and PR curves for the following cancer types: Breast invasive carcinoma. Abbreviations are given in the caption for Figs. 5A-5F. Model performances were generated the same way as described in the caption for Figs. 5A-5F. [0044] Figures 8A-8F: ROC and PR curves for the following cancer types:

Cervical squamous cell carcinoma and endocervical adenocarcinoma, cholangiocarcinoma. Abbreviations are given in the caption for Figs. 5A-5F. Model performances were generated the same way as described in the caption for Figs. 5A-5F.

[0045] Figures 9A-9F: ROC and PR curves for the following cancer types: Colon adenocarcinoma. Abbreviations are given in the caption for Figs. 5A-5F. Model performances were generated the same way as described in the caption for Figs. 5A-5F.

[0046] Figures 10A-10F: ROC and PR curves for the following cancer types:

Esophageal carcinoma. Abbreviations are given in the caption for Figs. 5A-5F. Model performances were generated the same way as described in the caption for Figs. 5A-5F.

[0047] Figures 11A-11F: ROC and PR curves for the following cancer types:

Glioblastoma multiforme, head and neck squamous cell carcinoma. Abbreviations are given in the caption for Figs. 5A-5F. Model performances were generated the same way as described in the caption for Figs. 5A-5F.

[0048] Figures 12A-12F: ROC and PR curves for the following cancer types:

Head and neck squamous cell carcinoma, kidney chromophobe. Abbreviations are given in the caption for Figs. 5A-5F. Model performances were generated the same way as described in the caption for Figs. 5A-5F.

[0049] Figures 13A-13F: ROC and PR curves for the following cancer types:

Kidney chromophobe, kidney renal clear cell carcinoma. Abbreviations are given in the caption for Figs. 5A-5F. Model performances were generated the same way as described in the caption for Figs. 5A-5F.

[0050] Figures 14A-14F: ROC and PR curves for the following cancer types:

Kidney renal papillary cell carcinoma. Abbreviations are given in the caption for Figs. 5A- 5F. Model performances were generated the same way as described in the caption for

Figs. 5A-5F.

[0051] Figures 15A-15F: ROC and PR curves for the following cancer types:

Liver hepatocellular carcinoma. Abbreviations are given in the caption for Figs. 5A-5F. 42-

Model performances were generated the same way as described in the caption for Figs. 5A-5F.

[0052] Figures 16A-16F: ROC and PR curves for the following cancer types:

Lung adenocarcinoma. Abbreviations are given in the caption for Figs. 5A-5F. Model performances were generated the same way as described in the caption for Figs. 5A-5F.

[0053] Figures 17A-17F: ROC and PR curves for the following cancer types:

Lung squamous cell carcinoma, lymphoid neoplasm diffuse large B-cell lymphoma. Abbreviations are given in the caption for Figs. 5A-5F. Model performances were generated the same way as described in the caption for Figs. 5A-5F.

[0054] Figures 18A-18F: ROC and PR curves for the following cancer types:

Mesothelioma, ovarian serous cystadenocarcinoma. Abbreviations are given in the caption for Figs. 5A-5F. Model performances were generated the same way as described in the caption for Figs. 5A-5F.

[0055] Figures 19A-19F: ROC and PR curves for the following cancer types:

Pancreatic adenocarcinoma, pheochromocytoma and paraganglioma, prostate adenocarcinoma. Abbreviations are given in the caption for Figs. 5A-5F. Model performances were generated the same way as described in the caption for Figs. 5A-5F.

[0056] Figures 20A-20F: ROC and PR curves for the following cancer types:

Prostate adenocarcinoma, rectum adenocarcinoma. Abbreviations are given in the caption for Figs. 5A-5F. Model performances were generated the same way as described in the caption for Figs. 5A-5F.

[0057] Figures 21A-21F: ROC and PR curves for the following cancer types:

Rectum adenocarcinoma, sarcoma. Abbreviations are given in the caption for Figs. 5A-5F. Model performances were generated the same way as described in the caption for Figs. 5A-5F.

[0058] Figures 22A-22F: ROC and PR curves for the following cancer types: Skin cutaneous melanoma, stomach adenocarcinoma. Abbreviations are given in the caption for 43-

Figs. 5A-5F. Model performances were generated the same way as described in the caption for Figs. 5A-5F.

[0059] Figures 23A-23F: ROC and PR curves for the following cancer types:

Stomach adenocarcinoma, testicular germ cell tumors. Abbreviations are given in the caption for Figs. 5A-5F. Model performances were generated the same way as described in the caption for Figs. 5A-5F.

[0060] Figures 24A-24F: ROC and PR curves for the following cancer types:

Thymoma, thyroid carcinoma. Abbreviations are given in the caption for Figs. 5A-5F. Model performances were generated the same way as described in the caption for Figs. 5A-5F.

[0061] Figures 25A-25F: ROC and PR curves for the following cancer types:

Thyroid carcinoma, uterine carcinosarcoma, uterine corpus endometrial carcinoma. Abbreviations are given in the caption for Figs. 5A-5F. Model performances were generated the same way as described in the caption for Figs. 5A-5F.

[0062] Figures 26A-26F: ROC and PR curves for the following cancer types:

Uterine corpus endometrial carcinoma, uveal melanoma. Abbreviations are given in the caption for Figs. 5A-5F. Model performances were generated the same way as described in the caption for Figs. 5A-5F.

[0063] Figures 27A-27B: ROC and PR curves for the following cancer types:

Uveal melanoma. Abbreviations are given in the caption for Figs. 5A-5F. Model performances were generated the same way as described in the caption for Figs. 5A-5F.

[0064] Figure 28: Fig 28A shows one embodiment of a decontamination pipeline, which strives to identify and subsequently remove contaminating microbes (“noise”) while retaining non-contaminating microbes (“signal”) from primary surgical resection of the tissue through nucleic acid sequencing and data analysis. Fig. 28B and 28C show the comparative model performances as areas under ROC and PR curves, respectively, on models built on full (“non-decontaminated”) data and on decontaminated data. A linear regression with a gray standard error bar ribbon is shown of the data points; a diagonal line is shown to denote what perfect (1: 1) correspondence would be between the two sets of model performances. In this particular embodiment, microbial taxonomies that were suspected to be contaminants by the decontamination pipeline (cf. Fig. 28A) were entirely removed prior to model building and testing. As before, the models were built and tested as described in Figs. 5A-5F, namely that the predictions were one-cancer-type- versus-all- others using either “Primary Tumor” or “Blood Derived Normal” tissues. Model performances were generated on randomly selected, 30% holdout test sets after training the model on the remaining 70% of the data with internal k-fold cross validation for model parameterization.

[0065] Figures 29A-29I: Fig 29 A shows one embodiment of validating the model performances observed in Figs. 5A-27B. Specifically, before normalization and batch correction, the raw microbial count data were split in half in a stratified manner. Each raw data half was then processed through the normalization and batch correction pipelines prior to machine learning model building. In this case, the model learning model that was built on the first half was tested on the second half, and vice versa. The resultant model performances were compared to building a model on 50% of the full, non-subsetted, normalized, batch corrected data and then subsequently testing on the remaining 50% of the full, non-subsetted, normalized, batch corrected data. Area under the curve values for ROC and PR curves are shown and labeled in the heatmap with each row being (and labeled as) a distinct TCGA cancer type (see Table 1 for abbreviations). Figs. 29B and 29C show comparative model performance (ROC and PR curve areas) between models that were built to discriminate between one cancer type versus all others using both DNA and RNA (“full data”) or just RNA. All microbial DNA and/or RNA came from primary tumors in TCGA and each data point is respectively labeled with a TCGA cancer type. Model performance was generated by applying the trained model on a randomly selected, 30% holdout test set. Figs. 29D and 29E show comparative model performance (ROC and PR curve areas) between models that were built to discriminate between one cancer type versus all others using both DNA and RNA (“full data”) or just DNA. All microbial RNA and/or DNA came from primary tumors in TCGA and each data point is respectively labeled with a TCGA cancer type. Model performance was generated by applying the trained model on a randomly selected, 30% holdout test set. Figs. 29F and 29G show comparative model performance (ROC and PR curve areas) between models that were built to discriminate between one cancer type versus all others using sequencing data from 45- all eight TCGA sequencing centers (“full data”) or just from the University of North Carolina (UNC). Notably, all sequencing data from UNC was only RNA (RNA-Seq), so this comparison eliminates possible variation due to incorporating multiple sequencing centers and experimental types. All microbial DNA and/or RNA came from primary tumors in TCGA and each data point is respectively labeled with a TCGA cancer type. Model performance was generated by applying the trained model on a randomly selected, 30% holdout test set. Figs. 29H and 291 show comparative model performance (ROC and PR curve areas) between models that were built to discriminate between one cancer type versus all others using sequencing data from all eight TCGA sequencing centers (“full data”) or just from the Harvard Medical School (HMS). Notably, all sequencing data from HMS was only DNA (Whole Genome Sequencing, WGS), so this comparison eliminates possible variation due to incorporating multiple sequencing centers and experimental types. All microbial RNA and/or DNA came from primary tumors in TCGA and each data point is respectively labeled with a TCGA cancer type. Model performance was generated by applying the trained model on a randomly selected, 30% holdout test set.

[0066] Figures 30A-30J: The mutation status of the top five most frequent mutations in TCGA (TP53, PTEN, PIK3CA, ARID1A, APC) are predicted solely by intratumoral microbial DNA and RNA abundances. The areas under the ROC and PR curves are shown on each respective plot.

[0067] Figure 31: For benchmarking purposes, all patients with stage I and stage

II cancers in TCGA were explored for discriminative performance between cancer types solely using microbial DNA identified in their matched blood samples. Models were built and tested as previously described: 70% of the data (randomly selected) were used for training discriminative models with internal k-fold cross validation for model tuning and final performance values were generated on the remaining, held-out 30% of the data; predictions were one-cancer-type-versus-all-others solely using microbial DNA. Additionally, model performance was compared across three levels of decontamination stringency, which resulted in models being built on four distinct datasets with varying proportions of original microbes being removed; for example, in the“Most Stringent Filtering” embodiment, over 90% of the original reads and taxa were discarded. One skilled in the art will recognize that there are many possible variations of decontamination 46- stringency that are employable here and that model performance may be improved or worsened by shifting that stringency level higher or lower.

[0068] Figures 32A-32C: For a conservative, comparative analysis against existing cell-free tumor DNA (ctDNA) assays, all TCGA patients containing at least one mutation in their tumor that was examined by two commercial ctDNA assays (GUARDANT360, FOUNDATIONONE Liquid) were removed. The remaining patients, whose cancers thus cannot be detected under any circumstances using these two commercial ctDNA assays, had microbial DNA extracted from their matched blood samples in TCGA. Using this microbial DNA, machine learning models were subsequently trained and tested to predict one cancer type versus all others; as before, performance was generated based on applying the model to a randomly selected, 30% holdout test set. The resultant model performances for patients without any detectable genomic alterations on the GUARDANT360 ctDNA panel are shown in Fig. 32A; similarly, model performances for patients without any detectable genomic alterations on the FOUNDATIONONE Liquid ctDNA panel are shown in Fig. 32B. The exact list of genomic alterations examined by these commercial ctDNA assay panels are listed in Fig. 32C

[0069] Figures 33A-33B: A website was developed to host and display the microbial presence and abundance information across dozens of cancer types in TCGA (Fig. 33 A), as well as to show the discriminatory performance of models in one-cancer- type-versus-all-others and tumor-vs-normal comparisons and their ranked microbial features (Fig. 33B).

DETAILED DESCRIPTION

[0070] The invention provides, in embodiments, a method to accurately diagnose human cancer, its subtypes, and its likelihood of therapy response using nucleic acids of non-human origin from a human tissue biopsy, malignant or non-malignant, or a blood- derived sample. It does this by identifying specific patterns of microbial nucleic acids and their presence or abundances ('a signature') within the sample to assign a certain probability that the sample (1) originated from a tumor rather than a 'normal' tissue site (e.g. the sample was a surgically resected solid tissue biopsy); (2) that the individual has 47- cancer (e.g. the sample came from typical blood draw with or without the intention to diagnose cancer); (3) that the individual has a cancer from a particular body site (e.g. the sample came from typical blood draw with or without the intention to diagnose cancer); (4) that the individual has a particular type of cancer (e.g. a patient with suspected cancer has a blood draw taken to quickly diagnose which cancer it may be instead of doing radiation-based imaging studies [e.g. PET-CT] or other costly imaging studies [e.g. MRI]; alternatively, a tissue biopsy of a newly found tumor lesion may be taken and the microbial‘signature’ may be indicative of what kind of cancer type it is); (5) that a cancer, which may or may not be diagnosed at the time, has a high or low likelihood or responding to a particular cancer therapy (e.g. a tissue biopsy of a suspected tumor lesion is taken, for which a microbial‘signature’ provides a prediction of whether the patient will respond to therapy or not; alternatively, a blood sample from the same patient may be used, for which a microbial‘signature’ may predict the immunogenicity of a patient’s tumor); (6) that a cancer, which may or may not be diagnosed at the time, is found to harbor microbial features (e.g. microbial antigens) that can be targeted for developing a personalized therapeutic to treat the subject’s cancer (e.g. a solid tissue biopsy reveals unique microbial neoantigens in the tumor tissue that can be used to develop a personalized cancer vaccine for the subject). Other uses for such methods are reasonably imaginable and readily implementable to those skilled in the art.

[0071] The invention is novel, in part, because it uses nucleic acids of non-human origin to diagnose a condition (i.e. cancer) that has been traditionally thought to be a disease of the human genome. It is better than a typical pathology report because it does not necessarily rely upon observed tissue structure, cellular atypia, or any other subjective measure traditionally used to diagnose cancer. It also has much better sensitivity by focusing solely on microbial sources rather than modified human (i.e. cancerous) sources, which are modified often at extremely low frequencies in a background of‘normal’ human sources. It can be done using either solid tissue or blood derived samples, the latter of which requires minimal sample preparation and is minimally invasive. It can also predict response to therapies that remain challenging to prognose, including distinguishing ‘complete responders’ to immunotherapy versus subjects who will experience‘progressive disease’. In certain circumstances, it can further provide information about host molecular aberrations and processes, such as mutation status of a subject’s cancer. The blood-based assay additionally does not deal with the same challenges posed by circulating tumor DNA (ctDNA) assays, which can have sensitivity issues due to cell-free DNA (cfDNA) that originates from non-malignant human cells. Moreover, based on data presented in Figs. 5A-27B, the blood-based microbial assay can distinguish between cancer types, which ctDNA assays most often cannot do, since most common cancer genomic aberrations are shared between cancer types (e.g. TP53 mutations, KRAS mutations). By constraining the size of the signatures, the method of which will be expected by someone knowledgeable in the art (e.g. regularized machine learning), the microbial assays can be made clinically available through the use of e.g. multiplexed qPCR, ISH, or table-top sequencers (e.g. MinlON, MiniSeq).

[0072] The machine learning models herein containing the microbial signatures can be deployed on real-time sequencing data or retrospective sequencing data. The signatures themselves were developed originally from data that was intended to sequence host nucleic acids but also included, but did not analyze, microbial features (i.e. human whole genome sequencing and RNA-Seq). These include sequencing studies performed on over 17,000 samples, over 10,000 patients, and several dozens of cancer types from patients in geographically diverse regions. However, the input data for these models can also derived from targeted metagenomic studies if so desired (e.g. 16S rRNA sequencing, shotgun sequencing). Moreover, such microbial presence or abundance information may be combined with host nucleic acid information to improve the predictive performance of these models in practice. Reduced to practice, this may or may not include doing the following (i.e. other examples are possible and will be anticipated by those skilled in the art):

Taking a blood sample from a patient during a routine clinic visit;

Removing an aliquot of that blood sample, extracting the nucleic acids within, and amplifying the sequences for specific microbial genes that are indicative of microbial taxonomy (e.g. V4 region of 16S rRNA gene);

Obtaining a digital read-out of the presence and/or abundance of these microbial sequences; Normalizing the presence and/or abundance data on an adjacent computer or cloud computing infrastructure and feeding it into a previously trained machine learning model;

Reading out a prediction and a certain degree of confidence for how likely this sample (1) is associated with the presence or absence of cancer, (2) is associated with cancer of a particular type or bodily location, or (3) is associated with a high, intermediate, or low likelihood of response to a range of cancer therapies; and

Using that sample’s microbial information to continue training the machine learning model if additional information is later inputted by the user.

[0073] All publications, patents, and patent applications mentioned in this specification are herein incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference.

[0074] Unless defined otherwise, all technical and scientific terms and any acronyms used herein have the same meanings as commonly understood by one of ordinary skill in the art in the field of the invention. Although any methods and materials similar or equivalent to those described herein can be used in the practice of the present invention, the exemplary methods, devices, and materials are described herein.

[0075] The practice of the present invention will employ, unless otherwise indicated, conventional techniques of molecular biology (including recombinant techniques), microbiology, cell biology, biochemistry and immunology, which are within the skill of the art. Such techniques are explained fully in the literature, such as, Molecular Cloning: A Laboratory Manual, 2 nd ed. (Sambrook et ak, 1989); Oligonucleotide Synthesis (M. J. Gait, ed., 1984); Animal Cell Culture (R. I. Freshney, ed., 1987); Methods in Enzymology (Academic Press, Inc.); Current Protocols in Molecular Biology (F. M. Ausubel et ak, eds., 1987, and periodic updates); PCR: The Polymerase Chain Reaction (Mullis et ak, eds., 1994); Remington, The Science and Practice of Pharmacy, 20 th ed., (Lippincott, Williams & Wilkins 2003), and Remington, The Science and Practice of Pharmacy, 22 th ed., (Pharmaceutical Press and Philadelphia College of Pharmacy at University of the Sciences 2012).

DEFINITIONS

[0076] To facilitate understanding of the invention, a number of terms and abbreviations as used herein are defined below as follows:

[0077] When introducing elements of the present invention or the preferred embodiment(s) thereof, the articles“a”,“an”,“the” and“said” are intended to mean that there are one or more of the elements. The terms“comprising”,“including” and“having” are intended to be inclusive and mean that there may be additional elements other than the listed elements.

[0078] The term“and/or” when used in a list of two or more items, means that any one of the listed items can be employed by itself or in combination with any one or more of the listed items. For example, the expression“A and/or B” is intended to mean either or both of A and B, i.e. A alone, B alone or A and B in combination. The expression“A, B and/or C” is intended to mean A alone, B alone, C alone, A and B in combination, A and C in combination, B and C in combination or A, B, and C in combination.

[0079] It is understood that aspects and embodiments of the invention described herein include“consisting” and/or“consisting essentially of’ aspects and embodiments.

[0080] It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible sub-ranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed sub-ranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range. Values or ranges may be also be expressed herein as“about,” from“about” one particular value, and/or to“about” another particular value. When such values or ranges are expressed, other embodiments disclosed include the specific value recited, from the one particular value, and/or to the other particular value. Similarly, when values are expressed as approximations, by use of the antecedent“about,” it will be understood that the particular value forms another embodiment. It will be further understood that there are a number of values disclosed therein, and that each value is also herein disclosed as “about” that particular value in addition to the value itself. In embodiments,“about” can be used to mean, for example, within 10% of the recited value, within 5% of the recited value, or within 2% of the recited value.

[0081] As used herein, “patient” or“subject” means a human or mammalian animal subject to be treated.

[0082] As used herein the term “pharmaceutical composition” refers to a pharmaceutical acceptable compositions, wherein the composition comprises a pharmaceutically active agent, and in some embodiments further comprises a pharmaceutically acceptable carrier. In some embodiments, the pharmaceutical composition may be a combination of pharmaceutically active agents and carriers.

[0083] As used herein the term“pharmaceutically acceptable carrier” refers to an excipient, diluent, preservative, solubilizer, emulsifier, adjuvant, and/or vehicle with which demethylation compound(s), is administered. Such carriers may be sterile liquids, such as water and oils, including those of petroleum, animal, vegetable or synthetic origin, such as peanut oil, soybean oil, mineral oil, sesame oil and the like, polyethylene glycols, glycerine, propylene glycol or other synthetic solvents. Antibacterial agents such as benzyl alcohol or methyl parabens; antioxidants such as ascorbic acid or sodium bisulfite; chelating agents such as ethylenediaminetetraacetic acid; and agents for the adjustment of tonicity such as sodium chloride or dextrose may also be a carrier. Methods for producing compositions in combination with carriers are known to those of skill in the art. In some embodiments, the language“pharmaceutically acceptable carrier” is intended to include any and all solvents, dispersion media, coatings, isotonic and absorption delaying agents, and the like, compatible with pharmaceutical administration· The use of such media and agents for pharmaceutically active substances is well known in the art. See, e.g., Remington, The Science and Practice of Pharmacy, 20th ed., (Lippincott, Williams & Wilkins 2003). Except insofar as any conventional media or agent is incompatible with the active compound, such use in the compositions is contemplated. [0084] As used herein, “therapeutically effective” refers to an amount of a pharmaceutically active compound(s) that is sufficient to treat or ameliorate, or in some manner reduce the symptoms associated with diseases and medical conditions. When used with reference to a method, the method is sufficiently effective to treat or ameliorate, or in some manner reduce the symptoms associated with diseases or conditions. For example, an effective amount in reference to age-related eye diseases is that amount which is sufficient to block or prevent onset; or if disease pathology has begun, to palliate, ameliorate, stabilize, reverse or slow progression of the disease, or otherwise reduce pathological consequences of the disease. In any case, an effective amount may be given in single or divided doses.

[0085] As used herein, the terms“treat,”“treatment,” or“treating” embraces at least an amelioration of the symptoms associated with diseases in the patient, where amelioration is used in a broad sense to refer to at least a reduction in the magnitude of a parameter, e.g. a symptom associated with the disease or condition being treated. As such, “treatment” also includes situations where the disease, disorder, or pathological condition, or at least symptoms associated therewith, are completely inhibited (e.g. prevented from happening) or stopped (e.g. terminated) such that the patient no longer suffers from the condition, or at least the symptoms that characterize the condition.

[0086] “Amplification” refers to any known procedure for obtaining multiple copies of a target nucleic acid or its complement, or fragments thereof. The multiple copies may be referred to as amplicons or amplification products. Amplification, in the context of fragments, refers to production of an amplified nucleic acid that contains less than the complete target nucleic acid or its complement, e.g., produced by using an amplification oligonucleotide that hybridizes to, and initiates polymerization from, an internal position of the target nucleic acid. Known amplification methods include, for example, replicase-mediated amplification, polymerase chain reaction (PCR), reverse transcription polymerase chain reaction (RT-PCR), ligase chain reaction (LCR), strand- displacement amplification (SDA), and transcription-mediated or transcription-associated amplification. Amplification is not limited to the strict duplication of the starting molecule. For example, the generation of multiple cDNA molecules from RNA in a sample using reverse transcription (RT)-PCR is a form of amplification. Furthermore, the generation of multiple RNA molecules from a single DNA molecule during the process of transcription is also a form of amplification. During amplification, the amplified products can be labeled using, for example, labeled primers or by incorporating labeled nucleotides.

[0087] “Amplicon” or“amplification product” refers to the nucleic acid molecule generated during an amplification procedure that is complementary or homologous to a target nucleic acid or a region thereof. Amplicons can be double stranded or single stranded and can include DNA, RNA or both. Methods for generating amplicons are known to those skilled in the art.

[0088] “Codon” refers to a sequence of three nucleotides that together form a unit of genetic code in a nucleic acid.

[0089] “Codon of interest” refers to a specific codon in a target nucleic acid that has diagnostic or therapeutic significance (e.g. an allele associated with viral genotype/subtype or drug resistance).

[0090] “Complementary” or “complement thereof’ means that a contiguous nucleic acid base sequence is capable of hybridizing to another base sequence by standard base pairing (hydrogen bonding) between a series of complementary bases. Complementary sequences may be completely complementary (i.e. no mismatches in the nucleic acid duplex) at each position in an oligomer sequence relative to its target sequence by using standard base pairing (e.g., G:C, A:T or A:U pairing) or sequences may contain one or more positions that are not complementary by base pairing (e.g., there exists at least one mismatch or unmatched base in the nucleic acid duplex), but such sequences are sufficiently complementary because the entire oligomer sequence is capable of specifically hybridizing with its target sequence in appropriate hybridization conditions (i.e. partially complementary). Contiguous bases in an oligomer are typically at least 80%, preferably at least 90%, and more preferably completely complementary to the intended target sequence.

[0091] “Configured to” or“designed to” denotes an actual arrangement of a nucleic acid sequence configuration of a referenced oligonucleotide. For example, a primer that is configured to generate a specified amplicon from a target nucleic acid has a nucleic acid sequence that hybridizes to the target nucleic acid or a region thereof and can be used in an amplification reaction to generate the amplicon. Also as an example, an oligonucleotide that is configured to specifically hybridize to a target nucleic acid or a region thereof has a nucleic acid sequence that specifically hybridizes to the referenced sequence under stringent hybridization conditions.

[0092] “Polymerase chain reaction” (PCR) generally refers to a process that uses multiple cycles of nucleic acid denaturation, annealing of primer pairs to opposite strands (forward and reverse), and primer extension to exponentially increase copy numbers of a target nucleic acid sequence. In a variation called RT-PCR, reverse transcriptase (RT) is used to make a complementary DNA (cDNA) from mRNA, and the cDNA is then amplified by PCR to produce multiple copies of DNA. There are many permutations of PCR known to those of ordinary skill in the art.

[0093] “Position” refers to a particular amino acid or amino acids in a nucleic acid sequence.

[0094] “Primer” refers to an enzymatically extendable oligonucleotide, generally with a defined sequence that is designed to hybridize in an antiparallel manner with a complementary, primer- specific portion of a target nucleic acid. A primer can initiate the polymerization of nucleotides in a template-dependent manner to yield a nucleic acid that is complementary to the target nucleic acid when placed under suitable nucleic acid synthesis conditions (e.g. a primer annealed to a target can be extended in the presence of nucleotides and a DNA/RNA polymerase at a suitable temperature and pH). Suitable reaction conditions and reagents are known to those of ordinary skill in the art. A primer is typically single stranded for maximum efficiency in amplification, but may alternatively be double stranded. If double stranded, the primer is generally first treated to separate its strands before being used to prepare extension products. The primer generally is sufficiently long to prime the synthesis of extension products in the presence of the inducing agent (e.g. polymerase). Specific length and sequence will be dependent on the complexity of the required DNA or RNA targets, as well as on the conditions of primer use such as temperature and ionic strength. Preferably, the primer is about 5-100 nucleotides. Thus, a primer can be, e.g., 5, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95 or 100 nucleotides in length. A primer does not need to have 100% complementarity with its template for primer elongation to occur; primers with less than 100% complementarity can be sufficient for hybridization and polymerase elongation to occur. A primer can be labeled if desired. The label used on a primer can be any suitable label, and can be detected by, for example, spectroscopic, photochemical, biochemical, immunochemical, chemical, or other detection means. A labeled primer therefore refers to an oligomer that hybridizes specifically to a target sequence in a nucleic acid, or in an amplified nucleic acid, under conditions that promote hybridization to allow selective detection of the target sequence.

[0095] A primer nucleic acid can be labeled, if desired, by incorporating a label detectable by, e.g., spectroscopic, photochemical, biochemical, immunochemical, chemical, or other techniques. To illustrate, useful labels include radioisotopes, fluorescent dyes, electron-dense reagents, enzymes (as commonly used in ELISAs), biotin, or haptens and proteins for which antisera or monoclonal antibodies are available. Many of these and other labels are described further herein and/or are otherwise known in the art. One of skill in the art will recognize that, in certain embodiments, primer nucleic acids can also be used as probe nucleic acids.

[0096] “RNA-dependent DNA polymerase” or “reverse transcriptase” (“RT”) refers to an enzyme that synthesizes a complementary DNA copy from an RNA template. All known reverse transcriptases also have the ability to make a complementary DNA copy from a DNA template; thus, they are both RNA- and DNA-dependent DNA polymerases. RTs may also have an RNAse H activity. A primer is required to initiate synthesis with both RNA and DNA templates.

[0097] “DNA-dependent DNA polymerase” is an enzyme that synthesizes a complementary DNA copy from a DNA template. Examples are DNA polymerase I from E. coli, bacteriophage T7 DNA polymerase, or DNA polymerases from bacteriophages T4, Phi-29, M2, or T5. DNA-dependent DNA polymerases may be the naturally occurring enzymes isolated from bacteria or bacteriophages or expressed recombinantly, or may be modified or“evolved” forms which have been engineered to possess certain desirable characteristics, e.g., thermostability, or the ability to recognize or synthesize a DNA strand from various modified templates. All known DNA-dependent DNA polymerases require a complementary primer to initiate synthesis. It is known that under suitable conditions a DNA-dependent DNA polymerase may synthesize a complementary DNA copy from an RNA template. RNA-dependent DNA polymerases typically also have DNA-dependent DNA polymerase activity.

[0098] “DNA-dependent RNA polymerase” or“transcriptase” is an enzyme that synthesizes multiple RNA copies from a double-stranded or partially double- stranded DNA molecule having a promoter sequence that is usually double-stranded. The RNA molecules (“transcripts”) are synthesized in the 5'-to-3' direction beginning at a specific position just downstream of the promoter. Examples of transcriptases are the DNA- dependent RNA polymerase from E. coli and bacteriophages T7, T3, and SP6.

[0099] A “sequence” of a nucleic acid refers to the order and identity of nucleotides in the nucleic acid. A sequence is typically read in the 5’ to 3’ direction. The terms“identical” or percent“identity” in the context of two or more nucleic acid or polypeptide sequences, refer to two or more sequences or subsequences that are the same or have a specified percentage of amino acid residues or nucleotides that are the same, when compared and aligned for maximum correspondence, e.g., as measured using one of the sequence comparison algorithms available to persons of skill or by visual inspection. Exemplary algorithms that are suitable for determining percent sequence identity and sequence similarity are the BLAST programs, which are described in, e.g., Altschul et al. (1990)“Basic local alignment search tool” J. Mol. Biol. 215:403-410, Gish et al. (1993) “Identification of protein coding regions by database similarity search” Nature Genet. 3:266-272, Madden et al. (1996) “Applications of network BLAST server” Meth. Enzymol. 266:131-141, Altschul et al. (1997) "’’Gapped BLAST and PSI-BLAST: a new generation of protein database search programs” Nucleic Acids Res. 25:3389-3402, and Zhang et al. (1997)“PowerBLAST: A new network BLAST application for interactive or automated sequence analysis and annotation” Genome Res. 7:649-656, which are each incorporated by reference. Many other optimal alignment algorithms are also known in the art and are optionally utilized to determine percent sequence identity.

[00100] A“label” refers to a moiety attached (covalently or non-covalently), or capable of being attached, to a molecule, which moiety provides or is capable of providing information about the molecule (e.g., descriptive, identifying, etc. information about the molecule) or another molecule with which the labeled molecule interacts (e.g., hybridizes, etc.)· Exemplary labels include fluorescent labels (including, e.g., quenchers or absorbers), weakly fluorescent labels, non-fluorescent labels, colorimetric labels, chemiluminescent labels, bioluminescent labels, radioactive labels, mass-modifying groups, antibodies, antigens, biotin, haptens, enzymes (including, e.g., peroxidase, phosphatase, etc.), and the like.

[00101] A“linker” refers to a chemical moiety that covalently or non-covalently attaches a compound or substituent group to another moiety, e.g., a nucleic acid, an oligonucleotide probe, a primer nucleic acid, an amplicon, a solid support, or the like. For example, linkers are optionally used to attach oligonucleotide probes to a solid support (e.g., in a linear or other logic probe array). To further illustrate, a linker optionally attaches a label (e.g., a fluorescent dye, a radioisotope, etc.) to an oligonucleotide probe, a primer nucleic acid, or the like. Linkers are typically at least bifunctional chemical moieties and in certain embodiments, they comprise cleavable attachments, which can be cleaved by, e.g., heat, an enzyme, a chemical agent, electromagnetic radiation, etc. to release materials or compounds from, e.g., a solid support. A careful choice of linker allows cleavage to be performed under appropriate conditions compatible with the stability of the compound and assay method. Generally a linker has no specific biological activity other than to, e.g., join chemical species together or to preserve some minimum distance or other spatial relationship between such species. However, the constituents of a linker may be selected to influence some property of the linked chemical species such as three- dimensional conformation, net charge, hydrophobicity, etc. Exemplary linkers include, e.g., oligopeptides, oligonucleotides, oligopoly amides, oligoethyleneglycerols, oligoacrylamides, alkyl chains, or the like. Additional description of linker molecules is provided in, e.g., Hermanson, Bioconjugate Techniques, Elsevier Science (1996), Lyttle et al. (1996) Nucleic Acids Res. 24(l4):2793, Shchepino et al. (2001) Nucleosides, Nucleotides, & Nucleic Acids 20:369, Doronina et al (2001) Nucleosides, Nucleotides, & Nucleic Acids 20:1007, Trawick et al. (2001) Bioconjugate Chem. 12:900, Olejnik et al. (1998) Methods in Enzymology 291:135, and Pljevaljcic et al. (2003) J. Am. Chem. Soc. 125(12):3486, all of which are incorporated by reference.

[00102] “Fragment” refers to a piece of contiguous nucleic acid that contains fewer nucleotides than the complete nucleic acid. [00103] “Hybridization,” “annealing,” “selectively bind,” or“selective binding” refers to the base-pairing interaction of one nucleic acid with another nucleic acid (typically an antiparallel nucleic acid) that results in formation of a duplex or other higher- ordered structure (i.e. a hybridization complex). The primary interaction between the antiparallel nucleic acid molecules is typically base specific, e.g., A/T and G/C. It is not a requirement that two nucleic acids have 100% complementarity over their full length to achieve hybridization. Nucleic acids hybridize due to a variety of well characterized physio-chemical forces, such as hydrogen bonding, solvent exclusion, base stacking and the like. An extensive guide to the hybridization of nucleic acids is found in Tijssen (1993) Laboratory Techniques in Biochemistry and Molecular Biology-Hybridization with Nucleic Acid Probes part I chapter 2,“Overview of principles of hybridization and the strategy of nucleic acid probe assays,” (Elsevier, New York), as well as in Ausubel (Ed.) Current Protocols in Molecular Biology, Volumes I, II, and III, 1997, which is incorporated by reference.

[00104] The term“attached” or“conjugated” refers to interactions and/or states in which material or compounds are connected or otherwise joined with one another. These interactions and/or states are typically produced by, e.g., covalent bonding, ionic bonding, chemisorption, physisorption, and combinations thereof.

[00105] A “composition” refers to a combination of two or more different components. In certain embodiments, for example, a composition includes one or more oligonucleotide probes in solution.

[00106] “Nucleic acid” or“nucleic acid molecule” refers to a multimeric compound comprising two or more covalently bonded nucleosides or nucleoside analogs having nitrogenous heterocyclic bases, or base analogs, where the nucleosides are linked together by phosphodiester bonds or other linkages to form a polynucleotide. Nucleic acids include RNA, DNA, or chimeric DNA-RNA polymers or oligonucleotides, and analogs thereof. A nucleic acid backbone can be made up of a variety of linkages, including one or more of sugar-phosphodiester linkages, peptide-nucleic acid bonds, phosphorothioate linkages, methylphosphonate linkages, or combinations thereof. Sugar moieties of the nucleic acid can be ribose, deoxyribose, or similar compounds having known substitutions (e.g. 2'- methoxy substitutions and 2'-halide substitutions). Nitrogenous bases can be conventional bases (A, G, C, T, U) or analogs thereof (e.g., inosine, 5-methylisocytosine, isoguanine).

[00107] An“oligonucleotide” or“oligomer” refers to a nucleic acid that includes at least two nucleic acid monomer units (e.g., nucleotides), typically more than three monomer units, and more typically greater than ten monomer units. The exact size of an oligonucleotide generally depends on various factors, including the ultimate function or use of the oligonucleotide. Oligonucleotides are optionally prepared by any suitable method, including, but not limited to, isolation of an existing or natural sequence, DNA replication or amplification, reverse transcription, cloning and restriction digestion of appropriate sequences, or direct chemical synthesis by a method such as the phosphotriester method of Narang et al. (1979) Meth. Enzymol. 68:90-99; the phosphodiester method of Brown et al. (1979) Meth. Enzymol. 68:109-151; the diethylphosphoramidite method of Beaucage et al. (1981) Tetrahedron Lett. 22:1859- 1862; the triester method of Matteucci et al. (1981) J. Am. Chem. Soc. 103:3185-3191; automated synthesis methods; or the solid support method of U.S. Pat. No. 4,458,066, or other methods known in the art. All of these references are incorporated by reference.

[00108] A“mixture” refers to a combination of two or more different components. A“reaction mixture” refers a mixture that comprises molecules that can participate in and/or facilitate a given reaction. An“amplification reaction mixture” refers to a solution containing reagents necessary to carry out an amplification reaction, and typically contains primers, a thermostable DNA polymerase, dNTP’s, and a divalent metal cation in a suitable buffer. A reaction mixture is referred to as complete if it contains all reagents necessary to carry out the reaction, and incomplete if it contains only a subset of the necessary reagents. It will be understood by one of skill in the art that reaction components are routinely stored as separate solutions, each containing a subset of the total components, for reasons of convenience, storage stability, or to allow for application-dependent adjustment of the component concentrations, and, that reaction components are combined prior to the reaction to create a complete reaction mixture. Furthermore, it will be understood by one of skill in the art that reaction components are packaged separately for commercialization and that useful commercial kits may contain any subset of the reaction components, which includes the modified primers of the invention. EXAMPLES

[00109] The broad evaluation of microbes from cancer patient sequencing data is shown in Fig. 1A across 33 cancer types in TCGA. Since these data derived from multiple sequencing centers, they had to be batch corrected (Figs. 1B-1C), which was done in a supervised manner, permitting selective reduction of technical batch variables while retaining or increasing the importance of biological variables (Fig. ID).

[00110] Ecological validation was subsequently performed to ensure that the identified microbes were in line with expected and/or observed clinical and literature findings (Figs. 2A-3).

[00111] Concurrently, another dataset from Hugo et al. (2016; Science ; PMID: 26997480) that collected whole transcriptomic data from patients’ tumors prior to them receiving anti-PD- 1 immunotherapy (i.e. nivolumab or pembrolizumab) was harvested for microbial reads. The intratumoral microbial RNA was then used to distinguish patients who had a‘complete response’ (CR) versus those who had‘progressive disease’ (PD), per iRECIST classification, as well as to distinguish patients who had a‘complete response’ (CR) versus those who had a‘partial response’ (PR). The PCoA plots are shown in Figs. 4A and 4C, and the plots showing discriminatory beta dispersion differences between the comparisons are shown in Figs. 4B and 4D.

[00112] Since the concept of immunogenicity is important in predicting response to certain types of cancer therapy, immunogenic subtypes of cancers were explored in TCGA to see if they could be discriminated by microbial DNA and RNA against non- immunogenic subtypes of cancer. Presented examples herein include discriminating cases of microsatellite instability in colon cancer (Fig. 4E) and discriminating cases of triple negative (“basal-like”) subtype of breast cancer among other breast cancer subtypes (Fig. 4F).

[00113] Using liver hepatocellular carcinoma as an example for distinguishing primary tumor samples as coming from a particular cancer type by solely using microbial DNA and RNA, a total of 13,883 primary tumor samples were processed across 32 cancer types, 416 of which were liver cancer. After training on a randomly selected, class- stratified 70% of the cases and testing on the remaining 30% cases, the model showed nearly perfect discrimination with an area under the receiver operator curve (AUROC) of 0.991300703 and an area under the precision-recall curve (AUPR) of 0.940399017. Figs. 15E and 16F shows the PR and ROC curves, respectively, of the model’s performance on the randomly selected 30% holdout test set. The model performance is also shown in the website screenshot in Fig. 33B.

[00114] Using liver hepatocellular carcinoma as another example for distinguishing blood-derived normal samples as coming from a particular cancer type by solely using microbial DNA, a total of 1866 blood-derived normal samples were processed, 32 of which were from liver cancer. After training on a randomly selected, class-stratified 70% of the cases, the model was tested on the remaining 30% of the cases and showed exceptionally good discrimination with an AUROC of 0.998585859 and an AUPR of 0.888716603. The respective PR and ROC plots are shown in Figs. 15A and 15B.

[00115] Again using liver hepatocellular carcinoma as another example for distinguishing tumor tissue from normal tissue solely using microbial DNA and RNA, all of the primary tumor and adjacent solid tissue normal samples from liver cancer patients were extracted for processing (n=488, of which 416 are primary tumors and 72 are adjacent solid tissue normals). After training on a randomly selected 70% of the cases, the model was tested on the remaining 30% of the cases and showed phenomenal discrimination with an AUROC of 0.983102919 and an AUPR of 0.997228962. The respective PR and ROC plots are shown in Figs. 15C and 15D.

[00116] A similar procedure, as described above, was applied to every possible discrimination for every cancer type in the TCGA dataset, as long as the minority class contained at least 20 samples, and are shown in Figures 5A-27B. The cancer types shown include the following: Adrenocortical carcinoma, bladder urothelial carcinoma, brain lower grade glioma, breast invasive carcinoma, cervical squamous cell carcinoma and endocervical adenocarcinoma, cholangiocarcinoma, colon adenocarcinoma, esophageal carcinoma, glioblastoma multiforme, head and neck squamous cell carcinoma, kidney chromophobe, kidney renal clear cell carcinoma, kidney renal papillary cell carcinoma, liver hepatocellular carcinoma, lung adenocarcinoma, lung squamous cell carcinoma, lymphoid neoplasm diffuse large B-cell lymphoma, mesothelioma, ovarian serous cystadenocarcinoma, pancreatic adenocarcinoma, pheochromocytoma and paraganglioma, prostate adenocarcinoma, rectum adenocarcinoma, sarcoma, skin cutaneous melanoma, stomach adenocarcinoma, testicular germ cell tumors, thymoma, thyroid carcinoma, uterine carcinosarcoma, uterine corpus endometrial carcinoma, or uveal melanoma. Data on the discriminatory performance on acute myelogenous leukemia samples were shown in the provisional application but are not shown here.

[00117] In cases of class imbalance, up-sampling of the minority class was used to promote model generalization, as shown herein. Many other strategies were previously attempted and presented in the provisional application, including: differential weighting of the samples during model training (i.e. higher weighting of minority class and lower weighting of majority class); down sampling the majority class; and interpolating new instances of the minority class using several interpolation algorithms (i.e. SMOTE and ROSE). Minor variation in model performance is possible with these, and someone skilled in the art will anticipate ways to improve model performance by their implementation and fine-tuning. For example, some of these strategies lead to models of the same discrimination that differ substantially in their sensitivity versus specificity, and it is possible to combine these models into an ensemble to make an overall better performing model.

[00118] Notably, the models presented herein have been minimally tuned and there is an anticipated opportunity to increase their predictive accuracy, among other performance metrics, by further model tuning and/or employing different training strategies, increasing sample size, regularization, model types, building ensembles of models, or a combination thereof.

[00119] To study the effects of (de)contamination on the model predictions, a decontamination pipeline was theorized and implemented (Fig. 28A) prior to machine learning model building and testing. Notably, the decontamination pipeline described in Fig. 28A represents one among many ways to evaluate the impact of and remove contaminants from such cancer microbiome data, and an individual skilled in the art will be to anticipate other such methods that extend or lessen the complexity of the presented pipeline. After decontamination, Figs. 28B and 28C show that classifier performance is maintained relative to models built and tested on the “full dataset” that was not decontaminated· [00120] In order to explore the generality of the findings described herein, several additional steps of analysis were performed. The first split the original microbial count data in half in a stratified manner, then normalized and batch corrected each half independently, and then built separate machine learning models on each half. The trained machine learning model was then tested on the opposite half’s data to estimate overall performance and model generalization. These predictions involved labeling one cancer type versus all others solely using microbial DNA and RNA from primary tumors. These performance values were then compared to a model trained and tested on the full dataset that had been normalized and batch corrected with 50%-50% training-testing splits, also predicting one cancer type versus all others solely using microbial DNA and RNA from primary tumors. The results are shown in Fig. 29A. Additionally, further comparative analysis on models built and tested on RNA-only data (Figs. 29B-29C) or DNA-only data (Figs. 29D-29E) did not show significant reductions in overall model performance. Even a more stringent comparative analysis, whereby data from a single sequencing center that only performed one type of sequencing (University of North Carolina: RNA-Seq) or another (Harvard Medical School: whole genome sequencing) were used to train and test models, did not show significant reductions in predictive performance when predicting one cancer type versus all others solely based on microbial nucleic acid information (Figs. 29F-29I).

[00121] Figure 30 shows several examples of predicting the mutation status of the top five most common mutations in TCGA solely using microbial DNA and RNA in primary tumors in a pan-cancer fashion.

[00122] Since many currently available liquid biopsy diagnostics are not able to accurately diagnose low-stage cancers (stage I and stage II), a conservative benchmarking analysis was done using microbial DNA derived from blood samples of TCGA patients who only had stage I or stage II cancers. Figure 31 shows that it is readily feasible to distinguish which cancer type a given blood sample belong to solely using microbial DNA and further shows that varying stringencies of decontamination do not drastically affect the performance of the model classifications.

[00123] Figure 32 also depicts a very conservative benchmarking analysis for predicting cancer type using microbial DNA derived from blood samples of TCGA patients that do not have any detectable genomic alterations in their tumors as measured by two commercial ctDNA assays. The results show that it is readily feasible to distinguish which cancer type a given blood sample belongs to just based on the microbial DNA found within it, notably when two major liquid biopsy assays would fail to even detect the presence of cancer, even when assuming 100% sensitivity and 100% specificity.

[00124] Figure 33 describes how an electronic website interface can be built for hosting, displaying, and sharing information about microbial presence and abundance in various cancer types, as well as showing model performances and which microbial features were most important for a model to make a particular discrimination. For anyone skilled in the art, it is expected that similar electronic, online interfaces can be used to remotely evaluate and diagnose a cancer using microbial nucleic acids that were measured as part of a deployable kit.

[00125] Appendix A is a listing of microbial features (i.e. taxonomy names at the genus level) that were detected in TCGA (n=l993). The models presented herein were not regularized and can utilize information from all 1993 available genera, although many models performed well with 30-1200 genera. Furthermore, a number of“decontaminated” datasets were built off of this original “full dataset” with varying levels of decontamination stringency. Since the combinatorial number of models trained and tested on all possible comparisons and datasets is high, and since the number of genera per model is even higher (i.e. several to many genera per model), it is not necessary to list out every ranked, unique model feature (estimated at >120,000 features) in this patent application. Instead, it is expected that someone skilled in the art would be able to readily replicate the invention using the methods described herein, as well as the list of microbial features provided. It is further expected that any subset of these microbial features, as selected by some algorithmic or machine learning process, can be used to make a variety of discriminatory predictions among various cancer types, subtypes, mutation statuses, samples types, treatment responses, and so forth.

[00126] The diagnostic methods described herein further provide a basis for methods of treatment of a diagnosed subject with an effective amount of a therapy directed against the diagnosed cancer, wherein the therapy now known in the art or later discovered. [00127] An example of analogous machine learning model creation known to those in the art is Ridgeway,“Generalized Boosted Models: a guide to the gbm package” 2007, as well as in Kuhn, Max, and Kjell Johnson, Applied predictive modeling. Vol. 26. New York: Springer, 2013, incorporated herein by reference.

[00128] These and other aspects features, alternatives and advantages of the present invention will be apparent to those skilled in the art upon a review of the specific embodiments disclosed herein, which are not to be considered limiting to the scope of the claimed invention.

APPENDIX A