Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEM AND METHOD FOR CLASSIFYING GENOMIC DATA
Document Type and Number:
WIPO Patent Application WO/2021/030193
Kind Code:
A1
Abstract:
A system and method are described for predicting the origin of a neoplastic tissue sample from a patient to assist in treatment of the patient by accurately informing a medical professional of the tissue origin so that an appropriate treatment can be provided to the patient. The method generally includes preparing a biological data profile of the neoplastic tissue sample and comparing it with models in a model dataspace to determine a best fit of a model with the biological data profile.

Inventors:
NEWTON YULIA (US)
Application Number:
PCT/US2020/045421
Publication Date:
February 18, 2021
Filing Date:
August 07, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NANTOMICS LLC (US)
International Classes:
G16B40/20; G16B15/00; G16B20/00; G16H50/20
Domestic Patent References:
WO2019018374A12019-01-24
Foreign References:
US20170051281A12017-02-23
US10340031B22019-07-02
Other References:
JOAO C. GUIMARAES, MIHAELA ZAVOLAN: "Patterns of ribosomal protein expression specify normal and malignant human cells", GENOME BIOLOGY, vol. 17, no. 1, 1 December 2016 (2016-12-01), XP055566778, DOI: 10.1186/s13059-016-1104-z
RAM AJORE, DAVID RAISER, MARIE MCCONKEY, MAGNUS JÖUD, BERND BOIDOL, BRENTON MAR, GORDON SAKSENA, DAVID M WEINSTOCK, SCOTT ARMSTRON: "Deletion of ribosomal protein genes is a common vulnerability in human cancer, especially in concert with TP53 mutations", EMBO MOLECULAR MEDICINE (ONLINE), WILEY - V C H VERLAG GMBH & CO. KGAA, DE, vol. 9, no. 4, 1 April 2017 (2017-04-01), DE, pages 498 - 507, XP055566779, ISSN: 1757-4684, DOI: 10.15252/emmm.201606660
Attorney, Agent or Firm:
CONNELL, Gary J. (US)
Download PDF:
Claims:
What Is Claimed Is:

1. A method, comprising: receiving a neoplastic tissue sample; preparing a biological data profile of the neoplastic tissue sample; inputting the biological data profile into a model dataspace comprising models having two or more dimensions and comprising data characteristic of multiple neoplastic tissue types; comparing the biological data profile with each model within the model dataspace to determine a best fit of a model in the model dataspace with the biological data profile; and communicating the best fit to a user device.

2. The method of claim 1, wherein the models have two or three dimensions.

3. The method of claim 1, wherein the model dataspace is built using a support vector machine.

4. The method of claim 1, wherein the model dataspace is prepared by T- distributed Stochastic Neighbor Embedding (t-SNE).

5. The method of claim 1, wherein the biological data profile comprises RNA sequence data.

6. The method of claim 1, wherein two or more of the models are built using one or more support vector machines.

7. The method of claim 6, further comprising retraining one or more of the models by comparing training data.

8. The method of claim 7, wherein the training data comprises a set of biological training data, and wherein a subset of the biological training data is used in the retraining.

9. The method of claim 8, wherein the subset is a set having a best fit of the subset of biological training data determined using the at least one of the support vector machines.

10. The method of claim 8, wherein the subset is a set of most frequently varying genes.

11. A system, comprising: a prediction server receiving input for processing biological data, wherein the server comprises a microprocessor and a computer-readable medium coupled thereto, and the microprocessor receives instructions from the computer-readable medium and is programmed to: prepare a biological data profile of a neoplastic tissue sample; input the biological data profile into a model dataspace comprising models having two or more dimensions and comprising data characteristic of multiple neoplastic tissue types; compare the biological data profile with each model within the model dataspace to determine a best fit of a model in the model dataspace with the biological data profile; and communicate the best fit to a user device.

12. The system of claim 11, wherein the models have two or three dimensions.

13. The system of claim 11, wherein the model dataspace is built by a support vector machine.

14. The system of claim 11, wherein the model dataspace is prepared by T- distributed Stochastic Neighbor Embedding (t-SNE).

15. The system of claim 11, wherein the biological data profile comprises RNA sequence data.

16. A method for treatment of a patient, comprising: analyzing a neoplastic tissue sample from the patient to obtain biological sample data; inputting the biological sample data into a model dataspace comprising models having two or more dimensions and comprising data characteristic of multiple neoplastic tissue types; identifying the neoplastic tissue sample as the neoplastic tissue type with which the biological sample data of the neoplastic tissue sample has the best fit in the model dataspace; and treating the patient with a treatment suitable to the identified neoplastic tissue type.

17. The method of claim 16, wherein the models have two or three dimensions.

18. The method of claim 16, wherein the model dataspace is built by a support vector machine.

19. The method of claim 16, wherein the model dataspace is prepared by T- distributed Stochastic Neighbor Embedding (t-SNE).

20. The method of claim 16, wherein the biological sample data comprises RNA sequence data.

Description:
SYSTEM AND METHOD FOR CLASSIFYING GENOMIC DATA

FIELD OF THE INVENTION

[0001] The field of the invention is systems and methods for classifying biological data using an ensemble of tissue models to predict the origin of neoplastic tissue.

BACKGROUND OF THE INVENTION [0002] This background description includes information that may be useful in understanding the present inventive subject matter. It is not an admission that any of the information provided herein is prior art or applicant admitted prior art, or relevant to the presently claimed inventive subject matter, or that any publication specifically or implicitly referenced is prior art or applicant admitted prior art.

[0003] There are many different types of treatments for cancer, including surgery, radiation, chemotherapy and biological therapy. Selection of an appropriate treatment depends on the nature of the cancer, including the tissue type from which the cancer originated. In the case of metastatic cancer, the tissue of origin can be difficult to determine, and thus, the ability of medical professionals to effectively treat such cancers can be limited. The present invention addresses this problem by constructing model dataspaces using biological data from tissues of known origin and comparing biological data from a patient sample tissue of unknown origin to accurately classify the patient sample tissue, thereby enabling more effective treatment of such patients.

[0004] All publications identified herein are incorporated by reference to the same extent as if each individual publication or patent application were specifically and individually indicated to be incorporated by reference. Where a definition or use of a term in an incorporated reference is inconsistent or contrary to the definition of that term provided herein, the definition of that term provided herein applies and the definition of that term in the reference does not apply.

[0005] In some embodiments, the numbers expressing quantities of ingredients, properties such as concentration, reaction conditions, and so forth, used to describe and claim certain embodiments of the inventive subject matter are to be understood as being modified in some instances by the term “about.” Accordingly, in some embodiments, the numerical parameters set forth in the written description and attached claims are approximations that can vary depending upon the desired properties sought to be obtained by a particular embodiment. In some embodiments, the numerical parameters should be construed in light of the number of reported significant digits and by applying ordinary rounding techniques. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of some embodiments of the inventive subject matter are approximations, the numerical values set forth in the specific examples are reported as precisely as practicable. The numerical values presented in some embodiments of the inventive subject matter may contain certain errors necessarily resulting from the standard deviation found in their respective testing measurements.

[0006] Unless the context dictates the contrary, all ranges set forth herein should be interpreted as being inclusive of their endpoints and open-ended ranges should be interpreted to include only commercially practical values. Similarly, all lists of values should be considered as inclusive of intermediate values unless the context indicates the contrary.

[0007] As used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise.

[0008] The recitation of ranges of values herein is merely intended to serve as a shorthand method of referring individually to each separate value falling within the range. Unless otherwise indicated herein, each individual value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g., “such as”) provided with respect to certain embodiments herein is intended merely to better illuminate the inventive subject matter and does not pose a limitation on the scope of the inventive subject matter otherwise claimed. No language in the specification should be construed as indicating any non-claimed element essential to the practice of the inventive subject matter.

[0009] Groupings of alternative elements or embodiments of the inventive subject matter disclosed herein are not to be construed as limitations. Each group member can be referred to and claimed individually or in any combination with other members of the group or other elements found herein. One or more members of a group can be included in, or deleted from, a group for reasons of convenience and/or patentability. When any such inclusion or deletion occurs, the specification is herein deemed to contain the group as modified thus fulfilling the written description of all Markush groups used in the appended claims.

[0010] Thus, there is still a need for improved methods and systems to classify biological data and predict the origin(s) of biological samples.

SUMMARY OF THE INVENTION

[0011] Various objects, features, aspects and advantages of the inventive subject matter will become more apparent from the following detailed description of preferred embodiments, along with the accompanying drawing figures in which like numerals represent like components.

[0012] Cancer is a disease of altered tissue growth regulation, and treatments for different types of cancers are often determined by a presumed origin of the tumor. However, it can be difficult to determine the presumed origin (or origins) of a tumor and improved methods and systems are needed to predict the origin(s).

[0013] It is with respect to the above that embodiments of the present disclosure were contemplated. In particular, embodiments of the present disclosure relate to data classification methods and systems.

[0014] One such non-limiting embodiment of the present invention is a method that includes receiving a neoplastic tissue sample and preparing a biological data profile of the neoplastic tissue sample. The biological data profile is input into a model dataspace comprising models having two or more dimensions and comprising data characteristic of multiple neoplastic tissue types. The method further includes comparing the biological data profile with each model within the model dataspace to determine a best fit of a model in the model dataspace with the biological data profile. The best fit can be communicated to a user device.

[0015] Another embodiment of the invention is a system that includes a prediction server for receiving input for processing biological data. The server comprises a microprocessor and a computer-readable medium coupled thereto, and the microprocessor receives instructions from the computer-readable medium. Further, the microprocessor is programmed to prepare a biological data profile of a neoplastic tissue sample and to input the biological data profile into a model dataspace having models having two or more dimensions and including data characteristic of multiple neoplastic tissue types. The microprocessor is also programmed to compare the biological data profile with each model within the model dataspace to determine a best fit of a model in the model dataspace with the biological data profile. The microprocessor is further programmed to communicate the best fit to a user device.

[0016] A further embodiment of the invention is a method for treatment of a patient having neoplastic tissue, such as a tumor or other tissue growth. Medical professionals working with patients having such conditions need to identify the nature of the tissue to make an accurate diagnosis and determine an appropriate course of treatment. For example, a medical professional must determine whether the tissue growth is malignant or benign, and for example, in the case of metastatic cancer, the site of origin of the cancer. Correctly identifying the nature of the tissue is critical to correctly diagnosing the condition of the patient and recommending an appropriate treatment. The method includes analyzing a neoplastic tissue sample from the patient to obtain biological sample data which is input into a model dataspace having models having two or more dimensions and comprising data characteristic of multiple neoplastic tissue types. The method further includes identifying the neoplastic tissue sample as the neoplastic tissue type with which the biological sample data of the neoplastic tissue sample has the best fit in the model dataspace and treating the patient with a treatment suitable to the identified neoplastic tissue type.

[0017] In some embodiments, the transcriptional profile data is maintained in a normalized data space. In various embodiments, the models have five or fewer dimensions, such as having two or three dimensions. In other embodiments, the model dataspace is built by a support vector machine (SVM). Also, the model dataspace can be prepared by T-distributed Stochastic Neighbor Embedding (t-SNE). Further, some embodiments include the biological data profile comprising RNA sequence data.

[0018] In yet other embodiments, two or more of the models can be built using one or more support vector machines. Such embodiments can further include retraining one or more of the models by comparing training data and the training data can include a set of biological training data, wherein a subset of the biological training data is used in the retraining. In this embodiment, the subset can be a set having a best fit of the subset of biological training data determined using at least one of the support vector machines. Alternatively, the subset can be a set of most frequently varying genes.

[0019] The phrases “at least one,” “one or more,” and “and/or” are open-ended expressions that are both conjunctive and disjunctive in operation. For example, each of the expressions “at least one of A, B and C,” “at least one of A, B, or C,” “one or more of A, B, and C,” “one or more of A, B, or C” and “A, B, and/or C” means A alone, B alone,

C alone, A and B together, A and C together, B and C together, or A, B and C together. When each one of A, B, and C in the above expressions refers to an element, such as X, Y, and Z, or class of elements, such as Xl-Xn, Yl-Ym, and Zl-Zo, the phrase is intended to refer to a single element selected from X, Y, and Z, a combination of elements selected from the same class (e.g., XI and X2) as well as a combination of elements selected from two or more classes (e.g., Y1 and Zo).

[0020] The term “a” or “an” entity may refer to one or more of that entity. As such, the terms “a” (or “an”), “one or more” and “at least one” can be used interchangeably herein. It is also to be noted that the terms “comprising”, “including”, and “having” can be used interchangeably.

[0021] The preceding is a simplified summary of the disclosure to provide an understanding of some aspects of the disclosure. This summary is neither an extensive nor exhaustive overview of the disclosure and its various aspects, embodiments, and configurations. It is intended neither to identify key or critical elements of the disclosure nor to delineate the scope of the disclosure but to present selected concepts of the disclosure in a simplified form as an introduction to the more detailed description presented below. As will be appreciated, other aspects, embodiments, and configurations of the disclosure are possible utilizing, alone or in combination, one or more of the features set forth above or described in detail below.

BRIEF DESCRIPTION OF THE DRAWINGS [0022] The present disclosure is described in conjunction with the appended figures, which are not necessarily drawn to scale:

[0023] Fig. l is a block diagram view of a system in accordance with at least some embodiments of the present disclosure; [0024] Fig. 2 is a block diagram view of additional details of a system in accordance with at least some embodiments of the present disclosure;

[0025] Fig. 3 is a flow chart view of a method of building a prediction model in accordance with at least some embodiments of the present disclosure;

[0026] Fig. 4 is a flow chart view of a method of determining a tissue prediction in accordance with at least some embodiments of the present disclosure; and [0027] Fig. 5 is a flow chart view of a method of retraining a model in accordance with at least some embodiments of the present disclosure.

[0028] Fig. 6 is a schematic that outlines the process for the validation and performance estimation of provided methodology and tools.

[0029] Figs. 7A, 7B, and 7C are Venn diagrams that show the overlap of tumors with cancer type and/or ICD10 annotations.

[0030] Figs. 8A and 8B show the cancer types used for training. Fig. 8A shows cancer types in the FFPE set, and Fig. 8B shows cancer types in the TCGA set.

[0031] Fig. 9 shows the cancer types that were used for training and cancer types that were not used. The cancer types to the left of the dashed line were used, while those to the right were not used. The lower bars represent the tumors from the FFPE set, and the upper bars represent tumors from TCGA. Above the bars, the bottom number is the number of FFPE samples, and the top number is the number of TCGA samples.

[0032] Fig. 10 shows the categories of cancer types on which the model predicts. Those categories that do not have enough validation samples are marked with an asterisk. Beside the bars, the left number is the number of FFPE samples, and the right number is the number of TCGA samples.

[0033] Fig. 11 shows the confusion matrix summarizing the results for all predictions (both high and low confidence). The true labels are on the x-axis and the predicted labels are on the y-axis.

[0034] Figs. 12A and 12B show the Per-Tissue Accuracy (Fig. 12A) and PPV (12B) of FFPE samples. The points in the plots represent point estimates of each metric.

Confidence intervals indicate 95 th percentile binomial distribution confidence interval. [0035] Fig. 13 shows the confusion matrix summarizing the results for all predictions (both high and low confidence). The true labels are on the x-axis and the predicted labels are on the y-axis. [0036] Figs. 14A and 14B show the Per-Tissue Accuracy (Fig. 12A) and PPV (12B) of TCGA samples. The points in the plots represent point estimates of each metric. Confidence intervals indicate 95 th percentile binomial distribution confidence interval. [0037] Fig. 15 shows the FFPE sample counts and categories in the validation study (total n = 902).

[0038] Fig. 16 shows the confusion matrix for the comparison of true labels to the predicted labels. The true labels are on the x-axis and the predicted labels are on the y- axis.

[0039] Fig. 17 shows the summary of the comparison of the predicted labels to true labels after applying the prediction model to 2075 TCGA validation samples. The true labels are on the x-axis and the predicted labels are on the y-axis.

[0040] Fig. 18 shows the confusion matrix for the comparison of true labels to the predicted labels for certain FFPE samples. The true labels are on the x-axis and the predicted labels are on the y-axis.

[0041] Figs. 19A, 19B, and 19C show a successful prediction by the model of Example 1. Fig. 19A shows the individual samples grouped by similarity, which separates into tissue of origin clusters as indicated by labelled tissue types. The patient’s tumor, in the Colorectal cluster is denoted with a star, and the most similar tumors are denoted with large circles, as indicated in the inset box. Fig. 19B shows the molecularly most similar tumors in TCGA dataset (clinical samples are not displayed on the report for privacy reasons). Fig. 19C shows the distributions of true and false positives and negatives, with the patient’s tumor’s score indicated by the dashed line.

[0042] Fig. 20 shows a successful prediction by the model of Example 1. The individual samples are grouped by similarity, which separates into tissue of origin clusters as indicated by labelled tissue types. The patient’s tumor, in the Brain cluster, is denoted with a star, and the most similar tumors are denoted with large circles, as indicated in the inset box.

[0043] Fig. 21 shows a prediction by the model of Example 1. The individual samples are grouped by similarity, which separates into tissue of origin clusters as indicated by labelled tissue types. The patient’s tumor, in the Breast Basal cluster, is denoted with a star, and the most similar tumors are denoted with large circles, as indicated in the inset box. DETAILED DESCRIPTION

[0044] It should be noted that any language directed to a computer should be read to include any suitable combination of computing devices, including servers, interfaces, systems, databases, agents, peers, engines, controllers, modules, or other types of computing devices operating individually or collectively. One should appreciate the computing devices comprise a processor configured to execute software instructions stored on a tangible, non-transitory computer readable storage medium (e.g., hard drive, FPGA, PLA, solid state drive, RAM, flash, ROM, etc.). The software instructions configure or program the computing device to provide the roles, responsibilities, or other functionality as discussed below with respect to the disclosed apparatus. Further, the disclosed technologies can be embodied as a computer program product that includes a non-transitory computer readable medium storing the software instructions that causes a processor to execute the disclosed steps associated with implementations of computer- based algorithms, processes, methods, or other instructions. In some embodiments, the various servers, systems, databases, or interfaces exchange data using standardized protocols or algorithms, possibly based on HTTP, HTTPS, AES, public-private key exchanges, web service APIs, known financial transaction protocols, or other electronic information exchanging methods. Data exchanges among devices can be conducted over a packet-switched network, the Internet, LAN, WAN, VPN, or other type of packet switched network; a circuit switched network; cell switched network; or other type of network. [0045] As used in the description herein and throughout the claims that follow, when a system, engine, server, device, module, or other computing element is described as configured to perform or execute functions on data in a memory, the meaning of “configured to” or “programmed to” is defined as one or more processors or cores of the computing element being programmed by a set of software instructions stored in the memory of the computing element to execute the set of functions on target data or data objects stored in the memory.

[0046] One should appreciate that the disclosed techniques provide many advantageous technical effects including the analysis of biological characteristics to create various models of biological tissue types, where the models can predict origins of cancer within the body. The models may be within a model dataspace allowing multi-dimensional matching of data within one or more models. In embodiments, the computer modeling of the data may advantageously provide the ability to analyze and use data in methods and systems that were not previously available. In addition, advantageous technical effects include the ability of systems and methods of the invention to more accurately predict one or more origins of neoplastic tissue in a biological sample from a patient and to more easily visualize the relationship between cancers of known origin and a sample tissue of a patient. This knowledge allows the patient to be more effectively treated because the effectiveness of cancer treatments can be dependent on the origin of cancer within the body.

[0047] The focus of the disclosed inventive subject matter is to enable construction or configuration of a computing device to operate on vast quantities of digital data, beyond the capabilities of a human. Although the digital data, or one or more portions of the digital data, represents the biological characteristics of a patient or patient tissue, it should be appreciated that the digital data is a representation of one or more digital models of the biological characteristics of a patient or patient tissue not the biological characteristics of a patient or patient tissue themselves. By instantiation of such digital models in the memory of the computing devices, the computing devices are able to manage the digital data or models in a manner that could provide utility to a user of the computing device that the user would lack without such a tool.

[0048] The following discussion provides many example embodiments of the inventive subject matter. Although each embodiment represents a single combination of inventive elements, the inventive subject matter is considered to include all possible combinations of the disclosed elements. Thus if one embodiment comprises elements A, B, and C, and a second embodiment comprises elements B and D, then the inventive subject matter is also considered to include other remaining combinations of A, B, C, or D, even if not explicitly disclosed.

[0049] As used herein, and unless the context dictates otherwise, the term "coupled to" is intended to include both direct coupling (in which two elements that are coupled to each other contact each other) and indirect coupling (in which at least one additional element is located between the two elements). Therefore, the terms "coupled to" and "coupled with" are used synonymously.

[0050] It should be apparent to those skilled in the art that many more modifications besides those already described are possible without departing from the inventive concepts herein. The inventive subject matter, therefore, is not to be restricted except in the spirit of the appended claims. Moreover, in interpreting both the specification and the claims, all terms should be interpreted in the broadest possible manner consistent with the context. In particular, the terms “comprises” and “comprising” should be interpreted as referring to elements, components, or steps in a non-exclusive manner, indicating that the referenced elements, components, or steps may be present, or utilized, or combined with other elements, components, or steps that are not expressly referenced. Where the specification or claims refer to at least one of something selected from the group consisting of A, B, C .... and N, the text should be interpreted as requiring only one element from the group, not A plus N, or B plus N, etc.

[0051] Before any embodiments of the disclosure are explained in detail, it is to be understood that the disclosure is not limited in its application to the details of construction and the arrangement of components set forth in the following description or illustrated in the following drawings. The disclosure is capable of other embodiments and of being practiced or of being carried out in various ways. Also, it is to be understood that the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting.

[0052] With reference to Figs. 1 and 2, illustrative systems 100, 200 will be described in accordance with at least some embodiments of the present disclosure. The systems 100, 200, in some embodiments, may include one or more computing devices operating in cooperation with one another to provide classification of biological data using an ensemble of tissue models. The components of the system 100, 200 may be utilized to facilitate one, some, or all of the methods described herein or portions thereof without departing from the scope of the present disclosure. Furthermore, although a server is depicted as including particular components or instruction sets, it should be appreciated that embodiments of the present disclosure are not so limited. For instance, although a single server may be provided with all of the instruction sets depicted and described in the server of Fig. 1, various instruction sets may reside in multiple servers. Alternatively, different instruction sets may exist other than those depicted in Fig. 1.

[0053] The systems 100, 200 are shown to include a communication network 104 that facilitates machine-to-machine communications between server 116 and one or more other devices. The system 100 is shown to include a prediction server 116. The system 200 is shown to include a prediction server 116 that communicates with a client device 204. [0054] The communication network 104 may comprise any type of known communication medium or collection of communication media and may use any type of protocols to transport messages between endpoints. The communication network 104 may include wired and/or wireless communication technologies. The Internet is an example of the communication network 104 that constitutes an Internet Protocol (IP) network consisting of many computers, computing networks, and other communication devices located all over the world, which are connected through many telephone systems and other means. Other examples of the communication network 104 include, without limitation, a standard Plain Old Telephone System (POTS), an Integrated Services Digital Network (ISDN), the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Session Initiation Protocol (SIP) network, a Voice over Internet Protocol (VoIP) network, a cellular network, and any other type of packet- switched or circuit-switched network known in the art. In addition, it can be appreciated that the communication network 104 need not be limited to any one network type, and instead may be comprised of a number of different networks and/or network types. Moreover, the communication network 104 may comprise a number of different communication media such as coaxial cable, copper cable/wire, fiber-optic cable, antennas for transmitting/receiving wireless messages, and combinations thereof.

[0055] The client device 204 may correspond to any type of computing resource that includes a processor, computer memory, and a user interface. The client device 204 may also include one or more network interfaces that connect the client device 204 to the communication network 104 and enable the client device 204 to send/receive packets via the communication network 104. Non-limiting examples of client devices 204 include personal computers, laptops, mobile phones, smart phones, tablets, etc. In some embodiments, the client device 204 is configured to be used by and/or carried by a user 208. As will be discussed in further detail herein, the user 208 may utilize a client device 204 to receive and/or view various outputs of the prediction server 116.

[0056] The prediction server 116 or components thereof may be provided as a single server or in a cloud-computing environment. The prediction server 116 may be configured to execute one or multiple different types of instruction sets. For example, the prediction server 116 may be configured to execute instruction sets in connection with processing patient data (i.e., biological sample data) received from a patient data source 156 and transforming the patient data into biological sample data that is useable by the prediction server 116. As will be discussed in further detail herein, the biological sample data received from the patient data source 156 may include data relating to a biological tissue and in particular, neoplastic tissue, including tissue exhibiting dysplasia or hyperplasia, benign tumors and malignant tumors. As will be discussed in further detail herein, the prediction server 116 may be configured to classify the biological sample data using one or more data space(s) in which the prediction server 116 is configured to process data. In this way, the prediction server 116 can transform the biological sample data into data that comprises a format necessary for further processing by the prediction server 116.

[0057] Biological sample data can refer to genes; nucleic acid molecules (DNA or RNA), including sequence information; RNA polymerase levels and/or activity; RNA processing; proteins, including amino acid sequence information and/or three-dimensional structure and/or post-translational modifications; organelles; cells; cellular structures; cell signaling (including chemical and receptor signaling); cell cycle information; organs; and organisms. Biological sample data may be obtained from the cancer genome atlas (TCGA) data. Biological sample data can include information regarding different states of a biological or chemical entity, for example, information regarding an unmodified protein as compared to phosphorylated protein or a free base form of a drug as compared to a salt of the drug. Biological sample data can also include any “omics” data, including genomics, transcriptomics, proteomics or metabolomics.

[0058] The prediction server 116 is shown to include a processor 120, memory 124, and network interface 128. The prediction server 116 is also shown to include a database interface 152, which may be provided as a physical set of database links and drivers. Alternatively or additionally, the database interface 152 may be provided as one or more instruction sets in memory 124 that enable the processor 120 of the prediction server 116 to interact with the databases 156, 157, 159.

[0059] These resources of the prediction server 116 may enable functionality of the prediction server 116 as will be described herein. For instance, the network interface 128 provides the server 116 with the ability to send and receive communication packets over the communication network 104. The network interface 128 may be provided as a network interface card (NIC), a network port, drivers for the same, and the like. Communications between the components of the server 116 and other devices connected to the communication network 104 may all flow through the network interface 128.

[0060] The processor 120 may correspond to one or many computer processing devices. For instance, the processor 120 may be provided as silicon, as a Field Programmable Gate Array (FPGA), an Application-Specific Integrated Circuit (ASIC), any other type of Integrated Circuit (IC) chip, a collection of IC chips, or the like. As a more specific example, the processor 120 may be provided as a microprocessor, Central Processing Unit (CPU), or plurality of microprocessors that are configured to execute the instructions sets stored in memory 124. Upon executing the instruction sets stored in memory 124, the processor 120 enables various functions of the prediction server 116.

[0061] The memory 124 may include any type of computer memory device or collection of computer memory devices. Non-limiting examples of memory 124 include Random Access Memory (RAM), Read Only Memory (ROM), flash memory, Electronically - Erasable Programmable ROM (EEPROM), Dynamic RAM (DRAM), etc. The memory 124 may be configured to store the instruction sets depicted in addition to temporarily storing data for the processor 120 to execute various types of routines or functions. Although not depicted, the memory 124 may include instructions that enable the processor 120 to store or retrieve data from the databases 156, 157, 159. Further still, the memory 124 may include instructions that enable the prediction server 116 to process various types of data (for example, use training data from the training data database 157 to create (e.g., train) or retrain prediction models 142, use patient data from the patient data database 156 to provide predictions using the prediction models 142, etc.).

[0062] The illustrative instruction sets that may be stored in memory 124 include, without limitation, data organization instructions 136, an inference engine 144, a training engine 146, a prediction engine 134, arbitration instructions 148, and verification instructions 147.

[0063] The patient data source 156 (also referred to herein as the patient data database) stores biological sample data of patient tissue samples. For example, the patient data database 156 can store RNA sequencing data and transcription/expression data, mRNA data, DNA data, and protein data, among others. In some embodiments, the patient sample data corresponds to textual data. [0064] The training data source 157 (also referred to herein as the training data database) stores biological data used to train or retrain the prediction models 142. The training data database 157 stores biological sample data. Such data can include, for example, RNA expression data, RNA expression labels, clinical data, and The Cancer Genome Atlas (TCGA) data, among others. Thus, in some embodiments, the patient sample data corresponds to textual data.

[0065] The inference engine 144, when executed by the processor 120, enables the prediction server 116 to scan and analyze the biological sample data (e.g., received from the patient data source 156 and/or the training data source 157) and, if necessary, manipulate the data or obtain additional biological data. For example, the inference engine 144 may obtain RNA sequencing data and/or one or more RNA expression (e.g., transcription) profiles and/or expression levels related to various genes within biological sample data. The expression profiles and/or expression levels may be inferred for each gene in a sample (where the sample may be either sample data from the training data in training data database 157, or sample data from the patient data database 156). In various embodiments, in the instance of an item (i.e., a datum) of biological sample data being the genetic material of only some cells of a sample, the inference engine 144 may access a data source to obtain the genetic material of other cells in the tissue sample or a different sample from a patient in order to obtain a full transcriptional profile of the biological sample data. Also, for example, in the instance of an item (i.e., a datum) of biological sample data being one type of data, the prediction server 116 may access a data source or instructions to generate (e.g., convert the data into) data types that are necessary for use with various processes of the prediction server 116, such as for use with the prediction models 142. In some embodiments, the inference engine 144 enables the prediction server 116 to measure the relative activity of previously identified target genes within the biological sample data by performing expression profiling of the sample data. The inference engine 144 may be configured to automatically scan the text of the biological sample data and extract relevant data from within the biological sample data.

[0066] The biological data and inferred expression levels may be stored on the training data source 157 (e.g., for training sample data) and on the patient data source 156 (e.g., for patient sample data). After the inference engine 144 has inferred the expression profiles and/or levels of biological sample data from the training data source 157, the inferred expression data may be compared to other training data (e.g., previous clinical data and TCGA data (the cancer genome atlas, also referred to as “genome atlas” herein)) in order to build the prediction models 142 (e.g., model A 142a through model N 142n). The training engine 146, when executed by the processor 120, may enable the prediction server 116 to train the prediction models 142 by comparing various training data with expression profiles. The training data and the expression profiles (including inferred expression levels) may be obtained from the training data database 157. Any amounts or types of training data may be compared with any of the expression profiles and/or inferred expression levels to create the prediction models 142. In various embodiments, biological data from the training data database 157 is used to build several support vector machines (SVMs) for classification of biological sample data. Then, the training engine 146 compares data (e.g., from the training data database 146) against each of the SVMs to determine which genes should be used to build each of the prediction models 142. In some embodiments, the most frequently varying genes may be used to build the prediction models 142.

[0067] The training engine 146, when executed by the processor 120, may also be used to retrain any of the prediction models 142. Any criteria or data may be used to determine that retraining should be done. Retraining any of the prediction models 142 may be done using batches of data (e.g., from the training data database 157) or using hard user input. The retraining instructions may be configured to perform various tasks related to retraining the models, including but not limited to: determining whether retraining the models is necessary, suggesting retraining the models, providing suggested updates to the models, and retraining the models, if necessary.

[0068] The prediction models 142 may be based on different tissue types. In various embodiments there may be a single prediction model for each tissue type. For example, there may be twenty-five different tissue types and each of these twenty-five types may have a corresponding prediction model within the prediction models 142. In some embodiments, the prediction models 142 may be three-dimensional models occupying a single data space. In some embodiments, some of the prediction models 142 may occupy one data space while others occupy one or more different data spaces. Although the individual prediction models contain separate data (e.g., based on different tissue types), the data from the models may overlap within the data space. In various embodiments, a data space may be processed using visualization techniques to improve the visualization of the data. For example, a T-distributed Stochastic Neighbor Embedding (t-SNE) technique may be applied to one or more of the data spaces to improve visualization of the one or more prediction models 142 by reducing the dimensionality of the modeled data.

[0069] The prediction engine 134, when executed by the processor 120, may enable the prediction server 116 to compare various patient data 156 (e.g., genetic data from one or more tissue samples) with one or more of the prediction models 142 (e.g., model A 142a through model N 142n) to provide predictions for the patient data. In various embodiments, some or all of the patient data 156 is input to each of the models 142 to obtain prediction results (also referred to herein as prediction data), which the prediction engine 134 stores in the prediction data database 159). For example, the prediction engine 134 obtains the expression data (e.g., transcription data or expression profiles from the genetic data from the patient data source 156, which may be inferred data provided by the inference engine 144) for one or more genes in a tissue sample and the prediction engine 134 compares the expression data to one or more of the prediction models 142. In various embodiments, the prediction engine 134 matches the expression data from a tissue sample from the patient data database 156 to each of the prediction models 142 (i.e., each of model A 142a through model N 142n). The prediction engine 134 may obtain predictions for the samples from the comparisons of expression levels with the prediction models 142, and store the predictions on the prediction data database 159. Thus, the prediction models 142 can advantageously be used to predict one or more origins of cancer cells based on expression data (e.g., transcriptional profiles) of a biological sample.

[0070] The verification instructions 147 include instructions to verify the models 142 and/or to verify other data. The verification instructions 147 can use any type of verification methods (e.g., models, programs, etc.) to perform the verifications. For example, the verification instructions 147 can use an iterative modeling process with portions of data (e.g., training data 157) to determine an accuracy of the prediction models 142. In some aspects, a five fold vector prediction may be used to verify that an SVM model is a preferred type of model to train the prediction models 142. Once a type of model is chosen, the model type may be used to build each of model A 142a through model N 142n in the prediction models 142 (e.g., the model type can be used to build a model for each tissue type). Other models that may be used include, but are not limited to, random forest, nearest K neighbor, neural network, and ransac, among others. Models other than basic machine learning models may be used. Verification data (including but not limited to mean accuracy, positive predictive value, and false discover rate) may be monitored to determine how well any of the prediction models 142 are performing, and if any of the models 142 should be retrained, then the prediction server 116 can retrain using training instructions 146. In some embodiments, one or more of the multiple data models are updated automatically in response to a confidence score meeting or exceeding the predetermined confidence threshold.

[0071] The data organization instructions 136 may organize data used by and generated by the prediction server 116. For example, the data organization instructions 136 may be configured to organize the data output by the training engine 146 for eventual storage as model A 142a through model N 142n in prediction models 142. For instance, the data organization instructions 136 may enable the prediction server 116 to organize the model data based on the data outputs of the training engine 146. In some embodiments, the data organization instructions 136 organize and classify the data, or portions of the data. In some embodiments, the data organization instructions 136 may be configured to organize the various data inputs based on a genomics classifications and/or labeling. Non-limiting examples of genomics classifications include, without limitation, shared/common pathways, cell communication behaviors, and/or cellular network behaviors. Also, the data organization instructions 136 may be configured to organize the data output by the inference engine 144 for eventual storage as training data within the training data database 157 and/or transcriptional data within the patient data database 156. For instance, the data organization instructions 136 may enable the server 116 to organize the sample data based on inferences drawn by the inference engine 144.

[0072] The arbitration instructions 148 may be configured to resolve conflicts within the instruction sets of the prediction server 116. For example, the arbitration instructions 148 may be configured to resolve conflicts between inferences generated by the inference engine 144 and/or between conclusions drawn by the training engine 146. In some embodiments, the arbitration instructions 148 may also enable the prediction server 116 to adhere to a predetermined policy or philosophy in connection with resolving such inference conflicts. In some embodiments, these predetermined policies or philosophies may be applied to newly-generated inferences as well as inferences that were previously generated by the inference engine 144 and stored in connection with prediction models 142, prediction data 159, and/or training data 157.

[0073] Although not depicted, but as will be described in further detail herein, the prediction server 116 may also have one or more of its instruction sets (e.g., the inference engine 144) executed as a neural network or similar type of artificial intelligence data structure. Furthermore, these neural networks, such as an intelligent inference engine 144, may be capable of being dynamically trained and updated based on outputs of the prediction server 116. Further still, one or more models used by an intelligent inference engine 144 may be constantly analyzed for possible improvements thereto. Such analysis may be done internally or by an external neural network that is specifically designed to train other neural networks. As another non-limiting example, the data organization instructions 136 may be executed as a neural network whose coefficients between nodes are constantly updated in accordance with desired updates to the data organization for any of the data associated with the prediction server 116. For instance, if a particular normalized data space is initially used by the data organization instructions 136, but there is a desire to try a second, different, normalized data space that focuses on different biological information (e.g., shared pathways as compared to cellular communication behaviors), then the data organization instructions 136 may be reconfigured (e.g., offline rather than reconfiguring online with live data) to determine if using a different normalized data space is useful, provides certain benefits, or makes the overall system work less efficiently. If it is determined that the different normalized data space provides an improvement over the original normalized data space, then the data organization instructions 136 may be updated within the prediction server 116 to begin applying the new normalized data space to further organizations of the transcriptional profile data. [0074] As shown in Fig. 2, a user may interact with the predication server 116 via a communication network 104 and a client device 204. Thus, the communication network 104 facilitates machine-to-machine communications between one or more servers (e.g., prediction server 116) and/or one or more client devices (e.g., client device 204).

[0075] With reference now to Figs. 3-5, various methods of operating the systems 100, 200 or components therein will be described. It should be appreciated that any of the following methods may be performed in part or in total by any of the components depicted and described in connection with Figs. 1 or 2. [0076] Referring to Fig. 3, methods of building prediction models will be described in accordance with at least some embodiments of the present disclosure. The method begins with the prediction server 116 receiving raw input data from data source(s) (step 304), which may include training data (e.g., previous clinical data and genome atlas) and RNA data. In step 308, the expression levels of RNA are inferred to obtain RNA expression profiles. For example, the inference engine 144 may access a data source to obtain the genetic material of cells in one or more tissue samples in order to obtain a full transcriptional profile of the sample(s).

[0077] After the full transcriptional profile(s) have been obtained, they are compared to the training data to build a prediction model at step 312. At step 316, the prediction data model is stored in a database. The steps of Fig. 3 may be duplicated and/or repeated for various tissue types to obtain prediction models for each tissue type.

[0078] With reference now to Fig. 4, methods of determining a tissue prediction will be described in accordance with at least some embodiments of the present disclosure. The methods of Fig. 4 may be applied after the prediction models are obtained, e.g., using the methods of Fig. 3. The methods begin at step 404, where at least one transcriptional profile of the tissue sample is received. At step 408, the transcriptional profile is input into each of the prediction models (e.g., the prediction models obtained in Fig. 3) to obtain predictions for each tissue type. For example, the prediction server 116 may compare the transcriptional profile with each of the SVMs obtained in Fig. 3 to obtain a result of the comparison. At step 412, the comparison results are confirmed. In various embodiments, this may be done using five fold vector prediction, where the mean accuracy, positive predictive value, false discovery rate, etc., of the results of each comparison are calculated to determine a best match of the comparisons with the SVMs. At step 416, the result of the tissue prediction is obtained by determining the best match from step 412. The result of the tissue prediction may be a single predicted tissue type, or multiple predicted tissue types.

[0079] With reference now to Fig. 5, methods of retraining a model will be described in accordance with at least some embodiments of the present disclosure.. The methods begin with receiving raw input data at step 504. At step 508, it is determined if retraining will be based on data batches or hard user input. The model is modified based on retraining data at step 512. In step 516, the modified model is verified. After the model has been verified, the retrained model may be stored in a database in step 520. It should be appreciated that any combination of prediction processes depicted and described herein can be performed without departing from the scope of the present disclosure. Alternatively or additionally, any number of other prediction processes can be developed by combining various portions or sub-steps of the described prediction processes without departing from the scope of the present disclosure.

[0080] Specific details were given in the description to provide a thorough understanding of the embodiments. However, it will be understood by one of ordinary skill in the art that the embodiments may be practiced without these specific details. For example, well-known circuits, processes, algorithms, structures, and techniques have been shown without unnecessary detail in order to avoid obscuring the embodiments.

[0081] EXAMPLES

[0082] Example 1.

[0083] This Example describes studies to validate methods of predicting the tissue of origin of a given single tumor sample (n-of-one case) based on RNA sequencing data. [0084] Between 2-5% of all initial solid malignancy diagnoses are of unknown primary site of origin. These diagnoses are called Carcinomas of Unknown Primary (CUP), Carcinomas of Uncertain Primary (CUP), or Occult Primary Tumors (OPT). Often the site of origin remains undetermined for these malignancies. Tumor site of origin often informs therapeutic decisions because it may affect treatment efficacy and mechanisms of sensitivity and resistance to those treatments. Because these diagnoses usually occur in later tumor stages, they tend to be more aggressive tumors and therefore not knowing the site of origin can significantly hinder therapeutic progress and patient survival.

[0085] In addition to CUP cases, those diagnoses that do carry a site of origin determination can be incorrect. It is estimated that up to 5.3% of patients are misdiagnosed and would benefit from a corrected diagnosis. Molecular signature can be more informative than tumor histology. It is also the case that sometimes a tumor arising in one anatomical site is molecularly more similar to tumors arising in another anatomical site and the patient would benefit from a treatment based on that molecular similarity.

[0086] Finally, in case of a metastatic biopsy site it is sometimes difficult to tell the primary site of origin. Pathologists use histology as well as common sites of metastasis information to make this determination. It is also sometimes the case that metastatic tumors are initially diagnosed as primary tumors of the site of biopsy. Therefore, there is a great need for either confirming or correcting primary site diagnosis in such cases.

[0087] This Example provides methodology and software tools for computer-assisted site of origin diagnosis to:

[0088] confirm original pathology diagnosis

[0089] identify and correct inconsistencies with original pathology diagnosis [0090] provide diagnosis in cases where no previous diagnosis was able to be made [0091] Fig. 6 is a schematic that outlines aspects of the current invention, including the process for the validation and performance estimation of provided methodology and tools.

[0092] 1. Definitions:

1.1. CUP - carcinomas of unknown primary or carcinomas of uncertain primary

1.2. OPT - occult primary tumors

1.3. FFPE - formalin-fixed paraffin-embedded

1.4. FF - fresh frozen or flash frozen

1.5. Nant clinical - set of clinical tumor samples sequenced and processed by Nantomics, LLC of Culver City, California.

1.6. FI - a first cohort of clinical and research samples (up until January 2016)

1.7. F2 - a second cohort of clinical samples (between January 2016 and April 2018)

1.8. TCGA - The Cancer Genome Atlas

1.9. PPV - positive predictive value

1.10. FDR - false discovery rate

1.11. GDC - The Genomics Data Commons

1.12. TPM - transcripts per million

1.13. HUGO - HUGO Gene Nomenclature Committee is a committee of the Human Genome Organisation that sets the standards for human gene nomenclature. HUGO gene symbols are standardized and approved by this committee gene symbols.

1.14. SVM - support vector machine

1.15. TIN - transcript integrity number

1.16. TPM - transcripts per million, a measure of gene expression

1.17. Clinical curator - a person with sufficient clinical expertise and background to annotate cancer cases with tumor annotations from pathology reports [0093] 2. Materials:

[0094] 2.1. Specimen Type:

[0095] RNA sequencing data from combined cohort of FFPE and FF samples was used. FFPE samples came from non-metastatic (as defined in 2.3.1) Nant clinical FI cohort and FF samples came from non-metastatic TCGA tumor cohort. TCGA data was downloaded in raw sequencing format from GDC and processed to produce TPM estimates per gene. [0096] 2.2. Equipment:

[0097] Computer cluster, configured with the software described in the Bioinformatics Software Inventory.

[0098] 2.3. Data:

[0099] 2.3.1. Annotations:

[00100] Tumor submissions have several entries that can be used to label a tumor, for example a text-based cancer type and an optional ICD10 code.

[00101] The following fields were collected from the laboratory information management system, where present, for each candidate tumor sample: cancer type, pathology notes, ICD10, and ICD10 description. The majority, but not all tumors, have both cancer type and ICD10 annotations, as shown in Fig. 7.

[00102] An aim was to remove complications from annotation from metastatic tumors, which might have been entered by either the primary or secondary anatomic site. The cancer type, pathology and ICD10 description fields were examined for words “metastatic”, “metastasis”, or “secondary” and marked those samples as metastatic. All metastatic samples were set aside for the clinical curator to go through clinical reports and fill in these annotations, for use in other validation.

[00103] With the remaining samples, the following steps were performed:

[00104] 1. For samples with an ICD10 field, but either without a "cancer type" or "cancer type" as Unknown Primary or Other Cancer, ICD10 was assigned as cancer type assignment.

[00105] 2. For samples with both cancer type and ICD10 fields, all possible pairings were examined for apparent mismatch between these two fields. Table 4 shows detailed mismatches. The following rates of mismatch were assessed:

[00106] a. FI clinical: 5/351 (1.4%) [00107] b. FI research: 1/214 (.5%)

[00108] c. F2 clinical: 30/866 (3.5%)

[00109] 3. During the process in the prior step, it was determined that Ampulla of Vater tumors ICD10 field was annotated as Biliary Tract (intrahepatic) cancer type in F2 samples, so that cancer type mapping in FI was changed to map to Cholangiocarcinoma. [00110] 4. The pathology notes field was examined and every note mentioning GIST was re-annotated as GIST.

[00111] 5. Some tumors had cancer type annotation as Oral and Throat Cancers (Including Thyroid). Each of these was assigned to either Head and Neck (C00-C14 ICD10), Thyroid (C73 ICD10), or flagged as needing a review by the clinical curator if the ICD10 code matched neither.

[00112] 6. Some of the tumors in the cohort were identified as a rare tumor type called Adenoid Cystic Carcinoma. However, this tumor type was not present in the cancer type annotations available for training or validation cohorts. Those tumors were not re annotated as Adenoid Cystic Carcinoma but those samples’ annotations were left as originally entered into the system.

[00113] 7. The samples flagged for future curation were excluded from the selected training and validation datasets. The flagged samples from step 2 and the metastatic samples kept aside came from these datasets:

[00114] a. FI clinical: n = 27 [00115] b. FI research: n = 1 [00116] c. F2 clinical: n = 107 [00117] 2.3.2. Genomic samples:

[00118] The prediction model was trained on RNA sequencing data from combined cohort of non-metastatic Nant clinical FI FFPE samples and non-metastatic TCGA tumor FF samples. TPM expression quantifications obtained by running RSEM bioinformatics tool were used. Normalization techniques were applied to the TPM values, as described in the methodology description, to make TCGA samples more comparable to FFPE samples and to avoid batch effects in the final dataset. The training dataset includes 8,110 TCGA samples and 559 FFPE samples. The cancer types are shown in Figs. 8A and 8B. [00119] As shown in Fig. 9, a cutoff of 30 samples per-tissue was defined as a requirement to be included into the training dataset (N = 29; dashed line represents the count cutoff):

[00120] The 29 categories shown in Fig. 9 are composed of varying numbers of FFPE and TCGA samples as shown in the barplot of Fig. 9 (upper bars - FFPE samples, lower bars - TCGA samples).

[00121] 2.4. Supplies:

[00122] Refer to Clinical and Laboratory Standards Institute (CLSI), “Application of a Quality Management System Model for Laboratory Services, 3rd Edition, 2004, College of American Pathologists Checklists, and Hardeep Singh, Traber Davis Giardina, Ashley N.D. Meyer, Samuel N. Forjuoh, Michael D. Reis, and Eric J. Thomas. "Types and Origins of Diagnostic Errors in Primary Care Settings", JAMA Intern Med. 173(6), 2014 for supplies required for producing FFPE blocks and extracting DNA from FFPE and blood samples.

[00123] 2.5. Reagents:

[00124] Refer to Clinical and Laboratory Standards Institute (CLSI), “Application of a Quality Management System Model for Laboratory Services, 3rd Edition, 2004, College of American Pathologists Checklists, and Hardeep Singh, Traber Davis Giardina, Ashley N.D. Meyer, Samuel N. Forjuoh, Michael D. Reis, and Eric J. Thomas. "Types and Origins of Diagnostic Errors in Primary Care Settings", JAMA Intern Med. 173(6), 2014 for reagents required for producing FFPE blocks and extracting DNA from FFPE and blood samples.

[00125] 3. Quality:

[00126] 3.1. Quality metrics

[00127] 3.1.1. Performance of this methodology was evaluated using overall prediction accuracy, PPV, sensitivity, specificity, and FDR metrics.

[00128] 4. Procedure:

[00129] 4.1. Sample preparation

[00130] 4.1.1. Tumor acquisition: Tumor samples were acquired.

[00131] 4.1.2. Generation of FFPE blocks and sections for tumor samples. [00132] 4.1.3. Isolation of genomic RNA from these samples.

[00133] 4.1.3.1. RNA was extracted from FFPE material.

[00134] 4.1.3.2. Ribosomal RNA was degraded, and stranded CDNA was created with the Kapa Stranded RNA-seq Kit with RiboErase.

[00135] 4.1.3.3. Multiple libraries per sample were sequenced using standard Illumina sequencing.

[00136] 4.2. Sequencing

[00137] 4.2.1. RNA-sequencing was performed.

[00138] 4.3. Transcript quantification bioinformatics pipeline

[00139] 4.3.1. Bowtie2, RSEM, and custom software were used for alignment and transcript quantification.

[00140] 4.3.2. TPM quantifications for protein coding genes were extracted from RSEM output files by computing sum of all TPM quantifications for all transcripts per HUGO symbol for those symbols that had at least one NM_ transcript

[00141] 4.3.3. A per-sample rescaling procedure was applied to the sample’s extracted TPM quantifications in order to normalize each sample and minimize variability across samples due to technical artifacts. Each TPM value was rescaled to force the sum of per- sample TPMs to 1 million.

[00142] 4.4. Mapping of external datasets into Nantomics space bioinformatics pipeline [00143] 4.4.1. Quantile normalization (James H Bullard, Elizabeth Purdom, Kasper D Hansen and Sandrine Dudoit. “Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments”, BMC Bioinformatics, 11(94), 2010.) procedure was used, with FFPE samples' expression quantiles as the target distribution, to TCGA FF samples. In order to control for tissue distribution differences between two datasets, the TCGA dataset was divided into two. The first one had either the same or smaller proportions of individual tissues as the Nant FI dataset. The second had the remaining TCGA samples. The first TCGA dataset was quantile normalized by mapping per-gene quantiles between TCGA and FFPE data, with exclusion of at-zero expression from these distributions. The second TCGA dataset was normalized by using the first dataset's distributions to compute quantiles and then mapping those quantiles.

[00144] 4.5. Metastatic samples in training data [00145] 4.5.1. For metastatic samples (n = 18) in the training data site of origin and not site of biopsy/metastasis for training labels of these samples.

[00146] 4.6. Tumor subtype categories

[00147] 4.6.1. Several tumor types have molecularly divergent subtypes. Information about the subtype is diagnostically, prognostically, and clinically important. Therefore, extra steps were taken to introduce subtype labels for the following tumor types: breast (basal or non-basal), esophageal (adenocarcinoma or squamous), and lung (adenocarcinoma or squamous). These subtypes were readily available for TCGA samples, as they were assigned by pathologists during data collection. These subtypes were not always available for FFPE clinical data. Therefore, a computational step was developed to predict tumor subtypes on samples for these three tumor types.

[00148] 4.6.2. The following is the description of the prediction model for subtype labels: [00149] 4.6.2.1. For each tumor type, samples were collected for which subtype labels were available (only TCGA samples had these labels).

[00150] 4.6.2.2. For each subtype, centroid expression vector was computed (mean expression for each gene across that subtype cohort of samples).

[00151] 4.6.2.3. For each unlabeled sample, Spearman correlation was computed between that sample’s expression profile and the centroid of each subtype available for the sample’s tumor type.

[00152] 4.6.2.4. The subtype label was assigned according to which subtype centroid had the highest correlation to this sample.

[00153] 4.6.3. This model was validated using leave-one-out (LOO) method [00154] 4.6.3.1. For each sample with a subtype label (only TCGA samples had these labels), that sample was kept out of the training cohort, and the training and prediction steps outlined in 4.6.2 were repeated. The predicted label was then compared to the true label. Overall prediction accuracy was computed for a given tumor subtype based on these comparisons of predicted labels to true labels for all samples.

[00155] 4.6.3.2. The LOO-based accuracy of subtype labeling was computed, as depicted in Table 1: Table 1

[00156] 4.7. Reportable Range

[00157] 4.7.1. Categories on which the model predicts are shown in the table of category counts in Table 2. Those categories that did not have enough validation samples (n >= 10) were still trained on, but it is noted that robust validation of those categories is impossible at this point (marked with * in the barplot of Fig. 10).

Table 2

[00158] 4.7.2. Final categories for site of origin prediction were further reviewed and finalized into the final N = 27 categories (down from initial N = 29).

[00159] 4.7.2.1. The findings indicated that Lung Squamous and Lung Adenocarcinoma subtypes of Lung cancer were not easily separated by the prediction model. Therefore, those two categories were combined into a single category labeled “Lung” for the final prediction model.

[00160] 4.7.2.2. Most of the training Lymphoma samples were of Non-Hodgkin’s subtype. There were not enough samples to separate this subtype from other subtypes of Lymphoma. Therefore, all Lymphoma samples were combined into a single category labeled “Lymphoma”.

[00161] 4.7.2.3. Adrenal and Pheochromocytoma and Paraganglioma tumor types are molecularly similar and were not well separated by the prediction model. Therefore, these two tumor types were combined into a single category labeled “Adrenal/PCPG”.

[00162] 4.7.3. The training dataset labels were reviewed by further blind review of those samples that the model mis-predicted in preliminary results. A list of 44 mis-predicted (all mis-predicted) and 29 correctly (selected at random) predicted patient and RNA-Seq UUIDs were provided to a reviewer who was blinded to the prediction status and current tissue annotations. The reviewer then used the patient/RNA-Seq UUIDs to find as much information about each sample (utilizing pathology reports, order information, and any other documentation that can be found) and make their own annotation of the site of origin for this sample. The reviewer’s annotation was then compared with the cancer type annotation and predicted site of origin for each tumor to assess:

[00163] 4.7.3.1. How often the annotated site of origin matched the blind review guess in those samples that were mis-predicted as well as in samples that were predicted correctly. [00164] 4.7.3.2. How often the predicted site of origin matched the blind review guess in those samples that were mis-predicted as well as in samples that were predicted correctly. [00165] 4.7.3.3. How often predicted site of origin matched neither the annotated cancer type nor the review guess.

[00166] 4.7.3.4. Once completed, training samples where the annotated site of origin was incorrect were re-annotated/corrected. These samples were labeled by one of the N = 27 final categories or were be taken out of the training set if such labeling was impossible or questionable. [00167] 4.7.4. The site of origin prediction model scored a test sample for each of the potentially reportable sites. The next section enumerates which sites were reportable, based on availability of training and testing data. The number of potential tissue sites with positive model scores determined what was reported:

[00168] A single predicted tissue, when the model returned a single positive score across tissues.

[00169] More than one site of origin, if the model returned multiple positive scores. [00170] No predicted tissue, when no sites had a positive score. In this case, a "best match" was listed, identifying the tissue with the highest score (still negative) for one of the possible predicted categories in the table above.

[00171] 4.7.5. In situations where the true tissue type of the n-of-one sample being tested was not present in the N=27 labels of the training dataset, the method would not be able to predict that tissue. It would only be able to predict the most molecularly similar tissue type among the 27 tissue labels. In this case, the method did one of the following things:

[00172] 4.7.5.1. Predict another molecularly closely related tissue (e.g. Gastric (Stomach) for Gastrointestinal Stromal Tumor (GIST)).

[00173] 4.7.5.2. Output no predicted tissue with selecting another molecularly closely related tissue as the best guess.

[00174] 4.8. Model training and implementation details

[00175] 4.8.1. A model was trained that can report on up to 27 tissue sites, using a Support Vector Machine (SVM) prediction model on a subset of informative gene expression features, where N is the number of tissues available in the training data. Informative genes were defined as 3,000 most varying genes across all training samples. N=27 tissues were selected by applying a cutoff of 30 samples per-category to the list of all tissue categories available in the RNA-Seq dataset.

[00176] 4.8.2. Input data for model training

[00177] 4.8.2.1. Gene expression values processed and normalized as described in 4.3 and 4.4 were used from 8903 samples (564 FFPE and 8339 TCGA samples) for 3,000 genes described in 4.8.1 as input training data.

[00178] 4.8.2.2. Tissue category labels (27 unique categories) for each of these samples were used as training labels.

[00179] 4.8.3. Implementation specifics [00180] 4.8.3.1. Python programming language was used to implement model training as well application of the trained model

[00181] 4.8.3.2. sklearn.svm.LinearSVC() function from scikitlearn, a free machine learning and artificial intelligence python library, was used. A call to this function was made to train a model fit for multi-label predictor using input data and input labels described in 4.8.2.

[00182] 4.8.3.2.I. Internally, scikitlearn implemented this multi-label fitting as fitting N individual models (N = 27 in this case), one for each tissue label.

[00183] 4.8.3.2.2. scikitlearn returned a matrix of feature weights, where the dimensions represent number of features by number of unique labels. In this case, the matrix was 3000 x 27, where 3000 was the number of gene features used in training and 27 was the number of tissue categories in the training data. This matrix was saved into a flat text file and used when applying the model to new samples.

[00184] 4.8.3.2.3. sklearn.svm.LinearSVC() arguments:

[00185] 4.8.3.2.3.I. multi_class='ovr'

[00186] 4.8.3.2.3.2. class_weight='balanced'

[00187] 4.8.3.2.3.3. random_state=0 [00188] 4.8.4. Model training description

[00189] 4.8.4.1. Stratified 5-fold cross validation was performed. Each fold incorporated every tissue category to ensure training and prediction on every category. For each fold, the model was trained on the remaining 4 folds and tested on the held-out fold. This training-testing process was performed 5 times (once for each fold).

[00190] 4.8.4.2. For every fold, prediction results were collected for that fold, from which all the quality metrics outlined in 3.1 could be computed for every fold. This allowed computation of the mean for each of the metrics across 5 folds.

[00191] 4.8.4.3. The final model was trained using all the input data. The model was saved to a flat file as described in 4.8.5.

[00192] 4.8.5. Applying the model to new samples

[00193] 4.8.5.1. As was described in 4.8.3, the final model was saved as a flat text file. Specifically, feature/gene weights were saved for each of the tissue labels.

[00194] 4.8.5.2. When applying the model to new samples, dot product was computed between that sample’s gene expression (only those genes used by the model) and gene weights for each tissue category. These dot products were the prediction scores for each of the tissue label. The final predictions were returns from these prediction scores as follows: [00195] 4.8.5.2.1. If a single positive score was present then the prediction was returned as that tissue label.

[00196] 4.8.5.2.2. If multiple positive scores were present then multiple predictions were returned as those tissue labels.

[00197] 4.8.5.2.3. If no positive scores were present then the highest score (negative score) was used as the best guess for the tissue label.

[00198] 4.9. Acceptable performance

[00199] 4.9.1. Performance of each tissue site category was evaluated separately, and each category was binned into one of the following groups based on the category’s final performance:

[00200] 4.9.1.1. High confidence category [00201] 4.9.1.2. Low confidence category

[00202] 4.9.2. A high confidence tissue category was any tissue that had a final accuracy of at least 95% and PPV of at least 95%.

[00203] 4.9.3. A low confidence tissue category was any tissue that had a final accuracy of below 95% or PPV of below 95%.

[00204] 4.10. Other evaluation references

[00205] 4.10.1. CGI TOO (Tissue of Origin test from Cancer Genetics Inc.)

[00206] 4.10.1.1. www.cancergenetics.com/laboratory-services/specialty-tests/t oo-tissue- of-ori gin-test

[00207] 4.10.1.2. Is the only FDA approved clinical test currently on the market. FDA approved for 15 tissue types.

[00208] 4.10.1.3. Published PPV is 89% across 462 FFPE validation samples for all tissues, however varies within individual tissues [00209] 4.10.2. CancerTYPE ID by Biotheranostics, Inc.

[00210] 4.10.2.1. www.cancertypeid.com

[00211] 4.10.2.2. Not FDA approved at this time. Distinguishes 50 cancer types [00212] 4.10.2.3. Published PPV is 83% and published specificity is on 790 validation samples.

[00213] [00214] 4.11. For each validation study, overall as well as per-tissue accuracy, PPV, and FDR were reported. For those samples that are mis-predicted, a blind reviewer (a clinical or research staff) was assigned to review each sample’s information (pathology report, order information, and any other documentation available for the sample) and guess the sample’s site of origin based on that information. Both the annotated and predicted sites of origin were compared to that blind reviewer’s evaluation and assessed if the prediction should be interpreted as correct or incorrect.

[00215] 4.12. Overview of validation studies [00216] 4.12.1. Cross-validation training step

[00217] 4.12.1.1. Overall as well as per-tissue accuracy, PPV, and FDR for 5-fold cross- validation results on training data (reporting for all as well as FFPE samples only) were reported.

[00218] 4.12.2. Held-out validation samples [00219] 4.12.2.1. FFPE samples

[00220] 4.12.2.1.1. Overall as well as per-tissue accuracy, PPV, and FDR for held out F2 dataset were reported.

[00221] 4.12.2.1.2. A subset of the samples for which predictions do not match the annotated tissue of origin were sent back to the pathology lab for further verification. [00222] 4.12.2.2. TCGA samples

[00223] 4.12.2.2.1. Overall as well as per-tissue accuracy, PPV, and FDR for held out 20% of TCGA dataset were reported.

[00224] 4.12.3. Metastatic samples [00225] 4.12.3.1. FFPE samples

[00226] 4.12.3.1.1. Overall as well as per-tissue accuracy, PPV, and FDR for held out FI and F2 metastatic samples were reported.

[00227] 4.12.3.1.2. Whether the methodology of the current invention better predicts the site of origin or the site of biopsy was analyzed.

[00228] 4.12.3.1.2.1. The following relationships were identified:

[00229] 4.12.3.1.2.1.1. Between tumor purity and prediction evaluation metrics. For this analysis, computationally derived tumor purity values, a method based on allele frequencies, was used. [00230] 4.12.3.1.2.1.2. Between transcript integrity number (TIN), a post-sequencing computationally derived proxy for sample’s RNA quality, and prediction evaluation metrics.

[00231] 4.12.3.2. TCGA samples

[00232] 4.12.3.2.1. Overall as well as per-tissue accuracy, PPV, and FDR for held out TCGA metastatic samples were reported.

[00233] 4.12.4. Other Nantomics samples

[00234] 4.12.4.1. Site of origin predictions were run on Nantomics samples of rare cancer types, which were not included into the training data (e.g. vaginal). These samples were not included into model training because of insufficient number of samples in the tumor type.

[00235] 4.12.4.2. Site of origin predictions were run on other Nantomics samples that were not included into the above listed validation sets. These samples were of unknown cancer type for the following reasons:

[00236] 4.12.4.2.1. Annotated as Other Cancers with no indication of cancer type in ICD10 field.

[00237] 4.12.4.2.2. Annotated as CUP.

[00238] 4.12.4.2.3. Un-annotated as any cancer type with no indication of cancer type in ICD10 field.

[00239] 4.12.4.3. Analysis of clinical and pathology reports for a subset of predictions was performed to verify that predictions were consistent with those reports.

[00240] 5. Results of validation studies

[00241] Validation studies described in section 4 were performed. Below are summaries of the results of the studies.

[00242] 5.1. Prediction results

[00243] This sub-section outlines the comparison of predicted labels, based on the model described in the previous sections, to the annotated labels in the samples for each validation study. Those predictions with positive prediction scores were denoted as high confidence predictions. Prediction rate was computed as the fraction of high confidence predictions.

[00244] 5.1.1. Cross-validation training results [00245] 5-fold cross-validation results were evaluated for the samples in the training dataset.

[00246] 5.1.1.1. FFPE samples

[00247] The following overall accuracy estimates were computed for 559 FFPE samples in the training dataset:

[00248] · High confidence prediction rate: 0.919

[00249] · Prediction accuracy within high confidence predictions: 0.971 +/- 0.011

[00250] · Prediction accuracy within all predictions: 0.941 +/- 0.016

[00251] Fig. 11 shows the confusion matrix summarizing the results for all predictions (both high and low confidence). The true labels are on the x-axis and the predicted labels are on the y-axis.

[00252] Per-tissue accuracy (A) and PPV (B) were analyzed, depicted in Figs. 12A and 12B. The points in the plots represent point estimates of each metric. Confidence intervals indicate 95 th percentile binomial distribution confidence interval.

[00253] Out of 17 metastatic FFPE samples in the training dataset, 15 predicted the site of origin correctly and 2 were incorrectly predicted, 1 with a high confidence score and 1 with low confidence score.

[00254] The first sample incorrectly predicted was a colorectal liver met, which was predicted to come from pancreas. In this case neither site of origin, nor site of metastasis was predicted. It is, however, possible that this sample was initially mis-annotated and molecular-derived site is a more correct one.

[00255] The second incorrectly predicted sample was a cystic carcinoma metastasized to lymph nodes, predicted to come from lung with a low confidence score.

[00256] 5.1.1.2. TCGA samples

[00257] The following overall accuracy estimates were computed for 8, 110 TCGA samples in the training dataset:

[00258] · High confidence prediction rate: 0.993

[00259] · Prediction accuracy within high confidence predictions: 0.997 +/- 0.001

[00260] · Prediction accuracy within all predictions: 0.996 +/- 0.001

[00261] Fig. 13 shows the confusion matrix summarizing the results for all predictions (both high and low confidence). The true labels are on the x-axis and the predicted labels are on the y-axis. [00262] Per-tissue accuracy (A) and PPV (B) were analyzed, with the results shown in Figs. 14A and 14B. The points in the plots represent point estimates of each metric. Confidence intervals indicate 95 th percentile binomial distribution confidence interval. [00263] Out of 318 metastatic TCGA samples in the training dataset, 317 predicted correctly and 1 non-basal breast sample predicted to come from ovarian tissue.

[00264] 5.1.2. Held out validation samples

[00265] Herein are the results of the comparison of annotated tissue type and predicted tissue type, based on the model described in the previous sections, on the samples that were held out and not seen by the model. These samples only included those with annotations from the same tissue types as the model outputs. Samples annotated as a tissue type not represented by the prediction model are not described here.

[00266] 5.1.2 1. FFPE samples

[00267] Figure 15 shows the FFPE sample counts and categories in this validation study (total n = 902).

[00268] The comparison of true labels to the predicted labels is summarized in the confusion matrix shown in Fig. 16. The true labels are on the x-axis and the predicted labels are on the y-axis.

[00269] Per-tissue accuracy and PPV from Fig. 16, depicted in Table 3 was analyzed. The overall accuracy of the predictions in this dataset is .79.

Table 3

[00270] 5.1.2.2. TCGA samples

[00271] Figure 17 shows the summary of the comparison of the predicted labels to true labels after applying the prediction model to 2075 TCGA validation samples. The true labels are on the x-axis and the predicted labels are on the y-axis.

[00272] 5.1.3. Other FFPE samples

[00273] The prediction model was run on FFPE samples annotated by a tissue type that was either not present in one of the model outputs or for which not enough validation samples were available, in order to test edge cases of when the sample’s site of origin was not present in the training data. The comparison of true labels to the predicted labels is summarized in the confusion matrix shown in Fig. 18. The true labels are on the x-axis and the predicted labels are on the y-axis.

[00274] The following mismatches between cancer type (on the left) and ICD10 codes (ICD10 description on the right) were found in the FFPE cohort (includes FI clinical, FI research, and F2 clinical; excludes metastatic samples):

Table 4

[00275] Example 2

[00276] Figure 19 shows the results of a successful prediction of the model described in Example 1. The process of Example 1 successfully identified metastatic colon adenocarcinoma as being colon cancer. Figure 19A shows the locations of molecular signature groupings of many cancers, with the patient’s tumor depicted with a star and the most similar molecular signatures depicted as circles. Figure 19B shows the correlations and certain details for the most similar molecular signatures in the TCGA dataset, or those depicted as circles in Fig. 19A. Figure 19C shows the distributions of true positives and negatives and false positives and negatives from the process of Example 1. The dashed line indicates the score of the patient’s tumor, demonstrating that it was above any previously observed false positive prediction scores.

[00277] Example 3

[00278] Figure 20 shows the results of a successful prediction of the model described in Example 1. Figure 20 shows the locations of molecular signature groupings of many cancers, with the patient’s tumor depicted with a star and the most similar molecular signatures depicted as circles. The process of Example 1 successfully identified pediatric glioma, which was not in the training dataset, as similar to adult brain cancers. This Example demonstrates that the processes of the current invention are capable of identifying tumors that are unknown to the prediction models of the invention. The implications of this include the models’ abilities to encounter new and even unknown (to the model and/or to medicine in general) tumors and identify their origin and similarity to other tumors. Amongst other advantages, those abilities would facilitate diagnosis, for example in identifying that a tumor has metastasized, and treatment, for example by helping with the selection of drugs that work on similar tumors. These aspects of the invention, unexpectedly, can outperform traditional diagnosis and evaluation, for example by doctors, as the processes of the invention can identify novel and rare tumors that can be difficult or impossible to diagnose by other means.

[00279] Example 4

[00280] Figure 21 shows the results of a prediction for which the model matches the predicted tumor type to a different tumor type. In this case, the model matched the tumor with basal breast cancer, while the tumor had been annotated as an adenoid cystic carcinoma. Adenoid cystic carcinomas are rare cancers and were not in the training dataset. The model here demonstrates that it can make connections between tumor types that can inform treatment and lead to better outcomes, superior to what traditional methods might accomplish. In this case, the molecular similarity between the adenoid cystic carcinoma and basal breast cancer indicates that certain treatments for basal breast cancer could be used for the treatment of the patient’s adenoid cystic carcinoma. This surprising finding demonstrates that the model can outperform other methods of diagnosis and inform treatment options that are appropriate for tumors in a manner superior to traditional diagnosis.

[00281] It should be appreciated that any combination of authentication processes depicted and described herein can be performed without departing from the scope of the present disclosure. Alternatively or additionally, any number of other authentication processes can be developed by combining various portions or sub-steps of the described authentication processes without departing from the scope of the present disclosure. [00282] Specific details were given in the description to provide a thorough understanding of the embodiments. However, it will be understood by one of ordinary skill in the art that the embodiments may be practiced without these specific details. For example, well-known circuits, processes, algorithms, structures, and techniques have been shown without unnecessary detail in order to avoid obscuring the embodiments.