Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHYLATION BIOMARKER SELECTION APPARATUSES AND METHODS
Document Type and Number:
WIPO Patent Application WO/2023/052917
Kind Code:
A1
Abstract:
Methylation biomarker selection apparatuses and methods are provided. A methylation biomarker selection apparatus stores a plurality of first data sets and a plurality of second data sets, wherein each of the first data sets includes a plurality of methylation degrees corresponding to a plurality of methylation loci, and each of the second data sets includes at least one medical record. The methylation biomarker selection apparatus determines a plurality of primary biomarkers by identifying a plurality of differentiable loci from the methylation loci according to the methylation degrees, determines a plurality of secondary biomarkers by identifying a plurality of comorbidities of a target disease, and associated genes thereof based on the second data sets, and determines a plurality of candidate biomarkers based on a correlation analysis of the primary biomarkers and the secondary biomarkers.

Inventors:
PAI TUN-WEN (TW)
LAI YI-HSUAN (TW)
CHEN SHU-JEN (TW)
SU FANG-CHENG (TW)
Application Number:
PCT/IB2022/058985
Publication Date:
April 06, 2023
Filing Date:
September 22, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ACT GENOMICS IP LTD (CN)
International Classes:
G06N3/04; G06N3/08; G16B25/10
Domestic Patent References:
WO2018183980A22018-10-04
Foreign References:
CN112927757A2021-06-08
CN107025387A2017-08-08
US20180166170A12018-06-14
CN110799196A2020-02-14
CN104745575A2015-07-01
Download PDF:
Claims:
75

CLAIMS

What is claimed is:

1. A methylation biomarker selection apparatus, comprising: a storage, being configured to store a plurality of first data sets and a plurality of second data sets, wherein each of the first data sets comprises a plurality of methylation degrees corresponding to a plurality of methylation loci, and each of the second data sets comprises at least one medical record; and a processor, being electrically connected to the storage and configured to perform the following operations:

(a) determining a plurality of primary biomarkers by identifying a plurality of differentiable loci from the methylation loci according to the methylation degrees,

(b) determining a plurality of secondary biomarkers by identifying a plurality of comorbidities of a target disease, and associated genes thereof based on the second data sets, and

(c) determining a plurality of candidate biomarkers based on a correlation analysis of the primary biomarkers and the secondary biomarkers.

2. The methylation biomarker selection apparatus of claim 1, wherein the processor further performs the following operations: 76

(d) clustering the candidate biomarkers into a plurality of functional clusters,

(e) calculating a weight for each of the candidate biomarkers in each of the functional clusters, and

(f) determining at least one target biomarker from at least one of the functional clusters according to the weights in each of the functional clusters.

3. The methylation biomarker selection apparatus of claim 1, wherein the processor determines the primary biomarkers by performing the following operation: selecting the methylation loci having at least one of an averaged methylation degree difference conforming to a first predetermined rule and a p-value conforming to a second predetermined rule as the differentiable loci, wherein the differentiable loci are determined as the primary biomarkers.

4. The methylation biomarker selection apparatus of claim 1, wherein the processor determines the secondary biomarkers by performing the following operations: calculating an association degree indicating relevance to the target disease for each of the distinct diagnosed diseases, selecting the diagnosed diseases having the association degree conforming to a third predetermined rule as the comorbidities, and determining a plurality of genes corresponding to the comorbidities as the secondary biomarkers. 77

5. The methylation biomarker selection apparatus of claim 4, wherein the association degree of each of the distinct diagnosed diseases comprises an odds ratio, a p-value, and a supporting rate.

6. The methylation biomarker selection apparatus of claim 2, wherein the processor is further configured to calculate at least one gene distance by the following operations: calculating a Gene Ontology (GO) term distance for each of at least one GO term pair between a first candidate biomarker and a second candidate biomarker, and determining the gene distance between the first candidate biomarker and the second candidate biomarker according to the at least one GO term distance.

7. The methylation biomarker selection apparatus of claim 6, wherein each of the GO term distances is calculated based on an information content distance and a Czekanowski-Dice distance.

8. The methylation biomarker selection apparatus of claim 2, wherein the processor is further configured to execute a recurrent neural network comprising an encoder, an attention mechanism, and a decoder, each of a plurality of candidate biomarker sequences belongs to one of a normal subject group and a disease subject group, each of the candidate biomarker sequences corresponds to one of the candidate biomarkers, and the processor calculates the weight for each of the candidate biomarkers in each of the functional clusters by the following operations: 78 deriving a plurality of normal attention weights from the attention mechanism by inputting the candidate biomarker sequences corresponding to the candidate biomarker and from the normal subject group into the recurrent neural network, deriving a plurality of disease attention weights from the attention mechanism by inputting the candidate biomarker sequences corresponding to the candidate biomarker and from the disease subject group into the recurrent neural network, calculating an averaged normal weight by averaging the normal attention weights, calculating an averaged disease weight by averaging the disease attention weights, and calculating the weight according to the averaged normal weight and the averaged disease weight.

9. The methylation biomarker selection apparatus of claim 2, wherein the processor further ranks the candidate biomarkers in each of the functional clusters according to the corresponding weights.

10. A methylation biomarker selection method for use in an electronic apparatus, the electronic apparatus storing a plurality of first data sets and a plurality of second data sets, each of the first data sets comprising a plurality of methylation degrees corresponding to a plurality of methylation loci, each of the second data sets comprises at least one medical record, and the methylation biomarker selection method comprising the following steps:

(a) determining a plurality of primary biomarkers by identifying a plurality of 79 differentiable loci from the methylation loci according to the methylation degrees;

(b) determining a plurality of secondary biomarkers by identifying a plurality of comorbidities of a target disease, and associated genes thereof based on the second data sets; and

(c) determining a plurality of candidate biomarkers based on a correlation analysis of the primary biomarkers and the secondary biomarkers.

11. The methylation biomarker selection method of claim 10, further comprising the following step:

(d) clustering the candidate biomarkers into a plurality of functional clusters;

(e) calculating a weight for each of the candidate biomarkers in each of the functional clusters; and

(f) determining at least one target biomarker from at least one of the functional clusters according to the weights in each of the functional clusters.

12. The methylation biomarker selection method of claim 10, wherein the step (a) comprises the following step: selecting the methylation loci having at least one of an averaged methylation degree difference conforming to a first predetermined rule and a p-value conforming to a second predetermined rule as the differentiable loci, wherein the differentiable loci are determined as the primary biomarkers. 80

13. The methylation biomarker selection method of claim 10, wherein the step (b) comprises the following steps: calculating an association degree indicating relevance to the target disease for each of the distinct diagnosed diseases; selecting the diagnosed diseases having the association degree conforming to a third predetermined rule as the comorbidities; and determining a plurality of genes corresponding to the comorbidities as the secondary biomarkers.

14. The methylation biomarker selection method of claim 13, wherein the association degree of each of the distinct diagnosed diseases comprises an odds ratio, a p-value, and a supporting rate.

15. The methylation biomarker selection method of claim 11, further comprises the following steps: calculating at least one gene distance, comprising the following steps: calculating a GO term distance for each of at least one GO term pair between a first candidate biomarker and a second candidate biomarker; and determining the gene distance between the first candidate biomarker and the second candidate biomarker according to the at least one GO term distance.

16. The methylation biomarker selection method of claim 15, wherein each of the GO 81 term distances is calculated based on an information content distance and a Czekanowski-Dice distance.

17. The methylation biomarker selection method of claim 11, wherein the electronic apparatus executes a recurrent neural network comprising an encoder, an attention mechanism, and a decoder, each of a plurality of candidate biomarker sequences belongs to one of a normal subject group and a disease subject group, each of the candidate biomarker sequences corresponds to one of the candidate biomarkers, and the step (e) comprises the following steps: deriving a plurality of normal attention weights from the attention mechanism by inputting the candidate biomarker sequences corresponding to the candidate biomarker and from the normal subject group into the recurrent neural network; deriving a plurality of disease attention weights from the attention mechanism by inputting the candidate biomarker sequences corresponding to the candidate biomarker and from the disease subject group into the recurrent neural network; calculating an averaged normal weight by averaging the normal attention weights; calculating an averaged disease weight by averaging the disease attention weights; and calculating the weight according to the averaged normal weight and the averaged disease weight.

18. The methylation biomarker selection method of claim 11, further comprising the following step: ranking the candidate biomarkers in each of the functional clusters according to the corresponding weights.

Description:
METHYLATION BIOMARKER SELECTION APPARATUSES AND METHODS

PRIORITY

This application claims priority to US Provisional Patent Application No. 63/261,780 filed on September 28, 2021, which is hereby incorporated by reference in its entirety.

FIELD OF THE INVENTION

[0001] The present invention relates to methylation biomarker selection apparatuses and methods. More specifically, the present invention relates to methylation biomarker selection apparatuses and methods that provide biomarkers pertaining to a target disease based on comorbidity analysis.

BACKGROUND OF THE INVENTION

[0002] Biomarkers have played an important role in the medical field, such as for diagnosing diseases and developing drugs. An ideal biomarker for a target disease should be of high sensitivity and high specificity so that the target disease can be detected in an early stage and prognosis can be evaluated. The common approach to discover biomarker(s) pertaining to a target disease is to investigate into the samples of the patients with the target disease. However, as the samples analyzed by the common approach are quite limited in terms of both quantity and diversity, the results are usually unsatisfactory (e.g., the derived biomarker(s) is/are without high sensitivity and/or without high specificity) and insufficient (e.g., only few biomarkers are derived).

[0003] Consequently, a technique that can provide a sufficient amount of biomarkers that are highly sensitive and highly specific to a target disease is still needed. SUMMARY OF THE INVENTION

[0004] An objective of this invention is to provide a methylation biomarker selection apparatus. The methylation biomarker selection apparatus comprises a storage and a processor, wherein the processor is electrically connected to the storage. The storage is configured to store a plurality of first data sets, wherein each of the first data sets comprises a plurality of methylation degrees corresponding to a plurality of methylation loci. The storage is also configured to store a plurality of second data sets, wherein each of the second data sets comprises at least one medical record. The processor is configured to perform the following operations: (a) determining a plurality of primary biomarkers by identifying a plurality of differentiable loci from the methylation loci according to the methylation degrees, (b) determining a plurality of secondary biomarkers by identifying a plurality of comorbidities of a target disease, and associated genes thereof based on the second data sets, and (c) determining a plurality of candidate biomarkers based on a correlation analysis of the primary biomarkers and the secondary biomarkers.

[0005] Another objective of this invention is to provide a methylation biomarker selection method for use in an electronic apparatus. The electronic apparatus stores a plurality of first data sets, wherein each of the first data sets comprises a plurality of methylation degrees corresponding to a plurality of methylation loci. The electronic apparatus also stores a plurality of second data sets, wherein each of the second data sets comprises at least one medical record. The methylation biomarker selection method comprises the following steps: (a) determining a plurality of primary biomarkers by identifying a plurality of differentiable loci from the methylation loci according to the methylation degrees, (b) determining a plurality of secondary biomarkers by identifying a plurality of comorbidities of a target disease, and associated genes thereof based on the second data sets, and (c) determining a plurality of candidate biomarkers based on a correlation analysis of the primary biomarkers and the secondary biomarkers.

[0006] The methylation biomarker selection technique (at least comprises the methylation biomarker selection apparatuses and methods) provided by the present invention utilizes two different kinds of data sets (i.e., the first data sets and the second data sets) to discover candidate biomarkers pertaining to a target disease . While the first data sets comprise methylation degrees of various methylation loci, the second data sets comprise medical record(s). With the first data sets, differentiable loci can be identified as the primary biomarkers pertaining to the target disease. With the second data sets, comorbidities of the target disease, and associated genes thereof can be identified so as to provide the secondary biomarkers pertaining the target disease. As both methylation degrees and comorbidities of the target disease are considered, the methylation biomarker selection technique of the present invention can provide candidate biomarkers that are highly sensitive and highly specific to the target disease. Furthermore, as the candidate biomarkers are determined based on a correlation analysis of the primary biomarkers and the secondary biomarkers, a sufficient amount of candidate biomarkers can be provided.

[0007] The detailed technology and preferred embodiments implemented for the subject invention are described in the following paragraphs accompanying the appended drawings for people skilled in this field to well appreciate the features of the claimed invention.

BRIEF DESCRIPTION OF THE DRAWINGS

[0008] FIG. 1 illustrates the schematic view of a methylation biomarker selection apparatus 1 in some embodiments of the present invention.

[0009] FIG. 2 illustrates the general data process flow for finding out the candidate biomarkers based on methylation degrees and comorbidities associated with a target disease.

[0010] FIG. 3 illustrates the data process flow for deriving the first data sets Dl_l, , Dl_q in some embodiments of the present invention.

[0011] FIG. 4 illustrates the data process flow for weight calculation and target biomarker selection in some embodiments of the present invention.

[0012] FIG. 5 illustrates the schematic view of an exemplary recurrent neural network used in some embodiments of the present invention.

[0013] FIG. 6 illustrates the main flowchart of a methylation biomarker selection method in some embodiments of the present invention.

[0014] FIG. 7 illustrates the main flowchart of a methylation biomarker selection method in some embodiments of the present invention.

[0015] FIG. 8 illustrates the main flowchart of the step S709 in some embodiments of the present invention.

[0016] FIG. 9 shows an exemplary result of clinical validation ofthe target biomarkers.

DETAILED DESCRIPTION

[0017] In the following descriptions, the methylation biomarker selection apparatuses and methods of the present invention will be explained regarding certain embodiments thereof.

However, these embodiments are not intended to limit the present invention to any specific environment, application, or implementations described in these embodiments. Therefore, descriptions of these embodiments are to provide illustration rather than to limit the scope of the present invention. It should be noted that, in the following embodiments and the attached drawings, elements unrelated to the present invention are omitted from depiction. In addition, dimensions of elements and any dimensional scales between individual elements in the attached drawings are provided only for ease of depiction and illustration but not to limit the scope of the present invention.

[0018] FIG. 1 illustrates the schematic view of a methylation biomarker selection apparatus 1 in some embodiments of the present invention. The methylation biomarker selection apparatus 1 comprises a storage 11 and a processor 13, wherein the storage 11 is electrically connected to the processor 13. The storage 11 may be a memory, a Universal Serial Bus (USB) disk, a portable disk, a Hard Disk Drive (HDD), or any other non-transitory storage media, apparatus, or circuit that can store data and known to a person having ordinary skill in the art. The processor 13 may be one of the various processors, central processing units (CPUs), microprocessor units (MPUs), digital signal processors (DSPs), or other computing apparatuses known to a person having ordinary skill in the art.

[0019] The storage 11 stores a plurality of first data sets Dl_l, > , Dl_q, wherein each of the first data sets Dl_l, > , Dl_q comprises a plurality of methylation degrees corresponding to a plurality of methylation loci. Please note that a methylation locus is a locus of gene that refers to CG rich or CG poor DNA region that includes at least one differentially methylated region. In some embodiments, methylation locus comprises CpG methylation locus and non-CpG methylation locus. In addition, the storage 11 stores a plurality of second data sets D2_l, , D2_r, wherein each of the second data sets D2_l, , D2_r comprises at least one medical record.

[0020] The methylation biomarker selection apparatus 1 aims to find out biomarkers that may be highly related to a target disease based on methylation degrees and comorbidities associated with the target disease, and the general data process flow of which is illustrated in FIG. 2. Specifically, the processor 13 determines a plurality of primary biomarkers PB_1, > , PB_m by identifying a plurality of differentiable loci from the methylation loci recorded in the first data sets Dl_l, , Dl_q according to the methylation degrees recorded in the first data sets Dl_l, > , Dl_q, determines a plurality of secondary biomarkers SB_1, , SB_n by identifying a plurality of comorbidities of a target disease, and associated genes thereof based on the second data sets D2_l, > , D2_r, and determines a plurality of candidate biomarkers CB_1, > , CB_k based on a correlation analysis of the primary biomarkers PB_1, > , PB_m and the secondary biomarkers SB_1, > , SB_n. The candidate biomarkers CB_1, , CB_k are the biomarkers that may be highly related to the target disease so that they may be used for further investigation and/or evaluation of the target disease. As used herein, “comorbidity” refers to one or more conditions, syndromes, diseases, or disorders that causes, is caused by, or co-occur with the target disease and can be either directly or indirectly linked to the target disease. In some embodiments, the first data sets Dl_l, , DI q are generated by the methylation array or methylation sequencing . In some embodiments, the target disease includes but not limited to brain cancer, breast cancer, colon cancer, endocrine gland cancer, esophageal cancer, female reproductive organ cancer, head and neck cancer, hepatobiliary system cancer, kidney cancer, lung cancer, mesenchymal cell neoplasm, prostate cancer, skin cancer, stomach cancer, tumor of exocrine pancreas and urinary system cancer.

[0021] The detailed descriptions of the first data sets Dl_l, > , Dl_q, the second data sets D2_l, > , D2_r, and the operations performed by the processor 13 in various embodiments are provided below.

[0022] First data sets

[0023] In some embodiments, the methylation biomarker selection apparatus 1 derives the first data sets Dl_l, , Dl_q from the data files generated by the methylation array (e.g., Illumina Infinium HumanMethylation450 BeadChip (45 OK Chip)), and the data process flow of which is illustrated in FIG. 3. In those embodiments, the methylation biomarker selection apparatus 1 is installed with the Chip Analysis Methylation Pipeline (ChAMP) package, and the processor 13 imports the data files F_l, , F_o (e.g., the ID AT files) of the methylation array from a first database (e.g., The Cancer Genome Atlas (TCGA)) through a transceiving interface (not shown) of the methylation biomarker selection apparatus 1. Each of the imported data files F_l, , F_o comprises a plurality of methylation degrees corresponding to a plurality of methylation loci (e.g., N methylation degrees correspond to N methylation loci one to one, and A is a positive integer greater than one). In the data files F_l, > , F_o generated by the methylation array, a methylation degree is called a /3 value. Then, the processor 13 may derive the first data sets Dl l, > , Dl_q by pre-processing the imported data files F_l, > , F_o, which usually involves quality control, normalization, and outlier removal.

[0024] An example regarding quality control is given herein. In this example, probes that meet any one of the following criteria are excluded: (1) probes with a detection value of P > 0.01 in at least one sample, (2) probes with a bead count smaller than 3 in at least 5% of samples, (3) probes targeting non-CpG positions, (4) probes targeting single nucleotide polymorphism (SNP) sites, (5) probes that align to multiple locations, and (6) probes located on X and Y chromosomes. After the aforesaid quality control, only the methylation loci corresponding to the remained probes are kept in the imported data files.

[0025] Examples regarding normalization are given herein. The methylation degrees in the aforesaid imported data files are bias because the methylation array adopts two different types of probe design (Inflinium type 1 probe design and Inflinium type 2 probe design); therefore, normalization is required to adjust the biases. For example, beta-mixture quantile normalization (BMIQ), subset-quantile within array normalization (SWAN), peak-based correction (PBC), or Functional normalization (FunNorm) can be used.

[0026] An example regarding outlier removal is given herein. The imported data files that have been processed by the aforesaid quality control and normalization are classified into a normal subject group and a disease subject group. The normal subject group comprises the imported data files related to the subjects without the target disease, while the disease subject group comprises the imported data files related to the subjects with the target disease. For each methylation locus in each of the normal subject group and the disease subject group, the outlier(s) are eliminated by the Interquartile Range (IQR) method. A person having ordinary skill in the art shall be familiar with the IQR method and, thus, the details are not given herein. By removing the outliers, the distribution of the methylation degrees of each methylation locus in each of the normal subject group and the disease subject group is in a concentrated form. In this way, noise interferences during primary biomarker selection can be avoided.

[0027] The imported data files that have been processed by the aforesaid quality control, normalization, and outlier removal are the first data sets Dl l, , Dl_q. Please note that the above examples are not intended to limit the approach for deriving the first data sets Dl_l, , Dl_q. In some other embodiments, the first data sets Dl_l, , Dl_q may be derived from other sources and by other approaches as long as each of the first data sets Dl_l, , Dl_q comprises a plurality of methylation degrees corresponding to a plurality of methylation loci.

[0028] Primary biomarker selection

[0029] As described above, the processor 13 determines a plurality of primary biomarkers PB_1, > , PB_m by identifying a plurality of differentiable loci from the methylation loci recorded in the first data sets Dl_l, , Dl_q according to the methylation degrees recorded in the first data sets Dl_l, , Dl_q. The differentiable loci are the loci that are more distinguishable among the methylation loci recorded in the first data sets Dl_l, , Dl_q. [0030] In some embodiments, for each of the methylation loci, the processor 13 determines whether the methylation locus can be selected as a differentiable locus based on an averaged methylation degree difference of the methylation locus and/or a p-value of the methylation locus. The averaged methylation degree difference of a methylation locus reflects the extent that the methylation degrees of the methylation locus from disease subjects are deviated from the methylation degrees of the methylation locus from normal subjects. The p- value of a methylation locus is a statistical measurement regarding a null hypothesis that the methylation locus is related to the target disease. Specifically, from the methylation loci recorded in the first data sets Dl_l, , Dl_q, the processor 13 selects the methylation loci having: (i) the averaged methylation degree difference conforming to a first predetermined rule (e.g., the averaged methylation degree difference being greater than a first predetermined threshold) and/or (ii) the p-value conforming to a second predetermined rule (e.g., the p-value being smaller than a second predetermined threshold) as the differentiable loci. The differentiable loci are determined as the primary biomarkers PB_1, , PB_m.

[0031] The aforesaid averaged methylation degree difference is elaborated herein. In some embodiments, the first data sets Dl_l, > , Dl_q are classified into a normal subject group and a disease subject group. That is, each first data set in the normal subject group is related to a subject without the target disease, while each first data set in the disease subject group is related to a subject with the target disease. In those embodiments, the processor 13 derives the averaged methylation degree difference of a methylation locus by performing the following operations (a) and (b). [0032] In the operation (a), the processor 13 calculates an averaged normal value according to the methylation degrees corresponding to the methylation locus from the normal subject group. In one example, the averaged normal value is the mean value of the methylation degrees of the methylation locus within the normal subject group, and can be characterized by the following equation (1): n _ i=i Pi

Pnormal_avg

[0033] In the above equation (1), /3 normai _ avg represents the averaged normal value, Pi represents the methylation degree corresponding to the methylation locus from the I th subject in the normal subject group, and n represents the number of subjects in the normal subject group (i.e., the number of the methylation degrees corresponding to the methylation locus in the normal subject group).

[0034] In the operation (b), the processor 13 calculates the averaged methylation degree difference according to the averaged normal value and the methylation degrees corresponding to the methylation locus from the disease subject group. In one example, the averaged methylation degree difference is the mean value of a plurality of individual methylation degree differences and can be characterized by the following equation (2):

[0035] In the above equation (2), A/? represents the averaged methylation degree difference, represents the methylation degree corresponding to the methylation locus from the j th subject in the disease subject group, f> nO rmai_avg represents the averaged normal value, and m represents the number of subjects in the disease subject group (i.e., the number of the methylation degrees corresponding to the methylation locus in the disease subject group). In addition, the value — Pnormai_avg) represents the individual methylation degree differences. [0036] The aforesaid approach for deriving primary biomarkers PB_1, , PB_m has been conducted to various target diseases, and the relevant information and data are listed in Table 1. Please note that the data fdes from TCGA are of March 15, 2021, and the data fdes from Gene Expression Omnibus (GEO) database) are of October 30, 2021. In Table 1, the variable N N represents the number of the subject without the target disease, and the variable N TD represents the number of the subject without the target disease.

Table 1

[0037] Second data sets

[0038] In some embodiments, the methylation biomarker selection apparatus 1 derives the second data sets D2_l, , D2_r from a second database through a transceiving interface (not shown) of the methylation biomarker selection apparatus 1. For example, the second database may be any electronic medical record dataset (e.g., the Taiwan’s National Health Insurance Research Database, NHIRD), which comprises a plurality of anonymous electronic medical records (EMRs).

[0039] Medical records stored in the second database are related to a plurality of subjects. The subjects with the target disease are selected as an experimental group, while some of the subjects without the target disease are selected as a control group. The subjects in the control group may be randomly selected by matching age groups and genders with fivefold of the subjects in the experimental group. For the control group, medical record(s) of each subject is/are retrieved. For the experimental group, medical record(s) of each subject within a predetermined time interval (e.g., 3, 4 or 5 years before the first diagnosis of the target disease) is/are retrieved. All the retrieved medical records are subjected to data cleaning and integration to yield the second data sets D2_l, , D2_r so that each of the second data sets D2_l, > , D2_r corresponds to one subject, and the medical record(s) of the same subject is/are included in one second data set.

[0040] Each medical record of the second data sets D2_l, > , D2_r has diagnosis information of a subject. If a subject has been diagnosed with one or more diseases, the corresponding medical record(s) will record the diagnosed disease(s). Please note that the present invention does not limit the way to record the diagnosed disease(s). In some embodiments, a diagnosed disease is a specific disease and can be recorded as a disease code followed the International Classification of Diseases (ICD). In some embodiments, a diagnosed disease is a disease group and can be recorded as a disease group code followed the ICD.

[0041] In some embodiments, the disease code(s) may be the code(s) from the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM).

There have been more than 1,000 diseases listed in ICD-9-CM. They contain 17 major chapters as shown in Table 2 and are further classified into various disease groups, which includes several diseases individually. Taking chapter 2 (i.e., neoplasms) of the ICD-9-CM as an example, it has 11 disease groups.

Table 2

[0042] The aforesaid approach for deriving the second data sets D2_l, . , D2_r has been conducted to various target diseases, and the relevant information and data are listed in

Table 3. Please note that the data sets derived from NHIRD are of January 29, 2016. The disease codes are the code(s) based on ICD-9-CM. In addition, the variable N EG represents the number of the subject in the experimental group, and the variable N cc represents the number of the subject in the control group.

Table 3

[0043] Secondary biomarker selection [0044] As described above, the processor 13 determines a plurality of secondary biomarkers SB_1, . , SB_n by identifying a plurality of comorbidities of the target disease, and associated genes thereof based on the second data sets D2_l, . , D2_r. In some embodiments, the processor 13 identifies a plurality of distinct diagnosed diseases from the second data sets D2_l, > , D2_r and determines the secondary biomarkers SB_1, > , SB_n by performing the following operations (c), (d), and (e).

[0045] In the operation (c), the processor 13 calculates an association degree indicating relevance to the target disease for each of the distinct diagnosed diseases.

[0046] In some embodiments, an association degree between a diagnosed disease and the target disease comprises an odds ratio, a p-value, and a supporting rate. For those embodiments, the processor 13 calculates the following four statistical numbers based on the second data sets D2_l, , D2_r: (i) the total number of the subjects with both the diagnosed disease and the target disease, which is represented by the variable N DD DT , (ii) the total number of the subjects with the diagnosed disease but without the target disease, which is represented by the variable N DD-NDT , (iii) the total number of the subjects without diagnosed disease but with target disease, which is represented by the variable N NDD DT , and (iv) the total number of the subjects without diagnosed disease and without target disease, which is represented by the variable N NDD NDT . With the four statistical numbers, the processor 13 can calculate the odds ratio and the supporting rate by the following equations (3) and (4) respectively:

Supporting Rate = _ N DD_DT+ N DD_NDT _

N DD_DT+ N DD_NDT+ N NDD_DT+ N NDD_NDT (4)

[0047] Please note that other indicator that can reflect relevance between two diseases can be used as an association degree. For example, an indicator of relative risk can be used as an association degree in some embodiments.

[0048] In the operation (d), among the distinct diagnosed diseases, the processor 13 selects the diagnosed diseases having the association degree conforming to a third predetermined rule as the comorbidities.

[0049] For the embodiments that an association degree comprises an odds ratio, a p- value, and a supporting rate, the third predetermined rule comprises three sub-rules for the odds ratio, the p-value, and the supporting rate, respectively. As an example, the three sub-rules may be “the odds ratio being greater than 2,” “the p-value being smaller than 0.05,” and “the supporting rate being greater than 10%.”

[0050] In the operation (e), the processor 13 determines a plurality of genes corresponding to the comorbidities as the secondary biomarkers SB_1, > , SB_n. For example, the processor 13 may retrieve the genes corresponding to the comorbidities from a third database (e.g., the DisGeNET database, the Online Mendelian Inheritance in Man (OMIM) database) through a transceiving interface (not shown) of the methylation biomarker selection apparatus 1.

[0051] The aforesaid approach for deriving the secondary biomarkers SB_1, > , SB_n has been conducted to various target diseases under the condition that the third predetermined rule comprises “the odds ratio being greater than 2,” “the p-value being smaller than 0.05,” and “the supporting rate being greater than 10%.” Various significant comorbidities of these target diseases and the relevant data are listed in Table 4 to Table 12.

Specifically, Table 4 is for the target disease “colorectal cancer,” Table 5 is for the target disease “lung cancer,” Table 6 is for the target disease “liver cancer,” Table 7 is for the target disease “pancreatic cancer,” Table 8 is for the target disease “prostate cancer,” Table 9 is for the target disease “breast cancer,” Table 10 is for the target disease “ovarian cancer,” Table 11 is for the target disease “esophagus cancer,” and Table 12 is for the target disease “stomach cancer.”

Atorney Docket No. 3819.0470W01

Table 4 (Significant comorbidities of colorectal cancer)

Atorney Docket No. 3819.0470W01

Table 5 (Significant comorbidities of lung cancer)

Atorney Docket No. 3819.0470W01

Table 6 (Significant comorbidities of liver cancer)

Atorney Docket No. 3819.0470W01

Table 7 (Significant comorbidities of pancreatic cancer)

Atorney Docket No. 3819.0470W01

Table 8 (Significant comorbidities of prostate cancer)

Atorney Docket No. 3819.0470W01

Table 9 (Significant comorbidities of breast cancer)

Atorney Docket No. 3819.0470W01

Table 10 (Significant comorbidities of ovarian cancer)

Atorney Docket No. 3819.0470W01

Table 11 (Significant comorbidities of esophagus cancer)

Atorney Docket No. 3819.0470W01

Table 12 (Significant comorbidities of stomach cancer)

[0052] Candidate biomarker selection

[0053] After deriving the primary biomarkers PB_1, > , PB_m and the secondary biomarkers SB_1, , SB_n, the processor 13 determines a plurality of candidate biomarkers CB_1, > , CB_k based on a correlation analysis of the primary biomarkers PB_1, > , PB_m and the secondary biomarkers SB_1, > , SB_n. In some embodiments, the correlation analysis is intersection or union of the primary biomarker and the second biomarker. Please note that different correlation analysis may be used in different embodiments.

[0054] As described above, the primary biomarkers PB_1, > , PB_m are differentiable loci regarding a target disease, and the secondary biomarkers SB_1, , SB_n are genes corresponding to the comorbidities of the same target disease. Hence, determining the candidate biomarkers CB_1, > , CB_k based on a correlation analysis of the primary biomarkers PB_1, > , PB_m and the secondary biomarkers SB_1, > , SB_n provides a promising result. That is, within the candidate biomarkers CB_1, , CB_k, biomarker(s) that is/are highly sensitive and highly specific to the target disease can be found and can be used for further analysis regarding the target disease.

[0055] Biomarker functional clustering

[0056] Different candidate biomarkers CB_1, > , CB_k represent different functional roles. As shown in FIG. 4, in some embodiments, the processor 13 further clusters the candidate biomarkers CB_1, , CB_k into a plurality of functional clusters G_l, , G_p. In FIG. 4, every black dot represents a candidate biomarker. Candidate biomarkers within the same functional cluster are close to each other in terms of function (e.g., regulating the same function or similar functions).

[0057] Biomarker functional clustering based on gene distances

[0058] In some embodiments, the processor 13 can cluster the candidate biomarkers CB_1, , CB_k into the functional clusters G_l, > , G_p based on a plurality of gene distances between every pair of the candidate biomarkers CB_1, > , CB_k. Please note that a gene distance is a value showing the distance in terms of function between two genes.

[0059] In some embodiments, the concept of Gene Ontology (GO) is adopted for calculating the gene distances. GO depicts gene functions in a GO tree by a plurality of GO terms, and the GO terms are categorized into three complementary biological concepts including Biological Process (BP), Molecular Function (MF), and Cellular Component (CC). Functions of most human genes are well annotated by GO terms. In those embodiments, each of the candidate biomarkers CB_1, > , CB_k is annotated with at least one GO term with reference to a fourth database (e.g., Ensembl Release 104, Ensembl Release 105, Ensembl Release 106 or Ensembl Release 107).

[0060] In those embodiments, the processor 13 calculates a gene distance for every pair of the candidate biomarkers CB_1, , CB_k. Specifically, the processor 13 can calculate a gene distance between a first candidate biomarker and a second candidate biomarker by the following operations (f) and (g).

[0061] In the operation (f), the processor 13 calculates a GO term distance for each of at least one GO term pair between the first candidate biomarker and the second candidate biomarker. Please note that a GO term distance is a value showing the distance (in terms of function) between two GO terms.

[0062] A concrete example is given herein for better understanding. In this example, the first candidate biomarker is the gene “B3GNTL1” and is annotated with a GO term “GO: 0016757,” while the second candidate biomarker is the gene “PLD5” and is annotated with three GO terms “G0:0003824,” “G0:0008152,” and “G0:0016021.” Three GO term pairs can be formed between the first candidate biomarker and the second candidate biomarker, including (G0:0016757, G0:0003824), (G0:0016757, G0:0008152), and (G0:0016757, G0:0016021). The processor 13 calculates a GO term distance for each of the three GO term pairs.

[0063] In the operation (g), the processor 13 determines the gene distance between the first candidate biomarker and the second candidate biomarker according to the GO term distance(s) derived in the operation (f). In some embodiments, the processor 13 takes the mean value of the GO term distance(s) as the gene distance between the first candidate biomarker and the second candidate biomarker.

[0064] The above concrete example is continued herein for better understanding. For the first candidate biomarker “B3GNTL1” and the second candidate biomarker “PLD5,” the GO term distances of the three GO term pairs (G0:0016757, G0:0003824), (G0:0016757, G0:0008152), and (G0:0016757, G0:0016021) have been calculated in the operation (f). Thus, the gene distance between the first candidate biomarker “B3GNTL1” and the second candidate biomarker “PLD5” may be derived by averaging the three GO term distances.

[0065] GO term distances for calculating gene distances [0066] As described above, a GO term distance is a value showing the distance (in terms of function) between two GO terms. In some embodiments, the processor 13 calculates each of the GO term distances based on a corresponding information content distance and a corresponding Czekanowski-Dice distance (e.g., averaging the information content distance and the Czekanowski-Dice distance). Before calculating the information content distances and the Czekanowski-Dice distances, the processor 13 calculates a weight for each of the GO terms. The weight of a GO term can be considered as an indicator for the position of the GO term located in the GO tree.

[0067] For the I th GO term, its weight is defined as the number of the candidate biomarkers CB_1, , CB_k annotated by the I th GO term divided by the number of nonduplicated candidate biomarkers CB_1, > , CB_k annotated by all the GO terms. A GO term located in an upper level of the GO tree correspond to more candidate biomarkers than a GO term located in lower lever branches of the GO tree, and its corresponding weight would be relatively higher.

[0068] Two concrete examples are given herein with the assumption that 70 candidate biomarkers are annotated by the GO term “G0:0016757,” 690 candidate biomarkers are annotated by the GO term “G0:0003824,” and 20,987 non-duplicated candidate biomarkers are annotated by GO terms. Under the assumption, the weight of the GO term “G0:0016757” is 0.003335 approximately (i.e., 70/20,987 « 0.003335 ), and the weight of the GO term

G0:0003824” is 0.032877 approximately (i.e., 690/20,987 « 0.032877).

[0069] The information content distance between two GO terms is elaborated herein. If two GO terms belong to different biological concepts in the GO tree, the information content distance between them is defined as 1 (i.e., a value representing the farthest distance) because they do not have Lowest Common Ancestor (LCA). If two GO terms belong to the same biological concept in the GO tree, the two GO terms have one or more LCAs. If there is more than one LCA, the common ancestor with the lowest weight value is selected. For the case that two GO terms belong to the same biological concept in the GO tree, the information content distance between them is calculated based on the weights of the two GO terms as well as the weight of the LCA. The calculation of the information content distance of any two GO terms can be characterized by the following equation (5).

[0070] In the above equation (5), represents the I th GO term, t ; - represents the j tfl GO term, t LCA represents the LCA of the I th and j tfl GO terms, VF(tj) represents the weight of the I th GO term, L '(t ) represents the weight of the j th GO term, W (t LCAij ^ represnts weith of the GO term t LCA . ., and represents the information content distance between the I th and j th GO terms.

[0071] A concrete example regarding the information content distance is given herein. It is assumed that the GO term “GO: 0016757” and the GO term “GO: 0003824” has an LCA having the weight 0.036451. Under this assumption, the information content distance between the GO term “G0:0016757” andthe GO term “G0:0003824” is 0.03669 (i.e., 2 x 0.036451 -

0.003335 - 0.032877 = 0.03669. [0072] The Czekanowski-Dice distance between two GO terms is elaborated herein. The Czekanowski-Dice distance represents the similarity of the sets of the candidate biomarkers annotated by the two GO terms. It is assumed that G t . and G t . represent the sets of the candidate biomarkers annotated by the I th and j tfl GO terms respectively. The Czekanowski-Dice distance between the I th and j th GO terms can be calculated based on the following equation (6).

[0073] In the above equation (6), t, represents the I th GO term, t ; - represents the j th GO term, G t . represents the set of the candidate biomarkers annotated by the I th GO term, G t . represents the set of the candidate biomarkers annotated by the j th GO term, and djst CD (tj, t ; ) represents the Czekanowski-Dice distance between the I th and j th GO terms. In addition, G t AG t . is the symmetrical difference between the sets G t . and G t ., G t . U G t . is the union of the sets G t . and G t ., and G t . Ci G t . is the intersection of the sets G t . and G t .. When the number of exclusive candidate biomarkers between the I th and j th GO terms is high, the Czekanowski-Dice distance between the I th and j th GO terms is relatively large.

[0074] A concrete example regarding the Czekanowski-Dice distance is given herein. Regarding the GO term “G0:0016757” and the GO term “G0:0003824,” it is assumed that the number of the exclusive candidate biomarkers is 694, the number of the union of the candidate biomarkers is 694, and the number of the intersection of the candidate biomarkers is 0. Under this assumption, the Czekanowski-Dice distance between the GO term “G0:0016757” and the GO term “G0:0003824” is 1.

[0075] Algorithms for biomarker functional clustering

[0076] As described above, in some embodiments, the processor 13 further clusters the candidate biomarkers CB_1, , CB_k into the functional clusters G_l, , G_p. [0077] In some embodiments, the processor 13 adopts a partition clustering algorithm

(e.g., K-means clustering method) to cluster the candidate biomarkers CB_1, , CB_k into the functional clusters G_l, , G_p based on the gene distances between every pair of the candidate biomarkers CB_1, . , CB_k.

[0078] Table 13 to Table 21 shows several examples of the clustering results by using the K-means clustering method. Specifically, Table 13 is for the target disease “colorectal cancer,” Table 14 is for the target disease “lung cancer,” Table 15 is for the target disease “liver cancer,” Table 16 is for the target disease “pancreatic cancer,” Table 17 is for the target disease “prostate cancer,” Table 18 is for the target disease “breast cancer,” Table 19 is for the target disease “ovarian cancer,” Table 20 is for the target disease “esophagus cancer,” and Table 21 is for the target disease “stomach cancer.” In these examples, the candidate biomarkers CB_1, > , CB_k being clustered are the intersection of the aforesaid exemplary primary biomarkers PB_1, , PB_m and the aforesaid exemplary secondary biomarkers SB_1, ,

SB_n Atorney Docket No. 3819.0470W01

Table 13 (K-means clustering result for the target disease “colorectal cancer”)

Atorney Docket No. 3819.0470W01

Table 14 (K-means clustering result for the target disease “lung cancer”)

Atorney Docket No. 3819.0470W01

Table 15 (K-means clustering result for the target disease “liver cancer”)

Atorney Docket No. 3819.0470W01

Table 16 (K-means clustering result for the target disease “pancreatic cancer”)

Atorney Docket No. 3819.0470W01

Table 17 (K-means clustering result for the target disease “prostate cancer”)

Atorney Docket No. 3819.0470W01

Table 18 (K-means clustering result for the target disease “breast cancer”)

Atorney Docket No. 3819.0470W01

Table 19 (K-means clustering result for the target disease “ovarian cancer”)

Atorney Docket No. 3819.0470W01

Table 20 (K-means clustering result for the target disease “esophagus cancer”)

Atorney Docket No. 3819.0470W01

Table 21 (K-means clustering result for the target disease “stomach cancer”)

[0079] In some embodiments, the processor 13 adopts a hierarchical clustering algorithm (e.g., the unweighted pair-group method with arithmetic mean (UPGMA)) to cluster the candidate biomarkers CB_1, , CB_k into the functional clusters G_l, , G_p based on the gene distances between every pair of the candidate biomarkers CB_1, . , CB_k. [0080] Table 22 shows several examples of the clustering results by using the UPGMA method. In these examples, the candidate biomarkers CB_1, , CB_k being clustered are the intersection of the aforesaid exemplary primary biomarkers PB_1, > , PB_m and the aforesaid exemplary secondary biomarkers SB_1, , SB_n.

Table 22 (UPGMA clustering results for nine target diseases)

[0081] Weight calculation and target biomarker selection

[0082] As described above, different candidate biomarkers CB_1, > , CB_k represent different functional roles, and candidate biomarkers within the same functional cluster are close to each other in terms of function. Therefore, to understanding the relation between the target disease and at least one category of function(s), at least one of the functional clusters G_l, , G_p may be further investigated.

[0083] In some embodiments, all the functional clusters G_l, > , G_p are further investigated. The processor 13 calculates a weight for each of the candidate biomarkers in each of the functional clusters G_l, > , G_p. The weight of a candidate biomarker indicates its importance within the functional cluster that it belongs to. Within a functional cluster, the higher the weight is, the more representative the corresponding candidate biomarker is for that functional cluster.

[0084] In some embodiments, the processor 13 determines at least one target biomarker from at least one of the functional clusters according to the weights in each of the functional clusters G_l, > , G_p. As shown in the example in FIG. 4, the processor 13 determines two target biomarkers Ta, Tb from the functional cluster G_1 according to the weights of the candidate biomarkers in the functional cluster G_1 but determines none target biomarker from the functional clusters G_p according to the weights of the candidate biomarkers in the functional cluster G_p.

[0085] The processor 13 can determine at least one target biomarker from at least one of the functional clusters according to the weights in each of the functional clusters G_l, , G_p based on different strategies. In some embodiments, given a functional cluster, the processor 13 may select the candidate biomarker(s) whose weight is/are greater than a third predetermined threshold as the target biomarker(s). In some embodiments, the processor 13 can rank the candidate biomarkers in each of the functional clusters G_l, , G_p according to the corresponding weights. For those embodiments, the processor 13 can determine the target biomarker(s) for each of the functional clusters G_l, > , G_p according to the corresponding ranking result.

[0086] The above description regarding weight calculation and target biomarker selection is for the case that all the functional clusters G_l, , G_p are further investigated. As mentioned, it is also feasible that only one or some of the functional clusters G_l, > , G_p are further investigated. A person having ordinary skill in the art shall understand how to modify the aforesaid operations for the case that only one or some of the functional clusters G_l, , G_ p are further investigated and, thus, the details are not described herein.

[0087] Recurrent neural network for weight calculation

[0088] In some embodiments, the processor 13 executes a recurrent neural network M and calculates the weight of each of the candidate biomarkers in each of the functional clusters G_l, . , G_p by the recurrent neural network M. As shown in FIG. 5, the recurrent neural network M is attention-based and comprises an encoder EN, an attention mechanism AM, and a decoder DE, wherein the attention mechanism AM may be a two-layer fully connected network. Please note that there is only one encoder EN in the recurrent neural network M. Although more than one encoder EN is shown in FIG. 5, they are shown to represent that the encoder EN executes several times (will be elaborated later). The recurrent neural network M can be trained for outputting a prediction P regarding whether an inputted biomarker sequence corresponds to a subject having the target disease (will be elaborated later).

[0089] In those embodiments, the storage 11 stores a plurality of candidate biomarker sequences D3_l, . , D3_s, which may be retrieved from a fifth database through a transceiving interface (not shown) of the methylation biomarker selection apparatus 1. Each of the candidate biomarker sequences D3_l, . , D3_s corresponds to one of the candidate biomarkers CB_1, . , CB_k. The candidate biomarker sequences D3_l, . , D3_s are classified into a normal subject group or a disease subject group. The normal subject group comprises the candidate biomarker sequences related to the subjects without the target disease, while the disease subject group comprises the candidate biomarker sequences related to the subjects with the target disease.

[0090] In those embodiments, the processor 13 calculates the weight for each of the candidate biomarkers in each of the functional clusters G_l, . , G_p by the following operations (h), (i), (j), (k), and (1).

[0091] In the operation (h), the processor 13 derives a plurality of normal attention weights from the attention mechanism AM by inputting the candidate biomarker sequences corresponding to the candidate biomarker and from the normal subject group into the recurrent neural network M.

[0092] A concrete example is given herein for better understanding. It is assumed that the processor 13 is handling the functional cluster G_p, and the functional cluster G_p comprises three candidate biomarker gpl, gp2, gp3. It is also assumed that the candidate biomarker sequences comprised in the normal subject group correspond to N normal subjects (i.e., N subjects without the target disease), wherein A is a positive integer. For each of the N normal subjects, his or her candidate biomarker sequence sgl, sg2, sg3 respectively corresponding to the candidate biomarker gpl, gp2, gp3 are inputted to the encoder EN in sequence. As shown in FIG. 5, the encoder EN outputs a feedback vector htl and a status vector hsl in response to the candidate biomarker sequence sgl, outputs a feedback vector ht2 and a status vector hs2 in response to the candidate biomarker sequence sg2 and the feedback vector htl, and outputs a feedback vector ht3 and a status vector hs3 in response to the candidate biomarker sequence sg3 and the feedback vector ht2. The attention mechanism AM outputs the normal attention weight awl, aw2, aw3 in response to the status vectors hsl, hs2, hs3 and the feedback vector ht3, wherein the normal attention weight awl, aw2, aw3 respectively correspond to the candidate biomarker gpl, gp2, gp3. After the candidate biomarker sequences of all the N normal subjects have been processed, N normal attention weights for each of the candidate biomarker gpl, gp2, gp3 will be derived.

[0093] Although the above concrete example is for the functional cluster G_p, a person having ordinary skill in the art shall understand that the normal attention weights corresponding to the candidate biomarker(s) in each of the rest functional clusters can be derived by the same approach. Hence, the details are not repeated.

[0094] In the operation (i), the processor 13 derives a plurality of disease attention weights from the attention mechanism AM by inputting the candidate biomarker sequences corresponding to the candidate biomarker and from the disease subject group into the recurrent neural network. The operation (i) is similar to the operation (h), and the only difference is that the operation (i) is applied to candidate biomarker sequences from the disease subject group. A person having ordinary skill in the art shall understand the details of the operation (i) based on the above description of the operation (h).

[0095] In the operation (j), the processor 13 calculates an averaged normal weight by averaging the normal attention weights. Taking the candidate biomarker gpl as an example, the processor 13 calculates the averaged normal weight corresponding to the candidate biomarker gpl by averaging the normal attention weights corresponding to the candidate biomarker gpl. Please note that the processor 13 calculates an averaged normal weight for each of the candidate biomarkers in each of the functional clusters G_l, . , G_p.

[0096] In the operation (k), the processor 13 calculates an averaged disease weight by averaging the disease attention weights. Similarly, taking the candidate biomarker gpl as an example, the processor 13 calculates the averaged disease weight corresponding to the candidate biomarker gpl by averaging the disease attention weights corresponding to the candidate biomarker gpl. Please also note that the processor 13 calculates an averaged disease weight for each of the candidate biomarkers in each of the functional clusters G_l, G_p

[0097] In the operation (1), the processor 13 calculates the weight according to the averaged normal weight and the averaged disease weight. Again, taking the candidate biomarker gpl as an example, the processor 13 calculates the weight of the candidate biomarker gpl according to the averaged normal weight of the candidate biomarker gpl and the averaged disease weight of the candidate biomarker gpl. Similarly, the processor 13 calculates the weight for each of the candidate biomarkers in each of the functional clusters G_l, . , G_p.

[0098] The advantage of using the recurrent neural network M for weight calculation is that the recurrent neural network M is good at handling long data sequence. Adopting a conventional neural network model usually has the technical problem of lacking sufficient space for storing long data sequence. The attention mechanism AM of the recurrent neural network M has the ability to ignore less important data. As only more important data is stored, adopting the recurrent neural network M for weight calculation will not face the technical problem of lacking sufficient space for storing data.

[0099] As described above, the recurrent neural network M can be trained for outputting a prediction P regarding whether the inputted biomarker sequences correspond to a subject having the target disease. In the example (i.e., the example that the inputted biomarker sequences are the candidate biomarker sequence sgl, sg2, sg3) shown in FIG. 5, the weighted summation operation OP generates a signal by weighting the status vectors hsl, hs2, hs3 by the normal attention weight awl, aw2, aw3 respectively and then sums them up, and then the decoder DE generates the prediction P in response to the signal from the weighted summation operation OP.

[0100] Candidate biomarker validation

[0101] In some embodiments, to achieve more accurate result, the processor 13 validates the candidate biomarkers CB_1, . , CB_k before performing biomarker functional clustering and eliminates the candidate biomarker(s) that fail(s) the validation. Candidate biomarker validation comprises two stages, including optimal cut-point selection and candidate biomarker screening.

[0102] In the first stage, the processor 13 determines an optimal cut-point from a plurality of preset cut-points for each of the candidate biomarkers CB_1, . , CB_k by the following operations (m), (n), (o), and (p). The optimal cut-point of a candidate biomarker may be considered as a threshold for determining whether a methylation degree corresponding to this candidate biomarker is severe. A preset cut-point may be a value between 0 and the maximum value of the methylation degree. It is noted that the present invention does not limit the number of the preset cut-points. Nevertheless, more preset cut-points will result in more accurate optimal cut-point. As an example, if the maximum value of the methylation degree is 1 and 99 preset cut-points are desired, the values of the 99 preset cut-points can be set to 0.01, 0.02, . , and 0.99.

[0103] In the operation (m), the processor 13 calculates an averaged normal value according to the methylation degrees corresponding to the concerned candidate biomarker (e.g., the candidate biomarkers CB_1) from the normal subject group based on the first data sets Dl_l, > , Dl_q. Please note that if the averaged normal value has been calculated (e.g., the aforesaid operation (a) has been executed), the operation (m) can be omitted.

[0104] In the operation (n), the processor 13 calculates a plurality of first difference values by subtracting the averaged normal value from each of the methylation degrees corresponding to the concerned candidate biomarker (e.g., the candidate biomarkers CB_1) recorded in the first data sets Dl l, . , Dl_q.

[0105] In the operation (o), the processor 13 generates a first confusion matrix for each of the preset cut-points according to the first difference values corresponding to the concerned candidate biomarker (e.g., the candidate biomarkers CB_1).

[0106] A concrete example is given herein for better understanding. The first confusion matrix for a concerned candidate biomarker (e.g., the candidate biomarkers CB_1) and a concerned preset cut-point (e.g., 0.02) comprises the following four statistical numbers: (i) the total number of the subjects that are predicted as having the target disease and do have the target disease, which is represented by the variable N TP , (ii) the total number of the subjects that are predicted as having the target disease but do not have the target disease, which is represented by the variable N FP , (iii) the total number of the subjects that are predicted as not having the target disease but do have the target disease, which is represented by the variable N FN , and (iv) the total number of the subjects that are predicted as not having the target disease and actually not have the target disease, which is represented by the variable N TN .

[0107] For a first difference value, if it is greater than the concerned preset cut-point

(e.g., 0.02), it is predicted that the corresponding subject has the target disease. In addition, whether a subject corresponding to a first difference value has the target disease is known because a first difference value is calculated based on a methylation degree recorded in one of the first data sets Dl_l, , Dl_q, and each of the first data sets Dl_l, , Dl_q belongs to the normal subject group or the target subject group.

[0108] In the operation (p), the processor 13 selects one of the preset cut-points as the optimal cut-point for the concerned candidate biomarker (e.g., the candidate biomarkers CB_1) according to the corresponding first confusion matrixes.

[0109] For a concerned candidate biomarker (e.g., the candidate biomarkers CB_1), a first confusion matrix for each of the preset cut-points has been generated in the operation (o). For example, if there are 99 preset cut-points, there will be 99 first confusion matrixes correspond to the concerned candidate biomarkers. In some embodiments, for each of the first confusion matrixes, the processor 13 can generate a sensitivity value (i.e., N TP /(N TP + N FN ) and a specificity value (i.e., N TN / (N TN + N FP ) based on the first confusion matrix and then generates a summarized value of the sensitivity value and the specificity value. Then, the processor 13 selects the preset cut-point with the greatest summarized value as the optimal cutpoint for the concerned candidate biomarker.

[0110] The second stage (i.e., candidate biomarker screening) is described herein. To perform the second stage, the storage 11 stores a plurality of third data sets D4_l, , D4_t, each of the third data sets D4_l, > , D4_t comprises a plurality of methylation degrees corresponding to the methylation loci. The methylation biomarker selection apparatus 1 may derives the third data sets D4_l, > , D4_t from a sixth database (e.g., Gene Expression Omnibus (GEO) database) through a transceiving interface (not shown) of the methylation biomarker selection apparatus 1.

[0111] Examples regarding the information related to the third data sets D4_l, > , D4_t used for nine target diseases are shown in Table 23. Please note that the data files from TCGA are of March 15, 2021, and the data files from GEO database are of October 30, 2021. In addition, the variable N N represents the number of the subject without the target disease, and the variable N TD represents the number of the subject without the target disease.

Table 23

[0112] The processor 13 validates each of the candidate biomarkers CB_1, , CB_k by the following operations (q), (r), (s), and (t).

[0113] In the operation (q), the processor 13 calculates a plurality of second difference values by subtracting the averaged normal value from each of the methylation degrees corresponding to the candidate biomarker and from the third data sets D4_l, , D4_t.

[0114] In the operation (r), the processor 13 generates a second confusion matrix for the optimal cut-point according to the optimal cut-point and the second difference values corresponding to the candidate biomarker. Similarly, the second confusion matrix comprises the following four statistical numbers: (i) the total number of the subjects that are predicted as having the target disease and do have the target disease, (ii) the total number of the subjects that are predicted as having the target disease but do not have the target disease, (iii) the total number of the subjects that are predicted as not having the target disease but do have the target disease, and (iv) the total number of the subjects that are predicted as not having the target disease and actually not have the target disease.

[0115] In the operation (s), the processor 13 generates a sensitivity value, a specificity value, and an accuracy value (i.e., the ratio that the prediction is correct) according to the second confusion matrix. For better understanding, please refer to Table 24 for the statistics of the accuracy values of the candidate biomarkers of each of the nine target diseases.

Atorney Docket No. 3819.0470W01

Table 24

[0116] In the operation (t), the processor 13 validates the candidate biomarker according to the accuracy value and a fourth predetermined threshold. For example, if the accuracy value of a candidate biomarker is lower than the fourth predetermined threshold, that candidate biomarker is eliminated.

[0117] For the embodiments that perform candidate biomarker validation, only candidate biomarkers that pass the validation (i.e., have not been eliminated) will be functional clustered.

[0118] FIG. 6 illustrates the main flowchart of a methylation biomarker selection method in some embodiments of the present invention. The methylation biomarker selection method is for use in an electronic apparatus (e.g., the methylation biomarker selection apparatus 1). The electronic apparatus stores a plurality of first data sets and a plurality of second data sets, wherein each of the first data sets comprises a plurality of methylation degrees corresponding to a plurality of methylation loci and each of the second data sets comprises at least one medical record. The methylation biomarker selection method comprises the following steps S601, S603, and S605.

[0119] In the step S601, the electronic apparatus determines a plurality of primary biomarkers by identifying a plurality of differentiable loci from the methylation loci according to the methylation degrees in the first data sets. In some embodiments, the step S601 comprises a step of selecting the methylation loci having at least one of an averaged methylation degree difference conforming to a first predetermined rule and a p-value conforming to a second predetermined rule as the differentiable loci, wherein the differentiable loci are determined as the primary biomarkers.

[0120] In the step S603, the electronic apparatus determines a plurality of secondary biomarkers by identifying a plurality of comorbidities of a target disease, and associated genes thereof based on the second data sets. In some embodiments, the step S603 comprises a step of calculating an association degree indicating relevance to the target disease for each of the distinct diagnosed diseases, a step of selecting the diagnosed diseases having the association degree conforming to a third predetermined rule as the comorbidities, and a step of determining a plurality of genes corresponding to the comorbidities as the secondary biomarkers. In some embodiments, the association degree of each of the distinct diagnosed diseases comprises an odds ratio, a p-value, and a supporting rate.

[0121] In the step S605, the electronic apparatus determines a plurality of candidate biomarkers based on a correlation analysis of the primary biomarkers and the secondary biomarkers. Please note that the order for executing steps S601 and S603 is not limited by the present invention. In one example, the step S603 may be executed prior to the step S601. In another example, the step S601 and the step S603 may be executed at the same time.

[0122] FIG. 7 illustrates the main flowchart of a methylation biomarker selection method in some embodiments of the present invention. In those embodiments, the methylation biomarker selection method further comprises the following steps S707, S709, and S711 in addition to the steps S601, S603, and S605.

[0123] In the step S707, the electronic apparatus clusters the candidate biomarkers into a plurality of functional clusters. In some embodiments, the step S707 clusters the candidate biomarkers into the functional clusters based on a plurality of gene distances between every pair of the candidate biomarkers. In those embodiments, the step S707 comprises a step of calculating at least one gene distance, which further comprises a step of calculating a GO term distance for each of at least one GO term pair between a first candidate biomarker and a second candidate biomarker and a step of determining the gene distance between the first candidate biomarker and the second candidate biomarker according to the at least one GO term distance. In some embodiments, each of the GO term distances is calculated based on an information content distance and a Czekanowski-Dice distance.

[0124] In the step S709, the electronic apparatus calculates a weight for each of the candidate biomarkers in each of the functional clusters. In some embodiments, the electronic apparatus executes a recurrent neural network comprising an encoder, an attention mechanism, and a decoder, and the step S709 is realized by a recurrent neural network. In those embodiments, each of a plurality of candidate biomarker sequences belongs to one of a normal subject group and a disease subject group, each of the candidate biomarker sequences corresponds to one of the candidate biomarkers, and the step S709 comprises the steps S801, S803, S805, S807, and S809 as shown in FIG. 8.

[0125] In the step S801, the electronic apparatus derives a plurality of normal attention weights from the attention mechanism by inputting the candidate biomarker sequences corresponding to the candidate biomarker and from the normal subject group into the recurrent neural network. In the step S803, the electronic apparatus derives a plurality of disease attention weights from the attention mechanism by inputting the candidate biomarker sequences corresponding to the candidate biomarker and from the disease subject group into the recurrent neural network. In the step S805, the electronic apparatus calculates an averaged normal weight by averaging the normal attention weights. In the step S807, the electronic apparatus calculates an averaged disease weight by averaging the disease attention weights. In the step S809, the electronic apparatus calculates the weight according to the averaged normal weight and the averaged disease weight. Please note that the steps S801, S803, S805, and S807 may be executed in other order as long as the step S801 is prior to the step S803 and the step S805 is prior to the step S807.

[0126] In the step S711, the electronic apparatus determines at least one target biomarker from at least one of the functional clusters according to the weights in each of the functional clusters. In some embodiments, the methylation biomarker selection method further comprises a step of ranking the candidate biomarkers in each of the functional clusters according to the corresponding weights. In those embodiments, the step S711 may determine the at least one target biomarker from at least one of the functional clusters according to the ranking result of each of the functional clusters.

[0127] In addition to the previously mentioned steps, the methylation biomarker selection method provided by the present invention can also execute all the operations and steps that can be executed by the methylation biomarker selection apparatus 1, have the same functions as the methylation biomarker selection apparatus 1, and deliver the same technical effects as the methylation biomarker selection apparatus 1. How the methylation biomarker selection method provided by the present invention executes these operations and steps, has the same functions, and delivers the same technical effects as the methylation biomarker selection method will be readily appreciated by a person having ordinary skill in the art based on the above explanation of the methylation biomarker selection apparatus 1 and, thus, will not be further described herein.

[0128] The methylation biomarker selection method described in the above embodiments may be implemented as a computer program comprising a plurality of codes. The computer program is stored in a non-transitory computer readable storage medium. After the codes of the computer program are loaded into an electronic apparatus (e .g . , the methylation biomarker selection apparatus 1), the computer program executes the methylation biomarker selection method as described in the above embodiments. The non-transitory computer readable storage medium may be an electronic product, such as a Read Only Memory (ROM), a flash memory, a floppy disk, a hard disk, a Compact Disk (CD), a Digital Versatile Disc (DVD), a mobile disk, a database accessible to networks, or any other storage media with the same function and well-known to a person having ordinary skill in the art.

[0129] Clinical validation of target biomarkers for colorectal cancer

[0130] In order to confirm the utility of the candidate biomarkers in the clinical setting, the methylation-specific Polymerase Chain Reaction (PCR) strategy is utilized to accomplish the clinical validation on these candidate biomarkers of the colorectal cancer using DNA extracted from formalin-fixed, paraffin-embedded (FFPE) tumor tissue specimens. Taking colorectal cancer as an example, 10 target biomarkers are selected from 141 candidate biomarkers and designed the corresponding quantitative methylation-specific PCR (qMSP) primers for each target biomarker. First, the commercial human methylated and nonmethylated DNA standards (Zymo research, Cat. #D5014) are used to test the primer performance and to build up the calibration curves for subsequent estimation of methylation levels in the clinical samples.

[0131] Next, 99 clinical FFPE samples are selected, including 18 normal tissues and 81 tumor tissues across 9 cancer types, to ascertain the methylation levels of these selected 10 target biomarkers of the colorectal cancer in various cancer specimens. The extracted DNA were underwent bisulfite conversion by using EZ DNA Methylation-Lightning™ kit (Zymo research, Cat. #D5031) following the manufacturer’s instruction manual. Finally, the bisulfite-converted DNA were subjected to qMSP tests for further determining their methylation levels by using the calibration curves.

[0132] All the results are presented in FIG. 9 and Table 25 to Table 33 below. In FIG. 9, “CRC” stands for colorectal cancer, “LC” stands for lung cancer, “BC” stands for breast cancer, “EC” stands for esophageal cancer, “GC” stands for gastric cancer, “HCC” stands for hepatocellular carcinoma, “OV” stands for ovarian cancer, “Pan” stands for pancreatic cancer, and “Pros” stands for prostate cancer. In addition, Table 25 is for “colorectal cancer,” Table 26 is for “lung cancer,” Table 27 is for “breast cancer,” Table 28 is for “esophageal cancer,” Table 29 is for “gastric cancer,” Table 30 is for “hepatocellular carcinoma,” Table 31 is for “ovarian cancer,” Table 32 is for “pancreatic cancer,” and Table 33 is for “prostate cancer.”

[0133] The results reveal that the methylation levels of the target biomarkers of the colorectal cancer are significantly up-regulated in colorectal cancer tumor tissue compared to normal tissues. In addition, ADHFE1, PLD5, and NRG1 had a higher methylation level in gastric (GC), esophageal (EC), and pancreatic (Pan) cancers. In contrast, the methylation extent of the MMP23B gene seemed to be elevated in every tested cancer type.

Atorney Docket No. 3819.0470W01

Table 25 (Clinical validation result for colorectal cancer)

Atorney Docket No. 3819.0470W01

Table 26 (Clinical validation result for lung cancer)

Atorney Docket No. 3819.0470W01

Table 27 (Clinical validation result for breast cancer)

Atorney Docket No. 3819.0470W01

Table 28 (Clinical validation result for esophageal cancer)

Atorney Docket No. 3819.0470W01

Table 29 (Clinical validation result for gastric cancer)

Atorney Docket No. 3819.0470W01

Table 30 (Clinical validation result for hepatocellular carcinoma)

Atorney Docket No. 3819.0470W01

Table 31 (Clinical validation result for ovarian cancer)

Atorney Docket No. 3819.0470W01

Table 32 (Clinical validation result for pancreatic cancer)

Atorney Docket No. 3819.0470W01

Table 33 (Clinical validation result for prostate cancer)

[0134] It shall be appreciated that, in the specification and the claims of the present invention, some terms (e.g., data sets, database, predetermined rule, predetermined threshold, candidate biomarker, difference value, confusion matrix) are preceded by “first,” “second,” “third,” “fourth,” “fifth,” or “sixth.” Please note that “first,” “second,” “third,” “fourth,” “fifth,” and “sixth” are used only for distinguishing different terms. If the order of these terms is not specified or cannot be derived from the context, the order of these terms is not limited by the preceded “first,” “second,” “third,” “fourth,” “fifth,” and “sixth.”

[0135] Furthermore, it shall be appreciated that the aforesaid normal subjects and the normal subject group may have different meaning in different embodiments. For example, if the methylation biomarker selection apparatus or method aims to find out the candidate biomarkers and/or target biomarker(s) for a specific race, the aforesaid normal subjects and the normal subject group may be narrowed down to related to subjects of that specific race and without the target disease.

[0136] According to the above descriptions, the methylation biomarker selection technique (at least comprises the methylation biomarker selection apparatuses and methods) provided by the present invention utilizes two different kinds of data sets (i.e., the first data sets and the second data sets) to discover candidate biomarkers pertaining to a target disease. While the first data sets comprise methylation degrees of various methylation loci, the second data sets comprise medical record(s). With the first data sets, differentiable loci can be identified as the primary biomarkers pertaining to the target disease. With the second data sets, comorbidities of the target disease, and associated genes thereof can be identified so as to provide the secondary biomarkers pertaining the target disease. As both methylation degrees and comorbidities of the target disease are considered, the methylation biomarker selection technique of the present invention can provide candidate biomarkers that are highly sensitive and highly specific to the target disease. Furthermore, as the candidate biomarkers are determined based on a correlation analysis of the primary biomarkers and the secondary biomarkers, a sufficient amount of candidate biomarkers can be provided.

[0137] The above disclosure is related to the detailed technical contents and inventive features thereof. People skilled in this field may proceed with a variety of modifications and replacements based on the disclosures and suggestions of the invention as described without departing from the characteristics thereof. Nevertheless, although such modifications and replacements are not fully disclosed in the above descriptions, they have substantially been covered in the following claims as appended.