A NORMALIZATION METHOD FOR SAMPLE ASSAYS

Title:

A NORMALIZATION METHOD FOR SAMPLE ASSAYS

Document Type and Number:

WIPO Patent Application WO/2017/083310

Kind Code:

Abstract:

Disclosed herein are covariation-based methods and systems for identifying normalizers to normalize measurements of one or more target analytes from any quantitative assays. The methods and systems can also be used to categorize biological samples as normal or abnormal based on the normalized measurements of one or more target analytes.

More Like This:

WO/2022/079621	INTRACARDIAC ECG NOISE DETECTION AND REDUCTION
WO/2024/062486	SYSTEM FOR MEASURING PHYSIOLOGICAL PARAMETERS AND A PRESSURE APPLIED ON A SURFACE THEREOF
WO/2023/240700	FLEXIBLE ELECTRODE APPARATUS FOR BONDING WITH IMPLANTABLE OPTICAL DEVICE AND METHOD FOR MANUFACTURING SAID APPARATUS

Inventors:

LI XITONG (US)

Application Number:

PCT/US2016/061001

Publication Date:

May 18, 2017

Filing Date:

November 08, 2016

Export Citation:

Click for automatic bibliography generation Help

Assignee:

INKARYO CORP (US)

International Classes:

A61B5/00; C12N15/115; G01N33/543; G06F17/10; G06N5/02; G16B25/10

Foreign References:

US20110306856A1	2011-12-15
US20120208282A1	2012-08-16
US20100204055A1	2010-08-12
US20050090021A1	2005-04-28
US20140274795A1	2014-09-18
US20140242588A1	2014-08-28
US20150227681A1	2015-08-13
US20110137851A1	2011-06-09
US20090043171A1	2009-02-12
US20140335630A1	2014-11-13

Other References:

SPELLMAN ET AL.: "Development and evaluation of a multiplexed mass spectrometry based assay for measuring candidate peptide biomarkers in Alzheimer's Disease Neuroimaging Initiative (ADNI) CSF.", PROTEOMICS-CLINICAL APPLICATIONS., vol. 9, no. 7-8, 24 April 2015 (2015-04-24), pages 715 - 731, XP055381013, Retrieved from the Internet

Attorney, Agent or Firm:

MAO, Yifan et al. (US)

Download PDF:

View/Download PDF PDF Help

Claims:

We claim:

1. A method of identifying a plurality of normalizers from a pool of candidate analytes for a target analyte in an assay, the method comprising: a) obtaining measurements of the target analyte and the measurements of the pool of candiate analytes from a group of training samples, b) calculating a covariation value of the target analyte and the measurements of one of the pool of candidate analytes, c) identifying the candidate analyte as a normalizer when the covariation value of the one of the pool of candidate analyte and the target analyte is higher than a threshold, and d) repeating steps b) and c) with the rest of the pool of candidate analytes until a plurality of normalizers is identified, wherein the plurality of normalizers are used to normalize measurements of the target analyte in a test sample. 2. The method of claim 1, further comprising e) measuring the amount of target analyte and the plurality of normalizers in a test sample to obtain the measurement of the target analyte and measurements of the plurality of normalizers, and f) normalizing the measurement of the target analyte in the test sample by producing a normalized quotient (NQ) for the target analyte, wherein the NQ is produced by dividing the measurement of the target analyte by a sum or weighted sum of the measurements of the normalizers in the test sample. 3. A method of categorizing a test sample based on a target analyte comprising steps of: a) generating a NQ of the target analyte in the test sample using method of claim 2; and b) categorizing the test sample as normal based on the target analyte when said NQ is (1) greater than or equal to a lower limit of a normal NQ range and (2) less than or equal to an upper limit of normal NQ range, or categorizing the test sample as abnormal when said NQ is (1) lower than the lower limit of the normal NQ range or (2) higher than the higher limit of said normal NQ range. 4. The method of any of claims 1-3, wherein the group of training samples comprise at least 3, at least 4, at least 5, or at least 10 samples. 5. The method of any of claims 1-3, wherein said covariation value is a correlation that is selected from the group consisting of a Spearman correlation, a Pearson correlation, and a Kendall correlation. 6. The method of any of claims 1-3, wherein said covariation value is a covariance. 7. The method of any of claims 1-3, wherein said threshold is a decimal number greater than or equal to 0.1, 0.2, 0.3, 0.4, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, or 0.99. 8. The method of any of claims 1-3, wherein the target analyte is a gene, a chromosome or a chromosomal bin.

9. The method of any of claims 1-3, wherein said threshold is determined by a minimum rank among the covariation values of all candidate analytes relative to the target analyte ranked from high to low, wherein the minimum rank is a positive integer between 1 and the total number of candidate analytes in the pool. 10. The method of any of claims 1-3, wherein said threshold is a decimal number greater than or equal to 10th, 20th, 30th, 40th, 50th, 55th, 60th, 65th, 70th, 75th, 80th, 85th, 90th, 91th, 92th, 93th, 94th, 95th, 96th, 97th, 98th, or 99th percentile of the covariation values for all candidate analytes relative to the target analyte.

11. The method of claim 3, wherein said lower limit of normal NQ range is determined by the mean of the NQ values of the target analyte in a set of reference samples minus a multiple of the standard deviation of the NQ values of the target analyte in the set of reference samples, and wherein said upper limit of normal NQ range is determined by the mean of the NQ values of the target analyte in the set of reference samples plus a multiple of the standard deviation of the NQ values of the target analyte in a set of reference samples. 12. The method of claim 11, wherein said multiple is greater than or equal to 1, greater than or equal to 1.5, greater than or equal to 2, greater than or equal to 2.5, greater than or equal to 3, greater than or equal to 3.5, greater than or equal to 4, greater than or equal to 4.5, greater than or equal to 5, greater than or equal to 5.5, greater than or equal to 6, or greater than or equal to 6. 13. The method of claim 3, wherein said lower limit of normal NQ range is determined by the median of the NQ values of the target analyte in a set of reference samples minus a multiple of the median absoulte deviation (MAD) of the NQ values of the target analyte in the set of reference samples, and wherein said upper limit of normal NQ range is determined by the median of the NQ values of the target analyte in a set of reference samples plus a multiple of the median absoulte deviation (MAD) of the NQ values of the target analyte in the set of reference samples.

14. A method for normalizing a measurement of a target analyte in a test sample, the method comprising:

receiving, with a computing system, measurements of the target analyte and candidate analytes in a group of training samples;

for each of the candidate analytes: determining, with a computing system, a covariation value between the target analyte and each of the candidate analytes, and selecting the candidate analyte as a normalizer if the covariation value of the candidate analyte and the target analyte is higher than a threshold, thereby selecting a plurality of normalizers, receiving, with a computing system, meaurements of the target analyte and the plurality of normalizers in a testing sample,

calculating a normalized quotient (NQ) of the target analyte in the test sample, thereby normalizing the measurement of the target analyte. 15. The method of claim 14, wherein the method further comprises comparing the NQ of the target analyte of the test sample with a normal NQ range, outputting to a user, with a computing system, an indication representing the test sample as normal based on the target analyte if said NQ is greater than or equal to a lower limit of normal NQ range and less than or equal to an upper limit of normal NQ range, or categorizing the test sample as abnormal if said NQ is lower than the lower limit of the normal NQ range or higher than the higher limit of said normal NQ range. 16. A computer-readable medium comprising computer program code for controlling a computer system to perform the method of claim 14 or 15. 17. A computer system comprising: the computer-readable medium of claim 16. 18. A system for categorizing a target analyte in a sample, the system comprising: one or more processors;

a memory coupled with the one or more processors via an interconnect;

a communications interface coupled with the interconnect and adapted to receive measurements of a target analyte and the candidate analytes in a group of training samples;

wherein the one or more processors are configured to:

receive values representing measurements of the target analyte and candidate analytes; determine, for each of the candidate analytes, the covariation value between the measurement of target analyte and the measurements of the candidate analytes and select the candidate analyte as a normalizer when the covariation value of the candidate analyte is higher than a threshold, thereby selecting a plurality of normalizers, receive measurements of the target analyte and the plurality of normalizers in the test sample and determine the NQ of the target analyte in the test sample, compare the NQ of the target analyte of the test sample with a normal NQ range,

an output module communicatively coupled with the processor and configured to output an indication, wherein the indication represents the test sample as normal based on the target analyte when said NQ is greater than or equal to a lower limit of normal NQ range and less than or equal to an upper limit of normal NQ range, or categorizing the test sample as abnormal when said NQ is lower than the lower limit of the normal NQ range or higher than the higher limit of said normal NQ range.

19. The method of any of claims 1-3, wherein the target analyte is mitochondrial DNA.

20. The method of claim 19, wherein each candidate analyte is an autosomal bin. 21. The method of any of claims 2-3, wherein the testing sample is a tissue biopsy for a testing disease.

22. The method of claim 21, wherein the disease is Charcot-Marie-Tooth type IIA, optic nerve atrophy, diabetes, or cancer.

Description:

A NORMALIZATION METHOD FOR SAMPLE ASSAYS CROSS-REFERENCES TO RELATED APPLICATIONS

[0001] The present application claims priority to US Provisional Application No.62/252,673, filed on November 9, 2015, and US Provisional Application No.62/252,679, filed on November 9, 2015, the entire disclosures of which are incorporated herein in its entirety. FIELD OF INVENTION

[0002] The present invention relates to methods and systems for normalizing measurements of analytes that may be used to detect abnormality in a sample, such as chromosomal aneuploidy, disorders associated with abnormal copy number of mitochondria, infectious diseases and cancer. BACKGROUND OF THE INVENTION

[0003] Quantitative measurements of an analyte of interest in a biological sample is critical for proper disease diagnosis, prognosis, and therapeutic treatment (Park et al., Cronin et al., Jennings et al.1997, Jennings et al.2012, Walsh et al.). Such an analyte can be for example, a gene, an RNA, a peptide, a cluster of genes, a chromosome, or a chromosomal region called bin. For example, copy number aberrations of chromosomal region detected in human samples have been indicated as prognostic markers in cancer associated phenotypes (Sapkota et al., Krepischi et al.). Expression quantities of 21 RNA biomarkers have been used for breast cancer prognosis (Park et al.). High level of cell surface protein Her2 is a companion diagnostic biomarker for the effectiveness of cancer drug Herceptin (Dent et al.). However, direct quantitative measurements of a particular analyte of interest from biological samples are often not meaningful because these values are affected by variations unrelated to the biological significance of analytes in the samples. Examples of these variations include uneven quality of the input materials, and variations introduced through different sample preparation procedures including enzyme bias and unknown assay condition changes. In aggregation, the variations often render it difficult to correlate a measurement of an analyte generated by one particular assay with the true signals of the analyte present in the samples. [0004] Normalization is an approach to transform measurements into corrected values to allow comparison of samples. It is useful for large scale of biological or chemical assays using high throughput technologies such as microarray, quantitative PCR (QPCR), mass spectrum, antibody array, two dimensional (2D) gel electrophoresis, digital PCR, or massively parallel sequencing where large quantity of analytes are measured at the same time. These assays include a variety of genome-wide analyses of gene copy number variation (CNV), chromosomal aberration, RNA expression, protein peptide expression, or protein binding patterns and are now routinely performed to compare treatment and reference groups for differentiation of disease and normal tissues. [0005] The current practice of normalization typically involves adjustment of the

measurements of analytes using single or multiple reference analytes through regression, spike-in control, or a statistical model (Landfors et al., Quinlan et al. Gao et al., Ghandi and Beer).

However, these normalization techniques are unable to eliminate bias in cases where the data of the analytes are skewed for unknown reasons and do not follow any typical mathematical distribution model. This occurs when, for example, a large number of the analytes in the assay are either positively or negatively affected by a particular treatment. In addition, certain high throughput experiments are likely to generate such skewed distributions, including large scale of parallel sequencing studies for the detection of chromosomal aneuploidy, chromatin

immunoprecipitation (ChIP)- experiments for the study of chromatin structure, RNA expression analyses for the discovery of cancer biomarkers, and SNP-based copy number variation profiling for the comparisons of normal and tumor tissues. This adversely affects the accuracy of sample analysis because the capacity of an experiment to identify altered analytes and to quantify the unbiased changes decreases as the portion of altered analytes and the skewness increases, as shown by the spike-in array data (Landfors et al.). As a result of such inaccurate normalization process, data variance is often still too large to be used for clinical diagnosis. [0006] Current methods available for quantifying mtDNA falls into two categories: 1) a PCR based method targeting a region of mtDNA such as described in Duran et al. and Koekemoer et al., and 2) a biochemical assay based on mitochondrial specific protein or enzyme activity such as described in Mogensen et al.2006, Rabol et al.2009, and Picard et al.2011. However, all these methods are subject to assay bias and thus inaccurate (Larsen et al.). SUMMARY OF INVENTION

[0007] The present disclosure provides a covariation-based normalization (CBN) method that corrects various biases caused , for example, sample and assay conditions. The method involves selecting normalizers from a set of analytes to normalize an analyte of interest, i.e. a target analyte. The selection bases on a covariation value of a candidate analyte relative to a target analyte, and the covariation value is independent of any distribution assumption of the data. The resulting normalization is thus reliable for correcting data skewness even if the cause is unknown. Therefore, the normalization method presented in this disclosure is robust, consistent, and useful to achieve desired sensitivity and specificity, which is especially critical for high throughout assays. [0008] The CBN method of selecting normalizers of a target analyte for use in high throughput assay disclosed herein comprises calculating covariation values of each candidate analyte and the target analyte among a group of training samples. A candidate analyte is selected as a normalizer if its covariation value is higher than a threshold. A combination of a few or all normalizers that meet the selection criterion can be collectively used to normalize measurements of the target analyte. [0009] In some embodiments, the disclosure provides a method of identifying a plurality of normalizers from a pool of candidate analytes for a target analyte in an assay, the method comprising: a) obtaining measurements of the target analyte and measurements of the pool of candiate analytes from a group of training samples; b) calculating a covariation value of the measurement of the target analyte and the measurement of one of the pool of candidate analytes; c) identifying the candidate analyte as a normalizer when the covariation value of the one of the pool of candidate analyte and the target analyte is higher than a threshold, and repeating steps b) and c) with the rest of the pool of candidate analytes until a plurality of normalizers is identified, wherein the plurality of normalizers are used to normalize measurements of the measurement of target analyte in a test sample. In some embodiments, the plurality of normalizers is at least 10, 50, 100, 150, or 200, 300, 400, 500, 600, 700, 800, 900, or 1000. In some embodiments, the group of training samples comprise at least 3, at least 4, at least 5, at least 10, at least 50, at least 70, at least 80, at least 100, at least 150, at least 200, or at least 300 samples. [0010] In some embodiments, the method further comprises measuring target analyte and the plurality of normalizers in a test sample to obtain the measurement of target analyte and the measurements of the plurality of normalizers in the test sample, and normalizing the

measurement of the target analyte in the test sample by producing a normalized quotient (NQ), wherein the NQ is produced by dividing the measurement of the target analyte by a sum or weighted sum of the measurements of the normalizers in the test sample. [0011] In some embodiments of the invention, the covariation value is a covariance or a correlation. In some embodiments of the invention, the covariation value is a Spearman, Pearson, or Kendall correlation. [0012] In some embodiments of the invention, the threshold of the covariation value used to select normalizers is a decimal number greater than or equal to 0.1, 0.2, 0.3, 0.4, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, or 0.99. [0013] In some embodiments of the invention, the threshold of the covariation value is determined by a minimum rank among all candidates’ covariation values ranked from high to low. The minimum rank is a positive integer between 1 and the total number of candidate analytes. In some embodiments of the invention, the threshold of the covariation value is a decimal number greater than or equal to 10th, 20th, 30th, 40th, 50th, 55th, 60th, 65th, 70th, 75th, 80th, 85th, 90th, 91th, 92th, 93rd, 94th, 95th, 96th, 97th, 98th, or 99th percentile of the covariation values for all pairs of candidate analytes, and said target analyte. [0014] This invention further provides a method of normalizing test data using a normalized quotient (NQ) of a target analyte in a test sample. The NQ is generated by dividing the measurement of the target analyte by a sum or a weighted sum of the measurements of normalizers in the test sample selected using the method above for the test sample. [0015] In some embodiments, the invention provides a method of categorizing a test sample based on a target analyte comprising the steps of a) generating a NQ of the target analyte in the test sample using the methods described above, b) categorizing the test sample as normal based on the target analyte when said NQ is (1) greater than or equal to the lower limit of a normal NQ range and (2) less than or equal to an upper limit of normal NQ range, or categorizing the test sample as abnormal when said NQ is (1) lower than the lower limit of the normal NQ range or (2) higher than the higher limit of said normal NQ range. [0016] In some embodiments, the lower limit of normal NQ range is determined by the mean of the NQ values of the target analyte in a set of reference samples minus a multiple of the standard deviation of the NQ values of the target analyte in the set of reference samples, and wherein said upper limit of normal NQ range is determined by the mean of the NQ values of the target analyte in the set of reference samples plus a multiple of the standard deviation of the NQ values of the target analyte in a set of reference samples. In some embodiments, the multiple is a number that is greater than or equal to 1, greater than or equal to 1.5, greater than or equal to 2, greater than or equal to 2.5, greater than or equal to 3, greater than or equal to 3.5, greater than or equal to 4, greater than or equal to 4.5, greater than or equal to 5, greater than or equal to 5.5, greater than or equal to 6, or greater than or equal to 6, greater than or equal to 7, greater than or equal to 8, greater than or equal to 9, greater than or equal to 10. [0017] In some embodiments, the lower limit of normal NQ range is determined by the median of the NQ values of the target analyte in a set of reference samples minus a multiple of the median absoulte deviation (MAD) of the NQ values of the target analyte in the set of reference samples, and wherein said upper limit of normal NQ range is determined by the median of the NQ values of the target analyte in a set of reference samples plus a multiple of the median absoulte deviation (MAD) of the NQ values of the target analyte in the set of reference samples. [0018] This invention also provides a method of categorizing a test sample based on a target analyte by comparing the target analyte’s NQ with a normal range of NQ. A test sample is categorized as normal based on the target analyte if said NQ falls within a normal range of NQ and abnormal if the NQ is outside the normal range. [0019] In certain embodiments, the normal range of NQ is defined by a low end (a.k.a.,“the lower limit”), which is the median minus a decimal number that is greater than or equal to 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, or 5 times of the median absolute deviation (MAD) of the NQs of the target analyte from a set of reference samples; and a high end, which is the mean plus a decimal number that is greater than or equal to 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, or 5 times of the median absolute deviation (MAD) of the NQs of the target analyte from a set of reference samples. Alternatively, the low end of the NQ range is the mean minus a decimal number that is greater than or equal to 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, or 5 times of the standard deviation (SD) of the NQs of the target analyte from a set of reference samples; and the high end (a.k.a., the“higher limit”) of the range is the mean plus a decimal number that is greater than or equal to 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, or 5 times of the standard deviation (SD) of the NQs of the target analyte from a set of reference samples. [0020] In certain embodiments, the step of identifying normalizer can be reiterated, that is to say, the NQs generated for the target analyte and candidate analytes based on a first set of plurality of normalizers are used as measurments for target analyte and candidate analyte to select a second set of plurality of normalizers. This reiteration processes can be repeated multiple times until a desired set of normalizers are obtained. The target ananlyte is then normalized using the second or later set of plurality of normalizers as described above. The sample can be categorized using based on the normalized measurements of the target analyte as described above. [0021] The present invention also provides a computer-based method for selecting a set of normalizers for a set of target analytes in an assay with a group of training samples. The method steps include inputting measurements of analytes, including said set of target analytes, into a computer, and executing computer readable instructions on the computer for selecting a set of normalizers for each target analyte. A candidate analyte is selected as a normalizer of the target analyte if the covariation value of the candidate analyte and the target analyte is greater than a threshold. A set of normalizers is then produced for the target analyte. By repetition of the process, sets of normalizers are produced for said set of target analytes. In some embodiments, the number of normalizers that is desired for the target analyte is also provided to the computer as an input parameter, which is then used as the minimum rank value for selection of a covariation threshold. The executing act can be performed either locally or remotely relative to the inputting act. [0022] In some embodiments, the present invention provides a method for normalizing a measurement of a target analyte in a test sample, the method comprising: receiving, with a computing system, measurements of the target analyte and candidate analytes in a group of training samples; for each of the candidate analytes: determining, with a computing system, a covariation value between the target analyte and each of the candidate analytes, and selecting the candidate analyte as a normalizer if the covariation value of the candidate analyte and the target analyte is higher than a threshold, thereby selecting a plurality of normalizers; receiving, with a computing system, meaurements of the target analyte and the plurality of normalizers in a testing sample, and calculating a normalized quotient (NQ) of the target analyte in the test sample, thereby normalizing the measurement of the target analyte. [0023] In some embodiments, the method further comprises comparing the NQ of the target analyte of the test sample with a normal NQ range, outputting to a user, with a computing system, an indication representing the test sample as normal based on the target analyte if said NQ is greater than or equal to a lower limit of normal NQ range and less than or equal to an upper limit of normal NQ range, or categorizing the test sample as abnormal if said NQ is lower than the lower limit of the normal NQ range or higher than the higher limit of said normal NQ range. [0024] In some embodiments, the invention provides a computer-readable medium comprising computer program code for controlling a computer system to perform the methods above. In some embodiments, the invention provides computer system comprising the computer-readable medium. [0025] This invention also provides a system for categorizing a sample based on a target, the system comprising (1) one or more processors; (2) a memory coupled with the one or more processors via an interconnect; (3) a communications interface coupled with the interconnect and adapted to receive a measurement of a target analyte and measurements of candidate analytes in a group of training samples. The one or more processors are configured to (a) receive values representing measurements the target analyte and measurements of candidate analytes;

determine, for each of the candidate analytes, the covariation value between the measurement of target analyte and the measurements of the candidate analytes and select the candidate analyte as a normalizer when the covariation value of the candidate analyte is higher than a threshold, thereby selecting a plurality of normalizers; (b) receive measurements of the target analyte and the plurality of normalizers in the test sample and determine the NQ of the target analyte in the test sample, (c) compare the NQ of the target analyte of the test sample with a normal NQ range. In some embodiments, the system additionally comprises an output module communicatively coupled with the processor and configured to output an indication, wherein the indication represents the test sample as normal based on the target analyte when said NQ is greater than or equal to a lower limit of normal NQ range and less than or equal to an upper limit of normal NQ range, or categorizing the test sample as abnormal when said NQ is lower than the lower limit of the normal NQ range or higher than the higher limit of said normal NQ range. [0026] The present invention also provides a physical computer-readable medium comprising a computer program for performing the following steps: inputting measurements of all measurable analytes including said set of target analytes into a computer; executing a set of computer readable instructions on the computer to select a set of normalizers for each target analyte, wherein a candidate analyte is selected as a normalizer if its covariation value is greater than a threshold; and outputting the set of normalizers for the set of target analytes. [0027] In some embodiments of the invention, the target analyte is a chromosome, a chromosomal bin, or a gene. In some embodiments, the target analyte is a human gene. [0028] In a preferred embodiment of the invention, the analytes are measured in a high throughput assay. [0029] In another preferred embodiment of the invention, the target analyte is a clinical diagnostic marker for a disease. [0030] In some embodiments, the target analyte is mitochondrial DNA. In some

embodiments, each of the candidate analytes is an autosomal bin. In some embodiments, the testing sample is a tisuse biopsy for a testing disease. In some embodiments, the disease is Charcot-Marie-Tooth type IIA, optic nerve atrophy, diabetes, or cancer.

BRIEF DESCRIPTION OF THE DRAWINGS

[0031] FIG.1 shows a system adapted for categorizing a sample as normal or abnormal based on a target analyte. [0032] FIG.2 shows a data processing system adapted for use in systems adapted for categorizing a sample as normal or abnormal based on a target analyte in accordance with various embodiments. [0033] FIG.3 illustrates a method of categorizing a sample as normal or abnormal based on a target analyte. [0034] FIG.4 illustrates a computer implemented method of categorizing a sample as normal or abnormal based on a target analyte. [0035] FIG.5. Illustrates normalizing the sequencing counts of human sub-chromosomal bins for detection of chromosome 21 trisomy in cfDNA samples from preganant women’s blood plasma using the methods disclosed herein. Definitions

[0036] The term“analyte” refers to the basic unit, element, or entity for analysis. In a typical high throughput study, an analyte is a nucleotide sequence, a gene, an amplicon, an RNA, a protein, an array probe, a peptide, a protein, an antigen, an antibody, a macromolecule, a SNP, a mutation, a sequence of DNA, a fragment of chromosomal region, a chromosome, or any measurable biomarker. Typically an analyte is quantitatively measured and measurements are normalized for an analysis. [0037] The term“target analyte” or“target” refers to an analyte of interest. [0038] The term“normalizer” refers to an analyte the measurement of which is used to normalize measurement(s) of the target analyte. The term“normalizers” or“set of normalizers” refers to a collection of such analytes. The term“sets of normalizers” implies normalizers selected for a group of target analytes with each target analyte having its own set of normalizers. In the present invention, the measurements of a target analyte and its normalizers are used to calculate a normalized value called“normalized quotient”. [0039] The term“candidate analyte” or“candidate” refers to an analyte that can serve as a normalizer for a“target analyte”. The present invention provides a method to screen candidate analytes to serve as normalizers. [0040] The term“sample” refers to a material source wherein an assay is performed in order to obtain the quantitative measurements of the material source for analysis. Samples may be ones obtained from an organism or from the environment (e.g., a soil sample, water sample, etc.), or may be those directly obtained from a source (e.g., such as a biopsy or from a tumor, blood, serum sample, ) or indirectly obtained, e.g., after culturing and/or one or more processing steps. [0041] The terms“assay”,“experiment”,“study”, and“test” are used interchangeably and refer to a biological, chemical, or assay of similar nature with a sample or a group of samples. An assay in this disclosure normally produces quantitative measurements of analytes for each sample in the assay. The term“high throughput” or“high throughput assay” is used to describe an assay in which a plurality of analytes were measured. [0042] The term“normalized quotient” (“NQ”) refers to the output value from a normalization formula comprising measurements of target analyte and normalizers. In some embodiments, the normalization formula is a function of target analyte measurements divided by the sum of measurements of all analyzers. Typically the value is generated as the measurement of the target analyte divided by the sum of the measurements of all normalizers. In some circumstances, one or more coefficients or weights are included to adjust the measurements for calculation of the normalized quotient value. In certain embodiments, various coefficients or weights determined empirically or statistically are applied to the measurements of the normalizers to produce a weighted sum, and a NQ is generated by dividing the measurements of the target analyte by the weighted sum. [0043] The term“normal NQ range” refers to the range of NQs for a target analyte calculated based on a set of a plurality of normalizers in a set of reference samples. The normal NQ range is used to compare with the NQ for a target analyte based on the same set of normalizers in a test sample to determine whether a target analyte in the test sample is normal or abnormal. [0044] The term“covariation” refers to the simultaneous change in measurement of two analytes across a group of samples. The terms“measurement of covariation”,“covariation measurement”, and“covariation value” are used interchangeably and refer to a statistical or mathematical similarity measurement, which include, but is not limited to, common statistical or mathematical terms“covariance”,“correlation”,“correlation coefficient”,“Pearson correlation”, “Spearman correlation”, and“Kendall correlation”. Unless explicitly stated to the contrary in this disclosure, wherever the term“a covariation value” is used, a target analyte is implied to be one of the two analytes for the calculation thereof; as an example, the term“a covariation value of a candidate analyte” refers to a covariation value of a target analyte and the candidate analyte. [0045] The term“covariance” refers to a covariation value through statistical covariance calculation. If ^ and ^^are real-valued variables representing two analytes for an experiment with means E(X), E(Y) and variances var(X), var(Y), respectively. The covariance of (X, Y) or cov(X, Y) is determined by

where E is the expected value operator.

[0046] The term“correlation” or“correlation coefficient” or“correlation value” refers to a covariation value through statistical correlation calculation. If X and Y are real-valued variables representing two analytes for an experiment with means E(X), E(Y), variances var(X), var(Y), and standard deviations sd(X), sd(Y), respectively. The covariance of (X, Y)or cov(X, Y) is determined by

where E is the expected value operator. Then“Pearson correlation” of (X, Y) or cor(X, Y) is determined by

[0047] If the ranks of X and Y are used instead of real values, the resulting correlation is “Spearman correlation”. [0048] Let (x ₁, y ₁), (x ₂, y ₂),…, (x _n, y _n) be a set of measurements of variables X and Y representing two analytes respectively. Any pair of observations (x _i, y _i) and (x _j, y _j) are said to be concordant if the ranks for both elements agree: that is, if both x _i > x _j and y _i > y _j or if both x _i < x _j and y _i < y _j. They are said to be discordant, if x _i > x _j and y _i < y _j or if x _i < x _j and y _i > y _j. If x _i = x _j or y _i = y _j, the pair is neither concordant nor discordant. Then“Kendall correlation” or ´ is defined by

where ^ _ୡ^is the number of concordant pairs and ^ _^ is the number of discordant pairs. [0049] The term“threshold value” or“threshold” refers to a minimal value of the

measurement of covariation for selection of a normalizer. [0050] The term“covariation-based normalization” or“CBN” refers to the process or the method of selecting normalizers based on candidate analytes’ covariation, covariance, or correlation to the target analyte. [0051] The term“test sample” refers to a sample of interest in a study. It is also referred as “patient sample” in this disclosure. [0052] The term“training samples” refers to a group of samples that are used to generate a training data set for the selection of normalizers for a target analyte. Any sample that has characteristics similar to those of a test sample can be selected as a training sample. [0053] The term“reference samples” refers to a group of samples that represent the typical sample populations of a study. NQs of the target analyte in the reference samples are used to determine median, mean, median absolute deviation (“MAD”), standard deviation of the NQs of the target analyte. low end of the normal range of NQ, and high end of the normal range of NQ for a target analyte. Reference samples can be the training samples that are used to select normalizers. [0054] The term“control sample” or“control” refers to a sample to be compared within a study. Typically the results from a test sample are compared to those from a control for analysis. In practice, control samples can be the training samples used for normalizer selection or the reference samples for the determination of the metrics of samples in the same experiment. [0055] The term“Ratio Score” or“RS” refers to the ratio value generated by dividing the NQ of a target analyte by the NQ median or mean of reference samples and adjusting the value by an arbitrary scaling factor or coefficient. RS is a further transformation of NQ and is useful when a statistic or metric is needed for a group of analytes. ^ [0056] A "CGH array" refers to an array that can be used to compare DNA samples for relative differences in copy number. A CGH array can be used in any assay in which it is desirable to scan a genome with a sample of nucleic acids. For example, a CGH array can be used in a location analysis as described in Wyrick et al. A CGH array thus can also be referred to as a "location analysis array" or an "array for ChIP-chip analysis." In some instances, a CGH array provides probes for screening or scanning a genome of an organism and comprises probes from across the genome. [0057] The term“single nucleotide polymorphism” or“SNP” refers to A "single nucleotide polymorphism" (SNP) occurs at a polymorphic site occupied by a single nucleotide, which is the site of variation between allelic sequences. The site is usually preceded by and followed by highly conserved sequences of the allele (e.g., sequences that vary in less than 11100 or 1/1000 members of the populations). A single nucleotide polymorphism usually arises due to substitution of one nucleotide for another at the polymorphic site. A transition is the replacement of one purine by another purine or one pyrimidine by another pyrimidine. A transversion is the replacement of a purine by a pyrimidine or vice versa. Single nucleotide polymorphisms can also arise from a deletion of a nucleotide or an insertion of a nucleotide relative to a reference allele. [0058] The term“chromosomal bin” or“bin” refers to a fragment of chromosome. The bins can be chromosomal fragments that are overlapped or non-overlapped with one another.

Chromosomal bin has been commonly used for analysis of subchromosomal regions (Hu et al., Xu et al.). The term“autosomal bin” refers to a bin of an autosome. [0059] The term“sequence read” or“read” refers to a nucleotide sequence generated by a sequencing device for a sample. [0060] The term“sequencing count” or its simplified version“count” refers to number of reads that are aligned or mapped to a reference sequence. The count is zero if there is no read aligned to the reference sequence, a positive integer number indicates otherwise. In some embodiments, a measurement of the analyte disclosed herein is a sequencing count. [0061] The term“chromosomal representation” or“chromosomal binning representation” refers to a numeric measurement for a chromosome or chromosomal bin divided by the sum measurement of all analytes in a sample. This is a way to adjust the input quantity variation for the measurement. For example, the chromosome representation in the context of the assay platform of a sequencing device is the count for the chromosome divided by the count for all chromosomes. [0062] The term“computer-readable medium” as used herein refers to any media that participates in providing instructions to a processor for execution. Such a medium can take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium, or any other tangible medium from which a computer can read. [0063] The term“high throughput assay” refers to an assay involving 10 or more analytes. [0064] A "biological sample" refers generally to body fluid or tissue or organ sample from a subject. For example, the biological sample may a body fluid such as blood, plasma, lymph fluid, serum, urine or saliva. A tissue or organ sample, such as a non-liquid tissue sample maybe digested, extracted or otherwise rendered to a liquid form - examples of such tissues or organs include cultured cells, blood cells, skin, liver, heart, kidney, pancreas, islets of Langerhans, bone marrow, blood, blood vessels, heart valve, lung, intestine, bowel, spleen, bladder, penis, face, hand, bone, muscle, fat, cornea or the like. A plurality of biological samples may be collected at any one time. [0065] The term“categorizing a test sample based on a target analyte” refers to determining whether the measurement of the target analyte in the test sample is within the normal range of the measurement of the target analyte in reference samples. For purpose of this disclosure, the categorization is carried out by comparing the NQ of the target analyte in the test sample with a normal NQ range determined based on the NQs of the reference samples. A test sample is categorized as normal based on a target analyte if the NQ of the target analyte falls within the normal NQ range (including the lower limit and higher limit) and abnormal if the NQ of the target analyte falls outside the normal NQ range. [0066] The term“measurement” refers to any type of quantitative measurement of an analyte, e.g., the amount of the analyte present in a sample, or sequencing counts of the analyte in the sample. A measurment of an analyte can be expressed in various forms, e.g., an amount of an assay signal, or a value resulted from mathematical transformations of the amount of assay signal. Other data processing approaches, such as normalization of assay signals in reference to a population’s mean values, can also be used to produce a measurement for an analyte. DETAILED DESCRIPTION [0067] In this disclosure and the appended claims, the singular forms "a," "an" and "the" include plural reference unless the context clearly dictates otherwise. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood to one of ordinary skill in the art to which this invention belongs. [0068] Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limit of that range, and any other stated or intervening value in that stated range, is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included in the smaller ranges, and are also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the invention. [0069] Although any methods, devices and materials similar or equivalent to those described herein can be used in the practice or testing of the invention, the preferred methods, devices and materials are now described. All publications mentioned herein are incorporated herein by reference for the purpose of describing and disclosing the invention components that are described in the publications that might be used in connection with the presently described invention. Methods

[0070] This invention can be used to accurately quantify one or more target analytes in a sample. The disclosed method of identifying normalizers and using it for normalizing a target analyte measurement is especially useful and advantageous when the number of normalizer candidates for a given target analyte is greater than 10, 100, or 1000, which is typical for high throughput studies, for example, high throughput studies using microarray, PCR, sequencing, or mass spectrum to quantify levels of a set of DNA, RNA, protein, or metabolite analytes. In these studies it is challenging to identify normalizers that perform well across different assay conditions such as different protocols, different samples, different assay time, or different batch of reagents. This disclosure provides methods for identifying normalizers for the target analyte using training samples under a desired assay condition, therefore measurements of target analyte can be normalized and accurately analyzed even if assay condition changes. [0071] These methods disclosed herein include providing a sample comprising a biological sample. In this context, the term“providing” is to be construed broadly. The term is not intended to refer exclusively to a subject who provided a biological sample. For example, a technician in an off-site clinical laboratory can be said to“provide” the sample, for example, as the sample is prepared for purification by chromatography. [0072] The biological sample is preferably an in vitro sample and but is not limited to any particular sample type. The biological sample may also include other components, such as solvents, buffers, anticlotting agents, and the like. In some embodiments, a sample used in the disclosure is an aliquot of material, frequently an aqueous solution or an aqueous suspension derived from biological material containing DNA. Samples to be assayed for the presence of the target nucleic acid by the methods of the present invention include, for example, cells, tissues, e.g., tumors, homogenates, lysates, extracts, and other biological molecules and mixtures thereof. Non-limiting examples of samples typically used in the methods of the invention include human and animal body fluids such as whole blood, serum, plasma, cerebrospinal fluid, sputum, bronchial washings, bronchial aspirates, urine, lymph fluids and various external secretions of the respiratory, intestinal and genitourinary tracts, tears, saliva, milk, white blood cells, myelomas and the like; biological fluids such as cell culture supernatants; tissue specimens which may or may not be fixed; and cell specimens which may or may not be fixed. [0073] The samples used in the methods of the present invention will vary depending on the assay format and the nature of samples, e.g. the characteristics of the tissues, cells, extracts or other materials, especially biological materials, to be assayed. Methods for preparing e.g.

homogenates and extracts, or other samples are well known in the art and can be readily adapted in order to obtain a sample that is compatible with the methods of the invention. The invention is not limited to any particular volume of biological sample. In some embodiments, the biological sample is at least about 1-100 ^L, at least about 10-75 ^L, or at least about 15-50 ^L in volume. In certain embodiments, the biological sample is at least about 20 in volume. [0074] The method disclosed herein can be used to analyze any target analyte present in a sample. Non-limiting examples of the target analytes include a gene, an RNA, a peptide, a cluster of genes, a chromosome, a chromosomal region called bin, or a Single Nucleotide Polymorphism (SNP). In some embodiments, the target analyte is a prognostic or diagnostic marker, the amount of which in the sample is associated with the presence or non-presence of a disease, e.g., cancer. For example, copy number aberrations of chromosomal region detected in human samples have been indicated as prognostic markers in cancer associated phenotypes (Sapkota et al., Krepischi et al.). Expression quantities of 21 RNA biomarkers have been used for breast cancer prognosis (Park et al.). High level of cell surface protein Her2 is a companion diagnostic biomarker for the effectiveness of cancer drug Herceptin (Dent et al.). In some embodiments, the target analyte is a marker, the amount of which is corrected with the effectiveness of a treatment. [0075] The method disclosed herein can be used to analyze measurement of target analytes produced by any assays that generate quantifiable and detectable signals. Such methods are well known in the art, for example, real time PCR, microarray, or sequencing. [0076] The invention provides a method of selecting normalizers for a target analyte from candidate analytes. In certain embodiments, the method includes calculating a covariation value for each candidate analyte and the target analyte in a group of training samples and selecting a candidate analyte as a normalizer if its covariation value is higher than a threshold (see Examples I-V below). A covariation value quantifies the simultaneous change or similarity in measurement of a candidate analyte and the target analyte in a group of training samples. The group of training samples comprise at least 3, 5, or 10 samples. In some embodiments, the group of training samples comprise one or more reference samples. In some embodiments, the group of training samples comprise 30-100, or 50-150 samples. In some embodiments, the group of training samples comprise 100-200 samples. In some embodiments, the group of training samples include the test sample itself. [0077] Various ways can be used to calculate the covariation value mathematically. In some embodiments, the covariation value is a mathematical correlation such as Spearman correlation, Pearson correlation, or Kendall correlation. In certain other embodiments, the covariation value is a mathematical covariance. The threshold for selection of a normalizer is determined either empirically or by statistical analysis. In preferred embodiments of the invention the covariation value is a Spearman, a Pearson, or a Kendall correlation that ranges from -1 to 1, and the threshold is set to a decimal number greater than or equal to 0.1, 0.2, 0.3, 0.4, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, or 0.99. [0078] In some embodiments of the invention, the threshold of the covariation value is determined by a minimum rank among all candidates’ covariation values ranked high to low. The minimum rank is a positive integer between 1 and the total number of candidate analytes. This rank determined threshold effectively selects the top n number of candidate analytes of highest covariation values as normalizers, where n represents a desired number of normalizers for a target analyte, for example, at least 3, 5, 10, 50, 100, 150, or 200. [0079] In some embodiments of the invention, the threshold of the covariation value is a decimal number greater than or equal to 10th, 20th, 30th, 40th, 50th, 55th, 60th, 65th, 70th, 75th, 80th, 85th, 90th, 91th, 92th, 93rd, 94th, 95th, 96th, 97th, 98th, or 99th percentile of the covariation values for all pairs of candidate analytes and said target analyte. This percentile- based threshold allows effective selection of candidate analytes having covariation values within the high percentile of all covariation values as normalizers. [0080] Measurements of these selected normalizers are used to normalize the measurement of the target analyte. In cases where multiple target analytes are to be analyzed, normalization process starts with electing one target analyte at a time and treating all the rest analytes as candidate analytes, from which normalizers are selected using method described above. The normalizer selection process is then reiterated until normalizers for all target analytes are identified (see Examples I-V below). [0081] Normalized quotient (NQ), as used in the claimed method, is the result generated by dividing the measurement of the target analyte by the sum of the measurements of the normalizers, which is shown in the following formula:

where _୧ is the NQ value for target analyte denotes the measurement for target analyte ^, and denotes the measurement of the ^-th normalizer in the set of normalizers ^ _୧ for target analyte ^. [0082] In certain embodiments, various coefficients or weights determined empirically or statistically are added to the measurements of each of the normalizers to calculate a weighted sum in order to determine the NQ value, using the following formula: where is the NQ value for target analyte denotes the measurement for target analyte denotes the weight of the h normalizer in the set of normalizers _୧ for target analyte ^, and denotes the measurement of the normalizer in the set of normalizers for target analyte ^. [0083] The analytes in the described invention may be polynucleotide probes for detection of aberrations in genomic DNA regions using- a microarray chip (Example I). They may be polynucleotide probes for measuring the level of cDNA converted from RNA signals in biological samples using a microarray. They may be amplicons defined by specific PCR primer sets in a high throughput assay that is designed to detect either RNA expression levels (RT-PCR assay) or specific DNA targets. The analytes may also be polynucleotide sequences that are defined bioinformatically or experimentally and detected by high throughput sequencing.

Because the normalization method presented in this invention is based on similarity or covariation among the analytes in a study, the analytes are not restricted to those from specific samples or assayed on specific technology platforms. [0084] The normalized quotient (NQ) of a target analyte can be used to determine whether a biological sample is normal or abnormal. This can be achieved by comparing the NQ of a target analyte to an NQ range determined through a set of reference samples, which comprise at least 3, 5, or 10 samples. In some embodiments, the set of reference samples comprises 30-100, or 50- 150 samples. In some embodiments, the set of reference samples comprises 100-200, 300-500, or 200-400, or 500-1000 samples. In some embodiments, the set of reference samples include the test sample itself. In certain other embodiments, the detection of abnormality of a test sample is based on a statistical z-score of an analyte usin the formula:

Where ^ _୧ represents the analyte’s z-score of the test sample, is the NQ value for the target _{analyte ^, ^} ^ഥ _{is the NQ median or NQ mean of the reference samples, ³} _{is the NQ median} absolute deviation (MAD) or standard deviation (SD) of the reference samples. The z-score obtained in high throughput assays typically follow Gaussian distribution, and thus the normal range for z-score is set as -2 to 2, -2.5 to 2.5, -3 to 3, -3.5 to 3.5, -4 to 4, or other fractional value range in similar range. If an analyte’s z-score for a test sample is not within the normal range, then the test sample is classified as abnormal for the analyte. This z-score based classification can, for example, be applied for detection of chromosomal aneuploidy as described in Sehnert et al. or Palomaki et.al. [0085] In practice, the normal range of NQ based on average and deviation can be used equivalently as of z-score. In certain embodiments, the normal range of NQ is defined by a low end, which is the median of the NQs of the target analytes in a set of reference samples minus a decimal number that is greater than or equal to 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, or 5 times of the median absolute deviation (MAD); and a high end, which is the median plus a decimal number that is greater than or equal to 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, or 5 times of the median absolute deviation (MAD). Alternatively, the low end of the NQ range is the mean of the NQs of the target analytes in a set of reference samples minus 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, or 5 times of the standard deviation (SD) and the high end of the range is the mean plus 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, or 5 times of the standard deviation (SD). [0086] In some embodiments, the training samples, reference samples are of the same tissue type as the test sample. In some embodiments, the training samples, reference samples are not of different tissue type as the test sample. In some embodiments, training samples used to select the normalizers are of the same tissue type as the reference samples. In some embodiments, training samples also serve as reference samples. [0087] The claimed methods are especially advantageous for studies in which the number of data points for each sample is large such that the bin analysis can be performed. In some embodiments the data are generated by whole genome sequencing to detect chromosomal aneuploidy. A chromosome can be digitally divided into fragements called bins. The claimed methods can be performed on data for a collection of bins, in order to detect amplification or deletion of a bin or a part of a bin in a sample. In accordance with the methods disclosed herein, these bins are designated as analytes, and each bin can be normalized using the method described above and also as illustrated in the Examples below. After each bin is normalized, i.e., by calculating a NQ, the data of the NQs for the bins belonging to the same chromsome can be aggregated to determine whether there is an abnormality for the chromosome. Whether bins belong to the same chromosome can be readily determined by chromosomal mapping. This method of normalizing data at chromosome bins and then agrregating to a chromsome allows more accurate analyses when data are not even across the full chromosome, meanwhile the method allows subchromosomal analysis where abnormality to be detected only on a portion of a chromosome. [0088] The bins can be of various or equal lengths, and may be overlapping or non- overlapping. In some instances, bins from human samples were used that are non-overlapping bins and have lengths ranging from 100 to 50,000,000 nucleotides, e.g, 200-5000000, 300- 100,000, 1,000-3,000,000, or 5,000-30,000 nucleotides. In some embodiments, after obtaining of the measurements of the bins, one bin is designated as a target analyte and all other bins are designated as candidate analytes to select normalizers that are used to normalize the

measurement for the bin. Sequencing counts for bins were obtained through a DNA sequencing device. Systems

[0089] In one aspect, systems are provided for facilitating normalizing a target analyte and/or categorizing a test sample based on the target analyte. Such systems can include one or more computing devices and can be communicatively coupled to a network. Such computing device can include a discrete computing device, a computing device tied into a main-frame system of a medical facility or can include one or more portable devices that are communicatively coupled to a network or server associated with a treating physician or medical facility. In some

embodiments, one or more of the computing devices can include a portable computing device of a treating physician, such as a tablet or handheld device. Such systems are configured, typically with programmed instructions recorded on a memory thereof, to select normalizers for normalizing the target analyte and categorize the test sample. In some embodiments, the system includes a computation engine that determines the covariation value between the target analyte and each of the candidate analytes; select the candidate analyte as a normalizer if the covariation value of the candidate analyte is higher than a threshold, thereby selecting a plurality of normalizers; calculate NQs of the target analyte in the reference samples and determine a normal NQ range for the target analyte; determine the NQ of the target analyte in the test sample, compare the NQ of the target analyte of the test sample with the normal NQ range. The computation engine can be defined by programmable instructions recorded on a memory of the system, which can include a memory accessed through a server or a memory coupled with one or more processors of one or more computing devices of the system. [0090] Provided below are descriptions of some devices (and components of those devices) that may be used in the systems and methods described above. These devices may be used, for instance, to communicate, process, and/or store data related to any of the functionality described above. As will be appreciated by one of ordinary skill in the art, the devices described below may have only some of the components described below, or may have additional components. [0091] FIG.1 depicts an example block diagram of a system configured to normalizing measurements of a target analyte and/or categorize a test sample based on the analyte. In the illustrated embodiment, system 100 includes a computer system 115 coupled to a network or server 110 that includes medical data associated with the patient from one or more data sources 105 (e.g. laboratory output of sample results). Data sources 105 can include values

corresponding to the measurements of the target analyte and candidate analytes from a group of training samples and/or a set of reference samples. The techniques described herein are not limited to any particular type of computer system or computer network and could include one or more computing devices, including portable computing devices. For example, network 110 can be a local area network (LAN), a wide-area network (WAN), a wireless network, a bus connection, an interconnect, or any other means of communicating data or control information across one or more transmission lines or traces in an electronic system. While in this embodiment, data sources 105 are accessed through a network or server 110, it is appreciated that data sources 105 can communicate data directly to the computing system 115 or data can be manually input into computer system 115 through a user input. [0092] Computer system 115 includes a processor 101 and a system memory 104 coupled together via an interconnect bus 108. In some embodiments, processor 101 and system memory 104 can be directly interconnected, or can be connected indirectly through one or more intermediary components or units. Processor 101 and system memory 104 can be any general- purpose or special-purpose components as is known in the art and is not limited to any particular type of processor or memory system. System memory 104 can be configured to store system and control data for automatically performing the normalization and sample categorization methods described herein. In some embodiments, computing system 115 is coupled with a database to receive data. The data stored on database can include data values corresponding to the measurements of the target analyte and candidate analytes from a group of training samples; data values corresponding to the measurements of the target analyte and normalizers from the testing sample and a set of reference samples; and/or data pertaining to the normalization of the target analyte in the test sample. For example, the processor can determine the covariation value between the target analyte and each of the candidate analyte, selecting a candidate as a normalizer if the covariation value of the candiate analyte is equal to or greater than a threshold. The threshold can be stored on system memory 104, or can be automatically obtained from a database as needed or obtained from another data source 105 accessed through communication with network 110. The processor can also determine a NQ for the target analyte in the test sample and each of the set of reference samples based on the measurements of the target analyte and the normalizers in each sample. The processor determines a NQ range having a lower lmit, e.g., by calculating a mean minus a multiple of the standard deviation and an higher limit by calculating the mean plus a multiple of the standard deviation. One advantage to including programmable instructions that queries an external data source for the pre-determined threshold –for selecting normalizers, or the multiple– for determining the limits of the normal NQ range, is that they can be changed or updated periodically, as needed, without altering the configuration of computing system 115.

[0093] Computing system 115 receives input data 103 from the various data sources through communications interface 120. Computer system 115 processes the received data according to programmed instructions recorded on memory 104 and provides resulting data pertaining to the differential diagnosis to a user via output module 125. Output module 125 can be

communicatively coupled to a user interface display or printer for presenting the processed data pertaining to the differential diagnosis. Typically, the output module 125 outputs an indication representing the categorization of the test sample to the user (e.g.“normal for target analyte”, “abnormal for target analyte”). Output module can further output a list of normalizers used to normalizing the target analyte. In another aspect, the output module 125 can output the processed data directly to the network or to a health information database so that the categorization result or associated data can be accessed by various other computing devices communicatively coupled with the network or database. [0094] In some embodiments, the computing system 115 receives receive values representing measurements of the target analyte and candidate analytes in a group of training samples, via network 110, and provides those values to computation engine 130. Comparison engine 930 then causes the processors 101, which, for each of the candidate analytes, determines the covariation value between the target analyte and the candidate analytes and select the candidate analyte as a normalizer if the covariation value of the candidate analyte is higher than a threshold, thereby selecting a plurality of candidate analytes. The computing system 115 also receives receive measurements of the target analyte and the plurality of the identified

normalizers in a set of reference samples, via network 110, and provides those values to computation engine 130. The computation engine then causes the processor to calculate NQs of the target analyte in the reference samples and determine a normal NQ range for the target analyte; determine the NQ of the target analyte in the test sample; and compare the NQ of the target analyte of the test sample with the normal NQ range to determine whether the NQ of the target analyte is within the normal NQ range. If the NQ of the target analyte in the test sample falls within the normal NQ range, a signal indicating as such can output by the output module indicating the test sample is normal based on the target analyte. [0095] Computation engine 130 may be implemented using specially configured computer hardware or circuitry or general-purpose computing hardware programmed by specially designed software modules or components; or any combination of hardware and software. The techniques described herein are not limited to any specific combination of hardware circuitry or software. For example, computation engine 130 may include off-the-shelf circuitry components or custom- designed circuitry. Alternatively, the computation functionality may be performed in software stored in memory 104 and executed by the processor 101. [0096] FIG.2 depicts an example block diagram of a data processing system upon which the disclosed embodiments may be implemented. Embodiments of the present invention may be practiced with various computer system configurations such as hand-held devices,

microprocessor systems, microprocessor-based or programmable user electronics,

minicomputers, mainframe computers and the like. The embodiments can also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a wire-based or wireless network or remotely through a cloud server. [0097] An example of a data processing system is shown in FIG.3, which depicts a Data Processing System 1000 that can be used with the embodiments described herein. Note that while various components of a data processing system are depicted, it is not intended to represent any particular architecture or manner of interconnecting the components as such details are not germane to the techniques described herein. It will also be appreciated that network computers and other data processing systems which have fewer components or perhaps more components may also be used. For example, the data processing system could be distributed across multiple computing devices that are communicatively coupled. The data processing system of FIG.3 can be a personal computer (PC), workstation, tablet, smartphone or other hand-held wireless device, or any device having similar functionality. [0098] As shown, the data processing system 1001 includes a system bus 1002 which is coupled to a microprocessor 1003, a Read-Only Memory (ROM) 1007, a volatile Random Access Memory (RAM) 1005, as well as other nonvolatile memory 1006. In the illustrated embodiment, microprocessor 1003 is coupled to cache memory 1004. System bus 1002 can be adapted to interconnect these various components together and also interconnect components 1003, 1007, 1005, and 1006 to a display controller and display device 1008, and to peripheral devices such as input/output (“I/O”) devices 1010. Types of I/O devices can include keyboards, modems, network interfaces, printers, scanners, video cameras, or other devices well known in the art. Typically, I/O devices 1010 are coupled to the system bus 1002 through I/O controllers 1009. In one embodiment the I/O controller 1009 includes a Universal Serial Bus (“USB”) adapter for controlling USB peripherals or other type of bus adapter. [0099] RAM 1005 can be implemented as dynamic RAM (“DRAM”) which requires power continually in order to refresh or maintain the data in the memory. The other nonvolatile memory 1006 can be a magnetic hard drive, magnetic optical drive, optical drive, DVD RAM, or other type of memory system that maintains data after power is removed from the system. While nonvolatile memory 1006 is shown as a local device coupled with the rest of the components in the data processing system, it will be appreciated that the described techniques can use a nonvolatile memory remote from the system, such as a network storage device coupled with the data processing system through a network interface such as a modem or Ethernet interface (not shown). [0100] With these embodiments in mind, it will be apparent from this description that aspects of the described techniques may be embodied, at least in part, in software, hardware, firmware, or any combination thereof. It should also be understood that embodiments can employ various computer-implemented functions involving data stored in a data processing system. That is, the techniques may be carried out in a computer or other data processing system in response executing sequences of instructions stored in memory. In various embodiments, hardwired circuitry may be used independently, or in combination with software instructions, to implement these techniques. For instance, the described functionality may be performed by specific hardware components containing hardwired logic for performing operations, or by any combination of custom hardware components and programmed computer components. The techniques described herein are not limited to any specific combination of hardware circuitry and software. [0101] Embodiments herein may also be in the form of computer code stored on a computer- readable storage medium embodied in computer hardware or a computer program product. Computer-readable media can be adapted to store computer program code, which when executed by a computer or other data processing system, such as data processing system 1000, is adapted to cause the system to perform operations according to the techniques described herein.

Computer-readable media can include any mechanism that stores information in a form accessible by a data processing device such as a computer, network device, tablet, smartphone, or any device having similar functionality. Examples of computer-readable media include any type of tangible article of manufacture capable of storing information thereon such as a hard drive, floppy disk, DVD, CD-ROM, magnetic-optical disk, ROM, RAM, EPROM, EEPROM, flash memory and equivalents thereto, a magnetic or optical card, or any type of media suitable for storing electronic data. Computer-readable media can also be distributed over a network- coupled computer system, which can be stored or executed in a distributed fashion. [0102] FIG.3 shows an exemplary method of categorizing a test sample for a target analyte. The method includes steps of: Obtaining measurements of a target analyte and candidate analytes; calculating a covariation value between the target analyte and each of the candidate analytes; for each of the candidate analyte, selecting the candidate as a normalizer if the covariation value is equal to greater than the threshold, thereby selecting a plurality of normalizers; calculating the NQs of the target analyte in the test sample and reference samples based on the measurements of the selected normalizers; determining a normal range of NQ for the target analyte from reference samples; comparing the NQ of the target analyte with the normal NQ range; categorizing the sample as normal based on the target analyte if the NQ falls within the normal NQ range or as abnormal if the NQ falls outside the normal NQ range. It is appreciated that the above described method can be performed, in part or in full, by use of a computing system configured to automatically perform part of all of the above steps. [0103] FIG.4 shows another exemplary method of categorizing a test sample for a target analyte by use of a computing system adapted for performing such a categorization. It is appreciated that the computing system can include one or more computing devices that can be communicatively coupled with a network or server. Such a method optionally includes a step of receiving a request for categorizing a test sample to be categorized for a target analyte. Such a request could be made by a physician for a patient and would be input through a user interface coupled with the computing device. In other embodiments, categorization method can be performed automatically for such a patient without requiring a request from the treating physician or other personnel. The method includes steps of: obtaining measurements of a target analyte and candidate analytes; Calculating a covariation value between the target analyte and each of the candidate analytes; for each of the candidate analyte, selecting the candidate as a normalizer if the covariation value is equal to greater than the threshold, thereby selecting a plurality of normalizers; Calculating the NQs of the target analyte in the test sample and reference samples based on the measurements of the selected normalizers; Determining a normal range of NQ for the target analyte from reference samples; Comparing the NQ of the target analyte with the normal NQ range; Categorizing the sample as normal based on the target analyte if the NQ falls within the normal NQ range or as abnormal if the NQ falls outside the normal NQ range; outputting a Outputting, with the computing device, an indication representing the test sample is normal if the target NQ falls within the normal NQ range (including the lower and higher limit), or an indication representing the test sample is abnormal if the NQ of the target analyte in the test sample falls outside the normal NQ range. The method optionally includes a step of outputting the selected normalizers. [0104] It should be appreciated that the specific operations illustrated in FIG.4 depict a particular embodiment of a process and that other sequences of operations may also be performed in alternative embodiments. For example, certain steps can be performed by another computing device communicatively coupled with the computing device or the above operations could be performed in a different order. Moreover, the individual operations may include multiple sub-steps that may be performed in various sequences as appropriate and additional operations may be added or removed depending on the particular applications. One of ordinary skill in the art would recognize the many possible variations, modifications, and alternatives. Applications

[0105] The covariation based normalization method disclosed herein can be used in any assay that involves quantitative analyses, for example, assays that are used to quantify biomarker expression or determine the amount of RNA or DNA in biosamples; and assays used to determine the presence of an abnormal number of chromosomes in a cell, which is commonly referred to as aneuploidy. The most common aneuploidy in human population is chromosome 21 trisomy (associated with Down syndrome); chromosome 18 trisomy (associated with Edwards syndrome); and chromosome 13 trisomy (associated with Patau syndrome). [0106] The methods disclosed herein can also be used to quantify mitochondria DNA

(mtDNA). Mitochondria play a crucial role in many cellular functions. Mitochondria DNA copy number, under control of a number of regulatory genes, is a critical component of mitochondria health. An abnormal change in mtDNA copy number is commonly associated with Alper's syndrome, progressive external ophthalmoplegia (PEO), and other recessive myopathies.

MtDNA depletion is also implicated in more common diseases including type 2 diabetes, many cancers , and neurodegenerative disorders such as Alzheimer and Parkinson Disease. See, Rooney et al., Methods Mol Biol.2015; 1241: 23–38. The claimed invention can be used to accurately monitor mitochondria DNA copy number of a patient sample. Thus, in some embodiments of the invention, the target analyte is mtDNA. In some embodiments, the target analyte is a chromosome or a chromosomal bin comprising mtDNA. EXAMPLES

[0107] It is to be understood that this invention is not limited to the particular devices, machines, materials and methods described. Although particular embodiments are described, equivalent embodiments may be used to practice the invention. [0108] The invention, now being generally described, will be more readily understood by reference to the following examples, which are included merely for purposes of illustration of certain aspects and embodiments of the present invention and are not intended to limit the invention. Example I. Normalizing the signals of CGH array

[0109] Comparative genomic hybridization (CGH) is useful for detecting small chromosomal abnormalities, for example, small genetic imbalances (gains or losses of chromosomal material), also known as chromosome copy number variations (CNVs) because these small genetic imbalance is difficult to detect by routine Cytogenetics. In CGH array, short DNA sequences called probes corresponding to known chromosomal loci spanning the genome are fixed to a solid surface. The composition of these sequences affects the size of the smallest detectable chromosomal anomaly. The typical array includes loci of common microdeletion or duplication syndromes, as well as numerous sub-telomeric and peri-centromeric regions. Sub-telomeric locations are sites where DNA copy number alterations frequently occur. [0110] Currently, the CGH method first hybridizes differentially fluorescently-labeled DNAs from patient samples and control samples to the array. After hybridization, the fluorescence signals from the patient samples and the control samples are detected and the ratios of the fluorescent signal of the patient sample to that of the control sample are calculated. In some cases, the logarithmically transformed counterparts of these ratios (log ratio) are calculated. These data are then processed through software to determine any copy number differences between the control and the patient DNA and serve as bases for clinical diagnosis. [0111] This traditional method of analysis of CGH array data is unable to produce reliable results because the signal ratios used by this method are not adequately normalized for at least two reasons. First, the two fluorescently-labeled DNAs from the patient and control samples could interfere with each other during hybridization. Second, the accuracy of signal ratios depends on both the patient samples and the control samples; the variation in the measurement of control samples could render the signal ratios significantly skewed; this is especially problematic for the analytes producing low signals in the control sample. Therefore the results from CGH analyses are often not optimal, especially for assays that require high accuracy, e.g., a non- invasive prenatal test using pregnant women’s blood plasma. [0112] In contrast, the claimed methods, i.e., the CBN normalization methods solves the problems associated with inadequate normalization of CGH array. The CBN process involves a normalizer selection process using a group of training samples or equivalents. The CBN method does not require two fluorescence labeling as such used in the CGH array and works equally well with one fluorescence labeling assay. In this example, we used a public available CGH array data from Pollack et al. The data set, which was deposited at

www.pnas.org/content/suppl/2002/09/23/162471999.DC1/4719C opyNoGeneDatsetLegend.html, shows 87 sample arrays consisting of the log2 ratio data of 6095 fluorescent probes. The first 46 samples were from a genomic DNA CGH assay. Out of the 46 samples, 5 samples of X chromosome variation and 4 cell line samples were removed. The remaining 37 samples, namely NORWAY.7, NORWAY.10, NORWAY.11, NORWAY.12, NORWAY.14,

NORWAY.15, NORWAY.16 , NORWAY.17, NORWAY.18, NORWAY.19, NORWAY.26, NORWAY.27, NORWAY.39, NORWAY.41, NORWAY.47, NORWAY.48, NORWAY.53, NORWAY.56, NORWAY.57, NORWAY.61, NORWAY.65, NORWAY.100, NORWAY.101, NORWAY.102, NORWAY.104, NORWAY.109, NORWAY.111, NORWAY.112,

STANFORD.2, STANFORD.14, STANFORD.16, STANFORD.17, STANFORD.23,

STANFORD.24, STANFORD.35, STANFORD.38, and STANFORD.A, were selected to serve as the training samples to determine normalizers for each of the 6095 probes. The dataset consisting of measurements of 6095 probes (designated as probe #1 through 6095) from each of the 37 samples (designated as sample 1 through 37) was preprocessed by the following steps: 1) Log ratio data generated as log ratios, were converted to the ratio values by an exponential function with base 2; and 2) the ratio values were divided by sample median on each array to adjust the variations of input DNA quantities. [0113] Using the preprocessed signal data of the 37 samples as the input, the selection for normalizers began with designating probe #1 as the target analyte and the remaining 6094 probes as candidate analytes. A Spearman correlation between probe #1 (the target analyte) and probe #2 (a candidate analyte) was calculated using the 37 signals of probe #1 (the median adjusted ratio values for probe #1 from the 37 input samples) and the 37 signals of probe #2 from the 37 samples. The process was reiterated to calculate a Spearman correlation for probe pairs of #1 versus #3, #1 versus #4, and all way through the last pair of #1 versus #6095. A candidate analyte having a Spearman correlation value higher than a threshold of 0.4 was selected as a normalizer. There are several ways to set the threshold. For a genomic CGH array analysis, a fixed threshold of Spearman 0.4 typically produces 50 to 500 normalizers from 6094 candidate analytes for a given target analyte. In this example, 278 normalizers were selected for probe #1 using the described method. [0114] In the next step, probe #2 was designated as the target analyte, and probes #1, #3 through #6095 are designated as candidate analytes. Using the claimed method, 135 probes were identified as normalizers for probe #2. [0115] The above process was then reiterated until normalizers for probes #3 through #6095 were determined. Once these processes were completed, the normalizers for each probe on the CGH array were determined and used to normalize the signals for samples (including the training samples and other samples) assayed on the CGH array. The 278 and 135 normalizers for probe #1 and #2 are shown in table 1 along with their Spearman correlations. The same normalizers can be used to normalize data for existing assay or any new assay on the same CGH array.

However, if CGH array is redesigned or assay chemistry is significantly changes, it might be necessary to redo the normalizer selection process with the samples assayed under the new conditions. [0116] To use the normalizers determined for the 46 genomic DNA samples (used as test samples) from the existing assay(, the log ratios were first transformed into signal ratios with an exponential function with base 2 and serve as the input for a NQ calculation. For each target probe on the array, its NQ was generated by dividing the signal ratio of the target by the sum of signal ratios of the normalizers as shown in the following formula:

where denotes the NQ value for target probe denotes the signal ratio for target probe and ^ denotes the signal ratio of the ^-th normalizer in the set of normalizers for target probe

[0117] This process was repeated for all remaining probes until all signal ratios were normalized into NQ values. The NQ values can then serve as the input for downstream analysis, or can be transformed into new log NQ values to be analyzed by the software processing log ratio signals for detection of biological abnormalities of a patient sample as compared to controls. Table 1. Summary of the normalizers identified for probe #1 and probe #2 in Example I

Ta e 1. Contnue Example II Normalizing RNA expression data in a genomic assay

[0118] Coding or non-coding RNA expression assays have become powerful tools for the discovery of specific RNA biomarkers for diagnosis or prognosis of cancer, immune disease, or blood disease. They are also very useful for the determination of drug response and

effectiveness. Because RNA expression signal often varies more significantly than its encoding DNA, it is critical to have the right normalizers to correct the RNA expression bias commonly associated with these RNA expression assays. The current RNA normalization methods used in these assays are based on a set of reference genes, such as beta-actin or GAPDH of which RNA expression is relatively constant and less prone to biases (Cronin et al). The CBN method presented in this invention provides an alternative and more effective way to define the reference genes, a.k.a. normalizers. [0119] A genomic scale RNA expression assay is typically performed using a high throughput technology platform such as microarray, multi-gene panel RT-QPCR, massively parallel sequencing, or digital PCR. The CBN method, claimed in this application, was used to normalize the RNA expression data from the same CGH array (Pollack et al) in the above example I. The data set, deposited at

www.pnas.org/content/suppl/2002/09/23/162471999.DC1/4719C opyNoGeneDatsetLegend.html, comprising 87 sample arrays and 6095 probes of log2 ratio data. The last 41 samples were from the RNA expression assay (the same CGH array as shown in example I). Out of the 41 samples, 37 samples of tumor biopsy origin namely NORWAY.7.mRNA, NORWAY.10.mRNA, NORWAY.11.mRNA, NORWAY.12.mRNA, NORWAY.14.mRNA, NORWAY.15.mRNA, NORWAY.16.mRNA, NORWAY.17.mRNA, NORWAY.18.mRNA, NORWAY.19.mRNA, NORWAY.26.mRNA, NORWAY.27.mRNA, NORWAY.39.mRNA, NORWAY.41.mRNA, NORWAY.47.mRNA, NORWAY.48.mRNA, NORWAY.53.mRNA, NORWAY.56.mRNA, NORWAY.57.mRNA, NORWAY.61.mRNA, NORWAY.65.mRNA, NORWAY.100.mRNA, NORWAY.101.mRNA, NORWAY.102.mRNA, NORWAY.104.mRNA,

NORWAY.109.mRNA, NORWAY.111.mRNA, NORWAY.112.mRNA,

STANFORD.2.mRNA, STANFORD.14.mRNA, STANFORD.16.mRNA,

STANFORD.17.mRNA, STANFORD.23.mRNA, STANFORD.24.mRNA,

STANFORD.35.mRNA, STANFORD.38.mRNA, and STANFORD.A.mRNA, were selected to serve as the group of samples to select normalizers for each of the 6095 probes. The dataset consisting of measurements of the 6095 probes (designated as probe #1 through 6095) from each of the 37 samples (designated as sample #1 through 37) was preprocessed by the following steps: 1) the data were transformed into signal ratios by an exponential function with base 2 since the data are originally generated as log ratios; and 2) the signal ratios were divided by sample median on each array to adjust variation of input RNA quantity. [0120] With the above preprocessed signal data of the 37 samples as the input, the normalizer selection began with probe #1 designated as the target analyte while the remaining 6094 probes as candidate analytes. A Spearman correlation between probe #1 (the target analyte) and probe #2 (a candidate analyte) was calculated using the 37 signal ratios of probe #1 and the 37 signals of probe #2 derived from the 37 samples. The process was reiterated to generate a Spearman correlation for each probe pair of #1 versus #3, #1 versus #4, and all way through the last pair of #1 versus #6095. A candidate analyte having a correlation value higher than a threshold of 0.4 was selected as a normalizer. For a genomic CGH array analysis, a fixed threshold of Spearman 0.4 will typically produce 50 to 500 normalizers from 6094 candidate analytes for a given target analyte. In this example, 117 normalizers were selected for probe #1. [0121] In the next step, the probe #2 was designated as the target analyte, and probes #1, #3 through #6095 were designated as candidate analytes. A total of 69 probes were selected as the normalizers for probe #2 using the above CBN method. [0122] The above process was then repeated to select normalizers for probes #3 through #6095. The 117 and 69 normalizers for probe #1 and #2 are shown in table 2 along with their Spearman correlations. If the RNA expression array is redesigned or assay condition has significantly changed, it might be necessary to re-perform the normalizer selection process with a new set of samples assayed under the new conditions to ensure optimal results. [0123] Once these processes were completed, the normalizers selected for each probe on the RNA expression array were used to normalize the ratio signals for all RNA expression samples on the array in the same assay. The log ratios assayed on the RNA expression array were first transformed into signal ratios with an exponential function with base 2 and then served as the input for NQ calculation. For each target probe on the array, the NQ was generated by dividing the signal ratio of the target by the sum of signal ratios of the normalizers as shown in the following formula:

where es the NQ value for target probe denotes the signal ratio for target probe

and denotes the signal ratio of the ^-th normalizer in the set of normalizers for target probe ^

[0124] This NQ generation process was performed for all remaining probes until all signal ratios were converted into NQ values. These NQ values can be analyzed directly, or transform into log ratios to be analyzed by the software processing log ratio signals for detection of RNA expression level change in the given sample in comparison to the training samples. Table 2. The 117 and 69 normalizers identified for target analytes (probe #1 and #2) in Example II

Example III. Normalizing the sequencing counts of human chromosomes and categorizing samples

[0125] Massively parallel shotgun sequencing (MPSS) has been proven to be an effective platform for detection of fetal chromosomal aneuploidy, i.e. gain or loss of one or more full copies of a fetal chromosome in the presence of maternal chromosomes (Chiu et al.2008; Fan et al.2008). This technique is used to sequence the first 36 bases (termed reads) of millions of DNA fragments to determine their specific chromosomal origin. The count, which is the number of reads mapped to a chromosome, is used to generate a ratio value over the counts of all chromosomes (Palomaki et al.2011) or the counts of a specific set of chromosomes called denominators selected by minimizing the variation of the chromosome ratios (Sehnert et al. 2011). Such a ratio value is then used to generate a standardized z-score in order to detect chromosomal aneuploidy. The CBN method in this invention provides an alternative and a more computing efficient way to find normalizers and generate NQ values for the z-score calculation. [0126] A data set of 14 sequencing samples from public NIH SRA database

(www.ncbi.nlm.nih.gov/sra) were subjected to the CBN method to select normalizers and to detect chromosomal aneuploidy. SRA study SRP016573, completed by Beijing Genome Institute (BGI), is a human blood genome sequencing of 49 runs using Illumina HiSeq 2000 sequencing machine. From these 49 runs, we chose 14 runs including 6 runs of male samples (SRA run ID SRR609113, SRR609114, SRR609115, SRR609116, SRR609117, SRR609118), 7 runs of female samples (SRA run ID SRR609125, SRR609126, SRR609127, SRR609128, SRR609129, SRR609130, SRR609131), and 1 run of pregnant maternal plasma sample (SRA run ID

SRR609105). The whole genome reads from each sample were trimmed and mapped to human genome HG19 reference sequences to generate counts to 24 human chromosomes including chromosome 1 through 22, X and Y. The counts of 24 chromosomes from each of the 14 samples were used as the training data to select normalizers for each chromosome. [0127] In order to adjust the variation of input DNA quantity, chromosomal representation was calculated as

where chr _i denotes the chromosomal representation for chromosome i, and counts _j are the number of aligned reads on chromosome j. [0128] The CBN method was applied by designating chromosome 1 as a target analyte and all other 23 chromosomes as candidate analytes. Using the chromosomal representation of the 14 samples, a Spearman correlation was calculated between each of the 23 chromosomes (the candidate analytes) and chromosome 1 (the target analyte). A threshold of 0.4 was chosen, and candidate analytes that have Spearman correlation values greater than 0.4 were selected as normalizers for chromosome 1. [0129] The above process was then reiterated by designating chromosome 2 as a target analyte and selecting the normalizers having Spearman correlation values greater than the threshold value of 0.4. This process was then reiterated for each of the chromosomes 3 - 22, X, Y. For chromosome Y, however, the Spearman correlation threshold was lowered to 0 since

chromosome representation of Y does not have high correlations to the rest of chromosomes. Table 3 lists the normalizers for all 24 chromosomes generated using the method of this invention. [0130] Once a set of normalizers were defined for each of 24 chromosomes, the NQ values were calculated for a sequenced sample using the formula

where denotes the value for target chromosome ^ _୧^denotes the chromosome

representation for target chromosome denotes the chromosome representation of the normalizer in the set of normalizers or target chromosome

[0131] To calculate a normal value range with a group of samples, the NQs of the same 14

samples were generated with the normalizers selected as described above. The lower limit of normal NQ range for a chromosome was determined by the mean minus two times of the standard deviation, and the higher limit of normal NQ range was determined by the mean plus two times of the standard deviation of the 14 NQ values for the chromosome. In this example, sample SRR609105 showed that NQ values of 15 out of 24 chromosomes were out of normal range (Table 4), while the NQ values of all chromosomes in the other 13 samples appeared to be normal. This results indicates that sample SRR609105 is different from others and should be further examined. Indeed, sample SRR609105 is the only blood plasma sample from a pregnant woman. Table 3. Normalizers for all 24 chromosomes generated in Example III

Table 3. (continued)

Table 4. NQ values and classification of chromosomal sequencing sample SRR609105 in Example III

Example IV. Normalizing the sequencing counts of human sub-chromosomal bins for detection of DNA copy number variation (CNV) or chromosomal aneuploidy

[0132] The CBN normalization was performed on sequencing counts for a collection of fragmented chromosomes (“bins”) in order to detect amplification or deletion of a bin or a part of a bin in a sample. The bins may be of various or equal lengths, and may be overlapping or non- overlapping. In some instances, bins from human samples were used that are non-overlapping bins and have lengths ranging from 100 to 50000000 nucleotides. One bin was designated as a target analyte and all other bins were designated as candidate analytes. Sequencing counts for bins were obtained through a DNA sequencing device. The CBN method was then used to select normalizers for the bin. The process is iterated until normalizers are selected for all bins. [0133] In one experiment, sequencing data of the 14 samples from the SRA database described in Example III were used. According to public HG19 human genome reference sequences (hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/), human genome can be divided into about 3000 bins of 1 megabase (MB). Sequence counts were generated for 2917 bins (analytes) in each of the 14 samples. In order to adjust variations of input DNA quantity, chromosomal binning representation was calculated using the following equation:

where cb _i denotes the chromosomal binning representation for bin i, and counts _j is the number of aligned reads on bin ^. [0134] Chromosomal bin 1 was designated as a target analyte, and all other 2916 bins as candidate analytes. Using the chromosomal binning representation of the 14 samples, a

Spearman correlation was calculated between each of the 2916 bins (the candidate analytes) and bin 1 (the target analyte). Candidate analytes were then ranked from high to low based on the Spearman correlation values. The correlation value ranking the 200th was selected as the threshold. Candidate analytes having Spearman correlation rankings that are higher than or equal to the threshold were selected as normalizers for bin 1. This process thus selected the top 200 highly correlated candidates as normalizers. [0135] The above process was reiterated to select normalizers for bin 2 through 2917. Once sets of normalizers were selected for all bins, the NQ value for each bin was calculated using the following equation:

where Q _i denotes the NQ value for target bin i, cb _i^denotes the chromosomal binning

representation for target bin i, and N _ij denotes the chromosomal binning representation of the j-th normalizer in the set of normalizers N _i for target bin i. [0136] The NQ values for the bins can then be used for downstream sub-chromosomal analysis such as detection of CNVs, or detection of chromosomal aneuploidy. The NQ values can also be used as the input data for covariation calculation to iteratively improve the normalizer selection. Figure 5 is a study analysis flowchart illustrating a process of using bin NQ values for detection of fetal chromosome 21 trisomy in cell-free DNA (“cfDNA”) samples from preganant women’s blood plasma in real commercial non-invasive prenatal tests. Cell-free DNA (cfDNA) samples from pregnant women’s blood plasma have been shown to contain both maternal DNA and fetal DNA (see review by Chiu and Lo 2011). Fetal chromosomal trisomy causes abnormal high level of chromsomal 21 measurement in cfDNA samples, which can be detected with the

normalization method as shown in Figure 5. In the study described above, the data analyses include a selection of initial normalizers, calulation of NQ values, re-selection of normalizers, re- calculation of NQ values for each sub-chromosomal bin of 1000000 nucleotides in size. The bin NQ values thus obtained for chromosome 21 can be directly compared to the mean bin NQ values of chromosome 21 obtained from reference samples through a t-test; a p-value less than 0.01 from the test predicts trisomy of fetal chromosome 21. [0137] Similarly, comparing the bin NQ values of a small portion of a chromome with the t- test in the same study is going to detect trisomic copy number of the chromosomal portion, which is an example of sub-chromosomal copy number variation (CNV). Example V Normalizing the sequencing count of mtDNA and quantifying mtDNA in a sample

[0138] The CBN normalization was performed on sequencing counts for a collection of fragmented chromosomes (“bins”) comprising mtDNA in order to quantify the copy number of each mtDNA derived bin in a sample. The bins may be of various or equal lengths, and may be overlapping or non-overlapping. In some instances, bins from human samples were used that are non-overlapping bins and have lengths ranging from 100 to 50000000 nucleotides. One bin derived from mtDNA was designated as a target analyte and bins derived from autosomes were designated as candidate analytes. Sequencing counts for bins were obtained through a DNA sequencing device. The CBN method was then used to select normalizers for the bin. The process is iterated until normalizers are selected for all target bins derived from mtDNA. [0139] In one experiment, sequencing data of six DNA samples extracted from buccal swabs of six human adults were used. According to public HG19 human genome reference sequences (hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/), human autosomes (X chromosome, Y chromosome, and mtDNA excluded) can be divided into about 2750 bins of 1 megabase (MB). Sequence counts were generated for 2734 autosomal bins plus one mtDNA bin (human mtDNA is about 16569 base pairs in size, so it is rounded into one bin in this case) for a total of 2735 bins (analytes) in each of the six samples. In order to adjust variations of input DNA quantity, chromosomal binning representation was calculated using the following equation:

where ^^ _^ denotes the chromosomal binning representation for bin i, and counts _j is the number of aligned reads on bin ^.

[0140] A bin comprising mtDNA (“mtDNA bin”) was designated as the target analyte, and all other 2734 bins as candidate analytes. Using the chromosomal binning representation of the six samples, a Spearman correlation was calculated between each of the 2734 bins (the candidate analytes) and mtDNA bin (the target analyte). Candidate analytes were then ranked from high to low based on the Spearman correlation values. The correlation value ranking the 200th was selected as the threshold. Candidate analytes having Spearman correlation rankings that are higher than or equal to the threshold were selected as normalizers for mtDNA. This process thus selected the top 200 highly correlated candidates as normalizers. Then the NQ value for mtDNA bin was calculated using the following equation:

where Q _i denotes the NQ value for target mtDNA bin, cb _i denotes the chromosomal binning representation for target mtDNA bin, and N _ij denotes the chromosomal binning representation of the j-th normalizer in the set of normalizers for target mtDNA bin.

[0141] As a result, the NQ values for target mtDNA bin (or mtDNA NQ value) for the six samples were generated as normalized measurements of mtDNA quantity with high accuracy and robustness (Table 5). [0142] The mtDNA NQ values can then be used for further analyses such as computing mtDNA copy number through an established NQ to copy number reference relation function, and establishing various mtDNA quantity range for clinical or physiological conditions of human tissues or cells. References

[0143] Charles E. Rogler, Tatyana Tchaikovskaya, Raquel Norel, Aldo Massimi, Christopher Plescia, Eugeny Rubashevsky, Paul Siebert, and Leslie E. Rogler, RNA expression microarrays (REMs), a high-throughput method to measure differences in gene expression in diverse biological samples, Nucleic Acids Res.2004;32(15)

[0144] Cronin et al. Measurement of Gene Expression in Archival Paraffin-Embedded Tissues. American Journal of Pathology, Vol.164, No.1, January 2004

[0145] Chiu RW, Chan KC, Gao Y, et al. Noninvasive prenatal diagnosis of fetal chromosomal aneuploidy by massively parallel genomic sequencing of DNA in maternal plasma. Proc Natl Acad Sci USA 2008;105:20458–20463

[0146] Chiu RW1, Lo YM. Non-invasive prenatal diagnosis by fetal nucleic acid analysis in maternal plasma: the coming of age. Semin Fetal Neonatal Med.2011 Apr;16(2):88-93.

[0147] Dent S, Verma Sh, Latreille J, Rayson D, Clemons M, Mackey J, Verma S, Lemieux J, Provencher L, Chia S, Wang B, Pritchard K. The role of HER2-targeted therapies in women with HER2-overexpressing metastatic breast cancer. Curr. Oncol.2009 Aug;16(4):25-35

[0148] Fan HC, Blumenfeld YJ, Chitkara U, Hudgins L, Quake SR. Noninvasive diagnosis of fetal aneuploidy by shotgun sequencing DNA from maternal blood. Proc Natl Acad Sci USA 2008;105:16266–16271

[0149] Gao J, Giles BS, Webb PG, Roberts DN. Normalization probes for comparative genome hybridization arrays. PCT Pub. No.: WO2007/097876

[0150] Ghandi M, Beer MA. Group normalization for genomic data. PLoS One.2012;7(8) [0151] Hu M, Deng K, Selvaraj S, Qin Z, Ren B, Liu JS. HiCNorm: removing biases in Hi-C data via Poisson regression. Bioinformatics.2012 Dec 1;28(23):3131-3

[0152] Jennings BA, Hadfield JE, Worsley SD, Girling A, Willis G. A differential PCR assay for the detection of c-erbB 2 amplification used in a prospective study of breast cancer. Mol Pathol.1997 Oct;50(5):254-6

[0153] Jennings EK, Kennedy GC, Baloch ZW, Cibas ES, Chudova D, Diggans J, Friedman L, Kloos RT, LiVolsi VA, Mandel SJ, Raab SS, Rosai J, Steward DL, Walsh PS, Wilde JI, Zeiger MA, Lanman RB, Haugen BR. Preoperative diagnosis of benign thyroid nodules with indeterminate cytology. N Engl J Med.2012 Aug 23;367(8):705-15

[0154] Jonathan R. Pollack, Therese Sørlie, Charles M. Perou, Christian A. Rees, Stefanie S. Jeffrey, Per E. Lonning, Robert Tibshirani, David Botstein, Anne-Lise Børresen-Dale, and Patrick O. Brown, Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors, PNAS October 1, 2002 vol.99 no.20 12963-12968

[0155] Krepischi AC, Achatz MI, Santos EM, Costa SS, Lisboa BC, Brentani H, Santos TM, Gonçalves A, Nóbrega AF, Pearson PL, Vianna-Morgante AM, Carraro DM, Brentani RR, Rosenberg C. Germline DNA copy number variation in familial and early-onset breast cancer. Breast Cancer Res.2012 Feb 7;14(1):R24

[0156] Landfors M, Philip P, Ryden P, Stenberg P. Normalization of High Dimensional Genomics Data Where the Distribution of the Altered Variables IsSkewed. PLoS ONE

2011;6(11)

[0157] Paik S, Shak S, Tang G, et al. A multigene assay to predict recurrence of tamoxifen- treated, node-negative breast cancer. N Engl J Med.2004;351(27):2817-26

[0158] Palomaki GE, Kloza EM, Lambert-Messerlian GM, Haddow JE, Neveux LM, Ehrich M, van den Boom D, Bombard AT, Deciu C, Grody WW, Nelson SF, Canick JA. DNA sequencing of maternal plasma to detect Down syndrome: an international clinical validation study. Genet Med.2011 Nov;13(11):913-20

[0159] Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, Baehner FL, Walker MG, Watson D, Park T, Hiller W, Fisher ER, Wickerham DL, Bryant J, Wolmark N. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med.2004 Dec 30;351(27):2817-26

[0160] Quinlan S, Smith T, Vishwanath P, Alien sequences, US Patent Appl. No.: 11/224,573, 2005

[0161] Rooney et al., Methods Mol Biol.2015; 1241: 23–38 [0162] Sapkota Y, Ghosh S, Lai R, Coe BP, Cass CE, et al. Germline DNA Copy Number Aberrations Identified as Potential Prognostic Factors for Breast. Cancer Recurrence. PLoS ONE 2013;8(1)

[0163] Sehnert AJ, Rhees B, Comstock D, de Feo E, Heilek G, Burke J, Rava RP. Optimal detection of fetal chromosomal abnormalities by massively parallel DNA sequencing of cell-free fetal DNA from maternal blood. Clin Chem.2011 Jul;57(7):1042-9.

[0164] Sterrenburg, E., et al., A Common Reference For cDNA Microarray Hybridizations. Nucleic Acids Res 2002;30(21)

[0165] Wyrick J, Young RA, Ren B, Robert F. Chromosome-wide analysis of protein-DNA interactions. US Patent No: 6,410,243 B1

[0166] Xu X, Zeng L, Tao Y, Vuong T, Wan J, Boerma R, Noe J, Li Z, Finnerty S, Pathan SM, Shannon JG, Nguyen HT. Pinpointing genes underlying the quantitative trait loci for root-knot nematode resistance in palaeopolyploid soybean by whole genome resequencing. Proc Natl Acad Sci U S A.2013 Aug 13;110(33):13469-74

[0167] All publications and patent applications cited in this specification are herein

incorporated by reference as if each individual publication or patent application were specifically and individually indicated to be incorporated by reference. The citation of any publication is for its disclosure prior to the filing date and should not be construed as an admission that the present invention is not entitled to antedate such publication by virtue of prior invention. [0168] Although the foregoing invention has been described in some detail by way of illustration and example for purposes of clarity of understanding, it is readily apparent to those of ordinary skill in the art in light of the teachings of this invention that certain changes and modifications may be made thereto without departing from the spirit or scope of the appended statements of invention. [0169] The following Statements of the Invention are intended to characterize possible elements of the invention according to the foregoing description given in the specification.

Because this application is a provisional application, these statements may become changed upon preparation and filing of the complete application. Such changes are not intended to affect the scope of equivalents according to the claims issuing from the complete application, if such changes occur. According to 35 U. S. C. § 111(b), claims are not required for a provisional application. Consequently, the Statements of the Invention cannot be interpreted to be claims pursuant to 35 U. S. C. § 112.

Previous Patent: MICROFLUIDIC LAMINAR FLOW NOZZLE APPARATUSES

Next Patent: TIMED RELEASE OF DECRYPTION KEYS FOR ACCESS TO DISTRIBUTED ENCRYPTED CONTENT