Other References:
BANDELT HANS-JURGEN ET AL: "Contamination and sample mix-up can best explain some patterns of mtDNA instabilities in buccal cells and oral squamous cell carcinoma", BMC CANCER, BIOMED CENTRAL, LONDON, GB, vol. 9, no. 1, 16 April 2009 (2009-04-16), pages 113, XP021048991, ISSN: 1471-2407, DOI: 10.1186/1471-2407-9-113
LIJUN SHEN ET AL: "Evaluating mitochondrial DNA in patients with breast cancer and benign breast disease", JOURNAL OF CANCER RESEARCH AND CLINICAL ONCOLOGY, SPRINGER, BERLIN, DE, vol. 137, no. 4, 16 June 2010 (2010-06-16), pages 669 - 675, XP019890192, ISSN: 1432-1335, DOI: 10.1007/S00432-010-0912-X
PALANICHAMY MALLIYA GOUNDER ET AL: "Potential pitfalls in MitoChip detected tumor-specific somatic mutations: a call for caution when interpreting patient data", BMC CANCER, BIOMED CENTRAL, LONDON, GB, vol. 10, no. 1, 30 October 2010 (2010-10-30), pages 597, XP021075420, ISSN: 1471-2407, DOI: 10.1186/1471-2407-10-597
RUOYU ZHANG ET AL: "Independent impacts of aging on mitochondrial DNA quantity and quality in humans", BMC GENOMICS, BIOMED CENTRAL LTD, LONDON, UK, vol. 18, no. 1, 21 November 2017 (2017-11-21), pages 1 - 14, XP021250683, DOI: 10.1186/S12864-017-4287-0
MARIA ANGELA DIROMA ET AL: "Extraction and annotation of human mitochondrial genomes from 1000 Genomes Whole Exome Sequencing data", BMC GENOMICS, BIOMED CENTRAL LTD, LONDON, UK, vol. 15, no. Suppl 3, 6 May 2014 (2014-05-06), pages S2, XP021184322, ISSN: 1471-2164, DOI: 10.1186/1471-2164-15-S3-S2
MICHAEL V. ZARAGOZA ET AL: "Mitochondrial DNA Variant Discovery and Evaluation in Human Cardiomyopathies through Next-Generation Sequencing", PLOS ONE, vol. 5, no. 8, 1 January 2010 (2010-01-01), pages e12295, XP055046795, ISSN: 1932-6203, DOI: 10.1371/journal.pone.0012295
PFEIFER ET AL., AMER. J. CLIN. PATHOL., vol. 139, 2013, pages 93 - 100
COSTELLO ET AL., BMC GENOMICS, vol. 19, 2018, pages 332
LERNER ET AL., CANCER RES., vol. 75, 2015, pages 5 - 02,08
SEHN ET AL., AMER. J. CLIN. PATHOL., vol. 144, 2015, pages 667 - 674
GUNNARSDOTTIR ET AL., NATURE COMMUN., vol. 2, 2011, pages 228
SLATKIN ET AL., GENETICS, vol. 129, 1991, pages 555 - 562
YE ET AL., PROC. NAT'L ACAD. SCI. USA, vol. 111, no. 111, 2014, pages 10654 - 10659
ZHANG ET AL., BMC GENOMICS, vol. 18, 2017, pages 890
LI ET AL., BIOINFORMATICS, vol. 25, 2009, pages 2078 - 2079
WEISSENSTEINER ET AL., NUC. ACIDS RES., vol. 44, 2016, pages W58 - W63
BOLGER ET AL., BIOINFORMATICS, vol. 30, 2014, pages 2114 - 2120
LANGMEAD ET AL., NATURE METHODS, vol. 9, 2012, pages 357 - 359
AUTON ET AL., NATURE, vol. 526, 2015, pages 68 - 74
BAR-YAACOV ET AL., GENOME RES., vol. 23, 2013, pages 1789 - 1796
HODGKINSON ET AL., SCIENCE, vol. 344, 2014, pages 413 - 415
HUANG ET AL., INFLAMM. BOWEL DIS., vol. 23, 2017, pages 366 - 378
WU ET AL., MOLEC. CANCER, vol. 19, 2020, pages 99
Description:
Mitochondrial DNA Quality Control Field The present disclosure is directed, in part, to meth
ods of identifying unreliable biological samples that may be mislabeled or contamin
ated. Background For over 10 years, next‐generation sequencing (NGS)
has become an important component in biological and biomedical researches beca
use it makes sequencing of large batches of DNA or RNA samples feasible. NGS has bro
ad applications, such as whole genome and whole exome sequencing for large cohort genetic
investigation, bulk RNA‐seq for disease gene expression signatures identification in clinical
evaluation, tissue biopsy sequencing in tumor study/diagnostic and recently emerged single‐ce
ll sequencing research, providing answers and solutions to many different problems and
questions. However, in the studies involving large‐scale samples, sample identity compli
cation is a common and almost inevitable problem. The estimate sample identity error rate can
range from 0.2% to 6% in practice (Pfeifer et al., Amer. J. Clin. Pathol., 2013, 139, 93‐100;
Costello et al., BMC Genomics, 2018, 19, 332; Lerner et al., Cancer Res., 2015, 75, Abstract P5‐
02‐08; and Sehn et al., Amer. J. Clin. Pathol., 2015. 144, 667‐674). The errors can happen at diff
erent degrees: 1) a complete swapping between samples, and/or 2) contamination of one sampl
e with one or more other samples. Various steps during the sample processing can introd
uce the errors, such as sample mislabeling during sample collection, material spillage
during pipetting, index swapping in pooled libraries when performing sequencing and many
other unexpected situations. Sample swapping/contamination will subsequently reduce the qua
lity and accuracy of the downstream analysis. For example, sample swapping in a whole tr
anscriptome analysis may lead to false discovery or lose power to detect differentially expr
essed genes. In cancer studies, somatic mutation identification is routinely used, given that
many of those mutations were present at very low frequency (<5%), thus even low levels (1
% to 5%) of contamination may result in false positive mutation callings. For those reasons, accurat
e detection of sample swapping and contamination is an important quality control step in
large scale NGS studies. Mitochondria are essential organelles in most eukaryot
ic cells. Human mitochondrial DNA (mtDNA) are 16.5 kb circular DNA molecules locat
ed in mitochondria, and encode gene products that are essential for mitochondrial function
s. There are hundreds to thousands of mtDNA copies within a single cell. mtDNA is maternal
ly inherited with negligible recombination. Because mtDNA is uniparentally inherited and undergoes
negligible recombination at a population level, mutations acquired over time have s
ubdivided the human population into several discrete mtDNA haplogroups. On average, two r
andom individuals will have 30 to 40 nucleotide differences in their mitochondrial genome (
Gunnarsdóttir et al., Nature Commun., 2011, 2, 228; Slatkin et al., Genetics, 1991, 129,
555‐562; and Ye et al., Proc. Nat’l Acad. Sci.
USA, 2014, 111, E4548‐E4550). Because of its multi
copy nature, mtDNA mutations often only present in a small proportion of the cell’s mtDNA,
a state termed as heteroplasmy. The percentage of mtDNA carrying the mutation is referred
as heteroplasmy frequency. In contrast, if a mutation is found in all mtDNA molecules, this
mutation will be termed as homoplasmy. Previous studies demonstrated that in general healthy
populations, most individuals will harbor less than 5 heteroplasmies (with frequency > 1 to
2%) in their mitochondrial genome (Zhang et al., BMC Genomics, 2017, 18, 890; and Ye et al., P
roc. Nat’l Acad. Sci. USA, 2014, 111, 10654‐ 10659). For a batch of samples, samples collected fr
om the same individual should all belong to the same haplogroup. Summary The present disclosure provides methods of identifying
an unreliable biological sample, the method comprising: a) performing a nucleic acid
sequencing assay on each biological sample of a plurality of biological samples obtained
from a single individual to obtain mitochondrial DNA (mtDNA) sequencing reads for each b
iological sample; b) identifying a heteroplasmy and a homoplasmy in the mtDNA sequencing
reads from the previous step for each of the biological samples; and c) assigning a
primary mtDNA haplogroup to each biological sample, wherein any biological sample having an assig
ned primary mtDNA haplogroup that is different than the primary mtDNA haplogroup assigned
to a majority of the biological samples from the same individual is an unreliable biological
sample that is a mislabeled biological sample. The present disclosure also provides methods of ident
ifying an unreliable biological sample, the method comprising: a) performing a nuclei
c acid sequencing assay on each biological sample of a plurality of biological sample
s obtained from a single individual to obtain mitochondrial DNA (mtDNA) sequencing reads for each b
iological sample; b) identifying a heteroplasmy and a homoplasmy in the mtDNA sequencing
reads from the previous step for each of the biological samples; c) assigning a prima
ry mtDNA haplogroup to each biological sample; and d) determining the total heteroplasmy num
ber for each biological sample, wherein when a biological sample has a high heteroplasmy num
ber, the biological sample is assigned a secondary mtDNA haplogroup based on the minor alleles
in the heteroplasmy sites, wherein a biological sample having an assigned secondary mtDNA
haplogroup that is different than the assigned primary mtDNA haplogroup is an unreliable sa
mple that is contaminated. The present disclosure also provides methods of ident
ifying an unreliable biological sample, the method comprising: a) performing a nuclei
c acid sequencing assay on each biological sample of a plurality of biological sample
s obtained from a single individual to obtain mitochondrial DNA (mtDNA) raw sequencing reads for ea
ch biological sample; b) processing the mtDNA raw sequencing reads for quality control and a
daptor sequence removal to produce quality controlled mtDNA sequencing reads; c) mapping
the quality controlled mtDNA sequencing reads to a mitochondrial reference genome
to produce candidate mtDNA sequencing reads; d) re‐mapping the candidate mtDNA
sequencing reads to a human reference genome and retaining the candidate mtDNA sequencing r
eads when: i) the candidate mtDNA sequencing read is uniquely mapped to the mitochondri
al reference genome or has fewer mismatches to the mitochondrial reference genome than
to the human reference genome; and ii) the alignment mismatch count of the candidate mt
DNA sequencing read is less than 5; e) performing post‐mapping processing of the retained c
andidate mtDNA sequencing reads for sorting and duplicate removal; f) identifying heteropl
asmy and homoplasmy in the retained candidate mtDNA sequencing reads for each of the bio
logical samples; and g) assigning a primary mtDNA haplogroup to each biological sample, w
herein any biological sample having an assigned primary mtDNA haplogroup that is different t
han the primary mtDNA haplogroup assigned to a majority of the biological samples fro
m the same individual is an unreliable biological sample that is a mislabeled biological sam
ple. The present disclosure also provides methods of ident
ifying an unreliable biological sample, the method comprising: a) performing a nuclei
c acid sequencing assay on each biological sample of a plurality of biological sample
s obtained from a single individual to obtain mitochondrial DNA (mtDNA) raw sequencing reads for ea
ch biological sample; b) processing the mtDNA raw sequencing reads for quality control and a
daptor sequence removal to produce quality controlled mtDNA sequencing reads; c) mapping
the quality controlled mtDNA sequencing reads to a mitochondrial reference genome
to produce candidate mtDNA sequencing reads; d) re‐mapping the candidate mtDNA
sequencing reads to a human reference genome and retaining the candidate mtDNA sequencing r
eads when: i) the candidate mtDNA sequencing read is uniquely mapped to the mitochondri
al reference genome or has fewer mismatches to the mitochondrial reference genome than
to the human reference genome; and ii) the alignment mismatch count of the candidate mt
DNA sequencing read is less than 5; e) performing post‐mapping processing of the retained c
andidate mtDNA sequencing reads for sorting and duplicate removal; f) identifying heteropl
asmy and homoplasmy in the retained candidate mtDNA sequencing reads for each of the bio
logical samples; g) assigning a primary mtDNA haplogroup to each biological sample; and h) d
etermining the total heteroplasmy number for each biological sample, wherein when a bi
ological sample has a high heteroplasmy number, the biological sample is assigned a secondary
mtDNA haplogroup, wherein a biological sample having an assigned secondary mtDNA haplogroup
that is different than the assigned primary mtDNA haplogroup is an unreliable sample that
is contaminated. Brief Description Of The Drawings The patent or application file contains at least one
drawing executed in color. Copies of this patent or patent application publication with co
lor drawing(s) will be provided by the Office upon request and payment of the necessary fee. Figure 1 shows a representative schematic showing sui
table steps for carrying out the quality control analysis described herein. In Box 1,
mtDNA homoplasmies and heteroplasmies are identified from fastq files. An optional down sa
mpling step can be applied for sample with high mtDNA coverage. After reads QC, mtDNA reads are
selected by a two‐step mapping strategy. mtDNA variants ware identified from the mtD
NA mapping results and primary and secondary mtDNA haplogroup are assigned to each sampl
e based on the variants information. In Box 2, sample swapping/mislabeling can be detected
by comparing haplogroup assignments of samples from the given individual. In Box 3, sam
ple contamination can be detected by unusual high mtDNA heteroplamsy number and unmatched
primary and secondary haplogroup groups. Figure 2 shows the performance of the methods descri
bed herein on virtual contamination samples. Virtual contamination samples we
re created by mixing two samples from 1000 Genomes Project at different ratios. The X
‐axis indicates the theoretical contamination level and the Y‐axis indicates the he
teroplasmy frequencies identified from each virtual contamination sample. Each colored dot represe
nts one heteroplasmy, the black dots represent the means of the heteroplasmy frequencies i
n the samples and error bars represent the frequency standard errors. The mean of the frequ
encies is significantly correlated with theoretical contamination level (Pearson correlation =
0.996781, P value = 6.212e‐09). Figure 3 shows results of contamination detection in
virtual contamination samples. Figure 4 shows results from sample swapping and cont
amination detection in RNA‐seq data 1. Figure 5 shows results from sample swapping and cont
amination detection in RNA‐seq data 2. Description Of Embodiments Methods are presented herein for leveraging mtDNA seq
uences information to detect potential sample mislabeling and contaminations in NGS
data. mtDNA polymorphisms and mutations can be used to infer the identity of a p
articular biological sample, and act as an indicator of sample mislabeling. In addition, when a
biological sample is contaminated by DNA/RNA from another biological sample, unusual mtDNA
mutation patterns will be revealed, which can help identify and further quantify the con
taminant. Compared to nuclear DNA mutation‐based approaches, the methods described here
in allows for higher sensitivity even in low coverage sequencing data. The methods described herein can take any NGS data
containing sufficient mtDNA reads as input, identify mtDNA variants (heteroplamsy
and homoplasmy) from the data, and use the variants information to assign haplogroups to
each sample to detect potential sample swapping or mislabeling. By evaluating the samples’
heteroplasmy information, the methods described herein can further detect cross‐individual
contamination. The terminology used herein is for the purpose of d
escribing particular embodiments only and is not intended to be limiting. A human will have wild type (or reference) mtDNA mo
lecules and may have mutant mtDNA molecules. If a human has no mutant mtDNA mol
ecules, such a human is considered to be homoplasmic wild type (or homoplasmic reference).
If a human has no wild type mtDNA molecules (i.e., only has mutant mtDNA), such a huma
n is considered to be homoplasmic mutant. Homoplasmy is, thus, a measure of possessing
all or no copies of mutant mtDNA. If a human possesses a mixture of wild type and mu
tant mtDNA molecules, the human is considered to possess a heteroplasmy. The fraction
of mutated copies is referred to herein as the “heteroplasmy frequency.” For example, assuming
a human has 8 copies of mtDNA molecules, and possesses a single copy of the eight
mtDNA molecules that has a particular mutation in gene A, such a human is considered to
have a heteroplasmy frequency of 12.5% (i.e., 1/8). Heteroplasmy can be determined for each
mutation within the mtDNA genome for a particular individual. Thus, an individual having 2 m
tDNA mutations (relative to wild type mtDNA) can have two heteroplasmies. Each heteroplasmy
is associated with its own heteroplasmy frequency. The present disclosure provides methods of identifying
an unreliable biological sample. The methods comprise performing a nucleic acid sequen
cing assay on each biological sample of a plurality of biological samples obtained from a si
ngle individual to obtain mitochondrial DNA (mtDNA) sequencing reads for each biological sample.
The methods also comprise identifying the presence of one or more heteroplasmies and homop
lasmies in the mtDNA sequencing reads for each of the biological samples. The method
s also comprise assigning a primary mtDNA haplogroup to each biological sample. A biological sa
mple that has an assigned primary mtDNA haplogroup that is different than the primary mtDNA
haplogroup assigned to a majority of the biological samples from the same individual is an un
reliable biological sample. Such an unreliable biological sample may have been, for examp
le, mislabeled or swapped with another biological sample. The nucleic acid sequencing assay is any nucleic aci
d sequencing protocol. In some embodiments, the sequencing assay comprises next gener
ation sequencing (NGS). In some embodiments, the NGS comprises whole genome sequencing
. In some embodiments, the NGS comprises whole exome sequencing. In some embodiments,
the NGS comprises RNA sequencing. In some embodiments, the NGS comprises bi
sulfite sequencing. The nucleic acid sequencing assay is performed on ea
ch biological sample of a plurality of biological samples obtained from a single individu
al. In some embodiments, the plurality of biological samples can number from as low as 2 to
thousands of samples. In some embodiments, the plurality of biological samples can
number from as low as 2 to hundreds of samples. In some embodiments, the plurality of biolog
ical samples is obtained from one or more clinical studies. In some embodiments, the singl
e individual’s plurality of biological samples may be intermixed or batched with a pluralit
y of biological samples from another individual. mtDNA sequencing reads for each biological
sample are obtained. The presence of one or more heteroplasmies and homop
lasmies in the mtDNA sequencing reads for each of the biological samples
is determined. Thus, for each mutation identified in the mtDNA sequencing reads, the heterop
lasmy and homoplasmy analysis is performed. The sum of all the heteroplasmies is repr
esented by the total heteroplasmy number for a particular biological sample. The mtDNA sequenc
ing read information for each mtDNA mutational site is compiled to provide a summary of
the sequencing information of mapped reads at each single site. In some embodiments, the
compiling can be carried out using, for example, the samtools mpileup function (Li et al., B
ioinformatics, 2009, 25, 2078‐2079). The mtDNA sequencing read information for each mtDNA muta
tional site is filtered by sequence quality to, for example, remove sequencing bases with
low sequencing quality to reduce sequencing errors. In some embodiments, a sequence qu
ality score (Q) is determined, which is a property that is logarithmically related to the se
quencing error probability (Q = ‐10*log 10 (P), where P is the probability of a sequencing error).
In some embodiments, the sequence quality Q is ≥ 20. When Q is 20, there is a 1% probabi
lity of a sequencing error. In some embodiments, the heteroplasmy is identified b
y determining the sequencing coverage, the presence of a minor allele, and minor
allele frequency. The sequencing coverage represents the number of reads that align to a know
n mtDNA reference base. In some embodiments, the sequencing coverage is ≥ 50. The
sequencing coverage is generated by mpileup minus the bases with Q < 20. In some em
bodiments, the minor allele frequency is ≥ 1% for DNA sequencing data and ≥ 5% for RNA sequenci
ng data. In some embodiments, the minor allele is observed at least twice from each DNA str
and, or the minor allele is observed at least three times for RNA. For example, the following mtDN
A sequencing reads may be obtained (the first sequence is the reference sequence): N1‐N2‐N3‐N4‐N5‐N6‐N7‐N8‐
G9‐N10‐N11‐N12‐N13‐N14‐N15‐N16‐N17‐N18‐N
19‐N20 5’‐N1‐N2‐N3‐N4‐N5‐N6‐N7‐N8‐A9
N10‐N11‐N12‐N13‐N14‐N15‐3’ 3’‐N2‐N3‐N4‐N5‐N6‐N7‐N
8‐A9‐N10‐N11‐N12‐N13‐N14‐N15‐N16‐5’ 5’‐N2‐N3‐N4‐N5‐N6‐N7‐N
8‐G9‐N10‐N11‐N12‐N13‐N14‐N15‐N16‐3’ 5’‐N3‐N4‐N5‐N6
N7‐N8‐G9‐N10‐N11‐N12‐N13‐N14‐N15‐N16‐N1
7‐3’ 5’‐N3‐N4‐N5‐N6
N7‐N8‐G9‐N10‐N11‐N12‐N13‐N14‐N15‐N16‐N1
7‐3’ 3’‐N3‐N4‐N5‐N6
N7‐N8‐G9‐N10‐N11‐N12‐N13‐N14‐N15‐N16‐N1
7‐5’ 3’‐N4‐
N5‐N6‐N7‐N8‐G9‐N10‐N11‐N12‐N13‐N14‐N15
N16‐N17‐N18‐5’
5’‐N7‐N8‐G9‐N10‐N11‐
N12‐N13‐N14‐N15‐N16‐N17‐N18‐N19‐N20‐3’
The heteroplasmy frequency for this candidate mtDNA h
eteroplasmy site is 25% (2/8). In this particular analysis, the sequencing quality is > 2
0 and the sequencing coverage is > 50. The minor allele is observed in both strands of the DNA
. Accordingly, this particular mutational site (i.e., a candidate mtDNA heteroplasmy site) is a mtD
NA heteroplasmy. In some embodiments, the homoplasmy is identified by
determining the sequencing coverage and the presence of one or more alleles. A
homoplasmy is present when: i) the sequencing coverage is ≥ 10; and ii) only one all
ele is observed at a particular nucleic acid mutation site and it is different than the correspon
ding reference allele, or multiple alleles are observed at a particular nucleic acid mutation site
and the major allele is different than the corresponding reference allele, and the particular nuc
leic acid site is not a heteroplasmy and does not meet the heteroplasmy identification criteria
. In some embodiments, assigning the primary mtDNA hapl
ogroup to each biological sample comprises constructing a mtDNA sequence for ea
ch biological sample. In some embodiments, the mtDNA sequence for each biological s
ample is constructed using the homoplasmy and major alleles of the heteroplasmy. In
some embodiments, the primary mtDNA haplogroups are assigned based on the constructed mtD
NA sequences using HaploGrep2 (Weissensteiner et al., Nuc. Acids Res., 2016, 44, W
58‐W63). HaploGrep2 is an algorithm whereby the haplogroup is classified based on precalc
ulated phylogenetic weights that correspond to the mutation occurrence per position in
Phylotree. Similar tools for assigning primary mtDNA haplogroups include mthap (world wide w
eb at “dna.jameslick.com/mthap/”) and haplofind (world wide web at “haplofind.unibo.it
/”). A biological sample that has an assigned primary mtD
NA haplogroup that is different than the primary mtDNA haplogroup assigned to a majo
rity of the biological samples from the same individual is an unreliable biological sample. I
n some embodiments, the unreliable biological sample has been mislabeled. In some embodi
ments, the unreliable biological sample has been swapped with another biological sample. In
some embodiments, the one or more mislabeled samples are re‐labeled correctly. In some
embodiments, the one or more mislabeled samples are discarded. In some embodiments, the methods further comprise det
ermining a heteroplasmy number for each biological sample. In some embodiment
s, the heteroplasmy frequency is determined for each mutation identified in the mtDNA
sequences for each biological sample. When a biological sample has a high heteroplasmy num
ber, the biological sample is assigned a secondary mtDNA haplogroup. In some embodiments, a th
reshold for having a high heteroplasmy number is ≥ 10 heteroplasmies. In some embodiments, assigning the secondary mtDNA ha
plogroup comprises constructing a secondary mtDNA sequence using the hom
oplasmy and minor alleles of the heteroplasmy. A biological sample having an assigned
secondary mtDNA haplogroup that is different than the assigned primary mtDNA haplogroup
is an unreliable sample that is contaminated. The choices for the primary haplogroup
are identical to the choices for the secondary haplogroup. In some embodiments, the methods further comprise det
ermining the level of contamination of a biological sample. In some embodim
ents the level of contamination is indicated by determining the median of the heteroplas
my frequencies of all heteroplasmies in the contaminated sample. The greater the median heter
oplasmy frequency, the greater the level of contamination. There is strong correlation b
etween the real contamination percent and heteroplasmy frequency median/mean. For example, if th
e median heteroplasmy frequency is 6%, the contamination level is also about 6%. In some embodiments, the methods further comprise pro
cessing the mtDNA sequencing reads obtained from the nucleic acid seque
ncing assay for quality control and adaptor sequence removal prior to identifying a heter
oplasmy and a homoplasmy. In such embodiments, the mtDNA sequencing reads obtained from
the nucleic acid sequencing assay are mtDNA raw sequencing reads. In carrying out the
processing of the mtDNA raw sequencing reads, quality controlled mtDNA sequencing reads are
produced. In some embodiments, the processing of the mtDNA sequencing reads obtained fro
m the nucleic acid sequencing assay for quality control and adaptor sequence removal can be
carried out by using “Trimmomatic” (Bolger et al., Bioinformatics, 2014, 30, 2114‐2120)
. This processing step improves the accuracy of the subsequent mtDNA variant identification. Anothe
r tool that can be used for processing is cutadpt (world wide web at “cutadapt.readthedocs.io/e
n/stable/”). In some embodiments, the method further comprises a
two‐step mapping process prior to identifying a heteroplasmy and a homoplasmy.
In some embodiments, the mtDNA sequencing reads obtained from the nucleic acid seque
ncing assay can be used in the two‐step mapping process. In some embodiments, the quality con
trolled mtDNA sequencing reads obtained from the quality control and adaptor sequenc
e removal process can be used in the two‐step mapping process. In these embodiments, the
mtDNA sequencing reads (obtained from the nucleic acid sequencing assay) or the quali
ty controlled mtDNA sequencing reads (obtained from the quality control and adaptor sequen
ce removal process) are mapped to a mitochondrial reference genome to produce candidate mt
DNA sequencing reads. In some embodiments, the mitochondrial reference genome is the
revised Cambridge Reference Sequence (rCRS) for the mitochondrial genome. In some
embodiments, the mapping step can be carried out using “bowtie2” (Langmead et al.,
Nature Methods, 2012, 9, 357‐359) or bwa. The candidate mtDNA sequencing reads obtained from th
e first mapping step are re‐mapped to an entire human reference genome. In some embodiments
, the human reference genome is GRCh38 for the nuclear genome. In addition, GRCh37 c
an also be used. In some embodiments, the mapping step can be carried out using “bowtie2
”. Upon carrying out the two‐step mapping process, the
candidate mtDNA sequencing reads are retained under two circumstances: 1) the c
andidate mtDNA sequencing reads are retained when the candidate mtDNA sequencing read is
uniquely mapped to the mitochondrial reference genome, or has fewer mismatches to the mit
ochondrial reference genome than to the human reference genome; and 2) the candidate mtD
NA sequencing reads are retained when the alignment mismatch count of the candidate m
tDNA sequencing read is less than 5 mismatched bases. In some embodiments, the methods further comprise pro
cessing the mtDNA sequencing reads (obtained from the nucleic acid sequ
encing assay) for sorting and duplicate removal. In some embodiments, the methods further com
prise processing the quality controlled mtDNA sequencing reads (obtained from the
quality control and adaptor sequence removal process) for sorting and duplicate removal. I
n some embodiments, the methods further comprise performing post‐mapping processing o
f the retained candidate mtDNA sequencing read for sorting and duplicate removal. In
some embodiments, the processing for sorting and duplicate removal can be carried out by
using the “samtools toolkit” (Li et al., Bioinformatics, 2009, 25, 2078‐2079). These processin
g steps are standard Next Generation Sequencing (NGS) data processing steps. The GATK tool
kit can also be used. In some embodiments, the methods further comprise dow
n‐sampling the mtDNA sequencing reads obtained from the nucleic acid seque
ncing assay to a desired depth prior to identifying the heteroplasmy and the homoplasmy. In s
ome embodiments, the methods further comprise down‐sampling the mtDNA sequencing reads ob
tained from the nucleic acid sequencing assay to a desired depth prior to process
ing the mtDNA raw sequencing reads for quality control and adaptor sequence removal. In some
embodiments, the mtDNA raw sequencing reads from a whole transcriptome data set
can be down‐sampled to 10 million reads. In some embodiments, the down‐sampling can b
e carried out by using “seqtk” (world wide web at “github.com/lh3/seqtk”). RNA seq data
usually has very high mtDNA content. Thus, not all of the sequences are required to perf
orm the methodology described herein because the greater the mtDNA coverage, the longer t
he computational time will be. In some embodiments, the desired depth is about 1000, but ca
n be as low as about 200. Additional tools that can be used include, for example, FASTQ
SAMPLE (world wide web at “homes.cs.washington.edu/~dcjones/fastq‐tools/fastq‐sa
mple.html”). In some embodiments, the methods described herein fur
ther comprise obtaining the plurality of biological samples from the individual p
rior to performing the nucleic acid sequencing assay on the plurality of samples. In som
e embodiments, the biological samples are blood, tissue, or tumor biopsy. In some embodiments,
the methods described herein further comprise amplifying nucleic acid molecules in the bio
logical samples prior to performing the nucleic acid sequencing assay on the plurality of sa
mples. The present disclosure also provides methods of ident
ifying an unreliable biological sample, the methods comprising: a) performing a nucle
ic acid sequencing assay on each biological sample of a plurality of biological sample
s obtained from a single individual to obtain DNA raw sequencing reads for each biological sample;
b) processing the DNA raw sequencing reads for quality control and adaptor sequence remova
l to produce quality controlled DNA sequencing reads; c) mapping the quality controlled D
NA sequencing reads to a mitochondrial reference genome to produce candidate mtDNA sequencing
reads; d) re‐mapping the candidate mtDNA sequencing reads to a human reference
genome and retaining the candidate mtDNA sequencing reads when: i) the candidate mtDNA
sequencing read is uniquely mapped to the mitochondrial reference genome or has fewer m
ismatches to the mitochondrial reference genome than to the human reference genome;
and ii) the alignment mismatch count of the candidate mtDNA sequencing read is less than
5; e) performing post‐mapping processing of the retained candidate mtDNA sequencing reads for
sorting and duplicate removal; f) identifying heteroplasmy and homoplasmy in the retaine
d candidate mtDNA sequencing reads for each of the biological samples; and g) assigning
a primary mtDNA haplogroup to each biological sample, wherein any biological sample havin
g an assigned primary mtDNA haplogroup that is different than the primary mtDNA
haplogroup assigned to a majority of the biological samples from the same individual is an un
reliable biological sample that is a mislabeled biological sample. The steps of this metho
d can be carried out by the processes described herein. The present disclosure also provides methods of ident
ifying an unreliable biological sample, the methods comprising: a) performing a nucle
ic acid sequencing assay on each biological sample of a plurality of biological sample
s obtained from a single individual to obtain DNA raw sequencing reads for each biological sample;
b) processing the DNA raw sequencing reads for quality control and adaptor sequence remova
l to produce quality controlled DNA sequencing reads; c) mapping the quality controlled D
NA sequencing reads to a mitochondrial reference genome to produce candidate mtDNA sequencing
reads; d) re‐mapping the candidate mtDNA sequencing reads to a human reference
genome and retaining the candidate mtDNA sequencing reads when: i) the candidate mtDNA
sequencing read is uniquely mapped to the mitochondrial reference genome or has fewer m
ismatches to the mitochondrial reference genome than to the human reference genome;
and ii) the alignment mismatch count of the candidate mtDNA sequencing read is less than
5; e) performing post‐mapping processing of the retained candidate mtDNA sequencing reads for
sorting and duplicate removal; f) identifying heteroplasmy and homoplasmy in the retaine
d candidate mtDNA sequencing reads for each of the biological samples; g) assigning a
primary mtDNA haplogroup to each biological sample; and h) determining the total heteroplasmy num
ber for each biological sample, wherein when a biological sample has a high heteroplasmy num
ber, the biological sample is assigned a secondary mtDNA haplogroup, wherein a biological sampl
e having an assigned secondary mtDNA haplogroup that is different than the assigned
primary mtDNA haplogroup is an unreliable sample that is contaminated. The steps of
this method can be carried out by the processes described herein. In some embodiments, the methods described herein can
be carried out as a workflow. Numerous workflow management tools such as,
for example, Pyflow (see, world wide web at “github.com/Illumina/pyflow”), can be
used to streamline the steps together. The methods described herein have several advantages.
First, the methods do not require any nuclear DNA (nDNA) variant information fo
r each sample or at population allele frequency level ‐‐ this kind of information is o
ften not available for many studies, especially for
RNA‐seq studies. Second, the methods do not require
intensively preprocessed sequencing data as input, such as whole genome mapped bam file
s, whole genome variants VCF files. The methods can directly take fastq files as input. Thir
d, the methods can be applied to low‐ coverage sequencing data. nDNA variants‐based methods
usually need high coverage (>50X) to detect low‐level contaminations. Due to multiple‐co
py nature of mtDNA, even for the low coverage data, for example, 2 to 4X for 1000 Genome
s Project, mtDNA coverage still can be as high as 1000 to 2000X, which is sufficient to detec
t contamination level as low as 1%. Fourth, the methods do not require high computational power,
a typical sample with 1000X mtDNA coverage can be processed in 10 to 20 minutes with
a single processor and 4Gb memory. Samples with high mtDNA contents can take longer to
process but can be down‐sampled to shorten the processing time. The methods can be easi
ly incorporated into standard NGS data processing pipelines and serve as an important qualit
y control step by identifying problematic samples and further improving the accuracy of downstr
eam data analysis. In order that the subject matter disclosed herein ma
y be more efficiently understood, examples are provided below. It should be understood
that these examples are for illustrative purposes only and are not to be construed as limiti
ng the claimed subject matter in any manner. Examples Example 1: mtDNA Variation Identification and Haplogro
up Assignment General Methodology mtDNA variations (both homoplasmy and heteroplasmy) ar
e identified from next generation sequencing data by, for example, performing
an analysis represented in Figure 1 (see, Box 1). Upon performing a nucleic acid sequenc
ing assay on a plurality of biological samples, raw sequencing reads can be down‐sampled t
o a desired depth to lower the computational burden (see, Figure 1, Box 1, “Step0
) using “seqtk” that can be found at, for example, the world wide web at “github.com/lh3/seqtk
”. This step is optional and need not be performed. The raw mtDNA sequencing reads obtained from the nuc
leic acid sequencing assay, optionally from the previous down‐sampling step, are
processed for quality control and adaptor sequence removal by using “Trimmomatic” (Bolger et
al., Bioinformatics, 2014, 30, 2114‐2120) (see, Figure 1, Box 1, “Step1”). To retrieve candidate mtDNA sequencing reads, the qua
lity controlled sequencing reads are mapped to the mitochondrial reference genom
e (revised Cambridge Reference Sequence, rCRS) with “bowtie2” (Langmead et al.,
Nature Methods, 2012, 9, 357‐359) (see, Figure 1, Box 1, “Step2”). Nuclear mitochondrial
DNA segments (NUMTs) in nuclear genome may be mismapped to the mitochondrial genome and cou
nted as mtDNA reads. To minimize the effect of NUMTs, a second‐round mapping can be
performed whereby the mapped reads from first round are re‐mapped to the entire human
reference genome, GRCh38 for the nuclear genome and the revised Cambridge Reference Sequence (
rCRS) for the mitochondrial genome. Reads (or read pairs) are retained if: a) reads (re
ad pairs) are uniquely mapped to the mitochondrial genome or have less mismatches to the
mitochondrial genome than to the nuclear genome; and b) the alignment mismatch count
is less than 5. The retained candidate mtDNA sequencing reads are fur
ther processed by “samtools toolkit” (Li et al., Bioinformatics, 2009, 25, 2078
‐2079), including sam to bam conversion, sorting and duplication removal (see, Figure 1, Box
1, “Step3”). The retained candidate mtDNA sequencing reads for eac
h mtDNA site are compiled with “samtools mpileup function” (Li et al., Bioi
nformatics, 2009, 25, 2078‐2079), and bases are further filtered by sequencing quality (> = 20),
and heteroplasmies and homoplasmies are identified (see, Figure 1, Box 1, “Step4”). Heter
oplasmies are identified with the following criteria: a) sequencing coverage >= 50; b) minor
allele frequency > = 1%; and c) for DNA data,
a minor allele must be observed at least twice from e
ach strand, and for RNA data, a minor allele must be observed at least three times. Homoplasmies
are identified with following criteria: a) sequencing coverage > 10; and b1) only one allele
is observed at the given site and it is different
from the reference allele, or b2) multiple alleles a
re observed and the major allele is different from the reference, but the site fails heteroplasmy
criteria. mtDNA sequences for each sample are constructed with
the homoplasmy information and the major alleles at heteroplasmy sites, and hap
logroups are assigned based on the constructed sequences using “HaploGrep2” (Weissenste
iner et al., Nuc. Acids Res., 2016, 44, W58‐W63) (see, Figure 1, Box 1, “Step5”). The
assigned haplogroup at this step is referred to as
the primary haplogroup for each sample. If a particular sample has an unusual high heteropla
smy number, a secondary mtDNA sequence is constructed with the homoplasmy informatio
n and the minor alleles at the heteroplasmy sites, and a secondary haplogroup will b
e assigned based on this secondary mtDNA sequence. (see, Figure 1, Box 1, “Step6”).
Sample mislabeling/swap detection (see, Figure 1, Box
2) In a plurality of samples, each sample can be assig
ned a primary haplogroup as described herein. In the case where all samples are
processed accurately, all samples from the same individual will be assigned to the same haplogr
oup. On the contrary, if two or more haplogroups were assigned among these samples, the sa
mple(s) with a minority haplogroup assignment are considered haplogroup unmatched (i.e.,
mislabeled or swapped with another sample). For example, in Table 1 below, sample 001
is considered to be swapped with sample 008. Table 1 Sample contamination detection and quantification (see,
Figure 1, Box 3) If an unusual high heteroplasmy number is observed i
n a particular sample, the sample is potentially contaminated. The primary and secondary
haplogroups are assigned to the suspected sample based on the major alleles and mino
r alleles on the heteroplasmy sites, respectively (see, Figure 1, Box 1, “Step5” and
“Step6”). If the primary and secondary haplogroups are different, the sample is considered t
o be a contaminated sample. When a sample is determined to be contaminated, the
median of the frequencies of all the heteroplasmies in the sample is used to represent th
e contamination level. Example 2: Use of mtDNA Haplogroup to Detect a Misl
abled Sample Samples are collected from several individuals and ea
ch individual has multiple samples. After obtaining RNA‐seq data for a batch
of clinical samples, mtDNA haplogroups are assigned to each sample (see, Table 2). Samples coll
ected from the same individual should belong to the same mtDNA haplogroup. Unmatched mtDNA
haplogroups suggests possible sample mislabeling. The sample with the haplogroup L3
h1a1 should be considered a mislabeled sample. Table 2 Example 3: Use of mtDNA Heteroplasmy to Detect Sampl
e Contamination Virtual Contamination Sample Preparation Whole genome sequencing fastq files of two individual
s, HG00290 and NA19086, were downloaded from 1000 Genome Project (see, ftp site a
t “ftp.1000genomes.ebi.ac.uk/ vol1/ftp/”). Sequencing reads were sampled from the
two individuals, and NA19086 reads were mixed into HG00290 at different ratios (0.1%, 0.5%,
1%, 2%, 5%, 10%, 20%, 30%, and 40%) to create virtual contamination samples. Real‐World Datasets DNA sequencing data were downloaded from 1000 Genome
Project (see, ftp site at “ftp.1000genomes.ebi.ac.uk/vol1/ftp/”). For each indi
vidual, reads mapped to mitochondrial genome were extracted by samtools (Li et al., Bioinf
ormatics, 2009, 25, 2078‐2079) from the bam file and subsequently converted to pair end fast
q files. The fastq files were used as the input for the methods described herein. Fastq files of two RNA‐seq studies were downloaded
from GEO at GSE81266 and GSE127165. GSE81266 contained whole transcriptome data
for 77 ileum and prepouch ileum samples, including 61 pair end (2 x 75 bp) and 16
single end (50 bp) samples. GSE127165 contained whole transcriptome data from 57 laryngeal
squamous cell carcinoma patients, each patient had a tumor sample and an adjacent normal s
ample. All samples were pair end with 150 bp read length. Analytical Performance Virtual contamination samples were analyzed. Whole gen
ome sequencing data of two individuals, HG00290 and NA19086, were downloaded from
1000 Genomes Project (Auton et al., Nature, 2015, 526, 68‐74). HG00290 belonged to
haplogroup U5a2a1a and one heteroplasmy was identified in this individual’s mtD
NA genome (2610T>C 1.4%), while NA19086 belonged to haplogroup D4b1a1 and two heterop
lasmies were identified (1646T>C 2.1%, 12785T>T 21.3%). The two individuals had 45
nucleotide differences in their mitochondrial genome. Virtual contamination samples were created by mixing
the sequencing reads from the two samples at a series ratio, ranging from 0.1% to
40%. HG00290 was treated as the original sample and NA19086 was treated as the contaminant. E
ach contamination sample contained 50 million read pairs with read length 100 bp. The
virtual contaminated samples were processed by the methods described herein for contami
nation analysis and the results are summarized in Figure 3. When the contamination levels
were above 2%, 45‐46 heteroplasmies were identified from the samples, much higher than t
he normal range (1 to 2 heteroplasmies in an individual). These heteroplasmies covered almost al
l the expected sites (the 45 segregating sites between the two individuals plus the original
heteroplasmy 2610 T>C in HG00290). Only one expected site was missed in the 2% sample. The
primary haplogroups were U5a2a1a for all 6 samples, which was same as the original sample HG
00290 and the secondary haplogroup was D4b1a1, same as the contaminant. When the contaminati
on level was 1%, 29 heteroplasmies were detected. The 17 missing sites were manually ch
ecked and it was determined that these sites all showed some heteroplasmy signal, but becaus
e the heteroplasmy frequency identification cutoff was set at 1%, those sites did
not make the cutoff. The secondary haplogroup of the 1% sample was correctly assigned t
o D4b1a1. Only 1 and 11 heteroplasmies were detected when the contamination levels were 0.1%
and 0.5%, and the secondary haplogroups for these two samples were still U5a2a1a,
therefore, contamination cannot be confidently detected in these low contamination level
samples. These results indicate that by combining the heteroplasmy number and secondary haplog
roup assignment, contaminations as low as 1% were able to be detected. The heteroplasmy frequencies in the artificial contami
nation samples were further evaluated. There were some fluctuations of the hetero
plasmy frequencies in each sample, but the mean and median of the frequencies were signific
antly correlated with theoretical contamination level (see, Figure 2, and Figure 3; Pe
arson correlation = 0.996781, 0.9979935, P value = 6.212e‐09, 1.189e‐09 for mean and median,
respectively). Therefore, when a given sample is detected as contaminated by the methods de
scribed herein, the contamination level can be relatively quantified by the mean/median of t
he heteroplasmy frequencies in the sample. Real‐World Data Application: RNA‐seq Data There are several factors that can make the low‐fr
equency (<5%) heteroplasmies identification more challenging in RNA‐seq data than
that in DNA‐seq data: 1) errors introduced during reverse transcription step; 2) RNA editing/modi
fication; and/or 3) uneven coverage across the mtDNA genome due to varied gene expressio
n levels. Therefore, to reduce the false positive heteroplasmies, only heteroplasmies with frequ
ency > 5% was considered as reliable heteroplasmies in RNA data. In addition, three well
defined mtDNA editing sites: 295, 2617 and 13710 (Bar‐Yaacov et al., Genome Res., 2013, 23, 1
789‐1796; and Hodgkinson et al., Science, 2014, 344, 413‐415) were excluded. The methods described herein were applied to two bul
k RNA‐seq datasets to evaluate different disease or tissue type context. First, the
methods described herein were applied to a dataset with 77 samples from 25 subjects (Huang et
al., Inflamm. Bowel Dis., 2017, 23, 366‐ 378). Most subjects in this study had samples from
different tissues (ileum and prepouch ileum) and/or at different biopsy time points (4 months, 8
months, 12 months etc.). 16 samples in this dataset were single end samples with 50 bp read len
gth and 61 were pair end samples with 75 read length. For each sample, 10 million reads (pair
s) were randomly sampled to test. The primary haplogroup assignments to the samples were fi
rst evaluated. In this dataset, samples from the same subject were all assigned to the same
mtDNA haplogroup (see, Figure 4), indicating that there was no sample swapping. Potenti
al contaminations in those samples was evaluated next. At 5% heteroplasmy frequency cutoff,
except sample SRR3493833, all other samples had at most 6 heteroplasmies and the seconda
ry haplogroup assignments were the same as the primary haplogroup (see, Figure 4). In
sample SRR3493833, 29 heteroplasmies were identified, much higher than the normal range,
and the median frequency of the heteroplasmies was 14.8%. The secondary haplogroup of
this sample was J1c8a, which was also different from the primary haplogroup U5b2a1a. These
results indicated that sample SRR3493833 was potentially contaminated by another sam
ple from J1 haplogroup and the contamination level was about 14.8%. The methods described herein were also applied to a
dataset involving tumor samples (Wu et al., Molec. Cancer, 2020, 19, 99). This data
set contained samples from 57 laryngeal squamous cell carcinoma patients, each has a tumor s
ample and a paired adjacent normal mucosa sample. In this dataset, the paired tumor sam
ple and adjacent normal sample from the same patient were all assigned to the same haplogrou
p (see, Figure 5), no sample swap was detected. All samples have low heteroplasmy numbers a
nd same primary and secondary haplogroup assignment ‐‐ therefore, there was also
no detectable contamination. By this dataset, the methods described herein were demonstrate
d to be able to identify tumor sample identities. Early and accurate sample swapping and contamination
detection is a critical quality control step for large scale NGS data, since it can
filter out suspected samples and improve the quality for subsequent analysis. In these examples, a
n efficient method is presented to detect sample swapping and cross‐individual contamination by
using mtDNA variations identified from the NGS data. The methodology can take demultiplexed
fastq files as input without any data preprocessing. It will first detect any sample swappi
ng for individual with multiple samples. It will further detect and quantify potential contaminati
on then suggest the source sample of the contaminants. Although whole genome DNA sequencing dat
a from 1000 Genomes Project and two bulk RNA‐seq datasets were used as working exa
mples for these examples, the methods described herein can be generalized to any NGS datas
et containing mtDNA reads, such as whole exome sequencing data with offsites mtDNA reads, sing
le cell RNA‐seq, ATAC‐seq data, etc. The simulation results described herein show that the met
hods described herein effectively detected contamination as low as 1%. Various modifications of the described subject matter,
in addition to those described herein, will be apparent to those skilled in the ar
t from the foregoing description. Such modifications are also intended to fall within the s
cope of the appended claims. Each reference (including, but not limited to, journal articles, U.S
. and non‐U.S. patents, patent application publications, international patent application publicati
ons, gene bank accession numbers, and the like) cited in the present application is incorp
orated herein by reference in its entirety.