SYSTEMS AND METHODS OF DETECTING MERGED DROPLETS IN SINGLE CELL SEQUENCING

Title:

SYSTEMS AND METHODS OF DETECTING MERGED DROPLETS IN SINGLE CELL SEQUENCING

Document Type and Number:

WIPO Patent Application WO/2023/154816

Kind Code:

Abstract:

Disclosed herein are methods for detecting one or more droplet mergers in a single cell sequencing workflow, including obtaining a dataset comprising sequence reads of a plurality of amplicons comprising barcodes associated with a plurality of cells; normalizing the dataset; clustering the barcodes; and determining the one or more droplet mergers by labelling one or more clusters.

Inventors:

SCIAMBI ADAM (US)
PARIKH SAURABH (US)
PARIKH ANUP (US)

Application Number:

PCT/US2023/062313

Publication Date:

August 17, 2023

Filing Date:

February 09, 2023

Export Citation:

Click for automatic bibliography generation Help

Assignee:

MISSION BIO INC (US)

International Classes:

C12Q1/6869; C12Q1/6804; C12Q1/6806; C12Q1/6874; A61K9/00; B01F33/302

Foreign References:

US20210246488A1	2021-08-12
US20140303025A1	2014-10-09
US20180216160A1	2018-08-02

Attorney, Agent or Firm:

ZHANG, Clark et al. (US)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS

1. A method for detecting one or more droplet mergers in a single cell sequencing workflow, the method comprising: obtaining a dataset comprising sequence reads of a plurality of amplicons comprising barcodes associated with a plurality of cells; normalizing the dataset to generate a barcode correlation value and a barcode coverage value for a barcode, wherein the barcode correlation value represents a correlation between an amplicon profde of the barcode and an average amplicon profile, and wherein the barcode coverage value represents a number of mean sequence reads per barcode; clustering the barcodes according to at least the barcode correlation value and the barcode coverage value for the barcodes; and determining the one or more droplet mergers by labelling one or more clusters.

2. The method of claim 1, wherein normalizing the dataset further comprises: removing amplicons with sequence reads less than a first threshold value from the dataset.

3. The method of claim 2, wherein the first threshold value is one read in the plurality of cells.

4. The method of any one of claims 1-3, wherein normalizing the dataset further comprises: removing barcodes with a coverage less than a second threshold value from the dataset.

5. The method of claim 4, wherein the second threshold value is two sequence reads per amplicon per cell.

6. The method of any one of claims 1-5, wherein normalizing the dataset further comprises: generating a plurality of barcode amplicon profiles based on the dataset.

7. The method of any one of claims 1-6, wherein normalizing the dataset further comprises: generating an average amplicon profile by calculating mean sequence reads per amplicon across the one or more barcodes.

8. The method of any one of claims 1-7, wherein normalizing the dataset further comprises: generating the barcode correlation value by performing a linear regression for the plurality of barcode amplicon profiles against the average amplicon profile.

9. The method of claim 8, wherein the barcode correlation value is the coefficient of determination of the linear regression.

10. The method of claim 1, wherein the barcode coverage value is a log base 10 of the mean sequence reads per barcode.

11. The method of any one of claims 1-10, wherein labelling one or more clusters further comprises: operating a DBSCAN method using a plurality of parameter values; selecting one or more parameter values, wherein the dataset is classified into two clusters by applying the selected parameter values; and for each of the selected parameter values, determining whether a criterion is met, wherein the criterion comprises a first cluster having a higher coverage value and a lower correlation value than a second cluster.

12. The method of claim 11, further comprising: if the criterion is not met for all selected parameter values, marking the method as failed.

13. The method of claim 11, further comprising: if the criterion is met for only one selected parameter value, marking the first cluster as a singlet cluster and the second cluster as a merger cluster.

14. The method of claim 11, further comprising: if the criterion is met for two or more selected parameter values, further selecting a cluster with fewest unassigned barcodes, and marking the selected cluster as the singlet cluster and the other cluster associated with the selected cluster as a merger cluster.

15. The method of any one of claims 1-14, wherein a droplet merger comprises one or more or two or more barcodes in a merged droplet, wherein a merged droplet represents a coalescence of two or more droplets, wherein the barcodes in respective droplets are different.

16. The method of any one of claims 1-15, wherein the amplicon profile of the barcode is any one of mean sequence reads per amplicon comprising the barcode or median sequence reads per amplicon comprising the barcode.

17. The method of any one of claims 1-16, wherein the average amplicon profde is a mean sequence reads per amplicon comprising one of the barcodes or median sequence reads per amplicon comprising one of the barcodes.

18. A method for detecting one or more droplet mergers in a single cell sequencing, the method comprising: obtaining a dataset comprising sequence reads of a plurality of amplicons comprising barcodes associated with a plurality of cells; normalizing the dataset to generate dimensionally reduced counts of the barcodes; clustering the barcodes according to the dimensionally reduced counts to generate clusters that satisfy one or more criteria; determining one or more droplet mergers by labelling one or more clusters.

19. The method of claim 18, wherein normalizing the dataset further comprises: removing barcodes with sequence reads less than a threshold value in a fraction of amplicons. The method of claim 19, wherein normalizing the dataset further comprises: for each amplicon, determining a median of the normalized sequence reads by dividing read counts for each barcode by mean sequence reads of the amplicon. The method of claim 20, wherein normalizing the dataset further comprises: for each of one or more barcodes, generating the dimensionally reduced counts by dividing the counts for each amplicon by the median of the normalized sequence reads for the barcode. The method of claim 19, wherein the threshold value is 3 sequence reads. The method of claim 19, wherein the fraction of amplicons is about 20%. The method of claim 18, wherein labelling one or more clusters further comprises: creating a 2D grid comprising a plurality of spaced points on the visual graph; at one of the spaced points, generating a plurality of lines on the visual graph, wherein each of the plurality of lines comprises a slope between -90° to 90° relative to a horizontal line; selecting, from the plurality of lines, lines that split the dimensionally reduced counts of the one or more barcodes into two clusters; and for each selected line, labelling a first cluster comprising a higher statistical read count as a singlet cluster and a second cluster comprising a lower statistical read count as a merger cluster. The method of claim 24, further comprising: further selecting lines by removing, from selected lines, lines that are close to more than a threshold of cells.

26. The method of claim 25, further comprising: if no lines are further selected, marking the method as failed.

27. The method of claim 26, further comprising: if one or more lines are further selected, identifying, from the further selected lines, a line having the maximum difference in the median counts between the two clusters.

28. A method for detecting one or more droplet mergers in a single cell sequencing workflow, the method comprising: obtaining a dataset comprising sequence reads of a plurality of amplicons comprising barcodes associated with a plurality of cells; and operating one of a first method and a second method, wherein the first method comprises: normalizing the dataset to generate a barcode correlation value and a barcode coverage value for a barcode, wherein the barcode correlation value represents a correlation between an amplicon profile of the barcode and an average amplicon profile, and wherein the barcode coverage value represents a number of mean sequence reads per barcode; clustering the barcodes according to the barcode correlation value and the barcode coverage value for the barcodes; and determining the one or more droplet mergers by labelling one or more clusters, and wherein the second method comprises: normalizing the dataset to generate dimensionally reduced counts of the barcodes; clustering the barcodes according to the dimensionally reduced counts to generate clusters that satisfy one or more criteria; determining one or more droplet mergers by labelling one or more clusters.

29. A method for detecting one or more droplet mergers in a single cell sequencing workflow, the method comprising: obtaining a dataset comprising sequence reads of a plurality of amplicons comprising barcodes associated with a plurality of cells; and normalizing the dataset to generate a barcode correlation value and a barcode coverage value for a barcode, wherein the barcode correlation value represents a correlation between an amplicon profde of the barcode and an average amplicon profile, and wherein the barcode coverage value represents a number of mean sequence reads per barcode; normalizing the dataset to generate normalized counts of the barcodes; dimensionally reducing a combination of two or more of the normalized counts, the barcode correlation value, and the barcode coverage value for the barcodes; clustering the barcodes using the dimensionally reduced combination; and determining one or more droplet mergers by labelling one or more clusters.

30. The method of claim 29, further comprising: validating the labeled one or more clusters.

31. The method of claim 30, wherein validating the labeled one or more clusters comprises: generating a quality score for at least one of the one or more clusters, the quality score representing an efficiency measure of identification of mergers and non-mergers.

32. The method of claim 31, wherein generating the quality score comprises generating two or more of: a silhouette score representing a measure of separation between clusters; a cluster score representing a percentage of barcodes categorized as non-outliers; and a cell score representing a measure of a position of a cell cluster in comparison to a position of a merger cluster.

33. The method of claim 32, wherein the cell score is generating by: performing a linear fit on a merger cluster; and determining a percentage of cells below the linear fit as the cell score.

34. The method of claim 32 or 33, wherein generating the quality score comprises generating a product of each of the silhouette score, the cluster score, and the cell score.

35. The method of any one of claims 31-34, further comprising: selecting a cluster with a highest quality score; determining whether the cluster with the highest quality score includes at least a threshold number of barcodes; and responsive to the determination that the cluster with the highest quality score includes at least the threshold number of barcodes, completing the validation of the labeled one or more clusters.

36. The method of any one of claims 31-35, further comprising: recovering one or more barcodes not assigned to a cluster as one or more cells.

37. The method of claim 36, wherein recovering one or more barcodes not assigned to a cluster comprises: determining a first distance between a barcode not assigned to a cluster and another barcode assigned to a cell cluster; determining a second distance between the barcode not assigned to a cluster and another barcode assigned to a merger cluster; comparing the first distance and the second distance to determine whether to recover the barcode.

38. The method of claim 37, wherein comparing the first distance and the second distance to determine whether to recover the barcode further comprises: determining that the second distance is greater than the first distance; recovering the barcode in response to the determination that the second distance is greater than the first distance.

39. The method of claim 37, wherein comparing the first distance and the second distance to determine whether to recover the barcode further comprises: determining that the second distance is at least twice as large as the first distance; recovering the barcode in response to the determination that the second distance is at least twice as larger as the first distance.

40. The method of any one of claims 29-39, wherein normalizing the dataset to generate a barcode correlation value and a barcode coverage value comprises: removing amplicons with sequence reads less than a first threshold value from the dataset.

41. The method of claim 40, wherein the first threshold value is one read in the plurality of cells.

42. The method of any one of claims 40-41, wherein normalizing the dataset to generate a barcode correlation value and a barcode coverage value further comprises: removing barcodes with a coverage less than a second threshold value from the dataset.

43. The method of claim 42, wherein the second threshold value is two sequence reads per amplicon per cell.

44. The method of any one of claims 40-43, wherein normalizing the dataset to generate a barcode correlation value and a barcode coverage value further comprises: generating a plurality of barcode amplicon profiles based on the dataset.

45. The method of any one of claims 40-44, wherein normalizing the dataset to generate a barcode correlation value and a barcode coverage value further comprises: generating an average amplicon profile by calculating mean sequence reads per amplicon across the one or more barcodes.

46. The method of any one of claims 40-45, wherein normalizing the dataset to generate a barcode correlation value and a barcode coverage value further comprises: generating the barcode correlation value by performing a linear regression for the plurality of barcode amplicon profiles against the average amplicon profile.

47. The method of claim 46, wherein the barcode correlation value is the coefficient of determination of the linear regression.

48. The method of any one of claims 29-47, wherein the barcode coverage value is a log base 10 of the mean sequence reads per barcode.

49. The method of claim 48, wherein normalizing the dataset to generate normalized counts of the barcodes further comprises: removing barcodes with sequence reads less than a threshold value in a fraction of amplicons.

50. The method of claim 49, wherein normalizing the dataset to generate normalized counts of the barcodes further comprises: for each amplicon, determining a median of the normalized sequence reads by dividing read counts for each barcode by mean sequence reads of the amplicon.

51. The method of claim 50, wherein normalizing the dataset to generate normalized counts of the barcodes further comprises: for each of one or more barcodes, generating the dimensionally reduced counts by dividing the counts for each amplicon by the median of the normalized sequence reads for the barcode.

52. The method of claim 49, wherein the threshold value is 3 sequence reads.

53. The method of claim 49, wherein the fraction of amplicons is about 20%.

54. The method of any one of claims 29-53, wherein the combination of two or more of the normalized counts, the barcode correlation value, and the barcode coverage value comprises a concatenated matrix comprising two or more of the normalized counts, the barcode correlation value, and the barcode coverage value.

55. A non -transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to perform a method of any one of claims 1-54.

56. A system comprising: a processor; and a non-transitory computer readable medium comprising instructions that, when executed by the processor, cause the processor to perform a method of any one of claims 1- 54.

Description:

SYSTEMS AND METHODS OF DETECTING MERGED DROPLETS IN SINGLE CELL SEQUENCING

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the benefit of and priority to U.S. Provisional Patent Application No. 63/308,150 filed February 9, 2022, the entire disclosure of which is hereby incorporated by reference in its entirety for all purposes.

BACKGROUND

[0002] Single cell sequencing technologies have enabled and increased the throughput of singlecell transcriptomics studies. However, in current droplet-based single cell sequencing protocols, errors often occur when a droplet containing a plurality of cells (e.g., a merged drop, or a doublet) is mistaken for a single cell, resulting in a challenge to distinguish data from single cells in when generating high throughput libraries for single-cell analysis. Therefore, it is valuable to develop a method to accurately detect merged droplets for increasing efficiency and performance of the current single-cell technologies.

SUMMARY

[0003] Disclosed herein are methods, systems, and apparatuses for detecting one or more merged droplets in a single cell sequencing workflow. Generally, the methods, systems, and apparatuses disclosed herein enable identifying or detecting droplet mergers, and thus differentiating the merged droplet from individual intact droplets by performing normalization and/or clustering on raw sequence reads of barcodes. Methods disclosed herein can identify and differentiate intact non-merged droplets from merged droplets in a single cell sequencing workflow (e.g., for analyzing cellular analytes, examples of which include DNA, RNA, or protein). In various embodiments, the single cell sequencing workflow refers to a single cell DNA sequencing workflow. In various embodiments, the single cell sequencing workflow refers to a single cell RNA sequencing. In various embodiments, the single cell sequencing workflow refers to a single cell protein sequencing workflow. In various embodiments, the single cell sequencing workflow refers to a single cell DNA+ protein sequencing workflow. In various embodiments, the single cell sequencing workflow refers to a single cell DNA+ RNA sequencing workflow. In various embodiments, the single cell sequencing workflow refers to a single cell RNA + protein sequencing workflow. In various embodiments, the single cell sequencing workflow refers to a single cell DNA+ RNA + protein sequencing workflow. In various embodiments, the methods, systems, and apparatuses disclosed herein are automated.

[0004] Advantageously, the systems and methods in the disclosed embodiments as described herein include the following benefits in correcting mislabeling of mergers as singlet-like barcodes: 1) identifying mergers as an artefact, which may have been previously incorrectly considered to be singlet-like barcodes (e.g., barcodes in an intact drop); 2) identifying singletlike barcodes with lower read counts, which corrects mislabeling of low read depth singlet-like barcodes as background barcodes. Accordingly, the systems and methods in the disclosed embodiments can improve the quality of the cells, reduce the amount of false discoveries, and increase the confidence in the observed data. Furthermore, the systems and methods in the disclosed embodiments can identify or detect more singlets, which might be useful in applications which require high cell throughputs such as minimal residual disease detection. [0005] Disclosed herein is a method for detecting one or more droplet mergers in a single cell sequencing workflow, the method comprising: obtaining a dataset comprising sequence reads of a plurality of amplicons comprising barcodes associated with a plurality of cells; normalizing the dataset to generate a barcode correlation value and a barcode coverage value for a barcode, wherein the barcode correlation value represents a correlation between an amplicon profile of the barcode and an average amplicon profile, and wherein the barcode coverage value represents a number of mean sequence reads per barcode; clustering the barcodes according to at least the barcode correlation value and the barcode coverage value for the barcodes; and determining the one or more droplet mergers by labelling one or more clusters.

[0006] In various embodiments, normalizing the dataset further comprises: removing amplicons with sequence reads less than a first threshold value from the dataset.

[0007] In various embodiments, the first threshold value is one read in the plurality of cells. [0008] In various embodiments, normalizing the dataset further comprises: removing barcodes with a coverage less than a second threshold value from the dataset.

[0009] In various embodiments, the second threshold value is two sequence reads per amplicon per cell.

[0010] In various embodiments, normalizing the dataset further comprises: generating a plurality of barcode amplicon profiles based on the dataset. [0011] In various embodiments, normalizing the dataset further comprises: generating an average amplicon profde by calculating mean sequence reads per amplicon across the one or more barcodes.

[0012] In various embodiments, normalizing the dataset further comprises: generating the barcode correlation value by performing a linear regression for the plurality of barcode amplicon profiles against the average amplicon profile.

[0013] In various embodiments, the barcode correlation value is the coefficient of determination of the linear regression.

[0014] In various embodiments, the barcode coverage value is a log base 10 of the mean sequence reads per barcode.

[0015] In various embodiments, labelling one or more clusters further comprises: operating a DBSCAN method using a plurality of parameter values; selecting one or more parameter values, wherein the dataset is classified into two clusters by applying the selected parameter values; and for each of the selected parameter values, determining whether a criterion is met, wherein the criterion comprises a first cluster having a higher coverage value and a lower correlation value than a second cluster.

[0016] In various embodiments, the method further comprises: if the criterion is not met for all selected parameter values, marking the method as failed.

[0017] In various embodiments, the method further comprises: if the criterion is met for only one selected parameter value, marking the first cluster as a singlet cluster and the second cluster as a merger cluster.

[0018] In various embodiments, the method further comprises: if the criterion is met for two or more selected parameter values, further selecting a cluster with fewest unassigned barcodes, and marking the selected cluster as the singlet cluster and the other cluster associated with the selected cluster as a merger cluster.

[0019] In various embodiments, a droplet merger comprises one or more or two or more barcodes in a merged droplet, wherein a merged droplet represents a coalescence of two or more droplets, wherein the barcodes in respective droplets are different.

[0020] In various embodiments, the amplicon profile of the barcode is any one of mean sequence reads per amplicon comprising the barcode or median sequence reads per amplicon comprising the barcode. [0021] In various embodiments, the average amplicon profile is a mean sequence reads per amplicon comprising one of the barcodes or median sequence reads per amplicon comprising one of the barcodes.

[0022] Additionally disclosed herein is a method for detecting one or more droplet mergers in a single cell sequencing, the method comprising: obtaining a dataset comprising sequence reads of a plurality of amplicons comprising barcodes associated with a plurality of cells; normalizing the dataset to generate dimensionally reduced counts of the barcodes; clustering the barcodes according to the dimensionally reduced counts to generate clusters that satisfy one or more criteria; determining one or more droplet mergers by labelling one or more clusters.

[0023] In various embodiments, normalizing the dataset further comprises: removing barcodes with sequence reads less than a threshold value in a fraction of amplicons.

[0024] In various embodiments, normalizing the dataset further comprises: for each amplicon, determining a median of the normalized sequence reads by dividing read counts for each barcode by mean sequence reads of the amplicon.

[0025] In various embodiments, normalizing the dataset further comprises: for each of one or more barcodes, generating the dimensionally reduced counts by dividing the counts for each amplicon by the median of the normalized sequence reads for the barcode.

[0026] In various embodiments, the threshold value is 3 sequence reads.

[0027] In various embodiments, the fraction of amplicons is about 20%.

[0028] In various embodiments, labelling one or more clusters further comprises: creating a 2D grid comprising a plurality of spaced points on the visual graph; at one of the spaced points, generating a plurality of lines on the visual graph, wherein each of the plurality of lines comprises a slope between -90° to 90° relative to a horizontal line; selecting, from the plurality of lines, lines that split the dimensionally reduced counts of the one or more barcodes into two clusters; and for each selected line, labelling a first cluster comprising a higher statistical read count as a singlet cluster and a second cluster comprising a lower statistical read count as a merger cluster.

[0029] In various embodiments, the method further comprises: further selecting lines by removing, from selected lines, lines that are close to more than a threshold of cells.

[0030] In various embodiments, the method further comprises: if no lines are further selected, marking the method as failed. [0031] In various embodiments, the method further comprises: if one or more lines are further selected, identifying, from the further selected lines, a line having the maximum difference in the median counts between the two clusters.

[0032] Additionally disclosed herein is a method for detecting one or more droplet mergers in a single cell sequencing workflow, the method comprising: obtaining a dataset comprising sequence reads of a plurality of amplicons comprising barcodes associated with a plurality of cells; and operating one of a first method and a second method, wherein the first method comprises: normalizing the dataset to generate a barcode correlation value and a barcode coverage value for a barcode, wherein the barcode correlation value represents a correlation between an amplicon profile of the barcode and an average amplicon profile, and wherein the barcode coverage value represents a number of mean sequence reads per barcode; clustering the barcodes according to the barcode correlation value and the barcode coverage value for the barcodes; and determining the one or more droplet mergers by labelling one or more clusters, and wherein the second method comprises: normalizing the dataset to generate dimensionally reduced counts of the barcodes; clustering the barcodes according to the dimensionally reduced counts to generate clusters that satisfy one or more criteria; determining one or more droplet mergers by labelling one or more clusters.

[0033] Additionally disclosed herein is a method for detecting one or more droplet mergers in a single cell sequencing workflow, the method comprising: obtaining a dataset comprising sequence reads of a plurality of amplicons comprising barcodes associated with a plurality of cells; and normalizing the dataset to generate a barcode correlation value and a barcode coverage value for a barcode, wherein the barcode correlation value represents a correlation between an amplicon profile of the barcode and an average amplicon profile, and wherein the barcode coverage value represents a number of mean sequence reads per barcode; normalizing the dataset to generate normalized counts of the barcodes; dimensionally reducing a combination of two or more of the normalized counts, the barcode correlation value, and the barcode coverage value for the barcodes; clustering the barcodes using the dimensionally reduced combination; and determining one or more droplet mergers by labelling one or more clusters.

[0034] In various embodiments, the method further comprises: validating the labeled one or more clusters. [0035] In various embodiments, validating the labeled one or more clusters comprises: generating a quality score for at least one of the one or more clusters, the quality score representing an efficiency measure of identification of mergers and non-mergers.

[0036] In various embodiments, generating the quality score comprises generating two or more of: a silhouette score representing a measure of separation between clusters; a cluster score representing a percentage of barcodes categorized as non-outliers; and a cell score representing a measure of a position of a cell cluster in comparison to a position of a merger cluster.

[0037] In various embodiments, the cell score is generating by: performing a linear fit on a merger cluster; and determining a percentage of cells below the linear fit as the cell score.

[0038] In various embodiments, generating the quality score comprises generating a product of each of the silhouette score, the cluster score, and the cell score.

[0039] In various embodiments, the method further comprises: selecting a cluster with a highest quality score; determining whether the cluster with the highest quality score includes at least a threshold number of barcodes; and responsive to the determination that the cluster with the highest quality score includes at least the threshold number of barcodes, completing the validation of the labeled one or more clusters.

[0040] In various embodiments, the method further comprises: recovering one or more barcodes not assigned to a cluster as one or more cells.

[0041] In various embodiments, recovering one or more barcodes not assigned to a cluster comprises: determining a first distance between a barcode not assigned to a cluster and another barcode assigned to a cell cluster; determining a second distance between the barcode not assigned to a cluster and another barcode assigned to a merger cluster; comparing the first distance and the second distance to determine whether to recover the barcode.

[0042] In various embodiments, comparing the first distance and the second distance to determine whether to recover the barcode further comprises: determining that the second distance is greater than the first distance; recovering the barcode in response to the determination that the second distance is greater than the first distance.

[0043] In various embodiments, comparing the first distance and the second distance to determine whether to recover the barcode further comprises: determining that the second distance is at least twice as large as the first distance; recovering the barcode in response to the determination that the second distance is at least twice as larger as the first distance. [0044] In various embodiments, normalizing the dataset to generate a barcode correlation value and a barcode coverage value comprises: removing amplicons with sequence reads less than a first threshold value from the dataset.

[0045] In various embodiments, the first threshold value is one read in the plurality of cells.

[0046] In various embodiments, normalizing the dataset to generate a barcode correlation value and a barcode coverage value further comprises: removing barcodes with a coverage less than a second threshold value from the dataset.

[0047] In various embodiments, the second threshold value is two sequence reads per amplicon per cell.

[0048] In various embodiments, normalizing the dataset to generate a barcode correlation value and a barcode coverage value further comprises: generating a plurality of barcode amplicon profiles based on the dataset.

[0049] In various embodiments, normalizing the dataset to generate a barcode correlation value and a barcode coverage value further comprises: generating an average amplicon profile by calculating mean sequence reads per amplicon across the one or more barcodes.

[0050] In various embodiments, normalizing the dataset to generate a barcode correlation value and a barcode coverage value further comprises: generating the barcode correlation value by performing a linear regression for the plurality of barcode amplicon profiles against the average amplicon profile.

[0051] In various embodiments, the barcode correlation value is the coefficient of determination of the linear regression.

[0052] In various embodiments, the barcode coverage value is a log base 10 of the mean sequence reads per barcode.

[0053] In various embodiments, normalizing the dataset to generate normalized counts of the barcodes further comprises: removing barcodes with sequence reads less than a threshold value in a fraction of amplicons.

[0054] In various embodiments, normalizing the dataset to generate normalized counts of the barcodes further comprises: for each amplicon, determining a median of the normalized sequence reads by dividing read counts for each barcode by mean sequence reads of the amplicon. [0055] In various embodiments, normalizing the dataset to generate normalized counts of the barcodes further comprises: for each of one or more barcodes, generating the dimensionally reduced counts by dividing the counts for each amplicon by the median of the normalized sequence reads for the barcode.

[0056] In various embodiments, the threshold value is 3 sequence reads.

[0057] In various embodiments, the fraction of amplicons is about 20%.

[0058] In various embodiments, the combination of two or more of the normalized counts, the barcode correlation value, and the barcode coverage value comprises a concatenated matrix comprising two or more of the normalized counts, the barcode correlation value, and the barcode coverage value.

[0059] Additionally disclosed herein is a non-transitory computer readable medium comprising instructions that, when executed by a processor, cause the processor to perform a method.

[0060] Additionally disclosed herein is a system comprising: a processor; and a non-transitory computer readable medium comprising instructions that, when executed by the processor, cause the processor to perform a method.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

[0061] These and other features, aspects, and advantages of the present invention will become better understood with regard to the following description, and accompanying drawings, where:

[0062] Figure (FIG.) 1A depicts an overall system environment for conducting single-cell analysis, including a merged droplet detection system to detect merged droplets, in accordance with an embodiment.

[0063] FIG. IB shows a block diagram of a merged droplet detection system, in accordance with an embodiment.

[0064] FIG. 1C shows an embodiment of processing individual cells to generate amplicons for sequencing, in accordance with an embodiment.

[0065] FIG. 2A shows a flow process of performing barcodes analysis to detect one or more droplet mergers, in accordance with a first embodiment (e.g., correlation-cluster method).

[0066] FIG. 2B shows a flow process of performing barcodes analysis to detect one or more droplet mergers, in accordance with a second embodiment (e.g., dimensional reduction method). [0067] FIG. 2C shows a flow process of performing barcodes analysis to detect one or more droplet mergers, in accordance with a third embodiment (e.g., correlation-dimensional reduction method).

[0068] FIGS. 3A-3C show the steps of lysing and digesting in the first droplet as described in the step 165 in FIG. 1C, in accordance with an embodiment.

[0069] FIG. 4A illustrates the priming and barcoding of an antibody-conjugated oligonucleotide, in accordance with an embodiment.

[0070] FIG. 4B illustrates the priming and barcoding of genomic DNA, in accordance with an embodiment.

[0071] FIG. 5 depicts an example computing device for implementing system and methods described herein.

[0072] FIGS. 6-10 show example results obtained from the droplet analysis by performing the correlation-cluster method and dimensional reduction method as described herein.

[0073] FIGS. 11 A-D and 12 show results obtained by performing the correlation-UMAP method as described herein.

[0074] FIGS. 13A-13D, 14, 15, and 16A-16B show example performance comparisons among methods as described herein.

DETAILED DESCRIPTION

Definitions

[0075] Terms used in the claims and specification are defined as set forth below unless otherwise specified.

[0076] The term “about” refers to a ± 10% variation from the nominal value unless otherwise indicated or inferred.

[0077] The term “subject” or “patient” are used interchangeably and encompass an organism, human or non-human, mammal or non-mammal, male or female.

[0078] The term “sample” or “test sample” can include a single cell or multiple cells or fragments of cells or an aliquot of body fluid, such as a blood sample, taken from a subject, by means including venipuncture, excretion, ejaculation, massage, biopsy, needle aspirate, lavage sample, scraping, surgical incision, or intervention or other means known in the art. [0079] The term “analyte” refers to a component of a cell. Cell analytes can be informative for understanding a state or behavior of a cell. Therefore, performing single-cell analysis of one or more analytes of a cell using the systems and methods described herein are informative for determining a state or behavior of a cell. Examples of an analyte include a nucleic acid (e.g., RM A, DMA, cDNA), a protein, a peptide, an antibody, an antibody fragment, a polysaccharide, a sugar, a lipid, a small molecule, or combinations thereof. In particular embodiments, a singlecell analysis involves analyzing protein analytes. In particular embodiments, a single-cell analysis involves analyzing surface protein analytes. In particular embodiments, a single-cell analysis involves analyzing intracellular protein analytes. In particular embodiments, a singlecell analysis involves analyzing two different analytes such as protein (e.g., intracellular and/or surface protein) and DNA, protein (e.g., intracellular and/or surface protein) and RNA, or RNA and DNA. In particular embodiments, a single-cell analysis involves analyzing three or more different analytes of a cell, such as RNA, DNA, and protein.

[0080] The phrase “cell phenotype” refers to the cell expression of one or more proteins (e.g., cellular proteomics). In various embodiments, a cell phenotype is determined using a single-cell analysis. In various embodiments, the cell phenotype can refer to the expression of a panel of proteins (e g., a panel of proteins involved in cancer processes). In various embodiments, the protein panel includes proteins involved in any of the following hematologic malignancies: acute lymphoblastic leukemia, acute myeloid leukemia, chronic lymphocytic leukemia, chronic myeloid leukemia, classic Hodgkin’s Lymphoma, diffuse large B-cell lymphoma, follicular lymphoma, mantle cell lymphoma, multiple myeloma, myelodysplastic syndromes, myeloid disease, myeloproliferative neoplasms, or T-cell lymphoma. In various embodiments, the protein panel includes proteins involved in any of the following solid tumors: breast invasive carcinoma, colon adenocarcinoma, glioblastoma multiforme, kidney renal clear cell carcinoma, liver hepatocellular carcinoma, lung adenocarcinoma, lung squamous cell carcinoma, ovarian cancer, pancreatic adenocarcinoma, prostate adenocarcinoma, or skin cutaneous melanoma. Examples proteins in the panel can include any of HLA-DR, CD 10, CD117, CDl lb, CD123, CD13, CD138, CD14, CD141, CD15, CD16, CD163, CD19, CD193 (CCR3), CDlc, CD2, CD203c, CD209, CD22, CD25, CD3, CD30, CD303, CD304, CD33, CD34, CD4, CD42b, CD45RA, CD5, CD56, CD62P (P-Selectin), CD64, CD68, CD69, CD38, CD7, CD71, CD83, CD90 (Thyl), Fc epsilon RI alpha, Siglec-8, CD235a, CD298-A, B2M-A, GATA3, CSTB, BCR-ABL (b3a2), MYC, BAD, AKT pS473, CASP3, BCL2,MP0, MKI67, INFG, IL2, CDK1, RPS6 pS244, CD49d, CD45, CD8, CD45RO, mouse IgGl, kappa, mouse IgG2a, kappa, mouse IgG2b, kappa, CD103, CD62L, CDl lc, CD44, CD27, CD81, CD319 (SLAMF7), CD269 (BCMA), CD99, CD164, KCNJ3, CXCR4 (CD184), CD109, CD53, CD74, HLA-DR, DP, DQ, HLA-A, B, C, ROR1, Annexin Al, or CD20.

[0081] The phrase “cell genotype” refers to the genetic makeup of the cell and can refer to one or more genes and/or the combination of alleles (e.g., homozygous or heterozygous) of a cell. The phrase cell genotype further encompasses one or more mutations of the cell including polymorphisms, single nucleotide polymorphisms (SNPs), single nucleotide variants (SNVs)), insertions, deletions, knock-ins, knock-outs, insertion or deletion mutation (indel), copy number variations (CNVs), duplications, translocations, and loss of heterozygosity (LOH). In various embodiments, a cell phenotype is determined using a single-cell analysis. In various embodiments, the cell phenotype can refer to the expression of a panel of genes (e g., a panel of genes involved in cancer processes). In various embodiments, the panel includes genes involved in any of the following hematologic malignancies: acute lymphoblastic leukemia, acute myeloid leukemia, chronic lymphocytic leukemia, chronic myeloid leukemia, classic Hodgkin’s Lymphoma, diffuse large B-cell lymphoma, follicular lymphoma, mantle cell lymphoma, multiple myeloma, myelodysplastic syndromes, myeloid, myeloproliferative neoplasms, or T-cell lymphoma. In various embodiments, the panel includes genes involved in any of the following solid tumors: breast invasive carcinoma, colon adenocarcinoma, glioblastoma multiforme, kidney renal clear cell carcinoma, liver hepatocellular carcinoma, lung adenocarcinoma, lung squamous cell carcinoma, ovarian cancer, pancreatic adenocarcinoma, prostate adenocarcinoma, or skin cutaneous melanoma. For example, for acute lymphoblastic leukemia, the following genes are interrogated: ASXL1, GATA2, KIT, PTPN11, TET2, DNMT3A, IDH1, KRAS, RUNX1, TP53, EZH2, IDH2, NPM1, SF3B1, U2AF1, FLT3, JAK2, NRAS, SRSF2, or WT1.

[0082] In some embodiments, the discrete entities as described herein are droplets. The terms “emulsion,” “drop,” “droplet,” and “microdroplet” are used interchangeably herein, to refer to small, generally spherically structures, containing at least a first fluid phase, e.g., an aqueous phase (e.g., water), bounded by a second fluid phase (e.g., oil) which is immiscible with the first fluid phase. In some embodiments, droplets according to the present disclosure may contain a first fluid phase, e.g., oil, bounded by a second immiscible fluid phase, e.g. an aqueous phase fluid (e.g., water). In some embodiments, the second fluid phase will be an immiscible (with respect to the first fluid phase) phase carrier fluid. Thus droplets according to the present disclosure may be provided as aqueous-in-oil emulsions or oil-in-aqueous emulsions. Droplets may be sized and/or shaped as described herein for discrete entities. For example, droplets according to the present disclosure generally range from 1 pm to 1000 pm, inclusive, in diameter. Droplets according to the present disclosure may be used to encapsulate cells, nucleic acids (e.g., DNA), enzymes, reagents, and a variety of other components. The term emulsion may be used to refer to an emulsion produced in, on, or by a microfluidic device and/or flowed from or applied by a microfluidic device.

[0083] The term “antibody” encompasses monoclonal antibodies (including full length monoclonal antibodies), polyclonal antibodies, multispecific antibodies (e.g., bispecific antibodies), and antibody fragments that are antigen-binding, e.g., an antibody or an antigenbinding fragment thereof. “Antibody fragment”, and all grammatical variants thereof, as used herein are defined as a portion of an intact antibody comprising the antigen binding site or variable region of the intact antibody, wherein the portion is free of the constant heavy chain domains (i.e., CH2, CH3, and CH4, depending on antibody isotype) of the Fc region of the intact antibody. Examples of antibody fragments include Fab, Fab’, Fab’-SH, F(ab’)2, and Fv fragments; diabodies; any antibody fragment that is a polypeptide having a primary structure consisting of one uninterrupted sequence of contiguous amino acid residues (referred to herein as a “single-chain antibody fragment” or “single chain polypeptide”).

[0084] “Identity,” as known in the art, is a relationship between two or more polypeptide sequences or two or more polynucleotide sequences, as determined by comparing the sequences. In the art, “identity” also means the degree of sequence relatedness between polypeptide or polynucleotide sequences, as determined by the match between strings of such sequences. “Identity” and “similarity” can be readily calculated by known methods, including, but not limited to, those described in Computational Molecular Biology, Lesk, A. M., ed., Oxford University Press, New York, 1988; Biocomputing: Informatics and Genome Projects, Smith, D. W., ed., Academic Press, New York, 1993; Computer Analysis of Sequence Data, Part I, Griffin, A. M., and Griffin, H. G, eds., Humana Press, New Jersey, 1994; Sequence Analysis in Molecular Biology, von Heinje, G., Academic Press, 1987; and Sequence Analysis Primer, Gribskov, M. and Devereux, J., eds., M Stockton Press, New York, 1991; and Carillo, H., and Lipman, D., Siam J. Applied Math., 48: 1073 (1988). In addition, values for percentage identity can be obtained from amino acid and nucleotide sequence alignments generated using the default settings for the AlignX component of Vector NTI Suite 8.0 (Informax, Frederick, Md.). Preferred methods to determine identity are designed to give the largest match between the sequences tested. Methods to determine identity and similarity are codified in publicly available computer programs. Example computer program methods to determine identity and similarity between two sequences include, but are not limited to, the GCG program package (Devereux, J., et al., Nucleic Acids Research 12(1): 387 (1984)), BLAST and BLAST 2.0 algorithms (e.g., BLAST X programs), which are described in Altschul et al, Nuc. Acids Res. 25:3389-3402 (1977), and FASTA (Atschul, S. F. et al., J. Molec. Biol. 215:403-410 (1990)). The BLAST X (e.g., BLASTP, BLASTN) programs are publicly available from NCBI and other sources (BLAST Manual, Altschul, S., et al., NCBINLM NIHBethesda, Md. 20894: Altschul, S., et al., J. Mol. Biol. 215:403-410 (1990). Other methods include the algorithms of Smith & Waterman, Adv. Appl. Math. 2:482 (1981), and Needleman & Wunsch, J. Mol. Biol. 48:443 (1970), etc.

[0085] The term “identical” and their variants, as used herein, when used in reference to two or more sequences, refer to the degree to which the two or more sequences (e.g., nucleotide or polypeptide sequences) are the same. In the context of two or more sequences, the percent identity or homology of the sequences or subsequences thereof indicates the percentage of all monomeric units (e.g., nucleotides or amino acids) that are the same at a given position or region of the sequence (i.e., about 70% identity, preferably 75%, 80%, 85%, 90%, 95%, 97%, 98% or 99% identity). The percent identity canbe over a specified region, when compared and aligned for maximum correspondence over a comparison window, or designated region as measured using a BL AST or BLAST 2.0 sequence comparison algorithms with default parameters described below, or by manual alignment and visual inspection. Sequences are said to be “substantially identical” when there is at least 85% identity at the amino acid level or at the nucleotide level. Preferably, the identity exists over a region that is at least about 25, 50, or 100 residues in length, or across the entire length of at least one compared sequence. Another indication that two nucleic acid sequences are substantially identical is that the two molecules or their complements hybridize to each other under stringent hybridization conditions.

[0086] The term “block,” “blocking,” “using a block buffer” and their variants, refer generally to any action or process whereby non-specific binding of antibodies or other reagents to the tissue is prevented. For example, non-specific binding prevents visualization of the antigen-antibody binding of interest. Thus, to mitigate nonspecific binding, a blocking step can be carried out before incubation with an antibody. A blocking buffer may be used in a blocking step. A blocking buffer can be a solution of a different protein, mixture of proteins, or other compound that passively adsorbs to remaining binding surfaces. The blocking buffer may reduce background interference and improve the signal-to-noise ratio. For example, an ideal blocking buffer may bind to potential sites of nonspecific interaction, eliminating background altogether, without altering or obscuring the epitope for antibody binding.

[0087] The terms “fixing,” “fixative,” and their related variants, refer generally to any action or process whereby cellular morphology, integrity, and/or structure are reserved so as to prevent an autolysis of cells and the process of putrefaction (cellular decay). A fixative may be used to enhance the rigidity and mechanical strength of cells, to withstanding the immuno staining procedure, as described herein. In some embodiments, cells may be fixed immediately following removal from cell culture conditions to limit autolysis and putrefaction.

[0088] The terms “permeabilize,” “permeabilizing,” “permeabilization,” “using a permeabilization buffer” and their variants, refer generally to any action or process whereby the cell membrane is punctured where membrane lipids are partially removed or dissolved to allow for at least a portion of the antibodies or any desired molecules to pass through a cellular membrane and enter the cell. A permeabilization buffer may be used in a permeabilizing step. In some embodiments, a permeabilization buffer can be a solution of non-ionic detergent, or other permeabilizing agents, as described herein.

[0089] The terms “amplify,” “amplifying,” “amplification reaction” and their variants, refer generally to any action or process whereby at least a portion of a nucleic acid molecule (referred to as a template nucleic acid molecule) is replicated or copied into at least one additional nucleic acid molecule. The additional nucleic acid molecule optionally includes a sequence that is substantially identical or substantially complementary to at least some portion of the template nucleic acid molecule. The template nucleic acid molecule can be single-stranded or double-stranded and the additional nucleic acid molecule can independently be single-stranded or double-stranded. In some embodiments, amplification includes a template-dependent in vitro enzyme-catalyzed reaction for the production of at least one copy of at least some portion of the nucleic acid molecule or the production of at least one copy of a nucleic acid sequence that is complementary to at least some portion of the nucleic acid molecule. Amplification optionally includes linear or exponential replication of a nucleic acid molecule. In some embodiments, such amplification is performed using isothermal conditions; in other embodiments, such amplification can include thermocycling. In some embodiments, the amplification is a multiplex amplification that includes the simultaneous amplification of a plurality of target sequences in a single amplification reaction. At least some of the target sequences can be situated, on the same nucleic acid molecule or on different target nucleic acid molecules included in the single amplification reaction. In some embodiments, “amplification” includes amplification of at least some portion of DNA- and RNA-based nucleic acids alone, or in combination. The amplification reaction can include single or double-stranded nucleic acid substrates and can further include any of the amplification processes known to one of ordinary skill in the art. In some embodiments, the amplification reaction includes polymerase chain reaction (PCR). In some embodiments, the amplification reaction includes an isothermal amplification reaction such as LAMP. In the present invention, the terms “synthesis” and “amplification” of nucleic acid are used. The synthesis of nucleic acid in the present invention means the elongation or extension of nucleic acid from an oligonucleotide serving as the origin of synthesis. If not only this synthesis but also the formation of other nucleic acid and the elongation or extension reaction of this formed nucleic acid occur continuously, a series of these reactions is comprehensively called amplification. The polynucleic acid produced by the amplification technology employed is generically referred to as an “amplicon” or “amplification product.”

[0090] Any nucleic acid amplification method may be utilized, such as a PCR-based assay, e.g., quantitative PCR (qPCR), or an isothermal amplification may be used to detect the presence of certain nucleic acids, e.g., genes of interest, present in discrete entities or one or more components thereof, e.g., cells encapsulated therein. Such assays can be applied to discrete entities within a microfluidic device or a portion thereof or any other suitable location. The conditions of such amplification or PCR-based assays may include detecting nucleic acid amplification over time and may vary in one or more ways.

[0091] A number of nucleic acid polymerases can be used in the amplification reactions utilized in certain embodiments provided herein, including any enzyme that can catalyze the polymerization of nucleotides (including analogs thereof) into a nucleic acid strand. Such nucleotide polymerization can occur in a template-dependent fashion. Such polymerases can include without limitation naturally occurring polymerases and any subunits and truncations thereof, mutant polymerases, variant polymerases, recombinant, fusion or otherwise engineered polymerases, chemically modified polymerases, synthetic molecules or assemblies, and any analogs, derivatives or fragments thereof that retain the ability to catalyze such polymerization. Optionally, the polymerase can be a mutant polymerase comprising one or more mutations involving the replacement of one or more amino acids with other amino acids, the insertion or deletion of one or more amino acids from the polymerase, or the linkage of parts of two or more polymerases. Typically, the polymerase comprises one or more active sites at which nucleotide binding and/or catalysis of nucleotide polymerization can occur. Some exemplary polymerases include without limitation DNA polymerases and RNA polymerases. The term “polymerase” and its variants, as used herein, also includes fusion proteins comprising at least two portions linked to each other, where the first portion comprises a peptide that can catalyze the polymerization of nucleotides into a nucleic acid strand and is linked to a second portion that comprises a second polypeptide. In some embodiments, the second polypeptide can include a reporter enzyme or a process! vity- enhancing domain. Optionally, the polymerase can possess 5 ’ exonuclease activity or terminal transferase activity. In some embodiments, the polymerase can be optionally reactivated, for example through the use of heat, chemicals or re-addition of new amounts of polymerase into a reagent. In some embodiments, the polymerase can include a hot-start polymerase or an aptamer-based polymerase that optionally can be reactivated.

[0092] The terms “target primer” or “target-specific primer” and variations thereof refer to primers that are complementary to a binding site sequence. Target primers are generally a single stranded or double- stranded polynucleotide, typically an oligonucleotide, that includes at least one sequence that is at least partially complementary to a target nucleic acid sequence.

[0093] ‘Forward primer binding site” and “reverse primer binding site” refers to the regions on the template DNA and/or the amplicon to which the forward and reverse primers bind. The primers act to delimit the region of the original template polynucleotide which is exponentially amplified during amplification. In some embodiments, additional primers may bind to the region 5’ of the forward primer and/or reverse primers. Where such additional primers are used, the forward primer binding site and/or the reverse primer binding site may encompass the binding regions of these additional primers as well as the binding regions of the primers themselves. For example, in some embodiments, the method may use one or more additional primers which bind to a region that lies 5’ of the forward and/or reverse primer binding region. Such a method was disclosed, for example, in W00028082 which discloses the use of “displacement primers” or “outer primers.”

[0094] A “barcode” nucleic acid identification sequence can be incorporated into a nucleic acid primer or linked to a primer to enable independent sequencing and identification to be associated with one another via a barcode which relates information and identification that originated from molecules that existed within the same sample. There are numerous techniques that can be used to attach barcodes to the nucleic acids within a discrete entity. For example, the target nucleic acids may or may not be first amplified and fragmented into shorter pieces. The molecules can be combined with discrete entities, e.g., droplets, containing the barcodes. The barcodes can then be attached to the molecules using, for example, splicing by overlap extension. In this approach, the initial target molecules can have “adaptor” sequences added, which are molecules of a known sequence to which primers can be synthesized. When combined with the barcodes, primers can be used that are complementary to the adaptor sequences and the barcode sequences, such that the product amplicons of both target nucleic acids and barcodes can anneal to one another and, via an extension reaction such as DNA polymerization, be extended onto one another, generating a double- stranded product including the target nucleic acids attached to the barcode sequence. Alternatively, the primers that amplify that target can themselves be barcoded so that, upon annealing and extending onto the target, the amplicon produced has the barcode sequence incorporated into it. This can be applied with a number of amplification strategies, including specific amplification with PCR or non-specific amplification with, for example, MDA. An alternative enzymatic reaction that can be used to attach barcodes to nucleic acids is ligation, including blunt or sticky end ligation. In this approach, the DNA barcodes are incubated with the nucleic acid targets and ligase enzyme, resulting in the ligation of the barcode to the targets. The ends of the nucleic acids can be modified as needed for ligation by a number of techniques, including by using adaptors introduced with ligase or fragments to enable greater control over the number of barcodes added to the end of the molecule.

[0095] A “barcode” as used herein is classified into one of the following categories based on sequence read count and variant calls: singlet, cell doublet (or doublet), merger, and background, as described below in further detail. In some embodiments, the these categories are mutually exclusive.

[0096] A “merger” or “droplet merger” used in embodiments herein refer to a barcode present in a merged drop. A “background barcode” used in embodiments herein refers to a barcode that is not paired with a cell. A “singlet” used in embodiments herein refers to a barcode in an intact drop including one cell. A “cell doublet” or “doublet” refers to a barcode in an intact drop including two cells. A “singlet-like barcode” used in embodiments herein refers to a singlet or a cell doublet barcode. In some embodiments, both a singlet and a cell doublet barcode originate from intact drops and have a similar read count signature. In some embodiments, the singlet-like barcodes are associated with the cells on which sample specific analysis (e.g., tertiary analysis) is performed.

[0097] A “merged drop” or “merged droplet” as used herein refers to a drop generated when multiple drops (e.g., individual parent drops) coalesce together. Thus, a merged drop may include a plurality of barcodes and/or cells that individual parent drops include. In some embodiments, a parent drop include a barcode and/or a cell. In some embodiments, each parent drop contains a barcode and/or a cell. In some embodiments, a parent drop individually contains no barcode or cell.

[0098] The terms “nucleic acid,” “polynucleotides,” and “oligonucleotides” refers to biopolymers of nucleotides and, unless the context indicates otherwise, includes modified and unmodified nucleotides, and DNA and RNA, and modified nucleic acid backbones. For example, in certain embodiments, the nucleic acid is a peptide nucleic acid (PNA) or a locked nucleic acid (LNA). Typically, the methods as described herein are performed using DNA as the nucleic acid template for amplification. However, nucleic acid whose nucleotide is replaced by an artificial derivative or modified nucleic acid from natural DNA or RNA is also included in the nucleic acid of the present invention insofar as it functions as a template for synthesis of complementary chain. The nucleic acid of the present invention is generally contained in a biological sample. The biological sample includes animal, plant or microbial tissues, cells, cultures and excretions, or extracts therefrom. In certain aspects, the biological sample includes intracellular parasitic genomic DNA or RNA such as virus or mycoplasma. The nucleic acid may be derived from nucleic acid contained in said biological sample. For example, genomic DNA, or cDNA synthesized from mRNA, or nucleic acid amplified on the basis of nucleic acid derived from the biological sample, are preferably used in the described methods. Unless denoted otherwise, whenever a oligonucleotide sequence is represented, it will be understood that the nucleotides are in 5’ to 3’ order from left to right and that “A” denotes deoxyadenosine, “C” denotes deoxy cytidine, “G” denotes deoxyguanosine, “T” denotes deoxythymidine, and ‘U’ denotes uridine. Oligonucleotides are said to have “5 ’ ends” and “3 ’ ends” because mononucleotides are typically reacted to form oligonucleotides via attachment of the 5 ’ phosphate or equivalent group of one nucleotide to the 3 ’ hydroxyl or equivalent group of its neighboring nucleotide, optionally via a phosphodiester or other suitable linkage.

[0099] A template nucleic acid is a nucleic acid serving as a template for synthesizing a complementary chain in a nucleic acid amplification technique. A complementary chain having a nucleotide sequence complementary to the template has a meaning as a chain corresponding to the template, butthe relationship between the two is merely relative. That is, according to the methods described herein a chain synthesized as the complementary chain can function again as a template. That is, the complementary chain can become a template. In certain embodiments, the template is derived from a biological sample, e.g., plant, animal, virus, micro-organism, bacteria, fungus, etc. In certain embodiments, the animal is a mammal, e.g., a human patient. A template nucleic acid typically comprises one or more target nucleic acid. A target nucleic acid in exemplary embodiments may comprise any single or double-stranded nucleic acid sequence that can be amplified or synthesized according to the disclosure, including any nucleic acid sequence suspected or expected to be present in a sample. [00100] Primers and oligonucleotides used in embodiments herein comprise nucleotides. A nucleotide comprises any compound, including without limitation any naturally occurring nucleotide or analog thereof, which can bind selectively to, or can be polymerized by, a polymerase. Typically, but not necessarily, selective binding of the nucleotide to the polymerase is followed by polymerization of the nucleotide into anucleic acid strand by the polymerase; occasionally however the nucleotide may dissociate from the polymerase without becoming incorporated into the nucleic acid strand, an event referred to herein as a “nonproductive” event. Such nucleotides include not only naturally occurring nucleotides but also any analogs, regardless of their structure, that can bind selectively to, or can be polymerized by, a polymerase. While naturally occurring nucleotides typically comprise base, sugar and phosphate moieties, the nucleotides of the present disclosure can include compounds lacking any one, some or all of such moieties. For example, the nucleotide can optionally include a chain of phosphorus atoms comprising three, four, five, six, seven, eight, nine, ten or more phosphorus atoms. In some embodiments, the phosphorus chain can be attached to any carbon of a sugar ring, such as the 5’ carbon. The phosphorus chain can be linked to the sugar with an intervening O or S. In one embodiment, one or more phosphorus atoms in the chain can be part of a phosphate group having P and O. In another embodiment, the phosphorus atoms in the chain can be linked together with intervening O, NH, S, methylene, substituted methylene, ethylene, substituted ethylene, CNH2, C(O), C(CH2), CH2CH2, or C(OH)CH2R (where R can be a 4-pyridine or 1 -imidazole). In one embodiment, the phosphorus atoms in the chain can have side groups having O, BH3, or S. In the phosphorus chain, a phosphorus atom with a side group other than O can be a substituted phosphate group. In the phosphorus chain, phosphorus atoms with an intervening atom other than O can be a substituted phosphate group. Some examples of nucleotide analogs are described in Xu, U.S. Pat. No. 7,405,281.

[00101] In some embodiments, the nucleotide comprises a label and referred to herein as a “labeled nucleotide”; the label of the labeled nucleotide is referred to herein as a “nucleotide label.” In some embodiments, the label can be in the form of a fluorescent moiety (e.g., dye), luminescent moiety, or the like attached to the terminal phosphate group, i.e., the phosphate group most distal from the sugar. Some examples of nucleotides that can be used in the disclosed methods and compositions include, but are not limited to, ribonucleotides, deoxyribonucleotides, modified ribonucleotides, modified deoxyribonucleotides, ribonucleotide polyphosphates, deoxyribonucleotide polyphosphates, modified ribonucleotide polyphosphates, modified deoxyribonucleotide polyphosphates, peptide nucleotides, modified peptide nucleotides, metallonucleosides, phosphonate nucleosides, and modified phosphate-sugar backbone nucleotides, analogs, derivatives, or variants of the foregoing compounds, and the like. In some embodiments, the nucleotide can comprise nonoxygen moieties such as, for example, thio- or borano- moieties, in place of the oxygen moiety bridging the alpha phosphate and the sugar of the nucleotide, or the alpha and beta phosphates of the nucleotide, or the beta and gamma phosphates of the nucleotide, or between any other two phosphates of the nucleotide, or any combination thereof.

[00102] It must be noted that, as used in the specification, the singular forms “a,” “an” and “the” include plural referents unless the context clearly dictates otherwise.

[00103] All references, issued patents and patent applications cited within the body of the specification are hereby incorporated by reference in their entirety, for all purposes.

Overview

[00104] Described herein are embodiments to identify barcodes that are mergers by analyzing barcode(s) associated with cell(s) in a plurality of drops, and thus to differentiate merged drops from intact drops. Generally, analyzing the one or more barcodes involves obtaining a dataset comprising sequence reads of a plurality of amplicons comprising barcodes associated with a plurality of cells, normalizing the dataset, clustering (e.g., based on the normalized dataset), and determining the one or more droplet mergers by labelling or visualizing one or more clusters. Thus, the barcodes in real intact drops (e.g., singlets on which sample specific analysis is performed, or cell doublets) can be correctly identified from the artifact (e.g., mergers).

[00105] Advantageously, the systems and methods in the disclosed embodiments as described herein uses read count information (instead of variant information) to identify mergers. Hence, this approach can work even in cases with homogenous populations or populations with few differentiating variants. This is in contrast to conventional DNA sequencing approaches which use variant information to identify cell doublets. Example methodologies for identifying a cell doublet are further described in Weber et al., “DoubletD: detecting doublets in single-cell DNA sequencing data.” Bioinformatics 2021(37): i214-i221 , and Lun et al., “EmptyDrops: Distinguishing Cells from Empty Droplets in Droplet-Based Single-Cell RNA Sequencing Data.” Genome Biology 2019(20): 63, which are hereby incorporated by reference in its entirety. [00106] In some embodiments, the methods and systems described herein are not applied to remove cell doublets, and instead may classify both cell doublets and singlets into one category, e.g., “singlet-like” barcodes.

[00107] The single-cell analysis as described herein involves generating amplicons derived from the one or more analytes and sequencing the amplicons to determine presence or absence of the analytes. In some embodiments, the one or more analytes comprise genomic DNA, RNA, and/or protein. The single-cell analysis further involves determining presence or absence of the cell genotype (e.g., cell mutations such as CNVs, indels, and/or SNVs). In various embodiments, to analyze analytes, the single cell analysis involves sequencing oligonucleotides that are linked to antibodies, where the antibodies exhibit binding affinity for a specific analyte expressed by a cell. Thus, sequence reads derived from the antibody-conjugated oligonucleotides are used to determine the cell phenotype (e.g., expression or presence of one or more analytes of the cell). The single-cell analysis in the present disclosure (e.g., inclusion of intracellular protein detection) can enable measurement of proteins in cancer mechanisms, such as apoptosis (BCL2 family proteins), transcription factors (GATA3), tumor suppressors (TP53), and/or phosphorylated proteins involved in cell growth signaling pathways (e.g., phosphorylated ERK and/or STAT proteins).

[00108] In various embodiments, the systems and methods in the disclosed embodiments as described herein are performed for a least 1,000, 10,000, 100,000, or 1 million cells in one workflow.

[00109] In various embodiments, the systems and methods in the disclosed embodiments as described herein are performed for a least 1,000, 10,000, 100,000, or 1 million barcodes in one workflow.

[00110] In various embodiments, the systems and methods in the disclosed embodiments as described herein are performed for a least 1,000, 10,000, 100,000, or 1 million droplets in one workflow.

[00111] In various embodiments, the FIGS. 1-2 can include additional or fewer components and/or steps. For example, the system 100 in FIG. 1 A need not include single cell preparation step 104. In another example, the merger detection system 130 in FIG. IB need not include the normalized data store 180. In another example, the merger detection workflow or methods as described herein includes the steps in both FIG. 2A and FIG. 2B. In another example, the workflow in FIG. 2A further includes the steps in FIG. 2B. In another example, the workflows in FIG. 2B and FIG. 2A are combined such that the output results from both methods can be compared and a result with better performance may be selected.

[00112] Reference is made to FIG. 1A, which depicts an overall system environment 100 including a single cell workflow device 106 and a computational device 108 for analyzing one or more analytes of one or more individual cells 102, in accordance with an embodiment. In various embodiments, the cells 102 can be isolated from a test sample obtained from a subject or a patient. In various embodiments, the cells 102 are healthy cells taken from a healthy subject. In various embodiments, the cells 102 include diseased cells taken from a subject. In one embodiment, the cells 102 include cancer cells taken from a subject previously diagnosed with cancer. For example, cancer cells can be tumor cells available in the bloodstream of the subject diagnosed with cancer. As another example, cancer cells can be cells obtained through a tumor biopsy. Thus, single-cell analysis of the tumor cells enables analysis of cells of the subject’s cancer. In various embodiments, the test sample is obtained from a subject following treatment of the subject (e.g., following a therapy such as cancer therapy). Thus, single-cell analysis of the cells enables analysis of cells representing the subject’s response to a therapy. In various embodiments, the cells 102 are or include one or more complete cells. In various embodiments, the cells 102 are or include one or more nuclei and/or partial cells, where the nuclei and/or partial cells are isolated from tissues and/or a suspension of complete cells before the single cell analysis workflow. Example methodologies for isolating cellular nuclei from cells are further described in Nabbi et al., “Isolation of Nuclei.” Cold Spring Harb Protoc. 2015(8): 731-734, and Vindelov et al., “N. I. A detergent-trypsin method for the preparation of nuclei for flow cytometric DNA analysis.” Cytometry 1983(3), 323-327, which are hereby incorporated by reference in its entirety.

[00113] At step 104, the cells 102 are prepared. In various embodiments, the cells 102 are incubated with one or more antibodies. In various embodiments, the antibody is conjugated to the oligonucleotide. In various embodiments, the antibody exhibits binding affinity to a target analyte. For example, the antibody can exhibit binding affinity to a target epitope of a target protein. [00114] In particular embodiments, step 104 involves performing any of washing the cell 102, blocking the cell 102, fixing the cell 102, quenching the cell 102, and/or permeabilizing the cell 102, as described in further detail below.

[00115] In various embodiments, washing the cell 102 comprises washing the cell 102 with wash buffer. In various embodiment, washing the cell 102 comprises washing the cell 102 for one or more times. In various embodiment, washing the cell 102 comprises washing the cell 102 for at least 2, 3, 4, 5, 6, 7, 8, 9, or 10 times. In various embodiments washing the cell 102 comprises washing the cell 102 for at least 1 minute, 2 minutes, 3 minutes, 4 minutes, 5 minutes, 6 minutes, 7 minutes, 8 minutes, 9 minutes, or 10 minutes.

[00116] In various embodiments, fixing the cell 102 comprises fixing the cell 102 using fixatives for at least 30, 45, 60, or 90 minutes. In particular embodiments, fixing the cell 102 comprises fixing the cell 102 using fixatives for 90 minutes. In various embodiments, fixing the cell 102 comprises fixing the cell 102 at a temperature between 4 and 50 °C. In various embodiments, fixing the cell 102 comprises fixing the cell 102 at a temperature between 10 and 30 °C. In various embodiments, fixing the cell 102 comprises fixing the cell 102 at a temperature between 20 and 25 °C. In various embodiments, fixing the cell 102 comprises fixing the cell 102 at a temperature between 20 and 25 °C for 90 minutes. In various embodiments, fixing the cell 102 comprises fixing the cell 102 using 0.1 mM to 20 mM of one or more fixatives in a reactive volume using a background buffer. In various embodiments, fixing the cell 102 comprises fixing the cell 102 using 0.5 mM to 10 mM of one or more fixatives in a reactive volume using a background buffer. In various embodiments, fixing the cell 102 comprises fixing the cell 102 using 1 mM to 5 mM of one or more fixatives in a reactive volume using a background buffer. In various embodiments, the reactive volume is from 0.01 to 10 mL. In various embodiments, the reactive volume is from 0.05 to 5 mL. In particular embodiments, the reactive volume is from 0.1 to 1 mL. In particular embodiments, the background buffer is Dulbecco’s phosphate-buffered saline (DPBS).

[00117] In various embodiments, quenching the cell 102 comprises quenching the fixed cell for at least 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 minutes. In various embodiments, quenching the cell 102 comprises quenching the fixed cell at a temperature between 10 and 50 °C. In various embodiments, quenching the cell 102 comprises quenching the fixed cell at a temperature between 10 and 30 °C. In various embodiments, quenching the cell 102 comprises quenching the fixed cell at a temperature between 20 and 25 °C.

[00118] In various embodiments, blocking the cell 102 comprises blocking the cell 102 for at least 10, 20, or 30 minutes. In various embodiments, blocking the cell 102 comprises blocking the cell 102 at a temperature between 10 and 50 °C. In various embodiments, blocking the cell 102 comprises blocking the cell 102 at a temperature between 10 and 30 °C. In various embodiments, blocking the cell 102 comprises blocking the cell 102 at a temperature between 20 and 25 °C. In various embodiments, blocking the cell 102 comprises using a blocking buffer. In particular embodiments, the blocking buffer is used in the surface protein product for preparing the cell.

[00119] In various embodiments, permeabilizing the cell 102 comprises permeabilizing the cell 102 for at least 10, 20, or 30 minutes. In various embodiments, permeabilizing the cell 102 comprises permeabilizing the cell 102 at a temperature between 10 and 50 °C. In various embodiments, permeabilizing the cell 102 comprises permeabilizing the cell 102 at a temperature between 10 and 30 °C. In various embodiments, permeabilizing the cell 102 comprises permeabilizing the cell 102 at a temperature between 20 and 25 °C. In various embodiments, permeabilizing the cell 102 comprises permeabilizing the cell 102 using a permeabilization buffer. In various embodiments, the permeabilization buffer comprises a 0.01%, 0.05%, 0.1%, 0.15%, or 0.2% solution. In various embodiments, the permeabilization buffer comprises at least one of Triton™ X-100, Prionex® gelatin, salmon sperm DNA, mouse IgG, EDTA. In various embodiments, the permeabilization buffer comprises Triton™ X-100. In particular embodiments, the permeabilization buffer comprises 0.1% Triton™ X-100.

[00120] In various embodiments, incubating the cell 102 with antibodies include incubating the cell 102 with antibody-conjugated oligonucleotides. In various embodiments, the antibody- conjugated oligonucleotide binds to the analyte located on the surface of the cell to generate a surface antibody-oligonucleotide conjugate. In various embodiments, the antibody- oligonucleotide conjugate enters the permeabilized cell to contact the analyte located internally within the cell to generate an intracellular antibody-oligonucleotide conjugate. In various embodiments, the antibody-conjugated oligonucleotide binds to the analyte located on the surface of the cell to generate a surface antibody-oligonucleotide conjugate, and enters the permeabilized cell to contact the analyte located internally within the cell to generate an intracellular antibody -oligonucleotide conjugate.

[00121] In various embodiments, incubating the cell 102 with antibodies includes incubating the cell 102 with antibody-oligonucleotide conjugates (e.g., antibody tag) for 10 minutes to 30 hours. In various embodiments, incubating the cell 102 with antibodies includes incubating the cell 102 with antibody-oligonucleotide conjugates for 10-60 minutes. In various embodiments, incubating the cell 102 with antibodies includes incubating the cell 102 with antibody- oligonucleotide conjugates for 30 minutes. In various embodiments, incubating the cell 102 with antibodies includes incubating the cell 102 with antibody-oligonucleotide conjugates for 10-25 hours. In various embodiments, incubating the cell 102 with antibodies includes incubating the cell 102 with antibody-oligonucleotide conjugates for 16-20 hours. In various embodiments, incubating the cell 102 with antibodies includes incubating the cell 102 with antibody- oligonucleotide conjugates overnight.

[00122] In various embodiments, incubating the cell 102 with antibodies includes incubating the cell 102 with antibody-oligonucleotide conjugates at a temperature between 0-30 °C. In various embodiments, incubating the cell 102 with antibodies includes incubating the cell 102 with antibody-oligonucleotide conjugates at a temperature between 2-30 °C. In various embodiments, incubating the cell 102 with antibodies includes incubating the cell 102 with antibody-oligonucleotide conjugates at a temperature between 3-6 °C. In various embodiments, incubating the cell 102 with antibodies includes incubating the cell 102 with antibody- oligonucleotide conjugates at a temperature of about 4 °C. In particular embodiments, incubating the cell 102 with antibodies includes incubating the cell 102 with antibody- oligonucleotide conjugates at room temperature (e.g., about 22 °C). In particular embodiments, incubating the cell 102 with antibodies includes incubating the cell 102 with antibody- oligonucleotide conjugates on ice (e.g., at about 0 °C).

[00123] In various embodiments, the number of cells incubated with antibodies can be 10 ² cells, 10 ³ cells, 10 ⁴ cells, 10 ⁵ cells, 10 ⁶ cells, or 10 ⁷ cells. In various embodiments, between 10 ³ cells and 10 ⁷ cells are incubated with antibodies. In various embodiments, between 10 ⁴ cells and 10 ⁶ cells are incubated with antibodies. In various embodiments, varying concentrations of antibodies are incubated with cells. In various embodiments, for an antibody in the protein panel, a concentration of 0.1 nM, 0.5 nM, 1.0 nM, 2.0 nM, 3.0 nM, 4.0 nM, 5.0 nM, 6.0 nM, 7.0 nM, 8.0 nM, 9.0 nM, 10.0 nM, 20 nM, 30 nM, 40 nM, 50 nM, 60 nM, 70 nM, 80 nM, 90 nM, or 100 nM of the antibody is incubated with cells.

[00124] In various embodiments, cells 102 are incubated with a plurality of different antibodies. In one embodiment, amongst the plurality of different antibodies, each antibody exhibits binding affinity for an analyte of a panel. For example, each antibody exhibits binding affinity for a protein of a panel. Examples of proteins included in protein panels are described herein. The incubation of cells with antibodies leads to the binding of the antibodies against target epitopes. In various embodiments, a concentration of 0.05 nM, 0.1 nM, 0.5 nM, 1.0 nM, 2.0 nM, 3.0 nM, 4.0 nM, 5.0 nM, 6.0 nM, 7.0 nM, 8.0 nM, 9.0 nM, 10.0 nM, 20 nM, 30 nM, 40 nM, 50 nM, 60 nM, 70 nM, 80 nM, 90 nM, or 100 nM for each antibody of the antibody panel is incubated with cells.

[00125] Following incubation, the cells 102 may be washed (e.g., with a wash buffer) for one or more times to remove excess antibodies that are unbound. In various embodiments, the cells 102 are washed for at least 1 minute, 2 minutes, 3 minutes, 4 minutes, 5 minutes, 6 minutes, 7 minutes, 8 minutes, 9 minutes, 10 minutes, 15 minutes, or 20 minutes to wash away unbound antibody-oligonucleotide conjugates. In various embodiments, the cells 102 are washed for at least 2, 3, 4, 5, 6, 7, 8, 9 or 10 times to wash away unbound antibody-oligonucleotide conjugates. In particular embodiments, the cells 102 are washed for 4 times. In particular embodiments, the cells 102 are washed for 5 minutes. In particular embodiments, the cells 102 are washed for 5 minutes for 4 times.

[00126] In various embodiments, the antibodies are labeled with one or more oligonucleotides, also referred to as antibody oligonucleotides. Such oligonucleotides can be read out with microfluidic barcoding and DNA sequencing, thereby enabling the detection of cell analytes of interest. When an antibody binds its target, the antibody oligonucleotide is carried with it and thus allows the presence of the target analyte to be inferred based on the presence of the oligonucleotide tag. In some implementations, analyzing antibody oligonucleotides provides an estimate of the different epitopes present in the cell.

[00127] The single cell workflow device 106 refers to a device that processes individuals cells to generate amplicons for sequencing. In various embodiments, the single cell workflow device 106 can encapsulate individual cells into a first droplet, lyse cells within the first droplet, perform cell barcoding of cell lysate in a second droplet, and generate amplicons in the second droplet. Thus, amplicons can be collected and sequenced. In various embodiments, the single cell workflow device 106 further includes a sequencer for sequencing the amplicons. In various embodiments, at least 10, 50, 100, 150, 20, 250, 300, 350, 400, 450, or 500 amplicons (e.g., DNA amplicons and/or amplicons derived from antibody oligonucleotides) are generated in a workflow. In various embodiments, the single cell workflow device 106 can be applied to one or more cell lines. In various embodiments, the single cell workflow device 106 can be applied to at least 2, 3, 4, 5, 6 cell lines, or their combinations thereof. In particular embodiments, the one or more cell lines include HL60, K562, KCL22, Jurkat, T47D, KG1, A549, and/or their mixture and/or mergers thereof.

[00128] The computing device 108 is configured to receive the sequenced reads from the single cell workflow device 106. In various embodiments, the computing device 108 is communicatively coupled to the single cell workflow device 106 and therefore, directly receives the sequence reads from the single cell workflow device 106. The computing device 108 analyzes the sequence reads to generate a cellular analysis 110. In one embodiment, the computing device 108 analyzes the sequence reads to determine presence or absence of the analytes. For example, the computing device analyzes the sequence reads to determine presence or absence of surface proteins and/or intracellular proteins. In one embodiment, the computing device 108 analyzes the sequence reads to determine cellular genotypes and phenotypes. The computing device 108 uses the determined cellular genotypes and phenotypes to discover new cell subpopulations and/or to classify individual cells into cell subpopulations. Thus, in such embodiments, the cellular analysis 110 can refer to the identification of cell subpopulations or the classifications of cells into cell subpopulations. In one embodiment, the computing device 108 analyzes the sequence reads to determine one or more mutations such as single-nucleotide polymorphism (SNV), insertion or deletion mutation (indel), or copy number variation (CNV). [00129] In various embodiments, the computing device 108 includes a merger detection system 130 to detect mergers from real droplets, as described herein. In general, the merger detection system 130 processes source data 120 obtained from a single cell workflow (e.g., step 106 in FIG. 1A) and provides output data 140 for further analysis (e.g., for the computing device 108 to generate a cellular analysis 110). In various embodiments, the computing device 108 is the merger detection system 130. [00130] In various embodiments, the source data 120 includes sequence reads of a plurality of amplicons that include barcodes associated with a plurality of cells. In particular embodiments, the source data 120 includes raw sequence reads count matrix of one or more barcodes obtained from a single cell workflow (e.g., step 106 in FIG. 1A).

[00131] In various embodiments, the merger detection system 130 processes (e.g., normalizes and/or clusters) the source data 120 using one or more merge detection methods.

[00132] In various embodiments, the merge detection methods as described herein include “correlation clustering” (or “correlation-cluster” or “corr-cluster” as used herein) method, as further described in FIG. 2A and FIGS. 4A-4D in the Examples section), “Dimensional Reduction” method, as further described in FIG. 2B and FIGS. 5A-5D and 6A-6D in the Examples section, combination of results of both, or their combinations thereof (e.g., “Correlation-Dimensional Reduction” method as further described in FIGS. 11A-1 ID and 12). In various embodiments, the “Dimensional Reduction” method or “Dimensional Reduction” portion of the method involves a dimensionality reduction analysis selected from one of principal component analysis (PCA), linear discriminant analysis (LDA), T-distributed stochastic neighbor embedding (t-SNE), or uniform manifold approximation and projection (UMAP). In particular embodiments, the “Dimensional Reduction” method involves Uniform Manifold Approximation and Projection (UMAP), hereafter referred to as the “UMAP” method. Furthermore, the data may be further processed using “K-means” or “Split” methods (e.g., using results obtained from UMAP as input), as described further in the Examples section.

[00133] In various embodiments, the output data 140 includes output from the merger detection system 130. For example, the output data 140 may include barcodes identified as mergers, singlet-like, or background barcodes. Thus, the computing device 180 may perform an additional analysis on a subset of the barcodes to generate the cellular analysis 110. For example, the computing device 180 may analyze barcodes identified as singlet-like (e.g., from singlet or doublet droplets) to determine presence or absence of surface proteins and/or intracellular proteins, determine cellular genotypes and phenotypes, and/or to discover new cell subpopulations and/or to classify individual cells into cell subpopulations. Generally, the computing device 180 excludes barcodes identified as mergers from the additional analysis. [00134] Reference is now made to FIG. IB, which depicts a block diagram illustrating the computer logic components of the merger detection system 130, in accordance with an embodiment. Specifically, the merger detection system 130 includes a data normalization module 150, a data clustering module 160, an input data store 170, a normalized data store 180, and an output data store 190, as described herein. In various embodiments, the merger detection system 130 can be configured differently with additional or fewer modules. For example, a barcode classification system 130 need not include the source data store 170, and instead, the source data (e.g., source data 120 in FIG. 1 A) is stored in a different system.

[0001] Generally, the data normalization module 150 processes (e.g., normalizes) the source data (e.g., source data 120 in FIG. 1A). In various embodiments, the data normalization module 150 performs transformation on the raw read count matrix. The data clustering module 160 processes (e.g., clusters) the coordinates provided from the data normalization module 150 to a 2D space. The source data store 170 stores source data (e.g., source data 120 in FIG. 1 A) for the data normalization module 150 to process. The normalized data store 180 stores normalized data provided by the data normalization module 150 for the data clustering module 160 to process. The output data store 190 stores data obtained from the data clustering module 160, and may include a classification of barcodes that are analyzed.

[00135] Depending on the clustering methods, the data normalization module 150 and data clustering module 160 may perform one or more merger detect ion methods (e.g., “correlation cluster” and/or “Dimensional Reduction ” methods in FIG. 1A and the Examples section), as described herein. In general, the merger detection methods are herein described in reference to two phases: 1) a normalization phase performed by the data normalization module 150 and 2) a clustering phase performed by the data clustering module 160. In various embodiments, the normalization phase refers to normalizing (e.g,. constructing, adjusting, and/or arranging) the source data 120 for use in the clustering phase. In various embodiments, the clustering phase refers to clustering the normalized data for determining, identifying, or differentiating mergers (e.g., from singlet-like barcodes). In various embodiments, the clustering phase further includes generating a 2D space for visualization of the clusters. Therefore, the various barcodes provided in the source data, as well as the various drops and/or cells associated with the barcodes, can be determined, identified, and classified as one of the categories as described herein (e.g., background, singlet-like, or mergers).

Correlation-cluster Method

Normalization Phase [00136] In general, the normalization phase of the correlation clustering method includes analyzing sequence reads including barcode sequences to obtain two parameters: a) the coverage value b) the correlation value as described herein. In various embodiments, the normalization phase of the correlation clustering method includes analyzing raw read counts of barcodes and amplicons. The raw read counts can be structured as a raw read count matrix of the barcodes and amplicons.

[00137] In various embodiments, the normalization phase involves removing amplicons from the raw read counts that have below a threshold number of reads across all cells. In various embodiments, the threshold number of reads is any of less than 10 reads, less than 9 reads, less than 8 reads, less than 7 reads, less than 6 reads, less than 5 reads, less than 4 reads, less than 3 reads, less than 2 reads, or less than 1 read. In particular embodiments, the threshold number of reads is less than 1 read. Thus, the normalization phase involves removing amplicons from the raw read counts that have 0 reads across all cells.

[00138] In various embodiments, the normalization phase involves removing background noise as represented by barcodes with low coverage. Barcodes with low coverage are defined as barcodes with less than a threshold number of reads per amplicon per cell. In various embodiments, the threshold number of reads per amplicon per cell is less than 10 reads per amplicon per cell, less than 9 reads per amplicon per cell, less than 8 reads per amplicon per cell, less than 7 reads per amplicon per cell, less than 6 reads per amplicon per cell, less than 5 reads per amplicon per cell, less than 4 reads per amplicon per cell, less than 3 reads per amplicon per cell, less than 2 reads per amplicon per cell, or less than 1 read per amplicon per cell. In particular embodiments, the threshold number of reads is less than 2 reads per amplicon per cell. Thus, the normalization phase involves removing background noise as represented by barcodes with low coverage (e.g., with less than 2 reads per amplicon per cell).

[00139] In various embodiments, the normalization phase involves generating average amplicon profile, which is represented by the mean reads per amplicon across some or all of the barcodes. Thus, the average amplicon profile represents a profile of a plurality of amplicons (as opposed to a single amplicon). In particular embodiments, the average amplicon profile represents a profile of at least 10 amplicons, at least 20 amplicons, at least 30 amplicons, at least 40 amplicons, at least 50 amplicons, at least 75 amplicons, at least 100 amplicons, at least 200 amplicons, at least 300 amplicons, at least 400 amplicons, at least 500 amplicons, at least 1000 amplicons, at least 2000 amplicons, at least 5000 amplicons.

[00140] In various embodiments, the normalization phase involves performing a linear regression for at least one barcode amplicon profde against average amplicon profile (e.g., the average amplicon profile generated in the normalization phase). In various embodiments, the normalization phase involves performing a linear regression for at least two, at least three, at least four, at least five, at least ten, at least twenty, at least thirty, at least forty, at least fifty, at least 100, at least 200, at least 300, at least 400, at least 500, at least 600, at least 700, at least 800, at least 900, at least 1000, at least 5000, at least 10,000, at least 50,000, at least 100,000, at least 500,000, or at least 1 million barcode amplicon profiles against the average amplicon profile (e.g., the average amplicon profile generated in the normalization phase). In various embodiments, the normalization phase involves performing a linear regression for each barcode amplicon profile against the average amplicon profile (e.g., the average amplicon profile generated in the normalization phase).

[00141] In various embodiments, the linear regression performed for a barcode amplicon profile in the normalization phase generates or provides a correlation value for at least one barcode. In various embodiments, the linear regression performed for a barcode amplicon profile in the normalization phase generates or provides a correlation value for each barcode. The correlation value for a barcode is defined as a coefficient of determination (R ²) of the linear regression.

[00142] In various embodiments, the normalization phase involves generating a coverage value of a barcode amplicon profile for each or at least one barcode. In various embodiments, the normalization phase generates a coverage value for at least 10 barcodes, at least 50 barcodes, at least 100 barcodes, at least 500 barcodes, at least 1000 barcodes, at least 5000 barcodes, at least 10,000 barcodes, at least 50,000 barcodes, at least 100,000 barcodes, at least 500,000 barcodes, or at least 1 million barcodes. In various embodiments, the coverage value for a barcode is determined based on a mean number of reads for the barcode. In particular embodiments, the coverage value is defined as the log base 10 of the mean reads per barcode. [00143] In various embodiments, the normalization phase involves one or more steps of removing amplicons from the raw read counts that have below a threshold number of reads across all cells, removing background noise as represented by barcodes with low coverage, generating average amplicon profile, which is represented by the mean reads per amplicon across some or all of the barcodes, performing a linear regression for at least one barcode amplicon profile against average amplicon profile, and generating a coverage value for each or at least one barcode. In various embodiments, the normalization phase involves each of removing amplicons from the raw read counts that have below a threshold number of reads across all cells, removing background noise as represented by barcodes with low coverage, generating average amplicon profile, which is represented by the mean reads per amplicon across some or all of the barcodes, performing a linear regression for at least one barcode amplicon profile against average amplicon profile, and generating a coverage value for each or at least one barcode.

Clustering Phase

[00144] In general, the clustering phase of the correlation clustering method includes generating a space (e.g., 2D space) that includes one or more parameter values, and/or performing parameter optimization and/or labelling. In various embodiments, the clustering phase of the correlation clustering method includes generating a space that includes one or more parameter values. In particular embodiments, the clustering phase of the correlation clustering method includes generating a 2D space that includes one or more parameter values. In various embodiments, the clustering phase of the correlation clustering method includes performing parameter optimization and/or labelling, as described further below.

[00145] In various embodiments, the clustering phase of the correlation clustering method involves running a method (e.g., DBSCAN method) by implementing one or more parameter values to obtain clusters. In various embodiments, the method includes a density-based clustering method. In particular embodiments, the method includes DBSCAN method. In various embodiments, the parameter values include the correlation value obtained in the normalization phase as described above. In various embodiments, the parameter values include the coverage value obtained in the normalization phase as described above. In particular embodiments, the parameter values include both the correlation value and the coverage value obtained in the normalization phase as described above.

[00146] In various embodiments, the clustering phase of the correlation clustering method involves identifying selecting a case involving parameter values to identify at least 2 clusters. In various embodiments, the clustering phase of the correlation clustering method involves selecting a case involving parameter values where at least 3 clusters were identified. In various embodiments, the clustering phase of the correlation clustering method involves selecting a case involving parameter values where at least 4 clusters were identified. In various embodiments, the clustering phase of the correlation clustering method involves selecting a case involving parameter values where at least 5 clusters were identified. In various embodiments, the clustering phase of the correlation clustering method involves selecting a case involving parameter values where at least 6 clusters were identified. In particular embodiments, the clustering phase of the correlation clustering method involves selecting a case involving parameter values where 2 clusters were identified.

[00147] In various embodiments, if only one case is selected, the clustering phase of the correlation clustering method involves, among the identified clusters (e.g., clusters where the parameter values are selected in prior step), selecting the cluster that has a higher median coverage value and lower correlation (R ²) value.

[00148] In various embodiments, if not a case is selected, the clustering phase of the correlation clustering method may further involve marking the correlation clustering method as failed, indicating that it could not differentiate singlets from mergers.

[00149] In various embodiments, if two or more cases are selected, the clustering phase of the correlation clustering method may further involve selecting the case that includes the fewest unassigned barcodes. In various embodiments, the unassigned barcodes is represented by the noise points. In particular embodiments, the unassigned barcodes is represented by the noise points labelled by DBSCAN algorithm.

[00150] In various embodiments, for the case that is selected, the clustering phase of the correlation clustering method further involves marking the cluster with the higher median coverage value and lower correlation (R ²) value as singlets and the other cluster(s) as merger(s).

Dimensional Reduction Method

Normalization Phase

[00151] In general, the normalization phase of the dimensional reduction method includes analyzing sequence reads including barcode sequences to generate normalized reads as described herein. In various embodiments, the normalization phase of the dimensional reduction method includes analyzing raw read counts of barcodes and amplicons. The raw read counts can be structured as a raw read count matrix of the barcodes and amplicons.

[00152] In various embodiments, the normalization phase involves removing incomplete background barcodes as represented by barcodes with low raw read counts. Barcodes with low read counts are defined as barcodes with less than a threshold number of reads across a fraction of amplicons. In various embodiments, less than a threshold number of reads is any of less than 10 reads, less than 9 reads, less than 8 reads, less than 7 reads, less than 6 reads, less than 5 reads, less than 4 reads, less than 3 reads, less than 2 reads, or less than 1 read. In particular embodiments, less than a threshold number of reads is less than 3 reads. In various embodiments, the fraction of amplicons include at least 5%, at least 10%, at least 20%, at least 30% , at least 40%, at least 50%, at least 60%, at least 70%, at least 80%, at least 90%, or 100% of amplicons. In particular embodiments, the fraction of amplicons include at least 20% of amplicons. In various embodiments, the fraction of amplicons include at least 10 amplicons, at least 20 amplicons, at least 30 amplicons, at least 40 amplicons, at least 50 amplicons, at least 75 amplicons, at least 100 amplicons, at least 200 amplicons, at least 300 amplicons, at least 400 amplicons, at least 500 amplicons, at least 1000 amplicons, at least 2000 amplicons, or at least 5000 amplicons.

[00153] In various embodiments, the normalization phase involves generating normalized reads for each or some of amplicons. In various embodiments, generating normalized reads for an amplicon is performed by dividing the counts for a barcode by the mean reads of the amplicon. In various embodiments, normalized reads are generated for at least 10 amplicons, at least 20 amplicons, at least 30 amplicons, at least 40 amplicons, at least 50 amplicons, at least 75 amplicons, at least 100 amplicons, at least 200 amplicons, at least 300 amplicons, at least 400 amplicons, at least 500 amplicons, at least 1000 amplicons, at least 2000 amplicons, at least 5000 amplicons.

[00154] In various embodiments, the normalization phase involves generating normalized reads for each or some of barcodes. In various embodiments, generating normalized reads for a barcode is performed by dividing the counts for an amplicon by the median of the normalized reads for the barcode. In various embodiments, the median of the normalized reads for the barcode is obtained during the step of generating normalized reads for each or some of the amplicons as described above. In various embodiments, normalized reads are generated for at least 10 barcodes, at least 50 barcodes, at least 100 barcodes, at least 500 barcodes, at least 1000 barcodes, at least 5000 barcodes, at least 10,000 barcodes, at least 50,000 barcodes, at least 100,000 barcodes, at least 500,000 barcodes, or at least 1 million barcodes.

[00155] In various embodiments, the normalization phase involves one or more steps of removing incomplete background barcodes as represented by barcodes with low raw read counts, generating normalized reads for each or some of amplicons, and generating normalized reads for each or some of barcodes. In various embodiments, the normalization phase involves each of removing incomplete background barcodes as represented by barcodes with low raw read counts, generating normalized reads for each or some of amplicons, and generating normalized reads for each or some of barcodes.

Clustering Phase

[00156] In general, the clustering phase of the dimensional reduction method includes generating a space with reduced dimensions including coordinates transformed from the normalized reads (or read counts) obtained from the normalization phase using a dimensional reduction analysis. In particular embodiments, the dimensional reduction analysis includes UMAP method. In particular embodiments, the dimensional reduction analysis reduces the space to two dimensions, thus creating a 2D space.

[00157] In various embodiments, the clustering phase of the dimensional reduction method includes generating a grid of points on the coordinates. In particular embodiments, the coordinates are UMAP coordinates. In various embodiments, the points are equally spaced on the grid. In various embodiments, the points are not equally spaced on the grid.

[00158] In various embodiments, the clustering phase of the dimensional reduction method includes generating one or more lines at each or some of the points generated on the grid. In various embodiments, one or more lines include a slope e.g., relative to a base line (e.g. horizontal line) on the grid. In various embodiments, the slope is from -90° to 90°. In various embodiments, the slope is about -90 degrees, about -85 degrees, about -80 degrees, about -75 degrees, about -70 degrees, about -65 degrees, about -60 degrees, about -55 degrees, about -50 degrees, about -45 degrees, about -40 degrees, about -35 degrees, about -30 degrees, about -25 degrees, about -20 degrees, about -15 degrees, about -10 degrees, about -5 degrees, about 0 degrees, about 5 degrees, about 10 degrees, about 15 degrees, about 20 degrees, about 25 degrees, about 30 degrees, about 35 degrees, about 40 degrees, about 45 degrees, about 50 degrees, about 55 degrees, about 60 degrees, about 65 degrees, about 70 degrees, about 75 degrees, about 80 degrees, about 85 degrees, or about 90 degrees.

[00159] In various embodiments, the clustering phase of the dimensional reduction method includes identifying lines that split the barcodes into at least 2 clusters. In particular embodiments, the clustering phase of the dimensional reduction method includes identifying cases including lines that split the barcodes into 2 clusters.

[00160] In various embodiments, the clustering phase of the dimensional reduction method includes labelling the clusters in the identified cases. In various embodiments, the cluster with a higher median is labeled as singlet-like or “valid-cell” cluster. In various embodiments, the cluster with a lower median is labeled as merger or “invalid-barcodes” cluster.

[00161] In various embodiments, labelling the clusters in the identified cases is performed by implementing one or more criteria. For example, the criterion may include the singlet-like cluster having barcodes that are above a first threshold number or fraction. In another example, the criterion may include the merger cluster having barcodes that are above a second threshold number or fraction. Thus, not all barcodes are called singlet-like or mergers. Accordingly, lines that do not meet the criterion are discarded. In various embodiments, the first threshold number or fraction is at least 5%, at least 10%, at least 20%, at least 30% , at least 40%, at least 50%, at least 60%, at least 70%, at least 80%, at least 90%, or 100% of barcodes that are on the plot. In various embodiments, the second threshold number or fraction is at least 5%, at least 10%, at least 20%, at least 30% , at least 40%, at least 50%, at least 60%, at least 70%, at least 80%, at least 90%, or 100% of barcodes that are on the plot. In particular embodiments, the first threshold number or fraction is at least 10%. In particular embodiment, the second threshold number or fraction is at least 5%. In another example, the criterion may include having below a third threshold number or fraction of cells close to the identified line(s). As used herein, close to the identified line may refer to less than a distance with respect to the line. In various embodiments, the distance is less than 0.01%, less than 0.05%, less than 0.1%, less than 0.5%, less than 1%, or less than 5% of one or any dimension of the plot. In various embodiments, the third threshold number or fraction is below 0.1%, below 0.5%, below 1%, below 5%, below 10%, below 20%, below 30%, below 40%, or below 50 of cells. [00162] In various embodiments, the dimensional reduction method may fail if not a line can be identified or labelled.

Combination of Results from Correlation-cluster and Dimensional Reduction Methods

[00163] In some embodiments, both correlation-cluster and dimensional reduction methods are implemented, and one of the methods may be selected as the final output. In various embodiments, the final output is implemented for classification of barcodes, as described herein. [00164] In various embodiments, the correlation clustering method is selected if it is successful or does not fail. For example, a successful correlation clustering method can identify 2 clusters with appropriate properties that meet the criteria as described above. In various embodiments, the correlation clustering method is associated with finding a specific signal.

[00165] In various embodiments, the dimensional reduction method is selected if the correlation clustering method fails. In various embodiments, the dimensional reduction method is given a lower priority than the correlation clustering method.

[00166] In various embodiments, if both correlation cluster and dimensional reduction methods fail or are not successful, both methods may not be considered as successful for identifying mergers or singlet-like barcodes.

Correlation-Dimensional Reduction Method

[00167] Generally, the correlation-dimensional reduction method may include portions of the steps described herein regarding the correlation-cluster and the dimensional reduction methods. In various embodiments, the correlation-dimensional reduction method can achieve improved clustering and differentiation between merger clusters and non-merger clusters (e.g., singlets). In particular embodiments, the correlation-dimensional reduction method achieves improved performance on edge cases (e.g., cases in which it is difficult to clearly categorize a barcode as a merger or a non-merger).

[00168] In various embodiments, the correlation-dimensional reduction method involves one or more steps of the normalization phase or the clustering phase of the correlation-cluster method. In particular embodiments, the correlation-dimensional reduction method involves one or more steps of the normalization phase of the correlation-cluster method, and does not include the steps of the clustering phase of the correlation-cluster method. For example, the correlation- dimensional reduction method can involve analyzing sequence reads including barcode sequences to obtain two parameters: a) the coverage value b) the correlation value as described herein. Thus, the steps described above regarding the determination of the coverage value and correlation value in the correlation-cluster method can be similarly applied here for the correlation-dimensional reduction method.

[00169] In various embodiments, the correlation-dimensional reduction method involves one or more steps of the normalization phase or the clustering phase of the dimensional reduction method. In various embodiments, the correlation-dimensional reduction method involves one or more steps of the normalization phase of the dimensional reduction method. In particular embodiments, the correlation-dimensional reduction method involves analyzing sequence reads including barcode sequences to generate normalized reads as described herein. Thus, the steps described above regarding the determination of the normalized reads in the dimensional reduction method can be similarly applied here for the correlation-dimensional reduction method.

[00170] In various embodiments, the correlation-dimensional reduction method involves one or more steps of the clustering phase of the dimensional reduction method. For example, the correlation-dimensional reduction method may involve performing a dimensionality reduction methodology to generate a space with reduced dimensions. Example dimensionality reduction methodologies include any of principal component analysis (PCA), linear discriminant analysis (LDA), T-distributed stochastic neighbor embedding (t-SNE), or uniform manifold approximation and projection (UMAP). In particular embodiments, the dimensionality reduction methodology involves performing UMAP. Such a correlation-dimensional reduction method that involves performing UMAP is also referred to herein as a correlation-UMAP method.

[00171] In various embodiments, the dimensional reduction methodology involves performing a dimensional reduction on the normalized reads. In various embodiments, the dimensional reduction methodology involves performing dimensional reduction on a combination of two or more of the normalized counts, the barcode correlation value, and the barcode coverage value for the barcodes. In various embodiments, the dimensional reduction methodology involves performing dimensional reduction on a combination of each of the normalized counts, the barcode correlation value, and the barcode coverage value for the barcodes. For example, the normalized counts can be structured/stored as a normalized read counts matrix and the barcode correlation and coverage values can be structured/stored as a correlation-coverage matrix. In various embodiments, the dimensional reduction methodology involves performing dimensional reduction on a combination of the normalized read counts matrix and the correlation-coverage matrix (e.g., a concatenation of the normalized read counts matrix and the correlation-coverage matrix).

[00172] In various embodiments, the dimensional reduction methodology involves clustering the dimensionally reduced combination to generate two or more clusters. In various embodiments, the clustering methodology involves performing one of DBSCAN or Flat- HDBSCAN. The number of clusters that are generated using the clustering methodology may be pre-determined e.g., using a density based approach. For example, the clustering methodology may be pre-set to generate two clusters. As another example, the clustering methodology may be pre-set to generate three clusters. As yet further examples, the clustering methodology may be pre-set to generate four clusters, five clusters, six clusters, seven clusters, eight clusters, nine clusters, or ten clusters.

[00173] In various embodiments, the dimensional reduction methodology involves labeling the clusters. For example, labeling the clusters can involve labelling clusters according to their position in a 2D plot. In various embodiments, labeling the clusters involves labeling clusters according to their position on the correlation-coverage plot. For example, clusters with higher coverage can be labeled as cells whereas clusters with lower coverage can be labeled as mergers. In various embodiments, clusters may also be labeled according to a score, such as a quality score. In various embodiments, cluster may be labeled according to a subscore (e.g., a subscore that is used to generate the quality score). For example, example subscores can be any of a silhouette score, a cluster score, or a cell score, each of which is described in further detail herein. In particular embodiments, cell clusters are differently labeled given a low silhouette score (e.g., 0.1 or less).

[00174] In some embodiments, the correlation-dimensional reduction method includes identifying various combinations of merger and/or cell clusters. In some embodiments, the correlation-dimensional reduction method includes identifying 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, or 20 mergers. In some embodiments, the correlation-dimensional reduction method includes identifying up to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, or 20 cell clusters. In some embodiments, the number of clusters that are identified is pre-determined. In some embodiments, the correlation- dimensional reduction method includes equal to or less than a threshold number of clusters. For example, the correlation-dimensional reduction method includes equal to or less than 10, 5, 3, 2, or 1 clusters. In particular embodiments, the correlation-dimensional reduction method includes equal to or less than 3 clusters.

[00175] In various embodiments, the correlation-dimensional reduction method includes validating the labeled one or more clusters. For example, validating the labeled one or more clusters can involve generating a quality score for at least one of the one or more clusters, the quality score representing an efficiency measure of identification of mergers and non-mergers. The quality score can be generated by combining two or more subscores, examples of which include a silhouette score representing a measure of separation between clusters; a cluster score representing a percentage of barcodes categorized as non-outliers; and a cell score representing a measure of a position of a cell cluster in comparison to a position of a merger cluster. In various embodiments, the quality score is a product of each of the silhouette score, the cluster score, and the cell score. Referring to the cell score, it may be generated by performing a linear fit on a merger cluster; and determining a percentage of cells below the linear fit as the cell score. In case of single clusters, the cell-score will be low due to the fact that the cluster identified as mergers will align with the cluster identified as cells.

[00176] In various embodiments, the correlation-dimensional reduction method further includes selecting a cluster with a highest quality score; determining whether the cluster with the highest quality score includes at least a threshold number of barcodes; and responsive to the determination that the cluster with the highest quality score includes at least the threshold number of barcodes, completing the validation of the labeled one or more clusters. This validation ensures that sufficient barcodes are labeled as cells.

[00177] In various embodiments, the correlation-dimensional reduction method further includes recovering one or more barcodes not assigned to a cluster as one or more cells. In various embodiments, recovering one or more barcodes not assigned to a cluster comprises: determining a first distance between a barcode not assigned to a cluster and another barcode assigned to a cell cluster; determining a second distance between the barcode not assigned to a cluster and another barcode assigned to a merger cluster; and comparing the first distance and the second distance to determine whether to recover the barcode. In various embodiments, if the second distance (e.g., distance between the barcode not assigned to a cluster and another barcode assigned to a merger cluster) is greater than the first distance (e.g., between a barcode not assigned to a cluster and another barcode assigned to a cell cluster), then this indicates that the unassigned barcode is more similar to a non-merger barcode. In such embodiments, the unassigned barcode can be recovered in response to the determination that the second distance is greater than the first distance. In various embodiments, if the second distance (e.g., distance between the barcode not assigned to a cluster and another barcode assigned to a merger cluster) is at least twice as large as the first distance (e.g., between a barcode not assigned to a cluster and another barcode assigned to a cell cluster), then this indicates that the unassigned barcode is more similar to a non-merger barcode. In such embodiments, the unassigned barcode can be recovered in response to the determination that the second distance is at least twice as large as the first distance.

Workflow of the correlation- dimensional reduction method or algorithm

[00178] Reference is now made to FIG. 2C, which shows a flow process 205 C of performing barcodes analysis to detect one or more droplet mergers, in accordance with a third embodiment (e.g., correlation-dimensional reduction method).

[00179] Step 210 involves obtaining a dataset comprising sequence reads of a plurality of amplicons comprising barcodes associated with a plurality of cells.

[00180] Step 260 involves normalizing the data set to generate a barcode correlation value and a barcode coverage value for a barcode. Step 260 may be similar to step 220A as described in reference to FIG. 2A. Step 265 involves normalizing the dataset to generate normalized counts of the barcodes. In some scenarios, Step 265 may be similar to step 220B as described in reference to FIG. 2B.

[00181] Step 270 involves dimensionally reducing a combination of two or more of the normalized counts, the barcode correlation value, and the barcode coverage value for the barcodes. In some embodiments, dimensional reduction (e.g., UMAP) is performed on the normalized read counts matrix concatenated with the correlation-coverage matrix. Combining both matrices into one matrix and performing a dimensional reduction can increase the odds of separating mergers from cells.

[00182] Step 275 involves clustering barcodes using the dimensionally reduced combination. For example, step 275 can involve performing a data clustering algorithm (e.g., flat-HDBSCAN) on the dimensional reduction for various cluster sizes and numbers to generate a predetermined number of clusters using a density based approach. In some embodiments, the cluster size is equal to or above a threshold number.

[00183] Step 280 involves determining the one or more droplet mergers by labelling one or more clusters. For example, the clusters are labelled according to their positions on the correlation-coverage plot. This may be performed for each clustering parameter combination. Clusters with higher coverage and lower R2/coverage are labeled as cells and the others are labeled as mergers. The cell clusters may have a low silhouette score (default: 0.1) with respect to each other on the correlation-coverage plot. Distinct cell clusters on the correlation-coverage plot are not supported (also not observed in the test data). Clustering parameters where no cell cluster can be identified are discarded from further analysis.

[00184] Step 285 involves validating the labeled one or more clusters. Step 285 may, in various embodiments, be an optional step and therefore, need not be performed (as is indicated by the dotted line shown in FIG. 2C). In various embodiments, the validation involves generating a quality score for each clustering where the clusters may be labeled. This measures the efficacy of the identification of mergers and cells. This quality score is a product of three metrics, including silhouette score, cluster score, and cell-score. In some embodiments, the silhouette score is generated on the correlation-coverage plot by combining the cell clusters into 1 cluster (“0” and “1” are combined into one cluster) and using the mergers (cluster “2”) as the second population. In some embodiments, the outliers (cluster “-1”) are discarded for this score to ensure that the clusters have significant separation on the correlation-coverage plot.

[00185] In some embodiments, “cluster score” refers to the percentage of barcodes not labeled as outliers (“-1”). This factor may enable the selection of clusters with the least number of outliers.

[00186] In some embodiments, “cell-score” is used to measure the relative position of the cell and merger clusters. For example, a linear fit can be generated on each merger population. The score refers to the percentage cells below the linear fit. This factor enables the identification of single clusters. In case of single clusters, the cell-score may be low because the cluster identified as mergers may align with the cluster identified as cells.

[00187] In various embodiments, step 285 further involves selecting clusters with the highest quality score. If the highest score is below a threshold (e.g., 0.25 by default), then the algorithm has failed to identify more than 1 cluster. If there are fewer than a predefined number of total barcodes (e.g., 5000) then all barcodes are labeled as cells (single-cluster).

[00188] In various embodiments, step 285 further involves recovering one or more unassigned barcodes. These are barcodes which could potentially be cells, but were marked as outliers (“-1”) by HDBSCAN. In various embodiments, barcodes are recovered by analyzing their proximity to the nearest labelled barcode. If the closest invalid-barcode is at least twice as far as the closest valid-cell, then the outlier is marked as a recovered cell (“1+” in this case).

Methods for Detecting Mergers

[00189] FIGS. 2A and 2B are embodiments of flow processes for detecting one or more droplet mergers in a single cell sequencing workflow. Generally, the flow processes as shown in FIGS. 2A and 2B generalize the merger detection methods as elaborated above in FIGS. 1A and IB in further detail. In various embodiments, a flow process for detecting one or more droplet mergers includes both the methods in FIGS. 2A and 2B, where one of the methods is selected according to the output results of the methods.

[00190] Reference is now made FIG. 2A. At step 210, a dataset (e.g. source data 120 in FIGS. IB and 1C) comprising sequence reads of a plurality of amplicons comprising barcodes associated with a plurality of cells is obtained.

[00191] At step 220A, the dataset is normalized (e.g., using the data normalization module 150 of the merger detection system 130 and/or correlation clustering method described in FIG. IB) to generate a barcode correlation value and a barcode coverage value for a barcode.

[00192] In some embodiments, the barcode correlation value represents a correlation between an amplicon profile of the barcode and an average amplicon profile.

[00193] In some embodiments, the barcode coverage value represents a number of mean sequence reads per barcode.

[00194] In some embodiments, the amplicon profile of the barcode can be mean sequence reads per amplicon comprising the barcode or median sequence reads per amplicon comprising the barcode.

[00195] In some embodiments, the average amplicon profile is a mean sequence reads per amplicon comprising one of the barcodes or median sequence reads per amplicon comprising one of the barcodes [00196] In some embodiments, normalizing the dataset in step 220A further includes one or more of the following steps: 1) removing amplicons with sequence reads less than a first threshold value (e.g., one read in the plurality of cells) from the dataset; 2) removing barcodes with a coverage less than a second threshold value (e.g., two sequence reads per amplicon per cell two sequence reads per amplicon per cell) from the dataset; 3) generating a plurality of barcode amplicon profdes based on the dataset; 4) generating an average amplicon profile by calculating mean sequence reads per amplicon across the one or more barcodes; 5) generating the barcode correlation value by performing a linear regression for each of the plurality of barcode amplicon profiles against the average amplicon profile.

[00197] In some embodiments, the barcode correlation value is the coefficient of determination of the linear regression. In some embodiments, the barcode coverage value is a log to the base 10 of the mean sequence reads per barcode.

[00198] At step 230 A, the barcodes according to the barcode correlation value and the barcode coverage value for the barcodes are clustered (e.g., using the data clustering module 160 of the merger detection system 130 and/or correlation clustering method described in FIG. IB). [00199] At step 240, one or more droplet mergers are determined by labelling one or more clusters.

[00200] In some embodiments, labelling one or more clusters further includes: operating a DBSCAN method using a plurality of parameter values; selecting one or more parameter values where the dataset is classified into two clusters by applying the selected parameter values; and for each of the selected parameter values, determining whether a criterion is met. In some embodiments, the criterion includes a first cluster having a higher coverage value and a lower correlation value than a second cluster. If such criterion is not met for all selected parameter values, marking the method as failed. If the criterion is met for only one selected parameter value, marking the first cluster as a singlet cluster and the second cluster as a merger cluster. If the criterion is met for two or more selected parameter values, labelling one or more clusters further includes: selecting a cluster with fewest unassigned barcodes, and marking the selected cluster as the singlet cluster and the other cluster associated with the selected cluster as a merger cluster. [00201] Reference is now made to FIG. 2B. At step 210, a dataset (e.g. source data 120 in FIG. IB) comprising sequence reads of a plurality of amplicons comprising barcodes associated with a plurality of cells is obtained.

[00202] At step 220B, the dataset is normalized (e.g., using the data normalization module 150 of the merger detection system 130 and/or the dimensional reduction method described in FIG. IB) to generate dimensionally reduced counts of the barcodes.

[00203] In some embodiments, normalizing the dataset further includes one or more of the following steps: removing barcodes with sequence reads less than a threshold value (e.g., 3 sequence reads) in a fraction of amplicons (e.g., 20%); for each amplicon, determining a median of the normalized sequence reads by dividing read counts for each barcode by mean sequence reads of the amplicon; for each of one or more barcodes, generating the dimensionally reduced counts by dividing the counts for each amplicon by the median of the normalized sequence reads for the barcode.

[00204] At step 230B, the barcodes according to the dimensionally reduced counts to generate clusters that satisfy one or more criteria are clustered (e.g., using the data clustering module 160 of the merger detection system 130 and/or dimensional reduction method described in FIG. IB). [00205] At step 240, one or more droplet mergers are determined by labelling one or more clusters.

[00206] In some embodiments, labelling one or more clusters further includes one or more of the following steps: creating a 2D grid comprising a plurality of spaced points on the visual graph; at one of the spaced points, generating a plurality of lines on the visual graph, wherein each of the plurality of lines comprises a slope between -90° to 90° relative to a horizontal line; selecting, from the plurality of lines, lines that split the dimensionally reduced counts of the one or more barcodes into two clusters; and for each selected line, labelling a first cluster comprising a higher statistical read count as a singlet cluster and a second cluster comprising a lower statistical read count as a merger cluster.

[00207] In some embodiments, the method in FIG. 2B further includes selecting lines by removing, from selected lines, lines that are close to more than a threshold of cells. If no lines are further selected, marking the method as failed. If one or more lines are further selected, identifying, from the further selected lines, a line having the maximum difference in the median counts between the two clusters. [00208] In some embodiments, the method in FIG. 2A is selected if it does not fail but successfully provides an output. In some embodiments, the method in FIG. 2B is selected if the method in FIG. 2A fails. In some embodiments, additional methods may be used if the methods in both FIG. 2A and FIG. 2B fail and are not able to identify mergers or singlet-like barcodes.

Methods for Performing Single-Cell Analysis

Encapuslation, Analyte Release. Barcoding, and Amplification

[00209] Embodiments described herein involve encapsulating one or more cells (e.g., at step 160 in FIG. IB) to perform single-cell analysis on the one or more cells. In various embodiments, encapsulating a cell with reagents is accomplished by combining an aqueous phase including the cell and reagents with an immiscible oil phase. In one embodiment, an aqueous phase including the cell and reagents are flowed together with a flowing immiscible oil phase such that water in oil emulsions are formed, where at least one emulsion includes a single cell and the reagents. In various embodiments the immiscible oil phase includes a fluorous oil, a fluorous non-ionic surfactant, or both. In various embodiments, emulsions can have an internal volume of about 0.001 to 1000 picoliters or more and can range from 0.1 to 1000 pm in diameter.

[00210] In various embodiments, the aqueous phase including the cell and reagents need not be simultaneously flowing with the immiscible oil phase. For example, the aqueous phase can be flowed to contact a stationary reservoir of the immiscible oil phase, thereby enabling the budding of water in oil emulsions within the stationary oil reservoir.

[00211] In various embodiments, combining the aqueous phase and the immiscible oil phase can be performed in a microfluidic device. For example, the aqueous phase can flow through a microchannel of the microfluidic device to contact the immiscible oil phase, which is simultaneously flowing through a separate microchannel or is held in a stationary reservoir of the microfluidic device. The encapsulated cell and reagents within an emulsion can then be flowed through the microfluidic device to undergo cell lysis.

[00212] Further example embodiments of adding reagents and cells to emulsions can include merging emulsions that separately contain the cells and reagents or picoinjecting reagents into an emulsion. Further description of example embodiments is described in US Application No. 14/420,646, which is hereby incorporated by reference in its entirety. [00213] The encapsulated cell in an emulsion is lysed to generate cell lysate. In various embodiments, a cell is lysed by lysing agents that are present in the reagents. For example, the reagents can include a detergent such as NP-40 and/or a protease. The detergent and/or the protease can lyse the cell membrane. In some embodiments, cell lysis may also, or instead, rely on techniques that do not involve a lysing agent in the reagent. For example, lysis may be achieved by mechanical techniques that may employ various geometric features to effect piercing, shearing, abrading, etc. of cells. Other types of mechanical breakage such as acoustic techniques may also be used. Further, thermal energy can also be used to lyse cells. Any convenient means of effecting cell lysis may be employed in the methods described herein. [00214] Reference is now made to FIGS. 3A-3C, which depict steps of releasing and processing analytes within an emulsion or a droplet (e.g., emulsion 300), in accordance with a first embodiment. FIG. 3 A depicts emulsion 300A that includes both the cell 102 and reagents 120 (as shown in FIG. 1C). Specifically, in FIG. 3 A, the emulsion 300 A contains the cell (which further includes DNA 302), antibody oligonucleotides 304 (from the antibodies used to bind cell proteins at step 104 in FIG. 1A), as well as proteases 310 that are added from the reagents. Within the emulsion 300A, the cell is lysed, as indicated by the dotted line of the cell membrane. In one embodiment, the cell is lysed by detergents included in the reagents, such as NP40 (e.g., 0.01% NP40).

[00215] FIG. 3B depicts the emulsion 300B as the proteases 310 digest the chromatin -bound DNA 302, thereby releasing genomic DNA. In various embodiments, emulsion 300B is exposed to elevated temperatures to enable the proteases 310 to digest the chromatin. In various embodiments, emulsion 300B is exposed to a temperature between 40 °C and 60°C. In various embodiments, emulsion 300B is exposed to a temperature between 45 °C and 55°C. In various embodiments, emulsion 300B is exposed to a temperature between 48 °C and 52°C. In various embodiments, emulsion 300B is exposed to a temperature of 50 °C.

[00216] FIG. 3C depicts the free genomic DNA strands 306 and the antibody oligonucleotides 304 residing within emulsion 300C. Proteases 310 are inactivated. In various embodiments, proteases 310 are inactivated by exposing emulsion 300C to an elevated temperature. In various embodiments, emulsion 300C is exposed to a temperature between 70°C and 90°C. In various embodiments, emulsion 300C is exposed to a temperature between 75 °C and 85°C. In various embodiments, emulsion 300C is exposed to a temperature between 78 °C and 82°C. In various embodiments, emulsion 300C is exposed to a temperature of 80 °C.

[00217] In various embodiments, the antibody oligonucleotide 304 and/or the free genomic DNA 306 undergo priming within emulsion 300C. In various embodiments, reverse primers can hybridize with a portion of the antibody oligonucleotide 304 and/or the free genomic DNA 306. For example, the reverse primer is a gene specific reverse primer that hybridizes with a portion of the free genomic DNA 306. Examples gene specific primers are described in further detail below. As another example, the reverse primer is a PCR handle that hybridizes with a portion of the antibody oligonucleotide 304, which is described in further detail below in relation to FIG.

4 A. In various embodiments, the priming of the antibody oligonucleotide 304 can occur earlier, for example in emulsion 300A or emulsion 300B, given that the reverse primers are included in the reagents, which are introduced into emulsion 300A along with the proteases 310.

[00218] In various embodiments, the antibody oligonucleotide 304 and the free genomic DNA 306 in emulsion 300C represent at least in part the cell lysate, such as cell lysate shown in FIG. 1C, which is subsequently encapsulated in a second emulsion for barcoding and amplification.

[00219] Once the reagents and barcode are added to an emulsion, the emulsion may be incubated under conditions that facilitate the nucleic acid amplification reaction. In various embodiments, the emulsion may be incubated on the same microfluidic device as was used to add the reagents and/or barcode, or may be incubated on a separate device. In certain embodiments, incubating the emulsion under conditions that facilitates nucleic acid amplification is performed on the same microfluidic device used to encapsulate the cells and lyse the cells. Incubating the emulsions may take a variety of forms. In certain aspects, the emulsions containing the reaction mix, barcode, and cell lysate may be flowed through a channel that incubates the emulsions under conditions effective for nucleic acid amplification. Flowing the microdroplets through a channel may involve a channel that snakes over various temperature zones maintained at temperatures effective for PCR. Such channels may, for example, cycle over two or more temperature zones, wherein at least one zone is maintained at about 65° C. and at least one zone is maintained at about 95° C. As the drops move through such zones, their temperature cycles, as needed for nucleic acid amplification. The number of zones, and the respective temperature of each zone, may be readily determined by those of skill in the art to achieve the desired nucleic acid amplification. [00220] In various embodiments, following nucleic acid amplification, emulsions containing the amplified nucleic acids are collected. In various embodiments, the emulsions are collected in a well, such as a well of a microfluidic device. In various embodiments, the emulsions are collected in a reservoir or a tube, such as an Eppendorf tube. Once collected, the amplified nucleic acids across the different emulsions are pooled. In one embodiment, the emulsions are broken by providing an external stimuli to pool the amplified nucleic acids. In one embodiment, the emulsions naturally aggregate over time given the density differences between the aqueous phase and immiscible oil phase. Thus, the amplified nucleic acids pool in the aqueous phase. [00221] In various embodiments, following pooling, the amplified nucleic acids can undergo further preparation for sequencing. For example, sequencing adapters can be added to the pooled nucleic acids. Example sequencing adapters are P5 and P7 sequencing adapters. The sequencing adapters enable the subsequent sequencing of the nucleic acids.

Example Barcoding of Antibody-Conjugated Oligonucleotide and Genomic DNA

[00222] FIG. 4A illustrates the priming and barcoding of an antibody-conjugated oligonucleotide, in accordance with an embodiment. In various embodiments, the antibody- conjugated oligonucleotide can be specific for a surface protein. In various embodiments, the antibody-conjugated oligonucleotide can be specific for an intracellular protein. Specifically, FIG. 4A depicts step 410 involving the priming of the antibody oligonucleotide 304 and further depicts step 420 which involves the barcoding and amplification of the antibody oligonucleotide 304. In various embodiments, step 410 occurs within a first emulsion during which cell lysis occurs and step 420 occurs within a second emulsion during which cell barcoding and nucleic acid amplification occurs. In such embodiments, the primer 405 is provided in the reagents and the barcodes are provided via a barcode bead. In some embodiments, both steps 410 and 420 occur within the second emulsion.

[00223] The antibody oligonucleotide 304 is conjugated to an antibody. In various embodiments, an antibody oligonucleotide 304 includes a PCR handle, a tag sequence (e.g., an antibody tag), and a capture sequence that links the oligonucleotide to the antibody. In various embodiments, the antibody oligonucleotide 304 is conjugated to a region of the antibody, such that the antibody’s ability to bind a target epitope is unaffected. For example, the antibody oligonucleotide 304 can be linked to a Fc region of the antibody, thereby leaving the variable regions of the antibody unaffected and available for epitope binding. In various the antibody oligonucleotide 304 can include a unique molecular identifier (UMI). In various embodiments, the UMI can be inserted before or after the antibody tag. In various embodiments, the UMI can flank either end of the antibody tag. In various embodiments, the UMI enables the quantification of the particular antibody oligonucleotide 304 and antibody combination.

[00224] In various embodiments, the antibody oligonucleotide 304 includes more than one PCR handle. For example, the antibody oligonucleotide 304 can include two PCR handles, one on each end of the antibody oligonucleotide 304. In various embodiments, one of the PCR handles of the antibody oligonucleotide 304 is conjugated to the antibody. Here, forward and reverse primers can be provided that hybridize with the two PCR handles, thereby enabling amplification of the antibody oligonucleotide 304.

[00225] Generally, the antibody tag of the antibody oligonucleotide 304 enables the subsequent identification of the antibody (and corresponding protein that the antibody specifically binds to). For example, the antibody tag can serve as an identifier e.g., a barcode for identifying the type of protein for which the antibody binds to. In various embodiments, antibodies that bind to the same target are each linked to the same antibody tag. For example antibodies that bind to the same epitope of a target protein are each linked to the same antibody tag, thereby enabling the subsequent determination of the presence of the target protein. In various embodiments, antibodies that bind different epitopes of the same target protein can be linked to the same antibody tag, thereby enabling the subsequent determination of the presence of the target protein.

[00226] In some embodiments, an oligonucleotide sequence is encoded by its nucleobase sequence and thus confers a combinatorial tag space far exceeding what is possible with conventional approaches using fluorescence. For example, a modest tag length of ten bases provides over a million unique sequences, sufficient to label an antibody against every epitope in the human proteome. Indeed, with this approach, the limit to multiplexing is not the availability of unique tag sequences but, rather, that of specific antibodies that can detect the epitopes of interest in a multiplexed reaction.

[00227] Step 410 depicts the priming of the antibody oligonucleotide 304 by a primer 405. As shown in FIG. 4A, the primer 405 may include a PCR handle and a common sequence. Here, the PCR handle of the primer 405 is complementary to the PCR handle of the antibody oligonucleotide 304. Thus, the primer 405 primes the antibody oligonucleotide 304 given the hybridization of the PCR handles. In various embodiments, extension occurs from the PCR handle of the antibody oligonucleotide 304 (as indicated by the dotted arrow). In various embodiments, extension occurs from the PCR handle of the primer 405, thereby generating a nucleic acid with the antibody tag and capture sequence.

[00228] Step 420 depicts the barcoding of the antibody oligonucleotide 304. As shown in FIG. 4, the barcode (e.g., cell barcode) is releasably attached to a bead and is further linked to a common sequence. Here, the common sequence linked to the cell barcode is complementary to the common sequence linked to the PCR handle, antibody tag, and capture sequence. The antibody oligonucleotide is extended to include the common sequence and cell barcode.

[00229] In various embodiments, the antibody oligonucleotide is amplified, thereby generating amplicons with the cell barcode, common sequence, PCR handle, antibody tag, and capture sequence. In various embodiments, the capture sequence contains a biotin oligonucleotide capture site, which enables streptavidin bead enrichment prior to library preparation. In various embodiments, the barcoded antibody-oligonucleotides can be enriched by size separation from the amplified genomic DNA targets.

[00230] FIG. 4B illustrates the priming and barcoding of genomic DNA 455, in accordance with an embodiment. Specifically, FIG. 4B depicts step 460 involving the priming of the genomic DNA 455 and further depicts step 470 which involves the barcoding and amplification of the genomic DNA 455. In various embodiments, step 460 occurs within a first emulsion during which cell lysis occurs and step 470 occurs within a second emulsion during which cell barcoding and nucleic acid amplification occurs. In such embodiments, the primer 465 is added in the reagents and the barcode and forward primers shown in step 470 are added. In some embodiments, step 460 and step 470 both occur within a single emulsion (e.g., a second emulsion) during which cell barcoding and nucleic acid amplification occurs. In such embodiments, the primer 465 shown in step 460 and the barcode and forward primers shown in step 470 are added.

[00231] At step 460, a primer 465 (as indicated by the dotted line) hybridizes with a portion of the genomic DNA 455. In various embodiments, the primer 465 is a gene specific primer that targets a sequence of a gene of interest. Therefore, the primer 465 hybridizes with a sequence of the genomic DNA 455 corresponding to the gene of interest. In various embodiments the primer 465 further includes a PCR handle or is linked to a PCR handle.

[00232] At step 470, a primer 475 (as indicated by the dotted line) hybridizes with a portion of the genomic DNA 455. In various embodiments, the primer 475 includes a PCR handle or is linked to a PCR handle. In various embodiments, the primer 475 is a gene specific primer that targets another sequence of the gene of interest that differs from the sequence targeted by the primer 465. Additionally, a cell barcode (cell BC), which is releasably attached to a bead, is linked to a PCR handle which hybridizes with the PCR handle of the forward primer. Nucleic acid amplification generates amplicons, each of which include the cell barcode, PCR handle, forward primer, the gene sequence of interest the primer 465, and the PCR handle.

Sequencing and Read Alignment

[00233] Amplified nucleic acids (e.g., amplicons) are sequenced to obtain sequence reads for generating a sequencing library. Sequence reads can be achieved with commercially available next generation sequencing (NGS) platforms, including platforms that perform any of sequencing by synthesis, sequencing by ligation, pyrosequencing, using reversible terminator chemistry, using phospholinked fluorescent nucleotides, or real-time sequencing. As an example, amplified nucleic acids may be sequenced on an Illumina MiSeq platform.

[00234] When pyrosequencing libraries of NGS fragments are cloned in-situ amplified by capture of one matrix molecule using granules coated with oligonucleotides complementary to adapters. Each granule containing a matrix of the same type is placed in a microbubble of the “water in oil” type and the matrix is cloned amplified using a method called emulsion PCR. After amplification, the emulsion is destroyed and the granules are stacked in separate wells of a titration picoplate acting as a flow cell during sequencing reactions. The ordered multiple administration of each of the four dNTP reagents into the flow cell occurs in the presence of sequencing enzymes and a luminescent reporter, such as luciferase. In the case where a suitable dNTP is added to the 3 ' end of the sequencing primer, the resulting ATP produces a flash of luminescence within the well, which is recorded using a CCD camera. It is possible to achieve a read length of more than or equal to 400 bases, and it is possible to obtain 10 ⁶ readings of the sequence, resulting in up to 500 million base pairs (megabytes) of the sequence. Additional details for pyrosequencing are described in Voelkerding et al., Clinical Chem., 55: 641-658, 2009; MacLean et al., Nature Rev. Microbiol., 7: 287-296; US patent No. 6,210,891; US patent No. 6,258,568; each of which is hereby incorporated by reference in its entirety.

[00235] On the Solexa / Illumina platform, sequencing data is produced in the form of short readings. In this method, fragments of a library of NGS fragments are captured on the surface of a flow cell that is coated with oligonucleotide anchor molecules. An anchor molecule is used as a PCR primer, but due to the length of the matrix and its proximity to other nearby anchor oligonucleotides, elongation by PCR leads to the formation of a “vault” of the molecule with its hybridization with the neighboring anchor oligonucleotide and the formation of a bridging structure on the surface of the flow cell. These DNA loops are denatured and cleaved. Straight chains are then sequenced using reversibly stained terminators. The nucleotides included in the sequence are determined by detecting fluorescence after inclusion, where each fluorescent and blocking agent is removed prior to the next dNTP addition cycle. Additional details for sequencing using the Illumina platform are found in Voelkerding et al., Clinical Chem., 55: 641- 658, 2009; MacLean et al., Nature Rev. Microbiol., 7: 287-296; US patent No. 6,833,246; US patent No. 7,115,400; US patent No. 6,969,488; each of which is hereby incorporated by reference in its entirety.

[00236] Sequencing of nucleic acid molecules using SOLiD technology includes clonal amplification of the library of NGS fragments using emulsion PCR. After that, the granules containing the matrix are immobilized on the derivatized surface of the glass flow cell and annealed with a primer complementary to the adapter oligonucleotide. However, instead of using the indicated primer for 3’ extension, it is used to obtain a 5’ phosphate group for ligation for test probes containing two probe-specific bases followed by 6 degenerate bases and one of four fluorescent labels. In the SOLiD system, test probes have 16 possible combinations of two bases at the 3’ end of each probe and one of four fluorescent dyes at the 5’ end. The color of the fluorescent dye and, thus, the identity of each probe, corresponds to a certain color space coding scheme. After many cycles of alignment of the probe, ligation of the probe and detection of a fluorescent signal, denaturation followed by a second sequencing cycle using a primer that is shifted by one base compared to the original primer. In this way, the sequence of the matrix can be reconstructed by calculation; matrix bases are checked twice, which leads to increased accuracy. Additional details for sequencing using SOLiD technology are found in Voelkerding et al., Clinical Chem., 55: 641-658, 2009; MacLean et al., Nature Rev. Microbiol., 7: 287-296; US patent No. 5,912,148; US patent No. 6,130,073; each of which is incorporated by reference in its entirety.

[00237] In particular embodiments, HeliScope from Helicos BioSciences is used. Sequencing is achieved by the addition of polymerase and serial additions of fluorescently -labeled dNTP reagents. Switching on leads to the appearance of a fluorescent signal corresponding to dNTP, and the specified signal is captured by the CCD camera before each dNTP addition cycle. The reading length of the sequence varies from 25-50 nucleotides with a total yield exceeding 1 billion nucleotide pairs per analytical work cycle. Additional details for performing sequencing using HeliScope are found in Voelkerding et al., Clinical Chem., 55: 641-658, 2009; MacLean et al., Nature Rev. Microbiol., 7: 287-296; US Patent No. 7,169,560; US patent No. 7,282,337; US patent No. 7,482,120; US patent No. 7,501,245; US patent No. 6,818,395; US patent No.

6,911,345; US patent No. 7,501,245; each of which is incorporated by reference in its entirety. [00238] In some embodiments, a Roche sequencing system 454 is used. Sequencing 454 involves two steps. In the first step, DNA is cut into fragments of approximately 300-800 base pairs, and these fragments have blunt ends. Oligonucleotide adapters are then ligated to the ends of the fragments. The adapter serves as primers for amplification and sequencing of fragments. Fragments can be attached to DNA-capture beads, for example, streptavidin-coated beads, using, for example, an adapter that contains a 5 ’-biotin tag. Fragments attached to the granules are amplified by PCR within the droplets of an oil-water emulsion. The result is multiple copies of cloned amplified DNA fragments on each bead. At the second stage, the granules are captured in wells (several picoliters in volume). Pyrosequencing is carried out on each DNA fragment in parallel. Adding one or more nucleotides leads to the generation of a light signal, which is recorded on the CCD camera of the sequencing instrument. The signal intensity is proportional to the number of nucleotides included. Pyrosequencing uses pyrophosphate (PPi), which is released upon the addition of a nucleotide. PPi is converted to ATP using ATP sulfurylase in the presence of adenosine 5’ phosphosulfate. Luciferase uses ATP to convert luciferin to oxyluciferin, and as a result of this reaction, light is generated that is detected and analyzed. Additional details for performing sequencing 454 are found in Margulies et al. (2005) Nature 437: 376-380, which is hereby incorporated by reference in its entirety.

[00239] Ion Torrent technology is a DNA sequencing method based on the detection of hydrogen ions that are released during DNA polymerization. The micro well contains a fragment of a library of NGS fragments to be sequenced. Under the microwell layer is the hypersensitive ion sensor ISFET. All layers are contained within a semiconductor CMOS chip, similar to the chip used in the electronics industry. When dNTP is incorporated into a growing complementary chain, a hydrogen ion is released that excites a hypersensitive ion sensor. If homopolymer repeats are present in the sequence of the template, multiple dNTP molecules will be included in one cycle. This results in a corresponding amount of hydrogen atoms being released and in proportion to a higher electrical signal. This technology is different from other sequencing technologies that do not use modified nucleotides or optical devices. Additional details for Ion Torrent Technology is found in Science 327 (5970): 1190 (2010); US Patent Application Publication Nos. 20090026082, 20090127589, 20100301398, 20100197507, 20100188073, and 20100137143, each of which is incorporated by reference in its entirety.

[00240] In various embodiments, sequencing reads obtained from the NGS methods can be filtered by quality and grouped by barcode sequence using any algorithms known in the art, e.g., Python script barcodeCleanup.py . In some embodiments, a given sequencing read may be discarded if more than about 20% of its bases have a quality score (Q-score) less than Q20, indicating a base call accuracy of about 99%. In some embodiments, a given sequencing read may be discarded if more than about 5%, about 10%, about 15%, about 20%, about 25%, about 30% have a Q-score less than Q10, Q20, Q30, Q40, Q50, Q60, or more, indicating a base call accuracy of about 90%, about 99%, about 99.9%, about 99.99%, about 99.999%, about 99.9999%, or more, respectively.

[00241] In some embodiments, sequencing reads associated with a barcode containing less than 50 reads may be discarded to ensure that all barcode groups, representing single cells, contain a sufficient number of high-quality reads. In some embodiments, all sequencing reads associated with a barcode containing less than 30, less than 40, less than 50, less than 60, less than 70, less than 80, less than 90, less than 100 or more may be discarded to ensure the quality of the barcode groups representing single cells.

[00242] In various embodiments, sequence reads with common barcode sequences (e.g., meaning that sequence reads originated from the same cell) may be aligned to a reference genome using known methods in the art to determine alignment position information. For example, sequence reads derived from genomic DNA can be aligned to a range of positions of a reference genome. In various embodiments, sequence reads derived from genomic DNA can align with a range of positions corresponding to a gene of the reference genome. The alignment position information may indicate a beginning position and an end position of a region in the reference genome that corresponds to a beginning nucleotide base and end nucleotide base of a given sequence read. A region in the reference genome may be associated with a target gene or a segment of a gene. Further details for aligning sequence reads to reference sequences is described in US Application No. 16/279,315, which is hereby incorporated by reference in its entirety. In various embodiments, an output fde having SAM (sequence alignment map) format or BAM (binary alignment map) format may be generated and output for subsequent analysis, such as for determining cell trajectory.

Cells and Cell Populations

[00243] Embodiments described herein involve the single-cell analysis of cells. In various embodiments, the cells are healthy cells. In various embodiments, the cells are diseased cells. Examples of diseased cells include cancer cells, such as cells of hematologic malignancies or solid tumors. Examples of hematologic malignancies include, but are not limited to, acute lymphoblastic leukemia, acute myeloid leukemia, chronic lymphocytic leukemia, chronic myeloid leukemia, classic Hodgkin’s Lymphoma, diffuse large B-cell lymphoma, follicular lymphoma, mantle cell lymphoma, multiple myeloma, myelodysplastic syndromes, myeloid, myeloproliferative neoplasms, or T-cell lymphoma. Examples of solid tumors include, but are not limited to, breast invasive carcinoma, colon adenocarcinoma, glioblastoma multiforme, kidney renal clear cell carcinoma, liver hepatocellular carcinoma, lung adenocarcinoma, lung squamous cell carcinoma, ovarian cancer, pancreatic adenocarcinoma, prostate adenocarcinoma, or skin cutaneous melanoma.

[00244] In various embodiments, the single-cell analysis is performed on a population of cells. The population of cells can be a heterogeneous population of cells. In one embodiment, the population of cells can include both cancerous and non-cancerous cells. In one embodiment, the population of cells can include cancerous cells that are heterogenous amongst themselves. In various embodiments, the population of cells can be obtained from a subject. For example, a sample is taken from a subject, and the population of cells in the sample are isolated for performing single-cell analysis. Barcodes and Barcoded Beads

[00245] Embodiments of the invention involve providing one or more barcode sequences for labeling analytes of a single cell during step 170 shown in FIG. 1C. The one or more barcode sequences are encapsulated in an emulsion with a cell lysate derived from a single cell. As such, the one or more barcodes label analytes of the cell, thereby enabling the subsequent determination that sequence reads derived from the analytes originated from the same single cell.

[00246] In various embodiments, a plurality of barcodes are added to a droplet with a cell lysate. In various embodiments, the plurality of barcodes added to a droplet includes at least 10 ², at least 10 ³, at least 10 ⁴, at least 10 ⁵, at least 10 ⁵, at least 10 ⁶, at least 10 ⁷, or at least 10 ⁸ barcodes. In various embodiments, the plurality of barcodes added to an emulsion have the same barcode sequence. For example, multiple copies of the same barcode label are added to an emulsion to label multiple analytes derived from the cell lysate, thereby enabling identification of the cell from which an analyte originates from. In various embodiments, the plurality of barcodes added to an emulsion comprise a ‘unique identification sequence’ (UMI). A UMI is a nucleic acid having a sequence which can be used to identify and/or distinguish one or more first molecules to which the UMI is conjugated from one or more second molecules to which a distinct UMI, having a different sequence, is conjugated. UMIs are typically short, e.g., about 5 to 20 bases in length, and may be conjugated to one or more target molecules of interest or amplification products thereof. UMIs may be single or double stranded. In some embodiments, both a barcode sequence and a UMI are incorporated into a barcode. Generally, a UMI is used to distinguish between molecules of a similar type within a population or group, whereas a barcode sequence is used to distinguish between populations or groups of molecules that are derived from different cells. In some embodiments, where both a UMI and a barcode sequence are utilized, the UMI is shorter in sequence length than the barcode sequence. The use of barcodes is further described in US Patent Application No. 15/940,850, which is hereby incorporated by reference in its entirety. [00247] In some embodiments, the barcodes are single-stranded barcodes. Single-stranded barcodes can be generated using a number of techniques. For example, they can be generated by obtaining a plurality of DNA barcode molecules in which the sequences of the different molecules are at least partially different. These molecules can then be amplified so as to produce single stranded copies using, for instance, asymmetric PCR. Alternatively, the barcode molecules can be circularized and then subjected to rolling circle amplification. This will yield a product molecule in which the original DNA barcoded is concatenated numerous times as a single long molecule.

[00248] In some embodiments, circular barcode DNA containing a barcode sequence flanked by any number of constant sequences can be obtained by circularizing linear DNA. Primers that anneal to any constant sequence can initiate rolling circle amplification by the use of a strand displacing polymerase (such as Phi29 polymerase), generating long linear concatemers of barcode DNA.

[00249] In various embodiments, barcodes can be linked to a primer sequence that enables the barcode to label a target nucleic acid. In one embodiment, the barcode is linked to a forward primer sequence. In various embodiments, the forward primer sequence is a gene specific primer that hybridizes with a forward target of a nucleic acid. In various embodiments, the forward primer sequence is a constant region, such as a PCR handle, that hybridizes with a complementary sequence attached to a gene specific primer. The complementary sequence attached to a gene specific primer can be provided. Including a constant forward primer sequence on barcodes may be preferable as the barcodes can have the same forward primer and need not be individually designed to be linked to gene specific forward primers.

[00250] In various embodiments, barcodes can be releasably attached to a support structure, such as a bead. Therefore, a single bead with multiple copies of barcodes can be partitioned into an emulsion with a cell lysate, thereby enabling labeling of analytes of the cell lysate with the barcodes of the bead. Example beads include solid beads (e.g., silica beads), polymeric beads, or hydrogel beads (e.g., polyacrylamide, agarose, or alginate beads). Beads can be synthesized using a variety of techniques. For example, using a mix-split technique, beads with many copies of the same, random barcode sequence can be synthesized. This can be accomplished by, for example, creating a plurality of beads including sites on which DNA can be synthesized. The beads can be divided into four collections and each mixed with a buffer that will add a base to it, such as an A, T, G, or C. By dividing the population into four subpopulations, each subpopulation can have one of the bases added to its surface. This reaction can be accomplished in such a way that only a single base is added and no further bases are added. The beads from all four subpopulations can be combined and mixed together, and divided into four populations a second time. In this division step, the beads from the previous four populations may be mixed together randomly. They can then be added to the four different solutions, adding another, random base on the surface of each bead. This process can be repeated to generate sequences on the surface of the bead of a length approximately equal to the number of times that the population is split and mixed. If this was done 10 times, for example, the result would be a population of beads in which each bead has many copies of the same random 10-base sequence synthesized on its surface. The sequence on each bead would be determined by the particular sequence of reactors it ended up in through each mix-split cycle. Additional details of example beads and their synthesis is described in International Application No. PCT/US2016/016444, which is hereby incorporated by reference in its entirety.

Reagents

[00251] Embodiments described herein include the encapsulation of a cell with reagents (e.g., reagents 120A and/or 120B in FIG. 1C) within a droplet (e.g., a first droplet and/or a second droplet in FIG. 1C). Generally, the reagents interact with the encapsulated cell under conditions in which the cell is lysed, thereby releasing target analytes of the cell. The reagents can further interact with target analytes to prepare for subsequent barcoding and/or amplification.

[00252] In various embodiments, the reagents include one or more lysing agents that cause the cell to lyse. Examples of lysing agents include detergents such as Triton X-100, Nonidet P-40 (NP40) as well as cytotoxins. In some embodiments, the reagents include NP40 detergent which is sufficient to disrupt the cell membrane and cause cell lysis, but does not disrupt chromatinpackaged DNA. In various embodiments, the reagents include 0.01%, 0.05%, 0.1%, 0.2%, 0.3%, 0.4%, 0.5%, 0.6%, 0.7%, 0.8%, 0.9%, 1.0%, 1.1%, 1.2%, 1.3%, 1.4%, 1.5%, 1.6%, 1.7%, 1.8%, 1.9%, 2.0%, 3.0%, 3.1%, 3.2%, 3.3%, 3.4%, 3.5%, 3.6%, 3.7%, 3.8%, 3.9%, 4.0%, 4.1%, 4.2%, 4.3%, 4.4%, 4.5%, 4.6%, 4.7%, 4.8%, 4.9%, or 5.0% NP40 (v/v). In various embodiments, the reagents include at least at least 0.01%, at least 0.05%, 0.1%, at least 0.5%, at least 1%, at least 2%, at least 3%, at least 4%, or at least 5% NP40 (v/v).

[00253] In various embodiments, the reagents further include proteases that assist in the lysing of the cell and/or accessing of genomic DNA. Examples of proteases include proteinase K, pepsin, protease — subtilisin Carlsberg, protease type X-bacillus thermoproteolyticus, protease type XIII — aspergillus Saitoi. In various embodiments, the reagents includes 0.01 mg/mL, 0.05 mg/mL, 0.1 mg/mL, 0.2 mg/mL, 0.3 mg/mL, 0.4 mg/mL, 0.5 mg/mL, 0.6 mg/mL, 0.7 mg/mL, 0.8 mg/mL, 0.9 mg/mL, 1.0 mg/mL, 1.5 mg/mL, 2.0 mg/mL, 2.5 mg/mL, 3.0 mg/mL, 3.5 mg/mL, 4.0 mg/mL, 4.5 mg/mL, 5.0 mg/mL, 6.0 mg/mL, 7.0 mg/mL, 8.0 mg/mL, 9.0 mg/mL, or 10.0 mg/mL of proteases. In various embodiments, the reagents include between 0.1 mg/mL and 5 mg/mL of proteases. In various embodiments, the reagents include between 0.5 mg/mL and

2.5 mg/mL of proteases. In various embodiments, the reagents include between 0.75 mg/mL and

1.5 mg/mL of proteases. In various embodiments, the reagents include between 0.9 mg/mL and 1.1 mg/mL of proteases.

[00254] In various embodiments, the reagents can further include dNTPs, stabilization agents such as dithothreitol (DTT), and buffer solutions. In various embodiments, the reagents can include primers, such as antibody tag primers. In various embodiments, the reagents can include primers, such as reverse primers that hybridize with a target analyte (e.g., genomic DNA or an antibody oligonucleotide). In various embodiments, such primers can be gene specific primers. Example primers are described in further detail below.

Primers (or Primer Reagents)

[00255] Embodiments of the invention described herein use primers to conduct the single-cell analysis. For example, primers are implemented during the workflow process shown in FIG. 1. Primers can be used to prime (e.g., hybridize) with specific sequences of nucleic acids of interest, such that the nucleic acids of interest can be barcoded and/or amplified. Specifically, primers hybridize to a target sequence and act as a substrate for enzymes (e.g., polymerases) that catalyze nucleic acid synthesis off a template strand to which the primer has hybridized. As described hereafter, primers can be provided in the workflow process shown in FIG. 1 in various steps. Referring again to FIG. 1, in various embodiments, primers can be included in the reagents 120 that are encapsulated with the cell 102. In various embodiments, primers can be included in the reagents that is encapsulated with the cell lysate 130. In various embodiments, primers can be included in or linked with a barcode 145 that is encapsulated with the cell lysate 130. Further description and examples of primers that are used in a single-cell analysis workflow process is described in US Application No. 16/749,731, which is hereby incorporated by reference in its entirety. [00256] In various embodiments, the number of distinct primers in any of the reagents, or with barcodes may range from about 1 to about 500 or more, e.g., about 2 to 100 primers, about 2 to 10 primers, about 10 to 20 primers, about 20 to 30 primers, about 30 to 40 primers, about 40 to 50 primers, about 50 to 60 primers, about 60 to 70 primers, about 70 to 80 primers, about 80 to 90 primers, about 90 to 100 primers, about 100 to 150 primers, about 150 to 200 primers, about 200 to 250 primers, about 250 to 300 primers, about 300 to 350 primers, about 350 to 400 primers, about 400 to 450 primers, about 450 to 500 primers, or about 500 primers or more.

[00257] For targeted DNA sequencing primers in the reagents (e.g., reagents 120 in FIG. 1) may include reverse primers that are complementary to a reverse target sequence on a nucleic acid of interest (e.g., DNA or RNA). In various embodiments, primers in the reagents may be gene-specific primers that target a reverse target sequence of a gene of interest. In various embodiments, primers in the reagents may include forward primers that are complementary to a forward target sequence on a nucleic acid of interest (e.g., DNA). In various embodiments, primers in the reagents may be gene-specific primers that target a forward target of a gene of interest. In various embodiments, primers of the reagents form primer sets (e.g., forward primer and reverse primer) for a region of interest on a nucleic acid. Example gene-specific primers can be primers that target any of the genes identified in the “Targeted Panels” section above.

[00258] The number of distinct forward or reverse primers for genes of interest that are added may be from about one to 500, e.g., about 1 to 10 primers, about 10 to 20 primers, about 20 to 30 primers, about 30 to 40 primers, about 40 to 50 primers, about 50 to 60 primers, about 60 to 70 primers, about 70 to 80 primers, about 80 to 90 primers, about 90 to 100 primers, about 100 to 150 primers, about 150 to 200 primers, about 200 to 250 primers, about 250 to 300 primers, about 300 to 350 primers, about 350 to 400 primers, about 400 to 450 primers, about 450 to 500 primers, or about 500 primers or more.

[00259] In various embodiments, instead of the primers being included in the reagents such primers can be included or linked to a barcode. In particular embodiments, the primers are linked to an end of the barcode and therefore, are available to hybridize with target sequences of nucleic acids in the cell lysate.

[00260] In various embodiments, primers of the reagents, or primers of barcodes may be added to an emulsion in one step, or in more than one step. For instance, the primers may be added in two or more steps, three or more steps, four or more steps, or five or more steps. Regardless of whether the primers are added in one step or in more than one step, they may be added after the addition of a lysing agent, prior to the addition of a lysing agent, or concomitantly with the addition of a lysing agent. When added before or after the addition of a lysing agent, the primers of the reagents may be added in a separate step from the addition of a lysing agent (e.g., as exemplified in the two step workflow process shown in FIG. 1C).

[00261] A primer set for the amplification of a target nucleic acid typically includes a forward primer and a reverse primer that are complementary to a target nucleic acid or the complement thereof. In some embodiments, amplification can be performed using multiple target-specific primer pairs in a single amplification reaction, wherein each primer pair includes a forward target-specific primer and a reverse target-specific primer, where each includes at least one sequence that is substantially complementary or substantially identical to a corresponding target sequence in the sample, and each primer pair having a different corresponding target sequence. Accordingly, certain methods herein are used to detect or identify multiple target sequences from a single cell sample.

Example Kit Embodiments

[00262] Also provided herein are kits for performing the single-cell workflow for determining cellular genotypes and phenotypes of populations of cells. The kits may include one or more of the following: fluids for forming emulsions (e.g., carrier phase, aqueous phase), barcoded beads, micro fluidic devices for processing single cells, reagents for lysing cells and releasing cell analytes, reagents and buffers for labeling cells with antibodies, reaction mixtures for performing nucleic acid amplification reactions, and instructions for using any of the kit components according to the methods described herein.

System and/or Computer Embodiments

[00263] Additionally described herein are systems and computer embodiments for performing the single cell analysis described above. An example system can include a single cell workflow device and a computing device, such as single cell workflow device 106 and computing device 108 shown in FIG. 1A. In various embodiments, the single cell workflow device 106 is configured to perform the steps of cell encapsulation 160, lysis and digestion 165, cell reencapsulation 170, and/or barcoding and amplification 175. In various embodiments, the computing device 108 is configured to perform the in silico steps such as read alignment, determining presence or absence of the analyte of the cell, and/or determining one or more mutations (e.g., SNV, indel, CNV etc).

[00264] In various embodiments, a single cell workflow device 106 includes at least a microfluidic device that is configured to encapsulate cells with reagents, encapsulate cell lysates with reagents, and perform nucleic acid amplification reactions. For example, the microfluidic device can include one or more fluidic channels that are fluidically connected. Therefore, the combining of an aqueous fluid through a first channel and a carrier fluid through a second channel results in the generation of emulsion droplets. In various embodiments, the fluidic channels of the microfluidic device may have at least one cross-sectional dimension on the order of a millimeter or smaller (e.g., less than or equal to about 1 millimeter). Additional details of microchannel design and dimensions is described in International Patent Application No. PCT/US2016/016444 and US Patent Application No. 14/420,646, each of which is hereby incorporated by reference in its entirety. An example of a microfluidic device is the Tapestri™ Platform.

[00265] In various embodiments, the single cell workflow device 106 may also include one or more of: (a) a temperature control module for controlling the temperature of one or more portions of the subject devices and/or droplets therein and which is operably connected to the microfluidic device(s), (b) a detection module, i.e., a detector, e.g., an optical imager, operably connected to the microfluidic device(s), (c) an incubator, e.g., a cell incubator, operably connected to the microfluidic device(s), and (d) a sequencer operably connected to the microfluidic device(s). The one or more temperature and/or pressure control modules provide control over the temperature and/or pressure of a carrier fluid in one or more flow channels of a device. As an example, a temperature control module may be one or more thermal cycler that regulates the temperature for performing nucleic acid amplification. The one or more detection modules i.e., a detector, e.g., an optical imager, are configured for detecting the presence of one or more droplets, or one or more characteristics thereof, including their composition. In some embodiments, detector modules are configured to recognize one or more components of one or more droplets, in one or more flow channel. The sequencer is a hardware device configured to perform sequencing, such as next generation sequencing. Examples of sequencers include Illumina sequencers (e.g., MiniSeq™, MiSeq™, NextSeq™ 550 Series, or NextSeq™ 2000), Roche sequencing system 454, and Thermo Fisher Scientific sequencers (e.g., Ion GeneStudio S5 system, Ion Torrent Genexus System).

[00266] FIG. 5 depicts an example computing device for implementing system and methods described in reference to FIGS. 1-4. For example, the example computing device 108 is configured to perform the in silico steps such as read alignment, determining presence or absence of the analyte of the cell, and/or determining one or more mutations (e.g., SNV, indel, CNV etc). Examples of a computing device can include a personal computer, desktop computer laptop, server computer, a computing node within a cluster, message processors, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like.

[00267] FIG. 5 illustrates an example computing device 108 for implementing system and methods described in FIGS. 1-4B. In some embodiments, the computing device 108 includes at least one processor 502 coupled to a chipset 504. The chipset 504 includes a memory controller hub 520 and an input/output (I/O) controller hub 522. A memory 506 and a graphics adapter 512 are coupled to the memory controller hub 520, and a display 518 is coupled to the graphics adapter 512. A storage device 508, an input interface 514, and network adapter 516 are coupled to the I/O controller hub 522. Other embodiments of the computing device 108 have different architectures.

[00268] The storage device 508 is a non-transitory computer-readable storage medium such as a hard drive, compact disk read-only memory (CD-ROM), DVD, or a solid-state memory device. The memory 506 holds instructions and data used by the processor 502. The input interface 514 is a touch-screen interface, a mouse, track ball, or other type of input interface, a keyboard, or some combination thereof, and is used to input data into the computing device 108. In some embodiments, the computing device 108 may be configured to receive input (e.g., commands) from the input interface 514 via gestures from the user. The graphics adapter 512 displays images and other information on the display 518. For example, the display 518 can show an indication of a predicted cell trajectory. The network adapter 516 couples the computing device 108 to one or more computer networks.

[00269] The computing device 108 is adapted to execute computer program modules for providing functionality described herein. As used herein, the term “module” refers to computer program logic used to provide the specified functionality. Thus, a module can be implemented in hardware, firmware, and/or software. In one embodiment, program modules are stored on the storage device 508, loaded into the memory 506, and executed by the processor 502.

[00270] The types of computing devices 108 can vary from the embodiments described herein. For example, the computing device 108 can lack some of the components described above, such as graphics adapters 512, input interface 514, and displays 518. In some embodiments, a computing device 108 can include a processor 502 for executing instructions stored on a memory 506.

[00271] In various embodiments, methods described herein, such as methods of aligning sequence reads, methods of determining cellular genotypes and phenotypes, and/or methods of analyzing cells using cellular genotypes and phenotypes can be implemented in hardware or software, or a combination of both. In one embodiment, a non-transitory machine-readable storage medium, such as one described above, is provided, the medium comprising a data storage material encoded with machine readable data which, when using a machine programmed with instructions for using said data, is capable of displaying any of the datasets and execution and results of a cell trajectory of this invention. Such data can be used for a variety of purposes, such as patient monitoring, treatment considerations, and the like. Embodiments of the methods described above can be implemented in computer programs executing on programmable computers, comprising a processor, a data storage system (including volatile and non-volatile memory and/or storage elements), a graphics adapter, an input interface, a network adapter, at least one input device, and at least one output device. A display is coupled to the graphics adapter. Program code is applied to input data to perform the functions described above and generate output information. The output information is applied to one or more output devices, in known fashion. The computer can be, for example, a personal computer, microcomputer, or workstation of conventional design.

[00272] Each program can be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the programs can be implemented in assembly or machine language, if desired. In any case, the language can be a compiled or interpreted language. Each such computer program is preferably stored on a storage media or device (e.g., ROM or magnetic diskette) readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer to perform the procedures described herein. The system can also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.

[00273] The signature patterns and databases thereof can be provided in a variety of media to facilitate their use. “Media” refers to a manufacture that contains the signature pattern information of the present invention. The databases of the present invention can be recorded on computer readable media, e.g. any medium that can be read and accessed directly by a computer. Such media include, but are not limited to: magnetic storage media, such as floppy discs, hard disc storage medium, and magnetic tape; optical storage media such as CD-ROM; electrical storage media such as RAM and ROM; and hybrids of these categories such as magnetic/optical storage media. One of skill in the art can readily appreciate how any of the presently known computer readable mediums can be used to create a manufacture comprising a recording of the present database information. “Recorded” refers to a process for storing information on computer readable medium, using any such methods as known in the art. Any convenient data storage structure can be chosen, based on the means used to access the stored information. A variety of data processor programs and formats can be used for storage, e.g. word processing text file, database format, etc.

EXAMPLES

Example 1: Example Results of the Correlation Clustering Methods

[00274] FIGS. 6A-6D show example performance by running an algorithm that implements the correlation clustering method including the following steps: removing amplicons that have zero coverage in all cells; keeping barcodes with coverage and completeness that may be above a pre-determined threshold; generating the per-barcode correlations to the experiment as a whole; removing barcodes that have correlation less than a minimum (r ² < 0.1); clustering the data (e.g., per-barcode correlation and log 10 mean coverage) and returning an array of cluster assignments using density-based spatial clustering of applications with noise (DBSCAN); grouping in good outliers.

[00275] In this example, “existing CF” includes 2,876 cells and “Corr-cluster” includes 3,789 cells (131.7%) Out of the 252 runs of the algorithm, 206 runs exhibit improved performance, e.g., as compared with previously applied algorithms, 23 runs exhibit worse performance, and 23 runs exhibit no cells being called.

[00276] FIGS. 6A-6D shows four example runs that are successful. More specifically, two clusters were visualized and labelled on a 2D space with “log 10 mean coverage” as the X-axis and “r ² of barcode coverage to bulk coverage” as the Y-axis. The cluster on the left of the 2D space represented “invalid-cells” that referred to mergers. The cluster on the right of the 2D space represented “valid-cells” that referred to singlet-like barcodes. Additionally, the data points in proximity to the right cluster represented “outliers called valid-cells,” and the data points in proximity to the left cluster represented “unassigned-barcodes.”

[00277] As shown in FIG. 6A, the assignments include 48938 invalid-barcode and 7459 validcell. As shown in FIG. 6B, the assignments include 5382 invalid-barcode, 5000 valid-cell. As shown in FIG. 6C, the assignments include 1974 invalid-barcode and 2512 valid-cell. As shown in FIG. 6D, the assignments include 1924 invalid-barcode and 5640 valid-cell.

[00278] For the runs that called more cells (e.g., runs exhibiting improved performance), the mean improvement was about 168.0%, and the median improvement was about 135.9%.

Example 2: Example Results of the Dimensional Reduction Method

[00279] FIGS. 7-8 show example performance of the dimensional reduction method by running an algorithm that implements the dimensional reduction method including the following steps: removing incomplete barcodes (e.g., requiring minimum fraction of amplicons to have minimum coverage); removing barcodes with low total reads; for each barcode, normalizing read counts across amplicons; for each amplicon, normalizing the read counts across barcodes; and reducing dimensions (2D) using UMAP.

Example 2, 1: Example Results from UMAP - K-Means Method

[00280] The method in FIGS. 7A-7D uses the above results from the UMAP as an input, and cluster the output using k-means (e.g. to generate 2-10 clusters), followed by ordering the clusters descending based on median per-barcode read count; partitioning the clusters into two groups (e.g., one group includes clusters 1-5, and another group includes cluster 6-10); requiring minimum ratio of medians from the two clusters (e.g., clusters 5 and 6) around the cut point or index; maximizing the difference in median read counts between groups; generating output when e.g., two clusters are found. [00281] In some scenarios, k-means after UMAP works effectively when the separation of clusters are clear. It may be challenging to find step function in some cases. It may find alternate structures in the data.

[00282] Out of 252 runs, 72 runs exhibit improved performance, 1 run exhibits worse performance, and 172 runs exhibit no cells being called. FIGS. 7A and 7B show two example failed runs, where 8 clusters were identified in FIG. 7A, and 10 clusters were identified in FIG. 7B. FIGS. 7C and 7D show two example successful runs, where two clusters are visualized and labelled on each 2D space.

Example 2,2: Example Results from UMAP - Split Method

[00283] The method in FIGS. 8A-8D uses the above results from the UMAP as an input, and generates a grid of points in the 2D UMAP plane, followed by splitting points with lines with various slopes between -90° to 90° to assign valid and invalid barcodes or cells; maximizing the absolute difference between median per-barcode read counts; filtering and selecting lines between clusters (e.g., by requiring a minimum fraction of valid (10%) and invalid (5%) barcodes, and removing lines where too many points are in proximity of the lines).

[00284] FIGS. 8A-8C show three example successful results, where two clusters are identified and separated by a line in each 2D space. FIG. 8D is an example “failure” results, where only one cluster is identified in the 2D space.

[00285] In terms of performance, out of 252 runs, 181 runs exhibit improved performance, 16 runs exhibit worse performance, and 54 runs exhibit no cells being called. For the runs that called more cells (e.g., runs exhibiting improved performance), the mean improvement was about 192.6%, and the median improvement was about 139.1%.

Example 3: Example Results from Combination of Correlation Cluster and Dimensional Reduction -split Methods

[00286] In this example, both correlation cluster and dimensional reduction methods were applied, and the method that called the most number of cells were selected.

[00287] In terms of performance, out of 252 runs (also named as “experiments” herein), 232 runs exhibit improved performance, 14 runs exhibit worse performance, and 6 runs exhibit no cells being called. For the runs that called more cells, the mean improvement was about 197.3%, and the median improvement was about 141.8%. Example 3, 1: PRQ0077 mixture

[00288] FIGS. 9A-9D and Table 1 below show example results of both methods applied to PR00077 mixture in comparison with existing data. As shown in the results, an increase in the number of cells does not introduce doublets at a higher rate.

Table 1

Example 3,2: PRQ0078 mixture

[00289] FIGS. 10A-10D and Table 2 below show example results of both methods applied to PR00078 mixture in comparison with existing data. As shown in the results, an increase in the number of cells does not introduce doublets at a higher rate.

Table 2

[00290] Altogether, the results shown in FIGS. 6-10 show that mergers can be successfully identified using the systems and methods as described in the presently disclosed embodiments. Example 4: Example Methodology of Correlation-Dimensional Reduction Method

[00291] This example describes the example methodology for performing the correlation-

UMAP methodology. The steps are as follows:

1. Obtain sequence reads of amplicons comprising barcodes. Normalize the dataset to generate barcode correlation values and barcode coverage values. Barcode correlation values and barcode coverage values can be structured/stored as a correlation-coverage matrix. Further normalize the dataset to generate normalized read counts. The normalized read counts can be structured/stored as a normalized read counts matrix.

2. Perform a UMAP on the normalized read counts matrix concatenated with the correlation- coverage matrix. In certain cases the normalized read counts don’t provide enough information to differentiate mergers from cells, whereas in other cases it’s the other way around. Combining both into one matrix and performing a UMAP on it increases the odds of separating mergers from cells.

3. Perform flat-DBSCAN on the UMAP for various minimum cluster sizes and various number of clusters. Flat-HDBSCAN generates a predetermined number of clusters using a density based approach - this is similar to a K-means for density based clustering. By default the maximum number of clusters is binned to 3, however this can be modified in case of a failure of the algorithm. It is also possible to specify the exact number of cell clusters and merger clusters.

4. Label the clusters according to their position on the correlation-coverage plot. This is performed for each clustering parameter combination. Clusters with higher coverage (cov in the table) and lower R ²/coverage (r2/cov in the table) are labeled as cells and the others are labeled as mergers. The cell clusters may have a low silhouette score (default: o . 1) with respect to each other on the correlation-coverage plot. Distinct cell clusters on the correlation-coverage plot are not supported (also not observed in the test data). Clustering parameters where no cell cluster can be identified are discarded from further analysis.

The additonal steps described below are useful for validating the labeled clusters.

5. Generate a quality score for each clustering where the clusters could be labeled. This measures the efficacy of the identification of mergers and cells. This quality score is a product of three metrics.

1. Silhouette score: This is calculated on the correlation-coverage plot by combining the cell clusters into 1 cluster (o and 1 are combined into one cluster) and using the mergers (cluster 2) as the second population. The outliers (cluster -1) are discarded for this score. The purpose of this is to ensure that the clusters have significant separation on the correlation-coverage plot.

2. Percentage cells clustered: This is simply the percentage of barcodes not labeled as outliers (-1). This factor enables the selection of clusters with the least number of outliers.

3. Cell-score: This score is used to measure the relative position of the cell and merger clusters. A linear fit is generated on each merger population. The score is the percentage cells below the linear fit. This factor enables the identification of single clusters. In case of single clusters the cell-score will be low due to the fact that the cluster identified as mergers will align with the cluster identified as cells.

6. Select clusters with the highest quality score. If the highest score is below a threshold (o . 25 by default), then the algorithm has failed to identify more than 1 cluster. If there are fewer than a predefined number of total barcodes (5000 by default) then all barcodes are labeled as cells (single-cluster), otherwise the algorithm fails and reverts to the completeness algorithm.

7. Recover unassigned barcodes. These are barcodes which could potentially be cells, but were marked as outliers (-1) by HDBSCAN. They are recovered by looking at their proximity to the nearest labelled barcode. If the closest invalid-barcode is at least twice as far as the closest valid-cell, then the outlier is marked as a recovered cell (1+ in this case).

[00292] Example parameters of the algorithm include:

• max num clusters (default: 3) - This is the maximum number of clusters the algorithm will test. The cluster could be either valid-cells or invalid-barcodes. Therefore either 1 or 2 valid-cell clusters and 0, 1, or 2 merger clusters can be identified with their sum of counts being less than or equal to 3.

• num_cell_clusters and num_merger_clusters (default: None) When these are provided, max num clusters is ignored and the algorithm attempts to find the given number of cell and merger clusters. It might find fewer clusters if there is no way of splitting the barcodes into populations such that the given number of clusters are obtained. With the default values the algorithm may sometimes incorrectly label a merger cluster as cells or vice-versa. Passing these in such situations is useful.

• umap corr cov fraction (default: 0.5) This is the weight given to the correlationcoverage matrix when combining with the normalized read counts to generate the UMAP. When this is set to 0 the UMAP is generated using only the normalized read counts. When it set to 1 it behaves like the correlation-cluster algorithm and tries to find the clusters on the correlation-coverage plot albeit using its own methods like flat- HDBSCAN, cluster quality.

• min quality (default: 0.25) This is the minimum value of the product of the silhouette score, percentage cells clustered and cell-score as described above in this article. 0.25 was chosen as the default since that is the recommended threshold for a good silhouette score. • Percentage cells clustered and cell-score are both less than or equal to 1. The worst case scenario is when the merger and cell clusters just separate out on correlation-coverage plot, and are also in the correct positions, then the score would be 0.25. Sometimes the clustering may be correct but may result in a lower quality score. Reducing this parameter will recover such samples.

• max cell silhouette score (default: 0.1): The maximum silhouette score between the cell clusters on the correlation-coverage plot. It is observed that the cell clusters which might separate out on the UMAP overlap significantly on the correlation-coverage plot. This ensures that arbitrary small cell clusters far away from are main cell population are not artificially created.

• max single cluster cells (default: 5000): The maximum number of total barcodes which will be called as cells if no clustering resulting in a quality score above min_quality is found. These are samples where a single cluster is found on the correlation-coverage plot and all barcodes are cells.

[00293] FIGs. 11 A-D and 12 show results obtained by performing the correlation-UMAP method as described herein. For example, the number of generated clusters can vary, examples of which can include 1 merger and 1 cell cluster (FIG. 11C), or 2 merger and 1 cell cluster (FIG.

1 IB), or 1 merger and 2 cell clusters (FIG. 12). The clustering may be scored for its quality and if the quality is too low, then it is assumed that there was a single cluster (FIG. 11A) and calls all barcodes as cells. If there are more than 5000 barcodes and no cluster is found, it’s assumed that the clustering is inefficient and it falls back to the convention method.

[00294] In some scenarios, the dimensional reduction on which the clustering is performed is generated using both the normalized read counts and the correlation-coverage values. In some cases, the correlation-coverage values differentiate mergers from cells better than the normalized read counts (FIG. 1 IB). In some cases, the normalized read counts differentiate mergers from cells better than the correlation-coverage values (FIG. 1 ID). Thus, the combination of the two matrices and information may improve the separation between the merger and cell barcodes. Example 5- Performance Comparisons

[00295] FIGs. 13A-13D illustrate successful labelling of clusters. For all the runs, the barcodes were manually labelled as valid-cells or invalid-barcodes. This was done by first looking at the UMAP and marking the cluster which could be cells (FIG. 13 A). If the UMAP did not show clear differentiation, then the cell cluster was marked using the correlation-cluster plot (FIG. 13B).

[00296] 155/276 runs were Raji-KGl QC or R&D runs. For these runs, all the barcodes were labelled as Raji, KG1, or Mixed using a simplified pileup based variant caller (FIG. 13C). For performance related reasons, only high quality SNPs were used and no local-realignment was performed. This labelling was used to calculate the mixing rate, and the number of singlets i.e. pure Raji/KGl (good cells) called by the cellfinder.

[00297] The runs were also marked as single-cluster by looking at the correlation plot (Figi 3d). 51 runs (18%) are in this group, therefore the identification of single-clusters is critical to the performance of the cellfinder.

[00298] Further, 21 runs were marked as edge cases of the correlation-cluster algorithm or method.

[00299] FIG. 14 illustrate performance metrics across the correlation-UMAP methodology, correlation-cluster method, and a previously published method (herein referred to as the “completeness” method). The “completeness” method refers to a conventional cell calling algorithm for identifying barcodes corresponding to “complete” cells. The conventional cell calling algorithm is based on a total read completeness parameter (e.g., > 8 * number of amplicons) and per-amplicon read completeness (>80% data completeness for working amplicons, which are defined as amplicons with greater than 0.2 * mean of all amplicon reads per qualified barcode. Further details and example implementations of the “completeness” method are described in Zhang, H., et al. “Application of high-throughput, high-depth, targeted single-nucleus DNA sequencing in pancreatic cancer.” bioRxiv 2022.03.06.483206, and Leighton et al., “Reconstructing mutational lineages in breast cancer by multi -patient-targeted single cell DNA sequencing.” bioRxiv 2021.11.16.468877, each of which is incorporated by reference in its entirety.

[00300] The sensitivity, specificity, and accuracy for the algorithms were calculated using the labels that were manually curated. Here the valid-cells are assumed to be true positives and invalid-barcodes are assumed to be true negatives. For a subset of the runs the genotype labels were used to identify the number of singlets (cells labeled Raji/KGl) and the fraction of mixed cells.

[00301] Arbitrary number of tubes were sequenced for various runs. The number of cells called by each algorithm is only interpretable relative to the others.

[00302] Table 3 below illustrates performance metrics. In Table 3, Cell-Completeness = %(amplicons > 10 reads)

Table 3

[00303] In this example, the correlation-UMAP method showed that it can handle various edge cases and single cluster runs with improved performance. For example, the correlationcoverage method failed to find any cluster for 49 runs (17%) whereas correlation-UMAP failed to find any cluster (including single cluster) for 11 runs (4%).

[00304] The correlation-UMAP algorithm performs better than both the other algorithms in many metrics except percent mixed cells for which it is 1 percentage point higher than the correlation-cluster algorithm. The labeled valid-cells also have a median mixing of 9.5%. Splitting the cells into two categories - low coverage and high coverage and calculating the mixing rate for each of those (FIG. 15) shows that the slightly higher mixing rate is due to the extra lower coverage barcodes captured by the correlation-UMAP method. A hypothesis is that these cells are ineffectively genotyped due to their low read counts. These cells would have been discarded in tertiary analysis in most cases.

[00305] However, due to the extra cells being called the average cell-completeness for the correlation-UMAP algorithm is lower than the other two algorithms. The extra cells can always be removed in tertiary analysis if the variant of interest is on a low-flyer amplicon, but when it is on a high-flyer amplicon the extra cells can be the difference between detecting the rare population and missing it.

[00306] Correlation-cluster and correlation-UMAP both have a lower mixing rate than the completeness algorithm because the mergers are removed and significantly more cells are called. Correlation-UMAP and correlation-cluster algorithms call more cells and have a lower mixing rate, as shown in FIG. 16. Correlation-UMAP calls at least 90% of the valid-cells called by the completeness algorithm in 96% of the runs compared to 87% for the correlation-cluster algorithm. In majority of the cases these algorithms also call other barcodes as valid-cells which are missed by completeness method.

Previous Patent: GAS CYLINDER ADAPTER ASSEMBLY FOR APPARATUS FOR THERAPEUTIC GAS TREATMENT AND METHODS THEREOF

Next Patent: FEATURE SUBSCRIPTIONS FOR MEDICAL DEVICE SYSTEM FEATURE SETS