Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEM AND METHOD FOR TCR SEQUENCE IDENTIFICATION AND/OR CLASSIFICATION
Document Type and Number:
WIPO Patent Application WO/2024/018467
Kind Code:
A1
Abstract:
Methods and computer system for the construction and use of TCR sequence sample classification/identification means. The system comprising a pre-training module configured to construct one or more pre-training datasets of TCR sequence samples data, an NLP module configured to: pre-train an NLP model in a self-supervised manner with the one or more pre-training datasets for the NLP model to provide latent space representations to amino acid sequences samples inputted thereto; and process by the pre¬ trained NLP model one or more training datasets of TCR sequence samples data to thereby produce training embedding data, and a training module configured to construct the one or more training datasets and train one or more classifiers with the one or more training datasets and the training embedding data, for the one or more classifiers thereby trained to identify /classify features of TCR sequence sample data inputted thereto.

Inventors:
EFRONI SOL (IL)
GOLDNER KABELI ROMI (IL)
Application Number:
PCT/IL2023/050758
Publication Date:
January 25, 2024
Filing Date:
July 19, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
CLONAL LTD (IL)
International Classes:
G06N3/045; G16B15/30; G16B30/00; G16B40/00
Domestic Patent References:
WO2022185179A12022-09-09
Foreign References:
CN113593631A2021-11-02
Other References:
WU KEVIN, YOST KATHRYN E., DANIEL BENCE, BELK JULIA A., XIA YU, EGAWA TAKESHI, SATPATHY ANSUMAN, CHANG HOWARD Y., ZOU JAMES: "TCR-BERT: learning the grammar of T-cell receptors for flexible antigen-xbinding analyses", BIORXIV, 19 November 2021 (2021-11-19), XP093131224, Retrieved from the Internet [retrieved on 20240214], DOI: 10.1101/2021.11.18.469186
OSTROVSKY-BERMAN MIRI, FRANKEL BOAZ, POLAK PAZIT, YAARI GUR: "Immune2vec: Embedding B/T Cell Receptor Sequences in ℝN Using Natural Language Processing", FRONTIERS IN IMMUNOLOGY, FRONTIERS MEDIA, LAUSANNE, CH, vol. 12, 1 July 2021 (2021-07-01), Lausanne, CH , pages 680687, XP093131226, ISSN: 1664-3224, DOI: 10.3389/fimmu.2021.680687
Attorney, Agent or Firm:
LOTAN, Mirit (IL)
Download PDF:
Claims:
CLAIMS:

1. A computer system comprising one or more processors and memories configured with a set of software modules for the construction and use of TCR sequence sample classification/identification means, the system comprising: a pre-training module configured to construct one or more pre-training datasets of TCR sequence samples data; an NLP module configured to: pre-train an NLP model in a self-supervised manner with said one or more pre-training datasets for said NLP model to provide latent space representations to amino acid sequences samples inputted thereto; and process by the pre-trained NLP model one or more training datasets of TCR sequence samples data to thereby produce training embedding data; and a training module configured to construct said one or more training datasets and train one or more classifiers with said one or more training datasets and said training embedding data, for said one or more classifiers thereby trained to identify/classify features of TCR sequence sample data inputted thereto.

2. The system of claim 1 comprising a dimension reduction module configured to reduce a dimension of the training embedding data into a lower dimension data representation thereof.

3. The system of claim 1 or 2 configured for one or both of the following: generate form the lower dimension presentation of the training embedding data presentation and/or analysis of the training embedding data; and/or determine one or more classification thresholds from the lower dimension presentation of the training embeddings data.

4. The system of claim 2 or 3 wherein the training module is configured to train the one or more classifiers with the lower dimension presentation of the training embedding data and the training datasets.

5. The system of any one of the preceding claims comprising an NLP performance test module configured to analyze the training embedding data and evaluate the performance of the pre-trained NLP model based thereon.

6. The system of claim 5 wherein the NLP performance test module is configured to produce test scores for the training embedding data, and wherein the system is further configured to use said test scores for the identification/classification of TCR sequence samples.

7. The system of any one or the preceding claims wherein the pre-training module is configured to either arrange bulk amino acid sequence samples data in a table or concatenate single cell amino acid sequence samples of the same cell with a separator token place therebetween.

8. A method for TCR sequence identification and/or classification, the method comprising: constructing a pre-training dataset comprising CDR3 TCRα and/or CDR3 TCRβ amino acid sequence samples data; pre-training a BERT model with said pre-training dataset in a self-supervised manner for thereby generating latent space representations to amino acid sequences samples inputted thereto; constructing a training dataset comprising private and public CDR3 TCRβ amino acid sequence samples data; processing the training dataset by the pre-trained BERT model to thereby produce latent space embedding data thereof; and training in a supervised manner a classifier with the latent space embedding data produced by the pre-trained BERT model for said training dataset, for said classifier to identify one or more biological features of an amino acid sequence sample data inputted thereto.

9. The method of claim 8 comprising reducing dimensionality of the produced latent space embedding data and using the dimensionality reduced latent space embedding data for the supervised training of the classifier.

10. The method of claim 9 comprising using a uniform manifold approximation and projection (UMAP) techniques for the reducing of the dimensionality of the produced latent space embedding data.

11. The method of claim 9 or 10 comprising determining a tagging classification threshold based on the dimensionality reduced latent space embedding data.

12. The method of any one of claims 8 to 11 wherein the pre-training of the BERT model comprises masking about 10% to 20% of each amino acid sequence of the pre- training dataset.

13. The method of any one of claims 8 to 12 wherein the constructing of the training dataset comprises randomly selecting 30% to 70% of the training dataset from private CDR3 TCRβ amino acid sequence sample data records of one or more repositories, and randomly selecting a remaining percentage of said training dataset from public CDR3 TCRβ amino acid sequence sample data records of one or more repositories.

14. The method of claim 13 comprising randomly selecting 50% private CDR3 TCRβ amino acid sequence samples data and 50% public CDR3 TCRβ amino acid sequence samples data.

15. The method of any one of claims 8 to 14 comprising using a perplexity test to classify an amino acid sequence input as either a CDR3 or a non-CDR3 sequence.

16. The method of any one of claims 8 to 15 comprising using a perplexity test to classify an amino acid sequence input as either a TCRα or a TCRβ sequence.

17. The method of any one of claims 8 to 16 comprising using a perplexity test to classify an amino acid sequence input as either a private or a public sequence.

18. The method of any one of claims 8 to 17 wherein the classifier is configured to use a linear discriminant analysis (LDA) algorithm for dimensionality reduction and to classify a processed amino acid sequence as J-gene sequence, and/or as either a public or a private sequence.

19. The method of any one of claims 8 to 18 wherein the classifier is configured to use a robust machine-learning algorithm (xgBoost) to classify an amino acid sequence as J-gene sequence, and/or either as a public or a private sequence.

20. The method of any one of claims 8 to 19 wherein the classifier is configured to use a deep neural network (DNN) to classify an amino acid sequence as J-gene sequence, and/or as a MAIT cell, and/or either as a public or a private sequence.

21. The method of any one of claims 8 to 20 wherein the classifier is configured to use a self-supervised trained classifier to classify a processed amino acid sequence as a V-gene.

22. The method of any one of claims 8 to 21 wherein the classifier is configured to use a self-supervised trained classifier to classify a processed amino acid sequence as a J- gene.

23. The method of any one of claims 8 to 22 comprising calculating convergent recombination (CR) data for sample sequences data to verify ability of the produced latent space embedding data to identify public sequences.

24. The method of any one of claims 8 to 23 wherein the CDR3 TCRβ amino acid sequence samples are from human source.

25. The method of any one of claims 8 to 24 wherein the constructing of the pre- training dataset utilizes only bulk CDR3 TCRα and/or bulk CDR3 TCRβ amino acid sequence samples data, and wherein the constructing of the training dataset utilizes only bulk private and public CDR3 TCRβ amino acid sequence samples data.

26. The method of claim 25 wherein the constructing of the pre-training dataset comprises randomly selecting private and public bulk CDR3 TCRβ amino acid sequence sample data records of one or more repositories.

27. The method of claim 25 or 26 wherein the constructing of the pre-training dataset comprises arranging the bulk amino acid sequence samples data in a table listing the CDR3 sequences of each bulk amino acid sequence sample.

28. The method of any one of claims 8 to 24 wherein the constructing of the pre- training dataset comprises utilizing only private and public single cell CDR3 TCRα and/or single cell CDR3 TCRβ amino acid sequence samples data, and wherein the constructing of the training dataset comprises utilizing only single cell private and public CDR3 TCRβ amino acid sequence samples data.

29. The method of claim 27 wherein the constructing of the pre-training dataset comprises utilizing single cell CDR3 TCRα and single cell CDR3 TCRβ amino acid sequence samples data.

30. The method of claim 28 or 29 wherein the constructing of the pre-training dataset comprises concatenating amino acid sequence samples data of the same cell with a separator token place therebetween.

31. The method of any one of claims 8 to 30 wherein said method is for classifying a CDR3 sequence as a public CDR3 sequence or a private CDR3 sequence.

32. The method of any one of claims 8 to 30 wherein said method is for identifying TCRp sequences that can partner with the same TCRα sequence.

33. The method of any one of claims 8 to 32 wherein said method is for identifying and/or classifying short CDR3 sequences lacking V and J segments.

34. A TCR sequence sample data analyzer comprising: an NLP module pre-trained in a self-supervised manner with a pre-training dataset comprising CDR3 TCRα and/or CDR3 TCRβ amino acid sequence samples data, and configured to receive sequence sample data and generate corresponding sequence sample embedding data therefor; and one or more classifiers configured to receive the sequence sample embedding data generated by the NLP module and identify based thereon one or more features and/or classifications of said sequence sample data, wherein at least one of said one or more classifiers is trained in a supervised manner with: (1) a training dataset comprising private and public CDR3 TCRβ amino acid sequence samples data; and (2) training embedding data produced for said training dataset by said pre-trained NLP model.

35. The analyzer of claim 34 wherein the supervised training of the at least one classifier at least partially utilizes dimensionality reduced representation of the training embedding data.

36. The analyzer of claim 35 configured to use a classification threshold determined at least partially based on the dimensionality reduced representation of the training embedding data.

37. The analyzer of any one of claims 34 to 36 wherein the training dataset comprises randomly selected 30% to 70% private CDR3 TCRβ amino acid sequence sample data, and a wherein a remaining percentage of said training dataset comprises randomly selected public CDR3 TCRβ amino acid sequence sample data.

38. The analyzer of any one of claims 34 to 37 comprising an embeddings input stage and a scores test module configured to receive the training embedding data from said embedding input stage and the sequence sample embedding data from the NLP module and identify based thereon at least one of the features and/or classifications of the sequence sample data.

39. The analyzer of claim 38 wherein the scores test module is configured to use a perplexity test.

40. The analyzer of any one of claims 34 to 39 comprising at least one classifier configured to use a linear discriminant analysis (LDA) algorithm for dimensionality reduction and to classify the received sequence sample data as J-gene sequence, and/or as either a public or a private sequence.

41. The analyzer of any one of claims 34 to 40 comprising at least one classifier configured to use a robust machine-learning algorithm (xgBoost) to classify the received sequence sample data as J-gene sequence, and/or either as a public or a private sequence.

42. The analyzer of any one of claims 34 to 41 comprising at least one classifier configured to use a deep neural network (DNN) to classify the received sequence sample data as J-gene, and/or as a MAIT cell, and/or either as a public or a private sequence. 43. The analyzer of any one of the preceding claims 34 to 42 comprising at least one classifier configured to use a self- supervised trained classifier to classify the received sequence sample data as a V-gene.

44. The analyzer of any one of claims 34 to 43 comprising at least one classifier configured to use a self- supervised trained classifier to classify the received sequence sample data as a J-gene.

45. The analyzer of any one of claims 34 to 44 configured to identify the one or more features and/or classifications of either bulk or single cell sequence sample data.

46. The analyzer of claim 45 configured to identify the one or more features and/or classifications of bulk sequence sample data, and to update at least one of a pre-trained NLP model and/or a classifier thereof based on single cell sequence sample data processed by a single cell sample sequence analyzer and output data generated by said single cell sample sequence analyzer for said single cell sequence sample data.

47. The analyzer of claim 45 configured to identify the one or more features and/or classifications of single cell sequence sample data, and to update at least one of a pre- trained NLP model and/or a classifier thereof based on bulk sequence sample data processed by a bulk sample sequence analyzer and output data generated by said bulk sample sequence analyzer for said bulk sequence sample data.

48. The analyzer of any one of claims 34 to 45 wherein the one or more features and/or classifications of the sequence sample data comprises at least one of the following: CDR3 or non-CDR3 sequence, TCRcr or TCRβ sequence, private or public sequence, MAIT or non-MAIT sequence, V-gene or non-V-gene sequence, and/or J-gene or non-J-gene sequence.

Description:
SYSTEM AND METHOD FOR TCR SEQUENCE IDENTIFICATION AND/OR CLASSIFICATION

TECHNOLOGICAL FIELD

The present application relates to identification and/or classification of TCR amino acid sequences, particularly short sequences, utilizing artificial intelligence tools.

BACKGROUND ART

References considered to be relevant as background to the presently disclosed subject matter are listed below:

[1] K. Wu et al., "TCR-BERT: learning the grammar of T-cell receptors for flexible antigen-binding analyses", bioRxiv-Bioinformatics, 2021, https ://doi.org/10.1101/2021.11.18.469186.

[2] J. Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", arXiv: 1810.04805, 2018, http s ://arxi v . org/ab s/ 1810.04805.

[3] US2022/0139498

[4] WO2022/185179

Bulk samples database PMID references:

1. Jia, Q. et al. Local mutational diversity drives intratumoral immune heterogeneity in non-small cell lung cancer. Nat. Commun. 9, 5361 (2018).

2. Napolitani, G. et al. Clonal analysis of Salmonella- specific effector T cells reveals serovar-specific and cross -reactive T cell responses. Nat. Immunol. 19, 742- 754 (2018).

3. Giudice, V. et al. Deep sequencing and flow cytometric characterization of expanded effector memory CD8+CD57+ T cells frequently reveals T-cell receptor VP oligoclonality and CDR3 homology in acquired aplastic anemia. Haematologica 103, 759-769 (2018).

4. Seet, C. S. et al. Generation of mature T cells from human hematopoietic stem and progenitor cells in artificial thymic organoids. Nat. Methods 14, 521-530 (2017).

5. Sims, J. S. et al. Diversity and divergence of the glioma-infiltrating T-cell receptor repertoire. Proc. Natl. Acad. Sci. U. S. A. 113, E3529-37 (2016). 6. Genolet, R., Stevenson, B. J., Farinelli, L., Osteras, M. & Luescher, I. F. Highly diverse TCRα chain repertoire of pre-immune CD8+ T cells reveals new insights in gene recombination. EMBO J. 31, 1666-1678 (2012).

7. Neal, J. T. et al. Organoid modeling of the tumor immune microenvironment. Cell 175, 1972-1988.el6 (2018).

8. Azizi, E. et al. Single-cell map of diverse immune phenotypes in the breast tumor microenvironment. Cell 174, 1293-1308.e36 (2018).

9. Heuvel, H. van den et al. Allo-HLA cross-reactivities of Cytomegalovirus-, influenza-, and varicella zoster virus -specific memory T cells are shared by different healthy individuals. Am. J. Transplant 17, 2033-2044 (2017).

10. Beziat, V. et al. A recessive form of hyper-IgE syndrome by disruption of

ZNF341 -dependent STAT3 transcription and activity. Sci. Immunol. 3, (2018).

11. Mimitou, E. P. et al. Multiplexed detection of proteins, transcriptomes, clonotypes and CRISPR perturbations in single cells. Nat. Methods 16, 409-412 (2019).

12. Buggert, M. et al. Identification and characterization of HIV- specific resident memory CD8+ T cells in human lymphoid tissue. Sci. Immunol. 3, (2018).

13. Sousa, A. de P. A. et al. Comprehensive analysis of TCR-P repertoire in patients with neurological immune-mediated disorders. Sci. Rep. 9, 344 (2019).

14. Cloughesy, T. F. et al. Neoadjuvant anti-PD-1 immunotherapy promotes a survival benefit with intratumoral and systemic immune responses in recurrent glioblastoma. Nat. Med. 25, 477-486 (2019).

15. Martino, D. et al. Epigenetic dysregulation of naive CD4+ T-cell activation genes in childhood food allergy. Nat. Commun. 9, 3308 (2018).

16. Kagoya, Y. et al. DOT1L inhibition attenuates graft-versus-host disease by allogeneic T cells in adoptive immunotherapy models. Nat. Commun. 9, 1915 (2018).

17. Wu, J. et al. Minimal residual disease detection and evolved IGH clones analysis in acute B lymphoblastic leukemia using IGH deep sequencing. Front. Immunol. 7, 403 (2016).

18. Zhang, W. et al. IMonitor: A robust pipeline for TCR and BCR repertoire analysis. Genetics 201, 459-472 (2015).

19. Yost, K. E. et al. Clonal replacement of tumor- specific T cells following PD- 1 blockade. Nat. Med. 25, 1251-1259 (2019). 20. Song, I. et al. Broad TCR repertoire and diverse structural solutions for recognition of an immunodominant CD8+ T cell epitope. Nat. Struct. Mol. Biol. 24, 395-406 (2017).

21. Stromnes, I. M., Hulbert, A., Pierce, R. H., Greenberg, P. D. & Hingorani, S. R. T-cell localization, activation, and clonal expansion in human pancreatic ductal adenocarcinoma. Cancer Immunol. Res. 5, 978-991 (2017).

22. Spreafico, R. et al. A circulating reservoir of pathogenic -like CD4+ T cells shares a genetic and phenotypic signature with the inflamed synovial micro- environment. Ann. Rheum. Dis. 75, 459-465 (2016).

23. Carey, A. J. et al. Public clonotypes and convergent recombination characterize the naive CD8+ T-cell receptor repertoire of extremely preterm neonates. Front. Immunol. 8, 1859 (2017).

24. Abdel-Hakeem, M. S., Boisvert, M., Bruneau, J., Soudeyns, H. & Shoukry, N. H. Selective expansion of high functional avidity memory CD8 T cell clonotypes during hepatitis C virus reinfection and clearance. PLoS Pathog. 13, el006191 (2017).

25. Rossetti, M. et al. TCR repertoire sequencing identifies synovial Treg cell clonotypes in the bloodstream during active inflammation in human arthritis. Ann. Rheum. Dis. 76, 435-441 (2017).

26. Suessmuth, Y. et al. CMV reactivation drives posttransplant T-cell reconstitution and results in defects in the underlying TCRP repertoire. Blood 125, 3835-3850 (2015).

27. Hsu, M. et al. TCR sequencing can identify and track glioma-infiltrating T cells after DC vaccination. Cancer Immunol. Res. 4, 412-418 (2016).

28. Beausang, J. F. et al. T cell receptor sequencing of early-stage breast cancer tumors identifies altered clonal structure of the T cell repertoire. Proc. Natl. Acad. Sci. U. S. A. 114, E10409-E10417 (2017).

29. Gomez-Tourino, I., Kamra, Y., Baptista, R., Lorenc, A. & Peakman, M. T cell receptor P-chains display abnormal shortening and repertoire sharing in type 1 diabetes. Nat. Commun. 8, 1792 (2017).

30. Keane, C. et al. The T-cell receptor repertoire influences the tumor microenvironment and is associated with survival in aggressive B-cell lymphoma. Clin. Cancer Res. 23, 1820-1828 (2017). 31. Page, D. B. et al. Deep sequencing of T-cell receptor DNA as a biomarker of clonally expanded TILs in breast cancer after immunotherapy. Cancer Immunol. Res. 4, 835-844 (2016).

32. Wu, D. et al. High-throughput sequencing detects minimal residual disease in acute T lymphoblastic leukemia. Sci. Transl. Med. 4, 134ra63 (2012).

33. Seay, H. R. et al. Tissue distribution and clonal diversity of the T and B cell repertoire in type 1 diabetes. JCI Insight 1, e88242 (2016).

34. Emerson, R. O. et al. Immuno sequencing identifies signatures of cytomegalovirus exposure history and HLA-mediated effects on the T cell repertoire. Nat. Genet. 49, 659-665 (2017).

Single cell samples database PMID references:

1. Leader, A. M. et al. Single-cell analysis of human non-small cell lung cancer lesions refines tumor classification and patient stratification. Cancer Cell 39, 1594- 16O9.el2 (2021).

2. Zhao, Y. et al. Clonal expansion and activation of tissue-resident memory-like TH17 cells expressing GM-CSF in the lungs of patients with severe COVID-19. Sci Immunol 6, eabf6692 (2021).

3. Tang, Y., Kwiatkowski, D. J. & Henske, E. P. Midkine expression by stem-like tumor cells drives persistence to mTOR inhibition and an immune-suppressive microenvironment. Nat Commun 13, 5018 (2022).

4. Banta, K. L. et al. Mechanistic convergence of the TIGIT and PD-1 inhibitory pathways necessitates co-blockade to optimize anti-tumor CD8+ T cell responses. Immunity 55, 512-526. e9 (2022).

5. Mahuron, K. M. et al. Layilin augments integrin activation to promote antitumor immunity. J Exp Med 217, e20192080 (2020).

6. Liao, M. et al. Single-cell landscape of bronchoalveolar immune cells in patients with COVID-19. Nat Med 26, 842-844 (2020).

7. Biermann, J. et al. Dissecting the treatment-naive ecosystem of human melanoma brain metastasis. Cell 185, 2591-2608.e30 (2022).

8. Caushi, J. X. et al. Transcriptional programs of neoantigen-specific TIL in anti- PD-l-treated lung cancers. Nature 596, 126-132 (2021). 9. Chandran, S. S. et al. Immunogenicity and therapeutic targeting of a public neoantigen derived from mutated PIK3CA. Nat Med 28, 946-957 (2022).

10. Azizi, E. et al. Single-Cell Map of Diverse Immune Phenotypes in the Breast Tumor Microenvironment. Cell 174, 1293-1308.e36 (2018).

11. Luoma, A. M. et al. Tissue-resident memory and circulating T cells are early responders to pre-surgical cancer immunotherapy. Cell 185, 2918-2935.e29 (2022).

12. Zheng, Y. et al. Immune suppressive landscape in the human esophageal squamous cell carcinoma microenvironment. Nat Commun 11, 6268 (2020).

13. Han, L. et al. Interleukin 32 Promotes Foxp3+ Treg Cell Development and CD8+ T Cell Function in Human Esophageal Squamous Cell Carcinoma Microenvironment. Frontiers Cell Dev Biology 9, 704853 (2021).

14. Anadon, C. M. et al. Ovarian cancer immunogenicity is governed by a narrow subset of progenitor tissue-resident memory T cells. Cancer Cell 40, 545-557.el3 (2022).

15. Anadon, C. M. et al. Protocol for the isolation of CD8+ tumor-infiltrating lymphocytes from human tumors and their characterization by single-cell immune profiling and multiome. Star Protoc 3, 101649 (2022).

16. Heming, M. et al. Neurological Manifestations of COVID- 19 Feature T Cell Exhaustion and Dedifferentiated Monocytes in Cerebrospinal Fluid. Immunity 54, 164- 175. e6 (2021).

17. Gueguen, P. et al. Contribution of resident and circulating precursors to tumor- infiltrating CD8+ T cell populations in lung cancer. Sci Immunol 6, (2021).

18. Kourtis, N. et al. A single-cell map of dynamic chromatin landscapes of immune cells in renal cell carcinoma. Nat Cancer 3, 885-898 (2022).

19. Wang, Z. et al. Single-cell RNA sequencing of peripheral blood mononuclear cells from acute Kawasaki disease patients. Nat Commun 12, 5444 (2021).

20. Shi, X. et al. Single-cell atlas of diverse immune populations in the advanced biliary tract cancer microenvironment. Npj Precis Oncol 6, 58 (2022).

21. Ferreira-Gomes, M. et al. SARS-CoV-2 in severe COVID-19 induces a TGF- P-dominated chronic immune response that does not target itself. Nat Commun 12, 1961 (2021).

22. Ramaswamy, A. et al. Immune dysregulation and autoreactivity correlate with disease severity in SARS-CoV-2-associated multisystem inflammatory syndrome in children. Immunity 54, 1083-1095. e7 (2021). 23. Gaydosik, A. M., Stonesifer, C. J., Khaleel, A. E., Geskin, L. J. & Fuschiotti, P. Single-Cell RNA Sequencing Unveils the Clonal and Transcriptional Landscape of Cutaneous T-Cell Lymphomas. Clin Cancer Res 28, 2610-2622 (2022).

24. Eberhardt, C. S. et al. Functional HPV-specific PD-1+ stem-like CD8 T cells in head and neck cancer. Nature 597, 279-284 (2021).

25. Corridoni, D. et al. Single-cell atlas of colonic CD8+ T cells in ulcerative colitis. Nat Med 26, 1480-1490 (2020).

26. Gao, S. et al. Single-cell RNA sequencing coupled to TCR profiling of large granular lymphocyte leukemia T cells. Nat Commun 13, 1982 (2022).

27. Saluzzo, S. et al. Delayed antiretroviral therapy in HIV-infected individuals leads to irreversible depletion of skin- and mucosa-resident memory T cells. Immunity 54, 2842-2858.e5 (2021).

28. Hu, Y. et al. Antigen multimers: Specific, sensitive, precise, and multifunctional high-avidity CAR-staining reagents. Matter 4, 3917-3940 (2021).

29. Borcherding, N. et al. Mapping the immune environment in clear cell renal carcinoma by single-cell genomics. Commun Biology 4, 122 (2021).

30. CheonIS et al. Immune signatures underlying post-acute COVID-19 lung sequelae. Sci Immunol 6, eabkl741 (2021).

Acknowledgement of the above references herein is not to be inferred as meaning that these are in any way relevant to the patentability of the presently disclosed subject matter.

BACKGROUND

T cells are an essential component of the adaptive immune system and are capable of recognizing a vast number of different antigens. To be able to target such a large number of foreign antigens, specific mechanisms produce enormous numbers of T cell strains, which differ by their T cell receptor (TCR) sequences. The composition of the multiple different TCRs constitutes the T cell receptor repertoire. The T cell receptor is composed of alpha and beta chains, or gamma and delta chains. Each T cell interacts with an antigen through the TCR; this interaction is mainly determined by the third complementarity-determining region (CDR3) of the receptor in the alpha and beta T cells. The chain itself is created by a rearrangement of multiple V, D, and J gene segments in a process called VDJ recombination. Progress in Natural Language Processing (NLP) has been made by utilizing transformers for the discovery of rules within sequential data, like DNA sequences and protein sequences. Transformers are often encoder and decoder based; the encoder generates an embedding (hyper-dimensional numerical representation) of the input. In the case of DNA and proteins, the studied language is either nucleotides, consisting of 4 unique letters, or amino acids, consisting of 20 unique letters. In current research, the most common transformer used for these tasks is BERT (bidirectional encoder representations from transformers), which has been trained to learn the structure of large sets of unlabeled data.

An encoder-based transformer (BERT) was applied to predict different antigens specific to different TCRs [1].

US Patent Publication No. 2022/0139498 discloses apparatuses, systems and methods which analyze deoxyribonucleic acid (DNA) sequence data using techniques that include the use of natural language processing (NLP) models, to achieve several outcomes, for example to identify genetic elements and transcriptional regulators as well as to verify putative novel cis-regulatory elements.

International Patent Publication No WO 2022/185179 discloses a protein language NLP system for predicting specific biophysiochemical properties of a protein, including TCR-epitope binding. The procedure includes training neural networks, in several training phases, on a collection of amino acids sequences of proteins, which are tokenized and masked.

GENERAL DESCRIPTION

BERT, which stands for Bidirectional Encoder Representations from Transformers [2], is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task specific architecture modifications.

CountVonCount (CVC), is a transformer model based on the architecture of BERT that uses in possible embodiments hereof bidirectional self-attention. The CVC is used in some embodiments to create embedding data for CDR3 sequences thereby processed (e.g., in a training stage of the system), which can be then used for multiple downstream tasks. CVC has demonstrated meaningful embedding abilities used in embodiments hereof to classify Private and Public sequences in a supervised manner, as well as clustering them in an unsupervised manner. By using the embeddings, the CVC displayed different phenotypes for each sequence. Together, based on its unique set of traits, the CVC can serve as a clustering module in an unsupervised fashion, exploiting the embedding space of CDR3 and may suggest a proper alternative to the common clustering metrics (z.e., editing distance).

The term private CDR3 sequence(s) refers to CDR3 sequences that are unique to an individual. The term public CDR3 sequence(s) refers to CDR3 sequences that are shared among several (i.e., more than one) individuals.

In possible embodiments the CVC is fed with an input of CDR3 amino acid sequences for producing their embeddings. In order to create these embeddings, the NLP model must understand the underlying language of these sequences, which has been achieved in a stage referred to as the ‘pre-training’ stage. The NLP model could be trained on any given database with k CDR3 sequences, with k being a positive integer number at the range of millions. Once completed, the NLP model is ready to be used for the creation of the embeddings data that can be then used for processing and analysis of new CDR3 amino acid sequences.

A central element of the transformer of the NLP model is its attention algorithm. In general, attention can be seen as a mapping between a query and a set of key -value pairs to an output, where the output is a weighted sum of the values. Self-attention, also called intra Attention, is an attention mechanism relating different positions of a single sequence to compute a representation of the same sequence.

The term supervised learning refers to a machine learning scheme of training an NLP model with labeled data e.g., the input training data is fed to the NLP model with tags indicative of the correct output for training it to map the input data to the correct output. The term self-supervised learning refers to machine learning schemes of training an NLP model to represent its input data in an unsupervised manner, without use of large amounts of labeled data e.g., the target output is generated automatically from the input data by the NLP model, without labeling. The term embeddings refers to representation/encoding by NLP model of a meaning of a portion/item (e.g., a word in text analysis) of its input data by a real numbers vector, such that other items encoded by the NLP model to the proximity of such a representation are expected to be of a similar meaning.

The terms TCR alpha, TCRa, and IRA are used interchangeably herein to denote the alpha chain of the TCR. The terms TCR beta, TCRfi, and TRB are used interchangeably herein to denote the beta chain of the TCR.

The methods of the invention may therefore be employed for example, but not limited to, for identifying CDR3 sequences, for identifying whether a CDR3 sequence is a public or a private sequence, and for identifying “sister"’ TCRp sequences (namely, different TCRp chains that can partner with the same TCRα).

Moreover, traditional methods for identification and classification of CDR3 sequences rely on alignment of the examined sequences to identify the V and J regions flanking the D region. As a result, such methods require that the fragments of CDR3 would be sufficiently long, containing the V, D and J regions, to be able to accurately identify and classify the CDR3.

In contrast, the present invention allows for alignment-free categorization of arbitrary sequences as either CDR3 or non-CDR3. This quality is utilizable as a tool to detect CDR3 sequences out of sequencing data regardless of the available length of the CDR3 fragment, namely it is effective even with short sequences, e.g., sequences lacking the V and J segments.

Accordingly, in some embodiment, the methods of the invention may be employed for identifying and/or classifying short CDR3 sequences, e.g., CDR3 sequences lacking the V and J segments. In some embodiments the term short CDR3 sequences refers to sequences which are 8 or 9 amino acids long. Evidently, the methods of the invention may be employed for identifying and/or classifying longer CDR3 sequences.

In one aspect there is provided a computer system comprising one or more processors and memories configured with a set of software modules for the construction and use of TCR sequence sample classification/identification means, the system comprising a pre-training module configured to construct one or more pre-training datasets of TCR sequence samples data, an NLP module configured to: pre-train an NLP model in a self-supervised manner with the one or more pre-training datasets for the NLP model to provide latent space representations to amino acid sequences samples inputted thereto; and process by the pre-trained NLP model one or more training datasets of TCR sequence samples data to thereby produce training embedding data, and a training module configured to construct the one or more training datasets and train one or more classifiers with the one or more training datasets and the training embedding data, for the one or more classifiers thereby trained to identify /classify features of TCR sequence sample data inputted thereto.

The system comprises in some embodiments a dimension reduction module configured to reduce a dimension of the training embedding data into a lower dimension data representation thereof. The system can be configured for one or both of the following: generate form the lower dimension presentation of the training embedding data presentation and/or analysis of the training embedding data; and/or determine one or more classification thresholds from the lower dimension presentation of the training embeddings data. The training module is configured in some embodiments to train the one or more classifiers with the lower dimension presentation of the training embedding data and the training datasets.

In possible embodiments the system comprises an NLP performance test module configured to analyze the training embedding data and evaluate the performance of the pre-trained NLP model based thereon. The NLP performance test module can be configured to produce test scores for the training embedding data. The system can be further configured to use the test scores for the identification/classification of TCR sequence samples.

Optionally, but in some embodiments preferably, the pre-training module is configured to either arrange bulk amino acid sequence samples data in a table or concatenate single cell amino acid sequence samples of the same cell with a separator token place therebetween.

In another aspect therein provided a method for TCR sequence identification and/or classification. The method comprising constructing a pre-training dataset comprising CDR3 TCRα and/or CDR3 TCRβ amino acid sequence samples data; pre- training a BERT model with the pre-training dataset in a self-supervised manner for thereby generating latent space representations to amino acid sequences samples inputted thereto, constructing a training dataset comprising private and public CDR3 TCRβ amino acid sequence samples data, processing the training dataset by the pre-trained BERT model to thereby produce latent space embedding data thereof, and training in a supervised manner a classifier with the latent space embedding data produced by the pre- trained BERT model for the training dataset, for the classifier to identify one or more biological features of an amino acid sequence sample data inputted thereto.

The method comprises in some embodiments reducing dimensionality of the produced latent space embedding data (e.g., using uniform manifold approximation and projection techniques) and using the dimensionality reduced latent space embedding data for the supervised training of the classifier. The method can comprise determining a tagging classification threshold based on the dimensionality reduced latent space embedding data. The pre-training of the BERT model may comprise masking aboutlO% to 20% of each amino acid sequence of the pre-training dataset. Optionally, the constructing of the training dataset comprises randomly selecting 30% to 70%, and in some embodiments 50%, of the training dataset from private CDR3 TCRβ amino acid sequence sample data records of one or more repositories, and randomly selecting a remaining percentage of the training dataset from public CDR3 TCRβ amino acid sequence sample data records of one or more repositories.

In possible applications the method comprises randomly selecting 50% private CDR3 TCRβ amino acid sequence samples data and 50% public CDR3 TCRβ amino acid sequence samples data. The method comprises in possible embodiments using a perplexity test to classify an amino acid sequence input as: either a CDR3 or a non-CDR3 sequence; and/or as either a TCRα or a TCRβ sequence; and/or as either a private or a public sequence.

The classifier is configured in possible embodiments to use at least one of the following: a linear discriminant analysis (LDA) algorithm for dimensionality reduction and to classify a processed amino acid sequence as J-gene sequence, and/or as either a public or a private sequence; a robust machine-learning algorithm (xgBoost) to classify an amino acid sequence as J-gene sequence, and/or either as a public or a private sequence; a deep neural network (DNN) to classify an amino acid sequence as J-gene sequence, and/or as a MAIT cell, and/or either as a public or a private sequence; a self- supervised trained classifier to classify a processed amino acid sequence as a V-gene; a self- supervised trained classifier to classify a processed amino acid sequence as a J-gene.

The method can comprise calculating convergent recombination (CR) data for sample sequences data to verify ability of the produced latent space embedding data to identify public sequences. Optionally, the CDR3 TCRβ amino acid sequence samples are from human source. The constructing of the pre-training dataset can utilize only bulk CDR3 TCRα and/or bulk CDR3 TCRβ amino acid sequence samples data. Optionally, the constructing of the training dataset utilizes only bulk private and public CDR3 TCRβ amino acid sequence samples data. The constructing of the pre-training dataset may comprise randomly selecting private and public bulk CDR3 TCRβ amino acid sequence sample data records of one or more repositories. The constructing of the pre-training dataset may comprise arranging the bulk amino acid sequence samples data in a table listing the CDR3 sequences of each bulk amino acid sequence sample.

The constructing of the pre-training dataset can comprise utilizing only private and public single cell CDR3 TCRα and/or single cell CDR3 TCRβ amino acid sequence samples data. Optionally, the constructing of the training dataset comprises utilizing only single cell private and public CDR3 TCRβ amino acid sequence samples data. The constructing of the pre-training dataset may comprise utilizing single cell CDR3 TCRα and single cell CDR3 TCRβ amino acid sequence samples data. The constructing of the pre-training dataset may comprise concatenating amino acid sequence samples data of the same cell with a separator token place therebetween.

In possible applications the method is used for at least one of the following: classifying a CDR3 sequence as a public CDR3 sequence or a private CDR3 sequence; identifying TCRp sequences that can partner with the same TCRα sequence; identifying and/or classifying short CDR3 sequences lacking V and J segments.

In yet another aspect there is provided a TCR sequence sample data analyzer comprising an NLP module pre-trained in a self- supervised manner with a pre-training dataset comprising CDR3 TCRα and/or CDR3 TCRβ amino acid sequence samples data, and configured to receive sequence sample data and generate corresponding sequence sample embedding data therefor; and one or more classifiers configured to receive the sequence sample embedding data generated by the NLP module and identify based thereon one or more features and/or classifications of the sequence sample data. At least one of the one or more classifiers can be trained in a supervised manner with: (1) a training dataset comprising private and public CDR3 TCRβ amino acid sequence samples data; and (2) training embedding data produced for the training dataset by the pre-trained NLP model.

Optionally, the supervised training of the at least one classifier at least partially utilizes dimensionality reduced representation of the training embedding data. The analyzer can be configured to use a classification threshold determined at least partially based on the dimensionality reduced representation of the training embedding data. The training dataset comprises in some embodiments randomly selected 30% to 70% private CDR3 TCRβ amino acid sequence sample data, and a remaining percentage of the training dataset can comprise randomly selected public CDR3 TCRβ amino acid sequence sample data.

The analyzer comprises in some embodiments an embeddings input stage and a scores test module configured to receive the training embedding data from the embedding input stage and the sequence sample embedding data from the NLP module, and identify based thereon at least one of the features and/or classifications of the sequence sample data. The scores test module can be configured to use a perplexity test.

The analyzer comprises in possible application at least one classifier configured to use: a linear discriminant analysis (LDA) algorithm for dimensionality reduction and to classify the received sequence sample data as J-gene sequence, and/or as either a public or a private sequence; and/or a robust machine-learning algorithm (xgBoost) to classify the received sequence sample data as J-gene sequence, and/or either as a public or a private sequence; and/or a deep neural network (DNN) to classify the received sequence sample data as J-gene, and/or as a MAIT cell, and/or either as a public or a private sequence; and/or a self- supervised trained classifier to classify the received sequence sample data as a V-gene; and/or a self-supervised trained classifier to classify the received sequence sample data as a J-gene.

The analyzer can be configured to identify the one or more features and/or classifications of either bulk or single cell sequence sample data. In some embodiments the analyzer is configured to identify the one or more features and/or classifications of bulk sequence sample data, and to update at least one of a pre-trained NLP model and/or a classifier thereof based on single cell sequence sample data processed by a single cell sample sequence analyzer and output data generated by the single cell sample sequence analyzer for said single cell sequence sample data. Alternatively, the analyzer is configured to identify the one or more features and/or classifications of single cell sequence sample data, and to update at least one of a pre-trained NLP model and/or a classifier thereof based on bulk sequence sample data processed by a bulk sample sequence analyzer and output data generated by said bulk sample sequence analyzer for said bulk sequence sample data. The one or more features and/or classifications of the sequence sample data identified by the analyzer may comprise at least one of the following: CDR3 or non- CDR3 sequence, TCRcr or TCRβ sequence, private or public sequence, MAIT or non- MAIT sequence, V-gene or non-V-gene sequence, and/or J-gene or non-J-gene sequence.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to better understand the subject matter that is disclosed herein and to exemplify how it may be carried out in practice, embodiments will now be described, by way of non-limiting example only, with reference to the accompanying drawings, in which:

Figs. 1A to ID are functional block diagrams schematically illustrating construction and use of a TCR sequence sample classier according to possible embodiments;

Figs. 2A and 2B respectively schematically illustrate construction of a bulk TCR sequence sample classifier, and a single cell TCR sequence sample classifier, according to possible embodiments, all the sequences that are noted in the figures are non-limiting, arbitrary (not “real”) sequences provided for demonstrative purposes only ;

Figs. 3A to 3G are graphical plots showing TCR Publicness in the (bulk sample/CVC) latent representation space using embeddings of CDR3 TCR beta sequences followed by dimensionality reduction (e.g., using UMAP for visualization) obtained by the trained NLP model of possible embodiments, and its association with sequence length and convergent recombination, wherein Fig. 3A shows dimensionality reduction of the embeddings of 1,000,000 TCR beta sequences colored according to their Public/Private label with lighter points representing private sequences and darker points represent public sequences, Fig. 3B shows the public appearance distribution of the sequences in the dataset, colored according to sequence length percentiles (10%, 25%, 50%, 75% and 90% corresponding to lengths 13, 14, 15, 16 and 18), displayed in the upper right comer, Fig. 3C shows sequence length distribution of 1,050,000 TCR beta sequences colored by the sequence length percentiles of 10%, 25%, 50%, 75% and 90% which corresponded to amino acid length of 13, 14, 15, 16, 18, respectively, Fig. 3D and 3E show dimensionality reduced embeddings produced for the sequences used to generate Fig. 3C colored in Fig. 3D according to the sequence length percentiles and Fig. 3E according to the Private/Public label of each sequence, showing the association between sequence length and sequences’ sharing status, Fig. 3F and 3G show dimensionality reduced embeddings produced for 536,932 TCR beta sequences colored in Fig. 3F according to Public/Private status and Fig. 3G according to convergent recombination ranges, showing five convergent recombination ranges (including for each range a set of sequences according to their distribution in the dataset: 0-100 with 500,000 sequences, 100-200 with 30799 sequences, 200-300 with 4574 sequences, 300-400 with 1132 sequences and 400 and above with 427 sequences, showing that the transformer captures publicness and convergent recombination simultaneously in latent space;

Fig. 4A to 4E demonstrate J-gene clustering in the latent embedding space of the BERT model according to possible embodiments, wherein Fig. 4A is a schematic representation of the structure of the CDR3 region of a m-RNA transcript of a TCR beta chain, Fig. 4B is a schematic representation of the structure of the DNA used for the production of TCR beta chains prior to recombination, consisting of the variable (V), joining (J), constant (C) and diversity (D) regions, where a segment from each region, together with deletion/addition/replacement of nucleotides, generate the TCR through the process of VDJ recombination (marked J genes areas are, JI: 1-6 and J2: 1-7), Fig. 4C shows a bar plot representation of the number of CDR3 sequences in a dataset used in possible embodiments, according to their use of J-genes (all the sequences of TCRBJ02- 04 and TCRBJ02-06 were taken and 9% of sequences from each of the remaining J-gene types were randomly selected, to create the represented embedding space, to provide meaningful representations for the visualization of all J-genes), Fig. 4D and 4E show the embedding space colored in Fig. 4D by the corresponding public/private label of each sequence and in Fig. 4E by the different J-gene types;

Fig. 5A to 5G demonstrate alignment-free, pseudo -perplexity based, discovery of CDR3 sequences according to possible embodiments, wherein Fig. 5A shows equations used for calculating of pseudo-perplexity score provided by a language model, Fig. 5B shows comparison of BERT language model's perplexity scores with single cell pseudo- perplexity scores (a grammatically correct sentence would yield a low perplexity score, just as a genuine CDR3 sequence would result in a low pseudo -perplexity score), Fig. 5C shows extraction of CDR3 sequences from single-cell RNA sequences, where the RNA- seq data is translated into an amino acid representation and divided into kmers of lengths 11-19, in a process repeated for the three different reading frames, resulting in a total of (3 * E[(X - k + 1)] for k = 11 to 19) kmers, and the pseudo-perplexity score is calculated for each kmer, and the kmer with the lowest score is considered the real sequence, Fig. 5D shows the process of Fig. 5C on a large scale, and Figs. 5E to 5G show pseudo- perplexity analysis plots for lOx Genomics single-cell lung cancer data including 3643 single cells with both TRA and TRB chains, with Fig. 5E showing pseudo-perplexity score distribution generated by the single cell NLP model indicating that both TRA (light plots) and TRB (dark plots) chain types have low perplexity scores, Fig. 5F shows match count (1) that represents the number of CDR3 sequences identified by the pseudo- perplexity score when considering the minimum score of all kmers, categorized by chain type (for sequences that did not precisely match the label (0), the lowest 10 scores were analyzed to determine if the label was among them), Fig. 5G shows distribution of match count for the lowest 10 scores, divided by chain type, Fig. 5F and 5G demonstrate a 95% accuracy in detecting the real CDR3 sequence using the pseudo-perplexity score. All the sequences that are noted in the figures are non-limiting, arbitrary (not “real”) sequences provided for demonstrative purposes only; and

Fig. 6A to 6C demonstrate use of embeddings of the NLP model for supervised classification tasks in possible embodiments, wherein Fig. 6A demonstrates use of deep neural networks (DNNs), xgBoost and LDA, for the task of binary classification of sequences for their public/private status, and DNN alone for the task of multi class classification of the J-gene of each sequence (in all cases the input is the embeddings of each sequence, as produced by the trained NLP model), Fig. 6B demonstrates ROC of the LDA, xgBoost and DNN classifiers trained over the task of binary classification of public and private sequences, where each algorithm was applied twice, using the embeddings created by the trained NLP model using one-hot encoding, showing that classifiers over embeddings achieved higher scores compared to the one -hot representation (AUC of 0.89, 0.89 and 0.9 compared to 0.76, 0.81 and 0.8, respectively), and Fig. 6C shows multiclass classification results of J-gene type prediction using DNN on both the embeddings and one -hot vector representation of the sequences, in which network was applied 3 times, and average result accuracies were 98.57% on the embeddings and 90.44% using one-hot encoding (all results are for the test set previously unseen data);

Fig. 7A to 7F show dimensionality reduced embedding space of MAIT cells and TCR beta sister sequences in bulk and single cell of possible embodiments, using the lOx genomics single-cell lung cancer dataset to examine the distribution of MAIT cells in the embedding space (MAIT cell barcodes were labeled according to their TRA J- and V- genes: TRAV1-2 combined with TRAJ33/20/12, enabling the labeling of corresponding TRB sequences by MAIT barcodes in possible embodiments), wherein Fig. 7A shows dimensionality reduced visualization of MAIT and non-MAIT single-cell embeddings generated using a single cell NLP model, Fig. 7B shows dimensionality reduced visualization of MAIT and non-MAIT TRB sequences produced by bulk NLP model (in both cases the MAIT cells did not cluster together), wherein eight lOx genomics single- cell datasets were combined for a more comprehensive analysis, Fig. 7C shows publicness distribution for 2508 MAIT and 2508 non-MAIT cells, revealing that over 60% of MAIT cells are Public, Fig. 7D shows DNN architecture employed for binary classification of MAIT cells with embeddings as input, Fig. 7E shows results evaluated using three types of embeddings: TRA only, TRB only, and TRA combined with TRB (ROC AUC values were 0.83, 0.71, and 0.76, respectively), and Fig. 7F shows use of 100,000 single-cell sequences from the single cell database (see data availability), TRB sequences co-expressed with the same TRA sequence, TRB sister sequences (see description hereinbelow), were grouped together (dimensionality reduced visualization of the embedding space for these cells highlights TRB sister sequences belonging to TRA CAVMDSNYQLIW, CAVSGSQGNLIF, and CALNPRGNKLTF (see Figs. 8A to 8C, respectively), showing that they do not cluster together. The mean distance between them was calculated and compared to the distance between them and the rest of the (random) sequences, revealing no difference in mean distance). All the sequences that are noted in the figure are non-limiting, arbitrary (not “real”) sequences provided for demonstrative purposes only;

Fig. 8A to 8C show TRB sister sequences in embedding space for some possible embodiments;

Fig. 9 depicts publicness evolution in embedding space in possible embodiments; Figs. 10A to 10D depicts V-Gene clusters in embedding space of some possible embodiments; and

Fig. 11 depicts J-gene classification results obtained in possible embodiments using shallow learning. DETAILED DESCRIPTION OF EMBODIMENTS

One or more specific and/or alternative embodiments of the present disclosure will be described below with reference to the drawings, which are to be considered in all aspects as illustrative only and not restrictive in any manner. It shall be apparent to one skilled in the art that these embodiments may be practiced without such specific details. In an effort to provide a concise description of these embodiments, not all features or details of an actual implementation are described at length in the specification. Emphasis instead being placed upon clearly illustrating the principles of the invention such that persons skilled in the art will be able to make and use the identification/classification schemes disclosed herein. This invention may be provided in other specific forms and embodiments without departing from the essential characteristics described herein.

The CDR3 can be represented as a sequence (of either nucleic acids or amino acids), it may therefore be approached as any other sequence of letters or words (for example, in language, a sentence is a sequence). A language model can therefore be applied to a CDR3 based data set.

Accordingly, the present disclosure describes methods and systems for identification/classification of TCR sequences in bulk repertoire and single cell samples by specially designed artificial intelligence tools. The identification/classification system/techniques disclosed herein are useful for obtaining clinical information for an individual, or groups of individuals, requiring clinical therapy e.g., diagnosis of a disease, efficacy likelihood information of a certain therapy and/or information for the development of therapeutic compounds.

Generally, embodiments disclosed herein utilizes an NLP model, which has been pre-trained in self- supervised manner with pre-training dataset to provide latent space representations for TCR sequence samples, and thereafter utilized to process a training dataset to produce latent space embeddings for TCR sequence samples inputted thereto. The training dataset and the embeddings produced by the NLP model are then used for supervised training of one or more classifiers for identification/classification of TCR sequence samples. In possible embodiments a perplexity test developed based on the produced embeddings is utilized for identification/classification of the TCR sequence samples.

For an overview of several example features, process stages, and principles of the invention, the examples of the BERT model illustrated schematically and diagrammatically in the figures are to provide an NLP model useful for producing meaningful embeddings for TCR sequence samples. These BERT -based systems are shown as one example implementation that demonstrates a number of features, processes, and principles used to provide the required embeddings, but other NLP models can be provided and used with different variations. Therefore, this description will proceed with reference to the shown examples, but with the understanding that the invention recited in the claims below can also be implemented in myriad other ways, once the principles are understood from the descriptions, explanations, and drawings herein. All such variations, as well as any other modifications apparent to one of ordinary skill in the art and useful in NLP based identification/classification applications may be suitably employed, and are intended to fall within the scope of this disclosure.

Fig. 1A schematically illustrates an NLP based identification/classification system 10 according to possible embodiments. The system 10 generally comprises one or more processors 12 and memories 13 configured with a set of software modules discussed hereinbelow to receive data from one or more databases 11 for the construction and use of TCR sequence sample classification/identification tools. One or more of the databases 11 can be local databases e.g. , directly managed by the system 10, or remote to the system 10 and accessed via one or more data networks (not shown e.g., the Internet).

The system 10 comprises a pre-training module 12p configured and operable to manage pre-training stage(s) of an NLP model (e.g., BERT). The pre-training module 12p is configured and operable to receive TCR sequence samples data from the one or more databases 11 and construct therefrom one or more pre-training datasets 13p for the pre-training of the NLP model. The pre-training datasets 13p prepared by the pre-training module 12p can be of private CDR3 TCR amino acid sequence samples and/or public CDR3 TCR amino acid sequence samples. In some embodiments the pre-training datasets 13p prepared by the pre-training module 12p comprises either bulk CDR3 TCRα and/or CDR3 TCRβ amino acid sequence samples data or single cell CDR3 TCRα and/or CDR3 TCRβ amino acid sequence samples data. Optionally, but in some embodiments preferably, the pre-training datasets 13p prepared by the pre-training module 12p comprises either bulk CDR3 TCRβ amino acid sequence samples data or single cell CDR3 TCRβ amino acid sequence samples data. The pre-training module 12p is configured in some embodiment to arrange the bulk amino acid sequence samples data in a table listing the CDR3 sequences of each bulk amino acid sequence sample. If single cell amino acid sequence samples data is used, the pre-training module 12p is configured to concatenate amino acid sequence samples data of the same cell with a separator token place therebetween.

The pre-training datasets 13p are used by an NLP module 12b to carry out a self- supervised pre-training stage of an NLP model. Optionally, but in some embodiments preferably, the NLP model is a BERT model, that can be pre-trained with the pre-training datasets 13p as known in the art. For example, in possible embodiments the pre-training is conducted using a masking technique wherein about 10% to 20%, optionally about 15%, of each amino acid sequence of the pre-training dataset is masked for prediction and learning. Following the pre-training stage, the pre-trained NLP model 14 acquires the ability to provide latent space representations to amino acid sequences samples inputted thereto.

The system 10 further comprises a training module 12t configured and operable to manage training stage(s) utilizing the pre-trained NLP model 14. The training module 12t is configured and operable to receive TCR sequence samples data from the one or more databases 11 and construct therefrom one or more training datasets 13t for the training of one or more classifiers utilizing embedding data generated by the pre-trained NLP model. The training datasets 13t prepared by the training module 12t comprises in some embodiments private CDR3 TCR amino acid sequence samples and public CDR3 TCR amino acid sequence samples. Optionally, but in some embodiments preferably, the training datasets 13t prepared by the training module 12t comprises either bulk CDR3 TCRβ amino acid sequence samples data or single cell CDR3 TCRβ amino acid sequence samples data.

The NLP module 12b then feeds the training datasets 13t to the pre-trained NLP model 14, to thereby produce embedding data (also referred to as training embedding data) 13e therefor. The produced embedding data 13e provides n-dimensional (where n>2 is an integer number e.g., >7=768) vector representation/encoding to portions/fragments of the amino acid sequence samples data of the training datasets 13t. Optionally, but is some embodiments preferably, a dimension reduction module 12d is used to reduce the high dimensional embedding data 13e into a lower (e.g., n=2) dimension data representation 13d (e.g., using uniform manifold approximation and projection - UMAP).

In possible embodiments an NLP performance test module 16 is used to analyze the embeddings data 13e and evaluate (e.g., using a perplexity test) the performance of the pre-trained NLP model 14. The NLP performance test module 16 is configured in some embodiments to produce test scores (perplexity scores) 16s for the embedding data 13e. The test scores 16s produced by the NLP performance test module 16 can be further used to carry out, or assist in, the sequence identification/classification tasks of the system 10, disclosed hereinbelow.

The training module 12t then uses the embedding data 13e together with the training datasets 13t to train one or more classifiers 15 to identify/classify features of TCR sequence sample data inputted thereto, as exemplified in Fig. IB and 1C. Optionally, but in some embodiments preferably, the training module 12t utilizes the reduced dimension presentation 13d of the embedding data 13e produced by the a dimension reduction module 12d together with the training datasets 13t to train the one or more classifiers 15. The reduced dimension presentation 13d of the embedding data 13e can be then used for presentation and analysis of the embedding data 13e. In some embodiments, the reduced dimension presentation 13d of the embeddings data 13e is further used for determining one or more classification thresholds 15d for the one or more classifiers 15.

Figs. IB and 1C exemplify sequence analysis units/modules 10b, 10c configured to use the one or more trained classifiers 15 to identify features of, and/or classify, TCR sequence samples data inputted thereto. The sequence analysis units/modules 10b, 10c are optionally part of the system 10 shown in Fig. 1, or alternatively implemented in separate computer systems having their own processing units and memories.

Fig. IB exemplifies a sequence analyzer 10b configured to use one or more classifiers 15 trained with training datasets 13t of bulk CDR3 TCR sample data, referred to herein as bulk classifier 15b, utilizing the NLP model pre-trained with the pre-training dataset 13p of bulk CDR3 TCR sample data, referred to herein as bulk pre-trained NLP model 14b. As seen, for each examined bulk sequence sample 17b inputted to the system the bulk pre-trained NLP model 14b generates respective embeddings data 14eb which is then fed to the bulk classifier 15b for identification/classification thereby.

Optionally, but in some embodiments preferably, a scores test module (e.g., perplexity scores test) 16t is additionally or alternatively used for the identification/classification of the examined bulk sequence sample 17b. The scores test module 16t is configured to carry out at least some of the identification/classification tasks of the system utilizing the embedding data 14eb generated by the bulk pre-trained NLP model 14b for the examined bulk sequence sample 17b based on the embedding data 13e produced for the training dataset 13t of the bulk CDR3 TCR sample data, which thus referred to herein as bulk embedding data 13eb. The scores test module 16t can be configured to generate one or more scores to the embedding data 14eb based on the bulk embedding data 13eb and determine based thereon one or more features and/or classifications of the examined bulk sequence sample 17b.

Fig. 1C exemplifies a sequence analyzer 10c configured to use one or more classifiers 15 trained with training datasets 13t of single cell CDR3 TCR sample data, referred to herein as single cell classifier 15s, utilizing the NLP model pre-trained with the pre-training dataset 13p of single cell CDR3 TCR sample data, referred to herein as single cell pre-trained NLP model 14s. As seen, for each examined bulk sequence sample 17s inputted to the system the single cell pre-trained NLP model 14s generates respective embedding data 14es, which is then fed to the single cell classifier 15s for identification/classification thereby.

Optionally, but in some embodiments preferably, a scores test module (e.g., perplexity scores test) 16t is additionally or alternatively used for the identification/classification of the examined single cell sequence sample 17b. The scores test module 16t is configured to carry out at least some of the identification/classification tasks of the system utilizing the embedding data 14es generated by the single cell pre- trained NLP model 14s for the examined single cell sequence sample 17s based on the embedding data 13e produced for the training dataset 13t of the single cell CDR3 TCR sample data, which thus referred to herein as bulk embedding data 13eb. The scores test module 16t can be configured to generate one or more scores to the embedding data 14eb based on the single cell embedding data 13eb, and determine based thereon one or more features and/or classifications of the examined single cell sequence sample 17s.

The system can generate based on the products from the bulk/single cell classifiers 15b/15s, and/or the from the scores test module 16t, for the examined bulk/single cell sequence sample 17b/17s one or more outputs 17o, indicative of identified features and/or classification of the examined sequence sample, such as, but not limited to, CDR3 or non- CDR3 sequence, TCRcr or TCRβ sequence, private or public sequence, MAIT or non- MAIT sequence, V-gene or non-V-gene sequence, and/or J-gene or non-J-gene sequence.

Fig. ID is a block diagram demonstrating a sequence analysis embodiments utilizing the bulk sample sequence analyzer 10b and the single cell sample sequence analyzer 10c, configured to exchange sequence sample data 17b, 17s thereby processed and respective output data 17o generated therefor. Particularly, the bulk sample sequence analyzer 10b can be configured and operable to receive the single cell sequence sample data 17s processed by the single cell sample sequence analyzer 10c, and the output data 17o thereby generated therefor, and update based thereon at least one of its bulk pre- trained NLP model 14b and/or is bulk classifier 15b. Similarly, the single cell sample sequence analyzer 10c can be configured and operable to receive the bulk sequence sample data 17b processed by the bulk sample sequence analyzer 10b, and the output data 17o thereby generated therefor, and update based thereon at least one of its single cell pre-trained NLP model 14s and/or is single cell classifier 15s.

EXAMPLES

Data collection for training

Bulk Sequencing Data

The dataset that was collected for the Count Von Count (CVC) model training, includes information from 34 published papers. All these papers report T cell repertoire sequencing from bulk-RNA (in contrast with single-cell data). The sequencing and the library preparation has been done using multiple methods. All samples were human samples. While some of the bulk sample databases referenced herein reported alpha chain sequencing, as well as beta chain sequencing, only beta chain sequencing was included in the training dataset. Further, the metadata (such as tissue type) was not referred to in the analysis since the focus was on the sequences themselves. Therefore, only two items were taken from each of the papers: TCR sequences and samples identification. The collection finally included 4217 samples that held 221,176,713 sequences.

Single Cell Sequencing Data

The single cell dataset that was collected for the single cell Count Von Count (scCVC) model training and analysis includes information from 31 published experiments. All of the single cell databases referenced herein report T cell repertoire from single-cell RNA sequencing. The sequencing and the library preparation has been done using multiple methods. All samples were human samples. As the main interest here is the sequences themselves, of both alpha and beta chain types, additional metadata was not referred to in the analysis. Therefore, only the following items were taken from each experiment: TCR sequence, samples identification, and unique cell identification. The collection finally included 458 samples that held 6,159,652 sequences.

NLP Models : CYC and scCVC

The CVC and scCVC models of the invention are NLP models based on the BERT model architecture, an NLP model shown to have state-of-the-art results on different NLP-related tasks. The CVC and scCVC models use a mechanism called attention to learn complex interactions within the input sequence and, in this case, the interactions and correlations between the amino acids. This allows, after some training, to understand the grammar of the amino acid language in an unsupervised manner. The difference between the CVC and scCVC models is mainly in the number of training samples each model was trained on, the input sequences themselves, and how they were presented during training.

CVC was trained on 5 million CDR3 TCR beta sequences, with an internal split of 2.5 million private and 2.5 million public sequences. The input was individual CDR3 TCR beta sequences taken from the Bulk Sequencing Data mentioned above. The training was achieved by using the masking technique: 15% of each sequence’s amino acids were masked, and the model had to predict them. scCVC was trained on 2,120,565 single cells (comprising 4,200,335 TCR alpha and beta sequences) from the previously mentioned Single-Cell Sequencing Data. The input consisted of single cells represented by a concatenated representation of the CDR3 that belongs to them, joined by a separator token. The training process was achieved by first generating a random permutation of the sequences that constitute the single cells and then using the masking technique: 15% of each sequence’s amino acids were masked, and the model had to predict them. The randomization of sequence order was employed to ensure that the model did not assign any importance to a particular order.

The hyperparameters that were used were the following. Most of them were kept equal to the default BERT values:

• Hidden representation dimensionality: 768

• Intermediate representation dimensionality: 1536

• Number of attention heads: 12

• Number of transformer layers: 12 • Batch size: 1024

• Training epochs: 50

• Learning rate: 5e-5

• Maximum positional embedding: 64

• Optimizer: Adam

• Loss: NLL (negative log likelihood)

Because of the large computational needs, the models were trained (separately) on the Google Cloud Platform (GCP) with the NVIDIA Tesla A 100 GPU and 120 GB of memory. With this hardware, it took about 6 days to train. Adding parallelization of 8 GPU’s decreased the training time to about 2 days.

Once the training was complete, each model was ready to be used for embedding creation. The inputs of CVC were CDR3 TCR beta sequences, and the inputs of scCVC were either individual CDR3 sequences of both chain types, or single cells in the format explained above. The lengths (L) differed. Each sequence was then padded with a prefix token, C, and a suffix token S. The padded input gets passed to an embedding layer that transposes each amino acid token into a 768-dimension vector. Along with position embeddings, all the embedded tokens were passed into a set of 12 layers that created the whole sequence embedding matrix with dimensions of (L+2) X 768. This matrix was then reduced to be of dimension 1 X 768 by calculating the mean of its embeddings. The method for dimensionality reduction could be changed, but the mean was set as the default method. This final embedding representation could then be used in various downstream tasks like the ones we present below.

Downstream clustering

CVC outputs embeddings with a dimension of 768. To view these high dimensional embeddings on a 2-D plot, dimensionality reduction had to take place. Uniform Manifold Approximation and Projection (UMAP) 23 was used in this case after the application of PCA 39 . The scanpy package 40 was used to apply this technique by receiving an AnnData object consisting of the embeddings and dimensionality reduction coordinates. There was also an attempt to use t-SNE, but the results were clearer and faster using UMAP.

Pseudo-Perplexity Score The pseudo-perplexity score is a modified version of the more traditional metric, perplexity, a metric used to evaluate classical language models - how well they learned the language and can predict, for example, the next sentence. For masked language models, such as BERT, perplexity is not well defined, which is why pseudo-perplexity is used. It is an efficient approximation of the perplexity and has shown to be reliable in evaluating the performance of such models. If the goal of traditional perplexity is to compute the likelihood of a whole test set of a sentence, which reflects how well the model predicted the next word based on the preceding word, the goal of pseudo-perplexity is to estimate the perplexity for a smaller test set (pseudo-test set). This subset is created by randomly subsampling a portion of the original test set. This greatly reduces the computational burden.

The equation for calculating pseudo-perplexity can be written in the following manner:

Where N is the number of samples in the pseudo-test set, is the i-th sample in the set, and p(xj) is the likelihood of the i-th samples according to the language model. To give the final pseudo-perplexity score, the geometric mean of the likelihoods for each sample is calculated - the sum of the logarithms of the sample likelihoods, divided by N and exponentiated. The lower the score, the better the language model performed and “understands” the data 28 .

Classification models

According to the guidelines of the recent DOME standard 41 for reporting results of supervised machine learning, the following supplementary table carefully follows the DOME standard.

Supplementary Table: A machine learning summary table in the DOME format.

Input Data Presentation

For each of the models described below the input were either the embeddings created by CVC or the one-hot encoding representation of the sequences. The one-hot encoding representation transformed each amino acid into a 1X20 dimensional one-hot vector.

LDA

The LDA algorithm is a supervised dimensionality reduction technique that was used here to classify both the public/private label and the J gene of a given sequence. The Python package that was used to apply this algorithm was sklearn 42 . It was used with its default hyperparameters. Hyperparameter tuning did not improve results. xgBoost

The xgBoost algorithm is a well-known classification algorithm that gives high- accuracy results when applied to tabular data. Here it was used in a supervised manner to classify both the Public/Private label and the J gene of a given sequence. The sklearn package was again used for this algorithm with default hyperparameters. Changing the parameters did not improve the results.

Deep Neural Network (DNN)

DNN was applied for performing the following classification tasks:

For the task of predicting Public/Private label and MAIT cells, the best results were achieved by using a simple 3-layer network with dimensions 128, 32, and 1. The nonlinear function was ReLU for the first two layers and sigmoid for the last, with a learning rate of le-5, the Adam optimizer, and Binary Cross Entropy (BCE) loss. For Public/Private classification, a batch size of 1024 and 150 epochs was used, as opposed to a batch size of 256 and 80 epochs for classifying MAIT cells. For the task of predicting J gene, the best results were achieved by using a simple 3-layer network with dimensions 64, 32, and 13 (13 types of J genes). The nonlinear function was ReLU, with a learning rate of le-5, the Adam optimizer, batch size of 1024, Cross Entropy (CE) loss, and 80 epochs. Adding dropout and batch normalization did not improve the results.

The Bulk Sequencing database is a collection of public data of 4219 samples that correspond to 221,176,713 rows. The following is a list of the PMIDs (PubMed unique identifiers) of the samples that make up this database: 30560866 1 , 29925993 2 , 29419434 3 , 28369043 4 , 27261081 5 , 22373576 6 , 30550791 7 , 29961579 8 , 28332333 9 , 299O7691 10 , 31011186 11 , 29858286 12 , 30674904 13 , 30742122 14 , 30120223 15 , 29765028 16 ,

27757113 17 , 26297338 18 , 31359002 19 , 2825O417 20 , 29066497 21 , 25498120 22 ,

29312340 23 , 28146579 24 , 27311837 25 , 25852054 26 , 26968205 27 , 29138313 28 ,

29176645 29 , 27649554 30 , 27587469 31 , 22593176 32 , 27942583 33 , 28369038 34 . For this project, only the TCR beta sequences data was used, which translates to 91,758,697 unique sequences.

The Single Cell Sequencing database is a collection of public data of 458 samples that correspond to 6,159,652 rows. The following is a list of the PMIDs of the samples that make up this database: 34767762 1 , 33622974 2 , 36028490 3 , 35263569 4 , 32539073 5 , 32398875 6 , 35803246 7 , 34290408 8 , 35484264 9 , 29961579 10 , 35803260 11 , 34414188 12 13 , 36065294 14,15 , 33382973 16 , 33514641 17 , 35668194 18 , 34521850 19 , 35982235 20 , 33785765 21 , 33891889 22 , 35421230 23 , 34471285 24 , 32747828 25 , 35411048 26 , 34813775 27 , 34901832 28 , 33504936 29 , 34591653 30 . For this project, the duplicate and low-quality sequences were filtered out. After filtering, there were 4,200,335 TCR alpha and beta sequences left, which translates to 2,120,565 cells.

The ImmuneCODE™ database (Nolan, S. et al. A large-scale database of T-cell receptor beta (TCRP) sequences and binding associations from natural and synthetic exposure to SARS-CoV-2. Res Sq (2020)) includes millions of TCR sequences from patients that were exposed to or infected with SARS-CoV-2. It includes over 1,400 different subjects. In this research, it was specifically used to distinguish the embedding space with the V and J genes. To do so, 17 million sequences were randomly extracted from it and used for both tasks.

The lOx Genomics Dataset offers many different single-cell datasets that can be used for different research investigations. Overall, there were six datasets used in this research: NSCLC tumor dataset, 20k bone marrow mononuclear cells, PBMCs of a healthy donor, 10k Human PBMCs, CD8+ T cells of Healthy Donor 1 and CD8+ of Healthy Donor 2. These were chosen based on the number of cells they contained, not for any specific reason. The NSCLC tumor dataset, which was used for immune profiling, consists of about 3643 cells. This and the rest of the datasets, 20k bone marrow mononuclear cells, PBMCs and CD8+ T cells, were used for MAIT cell classification. These datasets contain 19,737, 6,037, 14,632, 123,862 and 191,643 cells, respectively. More information and the datasets themselves can be found on the lOx Genomics website.

CVC and scCVC are based on BERT architecture. CVC was trained by processing CDR3 TCRP amino-acid sequences as input, while scCVC was trained on the combined CDR3 TCRα and TCRP sequences according to their linked single-cell association. scCVC’s input is in the form of single cells, represented by their TCR (a & P) joined by a separator token. This enables a more comprehensive analysis of T cell receptor behavior and features. Each amino acid would be a word in the original BERT architecture, while the CDR3 sequence is a would-be sentence. The model processes each input and outputs their embeddings: a 768-dimensions numerical vector. The data collection used for training CVC includes 1590 TCR-beta samples that translate to 91,758,698 unique CDR3 sequences. Of these, 5 million CDR3 TCR-beta sequences were randomly selected for CVC’s unsupervised training, with a sub-division of 2.5 million Private and 2.5 million Public sequences to avoid bias. As for scCVC, a collection of single-cell data was used, including 2,120,565 single cells, a total of 4,200,335 TCR sequences.

An unsupervised language model is trained by masking a certain percentage of the input and learns by predicting these masked items. In this case, 15% of each sequence's amino acids were masked, and the model predicted the missing information with feedback. Once the transformer was trained, TCR embeddings were produced for further analysis, pipeline visualizations are provided in Figs. 2A and 2B illustrating how these embeddings are used. The trained model receives amino acid CDR3 sequences to create their embeddings. The embedding space was visualized in 2D using UMAP 23 . Each point in CVC represents a sequence, while each point in scCVC represents a cell. In the different visualizations, point color is used for the specific feature analyzed.

To evaluate if CVC encodes meaningful, latent information about a sequence’s biology in its embeddings, the transformer was fed with 1,000,000 randomly sampled sequences to obtain their embeddings. Among the 1,000,000 sequences, 15% were Public, and 85% were Private, keeping the original distribution of these labels across the entire dataset. Then, the 150,000 public and 850,000 private sequences were visualized (UMAP). The results are shown in Fig. 3A, where each sequence (each point) is colored according to its Public/Private label. A sequence is labeled as Public when it appears in more than one sample in the original database. Otherwise, it is labeled Private. From the visualized embedding space, it is apparent that sequences are clustered into roughly a dozen groups (unsupervised), with Public sequences clustered at the tips of each group.

There is a question that arises regarding the specific threshold that was chosen for tagging a sequence as either Public or Private, as there could be changes associated with other values of this threshold. Different thresholds for the definition of a Public sequence produced similar results, resulting in further clumping of public sequences. Regardless of the specific threshold, it is interesting to see if publicness, the frequency by which a sequence is prevalent in the samples’ population, is in-built into the transformer’s embeddings. For that, the number of appearances for each sequence was calculated. The correlation between publicness and sequence length can be observed in Fig. 3B (upper right insert), which shows that sequence length appearances are normally distributed. These were divided into percentiles (10%, 25%, 50%, 75%, and 90%) that corresponded to lengths 13, 14, 15, 16, and 18. Publicness distribution is shown in Fig. 3B, colored based on sequence length percentiles. The x-axis represents the number of public appearances, and the y-axis represents the number of sequences (in log scale). As previously reported 24 , the more public a sequence is, the shorter and the more unique it will be. As the figure indicates, different sequence lengths share similar publicness.

Using information from the distribution, the publicness values were divided into 24 bins of different sizes. To demonstrate how the different sequences are encoded by the transformer, the sequences were sampled from each bin, maintaining the ratio of the complete dataset, leading to 1,037,748 sequences. CVC was used to create embeddings from the sequences, exclusively from the sequences, without considering samples or other features. In Fig. 3D, a UMAP of the embeddings is displayed using a color code showing the size-bin affiliation. The figure shows that the spectrum of publicness is associated with directionality in the embedded space. High values of publicness are distant from low values of publicness. The more public a sequence is, the further it is from the private ones. Furthermore, roughly a dozen notable larger clusters again show the same behavior seen before. As an interim summary, the results showed that the embeddings created by CVC capture, in an unsupervised manner, biological features integral to the CDR3 sequence itself.

As demonstrated earlier, different lengths have different publicness. The question of whether the transformed embeddings capture length similarities in the latent space was investigated. Fig. 3C shows the (Normal) distribution of the lengths, illustrating the same distribution as the entire data set, allowing to divide the data into percentiles. 1,050,000 sequences were sampled while keeping the ratio of the different percentiles and used CVC to produce their embeddings. Fig. 3D (UMAP) displays sequence length percentiles, while Fig. 3E shows Public/Private status. Side by side, these figures show that the embeddings provide roughly a dozen clusters, each populated by the sequences from every percentile. This is the same behavior observed in Fig. 3B of a gradient from larger percentiles to smaller ones. Contrasting this behavior with Fig. 3E, it can be seen that Public sequences populate the lower- mid percentiles, while Private sequences populate the high percentiles. This correlation agrees with the one seen in Fig. 9, indicating that embeddings created by CVC are also sensitive to the sequence length.

In addition to their length, public sequences have been associated with Convergent Recombination (CR), the phenomenon in which various nucleotide sequences code for the same amino acid sequence 10,25 . Meaningful embeddings may reveal an association between the input sequence and its multiform CR options and the sequence’s frequency within the population of samples. To see if the latent space provides this association, the CR levels were calculated for each sequence by counting the different nucleotide sequences coding for the same amino acid sequence and, for visualization, dividing them into 5 distinct groups: 0-100, 100-200, 200-300, 300-400, and 400 and above. The CR groups are not equally sized; the larger the CR range, the fewer sequences are associated with it. Then, the different groups were plotted using a random subsample from each group while maintaining their ratios, which totaled 536,932 sequences. Fig. 3F and 3G show the latent space of the embeddings, colored by the Private/Public labels and by CR ranges, respectively. As can be seen, the behavior of these two measurements is consistent with the expected: The combined perspective from these two panels indicates that CR increases with publicness. Therefore, it can be seen that the latent space contains information about CR behavior in tandem with Public status.

2D dimensionality reduction of the embedded representation shows an intriguing partition into 12-13 large clusters. As a reminder, the embeddings were created unsupervised; that is, CDR3s were not tagged with any labels during self-supervision and were therefore not associated with their origin J gene.

As Figs. 4A and 4B show the J gene region of the TCR gene lies within the CDR3 region and is of 13 types: JI: 1-6 and J2:l-7. To show a substantial amount of J gene tags on a UMAP, the ImmuneCODE database 27 was used, which includes millions of TCR sequences from more than 1400 individuals, with high-quality information about the V and J gene sources of each CDR3 sequence. Seven million sequences were randomly selected. The distribution of the J genes is shown in Fig. 4C, with TCRBJ02-04 and TCRBJ02-06 showing the lowest frequency in the dataset, while the rest of the J genes differ slightly in their frequency. To level the representation, there was a down-sample procedure to 9% of the sequences from each of the J genes except for TCRBJ02-04 and TCRBJ02-06, for which all available sequences have been used.

Since the J gene visual clustering was previously observed on one dataset, it was tested using the unseen 1400 individuals' dataset (ImmunoCode dataset) to test the reproducibility of the previous findings. Indeed, as Fig. 4D shows, the new embedding space, colored again by Public/Private labels, shows the same behavior that has been seen with the baseline dataset. To see whether the spatial stratification evident in the embedding space is associated with the different J genes, the sequences were embedded with CVC, and the dimensionality was reduced using UMAP, finally coloring each point (each sequence) matching its J gene. Results are shown in Fig. 4E. The apparent color coding of the different clusters reveals that the embedding space stratifies CDR3 sequences according to their J genes. These findings provide an additional level of meaning to CVC's output, as it offers a distinction between J genes, enhancing the utility of the model. Moreover, these results suggest that CVC can be used to classify the J gene of a CDR3 sequence.

The clear relevance of the J genes in the embedding space leads to a query about the role of V genes. Thus, the ImmuneCode dataset was used again, focusing on available V gene information. A total of 65 V genes from TCRBV1 - TCRBV30 were represented in the data. Roughly 2% of sequences from each type were used, and their embeddings were calculated and charted in Fig. 10A. Fig. 10B was created to see if the V genes are associated with the public status of sequences. The red line in the figure is at the 50% mark, meaning that any bars over that threshold are for V genes with a greater than 50% chance of being public. Based on the V genes of those bars, Fig. 10C and 10D were created, which display the embedding space with the corresponding V gene and Public/Private labels. In Fig. IOC, all the clusters contain all the types of V genes in which the sequences are grouped together by the different types. Regarding the publicness of these genes, we see that the same behavior occurs (Fig. 10D) but with a larger presence of the public sequences. All of this shows that embeddings also understand the behavior and find similarities between sequences with the same V gene. scCVC, designed to work with single T cells, facilitates the learning of a combined CDR3 TCRα,TCR0 representation, as well as learning differences between the chain types. Initially, utilizing open-source data from lOx Genomics, the model’s capability was evaluated to distinguish between TCR0 and TCRα sequences using pseudo-perplexity scores. Briefly, pseudo -perplexity approximates perplexity, a metric utilized to evaluate language model performance. As illustrated in Fig. 5A and 5B, it relies on the probability outputs of the model to assign a score for each input, where lower scores indicate better predictions or a more accurate model 28 .

Current tools for identifying CDR3 sequences, such as MiXCR 29 and TRUST 30 , depend on aligning gene segments from V and J genes to identify the boundaries of CDR3 regions. The alignment process, which is based on reconstructing contigs, necessitates longer sequences that can provide the building blocks for contigs containing the CDR3 segments. This approach is unable to handle reads that do not combine to form a sufficiently long contig for providing a V(D)J sequence, thus discarding potentially valuable data. With this context, the goal was to explore whether the model could, in theory, extract CDR3 sequences from arbitrary RNA-seq data without the need for preliminary V and J gene alignment steps.

To accomplish this, FASTQ files were used detailing single-cell RNA-seq data. Each RNA read was divided into kmers of length 11-19, simultaneously following three different frames (shifts), so that the total number of kmers per sequence can be represented with the following equation: Total Kmers = (3 * E[(X - k + 1)] for k = 11 to 19). The pseudo-perplexity score was calculated per kmer, and the kmer with the lowest score was considered most likely to be of a CDR3 sequence. This process is illustrated in Fig. 5C and further demonstrated on a larger scale in Fig. 5D. Before proceeding, there is a need to ensure that scCVC effectively recognizes both a and 0 chain types of CDR3 sequences. As demonstrated in Fig. 5E, scCVC provides low (favorable) perplexity scores for both chain types, indicating its ability to differentiate between the two. In contrast, CVC, which was trained on 0 sequences, gives low scores only for TCR0 sequences. This differentiation allowed to continue the developed approach.

As shown in Fig. 5F, an accuracy of only 60% in matching kmers with their true value was achieved by using receptors with a minimal score. However, when using the lowest ten scores, accuracy improved to 95% in detecting genuine CDR3s, as displayed in Fig. 5G. Further investigation of these top ten scores revealed that many of the sequences within the top ten originated from the same read region with varying lengths. This finding suggests the potential integration of this tool into sequencing workflows.

With clinical applications aiming to control specific TCR sequences in patients, the use of embeddings to expose sequence-based information that associates a TCR with its population-level quantities may greatly benefit clinical TCR uses. To determine if the embeddings could be used to tag sequences as Public or Private, 200,000 sequences were randomly selected, 100,000 from each type (Public/Private), and embedding vectors (768 dimensions) were produced through CVC. Then, these data (tabular, 200k X 768, label 0/1) were used for supervised binary classification. Multiple classification algorithms were tested, and eventually, the focus was narrowed to the following three algorithms: LDA, xgBoost, and a Deep Neural Network (see details in Supplementary Table and in Fig. 6A for the DNN architecture), which showed AUCs (over Test set) of 0.89, 0.89, and 0.9, respectively. The models provided an accuracy of 81.5%, 80.635%, and 81.7%.

To learn about the added information content provided by the transformed model, machine learning over a one-hot representation of the CDR3 sequences was used. In this approach, each amino acid was represented using a 20-dimensional binary vector. Each vector with 19 zeros, and one is placed at the index of the specific amino acid. To maintain an equal length for all sequences in the dataset, all one-hot transformations were set to be the length of the longest sequence (LS), while shorter sequences were padded with zeros. This led to a 200k X LS X 20 table as the algorithm’ s input. Using these data, the achieved AUC was 0.76, 0.81, and 0.8, respectively. The accuracy of the models was 69.98%, 73.75%, and 72.7%. xgBoost did better here, but only by a small margin. The ROC curve can be seen below in Fig. 6B. These results demonstrate the latent space of the embeddings provides information to successfully classify the sequences as Public or Private, with a very significant increase in AUC and accuracy. Another assessment was whether classification, using the embeddings created by CVC, can be used to classify a sequence’ s J gene. That is, without prior knowledge of the composition of the TCR sequences, identify the underlying J gene from the CDR3 representation in embedding space. The same set of algorithms used to classify the Public/Private label were used here: xgBoost, LDA, and a modified DNN network (Fig. 6A), both on the embeddings and on the one -hot representation of the sequences. Fig. 6C displays the accuracies for the DNN network, while the other methods appear in Fig. 11. All methods did well in predicting the J gene of a sequence when it is represented by the embeddings, but also quite well when the sequences are represented by a one-hot encoding. TCR J genes begin at the end of CDR3, as can be seen in Fig. 4A which explains the relative success of one -hot encoding. The embeddings, however, provided favorable scores, indicating that they encapsulate additional information from the sequence itself.

Single-cell immune profiling provides the knowledge of which TCRα and TCRP chains are expressed in the same cell, allowing exploration of their cooccurrence and possible functional implications. To investigate this, two distinct examples were analyzed: (1) the study of Mucosal- Associated Invariant T (MAIT) cells; and (2) the analysis of TRB “sister” sequences. MAIT cells are a unique type of T cell identifiable by their alpha chain’s specific J and V genes TRAV1-2 joined with TRAJ33/20/12. Using single-cell data, MAIT cells were tagged with this V/J information (available at the data source - the open source code contains the algorithm used to label the cells as MAIT). Fig. 7A and 7B show that MAIT cells do not cluster, neither in the single-cell embedding space (scCVC), nor in TCRP space (CVC). This behavior indicates that the unique transcriptional and functional characteristics of MAIT cells are driven primarily by their TCRα. To investigate the publicness of MAIT, the TCRP embeddings were used at our disposal to classify MAIT cells as Public or Private according to their TCR beta sequences. A DNN classifier was used, and as can be seen in Fig. 7C, roughly 60% of the MAIT cells were classified as Public. Given the demonstrated success in classifying public sequences with CVC embeddings and the fact that many MAIT cells were public, it was important to explore whether MAIT cells could be classified as such, using only their TCRP CDR3 (using CVC embeddings) or only their TCRα CDR3 (scCVC embeddings), without any information about V or J genes. As Fig. 7D and 7E show, the results achieved were an AUC of 0.71 for beta-based classification and 0.83 for alpha- based classification. These results demonstrate that information about the cell type is strongly encoded into the CDR3 sequence, and by translating this sequence into transformer-based embeddings without any gene information, one can effectively classify MAIT cells. The differences in accuracy between the alpha-based and the beta-based classifications are expected, as the tagging itself is alpha-based. It is surprising to find that beta sequences hold relevant information about the MAIT status of the cell.

In addition to MAIT cells, single-cell data was used to analyze single T cells to identify tenets of cooccurrences between different TCRP chains and the same TCRα chain in different cells. That is, studying TCRP sequences appearing in different cells that share the same TCRα sequences. These beta sequences are referred to as TCRP sisters. Using single-cell data, TCRP sisters were analyzed, and their embeddings were generated using CVC. To see if these TCRP sisters occupy a contained area in embedding space, the distances between sister TCRPs were measured and compared with the measured distances between sister TCRPs and random TCRP sequences. Also, the TCRP sequences were projected onto a two-dimensional UMAP plot. As Fig. 7F indicates, the distances within and without the TRB sisters were insignificantly different. The same phenomenon can be seen in Figs. 8A, 8B, and 8C, which show TCRP sisters scattered throughout the embedding space. These results indicate the diversity within sister TCRP sequences.

It should also be understood that throughout this disclosure, where a process or method is shown or described, the steps/acts of the method may be performed in any order and/or simultaneously, and/or with other steps/acts not-illustrated/described herein, unless it is clear from the context that one step depends on another being performed first. In possible embodiments not all of the illustrated/described steps/acts are required.

The present disclosure also provides computer program and/or computer program product for carrying out any of the methods described herein, and/or a (e.g., removable) computer readable medium (e.g., optical- CDROM/DVD, magnetic disk drives, solid state drive, or such like) having stored thereon a program for carrying out any of the methods described herein. Flowcharts/block diagrams of embodiments hereof illustrate the architecture, functionality, and operation of some possible implementations of the disclosed subject matter. Each block in a flowchart/block diagram may accordingly represent a module, segment, function, and/or a portion of an operation or step, which may be implemented as program code or hardware (e.g., integrated circuit - IC, application specific integrated circuit - ASIC, field-programmable gated arrays - FPGA, or suchlike), or as a combination thereof.

A computer system suitable for embodiments disclosed herein may include, for example, one or more processors connected to a communication bus, one or more volatile memories (e.g., random access memory - RAM) or non-volatile memories (e.g., Flash memory). A secondary memory (e.g., a hard disk drive, a removable storage drive, and/or removable memory chip such as an EPROM, PROM or Flash memory) may be used for storing data, computer programs or other instructions, to be loaded into the computer system. For example, computer programs (e.g., computer control logic) may be loaded from the secondary memory into a main memory for execution by one or more processors of the computer system. Alternatively, or additionally, computer program/code may be received via a communication interface e.g., program code stored in a server data processing system may be downloaded over a network (e.g., the Internet) from the server to enable the computer system to perform certain features of the present disclosure.

As described hereinabove and shown in the associated figures, the present invention provides system for TCR sequence identification and/or classification and related methods. While particular embodiments of the invention have been described, it will be understood, however, that the invention is not limited thereto, since modifications may be made by those skilled in the art, particularly in light of the foregoing teachings. As will be appreciated by the skilled person, the invention can be carried out in a great variety of ways, employing more than one technique from those described above, all without exceeding the scope of the claims.