Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHODS FOR PREDICTING IMMUNOGENICITY OF MUTATIONS OR NEOANTIGENIC PEPTIDES IN TUMORS
Document Type and Number:
WIPO Patent Application WO/2023/089203
Kind Code:
A1
Abstract:
This disclosure provides a highly effective platform for predicting ability of neoantigenic peptides (neo-peptides) and/or mutations associated with neoantigens or alleles thereof to induce tumor specific immune responses. The disclosed platform employs novel features and a far more comprehensive feature set as compared to existing methods, which contributes to improved immunogenicity prediction for neo-peptides and/or mutations in neoantigens.

Inventors:
BASSANI-STERNBERG MICHAL (CH)
MULLER MARKUS (CH)
HUBER FLORIAN (CH)
Application Number:
PCT/EP2022/082845
Publication Date:
May 25, 2023
Filing Date:
November 22, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
CENTRE HOSPITALIER UNIV VAUDOIS (CH)
International Classes:
G16B20/00; G16B40/20
Domestic Patent References:
WO2020132235A12020-06-25
Foreign References:
US20160125129A12016-05-05
US20210142904A12021-05-13
US4656127A1987-04-07
Other References:
SMITH CHRISTOF C. ET AL: "Machine-Learning Prediction of Tumor Antigen Immunogenicity in the Selection of Therapeutic Epitopes", CANCER IMMUNOLOGY RESEARCH, vol. 7, no. 10, 1 October 2019 (2019-10-01), US, pages 1591 - 1604, XP055833773, ISSN: 2326-6066, Retrieved from the Internet DOI: 10.1158/2326-6066.CIR-19-0155
S. KIM ET AL: "Neopepsee: accurate genome-level prediction of neoantigens by harnessing sequence and amino acid immunogenicity information", ANNALS OF ONCOLOGY, vol. 29, no. 4, 1 April 2018 (2018-04-01), NL, pages 1030 - 1036, XP055682958, ISSN: 0923-7534, DOI: 10.1093/annonc/mdy022
PYKE RACHEL MARTY ET AL: "Precision Neoantigen Discovery Using Large-scale Immunopeptidomes and Composite Modeling of MHC Peptide Presentation", MOLECULAR & CELLULAR PROTEOMICS, vol. 20, 1 January 2021 (2021-01-01), US, pages 100111, XP093028302, ISSN: 1535-9476, Retrieved from the Internet DOI: 10.1016/j.mcpro.2021.100111
WELLS DANIEL K ET AL: "Key Parameters of Tumor Epitope Immunogenicity Revealed Through a Consortium Approach Improve Neoantigen Prediction", CELL, ELSEVIER, AMSTERDAM NL, vol. 183, no. 3, 9 October 2020 (2020-10-09), pages 818, XP086325754, ISSN: 0092-8674, [retrieved on 20201009], DOI: 10.1016/J.CELL.2020.09.015
GARTNER, J. J. ET AL., NAT. CANCER, 2021, pages 1 - 12
WELLS, D. K. ET AL., CELL, vol. 183, 2020, pages 818 - 834
SHAPLEY, L. S., CONTRIBUTIONS TO THE THEORY OF GAMES, vol. 2, 1953, pages 307 - 317
BERGSTRA, J. ET AL., COMPUT. SCI. DISCOV., vol. 8, 2015, pages 014008
NG, PC ET AL., PLOS GEN., vol. 4, no. 8, 2008, pages 1 - 15
MERRIFIELD, SCIENCE, vol. 232, 1986, pages 341 - 347
BARANYMERRIFIELD: "The Peptides", 1979, ACADEMIC PRESS, pages: 1 - 284
STEWARTYOUNG: "Solid Phase Peptide Synthesis", 1984
SAMBROOK ET AL.: "Molecular Cloning, A Laboratory Manual", 1989, COLD SPRING HARBOR LABORATORY
JANEWAY: "The Immune System in Health and Disease", 1997, CURRENT BIOLOGY PUBLICATIONS
TRAN, E. ET AL., SCIENCE, vol. 350, 2015, pages 1387 - 1390
PARKHURST, M. R. ET AL., CANCER DISCOV., vol. 9, 2019, pages 1022 - 1035
WELLS, D. K. ET AL., SCELL, vol. 183, 2020, pages 818 - 834
ARNAUD, M. ET AL., NAT. BIOTECHNOL., vol. 40, 2022, pages 656 - 660
BASSANI-STERNBERG, M. ET AL., PLOS COMPUT. BIOL., vol. 13, 2017, pages e1005725
REYNISSON, B. ET AL., NUCLEIC ACIDS RES., vol. 48, 2020, pages W449 - W454
SCHMIDT, J ET AL., CELL REP. MED., vol. 2, 2021, pages 100194
HARNDAHL, M. ET AL., EUR. J. IMMUNOL., vol. 42, 2012, pages 1405 - 1416
NIELSEN, M ET AL., IMMUNOGENETICS, vol. 57, 2005, pages 33 - 41
ROGERS, M. F. ET AL., SCI. REP., vol. 7, 2017, pages 11597
MARTINEZ-JIMENEZ, F. ET AL., NAT. REV. CANCER, vol. 20, 2020, pages 555 - 572
COX, D. R. J. R, STAT. SOC. SER. B METHODOL., vol. 20, 1958, pages 215 - 242
BOSER, B. E. ET AL.: "In Proceedings of the Fifth Annual Workshop on Computational Learning Theory", 1992, ACM, pages: 144 - 152
CHEN, T.GUESTRIN, C: "Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining", 2016, ASSOCIATION FOR COMPUTING MACHINERY, pages: 785 - 794
MULLER, M. ET AL., FRONT. IMMUNOL., vol. 30, 2017
SHAPLEY, L. S, IN CONTRIBUTIONS TO THE THEORY OF GAMES, vol. 2, 1953, pages 307 - 317
WELLS, D. K. ET AL., CELL, vol. 183, pages 818 - 834
Attorney, Agent or Firm:
HGF (GB)
Download PDF:
Claims:
CLAIMS

What is claimed is:

1. A method of identifying one or more mutations associated with a neoantigen or an allele thereof or identifying one or more neo-peptides, wherein the one or more mutations or the one or more neo-peptides are capable of inducing a tumor-specific immune response, the method comprising: associating each of the one or more mutations or each of the one or more neo-peptides with a set of features; determining value of each of the set features for each of the one or more mutations or each of the one or more neo-peptides; inputting the value of each of the set of features into a classifier or a machine learning model to determine a likelihood of inducing a tumor-specific immune response of each of the one or more mutations or each of the one or more neo-peptides; and identifying a set of mutations or a set of neo-peptides that has the likelihood of inducing the tumor-specific immune response greater than a reference value.

2. The method of claim 1 , comprising ranking the one or more mutations or the one or more neo-peptides based on the likelihood of inducing a tumor-specific immune response.

3. The method of claim 1, wherein the classifier or machine learning model comprises a logistic regression (LR) classifier, or a XGBoost classifier.

4. The method of claim 3, wherein the classifier or machine learning model comprises a voting classifier that combines respective probabilities determined by the LR classifier and the XGBoost classifier.

5. The method of any one of the preceding claims, comprising optimizing hyperparameters of the classifier or machine learning model in a cross-validation loop.

6. The method of any one of the preceding claims, wherein the classifier or machine learning model has been trained on the set of features.

7. The method of any one of the preceding claims, wherein the set of features is ranked based on Shapley values.

8. The method of any one of the preceding claims, wherein the set of features is determined by one or more of MixMHCpred, PRIME, IpMSDB, CScape, GTEX, RNAseq, NetMHCpan, NetMHCstabpan, NetChop, HLA propensity, and DAI.

9. The method of any one of the preceding claims, comprising training the classifier or machine learning model on data that have been subject to missing value imputation, data normalization, conversion of categorical features into numerical features, or a combination thereof.

10. The method of any one of the preceding claims, wherein the data normalization comprises quantile normalization, power normalization, or a combination thereof.

11. The method of claim 10, wherein the data normalization comprises quantile normalization.

12. The method of any one of the preceding claims, comprising training the classifier or machine learning model on NCI -train and and testing on at least one of NCI-test, TESLA, and HiTIDE datasets.

13. The method of any one of the preceding claims, comprising identifying the one or more neo-peptides based on the set of features comprising one or more features selected from:

(a) mutant_rank, mutant_rank_ netMHCpan, mutant_rank_PRIME, rnaseq_alt_support, Sample_Tissue_expression_GTEx, TCGA_Cancer_expression, bestWTMatchOverlap_I, and seq_len;

(b) mut_Rank_Stab, TAP_score, mut_netchop_score_ct, mut_binding_score, mut_aa_coeff, DAI_MixMHC, DAI_NetStab, DAI_MixMHC_mbp, mutant_other_significant_alleles, bestWTMatchType_l, CSCAPE_score, CCF, and nb_same_mutation_lntogen; and

(c) mut_is_binding_pos, pep_mut_start, DAI__NetMHC, rnaseq_TPM, GTEx_all_tissues_expression_mean, bestWTMatchScore_l, bestMutationScore_l, bestWTPeptideCount_l, Clonality, mutation_driver_statement_lntogen, and gene_driver_lntogen.

14. The method of any one of the preceding claims, comprising identifying the one or more mutations based on the set of features comprising one or more features selected from:

(a) nb mutations in gene lntogen, TCGA_Cancer_expression, bestMutationScore_l, rnaseq_alt_support, COUNT_MUT_RANK_CI_netMHCpan, MIN_MUT_RANK_CI_MIXMHC, MIN_MUT_RANK_CI_PRIME, mut_Rank_EL_0, mut_Rank_Stab_2, DAI_0 , and mut_netchop_score;

(b) rnaseq_TPM, bestWTPeptideCount_l, CCF, CSCAPE score, Sample Tissue expression GTEx, COUNT_MUT_RANK_CI_MIXMHC, WT BEST RANK CI MIXMHC, next_best_BA_mut_ranks, mut_Rank_EL_1 , mut_Rank_EL_2, wt_Rank_EL_0, wt_Rank_EL_1 , mut_Rank_Stab_0, mut Rank Stab l , DAI_1, and mut TAP score_0 ; and

(c) nb same mutation lntogen, mutation_driver_statement_lntogen, gene driver lntogen, bestWTMatchScore_l, bestWTMatchOverlap I, Zygosity, Clonality, GTEx all tissues expression mean, COUNT_MUT_RANK_CI__PRIME, WT BEST RANK_CI_PRIME, wt_Rank_EL_2, and DAI 2.

15. The method of any one of the preceding claims, wherein the step of inputting the set of features comprises inputting a feature matrix or a feature vector generated from the set of features.

16. The method of any one of the preceding claims, comprising identifying the one or more mutations in sequencing data obtained from a sample o f a subject.

17. The method of claim 16, wherein the sequencing data comprises exome, transcriptome, whole genome nucleotide sequencing data, or a combination thereof.

18. The method of any one of claims 16 to 17, wherein the subject has a cancer.

19. The method of claim 18, wherein the cancer is a carcinoma, a sarcoma, a lymphoma, a melanoma, a pediatric tumor, or a leukemia.

20. The method of claim 18, wherein the cancer is selected from adrenal gland tumors, biliary cancer, bladder cancer, brain cancer, breast cancer, carcinoma, central or peripheral nervous system tissue cancer, cervical cancer, colon cancer, endocrine or neuroendocrine cancer or hematopoietic cancer, esophageal cancer, fibroma, gastrointestinal cancer, glioma, head and neck cancer, Li-Fraumeni tumors, liver cancer, lung cancer, lymphoma, melanoma, meningioma, multiple neuroendocrine type I and type II tumors, nasopharyngeal cancer, oral cancer, oropharyngeal cancer, osteogenic sarcoma tumors, ovarian cancer, pancreatic cancer, pancreatic islet cell cancer, parathyroid cancer, pheochromocytoma, pituitary tumors, prostate cancer, rectal cancer, renal cancer, respiratory cancer, sarcoma, skin cancer, stomach cancer, testicular cancer, thyroid cancer, tracheal cancer, urogenital cancer, and uterine cancer.

21. A nucleic acid molecule comprising a polynucleotide sequence of a neoantigen, wherein the nucleic acid sequence comprises at least one of the one or more mutations or encodes at least one of the one or more neo-peptides identified according to the method of any one of the preceding claims.

22. A vector comprising the nucleic acid molecule of claim 21.

23. A polypeptide encoded by the nucleic acid molecule of claim 21.

24. A tumor vaccine comprising the nucleic acid molecule of claim 21, the vector of claim 22, the polypeptide of claim 23, or a combination thereof

25. A method of preparing a tumor vaccine, comprising: identifying one or more mutations associated with a neoantigen or allele thereof or one or more neo-peptides that are capable of inducing a tumor-specific immune response according to the method of claims 1 to 20; generating a nucleic acid molecule having a polynucleotide sequence encoding the neoantigen or allele thereof or at least one of the one or more neo-peptides.

26. A method of treating cancer in a subject, comprising: identifying one or more mutations or neo-peptides capable of inducing a tumor-specific immune response according to the method of claims 1 to 20; and administering to the subject a therapeutically effective amount of a therapeutic agent comprising a T cell that binds specifically to at least one of the one or more mutations or neo- peptides.

27. A method of selecting an antigen-specific T cell, comprising: identifying one or more mutations or neo-peptides capable of inducing a tumor-specific immune response according to the method of claims 1 to 20; and selecting a T cell having a receptor that is antigen specific to at least one of the one or more mutations or neo-peptides.

28. A T cell therapy, comprising a T cell selected according to the method of claim 27.

29. The T cell therapy of claim 28, wherein the T cell therapy is adoptive cell therapy.

Description:
METHODS FOR PREDICTING IMMUNOGENICITY OF MUTATIONS OR NEOANTIGENIC PEPTIDES IN TUMORS

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority under 35 U.S.C. §119(e) to U.S. Provisional Patent Application No. 63/281,970, filed November 22, 2021. The foregoing application is incorporated by reference herein in its entirety.

FIELD OF THE INVENTION

This invention relates to methods for predicting immunogenicity of neoantigenic peptides (neo-peptides) or mutations in tumors.

BACKGROUND OF THE INVENTION

In recent years it has been demonstrated across tumor types in patients receiving adoptive transfer of autologous in vitro cultured tumor infiltrating lymphocytes (TILs) that T cells recognizing specifically mutated neoantigens play a key role in mediating effective anti-tumor responses. Furthermore, neoantigens were found to be implicated in the therapeutic efficacy of immune checkpoint inhibitor antibodies, and several studies showed immune recognition following neoantigen-based vaccines, where patients experienced no major toxicity. Mutated proteins are processed and presented on tumor cells as human leukocyte antigen (HLA) binding peptides (HLAp) and are recognized by cognate T-cell receptors as “non-self.” Targeting such neoantigens would enable immune cells to distinguish between normal and cancerous cells, avoiding the risk of autoimmunity. Significant technological improvements in genomics, bio- informatics, and in silico HLA binding prediction tools have facilitated major breakthroughs in the discovery of neoantigens encoded by somatic indels, frameshifts, and non-synonymous single nucleotide variants (SNV). Furthermore, advanced immunological screening techniques have facilitated the detection and isolation of neoantigen reactive T cells.

The development of innovative clinical treatment options targeting neoantigens requires the sequencing of tumor DNA and RNA and the identification of neoantigens that are likely to be targeted by autologous T cells. These methods apply different algorithms that score mutations based on their probabilities of being expressed, processed, and presented on the patient’s HLA molecule and are specifically recognized by high avidity T cells, compared with wild type counterpart sequences. While the performance of these algorithms has improved significantly in recent years, the sensitivity is still suboptimal. This is particularly important when a limited set of neoantigens can be included, for example, in a mRNA vaccine targeting up to 20 mutations in a high tumor mutation burden melanoma patients.

Accordingly, there remains a need for novel methods for predicting or ranking immunogenicity of mutations and/or neo-peptides in tumors.

SUMMARY OF THE INVENTION

This disclosure addresses the need mentioned above in a number of aspects. In one aspect, this disclosure provides a method of identifying one or more mutations associated with a neoantigen or an allele thereof or identifying one or more neoantigenic peptides (neo-peptides), wherein the one or more mutations or the one or more neo-peptides are capable of inducing a tumor-specific immune response. In some embodiments, the method comprises: (i) associating each of the one or more mutations or each of the one or more neo-peptides with a set of features; (ii) determining value of each of the set features for each of the one or more mutations or each of the one or more neo-peptides; (iii) inputting the value of each of the set of features into a classifier or a machine learning model to determine a likelihood of inducing a tumor-specific immune response of each of the one or more mutations or each of the one or more neo-peptides; and (iv) identifying a set of mutations or a set of neo-peptides that has the likelihood of inducing the tumor- specific immune response greater than a reference value.

In some embodiments, the method comprises ranking the one or mutations or the one or more neo-peptides based on the likelihood of inducing the tumor-specific immune response.

In some embodiments, the classifier or machine learning model comprises a logistic regression (LR) classifier, or a XGBoost classifier. In some embodiments, the logistic regression classifier and the XGBoost classifier may be used in combination. In some embodiments, the classifier or machine learning model comprises a voting classifier that combines respective probabilities determined by the LR classifier and the XGBoost classifier.

In some embodiments, the method comprises optimizing hyperparameters of the classifier or machine learning model in a cross-validation loop. In some embodiments, the classifier or machine learning model has been trained on the set of features. In some embodiments, the set of features is ranked based on Shapley values.

In some embodiments, the set of features is determined by one or more of MixMHCpred, PRIME, IpMSDB, CScape, GTEX, RNAseq, NetMHCpan, NetMHCstabpan, NetChop, HLA propensity, and DAI.

In some embodiments, the method comprises training the classifier or machine learning model on data that have been subject to missing value imputation, data normalization, conversion of categorical features into numerical features, or a combination thereof. In some embodiments, the data normalization comprises quantile normalization, power normalization, or a combination thereof. In some embodiments, the data normalization comprises quantile normalization.

In some embodiments, the method comprises training the classifier or machine learning model on NCI-train and testing the classifier or machine learning model on at least one of NCI- test, TESLA, and HiTIDE datasets.

In some embodiments, the step of inputting the set of features comprises inputting a feature matrix or a feature vector generated from the set of features.

In some embodiments, the method comprises identifying the one or more neo-peptides based on the set of features comprising one or more features selected from:

(a) mutant_rank, mutant_rank_netMHCpan, mutant_rank_PRIME, rnaseq_alt_ support, Sample_ Tissue_expression_GTEx, TCGA_Cancer_expression, bestWTMatchOverlap_I, and seq_len;

(b) mut_Rank_Stab, TAP_score, mut_netchop_score_ct, mut_binding_score, mut_aa_coeff, DAI_MixMHC, DAI_NetStab, DAI_MixMHC_mbp, mutant_other_ significant_alleles, bestWTMatchType_l, CSCAPE_score, CCF, and nb_ same_ mutation_lntogen; and

(c) mut_is_binding_pos, pep_mut_start, DAI_NetMHC, rnaseq_TPM, GTEx_all_ tissues_expression_ mean, bestWTMatchScore_l, bestMutationScore_l, bestWTPeptide-Count_l, Clonality, mutation_driver_statement_lntogen, and gene_driver_Intogen.

In some embodiments, the method comprises identifying the one or more mutations based on the set of features comprising one or more features selected from: (a) nb_mutations_in_ gene_lntogen, TCGA_ Cancer_expression, bestMutationScore_l, rnaseq_alt_support, COUNT_MUT_RANK_CI_netMHCpan, MIN_MUT_RANK_CI_ MIXMHC, MIN_MUT_RANK_CI_PRIME, mut_Rank_EL_0, mut_Rank_Stab_2, DAI_0 , and mut_netchop_score;

(b) rnaseq TPM, bestWTPeptideCount_l, CCF, CSCAPE score, Sample Tissue _expression_GTEx, COUNT_MUT_RANK_CI_MIXMHC, WT_BEST_RANK_ CI_MIXMHC, next_best_BA_mut_ranks, mut_Rank_EL_1 , mut_Rank_EL_2, wt_Rank_EL_0, wt_Rank_EL_1 , mut_Rank_Stab_0, mut_Rank_Stab_1 , DAI_1, and mut_TAP_score_0; and

(c) nb_same_mutation_lntogen, mutation_driver_statement_lntogen, gene_driver_ Intogen, bestWTMatchScore_l, bestWTMatchOverlap_I, Zygosity, Clonality, GTEx_all_tissues_ expression_mean, COUNT_MUT_RANK_CI_PRIME, WT_BEST_RANK_CI_PRIME, wt_Rank_EL_2, and DAI_2.

In some embodiments, the method comprises identifying the one or more mutations in sequencing data obtained from a sample of a subject. In some embodiments, the sequencing data comprises exome, transcriptome, whole genome nucleotide sequencing data, or a combination thereof.

In some embodiments, the subject has a cancer. In some embodiments, the cancer is a carcinoma, a sarcoma, a lymphoma, a melanoma, a pediatric tumor, or a leukemia.

In some embodiments, the cancer is selected from adrenal gland tumors, biliary cancer, bladder cancer, brain cancer, breast cancer, carcinoma, central or peripheral nervous system tissue cancer, cervical cancer, colon cancer, endocrine or neuroendocrine cancer or hematopoietic cancer, esophageal cancer, fibroma, gastrointestinal cancer, glioma, head and neck cancer, Li-Fraumeni tumors, liver cancer, lung cancer, lymphoma, melanoma, meningioma, multiple neuroendocrine type I and type II tumors, nasopharyngeal cancer, oral cancer, oropharyngeal cancer, osteogenic sarcoma tumors, ovarian cancer, pancreatic cancer, pancreatic islet cell cancer, parathyroid cancer, pheochromocytoma, pituitary tumors, prostate cancer, rectal cancer, renal cancer, respiratory cancer, sarcoma, skin cancer, stomach cancer, testicular cancer, thyroid cancer, tracheal cancer, urogenital cancer, and uterine cancer. In another aspect, this disclosure provides a nucleic acid molecule comprising a polynucleotide sequence of a neoantigen. In some embodiments, the nucleic acid sequence comprises at least one of the one or more mutations or encodes at least one of the one or more neo- peptides identified according to the method described herein.

Also within the scope of the disclosure are a vector comprising the nucleic acid molecule described herein, and a polypeptide encoded by the nucleic acid molecule described herein.

In another aspect, this disclosure also provides a tumor vaccine comprising the nucleic acid molecule, the vector, the polypeptide, or a combination thereof, as described herein.

In another aspect, this disclosure additionally provides a method of preparing a tumor vaccine. In some embodiments, the method comprises: identifying one or more mutations associated with a neoantigen or allele thereof or one or more neo-peptides that are capable of inducing a tumor-specific immune response according to the method described herein; and generating a nucleic acid molecule having a polynucleotide sequence encoding the neoantigen or allele thereof or at least one of the one or more neo-peptides.

In another aspect, this disclosure further provides a method of treating cancer in a subject. In some embodiments, the method comprises: identifying one or more mutations or neo-peptides capable of inducing a tumor-specific immune response according to the method described herein; and administering to the subject a therapeutically effective amount of a therapeutic agent comprising a T cell that binds specifically to at least one of the one or more mutations or neo- peptides.

In another aspect, this disclosure also provides a method of selecting an antigen-specific T cell. In some embodiments, the method comprises: identifying one or more mutations or neo- peptides capable of inducing a tumor-specific immune response according to the method described herein; and selecting a T cell having a receptor that is antigen specific to at least one of the one or more mutations or neo-peptides.

In another aspect, this disclosure additionally provides a T cell therapy, comprising a T cell selected according to the method described herein. In some embodiments, the T cell therapy is adoptive cell therapy. The foregoing summary is not intended to define every aspect of the disclosure, and additional aspects are described in other sections, such as the following detailed description. The entire document is intended to be related as a unified disclosure, and it should be understood that all combinations of features described herein are contemplated, even if the combinations of features are not found together in the same sentence, or paragraph, or section of this document. Other features and advantages of the invention will become apparent from the following detailed description. It should be understood, however, that the detailed description and the specific examples, while indicating specific embodiments of the disclosure, are given by way of illustration only, because various changes and modifications within the spirit and scope of the disclosure will become apparent to those skilled in the art from this detailed description.

BRIEF DESCRIPTION OF THE DRAWINGS

Figs. 1A, 1B, 1C, and 1D (collectively “Fig. 1”) show that using the classifier probability ranking, the rank_score evaluates how low the immunogenic neo-peps are ranked among all neo- peps of a patient. Fig. 1A shows leave-one-out CV rank_score on NCI-train_neo-pep. The boxes reflect the rank_score distributions over the 10 subsamples. The best possible total rank_score is 80.071. Fig. 1B shows number of neo-peps ranked in the top 20, 50, or 100 for leave-one-out CV on NCI-train_neo-pep. Immunogenic neo-peps were ranked per patient and the number of immunogenic neo-peps in the top 20, 50, or 100 ranks was calculated. The y-axis represents these numbers summed over all patients in NCI-train_neo-pep. The dashed horizontal line indicates 82, the total number of immunogenic neo-peps in NCI-train_neo-pep. Fig. 1C shows for the NCI- test _neo-pep dataset, the number of immunogenic neo-peps ranked in the top 20, 50 or 100 among all neo-peps of a patient, and summed over all patients in the dataset are shown. The ranking obtained with the LR, XGBoost and Voting classifiers are compared to the ranking from Gartner et al. (Gartner, J. J. et al. Nat. Cancer 1-12 (2021)). The dashed horizontal line indicates the total number of 21 immunogenic neo-peps in the NCI-test_neo-pep dataset. Fig. 1D shows LR leave- one-out CV rank_score on NCI-train _mut-seq for all features (‘ALL’), all features without ipMSDB features (‘No ipMSDB’), all features without Intogen features (‘No Intogen’), and all features without ipMSDB and Intogen features (‘No ipMSDB & Intogen’). The boxes reflect the rank_score distributions over the 10 Hyperopt runs. The best possible total rank score is 103.718. Figs. 2A, 2B, 2C, and 2D (collectively “Fig. 2”) show ranking of neo-peps or mutations. Fig. 2A shows CPU running times in seconds for the Hyperopt loop with 200 iterations. The boxes reflect the time distributions over the 10 subsamples. Fig. 2B shows number of neo-peps ranked in the top 20, 50, or 100 on NCI-test_neo-pep. Immunogenic neo-peps were ranked per patient and the number of immunogenic neo-peps in the top 20, 50, or 100 ranks was calculated. The y-axis represents these numbers summed over all patients in NCI-test_neo-pep. The dashed horizontal line indicates 21, the total number of immunogenic neo-peps in NCI-test_neo-pep. Fig. 2C shows as in Fig. 2B, but for TESLA_neo-pep with 34 immunogenic neo-peps. Fig. 2D shows as in Fig. 2B, but for HiTIDE_neo-pep with 38 immunogenic neo-peps.

Figs. 3A, 3B, 3C, and 3D (collectively “Fig. 3”) show ranking of neo-peps or mutations. Fig. 3A shows leave-one-out CV rank_score on NCI-train_mut-seq. The boxes reflect the rank_score distributions over the 10 repeated runs of Hyperopt loop. The best possible total rank_score was 103.718. Fig. 3B shows number of neo-peps ranked in the top 20, 50, or 100 on TESLA_mut-seq. Immunogenic neo-peps were ranked per patient and the number of immunogenic neo-peps in the top 20, 50, or 100 ranks was calculated. The y-axis represents these numbers summed over all patients in TESLA_mut-seq. The dashed horizontal line indicates 36, the total number of immunogenic mut-seqs in TESLA_mut-seq. Fig. 3C shows number of neo-peps ranked in the top 20, 50, or 100 on HiTIDE_mut-seq. Immunogenic neo-peps were ranked per patient and the number of immunogenic neo-peps in the top 20, 50, or 100 ranks was calculated. The y-axis represents these numbers summed over all patients in HiTIDEjnut-seq. The dashed horizontal line indicates 35, the total number of immunogenic neo-peps in HiTIDE mut-seq. Fig. 3D shows rank_score (α = 0.02) for LR and XGBoost and mut-seqs in NCI-test _mut-seq. The best possible total rank_score was 39.547.

DETAILED DESCRIPTION OF THE INVENTION

This disclosure provides a highly effective platform for predicting ability of neoantigenic peptides (neo-peptides) and/or mutations associated with neoantigens or alleles thereof to induce tumor specific immune responses. The disclosed platform employs novel features and a far more comprehensive feature set as compared to existing methods, which contributes to improved immunogenicity prediction for neo-peptides and/or mutations in neoantigens.

Identifying or Prioritizing Candidate Mutations and Candidate Neo-peptides In one aspect, this disclosure provides a method of identifying one or more mutations associated with a neoantigen or an allele thereof or identifying one or more neo-peptides, wherein the one or more mutations or the one or more neo-peptides are capable of inducing a tumor-specific immune response. In some embodiments, the method may include: (i) associating each of the one or more mutations or each of the one or more neo-peptides with a set of features; (ii) determining value of each of the set features for each of the one or more mutations or each of the one or more neo-peptides; (iii) inputting the value of each of the set of features into a classifier or a machine learning model to determine a likelihood of inducing a tumor-specific immune response of each of the one or more mutations or each of the one or more neo-peptides; and (iv) identifying a set of mutations or a set of neo-peptides that has the likelihood of inducing the tumor-specific immune response greater than a reference value.

In some embodiments, the method may include ranking the one or mutations or the one or more neo-peptides based on the likelihood of inducing the tumor-specific immune response.

As used herein, “neoantigen” refers to a class of tumor antigens which arise from tumor- specific changes in proteins. Neoantigens encompass, but are not limited to, tumor antigens which arise from, for example, substitution in the protein sequence, frame shift mutation, fusion polypeptide, in-frame deletion, insertion, expression of endogenous retroviral polypeptides, and tumor-specific overexpression of polypeptides.

As used herein, “allele” refers to one of several alternative forms of a gene or DNA sequence at a specific chromosomal location (locus). At each autosomal locus, an individual possesses two alleles, one inherited from the father and one from the mother.

As used herein, the terms “neo-peptide” and “neoantigenic peptide” are used interchangeably herein.

As used herein, “mutation” refers to a change of or difference in the nucleic acid sequence (nucleotide substitution, addition or deletion) compared to a reference. A “somatic mutation” can occur in any of the cells of the body except the germ cells (sperm and egg) and therefore are not passed on to children. These alterations can (but do not always) cause cancer or other diseases. A mutation may be a non-synonymous mutation. The term “non-synonymous mutation” refers to a mutation, such as a nucleotide substitution, which does result in an amino acid change such as an amino acid substitution in the translation product. The term “mutation” includes point mutations, indels, fusions, chromothripsis, and RNA edits.

In some embodiments, the features are determined by one or more of MixMHCpred (PMID: 30429286), PRIME (PMID: 33665637), IpMSDB (PMID: 29104575), CScape (PMID: 28912487), GTEX (https://www.gtexportal.org/), RNAseq, NetMHCpan (PMID: 32406916), NetMHCstabpan (PMID: 22678897), NetChop (PMID: 15744535), HLA propensity, and differential agretopicity index (DAI).

In some embodiments, the classifier or machine learning model may include a logistic regression (LR) classifier, or a XGBoost classifier. In some embodiments, the LR classifier may be used in combination with the XGBoost classifier.

In some embodiments, the classifier or machine learning model may include a voting classifier that combines respective probabilities determined by the LR classifier and the XGBoost classifier. In one example, the voting classifier may combine the probability p = predict_proba(x k ) of all base classifiers and then performe the ranking. In some embodiments, the weighted voting classifier includes a weight w for the probabilities p of classifiers c in two groups .

XGBoost is an optimized distributed gradient boosting library. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solves many data science problems in a fast and accurate way. Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. The XGBoost algorithm performs well in machine learning competitions because of its robust handling of a variety of data types, relationships, distributions, and the variety of hyperparameters that can be fine-tuned. XGBoost can be used for regression, classification (binary and multiclass), and ranking problems.

Logistic regression (LR) is a regression model for binary data from statistics where the logit of the probability that the dependent variable is equal to one is modeled as a linear function of the dependent variables. In addition to logistic regression and XGBoot classifiers, other classifiers may be used to implement the disclosed methods. For example, the classifier may be selected from Support vector machines (SVM), Logistic regression (LR), Nearest neighbor (NN) classifier, Random forest (RF) classifier, AdaBoost, CART decision tree classifier, XGBoost, and CatBoost. In some embodiments, the machine learning model may be a deep neural network (DNN), a convolutional neural network (CNN), or a deep convolutional neural network (DCNN). In some embodiments, the DNN is TabNet.

As used herein, “neural network” refers to a machine learning model for classification or regression consisting of multiple layers of linear transformations followed by element-wise nonlinearities typically trained via stochastic gradient descent and back-propagation.

In some embodiments, the method may include training the classifier or machine learning model on data that have been subject to missing value imputation, data normalization, conversion of categorical features into numerical features, or a combination thereof. In some embodiments, the data normalization may include quantile normalization, power normalization, or a combination thereof. In some embodiments, the data normalization may include quantile normalization.

In some embodiments, the method may include training the classifier or machine learning model on NCI-train (Gartner, J. J. et al. Nat. Cancer 1-12 (2021)), and testing the classifier or machine learning model on at least one of NCI-test (Gartner, J. J. et al. Nat. Cancer 1-12 (2021)), TESLA (Wells, D. K. et al. Cell 183, 818-834.el3 (2020)), and HiTIDE datasets.

In some embodiments, the classifier or machine learning model has been trained on the set of features. In some embodiments, the set of features is ranked based on Shapley values (Shapley, L. S. In Contributions to the Theory of Games, 2, 307-317 (1953), the content of which is incorporated herein by reference). The Shapley value is the average expected marginal contribution of one player after all possible combinations of features have been considered.

In some embodiments, the method may include optimizing hyperparameters of the classifier or machine learning model in a cross-validation loop. In some embodiments, optimization of hyperparameters may include Hyperopt optimization (Bergstra, J., et al. Comput. Sci. Discov. 8, 014008 (2015)). For example, in the Hyperopt optimization a 5-fold cross- validation may be used and the rank score of the classifier on the validation set averaged over all 5 folds may be calculated. In some embodiments, optimization of hyperparameters may include cross-validation. In some embodiments, cross-validation may include Scikit-learn class StratifiedKFold.

In some embodiments, the step of inputting the set of features may include inputting a feature matrix or a feature vector generated from the set of features.

In some embodiments, the method may include identifying the one or more neo-peptides based on the set of features that may include one or more features selected from: mutant_rank, mutant_rank_netMHCpan, mutant_rank_PRIME, rnaseq_alt_ support, Sample_Tissue_ expression_GTEx, TCGA_Cancer_expression, bestWTMatchOverlap_ I, and seq_len.

In some embodiments, the method may include identifying the one or more neo-peptides based on the set of features that may include one or more features selected from: mut Rank Stab, TAP_score, mut_netchop_score_ct, mut_binding_score, mut_aa_coeff, DAI_MixMHC, DAI_NetStab, DAI__MixMHC_ mbp, mutant_other_ significant_ alleles, bestWTMatchType_l, CSCAPE_score, CCF, and nb_ same_ mutation_lntogen.

In some embodiments, the method may include identifying the one or more neo-peptides based on the set of features that may include one or more features selected from: mut_is_binding_pos, pep_mut_start, DAI_NetMHC, rnaseq_TPM, GTEx_all_ tissues_expression_ mean, bestWTMatchScore_l, bestMutationScore_l, bestWTPeptide-Count_l, Clonality, mutation_driver_statement_lntogen, and gene_driver_ Intogen.

In some embodiments, the method may include identifying the one or more mutations based on the set of features that may include one or more features selected from: nb mutations in gene lntogen, TCGA_Cancer_expression, bestMutationScore_l, rnaseq_alt_ support, COUNT_MUT_RANK_CI_netMHCpan, MIN_MUT_RANK_CI_MIXMHC, MIN_MUT_RANK_CI_PRIME, mut_Rank_EL_0, mut_Rank_Stab_2, DAI_0, and mut_netchop_score.

In some embodiments, the method may include identifying the one or more mutations based on the set of features that may include one or more features selected from: rnaseq_TPM, bestWTPeptideCount_l, CCF, CSCAPE_score, Sample_Tissue_expression_GTEx, COUNT_MUT_RANK_CI_MIXMHC, WT_BEST_RANK_ CI_MIXMHC, next_best_BA_mut _ranks, mut_Rank_EL_1 , mut_Rank_EL_2, wt_Rank_EL_0, wt_Rank_EL_1 , mut_Rank_Stab_0, mut_Rank_Stab_1 , DAI_1, and mut_TAP_score_0.

In some embodiments, the method may include identifying the one or more mutations based on the set of features that may include one or more features selected from: nb_same_mutation_lntogen, mutation_driver_statement_lntogen, gene_driver_lntogen, bestWTMatchScore_l, bestWTMatchOverlap I, Zygosity, Clonality, GTEx_all_tissues_ expression_mean, COUNT_MUT_RANK_CI__PRIME, WT_BEST_RANK_CI_PRIME, wt_Rank_EL_2, and DAI_2.

In some embodiments, the method may include identifying the one or more mutations in sequencing data obtained from a sample of a subject. In some embodiments, the sequencing data may include exome, transcriptome, whole genome nucleotide sequencing data, or a combination thereof.

Peptides (e.g., neo-peptides) or neoantigens with mutations or mutated polypeptides arising from for example, splice-site, frameshift, readthrough, or gene fusion mutations in tumor cells can be identified by sequencing DNA, RNA or protein in tumor versus normal cells.

Mutations can include previously identified tumor specific mutations. Known tumor mutations can be found in the Catalogue of Somatic Mutations in Cancer (COSMIC) database.

A variety of methods are available for detecting the presence of a particular mutation or allele in an individual’s DNA or RNA. Advancements in this field have provided accurate, easy, and inexpensive large-scale SNP genotyping. For example, several techniques have been described including dynamic allele-specific hybridization (DASH), microplate array diagonal gel electrophoresis (MADGE), pyrosequencing, oligonucleotide-specific ligation, the TaqMan system as well as various DNA “chip” technologies such as the Affymetrix SNP chips. These methods utilize amplification of a target genetic region, typically by PCR. Still other methods are based on the generation of small signal molecules by invasive cleavage followed by mass spectrometry or immobilized padlock probes and rolling-circle amplification. PCR based detection means can include multiplex amplification of a plurality of markers simultaneously. For example, it is well known in the art to select PCR primers to generate PCR products that do not overlap in size and can be analyzed simultaneously. Alternatively, it is possible to amplify different markers with primers that are differentially labeled and thus can each be differentially detected. Of course, hybridization-based detection means allow the differential detection of multiple PCR products in a sample. Other techniques are known in the art to allow multiplex analyses of a plurality of markers.

Several methods have been developed to facilitate analysis of single nucleotide polymorphisms in genomic DNA or cellular RNA. For example, a single base polymorphism can be detected by using a specialized exonuclease-resistant nucleotide, as disclosed, e.g., in Mundy, C. R. (U.S. Pat. No. 4,656,127).

Any suitable sequencing-by-synthesis platform can be used to identify mutations. As described above, four major sequencing-by-synthesis platforms are currently available: the Genome Sequencers from Roche/454 Life Sciences, the 1G Analyzer from Illumina/Solexa, the SOLiD system from Applied BioSystems, and the Heliscope system from Helicos Biosciences. Sequencing-by-synthesis platforms have also been described by Pacific BioSciences and VisiGen Biotechnologies. In some embodiments, a plurality of nucleic acid molecules being sequenced is bound to a support (e.g., solid support).

The term “genome” relates to the total amount of genetic information in the chromosomes of an organism or a cell. The term “exome” refers to part of the genome of an organism formed by exons, which are coding portions of expressed genes. The exome provides the genetic blueprint used in the synthesis of proteins and other functional gene products. It is the most functionally relevant part of the genome and, therefore, it is most likely to contribute to the phenotype of an organism. The exome of the human genome is estimated to comprise 1.5% of the total genome (Ng, PC et al, PLoS Gen., 4(8): 1-15, 2008). The term “transcriptome” relates to the set of all RNA molecules, including mRNA, rRNA, tRNA, and other non-coding RNA produced in one cell or a population of cells. In context of the present invention the transcriptome means the set of all RNA molecules produced in one cell, a population of cells, preferably a population of cancer cells, or all cells of a given individual at a certain time point.

In some embodiments, the subject has a cancer. In some embodiments, the cancer is a carcinoma, a sarcoma, a lymphoma, a melanoma, a pediatric tumor, or a leukemia.

In some embodiments, the cancer is selected from adrenal gland tumors, biliary cancer, bladder cancer, brain cancer, breast cancer, carcinoma, central or peripheral nervous system tissue cancer, cervical cancer, colon cancer, endocrine or neuroendocrine cancer or hematopoietic cancer, esophageal cancer, fibroma, gastrointestinal cancer, glioma, head and neck cancer, Li-Fraumeni tumors, liver cancer, lung cancer, lymphoma, melanoma, meningioma, multiple neuroendocrine type I and type II tumors, nasopharyngeal cancer, oral cancer, oropharyngeal cancer, osteogenic sarcoma tumors, ovarian cancer, pancreatic cancer, pancreatic islet cell cancer, parathyroid cancer, pheochromocytoma, pituitary tumors, prostate cancer, rectal cancer, renal cancer, respiratory cancer, sarcoma, skin cancer, stomach cancer, testicular cancer, thyroid cancer, tracheal cancer, urogenital cancer, and uterine cancer.

In some embodiments, a reference value of the likelihood of inducing the tumor-specific immune response may be determined from a reference sample, reference cell, reference tissue, control sample, control cell, or control tissue that does not contain mutations or neo-peptides as identified by the disclosed methods. In some embodiments, the sample may be a tumor sample (e.g., tumor tissue).

The terms “sample” or “biological sample,” as used herein, include any biological specimen obtained (isolated, removed) from a subject. Samples may include, without limitation, organ tissue (e.g., primary or metastatic tumor tissue), whole blood, plasma, serum, whole blood cells, red blood cells, white blood cells (e.g, peripheral blood mononuclear cells), saliva, urine, stool (feces), tears, sweat, sebum, nipple aspirate, ductal lavage, tumor exudates, synovial fluid, cerebrospinal fluid, lymph, fine needle aspirate, amniotic fluid, any other bodily fluid, exudate or secretory fluid, cell lysates, cellular secretion products, inflammation fluid, semen, and vaginal secretions. In some embodiments, a sample may be readily obtainable by non-invasive or minimally invasive methods, such as blood collection (“liquid biopsy”), urine collection, feces collection, tissue (e.g, tumor tissue) biopsy or fine-needle aspiration, allowing the provision/removal/isolation of the sample from a subject. The term “tissue,” as used herein, encompasses all types of cells of the body, including cells of organs but also including blood and other body fluids recited above. The tissue may be healthy or affected by pathological alterations, e.g, tumor tissue. The tissue may be from a living subject or may be cadaveric tissue. In some embodiments, useful samples are those known to comprise, expected or predicted to comprise, known to potentially comprise, or expected or predicted to potentially comprise tumor cells.

Neoantigens and Neo-peptides Neoantigens can include nucleotides or polypeptides. For example, a neoantigen can be an RNA sequence that encodes for a polypeptide sequence. Neoantigens useful in vaccines can therefore include nucleotide sequences or polypeptide sequences.

Disclosed herein are isolated peptides that comprise tumor specific mutations identified by the methods disclosed herein, peptides that comprise known tumor specific mutations, and mutant polypeptides or fragments thereof identified by methods disclosed herein. Neoantigen peptides can be described in the context of their coding sequence where a neoantigen includes the nucleotide sequence (e.g., DNA or RNA) that codes for the related polypeptide sequence.

In some embodiments, the neoantigen or neo-peptide may include an amino acid sequence of SEQ ID NOs: 1-395 or include an amino acid sequence having at least 80% (e.g., 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%) identity to an amino acid sequence of SEQ ID NO: 1-395.

One or more neoantigens can be presented on the surface of a tumor. One or more neoantigens can be immunogenic in a subject having a tumor, e.g., capable of eliciting a T cell response or a B cell response in the subject.

The size of at least one neoantigenic peptide molecule can comprise, but is not limited to, about 5, about 6, about 7, about 8, about 9, about 10, about 11, about 12, about 13, about 14, about 15, about 16, about 17, about 18, about 19, about 20, about 21, about 22, about 23, about 24, about

25, about 26, about 27, about 28, about 29, about 30, about 31, about 32, about 33, about 34, about

35, about 36, about 37, about 38, about 39, about 40, about 41, about 42, about 43, about 44, about

45, about 46, about 47, about 48, about 49, about 50, about 60, about 70, about 80, about 90, about

100, about 110, about 120 or greater amino molecule residues, and any range derivable therein. In specific embodiments the neoantigenic peptide molecules are equal to or less than 50 amino acids. Neoantigenic peptides and polypeptides can be: for MHC Class I, 15 residues or less in length and usually consist of between about 8 and about 11 residues, particularly 9 or 10 residues; for MHC Class II, 15-24 residues.

Neoantigenic peptides and polypeptides can be presented on an HLA protein. In some aspects neoantigenic peptides and polypeptides are presented on an HLA protein with greater affinity than a wild-type peptide. In some embodiments, a neoantigenic peptide or polypeptide can have an IC50 of at least less than 5000 nM, at least less than 1000 nM, at least less than 500 nM, at least less than 250 nM, at least less than 200 nM, at least less than 150 nM, at least less than 100 nM, at least less than 50 nM or less.

In some embodiments, neoantigenic peptides and polypeptides do not induce an autoimmune response and/or invoke immunological tolerance when administered to a subject.

Also provided herein are compositions comprising one or more neoantigenic peptides. In some embodiments, the composition may include at least two distinct peptides. At least two distinct peptides can be derived from the same polypeptide. By distinct polypeptides is meant that the peptide varies by length, amino acid sequence, or both. The peptides are derived from any polypeptide known to or have been found to contain a tumor specific mutation.

Neoantigenic peptides and polypeptides having a desired activity or property can be modified to provide certain desired attributes, e.g., improved pharmacological characteristics, while increasing or at least retaining substantially all of the biological activity of the unmodified peptide to bind the desired MHC molecule and activate the appropriate T cell. For instance, neoantigenic peptide and polypeptides can be subject to various changes, such as substitutions, either conservative or non-conservative, where such changes might provide for certain advantages in their use, such as improved MHC binding, stability or presentation. By conservative substitutions is meant replacing an amino acid residue with another which is biologically and/or chemically similar, e.g., one hydrophobic residue for another, or one polar residue for another. The substitutions include combinations such as Gly, Ala; Vai, Ile, Leu, Met; Asp, Glu; Asn, Gin; Ser, Thr; Lys, Arg; and Phe, Tyr. The effect of single amino acid substitutions may also be probed using D-amino acids. Such modifications can be made using well known peptide synthesis procedures, as described in e.g., Merrifield, Science 232:341-347 (1986), Barany & Merrifield, The Peptides, Gross & Meienhofer, eds. (N.Y., Academic Press), pp. 1-284 (1979); and Stewart & Young, Solid Phase Peptide Synthesis, (Rockford, Ill., Pierce), 2d Ed. (1984).

In some embodiments, a neoantigen may include a nucleic acid (e.g., polynucleotide) that encodes a neoantigenic peptide or portion thereof. The polynucleotide can be, e.g., DNA, cDNA, PNA, CNA, RNA (e.g., mRNA), either single- and/or double-stranded, or native or stabilized forms of polynucleotides, such as, e.g., polynucleotides with a phosphorothiate backbone, or combinations thereof and it may or may not contain introns. A still further aspect provides an expression vector capable of expressing a polypeptide or portion thereof. Expression vectors for different cell types are well known in the art and can be selected without undue experimentation. Generally, DNA is inserted into an expression vector, such as a plasmid, in proper orientation and correct reading frame for expression. If necessary, DNA can be linked to the appropriate transcriptional and translational regulatory control nucleotide sequences recognized by the desired host, although such controls are generally available in the expression vector. The vector is then introduced into the host through standard techniques. Guidance can be found e.g. in Sambrook et al. (1989) Molecular Cloning, A Laboratory Manual, Cold Spring Harbor Laboratory, Cold Spring Harbor, N.Y.

In another aspect, this disclosure provides a nucleic acid molecule comprising a polynucleotide sequence of a neoantigen. In some embodiments, the nucleic acid sequence may include at least one of the one or more mutations or encodes at least one of the one or more neo- peptides identified according to the method described herein.

Immunogenic Compositions

Also disclosed herein is an immunogenic composition, e.g., a vaccine composition, capable of raising a specific immune response, e.g., a tumor-specific immune response. Vaccine compositions typically comprise a plurality of neoantigens, e.g., selected using a method described herein. Vaccine compositions can also be referred to as vaccines.

In some embodiments, the immunogenic composition may include a neoantigen or neo- peptide that includes an amino acid sequence of SEQ ID NOs: 1-395 or includes an amino acid sequence having at least 80% (e.g., 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%) identity to an amino acid sequence of SEQ ID NO: 1-395.

A vaccine can contain between 1 and 30 peptides, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, or 30 different peptides, 6, 7, 8, 9, 10 11, 12, 13, or 14 different peptides, or 12, 13 or 14 different peptides. Peptides can include post- translational modifications. A vaccine can contain between 1 and 100 or more nucleotide sequences, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,

28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,

54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,

80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100 or more different nucleotide sequences, 6, 7, 8, 9, 10 11, 12, 13, or 14 different nucleotide sequences, or 12, 13 or 14 different nucleotide sequences. A vaccine can contain between 1 and 30 neoantigen sequences, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,

31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56,

57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82,

83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100 or more different neoantigen sequences, 6, 7, 8, 9, 10 11, 12, 13, or 14 different neoantigen sequences, or 12, 13 or 14 different neoantigen sequences.

In one embodiment, different peptides and/or polypeptides or nucleotide sequences encoding them are selected so that the peptides and/or polypeptides are capable of associating with different MHC molecules, such as different MHC class I molecules. In some aspects, one vaccine composition comprises coding sequences for peptides and/or polypeptides capable of associating with the most frequently occurring MHC class I molecules. Hence, vaccine compositions can comprise different fragments capable of associating with at least 2 preferred, at least 3 preferred, or at least 4 preferred MHC class I molecules.

The vaccine composition can be capable of raising a specific cytotoxic T-cell response and/or a specific helper T-cell response.

A vaccine composition can further comprise an adjuvant and/or a carrier. Examples of useful adjuvants and carriers are given herein below. A composition can be associated with a carrier, such as a protein or an antigen-presenting cell such as a dendritic cell (DC), capable of presenting the peptide to a T-cell. Adjuvants are any substance whose admixture into a vaccine composition increases or otherwise modifies the immune response to a neoantigen. Carriers can be scaffold structures, for example, a polypeptide or a polysaccharide, to which a neoantigen is capable of being associated. Optionally, adjuvants are conjugated covalently or non-covalently. A carrier (or excipient) can be present independently of an adjuvant. The function of a carrier can for example be to increase the molecular weight of a particular mutant to increase activity or immunogenicity, to confer stability, to increase the biological activity, or to increase serum half- life. Furthermore, a carrier can aid in presenting peptides to T-cells. A carrier can be any suitable carrier known to the person skilled in the art, for example a protein or an antigen presenting cell. A carrier protein could be but is not limited to keyhole limpet hemocyanin, serum proteins such as transferrin, bovine serum albumin, human serum albumin, thyroglobulin or ovalbumin, immunoglobulins, or hormones, such as insulin or palmitic acid. For immunization of humans, the carrier is generally a physiologically acceptable carrier acceptable to humans and safe. However, tetanus toxoid and/or diphtheria toxoid are suitable carriers. Alternatively, the carrier can be dextrans, for example sepharose.

Methods of Treatment and Methods of Inducing Immune Responses

In another aspect, this disclosure further provides a method of treating cancer in a subject. In some embodiments, the method may include: identifying one or more mutations or neo-peptides capable of inducing a tumor-specific immune response according to the method described herein; and administering to the subject a therapeutically effective amount of a therapeutic agent comprising an immune cell therapy. In some embodiments, the immune cell therapy may include a T cell that binds specifically to at least one of the one or more mutations or neo-peptides.

In some embodiments, the immune cell therapy may include a tumor infiltrating lymphocyte. In some embodiments, the cancer therapy may include an adoptive cell therapy. In some embodiments, the adoptive cell therapy may include a T-cell receptor (TCR) T cell therapy or a chimeric antigen receptor (CAR) T cell therapy.

In some embodiments, the disclosed methods include administration of an adoptive cell therapy. As used herein, the term “adoptive cell therapy,” “ACT,” or “adoptive immunotherapy” are used interchangeably and refer to the administration of a modified immune cell to a subject with cancer. An “immune cell” (also interchangeably referred to herein as an “immune effector cell”) refers to a cell that is part of a subject’s immune system and helps to fight cancer in the body of a subject. Non-limiting examples of immune cells for use in the disclosed methods include T cells, tumor-infiltrating lymphocytes, and natural killer (NK) T cells. The immune cells may be autologous or heterologous to the subject undergoing therapy.

As used herein, a “T cell receptor” refers to an isolated TCR polypeptide that binds specifically to a TAA, or a TCR expressed on an isolated immune cell (e.g., a T cell). TCRs bind to epitopes on small antigenic determinants (for example, comprised in a tumor associated antigen) on the surface of antigen-presenting cells that are associated with a major histocompatibility complex (MHC; in mice) or human leukocyte antigen (HLA; in humans) complex. TCR also refers to an immunoglobulin superfamily member having a variable binding domain, a constant domain, a transmembrane region, and a short cytoplasmic tail (see, e.g., Janeway el al., Immunobiology: The Immune System in Health and Disease, 3rd Ed., Current Biology Publications, 1997) capable of specifically binding to an antigen peptide bound to a MHC receptor.

As used herein, a “chimeric antigen receptor” or “CAR” refers to an antigen-binding protein that includes an immunoglobulin antigen-binding domain (e.g., an immunoglobulin variable domain) and a TCR constant domain or a portion thereof, which can be administered to a subject as chimeric antigen receptor T-cell (CAR-T) therapy. As used herein, a “constant domain” of a TCR polypeptide includes a membrane-proximal TCR constant domain, and may also include a TCR transmembrane domain and/or a TCR cytoplasmic tail. For example, in some embodiments, the CAR is a dimer that includes a first polypeptide comprising an immunoglobulin heavy chain variable domain linked to a TCRβ constant domain and a second polypeptide comprising an immunoglobulin light chain variable domain (e.g., a K or λ variable domain) linked to a TCRα constant domain. In some embodiments, the CAR is a dimer that includes a first polypeptide comprising an immunoglobulin heavy chain variable domain linked to a TCRα constant domain and a second polypeptide comprising an immunoglobulin light chain variable domain (e.g., a K or λ variable domain) linked to a TCRβ constant domain.

As used herein, the term “treatment” refers to a clinical intervention designed to alter the natural course of the individual or cell being treated during the course of clinical pathology. Desirable effects of treatment include decreasing the rate of disease progression, ameliorating or palliating the disease state, and remission or improved prognosis. For example, an individual is successfully “treated” if one or more symptoms associated with a cancer are mitigated or eliminated, including, but are not limited to, reducing the proliferation of (or destroying) cancerous cells, reducing pathogen infection, decreasing symptoms resulting from the disease, increasing the quality of life of those suffering from the disease, decreasing the dose of other medications required to treat the disease, and/or prolonging survival of individuals. The phrase “treatment with a therapy,” “treating with a therapy,” “treatment with an agent,” “treating with an agent,” and the like refers to the administration of an effective amount of a therapy or agent, including a cancer therapy and optionally an agent, (e.g., a cytotoxic agent or an immunotherapeutic agent) to a patient, or the concurrent administration of two or more therapies or agents, including cancer therapies or agents, (e.g., two or more agents selected from cytotoxic agents and immunotherapeutic agents) in effective amounts to a patient. Also provided is a method of inducing a tumor specific immune response in a subject, vaccinating against a tumor, treating and or alleviating a symptom of cancer in a subject by administering to the subject one or more neoantigens such as a plurality of neoantigens identified using methods disclosed herein.

In some aspects, a subject has been diagnosed with cancer or is at risk of developing cancer. A subject can be a human, dog, cat, horse or any animal in which a tumor specific immune response is desired.

In some embodiments, the cancer is a carcinoma, a sarcoma, a lymphoma, a melanoma, a pediatric tumor, or a leukemia.

As used herein, “cancer,” “tumor,” and “malignancy” all relate equivalently to hyperplasia of a tissue or organ. If the tissue is a part of the lymphatic or immune system, malignant cells may include non-solid tumors of circulating cells. Malignancies of other tissues or organs may produce solid tumors. The methods described herein can be used in the treatment of lymphatic cells, circulating immune cells, and solid tumors.

As used herein, the term “cancer” refers to a malignant neoplasm characterized by deregulated or unregulated cell growth. The term “cancer” includes primary malignant cells or tumors (e.g., those whose cells have not migrated to sites in the subject’s body other than the site of the original malignancy or tumor) and secondary malignant cells or tumors (e.g., those arising from metastasis, the migration of malignant cells or tumor cells to secondary sites that are different from the site of the original tumor). The term “metastatic” or “metastasis” generally refers to the spread of a cancer from one organ or tissue to another non-adjacent organ or tissue. The occurrence of the neoplastic disease in the other non-adjacent organ or tissue is referred to as metastasis.

Examples of cancer include but are not limited to carcinoma, lymphoma, blastoma, sarcoma, and leukemia or lymphoid malignancies. More particular examples of such cancers include without limitation: squamous cell cancer (e.g., epithelial squamous cell cancer), lung cancer including small-cell lung cancer, non-small cell lung cancer, adenocarcinoma of the lung, squamous carcinoma of the lung and large cell carcinoma of the lung, cancer of the peritoneum, hepatocellular cancer, gastric or stomach cancer including gastrointestinal cancer, pancreatic cancer, glioma, glioblastoma, cervical cancer, ovarian cancer, liver cancer, bladder cancer, hepatoma, breast cancer, colon cancer, rectal cancer, colorectal cancer, endometrial cancer or uterine carcinoma, salivary gland carcinoma, kidney or renal cancer, prostate cancer, vulvar cancer, thyroid cancer, hepatic carcinoma, anal carcinoma, penile carcinoma, as well as CNS cancer, melanoma, head and neck cancer, bone cancer, bone marrow cancer, duodenum cancer, esophageal cancer, thyroid cancer, or hematological cancer.

In some embodiments, a neoantigen, a neo-peptide, or a composition thereof can be administered in an amount sufficient to induce a Cytotoxic T lymphocyte (CTL) response. In some embodiments, a neoantigen, a neo-peptide, or a composition thereof can be administered alone or in combination with other therapeutic agents. In some embodiments, the therapeutic agent may include, for example, a chemotherapeutic agent, radiation, or immunotherapy. Any suitable therapeutic treatment for a particular cancer can be administered. In addition, a subject can be further administered an anti-immunosuppressive/immunostimulatory agent such as a checkpoint inhibitor. For example, the subject can be further administered an anti-CTLA antibody or anti-PD- 1 or anti-PD-L1.

In another aspect, this disclosure additionally provides a method of preparing a tumor vaccine. In some embodiments, the method may include: identifying one or more mutations associated with a neoantigen or allele thereof or one or more neo-peptides that are capable of inducing a tumor-specific immune response according to the method described herein; and generating a nucleic acid molecule having a polynucleotide sequence encoding the neoantigen or allele thereof or at least one of the one or more neo-peptides.

In another aspect, this disclosure also provides a method of selecting an antigen-specific T cell. In some embodiments, the method may include: identifying one or more mutations or neo- peptides capable of inducing a tumor-specific immune response according to the method described herein; and selecting a T cell having a receptor that is antigen specific to at least one of the one or more mutations or neo-peptides.

In another aspect, this disclosure additionally provides a T cell therapy, comprising a T cell selected according to the method described herein. In some embodiments, the T cell therapy is adoptive cell therapy.

Additional Definitions To aid in understanding the detailed description of the compositions and methods according to the disclosure, a few express definitions are provided to facilitate an unambiguous disclosure of the various aspects of the disclosure. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

In many embodiments, the terms “subject” and “patient” are used interchangeably irrespective of whether the subject has or is currently undergoing any form of treatment. As used herein, the terms “subject” and “subjects” may refer to any vertebrate, including, but not limited to, a mammal (e.g., cow, pig, camel, llama, horse, goat, rabbit, sheep, hamsters, guinea pig, cat, dog, rat, and mouse, a non-human primate (for example, a monkey, such as a cynomolgus monkey, chimpanzee, etc.) and a human). The subject may be a human or a non-human. In more exemplary aspects, the mammal is a human. As used herein, the expression “a subject in need thereof’ or “a patient in need thereof’ means a human or non-human mammal that exhibits one or more symptoms or indications of disorders, and/or who has been diagnosed with cancer. In some embodiments, the subject is a mammal. In some embodiments, the subject is human.

The term “machine learning,” as used herein, refers to a computer algorithm used to extract useful information from a database by building probabilistic models in an automated way.

The term “regression tree,” as used herein, refers to a decision tree that predicts values of continuous variables.

The term “supervised learning,” as used herein, refers to a data analysis using a well- defined (known) dependent variable. All regression and classification algorithms are supervised. In contrast, “unsupervised learning” refers to the collection of algorithms where groupings of the data are defined without the use of a dependent variable.

The term “immune response,” as used herein, refers to any type of immune response, including, but not limited to, innate immune responses (e.g., activation of Toll receptor signaling cascade), cell-mediated immune responses (e.g., responses mediated by T cells (e.g., antigen- specific T cells) and non-specific cells of the immune system) and humoral immune responses (e.g., responses mediated by B cells (e.g., via generation and secretion of antibodies into the plasma, lymph, and/or tissue fluids)). The term “immune response” is meant to encompass all aspects of the capability of a subject’s immune system to respond to antigens and/or immunogens (e.g., both the initial response to an immunogen (e.g., a pathogen) as well as acquired (e.g., memory) responses that are a result of an adaptive immune response).

As used herein, descriptions of “first,” “second,” etc. (“third” . . . and the like) indicate that entities are different from each other.

As used herein, the term “in vitro” refers to events that occur in an artificial environment, e.g., in a test tube or reaction vessel, in cell culture, etc., rather than within a multi-cellular organism.

As used herein, the term “in vivo” refers to events that occur within a multi-cellular organism, such as a non-human animal.

It is noted here that, as used in this specification and the appended claims, the singular forms “a,” “an,” and “the” include plural reference unless the context clearly dictates otherwise.

The terms “including,” “comprising,” “containing,” or “having” and variations thereof are meant to encompass the items listed thereafter and equivalents thereof as well as additional subject matter unless otherwise noted.

The phrases “in one embodiment,” “in various embodiments,” “in some embodiments,” and the like are used repeatedly. Such phrases do not necessarily refer to the same embodiment, but they may unless the context dictates otherwise.

The terms “and/or” or means any one of the items, any combination of the items, or all of the items with which this term is associated.

The word “substantially” does not exclude “completely,” e.g., a composition which is “substantially free” from Y may be completely free from Y. Where necessary, the word “substantially” may be omitted from the definition of this disclosure.

As used herein, the term “approximately” or “about,” as applied to one or more values of interest, refers to a value that is similar to a stated reference value. In some embodiments, the term “approximately” or “about” refers to a range of values that fall within 25%, 20%, 19%, 18%, 17%, 16%, 15%, 14%, 13%, 12%, 11%, 10%, 9%, 8%, 7%, 6%, 5%, 4%, 3%, 2%, 1%, or less in either direction (greater than or less than) of the stated reference value unless otherwise stated or otherwise evident from the context (except where such number would exceed 100% of a possible value). Unless indicated otherwise herein, the term “about” is intended to include values, e.g., weight percents, proximate to the recited range that are equivalent in terms of the functionality of the individual ingredient, the composition, or the embodiment.

It is to be understood that wherever values and ranges are provided herein, all values and ranges encompassed by these values and ranges are meant to be encompassed within the scope of the present disclosure. Moreover, all values that fall within these ranges, as well as the upper or lower limits of a range of values, are also contemplated by the present application.

As used herein, the term “each,” when used in reference to a collection of items, is intended to identify an individual item in the collection but does not necessarily refer to every item in the collection. Exceptions can occur if explicit disclosure or context clearly dictates otherwise.

The use of any and all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illuminate the invention and does not pose a limitation on the scope of this disclosure unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of this disclosure.

All methods described herein are performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. In regard to any of the methods provided, the steps of the method may occur simultaneously or sequentially. When the steps of the method occur sequentially, the steps may occur in any order, unless noted otherwise.

In cases in which a method may include a combination of steps, each and every combination or sub-combination of the steps is encompassed within the scope of the disclosure, unless otherwise noted herein.

Each publication, patent application, patent, and other reference cited herein is incorporated by reference in its entirety to the extent that it is not inconsistent with the present disclosure. Publications disclosed herein are provided solely for their disclosure prior to the filing date of the present disclosure. Nothing herein is to be construed as an admission that the present invention is not entitled to antedate such publication by virtue of prior invention. Further, the dates of publication provided may be different from the actual publication dates, which may need to be independently confirmed.

It is understood that the examples and embodiments described herein are for illustrative purposes only and that various modifications or changes in light thereof will be suggested to persons skilled in the art and are to be included within the spirit and purview of this application and scope of the appended claims.

Examples

EXAMPLE 1

This example describes methods for Example 2.

Datasets

This example includes three cohort datasets consisting of whole exome (WES) and bulk RNA (RNAseq) sequencing data from healthy and matched cancerous tissues/cells and information about the immunogenicity of somatic mutations. Because some elements are associated with the mutations and some with predicted neoantigens, throughout the text the following naming conventions were used: mut-seq refers 25mer sequences with a mutation in the center, neo-pep refers to 8-12 (if not otherwise indicated) amino acids (AA) long subsequences of 25mers containing the mutation, peptide if the statement applies to both mut-seq and neo-peps. For the different subsets of the data, the following naming were used: DATASET_PEPTIDETYPE_SUBSET. The DATASET is equal to either NCI, NCI, NCI-train, NCI-test, TESLA, HiTIDE, or empty if all datasets are addressed (information about the datasets is indicated below). DATASET was set as P if it refers to only one patient P. The PEPTIDETYPE is equal to either mut-seq (25mer sequences with a mutation in the center), wt-seq (wild type (WT) version of mut-seq), neo-pep (8-12 AA long subsequence of mut-seq containing the mutation), or wt-pep (WT version of neo-pep). The SUBSET is equal to either imm (immunogenic), non-imm (non-immunogenic, can be either screened experimentally or not), tested (experimentally screened), not-tested (not screened) or empty for the entire dataset. For example, TESLA_neo- pep imm denotes all immunogenic neo-peps of the TESLA dataset.

NCI cohort

The largest dataset is a compilation of published datasets from the Rosenberg lab at the Surgery Brunch of the National Cancer Institute (Tran, E. et al. Science 350, 1387-1390 (2015); Gartner, J. J. et al. Nat. Cancer 1-12 (2021); Parkhurst, M. R. et al. Cancer Discov. 9, 1022-1035 (2019)). It was downloaded from the dbGap repository (https://dbgap.ncbi.nlm.nih.gov) under accession number phs001003.vl.pl. The NCI dataset contains mainly skin cutaneous melanoma, colon and rectum adenocarcinoma, lung adenocarcinoma, and breast invasive carcinoma. Immunogenicity assay information was obtained from Gartner et al. (Gartner, J. J. et al. Nat. Cancer 1-12 (2021)). At the time of download (December 2021), 112 patients, a cohort defined here as NCI mut-seq, had matched WES and RNAseq data files as well as results from immunogenicity screens of somatic genomic mutations (non-synonymous single nucleotide variants (SNV), InDei’ s and frameshifts). Filters based on RNAseq data were generally applied to prioritize mutations prior to the immunogenicity screening. In these screens, minigenes encoding the mutations and 12 flanking WT AA on each side were transcribed in vitro and transfected into autologous antigen presenting cells (APCs) followed by a co-culture with TIL cultures and IFN-γ ELISPOT immunogenicity measurement. NCI _mut-seq tested consists of the set of immunogenic (NCI mut-seq imm) and non immunogenic (NCI_mut-seq_non-imm) mut-seqs.

For 80 of the 112 patients, a cohort defined as NCI_neo-pep, other immunogenicity screens were performed to identify the optimal neo-antigenic epitopes and their HLA restrictions. For the mut-seqs, which tested positive in the minigene immunogenicity assay, the top-ranked neo-peps predicted by NetMHCpan were submitted to immunogenicity assays. Autologous APCs or APCs engineered to express the patient’s HLA-I alleles were pulsed with the selected neo-peps, prior to co-culture with TILs and IFNg ELISpot readout. NCI_neo-pep_imm contains all neo-peps with positive ELISpot readout. The same neo-pep annotations as provided by Gartner et al. (Gartner, J. J. et al. Nat. Cancer 1-12 (2021) were used: all neo-peps of mut-seqs screened by minigenes that were not found to be immunogenic or not tested are considered not immunogenic (NCI_neo- pep_non-imm). All neo-peps derived from mut-seqs that were not screened by minigenes are annotated as not tested (NCI neo-pep not-tested). The NCI_mut-seq and NCI_neo-pep cohorts were divided into a training set (89 patients for NCI_mut-seq, 57 patients for NCI_neo-pep) and a test set (23 patients each) . The lower number of patients compared to Gartner et al. (Gartner, J. J. et al. Nat. Cancer 1-12 (2021) (70 patients versus 57 for training and 26 patients versus 23 for testing) is due to the missing RNAseq data on dbgap. The pipeline used by Gartner et al. to process the WES and RNAseq data can be found in the supplementary material of Parkhurst et al. (Parkhurst, M. R. et al. Cancer Discov. 9, 1022-1035 (2019)).

TESLA The TESLA consortium (Wells, D. K. et al. Cell 183, 818-834. el3 (2020)) shared tumor and normal WES and tumor RNAseq data of nine patients with 25 different scientific groups working in the field. The participants used their proprietary software pipelines to call the somatic mut-seqs (non-synonymous SNV or short InDeis) and rank the epitopes according to their immunogenicity potential. The TESLA consortium collected these ranked lists and compiled a list of highly or reproducibly ranked neo-peps for immunogenicity screening, where HLA-I/peptide multimers were incubated with subject matched TILs or peripheral blood mononuclear cells (PBMCs), with IFNg ELISpot assays or with intracellular cytokine staining (Wells, D. K. et al. sCell 183, 818-834.el3 (2020)). For the first batch consisting of six patients (three melanoma and three NSCLC), 608 neo-peps (8-14 mers) were tested, and 37 of them were found to be immunogenic. For the second batch of another three melanoma patients, a compilation of 310 neo- peps (9-11 mers) was tested, resulting in four immunogenic epitopes. In total, datasets of eight patients (five with skin cutaneous melanoma, and three with NSCLC) were downloaded and processed. Annotations were inferred for the mut-seqs from annotations of the neo-peps, where a mut-seq is called immunogenic when at least one of its neo-peps, neo-pep was reported as immunogenic, non-immunogenic when at least one of its neo-peps was tested but none found to be not immunogenic, and not tested otherwise. The data was downloaded from the Synapse repository (https://www.synapse.org/) under accession number Synapse: syn21048999 .

Hi TIDE

An in-house dataset, called HiTIDE, consists of WES and RNAseq data and immunogenicity screening results for 11 patients with metastatic melanoma, lung, kidney, and stomach cancers . Patients were enrolled under protocols approved by the institutional regulatory committee at Lausanne University Hospital (Ethics Committee, University Hospital of Lausanne- CHUV). All patients provided informed consent. Variant calling and RNAseq analysis were performed, and neo-peps were ranked based on the MixMHC and PRIME ranks, RNAseq gene expression and coverage, and ipMSDB scores to select for each patient a set of neo-peps that were then tested for immunogenicity by NeoScreen (Arnaud, M. et al. Nat. Biotechnol. 40, 656-660 (2022) ). In the NeoScreen protocol, neo-peps (8-14mers) and engineered B-cells were added to the digested tissue for TILs expansion. This promoted a more efficient expansion of neoantigen specific TILs ex-vivo and provided a more sensitive IFN-γ ELISPOT detection. As for the TESLA dataset, the mut-seq immunogenicity annotation was inferred from the annotations of the neo-peps contained in mut-seq.

Feature determination

High confidence somatic variants affecting protein-coding genes were used to generate tumor-specific mut-seq (25mers) and class-I (8-12mers) neo-peps. neo-peps were then processed by a set of available tools that predict HLA binding and immunogenicity related features: MixMHCpred v2.1 (Bassani-Sternberg, M. et al. PLOS Comput. Biol. 13, el005725 (2017)) and netMHCpan v4.1 (Reynisson, B., et al. Nucleic Acids Res. 48, W449-W454 (2020).) for HLA class I binding affinity prediction, PRIME vl.0.1 (Schmidt, J. et al. Cell Rep. Med. 2, 100194 (2021)) for antigen presentation and T-cell receptor (TCR) recognition, netMHCstabpan v4.1 (Harndahl, M. et al. Eur. J. Immunol. 42, 1405-1416 (2012)) for HLA class I binding stability, netchop v3.1 (Nielsen, M., et al. Immunogenetics 57, 33-41 (2005)) for C-terminal proteasomal cleavage, and netCTLpan v1.1 33 for recognition by the TAP transporter complex. For binding affinity and stability, rank scores were used and differential agretopicity indexes (DAI) were calculated as the log(neo-pep rank) - log(wt-peptide rank). HLA-binding anchor positions of the peptides were calculated based on MixMHCpred sequence motifs. Patient-specific HLA haplotypes were used as input when required. The oncogenic status of SNVs was predicted by CScape (Rogers, M. F., et al. Sci. Rep. 7, 11597 (2017)), which is a ML tool trained on data from the COSMIC database (http://cancer.sanger.ac.uk/cosmic/help/gene/analysis) that predicts the oncogenicity of a mutation based on sequence conservation at the mutation site, as well as genomic, proteomic and structural features. SNV cancer driver status annotations were obtained from the Integrative Onco Genomics (IntOGen) database (Martinez- Jimenez, F. et al. Nat. Rev. Cancer 20, 555-572 (2020)). GTEx v8 (https://www.gtexportal.org/) and TCGA

(https://www.cancer.gov/tcga) databases containing tissue-specific gene expression data, were used for the annotation of mutated gene expression.

Sample-specific RNAseq data were additionally used to obtain mutated gene expression and mutation read coverage. The presentation level of a mut-seq or neo-pep was inferred based on ipMSDB information available for the corresponding wt-seq or wt-peptide. wt-peptide matches to ipMSDB were classified into the following subgroups: EXACT if wt-peptide matches exactly a peptide found in an ipMSDB peptide, INCLUDED if wt-peptide is fully included in a longer ipMSDB peptide, PARTIAL (PARTIAL MUT) if the match is partial (including the position of the mutation), and COVER if wt-peptide fully covers a shorter ipMSDB peptide. To infer the presentation level of the source protein, the ‘ ipMSDB Peptide Counf score was assigned, that represents the number of unique peptides for a given protein found in ipMSDB, whereas the ‘ ipMSDB Peptide Score" counts the AAs of the unique peptides in ipMSDB that overlap with a query wt-peptide. "ipMSDB Mutation Score" counts the AAs of the unique peptides in ipMSDB that overlap with a query wt-peptide specifically at the mutation position, and the ‘ ipMSDB Peptide Overlap" calculates the fraction of the query peptide covered by peptides in ipMSDB (0: no overlap, 1 : full coverage).

Preprocessing

Because only very few indel mutations were found to be immunogenic, only SNV mutations were considered in this example. Immunogenicity annotation of mut-seq and neo-peps was performed by parsing the immunogenicity tables for the three datasets. In these preprocessing steps, missing value imputation, data normalization, and conversion of categorical features into numerical ones were performed, as indicated below. In order to avoid that information from the test data leaks into the training process, all preprocessing methods were always fitted only on the training data, and these fitted methods were then applied to the test data.

Missing value imputation

Missing values (MV) were treated differently for numerical and categorical features. For numerical features, missing values were set either to the maximum or minimum value of that feature in the training data. For example, if the feature represents the TPM values from RNAseq, a missing value means that no reads were detected for this gene, and the missing value was replaced by 0 (minimum). On the other hand, if the feature represents the rank of MixMHC binding prediction, a missing value means that none of the patient’s alleles was predicted to bind to the neo-pep by the respective binding affinity tool. Therefore, the missing value was replaced by the maximum binding rank (100). The rules for whether to choose minimum or maximum values were set manually for each feature. For categorical features, no imputation was performed, but the peptide was assigned to the category for MVs.

Data normalization Some numerical feature values may change between patients and datasets, and data normalization was required in order to harmonize them. Different data transformers from the python Scikit-learn preprocessing toolkit were tested: QuantileTransformer, PowerTransformer, MinMaxScaler, and StandartScaler. The QuantileTransformer mapped all values to the interval [0, 1] such that the values were evenly distributed. The PowerTransformer applied a transformation to make the values Gaussian-like with zero mean and standard deviation of one. The MinMaxScaler transformed the values to the interval [0, 1 ] by an affine transformation. The distribution of the values remained the same apart from a constant shift and change of scale. The StandartScaler transformed the values to their z-score by subtracting the mean and dividing by the standard deviation. Each transformation was applied to the different numerical features of each patient after MV imputation.

Conversion of categorical values to numerical

Categorical features can only take on a few predefined values, which cannot be compared or ordered in a natural way. For the XGBoost and CatBoost classifiers, which are able to deal with categorical features directly, categorical features were left unchanged. For all other classifiers, categorical features were turned into numerical ones by a process called target encoding. For each categorical feature i the numerical value of its category was set to the rate r ij of immunogenic mut-seqs/neo-peps in this category in the training data: , where y k = 1 if k is an immunogenic mut-seq/neo-pep and 0 otherwise, X is the n × m (n mut- seqs/neo-peps, m features) training data matrix i.e. X ki is the value of feature i for mut-seqs/neo- peps k, and [. ] is a function that maps a boolean to an integer: [true] = 1 and [false] = 0.

Classifiers

Logistic regression (LR) (Cox, D. R. J. R. Stat. Soc. Ser. B Methodol. 20, 215-242 (1958)) can be used to estimate the probabilities of binary responses. It assumes that the log-odds (logarithm of the class probability ratio) is a linear function of the feature values plus an offset resulting in a linear class boundary. LR is a fast and scalable method that is robust to outliers or mislabeled data vectors. The scikit-learn implementation allows adding weights regularization, class weights and choosing different solvers for the gradient descent optimization. The Support Vector Machine (SVM) classifier was developed by Vapnik and collaborators (Boser, B. E., et al. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory 144-152 (ACM, 1992)). In its basic form, it fits a linear class boundary that maximizes the margin separating the two classes while minimizing the hinge loss of misclassified data vectors, which yields robust classification results. It can easily be extended to fitting nonlinear class boundaries by replacing the linear kernel with non-linear ones, which makes it a very flexible classifier. The scikit-learn SVM implementation allows choosing the weight regularization parameter, class weights and linear or various nonlinear kernels . Here, the SVM with linear (SVM- Linear) and radial basis function kernel (SVM-RBF) was used.

XGBoost (Chen, T. & Guestrin, C. in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785-794 (Association for Computing Machinery, 2016)) is a gradient tree boosting method, where the class labels were approximated by a sum of regression trees. Overfitting was avoided by penalizing trees with many leaves and large values within the leaf nodes. Further, feature and row sampling and shrinkage of regression tree leaf values can be applied. It works on sparse data and allows missing value imputation. The method is made scalable to very large datasets by a series of algorithmic improvements and parallelization as well as GPU usage.

CatBoost (Prokhorenkova, et al. In Advances in Neural Information Processing Systems vol. 31 (Curran Associates, Inc., 2018).) is another recent gradient boosting classifier. It handles categorical features and their interactions efficiently during training. It prevents overfitting, especially for smaller datasets, by estimating the gradients on different data than the ones used to estimate the trees by a technique called ordered boosting. CatBoost implements so-called oblivious balanced trees as base predictors, which allow directly accessing the leaf node of a data vector via an index. Efficient implementation of the training procedure and implementation tailored for GPU’s make it a fast and scalable algorithm.

A trained classifier C ranks a mut-seq/neo-pep k among a set of mut-seqs/neo-peps by the predict_proba(x k } function that is part of the programming interface of all classifiers used here (x k is the feature vector of mut-seq/neo-pep k). This function returns the probability estimated by classifier C that mut-seq/neo-pep k is immunogenic. All mut-seqs/neo-peps in a test or validation set are then ranked according to this probability in decreasing order (rank 0 is best). Subsampling of neo-peps for training

In order to limit class imbalance and computation time during training on neo-peps, the size of NCI-train_neo-pep_non-imm was limited by randomly sampling N neg non-immunogenic neo-peps from NCI-train_neo-pep_non-imm, while all immunogenic neo-peps in NCI-train_neo- pep_imm were retained. The data matrices of NCI-train_neo-pep_non-imm and NCI-train_neo- pep_imm were concatenated, and the order of the rows (neo-peps) was randomized. For the test datasets, no limitation on the size of the data matrices was enforced. Subsampling of non- immunogenic neo-peps was repeated 10 times when training the classifiers. For training on the smaller mut-seq data NCI-train_mut-seq_non-imm, no limit on the number of mut-seqs was used, but the order of rows (mut-seqs) was randomized.

Classifier hyper-parameter optimization

Every classifier algorithm has hyper-parameters that need to be set prior to training. For example, for a SVM classifier, the type of kernel and its parameters have to be set. Since these hyper-parameter settings can have a drastic effect on the classification performance, it was important to select hyper-parameters that worked well for a given dataset, especially when comparing different classifiers to each other. Therefore hyper-parameter optimization was regarded as part of the learning process. Since testing all possible hyper-parameter combinations for each classifier would be very time consuming, the Hyperopt (Bergstra, J., et al. Comput. Sci. Discov. 8, 014008 (2015)) framework was used, which implements sequential model-based optimization (also known as Bayesian optimization). In this iterative approach, a new guess of hyper-parameter values was calculated based on the results of previously tested values using a tree-structured Parzen estimator approach, which calculates for each hyper-parameter the value with the greatest expected improvement. The user can define a loss function to be minimized and for each hyper-parameter, the range and prior distribution of its values.

In the Hyperopt optimization, a 5 -fold cross-validation was used, and the rank_score of the classifier on the validation set averaged over all 5 folds was calculated. Since the datasets are unbalanced (many more non-immunogenic mut-seqs/ neo-peps than immunogenic ones), stratified cross-validation (Scikit-learn class StratifiedKFold) was used to ensure that there was a similar proportion of immunogenic mut-seqs/neo-peps in both training and validations sets. The score that was optimized in Hyperopt is: , where r k is the rank of the immunogenic mut-seq/neo-pep k determined by a classifier’s predict_proba function (see above) among all other mut-seqs/neo-peps in a data set (best rank is 0). The higher the predicted probability, the lower the rank and the higher the contribution to rank_score . The factor a determines how much weight is given to low ranking mut-seqs/neo-peps compared to high ranking ones. For example, a value of α = 0.1 will contribute 0.3679 for rank 10 and 0.0067 for rank 50, while α = 0.01 will contribute 0.9048 for rank 10 and 0.6065 for rank 50. For training, α = 0.005 was used. Similarly, the rank score vector was defined as: rank_score_vec = , for mut-seqs/neo-peps k.

Classifier leave-one-out cross-validation on training set

In order to evaluate the classifier performance on the NCI-train data, leave-one-out CV was performed for all patients p in NCI-train with immunogenic neo-peps'. once a set of hyperparameters was learned in the Hyperopt loop, the classifier on NCI-train_neo-pep_tested \ p_neo-pep _tested (NCI-train_neo-pep_tested without patient p) was retrained using the optimal hyperparameters and tested on patient p neo-pep. In this way, neo-pep rankings per patient was obtained, and the patient-wise rank_scores were calculated. The final rank_score was obtained by adding all patient-wise rank_scores. The same procedure was applied for mut-seqs.

Classifier evaluation on test set

In order to evaluate the classifier performance on the test data (NCI-test, TESLA, HiTIDE), the neo-peps were ranked separately for each patient in the test datasets, and the rank_scores were calculated. The total rank_score of a test dataset was then the sum of the rank_scorse of its patients.

Voting classifier

The voting classifier simply added the probability p = pre diet _proba(x k ) of all base classifiers and then performed the ranking. The weighted voting classifier includes a weight w for the probabilities p of classifiers c in two groups .

Optimal threshold method

For a given set of features and number of thresholds N tfl per feature, the N tfl quantiles for every feature were calculated. Then, all combinations of feature quantiles or thresholds were enumerated, and for each combination, the enrichment of immunogenic neo-peps was calculated in the subset of all neo-peps that passed all the thresholds with a Fisher test. The threshold combination with the highest enrichment (lowest Fisher test p-value) was retained. This method does not provide a ranking, but just a subset of neo-peps that are likely to be immunogenic.

Feature importance

Measures of feature importance inform us about the contribution of each feature to the final outcome of the ML prediction. Shapley values (Shapley, L. S. In Contributions to the Theory of Games, 2, 307-317 (1953)) were originally introduced for game theory, but in this context, they measure the average gain in the performance g of a classifier when adding a feature f. Let F be a set of m features F = {f 1 , f 2 ,..., f m }. For a feature vector x, the Shapley value Φ f (x) was the difference in performance of the classifier on x with and without feature f averaged over all possible subsets . This definition is mathematically consistent but very time consuming to calculate due to the very large number of subsets S for which the classifier needs to be trained. The python package SHAP (Lundberg, S. M. & Lee, S.-I. In Advances in Neural Information Processing Systems vol. 30 (Curran Associates, Inc., 2017)) implements different exact calculations and sampling approximations of the Shapley values (so-called Explainers). It provides additive explanations of the classifier performance g F (x) compared to the expected performance E[g F (x)] on a dataset: The package provides many useful visualizations that show the impact of the different features for a specific feature vector or for the whole dataset.

Description of software on github

Requires Python 3.7-3.9, and pip >= 19.0, Ubuntu 16.04 or later (64-bit)

EXAMPLE 2

Using machine learning (ML) to prioritize immunogenic neoantigens for personalized immunotherapy

Comparison of datasets

In order to investigate how the NCI, TESLA, and HiTIDE datasets compare to each other, several statistics were calculated. The number of somatic SNV mutations called per patient was highest for the TESLA dataset, which contained only melanoma and NSCLC samples known for high mutational loads. The number of tested mutations was highest in the NCI dataset, where the mutations selected for screening were less biased compared to TESLA and HiTIDE. Binding affinity was used as a screening criterion for neo-peps in all three datasets. However, NCI-neo- pep_tested was unaffected by this bias, since it contained all non-immunogenic neo-peps regardless of binding affinity. The number of immunogenic neo-peps per patient correlated well with the total number of mut-seqs in a patient, and the number of immunogenic neo-peps was highest for TESLA, followed by the HiTIDE and NCI datasets. The RNAseq TPM values revealed small differences between datasets. In all datasets, mutations selected for testing had higher RNAseq gene expression, and this effect was strongest in the HiTIDE- and weakest in NCI data. RNAseq mutation coverage was used as a screening criterion in all datasets, but it was most apparent in the TESLA dataset. The number of immunogenic neo-peps per mut-seq was higher for the tested mutations in HiTIDE and TESLA datasets. Overall the NCI dataset had the largest number of screened mutations with the least bias, where the NCI_mut tested set resembled the unscreened sets TESLA_mut and HiTIDE_mut .

Features’ association with immunogenicity

To gain a deeper understanding of the features and their ability to distinguish between immunogenic and non-immunogenic mut-seqs or neo-peps, the features’ capacity to distinguish immunogenic mut-seqs/ neo-peps from non-immunogenic ones was evaluated. First, it was demonstrated that features reflecting the binding affinity between a neo-pep and HLA-I complex were among the strongest predictors for immunogenicity for all 3 datasets. NetMHCpan and MixMHCpred prediction ranks correlated strongly, but they also contained complementary information. For example, in the NCI_neo-pep dataset, five neo peptides did not pass the strong binding threshold of rank 1 for NetMHCpan, but they passed as strong binders (rank <= 1) in MixMHCpred. Furthermore, features describing proteasomal cleavage, TAP import into ER, and binding stability, are correlated with immunogenicity in all three datasets, and they provided complementary information to binding affinity.

Notably, promiscuous neo-peps that were predicted to bind to multiple patient alleles were more likely to be immunogenic than neo-peps predicted to bind a single allele, likely because binding to multiple alleles increases the chance for HLA-I presentation and makes the presentation of neo-peps more resistant to loss of specific HLA-I alleles. Along the same lines, mut-seqs with a higher number of neo-peps weakly binding to a patient’ s HLA-I allele, were more likely to be immunogenic than mut-seqs with a lower number of weakly binding neo-peps. The PRIME prediction rank differences between immunogenic and non-immunogenic neo-peps were slightly less significant compared to MixMHCpred binding prediction. DAI values for binding prediction ranks were lower for immunogenic neo-peps. Notably, whether the mutation was located in an anchor position was not significant per se, but it became significant and important in combination with the DAI values, which were significantly lower for immunogenic mutations at anchor positions. Based on the analyzed data, there was no obvious tendency for mutations to be placed in the middle of a neo-pep, and the preferred positions depend on the datasets. It was found that enrichment of immunogenic mutations in the middle of 10 mers reported by Wells et al. (Wells, D. K. et al. Cell 183, 818-834. el 3 (2020)) for the TESLA dataset could not be confirmed for the NCI and HiTIDE datasets. However, immunogenic neo-peps were strongly enriched in the group of peptides of length 9 or 10 AA.

It has been demonstrated that gene or protein expression positively impacts HLA-I presentation and immunogenicity. In all three datasets, immunogenic neo-peps had higher gene expression and higher coverage of the mutation in the patient’s tumor bulk RNAseq data compared to non-immunogenic ones. To assess whether one could replace gene expression data from public datasets when sample specific RNAseq data is not available, RNAseq expression data from the public databases TCGA and GTEx was also considered as additional features. For both immunogenic and non-immunogenic neo-peps, the TCGA gene expression correlated strongly (R = 0.818) with its expression in the patient’s cancer tissue. The gene expression level in GTEx correlated to a lower extent (R = 0.645), and the regression line for immunogenic neo-peps was shifted to higher RNAseq values compared to the regression line for non-immunogenic neo-peps. It was concluded that immunogenic mutated genes had higher gene expression levels in cancer tissues compared to the healthy tissues in GTEx. Lastly, cancer cell fraction, clonality, and zygosity were not associated with immunogenic neo-peps, and the results varied between the datasets.

The in-house ipMSDB database (Muller, M., et al. Front. Immunol. 8, (2017)) contains WT HLA-I ligands identified by mass spectrometry in multiple healthy and cancerous human tissues and cell lines with various HLA allotypes. It was used to infer the HLA-I presentation level of a mut-seq or neo-pep based on information available for the corresponding wt-seq or wt-pep. Notably, the number of ipMSDB peptides mapped to a protein (‘ipMSDB Peptide Count') was significantly higher for proteins containing immunogenic mut-seqs across all three datasets, and almost all proteins with an immunogenic mut-seq had an ‘ ipMSDB Peptide Counf greater than 0. These data indicate that immunogenic peptides in the three datasets preferably belong to proteins that are naturally processed and presented. ‘ ipMSDB Peptide Counf for a given protein correlated (R=0.498) with mRNA expression of the corresponding gene, but this correlation could not fully explain the higher "ipMSDB Peptide Counf values for immunogenic mut-seqs for the NCI dataset. The "ipMSDB Peptide Score" measures the overlap between the wt-peptide and peptides within ipMSDB. Not surprisingly, the "ipMSDB Peptide Score" correlated with "ipMSDB Peptide Counf (R=0.435), i.e., proteins with overall more ipMSDB peptides had a better chance to cover the position of the mutation. However, the "ipMSDB Peptide Score" was higher for immunogenic neo- peps compared to non-immunogenic ones, and this shift was significant in all three datasets. These results indicated that wt-peptides of immunogenic neo-peps were preferably found in MS ‘hotspots’ - protein regions where MS peptides are denser. Since immunogenic neo-peps are true HLA binders, higher ipMSDB scores indicate a higher probability for HLA binding, and ipMSDB scores provided information complementary to RNA expression and binding affinity prediction. It was also found a highly significant enrichment immunogenic neo-pep, which mapped to EXACT wt- pep counterpart sequences in ipMSDB and to a lower extent for INCLUDED wt-pep counterparts. Therefore, sequence matching to ipMSDB is an efficient means to prioritize ‘true’ HLA-I binding neo-peps as long as the mutation is not in an anchor position, since such mutations reduce the likelihood of the wt-peptide being found in ipMSDB.

Further, features that evaluate the impact of a mutation on the cellular or molecular function of the mutated protein were included. Notably, while CScape (Rogers, M. F., et al. Sci. Rep. 7, 11597 (2017)) is an oncogenicity predictor, it also has a predictive value for immunogenicity, possibly because oncogenic mutations often destabilize the protein structure, leading to rapid degradation of the protein and presentation on HLA-I. Mutation annotations from the Intogen database were also included. It was found that mutations annotated as oncogenic drivers were enriched for immunogenicity in all three datasets, and that there was a slight immunogenicity enrichment for neo-peps containing mutations with higher prevalence in the population according to Intogen annotation.

Comparing classifiers for neo-peps The neo-pep ranking was performed by sorting the neo-peps according to their probability of belonging to the immunogenic class, which is learned by the classifier algorithms from the NCI training data (NCI-train). Placing all immunogenic neo-peps into low ranks can be difficult to obtain since a true immunogenic neo-pep has to compete with eventually thousands of non- immunogenic ones with similar feature characteristics. Additionally, every mut-seq theoretically produces 50 neo-peps of length 8-12, which all share the same values for features that depend only on the mut-seq (such as RNAseq expression or CScape). This reduces the power of mut-seq features, since an immunogenic neo-pep will be ‘surrounded’ by 49 non-immunogenic ones with exactly the same mut-seq feature values.

Before training the classifiers, three important parameters were evaluated, which need to be set before Hyperopt classifier optimization: the sample size, the number of Hyperopt iterations, and the normalization method. In order to limit the data volume and the class imbalance for classifier training, a specified number of non-immunogenic neo-peps from the training set NCI- train_neo-pep_tested were sampled. Seven sampling sizes (10k, 25k, 50k, 75k, 100k, 150k, and 200k) were tested with the LR classifier, and it was found that the LR classifier leave-one-out CV rank_score increased with higher training data sizes before starting to saturate at 100k sampled non-immunogenic neo-pep. Next, how many Hyperopt iterations were required to train an LR classifier was tested, and it was found that the Hyperopt loop required at least 50 iterations for LR. The data normalization method had a strong impact on the performance of the LR classifier, where quantile and power normalization performed best. The impact of data normalization on the number of immunogenic neo-peps was placed in the top 20, 50, or 100 ranks for LR. The use of quantile or power normalization nearly doubled the number of immunogenic neo-peps ranked in the top 20. Therefore, for the following analyses, a sample size of 100k, 200 Hyperopt iterations, and quantile normalization were used. All these optimal parameters were chosen on the cross-validated NCI- train data to avoid potential leakage from the test data.

To assess the performance of each of the classifiers, Hyperopt parameter optimization for LR, SVM-RBF, SVM-Linear, XGBoost, and CatBoost classifiers (see Table 2 for more details) was performed. Using the optimal hyperparameters, the classifiers were then trained on the sampled NCI-train_neo-pep_tested and tested on the NCI -train_neo-pep (using leave-one-out CV, see methods), and on NCI-test_neo-pep, TESLA_neo-pep and HiTIDE_neo-pep datasets. Hyperopt optimization and training on NCI-train_neo-pep_tested were repeated 10 times in order to evaluate the reproducibility of the results. The results revealed that for NCI-train, LR outperforms XGBoost followed by CatBoost and SVMs, and that the performances of linear classifiers (LR and SVM- Linear) were less dependent on the data sampling. In addition, LR and XGBoost required the shortest CPU computation time for the Hyper opt optimization (Fig. 2). The impact of the classifier algorithm on the number of top-ranked neo-peps was less pronounced than the impact of data normalization but still highly significant. The LR classifier was able to rank 49.1% of NCI- train_neo-peps in the top 20, 62.2% in the top 50, and 75.6% in the top 100 (Fig. 1). LR also performed best on the NCI-test data, while XGBoost performed best on TESLA and HiTIDE datasets (Fig. 2). The ranks for immunogenic neo-peps were similar for linear (LR, SVM-Linear) and tree-based (XGBoost, CatBoost) classifiers. A clear separation was found between the two groups indicating that the linear and tree-based classifiers are, to a certain extent, complementary. Therefore, a voting classifier was constructed, which adds the predict_proba(x k ) outputs of all 10 LR and 10 XGBoost classifiers to perform the ranking. As demonstrated, the voting classifier had good performance, which was less dependent on the test dataset compared to LR and XGBoost (Fig. 2).

Next, the performance of the ML ranking methods was compared with other common ranking approaches. As a baseline, neo-pep rankings were produced by simply sorting with MixMHCpred or NetMHCpan first, and then by RNAseq expression to resolve the ties. Fig. 2 shows the superior performance of the ML classifiers compared to these simple ranking strategies. NetMHCpan ranking worked best for the NCI-test and TESLA datasets, where NetMHCpan was used to select neo-peps for screening. For HiTIDE, where MixMHCpred was used for the screening selection, as expected, MixMHCpred therefore ranked a slightly larger number of immunogenic neo-peps in the top 20. The ‘optimal threshold’ method presented by Wells et al. (Wells, D. K. et al. Cell 183, 818-834.el3 (2020) was implemented, which calculates the feature thresholds that provide the highest enrichment of immunogenic neo-peps. The ‘ optimal threshold’ method does not provide a ranking, but only a subset of nep-peps enriched for immunogenicity. The optimal thresholds on NCI-train_neo-pep _tested were determined and applied to each patient p in the test data, which gave an enriched subset S p for every patient p containing a number N imm p of immunogenic neo-peps. In order to compare these results to the LR ranking, the minimal position in the LR ranking containing N imm,p immunogenic neo-peps was determined. The minimal position in the LR ranking was always smaller than the size of subset S p (on average 18.6 times smaller for LR). Finally, the results of the disclosed ML ranking methods were compared to the rankings published by Gartner et al. (Gartner, J. J. et al. Nat. Cancer 1-12 (2021)) for NCI- test _neo-pep_imm. LR, XGBoost, and the Voting classifier ranked more immunogenic neo-peps in the top 20, 50, and 100 ranks for the 23 patients in NCI-test_neo-pep_imm compared with the ranking performed by Gartner et al. (Fig. 1 and Table 4).

To gain insight into the contribution of each feature to the performance of the ML ranking, the Shapley values were calculated and explored. It was found that the strongest Shapley values for LR stemmed from PRIME, MixMHCpred, and NetMHCpan rank features, followed by stability rank, RNAseq expression and variant coverage, peptide length, and ipMSDB overlap. A similar feature importance ranking was obtained for the XGBoost classifier, but less weight was given to PRIME rank and peptide length. Many features such as 'Intogen same Mutation Count" or "MixMHCpred log Rank DAI" with smaller Shapley scores were consistently ranked by both LR and XGBoost. Other features that had high importance in LR were replaced by correlated features in XGBoost (e.g, "ipMSDB Match Score" replaced "ipMSDB Peptide Match Overlap"). Many of the features used here contributed to the low 24 th rank of the neo-pep QDAAAFQLW (SEQ ID NO: 394) of patient 4014 in NCI-test_ neo-pep_imm. which is better than rank 84 reported by Gartner et al. (Gartner, J. J. et al. Nat. Cancer 1-12 (2021)). Features that correlate well with immunogenicity can have low Shapley values, because other correlated features eventually replace them in the classification algorithm. For example, "ipMSDB Match Type" and "ipMSDB Match Score" were highly significant as single features, but had an almost zero Shapley value, likely because they were substituted by "ipMSDB Match Overlap" in LR classification. Immunogenic neo-peps with ranks above 100 (low priority) usually have poor predicted binding affinity in NetMHCpan, MixMHCpred, and PRIME. An example is the neo-pep DRNIVRHSW (SEQ ID NO: 395) from patient 4324 in NCI-test_neo-pep_imm, whose LR rank was 1849. The Shapley values of all important features were negative, and therefore the classification algorithm was unable to place this neo-pep in a low rank.

Next, we investigated whether a LR classifier trained on the large NCI-train_neo- pep _tested cohort (LR_ NCI) and applied to HiTIDE_neo-pep provides a better ranking of immunogenic neo-peps than a LR classifier trained on the smaller HiTIDE_neo-pep cohort itself (LR HiTIDE) using leave-one-out CV. It was shown that LR classifiers trained on the large NCI- train cohort outperformed the LR classifiers trained on the HiTIDE cohort. The size of the training data has a positive impact on classification performance. On the other hand, when the LR_ NCI and LR HiTIDE classifiers were combined by weighted voting, the LR_ NCI classifier was adapted to the specifics of the HiTIDE dataset. Such a weighted combination of LR_ NCI and LR HiTIDE classifiers indeed improved the ranking performance on HiTIDE_neo-pep. Therefore, LR_ NCI forms a good base classifier, which can be adapted to the specifics of a different smaller dataset by this weighted combination strategy.

Comparing classifiers for mutations

To compare the classifier performance with regard to ranking mut-seqs, the LR and XGBoost classifiers were trained on NCI-train_mut-seq_tested with the features described in the methods section and in Table 1. On the cross-validated training data, XGBoost performs better than LR on all 10 Hyperopt runs. Lor the test data, the improvement is most pronounced on the TESLA dataset (Fig. 3). As for the neo-pep data, LR and XGBoost results are complementary, and the voting classifier provides the overall best ranking (Fig. 3). For the NCI-test data, both LR and XGBoost performed better than the ranking published by Gartner et al. (Table 4), but the improvement was less pronounced compared to NCI-test neo-pep data (Fig. 1). Even though the MixMHCpred binding affinity rank was still the most important feature, the importance of non- HLA-binding related features increased compared to the neo-pep prioritization. The results demonstrated that RNAseq expression and coverage, ' ipMSDB Peptide Score," and 'Imogen Gene Mutation Count" features were almost equally important as binding affinity and stability. In the neo-pep data, most neo-peps are non-immunogenic even if they belong to an immunogenic mut- seq (see discussion above), and the importance of features that only depend on mut-seq (but not on the specific neo-pep} was, therefore, reduced. Fig. 3 shows the example of the immunogenic H00K3 gene mut-seq NQEGSDNEKIALFQSLLDDANLRKN (SEQ ID NO: 393), which was ranked 89th by Gartner et al. (Gartner, J. J. et al. Nat. Cancer 1-12 (2021)) Here, the XGBoost rank was 19th, in part due to the contribution of the ' ipMSDB Peptide Score," indicating that the mut-seq is placed in a region of the H00K3 protein, which is commonly detected by MS.

The disclosed ML approach uses the large NCI-train cohort to train and NCI-test, TESLA, and HiTIDE cohorts for testing. A robust machine learning methodology that makes it possible to train and test on data acquired by different labs with different protocols was developed. This robust methodology contains the characteristics that are further discussed below. This example shows the importance of data normalization and implementation of optimized normalization techniques. Quantile normalization with LR placed on average 40.3 immunogenic neo-peptides in the top 20 ranks, 51.0 in the top 50 and 62.0 in the top 100 ranks for NCI-train leave-one-out cross-validation, whereas only 18.5, 31.7 and 49.5 immunogenic neo- peptides were ranked in the top 20, 50, or 100 without normalization. This corresponds to an increase of 118%, 61%, and 25%, respectively.

Various classifiers were explored to identify the classifiers most adapted for this purpose (Fig. 1). LR classifier placed on average 40.3 immunogenic neo-peptides in the top 20 ranks, 51.0 in the top 50, and 62.0 in the top 100 ranks for NCI-train leave-one-out cross-validation, whereas only 29.6, 44.0 and 55.9 immunogenic neo-peptides were ranked in the top 20, 50, or 100 by the SVM-RBF classifier. This corresponds to an increase of 36%, 13%, and 11%, respectively.

In addition, hyperparameters of all these classifiers were optimized the in a cross-validation loop (Table 2).

The best-performing classifiers (e.g. , logistic regression and XGBoost) were combined into a so-called voting classifier, which added the probabilities of all 10 replicates of LR and all 10 replicates of XGBoost to perform the ranking. The voting classifier obtained a better and more robust immunogenicity ranking in the 3 test datasets (Fig. 2):

Compared to the ranking from Gartner et al. (Gartner, J. J. et al. Nat. Cancer 1-12 (2021)) for NCI -test, the LR classifier performed better, yielding 27% more immunogenic peptides in the top 20 ranks (Fig. 1):

Model training for LR and XGBoost classifiers were performed on NCI mutation and neo- peptide data separately. Therefore, independent models are provided herein to rank mutations and to rank neo-peptides. This is different from the approach in Gartner et al., where the mutation model depends on the neo-peptide model.

This example also shows that LR classifier trained on the large NCI-train cohort outperformed LR classifier trained on the smaller HiTIDE cohort itself. Also, a transfer learning mechanism was used to adapt the classifiers trained on NCI-train to the specificities of the HiTIDE data to further improve performance on HiTIDE.

Training on the large NCI-train cohort led to a 46% increase in top 20 peptides as compared to training on the smaller HiTIDE cohort. Transfer learning yielded a 44% increase as compared to LR trained on NCI-train and a 111% increase as compared to LR trained on HiTIDE.

Additionally, by using Shapley values, the disclosure makes the disclosed ML approach more interpretable. This is important for the understanding and quality control of the results.

It has been shown that other features than just binding affinity are useful to predict HLA presentation including binding stability, RNAseq expression of the mutated gene, RNAseq coverage of the mutation, antigen processing features such as TAP affinity and proteasomal cleavage propensity, or the difference between wild-type and mutated peptides.

In addition to these standard features describing HLA presentation of neo-peptides, several novel features were included, and it was shown that they were able to distinguish immunogenic from non-immunogenic neo-peptides or mutations. For example, an in-house database (ipMSDB) containing a large collection of HLA peptides identified by mass spectrometry provides such features (Fig. 1) was tested. On all three datasets, the ipMSDB features were able to significantly distinguish immunogenic from non- immunogenic neo-peptides:

In particular, as demonstrated in this example, the CScape score, which evaluates the oncogenicity of a mutation, was also predictive of immunogenicity.

Also, it was found that the Intogen database, which annotates mutations with regard to their role in oncogenesis, provided important information that can be used to characterize immunogenic mutations.

These p-values are based on the isolated features. In the context of a classifier, the Shapley values of some features was still significant (Table 1). Fig. 1 shows that performance of LR classifier ranking improved when including ipMSDB and Intogen derived features. Further, it was shown that neo-peptide promiscuity (the ability to bind several of the patient’s alleles) had a positive impact on immunogenicity. T-test p-values were highly significant for all three datasets: NCI: <1.00e-100, TESLA: 3.04e-75, HiTIDE: 1.20e-99.

For mut-seq, the number of neo-peps, which are contained in a mut-seq and bind to a patient’s allele (‘MixMHC Binding Peptide Count’), is also significant in all three datasets: NCI: 3.80e-14, TESLA: 1.39e-05, HiTIDE: 7.09e-04.

RNAseq gene expression in cancer biopsies was complemented by the expression of the gene in the TCGA database (t-test p-values NCI: 2.43e-29, TESLA: 2.46e-13, HiTIDE: 2.73e-08 and the GTEx database (t-test p-values NCI: 2.81e-20, TESLA: 1.48e-10, HiTIDE: 1.1 le-04. Additionally, PRIME was also included, which can be a predictor for T-cell recognition (t- test p-values NCI: 3.19e-52, TESLA: 1.14e-22, HiTIDE: 1.31e-14.

Finally, it was shown that the combination of the disclosed features performed better than when the ranking was done only on NetMHCpan binding affinity and RNAseq in all datasets (Fig. 2).

The present disclosure is not to be limited in scope by the specific embodiments described herein. Indeed, various modifications of the invention in addition to those described herein will become apparent to those skilled in the art from the foregoing description and the accompanying figures. Such modifications are intended to fall within the scope of the appended claims.

Table 2. Hyperparameters optimized by Hyperopt

Table 4. Ranking of neo-peptides