Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD OF PREDICTING SURVIVAL RATES FOR CANCER PATIENTS
Document Type and Number:
WIPO Patent Application WO/2020/157508
Kind Code:
A1
Abstract:
The present invention provides a method for providing a prognosis for a subject with lung cancer, the method comprising: (a) contacting a biological sample from the subject with reagents that specifically bind to each member of a panel of biomarkers comprising ANLN, ASPM, CDCA4, ERRFI1, FURIN, GOLGA8A, ITGA6, JAG1, LRP12, MAFF, MRPS17, PLK1, PNP, PPP1 R13L, PRKCA, PTTG1, PYGB, RPP25, SCPEP1, SLC46A3, SNX7, TPBG, XBP1; (b) determining a riskscore of the subject based on the nucleic acid levels of expression of the biomarkers in the samples; and (c) providing a prognosis for the lung cancer based on the risk score of the subject.

Inventors:
SWANTON ROBERT CHARLES (GB)
BISWAS DHRUVA (GB)
MCGRANAHAN NICHOLAS (GB)
BIRKBAK NICOLAI JUUL (DK)
Application Number:
PCT/GB2020/050221
Publication Date:
August 06, 2020
Filing Date:
January 30, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
THE FRANCIS CRICK INSTITUTE LTD (GB)
UNIV LONDON (GB)
International Classes:
C12Q1/6886
Domestic Patent References:
WO2010063121A12010-06-10
WO2015138769A12015-09-17
WO2001063121A12001-08-30
WO2015138769A12015-09-17
Foreign References:
US20100184063A12010-07-22
US20100184063A12010-07-22
Other References:
PHILIPPE LAMBIN ET AL: "Decision support systems for personalized and participative radiation oncology", ADVANCED DRUG DELIVERY REVIEWS, vol. 109, 1 January 2017 (2017-01-01), AMSTERDAM, NL, pages 131 - 153, XP055528085, ISSN: 0169-409X, DOI: 10.1016/j.addr.2016.01.006
GERALD QUON ET AL: "Computational purification of individual tumor gene expression profiles leads to significant improvements in prognostic prediction", GENOME MEDICINE,, vol. 5, no. 3, 28 March 2013 (2013-03-28), pages 29, XP021151900, ISSN: 1756-994X, DOI: 10.1186/GM433
J. SUBRAMANIAN ET AL: "Gene Expression-Based Prognostic Signatures in Lung Cancer: Ready for Clinical Use?", JNCI JOURNAL OF THE NATIONAL CANCER INSTITUTE, vol. 102, no. 7, 7 April 2010 (2010-04-07), pages 464 - 474, XP055073612, ISSN: 0027-8874, DOI: 10.1093/jnci/djq025
SARKAR I N ET AL: "Characteristic attributes in cancer microarrays", JOURNAL OF BIOMEDICAL INFORMATICS, ACADEMIC PRESS, NEW YORK, NY, US, vol. 35, no. 2, 1 April 2002 (2002-04-01), pages 111 - 122, XP008101629, ISSN: 1532-0464, [retrieved on 20021028], DOI: 10.1016/S1532-0464(02)00504-X
DETTERBECK ET AL.: "Lung Cancer Stage Classification", CHEST, vol. 15, 2017, pages 193 - 203
VARGAS ET AL.: "Biomarker development in the precision medicine era: lung cancer as a case study", NAT REV CANCER, 2016
VAN'T VEER ET AL.: "Enabling personalized cancer medicine through analysis of gene-expression patterns", NATURE, vol. 452, 2008, pages 564 - 570, XP002664142, DOI: 10.1038/nature06915
VARGAS ET AL.: "Biomarker development in the precision medicine era: lung cancer as a case study", NAT. REV. CANCER, vol. 16, 2016, pages 525 - 537
KUMAR-SINHA ET AL.: "Precision oncology in the age of integrative genomics", NAT. BIOTECHNOL., vol. 36, 2018, pages 46 - 60
BEER ET AL.: "Gene-expression profiles predict survival of patients with lung adenocarcinoma", NAT MED, vol. 8, 2002, pages 816 - 824
KRYSTANEK ET AL.: "A Robust prognostic gene expression signature for early stage lung adenocarcinoma", BIOMARK RES, vol. 4, 2016, pages 4
WISTUBA ET AL.: "Validation of a Proliferation Based Expression Signature as Prognostic Marker in Early Stage Lung Adenocarcinoma", CLIN CANCER RES, 2013
SUBRAMANIAN ET AL.: "Gene Expression-Based Prognostic Signatures in Lung Cancer; Ready for Clinical Use?", JNCL J NATL CANCER INST, vol. 102, 2010, pages 464 - 474, XP055073612, DOI: 10.1093/jnci/djq025
SHUKLA ET AL.: "Development of a RNA-seq Based Prognostic Signature in Lung Adenocarcinoma", JNCL J NATL CANCER INST, vol. 109, 2017
JAMAL-HANJANI ET AL.: "Tracking genomic cancer evolution for precision medicine: the lung TRACERxstudy", PLOS BIOL, vol. 12, 2014, XP055295763, DOI: 10.1371/journal.pbio.1001906
LI ET AL.: "Development and Validation of an Individualized Immune Prognostic Signature in Early-Stage Nonsquamous Non-Small Cell Lung Cancer", JAMA ONCOL, 2017
GYANCHANDANI ET AL.: "Intratumour Heterogenity Affects Gene Expression Profile Test Prognostic Risk Stratification in Early Breast Cancer", CLIN CANCER RES, vol. 22, 2016, pages 5362 - 5369
JAMAL-HANJANI ET AL.: "Tracking the evolution of non-small cell lung cancer", N ENGL J MED, vol. 376, 2017, pages 2109 - 2121, XP055407143, DOI: 10.1056/NEJMoa1616288
BURRELL ET AL.: "The Causes and Consequence of Genetic Heterogeneity in Cancer Evolution", NATURE, vol. 501, 2013, pages 338 - 345
GEISS ET AL., NAT BIOTECHNOL., vol. 26, no. 3, March 2008 (2008-03-01), pages 317 - 25
DAS ET AL.: "NanoString expression profiling identifies candidate biomarkers of RAD001 response in metastatic gastric cancer", ESMO OPEN, 2016, pages 1 - 9
BLACKHALL ET AL.: "Stability and Heterogeneity of Expression Profiles in Lung Cancer Specimens Harvested Following Surgical Resection", NEOPLASIA, 2004
MLECNIK ET AL.: "Comprehensive Intrametastatic Immune Quantification and Major Impact of Immunoscore on Survival", JNCL J NATL CANCER INST, vol. 110, 2018, pages 97 - 108
YACHIDA ET AL.: "Distant metastasis occurs late during the genetic evolution of pancreatic cancer", NATURE, vol. 467, 2010, pages 1114 - 1117, XP055501147, DOI: 10.1038/nature09515
LOVE ET AL.: "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2", GENOME BIOL, vol. 15, 2014, pages 550, XP021210395, DOI: 10.1186/s13059-014-0550-8
GYANCHANDANI ET AL.: "Intratumor Heterogeneity Affects Gene Expression Profile Test Prognostic Risk Stratification in Early Breast Cancer", CLIN. CANCER RES., vol. 22, 2016, pages 5362 - 5369
FRIEDMAN ET AL.: "Regularised Paths for Generalized Linear Models via Coordinate Descent", J STAT SOFTW, vol. 33, 2010, pages 1 - 22, XP055480579, DOI: 10.18637/jss.v033.i01
SIMON ET AL.: "Regularisation Paths for Cox's Proportional Hazards Model via Coordinate Descent", J STAT SOFTW, vol. 39, 2011, pages 1 - 13
DOBIN ET AL.: "STAR:ultrafast universal RNA-seq aligner", BIOINFORMATICS, vol. 29, 2013, pages 15 - 21, XP055500895, DOI: 10.1093/bioinformatics/bts635
LI ET AL.: "RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome", BMC BIOINFORMATICS, vol. 12, 2011, pages 323, XP021104619, DOI: 10.1186/1471-2105-12-323
DJUREINOVIC ET AL.: "Profiling cancer testis antigens in non-small-cell lung cancer", JCL INSIGHT, vol. 1, 2016, XP055586497, DOI: 10.1172/jci.insight.86837
KIM ET AL.: "TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions", GENOME BIOL, vol. 14, 2013, pages R36, XP021151894, DOI: 10.1186/gb-2013-14-4-r36
LIAO ET AL.: "The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote", NUCLEIC ACIDS RES, vol. 41, 2013, pages e108
DURINCK ET AL.: "Mapping Identifiers for the Integration of Genomic Datasets with the R/Bioconductor package biomaRt", NAT PROTOC, vol. 4, 2009, pages 1184 - 1191
KRATZ ET AL.: "A practical molecular assay to predict survival in resected non-squamous, non-small lung cancer: development and international validation studies", LANCET, vol. 379, 2012, pages 823 - 832, XP055160043, DOI: 10.1016/S0140-6736(11)61941-7
D. BISWAS ET AL.: "A clonal expression biomarker associates with lung cancer mortality", NATURE MEDICINE, vol. 25, 2019, pages 1540 - 1548, XP036901640, DOI: 10.1038/s41591-019-0595-z
GENTLES ET AL.: "The prognostic landscape of genes and infiltrating immune cells across human cancers", NAT MED, vol. 21, 2015, pages 938 - 945, XP055534033, DOI: 10.1038/nm.3909
JAMAL-HANJANI ET AL.: "Tracking the Evolution of Non-Small Cell Lung Cancer", N ENGL J MED, vol. 376, 2017, pages 2109 - 2121, XP055407143, DOI: 10.1056/NEJMoa1616288
Attorney, Agent or Firm:
APPLEYARD LEES IP LLP (GB)
Download PDF:
Claims:
1. A method for providing a prognosis for a subject with lung cancer, the method comprising:

(a) contacting a biological sample from the subject with reagents that specifically bind to each member of a panel of biomarkers comprising ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 ;

(b) determining a riskscore of the subject based on the nucleic acid levels of expression of the biomarkers in the samples; and

(c) providing a prognosis for the lung cancer based on the risk score of the subject.

2. The method of claim 1 wherein determining a risk score of the subject comprises:

for each of the biomarkers, determining a score indicative of nucleic acid levels of expression in the tissue sample;

calculating a riskscore based on the determined scores, wherein the riskscore is calculated by summing weighted biomarker scores, wherein the biomarker scores are based on the determined scores and each biomarker score has an associated weight; and

comparing the riskscore to a threshold.

3. The method of claim 2 wherein the associated weight for each of the biomarker scores for GOLGA8A, SCPEP1 , SLC46A3 and XBP1 has a negative value and the associated weight for the biomarker score for ANLN, ASPM, CDCA4, ERRFI1 , FURIN, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SNX7 and TPBG has a positive value.

4. The method of claim 2 or claim 3, wherein the weighted sum for the riskscore is: riskscore = ^ch + b2x2i+· +bnxni

are the biomarker scores for the four selected biomarkers for each subject i and bi . bn are a set of associated weights for each biomarker score.

5. The method of claim 4, further comprising determining the weights for the weighted sum using a Cox proportional hazard model which is trained using training data comprising information on a plurality of biomarkers in a set of subjects.

6. The method of claim 5, further comprising identifying the plurality of biomarkers to be used in the Cox proportional hazard model, wherein the plurality of biomarkers are selected from the group comprising ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . 7. The method of any one of claims 2 to 6, wherein the threshold is the median riskscore for the training data.

8. The method of any one of the preceding claims, wherein determining a score indicative of a level of the biomarker comprises determining a scaled intensity score.

9. The method of claim 8, wherein the biomarker score is based on the scaled intensity score which has been adjusted by subtracting an adjustment factor.

10. The method of any one of the preceding claims, wherein determining a score indicative of a level of the biomarker comprises awarding a first value when the level is above a threshold and a second value when the level is below the threshold.

11 . The method of any one of the preceding claims, wherein determining a score indicative of a level of the biomarker comprises awarding a first value when the level is above an upper threshold, a second value when the level is below the upper threshold but above a lower threshold and a third value when the level is below the lower threshold.

12. The method of any one of the preceding claims, wherein the reagents are nucleic acids.

13. The method of any one of the preceding claims, wherein the lung cancer is non-small lung cancer (NSCLC).

14. The method of claim 13 wherein the NSCLC is selected from invasive adenocarcinoma (LUAD), squamous cell carcinoma (LUSC), large cell carcinoma, adenosquamous carcinoma, carcinosarcoma or large cell neuroendocrine.

15. The method of claim 13 or 14 wherein the NSCLC is stage I, stage II, stage III or stage IV.

16. The method of any one of the preceding claims, wherein the sample is from a surgically resected tumour.

17. The method of any one of the preceding claims, wherein the sample is from lung tissue or a lung tumour biopsy.

18. The method of any one of the preceding claims, wherein the prognosis provides a risk assessment.

19. The method of any one of the preceding claims, wherein the method further comprising determining a treatment. 20. The method of claim 19, wherein said treatment is selected from surgical treatment, chemotherapy, surgery, radiotherapy, immunotherapy or CAR-T therapy.

21 . A method for determining a treatment for a subject, the method comprising the method of any one of claims 1 to 18 and further comprising the further step of determining a treatment.

22. The method of claim 21 , wherein said treatment is selected from surgical treatment, chemotherapy, surgery, radiotherapy, immunotherapy or CAR-T therapy.

23. A composition comprising a panel of reagents that specifically bind to each member of a panel of biomarkers comprising or consisting of ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG and XBP1 .

24. A kit comprising reagents that specifically bind to each member of a panel of biomarkers comprising ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG and XBP1 .

25. The composition of claim 23 or the kit of claim 24 wherein the reagents are nucleic acids.

26. Use of composition of claim 23 or 25 or a kit of claims 24 and 25 in a method for providing a prognosis for a subject with lung cancer according to any one of claims 1 to 18.

27. Use of a composition of claim 23 or 25 or a kit of claims 24 and 25 in a method for providing a treatment for a subject with lung cancer according to claim 19 or 20.

28. Use of ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG and XBP1 in a method for providing a prognosis for a subject with lung cancer according to any one of claims 1 to 18.

29. Use of ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG and XBP1 in a method for providing a treatment for a subject with lung cancer according to claim 19 or 20

30. A method of treatment of a subject with lung cancer comprising the steps of predicting a level of risk of mortality for a subject with lung cancer the method comprising: (a) contacting a biological sample from the subject with reagents that specifically bind to each member of a panel of biomarkers comprising ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 ;

(b) determining a riskscore of the subject based on the nucleic acid levels of expression of the biomarkers in the samples;

(c) comparing the riskscore to a threshold to predict whether the subject is high risk of mortality;

(d) selecting a treatment; and

(e) administering the treatment.

31. A method for generating a biomarker signature for a subject with cancer, the method comprising:

generating training data from a plurality of subjects who have had cancer, the training data comprising gene expression data for a plurality of genes for each of the plurality of subjects;

calculating both an intra-tumour heterogeneity measure and an inter-tumour heterogeneity measure for each gene in the plurality of genes based on the gene expression data; and

applying a heterogeneity filter to select genes having both an intra-tumour heterogeneity below an intra-tumour heterogeneity threshold and an inter-tumour heterogeneity above an inter-tumour heterogeneity threshold;

wherein the biomarker signature comprises at least some of the selected genes.

32. The method of claim 31 , further comprising:

calculating a concordance score for each gene; and

applying a concordance filter to select genes having a concordance score below a concordance threshold.

33. The method of claim 32, wherein the concordance score is calculated for the selected genes after applying the heterogeneity filter.

34. The method of any one of claims 31 to 33, wherein the intra-tumour heterogeneity measure for each gene is calculated by:

obtaining values for the gene expression of each gene at multiple locations within the same tumour,

calculating, for each tumour, a measure which is indicative of the obtained gene expression values of each gene, and

obtaining the intra-tumour heterogeneity measure as the average value of the indicative measure for each gene in each tumour. 35. The method of claim 34, wherein the measure which is indicative of the gene expression values is selected from the standard deviation, the median absolution deviation and the coefficient of variation.

36. The method of any one of claims 31 to 35, wherein the inter-tumour heterogeneity measure is calculated by:

obtaining values for the gene expression at one of multiple regions in a tumour for each subject; and

taking the standard deviation across the obtained values.

37. The method of claim 36, further comprising iterating the obtaining and taking steps multiple times and averaging the standard deviation across iterations to obtain the inter-tumour heterogeneity measure.

38. The method of any one of claims 31 to 37, wherein the biomarker signature is prognostic, and the method further comprises:

generating training data comprising associated survival data for each of the plurality of subjects; calculating a prognostic measure for each of the plurality of genes based on the survival data; and

applying a prognostic filter to select genes having a prognostic measure above a prognostic threshold.

39. The method of claim 38, wherein the prognostic measure is calculated using Cox univariate regression analysis.

40. The method of any one of claims 31 to 37, wherein the biomarker signature is predictive for a response of a subject to a particular treatment, and the method further comprises:

generating training data comprising associated response data for each of the plurality of subjects;

calculating a predictive measure for each of the plurality of genes based on the response data; and

applying a predictive filter to select genes having a predictive measure above a predictive threshold.

41 . A method for providing a prognosis for a subject with cancer, the method comprising:

contacting a biological sample from the subject with reagents that specifically bind to each member of a panel of biomarkers in the signature generated using the method of any one of claims 31 to 40;

determining a riskscore of the subject based on the nucleic acid levels of expression of the biomarkers in the samples; and providing a prognosis for the cancer based on the risk score of the subject.

42. A method for determining a treatment for a subject, the method comprising:

the method of providing a prognosis of claim 41 ; and

further comprising the further step of determining a treatment.

43. A composition comprising a panel of reagents that specifically bind to each member of a panel of biomarkers in the signature generated using the method of any one of claims 31 to 40.

44. A kit comprising reagents that specifically bind to each member of a panel of biomarkers in the signature generated using the method of any one of claims 31 to 40.

45. Use of the biomarkers in the signature generated using the method of any one of claims 31 to

39 in a method for providing a prognosis for a subject with cancer.

46. Use of the biomarkers in the signature generated using the method of any one of claims 31 to

40 in a method for providing a treatment for a subject with cancer.

47. A method of treatment of a subject with cancer comprising the steps of predicting a level of risk of mortality for a subject with cancer the method comprising:

contacting a biological sample from the subject with reagents that specifically bind to each member of a panel of biomarkers in the signature generated using the method of any one of claims 31 to 40;

determining a riskscore of the subject based on the nucleic acid levels of expression of the biomarkers in the samples;

comparing the riskscore to a threshold to predict whether the subject is high risk of mortality; selecting a treatment; and

administering the treatment.

48. A method for providing a prognosis for a subject with lung cancer, the method comprising:

(a) contacting a biological sample from the subject with reagents that specifically bind to each member of a panel of biomarkers, the panel comprising at least two biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 ;

(b) determining a riskscore of the subject based on the nucleic acid levels of expression of the biomarkers in the samples; and

(c) providing a prognosis for the lung cancer based on the risk score of the subject.

49. A composition comprising a panel of reagents that specifically bind to each member of a panel of biomarkers comprising or consisting of at least two biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG and XBP1 .

50. A kit comprising reagents that specifically bind to each member of a panel of biomarkers comprising at least two biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG and XBP1 .

51 . The composition of claim 48 or the kit of claim 50 wherein the reagents are nucleic acids.

52. Use of composition of claim 49 or a kit of claims 50 and 51 in a method for providing a prognosis for a subject with lung cancer according to claim 48.

53. Use of at least two biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG and XBP1 in a method for providing a prognosis for a subject with lung cancer according to claim 48.

54. A method of treatment of a subject with lung cancer comprising the steps of predicting a level of risk of mortality for a subject with lung cancer the method comprising:

(a) contacting a biological sample from the subject with reagents that specifically bind to each member of a panel of biomarkers, the panel comprising at least two biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 ;

(b) determining a riskscore of the subject based on the nucleic acid levels of expression of the biomarkers in the samples;

(c) comparing the riskscore to a threshold to predict whether the subject is high risk of mortality;

(d) selecting a treatment; and

(e) administering the treatment.

Description:
Method of predicting survival rates for cancer patients

FIELD

[01 ] The invention relates to a method of determining the prognosis of cancer patients and/or predicting treatment response to guide therapeutic options and/or survival rates and/or survival risks and/or clinical outcomes for cancer patients and/or a method of determining whether a therapy is appropriate for a particular cancer patient and/or a method of determining the treatment course (such as a method for stratification of therapy regimen) for cancer patients, particularly those with lung cancers such as non-small cell lung cancer.

BACKGROUND

[02] Lung cancer is the leading cause of global cancer mortality, with non-small cell lung cancer (NSCLC) accounting for 85-90% of cases diagnosed worldwide. As described in“Lung Cancer Stage Classification” by Detterbeck et al published in CHEST 15, 193-203, 2017, tumour stage helps inform the clinical decision to administer adjuvant chemotherapy. However, as described in “Biomarker development in the precision medicine era: lung cancer as a case study” by Vargas et al published in Nat Rev Cancer (2016), TNM stage is an imperfect predictor of survival risk, as patients with the same tumour stage can have markedly different clinical outcomes.

[03] There have been suggestions that cancer patients may be stratified into more precise disease subtypes by incorporating molecular biomarkers, such as gene-expression based correlates of tumour aggressiveness, into current diagnostic criteria. Examples are described in“Enabling personalized cancer medicine through analysis of gene-expression patterns” by Van’t Veer et al in Nature 452, 564- 570 (2008),“Biomarker development in the precision medicine era: lung cancer as a case study” by Vargas et al in Nat. Rev. Cancer 16, 525-537 (2016) and“Precision oncology in the age of integrative genomics” by Kumar-Sinha et al in Nat. Biotechnol. 36, 46-60 (2018). Accurate identification of patients at high-risk of NSCLC recurrence after surgery may have considerable clinical utility, helping to inform decisions such as whether to administer adjuvant chemotherapy or the required intensity of patient follow-up after surgical resection.

[04] Multiple attempts have been made over the last two decades to derive a prognostic gene expression signature for lung adenocarcinoma (LUAD) patients, the most common histological subtype of NSCLC. Examples are described in“Gene-expression profiles predict survival of patients with lung adenocarcinoma” by Beer et al in Nat Med 8, 816-824 (2002),“A Robust prognostic gene expression signature for early stage lung adenocarcinoma” by Krystanek et al published in Biomark Res 4, 4 (2016) and“Validation of a Proliferation Based Expression Signature as Prognostic Marker in Early Stage Lung Adenocarcinoma” by Wistuba et al published in Clin Cancer Res (2013). However, these efforts have been hindered by poor reproducibility, or limited prognostic power independent of existing clinicopathological risk factors as described for example in “Gene Expression-Based Prognostic Signatures in Lung Cancer; Ready for Clinical Use?” by Subramanian et al in JNCL J Natl Cancer Inst 102, 464-474 (2010).

[05] Figures 1 a to 1 d illustrate some of the problems associated with known signatures. Figure 1 a shows a lung 10 comprising a tumour 12. There are multiple regions R1 , R2, R3 and R4 from which a biopsy of the lung can be taken. However, as schematically illustrated in red and blue, biopsies taken from regions R1 , R2 and R3 will result in a high risk classification using known prognostic biomarkers and a biopsy taken from region R4 will result in a low risk classification. Typically, in routine clinical practice, diagnosis or prognosis is made using a single biopsy 14. Accordingly, the hypothetical prognostic signature illustrated in Figure 1 a will exhibit a discordant risk classification of the tumour as the result of a biopsy from region R4 is not aligned with the result that would be achieved if the sample was taken from a different region. Thus, the read-out of the signature is vulnerable to tumour sampling bias.

[06] Figure 1 b illustrates the effect of the tumour sampling bias on a patient population. A plurality of lung tumours 20, 22, 24, 26, 28, 30 each having multiple regions (e.g. R1 to R5) for sampling are shown. A prognostic biomarker applied to a biopsy from one of the regions stratifies lung cancer patients 40, 42, 44, 46, 48, 50 into more precise disease subtypes based on estimated survival risk and this may help inform therapeutic decision making. Correctly distinguishing high-risk patients in need of adjuvant chemotherapy from low risk patients for whom surgery alone is curative is important.

[07] In each of the regions of the lung tumours 20, 22, a biopsy would correctly result in a low-risk classification for the associated patients 40, 42 and thus these patients would be classified as being suitable for treatment by surgical resection alone. Similarly, in each of the regions of the lung tumours 28, 30, a biopsy would correctly result in a high-risk classification for the associated patients 48, 50 and thus these patients would be classified as requiring treatment by surgical resection and adjuvant chemotherapy. However, a third patient 44 has a lung tumour 24 similar to that illustrated in Figure 1 a. As illustrated, the biopsy from region R4 results in a classification as low risk which is not in line with the classification which results from biopsies from other regions within the lung tumour 24. This is significant because the patient is unlikely to receive adjuvant treatment based on this diagnosis and is thus not receiving sufficient treatment. Thus, the patient has sub-optimal treatment and follow-up. Similarly, a fourth patient 46 has a lung tumour 26 which will produce different results depending on where the biopsy is sampled from. In this illustration, the biopsy provides a high risk classification which may result in the patient being subject to unnecessary treatment and thus subject to the side effects of chemotherapy.

[08] Figure 1 c shows the result of analysing a known signature for LUAD described in“Development of a RNA-seq Based Prognostic Signature in Lung Adenocarcinoma” by Shukla et al in JNCL J Natl Cancer Inst 109 (2017). The signature is analysed using information from the TRACERx lung trial which is the world’s largest multi-region sequencing study, enabling detailed exploration of tumour evolution. The study is described for example in“Tracking genomic cancer evolution for precision medicine: the lung TRACERx study” by Jamal-Hanjani et al published in PLos Biol 12 (2014). In Figure 1 c, 89 tumour regions from 28 patients within the TRACERx study are analysed and as plotted on the graph, each patient is ordered by predicted survival“riskscore” with the lowest risk patients at the left of the graph. To calculate the“riskscore” in this example, regression coefficients were re-derived from supplementary data provided in the original publication by fitting a linear model without intercept through regression of the calculated riskscore on expression values of the four genes in the signature. Each point on Figure 1 c represents a single tumour region and the vertical lines display the range of the riskscore for each patient. 1 1 patients are classified as low risk and 5 are classified as high risk regardless of the location of the biopsy. However, there are 12 discordant patients where the riskscore depends on the location of the biopsy.

[09] Figure 1 d presents the data in Figure 1 c as a bar chart with the percentages of low risk, high risk and discordant patients. Figure 1 e is a similar bar chart for a different signature based on immune- related gene pairs which is described in“Development and Validation of an Individualized Immune Prognostic Signature in Early-Stage Nonsquamous Non-Small Cell Lung Cancer” by Li et al published in JAMA Oncol (2017). In both cases, there is a significant proportion of discordant patients - 43% or 29% - whereby different regions from the same tumour may be classified as harbouring distinct profiles of molecular risk. The high proportion of patients vulnerable to tumour sampling bias potentially limits the clinical utility of such prognostic assays.

[10] To-date the majority of gene expression based prognostic signatures in LUAD have been defined using microarray expression profiling, rather than RNA-sequencing. Figure 1f shows the concordance results for nine published LUAD prognostic signatures detailed in the table below. The number of patients n in each paper is indicated. Hierarchical clustering was performed for each prognostic signature using the Ward method on the Manhattan metric as described in“Intratumour Heterogenity Affects Gene Expression Profile Test Prognostic Risk Stratification in Early Breast Cancer” by Gyanchandani et al in Clin Cancer Res 22, 5362-5369 (2016). For a given number of clusters, clustering concordance is quantified as the percentage of patients with all tumour regions in the same cluster. The results are plotted as the percentage of patients with tumour regions clustering together against the number of clusters. Vertical dashed lines mark the range of clusters (2, 3, 14 and 28):

[11 ] At 28 clusters, the median clustering discordance rate was 50% (15.5/28 LUAD tumours) indicating that half the tumour regions would be at risk of misclassification due to sampling bias. The range was between 18-82% indicating that some signatures performs significantly better than others. Taken together Figures 1 a to 1 f illustrate that sampling bias can confound the use of molecular biomarkers in several cancer types. As described in“Tracking the evolution of non-small cell lung cancer” by Jamal-Hanjani et al in N Engl J Med 376, 2109-2121 (2017), ITH and chromosomal instability (CIN) are common features of NSCLC and other types of cancer. Furthermore, genetic intra-tumour heterogeneity (ITH) is prevalent across cancer types as described in“The Causes and Consequence of Genetic Heterogeneity in Cancer Evolution” by Burrell et al published in Nature 501 , 338-345 (2013).

[12] Background information on previous lung cancer prognostic signatures can be found in International Patent Publication W0201/063121 (describes using a 16-gene prognostic signature to classify non-small cell lung cancer (NSCLC) patients into risk groups); US Patent Publication US2010/184063 (describes using a 15-gene prognostic and predictive signature to classify NSCLC patients into risk groups); and International Patent Publication WO2015/138769 (describes using a 9- gene prognostic signature to classify NSCLC patients into risk groups).

[13] The present applicant has recognised the need for improved gene signatures to assist clinicians to refine prognostic accuracy to help inform therapeutic decision-making, e.g. to choose between surgical resection alone or surgical resection followed by chemotherapy or another adjuvant treatment.

SUMMARY

[14] According to the present invention there is provided an apparatus and method as set forth in the appended claims. Other features of the invention will be apparent from the dependent claims, and the description which follows.

[15] We describe a method for providing a prognosis for a subject with lung cancer, the method comprising: (a) contacting a biological sample from the subject with reagents that specifically bind to each member of a panel of biomarkers comprising ANLN, ASPM, CDCA4, ERRFI 1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 ; (b) determining a riskscore of the subject based on the nucleic acid levels of expression of the biomarkers in the samples; and (c) providing a prognosis for the lung cancer based on the risk score of the subject.

[16] Determining a risk score of the subject may comprise: for each of the biomarkers, determining a score indicative of nucleic acid levels of expression in the tissue sample; calculating a riskscore based on the determined scores, wherein the riskscore is calculated by summing weighted biomarker scores, wherein the biomarker scores are based on the determined scores and each biomarker score has an associated weight; and comparing the riskscore to a threshold. In this way, each subject may for example be stratified into a high risk group (e.g. a riskscore above the threshold) or a low risk group (e.g. a riskscore equal to or below the threshold). For example, when considering all types of lung cancer, the high risk group may have a low survival outcome and the low risk group may have a good chance of survival. Alternatively, when considering early stage cancers, the high risk group may be more likely to relapse than the low risk group. The associated weight for each of the biomarker scores for GOLGA8A, SCPEP1 , SLC46A3 and XBP1 may have a negative value indicating that they are genes which are favourable. The associated weight for the biomarker score for ANLN, ASPM, CDCA4, ERRFI1 , FURIN, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SNX7 and TPBG may have a positive value.

[17] The weighted sum for the riskscore may be determined from: riskscore = ^c h + b 2 x 2 i+· +b n x ni

are the biomarker scores for the four selected biomarkers for each subject i and bi . b n are a set of associated weights for each biomarker score.

[18] The method may further comprise determining the weights for the weighted sum using a Cox proportional hazard model which is trained using training data comprising information on a plurality of biomarkers in a set of subjects. The method may comprise identifying the plurality of biomarkers to be used in the Cox proportional hazard model, wherein the plurality of biomarkers are selected from the group comprising ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 .

[19] The threshold may be the median riskscore for the training data.

[20] Determining a score indicative of a level of the biomarker may comprise determining a scaled intensity score. The biomarker score may be based on the scaled intensity score which has been adjusted by subtracting an adjustment factor. Determining a score indicative of a level of the biomarker may comprise awarding a first value when the level is above a threshold and a second value when the level is below the threshold. Determining a score indicative of a level of the biomarker may comprise awarding a first value when the level is above an upper threshold, a second value when the level is below the upper threshold but above a lower threshold and a third value when the level is below the lower threshold.

[21 ] The reagents may be nucleic acids. [22] As used herein, the words "nucleic acid", "nucleic acid sequence", "nucleotide", "nucleic acid molecule" or "polynucleotide" are intended to include DNA molecules (e.g., cDNA or genomic DNA), RNA molecules (e.g., mRNA, miRNA, IncRNA), naturally occurring, mutated, synthetic DNA or RNA molecules, and analogues of the DNA or RNA generated using nucleotide analogues. Nucleic acids can be single-stranded or double-stranded. Such nucleic acids or polynucleotides include, but are not limited to, coding sequences of structural genes, anti-sense sequences, and non-coding regulatory sequences that do not encode mRNAs or protein products. These terms also encompass a gene. The term "gene", "allele" or "gene sequence" is used broadly to refer to a DNA nucleic acid associated with a biological function. Thus, genes may include introns and exons as in the genomic sequence, or may comprise only a coding sequence as in cDNAs, and/or may include cDNAs in combination with regulatory sequences. Thus, according to the various aspects of the invention, genomic DNA, cDNA or coding DNA may be used. In one embodiment, the nucleic acid is cDNA or coding DNA. Thus, genes may include introns and exons as in genomic sequence, or may comprise only a coding sequence as in cDNAs, and/or may include cDNAs in combination with regulatory sequences.

[23] Analysis of nucleic acids may be carried out using suitable techniques, for example techniques for measuring gene expression, including but not limited to digital PCR, qPCR, microarrays, RNA-Seq or nanostring® assays. In certain embodiments described herein, gene expression is measured by quantifying RNA, including RNA-Seq or nanostring® assays. It will be understood that more than one technique for measuring gene expression may be used.

[24] RNA sequencing (RNA-Seq) is a transcriptome profiling technology that utilizes next-generation sequencing platforms based on next generation sequencing (NGS). RNA-Seq transcripts are reverse- transcribed into cDNA, and adapters are ligated to each end of the cDNA. Sequencing can be done either unidirectional (single-end sequencing) or bidirectional (paired-end sequencing) and then aligned to a reference genome database or assembled to obtain de novo transcripts, proving a genome-wide expression profile. RNA-seq can qualitatively and quantitatively investigate any RNA type including messenger RNAs (mRNAs), microRNAs, small interfering RNAs, and long noncoding RNAs.

[25] RNA can be analysed using the NanoString nCounter gene expression assay. NanoString is a relatively new molecular profiling technology that can generate accurate genomic information from small amounts of fixed patient tissues. The NanoString platform uses digital, colour-coded barcodes or code sets tagged to sequence-specific probes, allowing quantification of mRNA expression (Geiss et al, Nat Biotechnol. 2008 Mar;26 (3):317-25, Das et al, NanoString expression profiling identifies candidate biomarkers of RAD001 response in metastatic gastric cancer, ESMO Open 2016, 1 -9). The NanoString system hybridizes two probes to each target transcript: a biotin-labeled capture probe and a fluorescent barcode-labeled reporter probe. Reporter probes hybridize with specific RNAs in a sample and capture probes lock them via avidin onto a static surface. The NanoString nCounter Analysis System counts the immobilized RNAs using their barcodes.

[26] The lung cancer may be non-small lung cancer (NSCLC). The NSCLC may be selected from invasive adenocarcinoma (LUAD), squamous cell carcinoma (LUSC), large cell carcinoma, adenosquamous carcinoma, carcinosarcoma, large cell neuroendocrine, undifferentiated non small cell lung cancer or bronchioalveolar. LUAD and LUSC make up the majority of NSCLC cases and the other types tend to be grouped together. The NSCLC may be stage I, stage II, stage III or stage IV.

[27] The sample may be from a surgically resected tumour. The sample may be from lung tissue or a lung tumour biopsy.

[28] The prognosis may provide a risk assessment.

[29] The method may further comprise determining a treatment. Thus, we also describe a method for determining a treatment for a subject the method comprising the method described above and further comprising the further step of determining a treatment. Said treatment may be selected from surgical treatment, chemotherapy, surgery, radiotherapy, immunotherapy or CAR-T therapy. Such treatments are known in the art. It will be appreciated that there are various types of immunotherapies such as immune checkpoint inhibitors, oncolytic virus therapy, T cell therapy and cancer vaccines. The appropriate therapy may be selected.

[30] We also describe a composition comprising a panel of reagents that specifically bind to each member of a panel of biomarkers comprising or consisting of ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG and XBP1 .

[31 ] We also describe a kit comprising reagents that specifically bind to each member of a panel of biomarkers comprising ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG and XBP1 .

[32] The reagents may be nucleic acids in the composition or kit described above. We also describe use of a composition or a kit in a method for providing a prognosis for a subject with lung cancer as described above. We also describe use of a composition or a kit in a method for providing a treatment for a subject with lung cancer as described above.

[33] We also describe use of ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG and XBP1 in a method for providing a prognosis for a subject with lung cancer as described above. We also describe use of ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG and XBP1 in a method for providing a treatment for a subject with lung cancer as described above.

[34] We also describe a method of treatment of a subject with lung cancer comprising the steps of predicting a level of risk of mortality for a subject with lung cancer the method comprising (a) contacting a biological sample from the subject with reagents that specifically bind to each member of a panel of biomarkers comprising ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 ; (b) determining a riskscore of the subject based on the nucleic acid levels of expression of the biomarkers in the samples; (c) comparing the riskscore to a threshold to predict whether the subject is high risk of mortality; (d) selecting a treatment; and (e) administering the treatment. [35] We also describe a method for generating a biomarker signature for a subject with cancer, the method comprising: generating training data from a plurality of subjects who have had cancer, the training data comprising gene expression data for a plurality of genes for each of the plurality of subjects; calculating both an intra-tumour heterogeneity measure and an inter-tumour heterogeneity measure for each gene in the plurality of genes based on the gene expression data; and applying an heterogeneity filter to select genes having both an intra-tumour heterogeneity below an intra-tumour heterogeneity threshold and an inter-tumour heterogeneity above an inter-tumour heterogeneity threshold; wherein the biomarker signature comprises at least some of the selected genes. Such a method may be applicable to a variety of different cancers, especially those associated with ITH.

[36] The method may further comprise: calculating a concordance score for each gene; and applying a concordance filter to select genes having a concordance score below a concordance threshold. The concordance filter may be considered to be a type of heterogeneity filter that removes noisy genes. The concordance score may be calculated for the selected genes after applying the heterogeneity filter. Alternatively, the concordance filter may be applied before calculating both the intra-tumour heterogeneity measure and the inter-tumour heterogeneity measure.

[37] The intra-tumour heterogeneity measure for each gene may be calculated by: obtaining values for the gene expression of each gene at multiple locations within the same tumour, calculating, for each tumour, a measure which is indicative of the obtained gene expression values of each gene, and obtaining the intra-tumour heterogeneity measure as the average value of the indicative measure for each gene in each tumour. The measure which is indicative of the gene expression values may be selected from the standard deviation, the median absolution deviation and the coefficient of variation.

[38] The inter-tumour heterogeneity measure may be calculated by: obtaining values for the gene expression of each gene for each subject at one of multiple regions in a tumour; and taking the standard deviation across the obtained values. The method may further comprise iterating the obtaining and taking steps multiple times and averaging the standard deviation across iterations to obtain the intertumour heterogeneity measure. It will be appreciated that other measures than standard deviation may also be used, for example coefficient of variation and median absolute deviation.

[39] The biomarker signature may be prognostic. The method may further comprise: generating training data comprising associated survival data for each of the plurality of subjects; calculating a prognostic measure for each of the plurality of genes based on the survival data; and applying a prognostic filter to select genes having a prognostic measure above a prognostic threshold. The prognostic measure may be calculated using Cox univariate regression analysis.

[40] The biomarker signature may be predictive for a response of a subject to a particular treatment, e.g. immunotherapy. The method may further comprise: generating training data comprising associated response data (e.g. outcome from the particular treatment) for each of the plurality of subjects; calculating a predictive measure for each of the plurality of genes based on the response data; and applying a predictive filter to select genes having a predictive measure above a predictive threshold. The predictive measure may be calculated using regression analysis, correlating gene expression with response to treatment, or proxy measures of treatment response. Such a method may be used to create a predictive signature of treatment response, to help stratify patients for the most appropriate treatment regime. There is thus the potential for a biomarker signature generated as described above to differentiate between cancer subtypes and determining treatment strategy on the basis of the cancer subtype. It will be appreciated that the method of providing a prognosis, the method for determining a treatment for a subject, the composition, the kit, the method of treatment and the uses described above can be applied to any signature which is generated as described above.

[41 ] We also describe a method for providing a prognosis for a subject with cancer, the method comprising: contacting a biological sample from the subject with reagents that specifically bind to each member of a panel of biomarkers in the signature generated as described above; determining a riskscore of the subject based on the nucleic acid levels of expression of the biomarkers in the samples; and providing a prognosis for the cancer based on the risk score of the subject. We also describe a method for determining a treatment for a subject, the method comprising the method of providing a prognosis and further comprising the further step of determining a treatment. We also describe a composition comprising a panel of reagents that specifically bind to each member of a panel of biomarkers in the signature generated as described above. We also describe a kit comprising reagents that specifically bind to each member of a panel of biomarkers in the signature generated as described above.

[42] We also describe use of the biomarkers in the signature generated as described above in a method for providing a prognosis for a subject with cancer. We also describe use of the biomarkers in the signature generated as described above in a method for providing a treatment for a subject with cancer. We also describe a method of treatment of a subject with cancer comprising the steps of predicting a level of risk of mortality for a subject with cancer the method comprising contacting a biological sample from the subject with reagents that specifically bind to each member of a panel of biomarkers in the signature generated as described above; determining a riskscore of the subject based on the nucleic acid levels of expression of the biomarkers in the samples; comparing the riskscore to a threshold to predict whether the subject is high risk of mortality; selecting a treatment; and administering the treatment.

[43] There may also be a computer device comprising at least one processor; and instructions that, when executed by the at least one processor cause the computer device to perform any of the determining, calculating and comparing steps of the methods described above. There may also be a tangible non-transient computer-readable storage medium having recorded thereon instructions which, when implemented by a computer device, cause the computer device to be arranged as described above and/or which cause the computer device to perform any of the relevant steps of the methods as described above. There may also be a kit comprising the computer device and a microarray for the tissue sample and/or one or more reagents to determine the presence of the biomarkers.

[44] Thus far, we have described using a panel of biomarkers comprising or consisting of 23 specific biomarkers. We now describe embodiments using a panel of biomarkers comprising two or more biomarkers selected from the 23 specific biomarkers.

[45] We also describe a method for providing a prognosis for a subject with lung cancer, the method comprising: (a) contacting a biological sample from the subject with reagents that specifically bind to each member of a panel of biomarkers, the panel comprising at least two biomarkers selected from ANLN, AS PM, CDCA4, ERRFI 1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 ; (b) determining a riskscore of the subject based on the nucleic acid levels of expression of the biomarkers in the samples; and (c) providing a prognosis for the lung cancer based on the risk score of the subject.

[46] Determining a risk score of the subject may comprise: for each of the selected biomarkers, determining a score indicative of nucleic acid levels of expression in the tissue sample; calculating a riskscore based on the determined scores, wherein the riskscore is calculated by summing weighted biomarker scores, wherein the biomarker scores are based on the determined scores and each biomarker score has an associated weight; and comparing the riskscore to a threshold. In this way, each subject may for example be stratified into a high risk group (e.g. a riskscore above the threshold) or a low risk group (e.g. a riskscore equal to or below the threshold). For example, when considering all types of lung cancer, the high risk group may have a low survival outcome and the low risk group may have a good chance of survival. Alternatively, when considering early stage cancers, the high risk group may be more likely to relapse than the low risk group. The associated weight for each of the biomarker scores for GOLGA8A, SCPEP1 , SLC46A3 and XBP1 may have a negative value indicating that they are genes which are favourable. The associated weight for the biomarker score for ANLN, AS PM, CDCA4, ERRFI1 , FURIN, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SNX7 and TPBG may have a positive value.

[47] The weighted sum for the riskscore may be determined from: riskscore = ^c h + b 2 x 2 i+· +b n x ni

are the biomarker scores for the four selected biomarkers for each subject i and bi . b n are a set of associated weights for each biomarker score.

[48] The method may further comprise determining the weights for the weighted sum using a Cox proportional hazard model which is trained using training data comprising information on a plurality of biomarkers in a set of subjects. The method may comprise identifying the plurality of biomarkers to be used in the Cox proportional hazard model, wherein the plurality of biomarkers are selected from the group comprising ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 .

[49] The threshold may be the median riskscore for the training data.

[50] Determining a score indicative of a level of the biomarker may comprise determining a scaled intensity score. The biomarker score may be based on the scaled intensity score which has been adjusted by subtracting an adjustment factor. Determining a score indicative of a level of the biomarker may comprise awarding a first value when the level is above a threshold and a second value when the level is below the threshold. Determining a score indicative of a level of the biomarker may comprise awarding a first value when the level is above an upper threshold, a second value when the level is below the upper threshold but above a lower threshold and a third value when the level is below the lower threshold. [51 ] The reagents may be nucleic acids.

[52] The lung cancer may be non-small lung cancer (NSCLC). The NSCLC may be selected from invasive adenocarcinoma (LUAD), squamous cell carcinoma (LUSC), large cell carcinoma, adenosquamous carcinoma, carcinosarcoma, large cell neuroendocrine, undifferentiated non small cell lung cancer or bronchioalveolar. LUAD and LUSC make up the majority of NSCLC cases and the other types tend to be grouped together. The NSCLC may be stage I, stage II, stage III or stage IV.

[53] The sample may be from a surgically resected tumour. The sample may be from lung tissue or a lung tumour biopsy.

[54] The prognosis may provide a risk assessment.

[55] The method may further comprise determining a treatment. Thus, we also describe a method for determining a treatment for a subject the method comprising the method described above and further comprising the further step of determining a treatment. Said treatment may be selected from surgical treatment, chemotherapy, surgery, radiotherapy, immunotherapy or CAR-T therapy. Such treatments are known in the art. It will be appreciated that there are various types of immunotherapies such as immune checkpoint inhibitors, oncolytic virus therapy, T cell therapy and cancer vaccines. The appropriate therapy may be selected.

[56] We also describe a composition comprising a panel of reagents that specifically bind to each member of a panel of biomarkers comprising or consisting of at least two biomarkers selected from ANLN, AS PM, CDCA4, ERRFI 1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG and XBP1 .

[57] We also describe a kit comprising reagents that specifically bind to each member of a panel of biomarkers comprising at least two biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG and XBP1 . The reagents may be nucleic acids in the composition or kit described above. We also describe use of a composition or a kit in a method for providing a prognosis for a subject with lung cancer as described above. We also describe use of a composition or a kit in a method for providing a treatment for a subject with lung cancer as described above.

[58] We also describe use of at least two biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG and XBP1 in a method for providing a prognosis for a subject with lung cancer. We also describe use of at least two biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG and XBP1 in a method for providing a treatment for a subject with lung cancer as described above.

[59] We also describe a method of treatment of a subject with lung cancer comprising the steps of predicting a level of risk of mortality for a subject with lung cancerthe method comprising: (a) contacting a biological sample from the subject with reagents that specifically bind to each member of a panel of biomarkers, the panel comprising at least two biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 ; (b) determining a riskscore of the subject based on the nucleic acid levels of expression of the biomarkers in the samples; (c) comparing the riskscore to a threshold to predict whether the subject is high risk of mortality; (d) selecting a treatment; and (e) administering the treatment.

[60] In each of the embodiments of the invention where the panel of biomarkers comprises a selection of biomarkers, the skilled person will understand that the panel of biomarkers may comprise or consist of at least three biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of at least four biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of at least five biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of at least six biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of at least seven biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of at least eight biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1. It will be understood that the panel of biomarkers may comprise or consist of at least nine biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1. It will be understood that the panel of biomarkers may comprise or consist of at least ten biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of at least eleven biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of at least twelve biomarkers selected from ANLN, ASPM, CDCA4, ERRFI 1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of at least thirteen biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of at least fourteen biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of at least fifteen biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of at least sixteen biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1. It will be understood that the panel of biomarkers may comprise or consist of at least seventeen biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1. It will be understood that the panel of biomarkers may comprise or consist of at least eighteen biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of at least nineteen biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of at least twenty biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of at least twenty-one biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of at least twenty-two biomarkers selected from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 .

[61 ] In each of the embodiments of the invention where the panel of biomarkers comprises a selection of biomarkers, the skilled person will understand that the panel of biomarkers may comprise or consist of ANLN and at least one of ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of ASPM and at least one of ANLN, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of CDCA4 and at least one of ASPM, ANLN, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1. It will be understood that the panel of biomarkers may comprise or consist of ERRFI1 and at least one of ASPM, ANLN, CDCA4, FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1. It will be understood that the panel of biomarkers may comprise or consist of FURIN and at least one of ASPM, ANLN, CDCA4, ERRFI1 , GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1. It will be understood that the panel of biomarkers may comprise or consist of GOLGA8A and at least one of ASPM, ANLN, CDCA4, ERRFI1 , FURIN, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of ITGA6 and at least one of ASPM, ANLN, CDCA4, ERRFI1 , FURIN, GOLGA8A, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of JAG1 and at least one of ASPM, ANLN, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of LRP12 and at least one of ASPM, ANLN, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of MAFF and at least one of ASPM, ANLN, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of MRPS17 and at least one of ASPM, ANLN, CDCA4, ERRFI 1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of PLK1 and at least one of ASPM, ANLN, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of PNP and at least one of ASPM, ANLN, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of PPP1 R13L and at least one of ASPM, ANLN, CDCA4, ERRFI 1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1. It will be understood that the panel of biomarkers may comprise or consist of PRKCA and at least one of ASPM, ANLN, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of PTTG1 and at least one of ASPM, ANLN, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1. It will be understood that the panel of biomarkers may comprise or consist of PYGB and at least one of ASPM, ANLN, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of RPP25 and at least one of ASPM, ANLN, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1. It will be understood that the panel of biomarkers may comprise or consist of SCPEP1 and at least one of ASPM, ANLN, CDCA4, ERRFI 1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SLC46A3, SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of SLC46A3 and at least one of ASPM, ANLN, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SNX7, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of SNX7 and at least one of ASPM, ANLN, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, TPBG, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of TPBG and at least one of ASPM, ANLN, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, XBP1 . It will be understood that the panel of biomarkers may comprise or consist of XBP1 and at least one of ASPM, ANLN, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG.

[62] The skilled person would understand that any combination of two or more biomarkers from ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 may be sufficient to provide a prognosis for a subject with lung cancer or to determine a treatment.

[63] Although a few preferred embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that various changes and modifications might be made without departing from the scope of the invention, as defined in the appended claims.

BRIEF DESCRIPTION OF DRAWINGS

[64] For a better understanding of the invention, and to show how embodiments of the same may be carried into effect, reference will now be made, by way of example only, to the accompanying diagrammatic drawings in which:

[65] Figure 1 a is a schematic illustration of a lung tumour showing sampling sites which illustrate the tumour sampling bias problem;

[66] Figure 1 b is a schematic illustration of the steps of a prediction method and the clinical implications of the tumour sampling bias problem of Figure 1 a;

[67] Figure 1 c plots the riskscore against patient using a particular known signature;

[68] Figures 1 d and 1 e are bar charts showing the proportions of low, high and discordant risk patients using two known signatures;

[69] Figure 1 f plots percentage of patients with tumour regions clustering together against the number of clusters for 9 known signatures.

[70] Figure 2a is a flowchart showing the steps of the method for developing and validating a prognostic signature;

[71 ] Figure 2b is a flowchart showing the steps of the prognostic method; [72] Figure 2c is a schematic box diagram showing the components of the system for implementing the method of Figure 2b;

[73] Figures 2d and 2e plot the proportion of patients with all samples in the same cluster against the number of clusters for two genes CKMT2 and HOXC1 1 , respectively;

[74] Figure 2f plots the hierarchical clustering concordance against each gene;

[75] Figure 3a illustrates the steps in calculating RNA intra-tumour heterogeneity for a plurality of genes;

[76] Figure 3b plots the median absolute deviation (MAD) against the standard deviation scores for each gene;

[77] Figure 3c plots the coefficient of variation (CV) against the standard deviation scores for each gene;

[78] Figure 3d illustrates the random sampling process to calculate the RNA inter-tumour heterogeneity measure;

[79] Figure 3e plots values for RNA intra-tumour heterogeneity (y-axis) against values RNA intertumour heterogeneity (x-axis) for a plurality of genes;

[80] Figure 4a plots the prognostic value for each of three signatures in a validation cohort;

[81 ] Figure 4b shows a Forest plot with predictive value over known risk factors;

[82] Figures 4c, 4d and 4e plot prognostic value of substaging criteria, current clinical guidelines for chemotherapy and the output signatures in stage I patients where improved risk prediction could impact clinical decision-making;

[83] Figure 4f plots the riskscore from the output signature for a plurality of patients;

[84] Figure 4g plots prognostic value assessment using a RNA-Seq dataset and four microarray datasets;

[85] Figure 4h shows a plot indicating that any subset of the ORACLE signature may have a prognostic value;

[86] Figure 5a plots the prognostic association of RNA heterogeneity quadrants across cancer types;

[87] Figure 5b compares the performance of genes within each quadrant for a plurality of cancer types to determine whether the quadrants are enriched or depleted in prognostic genes;

[88] Figure 6a plots gene expression ITH against copy number ITH;

[89] Figure 6b shows the expression difference for subclonal chromosomal copy-number changes (losses and gains respectively);

[90] Figure 6c plots clonal copy-number gains by RNA heterogeneity quadrant; and

[91 ] Figure 6d plots the enriched reactome pathways in RNA heterogeneity quadrant Q4.

DESCRIPTION OF DRAWINGS

[92] As explained above, Figures 1 a to 1 f illustrate that sampling bias can confound the use of molecular biomarkers in several cancer types. This is because intratumour heterogeneity (i.e. spatial variation in genetic and transcriptomic features within an individual tumour), as substrate for tumour evolution, impacts the outcome of the application of the molecular biomarkers because the result may depend on the location of the tumour sample which is being tested. Multiple solutions to the tumour sampling bias problem have been proposed, including leveraging multi-region sequencing to (i) pool multiple biopsies in order to gain a global molecular risk estimate for an individual tumour (as described in“Stability and Heterogeneity of Expression Profiles in Lung Cancer Specimens Harvested Following Surgical Resection” by Blackhall et al published in Neoplasia (2004)); or (ii) to identify the“lethal” subclone with maximal immune evasive (e.g. as described in“Comprehensive I ntra metastatic Immune Quantification and Major Impact of Immunoscore on Survival” by Mlecnik et al published in JNCL J Natl Cancer Inst 110, 97-108 (2018)) or metastatic potential (e.g. as described in“Distant metastasis occurs late during the genetic evolution of pancreatic cancer” by Yachida et al published in Nature 467, 1 1 14- 11 17 (2010)). However, in the clinical setting multi-region sequencing is currently impractical.

[93] Figure 2a is a flowchart of the steps in developing a set of biomarkers (or gene-expression signature) which produces a reliable prognostic result that is applicable to single region tumour samples which are routinely collected in clinical practice. As explained in more detail below, the set of biomarkers comprises genes with low intra-tumour heterogeneity but high inter-tumour heterogeneity which minimises the confounding effects of sampling bias but maximises discriminatory power between patients.

[94] The first step S100 is to collect training data, for example gene expression and survival data from the Cancer Genome Atlas (TCGA) for 959 NSCLC patients who are at stages I to III (469 LUAD patients and 490 LUSC patients). This data forms a training dataset which is used to derive the signature as described below. The downloaded data may thus be processed as per standard techniques in an RNA- seq pre-processing pipeline to form the training data. For example, alignment to the human genome may be performed, e.g. using the MapSplice package described in 67. Gene expression may then be quantified, e.g. using the GenomicFeatures and Genomic Ranges packages from Bioconductor. An expression filter may then be applied keeping genes with at least 0.5 CPM in at least 2 tumour samples, as shown in step S101 . Normalised count values are then obtained for filtered genes using a variance stabilizing transformation from the DESeq2 package described in“Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2” by Love et al published in Genome Biol 15, 550 (2014). It will be appreciated that data for different patients could also be collected when developing a prognostic signature for a different disease.

[95] The next step S102 is to calculate a prognostic measure for each gene which identifies significantly prognostic genes. A first filtering step S104 is then applied to remove genes based on their prognostic effect (i.e. to select the genes having a prognostic measure above a threshold). Each of these genes has an unknown impact on the overall survival for each patient. The prognostic measure may be calculated using any suitable technique.

[96] For example, Cox univariate regression analysis may be applied. The Cox model is expressed by the hazard function denoted by h(t), is used. The hazard function can be interpreted as the risk of dying at time t. It can be estimated as follows:

h (t) = ho (t) x exp(bi xi + båX2+ ... + b P Xn) where t represents the survival time, h(t) is the hazard function determined by a set of n covariates

(xi ,X 2 . Xn ) - in this cases the genes, the set (bi ,b 2 . b n ) are weights (or coefficients) for each covariate and the term ho is called the baseline hazard which corresponds to the value of the hazard if all the x. are equal to zero (the quantity exp(0) equals 1). The in h(t) reminds us that the hazard may vary over time. However, the time variance can be removed so that the model can be rewritten in linear form by taking the log of the hazard ratio for patient i to the reference group and this may be written as:

log

This linear equation is known as the Cox Proportional Hazards model with a set of n covariates (i.e. genes) . .. . for each patient i and a set (bi ,b . b n ) of weights which optimise the model for all patients. A univariate analysis means considering each variable in term. Typically for each variable, the coefficient is calculated together with the lower and upper limits for the 95% confidence interval around the coefficient (CI95L and CI95U respectively). The P-value is a measure of the statistical significance of the variable and is calculated either using the Wald-test or the Log-rank test. The Q- value is an adjusted P-value using the Benjamini & Hochberg method.

[97] As shown in step S104, one than one prognostic filter may be applied. For example, a first filter may comprise filtering all genes based on a prognostic significance threshold, e.g. with P<0.05 which in this example may reduce the number of genes from 19026 to 4240. A second filter may be applied to filter genes based on a median threshold, e.g. to filter out all genes which have a value for the prognostic measure which is below a prognostic threshold may be removed. In this example, this may reduce the number of genes from 19026 to 9512. The two thresholds together may be considered a prognostic threshold and thus overall the first filtering step may reduce the number of genes from 19026 to 2023.

[98] A second filtering step S106 may then be applied. This filter may be termed a clonal expression filter or heterogeneity filter. As explained in more detail below, the clonal expression filter may remove the genes which do not have both low intra-tumour heterogeneity and high inter-tumour heterogeneity (i.e. select the genes which have both low intra-tumour heterogeneity and high inter-tumour heterogeneity). In this example, this may reduce the number of genes from 2023 to 176.

[99] A third filtering step S108 may then be applied. This filter which may be termed a concordance filter may short-list the remaining genes based on gene-wise clustering concordance scores. The clustering concordance score may be calculated using any suitable technique. For example, concordance may be determined through hierarchical clustering analysis on cancer expression data where multiple samples have been obtained from each tumour, e.g. using the Ward method on the Manhattan metric as described in“Intratumor Heterogeneity Affects Gene Expression Profile Test Prognostic Risk Stratification in Early Breast Cancer” by Gyanchandani et al published in Clin. Cancer Res. 22, 5362-5369 (2016). Concordance is determined on a per gene level as the percent of tumours where all samples cluster together. The clustering analysis may be run iteratively from 2 to the total number of patients (e.g. 28 in this TRACERx LUAD cohort). For each gene, a curve may be plotted for the number of patients will al regions in the same cluster against the number of clusters. For example, as shown in Figures 2d and 2e, the proportion of patients with all samples in the same cluster has been plotted against the number of clusters (2 to 28) for two genes CKMT2 and HOXC11 . The clustering concordance score for each gene is then summarised as the area under the curve. The concordance scores for each gene may be plotted as shown in Figure 2f which shows the hierarchical clustering concordance for each gene. Once the concordance scores for each gene have been calculated, all genes having a concordance score below a concordance threshold may be removed. The concordance threshold (i.e. cut-off) may be determined using ten-fold cross-validation. In this example, this may reduce the number of genes from 176 to 90.

[100] The number of genes may still be too high for a practical prognostic kit and thus the number of genes may be optionally further reduced using standard techniques such as Lasso regression (S1 10). Lasso regression may be applied in the R software environment using the glmnet package described in “Regularised Paths for Generalized Linear Models via Coordinate Descent” by Friedman et al published in J Stat Softw 33, 1 to 22 (2010) for a Cox’s Proportional Hazard Model (e.g. as described in“Regularisation Paths for Cox’s Proportional Hazards Model via Coordinate Descent” by Simon et al published in J Stat Softw 39, 1 -13 (201 1)) applying the lasso penalty (alpha=1). In this example, this may reduce the number of genes from 90 to 23. The resulting set of 23 genes (i.e. signature) is then output (S1 12). The resulting signature may be termed an ORACLE signature (Outcome Risk Associated Clonal Lung Expression). The prognostic accuracy of the output signature may be evaluated using validation data (S114).

[101 ] It will be appreciated that each of the filtering steps in Figure 2a may be applied in any order. The order shown is merely exemplary and not intended to be limiting.

[102] The prognostic biomarker signature comprises the following genes: ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 , PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 . There are five genes related to cell proliferation: ANLN, ASPM, CDCA4, PLK1 , PRKCA) and six genes relating to oncogenic signalling pathways (ERRFI, FURIN, ITGA6, JAG1 , PPP1 R13L, PTTG1). Only seven of the genes appearto have been previously used in LUAD prognostic signatures, namely ASPM, FURIN, PLK1 , PNP, PRKCA, PTTG1 and TTBG. Prognostic biomarkers predict survival risk independent of therapy.

[103] A method for providing a prognosis or predicting a level of risk for a subject with lung cancer, the method comprising:

a) contacting a biological sample from the subject with reagents that specifically bind to each member of a panel of biomarkers comprising or consisting of ANLN, ASPM, CDCA4, ERRFI1 , FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PTTG1 ,

PYGB, RPP25, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1 ; b) determining a riskscore of the subject based on the nucleic acid levels of expression of the biomarkers in the samples; and

c) providing a prognosis for the lung cancer based on the risk score of the subject.

[104] The method may also comprise obtaining the sample from the patient. The sample may be a tumour sample. The reagent used in the methods, kits and compositions provided herein may be a nucleic acid, for example an oligonucleotide or primer.

[105] Prognosis as used herein relates to a clinical outcome, such as overall survival, medium or long term mortality (e.g. 1 , 2, 3, 4 or 5 years) or disease free survival.

[106] It will be appreciated that Figure 2a shows a method for generating a signature which is prognostic, but the method may be readily adapted for prediction by replacing the prognostic measures with predictive measures. [107] Figure 2b provides an embodiment of the method for prognosis. It shows the steps which may be carried out in a prognosis method using the output signature. The first step is contacting a biological sample (step S200). The biological sample may be a tumour sample may be obtained using any appropriate method, e.g. from a donor sample obtained using a biopsy. A value for each of the 23 genes within the tumour sample is then determined (step S202) using standard techniques.

[108] The next step (S204) is to determine the risk score from a weighted sum of the values for each of the 23 genes. The riskscore may thus be calculated from:

riskscore = b 1 x 11 + ¾ 2 ¾i +· +b n x ni

.. are the values for the 23 selected genes for each patient / ' and b-i .te . b n are a set of associated weights for each gene. The weights may be determined using Lasso regression as described above.

[109] As an example, suitable weights are shown below for each of the genes in the signature. Genes with a positive beta coefficient are associated with a hazard ratio > 1 (i.e. are“unfavourable genes”, predicting worse survival) and vice versa for genes with negative coefficients (“favourable genes”). It will be appreciated that these weights are indicative of suitable values and not limiting.

[110] Returning to Figure 2b, once the riskscore has been determined, it is compared to a riskscore threshold (step S206). If the riskscore is equal to or above the threshold, there is deemed to be a high risk that the patient will not survive and thus the patient is classified as a high risk patient (step S210). Conversely if the riskscore is below the threshold, there is deemed to be a low risk that the patient will not survive and thus the patient is classified as a low risk patient (step S212). The threshold may for example be the median riskscore of the data which was used to derive the signature and/or the median riskscore of the most significant splits (log-rank P<0.01). In other words, the threshold may be the riskscore which most significantly splits the training cohort into relapsing and not-relapsing (i.e. cured) patients.

[11 1 ] As an alternative to step S208, the riskscore may be compared to an upper and a lowerthreshold. If the riskscore is equal to or above the upper threshold, the patient is classified as a high risk patient. Ifthe riskscore is below the lowerthreshold, the patient is classified as a low risk patient. Ifthe riskscore is between the two thresholds, the patient is classified as an intermediate risk patient. The upper and lower thresholds may be determined as the tertiles of the riskscores determined from the training cohort as explained below.

[112] Once the riskscore has been determined, this may optionally be used to decide on the most appropriate treatment. For example, for a high risk patient, adjuvant chemotherapy is recommended to supplement the surgery. Such treatment results in an improved overall survival rate than chemotherapy alone. This is especially relevant for stage I patients, where a clinical metric for identifying high-risk patients is lacking. Currently stage I patients tend not to receive chemotherapy resulting in undertreatment of approximately 25% of stage I patients who recur within 5 years. By contrast, for a low risk patient, the treatment can be selected from either surgery alone or the combined surgical approach specified above. Both options are equally effective in such cases.

[113] A schematic of an associated system for performing the method is shown in Figure 2c. The system comprises a computing device 210 which could be a handheld device which is portable for a clinician to transport from patient to patient and an app could be loaded onto the device for calculation of the riskscore. The computing device 210 comprises the standard components such as a processing unit or processor 220, a user interface unit 222 for allowing a user to input information, e.g. determined scores, and a memory 224 for storing the code to perform the calculation and/or the threshold for comparing the calculated riskscore. The user interface may display information or alternatively, there may be a display 224 for displaying information to a user, e.g. a calculated riskscore and/or a suggestion for treatment as described above and a communications module 228 for communicating with other devices and/or accessing the cloud 240, e.g. to process the riskscore. The tissue sample 230 is also shown schematically.

[114] This schematic system may be constructed, partially or wholly, using dedicated special-purpose hardware. Terms such as‘module’ or‘unit’ used herein may include, but are not limited to, a hardware device, such as circuitry in the form of discrete or integrated components, a Field Programmable Gate Array (FPGA) or Application Specific Integrated Circuit (ASIC), which performs certain tasks or provides the associated functionality. In some embodiments, the described elements may be configured to reside on a tangible, persistent, addressable storage medium and may be configured to execute on one or more processors. These functional elements may in some embodiments include, byway of example, components, such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables. Although the example embodiments have been described with reference to the components discussed herein, such functional elements may be combined into fewer elements or separated into additional elements;

[115] Figures 3a to 3e illustrate how a clonal expression filter may be created. For each gene an RNA intra-tumour heterogeneity value and an RNA inter-tumour heterogeneity value is calculated. These per-gene metrics may quantify variability by the standard deviation between regions within the same tumour to generate a value for intra-tumour heterogeneity and quantify variability by the standard deviation between the same tumour regions from different tumours to generate a value for inter-tumour heterogeneity. They may be calculated using multi-region RNAseq data (normalised count values).

[1 16] Figures 3a to 3e plot the data for tumour samples from the dataset collected from 100 NSCLC patients enrolled in the TRACERx lung cancer study sponsored by University College London. Multiregion sampling was performed to obtain DNA and RNA sequentially from the same tissue. Whole exome sequencing was performed on DNA samples. Of the cohort of 100 tumours, RNA samples of sufficient quality were obtained from 174 regions of 68 tumours. Of these, at least two samples were available from 48 tumours.

[1 17] Further processing may be performed as necessary. For example, alignment was performed, for example using the STAR package described in“STAR:ultrafast universal RNA-seq aligner” by Dobin et al published in Bioinformatics 29, 15 to 21 (2013) to map reads to the human genome. Transcript expression was quantified, for example using the RSEM package described in “RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome” by Li et al published in BMC Bioinformatics 12, 323 (201 1) to generate count and transcript per million (TPM) expression values. An expression filter was applied, keeping genes with at least 1 TPM in at least 20% (30/156) of tumour samples. Lastly, a variance stabilizing transformation was applied to counts from filtered genes (assuming a negative binomial distribution for count values) using the DESeq2 package described above. Homoscedastic and library size normalised count values were output to be used as described below. In this example, there may be 19206 genes to consider.

[118] As shown in Figure 3a, for each patient (e.g. CRUK0003) the gene expression at multiple locations (e.g. R1 to R8) for each gene is determined. The plots on the left side of Figure 3a show the gene expression at a plurality of locations for EDC4, CALM2 and PROM1 as examples. For a given tumour, the standard deviation of expression values for a particular gene across tumour regions may be calculated yielding a gene-specific, patient specific measure of RNA- intra-tumour heterogeneity (og.p) . For the example patient, these are shown in the table in the centre of Figure 3a and for the three genes are 0.075, 0.552 and 2.248 respectively. Thus, EDC4 has little variation across the tumour but PROM1 has a significant variation. This may then be repeated for all genes and then for all tumours, generating a matrix of o g,P values which is depicted as a table showing the patients (p) in columns and the genes (g) in rows in this example.

[119] Gene-wise RNA- intra-tumour heterogeneity values may be summarised as the average (median) value per gene across all tumours in the cohort (o g ). These values may be determined for example by plotting graphs such as those shown on the right side of Figure 3a. For the three example genes, the o g values are 0.096, 0.246 and 1.380 respectively. Alternatively, patient-wise RNA-ITH values may be summarised as the average (median) value per tumour across all expressed genes in the cohort (s ).

[120] Figure 3b plots the median absolute deviation (MAD) against the standard deviation scores for each gene. Similarly, Figure 3c plots the coefficient of variation (CV) against the standard deviation scores for each gene. Figures 3b and 3c show that MAD and CV are alternative metrics for the quantification of gene-wise RNA-ITH which show good agreement with standard deviation scores.

[121 ] As shown in Figure 3d, an inter-tumour heterogeneity measure may be derived for each gene by randomly sampling one region per patient, for example R1 for patient CRUK001 , R2 for patient CRUK002 and so on. The standard deviation across the resulting single-biopsy cohort may then be taken. The random sampling and calculating of the standard deviation may be repeated multiple times (e.g. ten) to take the average score across iterations. As a check, the same method may also be applied to the TCGA NSCLC data-set which is a true single-biopsy cohort. Such a check found good agreement with scores calculated within the TRACERx cohort (PMCC=0.94, P<0.001), which indicates calculation of inter-tumour heterogeneity scores is reproducible.

[122] Figure 3e plots values (also termed scores and the terms may be used interchangeably) for RNA intra-tumour heterogeneity (y-axis) against values RNA inter-tumour heterogeneity (x-axis) for each gene. The plot in Figure 3b is split into quadrants by the mean intra-tumour heterogeneity value (dashed horizontal line) and mean inter-tumour heterogeneity value (dashed vertical line). The quadrants are numbered Q1 , Q2, Q3 and Q4 with the number of genes per quadrant indicated. Q1 represents low inter-tumour heterogeneity and high intra-tumour heterogeneity value genes and contains 798 genes. Q2 represents low inter-tumour heterogeneity and low intra-tumour heterogeneity value genes and contains 9642 genes. Q3 represents high inter-tumour heterogeneity and high intra-tumour heterogeneity value genes and contains 4766 genes. Q4 represents high inter-tumour heterogeneity and low intra-tumour heterogeneity value genes and contains 1080 genes. Genes in Q2 and Q4 exhibit homogenous expression within tumours (i.e. low inter-tumour heterogeneity) which may restrict sampling bias. However, in Q2 the genes also have low inter-tumour heterogeneity which means that they exhibit homogenous expression between different tumours and are thus not informative for patient stratification into high/low risk groups. Accordingly, the group of genes in Q4 is the more useful and thus the clonal expression filter may filter out all genes outside the Q4 quadrant, i.e. genes having both an inter-tumour heterogeneity value above the inter-tumour threshold (e.g. the median value) and an intra-tumour heterogeneity value below the intra-tumour threshold (e.g. the median value).

[123] Figures 4a to 4e illustrate example results of the final optional step of the method of Figure 2a which is to evaluate the prognostic accuracy of the output (ORACLE) signature using validation data. In this example, the validation data is taken from the“Uppsala II” dataset which is an independent cohort of early stage LUAD patients (Ull, n=103, stage I to III). The validation data comprises pre-processed Uppsala RNAseq and clinical data downloaded for 170 NSCLC patients (103 LUAD + 67 LUSC) enrolled in the Uppsala NSCLC II cohort from the Gene Expression Omnibus. The cohort are described in“Profiling cancer testis antigens in non-small-cell lung cancer” by Djureinovic et al published in JCL Insight 1 (2016). [124] Gene information was extracted from the data set using known step. For example, alignment to the human genome was performed, e.g. using the TopHat package described in“TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions” by Kim et al publication in Genome Biol 14, R36 (2013). Raw reads were then calculated, for example using the Subread package described in“The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote” by Liao et al published in Nucleic Acids Res 41 , e108 (2013). Gene IDs were converted to HGNC IDs using the biomaRt package described in “Mapping Identifiers for the Integration of Genomic Datasets with the R/Bioconductor package biomaRt” by Durinck et al publications in Nat Protoc 4, 1 184-1 191 (2009). Max values were then selected for multi-mapping probes. Lowly expressed genes, which were identified in the training dataset described above, were filtered from the validation dataset and a variance stabilizing transform was applied using the DESeq2 package described above to output normalized count values. Additional clinical information (e.g. treatment status and tumour size) was also obtained.

[125] Figure 4a compares the performance of the output 23 gene signature to similar signatures based on known papers. Signature A is based on a signature construction pipeline described in“Development of a RNA-seq Based Prognostic Signature in Lung Adenocarcinoma” by Shukla et al in JNCL J Natl Cancer Inst 109 (2017). The signature is derived from the genes identified in the Shukla paper and uses standard techniques to select several genes for the signature. For example, using the training dataset from the TCGA database, in particular the LUAD patients, a univariate Cox regression analysis is performed, and a primary prognostic filter is applied (univariate Cox analysis P<0.00025) to reduce the number of genes identified in the Shukla paper to 108. Another prognostic filter, this time univariate Cox analysis FDR<0.02 reduces the 108 genes to 15. Finally, a forward conditional stepwise regression is applied to yield a 6-gene signature. The steps outlined in the Shukla paper have thus been followed but the different training data has yielded a 6-gene signature rather that the prognostic model containing 4 genes on page 4 of Shukla.

[126] Signature B is based on the signature construction pipeline described in“A practical molecular assay to predict survival in resected non-squamous, non-small lung cancer: development and international validation studies” by Kratz et al published in the Lancet 379, 823-832 (2012). In the development of signature B, all the genes identified in the papers listed in the table in the background section are first collated in a list. Using the training dataset from the TCGA database, in particular the LUAD patients, a univariate Cox regression analysis is performed, and a primary prognostic filter is applied (univariate Cox analysis P<0.00025) to reduce the number of genes identified to 249. A secondary prognostic filter is applied by short-listing only the genes which are cancer- related to reduce the number from 249 to 56. Finally, a lasso regression is applied to yield a 24 gene prognostic signature. As with signature A, this signature B is thus derived using the methodology described in the Kratz paper but results in a different selection of genes because of the training cohort. Both signatures are comparable with the 24 gene signature described above.

[127] Figure 4a plots the prognostic value for each of three signatures. The prognostic accuracy of the three signatures is tested using the validation data from the Uppsala dataset. As shown the signature resulting from the process shown in Figure 2a predicted significant survival risk (log-rank P = 0.006) and outperformed both signatures A and B. In other words, Fig 4a shows that using the riskscore calculated using the signature derived by the Figure 2a process, patients in the validation cohort may be more successfully split into subgroups with significantly different survival times than using the riskscores from signatures A and B.

[128] Figure 4b is a Forest plot and shows the predictive value of the new signature when combined with other known risk factors. In Figure 4b, a multivariate (rather than the previous univariate) analysis is performed to demonstrate that the calculated riskscore (which is input as a continuous variable) maintains significance even when clinical information is integrated to predict survival. The relative risk of death (hazard score) is shown as the solid block as a combined function of tumour stage (e.g. I to III), therapy status (no or some adjuvant treatment) and the risk score calculated using the output (ORACLE) signature. The 95% confidence level is also indicated by the bars. The higher the hazard ratio, the greater the risk of death and not unsurprisingly, the stage III patients have the highest values. Figure 4b shows that the output signature is significant when using multivariate analysis with tumour stage and treatment status (Cox MVA P= 0.0247) because the signature provides additional prognostic information.

[129] Figures 4c to 4e show the clinically actionable information in stage I patients. There are approximately 60 such patients in the validation dataset. Figure 4c shows the stage I patients divided into two groups: IA (n=42) and IB (n=18) according to substaging criteria (log-rank P =0.52). Classifying the patients in this way does not effectively stratify the patients into those having a good overall survival rate and a lower overall survival rate. Similarly, Figure 4f shows the stage I patients classified into high and low risk patients based on tumour size. Current clinical guidelines stage I LUAD patients are that patients having stage IB tumours greater than 4cm in size are though to be high-risk and other patients (i.e. with stage IA tumours or stage IB tumours less than 4cm) are low risk. In the group of 60 patients, just five patients are classified as high risk. As shown in Figure 4d, the patients are not well stratified into those having a good overall survival rate and a lower overall survival rate.

[130] Figure 4e shows the stage I patients divided into two groups: a high risk group (red) and a low risk group (blue) using the output (ORACLE) signature. As shown, this split is much more effective in predicting the survival rate of the patients.

[131 ] Figure 4f shows the effect of tumour sampling bias on the output (ORACLE) signature. Tumour regions are classified as either“high-risk” or“low-risk” using the calculated risk scores. Individual patients are then assessed for discordant classification whereby different regions from the same tumour may be classified as harbouring distinct profiles of molecular risk. As shown just 3/28 patients (i.e. 11 %) are discordant and this is a much lower rate of discordance than those shown in Figures 1 c and 1d.

[132] Figure 4g plots prognostic value assessment using a RNA-Seq dataset and four microarray datasets. To investigate concordance across multiple cohorts, the output (ORACLE) signature was applied to four microarray datasets. Specifically, prognostic value of the output (ORACLE) signature was assessed in a meta-analysis across five validation cohorts of patients with LUAD (n =904 patients with stage l-lll LUAD). Univariate Cox analysis was performed in one RNA-Seq dataset and four microarray datasets. In the microarray cohorts, 19 of the 23 genes were available for analysis (ASPM, CDCA4, FURIN, GOLGA8A, ITGA6, JAG1 , LRP12, MAFF, MRPS17, PLK1 , PNP, PPP1 R13L, PRKCA, PYGB, SCPEP1 , SLC46A3, SNX7, TPBG, XBP1). Hazard ratios with a 95% confidence interval are shown for each cohort and are plotted on a natural log scale. It was expected that the output signature’s performance would be poorer, as only 19 of the 23 genes were matched to the microarray probe sets, and signature weights that were trained on RNA-Seq data were used. However, ORACLE significantly associated with survival in three of the four microarray datasets. Meta-analysis considered all validation cohorts - the diamond indicates the hazard ratio for the meta-analysis of the five validation cohorts - which shows that ORACLE was significantly associated with the outcome with an overall hazard ratio of 3.57. These data indicate that survival associations resistant to differences in expression profiling technology can be obtained by controlling for RNA-ITH in biomarker design. Further information on this analysis can be found in“A clonal expression biomarker associates with lung cancer mortality” by D. Biswas et al, Nature Medicine 25, 1540-1548 (2019).

[133] Figure 4h plots the prognostic value for combinations of 1 to 23 genes selected from the ORACLE signature. Two procedures for selecting gene combinations from the full ORACLE signature were considered, as computationally efficient alternatives to an exhaustive search through every combination of the 23 genes. Backward subsetting began with the full model containing all 23 genes, all 22 gene combinations were evaluated, then the best combination with highest prognostic significance was chosen. This procedure continued iteratively, removing genes one-at-a-time, until one gene remained. Forward subsetting began with a model containing no genes, and then added genes yielding the highest prognostic significance to the model, one-at-a-time, until all 23 genes were included. Importantly, the weights for each gene were not re-trained, so each combination was evaluated as a subset of the full ORACLE signature defined above. These data indicate that any combination of two or more of the 23 genes of the ORACLE signature may have a prognostic value. Data from these two procedures is shown in Appendix A.

[134] Figures 5a and 5b illustrate that the method described above in Figure 2a may hold prognostic relevance across other cancer types. The clonal expression filter described in Figures 3a to 3f was generated by leveraging the full multi-region RNAseq dataset from the TRACERx lung cohort incorporating data from multi-region LUSC tumours and other NSCLC histologies. This data was then used to calculate an intra-tumour heterogeneity score and an inter-tumour heterogeneity for each gene using the same gene-wise metric described above. The genes are divided into four quadrants as described above.

[135] The proportion of each of the genes in each of the quadrants which give a pan-cancer significant prognostic value was then assessed and is displayed in Figure 5a. For example, Pan-cancer gene- wise prognostic values may be downloaded from the PRECOG resource described in“The prognostic landscape of genes and infiltrating immune cells across human cancers” by Gentles et al published in Nat Med 21 , 938-945 (2015). The PRECOG resource is a meta-dataset summarizing 166 microarray datasets covering 39 distinct malignant histologies. The dataset comprises Z-scores which have previously been calculated using a Cox univariate regression analysis. The genes having a Izl score > 1.96 (which is equivalent to a two-sided P<0.05) are selected. Consistent with the analysis in Figures 3a to 3f, genes in quadrant Q4 (i.e. high inter-tumour heterogeneity and low intra-tumour heterogeneity value genes) exhibited a significantly higher pan-cancer Z score (reflecting significant prognostic ability) compared to all other quadrants.

[136] Figure 5b also compares the performance of genes within each quadrant to determine whether the genes are enriched or depleted in prognostic genes. Each point in Figure 5b corresponds to one out of 33 cancer types sourced from the PRECOG database. The number of prognostically significant genes (Izl score > 1 .96) per NSCLC RNA heterogeneity quadrant is indicated for each cancer types as non-significant (gray), significantly enriched (red) or significantly depleted (blue). As shown in Figure 5b, the genes in Q4 were significantly enriched in prognostic genes in 49% (19/39) of cancer types and only significantly depleted in head and neck cancer 3/% (1/39). Conversely, genes in Q1 (low intertumour heterogeneity and high intra-tumour heterogeneity value genes) were not significantly enriched in any cancertypes and depleted in 56% (22/39). Genes in Q2 (low inter-tumour heterogeneity and low intra-tumour heterogeneity value genes) and Q3 (high inter-tumour heterogeneity and high intra-tumour heterogeneity value genes) showed similar numbers of depleted and enriched cancer types.

[137] Figures 6a to 6c explore the genomic mechanisms underpinning RNA-ITH. The relationship between RNA-ITH scores calculated as described above using the multi-region RNAseq data and a copy number heterogeneity quantified using multi-region WES data described in“T racking the Evolution of Non-Small Cell Lung Cancer” by Jamal-Hanjani et al published in N Engl J Med 376, 2109-2121 (2017) is first considered. Figure 6a correlates gene expression ITH correlated with copy number ITH. From the TRACERx LUAD cohort, patient-wise RNA-ITH scores are plotted against patient-wise SCNA- ITH scores. Figure 6a shows that there is a significant correlation between the median RNA-ITH score per patient and the percentage of subclonal SCNA events per patients (R s =0.48, P=0.0162). This indicates that SCNA-ITH may contribute to transcriptomic heterogeneity.

[138] Figure 6b shows that there is a highly significant association between subclonal copy-number gains and increased expression as well as between subclonal copy-number losses and decreased expression (P<0.001). This data indicates an association between chromosomal copy number gains and losses at the subclonal level and gene transcription and that RNA-ITH reflects on-going CIN and likely selection of heterogeneous DNA copy number events.

[139] Figure 6c shows the clonal copy number gains odds ratio for each quadrant. Figure 6c assesses the relative enrichment of genes within each quadrant most commonly subjected to clonal copy number gain (upper quartile) versus the genes that rarely showed clonal copy number gain (lower quartile) in the TRACERx cohort. Figure 6c shows that there is a highly significant enrichment of Q4 genes subject to clonal copy number gain events in TRACERx (P=1 18e-05, Fisher’s exact test) and to a lesser extent Q3 genes (P=0.000109, Fisher’s exact test). By contrast, there is a depletion of Q2 genes (P=6.86e- 08, Fisher’s exact test). This data suggests that homogeneous expression across a tumour is likely rooted in clonal DNA copy number alterations, selected early in tumour evolution.

[140] Figure 6d shows the enriched reactome pathways in Q4 and these relate to cell proliferation, including mitosis, nucleosome assembly and epigenetic regulation. By contrast, the same analysis of the pathways for the genes in the other quadrants showed that the Q1 genes showed no significant enrichment, the Q2 genes showed involvement in RNA splicing processing and the Q3 genes showed involvement in GPCR ligand binding and extracellular matrix organisation. This analysis shows that the Q4 genes may be linked to specific biological features of tumour aggressiveness that might explain their prognostic discriminatory pathways.

[141 ] Various combinations of optional features have been described herein, and it will be appreciated that described features may be combined in any suitable combination. In particular, the features of any one example embodiment may be combined with features of any other embodiment, as appropriate, except where such combinations are mutually exclusive. Throughout this specification, the term “comprising” or“comprises” means including the component(s) specified but not to the exclusion of the presence of others.

[142] Attention is directed to all papers and documents which are filed concurrently with or previous to this specification in connection with this application and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference.

[143] All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and/or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive.

[144] Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features.

[145] The invention is not restricted to the details of the foregoing embodiment(s). The invention extends to any novel one, or any novel combination, of the features disclosed in this specification (including any accompanying claims, abstract and drawings), or to any novel one, or any novel combination, of the steps of any method or process so disclosed.

Appendix A - Data showing particular combinations of biomarkers that have a prognostic value, as obtained using the forward and backward subsetting procedures of Figure 4h.

Forward Analysis

Backward Analysis