Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
MICROSATELLITE MARKERS
Document Type and Number:
WIPO Patent Application WO/2023/052795
Kind Code:
A1
Abstract:
The invention provides novel methods for evaluating levels of microsatellite instability in a sample and evaluating the biological significance of sequence variations identified in a sample during sequencing. The invention further relates to the use of novel microsatellite instability markers for evaluating levels of microsatellite instability in a sample and evaluating the biological significance of sequence variations identified in a sample during sequencing. Corresponding kits are also provided.

Inventors:
BURN JOHN (GB)
JACKSON MICHAEL (GB)
SANTIBANEZ-KOREF FRANCISCO (GB)
GALLON RICHARD (GB)
Application Number:
PCT/GB2022/052500
Publication Date:
April 06, 2023
Filing Date:
October 03, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
CANCER RESEARCH TECH LTD (GB)
International Classes:
C12Q1/6886; C12Q1/6827
Domestic Patent References:
WO2018037231A12018-03-01
WO2021019197A12021-02-04
WO2020178400A12020-09-10
WO2019011971A12019-01-17
WO2021019197A12021-02-04
WO2018037231A12018-03-01
Other References:
LISA REDFORD ET AL: "A novel panel of short mononucleotide repeats linked to informative polymorphisms enabling effective high volume low cost discrimination between mismatch repair deficient and proficient tumours", PLOS ONE, vol. 13, no. 8, 29 August 2018 (2018-08-29), pages e0203052, XP055651331, DOI: 10.1371/journal.pone.0203052
GONZÁLEZ-ACOSTA MARIBEL ET AL: "High-sensitivity microsatellite instability assessment for the detection of mismatch repair defects in normal tissue of biallelic germline mismatch repair mutation carriers", JOURNAL OF MEDICAL GENETICS, vol. 57, no. 4, 7 September 2019 (2019-09-07), GB, pages 269 - 273, XP055930862, ISSN: 0022-2593, DOI: 10.1136/jmedgenet-2019-106272
RICHARD GALLON ET AL: "A sensitive and scalable microsatellite instability assay to diagnose constitutional mismatch repair deficiency by sequencing of peripheral blood leukocytes :", HUMAN MUTATION, vol. 40, no. 5, 1 May 2019 (2019-05-01), US, pages 649 - 655, XP055618211, ISSN: 1059-7794, DOI: 10.1002/humu.23721
PEREZ-VALENCIA JUAN A ET AL: "Constitutional mismatch repair deficiency is the diagnosis in 0.41% of pathogenic/variant negative children suspected of sporadic neurofibromatosis type 1", GENETICS IN MEDICINE, vol. 22, no. 12, 10 August 2020 (2020-08-10), pages 2081 - 2088, XP037309385, ISSN: 1098-3600, DOI: 10.1038/S41436-020-0925-Z
GALLON ET AL., HUM MUTAT, vol. 40, no. 5, May 2019 (2019-05-01), pages 649 - 655
GONZALEZ-ACOSTA ET AL., J MED GENET, vol. 57, no. 4, April 2020 (2020-04-01), pages 269 - 273
SAMBROOK ET AL.: "Molecular Cloning: A Laboratory Manual", 1989, COLD SPRING HARBOR PRESS
AUSUBEL ET AL.: "Current Protocols in Molecular Biology (Supplement 47", 1999, JOHN WILEY & SONS
SINGLETONSAINSBURY: "Dictionary of Microbiology and Molecular Biology", 1994, JOHN WILEY AND SONS
HALEMARHAM: "The Harper Collins Dictionary of Biology", 1991, HARPER PERENNIAL
BOYLE ET AL., BIOINFORMATICS, vol. 30, no. 18, 15 September 2014 (2014-09-15), pages 2670 - 2
GALLONPEREZ-VALENCIA ET AL., GENET MED, vol. 22, no. 12, 2019, pages 2081 - 2088
REDFORD ET AL., PLOS ONE, vol. 13, no. 8, 29 August 2018 (2018-08-29), pages e0203052
GALLON ET AL., HUM MUTAT., vol. 41, no. 1, January 2020 (2020-01-01), pages 332 - 341
REDFORD ET AL., PLOS ONE, vol. 13, no. 8, 2018, pages e0203052
GALLON ET AL., HUMAN MUTATION, vol. 41, no. 1, 2020, pages 332 - 341
Attorney, Agent or Firm:
HGF (GB)
Download PDF:
Claims:
CLAIMS

1. A method for evaluating levels of microsatellite instability in a sample, comprising: a) analyzing the sample’s DNA to determine the nucleotide sequence of one or more microsatellite marker, wherein the one or more microsatellite marker is selected from Table A; and b) comparing the nucleotide sequence to a predetermined sequence, and determining any deviation, indicative of instability, from the predetermined sequences.

2. The method of claim 1, wherein the one or more microsatellite markers is 1, 2, 3, 4, 5,

6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21 , 22, 23, 24, or more, microsatellite markers selected from Table A.

3. The method of claim 2, wherein at least one of the microsatellite markers is selected from Table B or Table D, optionally wherein at least one of the markers is selected from the top 21 markers listed in Table B.

4. The method of claim 2, wherein at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, at least 21 , at least 22, at least 23, at least 24, or more microsatellite markers are selected from Table B or Table D, optionally wherein the at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11 , at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, or 21 microsatellite markers are selected from the top 21 markers listed in Table B.

5. The method of claim 1 or 2, wherein the one or more microsatellite markers selected from Table A is selected from the group of microsatellite markers listed in Table C.

6. The method of claim 5, wherein at least one of the microsatellite markers is selected from Table D.

7. The method of claim 5, wherein at least 2, at least 3, at least 4, least 5, least 6, least

7, least 8, least 9, least 10, least 11 , least 12, least 13, least 14, least 15, least 16, least 17, least 18, least 19, least 20, least 21 , least 22, least 23 or 24 microsatellite markers are selected from Table D.

8. The method of claim 1 or 2, wherein at least one of the markers is selected from the group consisting of AKMmono10v2, LMmono05v2, AKMmono05 and EJmono12_SNP1.

9. The method of claim 1, wherein the one or more microsatellite markers is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 28 or more,

74 microsatellite markers selected from Table H, optionally wherein the one or more microsatellite markers are the 32 markers listed in Table H. The method of claim 1 , wherein the one or more microsatellite markers is 1, 2, 3, 4, 5,

6, 7, 8, 9, 10, or 11 markers is selected from Table I. The method of claim 10, wherein the method further comprises determining the nucleotide sequence of one or more microsatellite markers selected from Table G. The method of claim 11, wherein the one or more microsatellite markers from Table G are LR36, GM07 and LR44. The method of claims 10 to 12, wherein the method for comprises determining the nucleotide sequence of a cancer hotspot. The method of any preceding claim, wherein the method comprises the step of amplifying from the sample one or more microsatellite marker selected from Table A to generate microsatellite markers amplicons prior to step a). A method for evaluating the biological significance of sequence variation identified during sequencing, comprising: a) amplifying from the sample one or more microsatellite marker selected from Table E to generate microsatellite markers amplicons, wherein each microsatellite loci has a single nucleotide polymorphism (SNP) within a short distance of the microsatellite marker and said amplifying step amplifies both the microsatellite marker and associated SNP in a single amplicon; b) sequencing the amplicons; and c) comparing the sequences from the amplicons to predetermined sequences and determining any deviation, indicative of instability, from the predetermined sequences; and d) for heterozygous SNPs, determining whether there is a bias between indel frequencies for the two alleles. The method of claim 15, wherein the one or more microsatellite markers is 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, or 14 microsatellite markers. The method of claim 15 or 16, wherein the one or more markers selected from Table E may be AKMmono10v2 or LMmono05v2. The method of any preceding claim, wherein the sample is a fluid sample or a solid sample, optionally wherein the fluid sample is a blood sample, urine sample, or part thereof. The method of claim 10, wherein the part is peripheral blood leukocytes (PBLs). The method of any preceding claim, wherein the subject has, is at risk of having, or is predisposed to a condition associated with microsatellite instability.

75 The method of claim 20, wherein the condition associated with microsatellite instability is cancer, CMMRD, Lynch syndrome, and/or Muir-Torre syndrome; preferably cancer or CMMRD. The method of claim 21 , wherein the cancer is selected from the group consisting of colon cancer, endometrium cancer, gastric cancer, ovarian cancer, hepatobiliary tract cancer, urinary tract cancer, stomach cancer, small intestine cancer, brain cancer, skin cancer, and haematological cancer. A kit for amplifying one or more microsatellite marker selected from Table A, wherein the kit comprises primers and/or probes for specifically amplifying the one or more microsatellite marker. A kit of claim 23, wherein the microsatellite marker is associated with a SNP and wherein the primers and/or probes are for specifically amplifying the one or more microsatellite marker and the associated SNP, optionally wherein the primers and/or probes have a sequence as shown in Table F, Table I, and/or Table 4. Use of one or more microsatellite markers selected from Table A for evaluating levels of microsatellite instability in a sample. Use of one or more microsatellite markers selected from Table E for evaluating the biological significance of sequence variation identified during sequencing of a sample.

Description:
MICROSATELLITE MARKERS

Field of the invention

The invention provides novel methods for evaluating levels of microsatellite instability in a sample and evaluating the biological significance of sequence variations identified in a sample during sequencing. The invention further relates to the use of novel microsatellite instability markers for evaluating levels of microsatellite instability in a sample and evaluating the biological significance of sequence variations identified in a sample during sequencing. Corresponding kits are also provided.

Background

The DNA mismatch repair (MMR) system maintains the sequence of the human genome by correcting errors made during DNA replication prior to cell division. MMR deficiency can occur in cancers and results in an increased mutation rate, a high tumor mutation burden, and distinct mutational signatures. Microsatellite instability (MSI), i.e. the increased frequency of insertion and deletion mutations (indels) in short tandem repeats found throughout the human genome, is a well-known and long-used hallmark feature of the mutator phenotype associated with MMR deficiency.

Whilst typically observed in neoplastic cells, MMR deficiency has also been described as a very rare constitutional condition associated with childhood cancer predisposition. This Constitutional MMR deficiency (CMMRD) is caused by germline bi-allelic pathogenic variants affecting one of four MMR genes, and results in a high risk of developing a broad spectrum of malignant tumors within the first three decades of life. Non-malignant clinical features, of which skin pigmentation alterations are the most prevalent, are found in nearly all CMMRD patients and are important diagnostic markers.

Timely diagnosis of CMMRD is critical as it allows patients to benefit from personalized treatment, cancer surveillance, and cancer prevention. Families of CMMRD patients may benefit from identification of affected relatives, and provision of genetic counselling. Due to these important implications, a clinical diagnosis of suspected CMMRD needs confirmation by a molecular diagnosis. However, a definitive genetic diagnosis may be precluded by limitations inherent to any mutational analysis method, specific limitations due to pseudogenes of the PMS2 MMR gene, and variants of uncertain significance (VUS). Hence, complementary functional assays are needed to confirm or refute the diagnosis when genetic analysis fails to render a definite diagnosis.

MSI analysis has been used to detect MMR deficiency in cancers since the discovery of this tumour phenotype in the early 1990s. This test informs the prognosis of the cancer patient, can be used to screen for Lynch syndrome, and may inform use of immunotherapy, such as the immune checkpoint blockade inhibitor pembrolizumab. A wide variety of highly sensitive and specific MSI assays have been developed for tumour diagnostics. Widespread assays include fragment length analysis and software to determine MSI status from high throughput sequencing reads. An example of a commercial kit based on fragment length analysis is the Promega MSI Analysis System, which uses PCR to amplify 5 mononucleotide repeat microsatellite markers, followed by analysis of fluorescently tagged amplicons using capillary electrophoresis to identify microsatellite indels. MSI status is determined by the proportion of microsatellite markers that contain indels. Sequencing-based MSI analysis software use a variety of classification methods and a variety of microsatellites captured by targeted though to whole genome sequencing.

In 2019, the inventors were the first to show that sequencing-based MSI analysis, using single molecule molecular inversion probe (smMIP) amplification of 24 mononucleotide repeats and amplicon sequencing, was able to detect MSI in the non-neoplastic tissues of CMMRD patients (Gallon et al. Hum Mutat. 2019 May; 40(5):649-655, DOI: 10.1002/humu.23721 , PMID: 30740824). Prior to this, the weak MSI signal in non-neoplastic CMMRD tissues was only detectable by laborious techniques such as small pool PCR and culturing of lymphoblastoid cell lines, or by fragment length analysis of dinucleotide repeat markers, which are insensitive to MSH6 deficiency and, therefore, -25% of CMMRD cases. Other MSI analysis methods used routinely for tumours could not detect this signal.

The inventors’ smMIP and sequencing-based MSI assay was initially developed for cancer diagnostics, and hence its 24 mononucleotide repeat markers (herein referred to as the “original markers” which are described in W02021019197) had been selected from MMR deficient tumour data. Whilst the assay was 98% sensitive and 100% specific for CMMRD detection, there was poor separation of some CMMRD samples from controls (Gallon et al. 2019). A more recent sequencing-based MSI assay has been developed that has a much greater separation of CMMRD from control samples (Gonzalez-Acosta et al. J Med Genet. 2020 Apr;57(4):269-273, DOI: 10.1136/jmedgenet-2019-106272; PMID: 31494577). It also uses microsatellite markers selected from MMR deficient tumour data, and improves detection of CMMRD by using exceptionally high read depths per marker (20,000x), and a very large number of microsatellite markers (186 mononucleotide repeats). This second MSI assay for CMMRD detection is therefore limited by a high cost and reliance on high capacity sequencing platforms.

Accordingly, there remains a need for further improved methods for identifying microsatellite instability in a sample. Summary of the invention

The present invention is based on the inventors’ development of a novel panel of MSI markers (listed in Table A below). These markers have been tested and validated in CMMRD samples and surprisingly were found to differentiate between CMMRD and control samples with 100% sensitivity and 100% specificity as shown in the Examples section of the present application. The present inventors have also found that this novel panel of markers is very useful in the context of evaluating MSI in tumours, and therefore can be used to differentiate microsatellite stable (MSS) and MSI cancers. As shown in the Examples section of the present application, the inventors have found that MSI classification of colorectal cancers (CRCs) using the top 24 markers of the new microsatellite marker panel had 100% sensitivity and 100% specificity and provided a very clear separation between microsatellite instability - high (MSI-H) and MSS samples.

The present inventors have found that even just one marker from the novel panel of markers described herein may be sufficient to identify microsatellite instability in a sample. This is because the markers described herein individually have a very high sensitivity and specificity as shown by the markers high AUG ROC scores. Most markers described herein have an AUC ROC score greater than 0.9 (for example 0.91 , 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99, or even 1). Merely by way of example, figure 8A shows that the marker AKMmono10v2, when analysed on its own, allows separation between CMMRD and control samples. However, it will be appreciated that similar separation of the two types of samples may be expected when analysing any of the markers of the present invention.

Thus, the markers disclosed herein can be used in low cost and scalable MSI assays with improved accuracy for detecting microsatellite instability.

Furthermore, the inventors have surprisingly found that the markers described herein can identify microsatellite instability in a blood sample or part thereof (such as peripheral blood leukocytes). Microsatellite markers that are particularly useful in this context are provided in Table H of the present disclosure.

Moreover, the inventors have developed a set of microsatellite markers which may be particularly useful in a diagnostic context, as the set is optimised for use in a single-round multiplex PCR reaction. The inventors also developed primers that may be used in such a single-round multiplex PCR reaction. These markers and primers are provided in Table I.

Accordingly, in one aspect the present invention provides a method for evaluating levels of microsatellite instability in a sample, the method comprising the steps of: a) analyzing the sample’s DNA to determine the nucleotide sequence of one or more microsatellite marker, wherein the one or more microsatellite marker is selected from Table A; b) comparing the nucleotide sequence to a predetermined sequence, and determining any deviation, indicative of instability, from the predetermined sequence.

Suitably, the one or more microsatellite markers may be 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 , 12, 13, 14, 15, 16, 17, 18, 19, 20, 21 , 22, 23, 24, or more, microsatellite markers selected from Table

A.

Suitably, at least one of the microsatellite markers may be selected from Table B or Table D.

Suitably, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11 , at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, at least 21 , at least 22, at least 23, at least 24, or more microsatellite markers may be selected from Table B or Table D.

Suitably, at least one of the markers may be selected from the top 21 markers listed in Table

B. Suitably, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11 , at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, or 21 of the markers are selected from the top 21 markers listed in Table B.

Suitably, the one or more microsatellite markers selected from Table A may be selected from Table C, optionally wherein at least one of the microsatellite markers may be selected from Table D, further optionally wherein 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 , 12, 13, 14, 15, 16, 17, 18, 19, 20, 21 , 22, 23 or 24 microsatellite markers may be selected from Table D.

Suitably, at least one of the markers may be selected from the group consisting of AKMmono10v2, LMmono05v2, AKMmono05, and EJmono12_SNP1.

Suitably, the method may comprise the step of amplifying from the sample one or more microsatellite marker selected from Table A to generate microsatellite markers amplicons prior to step a).

In one aspect the present invention provides a method for evaluating the biological significance of sequence variation identified during sequencing, comprising: a) amplifying from the sample one or more microsatellite marker selected from Table E to generate microsatellite markers amplicons, wherein each microsatellite loci has a single nucleotide polymorphism (SNP) within a short distance of the microsatellite marker and said amplifying step amplifies both the microsatellite marker and associated SNP in a single amplicon; b) sequencing the amplicons; and c) comparing the sequences from the amplicons to predetermined sequences and determining any deviation, indicative of instability, from the predetermined sequences; and d) for heterozygous SNPs, determining whether there is a bias between indel frequencies for the two alleles.

Suitably, the one or more markers may be 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 , 12, 13, or 14 markers selected from Table E.

Suitably, at least one of the one or more markers selected from Table E may be AKMmono10v2 or LMmono05v2.

Suitably, the sample may be a fluid sample or a solid sample.

Suitably, the subject may have, be at risk of having, or be predisposed to a condition associated with microsatellite instability.

Suitably, the condition associated with microsatellite instability may be selected from cancer, CMMRD, Lynch syndrome, and Muir-Torre syndrome; preferably cancer or CMMRD.

Suitably, the cancer may be selected from the group consisting of colon cancer, endometrium cancer, gastric cancer, ovarian cancer, hepatobiliary tract cancer, urinary tract cancer, stomach cancer, small intestine cancer, brain cancer, skin cancer, and haematological cancer.

In one aspect, the present invention provides a kit for amplifying one or more microsatellite marker listed in Table A, wherein the kit comprises primers and/or probes for specifically amplifying the one or more microsatellite marker.

Suitably, the microsatellite marker may be associated with a SNP (i.e. is a marker selected from Table E) and wherein the primers and/or probes are for specifically amplifying the one or more microsatellite marker and associated SNP.

In one aspect, the present invention provides use of one or more microsatellite markers selected from Table A for evaluating levels of microsatellite instability in a sample.

In one aspect, the present invention provides use of one or more microsatellite markers selected from Table E for evaluating the biological significance of sequence variation identified during sequencing of a sample. Throughout the description and claims of this specification, the words “comprise” and “contain” and variations of them mean “including but not limited to”, and they are not intended to (and do not) exclude other moieties, additives, components, integers or steps.

The following terms or definitions are provided solely to aid in the understanding of the invention. Unless specifically defined herein, all terms used herein have the same meaning as they would to one skilled in the art of the present invention. Practitioners are particularly directed to Sambrook et al., Molecular Cloning: A Laboratory Manual, 2nd ed., Cold Spring Harbor Press, Plainsview, N.Y. (1989); and Ausubel et al., Current Protocols in Molecular Biology (Supplement 47), John Wiley & Sons, New York (1999), for definitions and terms of the art. As a further example, Singleton and Sainsbury, Dictionary of Microbiology and Molecular Biology, 2d Ed., John Wiley and Sons, NY (1994); and Hale and Marham. The Harper Collins Dictionary of Biology, Harper Perennial, NY (1991) provide those of skill in the art with a general dictionary of many of the terms used in the invention. Although any methods and materials similar or equivalent to those described herein find use in the practice of the present invention, the preferred methods and materials are described herein.

Throughout the description and claims of this specification, the singular encompasses the plural unless the context otherwise requires. In particular, where the indefinite article is used, the specification is to be understood as contemplating plurality as well as singularity, unless the context requires otherwise. Accordingly, as used herein, the singular terms "a", "an," and "the" include the plural reference unless the context clearly indicates otherwise.

Unless otherwise indicated, nucleic acids are written left to right in 5' to 3' orientation; amino acid sequences are written left to right in amino to carboxy orientation, respectively. It is to be understood that this invention is not limited to the particular methodology, protocols, and reagents described, as these may vary, depending upon the context they are used by those of skill in the art.

Features, integers, characteristics, compounds, chemical moieties or groups described in conjunction with a particular aspect, embodiment or example of the invention are to be understood to be applicable to any other aspect, embodiment or example described herein unless incompatible therewith.

The entire disclosures of the issued patents, published patent applications, and other publications that are cited herein are hereby incorporated by reference to the same extent as if each was specifically and individually indicated to be incorporated by reference. In the case of any inconsistencies, the present disclosure will prevail.

Various aspects of the invention are described in further detail below. Brief Description of the Drawings

In order to provide a better understanding of the present invention, embodiments will be described by way of example only with reference to the following figures, in which:

Figure 1 shows the ROC AUC for each mononucleotide repeat marker using Reference Allele Frequency (RAF) to classify MMR deficiency in CRC samples. The results are based on samples from the pilot cohort, which contained 8 CMMRD peripheral blood leukocyte genomic DNA samples, 38 control peripheral blood leukocyte genomic DNA samples, 8 MMR deficient CRC genomic DNA samples and 8 MMR proficient CRC genomic DNA.

Figure 2 show the difference in the median RAF of MMR proficient and MMR deficient CRC samples for each mononucleotide repeat marker. The results are based on samples from the pilot cohort, which contained 8 CMMRD peripheral blood leukocyte genomic DNA samples, 38 control peripheral blood leukocyte genomic DNA samples, 8 MMR deficient CRC genomic DNA samples and 8 MMR proficient CRC genomic DNA.

Figure 3 shows the ROC AUC for each mononucleotide repeat marker using RAF to classify CMMRD versus control samples. The results are based on samples from the pilot cohort, which contained 8 CMMRD peripheral blood leukocyte genomic DNA samples, 38 control peripheral blood leukocyte genomic DNA samples, 8 MMR deficient CRC genomic DNA samples and 8 MMR proficient CRC genomic DNA.

Figure 4 shows the difference in the minimum control RAF and the maximum CMMRD RAF for each mononucleotide repeat marker. A more negative difference represents increasing overlap between CMMRD and control RAFs. A more positive difference represents increasing separation between CMMRD and control RAFs. The results are based on samples from the pilot cohort, which contained 8 CMMRD peripheral blood leukocyte genomic DNA samples, 38 control peripheral blood leukocyte genomic DNA samples, 8 MMR deficient CRC genomic DNA samples and 8 MMR proficient CRC genomic DNA.

Figure 5 shows the MSI assay score of the blinded cohort and known controls using 32 of the new mononucleotide repeat markers, and the scoring method described by Gallon et al. 2019 and Perez-Valencia et al. 2020. The results are based on samples from the large blinded cohort, which contained 30 CMMRD peripheral blood leukocyte genomic DNA samples, 73 control peripheral blood leukocyte genomic DNA samples (43 blinded and 30 known controls).

Figure 6 shows a comparison of MSI assay score from the blinded cohort and known controls using either the original 24 mononucleotide repeat markers or new 32 mononucleotide repeat markers, and the scoring method described by Gallon et al. 2019 and Perez-Valencia et al. 2020. The dotted lines represent the minimum CMMRD score, and the solid lines represent the maximum control or LS score. The results are based on samples from the large blinded cohort, which contained 30 CMMRD peripheral blood leukocyte genomic DNA samples, 73 control peripheral blood leukocyte genomic DNA samples (43 blinded and 30 known controls).

Figure 7 shows a comparison of microsatellite marker length (in nucleotides) and ROC AUC (using single molecule sequence [smSequence)] RAF as a measure of MSI) to detect CMMRD in the blinded cohort and known controls. The results are based on samples from the large blinded cohort, which contained 30 CMMRD peripheral blood leukocyte genomic DNA samples, 73 control peripheral blood leukocyte genomic DNA samples (43 blinded and 30 known controls).

Figure 8 (A) shows a summary of MSI assay scores from the blinded cohort and known controls using different numbers of markers from ranking both new and original marker sets. It can be seen that use of a single marker provided separating between all CMMRD and control samples. (B) shows the same results as Figure 8A, but the y axis has been limited to show the separation of CMMRD and control scores at low marker numbers from ranking both new and original marker sets. (C) shows summary of MSI assay scores from the blinded cohort and known controls using different numbers of markers from ranking the original marker set only. A persistent overlap between CMMRD and control samples can be seen. (D) shows the same data as Figure 8C, but the y axis has been limited to show the overlap of CMMRD and control scores with any marker combination from ranking the original marker set only. The results are based on samples from the large blinded cohort, which contained 30 CMMRD peripheral blood leukocyte genomic DNA samples, 73 control peripheral blood leukocyte genomic DNA samples (43 blinded and 30 known controls).

Figure 9 (A) shows the range of MSI assay scores in control and CMMRD samples, as well as the margin difference (minimum CMMRD score - maximum control score) and median difference (median CMMRD score - median control score) in CMMRD and control scores, in the blinded cohort and known controls using different numbers of markers from ranking both new and original marker sets. (B) shows the range of MSI assay scores in control and CMMRD samples, as well as the margin difference (minimum CMMRD score - maximum control score) and median difference (median CMMRD score - median control score) in CMMRD and control scores, in the blinded cohort and known controls using different numbers of markers from ranking the original marker set only. The results are based on samples from the large blinded cohort, which contained 30 CMMRD peripheral blood leukocyte genomic DNA samples, 73 control peripheral blood leukocyte genomic DNA samples (43 blinded and 30 known controls).

Figure 10 (A) shows the normalised margin difference ((minimum CMMRD score - maximum control score) / range control score) and the normalised median difference ((median CMMRD score - median control score) I range control score) in MSI assay score in control and CMMRD samples from the blinded cohort and known controls using different numbers of markers from ranking both new and original marker sets. (B) shows normalised margin difference ((minimum CMMRD score - maximum control score) I range control score) and the normalised median difference ((median CMMRD score - median control score) I range control score) in MSI assay score in control and CMMRD samples from the blinded cohort and known controls using different numbers of markers from ranking the original marker set only. The results are based on samples from the expanded cohort, which contained 30 CMMRD peripheral blood leukocyte genomic DNA samples, 73 control peripheral blood leukocyte genomic DNA samples (43 blinded and 30 known controls).

Figure 11 shows ROC AUC for each mononucleotide repeat marker from both new and original microsatellite marker sets, calculated from read RAFs of 50 MSI-H and 52 MSS CRCs.

Figure 12 shows a comparison of MSI assay score from 50 MSI-H and 52 MSS CRCs using either the original 24 mononucleotide repeat markers or top 24 new mononucleotide repeat markers, and the classification method described by Redford et al. 2018 and used by Gallon et al. 2020. The dotted lines represent the minimum MSI-H CRC score, and the solid lines represent the maximum MSS CRC score.

Figure 13 shows a comparison of microsatellite marker length (in nucleotides) and ROC AUC, calculated from read RAF of 50 MSI-H and 52 MSS CRCs, of both new and original microsatellite marker sets.

Figure 14 shows summary of MSI assay scores from 50 MSI-H and 52 MSS CRCs from ranking the new microsatellite marker set and classification using different numbers of the top ranked markers (A) and using the original microsatellite marker set (B).

Figure 15 shows summary of margin difference (minimum MSI-H CRC score - maximum MSS CRC score), median difference (median MSI-H CRC score - median MSS CRC score), and range in MSI assay scores from 50 MSI-H and 52 MSS CRCs from ranking the new microsatellite marker set and classification using different numbers of the top ranked markers (A) and using the original microsatellite marker set (B).

Figure 16 shows the normalised margin difference ((minimum MSI-H CRC score - maximum MSS CRC score) I range MSS CRC scores) and the normalised median difference ((median MSI-H CRC score - median MSS CRC score) I range MSS CRC scores) in MSI assay score of 50 MSI-H and 52 MSS CRCs from ranking the new microsatellite marker set and classification using different numbers of the top ranked markers (A) and using the original microsatellite marker set (B). Figure 17 Whole genome sequencing and pilot amplicon sequencing to select MSI markers. The frequency of variant microsatellites by motif size in whole genome sequence data from blood, including the raw count of microsatellites containing a variant for each sample (A), and the relative frequency of non-germline microsatellite variants for each sample (B). Candidate MSI marker performance in amplicon sequence data from a pilot cohort of peripheral blood leukocyte (PBL) and colorectal cancers (CRC) samples, quantified by the receiver operator characteristic area under curve (ROC AUC) of microsatellite reference allele frequency (RAF) to discriminate between MMR-deficient and -proficient samples (C), and by the difference in median RAF between MMR-deficient and MMR-proficient samples (D).

Figure 18 shows sample MSI scores. The MSI scores of a blinded cohort of 56 CMMRD, 8 CMMRD-negative, and 43 control peripheral blood leukocyte (PBL) gDNAs, 80 reference control PBL gDNAs, and 40 Lynch syndrome PBL gDNAs using 32 new MSI markers (A). CMMRD-negative refers to patients with a CMMRD-like phenotype but no MMR variants at germline analysis. A comparison of initial and repeat MSI scores of 26 CMMRD and 33 control PBL gDNAs (B). scores of blood samples by sequencing batch Data for repeat amplification and sequencing of samples are shown.

Receiver operator characteristic area under curve (ROC AUC) values calculated from the ability of each MSI marker to separate CMMRD blood from control samples using microsatellite reference allele frequency (RAF), comparing new and original marker sets.

Figure 21 MSI marker characteristics and performance. A comparison of the length of each MSI marker and its receiver operator characteristic area under curve (ROC AUC) to discriminate between CMMRD and control PBL samples (A). A comparison of MSI score of 50 CMMRD and 75 control PBL samples using either the original 24 tumour-derived MSI markers or an equivalent number of the most discriminatory of the new blood-derived MSI markers (B). MSI scores of blood samples using reduced panels of the most discriminatory N of the original MSI markers (left panel) and most discriminatory N of the new MSI markers (right panel). shows MSI scores As a further test of diagnostic utility, a larger panel of 54 new

MSI markers was smMIP-amplified and sequenced in 192 colorectal cancers (CRCs) of known MSI status (MSI Analysis System v1.2, Promega) as a biomarker of MMR function. A larger panel of MSI markers could be used as, previously, the inventors have shown that smSequences provide no benefit to CRC MSI classification. Therefore, lower read depths of 3000x can be used, and hence more MSI markers assessed for equivalent cost. Custom R scripts were used to extract microsatellite variants from reads. The microsatellite deletion frequencies and allelic bias (if a heterozygous neighbouring SNP was available to discriminate between paternal and maternal alleles) in sequence reads generated from a training cohort of 50 MSI-H and 52 MSS CRCs were used to train a naive Bayesian classifier according to Redford et al. The remaining 90 CRCs (46 MSI-H, 44 MSS) formed the validation cohort. A tumour-MSI score was generated for each sample using the trained classifier. Tumour-MSI scores >0 indicate a higher probability the sample is MMR deficient than MMR proficient, and the inverse for scores <0.

Tumour-MSI scoring achieved 100% sensitivity (50/50; 95% Cl: 92.9-100.0%) and 100% specificity (52/52; 95% Cl: 93.2-100.0%) in the training cohort and 100% sensitivity (46/46; 95% Cl: 92.3-100.0%) and 100% specificity (44/44; 95% Cl: 92.0-100.0%) in the validation cohort (A). Training cohort samples were also analysed by the original MSI markers. Each marker’s ability to separate MMR deficient and MMR proficient CRCs by microsatellite reference allele frequency (RAF) in the training cohort data was assessed. RAF ROC ALICs of the new MSI markers were greater than the RAF ROC ALICs of the originals (p=8.31x10' 5 ) (B). To compare tumour-MSI classification by marker set with an equivalent number of MSI markers, the new MSI markers were ranked by ROC AUC and the most discriminatory 24 were used to re-score the training cohort samples, achieving 100% accuracy as for the full 54 marker panel (C). Scoring of the training cohort by the original MSI markers misclassified two CRCs, one MMR deficient (49/50; 98% sensitivity, 95% Cl: 89.4-99.9%) and one MMR proficient (51/52; 98% sensitivity, 95% Cl: 89.7-99.9%) (C). MMR deficient CRCs had more positive tumour-MSI scores when using new versus original MSI markers (p=3.16x10' 4 ) and MMR proficient CRCs had more negative scores when using new versus original MSI markers (p=2.23x1 O' 14 ), demonstrating a greater score-separation with the new MSI markers. The most discriminatory 24 new MSI markers also classified the validation cohort with 100% accuracy as for the full 54 marker panel (D).

Figure 24 shows sample MSI scores by patient genotype. The MSI scores of CMMRD patients by whether they have at least one MMR missense variant in their germline (A). A pair-wise comparison of the MSI scores of CMMRD patients who share the same MMR genotype (B).

Figure 25 shows associations of disease phenotype with MSI score or MMR genotype. The MSI score and age of first tumour of 50 CMMRD patients (A). The age of first tumour of 50 CMMRD patients by whether they have at least one MMR missense variant in their germline (B). Figure 26 shows sample MSI scores compared to patient age and presence of tumour at sample collection. The MSI score and age of sample collection of 30 CMMRD patients (A). The MSI score by whether the patient had a tumour at sample collection for 27 CMMRD patients (B).

Figure 27 MSI scores of a training and validation cohort of FFPE CRCs, NEQAS standards, and cancer cell lines (A). The microsatellite allele length and allele frequency distribution of the 24 original and 32 new MSI markers in 75 control blood samples, 50 CMMRD blood samples, 52 microsatellite stable (MSS) colorectal cancers (CRCs), and 50 MSI-high (MSI-H) CRCs for which sequence data from both marker sets were available (B).

Detailed description

The present invention is based on the inventors’ identification of new, highly accurate markers for evaluating microsatellite instability (MSI). The identification of these new markers allows the design and implementation of new MSI screening methods using a smaller number of microsatellite markers than previously thought possible. For example, prior to the identification of the markers disclosed herein, differentiation between CMMRD and control samples required analyzing 186 MSI markers (Gonzalez-Acosta et al. 2020). Surprisingly, using the markers disclosed herein, this may be achieved by analyzing just one of the microsatellite markers listed in Table A (for example using marker AKMmono10v2, LMmono05v2, AKMmono05 or EJmono12_SNP1). Furthermore, the inventors found that these markers are not only highly accurate in the context of detecting MSI associated with CMMRD, but may also be superior than previously disclosed microsatellite markers differentiating between MSS and MSI cancers. Additionally, the inventors have surprisingly found that these microsatellite markers enable the evaluation of microsatellite instability not only in a solid sample (such as solid tumour sample), but also in a fluid sample (such as a blood sample or urine sample).

Accordingly, in one aspect, provided herein is a method for evaluating levels of microsatellite instability in a sample, comprising: a) analyzing the sample’s DNA to determine the nucleotide sequence of one or more microsatellite marker, wherein the one or more microsatellite marker is selected from Table A; b) comparing the nucleotide sequence to a predetermined sequence, and determining any deviation, indicative of instability, from the predetermined sequences.

In addition, some of the 62 markers are associated with a single nucleotide polymorphism (SNP) located within a short distance of the marker. Using markers associated with such SNPs can differentiate between amplification and/or sequencing errors, and MSI induced indels/mutations. Such SNPs are typically within 80 base pairs of the associated microsatellite marker, for example 50 base pairs, 40 base pairs, or 30 base pairs. Suitably, the single SNP has a minor allele frequency of above 0.05. Suitably, the SNP has a high heterozygosity. Accordingly, the invention also provides novel methods for evaluating the biological significance of sequence variation identified in a microsatellite marker listed in Table E.

In general, microsatellites are mono-, di-, tri-, tetra-, penta-, or hexanucleotide repeats found in DNA, consisting of at least two units and with a minimal length of 6 bases. Homopolymers are a particular subclass of microsatellites, which are mononucleotide repeats of at least 6 bases; in other words, a stretch of at least 6 consecutive A, C, T or G residues if looking at the DNA level. The microsatellite markers disclosed herein are homopolymers. The terms “microsatellite marker, “microsatellite instability marker”, and “marker” are used herein interchangeably and have the same meaning.

Microsatellite instability (MSI) as used herein refers to a unique molecular alteration and hyper-mutable phenotype, which is the result of a defective DNA mismatch repair (MMR) system, and can be defined as the presence of alternate sized repetitive DNA sequences as compared to a predetermined (for example reference) sequence. Suitably, in the context of the present disclosure, DNA may refer to genomic DNA. Suitably, the DNA may be cell free DNA. Alternate sized repetitive DNA sequence may be due to “an indel”. An “indel” as used herein refers to a mutation class that includes insertions, deletions, or a combination thereof. An indel in a microsatellite region results in a net gain or loss of nucleotides. The presence of an indel can be established by comparing it to DNA in which the indel is not present (e.g. comparing DNA from a tumour sample to germline DNA from the subject with the tumour), or, by comparing it to a reference (predetermined) length of the microsatellite (e.g. Human reference genomes). Comparison may involve counting the number of repeated units. In the context of the present disclosure, a deviation indicative of instability is an alternate sized repetitive DNA sequences, for example due to an indel.

The term “evaluating levels” as used herein refers to determining the presence or absence of microsatellite instability in a subject or sample obtained from the subject. Suitably, when the presence of microsatellite instability has been determined in a sample, the MSI status may be then determined by calculating the percentage of microsatellite markers that were found to have a deviation indicative of instability. MSI status can be one of two discrete classes: MSI- H (also referred to as MSI-high, MSI positive or MSI) or MSI-L (also referred to as MSI-low). Typically, to be classified as MSI-H, at least 30% of the markers used to classify MSI status need to score positive (i.e. have a deviation indicative of instability). If an intermediate number of markers scores positive (that is less than 30% but more than 0%), then the MSI status is classified as MSI-L. An absence of microsatellite instability may also be referred to as microsatellites stability (MSS).

As used herein, the noun “subject” refers to an individual vertebrate, more particularly an individual mammal, most particularly an individual human being. Suitably, the subject may be a human, but can also be a different mammal, particularly a domestic animal such as cat, dog, rabbit, guinea pig, ferret, rat, mouse, and the like, or a farm animal like horse, cows, pig, goat, sheep, llama, and the like. A subject can also be a non-mammalian vertebrate, like a fish, reptile, amphibian or bird; in essence any animal which can develop cancer fulfils the definition. Suitably, the subject has, is suspected of having, is at risk of having or is predisposed to a condition associated with microsatellite instability. Conditions associated with microsatellite instability can include one or more of: cancer conditions (e.g., colon cancer, gastric cancer, endometrium cancer, ovarian cancer, hepatobiliary tract cancer, urinary tract cancer, stomach cancer, small intestine cancer, brain cancer, skin cancer, haematological cancer, or any other solid or liquid malignant neoplasia); CMMRD, Lynch syndrome; Muir-Torre syndrome; and/or any other suitable conditions associated with mismatch repair deficiency. Haematological cancers can acquire MMR deficiency in therapy-resistant clones and therefore MSI analysis may be relevant to relapsed tumours even though MSI/MMR deficiency is rare in primary tumours. Lynch syndrome as used herein refers to an autosomal dominant genetic condition which has a high risk of colon cancer as well as other cancers including endometrium, ovary, stomach, small intestine, hepatobiliary tract, upper urinary tract, brain, and skin cancer. The increased risk for these cancers is due to inherited mutations that impair DNA mismatch repair. The old name for the condition is Hereditary Non-Polyposis Colorectal Cancer (HNPCC).

The term “sample” as used herein refers to samples comprising biological material and, in particular, DNA of the subject (or subject’s cancer). Suitably, the sample may be a fluid sample (such as blood, plasma, serum, saliva or urine, or part thereof), or a solid sample (such as a tissue biopsy for example of a tumour). Suitably, the solid sample may be formalin- fixed paraffin-embedded. Techniques for obtaining and preparing the aforementioned types of biological samples are well known in the art. In the context of the present disclosure, a part of a fluid sample includes cells that are present within the fluid sample. By way of example, when the fluid sample is a blood sample, a part of the blood sample may be peripheral blood leukocytes and/or cell free DNA present with the blood sample. Thus, in a suitable embodiment, the sample may be a peripheral blood leukocyte sample. Such a sample may be particularly suitable in a method of the invention where the microsatellite marker is selected from Table H. Testing biological samples using the methods described herein may be particularly useful e.g. for early cancer detection in those at high risk of cancer (for example diagnosed with CMMRD) or monitoring for disease recurrence (by assessing circulating tumour or cell free DNA). The term “cancer” as used herein, refers a disease involving unregulated cell growth, also referred to as malignant neoplasm. The term “tumour” is used as a synonym in the application. It is envisaged that this term covers all solid tumour types (carcinoma, sarcoma, blastoma), but it also explicitly encompasses non-solid cancer types such as leukemia. Thus, a “tumour sample” encompasses both solid tumour samples (e.g. tissue biopsies) as well as biological fluid samples (e.g. those that have been obtained or isolated from a bodily fluid such as urine, blood, plasma, serum etc). As would be clearly understood by a person of skill in the art, the sample can be described as a “sample of tumour DNA”. The tumour DNA may be present within a bodily fluid such as urine, blood, plasma, serum etc and may be isolated from the bodily fluid prior to performing the methods described herein. Any appropriate method for obtaining or isolating the tumour DNA may be used. Several appropriate methods are well known in the art. Typically, a sample of tumour DNA has at one point been isolated from a subject, particularly a subject with cancer. Optionally, it has undergone one or more forms of pre-treatment (e.g. lysis, fractionation, separation, purification) in order for the DNA to be sequenced, although it is also envisaged that DNA from an untreated sample may be sequenced.

In the context of the present disclosure, the nucleotide sequence may be determined by sequencing (for example genomic DNA sequencing or amplicon sequencing). As used herein, “sequencing” refers to biochemical methods for determining the order of the nucleotide bases, adenine, guanine, cytosine, and thymine, in a DNA oligonucleotide. Methods of sequencing will be well known to those skilled in the art. Merely by way of example, sequencing may be by a method selected from group consisting of high throughput sequencing, next generation sequencing, sequencing-by-synthesis, ion semiconductor sequencing and/or pyrosequencing.

Suitably, prior to determining the nucleotide sequence of the one or more microsatellite marker (for example by sequencing), the microsatellite marker may be amplified. In such embodiments, the methods provided herein may compare the sequences from the microsatellite amplicons to predetermined sequences and determine any deviation, indicative of instability, from the predetermined sequences. Methods for detecting an insertion or deletion are well known in the art.

Accordingly, the method for evaluating levels of microsatellite instability in a sample, may comprise: a) amplifying from the sample one or more microsatellite marker selected from Table A to generate microsatellite markers amplicons, b) analyzing the amplicons to determine the nucleotide sequence of one or more microsatellite marker; c) comparing the nucleotide sequence to a predetermined sequence, and determining any deviation, indicative of instability, from the predetermined sequences.

Although the invention is exemplified herein using molecular inversion probes (MIPs; e.g. single-molecule molecular inversion probes (smMIPs)) to amplify the selected markers, any other appropriate technique for amplifying the selected loci may be used. Alternative appropriate methods are well known in the art and include conventional PCR. In other words, the methods may use any appropriate nucleic acid sequence (e.g. primer and/or probe) that enables amplification of the selected markers. The amplification step may amplify each selected microsatellite marker individually (in a separate reaction), or may comprise coamplifying some or all of the selected markers in a multiplex amplification reaction. Suitable primers and/or probes may be selected for the chosen method using standard techniques. Suitably, in methods wherein a single nucleotide polymorphism (SNP) within a short distance of the selected microsatellite marker is to be amplified together with the marker in order to generate a single amplicon encompassing the maker and the SNP, primers and/or probes that amplify both the microsatellite marker and the SNP within a short distance of the microsatellite marker need to be used.

The primers and/or probes may contain a sequence of sufficient length and complementarity to a corresponding DNA region to specifically hybridize with that region under suitable hybridization conditions. The corresponding DNA region may be the region of the microsatellite marker itself, or a region up or downstream of the microsatellite marker (or marker and SNP). Sequences of exemplary probes are provided in Table F. These probes give rise to the kits of the present disclosure, which are described in more detail elsewhere in the present specification.

In the context of the present disclosure, multiplex amplification and sequencing techniques may be particularly advantageous because they allow for automated sequence analysis and high throughput diagnostics. However, as would be clear to a person of skill in the art, any other suitable means for amplifying and sequencing the informative MSI markers described herein may also be used (e.g. conventional PCR may be used).

Upon determination of the nucleotide sequence of one or more microsatellite marker, the nucleotide sequence is compared to a predetermined sequence in order to determine any deviation indicative of instability from the predetermined sequences. The deviation may be an indel when compared to the predetermined sequences. The predetermined sequence (also referred to as a reference sequence) may be a sequence of said microsatellite marker in a healthy control, for example a subject or group of subjects believed or known to not have, not be at risk of, or not be predisposed to a microsatellite instability associated condition. However, one of the advantages of the methods described herein is that accurate MSI classification using the methods provided herein does not require control normal DNA and MSI may be determined by simply counting the number of repeats.

The methods of the present invention comprise determining the nucleotide sequence of one or more microsatellite marker, wherein the one or more microsatellite marker is listed in Table

A.

Suitably, the one or more microsatellite markers is 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 , 12, 13, 14, 15, 16, 17, 18, 19, 20, 21 , 22, 23, 24, 28, 32 or more, microsatellite markers listed in Table A.

Suitably at least one of the microsatellite markers is selected from Table B, Table D, Table H, or Table I; or at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11 , at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, at least 21 , at least 22, at least 23, at least 24, or more microsatellite markers are selected from Table B, Table D, Table H or Table I.

More suitably, at least one of the markers is selected from the top 21 markers listed in Table

B. Suitably, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11 , at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, or 21 are selected from the top 21 markers listed in Table B. Suitably, the one or more markers selected from the top 21 markers listed in Table B may be in combination with one or more other markers listed in Tables A, B, C, D, H or I.

Suitably, the one or more microsatellite markers listed in Table A is selected from the group of microsatellite markers listed in Table C, optionally at least one of the microsatellite markers is selected from Table D, or at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11 , at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, at least 21 , at least 22, at least 23, or 24 microsatellite markers are selected from Table D.

Suitably, when the method comprises the step of determining the nucleotide sequence of one microsatellite marker, the microsatellite marker may be any one of the markers listed in Table A, B, C, D, H or I. Suitably, the methods of the present invention comprise determining the nucleotide sequence of one or more microsatellite marker, wherein the one or more microsatellite marker is listed in Table H. Suitably, the one or more microsatellite markers is 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 , 12, 13, 14, 15, 16, 17, 18, 19, 20, 21 , 22, 23, 24, 28, or more microsatellite markers listed in Table H. More suitably, the one or more microsatellite markers is 24 or more, microsatellite markers listed in Table H. More suitably, the one or more microsatellite markers is 32 markers listed in Table H. In an embodiment where the method comprises determining the nucleotide sequence of one or more microsatellite marker selected from Table H (for example 24 or more, or all 32 markers listed in Table H), the sample may be a fluid sample (such a blood sample, or part thereof, for example peripheral blood leukocytes). Suitably, when the one or more microsatellite marker is listed in Table H, the sample is a blood sample or part thereof, for example PBLs.

Suitably, the methods of the present invention comprise determining the nucleotide sequence of one or more microsatellite marker, wherein the one or more microsatellite marker is listed in Table I. Suitably, the one or more microsatellite markers is 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 , 12, 13, or 14 microsatellite markers listed in Table I. More suitably, the one or more microsatellite markers is 10, 11 , 12, 13 or 14 microsatellite markers listed in Table I. More suitably, the one or more microsatellite markers is the 14 microsatellite markers listed in Table I.

Suitably, the microsatellite markers disclosed herein may be amplified in a multiplex PCR round reaction. Suitably, the multiplex PCR method may be a single-round or two-round multiplex PCR method, more suitably single-round multiplex PCR method. Suitably, the single-round multiplex PCR may involve amplifying form a sample one or more marker (for example 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 , 12, 13, or 14 microsatellite markers) listed in Table I. Suitably, the markers may be amplified using the primers comprising or consisting of the sequences as shown in Table I, prior to determining the nucleotide sequences of the markers.

Suitably, the method comprises determining the nucleotide sequence of one or more microsatellite marker selected from the group consisting of AKMmono10v2, LMmono05v2, AKMmono05 and EJmono12_SNP1.

Suitably, the method comprises determining the nucleotide sequence of one or more microsatellite marker selected from the group consisting of EJmono12_SNP1 , LMmono05v2_SNP1 , AKMmono14_SNP1 and MSJmono22_SNP1. Suitably, the method comprises determining the nucleotide sequence of one or more microsatellite marker selected from the group consisting of EJmono12_SNP1, LMmono05v2_SNP1, AKMmono14_SNP1, MSJmono22_SNP1 and EJmono14v2_SNP1.

Suitably, the method comprises determining the nucleotide sequence of one or more microsatellite marker selected from the group consisting of EJmono12_SNP1, LMmono05v2_SNP1, AKMmono14_SNP1, MSJmono22_SNP1, EJmono14v2_SNP1 and MSJmono20_SNP1.

Suitably, the method comprises determining the nucleotide sequence of one or more microsatellite marker selected from the group consisting of EJmono12_SNP1, LMmono05v2_SNP1, AKMmono14_SNP1, MSJmono22_SNP1, EJmono14v2_SNP1, MSJmono20_SNP1 and AKMmono07_SNP1.

Suitably, the method comprises determining the nucleotide sequence of one or more microsatellite marker selected from the group consisting of EJmono12_SNP1, LMmono05v2_SNP1, AKMmono14_SNP1, MSJmono22_SNP1, EJmono14v2_SNP1, MSJmono20_SNP1 , AKMmono07_SNP1 and AKMmono05_SNP1.

Suitably, the method comprises determining the nucleotide sequence of one or more microsatellite marker selected from the group consisting of EJmono12_SNP1, LMmono05v2_SNP1, AKMmono14_SNP1, MSJmono22_SNP1, EJmono14v2_SNP1, MSJmono20_SNP1 , AKMmono07_SNP1, AKMmono05_SNP1 and LMmono09_SNP1.

Suitably, the method comprises determining the nucleotide sequence of one or more microsatellite marker selected from the group consisting of EJmono12_SNP1, LMmono05v2_SNP1, AKMmono14_SNP1, MSJmono22_SNP1, EJmono14v2_SNP1, MSJmono20_SNP1, AKMmono07_SNP1, AKMmono05_SNP1, LMmono09_SNP1 and AKMmono02_SNP1.

Suitably, the method comprises determining the nucleotide sequence of one or more microsatellite marker selected from the group consisting of EJmono12_SNP1, LMmono05v2_SNP1, AKMmono14_SNP1, MSJmono22_SNP1, EJmono14v2_SNP1, MSJmono20_SNP1, AKMmono07_SNP1, AKMmono05_SNP1, LMmono09_SNP1, AKMmono02_SNP1 and AKMmono13_SNP1.

Suitably, the method comprises determining the nucleotide sequence of one or more microsatellite marker selected from the group consisting of EJmono12_SNP1, LMmono05v2_SNP1, AKMmono14_SNP1, MSJmono22_SNP1, EJmono14v2_SNP1, MSJmono20_SNP1, AKMmono07_SNP1, AKMmono05_SNP1, LMmono09_SNP1, AKMmono02_SNP1 , AKMmono13_SNP1 and LMmono08_SNP1. Suitably, the method comprises determining the nucleotide sequence of one or more microsatellite marker selected from the group consisting of EJmono12_SNP1, LMmono05v2_SNP1, AKMmono14_SNP1, MSJmono22_SNP1, EJmono14v2_SNP1, MSJmono20_SNP1, AKMmono07_SNP1, AKMmono05_SNP1, LMmono09_SNP1, AKMmono02_SNP1, AKMmono13_SNP1, LMmono08_SNP1 and MSJmono39_SNP1.

Suitably, the method comprises determining the nucleotide sequence of one or more microsatellite marker selected from the group consisting ofEJmono12_SNP1,

LMmono05v2_SNP1, AKMmono14_SNP1, MSJmono22_SNP1, EJmono14v2_SNP1,

MSJmono20_SNP1, AKMmono07_SNP1, AKMmono05_SNP1, LMmono09_SNP1,

AKMmono02_SNP1, AKMmono13_SNP1, LMmono08_SNP1, MSJmono39_SNP1 and LMmono03 SNP1.

Suitably, the method comprises determining the nucleotide sequence of one or more microsatellite marker selected from the group consisting of EJmono12_SNP1,

LMmono05v2_SNP1, AKMmono14_SNP1, MSJmono22_SNP1, EJmono14v2_SNP1,

MSJmono20_SNP1, AKMmono07_SNP1, AKMmono05_SNP1, LMmono09_SNP1,

AKMmono02_SNP1, AKMmono13_SNP1, LMmono08_SNP1, MSJmono39_SNP1,

LMmono03 SNP1 and AKMmono03 SNP1.

Suitably, the method comprises determining the nucleotide sequence of one or more microsatellite marker selected from the group consisting of EJmono12_SNP1,

LMmono05v2_SNP1, AKMmono14_SNP1, MSJmono22_SNP1, EJmono14v2_SNP1,

MSJmono20_SNP1, AKMmono07_SNP1, AKMmono05_SNP1, LMmono09_SNP1,

AKMmono02_SNP1, AKMmono13_SNP1, LMmono08_SNP1, MSJmono39_SNP1,

LMmono03_SNP1 , AKMmono03_SNP1, and MSJmono27_SNP1.

Suitably, the method comprises determining the nucleotide sequence of one or more microsatellite marker selected from the group consisting of EJmono12_SNP1,

LMmono05v2_SNP1, AKMmono14_SNP1, MSJmono22_SNP1, EJmono14v2_SNP1,

MSJmono20_SNP1, AKMmono07_SNP1, AKMmono05_SNP1, LMmono09_SNP1,

AKMmono02_SNP1, AKMmono13_SNP1, LMmono08_SNP1, MSJmono39_SNP1,

LMmono03_SNP1, AKMmono03_SNP1, MSJmono27_SNP1 and MSJmono46_SNP1.

Suitably, the method comprises determining the nucleotide sequence of one or more microsatellite marker selected from the group consisting of EJmono12_SNP1, LMmono05v2_SNP1, AKMmono14_SNP1, MSJmono22_SNP1, EJmono14v2_SNP1, MSJmono20_SNP1, AKMmono07_SNP1, AKMmono05_SNP1, LMmono09_SNP1, AKMmono02_SNP1, AKMmono13_SNP1, LMmono08_SNP1, MSJmono39_SNP1, LMmono03_SNP1, AKMmono03_SNP1, MSJmono27_SNP1, MSJmono46_SNP1 and MSJmonol 1 SNP1.

Suitably, the method comprises determining the nucleotide sequence of one or more microsatellite marker selected from the group consisting of EJmono12_SNP1, LMmono05v2_SNP1, AKMmono14_SNP1, MSJmono22_SNP1, EJmono14v2_SNP1, MSJmono20_SNP1, AKMmono07_SNP1, AKMmono05_SNP1, LMmono09_SNP1,

AKMmono02_SNP1, AKMmono13_SNP1, LMmono08_SNP1, MSJmono39_SNP1,

LMmono03_SNP1, AKMmono03_SNP1, MSJmono27_SNP1, MSJmono46_SNP1,

MSJmonol 1_SNP1 and AKMmono12_SNP1.

Suitably, the method comprises determining I the nucleotide sequence of one or more microsatellite marker selected from the group consisting of EJmono12_SNP1,

LMmono05v2_SNP1, AKMmono14_SNP1, MSJmono22_SNP1, EJmono14v2_SNP1,

MSJmono20_SNP1, AKMmono07_SNP1, AKMmono05_SNP1, LMmono09_SNP1,

AKMmono02_SNP1, AKMmono13_SNP1, LMmono08_SNP1, MSJmono39_SNP1, LMmono03_SNP1, AKMmono03_SNP1, MSJmono27_SNP1, MSJmono46_SNP1,

MSJmonol 1_SNP1, AKMmono12_SNP1 and MSJmono40_SNP1.

Suitably, the method comprises determining I the nucleotide sequence of one or more microsatellite marker selected from the group consisting of EJmono12_SNP1,

LMmono05v2_SNP1, AKMmono14_SNP1, MSJmono22_SNP1, EJmono14v2_SNP1,

MSJmono20_SNP1, AKMmono07_SNP1, AKMmono05_SNP1, LMmono09_SNP1,

AKMmono02_SNP1, AKMmono13_SNP1, LMmono08_SNP1, MSJmono39_SNP1,

LMmono03_SNP1, AKMmono03_SNP1, MSJmono27_SNP1, MSJmono46_SNP1,

MSJmono11_SNP1, AKMmono12_SNP1, MSJmono40_SNP1 and EJmono03_SNP1.

Suitably, the method comprises determining the nucleotide sequence of one or more microsatellite marker selected from the group consisting of EJmono12_SNP1,

LMmono05v2_SNP1, AKMmono14_SNP1, MSJmono22_SNP1, EJmono14v2_SNP1,

MSJmono20_SNP1, AKMmono07_SNP1, AKMmono05_SNP1, LMmono09_SNP1,

AKMmono02_SNP1, AKMmono13_SNP1, LMmono08_SNP1, MSJmono39_SNP1,

LMmono03_SNP1, AKMmono03_SNP1, MSJmono27_SNP1, MSJmono46_SNP1,

MSJmonol 1_SNP1, AKMmono12_SNP1, MSJmono40_SNP1, EJmono03 SNP1 and

AKMmono17v2 SNP1. Suitably, when the method comprises the step of determining the nucleotide sequence of or more microsatellite markers, the one or more microsatellite markers may be selected from the group consisting EJmono12_SNP1, LMmono05v2_SNP1, AKMmono14_SNP1,

MSJmono22_SNP1, EJmono14v2_SNP1, MSJmono20_SNP1, AKMmono07_SNP1,

AKMmono05_SNP1, LMmono09_SNP1, AKMmono02_SNP1, AKMmono13_SNP1,

LMmono08_SNP1, MSJmono39_SNP1, LMmono03_SNP1, AKMmono03_SNP1,

MSJmono27 SNP1, MSJmono46 SNP1, MSJmono11 SNP1, AKMmono12 SNP1,

MSJmono40_SNP1, EJmono03_SNP1, AKMmono17v2_SNP1 and AKMmono16_SNP1.

Suitably, the method comprises determining the nucleotide sequence of one or more microsatellite marker selected from the group consisting of EJmono12_SNP1, LMmono05v2_SNP1, AKMmono14_SNP1, MSJmono22_SNP1, EJmono14v2_SNP1, MSJmono20_SNP1, AKMmono07_SNP1, AKMmono05_SNP1, LMmono09_SNP1, AKMmono02_SNP1, AKMmono13_SNP1, LMmono08_SNP1, MSJmono39_SNP1, LMmono03_SNP1, AKMmono03_SNP1, MSJmono27_SNP1, MSJmono46_SNP1, MSJmonol 1_SNP1, AKMmono12_SNP1, MSJmono40_SNP1, EJmono03_SNP1, AKMmono17v2_SNP1, AKMmono16_SNP1 and LMmono10v2_SNP1.

Suitably, the method comprises determining the nucleotide sequence of one or more microsatellite marker selected from the group consisting AKMmono02_SNP1, AKMmono03_SNP1, AKMmono04_SNP1, AKMmono07_SNP1, AKMmono12_SNP1, AKMmono13_SNP1, AKMmono16_SNP1, EJmono12_SNP1, MSJmono20_SNP1, MSJmono39_SNP1, and MSJmono45_SNP1. Optionally the method may further comprise determining the nucleotide sequence of one or more microsatellite marker selected from the group consisting LR36, GM07, and LR44.

Suitably, the methods of the invention may comprise determining and comparing the nucleotide sequence of one or more microsatellite marker (for example 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 or more) selected from Table A, B, C, D, E , H, or I in combination with one or more marker described in W02021019197, which is incorporated herein by reference. More suitably, the one or more marker selected from W02021019197 may be selected from the group consisting of LR36, GM07, LR48, LR44, and LR52 (the details of which are provided in Table G hereinbelow), more suitably LR36, GM07, and LR44. An example of such a suitable combination of markers is shown in Table I. Additionally or alternatively , the methods of the invention may comprise determining and comparing the nucleotide sequence of one or more microsatellite marker (for example 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 , 12 or more) selected from Table A, B, C, D, E , H, or I in combination with the nucleotide sequence of one or more tumour mutation hotspots. Exemplary tumour mutation hotspots are provided in the Examples section of the present application. These hotspots may be particularly relevant in the context of CRC. Other suitable more tumour mutation hotspots will be known to those skilled in the art. Suitable tumour mutation hotspots are, for example, described in Modest et al., 2016 (doi: 10.1093/annonc/mdw261), which is incorporated herein by reference.

Suitably, the methods of the present invention may involve determining the nucleotide sequence of less than 63, less than 62, less than 61 , less than 60, less than 59, less than 58, less than 57, less than 56, less than 55, less than 54, less than 53, less than 52, less than 51 , less than 50, less than 49, less than 48, less than 47, less than 46, less than 45, less than 44, less than 43, less than 42, less than 41 , less than 40, less than 39, less than 38, less than 37, less than 36, less than 35, less than 34, less than 33, less than 32, less than 31 , less than 30, less than 29, less than 28, less than 27, less than 26, less than 25, less than 24, less than 23, less than 22, less than 21 , less than 20, less than 19, less than 18, less than 17, less than 16, less than 15, less than 14, less than 13, less than 12, less than 11 , less than 10, less than 9, less than 8, less than 7, less than 6, less than 5, less than 4, or less than 3 microsatellite markers. For avoidance of doubt, when the method of the invention comprises determining the nucleotide sequence of, for example, less than 6 microsatellite markers, it may involve one or more, but less than 6 microsatellite markers (so for example 1 , 2, 3, 4, or 5 microsatellite markers).

Although the markers disclosed herein may provide accurate differentiation between MSI and MSS when analysed individually, it shall be appreciated by a person of skill in the art, the addition of further microsatellite markers may further improve accuracy and/or robustness of the methods of the invention. It will also be appreciated by a person of skill in the art, that some microsatellite markers and/or microsatellite marker combinations may be more informative than others. Tables B, C and D provide lists of markers that have been ranked from the most to least informative. It will be therefore understood by a person skilled in the art, that markers more highly ranked and/or combinations of more highly ranked markers may be more informative than markers or combinations with lower rankings.

Advantageously, the markers and/or marker combinations provided herein allow for an MSI classification accuracy of at least 0.9, preferably at least 0.95, more preferably at least 0.999 or 1. The marker combinations provided herein can therefore achieve a clinically acceptable MSI classification accuracy with significantly fewer markers than was previously understood to be necessary, meaning that the associated methods and kits can be significantly cheaper and more efficient. The marker combinations provided herein are therefore particularly advantageous in achieving a clinically acceptable MSI classification accuracy.

As mentioned herein above, in some embodiments, the methods described herein may be performed using a multiplex PCR method (for example a single-round or two-round multiplex PCR method). Such a multiplex PCR may be utilised in the amplification step of one or more markers (such as those listed in Table I) from the sample to generate microsatellite markers amplicons prior to step a).

The term “about” as used herein, for example with reference to the thermocycler programme, means plus or minus 10% or less. For example, plus or minus 9%, plus or minus 8%, plus or minus 7%, plus or minus 6%, or less. For example, plus or minus 5%, plus or minus 4%, plus or minus 3%, plus or minus 2%, or plus or minus 1 %, or less.

The methods described herein may include the step of determining allelic imbalance. Assessing whether length variants are concentrated in sequence reads from one SNP allele offers an additional criterion to differentiate between PCR artefacts and mutations that occur in vivo, and can provide additional discrimination between MSI and MSS samples. This is because PCR artefacts are likely to affect both alleles equally, whereas microsatellite instability is a stochastic event affecting a single allele at a time. This can lead to bias in the levels of instability observed between the alleles at a single microsatellite marker, even if both are unstable. As mentioned elsewhere in the present specification, some of the novel markers identified by the inventors and listed in Table A are associated with SNPs. These markers may be useful in the context of a method for evaluating the biological significance of any microsatellite instability in the sample, the method comprising amplifying both the microsatellite marker and an SNP within a short distance of it in a single amplicon (by e.g. using primers and/or probes), and for heterozygous SNPs, determining whether there is a bias between indel frequencies for the two alleles of the sample.

Accordingly, in a further aspect, provided herein is a method for evaluating the biological significance of sequence variation identified during sequencing, comprising: a) amplifying from the sample one or more microsatellite marker listed in Table E to generate microsatellite markers amplicons, wherein each microsatellite loci has a single nucleotide polymorphism (SNP) within a short distance of the microsatellite marker and said amplifying step amplifies both the microsatellite marker and associated SNP in a single amplicon; b) sequencing the amplicons; and c) comparing the sequences from the amplicons to predetermined sequences and determining any deviation, indicative of instability, from the predetermined sequences; and d) for heterozygous SNPs, determining whether there is a bias between indel frequencies for the two alleles.

Suitably, the one or more microsatellite marker may be any 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 , 12, 13 or all 14 markers in Table E. Suitably, at least one of the markers selected from Table E may be AKMmono10v2 or LMmono05v2.

Suitably, the SNP is within 100 base pairs, more suitably within 50 base pairs, most suitably within 30 base pairs from the microsatellite marker.

It will be appreciated that the embodiments described herein in the context of a method for evaluating microsatellite instability may equally apply to the method for evaluating the biological significance of sequence variation.

A method as above may be useful for identifying mismatch repair defects, wherein deviation from the predetermined sequences for one or more (for example 2, 3, 4, 5, 6 or more) microsatellite markers is indicative of a mismatch repair defect.

A method as above may be useful for identifying MSI, wherein deviation from the predetermined sequences for one or more (for example 2, 3, 4, 5, 6 or more) microsatellite markers is indicative of the sample having MSI.

In a further aspect, the invention provides a kit for use in the methods of the invention. The kit may comprise primers and/or probes for amplifying microsatellite markers and/or microsatellite markers with their associated SNPs in accordance with the above.

The kit may also comprise a thermostable polymerase and/or labelled dNTPs or analogs thereof. The labelled dNTPs or analogs thereof may be fluorescently labelled. Suitably, the kit may comprise, as well as the primers and/or probes for amplifying the microsatellite markers and/or microsatellite markers with their associated SNPs, reagents necessary for carrying out the methods of the invention, for example enzymes, dNTP mixes, buffers, PCR reaction mixes, chelating agents and/or nuclease-free water. The kit may comprise instructions for carrying out a method of the invention.

The primers and/or probes for amplifying microsatellite markers and/or microsatellite markers with their associated SNPs in accordance with the above may have sequences as provided in Table F and/or I. The kit may comprise primers and/or probes for amplifying one or more microsatellite markers and/or microsatellite markers with their associated SNPs listed in Table A. Suitably, the kit may comprise primers and/or probes for amplifying 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 , 12, 13, 14, 15, 16, 17, 18, 19, 20, 21 , 22, 23, 24 or microsatellite markers and/or microsatellite markers with their associated SNPs listed in Table A. Suitably, the kit may comprise primers and/or probes for amplifying combinations of microsatellite markers provided elsewhere in the present specification (for example Table F). Suitably, the kit may comprise primers and/or probes for amplifying combinations of markers provided elsewhere in the present specification (for example Table I and/or Table 4).

Suitably, the kit may be kit comprising reagents necessary for carrying out a single-round multiplex PCR reaction. Suitably, such a kit may comprise a buffer (for example 5x HS VeriFi Buffer), a polymerase (for example HS VeriFi DNA Polymerase), and optionally a multiplex primer mix and/or molecular grade H2O. Suitably the primer mix may comprise or consist of one or more primers listed in Table I and/or Table 4.

It will be appreciated that depending on the location of the microsatellite markers, it may be that more than one microsatellite marker (optionally with an associated SNP) may be amplified in a single amplicon. Typically this may be the case for markers that found within close vicinity of one another. Thus more than one marker (optionally with an associated SNP) may be amplified using one or a pair of the same probes/primers.

Throughout the present disclosure marker names such as “EJmono12_SNP1” and “EJmono12” refer to the same marker. In the tables below “rsXXXXXXX” indicates that there is no SNP associated with the marker.

TABLE A 62 Microsatellite Instability Markers

Table B 28 microsatellite instability markers

Table C 54 Microsatellite Instability Markers

Table D 24 Microsatellite Instability Markers

Table E Microsatellite Markers associated with SNP

Table F Microsatellite Instability Marker Probe Sequences

EJmonol3v2 CGGTCTTTCCACATAGGATAATTGGGGNNNNCTTCAGCTTCCCGATATCCGACGGTAGTG TNNNNTGAACGCAAGTGGCAA

EJmonol4v2 AAAGTGCAAGTAAATATTAAACAACTGGNNNNCTTCAGCTTCCCGATATCCGACGGTAGT GTNNNNCCTCAGTTCCTTCTTTA

EJmonol6 ACCAACCAACAAACAAAAAGCTGAAACCNNNNCTTCAGCTTCCCGATATCCGACGGTAGT GTNNNNGTGGGTGATTGCAATTG

EJmono21v2 SEQ ID NO 27 GCTTATAAAAACCTCAGCTAGGTCTCAGANNNNCTTCAGCTTCCCGATATCCGACGGTAG TGTNNNNTATAGTGCAGTTTGGT

HGtetra23ms2 SEQ ID NO CTTTGTACTTGTATCTCTGGATGCCNNNNCTTCAGCTTCCCGATATCCGACGGTAGTGTN NNNGGACAAGAGTGAAGCTTCAT

LMmonoOl SEQ ID NO: 29 CTATGGGATTATGTAGAAAGACTGAACCNNNNCTTCAGCTTCCCGATATCCGACGGTAGT GTNNNNCTGAAATAAGATACACT

LMmono03 SEQ. ID NO: 30 TAATAGAGTTGACCATCACAACGAATGGNNNNCTTCAGCTTCCCGATATCCGACGGTAGT GTNNNNGTTAGTATCACTAGGGC

LMmono04v2 SEQ. ID NO: 31 TCTTCATTCCACGTAACCNNNNCTTCAGCTTCCCGATATCCGACGGTAGTGTNNNNTGCC ATGTTCCAATCAATTCTAAATC

LMmono05v2 SEQ ID NO: 32 GAGTACTTACGATGTGCCAAATACNNNNCTTCAGCTTCCCGATATCCGACGGTAGTGTNN NNAGGCACAAAAAGATAAAA

LMmono07 SEQ ID NO: 33 CCCAGTCTCAACTATTATGTAATAGCAGNNNNCTTCAGCTTCCCGATATCCGACGGTAGT GTNNNNGCTAGCTCCTTGATCTT

LMmono08 TGGCTGGAAATTTTCCAAACTTGATGNNNNCTTCAGCTTCCCGATATCCGACGGTAGTGT NNNNAAGAAGCCCAATACACAGG

LMmono09 GTTGACCTCGAACTCCAGTCCNNNNCTTCAGCTTCCCGATATCCGACGGTAGTGTNNNNG CGTCCTCCCTTTATGTTTTGTTG

LMmonol0v2 GAAGTGTGGGAAAACTGGCACTCTGGCNNNNCTTCAGCTTCCCGATATCCGACGGTAGTG TNNNNGAAAGCAGGAGACTACG

LMmonol2 TCCACAGTGACACAAACCAATTCCNNNNCTTCAGCTTCCCGATATCCGACGGTAGTGTNN NNGATGCACTTTGAAACAGCCTT

LMmonol6 CATCCTTAAGAGAGACAAACCTCTGNNNNCTTCAGCTTCCCGATATCCGACGGTAGTGTN NNNGGGACTTTCCAGGATTTGCC

MSJcom06msl SEQ ID NO 39 TGGGTGCGGTGGCCCACACCTGTAATTNNNNCTTCAGCTTCCCGATATCCGACGGTAGTG TNNNNCCTCACTGAGTAGTTTTT

+ ms2

MSJmonolO SEQ ID NO: 40 GTACATCAATTTGGGGAGAATTTGCATCCNNNNCTTCAGCTTCCCGATATCCGACGGTAG TGTNNNNAAACAACCTTGTCTGT

MSJmonoll SEQ ID NO: 41 TCCCCCTTCTCTCTCTTTCTCCTGCTANNNNCTTCAGCTTCCCGATATCCGACGGTAGTG TNNNNAGCCGAGATCACACCTGG

MSJmonol5 SEQ ID NO: 42 GTGAAGCCTGACCAATGAAGACATCNNNNCTTCAGCTTCCCGATATCCGACGGTAGTGTN NNNGGGCAACAGAGAGAGATGC

MSJmonol7 SEQ ID NO: 43 GTGGGTGTACTAAACATATTTGATACCTNNNNCTTCAGCTTCCCGATATCCGACGGTAGT GTNNNNTAGCTTGGGTGACGGAG

MSJmonol9msl SEQ ID NO: 44 CACAAATTGGTAACACTGATCCATCTNNNNCTTCAGCTTCCCGATATCCGACGGTAGTGT NNNNGAGAGATTCTGTCTCTACC

+ ms2

MSJmono20 SEQ ID NO: 45 GACTTGAGGATATCCTCCAGGAAAATGNNNNCTTCAGCTTCCCGATATCCGACGGTAGTG TNNNNTAGCACTGCAGTGAGCTG

MSJmono22 SEQ ID NO: 46 GTGTTTCAGATACGTCGGTAACNNNNCTTCAGCTTCCCGATATCCGACGGTAGTGTNNNN CGTGCCATTGCACTCTATCCTGG

MSJmono23v2 SEQ ID NO: 47 TTCCCACCTCAGCCTCCTGAGTAGCTAGCNNNNCTTCAGCTTCCCGATATCCGACGGTAG TGTNNNNCACATCCTACACTCCA

MSJmono26 SEQ ID NO: 48 GCGGAGTCTCGCTCTGTCGCCCATGCTGGNNNNCTTCAGCTTCCCGATATCCGACGGTAG TGTNNNNATACCCACATGATCAT

MSJmono27 SEQ ID NO: 49 TGGTGGGATTGTTCACACCTGTAATCCCNNNNCTTCAGCTTCCCGATATCCGACGGTAGT GTNNNNACTGTAACCTGGCCAAC

MSJmono30v2 SEQ ID NO: 50 TCCTTTATAAATTACCCAGTCTCGGCCNNNNCTTCAGCTTCCCGATATCCGACGGTAGTG TNNNNAAATAAAGTGGTTAAGAA

Table G Additional microsatellite markers that may be used in combination with the markers of the present disclosure

Table H 32 Microsatellite Instability Markers

Table I Markers and primers selected for single-round multiplex PCR

Examples

Example 1

In the context of the present description, including the Examples provided below, when reference is made to original marker or original MSI assay, this refers to the markers and/or assay as described in W02021019197.

1. Identification of candidate microsatellite markers from whole genome sequence data

1.1. Identification of microsatellite loci using whole genome sequencing

Samples:

3 CMMRD peripheral blood leukocyte genomic DNAs; 1 LS peripheral blood leukocyte genomic DNA and 2 control peripheral blood leukocyte genomic DNAs were used.

Method and Analysis:

PCR amplification of samples using NEBNext® Ultra™ II DNA Library Prep Kit for Illumina (New England Biolabs) was performed, followed by high depth (120x) genome sequencing on a NovaSeq (Illumina). A custom bioinformatics pipeline was then used to detect microsatellite variants in the genome sequence and microsatellite loci with variant allele frequencies suggestive of somatic instability in the CMMRD and LS samples, but not in the control samples were selected.

Results:

191 microsatellite loci, including mono-, di-, tri-, tetra- and pentanucleotide repeats, as well as complex microsatellites containing multiple motifs, were identified from the whole genome sequence data.

1.2. Design and confirmation of smM IPs to capture candidate microsatellite markers

Samples:

3 FFPE tumour genomic DNAs and 3 control peripheral blood leukocyte genomic DNAs were used.

Method and Analysis: smMIPs to capture the identified microsatellite loci were designed using MIPgen software (Boyle et al. Bioinformatics. 2014 Sep15;30(18):2670-2, DOI:10.1093/bioinformatics/btu353. PMID: 24867941). smMIP-based amplification of microsatellite loci from samples was then performed, followed by high depth (1000x) amplicon sequencing on a MiSeq (Illumina). Finally, read depths achieved for each smMIP as a check of smMIP performance was calculated.

Results:

MIPgen failed to design a smMIP for 21 of the 191 microsatellite loci. Therefore, 170 microsatellite loci had a smMIP available to be checked by amplicon sequencing. Some loci contained multiple distinct microsatellites and, therefore, these 170 smMIPs capture a total of 213 candidate microsatellite markers. smMIPs were analysed in batches and smMIPs generating read counts >10% of the median read depth passed the smMIP quality check, totalling 133 of the 170 smMIPs. These 133 smMIPs capture 155 candidate microsatellite markers, including mono-, di-, tri-, tetra- and pentanucleotide repeats, as well as complex microsatellites containing multiple motifs.

Mononucleotide repeat markers are of particular interest as they are sensitive to deficiency of any MMR protein, whereas longer motif microsatellite markers (such as dinucleotide repeats) are not sensitive to MSH6 deficiency. 91 of the 133 smMIPs that passed the quality check capture at least one mononucleotide repeat and, in total, capture 98 mononucleotide repeats between them.

All 91 smMIPs and the candidate 98 mononucleotide repeat markers they capture were taken forward for additional analysis using smMIP-based amplification and amplicon sequencing of a pilot cohort of samples.

2. Candidate marker selection using a pilot cohort of blood and tumour DNA samples

Samples:

8 CMMRD peripheral blood leukocyte genomic DNAs; 38 control peripheral blood leukocyte genomic DNAs; 8 MMR deficient CRC genomic DNAs and 8 MMR proficient CRC genomic DNAs were used.

Method and Analysis: smMIP-based amplification of the candidate 98 mononucleotide repeat markers from samples, followed by high depth (5000x) amplicon sequencing on a MiSeq (Illumina) was performed.

A custom bioinformatics pipeline to extract microsatellite allele frequencies from the amplicon sequence data was used. Estimation of the frequency of germline length variants in each candidate marker using the blood samples was performed. Finally, assessment of microsatellite allele distribution by visual inspection of graphs of microsatellite allele frequencies was carried out. Many aspects of the distributions were considered to determine if the marker had unambiguous signals of increased MSI in the MMR deficient tumour or blood samples. Markers were then given assigned to groups, where group 1 contained markers with the clearest MSI signal, down to group 4 which contained markers with no clear MSI signal.

Assessment of microsatellite reference allele frequency (RAF) as a measure of MSI in different sample types, and generation of receiver operator characteristic area under curve (ROC AUC) statistics based on RAF as a measure of marker sensitivity and specificity for MM R deficiency was performed. Analyses of tumour DNA samples used read frequencies, whereas analyses of blood DNA samples used smSequence frequencies (“read” and “smSequence” are defined as per Gallon et al. 2019).

Candidate mononucleotide repeat markers for inclusion in a version 2 MSI assay to be assessed in larger sample cohorts were selected. For a version 2 tumour MSI assay we aimed for approximately 50 markers, and for a version 2 CMMRD MSI assay (described herein) we aimed for approximately 30 markers (MSI analysis to detect CMMRD requires much higher read depths, so fewer markers were used to reduce the total reads needed and, therefore, the cost of sequencing). In the context of the present disclosure, the term “version 2 MSI assay” refers to the assays and/or MSI markers that are described herein.

Results:

All samples, except 7 control samples, had previously been analysed by the original 24 mononucleotide markers using the same smMIP-based method (disclosed in W02021019197), allowing comparison of the new candidate mononucleotide repeat markers and original mononucleotide repeat markers. For these pilot comparisons, it should be noted that the original markers have already been through several rounds of selection to optimise the panel for the detection of MMR deficiency in CRCs and so also contain good markers for the detection of CMMRD in non-neoplastic tissues.

Many of the new candidate mononucleotide repeat markers were equivalent or superior to the original mononucleotide repeat markers for the detection of MMR deficiency in different sample types. For example: o using RAF as a measure of MSI to detect MMR deficiency in CRCs, 59/98 (60.2%) of the new candidate markers and 17/24 (70.8%) of the original markers had a ROC AUC >0.95 (Figure 1). o The difference in the median RAF of the MMR proficient versus MMR deficient CRCs (median difference) was generally greater among the new candidate markers versus the original markers (Figure 2, Mann Whitney U Test p = 6.5x1 O’ 5 ). o using RAF as a measure of MSI to detect CMMRD, 49/98 (50.0%) of the new candidate markers and 12/24 (50.0%) of the original markers had a ROC AUC >0.95 (Figure 3). o the difference in the minimum RAF of the controls and the maximum RAF of the CMMRDs (margin difference) ranged up to 0.025 among the new candidate markers, whereas the greatest margin difference observed with the original markers was 0.004 (Figure 4).

The 62 best of the new candidate mononucleotide repeat markers (captured by 60 smMIPs) were selected from the smMIP amplicon sequence data, based on microsatellite allele distribution, RAF ROC AUC for the detection of MMR deficiency in both tumour and blood samples, and frequency of germline length variants (Table 1). All 62 mononucleotide repeat markers were taken forward for additional analysis using smM IP-based amplification and amplicon sequencing of a large colorectal cancer cohorts.

These 62 mononucleotide repeat markers were further ranked by RAF ROC AUC and margin difference for the detection of CMMRD versus control samples. The best 32 (captured by 31 smMIPs) were selected (Table 1) for further assessment using smMIP-based amplification and amplicon sequencing of a large blinded cohort of CMMRD and control sample.

Table 1 : Selection of candidate mononucleotide repeat markers to create version 2 MSI assays.

Selection criteria for version 2 tumour MSI assay included: Germline variant frequency <0.10, ROC AUC >0.90 for detection of MMR deficiency in either blood or tumour samples, and placement in allele distribution group 1 or 2 (see Methods and Analysis).

Further selection criteria for the version 2 CMMRD assay included: Germline variant frequency <0.05, ROC AUC >0.95 for detection of MMR deficiency in blood samples, minimum control blood RAF >0.88, and margin difference (minimum control RAF - maximum CMMRD RAF) >- 0.02.

3. Selected markers improve CMMRD detection using a large blinded cohort

Samples:

30 CMMRD peripheral blood leukocyte genomic DNAs (blinded); 43 control peripheral blood leukocyte genomic DNAs (blinded); and 30 control peripheral blood leukocyte genomic DNAs (known) were used.

Method and Analysis: smMIP-based amplification of 32 mononucleotide repeat markers from samples, followed by high depth (5000x) amplicon sequencing on a MiSeq (Illumina) was performed. A custom bioinformatics pipeline to extract microsatellite allele frequencies from the amplicon sequence data was used. Sample scoring using the method of our original MSI assay to detect CMMRD (Gallon et al. 2019; Perez-Valencia et al. Genet Med. 2020 Dec;22(12):2081-2088, D0l:10.1038/s41436-020-0925-z, PMID: 32773772) was carried out. The 30 known controls were used as a reference set for sample scoring. A higher score indicates increased MSI and a higher probability the sample is from an individual with CMMRD.

Results:

All samples, except 3 CMMRD and 1 control sample from the blinded cohort, had previously been analysed by the original 24 mononucleotide repeat markers (described in Gallon et al. 2019. Further information on these markers may also be found in W02021019197 and WO2018037231) using the same method, allowing comparison of the new and original marker sets.

Sample un-blinding showed that the scoring method detected CMMRD samples with 100% sensitivity (95% Cl: 88.4-100.0%) and 100% specificity (95% Cis: 95.1-100.0%), with a very large separation (score difference = 64.7) between CMMRD and control samples (Figure 5). This score separation was far greater than that of the original mononucleotide repeat marker set, which had overlapping scores from CMMRD and control samples (Figure 6). This supports that the process of selection from genome sequence data of individuals with CMMRD has identified exceptional markers for MSI analysis.

Two CMMRD samples (ID210 and ID224) from the blinded cohort had smSequence counts <100 in 18 markers. Whilst the remaining 14 markers clearly showed these two samples to have increased MSI, the samples were excluded from further analyses due to the unreliability of data from them in some markers.

Exploration of microsatellite marker structure across both the original and new mononucleotide repeat markers showed a strong correlation of ROC AUC (using smSequence RAF as measure of MSI to detect CMMRD) and microsatellite length (Figure 7, Spearman Rho = 0.743, p = 5.4x1 O' 11 ). New mononucleotide repeat markers were generally longer than the original mononucleotide repeat markers (Figure 7, Mann Whitney II Test p = 2.5x1 O' 9 ), suggesting the improved performance of the new markers may be a function of microsatellite structure rather than the selection process. However, a comparison of the new markers of 11 to 12 nucleotides in length with the original markers of the same length showed significantly higher ROC ALICs in the new markers (Figure 7, Mann Whitney II Test p = 5.2x1 O' 5 ). This further supports that the process of selection from genome sequence data of individuals with CMMRD has identified exceptional markers for MSI analysis.

The effect of reducing marker number on assay score distributions was assessed by first ranking all microsatellite markers (from both the new 32 mononucleotide repeats and original 24 mononucleotide repeats) based on their ability to detect CMMRD using smSequence RAF in the blinded cohort, and then repeating scoring using the top n markers, from n = 1 through to n = 30. Only 2 of the original microsatellite markers were included in the top 30 markers at rank 22 and rank 28 (Table 2). Separation of all CMMRD from all control samples by MSI scoring was achieved with all marker sets, including scoring by the top, single microsatellite marker (Figures 8A and 8B). An equivalent analysis that included only the original mononucleotide repeat markers showed persistent overlap of CMMRD and control sample scores (Figures 8C and 8D), again supporting that the method of selection has identified exceptional markers for MSI analysis.

As the marker number increased, the separation between CMMRD and control samples increased, as measured by the difference in score margins and medians between CMMRD and control samples (Figure 9A). The range of scores within control samples and within CMMRD samples also increased as marker number increased (Figure 9A). An equivalent analysis that included only the original mononucleotide repeat markers showed a similar trend for increasing range in the data as marker number increases, but the separation of CMMRD from control sample scores is much poorer than when the new markers are included in the ranking (Figure 9B).

To make the change in the differences in score margins and medians comparable between marker sets, the margin score difference and median score difference were normalised by the control score range for each marker set. A steep increase in the normalised margin and median score differences was observed from the top 1 to top 5 microsatellite markers, followed by a more gradual increase from adding additional markers to the set (Figure 10A). This suggests that increasing marker number of any of the new markers will increase the ability of the MSI assay and scoring method to detect CMMRD. However, as few as 5 of the new markers will achieve separation of CMMRD from control samples that is almost equivalent to the separation achieved with much greater numbers of microsatellite markers. That so few microsatellite markers can achieve such separation of CMMRD and control samples is novel and unexpected: 5 microsatellite markers is far fewer than the 24 of our original MSI assay (Gallon et al. 2019) or the 186 of the MSI assay of Gonzalez-Acosta et al. 2020, and is equivalent in number to fragment length analysis-based techniques used for tumour MSI analysis. An equivalent analysis that included only the original mononucleotide repeat markers showed a similar trend for increasing normalised differences as marker number increases, but, again, the separation of CMMRD from control sample scores is much poorer than when the new markers are included in the ranking (Figure 10B).

Table 2: Ranking of microsatellite markers from both new and original microsatellite marker sets using data from the blinded cohort and known controls. Markers with a ROC AUC <0.90 (using smSequence RAF as a measure of MSI to detect CMMRD) or a germline variant frequency >0.05 were excluded. Remaining markers were grouped by the normalised margin difference ((minimum control RAF - maximum CMMRD RAF) I range control RAF): Groups included margin difference >0.00, >-0.25, >-0.50, and <-0.50. Subsequently, markers were ranked by normalised median difference ((median control RAF - median CMMRD RAF) I range control RAF) within each group. Original markers are indicated with an asterisk.

*Markers from the original mononucleotide repeat marker set.

4. Selected markers improve MSI classification of colorectal cancers

Samples:

50 MSI-H colorectal cancers DNAs (from formalin fixed and paraffin embedded tissue) and 52 MSS colorectal cancers DNAs (from formalin fixed and paraffin embedded tissue) were used.

Methods: smMIP-based amplification of 54 of the 62 mononucleotide repeat markers (7 markers missed, but data from the other 54 are sufficient to show marker efficacy), followed by high depth (2000-3000x) amplicon sequencing on a MiSeq (Illumina) was carried out. A custom bioinformatics pipeline to extract microsatellite allele frequencies from the amplicon sequence data was used. Sample classification using the method of our original MSI assay to determine tumour MSI status (Redford et al. PLoS One. 2018 Aug29;13(8):e0203052, DOI:10.1371/journal. pone.0203052, PMID: 30157243; Gallon et al. Hum Mutat. 2020 Jan;41 (1):332-341 , DGI:10.1002/humu.23906, PMID: 31471937) was performed. The classifier was trained using the same sample cohort of 50 MSI-H and 52 MSS CRCs. A score >0 indicates a higher probability the sample is MSI-H, and a score <0 indicates a higher probability the sample is MSS.

Results:

All samples had previously been analysed by the original 24 mononucleotide repeat markers using the same method, allowing comparison of the new and original marker sets.

Markers for both new and original marker sets had ROC ALICs calculated for the separation of MSI-H CRCs from MSS CRCs based on read RAF. Potential germline variants were included in the ROC AUC calculation, and therefore the influence of marker polymorphism on its ability to discriminate between MSI-H and MSS CRCs was accounted for in this one value (Table 3A, Table 3B). The RAF ROC ALICs of the new microsatellite marker set were greater than those of the original microsatellite marker set (Figure 11 , Mann Whitney II Test p = 8.3x1 O’ 5 ).

As there was a much greater number of new markers (n = 54) than original markers (n = 24), the top 24 markers from the new marker set were first identified to allow a fair comparison of classification using these different markers sets. Markers in the new set were ranked using ROC AUC for the separation of MSI-H CRCs from MSS CRCs based on read RAF (as described in the previous bullet point), and the top 24 markers were selected (Table 3A).

MSI classification of CRCs using the top 24 markers of the new microsatellite marker set had 100% sensitivity (95% Cl: 92.9-100.0%) and 100% specificity (95% Cis: 93.2-100.0%), with a clear separation (score difference = 35.4) between MSI-H and MSS samples (Figure 12).

MSI classification of CRCs using the original 24 microsatellite marker set had 98% sensitivity (95% Cl: 89.4-100.0%) and 98% specificity (95% Cis: 89.7-100.0%), with overlapping scores (score difference = -11.0) between MSI-H and MSS samples due to misclassification of one MSI-H and one MSS CRC (Figure 12).

Exploration of microsatellite marker structure across both the original and new mononucleotide repeat markers showed a correlation of ROC AUC (calculated from read RAF of 50 MSI-H and 52 MSS CRCs as described above) and microsatellite length (Figure 13, Spearman Rho = 0.41 , p = 1.9x1 O' 4 ). New mononucleotide repeat markers were generally longer than the original mononucleotide repeat markers (Figure 13, Mann Whitney U Test p = 2.3x1 O' 9 ). Unlike for the detection CMMRD by increased MSI in blood, a comparison of the new markers of 11 to 12 nucleotides in length with the original markers of the same length showed no difference in ROC AUC between the two sets using either all 54 of the new markers (Mann Whitney U Test p = 0.94), or just the top 24 new markers (Mann Whitney U Test p = 0.11). It's worth noting there was less room for improvement on the original markers for these tumour analyses compared to the CMMRD analyses (see page 17 and Figure 7) as the original markers already had high ROC AUCs for tumour-based MSI testing.

The effect of reducing marker number on MSI assay score distributions was assessed by classifying the 50 MSI-H and 52 MSS CRCs by different marker combinations, starting with the single top ranked marker, then the top two ranked markers, and so on, until all 24 markers were included (Table 3A, Table 3B). With a minimum of 4 of the new microsatellite markers, and any combination with more than 4 markers, separation between all MSI-H and MSS CRCs was achieved. Two MSS CRCs (IDs 296151 and 296213) had consistently high scores across different marker combinations and were responsible for the misclassifications at low marker numbers. The same two MSS CRCs (IDs 296151 and 296213) that were frequently misclassified by the new marker set also had consistently high scores contributing to misclassifications across nearly all combinations of the original markers. In addition, one MSI-H CRC (ID 215320) had consistently low scores with the original marker combinations, but was correctly classified with all combinations of the new markers. This, again, supports that the method of selection has identified exceptional markers for MSI analysis.

As the marker number increased, the separation between MSI-H and MSS CRC scores consistently increased using the new microsatellite marker set (Figure 15A). The range of scores within each sample type also increased as marker number increased (Figure 15A). An equivalent analysis for the original microsatellite markers showed a similar trend for increasing range in the data as marker number increases, but the separation of MSI-H and MSS CRC scores is much poorer than for the new markers: As more of the original markers are added, the score margin between MSI-H and MSS CRCs decreases (Figure 15B).

To make the change in the differences in score margins and medians comparable between marker sets, the margin score difference and median score difference were normalised by the MSS CRC score range for each marker set. A steep increase in the normalised margin and median score differences was observed from the top 1 to top 6 microsatellite markers for both new and original microsatellite marker sets (Figure 16A and 16B, respectively). For the new microsatellite markers, additional markers steadily increase both normalised margin and median score differences (Figure 16A). However, this is not true of the original microsatellite markers, as both normalised margin and median score differences initially decrease with additional markers after the top 6, and subsequently level off (Figure 16B). The normalised margin and normalised median differences for the new microsatellite markers are generally higher than for the original microsatellite markers (compare Figure 16A and 16B). We have previously reported that a minimum of 6 microsatellite markers from the original set can be used to achieve accurate MSI classification of CRCs (Gallon et al. 2020), and have reproduced this result in this new cohort of CRC samples for both new and original microsatellite marker sets. However, these data show that a much larger proportion of the new microsatellite markers will improve classification when added into the marker set, further confirming that the method of selection has identified exceptional markers for MSI analysis, which can be used even on their own. Table 3A: Ranking of microsatellite markers from the new microsatellite marker set using ROC ALICs calculated from read RAF from 52 MSS and 50 MSI-H CRCs.

Table 3B: Ranking of microsatellite markers from the original microsatellite marker set using ROC ALICs calculated from read RAF from 52 MSS and 50 MSI-H CRCs.

Example 2

Introduction

The DNA MMR system is conserved across all three kingdoms of life. Primarily, it mediates the repair of base-to-base mismatches and small insertion-deletion loops generated during DNA replication, as well as a variety of base modifications such as cytosine deamination and guanine methylation, by excision of the affected DNA strand for resynthesis whilst signalling to the wider DNA damage response (DDR). MMR function can be lost in a wide variety of neoplasia, affecting approximately 1 in 4 endometrial cancers (ECs) and 1 in 7 CRCs. MMR deficient tumours are often hyper-mutated, with >10 mutations per megabase, and display high levels of MSI, a molecular phenotype defined as the accumulation of insertion and deletion (indel) mutations in short tandem repeat sequences scattered throughout the genome. An elevated mutation rate in the absence of MMR has also been demonstrated using human cell line, mouse, yeast, and bacterial models, and has been proposed to drive tumorigenesis through secondary mutation of onco- and tumour suppressor genes. Indeed, functional studies have demonstrated that frameshifts caused by coding microsatellite indels promote malignant cell growth. Furthermore, an over-representation of disruptive C>T transitions associated with defective MMR, and coding microsatellite frameshifts in the APC tumour suppressor gene has been observed in MMR deficient compared to proficient CRCs. The distinct patterns of recurrent coding microsatellite frameshift mutations between different tumour types also suggests tissue-specific positive selection of MMR deficiency-associated mutations during tumorigenesis.

Individuals with LS carry a germline pathogenic variant in one of the four principle MMR genes, MLH1, MSH2, MSH6, or PMS2, and have an increased risk of cancer, in particular CRC, EC, and other tumours of the gastrointestinal and genitourinary tracts. LS is one of the most common hereditary causes of cancer, affecting approximately one in 300 individuals in the general population. CMMRD is a far rarer childhood cancer syndrome caused by germline variants affecting both alleles of MLH1, MSH2, MSH6, or PMS2, with an estimated birth incidence of one per million. The loss of MMR function in all constitutional tissues is associated with an exceptionally high cancer risk, with a median age of onset less than 10 years. This includes LS cancers, which occur in approximately one third of cases, and, more commonly, high grade brain tumours and haematological malignancies. CMMRD is also associated with several non-neoplastic features, the most distinctive of which are cafe au lait macules (CALM) and skin-fold freckling reminiscent of neurofibromatosis type 1 (NF1). Other features include localised skin hypopigmentation, defective immunoglobulin class switch recombination, pilomatrixoma, and multiple developmental venous anomalies. Presentation may depend on which MMR gene is affected in the patient’s germline. In a review of 146 published cases, haematological malignancies were more prevalent in MLH1- or MSH2- than P/WS2-associated CMMRD (p=0.04), whilst the opposite was true of brain tumours (p=0.01). Furthermore, MLH1 or /WS/72-associated CMMRD cancers tended to occur earlier, which correlates with the earlier onset of MLH1- and /WS/72-associated LS.

Given the role of MMR deficiency in tumour progression, it is possible the malignant and non- malignant clinical features of CMMRD are, to varying extent, linked to an increased constitutional mutation rate. Increased MSI in non-neoplastic tissues is a highly specific feature of CMMRD, that is detectable by high-depth amplicon sequencing or low-pass whole genome sequencing but not by traditional MSI analysis methods. Sequencing-based microsatellite analysis can quantify the proportion of microsatellites demonstrating instability and the frequency of variant alleles at each microsatellite (collectively referred to here as MSI- burden) to approximate constitutional mutation rate.

Previously, using high-depth amplicon sequencing, the inventors observed a relatively low MSI-burden in the peripheral blood leukocytes (PBLs) of CMMRD cases homozygous for a hypomorphic PMS2 variant (c.2002A>G p.(lle668Val)) typified by an attenuated phenotype more similar to early-onset LS than classical CMMRD. This observation suggested constitutional MSI-burden may correlate with MMR genotype and/or CMMRD disease phenotype. However, more comprehensive analyses were precluded by the limited cohortsize of 32 patients and an MSI assay that only minimally separated CMMRD samples from controls. Further exploration of such correlations could broaden our understanding of how MMR deficiency contributes to malignant transformation, aid variant interpretation, and allow risk stratification to guide clinical management of CMMRD.

Here the inventors aimed to enhance methods to quantify constitutional MSI-burden, and subsequently explore its association with CMMRD genotype and phenotype using a relatively large cohort. One limitation of the previous method was its use of markers selected for MSI analysis of tumours as dysregulated replication, a possible mutator phenotype, and a common lineage, whereby cancer subclones are more likely to share mutations than the thousands of clones represented in healthy peripheral blood, may contribute to different mechanisms and frequencies of microsatellite mutation in cancers compared to non-neoplastic blood. Therefore, new MSI markers selected for instability in blood were desirable. Here, the inventors identified potentially informative MSI markers from high depth genome sequencing of CMMRD blood, and used amplicon sequencing to refine a panel of markers with highest sensitivity to MMR deficiency and to quantify constitutional MSI-burden in over 50 CMMRD patients.

Materials and Methods

Patient samples and ethical approval

Anonymised CMMRD PBL gDNAs were sourced from the Medical University of Innsbruck, Innsbruck, Austria (MUI), the University of Manchester, Manchester, UK (UM), the Gustave Roussy Cancer Campus, Villejuif, France (GR), the Institut Curie, Universite de Recherche Paris Sciences et Lettres, Paris, France (IC), and the Cancer Centre de Recherche Saint- Antoine, Sorbonne University, Paris, France (CRSA). MMR variants were classified according to InSiGHT criteria v2.4 and reference to ClinVar and InSIGHT databases. For patients with one or more VUS, the diagnosis had been confirmed by assessment of MMR function in non- neoplastic tissues, including assays of germline/constitutional MSI, and/or ex vivo MSI and methylation tolerance. PBL gDNAs from eight patients with a CMMRD-like phenotype but who tested negative for germline MMR pathogenic variants were sourced from MUI. Patient samples were analysed with consent and ethical approval by the review boards of the respective centres.

Anonymised control PBL gDNAs were extracted from discard blood samples of patients tested for non-cancer related conditions from Newcastle-upon-Tyne Hospitals NHS Foundation Trust, Newcastle-upon-Tyne, UK (NuTH) and MUI, following ethical review by the NHS Health Research Authority (REC reference 13/LO/1514) and the MUI review board, respectively.

Anonymised genetically-diagnosed LS PBL gDNAs were sourced from the CaPP3 clinical trial (ISRCTN16261285) biobank with participant consent for sample-use in research, and analysed following an ethical review by the NHS Health Research Authority (REC reference 13/LO/1514).

PBL samples were divided across three cohorts. High quantity (>2pg) and quality samples from three CMMRD patients (2 MUI, 1 UM), one LS carrier (CaPP3), and two controls (NuTH) were whole genome sequenced. Eight CMMRD (MUI) and 38 control (NuTH) samples were analysed in a pilot cohort. Fifty-seven CMMRD (31 MUI, 9 GR, 4 IC, and 13 CRSA), eight CMMRD-negative (MUI), and 43 control (MUI) samples were analysed as a blinded cohort, alongside 80 known controls (30 MUI, 50 NuTH) to provide reference samples for MSI scoring and 40 LS samples (CaPP3).

CRC samples were sourced from NuTH as 10pm FFPE tissue curls of resected tumours or pre-extracted gDNAs from non-fixed endoscopic biopsies, following an ethical review by the NHS Health Research Authority (REC reference 13/LO/1514). FFPE CRC gDNAs were extracted using GeneRead DNA FFPE Kit (QIAGEN). Eight MMR deficient and 8 MMR proficient CRC endoscopic biopsies were included in a pilot cohort, and a further 96 MMR deficient and 96 MMR proficient FFPE resected CRCs were analysed to train and validate a naive Bayesian classifier.

Genome sequencing and variant analysis

Samples were prepared for whole genome sequencing by 3 cycle PCR amplification using the NEBNext® Ultra™ II DNA Library Prep Kit for Illumina (New England Biolabs), and were sequenced to >120x coverage on a NovaSeq (Illumina). Reads were aligned to human reference genome build hg19 using BWA mem and BAM files generated using SAMtools view, sort, and index. Variants were called by a somatic variant calling pipeline and panel of reference control genomes using GATK 4 MuTect2, followed by GetPileupSummaries, CalculateContamination, and FilterMutectCalls, with PCR_indel_model set to NONE. Variants were classed as germline if the probability of variant allele frequency equalling the 1 :1 or 1 :0 ratio expected of a germline variant was >10 -7 .

For MSI marker selection, microsatellite variants flagged as germline and/or identified in the panel of reference genomes were excluded. Variants annotated as clustered events, multiallelic, slippage, or PASS, and where the total variant allele frequency was <0.25 (to further exclude potential germline variants) were retained and visually inspected using IGV. Microsatellites with variants captured by high quality read-alignments, not embedded within conserved repetitive elements, and that had higher variant allele frequencies in CMMRD patients than in controls were selected for further assessment by amplicon sequencing.

Single molecule molecular inversion probe design and amplicon sequencing

Single molecule molecular inversion probes (smMIPs) were designed using MIPgen to amplify MSI markers with capture sizes between 100bp and 160bp, and a molecular barcode of 4N at both extension and ligation arms.

MSI markers were amplified from samples using a published smMIP and high fidelity polymerase-based protocol. Amplicons were purified using AMPure XP beads (Beckman Coulter), quantified using a QuBit fluorometer 2.0 (Invitrogen), diluted to 4nM using 10mM pH8.5 Tris-HCI buffer, and pooled into 4nM sequencing libraries. Sequencing libraries were sequenced using custom sequencing primers on a MiSeq (Illumina) to a target depth of 5000x, following manufacturer’s protocols.

Microsatellite amplicon sequence analysis and microsatellite instability scoring Amplicon sequence reads were aligned to human reference genome build hg19 using BWA mem and further processed and analysed as previously described. In brief, to reduce PCR and sequencing error for low frequency variant detection, reads sharing the same molecular barcode were grouped and the microsatellite length represented in the majority of reads was defined as the single molecule sequence (smSequence) for each group. Groups containing only one read or without a majority were discarded. Microsatellite reference allele frequencies (RAFs) in smSequences were used to generate an MSI score (equivalent to MSI-burden) for each sample by comparison to RAFs of 80 known control samples. For any sample, MSI markers with a RAF <0.75 (probable germline variants) or with <100 smSequences were excluded from MSI scoring.

Statistical analyses and data availability

All analyses used R version 4.0.2. Comparisons of two sample groups used the Mann-Whitney test. Comparisons of more than two sample groups used the Kruskal-Wallis test. Correlation of variables where a linear relationship could or could not be assumed used Pearson’s R or Spearman’s rho, respectively. Confidence intervals for sensitivity and specificity estimates used a binomial distribution.

Genome sequence BAM and amplicon sequence FASTQ files are available from the European Nucleotide Archive using Study IDs PRJEB39601 and PRJEB53321 , respectively.

Results

Genome sequencing of blood identifies high sensitivity MSI markers

Three CMMRD (two PMS2- and one /WS/76-associated), one LS (MLH1 -associated), and two control blood samples were whole genome sequenced. An LS sample was included as highly sensitive MSI analysis and single-base-mismatch repair assays have previously detected reduced MMR function in blood and cell lines with one dysfunctional MMR allele. There was a marginal increase in the frequency of mononucleotide repeat (MNR) variants in both PMS2- associated and /WS/76-associated CMMRD bloods relative to control and LS bloods, but an increase in variants of longer motif microsatellites was only observed in the P/WS2-associated CMMRD bloods (Figure 17A). These variants include PCR error, sequencing error, germline variants, and somatic variants. To enhance the somatic signal, probable germline variants were identified, and the relative frequency of non-germline variants was assessed. In both P/WS2-associated and /WS/76-associated CMMRD bloods there was an increase in the relative frequency of non-germline MNR variants compared to LS and control bloods, but, again, an increase in longer motif microsatellites was observed only in the P/WS2-associated CMMRD bloods (Figure 17B). This is consistent with the role of MSH6 in the repair of single nucleotide indels, mismatches, and modifications, but not multiple nucleotide indels.

Microsatellites with a potential to enhance MSI analysis in blood were selected from the blood genome sequence data (see Methods), with review of over 2000 microsatellites, the majority of which were 11-16bp A-homopolymers. Since MSH6 deficiency causes 20% of CMMRD and MNR instability was increased in both MSH6- and PMS2-associated CMMRD samples in genome sequence analysis, 121 MNRs were short-listed as candidate MSI markers for further assessment by amplicon sequencing. These were smM IP amplified and sequenced from three control bloods, and 91 smMIPs (covering 98 candidate markers) generated read counts >10% of the median read depth and were taken forward. The ability of candidate markers to discriminate between MMR deficient and MMR proficient tissues was assessed by smMIP- amplicon sequencing a pilot cohort of eight CMMRD and 38 control blood gDNAs, as well as eight MMR deficient and eight MMR proficient CRC gDNAs. All except seven control samples had been previously analysed using the 24 tumour-derived MNRs of the original MSI assay, allowing comparison of marker sets. Twenty-seven of the 98 new blood-derived MSI markers were excluded as >10% of the pilot PBL samples had a RAF <0.75 indicative of a germline length variant (see Methods). There was no difference in the receiver operator characteristic (ROC) area under curve (AUC) values based on microsatellite RAF between the remaining 71 new and 24 original MSI markers to detect MMR deficiency in either pilot CRCs (p=0.439) or pilot PBLs (p=0.530, Figure 17C). However, the difference between the median RAFs of MMR deficient and MMR proficient samples was significantly greater for the new markers in both CRCs (p=1.81x10 -5 ) and in PBLs (p=2.18x10 -8 , Figure 17D), indicating they are more sensitive to MMR deficiency. Based on these data and visual inspection of microsatellite allele distributions, the candidate markers were refined to a panel of the most discriminatory 32 MNRs (Table H).

New MSI markers enhance the detection of CMMRD

The 32 new MSI markers were amplified and sequenced from 80 control PBL gDNAs to provide a reference for MSI scoring, and a blinded cohort of 57 CMMRD, 8 CMMRD-negative (patients with a CMMRD-like phenotype but no germline MMR variants), and 43 control PBL gDNAs. Forty LS PBL gDNAs (10 for each MMR gene) were also analysed to investigate if increased MSI in blood is specific to biallelic loss of MMR function. One sample from the blinded cohort failed to amplify, and was later revealed to be a CMMRD case. All other sample amplicons were sequenced and an MSI score generated for each. Markers with low (<100) smSequence counts were observed in only four samples: Two had a single low count-marker, whilst the others had <100 smSequences in >17 MSI markers with equivalent results upon repeat amplification and sequencing, suggesting poor sample quality. On un-blinding, these two samples were revealed to be CMMRD cases.

The blood MSI score identified CMMRD with 100% sensitivity (56/56; 95% Cl: 93.6-100.0%) and 100% specificity (171/171 ; 95% Cl: 97.9-100.0%), including the two CMMRD samples with exceptionally low smSequence counts, and with a clear separation from control, LS, and CMMRD-negative samples (Figure 18A). MSI score was associated with affected MMR gene (p= 1.15x1 O' 3 ); patients with MSH6 deficiency had significantly lower MSI scores than patients with MSH2 deficiency (p=2.38x10' 4 ) or PMS2 deficiency (p=6.01x10' 3 ), and a trend for lower MSI scores than patients with MLH1 deficiency (p=5.30x10' 2 , multiple testing significance at p<1.67x10' 2 ). LS MSI scores were not significantly different from controls (p=0.169), but it was notable that six (3.7-11.3) were greater than the highest control (3.6). CMMRD-negative samples generally had higher MSI scores than controls (p=0.0188). However, small but significant differences were observed between controls of different amplification and sequencing batches (p=1.23x10' 8 , Figure 18), and 7/8 CMMRD-negative samples were analysed in a single batch. For these seven, there was no significant difference in MSI score compared to controls from the same batch (p=0.0958). However, it was notable that two CMMRD-negative MSI scores (4.1 , 5.3) were greater than the highest control (3.6). As these high scoring LS and CMMRD-negative samples had much lower MSI scores than the CMMRDs they were not analysed further. To assess MSI assay reproducibility, residual DNA samples available from 26 CMMRD patients and 33 controls were re-amplified, sequenced, and scored, and a strong correlation was found between initial and repeat MSI scores (R=0.994, p<1 O' 15 , Figure 18B).

Fifty CMMRD and 75 control samples were also analysed using the original 24 MSI markers. The new MSI markers had greater RAF-based ROC AUCs for CMMRD detection than the original set (p=9.00x10' 14 , Figure 19). The new MSI markers were longer (range 11-15bp versus 7-12bp, p=1.93x10' 7 ) and there was a strong positive correlation between marker length and ROC AUC (rho=0.730, p=1.79x10' 1 °). However, comparing markers of equivalent size (11-12bp) found higher ROC AUCs for the new markers than the original (p=2.52x10' 4 , Figure 21 A). The new MSI markers were ranked by RAF ROC AUC to separate CMMRD from control samples (Table H) and the most discriminatory 24 new MSI markers maintained a large MSI score separation of 15.3 between CMMRD and control samples, compared to the 0.1 MSI score overlap when using the original 24 MSI markers (Figure 21 B). Using only three new MSI markers gave 100% accurate CMMRD detection (Figure 22). The new MSI markers also enhanced MSI classification of CRCs compared to the original set (Figure 23A-D) and there was a strong correlation between their RAF ROC AUCs in CRCs compared to bloods (rho=0.715, p=9.01x10' 5 ). CMMRD constitutional M Si-burden is associated with MMR genotype but not age of tumour onset

There was a breadth of MSI scores between CMMRD patients with deficiency of the same MMR gene suggesting potential genotype or phenotype correlations with constitutional MSI- burden. CMMRD patients with one or more missense MMR variant had significantly lower MSI scores than those without (p=8.81x10 -4 , Figure 24A), whilst the frequency of missense variants was equivalent between MMR genes indicating this was not due to an over-representation of missense variants in any one gene group (p=0.55). To further assess whether MMR variants associate with constitutional MSI-burden, pair-wise comparisons of MSI score between patients sharing the same genotype were made. This included twelve pair-wise comparisons between siblings of eight CMMRD families, and ten between five unrelated patients homozygous for the recurrent PMS2 c.2007-2A>G variant, finding MSI scores were strongly correlated between pairs (R=0.744, p=7.13x10' 5 , Figure 24B).

A clinical history of tumour diagnoses was available for CMMRD patients. Five patients had no cancer history, and for another the age of tumour diagnosis was unknown. Despite the strong genotype correlations with constitutional MSI-burden, no correlation was observed between age of first tumour and MSI score overall (Rho=-0.154, p=0.287, Figure 25A), or for subgroup analysis of MSH6-deficient cases (Rho=-0.342, p=0.195) and PMS2-deficient cases (Rho=-1.31x10’ 2 , p=0.95). It is possible that constitutional MSI-burden is associated with the onset of specific tumour types as both sporadic and CMMRD-related brain and haematological malignancies have a reduced frequency of MSI compared to cancers within the LS spectrum. However, no correlation was found between MSI score and the age of onset of brain tumours (Rho=-0.167, p=0.318), haematological malignancies (Rho=-0.285, p=0.268), or LS- associated tumours (Rho=-0.143, p=0.582). There was also no association of age of first tumour with affected MMR gene (p=0.483) or with whether the CMMRD patient had at least one missense MMR variant (p=0.457, Figure 25B).

Other factors that might affect constitutional MSI-burden include age at sample collection contaminating tumour DNA. Age at sample collection was not correlated with MSI score among 30 CMMRD patients with data available (Rho=-0.310, p=9.9x10 -2 , Figure 26A) but was correlated with age of first tumour (R=0.727, p=3.87x10 -5 ) as expected given CMMRD diagnoses will typically be made at presentation of a malignancy. Similarly, MSI score was not associated with age at sample collection in 50 controls with data available (p=0.652). For 27 CMMRD patients it was also known if a tumour was present at the time of sample collection; the MSI scores of the 18 patients with a tumour were equivalent to those without (p=0.495, Figure 26B).F Discussion

Novel MSI markers were selected in this study from blood WGS to enhance an existing amplicon sequencing-based MSI assay, achieving excellent separation of CMMRD samples from controls. Sequencing-based MSI analysis to detect CMMRD has now been demonstrated with a variety of methods. However, the method used here has a particularly low cost and is scalable from functional testing of a few samples to high throughput screening, as demonstrated when screening for CMMRD in cancer-free children with NF1 -like phenotypes but negative for NF1 or SPRED1 germline variants. Functional assays also support ambiguous genetic test results, such as MMR VUS and analysis of PMS2 (the MMR gene affected in the majority of CMMRD patients), which otherwise needs specialist techniques to avoid its pseudogenes. The inventors’ results provide data to support reclassification of 17 MMR VUS as pathogenic, at least in the context of CMMRD. The new MSI markers were found to be longer than the original set, ranging between 11 bp and 15bp, which is equivalent to the most sensitive and specific A-homopolymers identified in TCGA tumour exome sequencing data. This suggested that a microsatellite’s diagnostic utility may simply be a function of its length. However, a comparison of 11-12bp markers showed the new blood-derived MSI markers have significantly higher ROC AUCs than the original tumour-derived set, confirming this new selection had identified exceptional markers. The new MSI markers also enhanced detection of MMR deficiency in CRCs, suggesting that they will be sensitive irrespective of tissue despite our initial hypothesis that some microsatellite markers may be more sensitive in blood than in tumours. However, the original tumour-derived set analysed here had also been selected to be <12bp and have a SNP within 30bp, and so these differences in selection criteria may mask tissue specificity.

A CMMRD patient’s MSI score was associated with their genotype. Previously, a reduced MSI-burden in MSH6- versus /WS/72-associated CMMRD cases using an alternative amplicon sequencing assay was found. Here, the inventors have shown that this extends to PMS2- associated CMMRD and that there is a similar trend comparing MSH6- to MLH1 -associated CMMRD. A reduced MSI-burden of MSH6- compared to /WS2-associated CMMRD was also observed in our genome sequence data, and is consistent with genome sequence data from CRISPR-knockout cell lines that show a reduced indel frequency in MSH6- compared to MLH1-, MSH2- or PMS2-deficient cells. The redundancy for 1 bp indel repair between MSH2- MSH6 (MutSa) and MSH2-MSH3 (MutSP) MMR heterodimers likely explains the reduced frequency of MNR variants in the constitutional tissues of /WS/76-associated CMMRD. The inventors also observed genotype-phenotype correlations with respect to the type of MMR variant, with CMMRD patients carrying one or more MMR missense variants having lower MSI scores than those without. To the inventors knowledge, this is a novel observation for MMR genes and could have implications for our understanding of how MMR genotype influences mutation rate. It would be interesting, for example, to explore if MMR missense variants are associated with reduced MSI in MMR deficient tumours, and whether this has any association with clinical course. This strong genotype and MSI-burden correlations did not translate to differences in disease phenotype among the 56 CMMRD patients analysed, with no observed correlation of MMR genotype or MSI score with age of first tumour. Previously observed subtle differences in the incidence of CNS tumours and age of first tumour by affected MMR gene in CMMRD, but had analysed a larger cohort of 146 patients. In LS it is well established that the MMR genes are associated with distinct cancer spectra and risks. Previously it was also found both CRC and EC occurred earlier in carriers of PMS2 variants that cause loss of RNA expression compared to those that retain expression. However, there is otherwise very limited data supporting an effect for type or position of MMR variants on clinical phenotype. Regardless, it is likely disease phenotype correlates with MMR genotype in CMMRD but the association is far weaker than that between constitutional MSI-burden and MMR genotype.

The question then remains, why is there an apparent disconnect between constitutional MSI- burden and disease phenotype in CMMRD? There are several plausible explanations and a key limitation of our study is the restricted subgroup or multivariate analyses that might disentangle possible confounders due to cohort size. Constitutional MSI-burden is a combination of mutation rate and patient age at sampling. As age at sampling is positively correlated with age of first tumour, patients with less severe phenotypes will have had more time to accumulate microsatellite variants, as has been suggested for MSI in the general population and LS. Patient age, therefore, may confound associations between constitutional MSI-burden and disease penetrance. Analysing constitutional mutation rate directly would be superior, but would require alternative methods to quantify, for example, serial sampling of individuals or use of models, which have their own limitations. Furthermore, repair of microsatellite indels is only one of several functions of the MMR system within the DDR. In particular, both MMR-deficiency related single base substitutions (SBS) and indel mutations appear to drive MMR deficient tumorigenesis, and, whilst indel frequency was reduced, MSH6- deficient tissues have an equivalent increase in SBS frequency relative to MLH1-, MSH2-, and PMS2-deficient tissues in CRISPR knockouts of these genes. It is also possible that these mechanisms of mutation contribute to differing degrees in different tissues. For example, CMMRD brain and haematological malignancies have a reduced MSI signal compared to CMMRD LS-related carcinomas, increased MSI is common to LS-associated carcinomas but not brain or haematological malignancies in the sporadic population, and CMMRD brain tumours are typically ultra-hypermutated with >100mutations/Mb associated with concurrent deficiencies of polymerase proofreading and MMR. Genetic and environmental backgrounds may determine the degree to which MSI or SBS contribute to tumorigenesis, with traditional PCR and fragment length analysis finding only 40% of gastrointestinal tumours to be MSI-H in CMMRD whilst >90% are in LS. The MMR system also signals to the wider DDR, for example to induce cell cycle arrest and apoptosis, and some MMR variants may promote tumorigenesis through these pathways rather than through, or in combination with, reduced repair capacity.

Environmental and genetic modifiers of cancer risk are also unaccounted for by the MSI score used in this study. Familial modifiers are to known to have large effects on cancer risk in LS and genetics may be of particular importance in CMMRD given parental consanguinity is seen in approximately half of CMMRD families. Familial risk factors might also explain the strong correlation in MSI score between patients sharing the same genotype observed in this study. With respect to tumorigenesis, this could imply that other factors have a more significant contribution to tumour initiation or progression than MMR deficiency, consistent with early models of LS colorectal tumorigenesis. We also explored whether tumour at the time of sampling is associated with MSI score, but no difference was found. It was interesting, however, that some CMMRD-negative and LS samples showed marginally increased MSI scores. Though beyond the scope of this study, further exploration of the effect of contaminating MSI-H circulating tumour cells on blood MSI analysis may be warranted.

In summary, we have analysed the constitutional MSI-burden of one of the largest cohorts of CMMRD patients in the scientific literature thus far, combining novel MSI markers and a simple method that could enhance CMMRD diagnostics. Our data show a strong association of constitutional MSI-burden with MMR genotype.

Example 3 - Development of markers optimised for multiplex PCR

Background

For MSI assay development, molecular inversion probes (MIPs) were used to facilitate robust multiplex amplification of MSI markers, and other genetic loci of clinical interest such as tumour mutation hotspots, without apparent limitation to the number of loci analysed. However, MIPs have limitations. In particular, MIPs require a minimum reaction input of ~25ng of sample DNA for reliable amplification. We have found that, in diagnostic practice at the Northern Genetics Service (Newcastle-upon-Tyne Hospitals NHS Foundation Trust), 14% of tumour DNA samples are of a quality/quantity too low to be analysed by MIPs and must be analysed by a “salvage” pathway. Furthermore, the MIP protocol is typically run over 2 days, which restricts sequencing to two batches per week with a median turnaround time of 10 days from sample receipt to report. Multiplex amplification by traditional PCR methods is limited by primer-primer and primeramplicon cross reactivity between target loci, and hence the number of loci that can be amplified is very variable (depending on loci and primer design, etc) and often limited to 10 loci or fewer. However, multiplex PCR can amplify from <1ng of sample DNA, and its use instead of MIPs would, therefore, remove the need for a salvage pathway and streamline the diagnostic pipeline in practice. Multiplex PCR amplification also requires a shorter (<1 day) protocol, which would allow 3 (or more) sequencing batches per week, increasing throughput and cutting overall turnaround time to <7 days from sample receipt to report.

The inventors have recently demonstrated that a two-round multiplex PCR assay of 12 of the best MSI markers described in WO/2018/037231 can be used to accurately test for MSI in resected tumour DNA samples as well as low quantity/quality samples, including genomic DNA extracted from endoscopic biopsies of colorectal cancers and cell free DNA extracted from urine (Phelps et al, doi: 10.3390/cancersl 4153838). Multiplex PCR can, therefore, provide an accurate alternative assay of our MSI markers that overcomes the limitations of MIPs.

The two-round multiplex PCR MSI assay described in Phelps et al 2022 had the potential to be simplified further to a single-round of multiplex PCR, which is an even shorter protocol. The published two-round multiplex PCR MSI assay also used in the inventors’ original MSI markers (described WO/2018/037231), which the inventors have demonstrated are less sensitive to MMR deficiency in both blood and tumour tissues than the new MSI markers when analysed by MIP amplification and sequencing, as described herein. Therefore, the inventors developed an MSI assay using a single-round multiplex PCR assay of 14 MSI markers (3 original and 11 new), as well as BRAF and RAS mutation hotspots. It will be appreciated that the addition of mutation hotspots to the assay is optional. Furthermore, the selection of hotspots may depend upon the type of cancer investigated.

Marker selection

PCR primers for the 62 new MSI markers, described in Table A, were first designed and tested using the two-round multiplex PCR assay, which has a much lower setup cost than the singleround multiplex PCR method. PCR primer design followed the protocol of Phelps et al (2022). In brief, PCR primers were designed with 8N molecular barcodes (4N in each primer) using PCRTiler v1 .42 with GrCH37/hg19 as reference and a melting temperature range of 57-61 °C. Amplicon size was initially set at a maximum of 90bp, and then increased by 10bp incrementally if no usable primer pairs were obtained. Multiplex Manager was used to select primers which minimised primer interactions within the multiplex. Two-round multiplex PCR primer were successfully designed and produced amplicons in initial tests following the two- round multiplex PCR protocol (Phelps et al 2022) for 26 MSI markers (Table 4).

Table 4 - Successful two-round multiplex PCR primer designs for new MSI markers. These primers are used in the first round of PCR. Ns in the primer sequence represent molecular barcodes. The common sequence 5’ of the molecular barcode (TCCGACGGTAGTGT for forward primers, TCGGGAAGCTGAAG for reverse primers) act as annealing sites for the universal amplification primers in the second PCR.

Sequencing of amplicons generated by two-round multiplex PCR amplification of different combinations of these 26 new MSI markers, along with original MSI markers and, optional BRAF and RAS mutation hotspots, identified a set of 19 new MSI markers that were most robust to multiplexing. It shall be appreciated by the skilled person that mixing lots of primers together in one reaction (i.e. a multiplex) may alter their performance compared to when primers are in singleplex. Therefore, the 26 markers that had initially been selected by singleplex analysis were then reduced to 19 based on performance in the two-round multiplex PCR method to see which primers worked best in a multiplex format. To do this, the inventors mixed the primers in different combinations, assessed them by gel electrophoresis as per the singleplex analysis, but also sequenced the amplicons to look at per marker read depths from the different primer combinations. The inventors selected MSI markers with highest read depths and that behaved most consistently across multiplexes.

These 19 robust new MSI markers, combined with the best 6 original MSI markers, were amplified by two-round multiplex PCR and sequenced from a cohort of 72 MSI-H and 72 MSS CRCs. The reference method for MSI status of samples was the MSI Analysis System v1 .2 (Promega).

For each MSI marker, its ability to separate MSI-H from MSS CRCs was defined as the receiver operator characteristic area under curve (ROC AUC) calculated from sample reference allele frequencies (RAF, i.e. the proportion of reads containing the reference or wild type length of microsatellite). 16/19 of the new MSI markers and 4/6 original MSI markers achieved RAF ROC ALICs >0.95 using the two-round multiplex PCR assay (Table 5) demonstrating high accuracy for MSI detection using multiplex PCR. Table 5 - Reference allele frequency (RAF) receiver operator characteristic area under curve (ROC AUC) values for the ability of MSI markers (19 new, 6 original) to discriminate between 72 MSI-H and 72 MSS colorectal cancers (CRCs). MSI markers were amplified and sequenced from CRCs by the two-round multiplex PCR protocol described by Phelps et al (2022).

The two-round multiplex PCR primers were redesigned to incorporate the universal amplification primers such that amplification could be completed in a single-round of multiplex PCR. The single-round multiplex PCR protocol is unpublished. In brief, each reaction contains 5pl of 5x HS VeriFi Buffer (PCR Biosystems), 0.25pl of 2U/pl HS VeriFi DNA Polymerase (PCR Biosystems), 1 pl of multiplex primer mix with each primer at 1 pM in the stock, 1 -5 pl of DNA sample, and molecular grade H2O to achieve a total reaction volume of 25pl. Reactions are incubated in a thermocycler using the following programme: Heat activation:

95°C 1 min

Amplification (30 cycles):

95°C 30sec

57°C 90sec

72°C 60sec

Final extension:

72°C 2min

Hold:

4°C

Amplicon library preparation and sequencing follow established protocols (Phelps et al 2022).

Initial testing of primer multiplexes to amplify different combinations of MSI markers defined the final marker panel, containing 11 new and 3 original MSI markers (Table I), as well as 7 BRAF and RAS optional tumour mutation hotspots relevant to CRC care (not shown in Table I). In Table I, Primer Name, “xxx” is a unique sample index number and each primer must be purchased for each sample index. In Primer Sequence, the [lndex8N] is the 8 base sequence of the sample index.

The finalised single-round multiplex PCR assay of 11 new and 3 original MSI markers, plus 7 BRAF and RAS mutation hotspots was subsequently validated using FFPE CRC DNA samples, NEQAS standards (https://ukneqas.org.uk/), and cancer cell lines. This included a training cohort of 50 MSI-H and 50 MSS CRCs to train the naive Bayesian MSI classifier previously used for tumour analysis (Redford et al 2018, PLoS One 13(8):e0203052. doi: 10.1371/journal. pone.0203052, PMID: 30157243) and a validation cohort of 55 MSI-H and 83 MSS CRCs, as well as 4 MSI-H and 4 MSS NEQAS standards and 3 MSI-H and 3 MSS cancer cell lines. The CRC validation cohort deliberately contained samples of very low quantity and samples that had previously failed sequence analysis by MIPs to challenge the single-round multiplex PCR assay. The reference method for MSI status of samples was the MSI Analysis System v1.2 (Promega) or the MIP-based MSI assay (Gallon et al 2020, Human Mutation 41(1):332-341. doi: 10.1002/humu.23906, PMID: 31471937). Once trained, the naive Bayesian MSI classifier generates an MSI score for each sample, with MSI scores >0 classifying the sample as MSI-H and MSI scores <0 classifying the sample as MSS.

Quality control (QC) thresholds were set for the single-round multiplex PCR MSI assay, requiring a median 100 reads for the MSI markers for a sample to pass QC. NEQAS standards and cancer cell lines all passed QC and were correctly classified (Figure 27). 97 MSI-H and 110 MSS CRCs passed QC, and among these the MSI assay achieved 99.0% sensitivity (96/97) and 100.0% specificity (110/110) (Figure 27). 8 MSI-H and 23 MSS CRCs from the validation cohort failed QC, with 6 of these having read depths too low to generate an MSI score. Despite this, the remaining 25 QC-fail CRCs were all correctly classified, although MSI scores clustered around 0 (an indeterminate score). Subsequently, this was demonstrated to be an issue of sample processing; nearly all of these samples came from a small number of DNA extraction batches and purification or dilution followed by a repeat assay improved QC metrics and MSI scores moved away from 0 (i.e. MSI-H CRC scores increased, and MSS CRC scores decreased, data not shown).

Table 6 - examples of hotspots and associated primers suitable for a single-round multiplex

PCR reaction as described herein