Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD FOR IDENTIFYING AND OBTAINING LIVE BACTERIAL ISOLATES
Document Type and Number:
WIPO Patent Application WO/2023/214186
Kind Code:
A1
Abstract:
A method of isolating and/or identifying one or more strains of bacteria in a sample, said method comprising the steps of: extracting DNA from the sample, preparing a sequencing library using a long-read sequencing technique, de novo assembly of reads into one or more sequence contigs or assemblies, analysis of sequence contigs or assemblies to identify one or more conserved and/or typing gene operon sequences, analysis of sequence contigs or assemblies to identify PCR primer sites upstream of the one or more conserved and/or typing gene operon sequences, taxonomic classification of upstream PCR primer sites, and/or taxonomic classification of conserved and/or typing gene operon sequences.

Application Number:
PCT/GB2023/051210
Publication Date:
November 09, 2023
Filing Date:
May 09, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
QUADRAM INST BIOSCIENCE (GB)
International Classes:
C12Q1/6806; C12Q1/6869; C12Q1/689; G16B30/00
Other References:
CHARALAMPOUS THEMOULA ET AL: "Nanopore metagenomics enables rapid clinical diagnosis of bacterial lower respiratory infection", NATURE BIOTECHNOLOGY, NATURE PUBLISHING GROUP US, NEW YORK, vol. 37, no. 7, 24 June 2019 (2019-06-24), pages 783 - 792, XP036845543, ISSN: 1087-0156, [retrieved on 20190624], DOI: 10.1038/S41587-019-0156-5
KYRGYZOV OLEXIY ET AL: "Binning unassembled short reads based on k-mer abundance covariance using sparse coding", GIGASCIENCE, vol. 9, no. 4, 29 March 2020 (2020-03-29), XP093082447, Retrieved from the Internet DOI: 10.1093/gigascience/giaa028
MATSUO YOSHIYUKI ET AL: "Full-length 16S rRNA gene amplicon analysis of human gut microbiota using MinION(TM) nanopore sequencing confers species-level resolution", BMC MICROBIOLOGY, vol. 21, no. 1, 26 January 2021 (2021-01-26), XP093054437, Retrieved from the Internet DOI: 10.1186/s12866-021-02094-5
ABELLAN-SCHNEYDER ISABEL ET AL: "Primer, Pipelines, Parameters: Issues in 16S rRNA Gene Sequencing", MSPHERE, vol. 6, no. 1, 24 February 2021 (2021-02-24), XP093081504, Retrieved from the Internet DOI: 10.1128/mSphere.01202-20
GILPATRICK, TIMOTHYISAC LEEJAMES E. GRAHAMETIENNE RAIMONDEAUREBECCA BOWENANDREW HERONBRADLEY DOWNSSARASWATI SUKUMARFRITZ J. SEDLAZ: "Targeted Nanopore Sequencing with Cas9-Guided Adapter Ligation", NATURE BIOTECHNOLOGY, 2020, Retrieved from the Internet
GOPALAKRISHNAN, VANCHESWARANBETH A. HELMINKCHRISTINE N. SPENCERALEXANDRE REUBENJENNIFER A. WARGO: "The Influence of the Gut Microbiome on Cancer, Immunity, and Cancer Immunotherapy", CANCER CELL, vol. 33, no. 4, 2018, pages 570 - 80, XP085376835, Retrieved from the Internet DOI: 10.1016/j.ccell.2018.03.015
JOHNSON, JETHRO S., DANIEL J. SPAKOWICZ, BO-YOUNG HONG, LAUREN M. ETERSEN, PATRICK DEMKOWICZ, LEI CHEN, SHANA R. LEOPOLD: "Evaluation of 16S RRNA Gene Sequencing for Species and Strain-Level Microbiome Analysis", NATURE COMMUNICATIONS, vol. 10, no. 1, 2019, pages 1 - 11, Retrieved from the Internet
TIMMY SCHWEERJORG PEPLIESCHRISTIAN QUASTMATTHIAS HORNFRANK OLIVER GLOCKNER: "`Evaluation of General 16S Ribosomal RNA Gene PCR Primers for Classical and Next-Generation Sequencing-Based Diversity Studies", NUCLEIC ACIDS RESEARCH, vol. 41, no. 1, 2013, pages e1, XP055253067, Retrieved from the Internet DOI: 10.1093/nar/gks808
KOKOT, MAREKMACIEJ DLUGOSZSEBASTIAN DEOROWICZ: "KMC 3: Counting and Manipulating k-Mer Statistics", BIOINFORMATICS, vol. 33, no. 17, 2017, pages 2759 - 61, Retrieved from the Internet
KOLMOGOROV, MIKHAILJEFFREY YUANYU LINPAVEL A. PEVZNER: "Assembly of Long, Error-Prone Reads Using Repeat Graphs", NATURE BIOTECHNOLOGY, vol. 37, no. 5, 2019, pages 540 - 46, XP036773011, Retrieved from the Internet DOI: 10.1038/s41587-019-0072-8
LI, HENG: "Minimap2: Pairwise Alignment for Nucleotide Sequences", BIOINFORMATICS, vol. 34, no. 18, 2018, pages 3094 - 3100, Retrieved from the Internet
LI, HENGBOB HANDSAKERALEC WYSOKERTIM FENNELLJUE RUANNILS HOMERGABOR MARTHGONCALO ABECASISRICHARD DURBIN: "The Sequence Alignment/Map Format and SAMtools", BIOINFORMATICS, vol. 25, no. 16, 2009, pages 2078 - 2079, XP055229864, DOI: 10.1093/bioinformatics/btp352
MAKENDI, CARINEANDREW J PAGEBRENDAN W WRENTU LE THI PHUONGSIMON CLARECHRISTINE HALEDAVID GOULDING ET AL.: "`A Phylogenetic and Phenotypic Analysis of Salmonella Enterica Serovar Weltevreden, an Emerging Agent of Diarrheal Disease in Tropical Regions", PLOS NEGLECTED TROPICAL DISEASES, 2016
PARKS, DONOVAN H., MARIA CHUVOCHINA, PIERRE-ALAIN CHAUMEIL, CHRISTIAN RINKE, AARON J. MUSSIG, AND PHILIP HUGENHOLTZ: "Selection of Representative Genomes for 24,706 Bacterial and Archaeal Species Clusters Provide a Complete Genome-Based Taxonomy", BIORXIV, November 2019 (2019-11-01), pages 771964, Retrieved from the Internet
PIETERS, ZOENEIL J SAADMARINA ANTILLONVIRGINIA E PITZERJOKE BILCKE: "Case Fatality Rate of Enteric Fever in Endemic Countries: A Systematic Review and Meta-Analysis", CLINICAL INFECTIOUS DISEASES: AN OFFICIAL PUBLICATION OF THE INFECTIOUS DISEASES SOCIETY OF AMERICA, vol. 67, no. 4, 2018, pages 628 - 38, Retrieved from the Internet
QUAN, JENAICHARLES LANGELIERALISON KUCHTAJOSHUA BATSONNOAM TEYSSIERAMY LYDENSAHARAI CALDERA ET AL.: "FLASH: A next-Generation CRISPR Diagnostic for Multiplexed Detection of Antimicrobial Resistance Sequences", NUCLEIC ACIDS RESEARCH, vol. 47, no. 14, 2019, pages e83 - e83, XP055652008, Retrieved from the Internet DOI: 10.1093/nar/gkz418
QUAST, CHRISTIANELMAR PRUESSEPELIN YILMAZJAN GERKENTIMMY SCHWEERPABLO YARZAJORG PEPLIESFRANK OLIVER GLOCKNER: "The SILVA Ribosomal RNA Gene Database Project: Improved Data Processing and Web-Based Tools", NUCLEIC ACIDS RESEARCH, vol. 41, no. D1, 2013, pages D590 - 96, XP055252806, Retrieved from the Internet DOI: 10.1093/nar/gks1219
REKDAL, VAYU MAINIELIZABETH N. BESSJORDAN E. BISANZPETER J. TURNBAUGHEMILY P. BALSKUS: "Discovery and Inhibition of an Interspecies Gut Bacterial Pathway for Levodopa Metabolism", SCIENCE, vol. 364, 2019, pages 6445, Retrieved from the Internet
RICE, PETERIAN LONGDENALAN BLEASBY: "EMBOSS: The European Molecular Biology Open Software Suite", TRENDS IN GENETICS, vol. 16, no. 6, 2000, pages 276 - 77, XP004200114, Retrieved from the Internet DOI: 10.1016/S0168-9525(00)02024-2
UNTERGASSER, ANDREASIOANA CUTCUTACHETRIINU KORESSAARJIAN YEBRANT C. FAIRCLOTHMAIDO REMMSTEVEN G. ROZEN: "Primer3—New Capabilities and Interfaces", NUCLEIC ACIDS RESEARCH, vol. 40, no. 15, 2012, pages e115, XP055982973, Retrieved from the Internet DOI: 10.1093/nar/gks596
WOOD, DERRICK EJENNIFER LUBEN LANGMEAD: "Improved Metagenomic Analysis with Kraken 2", GENOME BIOLOGY, vol. 20, no. 1, 2019, pages 257, Retrieved from the Internet
NGUYENLAM-TUNG AND SCHMIDTHEIKO AVON HAESELERARNDT AND MINHBUI QUANG: "IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies", IN MOLECULAR BIOLOGY AND EVOLUTION, vol. 32, no. 1, 2014, pages 268a264
MINH, B. QNGUYEN, M. A. THAESELER, A: "Ultrafast Approximation for Phylogenetic Bootstrap", IN MOLECULAR BIOLOGY AND INVOLUTION, vol. 30, no. 5, 2013, pages 1188a1195
SEEMANN T, SNIPPY: FAST BACTERIAL VARIANT CALLING FROM NGS READS., 2015, Retrieved from the Internet
TIERNEY DCOPSEY SDMORRIS TPERRY JD: "A new chromogenic medium for isolation of Bacteroides fragilis suitable for screening for strains with antimicrobial resistance", ANAEROBE, vol. 39, 2016, pages 168 - 172
Attorney, Agent or Firm:
BAILEY WALSH & CO. LLP (GB)
Download PDF:
Claims:
Claims

1. A method of isolating and/or identifying one or more strains of bacteria in a sample, said method comprising the steps of:

Extracting DNA from the sample;

Preparing a sequencing library using a long-read sequencing technique;

De novo assembly of reads into one or more sequence contigs or assemblies;

Analysis of sequence contigs or assemblies to identify one or more conserved and/or typing gene operon sequences;

Analysis of sequence contigs or assemblies to identify PCR primer sites upstream of the one or more conserved and/or typing gene operon sequences;

Taxonomic classification of upstream PCR primer sites; and/or

Taxonomic classification of conserved and/or typing gene operon sequences;

2. A method according to claim 1 wherein the conserved operons are rRNA operons.

3. A method according to claim 2 wherein the conserved operons are 16S operons or parts thereof.

4. A method according to claims 1-3 wherein a table or spreadsheet is created including primers for the conserved and/ or typing gene operon sequences or rRNA operon and the likely species they came from.

5. A method according to claim 1 wherein a serial dilution is performed with an aliquot of the original sample, to progressively reduce the number of cells.

6. A method according to claim 5 wherein each dilution is subject to real time or qPCR of the bacteria (identified or unidentified) of choice.

7. A method according to claim 6 wherein when the PCR goes from positive to negative, the previous dilution before the negative result is taken.

8. A method according to claim 7 wherein the cells are cultured from the PCR positive dilution.

9. A method according to claim 8 wherein the cells are cultured in anaerobic conditions, or aerobic conditions, on a variety of media.

10. A method according to claim 9 wherein all colonies are selected and plated onto new media.

11. A method according to claim 10 wherein each colony is then subjected to qPCR using primers specific for the cognate ‘proximal’ sites.

12. A method according to claim 9-11 wherein if there are multiple closely related bacteria, multiple primers can be used as they are selfbarcoding, with different combinations of primers used to distinguish between strains.

13. A method according to any preceding claim wherein the bacterial isolate is sequenced.

14. A method according to claim 13 wherein the isolate sequence is subject to qPCR to check or confirm the bacteria isolated is qPCR positive.

15. A method according to claims 13 or 14 wherein the user can assess if the selected bacteria is from a known species or a selection from an unknown species is present.

16. A method according to any preceding claim wherein the sample is a biological metagenomic sample.

17. A method according to any preceding claim wherein the DNA is extracted from the sample using bead-beating and/ or non-bead beating lysis methods.

18. A method according to claim 17 wherein the extracted DNA fragments are subject to size selection.

19. A method according to claim 17 wherein the sequencing library is prepared and sequenced to a high depth using a long-read sequencing instrument or technique.

20. A method according to claim 19 wherein the long-read sequence spans a conserved and/or typing gene operon sequences.

21. A method according to claim 20 wherein the long-read sequence spans a typing gene (16S) and the surrounding (proximal site; PSI site containing PCR primers) DNA flanking the same.

22. A method according to claim 21 wherein the DNA flanking the typing gene, also known as flanking site DNA, is used to design speciesspecific PCR primers.

23. A method according to any preceding claim 17 wherein the frequency of all k-mers in the assembly is calculated.

24. A method according to claim 23 wherein the 16S rRNA sequence and an upstream region of substantially up to the mean fragment length is identified.

25. A method according to claim 23 or 24 wherein k-mers are identified in the upstream region for each rRNA operonand used to identify subregions which are suitable for primer creation.

26. A method according to claim 17 wherein regions of the upstream sequence are searched for high quality PCR primer sites.

27. A method according to any preceding claim wherein in-silico PCR is performed with primers against the whole assembly, enabling the counting of the number of off target hits.

28. A method according to claim 27 wherein the number of individual reads which fully overlap the whole conserved and/or 16S region, and the PCR primers is recorded.

29. A method according to any preceding claim wherein the taxonomic classification is performed using one or more databases.

30. A method according to claim 29 wherein the upstream region is separately taxonomically classified using one or more databases.

31. A method of isolating and/or identifying one or more strains of bacteria in a sample, said method comprising the steps of:

Extracting DNA from the sample;

Preparing a sequencing library using a short-read sequencing technique;

Performing short read shotgun metagenomics using a sequencing instrument and using the short-read data to calculate the relative abundances of the k-mers in the sample.

32. A method according to claims 1 and 31 wherein short and long reads are mapped to the assembled data to allow for normalisation of the yield from the different sequencing techniques.

Description:
The invention to which this application relates is to a new Proximal Site Identification (PSI/W) method that enables the selection of live bacterial isolates from a sample.

Although the following description refers exclusively to selecting and isolating bacteria from a metagenomic sample, the person skilled in the art will appreciate that any sample type could be used and is not limited to metagenomic samples, faecal samples and/or the like, Indeed the present method enables the skilled person to selectively isolate live bacterial organisms of their choosing from a sample, distinguish between closely related bacteria down to strain level, as well as being universal to all bacteria, requiring no prior knowledge of the bacteria targeted for isolation.

Most bacteria around us (present in the gut, soil, environment, for example) are uncharacterised, yet they may have potential as therapeutic agents in their own right or may generate metabolites that have therapeutic potential, promising the next generation of drug discovery.

Currently a brute force culturomics approach has been taken to isolate novel organisms as candidates for therapeutic study. Using this approach, large numbers of bacteria are grown in the expectation that novel strains and/or interesting properties will be identified. The culturomics approach relies on the concept that everything that can be grown has its DNA sequence determined and thus characterising novel organisms (and the biochemical pathways that hey harbour).. This approach has the problem of exponentially diminishing returns because it takes years to build up a catalogue of strains by growing everything up and sequencing, however as you progress, more and more of what you are growing has been seen before. To find a single new species/ subspecies in this manner potentially requires the culturing and sequencing of tens (or indeed hundreds) of thousands of bacteria.

It is therefore an aim of the present invention to provide a method that addresses the abovementioned problems.

In a first aspect of the invention there is provided a method of isolating and/or identifying one or more strains of bacteria in a sample, said method comprising the steps of:

Extracting DNA from the sample;

Preparing a sequencing library using a long-read sequencing technique;

De novo assembly of reads into one or more sequence contigs or assemblies;

Analysis of sequence contigs or assemblies to identify one or more conserved and/or typing gene operon sequences;

Analysis of sequence contigs or assemblies to identify PCR primer sites upstream of the one or more conserved and/or typing gene operon sequences;

Taxonomic classification of upstream PCR primer sites; and/or

Taxonomic classification of conserved and/or typing gene operon sequences;

Preferably the conserved operons are rRNA operons. Further preferably the conserved operons are 1166SS ooppeerroonnss or parts thereof. In a preferred embodiment a table or spreadsheet is created including primers for the conserved and/or typing gene operon sequences or rRNA operon and the likely species they came from. Typically, unidentified strains are also included.

In a preferred embodiment a serial dilution is performed with an aliquot of the original sample, to progressively reduce the number of cells. Typically, each dilution is subject to real time or qPCR of the bacteria (identified or unidentified) of choice. Further typically, when the PCR goes from positive to negative, the previous dilution before the negative result is taken. As such, the number of target bacteria in the tube are known, or can be estimated, with a minimal number of cells.

In a preferred embodiment the cells are cultured from the PCR positive dilution. Typically the cells are cultured in anaerobic conditions and/ or aerobic conditions. Further typically the cells are cultured in anaerobic conditions and/or aerobic conditions on a variety of media.

In one embodiment all colonies are selected and plated onto new media. Typically, each colony is then subjected to qPCR using primers specific for the cognate ‘proximal’ sites. This process allows for the identification of the colonies which contain the target bacteria. The identified colonies are plated onto new media as a pure isolate.

In one embodiment, if there are multiple closely related bacteria, multiple primers can be used as they are self-barcoding, with different combinations of primers used to distinguish between strains. In one embodiment the bacterial isolate is sequenced. Typically the isolate sequence is subject to qPCR to check or confirm the bacteria isolated is qPCR positive. Further typically the user can assess if the selected bacteria is from a known species or a selection from an unknown species is present.

In one embodiment the sample is a biological metagenomic sample. For example such a sample could include human faeces, soil etc.

Typically the DNA is extracted from the sample using bead-beating and/or non-bead beating lysis methods.

In a preferred embodiment both bead-beating and non-bead beating lysis methods are used. Typically using both approaches to extract the DNA allows the recovery of long DNA fragments from cells that are easy to lyse as well as cells (and spores) that are more difficult to break open.

In one embodiment the extracted DNA fragments are subject to size selection. Typically the fragments are size selected to remove very short fragments.

In one embodiment the sequencing library is prepared and sequenced to a high depth using a long-read sequencing instrument or technique. Typically the long read sequencing technique is promethlON™ from Oxford Nanopore Technologies or a Sequel II™ from PacBio.

Preferably the long-read sequence spans a conserved and/ or typing gene operon sequences. Further preferably the long-read sequence spans a typing gene (16S) and the surrounding (proximal site; PSI site containing PCR primers) DNA flanking the same.

Typically the DNA flanking the typing gene, also known as flanking site DNA, is used to design species-specific PCR primers.

In one embodiment the sequence read data is quality checked. Typically the quality checking is automatic wherein low quality reads are removed, chimeras split and/or adapters cut with de novo assembly software.

In one embodiment the quality checking is performed using Flye software.

In a preferred embodiment of the invention the frequency of all k-mers in the assembly is calculated. This enables the user to calculate the relative abundance.

In one embodiment each operon is identified using identification or predictions software. Typically the rRNA operon is identified using Barrnap or such similar software that can identify or predict the location of rRNA genes in sequences and/or genomes.

In one embodiment the 16S rRNA sequence and an upstream region of substantially up to the mean fragment length is identified. Typically k- mers are identified in the upstream region for each rRNA operon. Futher typically the k-mer global abundance is used to identify subregions which are suitable for primer creation. The skilled person will appreciate that in some genomes rRNA operons are close together and there may not be enough non-operon nucleotides upstream to produce PCR primers. In other cases the assembler has broken the contig beside an rRNA operon, disrupting the upstream region. This may be because most bacteria are circular, and the break has been introduced as a consequence of an ambiguity in the underlying assembly graph.

In one embodiment regions of the upstream sequence are searched for high quality PCR primer sites. Typically the search begins with the lowest global abundance and progressively increases.

Typically the searching is an iterative process. Further typically when a high-quality primer is identified the search of this operon ends.

In one embodiment the primers are tuned to work with quantitative PCR, using standard methods.

In a preferred embodiment the in-silico PCR is performed with the primers against the whole assembly. This enables the counting of the number of off target hits. Typically the result recorded. Further typically the number of individual reads which fully overlap the whole conserved and/or 16S region, and the PCR primers is recorded. This gives the user confidence that there is underlying genomic DNA linking the two and is not simply an algorithmic error.

In a preferred embodiment the taxonomic classification is performed using one or more databases. In one embodiment the 16S region is taxonomically classified using Kraken2 and/or the SILVA database. In a preferred embodiment the upstream region is separately taxonomically classified using one oorr mmoorree databases. In one embodiment the upstream region is classified using Kraken2 and the GTDB database. Typically the databases are mostly mutually exclusive.

Typically, the 16S classification will give at most Genus level taxonomic information and the upstream region will give down to species/ strain level resolution. Further typically the higher resolution is at the expense of greater errors due to the reliance on short-read MAGs.

In one embodiment the output of the classification gives a selection of primers for each conserved region or rRNA operon and the likely species they came from. Typically the information is provided in table or a spreadsheet of primers for each rRNA operon and the likely species they came from.

In one embodiment where a study involves the comparison of metagenomic samples and/or from multiple sources (such as microbiome data concerning cases and controls, and/or disease vs healthy sample), established statistical methods can be employed to highlight interesting species, and the PCR primers are available to identify them from the other cells in the microbial community.

Typically PCR primers are created using the classification sequences output. Further typically PCR primers are created using the sequences selected from the table or spreadsheet. In a preferred embodiment a serial dilution is performed with an aliquot of the original sample, to progressively reduce the number of cells. Typically a 10-fold serial dilution is performed.

Typically each dilution is subject to qPCR of the bacteria selected from the table. Further typically when the qPCR result goes from positive to negative, the dilution prior to the negative result is selected. The number of target bacteria in the selected dilution are known, with a minimal number of cells.

In one embodiment the cells are cultured using anaerobic and/or aerobic conditions on a variety of media. Typically the cells are cultured using both anaerobic and aerobic conditions on a variety of media.

In one embodiment the colonies are picked and plated onto new media. Typically these colonies are then subjected to qPCR using primers specific for the cognate ‘proximal’ sites. Further typically this process allows for the identification of the colonies which contain the target bacteria.

Typically, if there are multiple closely related bacteria in a metagenomic sample, the multiple primers can be used as they are self-barcoding. The skilled person or user has now isolated the live target bacteria.

In one embodiment the identity of the bacterial isolate is confirmed. Typically the bacterial isolate is sequenced and subject to qPCR to confirm the identity. In an alternative embodiment the DNA is extracted as previously described and a subsample is fragmented to 500 bases to prepare short read sequencing libraries. Typically short read shotgun metagenomics is performed using an sequencing instrument. Further typically this short-read data is used to calculate the relative abundances of the k- mers in the sample.

Typically another subsample of DNA is subject to a CRISPR cas9 system to cut at the 3’ end of a conserved region.

In one embodiment the universal primer GGGTYKCGCTCGTTR is used to target the end of the 16S region at position 1100-1114.

In a preferred embodiment long read sequencing libraries are prepared. Typically the result is that the conserved region and the upstream complex sequence are included and substantially nothing else. Further typically the result is the near full-length 16S sequence and a few thousand bases upstream of the same.

In one embodiment long read sequencing is performed and as the targeted region is small in each sample, a number of samples can be multiplexed (barcoded) on a single flowcell.

In one embodiment bioinformatic classification analysis is performed substantially as previously described.

Typically the total k-mer abundance is derived in this alternative embodiment from the short-read data instead of from a long read de novo assembly. Further typically short and long reads are mapped to the assembled data to allow for normalisation of the yield from the different sequencing experiments.

Specific embodiments of the invention are now described with reference to the following figures, wherein:

Figure 1 shows a schematic of sample preparation and bioinformatics pre-processing according to one embodiment of the invention;

Figure 2 identification of the 16S conserved region and upstream regions with taxonomic classification in accordance with one embodiment of the invention;

Figure 3 shows a schematic of the iterative primer identification in the upstream regions in accordance with one embodiment of the invention;

Figure 4 shows the proximal site identifier for a single 16S with the relative coverage of the k-mers in the assembly;

Figure 5 illustrates the selection of a bacteria to target and growth of the isolates in accordance with one embodiment of the invention;

Figure 6 illustrates the location of PSI primers within a chromosome with multiple ribosomal operons;

Figure 7 is a graph of the distribution of the number of operons found in 1588 representative genomes from RefSeq, which is a good representation of species diversity across the spectrum of cultured bacteria; Figure 8 is a graph of the correlation of the qPCR results to the depth of read coverage of the corresponding 16S sequence in accordance with one embodiment of the invention;

Figure 9 shows a phylogenetic tree of the 18 B. stercoris plus the reference genome;

Figure 10 shows the phylogenetic tree of the top cluster; and

Figure 11 shows the phylogenetic tree of the bottom cluster.

Most species of bacteria around us (in the gut, in the soil, in the environment etc that constitute the microbiome) are uncharacterised and have never been grown or studied in the laboratory, yet, as part of the overall microbial ecology of any environment, it is likely that all microorganisms make some sort of contribution(s) to the survival and growth of the surrounding microbial consortia of which they form a part. Importantly, these (as yet uncharacterised) organisms may harbour genes encoding biochemical pathways that will have utility as commercial products. Broadly, this new area of science considers that novel bacteria harbouring relevant biochemical pathways may be used therapeutically in three broad ways:

(i) ‘Bugs as Drugs’, where the microorganism is delivered as a drug; ii) ‘Metabolites as Drugs’, where biochemical entities derived from the novel organisms may have therapeutic/ commercial potential; or

(iii) ‘Bugs as Targets’ where the novel microorganism is selectively targeted by a specific drug that modulates its behaviour so as to generate a therapeutic/commercial output. Characterising those novel organisms in any selected microbiome that have been demonstrated to have a causal relationship with the clinical condition promises to power the next generation of drug discovery. The importance of this is evidenced by the recent findings that the Parkinson’s Disease drug L-dopa (Rekdal et al. 2019) works through the microbiome with (S)-a-fluoromethyltyrosine preventing L-dopa decarboxylation by gut commensals and that modulating the gut microbiome may impact responses to numerous forms of cancer therapy (Gopalakrishnan et al. 2018).

The main problem with this exciting new area of drug discovery is that the methods required to identify, grow and characterise the microbial organisms that are responsible for the observed phenotype are currently expensive and time consuming and not amenable to the high throughput assays.

In the genomic era, research has focused on high-level population diversity using conserved markers that are found across all bacteria. Typically, the genes encoding the 16S ribosomal RNA (rRNA) are targeted by conserved PCR primers (polymerase chain reaction) and the amplified product resolved by DNA sequencing. Unfortunately, targeting short variable regions within the 16S rRNA, (VI -V9) is not very sensitive (Johnson et al. 2019), and only gives Family or Genus level resolution that is not good enough to deliver the selectivity required for strain-level identification. Despite the limitations, 16S amplicon sequencing is widely used for population diversity experiments but can only produce high-level associations. Bacteria usually contain multiple copies of 16S, with the variation between copies in a single genome often exceeding the variation between species. This means that when using short read sequencing to interrogate 16S, you are often observing a consensus, rather that the true sequence of the 16S, due to the repetitive nature. With short-read sequencing, due to technological limitations, short hyper variable regions are used, VI -V9, rather than the whole 16S. This complicates the use of this data further, introducing more error. Based on a 16S sequence for a bacteria of interest, it is not feasible to define PCR primers with sufficient selectivity to identify the bacteria at the strain level for subsequent isolation.

An alternative approach to 16S sequencing is to use short-read shotgun metagenomics. While this approach is cheap, it leads to highly fragmented and often erroneous genome assemblies (metagenomically assembled genomes; MAGs). A more recent approach is to use long- read shotgun metagenomics, however, this approach is very expensive and has a high base level error rate so unlikely to be broadly applicable to general microbiome research. Metagenome assemblies based on short reads are well known to be erroneous. Short fragments of the genomes are assembled, then binned together to form MAGs using techniques such as relative coverage and GC% content, however there can often be tens of thousands of short unconnected fragments forming a single genome, with fragments often only containing a single gene, (or partial gene). It is accepted that error rates of 10% or more are likely and researchers treat these genomes as very low quality, placing them in separate databases to avoid polluting higher quality genome assemblies derived from cultured isolate assemblies. The 16S region cannot be assembled using short read metagenomic sequencing data due to the long repeats, and so this information is missing from the MAGs. One way to specifically identify candidate microorganisms is to generate ‘unique’ PCR primers corresponding to a novel target genome. In practice, this could be achieved from genomic information derived from either short or long-read metagenomic assemblies. However, for this approach to be successful, the novel region(s) must be rigorously selected, and must be able to differentiate between multiple closely- related species within the same sample (a common issue). This protocol is the route that is taken for diagnosing pathogens, where specific PCR primers are created to differentiate between different pathogens.

Currently if strain level isolates are to be obtained, a brute force culturomics approach is taken where vast numbers of bacteria can be cultured (often in a variety of growth media) in the expectation that something novel and interesting (with respect to pharmaceutical or commercial interest) will be identified. However, culturomics-based approaches require significant resources and offer exponentially diminishing returns and exponentially rising costs.

The challenge is to develop a method that is as cost effective as 16S sequencing and gives as much resolution as long-read metagenomics without the associated cost, and at the same time can permit the isolation of live bacterial strains. Strain-level identification is key, as the phenotype of bacteria can vary greatly between species and subspecies, for example Salmonella serovars

Typhi/Paratyphi/Entriditis/Typhimurium are highly invasive killing hundreds of thousands of people per year (Pieters et al. 2018), yet other serovars, such as Salmonella Welteverden (Makendi et al. 2016), which are common and globally distributed, cause hardly any invasive disease. We have developed the Proximal Site Identification (PSI/T) method which allows interrogation of a metagenomic sample, and permits the identification of live bacterial isolates of choice. We describe a method that utilises single long-reads that span universally conserved regions in bacteria and are adjacent to nearby non-conserved regions (the so- called proximal-sites). These unique proximal sites (PSI sites) are used for the identification of PCR-primer binding sites that can be used to identify specific microbial strains. In effect, combinations of these PSI- sites allows for self-barcoding of bacteria to give strain level discrimination.

Strain level selection is important because closely related bacteria can have quite different phenotypes, for example, some strains of Escherichia coli are harmless commensals in our gut whilst others are deadly pathogens. 16S cannot distinguish between the two. The method we describe here allows for only bacteria of interest to be targeted. This allows us to move from high level associations to causality. Thus the PSI method can be used for pathogen diagnostics.

Our method uses long read assemblies, which are assembled into large fragments (or even whole chromosomes). Additionally, we use the conserved 16S regions as a quality control measure to ensure we are actually identifying what we expect. We look for primers in the complex non-conserved regions upstream of the conserved 16S region, ensuring they these regions are supported by individual long reads spanning both the conserved and non-conserved regions.

A variant of the method utilises the CRISPR cas9 system to cut at the end of the 16S region with extraction, allowing us to target and purify specific ribosomal areas of the microbial chromosome. Once these sequences are purified away from the rest of the microbial DNA (using methods well known in the art) they may be subjected to DNA sequence analysis to determine the identity of novel PSI-sites. Specifically identify both 16S and PSI-sites in this way will generate considerable savings and will likely reduce sequencing costs by 98%.

Method

1) A biological metagenomic sample (such as human faeces, soil etc.), is selected and the DNA extracted using both bead-beating and non-bead beating lysis methods. Using both approaches allows the recovery of long DNA fragments from cells that are easy to lyse as well as cells (and spores) that are more difficult to break open.

2) The fragments are size selected to remove very short fragments.

3) A sequencing library is then prepared with the DNA and it is sequenced to a high depth using a long-read sequencing instrument such as a promethlON from Oxford Nanopore Technologies or a

Sequel II from PacBio. This is shown in Figure 1. This gives single reads which can span a typing gene (16S) and the surrounding (proximal site) flanking DNA.

4) The unique flanking-site DNA is used to design species-specific

PCR primers.

5) All of the sequence read data is automatically quality checked, with low quality reads removed, chimeras split and adapters cut with de novo assembly using Flye.

6) The frequency of all k-mers in the assembly is calculated to work out the relative abundance. 7) Each rRNA operon is identified using barrnap. The 16S sequence and an upstream region of up to the mean fragment length is identified as shown in Figure 2. k-mers are identified in the upstream region for each rRNA operon, with their global abundance used to identify subregions which are suitable for primer creation. In some genomes rRNA operons are close together and there may not be enough non- operon nucleotides upstream to allow the design of unique PCR primers. In other cases the assembler has broken the contig beside an rRNA operon, disrupting the upstream region. This may be because most bacteria are circular, and the break has been introduced as a consequence of an ambiguity in the underlying assembly graph.

8) In an iterative process, regions of the upstream sequence are searched for high quality PCR primer sites beginning with the lowest global abundance and progressively increasing as shown in Figure 3.

9) When a single high-quality primer is identified (by primer3) the search of this operon ends as shown in Figure 4. The primers are tuned to work with qPCR using standard methods.

10) An in-silico PCR is performed with the primers against the whole assembly to count the number of off target hits, with the result recorded. The number of individual reads which fully overlap the whole 16S region and the PCR primers is recorded. This gives a researcher confidence that what they are seeing is not simply an algorithmic error and there is underlying genomic DNA linking the two. 11) The 16S region is taxonomically classified using Kraken2 and the SILVA database.

12) The upstream region is separately classified using Kraken2 and the GTDB database. These databases are mostly mutually exclusive. The 16S will give at most Genus level taxonomic information and the upstream region will give down to species/strain level resolution, at the expense of greater errors due to the reliance on short-read MAGs.

13) This gives a spreadsheet of primers for each rRNA operon and the likely species they came from. Where a study involves the comparison of microbiome data from multiple sources (such as cases and controls or disease vs healthy), established statistical methods can be employed to highlight interesting species, and the PCR primers are available to identify them from the other cells in the microbial community.

14) PCR primers are created using the sequences selected from the spreadsheet.

15) A 10-fold serial dilution is performed with an aliquot of the original sample, to progressively reduce the number of cells as shown in Figure 5.

16) Each dilution is subject to qPCR using PCR primers designed to target the bacteria of choice. The dilution series is scored as either positive or negative and when the dilution point returns a negative result, the dilution just before the negative result is selected for further analysis.

17) The cells from the ‘last’ positive PCR score are cultured using both anaerobic and aerobic conditions on a variety of media.

18) The colonies are picked and plated onto new media.

19) These colonies are then subjected to qPCR using primers specific for the cognate ‘proximal’ sites. This process allows for the identification of the colonies which contain the target bacteria.

20) If there are multiple closely related bacteria, the multiple primers can be used in combination to distinguish between strains, in effect self-barcoding the strains. You now have the live target bacteria and can perform work on it. 21) We subsequently sequence the bacterial isolate and check with qPCR.

Variant of method using CRISPR CAS9

In this method we use 16S as an example, however any conserved region can be use (like IS elements, AMR genes etc).

1) Prepare the DNA as previously described.

2) Take a subsample and fragment to 500 bases. Prepare short read sequencing libraries according to manufacturer’s instructions.

3) Perform short read shotgun metagenomics using an Illumina sequencing instrument. This short-read data is used to calculate the relative abundances of the κ-mers in the sample.

4) Take another subsample of DNA and dephosphorylate.

5) Use a CRISPR cas9 system to cut at the 3’ end of a conserved region (Klindworth et al. 2013), such as using the universal primer GGGTYKCGCTCGTTR targeting the end of the 16S region at position 1100-1114. This method has been show to work on other genes (Gilpatrick et al. 2020; Quan et al. 2019) and allows for the ligation of sequencing adaptors just to this targeted region.

6) Prepare long read sequencing libraries in the standard fashion.

7) Thus we have the targeted conserved region and the upstream complex sequence and nothing else. In this case it is the near full- length 16S and a few thousand bases upstream.

8) Perform long read sequencing using the MinlON instrument (Oxford Nanopore Technologies). As the targeted region is small in each sample 48 samples can be multiplexed (barcoded) on a single flowcell. 9) Perform bioinformatics analysis as previously described with some variation.

10) The total κ-mer abundance is derived from the short-read data instead of from a long read de novo assembly.

11) Short and long reads are mapped to the assembled data to allow for normalisation of the yield from the different sequencing experiments.

12) All following steps are as previously described.

Detailed example

An anonymous healthy adult volunteer donated a fresh faecal sample, with an aliquot retained for sequencing and the remainder deposited in the Norwich Research BioRepository as required by the Ethics committee.

DNA extractions

Beginning with a metagenomic sample (human stool in this case), DNA was extracted using the Promega GMO Pure food kit including RNAase step, with and without bead beating. Bead beating was performed on a fast prep instrument using matrix lysing etubes for 30 seconds at speed 6 m/ s.

Quantification of extracted DNA was performed using a Qubit.

Sheering and size selection of fragments of DNA was performed using Covaris gTubes to approximately lOKbases.

DNA clean-up DNA was washed to remove small fragments at a 0.4 ratio bead wash, resulting in a minimum fragment size of approximately 1,000 bases.

Library preparation for PromethlON sequencing

The DNA was prepared for sequencing by creating multiplexed native barcoded libraries (SQK-LSK109), one each for bead beating, non-bead beaten and a negative control. During the library preparation a long fragment buffer was used during final stage clean-up to remove most fragments below 3,000 bases.

Table 1: The final QC results for DNA concentration and fragment size.

Sequencing was ppeerrffoorrmmeedd on the PromethlON long read sequencer (Oxford Nanopore Technologies) using a single flowcell according to manufacturer’s instructions for 3 days.

Basecalling

The raw single data (squiggles) were converted to nucleotide sequencings using Guppy v.3.4.3 (Oxford Nanopore Technologies) in high quality mode, on a private open stack cloud with two Ubuntu vl 8.04 virtual machines running Nvidia T4 GPUs. This process took 19 hours and produced a single FASTQ file. The sequencing data was barcoded (multiplexed) into bead beaten, nonbead beaten and a control, and were demultiplexed using qcat vl.1.0 (Oxford Nanopore Technologies, https: / / github.com/ nanoporetech/ qcat) to produce 3 FASTQ files.

Table 2: Overview of the raw sequencing experiment results.

Read filtering

The bead beaten and non-bead beaten reads were merged into a single FASTQ file using cat.

The sequencing rreeaaddss wweerree tittered using FiltLong v0.2.0

(https: / / github.com/ rrwick/Filtlong) to remove reads less than 2000 bases in length which are generally of lower quality with a FASTQ file outputted containing 11,213,742 reads.

Porechop vO.2.4 (https: / / github.com/ rrwick/Porechop) was used to remove adapters in reads. If the adaptors were at the ends, they were trimmed. 29,981 reads contained adaptors in the middle of the read and were regarded as chimeric and split into 2 separate reads. Read qualities were removed and the reads were outputted as a single FASTA file. Table 3: Read data after different filtering steps were applied.

De novo assembly

The pre-processed reads were provided as input to the Flye assembler v2.5 (Kolmogorov et al. 2019) with the metagenome and plasmid flags set and an estimated genome size of 600 Mbases. It was run on a HPC cluster using 32 processors, 0.5 Tbytes of RAM and took 14.5 hours. This produced a de novo assembly of contigs in FASTA format.

Table 4: Details of the de novo assembly which can be used to assess quality.

Mapping The pre-processed reads were aligned to the de novo assembly using Minimap2 v2.12 (Li 2018) with the ‘map-ont’ setting and converted to a sorted and indexed CRAM file using Samtools vl.9 (Li et al. 2009).

Table 5: The number of reads mapping back to the de novo assembly giving an indication of how representative the assembly is of the underlying data.

Primer identification

Software written in Python3 implements the core proximal site identification method. The software was provided with the assembly, the reads aligned to the assembly and 2 taxonomic databases for classification and run with default parameters (upstream size of 2,000 bases, primer size of 20 bases).

The total κ-mer abundance of the assembly was calculated using KMC v3.0 (Kokot, Dlugosz, and Deorowicz 2017). To minimise memory usage single copy κ-mers were implicitly present.

Each rRNA operon wwaass identified wwiitthh Barrnap v0.9 (https: / / github.com/tseemann/barrnap). with the coordinates and direction of the 16S/23S/5S regions output in GFF3 format. The assembly had 537 identifiable ribosomal RNA operons. The region of up to 2,000 nucleotides (adjustable) upstream of the conserved 16S region was extracted from the assembly for each operon. A check was performed to ensure that this did not overlap with another rRNA operon, as this has been observed in some organisms such as Staphylococcus aureus.

The κ-mers were extracted for each upstream region, with the total κ- mer abundance of the assembly substituted for each κ-mer. Where a k- mer was not found it was implicitly derived as a frequency of 1. Starting at 1, any k-mers in the upstream region with an abundance above 1 were masked out with N’s.

PCR primers with a size of fe. (20) were generated using Primer3 v2.3.7 (Untergasser et al. 2012, 3) from the masked upstream region with a target product size of 100 bases (+- If no primers were identified, the κ-mer abundance threshold is increased by 1, and the process of masking and identifying PCR primers is repeated. This is repeated until primers are found or there are no unmasked regions in the upstream region.

In-silico PCR was performed using primersearch which is part of the EMBOSS suite v6.5.7.0 (Rice, Longden, and Bleasby 2000) to calculate the number of contigs hit by each primer indicating the potential for off target hits. This bioinformatics process removes false hits.

Table 6: Number of contigs hit by each primer set.

Table 7: Summery of 16S regions with no primers.

The primers missing due to extreme GC all originated from a single strain on a single contig and 71% had alternative primers available from the other operons in the same species, allowing for their identification.

The proximal site identification method can make use of the fact that most bacteria have multiple copies of 16S. The use of 16S in its own right can only get resolution down to the Family/ Genus level. As a consequence, identifying and generating PCR primers that can discriminate different variants of 16S is error prone as the 16S sequences are too similar. However, by using genomic information that is upstream of the 16S (into the highly variable complex ‘proximal’ sequences), unique PCR primers can be generated. There are multiple 16S sites in any given bacteria and if PCR primers upstream of one 16S element cannot be identified, there is usually sufficient redundancy in the approach to allow a backup analysis using a different copy of 16S resident in the target strain. At its core DNA sequence information that spans the full 16S up to the PSI- sites (spanning the unique PCR primers), ensuring that the primers are related to the 16S. Databases can also be interrogated to aid in the categorisation the upstream regions independently (at the species level).

Self-barcoding

We have developed a universal method for self-barcoding bacterial genomes to allow for strain level resolution. To achieve this we use multiple conserved regions within a genome as anchors and extend upstream into non-conserved regions (which we term proximal sites) where we can create unique PCR primers. As bacteria usually have multiple conserved regions (such as 16S) we have multiple primers as show in Figure 6. This allows us to differentiate bacteria down to strain level, by using combinations of primers. The approach is universal and agnostic with respect to the bacteria being targeted. We do not require knowledge of the bacteria in advance or the number of conserved regions (such as 16S). If there is a single sub-species of interest (such as E. colI) in the metagenomic sample, then all the primers will lead to the same organisms. If there are multiple closely related sub-species from the same species, then some will share primers, and others won’t. This selfbarcoding allows us to differentiate between organisms that are closely related. When performing bioinformatics analysis for Socru (Page and

Langridge 2019) we observed in 10 common pathogens that core conserved genes rarely fell near the rRNA operons (searching out to 50Kbases from the operons). This provides enough uniqueness in the upstream primer regions to allow for the classification down to the subspecies level. 12 randomly selected E. coll reference genomes of different subspecies were selected (multi-locus sequence types) and a mock metagenome assembly created as input to the bioinformatics method. From the self-barcoding, we were able to distinguish between each of the sequence types.

Strain identification example

Given an unknown metagenomic sample a single genus of bacteria was selected for further study. No information about the bacteria was known in advance, (such as how many species or subspecies occurred), or how many rRNA operons were harboured in the target genome e (or if the number of rRNA operons varies). All the primers for the Genus, (as described in Table 8) were selected dilutions and growth of the bacteria were performed as described above.

Table 8: All primers for a given genus and the number of contigs hit by each primer.

From the qPCR results a range of ‘hits’ is returned from different colonies on the plates as shown in Table 9. From the pattern of hits to different primers we derive a unique barcode which can differentiate between different colonies of different strains. In this case there are 3 different strains, each with a primer unique to a single colony which can be used for identification.

Table 9: qPCR results for different colonies when cultured.

Example: A more complex case could exist where there are no unique PCR primers for a subspecies. Table 3 shows the results from PSI for the primers. At first glance this looks like there are 2 subspecies with 3 rRNA operons.

Table 10: All primers for a given genus and the number of contigs hit by each primer, where there is no single contig hit primer.

However, after growing the organisms and performing PCR, the patterns show that we have 3 subspecies, each with 2 rRNA operons, rather than 2 subspecies. This allows us to identify and isolate each of the subspecies and underlying diversity instead of just assuming they are all just the same thing (potentially missing an important organism).

Table 11: PCR results for different colonies when cultured.

Taxonomic classification

Taxonomic classification was performed using Kraken2 v2.0.7 (Wood, Lu, and Langmead 2019). The classification of the 16S regions was performed with the Silva 16S database (version 18/4/2019) (Quast et al. 2013) and presented to the genus taxonomic level. The classification of the upstream regions was performed using the GTDB database (version R89_54k) (Parks et al. 2019) to the species taxonomic level.

Table 12: The maximum taxonomic categories identified from each region. Individual long reads spanning PSI-site and 16S regions

For each primer and 16S sequence, individual reads which fully contained the full length of the primers and the full length of the 16S sequence were identified from the input CRAM file using Samtools vl.9 (Li et al. 2009). The mean number of reads overlapping the full-length of the region was half the depth of coverage as expected based on the fragment length and length of upstream region.

Table 13: Covering reads, where a single read spans both the upstream primers and the full length 16S conserved region.

Primers

Supplementary Table 1 lists the primers for each 16S region. From the results 27 primers were selected, representing a cross section of abundance in the underlying sample, species, genus, and public information. 51% (n=14) were classified as ‘uncultured’ at the genus level based on the 16S data, the genus hasn’t even been assigned a name and thus information is on this branch of life is limited. We selected 6

(22%) species as they have only been observed in short read shotgun assemblies and never been cultured, where we define uncultured as a there being no corresponding type strain in a recognised tissue collection (search performed at bacdive.dsmz.de). As a control 7 (25%) species were selected as they have previously been deposited in collections of tissue cultures, so we can purchase type strains and culturing information, for comparison during isolation experiments.

These primers were ordered from a commercial supplier.

Identification and culturing qPCR of the target organisms from the sequencing data was performed.

This was to validate the primers worked as expected.

Table 14: Primers of selected organisms and their effectiveness with qPCR.

Two organisms were selected (highlighted in yellow in Table 14), a control (scaffold_2458) and an unknown (contig_348) where no representative of the Family/ Genus/Species had ever been cultured and described.

Dorea longicatena (0.62% abundance) which is known to be culturable and where there is a type strain available, and UBA1 1524 sp000437595

(0.54% abundance) which has never been cultured and no type strain is available. The uncultured sample is part of the Christensenellales order, and no member of the family (CAG-74) has been cultured, with the family, genus and species all being given computer generated ID numbers instead of proper names as they have yet to be studied. The never cultured bacterial assembly produced a single circularised chromosome of 3.5Mbases, containing 5 copies of 16S.

Serial dilution of stool lOOmg of stool was taken from the original sample and homogenised in 900ul PBS and diluted down from 10" 1 to 10" 10 in increments of 10 fold dilutions. This contains live cells.

DNA was extracted as previously described. qPCR was performed for each dilution using 2 μ.1 for the 2 target organisms. A 16S rRNA qPCR assay was used for the relative quantification of bacteria in the samples. The higher the Cq value, the less starting template was present. Table 15: The qPCR results for each sample in each dilution. Red indicates the final dilution where cells from the target organisms are present.

In order to determine which dilution factor to culture the targeted organisms from, qPCR was used to detect the relative abundance of those targets in the dilutions. 16S was used alongside to determine the overall abundance of bacterial cells. From this example dilution 10" 4 would be chosen for culture based on the Cq values generated, which suggest there are 1-10 cells of the target bacteria in the qPCR reaction.

The chosen dilution (IO 4 ) is plated onto multiple media plates and grown at 37° C in an anaerobic cabinet (as the input material was sourced from an anaerobic environment). The colonies were picked on to a fresh media plate, followed by colony boiled ptep qPCR. This then tells us which colony contains the target organisms. Reincubate the newly inoculated plate in the same environmental conditions. This can be done in parallel for multiple organisms at once.

Once the target organism is identified we went back to the pure colony culture (live bacteria) and we extracted some DNA for qPCR and whole genome sequencing for verification and quality control.

The total time from start to finish is 24 days.

Targeted isolation and sequencing of a single species

Background:

As an exemplification of the method, we took a single species Bacteroides stercoris found as a commensal in the human gut to see if we could extract a community of live isolates from a single individual, capturing diversity in the community of that species. Extracting a community of similar strains of the same species allows for further experimental work to find a panel of potential live therapeutics.

Short read shotgun metagenomic sequencing and analysis:

A collection of human gut samples as part of a longitudinal study were sequenced using short read shot gun sequencing. The reads were metagenomically assembled (MAGs) and binned, and compared to the ATCC 43183 reference genome. Three donors had genomes within 98% identity to the reference genome, with each having at least 2 distinct strains falling into different bins, as shown in Table 16.

Table 16: Statistics for bins compared to the ATCC 4318 reference genome indicating high similarity, with identity calculated using FastANI following assembly with MEGAHIT.

The reads were classified using Kraken2 using the GTDB database (v202) and Donor 1 was found to have the highest percentage of reads being classified as B. stercoris at 7.07%. This is species level classification and is likely an underestimate of the total reads belonging to the species as regions of similarity between species would only be classified at the Genus level or high. The presence of bins classified as B. stercoris suggests a set of samples with a detectable amount of gDNA from the species of interest, as the ability to produce a reasonably complete bin relies on an adequate sequence coverage and crossing the finding with Kraken2 suggests that Donor 1 is a good source for experimentation.

PSI method:

Donor 1 provided another fresh sample. This sample underwent the PSI method of the present invention, which produced a list of 183 PCR primers. Five of these primers were for B. stercoris (see Table 17).

Table 17: Primer sequences for B. stercoris. The primer utilised for the following experiments is highlighted in yellow.

The sample was grown on a Bacteroides selective medium (Bile Esculin Agar to which 100 μ.g/ml gentamicin, 100 μ.g/ml fosfomycin, 25 μ.g/ml glucose-6-phosphate and 5 % defibrinated horse blood were added [Tierney et al., 2016]) using the dilution and plating method. For each colony, two qPCRs (SYBR green method) wweerree performed simultaneously using a primer pair for the amplification of the 16s rRNA V3-V4 region and a primer pair specific for B. stercoris (Table 17, highlighted in yellow). An example of the method is shown in Table 18. An estimated 17% of colonies were B. stercoris, suggesting a 2 fold enrichment by the selective medium. Twenty one colonies which were positive using the B. stercoris PCR primers were picked and replated into individual plates, then subjected to short read whole genome sequencing. The resulting genomic reads were classified with Kraken2 using GTDB (v202) which identified 18 (85.7%) as B. stercoris. This shows that the primers are highly specific. The other samples were identified as Pbocaeicola vulgatus (formerly Bacteroides vulgatus) and two

Bacteroides uniformis, both of which are closely related to B. stercoris. Table 18. Identification of B. stercoris species/strains based on colony qPCR. In this example, 94 colonies were screened for the identification of B. stercoris. B. stercoris ATCC 43183 was used as a positive control for each plate and is highlighted in green. Differences between crossing points for B. stercoris- specific primers (CPs) and bacterial V3-V4 primers (CPv) similar to the ones obtained for the positive control

(close to zero) are highlighted in red (both Cps and Cpv with a value of around 20). *Pp; primer pair S: species-specific primers and V: V3- V4 16s rRNA.

Phylogenetic analysis

The 18 samples classified as B. stercoris were analysed using the bioinformatics Galaxy system. The only representative reference genome for B. stercoris is ATCC 43183 and this was downloaded in PASTA format from NCBI

[http s://www.ncbi.nlm.nih. gov/ genome/ 1000?genome_assembly_id= 170123]. The reads for each sample were individually mapped against the reference genome and had variants detected using Snippy (Seemann, 2015). Each result set was combined using snippy-core to produce a core set of genomic data. A pairwise comparison was undertaken between each sample using snp-dists (Seemann, 2019) which showed that there were 2 distinct clusters of strains, neither of which were close to the reference (see Table 19). 8

Table 19: A pair wise SNP comparison between each sample, where the number indicates the number of SNP differences. A heat map is applied to the numbers.

A phylogenetic tree was constructed for the 18 B. stercoris samples, plus the reference genome for context, using IQ-Tree (Nguyen, 2015; Minh, 2013) showing that there are 2 distinct clusters, each highly similar and each distinct from the reference genome (Figure 9).

Looking closer at the clusters (Figures 10 and 11), it is clear that there is diversity between the strains isolated within the individual. This shows that it is possible to use PSI to undertake targeted identification of a species and extract multiple different, diverse strains for further work.

References

Gilpatrick, Timothy, Isac Lee, James E. Graham, Etienne Raimondeau, Rebecca Bowen, Andrew Heron, Bradley Downs, Saraswati Sukumar, Fritz J. Sedlazeck, and Winston Timp. 2020. ‘Targeted Nanopore Sequencing with Cas9-Guided Adapter Ligation’. Nature Biotechnology, February, 1-6. https://doi.org/10.1038/s41587-020-0407-5. Gopalakrishnan, Vancheswaran, Beth A. Helmink, Christine N. Spencer, Alexandre Reuben, and Jennifer A. Wargo. 2018. ‘The

Influence of the Gut Microbiome on Cancer, Immunity, and Cancer Immunotherapy’. Cancer Cell 33 (4): 570—80. https://doi.Org/10.1016/j.ccell.2018.03.015.

Johnson, Jethro S., Daniel J. Spakowicz, Bo-Young Hong, Lauren M. Petersen, Patrick Demkowicz, Lei Chen, Shana R. Leopold, et al. 2019. ‘Evaluation of 16S RRNA Gene Sequencing for Species and Strain- Level Microbiome Analysis’. Nature Communications 10 (1): 1—11. https://doi.org/10.1038/s41467-019-13036-l.

Klindworth, Anna, Elmar Pruesse, Timmy Schweer, Jorg Peplies, Christian Quast, Matthias Horn, and Frank Oliver Glockner. 2013. ‘Evaluation of General 16S Ribosomal RNA Gene PCR Primers for

Classical and Next-Generation Sequencing-Based Diversity Studies’. Nucleic Acids Research 41 (1): el. https://doi.org/10.1093/nar/gks808.

Kokot, Marek, Maciej Dlugosz, and Sebastian Deorowicz. 2017. ‘KMC 3: Counting and Manipulating k-Mer Statistics’. Bioinformatics 33 (17): 2759—61. https:/ / doi.org/10.1093/bioinformatics/btx304.

Kolmogorov, Mikhail, Jeffrey Yuan, Yu Lin, and Pavel A. Pevzner. 2019. ‘Assembly of Long, Error-Prone Reads Using Repeat Graphs’. Nature Biotechnology 37 (5): 540—46. https://doi.org/10.1038/s41587- 019-0072-8.

Li, Heng. 2018. ‘Minimap2: Pairwise Alignment for Nucleotide Sequences’. Bioinformatics 34 (18): 3094—3100. https:/ / doi.org/10.1093/bioinformatics/btyl91.

Li, Heng, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, and Richard Durbin. 2009. ‘The Sequence Alignment/Map Format and SAMtools’. Bioinformatics 25 (16): 2078-2079.

Makendi, Carine, Andrew J Page, Brendan W Wren, Tu Le Thi Phuong, Simon Clare, Christine Hale, David Goulding, et al. 2016. ‘A Phylogenetic and Phenotypic Analysis of Salmonella Enterica Serovar Weltevreden, an Emerging Agent of Diarrheal Disease in Tropical Regions’. PLOS Neglected Tropical Diseases.

Parks, Donovan H., Maria Chuvochina, Pierre-Alain Chaumeil, Christian Rinke, Aaron J. Mussig, and Philip Hugenholtz. 2019. ‘Selection of Representative Genomes for 24,706 Bacterial and Archaeal Species Clusters Provide a Complete Genome-Based Taxonomy’. BioRxir, November, 771964. https://doi.org/10.1101/771964.

Pieters, Zoe, Neil J Saad, Marina Antillon, Virginia E Pitzer, and Joke Bilcke. 2018. ‘Case Fatality Rate of Enteric Fever in Endemic Countries: A Systematic Review and Meta-Analysis’. Clinical Infectious Diseases: An Official Publication of the Infectious Diseases Society of America 67 (4): 628-38. https://doi.org/10.1093/cid/ciyl90.

Quan, Jenai, Charles Langelier, Alison Kuchta, Joshua Batson, Noam Teyssier, Amy Lyden, Sahara! Caldera, et al. 2019. ‘FLASH: A next- Generation CRISPR Diagnostic for Multiplexed Detection of Antimicrobial Resistance Sequences’. Nucleic Acids Research 47 (14): e83— e83. https:// doi.org/10.1093/nar/gkz418.

Quast, Christian, Elmar Pruesse, Pelin Yilmaz, Jan Gerken, Timmy Schweer, Pablo Yarza, Jorg Peplies, and Frank Oliver Glockner. 2013. ‘The SILVA Ribosomal RNA Gene Database Project: Improved Data Processing and Web-Based Tools’. Nucleic Acids Research 41 (DI): D590— 96. https:// doi.org/10.1093/nar/gksl219.

Rekdal, Vayu Maini, Elizabeth N. Bess, Jordan E. Bisanz, Peter J. Turnbaugh, and Emily P. Balskus. 2019. ‘Discovery and Inhibition of an Interspecies Gut Bacterial Pathway for Levodopa Metabolism’. Science 364 (6445). https://doi.org/10.1126/science.aau6323.

Rice, Peter, Ian Longden, and Alan Bleasby. 2000. ‘EMBOSS: The European Molecular Biology Open Software Suite’. Trends in Genetics 16 (6): 276-77. https://doi.org/10.1016/S0168-9525(00)02024-2.

Untergasser, Andreas, loana Cutcutache, Triinu Koressaar, Jian Ye, Brant C. Faircloth, Maido Remm, and Steven G. Rozen. 2012. ‘Primer3 — New Capabilities and Interfaces’. Nucleic Acids Research 40 (15): el l 5. https://doi.org/10.1093/nar/gks596.

Wood, Derrick E., Jennifer Lu, and Ben Langmead. 2019. ‘Improved Metagenomic Analysis with Kraken 2’. Genome Biology 20 (1): 257. https://doi.org/10.1186/sl3059-019-1891-0.

Nguyen, Lam-Tung and Schmidt, Heiko A. and von Haeseler, Arndt and Minh, Bui Quang (2014). IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies. In Molecular Biology and Evolution, 32 (1), pp.

268a 274. [doi:10.1093 / molbev/ msu300]

Minh, B. Q. and Nguyen, M. A. T. and von Haeseler, A. (2013). Ultrafast Approximation for Phylogenetic Bootstrap. In Molecular Biology and Evolution, 30 (5), pp.

1188a 1 195. [doi:10.1093 /molbev/ mst024]

Seemann, Torsten (2019). snp-dists.

Seemann T (2015). snippy: fast bacterial variant calling from NGS reads, https:// github.com/ tseemann/ snippy.

Tierney D, Copsey SD, Morris T, Perry JD. A new chromogenic medium for isolation of Bacteroides fragilis suitable for screening for strains with antimicrobial resistance. Anaerobe. 2016;39:168-172. doi: 10.1016/j. anaerobe.2016.04.003