Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
IMPROVED MICROBIOME SEQUENCING ANALYSES
Document Type and Number:
WIPO Patent Application WO/2023/237594
Kind Code:
A1
Abstract:
The invention relates to a computer-implemented method for normalization of sequencing data from a microbiome sample. The invention also relates to one or more non-transitory computer readable media storing instructions for carrying out the method; and a computer program product comprising instructions to carry out the method.

Inventors:
CABALLERO-LIMA DAVID (IE)
O'SULLIVAN COLIN (IE)
O'SULLIVAN JAMES (IE)
Application Number:
PCT/EP2023/065204
Publication Date:
December 14, 2023
Filing Date:
June 07, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
RINOCLOUD LTD (IE)
International Classes:
G16B30/00
Domestic Patent References:
WO2020073946A12020-04-16
Foreign References:
US20220016188A12022-01-20
Other References:
SMERCINA DARIAN N ET AL: "Impacts of nitrogen addition on switchgrass root-associated diazotrophic community structure and function", FEMS MICROBIOLOGY ECOLOGY, vol. 96, no. 12, 10 October 2020 (2020-10-10), XP093079441, Retrieved from the Internet DOI: 10.1093/femsec/fiaa208
ZEWEI SONG ET AL: "Fungal endophytes as priority colonizers initiating wood decomposition", FUNCTIONAL ECOLOGY, JOHN WILEY & SONS, INC, HOBOKEN, USA, vol. 31, no. 2, 28 September 2016 (2016-09-28), pages 407 - 418, XP071555702, ISSN: 0269-8463, DOI: 10.1111/1365-2435.12735
KIM HYOJUNG ET AL: "Instruction of microbiome taxonomic profiling based on 16S rRNA sequencing", THE JOURNAL OF MICROBIOLOGY, THE MICROBIOLOGICAL SOCIETY OF KOREA // HAN-GUG MISAENGMUL HAG-HOE, KR, vol. 58, no. 3, 27 February 2020 (2020-02-27), pages 193 - 205, XP037044467, ISSN: 1225-8873, [retrieved on 20200227], DOI: 10.1007/S12275-020-9556-Y
Attorney, Agent or Firm:
HGF (GB)
Download PDF:
Claims:
CLAIMS

1. A computer-implemented method for normalization of sequencing data from a microbiome sample, the method comprising: a) sub-sampling reads or sequences in the microbiome sample by randomly selecting a plurality of n reads or n sequences from the sample forming a plurality of sub-samples each consisting of n reads or n sequences; b) calculating an OTU frequency table for each sub-sample; c) calculating a median OTU frequency table from the OTU frequency tables from each sub-sample, wherein the median OTU frequency table is representative of the sequencing data of the microbiome sample; optionally calculating the median alpha-diversity from the OTU frequency table for each sub-sample or from the median OTU frequency table; and d) identifying the OTUs in the OTU frequency table for each sub-sample; or the OTUs in the median OTU frequency table, as being from a particular microbe.

2. The method of claim 1, wherein the median alpha-diversity is calculated.

3. The method of claim 2, wherein after step b) and before step c), the OTU frequency table calculated for each sub-sample is used to calculate alpha-diversity for each sub-sample.

4. The method of claim 3, wherein sub-samples which are outliers are identified by performing a maximum normalized residual test on the sub-sample alpha-diversity values and optionally wherein one or more sub-samples which are identified as outliers are removed prior to calculating the median OTU frequency table in step c).

5. The method of any of the preceding claims, wherein before step a), n is selected by comparing the number of observed sequence variants at different sizes of n.

6. The method of claim 5, wherein the number of observed sequence variants is determined by: a) calculating the alpha-diversity for a plurality of sub-samples with different sizes of n; and/or b) calculating an OTU frequency table for a plurality of sub-samples with different sizes of n; and identifying the OTUs in the OTU frequency tables as being from a particular microbe.

7. The method of any of the preceding claims, wherein the number of sub-samples is at least 100.

8. The method of any of the preceding claims, wherein the sequencing data is 16S rRNA sequencing data.

9. The method of any of the preceding claims, wherein the starting value for n is selected according to the source of the microbiome data.

10. The method of any of the preceding claims, the median OTU frequency table from multiple samples is used to calculate beta-diversity.

11. The method of any of the previous claims, wherein the microbiome sequencing data is from a skin microbiome sample.

12. The method of claim 11, wherein the median alpha-diversity is calculated, and the median alpha-diversity and median OTU frequency table are used to classify the skin microbiome sample.

13. The method of any of claims 1 to 11 , wherein the median alpha-diversity and median OTU frequency table are input into an Al model to classify the skin microbiome sample.

14. One or more non-transitory computer readable media storing machine-readable instructions which, when executed, cause one or more processors to perform the method of any of claims 1 to 13.

15. A computer program product comprising instructions, which, when the program is executed by a computer, cause the computer to carry out the method of any one of claims 1 to 13.

Description:
Improved microbiome sequencing analyses

Field of the Invention

The present invention relates to a computer-implemented method for normalization of sequencing data from a microbiome sample. The invention also relates to a non-transitory computer readable media storing instructions to perform the method, and a computer program product comprising instructions which when executed cause the computer to carry out the method.

Background

16s rRNA gene and the nuclear ribosomal internal transcribed spacer (ITS) are common amplicon sequencing methods that have been used for the study of Microbial communities (Microbiome) for decades.

High-throughput sequencing, also known as Next-generation sequencing (NGS) has revolutionised genomic research with large amounts of data being analysed quicker and at lower costs compared with the previous Sanger sequencing method. NGS produces shorter gene sequences (150-600 bases) with different sub-regions of the genes are therefore targeted, ranging from single variable regions, such as V4 or V6, to three variable regions, such as V1-V3 or V3-V5.

Due to its high scalability and low costs, NGS is now used for metagenomics, often defined as the analysis of DNA from microbial communities in environmental samples without prior need for culturing. Both optimising output and reducing variability are important aspects of all metagenomic studies.

A variety of sampling methods are used in this field depending on the type of sample being collected. For instance, the most common methods for isolating the skin microbiome are: swabbing, scrub-washing and tape-striping. Variability and bias can be introduced with different sampling methods, with repeated sampling using the same method and even when an experienced operator is conducting the sampling. This variability, as part of the sequencing process, will influence the number of reads from sample to sample, introducing bias in the biodiversity of the samples and the taxonomic identification of microbial groups.

This issue becomes apparent when looking to analyse a subject sample over time. If there is variability within a single sample, then of course monitoring the change over time is less accurate. Reducing sampling variability is a well-known problem even for clinical studies with very tight quality controls which make it a huge issue for Direct-to-Consumer (DTC) type of services where the sampling methods are not as tightly monitored.

Summary of the Invention

The present invention addresses this need by providing a new method for alpha-rarefaction.

In a first aspect of the invention, there is provided a computer-implemented method for normalization of sequencing data from a microbiome sample, the method comprising: a) sub-sampling reads or sequences in the microbiome sample by randomly selecting a plurality of n reads or n sequences from the sample forming a plurality of sub-samples each consisting of n reads or n sequences; b) calculating an OTU frequency table for each sub-sample; c) calculating a median OTU frequency table from the OTU frequency tables from each subsample, wherein the median OTU frequency table is representative of the sequencing data of the microbiome sample; optionally calculating the median alpha-diversity from the OTU frequency table for each subsample or from the median OTU frequency table; and d) identifying the OTUs in the OTU frequency table for each sub-sample; or the OTUs in the median OTU frequency table, as being from a particular microbe.

In a further aspect of the invention there is provided one or more non-transitory computer readable media storing machine-readable instructions which, when executed, cause one or more processors to perform the aforementioned method.

In a further aspect of the invention there is provided a computer program product comprising instructions, which, when the program is executed by a computer, cause the computer to carry out the aforementioned method. Detailed description

Method

Computer implemented

Computer-implemented means that the method involves the use of a computer, computer network or other programmable apparatus, where one or more features are realised wholly or partly by means of a computer program.

Normalization

Normalization across samples of sequencing data is performed to account for differences in sequencing depths. Normalization allows all samples to be put on a common basis. This allows the normalized data to be compared.

The normalization allows accurate microbe identification in a microbiome sample.

Sequencing data

The sequencing data may be DNA sequencing data. The sequencing data may be DNA sequencing data from bacteria. The sequencing data may be from priming the 16S rRNA genes. That is the sequencing data may be bacterial 16S rRNA sequencing data.

However, the sequencing data may also be from any other microbe within the microbiome sample.

When 16S rRNA gene sequencing data is used, the data may be from priming any of the single variable regions, such as V4 or V6, and/or from priming for example three variable regions, such as V1-V3 or V3-V5.

The sequencing data may also be ITS (internal Transcribed Spacer) region sequencing data. This is used as a fungal identifier. Alternatively or additionally, the 16SrRNA and 18S rRNA genes may be primed from for the fungal sequencing data.

The DNA sequencing may be next generation sequencing data, for example using reads from fragments of 100-5000bp in length. The sequencing data may also comprise RNA sequencing data. For example, viral RNA sequencing data, metagenomic data, transcriptome data or metatranscriptome data.

Microbiome sample

The sample may be any microbiome sample. By microbiome is meant the collection of all microbes, such as bacteria, fungi, viruses, and their genes, that naturally live on or inside mammalian bodies.

For example, the sample may be a skin sample. The sample may be taken from the cheek or the scalp. Other parts of the body that may be sampled include: back or elbow. Samples may be taken using a swab.

Sub-sample/sub-sampling/n reads

By sub-sample is meant a division of the whole microbiome sequencing sample. That is, a selection of a certain number of reads or sequences (n reads or n sequences) from the whole of the number of reads or sequences obtained for that sample.

By read is meant the series of nucleotides which are a read out from a sequencing machine.

By sequence is meant the merged pair-end reads. This is also referred to as a feature.

A minimum of 100 sub-samples may be taken, each sub-sample consisting of n reads or sequences. For example 100-1000 sub-samples may be taken. n may be selected by an initial tuning step which can be performed for each sequencing run (i.e. before step a), n is selected by comparing the number of observed sequence variants at different sizes of n). This tuning optimises what number of reads maximises the diversity for the least number of reads. This can be done by calculating the alpha-diversity and comparing the alpha-diversity for different numbers of n. This is shown in Figure 2. Here it can be seen that above 10,000 reads, the diversity does not radically change. As a result, going above a value of n = 10,000 will increase time and computing costs for little value.

Alternatively or additionally, the number of observed sequence variants is determined by calculating an OTU frequency table for a plurality of sub-samples with different sizes of n; and identifying the OTUs in the OTU frequency tables as being from a particular microbe. n may be adjusted according to the source of the microbiome data. For example, n is adjusted according to whether the sample is from the elbow or the back as different sites will have different diversity. Sites with larger diversity will require larger n to capture all of the individual microbial sequences present in the sample, n may be calculated starting from the standard minimum number of reads for that body site which is then tuned for each sequencing run as described above.

The reads sampled may be the raw reads. Alternatively, instead of the reads, the paired- end sequences may be sub-sampled. Further filtering of the paired-end reads may also be applied prior to sub-sampling, for example with QIIME2.

Randomly

By randomly is meant there is no design to the selection, n reads are selected to create the sub-sample from the sample. No selection of particular reads is carried out. Any read or sequence in the sample may end up part of the sub-sample.

This is described in Example 1 : Calculating n (the number of reads in each sub-sample).

For example, Seqtk, an open source, tool to process sequencing data in the form of Fast may be used to generate sub-samples from the total number of reads.

Alpha-diversity

The alpha-diversity calculation provides an indication of diversity in a microbiome sample. The alpha-diversity may measure any one or more of the following: a) the number (or count) of species in the sample; and/or b) the inequality between species abundances (species proportional abundance). Both may be measured in the calculation of alpha-diversity.

The Shannon diversity index is one method which may be used to calculate alpha-diversity. The Shannon index measures both a) and b). That is, it weights the importance of each species in the sample.

The Shannon index is calculated as follows:

H = -S[(pi) * log(pi)], where:

H - Shannon diversity index; pi - Proportion of individuals of i-th species in a whole community: pi = n / N, where: n - individuals of a given type/species; and N - total number of individuals in a community, Z - Sum symbol; and log - Usually the natural logarithm, but the base of the logarithm is arbitrary (10 and 2 based logarithms are also used).

The median alpha-diversity may be used alongside the median OTU frequency table as input to classify a microbiome sample or to train a model to classify a microbiome sample.

OTU

Operational taxonomic unit or OTU is considered as the basic unit used in numerical taxonomy. These units may refer to an individual, species, genus, or class. They are groups of organisms clustered by sequence similarity, for example sequence similarity of the 16S rRNA gene or parts of it (or any of the other areas for priming described above in the section describing the sequencing data). OTU may be defined as 97 or 100% identical (calculated using BLAST for example) to other members within the same OTU. Identity is calculated based on for example comparison of 16S rRNA gene data, 18S rRNA gene data or ITS data or any of the other data described above under the section entitled “Sequencing Data”.

The OTU frequency is the number of reads or sequences for each OTU identified in the subsample.

By OTU frequency table is meant a table comprising the number of OTUs versus the number or abundance of that sequence or read for each OTU. The OTU frequency table can be calculated using from an ASV table (Amplicon Sequence Variant) table by binning the sequence variants according to their percentage identity with each other to form OTU bins. The median OTU frequency table is then calculated from the OTU frequency tables from all sub-samples.

Alternatively, at step b) an ASV table can be calculated and the OTU binning is only carried out after calculating the median OTU frequency table. Therefore, step b) of claim 1 may read calculating a variant frequency table for each sub-sample wherein the variant is either an amplicon sequence variant or an OTU. The table would then be an ASV table or an OTU frequency table.

Identified as being from a particular microbe

The OTUs are identified as being from a particular microbe. For example, the sub-sample OTU frequency table or the median OTU frequency table may be processed to identify the particular OTUs with a microbe, i.e. the OTUs are taxonomically identified. Identification requires the OTU to have a percentage identity with a known microbial sequence.

This taxonomic identification may be carried out by QIIME2 and a naive-bayes classifier trained using scikit-learn python library. Other taxonomic classification methods may also be used.

Taxonomic identification may be to the genus or species level or family level. Taxonomic identification of the OTU may require at least 80, 85, 90, 95, 97 or 100% identity to a known microbial sequence for identification.

Median

By median is meant the middle number in a sorted, ascending or descending, list of numbers.

The median alpha diversity may be calculated by either calculating the alpha diversity for each sub-sample (from the OTU frequency table from each sub-sample) and then calculating the median of these sub-sample values. Alternatively, the median alpha diversity may be calculated from the median OTU frequency table.

The median alpha diversity may be calculate using python and its Numpy library. The median OTU frequency table can be calculated by combining the OTU frequency tables from each sub-sample into one median table. This may be performed for example using Pandas, a python library for data analysis.

Optionally before calculation of the median values, outlier sub-samples may be removed. This is described below.

Representative of the sequencing data of the microbiome sample

The median alpha-diversity and median OTU frequency table are not the entire sample, but instead are normalized data which are a true statistical representation of the data as a whole.

Sub-sample outliers and Maximum normalized residual test

The alpha-diversity of the sub-samples should be normally distributed.

This can be plotted for example using SciPy open-source Python library.

Sub-samples which do not fall within this normal or Gaussian distribution are outliers.

It is possible to remove these outliers before using them to calculate the median OTU frequency table and/or median alpha-diversity.

The removal of outliers can be done using a Maximum normalized residual test, also known as Grubb’s method.

Essentially, this removes biased sub-sampling. Grubb’s method is iteratively applied until all outliers are removed and the population fit a Gaussian distribution.

The maximum normalized residual test is defined for the hypothesis:

Ho: There are no outliers in the data set

H a : There is exactly one outlier in the data set

The Grubbs test statistic is defined as: with y and ® denoting the sample mean and standard deviation, respectively. The Grubbs test statistic is the largest absolute deviation from the sample mean in units of the sample standard deviation.

Software that may be used to apply Grubb’s method include: Numpy and SciPy open-source Python library (See for example Mastering data mining with Python - Find patterns hidden in your data by M. Squire. 2016)

Therefore, by calculating the alpha-diversity of the sub-samples, and plotting these, any which do not follow a Gaussian distribution can be removed. The median alpha-diversity and median OTU frequency table may then be calculated without these outlier sub-samples. The effect of this is shown in Figure 6 and described in Example 3.

Beta-diversity

The Beta-diversity calculation provides a comparison of diversity between different microbiome samples.

Matrixes such as Bray-Curtis or Jaccard indexes using PCoAs may be used to calculate the beta-diversity. n is selected by comparing the number of observed sequence variants at different sizes of n

This can be carried out by calculating the alpha diversity of samples with different values of n reads. This result can be seen in Figures 1 and 2.

This step can also be carried out by calculating the OTU frequency at different values of n reads and classifying the OTUs (i.e. taxonomically identifying the OTUs). This is shown in Figure 3. By classifying or identifying the OTUs, this allows the data quality to be judged. For example, there may be many reads but these cannot be identified due to the poor data. Therefore, basing the calculation of n on the alpha diversity alone may not be sufficient. By identifying the OTUs, the real bacterial diversity that can be identified in the sub-sample can be seen. This further allows n to be calibrated not only on the number of reads but also on the quality of the reads. In this case it is the number of classified OTUs at different sizes of n which are compared. Classification (i.e. taxonomical identification) of the OTUs may be as described above in the section entitled “Identified as being from a particular microbe”. Alternatively, both the alpha diversity and the classified output from the OTU frequency table can be used to decide on n.

By observed sequence variants is meant the number of OTUs, or when classification is carried out by calculating an OTU frequency table for a plurality of sub-samples with different sizes of n; and identifying the OTUs in the OTU frequency tables, the observed sequence variants are OTUs which have been taxonomically identified as being from a particular microbe.

Classify skin

The skin may be classified based on the median alpha-diversity and/or median OTUs. Classification may be carried out using the alpha diversity and/or median OTUs by creating a population distribution which is fractionated into value ranges for each skin type. The fractionation may be guided by the presence, absence and abundance of certain features identified by an Al model as characteristic of certain skin types.

Al model to classify skin

The median alpha-diversity and median OTUs can be used as input into an Al model to classify the skin by type. For example, into dry, sebaceous or oily. Other data may be used as input into the Al model also.

Products

Non-transitory computer readable media storing machine-readable instructions

Machine readable program instructions may be provided on a transitory medium such as a transmission medium or on a non-transitory medium such as a storage medium. Such machine readable instructions (computer program code) may be implemented in a high level procedural or object oriented programming language. However, the program(s) may be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language, and combined with hardware implementations. Program instructions may be executed on a single processor or on two or more processors in a distributed manner.

Throughout the specification, unless the context demands otherwise, the terms ‘comprise’ or ‘include’, or variations such as ‘comprises’ or ‘comprising’, ‘includes’ or ‘including’ will be understood to imply the method or kit includes a stated integer or group of integers, but not the exclusion of any other integer or group of integers.

Each document, reference, patent application or patent cited in this text is expressly incorporated herein in their entirety by reference, which means it should be read and considered by the reader as part of this text. That the document, reference, patent application or patent cited in the text is not repeated in this text is merely for reasons of conciseness. Reference to cited material or information contained in the text should not be understood as a concession that the material or information was part of the common general knowledge or was known in any country.

Description of the Figures

Figure 1 shows Alpha-rarefaction curve for Shannon diversity from human cheek microbiome.

Figure 2 shows Alpha-rarefaction tuning per run.

Figure 3 shows Genus level taxonomic classification human cheek microbiomes, faceted by number of reads subsampled. Sam = samples. Each sample is from a different person. All are from cheek.

Figure 4 shows Distribution of Shannon alpha diversity on whole population (n= 91 samples) following random subsampling of reads at a depth of 10,000 , 8 times (A:H).

Figure 5 shows Shannon alpha diversity per sample showing the range of alpha diversity results per sample depending on the subsample of 10,000 reads used.

Figure 6 shows the Effect of outliers removal on Taxonomic classification.

Figure 7 shows the overall Bioinformatics pipeline

Examples

Aspects of the present invention will now be illustrated by way of example only and with reference to the following experimentation.

General methods

16S Metagenomics Workflow

Procedure:

1. Microbiome samples are collected from the users’ skin. Cheek for skin care recommendations. Scalp for hair care recommendations. Other parts of the body (back, elbow, tight) for medical applications.

2. DNA extraction is carried out as follows:

2.1. Sample swabs are treated with Proteinase K.

2.2. Bacterial cell walls are disrupted by a mechanical method using a cell homogeniser in deep 96 well plates.

2.3. 96 well plates are centrifuged to pellet cell debris.

2.4. Supernatant is transferred to a fresh 96 well plate.

2.5. DNA extraction is completed in a QiA cube liquid handling robot using a Qiagen Power Soil DNA extraction kit.

3. DNA amplification is carried out using primers: 16S 27F and 16S 534R to amplify

4. variable regions V1-V3 of the 16S gene.

5. PCR products are purified using magnetics beads.

6. DNA is run in gel to assess quality and quantify.

7. If pass, V1-V3 fragments are barcoded using Nextera XT DNA Library Preparation Kit from Illumina.

8. PCR products are purified using magnetics beads.

9. DNA is run in gel to assess quality and quantify.

10. 11. If passed, up to 192 samples including blanks and positive controls are pooled together at equal concentrations.

11. Pooled library is loaded into the MiSeq cartridge and the sequencer is run using a flow cell and cartridge from a MiSeq Reagent Kit v3 (600 cycles).

12. At the end of the run, Run quality is assessed.

13. If pass, raw data is uploaded to bioinformatics pipeline servers. Example 1 : Calculating n (the number of reads in each sub-sample)

The sub-sample n number can also be tuned for each sequencing run ensuring that rare groups in different samples are still represented in the majority of sub-samples.

The bioinformatics pipeline for calculating n is shown in Figure 7 (Left hand side, resulting in alpha rarefaction tuning). Each stage is discussed in detail below.

1. Initial analysis of the data (Qiime2, Dada2, Scikit-Learn classifiers)

2. Alpha rarefaction was initially calculated for the entire population (complete data set data in our database) and then tuned for each run as follows. (Figure 1).

3. On the population of interest, (actual sequencing run or users batch). Seqtk is used to randomly subsample 5000, 10000, 20000, and 30000 reads from paired fastq with seed 100.

4. Genus level taxonomic results (Figure 3) and Shannon alpha diversity (Figure 2) are downloaded and plotted in R using ggplot2

QIIME2 takes the raw sequencing reads and organizes the paired-end reads if using. Classification can also be done in QIIME2. It can also demultiplex and quality filter your data.

Dada2 takes sequencing data and outputs a table with sequence variants (i.e. different sequences or features) and their sample-wise abundances. This is called an amplicon sequence variant or ASV table. It also denoises the data.

Results:

The results can be seen in Figures 1-3.

As can be seen from Figure 1, sampling at n values below around 2000 for this sample reduces the observable alpha-diversity.

Further analyses of the data can be seen in Figures 2 and 3. The most abundant genera, remain consistent even at low sampling depth of 5000 reads, resulting in -2500 ASV following Dada2. Some low abundant genera are lost below 10,000 reads.

Therefore, n was chosen as 10,000 for this data. Example 2: Iterative random sampling of sequencing raw data (taking n reads from the sample a plurality of times to form a plurality of sub-samples each consisting of n reads)

Random sampling of each sample is carried out of each sample’s raw-sequencing data to produce a number of sub-samples.

The number of reads (n) for each sub-sample is obtained as explained in Example 1.

The generation of a set of sub-samples is used to avoid bias on the samples containing the lower number of reads and to ensure coverage of all rare microbial species.

To determine if subsampling at a minimum depth had an impact on Shannon diversity, 10,000 reads are randomly subsampled. QIME2 and DADA2 are used to analyse and organise the data. Shannon diversity is then plotted for each sub-sample of 10,000 reads.

Results:

The results can be seen in Figure 4.

This shows examples with seeds of 100, 95, 87, 82, 76, 62, 59, 48 and labelled a, b, c, d, e, f, g, h). Overall there is slight variation in population alpha diversity, with subsampled group g having slightly lower alpha diversity.

As can be seen from Figure 5, alpha diversity varies across the different sub-samples. By taking these together as a whole, we sample the entire sequencing space but using smaller sample sizes so less computing power is used, but by using multiple sampling (and tuning n as explained in Example 1), those species in low abundance in the samples are still retrieved.

Example 3: Removing outliers from the sub-samples

On an individual level, many samples remain consistent and Shannon diversity of the subsamples are normally distributed.

However some samples do change dramatically (losing normality), depending on the subsample used e.g. a red sample in subgroup a has a Shannon value ~4.2, which drops to ~3.5 in b, trends back towards 4.2 in c, d and e, before dropping to ~3.5 in F again, (see arrow in Figure 5).

For this reason, it is possible to remove outliers using Grubb’s method.

Grubb’s method or the maximum normalized residual test assumes sub-sample values should have a normalised distribution. If not then they are removed by the Grubb’s method. Essentially, this removes biased sub-sampling. Grubb’s method is iteratively applied until all outliers are removed and the population fit a Gaussian distribution.

Results:

The results can be seen in Figure 6 and Tables 1 and 2 below.

These data show a run example in which Shannon index is calculated for each of the subsamples. The majority of samples fitted a Gaussian distribution when a normality test was run in the Shannon index set of all subsamples from each sample. The samples which did not fitted a normal distribution (Table 1A) underwent the removal of outliers using the Grubb’s method until the pass the normality test (Table 1 B).

Table 2 shows how the outliers removal affect the median Shannon diversity (H’) and the Skin Trust Club Microbiome Score (S). The identification and removal of outliers using this method was subsequently used on the taxonomic microbial classification. Changes in the relative amounts of certain microbes could be observed in Figure 6.

A B.

Table 1. Example of subsample data normalization by removal of outliers using the Grubb’s method.

Table 2. Effect of outliers removal on Shannon a-diversity and Skin Trust Club microbiome balance score for the same samples showed above. H’ = Median Shannon a-diversity. H’ o = Median Shannon a-diversity without outliers. S = Median STC microbiome balance score. S o = Median STC microbiome balance score without outliers.