EXPRESSION PROFILING - GARVAN INSTITUTE OF MEDICAL RES

Title:

EXPRESSION PROFILING

Document Type and Number:

WIPO Patent Application WO/2020/061643

Kind Code:

Abstract:

This disclosure relates to a method for determining a state of a biological sample using streaming data from a sequencer, such as, but not limited to, diagnosing sepsis using sequencing data. A processor generates an expression profile for the sample. The expression profile comprises for each of the multiple sequences an indication of abundance of that sequence in the sample. While the processor receives further sequences for the sample, the processor updates the expression profile for the sample, performs a comparison of the expression profile for the sample to stored expression profiles to determine a matching stored expression profile, and determines the state of the sample as the state associated with the matching stored expression profile (such as sepsis). Upon determining the state of the sample, the processor terminates the receiving of the further sequences before the full sequencing data has been received.

Inventors:

BUNADI DENNIS (AU)
SMITH MARTIN (AU)
FERGUSON JAMES (AU)
CARSWELL SHAUN (AU)

Application Number:

PCT/AU2019/051049

Publication Date:

April 02, 2020

Filing Date:

September 27, 2019

Export Citation:

Click for automatic bibliography generation Help

Assignee:

GARVAN INSTITUTE OF MEDICAL RES (AU)

International Classes:

G16B25/10; C12Q1/68; G16B20/00

Domestic Patent References:

WO2011106536A2	2011-09-01
WO2017106918A1	2017-06-29

Foreign References:

US9322820B2

2016-04-26

Attorney, Agent or Firm:

FB RICE PTY LTD (AU)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS:

1. A method for diagnosis of sepsis in a sample from a patient using streaming data from a sequencer, the method comprising:

receiving multiple sequences of the sample from the sequencer;

generating an expression profile for the sample, the expression profile comprising for each of the multiple sequences an indication of abundance of that sequence in the sample;

receiving further sequences as streaming data from the sequencer;

while receiving the further sequences performing the steps of:

updating the expression profile for the sample;

performing a comparison of the expression profile for the sample to a stored expression profile indicative of an abundance of sequences when sepsis is present;

determining whether the patient has sepsis based on the comparison; and upon determining whether the patient has sepsis terminating the receiving of the further sequences.

2. A method for determining a state of a biological sample using streaming data from a sequencer, the method comprising:

receiving multiple sequences of the sample from the sequencer;

generating an expression profile for the sample, the expression profile comprising for each of the multiple sequences an indication of abundance of that sequence in the sample;

receiving further sequences as streaming data from the sequencer;

while receiving the further sequences performing the steps of:

updating the expression profile for the sample;

performing a comparison of the expression profile for the sample to one or more stored expression profiles to determine a matching stored expression profile, each of the one or more stored expression profiles being associated with a respective predefined state of the sample; determining the state of the sample as the state associated with the matching stored expression profile; and

upon determining the state of the sample terminating the receiving of the further sequences.

3. The method of claim 2, wherein the state of the sample comprises a tissue of origin as determined as the tissue of origin associated with the matching stored expression profile.

4. The method of claim 1, 2 or 3, wherein the sequencing data comprises a stream of consecutive information including the read sequences.

5. The method of any one of the preceding claims, wherein the sequencer comprises a nanopore continuously generating the sequencing data.

6. The method of any one of the preceding claims, wherein the expression profiles comprise a representation of an electric signal in a time-domain that corresponds to a read direction along the sequence.

7. The method of any one of the preceding claims, wherein the comparison is based on comparing the sequences in the expression profiles.

8. The method of claim 7, wherein comparing the sequences is based on comparing features extracted from the sequences.

9. The method of any one of the preceding claims, wherein the expression profiles comprise a list of sequences that is ordered by the respective abundances.

10. The method of any one of the preceding claims, wherein performing the comparison comprises calculating a matching score between the expression profile for the sample and the one or more stored expression profiles.

11. The method of claim 10, wherein the matching score is based on an order of sequences in the expression profiles by respective abundances.

12. The method of claim 11, wherein the matching score is based on a difference in a position of a sequence within the ordered sequences between the expression profiles.

13. The method of claim 10 or 11, wherein the matching score is based on a rank correlation coefficient.

14. The method of any one of the preceding claims, wherein the state of the sample is determined and the receiving of the further sequences is terminated when a matching score determined by the comparison meets a pre-defmed threshold.

Description:

EXPRESSION PROFILING

Related application

[0001] This application claims priority from Australian application 2018903657, filed on 27 September 2018, which is incorporated herein by reference.

Technical Field

[0002] This disclosure relates to a method for determining a state of a biological sample using streaming data from a sequencer, such as, but not limited to, diagnosing sepsis using sequencing data.

Background

[0003] The genome produces a diverse multitude of protein-coding (mRNA) and non protein coding (ncRNA) transcripts that, collectively, embody the transcriptome. A transcriptome represents a snapshot of global genetic activity from a single cell or a population of cells (e.g. a tissue), which can be decomposed into thousands of individual genes and gene products that are each produced (or expressed) at different levels. The nature and relative quantities of expressed genes is very dynamic and varies in function of‘cellular states’, e.g. tissue-specificity, developmental processes, differentiation, disease, drugs, and environment. Hence, measuring and observing transcriptomes via high-throughput sequencing provides an informative, high- resolution molecular profile (or‘snapshot’) of cellular states.

[0004] However, sequencing datasets are generally large so that an upload of the full dataset generally requires a long time, such as three days. For many diagnostic applications, especially emergency applications, this is unacceptably long.

[0005] Any discussion of documents, acts, materials, devices, articles or the like which has been included in the present specification is not to be taken as an admission that any or all of these matters form part of the prior art base or were common general knowledge in the field relevant to the present disclosure as it existed before the priority date of each claim of this application.

[0006] Throughout this specification the word "comprise", or variations such as "comprises" or "comprising", will be understood to imply the inclusion of a stated element, integer or step, or group of elements, integers or steps, but not the exclusion of any other element, integer or step, or group of elements, integers or steps.

Summary

[0007] Disclosed herein is a method for analysing sequences by matching the abundances (i.e. expression levels) against known profiles. This is achieved without the entire sequencing data-set but on the fly as the sequences become available. Once a match is found, the process can be stopped, which results in a significantly reduced time required to come to a decision.

[0008] In this sense, there is provided a method for determining a state of a biological sample using streaming data from a sequencer. The method comprises:

receiving multiple sequences of the sample from the sequencer;

generating an expression profile for the sample, the expression profile comprising for each of the multiple sequences an indication of abundance of that sequence in the sample;

receiving further sequences as streaming data from the sequencer;

while receiving the further sequences performing the steps of:

updating the expression profile for the sample;

determining the state of the sample as the state associated with the matching stored expression profile; and upon determining the state of the sample terminating the receiving of the further sequences.

[0009] There is also provided a method for determining a state of a biological sample using streaming data from a sequencer. The method comprises:

receiving multiple sequences of the sample from the sequencer;

generating an expression profile for the sample, the expression profile comprising for each of the multiple sequences an indication of abundance of that sequence in the sample;

receiving further sequences as streaming data from the sequencer;

while receiving the further sequences performing the steps of:

updating the expression profile for the sample;

ordering the sequences in the expression profile for the sample by the respective abundances;

performing a comparison of the expression profile for the sample to one or more stored expression profiles based on a difference in a position of a sequence within the ordered sequences between the expression profiles to determine a matching stored expression profile, each of the one or more stored expression profiles being associated with a respective predefined state of the sample and ordered by the respective abundances;

determining the state of the sample as the state associated with the matching stored expression profile; and

upon determining the state of the sample terminating the receiving of the further sequences.

Brief Description of Drawings

[0010] An example will now be provided with reference to the following drawings:

[0011] Fig. 1 illustrates a sorted X-profile being generated using nanopore sequencing and a database of previously generated X-profiles against which the native X-profile is compared to. [0012] Fig. 2 illustrates an example of comparative X-profiles for determining tissue of origin.

[0013] Fig. 3 illustrates an example of X-profile comparison approach.

[0014] Fig. 4 illustrates a Comparison of unknown sample to known samples. Mouse RNAseq data from a blind sample (Sample X) was used to generate progressively larger X-profiles, which are compared to 3 reference X-profiles form known tissues (Brain, Kidney, Testes). Sample X was predicted to be mouse brain, which was subsequently confirmed by the technician who produced the sample.

[0015] Fig. 5 illustrates a method for diagnosis of sepsis in a sample from a patient.

[0016] Fig. 6 illustrates method for determining a state of a biological sample.

Description of Embodiments

[0017] Nanopore sequencing enables real-time analysis of genomic and

transcriptomic data. In particular, the real-time acquisition of data enables interactive, selective sequencing applications premised on instantaneous analysis of sequencing data. A molecule can be ejected by reversing the flow of current across the nanopore if the analysis of the sequence reveals it to be undesired. Conversely, the molecule may continue to be sequenced if analysis of the sequence reveals it to be desirable. Oxford Nanopore Technologies have pioneered such applications with their‘read-untiT functionality.

[0018] For RNA sequencing (a.k.a. transcriptomics) it can be beneficial to selectively reject abundant and highly similar transcripts, such as mRNA sequences of the same genes. Indeed, some highly-expressed genes compose the majority of mRNA sequences in a transcriptome. These abundant molecules can saturate a sequencing experiment, and provide little qualitative information after an initial subset of sequencing reads have been generated. It is thus desirable to reject these reads once they have been sequenced sufficiently to determine the composition and diversity of their primary structure. Indeed, less abundant transcripts, such as regulatory ncRNAs, can provide distinguishing information about the nature of a sample. However, retaining the relative abundances of all transcripts can nonetheless provide distinguishing information about the nature of the sample.

[0019] This disclosure provides a method to characterize cellular states by generating qualitative and quantitative expression profiles (X-profiles) using a data format compatible with real-time nanopore sequencing. We describe the utility of X-profiles for processing transcriptomic data in real-time, including the comparative analysis of X-profiles. We demonstrate how comparative X-profile analysis can be used to identify the source of an unknown RNA sequencing sample by comparing it to a database of annotated X-profiles. This approach can be extended to clinical applications, such as the identification of tissue of origin for metastatic cancers of unknown primary (CUPs), or the stratification of sepsis patients based on signatures of gene expression (i.e.

‘cellular states’). Furthermore, the nature of X-profiles enables real-time comparisons to other X-profiles generated a priori, enabling real-time classification of biological and clinical samples, which can drastically reduce the turnaround time for clinical tests.

[0020] An“expression profile” (X-profile) is a database that stores biological sequencing information in signal form, alongside a quantification of said signal abundance as described in PCT/AU2018/050265, which is incorporated herein by reference. An X-profile can be sorted by the relative abundance (i.e. quantification of signal), most common to less common [Fig 1] Collections of expression profiles for disparate tissue / sample types may be loaded into cloud-computing instances, allowing comparisons between expression profiles to determine match similarity via rank correlation. A processor of a computing system receives multiple sequences of a sample from the sequencer, such as in the form of a file generated by the sequencer. Each sequence can be considered as being a‘read’, that is, one contiguous stream of sequencing data, noting that for nanopore sequencing the reads are relatively long compared to Illumina sequencing, for example. The processor then generates an expression profile for the sample. The expression profile comprising for each of the multiple sequences an indication of abundance of that sequence in the sample.

[0021] Fig. 1 illustrates an expression profile (X-profile) 101, which is sorted in this example. The solid bars in each row of profile 101 indicate the abundance of that sequence in the sense that longer bars indicate a higher number of sequences being read. In this example, the processor has generated the profile 101 using nanopore sequencing . It is noted that at the moment in time of Fig . 1 , the profile 101 is not complete yet but rather‘work in progress’ as the processor is building the profile 101 because the entire sequencing data has not yet been received. In this sense, profile 101 could be referred to as partial, incomplete, fragmentary or unfinished. Nevertheless, the processor 101 can already use the partial or intermediate profile 101 as described below. There is also a database 102 of previously generated X-profiles 103, 104, 105 against which the native X-profile is compared to.

[0022] In this sense, the processor receives further sequences 106 as streaming data from the sequencer as shown at the left hand side of Fig. 1. While the processor receives the further sequences 106, the processor performs the steps below. This means that the processor may perform the below steps during the sequencing, as the signal or the individual bases arrive at the processor, or at the end of each read where the profile 101 is updated or after every 10 or 100 reads. Importantly, processor performs the below steps multiple times before the entire sequencing data is available.

[0023] The steps repeated by the processor include updating the expression profile 101 for the sample, so that the stored abundances reflect the number of reads received so far for each stored read. The processor then performs a comparison of the expression profile 101 for the sample to a stored expression profile (103, 104, 105), noting that the stored profiles 103, 104, 105 are associated with a respective predefined state of the sample. For example, the profile may be indicative of an abundance of sequences when sepsis is present. The processor then determines the state of the sample as the state associated with the matching stored expression profile. For example, when the stored sepsis profile matches with the current profile 101, the processor determines that the patient has sepsis. Importantly, upon determining the state of the sample (i.e. sepsis is present), the processor terminates the receiving of the further sequences.

[0024] Furthermore, the database can be reduced to only retain

features/sequences/entries of noted significance or interest. Conversely, X-profiles can be extended with other features arising from the signal that can feed into a maximum likelihood model or classifier system, including but not limited to transformations of the signal from the time domain to the frequency domain, signal time series averages, peak co-ordinates, auto-correlates, zero-crossing derivative vectors, etc. see Fig. 2. While Fig. 2 provides some examples of features (events, FFT, PSD, Matched signal abundance), a combination of those or others not mentioned here may equally be used. In one example, the method uses a model for each tissue of interest, or biological data signatures in k-mer space.

[0025] X-profiles can be generated using different sequencing technologies and can be converted between formats. Public RNA sequencing datasets using the Illumina short read platform are plentiful in repositories such as TCGA, GTEx,

MiTranscriptome, etc. An example of how they can be used to generate X-profiles follows:

1. Generate reference transcripts to be used as qualitative X-profile features from ab initio transcriptome assembly of short reads using tools such as Trinity, TransAbyss, SeqMan NGen, SOAPdenovo-Trans, Velvet/Oases, etc.

Comparably, this can be done with de novo assembly tools such as Cufflinks, String-tie, etc.

2. Quantification of the assembly with tools like Kallisto, Salmon, Sailfish,

HTScount, generating abundances for the assembled sequence;

3. Sorting the sequences by decreasing abundance. Another example:

1. Extract sequences corresponding to CAGEseq peaks from FANTOM5,

representing the 5’ end of mRNAs, as qualitative feature of X-profile;

2. Assign CAGE peak abundance as quantitative feature of extracted mRNA

sequence into X-profile;

3. Sort by decreasing abundance;

[0026] X-profiles can also be converted between formats, sequencing technologies, platforms, or data sets, enabling the generation of a normalized, unified and centralized database of gene expression profiles. For example, an X-profile generated with sequence information as the qualitative feature can be converted to signal features using a tool like Scrappie or DeepSimulator, which convert between sequence and nanopore signal data, in this example. The abundances from the original profiles can thus be interchangeable across datasets of different qualitative natures, facilitating normalization across different sequencing platforms.

[0027] One or more X-profile can be used to generate a representative X-profile for a given sample, tissue, biological or physical feature of interest. For example, two or more X-profiles can be merged by creating a meta X-profile that represents a consensus of the two or more profiles. Similarly, two or more X-profiles can be merged by extracting the common or discriminative profiles.

Comparing expression profiles

Normalize signals across samples

[0028] In one example the method subtracts the mean, divide by the standard deviation of the residuals - compare like to like. Alternatively map the bounds between [0,1] To compare X-profiles, query X-profiles is normalized against reference X- profiles. This can, for example, be done by subtracting the mean and dividing by the standard deviation of the residuals, or as another example, map the bounds between

[0, 1]·

Nearest neighbor rank correlation of profiles

[0029] The table below provides an example where each row from profile 101 is annotated with the best matching row from profile 102 with the corresponding rank. The number values in the table below do not directly correspond to the example in Fig. 1 but constitute a different example.

[0030] Fig. 3 illustrates how the processor compares two expression profiles 301 and 302. The processor takes two expression profiles - A 101 and B 103. Each profile is ordered by descending abundance. The processor then takes the first signal in A 101 and compares it to the first signal in B by applying a signal comparison function as indicated by the arrows in Fig. 3. If the very first signals match, it can be said that A rank 1 matches to B rank 1, resulting in a score of 1. If they do not match, processor continues comparing for A’s next N neighbors in B (if no match, then N+l rank scoring penalty).

[0031] In the example of Fig. 3, the first signal in A 101 matches to the sixth signal in B 103, which results in a score of 6. The second signal in A 101 matches with the fifth signal in B 103 resulting in a score of 5 and for the third signal in A a score of 3. [0032] First (top) the most abundant sequence/signal from X-profile A is compared to the most abundant seq/signal from X-profile B. The rank of a‘match’ is returned. Same for the 2nd (middle) and 3rd most abundant signal (bottom) from X-profile A. In Fig. 3, the AB rank sum for this example (top 3 from profile A) would be 6+5+3=14. A less similar X-profile C would produce an ABscore >>14, while a more similar one <14.

[0033] Pseudo-code:

product (readsl , reads2) :

distance = rilpy.dtw std (reads! [readl] , reads2 [read2 ] ,

d:i3t__oniy-True }

if distance within THRESH and not exceed N TRIES :

signals match read2 idx. append [reads! . index {read } ] break

^{: :} ht end have toy· X ranks fox reads, reac2 can rank correlate / rank sen etc:

[0034] The result is a vector of rank-matches between A & B - A has a natural vector (just the indices ordered by abundance), while we’ve returned the vector of B in relation to A.

[0035] It is then possible to apply rank correlation coefficient (Kendall Tau, etc.) to assess the ordinal association between the profiles e.g. are the transcript abundances of these signals together, measured by tau + p-value.

Sequence data

[0036] The stored signal data can be obtained directly from a sequencing machine (e.g. Oxford Nanopore devices such as MinlON, GridlON, PromethlON, etc.) or indirectly by taking sequence data in basespace, such as generated by short read sequencing (Illumina), or from transcriptome annotations generated from de novo assembly of data, or cDNA sequencing using other technologies, and converting the nucleotide sequence into a similar‘squiggle’ signal format, , with tools like

DeepSimulator or Scrappie (Mozilla Public License Version 2.0)

Approaches to quantifying signal abundance / signal comparison

Dynamic time warping

[0037] Mapping signal-to-signal alignments via dynamic time warping (DTW) - 0(N 1N2) where N is the length of a sequence, noting that this example relates to long reads. Additionally, discrepancy between sampling rate of electrical current measurements versus speed of molecule passing through the pore. The DTW distance between two signals below a bootstrapped threshold constitutes a match, and if no matching signal present in the SQUID DB signal is recorded in the SQUID DB with a corresponding integer count, otherwise if matching signal found then increment the abundance count.

Machine learning via signal processing and feature extraction

[0038] After obtaining the signal, it is possible to clean the signal through a 1D wavelet filter, recapitulate the signal, apply signal processing techniques to the regenerated signal (fast Fourier transform, power spectral density, auto-correlate, etc) to obtain a feature set of the signal. These features, when used with well-labelled, accurate training sets (can be on Illumina data transformed into signal-space) can be used as input for a classifier / model in established ML techniques.

[0039] The model can be included with the SQUID DB for different samples / tissues, so that we can extract features from newly sequenced signals and classify them according to our trained models.

• Extract signal

• 1D wavelets -> de-noised signal, FFT, PSD, AC, (x,y) co-ordinates for peaks

• Build feature set • Train model

• Assess model ability to differentiate signals Application example

Mouse tissue

[0040] RNA was extracted from 3 mouse tissues (Brain, Kidney, Testes) and 4 samples were sequenced: one from each tissue and one unknown sample (blind control). Each sample was sequenced on 4 Oxford Nanopore Mini ON R9.4.1 flowcells using a cDNA + PCR library preparation protocol.

• Mouse brain read count: 983,348

• Mouse kidney read count: 875,066

• Mouse testes read count: 1,749,002

• Blind control read count: 706,115

[0041] Base called data (sequences) for the 3 known samples (samples B, K & T) were used to generate X-profiles as follows:

1. Each reference sequence of the mouse reference transcriptome

(ftp : //ftp. ensembi.org/pub/reiease··

93/fasta/mus museu¾us/cdna/Mus muscuius.GRCm38.cdna.all.fa) is used as a database entry (e.g. the first column/qualitative feature of the X-profile examples above);

2. Sequences were aligned to database entries using the Minimap2 software;

3. The most similar database entry to a base called sequence as determined by Minimap2 has the associated counter incremented;

4. Repeat (2.) until all base called sequences have been aligned.

5. Sort the database entries decreasingly by their abundance.

[0042] A fourth X-profile (sample X) was then generated using increasing amount of reads. A first X-profile was generated as described above with the first 1000 base called reads from sample X (Xp-lk), then compared to samples B, K, & T using a rank sum correlation. The respective values are plotted in Figure 4.

[0043] A second X-profile (Xp-lOk) was then generated by sampling a further 9000 base called reads from sample X (10,000 total abundance) and adding them to Xp-lk. Xp-lOk was then compared to the 3 X-profiles from known samples as previously described, and plotted in Figure 4.

[0044] This was also performed for 50k and lOOk total reads. The increasing size of the presented X-profiles represents a growing X-profile during the acquisition of streaming data, such as produced by real-time sequencing platforms. As demonstrated in Fig. 4, Sample X can rapidly be classified as Sample B, or brain tissue, by comparing the relative similarity scores (here, the rank sum correlation) across reference X-profiles.

[0045] A final X-profile (Xp-F) including all base called sequences from sample X was compared to the 3 X-profiles from known samples, generating a match to sample B (brain) with a P-value of 0.02 (Tau test, t ~ 0.1). This result was found to be discriminatory, as matches to the X-Profiles of the other tissues did not result in a significantly correlated ranking (t ~ 0.1, P-values > 0.65).

[0046] Other application examples

[0047] Sepsis stratification

(1) Sequence patients with and without sepsis to generate X-profiles, labelled for clinical data such as severity of infection, nature of pathogen, source of infection, patient age, health outcomes, demographics, date;

(2) Classify X-profiles into reference categories based on discriminatory features of interest, such as acute sepsis versus non-sepsis profiles;

(3) Sequence blood of a patient with unknown status to generate a X-profile in real time; (4) Compare X-profile generated in real-time to reference X-profiles to determine the most similar category;

(5) Use comparative X-profde scores to make a clinical diagnosis, stratification of patient risk, or treatment recommendation.

[0048] Another example: (Cancer of unknown primary)

(1) Generate X-profiles for normal human tissues and tumours;

(2) Sequence a biopsy a carcinoma of unknown primary (a metastatic tumour with an unknown tissue of origin);

(3) Compare X-profile from biopsy to X-profiles from normal tissues to identify tissue of origin and help guide subsequent treatment.

(4) Alternatively, any other tumour can be compared to previously sequenced

tumours to find a match.

[0049] Another example: (Sample identification/validation)

(1) Generate X-profiles of various cell lines to validate or identify cell lines or contamination of cell lines.

[0050] General case:

(1) Can be used to test a query transcription profile against a set of reference

profiles to identify some transcription level difference/similarity.

(2) Can be used to assess a change in transcription, for example, by the host

response to a disease, a pathogen, or a treatment.

Identifying tissue of origin for cancers of unknown primary

[0051] By efficient/accurate iterative clustering of nanopore read raw signal binned on similarity via (dtw/CNN/hashing/metric), we performed long read quantification of RNA sequencing full length transcripts, resulting in expression profiles (datastore of transcript signal and abundance) for 4 samples. [0052] Further validation can be performed via construction of synthetic expression profiles from publically available Illumina data, where nucleotide sequences are converted into synthetic nanopore signals and (kallisto/de novo whole transcriptome assembly) used to quantify transcript abundance. These matched tissue/sample differences are were then compared to similar tissues/samples sequenced with ONT and concordance was found between the meta expression profile analysis.

Real-time sequencing

[0053] There are clinical applications where a diagnosis should be available within a relatively short time window. For example, a patient presenting sepsis may arrive at an emergency department of a hospital and a treatment needs to be commenced before a time-consuming sequencing process can be performed. Even the download of a full data file of the sequencing result may take too long for this situation.

[0054] With the method proposed herein, the profiles are formatted such that they are compatible with a real-time processing of the sequencing data stream. That is, the sequencing signal is received and while the sequencing signal is being received (before the full data is available), a diagnosis can be made by the proposed method. In this sense, the indication of abundances in the profiles is continuously updated and after every update or periodically (such as every minute or every 5 minutes) the profile is matched against the stored profiles. In particular, one of the stored profiles may be the typical profile of a sepsis patient and a good match indicates sepsis as a diagnosis and treatment can be commenced straight away and within a short time window, such as within 10 minutes or within 30 minutes. This also means that the receiving of the sequencing data can be stopped before the full data has been received and as soon as a diagnosis has been provided.

[0055] In this sense, the data stream is processed in real time, while the stream is being generated. For example, a whole genome sequencing such as Illumina sequencing may be performed off-site but the dataset is too large to transmit via a relatively slow internet connection. For example, it may take three days to transmit the entire dataset which is too long for some diagnoses, such as sepsis.

[0056] It is an advantage that the sequences are ordered by abundance and the matching score represents the difference in the position of the sequence within the ordered sequences, because the most abundant sequences are likely to be sequenced at larger numbers early and therefore provide a robust diagnosis. In other words, the diagnosis is performed based on the most abundant (i.e. most accurate) sequences. In one example, the comparison between profiles is not performed on all available sequences but only on the top most abundant sequences (such as top 10 or top 100 sequences).

[0057] In one example, there is a threshold on the matching score and the analysis (i.e. receiving of further sequences) is stopped as soon as the threshold is met. For example, where a higher matching score indicates a worse match, the analysis is stopped as soon as the matching score is below the threshold (such as 100 in the example of Fig. 3).

[0058] While the sequences may comprise base calls, it is also possible that they comprise a time domain electrical signal, also referred to as squiggle, which may be indicative of the current through a nanopore while the bases pass through the nanopore. The advantage of using squiggles is that it is not necessary to call bases from the squiggle (i.e. convert the squiggle into sequence), which speeds up the process and increases reliability as approximations are removed. It is possible to used BLAST, minimap2, for sequence matching instead of DWT for squiggle matching.

[0059] It is noted that the method described herein is performed by a computer system comprising an input port to receive the sequences (such as USB) and a processor to create/update the expression profiles and the compare the expression profile against the database. The database may be local or remote and the comparison (i.e. calculating a matching score) may be performed remotely, such as in a cloud computing

environment. It is noted that the bandwidth required for the cloud computing implementation is minimal because it is not necessary to upload the entire sequencing data set at once but only as it is generated by the sequencer. In that case, the library of expression profdes would also be stored in the cloud and matched there. This allows the use of relatively large libraries without the need for local data storage and without the need for full transfer of the entire sequencing data set as an upload from the sequencer. This has the significant technical advantage that the analysis of the sequencing data can be performed much faster because it is not necessary to wait for the upload to finish.

[0060] Fig. 5 illustrates a method 500 for diagnosis of sepsis in a sample from a patient using streaming data from a sequencer. The method comprises receiving 501 multiple sequences of the sample from the sequencer and generating 502 an expression profile for the sample. The expression profile comprises for each of the multiple sequences an indication of abundance of that sequence in the sample.

[0061] Method 500 also comprises receiving 503 further sequences as streaming data from the sequencer and while receiving 504 the further sequences, the method 500 comprises performing the steps of:

• updating 505 the expression profile for the sample;

• performing 506 a comparison of the expression profile for the sample to a stored expression profile indicative of an abundance of sequences when sepsis is present;

• determining 507 whether the patient has sepsis based on the comparison; and

• upon determining whether the patient has sepsis terminating 508 the receiving of the further sequences.

[0062] Fig. 6 illustrates method 600 for determining a state of a biological sample using streaming data from a sequencer. Method 600 comprises receiving 601 multiple sequences of the sample from the sequencer and generating 602 an expression profile for the sample, the expression profile comprising for each of the multiple sequences an indication of abundance of that sequence in the sample. [0063] Method 600 further comprises receiving further sequences as streaming data from the sequencer and while receiving 604 the further sequences performing the steps of:

• updating 605 the expression profile for the sample;

• performing 606 a comparison of the expression profile for the sample to one or more stored expression profiles to determine a matching stored expression profile, each of the one or more stored expression profiles being associated with a respective predefined state of the sample;

• determining 607 the state of the sample as the state associated with the matching stored expression profile; and

• upon determining the state of the sample terminating 608 the receiving of the further sequences.

[0064] It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the above-described embodiments, without departing from the broad general scope of the present disclosure. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.

Previous Patent: CONVEYOR FEEDER APPARATUS

Next Patent: LIQUID TREATMENT UNIT AND METHOD