Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHODS FOR THE MAINTENANCE AND ANALYSIS OF BIOLOGICAL DATA
Document Type and Number:
WIPO Patent Application WO/2007/001195
Kind Code:
A1
Abstract:
The invention relates to a computer program for assisting analysis of biological data, the software being adapted to receive biological data indicating two or more sequences, and the software comprising: a comparison module for conducting an alignment between two or more sequences to determine regions of similarity, a phylogenetic tree constructor for constructing a phylogenetic tree from sequence data, a first visualisation tool for displaying sequences and alignments between sequences, and a second visualisation tool for providing a graphical representation of a phylogenetic tree based on output from the tree constructor, wherein the biological sequence data can comprise one or more of whole genome data, nucleotide sequence data and protein sequence data.

Inventors:
DRUMMOND ALEXEI (NZ)
Application Number:
PCT/NZ2006/000165
Publication Date:
January 04, 2007
Filing Date:
June 27, 2006
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
BIOMATTERS LTD (NZ)
DRUMMOND ALEXEI (NZ)
International Classes:
G16B50/30; G16B10/00; G16B30/10; G16B45/00
Domestic Patent References:
WO2004053769A22004-06-24
WO2004021146A22004-03-11
WO2002025564A12002-03-28
Foreign References:
US20050196807A12005-09-08
US20040139051A12004-07-15
Other References:
WU ET AL.: "From Sequence to Structure To Literature: The Protocol Approach to BioInformation", PACIFIC SYMPOSIUM ON BIOCOMPUTING, 1998, pages 747 - 758, XP008074573
Attorney, Agent or Firm:
ADAMS, Matthew, D et al. (6th Floor Huddart Parker Building Po Box 94, Wellington 6015, NZ)
Download PDF:
Claims:
CLAIMS

1. A computer program for assisting analysis of biological data, the software adapted to receive biological data indicating two or more sequences, and the software comprising: a comparison module for conducting an alignment between two or more sequences to determine regions of similarity, a phylogenetic tree constructor for constructing a phylogenetic tree from sequence data, a first visualisation tool for displaying sequences and alignments between sequences, and a second visualisation tool for providing a graphical representation of a phylogenetic tree based on output from the tree constructor, wherein the biological sequence data can comprise one or more of whole genome data, nucleotide sequence data and protein sequence data.

2. A computer program according to claim 1 wherein the computer program is adapted to receive biological data from a database that has been compiled from a plurality of databases.

3. A computer program according to any preceding claim wherein the computer program is adapted to update sequence data by conducting a search of biological databases.

4. A computer program according to any preceding claim further comprising a search module adapted to query databases containing research publication data and providing the publication data to a user.

5. A computer program according to any preceding claim adapted to conduct searches on both remote and local databases.

6. A computer program according to any preceding claim adapted to conduct BLAST searches on the nucleotide sequence and/or protein sequence databases.

7. A computer program according to any preceding claim further adapted to filter and/or sort data in real-time as the data is retrieved.

8. A computer program according to any preceding claim further comprising a module to display and analyse instrumental data relating to sequences.

9. A computer program according to any preceding claim, wherein the computer program is adapted to store one or more of the following: a) nucleotide sequences, b) protein sequences, c) phylogenetic trees, d ) three dimensional molecular structures, e) sequence alignments, f) publication data, g) documents.

10. A computer program according to any preceding claim further comprising an application program interface for integration with external software applications.

11. A computer program according to any preceding claim further comprising a graphing module to visualise similarities between sequences by way of a graph.

12. A computer program according to any preceding claim further comprising a visualisation module for rendering three dimensional molecule structures.

13. A computer program according to any preceding claim further comprising a published article visualisation module for displaying published articles, citation data and abstracts to a user.

14. A computer program according to any preceding claim wherein the sequence and alignment visualisation module displays similarity across all sequences for every position.

15. A computer program according to any preceding claim wherein the sequence and alignment visualisation module displays hydrophobicity of each element of a protein sequence.

16. A computer program according to any preceding claim wherein the sequence and alignment visualisation module displays the isoelectric point of a protein at every element position.

17. A computer program according to any preceding claim wherein the sequence and alignment visualisation module determines and displays statistics relating to nucleotide or amino acid elements.

18. A computer program according to any preceding claim wherein the sequence and alignment visualisation module displays statistics for the number of sites that are identical across multiple sequences in an alignment.

19. A computer program according to any preceding claim wherein the sequence and alignment visualisation module displays annotations to a sequence and/or alignment.

20. A computer program according to any preceding claim wherein sequences and alignments can be saved and edited.

21. A computer program according to any preceding claim adapted to translate nucleotide sequences to protein sequences.

22. A computer program according to any preceding claim adapted to locate genes within a nucleotide sequence.

23. A computer program according to any preceding claim wherein the biological data is one or more of nucleotide sequence data, amino acid sequence data, protein structure data, nucleotide expression data, protein expression data, ligand-ligand interaction data comprising protein-ligand, enzyme-substrate and enzyme-inhibitor interaction data, genomic data, morphological or taxonomic data, metabolomic data, proteomic data, pharmacological data, pharmaco genomic data, pharmacokimetic and citation and/or publication data.

24. A method of compiling biological data from data storage devices connected together within a distributed network, the method comprising the steps of: selecting a first storage device within the network, the storage device having biological data stored thereon; retrieving

at least one biological data set from the first storage device; selecting a second storage device within the network, the storage device having biological data stored thereon; retrieving at least one biological data set from the second storage device using search criteria derived at least partly from the biological data set(s) retrieved from the first storage device; merging the biological data set(s) retrieved from the first storage device with the biological data set(s) retrieved from the second storage device; and storing the merged biological data set(s) in a destination storage device within the network.

25. A method according to claim 24 wherein the step of merging the biological data set(s) comprises the steps of maintaining a data structure in the destination storage device; and conforming the retrieved biological data set(s) to the data structure maintained in the destination storage device.

26. A method according to claim 24 or 25 further comprising the steps of monitoring one or more of the data storage devices within the network for modification notifications; and selecting the first and/or second storage device from the data storage device(s) for which modification notifications have issued.

27. A method according to any one of claims 24 to 26 wherein said biological data is selected from the group comprises nucleotide sequence data, amino acid sequence data, protein structure data, nucleotide expression data, protein expression data, ligand-ligand interaction data comprising protein-ligand, enzyme-substrate, and enzyme-inhibitor interaction data, genomic data, morphological or taxonomic data, metabolomic data, proteomic data, pharmalogical data, pharmacogenomic data, pharmacokinetic data, and citation and/or publication data.

28. A computer program for compiling biological data from data storage devices connected together within a distributed network, the computer program adapted to: select a first storage device within the network, the storage device having biological data stored thereon; retrieve at least one biological data set from the first storage device; select a second storage device within the network, the storage device having biological data stored thereon; retrieve at least one biological data set from the second storage device using search criteria derived at least partly from the biological data set(s) retrieved from the first storage device; merge the biological data set(s) retrieved from the first storage device with the biological data

set(s) retrieved from the second storage device; and store the merged biological data set(s) in a destination storage device within the network.

29. A computer program according to claim 28 wherein merging the biological data set(s) comprises the steps of maintaining a data structure in the destination storage device; and conforming the retrieved biological data set(s) to the data structure maintained in the destination storage device.

30. A computer program according to claim 28 or 29 further adapted to monitor one or more of the data storage devices within the network for modification notifications; and selecting the first and/or second storage device from the data storage device(s) for which modification notifications have issued.

Description:

METHODS FOR THE MAINTENANCE AND ANALYSIS OF BIOLOGICAL DATA

FIELD OF INVENTION

The present invention provides systems, software and methods for the analysis, integration and management of information, and more particularly to systems and methods for the analysis, integration and management of bioinformatics including information relating to the life sciences, biotechnology, therapeutics, diagnostics, pharmaceuticals, although the systems and methods may be applied to other scientific, business and information-oriented endeavours.

BACKGROUND

It is widely accepted that the rate at which data is generated worldwide is growing exponentially. Using a single data repository as an example, in the six months from GenBank

Release 144 in October 2004 to GenBank Release 147 in April 2005, the GenBank database grew from 38 million sequence entries totaling more than 43 billion base pairs, to more than

44 million sequence entries totalling over 48 billion base pairs.

Current estimates indicate that biological data- volumes are doubling in size every 12 months. Problematically, biological data are heterogeneous, are contained in heterogeneous databases that are not normalised, utilise different formats and file-naming conventions for storage and require different systems for access, and may contain duplicated, incomplete or inaccurate information. Further web-based systems for accessing data lack persistence, search results are not readily parsable by the human eye, and lack the ability to notify researchers of newly available information pertinent to their research. The diversity of biological data, of the systems required to access and analyse that data, and of the output of such analyses places a substantial burden on those seeking to convert data to knowledge. The lack of unified, global, real-time or near-real-time, updatable data access and analysis is detrimental to the generation of the knowledge required for research decisions and other critical business decisions to be soundly made. For example, the difficulties in new product discovery, product development, lead compound identification, product testing and validation, and time-to-market would all be eased if such data access and analysis were available

The functional integration of heterogeneous biological data, including, for example, nucleotide sequence information, taxonomic data, protein expression data, chemical and protein secondary structure data, whole genome data, scientific and patent publication data and medical/biological publication data, is crucial.

Current data management and analysis software are generally based on either the web forms of databases such as NCBI and EBI themselves, academic packages that tackle one specific function only, or proprietary platforms for network and applications analysis and utilise platform technologies such as SQL with open database connectivity (ODBC), component object model (COM), Object Linking and Embedding (OLE) and/or proprietary applications for analysis as evidenced in issued patents, such as for example patents issued or assigned to such companies as Sybase, Kodak, IBM, and Cellomics in U.S. Pat. Nos.

6,161,148, 6,132,969, 5,989,835, and 5,784,294, for data management and analysis, each of which patents are hereby incorporated by reference.

The heterogeneity of biological data has rendered it largely intractable to ready management and analysis by current methodologies and technologies, particularly within appropriate research or development timeframes, and with desired levels of security.

Moreover, the rate at which data is generated coupled with a lack of continuously updatable datasets means any analysis performed on an existing dataset will rapidly become outdated and frequently irrelevant. Any ongoing analysis of a given dataset cannot currently take advantage of data generated after the analysis was begun. This means both that the results of any such analysis may not reflect current data, and that to make use of the most up- to-date data so as to provide an up-to-date result, one must constantly restart the analysis. Clearly, this is undesirable.

SUMMARY OF INVENTION

There is, therefore, a growing and unmet need for methodologies for bi-directional transfer, analysis, and continuous updating of heterogeneous data from various, distributed sources in real-time, that is capable of functionally integrating this heterogeneous data, preferably without manual intervention.

There remains a need for systems, methods, and architectures that overcome these problems and limitations.

It is an object of the present invention to address the foregoing problems or at least to provide the public with a useful choice.

In one form the invention comprises a method of compiling biological data from data storage devices connected together within a distributed network, the method comprising the steps of selecting a first storage device within the network, the storage device having biological data stored thereon; retrieving at least one biological data set from the first storage device; selecting a second storage device within the network, the storage device having biological data stored thereon; retrieving at least one biological data set from the second storage device using search criteria derived at least partly from the biological data set(s) retrieved from the first storage device; merging the biological data set(s) retrieved from the first storage device with the biological data set(s) retrieved from the second storage device; and storing the merged biological data set(s) in a destination storage device within the network.

Preferably the step of merging the biological data set(s) comprises the steps of maintaining a data structure in the destination storage device; and conforming the retrieved biological data set(s) to the data structure maintained in the destination storage device.

Preferably the method further comprises the steps of monitoring one or more of the data storage devices within the network for modification notifications; and selecting the first and/or second storage device from the data storage device(s) for which modification notifications have issued.

In one embodiment, said biological data is selected from the group comprising nucleotide sequence data, amino acid sequence data, protein structure data, nucleotide expression data, protein expression data, ligand-ligand interaction data comprising protein- ligand, enzyme-substrate, and enzyme-inhibitor interaction data, genomic data, morphological or taxonomic data, metabolomic data, proteomic data, pharmacological data, pharmacogenomic data, pharmacokinetic data, and citation and/or publication data.

In another form the invention comprises a computer program for compiling biological data from data storage devices connected together within a distributed network, the computer program adapted to: select a first storage device within the network, the storage device having biological data stored thereon; retrieve at least one biological data set from the first storage device; select a second storage device within the network, the storage device having biological data stored thereon; retrieve at least one biological data set from the second storage device using search criteria derived at least partly from the biological data set(s) retrieved from the first storage device; merge the biological data set(s) retrieved from the first storage device with the biological data set(s) retrieved from the second storage device; and store the merged biological data set(s) in a destination storage device within the network.

In another form the invention comprises a computer program for assisting analysis of biological data, the software adapted to receive biological data indicating two or more sequences, and the software comprising: a comparison module for conducting an alignment between two or more sequences to determine regions of similarity, a phylogenetic tree constructor for constructing a phylogenetic tree from sequence data, a first visualisation tool for displaying sequences and alignments between sequences, and a second visualisation tool for providing a graphical representation of a phylogenetic tree based on output from the tree constructor, wherein the biological sequence data can comprise one or more of whole genome data, nucleotide sequence data and protein sequence data.

Embodiments of the invention are directed toward an extensible architecture for the functional integration of diverse applications, themselves able to access, analyse and continuously update distributed, heterogeneous datasets. Preferably, all these functions are integrated into a unified software "research engine". This framework allows a user to:

o access and retrieve data from heterogeneous databases, comprising genome/nucleotide sequence data, protein sequence and structure data, and publication/bibliographic data; and

o compare nucleotide and/or protein sequences between the same species and/or different species for the purposes of determining the evolutionary relationships

between them, comprising carrying out pairwise alignments and/or multiple alignments to determine the regions of similarity (homology) between sequences, and constructing phylogenetic trees from the sequence data.

Preferably, all functionality, from data access and retrieval through to data storage, data analysis and results management, is seamlessly unified under a single user-interface. The user-interface provides access to visualisation tools allowing to view nucleotide sequences, protein sequences and structures, sequence alignments and annotations, and phylogenetic trees.

The invention preferably provides a framework to allow a range of bioinformatics computing tasks to be performed on a continuously updatable data-set. The range of functions, types of data, and types of visualisations described above and in the following descriptions are typical of those toward which the invention is directed, but it should be clear that the invention is not limited to these types of data and functions. A core feature of the invention is that the framework is extensible, allowing for the integration of any desired application for accessing, analysing and/or updating relevant data-sets, or indeed the integration of any software tool that might enhance the performance or usability of the research engine.

Examples of applications and tools that could be integrated into the framework comprise (but are not limited to):

o Applications that provide the ability to continuously and automatically update data through user-defined, automated searches (such applications are sometimes referred to as "Agents").

o Applications that allow publications such as articles from the biomedical literature to be stored, indexed and searched.

o Specialised applications to run specific types of technical searches on databases such as nucleotide and/or protein sequence databases, for example BLAST searches.

o Applications that can run searches on data stored in both external databases and the local database.

o Applications that allow data to be filtered and sorted during searches by user-defined criteria, comprising filtering "on-the-fly" (i.e. as the data is being downloaded from an external networked database).

o Applications enabling Peer-to-Peer functionality and grid-computing.

o Visualization tools to view graphs comparing similarity between sequences (for example, dotplots).

o Visualization tools to view three-dimensional molecular structures.

o Visualization tools allowing to view details of published articles, comprising citation data, abstracts, full articles, and a link to Google Scholar.

o Applications that can process data generated directly by gene and protein sequencing equipment (e.g. chromatograms).

o Visualization tools that display the hydrophobicity of proteins at every position along the sequence, and the average hydrophobicity when there are multiple sequences.

o Visualization tools that display the isoelectric point of proteins at every position along the sequence, and the average isoelectric point when multiple sequences are being viewed.

o Visualization tools that display statistics for the frequency of each nucleotide or amino acid over the entire length of the sequence, or over a user-selected length of the sequence, and that display statistics for the number of sites that are identical across multiple sequences in an alignment.

o Tools that display annotations to the sequences and/or the alignments, and which allow for new annotations to sequences and alignments to be added and saved.

o Tools allowing sequences to be manually edited.

o Tools for the translation of nucleotide sequences to protein sequences.

o Tools for finding genes in nucleotide sequences.

Such applications and tools as those described above may be purpose-written to work within the framework of the invention. Alternatively such applications may be third-party existing commercial or open-source modules which can be integrated into the extensible architecture of the invention as "plugins". hi order to facilitate the functional integration of such plugin modules, the invention may contain an Application Programming Interface (API), which allows both in-house and third-party programmers to write applications to the framework of the invention.

It is towards such an extensible architecture for the functional integration of diverse applications themselves able to access, analyse and continuously update distributed, heterogeneous datasets, that the present invention is directed. Further aspects and advantages

of the present invention will become apparent from the ensuing description which is given by way of example only.

The term "comprising" as used in the claims means "consisting at least in part of. When interpreting the claims which include that term, the features, prefaced by that term in each statement, may be present but other features can also be present. Related terms such as "comprise" and "comprised" are to be interpreted in the same manner.

In this specification where reference has been made to patent specifications, other external documents, or other sources of information, this is generally for the purpose of providing a context for discussing the features of the invention. Unless specifically stated otherwise, reference to such external documents is not to be construed as an admission that such documents, or such sources of information, in any jurisdiction, are prior art, or form part of the common general knowledge in the art

To those skilled in the art to which the invention relates, many changes in construction and widely differing embodiments and applications of the invention will suggest themselves without departing from the scope of the invention as defined in the appended claims. The disclosures and the descriptions herein are purely illustrative and are not intended to be in any sense limiting.

BRIEF DESCRIPTION OF THE FIGURES

Figure 1 shows a graphical representation of the various component databases comprising the UniProt/Swiss-Prot database.

Figure 2 shows a graphical representation of a computer system 100 suitable for use in the invention.

Figure 3 shows a graphical representation of a distributed network 200 suitable for use in the invention.

Figure 4 shows a general overview of a distributed computing environment of the invention. Figure 5 is a screenshot showing a user interface of the invention.

Figure 6 is a screenshot showing plugins that are available for use.

Figure 7 is a screenshot showing the displayed results of a user search.

Figure 8 is a screenshot showing the user interface for creating a user-specified agent.

Figure 9 is a screenshot of a user interface showing a sequence alignment.

Figure 10 is a flow diagram showing how a multiple sequence alignment is carried out.

Figure 11 is a screenshot showing a text view of a sequence alignment.

Figure 12 is a screenshot showing a graphical view of a sequence alignment.

Figure 13 is a screenshot showing a plug-in module for creating a phylogenetic tree.

Figure 14 is a screenshot showing a graphical representation of a phylogenetic tree. Figure 15 is a screenshot showing a chromatogram anaylsis.

Figure 16 is flow diagram showing a BLAST search.

Figure 17 is a screenshot showing a dotplot.

Figure 18 is a screenshot showing a graphical display of a three-dimensional structure visualisation.

DETAILED DESCRIPTION OF THE INVENTION Obtaining Information from Heterogeneous Data sources

The rate of biological data generation is growing exponentially. For example, the development of methodologies such as high-throughput sequencing, DNA, RNA and protein microarrays, and large scale protein structure determination has led to an almost unimaginable rate of creation of sequence and functional information. Genomics, metabolomics, proteomics, bioinformatics and related disciplines generate and utilise biological information in ever-increasing levels. For example, in 1982, the number of biological sequences deposited in GenBank at the National Center for Biotechnology Information (NCBI) was 606, comprised of only 680,338 base pairs. As of April 2005, approximately 44 million sequence entries totalling over 48 billion base pairs are deposited in GenBank.

Biological data is extraordinarily heterogeneous in nature. Biological data, as used herein comprises any data or information relating to a biological entity or system, such as for example, nucleotide sequence data, amino acid sequence data, protein structure data, nucleotide expression data, protein expression data, ligand-ligand interaction data comprising protein-ligand, enzyme-substrate, and enzyme-inhibitor interaction data, genomic data, morphological or taxonomic data, metabolomic data, proteomic data, pharmalogical data, pharmacogenomic data, pharmacokinetic data, and citation and/or publication data.

Biological data is stored in a great number of diverse databases distributed across the globe. Three examples of such repositories of biological data are given below.

The Entrez Nucleotides database is a collection of sequences from several sources, comprising GenBank, RefSeq, and PDB, and is perhaps the most widely used biological database. The data collected is highly heterogeneous, and constantly changing, as described above. A record for a given database may contain a generalised format for the data contained therein, while the format and content of records of the various databases can differ greatly. An example of a sample GenBank record is available at http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html. A further example of a GenBank record is presented below.

BC039879. Reports Homo sapiens tRNA...[gi:25058849] Links

LOCUS BC039879 1215 bp mRNA linear PRI 07-OCT-2003

DEFINITION Homo sapiens tRNA selenocysteine associated protein, mRNA (cDNA clone MGC:46069 IMAGE : 5769935) , complete cds .

ACCESSION BC039879

VERSION BC039879.1 GI: 25058849

KEYWORDS MGC .

SOURCE Homo sapiens (human) ORGANISM Homo sapiens

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Catarrhini; Hominidae; Homo .

REFERENCE 1 (bases 1 to 1215)

AUTHORS Strausberg,R.L. , Feingold,E.A. , Grouse, L. H., Derge,J.G.,

Klausner,R.D. , Collins, F. S. , Wagner, L., Sheranen, CM. , Schuler, G. D. , Altschul,S.F. , Zeeberg,B., Buetow,K.H., Schaefer, C. F. , Bhat,N.K., Hopkins, R. F. , Jordan, H., Moore, T., Max, S. I., Wang, J., Hsieh,F., Diatchenko, L. , Marusina,K., Farmer,A.A., Rubin, G.M., Hong, L.,

Stapleton,M., Soares,M.B., Bonaldo,M. F. , Casavant, T. L. , Scheetz,T.E. , Brownstein,M. J. , Usdin,T.B., Toshiyuki, S . , Carninci,P., Prange,C, Raha,S.S., Loquellano,N.A. , Peters, G. J., Abramson,R.D. , Mullahy, S . J. , Bosak,S.A., McEwan,P.J., McKernan,K. J. , Malek,J.A., Gunaratne, P. H. , Richards, S., Worley,K.C, Hale, S., Garcia, A.M., Gay, L. J., Hulyk,S.W., Villalon,D.K. , Muzny, D.M. , Sodergren,E. J. , Lu, X., Gibbs,R.A., Fahey,J., Helton, E., Ketteman,M. , Madan,A., Rodrigues, S . , Sanchez, A., Whiting, M., Madan,A., Young, A. C, Shevchenko, Y. , Bouffard,G.G. , Blakesley, R. W. , Touchman, J.W. , Green, E. D.,

Dickson,M.C. , Rodriguez, A. C. , Grimwood, J. , Schmutz,J., Myers, R. M., Butterfield, Y. S. , Krzywinski,M. I . , Skalska,U., Smailus, D. E. , Schnerch,A., Schein, J.E. , Jones, S.J. and Marra,M.A.

TITLE Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences

JOURNAL Proc. Natl. Acad. Sci. U.S.A. 99 (26), 16899-16903 (2002)

PUBMED 12477932 REFERENCE 2 (bases 1 to 1215)

AUTHORS Strausberg,R. TITLE Direct Submission

JOURNAL Submitted (15-NOV-2002) National Institutes of Health, Mammalian Gene Collection (MGC) , Cancer Genomics Office, National Cancer Institute, 31 Center Drive, Room 11A03, Bethesda, MD 20892-2590, USA REMARK NIH-MGC Project URL: http : //mgc . nci . nih. gov COMMENT Contact: MGC help desk

Email: cgapbs-r@mail.nih.gov Tissue Procurement: Life Technologies, Inc. cDNA Library Preparation: Life Technologies, Inc. cDNA Library Arrayed by: The I.M.A. G. E. Consortium (LLNL)

DNA Sequencing by: National Institutes of Health Intramural Sequencing Center (NISC) , Gaithersburg, Maryland;

Web site: http: //www.nisc.nih. qov/

Contact: nisc_mgc@nhgri.nih. gov

Akhter,N., Ayele,K., Beckstrom-Sternberg, S.M. , Benjamin, B. ,

Blakesley,R.W. , Bouffard,G.G. , Breen,K., Brinkley, C, Brooks, S.,

Dietrich, N. L. , Granite, S., Guan,X., Gupta, J., Haghighi,P.,

Hansen, N., Ho, S. -L., Karlins,E., Kwong,P., Laric,P., Legaspi,R.,

Maduro,Q.L., Masiello,C, Maskeri,B., Mastrian, S .D. , McCloskey, J. C. , McDowell, J-, Pearson, R., Stantripop, S. , Thomas, P. J. , Touchman, J.W. ,

Tsurgeon,C, Vogt,J.L., Walker,M.A., Wetherby,K. D. , Wiggins, L., Young,A., Zhang, L.-H. and Green, E. D.

Clone distribution: MGC clone distribution information can be found through the I.M.A. G. E. Consortium/LLNL at: http: //image. llnl.gov Series: IRAK Plate: 79 Row: p Column: 6 This clone was selected for full length sequencing because it passed the following selection criteria: matched mRNA gi: 8923459. Differences found between this sequence and the human reference genome (build 35) are described in misc_difference features below. FEATURES Location/Qualifiers source 1..1215

/organism="Homo sapiens" /mol_type="mRNA" /db xref="taxon:9606" /clone="MGC: 46069 IMAGE: 5769935" /tissue_type="Brain, fetal, whole pooled"

/clone_lib="NIH_MGC_121" /lab_host="DH10B" /note="Vector: pCMV-SPORT6" gene 1..1215 /gene="SECP43"

/note="synonyms: FLJ20503, PRO1902" /db xref="GeneID: 54952" CDS 491..1024

/gene="SECP43"

/codon_start=l

/product="SECP43 protein"

/protein id="AAH39879.1"

/db_xref="GI : 25058850"

/db xref="GeneID: 54952"

/translation="MLYEFFVKVYPSCRGGKWLDQTGVSKGYGFVKFTDELEQKRAL

TECQGAVGLGSKPVRLSVAIPKASRVKPVEYSQMYSYSYNQYYQQYQNYYAQWGYDQ N TGSYSYSYPQYGYTQSTMQTYEEVGDDALEDPMPQLDVTEANKEFMEQSEELYDALMD CHWQPLDTVSSEIPAMM" misc feature 491..670

/gene="SECP43"

/note="rrm; Region: RNA recognition motif, (a.k.a. RRM, RBD, or RNP domain) . The RRM motif is probably diagnostic of an RNA binding protein. RRMs are found in a variety of RNA binding proteins, including various hnRNP proteins, proteins implicated in regulation of alternative splicing, and protein components of snRNPs . The motif also appears in a few single stranded DNA binding proteins. The RRM structure consists of four strands and two helices arranged in an alpha/beta sandwich, with a third helix present during RNA binding in some cases The C-terminal beta strand (4th strand) and final helix are hard to align and have been omitted in the SEED alignment The LA proteins have a N terminus rrm which is included in the seed. There is a second region towards the C terminus that has some features of a rrm but does not appear to have the important structural core of a rrm. The LA proteins are one of the main autoantigens in Systemic lupus erythematosus (SLE) , an autoimmune disease" /db xref="CDD:pfam00076" misc difference 1133..1215

/gene="SECP43"

/note="polyA tail: 83 bases do not align to the human

genome . " ORIGIN

1 cggtgcgcgg gtatggcggc cagcctgtgg atgggcgacc tggaacccta catggatgag

61 aacttcatct ccagagcctt tgccaccatg ggggagaccg taatgagcgt caaaattatc

121 cgaaaccgcc tcactgggat cccagctggc tactgctttg tagaatttgc agatttggcc

181 acagctgaga agtgtttgca taaaattaat gggaaacccc ttccaggagc cacaccttta

241 cttagcttac agctgcacca gctggcacac ctgggctcat agaaccatgg agctggcagt

301 gcccttagcg gtcatccgtg caaccccctc attttataca ggagaaaaag ctgaggctta

361 gagaggggga gatgttttgg ccaaggcgaa acgttttaaa ctgaactatg ccacttacgg 421 gaaacaacca gataacagcc ctgagtattc cctctttgtg ggggacctga ccccggacgt

481 ggatgatggc atgctgtatg aattcttcgt caaagtctac ccctcctgtc ggggaggcaa

541 ggtggttttg gaccagacag gcgtgtctaa gggttatggt tttgtgaaat tcacagatga

601 actggaacag aagcgagccc tgacggagtg ccagggagca gtgggactgg ggtctaagcc

661 tgtgcggctg agcgtggcaa tccctaaagc gagccgtgta aagccagtgg aatatagtca 721 gatgtacagt tatagctaca accagtatta tcagcagtac cagaactact atgctcagtg

781 gggctatgac cagaacacag gcagctacag ctacagttac ccccagtatg gctataccca

841 gagcaccatg cagacatatg aagaagttgg agatgatgca ttggaagacc ccatgccaca

901 gctggatgtg actgaggcca acaaggagtt catggaacag agtgaggagc tgtatgacgc

961 tctgatggac tgtcactggc agcccctgga cacagtgtct tcagagatcc ctgccatgat 1021 gtagccaggc caaaggacaa gccaggttgc atgatgtgag ggagatgaga gactcctttt

1081 taaaaattgt gaaacctttt tggaaatatg atttgtaaga ttttaataat gaaaaaaaaa

1141 aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa 1201 aaaaaaaaaa aaaaa //

The record contains heterogeneous data distributed amongst a number of fields. Nucleotide sequence data, publication data, source organism, related and derived sequences, relationships with other records and data components of such records, together with relationships with the database as a whole, are present in a single GenBank record. Definitions of the various fields are available via the abovementioned URL, and are incorporated herein in their entirety.

A further example of a database suitable to integration and analysis using the systems, methods and architectures of the present invention, is PubMed, also available via the National Center for Biotechnology Information (NCBI) Entrez retrieval system (at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed). PubMed provides access to citations from biomedical literature, and incorporates the functionality of LinkOut, providing access to full-text articles at journal Websites and other related Web resources. PubMed also provides access and links to the other Entrez molecular biology resources. A PubMed record comprises Author, Journal, Title, Abstract, Date of Publication and other data relating to a given article. An example of a PubMed record is presented below.

Proc Natl Acad Sci U S A. 2002 Dec 24;99(26): 16899-903. Epub 2002 Dec Re | ated Artic | es . 11. Links

'

Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences.

Strausberg RL, Feingold EA, Grouse LH, Derge JG, Klausner RD, Collins FS,

Wagner L, Shenmen CM, Schuler GD, Altschul SF, Zeeberg B, Buetow KH, Schaefer CF, Bhat NK, Hopkins RF, Jordan H, Moore T, Max SI, Wang J, Hsieh F, Diatchenko L, Marusina K, Farmer AA, Rubin GM, Hong L, Stapleton M, Soares MB, Bonaldo MF, Casavant TL, Scheetz TE, Brownstein MJ, Usdin TB, Toshiyuki S, Carninci P, Prange C, Raha SS, Loquellano NA, Peters GJ,

Abramson RD, Mullahy SJ, Bosak SA, McEwan PJ, McKernan KJ, Malek JA, Gunaratne PH, Richards S, Worley KC, Hale S, Garcia AM, Gay LJ, Hulyk SW, Villalon DK, Muzny DM, Sodergren EJ, Lu X, Gibbs RA, Fahey J, Helton E, Ketteman M, Madan A, Rodrigues S, Sanchez A, Whiting M, Madan A, Young AC, Shevchenko Y, Bouffard GG, Blakesley RW, Touchman JW, Green ED,

Dickson MC, Rodriguez AC, Grimwood J, Schmutz J, Myers RM, Butterfield YS, Krzywinski MI, Skalska U, Smailus DE, Schnerch A, Schein JE, Jones SJ, Marra MA; Mammalian Gene Collection Program Team.

National Cancer Institute, Bethessda, MD 20892-2580, USA. rls@nili.gov

The National Institutes of Health Mammalian Gene Collection (MGC) Program is a multiinstitutional effort to identify and sequence a cDNA clone containing a complete ORF for each human and mouse gene. ESTs were generated from libraries enriched for full-length cDNAs and analyzed to identify candidate full-ORF clones, which then were sequenced to high accuracy. The MGC has currently sequenced and verified the full ORF for a nonredundant set of >9,000 human and >6,000 mouse genes. Candidate full-ORF clones for an additional 7,800 human and 3,500 mouse genes also have been identified. All MGC sequences and clones are available without restriction through public databases and clone distribution networks (see http:mgc.nci.nih.gov).

PMID: 12477932 [PubMed - indexed for MEDLINE]

Yet another example of databases comprising records of interest are the

UniProtKB/Swiss-Prot databases, available at http://us.expasy.org/sprot/ . A sitemap of the various databases contained therein is presented in Figure 1. The data contained in any given record can vary greatly, and may comprise protein sequence information, which itself may comprised data relating to protein processing comprising identification of initiation methionines, signal peptides, and propetides, secondary structural elements such as helices, turns, β-strands; protein structure regions including transmembrane domains and 3D structural motifs such as DNA binding regions, zinc-fingers, coiled-coils and the like, active sites including metal binding sites and ligand binding sites, sites of amino acid modification, and sites of natural variation, in addition to 3D protein structure data including X-ray diffraction co-ordinates or NMR data, 3D images of proteins, 2D polyacrylamide gel electrophoresis data, enzyme nomenclature, protein models, and germ cell differentiation information.

An example of a UniProtKB/Swiss-Prot database record is presented below.

Entry name Q86SU7_HUMAN

Primary accession number Q86SU7

Secondary accession numbers None

Entered in TrEMBL in Release 24, June 2003

Sequence was last modified in Release 24, June 2003

Annotations were last modified in Release 26, March 2004

Name and origin of the protein!

Protein name SECP43 protein

Synonyms None

Gene name None

From Homo sapiens (Human) I

Taxonomy Eukarvota: Metazoa; Chordata: Craniata: Vertebrata; Euteleostomi: Mammalia: Eutheria; Euarchontoglires; Primates: Catarrhini: Hominidae: Homo.

[ 1 ] NUCLEOTIDE SEQUENCE.

TISSUE=J3rain;

DOI=10.1073/pnas.242603899: PubMed=12477932 [NCBI. ExPASv, EBI. Israel, Japan] Strausberg R.L.. Feingold E.A.. Grouse L.H.. Derge J.G.. Klausner R.D.. Collins F. S.. Wagner L.. Shenmen CM.. Schuler G.D.. Altschul S.F.. Zeeberg B.. Buetow K.H.. Schaefer C.F.. Bhat N.K., Hopkins R.F.. Jordan H., Moore T..

Max SJL, Wang I, ESI , Marra M.A.:

"Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences."; Proc. Natl. Acad. Sci. U.S.A. 99:16899-16903f20021.

[2] NUCLEOTIDE SEQUENCE.

TISSUE=Brain; Strausberg

Submitted (NOV-2002) to the EMBL/GenBank/DDBJ databases.

SUBCELLULAR LOCATION: Nuclear (B x . similarity}.

EMBL BC039879; AAH39879.1; -; mRNA. [EMBL / GenBank / DDBJ] rCoDingSequence]

GO GO:0003676: Molecular function: nucleic acid binding (inferred from electronic annotation).

OuickGo view.

IPR000504: RNPl RRM.

InterPro

Graphical view of domain structure.

PF00076; RRM 1;

Pfam graphical view of domain structure.

RRM:

PROSITE graphical view of domain structure (profiles').

ProDom f Domain structure / List of seq. sharing at least 1 domain]

HOVERGEN fFamilv / Alignment / Tree!

PRESAGE

SWISS-

Get region on 2D PAGE. 2DPAGE

UniRef View cluster of proteins with at least 50% / 90% identity.

mRNA processing; mRNA splicing.

None

Length: 177 AA Molecular weight: 20403 Da CRC64: 5E08BFD70AA68A7C [This is a checksum on the sequence]

10 20_ 3^ 4Cl 5£ 6£

MLYEFFVKVY PSCRGGKWL DQTGVSKGYG FVKFTDELEQ KRALTECQGA VGLGSKPVRL

10_ 8J3 9^ 100 , 110 12£

Q86SU7 in

SVAIPKASRV KPVEYSQMYS YSYNQYYQQY QNYYAQWGYD QNTGSYSYSY PQYGYTQSTM FASTA format

130. 140. 150. 160 , 170 , QTYEEVGDDA LEDPMPQLDV TEANKEFMEQ SEELYDALMD CHWQPLDTVS SEIPAMM

Again, the record contains heterogeneous data distributed amongst a number of fields.

It will be apparent to those skilled in the field that each of the above records are interrelated, containing biological data relating to the same biological entity, in this case a human protein. Using the methods and systems of the present invention, biological data obtained from one database can be used as search criteria to interrogate any other database. The biological data sets obtained from multiple databases are then merged and stored on a destination storage device thereby facilitating synchronisation.

For example, biological data extracted from a database and comprising a dataset can be used as a search term to query the same or another database. Using the example above, the amino acid sequence of the protein can be used to search the same or other databases for biological data relating to said protein, for example by using a search program capable of determining polypeptide sequence similarity or identity, such as BLASTP. The subject polypeptide sequence is compared to a candidate polypeptide sequence using BLASTP (from the BLAST suite of programs, version 2.2.10 [Oct 2004]) in bl2seq, which is publicly available from NCBI (ftp://ftp.ncbi.nih.gov/blast/). The default parameters of bl2seq are utilized except that filtering of low complexity regions should be turned off.

Polypeptide sequence identity may also be calculated over the entire length of the overlap between a candidate and subject polynucleotide sequences using global sequence alignment programs. EMBOSS-needle (available at http:/www.ebi.ac.uk/emboss/align/) and GAP (Huang, X. (1994) On Global Sequence Alignment. Computer Applications in the Biosciences 10, 227-235.) as discussed above are also suitable global sequence alignment programs for calculating polypeptide sequence identity.

Use of BLASTP as described above is preferred for use in the determination of polypeptide variants according to the present invention. Such polypeptide variants are an example of biological data relating to said protein.

Polypeptide variants also encompass those which exhibit a similarity to one or more of the specifically identified sequences that is likely to preserve the functional equivalence of those sequences and which could not reasonably be expected to have occurred by random chance. Such sequence similarity with respect to polypeptides may be determined using the publicly available bl2seq program from the BLAST suite of programs (version 2.2.10 [Oct 2004]) from NCBI (ftp://ftp.ncbi.nih.gov/blast/). The similarity of polypeptide sequences may be examined using the following UNIX command line parameters:

bl2seq -i peptideseql -j peptideseq2 -F F -p blastp

Variant polypeptide sequences preferably exhibit an E value of less than 1 x 10 -5, more preferably less than 1 x 10 -6, more preferably less than 1 x 10 -9, more preferably less than 1 x 10 -12, more preferably less than 1 x 10 -15, more preferably less than 1 x 10 -18 and most preferably less than 1 x 10 -21 when compared with any one of the specifically identified sequences.

The parameter -F F turns off filtering of low complexity sections. The parameter — p selects the appropriate algorithm for the pair of sequences. This program finds regions of similarity between the sequences and for each such region reports an "E value" which is the expected number of times one could expect to see such a match by chance in a database of a fixed reference size containing random sequences. For small E values, much less than one, this is approximately the probability of such a random match.

In another example, the biological data comprising the GenBank accession number, or the biological data comprising the PubMed publication identifier can be used to interrogate other databases to identify any publications citing said GenBank accession number and/or said PubMed publication identifier. The methods and systems of the present invention allow continuous, on-going interrogation thereby enabling the updating of biological data. Moreover, using the methods and systems of the present invention, users can be notified of any amendments to or updates of biological data of interest resident in a database.

In other examples, proprietary data generated within various collaborating host institutions or by research collaborators at different institutions are likely to be stored in databases of differing structure. Collaboration between such researchers, institutions, companies and the like would be greatly facilitated if the heterogeneous data comprising such diverse datasets could be readily compared, extracted, analysed, and updated.

The heterogeneity and the sheer volume of biological data is problematic, as the conversion of data and information into knowledge requires the analysis and synthesis of this disparate data. The methods and systems of the present invention allow the functional integration of heterogeneous biological data, including, for example, nucleotide sequence information, single nucleotide polymorphism data (SNP), protein expression data, chemical and protein structure data, bioassay data, scientific and patent publication data and clinical text data, from disparate data sources.

A particularly preferred embodiment of the present invention is as follows. Figure 2 shows a computer system 200 suitable for implementation of a method of integration and compilation of bioinformatics and biological data. The computer system 200 is referred to as a processing unit or PU. The system 200 comprises one or more processors 205 that receive date and program instructions from a temporary data storage device such as a memory device 210 over a communications bus 215. A memory controller 220 governs the flow of data into and out of the memory device 210. The system 200 also comprises a persistent data storage device such as a disk drive 225 that stores data in a matter prescribed by a disk controller 230. One or more input devices 235 such as a mouse and a keyboard and output devices 240 such as a monitor and a printer allow the computer system to interact with a human user and with other processing units networked together over a distributed network. The typical processing unit 200 comprises a network interface 245 to enable the processing unit to interconnect with other similar processing units.

Figure 3 illustrates a distributed system 300 including network(s) 305 in which client processing unit 200 is interconnected. Client PU 200 is connected to a network or networks to further processing units 320, 330, and 340. These processing units 320, 330 and 340 store biological data that is then accessed by client PU 200. For this reason they are referred to as storage devices. In practice, each storage device 320, 330, and 340 will have some or all of

the components of client processing unit 200, for example a processor, memory controller, persistent memory disc drive, input device and/or output device. The preferred form network arrangement is a peer-to-peer file sharing network in which participating processing units interconnected on the network upload data to a server connected to the network in order to be accessed by one or more other processing units connected to the networks 305.

Storage device 320 represents a large commercial database of biological data. The storage processing unit or device 320 is interfaced to disk storage devices 325 indicated as databases. Data stored in databases 325 could be stored in any data format. It is expected that the data in the storage device 320 and/or database 325 will be updated frequently. As data is updated it is important that the updated data be made available to the users of the commercial database. Commercial users of this database are stored in a membership list 330. As write functions are performed into the database that modify the contents of the database, a modification log is generated. A modification notifier 335 retrieves individual member details from the membership list 330 that should be notified about the database modifications. If client PU 200 is a commercial user of the database, and is recorded in the membership list as a member that requires notification of database changes, the modification notifier 335 will notify client processing unit 200 that the database has been modified and may provide a modification log. Client PU 200 is then able to select storage device 320 as a storage device that should be checked to ensure that data on or associated with storage device 320 is synchronised with data stored on or associated with client PU 200.

Storage device 340 is another example of a commercial database on which biological data can be stored in database or databases 345. Storage device 320 differs from storage device 340 in that storage device 340 generates a modification log indicating the changes that have been made to the data stored in database 345 following write operations. Instead of transmitting the modification log to all members of a membership list, the modification log 350 is made available for download to other processing units interconnected over networks 305.

Client PU 200 has installed on it a modification log checker 360 that periodically polls those storage devices with associated modification logs to check to see whether data associated with individual storage devices needs to be synchronised with data stored or associated with client PU 200.

Storage device 370 also has stored on it biological data. In this case storage device 370 is not a commercial database but is a work station maintained by a researcher in the biological area. The storage device 370 has stored on it the results of biological data compilation and analysis. It is more difficult with storage device 370 to assess whether or not modifications have been made to that data than it is for storage device 320 and storage device 340. A record analyser 380 associated with client PU 200 periodically checks storage devices of the type indicated as storage device 370. Key records are retrieved from the storage device 370 and compared with records already retrieved and stored with client processing unit 200. Preferably a small sample of records are retrieved. If there are significant differences between the sample records and crossreferenced records in the client PU then further records are retrieved from the storage device 370 and synchronised with client PU 200.

In each of processing units 200, 320, 340 and 370 there will not necessarily be a consistent data format in which corresponding data is performed. Such disparate data structures are often difficult to synchronise, without the benefit of the method and system described above.

Computer Software and Apparatus for Utilising Information from Heterogenous Datasources

In a preferred embodiment, the invention has a modular architecture providing a framework to simplify and standardize access to and analysis of biological data. In such an embodiment, the invention creates a framework in which many different programs and algorithms and processes designed for the analysis of biological data may freely exchange data. In this embodiment, the invention has been designed in such a manner as to allow the addition or removal of further functional modules, known as plugins.

Figure 4 is a functional diagram of a distributed computing environment where the system provides for flexible access to and analysis of sequence and structural data from heterogeneous data sources. The application^ ) displays the data and provides a means of comparing sequences for determining the evolutionary relationships between them. Data sources can be both local(3) and remote(5). With the aid of certain ρlugins(2), the invention has the ability to retrieve data from a variety of external data sources such as NCBI or EMBL. Data that has been retrieved from an external data source may be saved locally.

The local data repository is a collection of files known as Documents. Documents can include, but are not limited to, nucleotide or protein sequences, three dimensional structures, sequence alignments, and phylogenetic tree documents. Any data retrieved from a remote source is converted to the appropriate format and saved as a document. In one embodiment the documents may be stored as files on the user's hard drive. They could also be stored in a relational database. Whatever the embodiment, it is preferred that the Document format is retained.

Figure 5 is a screenshot showing, by way of example, a user interface that allows a user to easily and conveniently access all the functionality of the invention. The interface in this example consists of six main panels.

The Services Panel contains the services and functions offered by the application, including local documents, sample documents, an incoming document store, links to web- based databases, and agents. The Services Panel in this example shows a tree that concisely displays the functions of the application and documents stored in the local database. In a preferred embodiment of the invention, the Services Panel also displays databases available on collaborator machines in a networked configuration, where database-sharing is enabled through Peer-to-Peer functionality.

The Document Table displays summaries of downloaded data such as DNA sequences, protein sequences, journal articles, sequence alignments, and trees. In this panel, the user can search data with "Search" and "Advanced Search" tabs. Here, downloaded data can also be filtered using "Filter", "Text match" and "Sort by similarity to selected sequence" features. Selecting a document in the Document Table will display its details in the Document View Panel. Double-clicking a document in the Document Table displays the same view in a separate window.

The Document Viewer Panel is where sequences, alignments, trees, and journal article abstracts may be shown graphically or as plain text. In this example, this panel also offers various options while visualizing protein and nucleotide sequences. These options include

zooming, color and layout selection, and annotations. When viewing trees, there are additional options for branch and leaf labeling, and controlling tree layout. When viewing journal articles, this panel may include a direct link to web-based citation services such as Google Scholar.

The Help Panel comprises a tutorial, and a short help message pops up when any of the tasks on the Services panel are clicked. The user may choose whether or not to display this panel in this example.

In this example the Toolbar displays seven large icons: "Back", "Forward", "Add

Note", "Build Tree", "Alignment", "Preferences", "Help" - and a drop down arrow in between the "Back" and "Forward" icons. The Toolbar also has "Text match" and "Filter" controls. The Add Note icon is available when one or more documents are selected. The Alignment icon only becomes available when two or more protein or nucleotide sequences are selected in the Document Table. The Build Tree icon becomes available when an alignment or a set of sequences is selected. The toolbar also allows the user to hide icons when they are not in use.

The Menu Bar in this example has four main menus "File", "Edit", Tools" and "Help".

Preferably, as in this example, the client is a Java application capable of running on any platform capable of running a Java interpreter such as Windows, Linux or Mac.

Plugins Plugins may be integrated as modules into the invention via a freely available

Application Programming Interface (API). The API allows programmers to construct modules that fit into the framework of the invention. Users are able to choose which plugins they desire to have functional at any time. Some plugins may be dependant on third party plugins being installed on the user's computer. Some may be reliant on access to the internet. Others may have the functional code included within the plugin.

In a preferred embodiment an interface is provided where a user can select which

plugins to use. Figure 6 is a screenshot showing, by way of example, how the user interface may be constructed so that plugins are available to be selected by the user as optional modules to carry out various tasks. Plugin modules such as the NCBI plugin or the EMBL database plugin allow the invention to search the databases publically available on the NCBI and EMBL websites. Other modules allow for searching other databases, carrying out of specialised data searches, various forms of data comparison and data analysis, and the performance of diverse bioinformatics computing tasks. Plugin modules can access data stored in the local database or in other databases in a distributed or peer-to-peer collaborative computing environment.

The invention is capable of, but not limited to, carrying out the following functions on the nucleotide and/or protein sequences contained within its local data store, as well as other kinds of data sets. In many examples the function is performed by a plugin, illustrating the modular nature of the preferred embodiment of the invention.

Retrieving data from remote databases

Figure 7 is a screenshot showing, by way of example, how the user interface may be constructed to display the results of a user-specified search on a remote database. In this example a plugin allows the remote web-based database to be searched. The results are listed and a user may move any of the resulting sequence documents to the local files.

While there are numerous publicly available databases containing a large amount of biological data, one purpose of the invention is to enable a user to search for, filter and store only the data relevant to the user.

Examples of publicly available databases that may be accessed and searched by means of the invention include, but are not limited to:

The Entrez Genome database. This provides views of a variety of genomes, complete chromosomes, sequence maps with contigs (contiguous sequences), and integrated genetic and physical maps.

The Entrez Nucleotide database. This database in GenBank contains three separate components that are also searchable databases: "EST", "GSS" and "CoreNucleotide". The core nucleotide database brings together information from three other databases: GenBank, EMBI, and DDBJ. These are part of the International collaboration of Sequence Databases. This database also contains Ref-Seq records, which are NCBI-curated, non-redundant sets of sequences.

The Entrez Popset database. This database contains sets of aligned sequences that are the result of population, phylogenetic, or mutation studies. These alignments usually describe evolution and population variation. The PopSet database contains both nucleotide and protein sequence data, and can be used to analyse the evolutionary relatedness of a population.

The Entrez Protein database. This database contains sequence data from the translated coding regions from DNA sequences in GenBank, EMBL, and DDBJ as well as protein sequences submitted to the Protein Information Resource (PIR), SWISS-PROT, Protein

Research Foundation (PRF), and Protein Data Bank (PDB) (sequences from solved structures).

The Entrez Structure database. This is NCBFs structure database and is also called MMDB (Molecular Modeling Database). It contains three-dimensional, biomolecular, experimentally determined structures obtained from the Protein Data Bank.

Tlie PubMed database. This is a service of the U.S. National Library of Medicine that includes over 16 million citations from MEDLINE and other life science journals. This archive of biomedical articles dates back to the 1950s. PubMed includes links to full text articles and other related resources, with the exception of those journals that need licenses to access their most recent issues.

Entrez Taxonomy. This database contains the names of all organisms that are represented in the NCBI genetic database. Each organism must be represented by at least one nucleotide or protein sequence.

UniProt. This database is a comprehensive catalogue of protein data. It includes protein sequences and functions from Swiss-Prot, TrEMBL, and PIR. It has three main components, each optimized for a particular purpose.

Search Filters

In a preferred embodiment, the invention allows data to be filtered to remove redundant data and partial matches during searching, so that relevant information can be located. Search functionality can be made available on all locally stored documents. The advanced search functionality allows searching on multiple different types of terms, i.e. search for titles containing word "x" and content containing term "y".

An interface for searching and advanced searches may also be included in plugins that interface with remote data sources.

The search function is able to sort through multiple data formats, for example sequence data, three-dimensional molecular structures, word documents, pdf documents, and others.

The invention also allows data filtering to be carried out "on-the-fly", for example while searching for data or documents in public web-based databases. This feature is particularly useful when a search is returning a large number of results, because it allows the user to refine the search without stopping and recommencing.

Agents

In a preferred embodiment of the invention, advanced searches may be saved and automated as "Agents". Agents save search details and require parameters detailing where results are to be saved and how often a search is to be repeated. This allows working data to be continually updated without input from the user.

Figure 8 is a screenshot showing, by way of example, how the user interface may be constructed to allow the user to create a user-specified Agent. The user can select a set of

search criteria and the destination database to which the data will be stored. Once an Agent has been set up, it can be monitored, disabled, enabled, edited or deleted.

Bibliographic and research publication data In addition to other forms of biological data, the invention may also search for and store scientific publication data, such as citations, abstracts, and entire research articles. This allows easy correlation of data and scientific reviews. The search is able to search a number of publication formats including pdf, the most commonly used format for scientific publications.

Pairwise alignment

Figure 9 is a screenshot showing, by way of example, how the user interface may be constructed to allow the user to perform a sequence alignment and view results, hi this example a sequence alignment plugin module allows the invention to perform pairwise alignments. Pairwise alignment assists in the location of regions of similarity between two sequences.

Any two like sequence documents (nucleotide-nucleotide or protein-protein) can be selected from the stored local documents. The alignment tool is selected and allows the user to enter the cost matrix and specify the gap penalties to be used. The application then undertakes a pairwise alignment of the two sequences, and creates and displays an alignment document. The alignment can be either local (an alignment of two sub-regions of a pair of sequences) or global (alignment over the entire lengths of the sequences).

Multiple sequence alignment A sequence alignment plugin may also allow for the creation of multiple sequence alignments, which allow a user to determine regions of similarity across multiple sequences. When any three or more like (nucleotide-nucleotide or protein-protein) documents stored locally are selected, a multiple sequence alignment can be created.

Figure 10 is a flow diagram showing a multiple sequence alignment may be created within the invention. First the user selects a database which contains relevant sequence data. A search is then performed on the data within the database, in order to retrieve the relevant sequences required by the user. The user then selects three or more sequences from those retrieved by the search, and sets the alignment parameters to be used in the alignment. In this

step the user is presented with a variety of default parameters for the alignment which can be changed. The multiple sequence alignment is then created and displayed.

Sequence and alignment visualisation Upon selection of a nucleotide or protein sequence document, the sequence information is parsed and displayed in either a textual format by a Text Viewer plugin or in a graphical format by a Graphical Sequence View plugin.

Figure 11 is a screenshot showing, by way of example, a text view of a sequence alignment. For sequences, a text view consists of a sequence of letters denoting the nucleotides or amino acids making up the sequence. Name and position notations are also provided. For alignments, the textual view consists of two aligned strings of text, one for each sequence, with the name and position notation included.

Figure 12 is a screenshot showing, by way of example, a graphical view of a sequence alignment. The graphical view is composed of ribbons denoting functional units accompanied by letters denoting nucleotide or amino acids within the sequences. For alignments, the graphical view is composed of aligned ribbons denoting functional units within the sequences accompanied by letters denoting nucleotide or amino acids within the sequences.

For all sequences and alignments, the graphical interface provides a variety of viewing tools such as zoom, layout, colour, statistics, annotations and graphs.

Degrees of similarity between sequences may also be displayed for every position by the visualisation tool. For example, the colour green may be used to show that the residue at the position is the same across all sequences, yellow when there is less than complete similarity, and red when there is very low similarity for the given position.

The visualisation tool may also display the hydrophobicity of a protein at every position on the residue, or the average hydrophobicity when there are multiple sequences.

Similarly the isoelectric point of a protein may be displayed at every position along a sequence, or the average isoelectric point when multiple sequences are displayed.

Statistics about the sequences being viewed can also be displayed by the visualization tool. These statistics may correspond to the sequence or alignment being viewed or the highlighted part of the sequence and alignment. Such statistics can include the frequency of each nucleotide or amino acid over the entire length of a sequence, including gaps, and the number of sites that are identical across all sequences when viewing multiple sequences or alignments.

In a preferred embodiment of the invention, it is also desirable that the user should be able to view any annotations that have been made to sequences or sequence alignments. In this example, annotations are displayed with the sequence or alignment to which they refer. Annotations may furthermore be added, deleted, or edited by the user.

The visualization tool in this example also comprises the ability to translate selected nucleotide sequences to protein sequences, include annotations.

Phylo genetic tree construction

Figure 13 is a screenshot showing, by way of example, how a plugin module may be used to allow a phylogenetic tree to be constructed from nucleotide or protein sequence data. The user interface allows a user to select any locally stored alignment between three or more sequences and request a phylogenetic tree. Upon selection of the tree command variety of options are made available. The algorithim for constructing the tree is selected (Jukes Cantor, HKY, etc). The tree building method must also be selected (i.e. Neighbour joining) and whether or not resampling is to be used. The application then constructs the tree and creates a tree document that can then be viewed.

Phylo genetic Tree visualisation

Figure 14 is a screenshot showing, by way of example, a visualisation tool that provides a graphical representation of a phylogenetic tree.

Upon completion of the creation of a tree document or the selection of a tree document, a tree is displayed in a results pane. Each tree document allows the user to view the

information in a graphical, textual or sequence view. A graphical view provides a picture of in one of several styles of the tree. Several style options are available for the tree, including, zoom and tip, branch and node labels. The textual view provides an ascii tree and a textual display of the alignment. The sequence view provides a graphical view of the alignment used to create the tree in exactly the same manner as the viewing of alignments.

Chromatogram sequence data

In a preferred embodiment of the invention, the data management and visualisation tools may allow data to be imported from sequencing instruments, displayed, edited, and used in other analyses.

Figure 15 is a screenshot showing, by way of example, how chromatogram data may be usefully displayed. In this example, the instrumental data is displayed and aligned with the corresponding nucleotide data. In one form of displaying the data, shown in the upper window of the screenshot, the quality of the instrumental data is analysed and displayed to the user by using different letter sizes to qualitatively signify the certainty of the nucleotide assignments to the instrumental data. The user may edit the nucleotide assignments as desired.

BLAST searches In a preferred embodiment of the invention, additional functionality will be integrated into the data search and retrieval applications to allow specialised search types. An example of such functionality is the ability to perform searches using a Basic Local Alignment Search Tool (BLAST). BLAST is a widely used tool to find related nucleotide protein sequences in a database of disparate organisms. If it is desirable to carry out BLAST searching on the local database, or in a database on another machine in a collaborative peer-to-peer computing environment, the code required to carry out BLAST searching should be written into the code of the operating framework. It is also possible to integrate a plugin module able to access third-party BLAST searching functionality in external databases. An example of the latter method is shown in Figure 16, which is a block diagram showing how the invention may be used to perform a BLAST search on a web-based database. Within the application framework of the invention, the user selects a database, then selects the type of BLAST search required, enters a sequence or accession number, and then submits the search request. The application then supplies the search request to the third-party web-based search service. The application

may communicate useful information to the user during the search, such as estimated search times. The results of the search on the web-based database are retrieved and downloaded to the user's system, where they are displayed by the application and may be saved to a local database.

Dotplot viewing

In a preferred embodiment of the invention, nucleotide or protein sequence data may be compared using dotplots. A dotplot is a special viewer that appears when two sequences are chosen. A dotplot compares two sequences to find regions of similarity. Each axis (X and Y) on the plot represents one of the sequences being compared. Figure 17 is a screenshot showing, by way of example, how a dotplot may be displayed within the framework of the invention.

Three-dimensional protein structure visualisation In a preferred embodiment of the invention, three dimensional protein structures may be usefully viewed. Figure 18 is a screenshot showing, by way of example, how such three- dimensional structures may be usefully displayed by means of a visualisation tool built into the framework of the invention. The data required to create such a visualisation may be searched for, downloaded, and stored in a local database using the search functions of the invention. The visualisation tool in this example is a third-party plugin module, providing options such as changing the way in which atoms and bonds are displayed, and allowing the molecule to be rotated along any axis. The secondary structure of the molecule may also be viewed as ribbons.

The foregoing describes the invention including preferred forms thereof. Alterations and modifications as will be obvious to those skilled in the art are intended to be incorporated within the scope hereof.