Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
COLLABORATIVE CURATION OF KNOWLEDGE FROM TEXT AND ABSTRACTS
Document Type and Number:
WIPO Patent Application WO/2007/139999
Kind Code:
A2
Abstract:
A networked system is disclosed that operates to facilitate collaborative curation, particularly of biomedical knowledge from biomedical text and abstracts. When a researcher uses a client computer to access a reference in a database, such as the PubMed database, the client computer automatically connects to a curation server- such as CBioC, and attempts to locate existing curated knowledge regarding that reference. Absent existing curated knowledge, the curation server retrieves a copy of the references for itself and automatically extracts curation information through textual analysis of the reference. The curation information is sent to and displayed by the client computer. The user may be permitted to vote about the correctness of the displayed information, to revise the information, and/or to add new information regarding the reference. This new or revised information is stored by the curation server for later access.

Inventors:
BARAL CHITTA (US)
Application Number:
PCT/US2007/012624
Publication Date:
December 06, 2007
Filing Date:
May 25, 2007
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV ARIZONA STATE (US)
BARAL CHITTA (US)
International Classes:
G06F7/00; G16B50/30
Foreign References:
US6778979B2
Attorney, Agent or Firm:
STECK, Jeffrey, A. (300 South Wacker DriveSuite 310, Chucago IL, US)
Download PDF:
Claims:

CLAIMS

1. A collaborative curation method comprising: determining that a user has accessed a first reference through a first source; providing to the user one or more knowledge elements related to the first reference, wherein the knowledge elements are accessed through a second source; in an interactive user interface, enabling a user to provide feedback regarding the knowledge elements.

2. The collaborative curation method of claim 1, further comprising: automatically extracting a plurality of knowledge elements to populate a database of the second source.

3. The collaborative curation method of claim 1, wherein the automatic extraction operates on demand such that a user may request automatic extraction of knowledge elements relating to the first reference.

4. The collaborative curation method of claim 1, further comprising: storing the user feedback in the second source.

5. The collaborative curation method of claim 1, wherein the feedback comprises a vote regarding applicability of a particular knowledge element.

6. The collaborative curation method of claim 1, wherein the feedback comprises a numeric value associated with the applicability of a particular knowledge element.

7. The collaborative curation method of claim 1, wherein the feedback comprises a ranking of two or more of the knowledge elements.

8. The collaborative curation method of claim 1, wherein the feedback comprises modifying a particular knowledge element.

9. The collaborative curation method of claim 1, further comprising: in the interactive user interface, enabling the user to modify one or more of the knowledge elements.

10. The collaborative curation method of claim 1, wherein the feedback comprises adding a new knowledge element.

11. The collaborative curation method of claim 1, wherein feedback is provided through a tool consisting of a button, a drag and drop mechanism, a link, a slider, or a text box.

12. The collaborative curation method of claim 1, further comprising: in the interactive user interface, enabling the user to add new knowledge elements.

13. The collaborative curation method of claim 1, wherein the first source comprises a database of biomedical texts and abstracts.

14. The collaborative curation method of claim 1, wherein the first and second sources are accessible through a packet network.

15. The collaborative curation method of claim 13, wherein the packet network is the Internet.

16. A computer readable medium for facilitating a collaborative curation, the computer-readable medium carrying one or more sequences of one or more instructions which, when executed by one or more processors, cause the one or more processors to perform the computer-implemented steps of: determining that a user has accessed a first reference through a first source; providing to the user one or more knowledge elements related to the first reference, wherein the knowledge elements are accessed through a second source; in an interactive user interface, enabling a user to provide feedback regarding the knowledge elements.

17. The computer readable medium of claim 16, further comprising instructions for implementing a toggle interface on a network browser, wherein the toggle interface operates to turn on a knowledge element user interface for interactively displaying the one or more knowledge elements.

18. The computer readable medium of claim 16, wherein the first source comprises PubMed; wherein the first reference comprises an abstract of a biotechnology article; wherein the knowledge elements comprise references to papers in PubMed that are to some extent related to the first reference.

19. A user interface for collaborative curation comprising: a reference database interface configured to display information related to a reference; and a knowledge element interface, wherein the reference interface is configured to interactively display one or more knowledge elements related to the reference.

20. The user interface of claim 19, further comprising a browser that includes both the reference database interface and the knowledge element interface.

21. The user interface of claim 19, further comprising a toggle button for enabling and disabling the knowledge element interface.

22. The user interface of claim 19, wherein the knowledge element interface includes a mechanism for providing user feedback regarding the knowledge elements.

23. A data source comprising: a plurality of knowledge element records, wherein each record is associated with a reference stored in a separate source, and wherein each knowledge element record .includes an indication of a quality of the association; and a mechanism for modifying said quality of the association based on a user feedback regarding the knowledge elements.

24. The data source of claim 23, wherein the indication of a quality of the association comprises a number of votes.

25. The data source of claim 23, further comprising a mechanism for returning knowledge element records associated with a first reference.

26. The data source of claim 25, wherein through a query, the mechanism is configured to return a predetermined maximum number (n) of records, and wherein the records returned include the records most closely related to the first reference.

27. The data source of claim 25, wherein only a portion of the records are returned.

28. A system comprising: a first computer including a browser; a first source; a second source; and a network configured to communicatively connect the first computer with the first and second sources, wherein software at the first computer is configured to determine when the browser is directed to retrieving references stored at the first source, and wherein said software is further configured to query the second source for knowledge elements related to a first reference record retrieved from the first source, and wherein said software is further configured to provide feedback to the second source, wherein said feedback is indicative of a user's perception of relevance of the knowledge element in relation to the first reference.

29. The system of claim 28, further comprising an extractor system configured to populate the second source using an automatic search of the first source.

30. The system of claim 28, further comprising a query based mechanism at the second source for incorporating received feedback into records associated with the respective knowledge element.

Description:

COLLABORATIVE CURATION OF KNOWLEDGE FROM TEXT AND ABSTRACTS

BACKGROUND

This application claims the priority of U.S. Provisional Patent Application No. 60/808,391, filed May 25, 2006.

The present invention relates to collaborative reference curation. In addition to the data that exists in various public and private databases, there is a much larger and ever increasing amount of information buried in existing biomedical articles. It is beyond human ability to read the various relevant articles and recall relevant findings of these articles for further research. The sheer volume of the articles and their constant growth makes it prohibitively expensive to employ (and monetarily compensate) human curators to read through the articles and cull the useful information buried in them. The volume of existing biomedical articles is huge and it grows day by day. From 1994 to 2004, close to 3 million biomedical articles were published by US and European researchers alone. Added to the approximately 15 million abstracts already in PubMed, this represents over 800 new articles per day and a myriad of individual new facts to survey for information relevant to a particular research question.

Currently two approaches are pursued to extract and combine facts from biomedical publications. The first approach of hiring human curators is expensive, and thus does not scale up. It is also subject to bias. The second approach of using automated information extraction systems only has a recall and precision of around 60%.

Nevertheless, human curation has been tried for specific domains. Due to the issue of cost, many of the curated databases are proprietary and offer only limited coverage. The following are examples of such efforts:

• Bader, G.D., Donaldson, L, Wolting, C, Ouellette, B.F., Pawson, T., and

Hogue, CW. (2001) BIND-The Biomolecular Interaction Network Database. Nucleic Ac. R. 29: 242-245.

• BIND: Biomolecular Interaction Network Database, available at <http://www.bind.ca>

• Stein, Lincoln (2002), Creating a bioinformatics nation, Nature, 417: 119-120.

• Xenarios, I. and Eisenberg, D. (2001) Protein interacting databases. Current Opinion inBiotechnology, 12: 334-339.

• KEGG: Kyoto Encyclopedia of Genes and Genomes, available at

<http://www.genome.jp/kegg/>

• HPRD: Human Protein Reference Database, available at <http ://www.hprd.org/>

In recent years, an alternative approach of using automatic text extraction systems has been proposed. Such efforts are described in the following references:

• Rzhetsky, A. et al. (2004) Geneways: a system for extracting, analyzing, visualizing, and integrating molecular pathway data, Journal of Biomedical

Informatics 27: 43-53.

• Fukuda, K., Tamura, A., Tsunoda, T., and Takagi, T. (1998) Toward information extraction: identifying protein names from biological papers, PSB 1998, 707-718.

• Tanabe, L. and Wilbur, WJ. (2002) Tagging gene and protein names in biomedical text, Bioinformatics 18(8): 1124-1132.

• Blaschke, C, Andrade, M. A., Ouzounis, C, and Valencia, A. (1999)

Automatic extraction of biological information from scientific text: protein- protein interactions, Proceedings of the International Conference on Intelligent System Molecular Biology 1999, 60-67.

• Ono, T., Hishigaki, H., Tanigami, A., and Takagi, T. (2001) Automated extraction of information on protein-protein interactions from the biological literature, Bioinformatics 17(2): 155-561.

• Novichkova, S., Egorov, S., and Daraselia, N. (2003) MedScan, a natural language processing engine for MEDLINE abstracts, Bioinformatics 19(13), 1699-1706.

• Friedman, C, Kra, P., Yu, H., Krauthammer, M., Rzhetsky, A. (2001)

GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles, Bioinformatics 2001, 17 Suppl. 1 : S74-82.

• Rzhetsky, A., Iossifov, L, Koike, T., Krauthammer, M., Kra, P., Morris, M.,

Yu, H., Duboue, P.A., Weng, W., Wilbur, WJ., Hatzivassiloglou, V., and Friedman, C. (2004) Gene- Ways: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J Biomed Inform. 2004 February, 37(1), 43-53.

• Corney, D.P., Buxton, B. F., Langdon, W.B., and Jones, D.T. (2004) BioRAT: extracting biological information from full-length papers, Bioinformatics 20(17): 3206-3213.

• Temkin, J.M. and Gilder, M.R. (2003) Extraction of protein interaction information from unstructured text using a context-free grammar, Bioinformatics 19(16): 2046-2053.

• Chiang, J.H., Yu, H.C., and Hsu, HJ. (2004) GIS: a biomedical text-mining system for gene information discovery, Bioinformatics 20(1): 120-121.

• Craven, M. and Kumlien, J. (1999) Constructing biological knowledge bases by extracting information from text sources, Proceedings of International Conference on Intelligent System Molecular Biology 1999, 77-86.

• Bunescu, R., Ge, R., Kate, R.K., Marcotte, E.M., Mooney, R.J., Ramani, A.K., and Wong, Y. W. (2004) Comparative Experiments on Learning Information Extractors for Proteins and their Interactions. Journal Artificial Intelligence in Medicine 2004.

• Ding, J., Berleant, D., Xu, J., and Fulmer, A. (2003) Extracting biochemical interactions from MEDLESfE using a link grammar parser, Proceedings of the 15 th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'03), 467.

Systems of automatic text extraction, however, at times infer incorrect information or miss out on important information. Moreover, most existing systems focus on simpler data forms, such as identification of gene or protein names, or simple interactions without context. Sometimes such simplicity may lead to inconsistency.

SUMMARY

A Web-based system is disclosed that operates to facilitate collaborative curation of biomedical knowledge from biomedical text and abstracts. More generally, the system may take other, non-web-based forms, and may also be used for applications other than those directly relating to biomedical knowledge or research.

In a first embodiment, a Web-based software system is configured on a computer that is connected to a public network, such as the Internet. When a database or repository site such as PubMed is accessed through the computer, software will open a side-by-side frame interface. When a particular abstract or article is explored, the system will connect to a server and attempt to locate existing curated knowledge regarding that article or abstract. Any retrieved knowledge may be displayed in the new frame, and the user may be permitted to vote about the correctness of each displayed knowledge element. Further, the user may be permitted to add new knowledge elements and/or also revise existing knowledge elements.

A registration process may be required and users may be assigned categories and their permission levels modified according to user categories. For example, a professor may be allowed to create new knowledge schemas, while undergraduate students may only be allowed to browse the knowledge. To motivate researchers to participate, the system may use automatic text extraction programs to extract knowledge for all articles as a bootstrap. Thus an initial visit to an article may not result in a blank knowledge-base. Rather the user may see automatically extracted knowledge and will be able to vote on its relevance and/or value. The voting is

preferred, as the automatic extraction systems are not necessarily accurate, and even the best automatic extraction systems do not have perfect recall or precision.

This approach can be used to solve the problem of dealing with large number of existing articles and the explosion of new articles, because the curation is done by the whole community. Thus, as the numbers of articles increase, there is also an increase in the number of researchers who write those articles. Since these researchers will benefit from the knowledge repository created by our system, those researchers may participate in this collaborative curation effort. Thus the number of articles may be matched by the number of curators. In one aspect, the system may be an open system accessible for free to the research community or the public. However, other fee based and controlled systems are also contemplated.

The foregoing as well as other aspects, advantages, and alternatives will become apparent to those of ordinary skill in the art by reading the following detailed description and claims, with reference where appropriate to the accompanying drawings. Although this summary speaks of specific embodiments, it should not be read to limit the scope of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 is a block diagram illustrating the functional architecture of a collaborative curation system.

Figure 2 is a block diagram illustrating the functional architecture of a collaborative curation system. Figure 3 is a block diagram illustrating another architecture of a collaborative curation system.

Figure 4 is a flow diagram illustrating operations performed in initiating use of a collaborative curation system.

Figure 5 is a flow diagram illustrating operations performed in using a collaborative curation system as a research tool.

Figure 6 is a wireframe illustration of a user interface displayed at a client computer, including a reference interface and a Web band.

Figure 7 is a wireframe illustration of the Web band portion of the user interface, showing the tabular display of curation information.

Figure 8a is a wireframe illustration of a user interface for rating confidence levels in an element of curation information.

Figure 8b is a wireframe illustration of a display in which users are provided with a history of confidence level ratings. Figure 9 is a flow diagram illustrating steps taken in a collaborative curation method.

Figure 10 is a block diagram illustrating the functional architecture of a collaborative curation system.

Figure 11 is a schematic wireframe illustration of a user interface displayed at a client computer, including a reference interface and a Web band.

DETAILED DESCRIPTION A. System Architecture Figure 1 is a block diagram illustrating an architecture of a collaborative curation system in accordance with a preferred embodiment. In the figure, a packet network 102, such as the Internet, interconnects a first source 106, a second source 108, and a client computer, such as a personal computer 104. The personal computer 104 is preferably equipped with an Internet browser or other software that can be used as a research tool for seeking out references, such as publications stored electronically in an on-line database.

The client computer is equipped with watching software that operates to determine if and/or when the client computer is being used to seek out references. In a preferred embodiment, this watching software is in the form of a browser extension; in this case, the watching software is referred to as a browser helper. Alternatively, the watching software may be an element of software external to the browser, or it may be a part of a stand-alone software research system.

The browser helper watches network activity at the client computer 104 to determine when the browser is being used to seek out references over the network. The browser software provides the user interface for control of the watching software. In a preferred embodiment, the first source 106 contains reference information such as publication text, keywords, titles, citation information, and/or abstracts, for instance, and the first source makes at least some of this information available over a network. (Although such availability may require a subscription or be otherwise

subject to limited access rights.) These references may be — but are not necessarily — research articles, such as biomedical research articles. The first source can be referred to as a reference server. The term "server" as used herein is intended not to be limited to implementations on a single computer, but rather encompasses other implementations, such as those on multiple computers, which may be distributed at different locations.

In a preferred embodiment, the second source 108 contains knowledge elements associated with references in the first source 106, and the second source makes at least some of this information available over a network. For example, when the first source is a repository of biomedical research articles, the knowledge elements may include, for each research article, information identifying the various biological entities, diseases, biological processes — and interactions therebetween — discussed in the article. Biological entities may include proteins, genes, and organs, among others. Because the second source includes information that can be used to organize information in the first source, it can be referred to as a curation server, which again may be implemented on one or more than one computer, which may or may not be separate from the reference server.

Although the reference server 106 and the curation server 108 are shown as separate elements in Figure 1 , it is contemplated that the two sources may be configured on the same physical hardware and even within the same database. In a particular example, the reference server 106 may be the PubMed system, and the curation server 108 may be fully incorporated into the PubMed database system. Further, one or more of the sources may temporarily or permanently be included on the client computer 104 so that access across the packet network 102 is not necessary. Figure 2 provides another look at the architecture of a collaborative curation system at the database level. A client computer 204 (which may be a personal computer or other computing device, including a PDA or handheld device) may be connected to various reference databases 206 as well as a curation database 202. When the client computer 204 retrieves references (such as texts, titles, citations, abstracts, etc.) from the reference database, the curation database is used to locate supporting information regarding the retrieved reference.

A more detailed architecture of one possible curation system is illustrated in Figure 3. References of interest to researchers are stored in one or more databases 302. A download agent 304 is capable of automatically retrieving references from the

reference databases 302 and storing the reference text in a text database 306. An extractor system consults this stored text and operates software to analyze the text and to extract curation information from the text. Other databases 310 may likewise include curation information relating to the references stored in the reference databases 302. The system can benefit from this information by implementing another download agent 312. This agent populates another database 314 with the downloaded curation information. Because the downloaded information may not be in the same format as that output by the extractor systems 308, a data format exchange system 316 operates software to translate data formats if necessary. This translated information, together with extracted information, can be combined in a curation database, such as the CBioC database 318. At the client side, a user interface 320 enables a user 322 to browse through the facts that comprise the curation information, to vote on the accuracy of those facts, to add new facts or modify facts, to invoke the extraction system, or to handle user account information.

B. Operation of an Exemplary Curation System

An exemplary curation system makes use of the interactions between an end user, that user's client computer running browser software with a browser helper, a curation server, and a reference server. Such interactions are described herein with respect to one particular embodiment. In this particular embodiment, the curation server is a server affiliated with the CBioC project, described at <www.cbioc.org>, and the reference server is a server affiliated with the PubMed service, available at <www.pubmed.gov>. It should be understood that these examples are chosen for the sake of clarity, but that the invention is not limited to these particular services or the servers and databases affiliated with them.

J. Automatic Session Setup

Figure 4 illustrates how a user's attempt to access the PubMed database automatically triggers the client computer to begin setting up a session with the CBioC server. In step 2, the user attempts to navigate to the PubMed Web site by, for example, selecting an appropriate bookmark or hyperlink, or typing a uniform resource locator (URL) addressing the PubMed Web site. In response, in step 4, the browser software attempts to retrieve the PubMed welcome page by sending a request

to the PubMed server. The request may be, for example, a GET request in the hypertext transfer protocol (HTTP).

The browser helper, in step 6, detects that the browser has requested a page from the PubMed Web site. It may do this by, for example, monitoring HTTP requests generated by the browser software and determining whether such a request indicates that the user is attempting to navigate to the PubMed Web site. In other embodiments, software at the client computer may use other techniques to detect a user's attempt to access the PubMed Web site.

In response to detecting that the user is navigating to the PubMed Web site, the browser helper in step 8 requests the browser to open a curation region of the browser's user interface. This curation region may be, for example, a new window, a new tab, or a new frame. In a preferred embodiment, this curation region is in the form of a "Web band," which appears as a separate region below a main window of the browser's client area. In response to the request, the browser in step 10 opens the Web band.

In parallel with the actions taken by the browser helper, the request for the PubMed welcome page that was generated by the browser in step 4 is received by the PubMed server, and the PubMed server, in response, sends the PubMed welcome page to the client computer in step 12. The computer receives the welcome page and displays it in step 14 in the main window of the browser's client area.

Throughout the operation of the curation system, it is preferable that the main window of the client area operates as it would if the browser helper were not present. For example, the user can make use of a browsers "address bar" or navigation buttons, and otherwise navigate through the Web (including to Web sites unrelated to research) without interference from the browser helper. Conversely, some (but not all) interactions taking place through the main window cause the browser helper to take action, and, where appropriate, to display relevant information in the Web band.

With the detection of a request to access the PubMed Web site, the browser helper in step 16 requests a welcome page from the CBioC server. In step 18, the CBioC server receives the request and sends the CBioC welcome page to the client computer, where the browser software causes it to be displayed in the Web band. In this example, the user has not yet logged in to the CBioC service, so the welcome page prompts the user to enter his or her username (which may be an email address) and password. The user obliges in step 22, and the client computer sends the

username and password to the CBioC server (step 24) for authentication (step 26). Assuming authentication is successful, the CBioC server initiates a session with the user in step 28. The establishment of a session may involve setting up a session identifier stored as a "cookie" at the client computer. In future interactions within the session, the client computer can send the session identifier to the CBioC server, so that the user need not re-enter a username and password with each transaction.

In some embodiments, particularly those dealing with non-critical or non- controversial data, the level of authentication required may be minimal or nonexistent. Moreover, authentication may permit different levels of access for different users. For example, no authentication may be required for read-only access, whereas some authentication may be required for read/write access.

When a session has begun, the CBioC server sends to the client computer a page indicating that the login has been successful. In a presently preferred embodiment, this page includes statistical information on the use of the CBioC service (step 30). In step 32, the client computer displays this statistical information in the Web band.

2. Automatic Retrieval ofCuration Information Once a user is set up in a session with the CBioC server, the user's attempts to access a reference in the PubMed database can trigger the display of curation information relating to that reference. This display preferably appears only in the Web band, allowing the user to search the PubMed database using prior art techniques if he or she so desires, or to review and interact with the information in the Web band if the user believes it would be helpful. Figure 5 provides an example of the steps that might take place in a typical user interaction with the curation system, although it should be understood that other types of interaction are possible as well. In one such interaction, the user types search terms in the PubMed search box (step 34) and submits the search by clicking "Go" (step 36). This causes the browser software to send these search terms (possibly with other parameters) in an HTTP GET request to the PubMed server (step 38). The

PubMed server receives these parameters in step 40 and, in response, runs a search of the PubMed database (step 42). In step 44, the PubMed database sends the results of the search to the client computer, where they are displayed in step 46 by the browser software.

In a preferred embodiment, the search results are in the form of article citations, and each of these citations is hyperlinked to retrieve the abstract of the article. If the user clicks on such a hyperlinked citation, as in step 48, the browser sends a request for the selected reference. The PubMed server receives and process the request and sends the selected reference to the client computer (step 52), where it is displayed, as expected by the user, in the main window (step 54).

In the mean time, the browser helper also determines that the user has attempted to view a particular reference, and this determination calls the browser helper into action. In this example, the browser helper determines that the user is attempting to view a particular reference by detecting in step 56 the reference request that was generated in step 50. In a preferred embodiment, the browser helper parses HTTP GET requests generated by the browser helper to determine whether a particular reference has been requested, and, if so, the identity of that reference. As one example, where the reference server is the PubMed server, the browser helper looks for GET requests directed to the host "www.ncbi.nlm.nih.gov", where the requested resource path begins with "/entrez" and the search parameters of the request include the parameter "list_uids=" followed by a valid PubMed article identification number (e.g., 15051730).

In other embodiments, the browser helper, watching software, or other client- side software may use other techniques to determine automatically when the user is attempting to view a reference. For example, instead of parsing outgoing request messages, the software could watch for incoming data that is indicative of a reference. Alternatively, the client-side software can provide a front-end user interface through which the user searches for and/or selects references; in this case, the client-side software reacts to the user's selections by generating the appropriate requests to the reference server and to the curation server.

Having detected in step 56 the request for a reference, the browser helper generates a request for curation information and sends that request to the CBioC server in step 60. If curation information for that reference is already available (for example, it is stored in a curation database), it is retrieved in step 62 and sent to the client computer in step 64.

If, on the other hand, no curation information is already available to the CBioC server, that server operates to extract useful information automatically. In step 66, the CBioC server requests its own copy of the reference being accessed by the

user. The PubMed server sends that reference to the CBioC server (step 68), and the CBioC server automatically extracts curation information relating to the reference. The automatically extracted curation information may include, for example, the names of proteins, organs, diseases, genes, biological processes — and interactions between them — that are mentioned in the article. This automatically-extracted curation information is then stored in a database by the CBioC server (step 72) and is used to respond to the current request (from step 58) and to respond to later requests relating to the same reference. Once the curation information has been sent to the client computer, it is displayed in the Web band of the browser client area (step 74).

3. Interpreting the Curation Information

In a preferred embodiment, users of the curation system not only benefit from receiving curation information associated with a reference, they are also able to contribute to the curation information, for example by adding new information and by judging the accuracy of the information already present.

Figure 6 is a schematic illustration showing the layout of a user interaction screen displayed by the browser and browser helper software. A client area 600 is divided into a reference interface 602, and a Web band 604. In this example, the PubMed welcome page is displayed in the main window. The PubMed welcome page includes a text box 606 in which a user types search parameters, such as key words, together with command buttons such as the "Go" button (608) to submit the search parameters. In this illustration, the Web band displays text boxes prompting the user to enter his or her email address (box 610) and password (box 612) and a command button 614 that causes the address and password to be sent to the CBioC server, in order to log in to the server.

Figure 7 illustrates curation information that is displayed in the Web band 604 when the user navigates to an abstract on the PubMed server. As an example, Figure 7 may illustrate the outcome of step 74 in Figure 5, when curation information is received by the client computer and displayed in the Web band. In this example, the curation information relating to the requested reference is illustrated in the format of a tabbed table. Through the selection of various tabs, the user can view curation information relating to Protein/Protein interactions (tab 702), Gene/Disease interactions (tab 704), Gene/Organ interactions (tab 706), and Gene/Bio Process interactions (tab 708). In a preferred embodiment, the curation information under each

tab, relating to each class of interaction, is organized in a fashion analogous to the Protein/Protein interactions illustrated under tab 702. Accordingly, for the sake of simplicity, only the contents of tab 702 are illustrated herein.

Under tab 702, curation information is provided relating to interactions between proteins discussed in the reference. The curation information is organized in a table in which each row describes an interaction. Within each row, the column "Protein 1" names a protein that has some role in the interaction. The column "Interaction" names the type of interaction or relationship between proteins (e.g., "regulator," "binds," "inhibits," "interacts," "stimulates," "depletion," "phenotypes," "repressed," "interact," "expressed," among others). The column "Protein 2" names a second protein involved in the interaction. Although it is not necessary, it is desirable for these first three columns to read as a declarative sentence relating to a protein/protein interaction. For example, at the fifth row of the table, the first three columns together read "KAPl stimulates p53 HDACl complex," which conveys information in a form easy for a researcher to identify and understand. Where the interactions between proteins discussed in a reference is less clear, it may not be possible or desirable to present information in such a straightforward way. In such a case, the nature of the interaction may be described more broadly, as in the fourth row, which reads "MDM2 interacts KAPl." With this information, researchers looking into the relationship between MDM2 and KAPl can at least identify the reference as one that refers to such an interaction, even where no more precise formulation would be appropriate.

In a row corresponding to an interaction, the column "Source" identifies the person or other source responsible for entering the interaction in the curation system. For example, where a human user reads a reference and enters an interaction into the curation system, the "Source" may be the screen name or other identifier associated with the human user (e.g., "readerl"). Alternatively, where the interaction is entered into the curation system automatically, using software that extracts interaction information from reference text, the name software or other automatic agent may be identified as the "Source." For example, interactions may be entered automatically by the IntEx system, described in S. T. Ahmed et al., "IntEx: A Syntactic Role Driven Protein-Protein Interaction Extractor for Biomedical Text." Such automatic entry of interactions may take place during, for example, step 70 of Figure 5. Where the IntEx system has automatically entered information on an interaction, "IntEx" may be

named as the "Source" of the interaction. The source information may differentiate between versions of an automatic extraction system (e.g., "IntEx3").

4. Contributing to the Curation Information A user with read-write access to the curation information can not only review information that has been entered by others (or automatically); he or she also has the capability of adding new curation information. For example, a user may review a reference and note that a particular interaction is not listed with the curation information. That user has the capability to manually add information on that interaction to the corpus of curation information. In the example of Figure 7, text boxes are provided for a user to enter information relating to a first protein (box 710), a second protein (box 712), and the nature of the interaction (box 714). By clicking an "Add" button (command button 716), the user causes the data he or she has entered to be transmitted to the curation server (e.g., the CBioC server) for storage and later retrieval.

Such a user is further provided with the ability to modify curation information already stored by the curation server. The user can initiate the making of these modifications by clicking the "Modify" command button (e.g., button 718) in the row of interest. The history of modifications to curation information is stored, and changes can be accessed by users for review.

5. Rating the Curation Information

Much of the curation information may have been generated automatically, or it may have been generated by individuals whose qualifications and biases are not known to a particular user. To encourage an appropriate level of confidence in the curation system, the system permits users to rate the accuracy of the curation information relating to each interaction. This rating information is displayed (preferably in summary form) along with other interaction information, providing for each user a sense of whether this information has been accepted by the research community at large.

At least two types of rating information can be provided. First, a user can provide feedback on whether the interaction has been accurately characterized. An interaction may be inaccurately characterized if, for example, an automatic interaction extractor erroneously identifies an "interaction" between two non-interacting proteins

discussed in the reference. Similarly, a human user may make a typographical or other error while typing interaction information. A second kind of feedback a user can provide relates to how well-documented a particular interaction is in the associated reference. For example, a reference may refer in passing to an observation that MDM2 is a regulator of p53, and that interaction may be properly extracted in the curation information, but it would be helpful for researchers to know that the reference provides little discussion of or support for the interaction.

To provide the first kind of feedback — on whether an interaction has been correctly extracted — a column labeled "Correctly Extracted?" is provided, and within that column are command buttons "Yes" (720) and "No" (722) corresponding to each interaction. A user may select one of the buttons to vote on whether he or she believes the interaction has been correctly extracted, and each user's vote is sent to and stored by the curation server. Preferably, each vote is associated with a user identifier. This association has at least two benefits. First, it can be used to prevent a user from (possibly inadvertently) voting more than once on the same transaction. Second, the association can be used to permit a user to change his or her vote, possibly as a result of gaining greater understanding of the subject matter of the interaction.

The outcome of this first kind of feedback is displayed in columns entitled "% Approval" and "Votes." In the "% Approval" column, the proportion of "yes" votes may be illustrated by a bar graphic (e.g., graphic 724), among other means. The "Votes" column indicates the number of votes entered regarding each interaction.

To provide the second kind of feedback — on the level of support for an interaction — a column entitled "Evidence Level" is provided. By clicking on the "Rate" command button (e.g., button 726), the user is provided with the ability to enter feedback on the level of confidence in a particular interaction. In one embodiment, this feedback is provided through a feedback interface such as a pop-up box 800, illustrated in Figure 8a, which appears over the curation region of the browser's user interface. In the example of Figure 8a, a user may click on icons indicating zero through five stars to enter his or her opinion on the level of support for the selected interaction. As examples, clicking on star 804 indicates a selection of "two stars," while clicking on the crossed-out star 808 indicates "no stars." In a preferred embodiment, a textual description 806 of each possible star rating appears when a user's cursor is held over one of the star icons (e.g., when the user "mouses over" an icon). Such a textual description can provide short normative guidelines as to

the meaning of the star ratings. As examples, descriptions may be provided as follows:

No stars: "No rating"

One star: "Author statement, high-throughput data, or Inferred from Electronic Annotation"

Two stars: "Traceable Author Statement"

Three stars: "Inferred from Sequence Similarity or Reviewed

Computational Analysis"

Four stars: "Indirect experimental evidence"

Five stars: "Direct experimental evidence"

In a text box 802, a user may optionally type a brief reason or clarification for his or her rating. In response to selection of the "Submit" link, a user's rating is sent to the curation server for storage. From the curation region of the user interface, as illustrated in Figure 7, a user may review the feedback relating to each interaction. In the column entitled "Average Confidence," the curation information indicates whether an interaction has been rated at all and, if so, what the average of those ratings is. Where an average is provided, e.g. the rating "3" indicated at entry 728, the user may access details on that rating by clicking on the rating itself. In response, the client computer obtains from the curation server and displays information available on that rating. This information may be displayed in tabular format, as illustrated in Figure 8b. In Figure 8b, table 810 provides details on the confidence level. Each row of the table provides information on a rating provided by a user. For each rating, the table identifies the user (by screen name, for example), the rating level, the time at which the rating was entered, and any comments entered by the user. The tabular format illustrated in Figure 8b may correspond to the structure of a relational database in which this information is stored by the curation server.

Other curation information illustrated in Figure 7 may be hyperlinked to enable users to learn more about an area of interest. Hyperlinked information is illustrated in Figure 7 with underlining. Thus, protein names may be hyperlinked, and a user's selection of a protein name can cause other curation information relating to that protein to be retrieved from the curation server. Each interaction may be accompanied by a "Related Articles" hyperlink, the selection of which causes the curation server to identify other articles relating to the interaction.

6. The User Experience Through the techniques described herein, the community of researchers that writes and reads biomedical texts (or texts in other specialized fields) will be able to contribute to the curation process. Automated text extraction is used as a starting point to "bootstrap" the database, offering encouragement for researchers to participate at an early stage in the process. Biologists and other researchers can then improve upon the extracted data, ironing out inconsistencies by subsequent edits on a massive scale. Researchers can correct information that is automatically extracted from the biomedical texts, vote on the accuracy of the extraction and rate the reliability of the extracted facts based on the evidence presented by the author.

The CBioC system described herein, for example, runs as a Web browser extension and allows unobtrusive use of the system during the regular course of research in PubMed. However, this system can also be accessed directly, without having to install a browser plug-in.

The CBioC system described herein allows users to search the curation database for all facts related to a particular protein, gene, disease, or interaction word by typing the relevant term in a Search box within the CBioC Web band. CBioC automatically expands search terms with known synonyms of the terms. The facts available for a set of abstracts can be displayed by typing a comma-separated list of their PubMed identification numbers in the search box.

These features as described with respect to the CBioC system embodiment may be offered in or omitted from other embodiments, and they may be adapted as appropriate for use in non-biomedical fields of research.

C. Additional Embodiments

For the sake of clarity, a curation system was described in Section B, above, with reference to the particular example of the CBioC service used in conjunction with the PubMed database. Such a combination is not the only possible implementation of the systems described herein. More generally, a method implemented by the systems described herein is illustrated in Figure 9. This method may be implemented by software operating on a user's computer. At step 902, a determination is made that a user has accessed a first reference. At step 904, knowledge elements related to first reference are provided to the user. The knowledge elements may be provided through a graphic user interface at a display screen, for instance. At step 906 the user is enabled to provide feedback regarding the knowledge elements. According to a preferred embodiment, the feedback provides user input on the relevance of the knowledge elements to the reference.

Figure 10 illustrates the functional architecture of a curation system that is not necessarily tied to the CBioC or PubMed services. Although Figure 10 is not limited to the CBioC or PubMed services, it should be noted that Figure 10 also provides furher detail on the type of curation system that implements the methods described with reference to those services in Figures 4 through 7.

In Figure 10, client computers 504 are in communication with a network 506. Through the network 506, those client computers interact with a curation system 508, which may be implemented on one or more servers, whether at a single location or spread among different locations on the same or on different networks.

The system provides a browser helper 520, which preferably is sent as software (for example, as a dynamically-linked library, or DLL file) for installation on those computers. The browser helper includes a specified text watcher logic 522 to monitor the clients' requests for reference data and communication logic 524 to communicate the output from the text watcher logic 522. Provided to the client computers either together with or separately from the browser helper logic 520 is Web band logic 526, which includes the Web band object logic 528 and communication logic 532. In a preferred embodiment, the browser helper logic is installed on each client computer and, whenever the user navigates to a Web page from where he or she can access an article or an abstract, the Web band logic is invoked. The browser helper logic 528 may also cause the user's Web browser to display a toggle button, through which the user may activate or deactivate the Web band logic. The browser

helper may invoke the Web band logic by causing the client computer to download the Web band logic 526 in the form of, for example, a Web page with JavaScript or ActiveX controls.

At least one server operates a Web application 510 offering Web forms 512, server controls 514, an extractor system adapter 516, and a database connection adapter 518. Knowledge element databases 534 store knowledge elements relating to references in one or more reference databases (not shown), and extractor systems 536 operate to automatically extract knowledge elements from those databases.

When the user of a client computer 504 attempts to access a reference on his or her Web browser, the browser helper logic 520 informs the Web band logic 526 of this fact. In response, the Web band logic 526 triggers the Web application 510 to provide relevant knowledge elements, such as curation information. If such knowledge elements are available in a knowledge element database 534, the Web application 510 retrieves those elements and sends them to the client computer 504. If they are not available, the Web application invokes an extractor system 536 to automatically generate knowledge elements and to store those elements in the database 534.

The references and knowledge elements obtained by the user of a client computer 504 may be displayed as illustrated in Figure 11. Figure 11 illustrates a user interface 400 such as may be displayed in the client area of a Web browser on a client computer 504. During operation of this embodiment, the user interface may take any number of forms as the system is used for its various purposes such as browsing facts, voting for facts, ranking facts, adding and modifying facts, adding or modifying a schema, invoking the extractor system, user management, searching, and so on. In form illustrated in Figure 11 , a reference database interface 402 may contain the text of a reference, or other biographical information regarding the reference, such as the title, author, source, etc. The reference database interface 402 may additionally/alternatively contain search functionality, such as a search box or other query form that may be used to search one or more reference databases (as in box 606 of Figure 6). A knowledge element interface 418 is provided and displays information regarding various knowledge facts (or, perhaps more generally knowledge elements). Three knowledge facts 408, 410, and 412 are shown in order of importance and given ranks, 1, 2, and 3. Each knowledge fact may include one or more mechanisms for learning more about the particular fact. Here, that mechanism is a link 420. A user

may select the link to browse through information regarding the fact. Another interface portion may provide facilities for modifying a fact 414 or adding a new fact 416 or voting whether a particular fact is useful and/or relevant. According to a further embodiment a user may "drag and drop" the facts to modify a ranking of the utility and/or relevance of the facts. Various other examples of feedback techniques include a button, a link, a slider, or a text box

This information provided by the user may be related to the source containing the knowledge facts and updated therein.

According to a particular embodiment, once the appropriate software is installed on a user's PC, the system watches the user's access of the web through a browser's windows. Whenever the researcher accesses a web page from where she can access an article or an abstract, the CBioC system is invoked and an interaction frame is created, as shown in Figure 5. A toggle button may be provided in the browser to toggle the interaction frame on and off. Just a few of the many possible examples of databases that can be used as alternatives to or on addition to the PubMed database are Nature and Science. The extractor system can operate its own extraction software, e.g. by using a download agent, and parsing/matching text with a textual database. Other known methods can also be used to automatically determine a plurality of knowledge elements to associate with various references. Other databases (BioPax, DIF, Reactome, e.g.) may also be used to automatically populate the knowledge element database. In such an embodiment, a data format exchange server may be used to download relevant data from the databases and convert them to the proper format.

In addition to the knowledge elements added to the CBioC database through the extractor systems and the data format exchange server, a user may be encouraged to add new elements to the CBioC database and also to rank various elements that have been proposed by the system or by other users.

A variety of embodiments have been described above. More generally, those skilled in the art will understand that changes and modifications may be made to these embodiments without departing from the true scope and spirit of the present invention, which is defined by the claims. For example, elements may be added or removed from the system architecture without eliminating usefulness of the embodiments. Elements described as hardware may be implemented as software or firmware. Likewise, elements described as software may be implemented as firmware

or hardware. Generally, discussion of the Internet may be applied to any network or set of networks that serve to interconnect various machines or portions of a machine. Background elements that are well known to those skilled in the art were not necessarily fully described although they may be a part of any of the embodiments.