Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
ARRANGEMENT AND METHOD FOR FINDING RELATIONSHIPS AMONG DATA
Document Type and Number:
WIPO Patent Application WO/2011/151500
Kind Code:
A1
Abstract:
Electronic arrangement, such as a computer apparatus,and related method for finding relationships among data are presented. The proposed solution comprises: receiving data records from a number of data sources such as databases relating to a predetermined application domain such as biological or biomedical domain (604), indexing data records contained in said plurality of data sources, said index further comprising associations between said records based on indications in the data sources, wherein the utilized data model comprises a weighted graph having a plurality of nodes and edges, a node corresponding to one or multiple, aggregated, data records relating to the same concept and an edge representing a relationship between two nodes, an edge being associated with a weight (606), receiving a user-defined query defining a number of concepts of the domain and associating the concepts with the corresponding one or more query nodes of the graph (608),determining a subgraph from the graph best associating the one or more query nodes with one or more other nodes according to a number of predetermined criteria utilizing the weights and a predetermined subgraph extraction technique (610), and creating a graphical visualization of the subgraph for illustration on a display, wherein said one or more query nodes, a number of related other nodes and edges of the subgraph are illustrated so as to facilitate finding, verifying and understanding indi-25 rect associations among said one or more query nodes and/or associations between one or more graph elements such as one or more other nodes and said one or more query nodes (612), such as verifying a hypothesis between a phenotype and a related gene.

Inventors:
ERONEN LAURI (FI)
HINKKA ATTE (FI)
HINTSANEN PETTERI (FI)
KASARI MELISSA (FI)
KULOVESI KIMMO (FI)
LANGOHR LAURA (FI)
SEVON PETTERI (FI)
TOIVONEN HANNU (FI)
Application Number:
PCT/FI2010/050441
Publication Date:
December 08, 2011
Filing Date:
May 31, 2010
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HELSINGIN YLIOPISTO (FI)
ERONEN LAURI (FI)
HINKKA ATTE (FI)
HINTSANEN PETTERI (FI)
KASARI MELISSA (FI)
KULOVESI KIMMO (FI)
LANGOHR LAURA (FI)
SEVON PETTERI (FI)
TOIVONEN HANNU (FI)
International Classes:
G06F17/30; G16B5/20; G16B45/00; G16B50/10; G16B50/20
Other References:
SEVON P ET AL.: "Link Discovery in Graphs Derived from Biological Databases", DATA INTEGRATION IN THE LIFE SCIENCES, LECTURE NOTES IN COMPUTER SCIENCE, vol. 4075, 2006, pages 35 - 49, Retrieved from the Internet [retrieved on 20110303]
TOIVONEN H: "Biomine search engine for probabilistic graphs", 3 February 2010 (2010-02-03), Retrieved from the Internet [retrieved on 20110303]
SEVON P ET AL.: "Subgraph queries by context-free grammar", JOURNAL OF INTEGRATIVE BIOINFORMATICS, vol. 5, no. 2, 2008, pages 100, Retrieved from the Internet [retrieved on 20110315]
HINTSANEN P ET AL.: "Finding Reliable Subgraphs from Large Probabilistic Graphs", DATA MINING AND KNOWLEDGE DISCOVERY, vol. 17, no. 1, 2008, pages 3 - 23, Retrieved from the Internet [retrieved on 20110303]
LANGOHR L: "Finding a diverse set of nodes in probabilistic graphs", SEMINAR: GRAPH MINING (3 CR), SPRING 2010, 21 May 2010 (2010-05-21), Retrieved from the Internet [retrieved on 20110309]
LANGOHR L ET AL.: "Finding representative nodes in probabilistic graphs", WORKSHOP ON EXPLORATIVE ANALYTICS OF INFORMATION NETWORKS AT ECML PKDD, 65-76, September 2009 (2009-09-01), BLED, SLOVENIA, pages 65 - 76, Retrieved from the Internet [retrieved on 20110309]
HINTSANEN P: "The most reliable subgraph problem", PROCEEDINGS OF THE 11TH EUROPEAN CONFERENCE ON PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES, 2007, pages 471 - 478, Retrieved from the Internet [retrieved on 20110315]
WANG X ET AL.: "A Study of Methods for Negative Relevance Feedback", SIGIR '08: PROCEEDINGS OF THE 31ST ANNUAL INTERNATIONAL ACM, SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2008, NEW YORK, NY, USA, pages 219 - 226, Retrieved from the Internet [retrieved on 20110311]
ZHOU F ET AL.: "Review of Network Abstraction Techniques", WORKSHOP ON EXPLORATIVE ANALYTICS OF INFORMATION NETWORKS AT ECML PKDD, September 2009 (2009-09-01), BLED, SLOVENIA, pages 50 - 63, Retrieved from the Internet [retrieved on 20110315]
TOIVONEN H ET AL.: "A framework for path-oriented network simplification", THE NINTH INTERNATIONAL SYMPOSIUM ON INTELLIGENT DATA ANALYSIS (IDA), 19 May 2010 (2010-05-19) - 21 May 2010 (2010-05-21), TUCSON, ARIZONA, US, Retrieved from the Internet [retrieved on 20110311]
NIKRAVESH M: "Concept-based search and questionnaire system", SOFT COMPUTING, vol. 12, 2008, pages 301 - 314, Retrieved from the Internet [retrieved on 20110302]
TOMIYAMA T ET AL.: "Concept-based Web communities for Google search engine", THE 12TH IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS, 2003. FUZZ, 2003, pages 1122 - 1128, Retrieved from the Internet [retrieved on 20110302]
Attorney, Agent or Firm:
IPR PARTNERS OY (Helsinki, FI)
Download PDF:
Claims:
Claims

1. An electronic arrangement, such as a server (104, 1 18, 120), comprising: -a communication entity (1 12, 130) configured to receive data records from a number of data sources (108a, 108b, 108c) such as databases relating to a predetermined application domain such as biological or biomedical domain,

-an indexing entity (122) configured to create an index (105) for the data records contained in the data sources, said index further comprising associations between said records based on indications in the data sources, wherein the utilized data model comprises a weighted graph, such as a probabilistic graph, having a plurality of nodes and edges, a node corresponding to one or multiple, aggregated, data records relating to the same concept and an edge representing a relationship between two nodes, an edge being associated with a weight, wherein the weight optionally indicates a probability,

-a search interface entity (124) configured to receive a user-defined query relating to a number of concepts of the domain and to associate the concepts with the corresponding one or more query nodes of the graph,

-a search engine entity (126) configured to determine a subgraph (103, 202, 302) from the graph associated with said one or more query nodes according to a number of predetermined criteria utilizing the weights, edge types and/or other characteristic of the edges, and a predetermined subgraph extraction technique, and

-a visualization entity (128) configured to control the graphical visualization of the subgraph on a display device, wherein said one or more query nodes, a num- ber of related other nodes and edges of the subgraph are illustrated so as to facilitate finding, verifying and understanding indirect associations among said one or more query nodes and/or associations between one or more graph elements such as one or more other nodes and said one or more query nodes, such as verifying a hypothesis between a phenotype and a related gene.

2. The arrangement of claim 1, wherein each edge of the graph is associated with a characteristic type indicative of the nature of the connection between the edge ends, said type being optionally indicative of similarity, interaction and/or causal relationship.

3. The arrangement of any preceding claim, wherein an edge weight is deter- mined on the basis of at least one factor selected from the group consisting of: reliability indication of related data source, considered relevance of the edge type or of other characteristic attribute of the edge, and rarity based on the degree of an end node of the edge. 4. The arrangement of any preceding claim, configured to determine or adapt the subgraph indicative of the neighborhood of a query node on the basis of weight-related path strength criterion between the query node and nodes connected thereto directly or via intermediate nodes. 5. The arrangement of any preceding claim, configured to determine or adapt the subgraph by forming a number of clusters and/or cliques therefrom, optionally utilizing edge and/or path strengths at least partially determined using the weights as criterion. 6. The arrangement of any preceding claim, configured to utilize, in connection with at least two query nodes, at least one subgraph extraction precept selected from the group consisting of: a subgraph spanned by most likely acyclic paths, subgraph spanned by minimum paths, most reliable sub-network, and limiting the connecting paths according to node and/or edge type(s) using a con- text-free grammar.

7. The arrangement of any preceding claim, configured to determine the subgraph from the graph preferably on the basis of two query nodes, wherein the utilized method includes gathering a candidate set of paths from a larger set of paths, such as all paths, between the query nodes, whereafter an optimal subset of the candidate paths is determined according to predetermined criteria optionally including the edge budget, and wherein Monte Carlo simulation is preferably utilized. 8. The arrangement of any of claims 1-6, configured to determine the subgraph from the graph on the basis of two or more, such as three or more, query nodes, wherein the utilized method includes forming spanning trees from the graph connecting the query nodes and having the query nodes as leaves thereof.

9. The arrangement of any preceding claim, configured to combine at least two nodes of the subgraph having an identical or substantially similar neighborhood, such as similar connections with a common end point, according to a number of predetermined criteria together so as to elevate the informative value of the subgraph from the standpoint of the user.

10. The arrangement of any preceding claim, configured to cluster nodes of the graph or subgraph (502) utilizing a predetermined similarity measure and select, from any or each cluster, a representative node (504), such as the node that has maximum similarity relative to other nodes, and preferably indicate the representative node visually to facilitate the visual investigation of the graph or subgraph, wherein optionally k-medoids or hierarchical clustering is utilized.

1 1. The arrangement of any preceding claim, configured, on the basis of the one or more query nodes, to determine a number of other nodes (514) relevant with respect to the query nodes and non-redundant with respect to each other according to at least one probabilistic proximity measure (520, 530), said number of other nodes, said one or more query nodes and intermediate connecting nodes and edges optionally forming or being at least included in the determined sub- graph and/or visualization thereof.

12. The arrangement of claim 1 1, configured to support a negative query node so as to exclude it from the determined subgraph and optionally one or more nodes close thereto according to a predetermined criterion.

13. The arrangement of claim 1 1 or 12, configured to apply a greedy method for determining the subgraph in view of the at least one probabilistic proximity measure, wherein a selected node fulfilling the measure and selected during an iteration round is treated as a negative query node during a subsequent round.

14. The arrangement of any preceding claim, configured to estimate the unexpectedness of a relationship between two nodes utilizing predetermined criterion optionally based on link types, and optionally to utilize the obtained unexpectedness information for determining and/or visualizing the subgraph.

15. The arrangement of any preceding claim, configured to control the graphical visualization of the subgraph on a display device based on user input for providing interactivity to the analysis of the subgraph.

16. The arrangement of claim 15, wherein the user selection of a visualized element, such as a graph, subgraph, node, edge, icon, other symbol, or space between them, triggers a function selected from the group consisting of: visualization of the element detail(s), access to the related data source, change of a related type or attribute, change of a query concept or other parameter, editing of the element data, view zoom, view unzoom, rotation of the view, panning the view, view angle selection, visual expansion of the element such as expansion of a graph to constituents thereof, visual collapse of the element, and addition of supplementary information.

17. The arrangement of any preceding claim, configured to apply a user- controllable criterion or emphasis factor for subgraph extraction based on at least one element selected from the group consisting of: target edge type and/or other attribute, target node type and/or other attribute, data source, and size of the tar- get subgraph.

18. A method for finding relationships among data to be performed by a computer device, comprising -receiving data records from a number of data sources such as databases relating to a predetermined application domain such as biological or biomedical domain (604),

-indexing data records contained in said plurality of data sources, said index fur- ther comprising associations between said records based on indications in the data sources, wherein the utilized data model comprises a weighted graph having a plurality of nodes and edges, a node corresponding to one or multiple, aggregated, data records relating to the same concept and an edge representing a relationship between two nodes, an edge being associated with a weight (606),

-receiving a user-defined query defining a number of concepts of the domain and associating the concepts with the corresponding one or more query nodes of the graph (608), -determining a subgraph from the graph best associating the one or more query nodes with one or more other nodes according to a number of predetermined criteria utilizing the weights and a predetermined subgraph extraction technique (610), and -controlling a graphical visualization of the subgraph, such as creating visualization data for a display device, wherein said one or more query nodes, a number of related other nodes and edges of the subgraph are illustrated so as to facilitate finding, verifying and understanding indirect associations among said one or more query nodes and/or associations between one or more graph elements such as one or more other nodes and said one or more query nodes (612), such as verifying a hypothesis between a phenotype and a related gene. 19. A computer program comprising code means adapted, when run on a computer, to execute the method of claim 18.

20. A carrier medium comprising the computer program of claim 19.

Description:
ARRANGEMENT AND METHOD FOR FINDING RELATIONSHIPS AMONG DATA

FIELD OF THE INVENTION

Generally the invention pertains to information and computer sciences. In partic- ular, the invention concerns the creation of a searchable aggregate knowledge base from a number of information sources.

BACKGROUND The amount of information available in the modern world is unquestionably huge and increases continuously. Further, more and more information has become easily accessible to basically everyone via the Internet and the information sources connected and search tools associated therewith, for instance. Indeed, different search engines, crawlers, semantic intelligence logic engines etc. have been set up to cope with the vast amount of ever-growing data space such that the explorers thereof do not get overly confused about it in view of each particular, more specific, information retrieval task in question. Nevertheless, despite of the feeling a modern person often encounters, i.e. a feeling of being flooded with information and particularly unnecessary nonsense, everything interesting or useful has certainly not been discovered yet, whereupon e.g. the scientists and engineers admittedly have much to learn also in the future for augmenting the collective knowledge of mankind in the various fields of science and technology.

For example, the amount of publicly available biological data is growing at a tremendous pace as new information about genomes, proteomes, interactomes etc. is published daily. Despite the large amount of the already available information, it is clear that it only represents a tiny fraction of the biological knowledge that could be discovered. For instance, when considering the functions of genes according to a selected database, such as the Gene Ontology (http://www.geneontology.com) database, a relatively high percentage of the gene products comprise, due to being unknown, no informative annotation for a molecular function. There evidently seems to be a lot of work left in the light of associated biological studies so that the missing pieces of information can be ultimately filled in.

One interesting and important, both alike, source of new information might arise from the existing connections between two or more known concepts. As a simple example, one could consider a (web) search for two countries, such as "Finland" and "United Kingdom". A typical web search will provide results that may mention both terms, such as the page of Finnish embassy in the UK or e.g. a football match between these two nations. Very roughly, this type of traditional informa- tion retrieval/web search mainly looks for the intersection of items relevant to both Finland and UK, and constructs the results list accordingly.

In the aforementioned field of biology, different biological databases may be separately searched e.g. via web interfaces using a plurality of search terms such that the top priority search results finally incorporate a maximum number of any or all them. For example, a gene mapping for a particular phenotype could be considered. The mapping may have resulted in a large set of candidate genes. When further extensive analyses are planned for the wet lab, the researchers may first compare the candidates in the light of what is disclosed about them in the public databases and literature, hoping to be able to concentrate the efforts and resources on the most promising candidates. This slow and laborious task is mostly done manually by browsing the databases, which inevitably limits the extent and coverage of the search and easily lowers the quality of obtained results. Thus, notwithstanding the benefits the contemporary solutions undoubtedly provide in finding desired information from a larger information repository, or a plurality of repositories, some defects still exist therewith. Namely, every now and then one may come up with a new idea or maybe just a thin hunch involving a number of concepts that may be formulated as search terms for a search engine, for example. However, the current search interfaces and underlying search engines mostly look for the above intersections, i.e. certain documents comprising an explicit reference to the concept(s) or semantically corresponding entities, which is certainly feasible, but potentially misses the weaker clues and bigger picture that might provide valuable information based on the existing interrela- tionships between the concepts. Yet, information contained in a number of large data structures such as network structure(s) may be difficult to view and handle by the user. SUMMARY OF THE INVENTION

The objective is to alleviate one or more of the aforesaid defects present in prior art solutions and to provide at least a feasible alternative for finding information.

The objective is met by an arrangement and method in accordance with the present invention. Certain embodiments of the present solution may be applied to implement a search engine regarding a predetermined domain or domains such as biological, biomedical and/or medical domains. An index may be pre-constructed on the basis of a plurality of data sources such as databases relating to the domain. The data sources may include heterogeneous data. Preferably the index integrates data from the sources into a local repository represented by a weighted, such as probabilistic, model, advantageously a graph, wherein nodes represent data records and edges represent their interrelations (based on associations be- tween the data records indicated e.g. in the data sources). Thereafter, the user may query the index, e.g. via a web interface, by picking up one or more preferred query nodes such that a subgraph comprising edges advantageously connecting these nodes and/or other relevant nodes is provided in response for visual contemplation and analysis. In minimum case, the obtained subgraph may con- tain only a single node, while in the maximum case it may contain all the nodes of the original source graph. The solution is applicable for exposing indirect and therefore commonly unknown associations between two or more data records and related concepts, for instance. As one merely exemplary use case, the proposed solution may be utilized for determining the subgraph and/or path best connecting two or more query nodes according to a number of predetermined criteria, such as maximization of subgraph or path reliability (probability).

Thereby, in one aspect, an electronic arrangement, such as a server arrangement, comprises:

-a communication entity configured to receive data records from a number of, such as plurality of, data sources such as databases relating to a predetermined application domain such as biological or biomedical domain, -an indexing entity configured to populate an index for the data records contained in said number of data sources, said index further comprising associations between said records based on indications in the data sources, wherein the utilized data model comprises a weighted graph, such as a probabilistic graph, having a plurality of nodes and edges, a node corresponding to one or multiple, aggregated, data records relating to the same concept and an edge representing a relationship (association) between two nodes, an edge being associated with a weight,

-a search interface entity configured to receive a user-defined query relating to a number of concepts of the domain and to associate the concepts with the corresponding one or more query nodes of the graph, -a search engine entity configured to determine a subgraph from the graph associated with said one or more query nodes according to a number of predetermined criteria utilizing the weights, edge type and/or other characteristic of the edges, and a predetermined subgraph extraction technique, and -a visualization entity configured to control the graphical visualization of the subgraph on a display device, wherein said one or more query nodes, a number of related other nodes and edges of the subgraph are illustrated so as to facilitate finding, verifying and understanding indirect associations among said one or more query nodes and/or associations between one or more graph elements such as one or more other nodes and said one or more query nodes, such as facilitating verifying a hypothesis between a phenotype and a related gene.

In one embodiment the arrangement may be configured to utilize a proprietary or public database, e.g. a loadable file version thereof, as a data source. The data- base may be a scientific, engineering, or some other type of a database. For example, a biological database, e.g. EntrezGene, UniProt, Swiss-Prot, EntrezProte- in, GO(A) (Gene Ontology (Annotation)), Interpro, STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) or (O)MIM ((Online) Mendelian Inheritance in Man) may be applied as a data source in the case of at least partly biological target domain. A data record of a data source such as of a database may be turned into a node of the applied probabilistic model, whereas a cross- reference between records may be turned into an edge thereof.

In another, either supplementary or alternative, embodiment a node and/or an edge of the data model (graph) may be associated with at least one characteristic type and/or other characteristic attribute. For example, an edge type may generally confer similarity, interaction, and/or causal relationship between the end nodes thereof. A node may be of gene or protein type in a biological context, for exam- pie. Generic types incorporating a plurality of sub-types may be provided. For instance, a sequence may subsume both gene and protein. The aforesaid weight attribute may be used to indicate the estimated edge strength and/or probability of the edge's actual existence, for instance.

In a further, either supplementary or alternative, embodiment the arrangement may be configured to identify common concepts from a plurality of data sources such as databases such that potential naming practice differences relative to the same concept are detected, on the basis of e.g. synonym information and/or se- mantic analysis, and the related data are combined to the same record associated e.g. with a certain node of the established index, for example.

Yet in a further, either supplementary or alternative, embodiment a weight may be assigned to an edge by applying a function of a number of aspects. For in- stance, three factors may be used. For example, the weight may indicate edge goodness on the basis of e.g. a) predetermined data reliability factor associated with each data source (e.g. edge derived on the basis of data source a, receives a reliability factor x, edge derived o the basis of data source b, receives a reliability factor y, etc.), b) relevance of an edge type assigned e.g. query-specifically (may be user-controllable), and/or c) rarity associated with e.g. the degree of the incident node of an edge (lower degree of node -> more specific and potentially more informative link behind the related edge, for example).

Still in a further, either supplementary or alternative, embodiment indexing may be performed periodically, e.g. in a timed manner, or upon some more specific triggering condition, e.g. detection of change(s) in a data source or in the indexed data, and/or by external trigger, e.g. by any of the data sources or a client device.

In a further, either supplementary or alternative, embodiment the query may di- rectly include or at least implicitly indicate only a single query node of the graph. In this case, the subgraph may include so-called neighbor nodes best relating to the queried node according to a number of predetermined criteria, for example. For example, a path strength at least partly defined by the product of path edge weights (path probability in the case of given edge probabilities) may be used as criterion for determining the connection between the query node and some other node. In a further, either supplementary or alternative, embodiment the search interface entity may be configured to utilize a user-defmed search query term (concept) directly as a query node (or other distinguishable member) of the graph. This may happen when a direct match between the concept and a node of the index graph can be found. Alternatively, the search interface entity may be configured to find a best-matching node or other distinguishable member of the graph relative to the concept, or a preferably ranked list of suggestions, if the concept as such is not found. This may occur in connection with clerical errors, synonyms, etc. The interface entity may be configured to ask for confirmation in the case of unob- vious match between the user-defmed search concept and query node or other member of the graph.

In a further, either supplementary or alternative, embodiment determination of the target subgraph, particularly in connection with two query nodes, may be executed using at least one subgraph extraction precept selected from the group consisting of: a subgraph spanned by most likely acyclic paths (strongest paths connecting the query nodes may be selected until a desired size of the subgraph is reached, for example), subgraph spanned by minimum paths (paths the sub- paths of which do not connect given nodes may be selected until a desired size is reached), most reliable subgraph (a subgraph of selected size may be determined such that it, as a whole, connects the query nodes as effectively as possible, i.e. it is maximally likely that there is a connection between the query nodes), and limiting the connecting paths according to node and/or edge type(s) using a context-free grammar. In scenarios with more than two query nodes, e.g. the aforesaid subgraph spanned by most likely acyclic paths may be applied.

Further, as one preferable option, a so-called Path Covering (PC) method is proposed for use in determining the most reliable subgraph between e.g. two query nodes. Formally, the given weighted graph may be seen as a Bernoulli random graph G, for example, and the objective may be to find a subgraph H c G with at most B edges such that the probability that a path exists in H between the given two terminal nodes (s and f) is maximized. The algorithm is based on an efficient stochastic search of candidate paths, and the use of Monte Carlo simulation to cast the problem as a set cover problem.

PC may be implemented as a two-phase process. In a path sampling phase, a relatively small set C of candidate paths may be gathered from the set of all s-t- paths in a selected graph. Then, in a subgraph construction phase, the aim may be to choose an optimal subset 'F of the candidate paths in C, according to e.g. edge budget.

Alternatively, a modification of the PC method may be applied to cases with two or more query nodes. Some PC principles may remain unchanged, such as the two phases and use of Monte Carlo simulation. However, whereas the PC uses paths connecting the two given query nodes as its building blocks (set C described hereinafter) in the subgraph construction phase, the proposed method advantageously considers spanning trees connecting the k query nodes, with 2 < k < |V| (V refers to the overall set of nodes). Similarly, set S in the objective function to be reviewed hereinlater consists of spanning trees instead of paths as in PC.

In a further, either supplementary or alternative, embodiment the obtained (sub)graph may be cultivated by aggregating (grouping and/or combining) nodes having e.g. an identical or substantially similar neighborhood (such as similar connections with common end points) according to a number of predetermined criteria together so as to elevate the informative value of the subgraph from the standpoint of the user. Alternatively or additionally, other techniques for grouping and potentially combining nodes and/or (sub-)graphs may be applied. For in- stance, edge and/or path strength of sufficient (predetermined level) between nodes may be utilized as a feasible criterion in addition to or instead of other criteria, whereby e.g. clusters and/or cliques may be determined accordingly.

In a further, either supplementary or alternative, embodiment the graphical visua- lization of the subgraph may be performed utilizing a Java-based entity such as application or applet.

In a further, either supplementary or alternative, embodiment the graphical visualization may preferably include node labels (identifiers) and/or node and/or edge types, preferably also supplementary or alternative information such as various attributes, e.g. edge weights.

In a further, either supplementary or alternative, embodiment the graphical visualization is interactive. The illustrated graph may be zoomed, rotated and/or oth- erwise controlled. An element such as a node or edge may be selected by the user via the user interface (UI) to activate an associated feature. The selection feature may trigger a function selected from the group consisting of: visualization of the element detail(s), access to the corresponding data source using e.g. HTTP/WWW, change of a related type or attribute, change of a query criterion or other parameter, editing of the element, at least visually expanding the element (if e.g. a subgraph represented by a single entity), at least visually collapsing the element (vice versa, a number of nodes and/or edges may be represented by a node and a number of related edges), and addition of supplementary information. Selection may trigger execution of a related function automatically or after a second action such as additional verification, selection, or e.g. click action in the case of a double-click type selection action. The change of the query may indicate narrowing, expanding or otherwise changing the query in terms of the re- lated index search criteria.

In a further, either supplementary or alternative, embodiment the arrangement may be configured to automatically hide details, such as nodes, edges, node- related information, and/or edge-related information, deemed as uninteresting from the standpoint of the user according to predetermined criterion such as rules based on query terms (directly user-de fined and/or related concepts).

In a further, supplementary or alternative, embodiment the arrangement may be configured to determine a number of representative nodes in a graph such as the original source graph (index) or obtained subgraph thereof. According to one feasible method, the given nodes are first clustered utilizing e.g. a desired similarity measure and then a representative is preferably selected from each cluster.

In a further, supplementary or alternative, embodiment the user may be allowed to specify query nodes (concepts), whereupon the arrangement is configured to find other nodes (concepts) that are relevant with respect to the query nodes, but advantageously non-redundant with respect to each other. Such embodiment may be motivated by biological graphs, for example, where similar problems arise in connection with summarizing existing knowledge concisely, as well as in disco- vering and highlighting a selected set of potentially novel, less obviously related concepts. The arrangement is configured to utilize a probabilistic proximity measure for defining a relevance function for a node with respect to (positive) query nodes. In a further, supplementary or alternative, embodiment the measure is expanded to take negative query nodes to be avoided into account. The ar- rangement may be generally configured to measure the relevance and non- redundancy of a set of retrieved nodes. In a further, supplementary or alterative, embodiment the arrangement may be configured to estimate, in addition to or instead of relationship strength (e.g. path strength) between e.g. two nodes/concepts, also the unexpectedness of such a relationship. The resulting subgraph might be constructed so as to include and/or highlight nodes the connection of which is strong, but inevident. The used criterion may be based on link types, for example.

In another aspect, a method for finding relationships among data, comprising -receiving data records from a number of, such as plurality of, data sources such as databases relating to a predetermined application domain such as biological or biomedical domain,

-indexing data records contained in said plurality of data sources, said index fur- ther comprising associations between said records based on indications in the data sources, wherein the utilized data model comprises a weighted graph having a plurality of nodes and edges, a node corresponding to one or multiple, aggregated, data records relating to the same concept and an edge representing a relationship between two nodes, an edge being associated with a weight,

-receiving a user-defined query defining a number of concepts of the domain and associating the concepts with the corresponding one or more query nodes of the graph, -determining a subgraph from the graph best associating the one or more query nodes with one or more other nodes according to a number of predetermined criteria utilizing the weights and a predetermined subgraph extraction technique, and -creating a graphical visualization of the subgraph for illustration on a display, wherein said one or more query nodes, a number of related other nodes and edges of the subgraph are illustrated so as to facilitate finding, verifying and understanding indirect associations among said one or more query nodes and/or associations between one or more graph elements such as one or more other nodes and said one or more query nodes, such as verifying a hypothesis between a phe- notype and a related gene. The previously presented considerations concerning the various embodiments of the arrangement may be flexibly applied to the different embodiments of the provided method mutatis mutandis and vice versa, as being appreciated by a person skilled in the art.

The utility of the present invention follows from a plurality of issues. The invention may be applied as a highly automatic technical aid for digging up new information on the basis of found implicit, indirect, often therefore previously unknown or unrecognized, connection(s), i.e. paths or "chains" of associations, be- tween given concepts in a certain domain for visually verifying, producing and/or ranking related hypothesis from the standpoint of a life scientist, for example. The suggested PC method is efficient and scalable in connection with large source graphs, for example. The suggested modified PC method for reliable subgraph extraction in connection with two or more query nodes is also both effec- tive and extraordinarily scalable in comparison with prior art methods. The invention may provide information on which concepts are related, how they are related, and/or how strong are the connections. Single paths, complexes and distant connections relative to the queried concepts may be identified. The concepts and/or their interrelations may be then prioritized for future research, for exam- pie.

Thus the relevant information may be found from the data sources faster and easier than before. Heterogeneous information may be cleverly integrated in a single, searchable index. The domain may preferably include abundant data, uncer- tain/weighted direct relations among data, some incompleteness (e.g. in the light of relations), and/or heterogeneity for maximum applicability in view of the present invention. The present invention may traverse whole network(s) of relations on the basis of a formal, probabilistic model for the purpose. Finding and exploring non-trivial, possibly surprising, associations is especially important in many scientific and engineering fields, where large bodies of domain knowledge are available, and research and development has a fast pace and is very competitive. Biomedical domain is an example of a very promising that kind of domain. For instance, a biomedical researcher may construct a hypothesis or guesstimate around a phenotype and potential genes related thereto, whereupon the hypothesis may be conveniently verified using the suggested arrangement and method. Representative concepts may be determined on the basis of a larger initial group for reducing redundancy and facilitating identifying complementary or alternative entities, for example.

In addition, e.g. millions of records of a certain domain may be indexed together with their relations reported in the data sources. The used data sources are preferably structured but additionally or alternatively, also only partly structured or even unstructured sources, e.g. books, journals and/or web pages, may be applied with the cost of additional processing likely required for mining it, e.g. via textual/semantic analysis, and the related uncertainty.

The present invention may be used to graphically illustrate the found links between interesting concepts preferably with interactivity and/or editability, which facilitates understanding the found relationships instead of merely numerical or textual output. Additional query formats may further facilitate providing new in- formation to the user. For example, a query may be constructed for finding other relevant concepts related to the queried concept, to determine the surprising ones according to the used criterion, and/or for finding analogical/similar relations.

The expression "a number of refers herein to any positive integer starting from one (1), e.g. to one, two, or three.

The expression "a plurality of refers herein to any positive integer starting from two (2), e.g. to two, three, or four. The terms "network" and "graph" are used herein interchangeably.

Different embodiments of the present invention are disclosed in the dependent claims. BRIEF DESCRIPTION OF THE RELATED DRAWINGS

Next the invention is described in more detail with reference to the appended drawings in which Fig. la illustrates an embodiment of the arrangement in accordance with the present invention.

Fig. lb is a block diagram of an embodiment of the arrangement. Fig. 2 illustrates a first example of a subgraph obtained utilizing an embodiment of the present invention.

Fig. 3 illustrates a second example of a subgraph obtained utilizing an embodiment of the present invention.

Fig. 4 visualizes test results in connection with an embodiment of a modified path covering method for discovering the most reliable sub-network in the light of two or more queried concepts.

Fig. 5a illustrates an embodiment of finding representative nodes in a graph in accordance with the present invention.

Fig. 5b illustrates an embodiment of finding nodes particularly relevant to the query node(s).

Fig. 5c is another illustration of the embodiment for finding nodes particularly relevant to the query node(s).

Fig. 6 discloses a flow diagram of an embodiment of a method according to the present invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS

Figure la illustrates the concept of the present invention in accordance with an embodiment thereof.

In the visualized scenario a terminal 102, such as a desktop, laptop or mobile (hand-held) computer, is functionally connected to a server arrangement 104 e.g. via at least one communication network 106 such as the Internet. The server ar- rangement 104 offers a search service, e.g. a web service, to its clients such that it constructs an index preferably representable in the form of a weighted (e.g. probabilistic) graph 105 based on data records available in a plurality of optionally heterogeneous, external data sources 108a, 108b and 108c, and serves a search query provided by the user 102a via the terminal 102 through the provision of appropriate subgraph 103 and/or other data in return. Alternatively, the server arrangement 104 may be implemented, as a logical entity, also directly in the utilizing terminal device 102, being then typically best suitable for local use.

The broken arrows between the data sources 108a, 108b, 108c represent the pro- vision of data to the arrangement 104 for indexing. Data may be transferred also in the reverse direction. The arrangement 104 may poll a source 108a, 108b, 108c upon need, or the source 108a, 108b, 108c may be configured to send the data to the arrangement 104 automatically. Figure lb represents, by way of example only, a block diagram of the server arrangement 104, terminal 102 or other similar arrangement incorporating one or more devices and configured to provide the search service in accordance with the present invention.

The entity in question is typically provided with one or more processing devices capable of processing instructions and other data, such as one or more microprocessors, micro-controllers, DSPs (digital signal processor), programmable logic chips, etc. The processing entity 1 10 may thus, as a functional entity, physically comprise a plurality of mutually co-operating processors and/or a number of sub- processors connected to a central processing unit, for instance. The processing entity 1 10 may be configured to execute the code stored in a memory 1 16, which may refer to search engine software 1 18 in accordance with the present invention and other applicable software applications.

The logic for the search engine and related functionalities may indeed be implemented as software stored in the memory entity 1 16 and executed by the processing entity 1 10. The search entity may include a number of applications, modules, and/or other software preferably having at least a functional interconnection. Software 1 18 may utilize a dedicated or a shared processor for executing the tasks thereof. Similarly, the memory entity 1 16 may be divided between one or more physical memory chips or other memory elements. The memory 1 16 may further refer to and include other storage media such as a preferably detach- able memory card, a floppy disc, a CD-ROM, or a fixed storage medium such as a hard drive. The memory 326 may be non-volatile, e.g. ROM (Read Only Memory), and/or volatile, e.g. RAM (Random Access Memory), by nature.

The UI (user interface) 1 14 may comprise a display, e.g. an (O)LED (Organic LED) or LCD (liquid crystal display) display, and/or a connector to an external display or other type of a display device such as a data projector, and a key- board/keypad/mouse/touchpad and/or other applicable control input means (e.g. touch screen or voice control input, or separate keys/buttons/knobs/switches) configured to provide the user of the entity with practicable data, e.g. graph, vi- sualization and control means. The UI 1 14 may further include one or more loudspeakers and associated circuitry such as D/A (digital-to-analogue) converters) for sound output, and a microphone with A/D converter for sound input. In addition, the entity may comprise an interface 1 12 such as at least one transceiver incorporating e.g. a radio part including a wireless transceiver, such as WLAN (Wireless LAN), Bluetooth or a cellular like GSM (Global System for Mobile Communications)/UMTS (Universal Mobile Telecommunication System) transceiver, for general communications with external devices and/or a network in- frastructure, and/or other wireless or wired data connectivity means such as one or more wired interfaces (e.g. Firewire, LAN such as Ethernet, or USB (Universal Serial Bus)) for communication with other devices such as terminal devices, control devices, server devices, peripheral devices such as external sensors, and/or network infrastructure(s).

It is clear to a skilled person that the entity may comprise few or numerous additional functional and/or structural elements for providing beneficial communication, processing or other features, whereupon this disclosure is not to be construed as limiting the presence of the additional elements in any manner.

A carrier medium such as an optical disk, a floppy disk, a memory card, a hard disk, a memory chip, or a memory stick may be configured to comprise computer code, e.g. a computer program product, for performing at least part of the tasks described herein. The program code and/or related data such as model or repre- sentation data may be provided on a signal carrier. The code and/or the data may be at least partially encrypted using a selected encryption method such as AES (Advanced Encryption Standard).

In some embodiments the entity, such as the terminal 102 or the server 104, may be self-contained and include all the necessary functionality from obtaining the data from a number of sources to providing related search services for finding indirect relationships. Alternatively, tasks may be shared and distributed among available devices 102, 104, and/or optionally further devices embodiment- specifically as understood by a skilled person. The server arrangement 104 may in practice contain a single (computer) device or a plurality of at least functionally interconnected devices fulfilling the required functions as an aggregate server entity.

One example of a possible logical division between the at least conceptual ele- ments of the suggested solution is provided in the lower section of the figure. Many of the elements may be implemented as a computer executable code, which is highlighted in the figure by broken lines drawn between the application block 1 18 and expansion portion 140. Naturally the necessary hardware such as processing element(s), memory, UI and communication adapter(s) are still needed for interfacing the logic with a physical world.

The I/O entity 130 may take care of data input and output between external enti- ties and the suggested solution. A control block 120 may generally manage the execution of tasks according to requests received via the I/O entity 130 and internal logic, for example. Indexing entity 122 takes care of producing the integrated, searchable index from a number of, likely a plurality of, data sources such as databases. Search interface entity 124 may turn a received search query into an in- dex query, which may refer to receiving the search query and identifying the nodes of the index (graph) either directly or implicitly identified therein. For example, sometimes a user may construct a search query including a term not directly identified as such as a searchable node (and/or edge/edge type and/or characteristic) or other member of the index (graph), whereupon the entity 124 may be configured to provide a list of suggestions for the query node based on inherent estimation logic such as semantic logic trying to associate the search term with a searchable entity of the index. A data structure such as a table may be applied to contain synonyms of different concepts for rapid concept term-to-graph element (e.g. node) term transformation.

Visualization entity 128 may be configured to control the visualization (views) of the graphs, subgraphs, and related entities to the user 102a. It may be configured to create related visualization data, such as graphical image or video (animation) data and/or related instructions, to be provided to the terminal 102 for illustration via an available data visualization device such as a display.

Search engine/subgraph formation entity 126 may be configured to traverse the index database graph and extract a target subgraph therefrom for future visualization and/or other purposes.

Any aforesaid entity may be implementation-specifically realized as divided into a number of sub-entities or integrated with some other entity.

Advantageously, the user 102a may be provided with various means by the UI 1 14 for controlling the searches and/or visualization of the search result. For example, edge type, node type and/or data source -specific weighting may be user- controllable or -selectable in connection with executing searches, as may be the size on the search result (subgraph), desired node types (from which representa- tives may be selected and/or that should be located at the border of the network), and/or other node and/or edge characteristics affecting the search.

In addition to or instead of graphical output, a (purely) textual result e.g. in hypertext (e.g. HTML) or XML (extensible Markup Language) format may be provided.

Figure 2 illustrates a first example of a subgraph 202 obtained utilizing an embodiment of the present invention. In particular, a fictional example of a subgraph summarizing the link between a gene and a phenotype has been depicted. The user has specified the search concepts (source/origin node: Gene(S) and destination node: Phenotype(T)) used as query nodes, and the class of path types of interest (path types suggesting a causal relationship). Indeed, instead of all paths connecting two nodes, the user may be interested in paths with specific semantics, e.g. paths that confer similarity or paths that suggest a causal relationship. With labeled graphs, it is natural to base queries on the path type— the string of node and/or edge types on a path. A path class corresponding to a type of relationship may be defined as the set of path types that sug- gest that type of link between the end nodes. In the shown example, the subgraph results from a query for causal links from a given gene to a given phenotype.

A context-free grammar may be used as a means of defining path classes. A CFG may define a language of strings (path types in our framework) of terminal sym- bols (edge and vertex types) that can be derived from a distinguished (starting) non-terminal symbol using a set of production rules and a set of other nonterminal symbols. Each non-terminal symbol defines a path class, and paths of a given class are queried by specifying the non-terminal corresponding to that class as the starting symbol. Compared to another common formalism for defining string classes, r expressions, CFGs are more expressive, and provide a natural means of naming path classes. A subgraph querying system may rely on a background CFG, defining a comprehensive domain-specific list of path classes. The complexity of using CFGs may be hidden from the end-user: subgraphs can be queried simply by giving either one or two (sets of) query nodes (vertices) and the path class of interest. Alternatively, the user can pose more complex queries by manually defining the top level production rules and/or any auxiliary production rules. The implemented system may support queries such as: 1) connection subgraph queries between two given nodes (or sets of nodes), returning the sub- graph linking the nodes together, and 2) neighborhood queries, returning the subgraph induced by the set of paths starting from a given node and matching the query. For example, publication "Subgraph Queries by Context-free Grammars " (Petteri Sevon and Lauri Eronen. Journal of Integrative Bioinformatics 5 (2): 100. 2008), which is incorporated herein by reference in its entirety, describes one possibility of exploiting CFGs generally as explained above for defining and serving queries including path parsing, see in particular chapters 1-3 (pages 1-9) thereof, in connection with graphs.

Fig. 3 illustrates a second example of a subgraph obtained utilizing an embodiment of the present invention. As a background story, a life scientist may have been studying the etiology of Alzheimer disease (AD) for drug development or diagnostic purposes, for example. Assume that she have come up with a hypothesis based e.g. on wet lab results that DCDC2 gene could be associated with AD. Instead of manually browsing the biomedical databases available via web, she has exploited the embodiment in order to identify how DCDC2 could indeed be involved in mechanisms behind AD and optionally how strong the link between the two concepts is. A related query of two concepts (nodes) "DCDC2 AD" may have resulted in the shown graph. Each unit of information in the graph may, in principle, be previously known from a data source or another, but the chains of relations may be far from obvious even to a skilled person. Advantageously only the strongest advantageously indirect relationships between the query nodes have been shown to the user for improving the clarity of the obtained subgraph. In addition to edge types, edge directions, and node labels, also weights (e.g. probabilities) indicative of link "goodness" are shown.

Generally the arrangement may be configured to produce a number of preferably user-selectable views on the (sub-)graphs. For example, colors and/or other type of highlighting may be applied to distinguish between different properties of nodes and/or edges. The view may be generated on the basis of user-provided configuration. For example, gene expression information may be used for color (tone) scaling. Further generally, edge visualization, such as color, style, size and/or shape, may indicate attribute such as a type and/or weight thereof, whereupon textual descriptions are not necessary.

The weights may be determined using a selected method. For example, three factors may be utilized: data reliability, relevance, and rarity. The data reliabilities of edges may be defmed using a set of rules, such as: if the edge is derived from a predetermined data source A, e.g. Swiss-Prot, then its reliability is X (6 [0, 1]), whereas if the edge is derived e.g. from the computer-annotated TrEMBL database, then its reliability may be Y (lower than A, for example), etc. The interpretation of edge reliability is the degree of belief the investigator has for the edge being correctly annotated. If there is a value associated with an edge that reflects similarity or confidence, such as a homology score, the value can be transformed into a [0, l]-similarity value. With the interpretation that the similarity of nodes u and v is the probability that any relationship between u and a third node t is also true for v and t, the similarity can be multiplied into the reliability of the edge.

Considering relevance, the relevance of an edge type may be defined as the degree of the user's belief that edges of that type represents a relevant connection with respect to the query. In a practical system, the user may have a basic configuration— a set of default relevance values for edge types— and only few adjustments are needed for a typical query. The relevance values may sometimes be easier to give in terms of node types instead of edge types. Then, relevance q(x) for a node type τ can be decomposed into coefficients for edge types by multiplying all edge types with one end-node of type τ by square(q(x)), and edge types with both end-vertices of type τ by q(x). As a path relevance may be defmed as a product of included edge relevances, this gives the desired outcome: the relevance of any path visiting a node of type τ is multiplied by q(x).

Regarding rarity, it may be desired to give lower scores for paths that visit nodes with high degrees: the higher the degree of node v, the less likely it is that any two neighbors of v actually have an interesting connection through v. Rarity d(v) may be defmed first for nodes: d(v) is the probability that any two edges incident on v are related to each other and represent a meaningful path. We propose the ad hoc formula d(v) =(|Γ(ν)|+ 1) ~K 6 [0, 1], with a > 0, to determine the penalty for the degree |Γ(ν)| of node v; smaller values mean larger penalty. The parame- ter a determines how steeply the penalty increases with the degree. With a = 1, rarity d(v) = 1/(|Γ(ν)|+1) has a natural probabilistic interpretation. One could consider a random walker who, at any node, is equally likely to follow any edge, or stop at the node. Then, given a path p through vertices vi ,V2 ,...,vk , rarity d(vi) is the probability that a random walker who has so far traversed nodes vi,...,vi , will next stay on the path and visit node vi+i. Although lower values of a do not give equally attractive interpretations as random walk probabilities, they can be useful in practice to give relevant penalties for node degree that reward parallel paths more than a standard random walker. The maximum value of d(v) for an non- terminal node v of a path is 3 ~iK , Rarity values of the terminal nodes may be ignored; they would only add a constant factor to all paths. In principle, the values of a could be set separately for each node type, but in this exemplary embodiment a single value for all nodes is used. As with relevance above, the rarity val- ues are decomposed into edge-specific coefficients by taking the square root of them. Ideally, in the context of analysis of connection subgraphs, the relatedness of edges incident on a node should be tested for each pair of edges separately and independently. With the rarity values of nodes decomposed on the incident edges, this is clearly not the case. The approximation is used in order to avoid the quadratic computational cost for each node. It has no effect on evaluation of the goodness of a single path.

In the light of total edge weight, now that we have defmed all the components of edge weight, edge weight w(e) may be simply defmed as a product of those fac- tors: w(e) = r(e)q(e)d(e), where r(e) 6[0, 1], q(e) £ [0, 1], and d(e) £ [0, 1] are the reliability, relevance, and rarity of edge e, respectively. Under the assumption that they are probabilities for mutually independent necessary conditions for the edge, the weight w(e) is the probability that edge e exists. Path goodness between two end nodes may be defmed as the product of associated edge weights, for instance. The probabilities may be converted into distances by taking the negative logarithm of the goodness so that a selected discovery method for finding shortest paths may be utilized. For example, a publication "Link Discovery in Graphs Derived from Biological Databases " (Petteri Sevon, Lauri Eronen, Petteri Hintsanen, Kimmo Kulovesi, Hannu Toivonen. 3rd International Workshop on Data Integration in the Life Sciences 2006 (DILS'06), LNBI 4705, 35-49, Hinxton, UK, July 2006. Springer), which is incorporated herein by reference in its entirety, describes, see particular- ly chapters 3.2-3.4 (pages 7-10), few applicable methods for finding and evaluating links in connection with large graphs.

One feasible subgraph extraction method relies on the principle of determining the most reliable subgraph of selected size constructed such that it, as a whole, connects the query nodes as effectively as possible, i.e. it is maximally likely that there is a connection between the query nodes. Applicable algorithms e.g. for the scenarios of two query nodes include "Best Paths Incremental" (BPI) and "Series-Parallel Augmentation" (SPA), which are described in the publication "Finding Reliable Subgraphs from Large Probabilistic Graphs " (Petteri Hintsa- nen, Hannu Toivonen. Data Mining and Knowledge Discovery 17 (1): 3-23. 2008. Springer) incorporated herein by reference in its entirety, see especially pages 5 and 7-15. The BPI algorithm is based on the following idea: find the most probable, or "best", paths between the terminal nodes s and t, and let them span a subgraph. The BPI algorithm adds best paths to the solution until it has at least ||G|| - K edges. In order to have exactly ||G|| - K edges, it then calls Monte Carlo Pruning to remove the possible excess edges. The SPA algorithm is based on direct optimization of the reliability of the result subgraph H in a greedy, iterative manner. The cost of evaluating reliability is greatly reduced by constructing series-parallel graphs, a restricted class of graphs for which the reliability can be evaluated efficiently.

Fig. 4 visualizes few, merely exemplary, test results in connection with an embo- diment of a novel, modified PC method for discovering the most reliable subnetwork especially in scenarios with two or more target concepts (query nodes).

The problem of finding the most reliable k-terminal subgraph may be defined as follows. Let G = (V, E) be an undirected graph where V is the set of nodes and E the set of edges. G is a Bernoulli random graph where each edge e has an associated probability pe. The interpretation is that edge e 6 E exists with probability pe, and conversely e does not exist, or is not true with probability 1-pe. Given edge probabilities, the states of edges are mutually independent. Nodes are static. Given a set Q V of nodes, or terminals, the network reliability (G,Q) of G is defined as the probability that Q is connected, i.e., that any node in Q can be reached from any other node in Q. In the most reliable subgraph problem the one is looking for a subgraph H G connecting the terminals in Q such that H has at most B edges and a maximal reliability with respect to the terminals, i.e., find H * = arg max HCG,||H||<B R(H,Q). The problem can be defined for directed graphs as well. This problem, like reliability problems is inherently challenging: efficient solutions are available only for restricted classes of graphs, but cases on general graphs are most likely intractable. Given a graph G, V(G) is the node set of G and E(G) the edge set of G. Given a set of edges ScE, we say S induces a graph G(S) = (V, S), where V consists of the endpoints of the edges in S. The union between two graphs Gi = (Vi, El) and G 2 = (V2, E2) is a new graph H = Gi U G2 = (Vi U V2, El U E2). Other set op- erations for graphs may be defined analogously. For notational convenience, we treat paths, trees and edges as (induced) graphs when there is no risk of confusion. This makes it notationally easier, e.g., to add a path P to a graph G by writing simply G u P instead of G u G(P), or to denote the edges of a tree T as E(T ) instead of E(G(T)). Finally, a path with endpoints u and v is said to be a u-v- path.

A Monte-Carlo sampling based algorithm Path Covering (PC may be applied for the two-terminal case, for example. The algorithm has two phases: a path sam- pling phase and a subgraph construction phase. In the path sampling phase, the goal is to identify a small set C of paths that have high probabilities and are relatively independent of each other. The goal is achieved by approximately maximizing the probability that at least one path P £ C is true. This probability may be denoted by Pr(C) = Pr(V ' pecP). In the subgraph construction phase, PC chooses a set of solution paths ScC such that the probability Pr(S) = Pr(VpesP) is maximized and ||G(S)|| < B, where ||G|| denotes the number of edges in G. PC does not maximize (G(S)) directly, but works on its lower bound Pr(S) instead. Concisely put, PC generates S iteratively by choosing at each iteration the path P * by the objective function, which gives the maximal per-edge increase to the (esti- mated) probability Pr(S), that is

Prf S U Ϊ P S)

are; max

E(P) \ E(H)

(1) where H = G(S) is the result subgraph being constructed. During each iteration round, paths that become included into H are removed from C. To satisfy the budget constraint, paths PeC for which ||H|| + |E(P) \ E(H)| > B are also removed. The algorithm stops when ||H|| = B or C \ S = 0, and returns the subgraph H.

Monte Carlo sampling is briefly described: in the crude Monte-Carlo method for estimating network reliability between terminals s and t, one draws N independent samples, or realizations Gi, 1 < i < N, from the random graph Q. Each realization may be generated by simulating the existence of each edge of g randomly and independently. To estimate reliability R = R(g), one counts the number K of those realizations Gi which contain at least one .s-t-path. Then = K/N is an un- biased estimator for R with variance R(l - R)/N.

Next, the above-introduced PC algorithm is reviewed in more detail. In the first phase, i.e. path sampling phase, an iterative approach may be applied for constructing the set C of candidate paths efficiently between terminals (query nodes) s and t. In the path sampling phase, PC gathers a relatively small set C of candidate paths from the set of all s-t paths in Q. Then, in the subgraph construction phase, PC aims to choose an optimal subset :P of the candidate paths in C, according to the edge budget B, and returns the subgraph G( ) induced by !P (it is said that a set of paths IP induces a graph G( ) = (V,E), where V = {u : {u, v} E P, P E IP } and E = {e E P : P E iP }).

Both the phases may be considered to address the same general problem: selection of a subset of available s-t-paths to induce a reliable subgraph. Furthermore, both phases use a similar strategy to achieve this goal by iteratively and greedily maximizing Pr( ) = P), that is, the probability that at least one of the paths in is true. The main difference is that the path sampling phase scales to large inputs with exponentially many paths, while the subgraph construction phase produces a better optimized subgraph G(P) with a larger computational cost per path. To begin with path sampling, the most probable, or best, s-t path is used as the initial candidate path. Then C is augmented in each iteration with a path P such that Pr(C V P) is approximately maximized. Let C denote an event where none of the paths in C exists. Since

Pr(C vP) = Pr(C v (£ AP)) Pr(C) + Pr(£ A P),

(2) the most probable s-t-path P is looked for under the condition that all current paths in C fail. This can be implemented with Monte Carlo simulation by randomly realizing edges in each iteration according to their probabilities: an edge e is decided to exist with probability p(e) and to not exist otherwise. A cut C is found if every candidate path has at least one edge that does not exist. Then a new s-t-path may be added to C, if one exists. Table 1 Path sam li

Input: Random g a h Q — (F, B), terminal podes s and number of candidate at s N

Output: Set C of $-t-p h&

k C the best s-t- th tan Q

2: Set w(e) ^ - log(p(e)) for all e€ E

3: re eat

4: Reset all e€ E as undecided

5: tor ail P€ C do

6: for all e€ F do

7: if e has not been decided then

8: Decide e as successful with probability p(e), felled otherwise

9; If € has failed then

10: €i>titliaiia from line 5 with next P

1 1 : go to 4 { P exists}

1.2: Find the best s-t-patb F from £ using edge weights w * deciding edges as necessary

13: if p≠ then

14: ^ C -~ C U P

16: return C

The algorithm of Table 1 may be used to implement the path sampling phase. It is mostly looking for a cut event C in each of its iterations (lines 3-15). It realizes edges during the process only when needed, and applies the following two rules to avoid unnecessary work. When checking if a path exists, the rest of the path can be ignored as soon as a failed edge is encountered (line 10). When an existing path has been found, the remaining paths can be ignored (line 1 1). When a cut C does take place, we find a new candidate path among the non- failed edges (line 12). Any shortest path algorithm can be applied with edge weights w(e) = - \og(p(e)). Namely, let P be the shortest path found. Then

log(Pr(P)).

(3) and since the logarithm function is strictly increasing and w(P) is minimized, Pr( ) is maximized. Edge decisions can be integrated into best path search, too, and carried out whenever an undecided edge is encountered. A potential challenge in the path sampling algorithm is that cut events C are rarely found when Pr(Q is high. In such cases the algorithm could require unaccept- ably many iterations before stopping. A more effective approach would be to draw random realizations of Q directly under the condition that no path in C is true. Unfortunately, given the potential dependencies between paths in C, it may be more challenging to do this exactly. As an alternative algorithmic variant, the following approximation is proposed where one may deliberately "fail" edges of candidate paths until all candidate paths have been broken. First, edges are realized until all paths in C have been decided— even if some paths are found to exist. Then, if some paths exist, we iteratively and greedily fail the edge e which in- tersects the largest number of true paths in C until no true paths remain in C. If there is more than one such edge, the one with the smallest probability p(e) may be chosen. This modification may be implemented by removing line 1 1 of the path sampling algorithm and adding the cut sampler (Table 2) just before line 12. a le 2 Cut sa pler

1 : " *~ {P€ C ; P is true}

2: while if ≠% ®

3: Let E(F) = f e€ P : P€ \

4: Let e ~ ar oi ^ ( < \ H P€ : e€ P

51 Re-decide as faile

To proceed further, Table 3 discloses an exemplary algorithm for subgraph construction phase. Table 3 Paih Selection

Input: Set C of s™Fpatii$ > inte er B

Output A reliable subgraph Ή C G(C) with at most B edges

I : *™

2: Remo e , all paths with o»re than B edges from C

3: Generate N .realizations 6\· of G(C)

4: For each F€ C, let C(F) = fi ; F€€¾

5: while S > 0 Λ ≠ 0 do '

6: Choose F * — arg m x c^ KP) using (3

7: if s(p*) = 0 tlt i

8: go to 4

: B *~ B ~~ w(P * )

10: Add F * to P and remove it from C

I I : Remove all paths F from C s. tt'(F) >

12: Remove al paths Q from s.t. F * >·· Q

13: return H = G(V)

Indeed, in the second phase of PC, the executing element first obtains the set C of candidate paths generated in the first phase, chooses a subset c C having at most 5 unique edges in total, and return the subgraph G(F) c g induced by them. The objective is to choose the set of paths such that the reliability R(G( P)) is maximized. Exhaustive search and evaluation of all feasible subsets is often intractable, even though the number of candidate paths C is assumed to be relative - ly small. We may relax the problem by maximizing the probability Pr( ) = Pr(V e ~P) instead. It is a lower bound of R(G(!P)) and is easier to evaluate— but still requires exponential time in the worst case.

Therefore we may resort to Monte Carlo approximation of probabilities. This choice also allows casting the path selection task as a set cover problem. In the path selection algorithm (Table 3), N random realizations & of the whole graph G(C) (line 3) are first drawn. Let C(P) = {i : P 6 Gi} be the cover set of each path P 6 C, i.e. the indexes of those Monte-Carlo realizations where P did exist (line 4). Given = {Pi, . . . , Pk} , the cover sets can be used to estimate Pr(:P) ~ \C(Pi) U ... U C(/Y)|/N in O(kN) time. This is a substantial improvement over Θ(2*) time required for exact computation. With cover sets, the path selection problem reduces to an instance of a specialized SET COVER problem (hence the name PATH COVERING), where the goal is to choose a set of paths such that ~ C(P)\ is maximized and ||CJ(-P)| | < B, where ||G(y)|| denotes the number of edges in G( ). This problem differs from the ordinary SET COVER in three ways: it does not require the entire universe (the set of all positive realizations) to be covered, it is weighted (via budget B), and the weights are dynamic (different choices of paths affect the cost of individual paths). To solve the path selection problem, the executing entity may use a greedy approach where one path is added at a time to an initially empty (lines 5-12). The best possible addition may be always chosen from C, until the budget B has been exhausted. Here, "best possible" means the one adding most Monte Carlo realizations to the cover per edge added to IP. Formally, the cost w(P) of a path P is the number of new edges added to the solution subgraph G( ): w(P) = \\P \ G( )\\. The score s(P) of a path is defined as the ratio of the improvement in probability over its cost s( ) =(Pr( U P) - ?r( ))/w(P). Note that both s and w vary in each iteration. With cover sets, the score function has an estimate s(P) = (\C(P)\ C( )\)/w(P), where C(P) = U/ > < - C(P)\.

In an extreme case, the cost of a path becomes zero if all of its edges have already been included in the solution. As an additional optimization, one may implement a look-ahead to take advantage of such situations: if the addition of path P would make another path Q £ C completely included in G( ), then the cover set C(P) is extended with the cover set C(Q). More formally, we say that P dominates Q if an inclusion of P into implies Q 6 G( ), and denote this relation by P > Q. Clearly, relation >- is reflexive and transitive. An improved estimate of the above score function estimate with dominating paths is thus

<S { f ' ? --- ; ,

" w{P (4)

At each iteration, the path P with the maximum §(P) is then chosen, and P is added into (line 6). It may be assumed that after adding a path P to , all paths dominated by P are removed from C (line 12). During successive iterations, the enumerator in Equation 4 approaches zero as the proportion of realizations covered by paths in ? ' increases. Eventually all realizations may become covered so that s(P) = 0, and the choice of the remaining paths becomes somewhat arbitrary. In these (rare) situations our implementation "restarts" by considering all realizations uncovered (line 8) and tries to recover them with additional paths from C as before. In each iteration, paths that are too expensive to be added to 'P may be removed (line 1 1).

Reverting now to the additionally suggested, both efficient and scalable, modification of the PC method particularly applicable also for scenarios with three or more query nodes, in the first phase thereof, the algorithm extracts a set of trees from the original graph G. Each of the trees connects the given k query nodes; by construction, they are spanning trees having the query nodes as leaves. In the second phase, these trees are used as building blocks to construct the result of the algorithm just like PC uses paths as its building blocks. T¾l>te 4 Al rit m 1 of the o osed modified PC o.udhod Algorithm 1 Sample trees

Input: Random graph G t set Q C V of query nodes, number of trees to be g nerated Output: Collection C of trees connecting ail. query nodes

1 : C ·····

2: for each node pair v ¾ ε Q, t?t≠ ¾ do

3: Add the best Vi ~ t¾— ath from G to C

4: repeat

5: Decide each e 6 E(C) by flipping a coin biased by the probability of e

6: Let E$ contain the successful edges and E F contain the failed edges

7: T$ *~ 0

11: continue at line 5

12: Γ,. · C,

13: continue at line 17

1 ; Randomly select u and v€: Q. u≠- v

15: Add the best ¾~¾ » -path from G E F to C

16: continue at line 5

17: Set the probability of to 1 for all e€ E'{Ts)

18: Randomly ei¾¾¾c w 6 Q Π ViTs) and v Q \ V{T$)

19: P$ <— the best «--~? paih from G— E F

20: Add all e G E(P S ) to E(T S ) and all v ε V( P S ) to F(Z¾)

21; Reset edge weights for ail e€ E(Ts)

22: until the stopping condition is satisfied

23: return C (after removing all incom le e trees) Input and output data: The first phase of the algorithm, hereinafter Algorithm 1 (see Table 4), takes a random graph G and a set Q <= V of query nodes as input. The algorithm outputs a set C of trees such that each tree connects the query nodes. These trees are called herein candidate trees. C is used as an input in the second phase of the algorithm, hereinafter Algorithm 2 (see Table 5).

Producing candidate trees: At the first iteration, (|Q| 2 - |Q|)/2 new candidate trees are generated (See Lines 2-3 of Table 4). Each tree connects one pair of query nodes and each query node pair is connected by one tree. Later, as the algorithm proceeds, new trees are added one by one; each of the later trees is also initially formed as a path between two query nodes. During the algorithm, individual trees are created and grown iteratively. In each iteration, either a branch is added to an existing incomplete tree (a tree that does not yet connect all query nodes) so that a new query node is connected to the tree, or a new initial tree is generated. At the end of the algorithm only complete trees (trees that connect all query nodes) are output.

Edge sampling: The algorithm is stochastic. At each iteration round, it randomly decides, according to the probabilities pe, which edges exist and which do not (Line 5). Only edges that are included in at least one candidate tree are decided. All other edges are considered to exist. The next step is to determine if any of the previous candidate trees exist in the current graph realization (Line 9). If one does not exist a new candidate tree is generated (Lines 14-15). If a previously discovered tree exists, the first such tree is taken into examination (Line 10). If the tree is complete, the algorithm proceeds directly to the next iteration. Otherwise the tree is extended (Lines 17-21) before continuing to the next iteration.

Tree construction: a new tree is formed by the best path connecting two query nodes (Line 15). A previously established incomplete tree is extended by con- necting a new query node to it with the best path between some node in the tree and the new query node (Lines 18-20). The probabilities of all edges in the tree are set to 1 prior to the search of the best path (Line 17), while the probabilities of other edges remain the same. As a result the new branch is formed by the best path between the new query node and the tree. Edges that do not exist at the ite- ration are not used. All edge weights are set to their original values before proceeding to the next iteration (Line 21). Choosing query nodes: When a new tree is formed, the algorithm decides randomly which two query nodes are included in the initial tree. Later on when an incomplete tree is extended, the algorithm again randomly selects the new query node to connect to the tree. This is to avoid unnecessary complexity: in con- ducted experiments this solution produced better results and shorter running times than selecting the node to be added based on its distance from the tree (results not shown).

Discovering strong trees: The collection C of candidate trees is organized in a queue, i.e., the oldest candidate trees are always considered first. This drives the algorithm to complete some trees first (the oldest ones) rather than extending them in random order and not necessarily up to a completion. On the other hand, the stochasticity of the algorithm favors strong trees: they are more likely to be true at any given iteration and thus more likely to be extended. The algorithm al- so has a tendency to avoid similar trees: when two (partial) trees are true at the same time, only the oldest one is potentially extended.

Stopping condition: The number |C| of complete trees generated is used as the stopping condition for the first phase. The number of iterations would be another alternative. Using the number of trees seems a better choice than using the number of iterations, since the minimum number of iterations needed to produce a single complete tree increases when the number of query nodes increase.

Table 5 Algorithm 2 of he r sed m di fled PC method Algorithm 2 Select trees

Input: Random graph G, collection C of trees, budget B€· M

Output: Subgraph H C G with at most B edges

1: S :

2; while \GiS) \\ < B and C \ S≠ 0 do

4: S S U T

5: for all T€ C do { remove useless trees}

6: if T C G(S) ur \ \G($) \\ + \E(T) \ E(G(S) )\ > B then

7: C < ■■■■ C \ [ Ί }

8: return H = G(S)

Tests: A large index of various interlinked public biological databases, such as EntrezGene, UniProt, InterPro, GO, and STRING, was constructed in accordance with the principles of the present invention described herein. The index enables representing and viewing the contents of the aforesaid databases as a large, heterogeneous biological graph. Nodes in this graph represent biological entities (records) in the original databases, and edges represent their annotated relationships. Edges have weights interpreted as probabilities. The proposed method was evaluated using six source graphs of varying sizes (Table 6) and a set of up to ten query nodes. They were obtained as follows.

Table 6 Test set- TO

Mai e of G— (V, E) 400 500 700 1000 2000 5000

Number of ed es. \E\ = \ \G\\ 395 499 717 1046 2 1 4998 Hn!Bber of nodes. ) v\ 153 189 260 336 579 1536 Reliability of G w 0,46 (148 0.50 0.56 0.59 0.60

First, the largest subgraph, consisting of approximately 5000 edges and 1500 nodes, was retrieved from the index database using Crawler, a proprietary subgraph retrieval component described in more detail hereinafter. For this initial re- trieval, we used eight randomly selected query nodes. Second, a set of ten query nodes to be used in the experiments was defined by randomly picking nodes from the subgraph of 5000 edges. The query node identifiers are EntrezGene:348, En- trezGene:29244, EntrezGene:6376, EntrezGene:4137, UniProt:P51 149, Uni- Prot:Q91ZX7, EntrezGene: 14810, UniProt: P49769, EntrezGene: l 1810, and Un- iProt:P98156. Third, smaller subgraphs were retrieved with Crawler by a sequence of subgraph retrievals, always extracting the next smaller subgraph from the previous subgraph, using the ten query nodes given above. We also used two additional source graphs when evaluating the scalability of the algorithm to large source graphs. The smaller one consisted of 51,448 edges and 15,862 nodes. The size of the larger graph was 130,761 edges and 66,990 nodes. The smaller graph was a subgraph of the larger one and all other source graphs used in the experiments were subgraphs of both.

Crawler. The subgraph retrieval component of the suggested indexing system, "Crawler", was used to extract the source graphs, and it will also be used below in a comparative experiment to assess the effectiveness of the proposed algorithm. Given a source graph, a set of query nodes, and a node budget, it works roughly as follows. It first finds a large set of best paths between all pairs of query nodes. It then picks those paths sequentially between the node pairs, until the subgraph induced by the chosen paths reaches the specified number of nodes. The method is somewhat similar to the afore-explained BPI algorithm, but works for multiple query nodes. Even though the Crawler works with random graphs, it does not try to optimize the k-terminal reliability.

Stopping condition: The number of complete candidate trees generated was used as the stopping condition for Algorithm 1. Another alternative would have been be the number of iterations. Neither condition is totally perfect: for instance, the number of query nodes has a strong e.ect on the number of trees needed to find a good subgraph. On the other hand, the number of query nodes has also a strong effect on the number of iterations needed to produce a sufficient amount of trees. A single fixed number of candidate paths is a suitable stopping condition for the two-terminal case but it is problematic in the k-terminal case where the building blocks are trees consisting of multiple branches. For the done experiments, a fixed number of candidate trees gave a fair impression of the performance of the method, however.

Parameters: The experiments have been performed using the following parameter values; the default values we have used throughout the experiments, unless otherwise indicated, are shown as bolded.

- Size of source graph G (Table 6): ||G|| = 400, 500, 700, 1000, 2000, 5000 - Number of query nodes: |Q| = 1, 2, 3, 4, . . ., 10

- Size of extracted subgraph: ||H|| = 10, 20, 30, . . ., 60, . . ., 100, 150, 200

- Number of complete trees (stopping condition of Algorithm 1): |C| = 10, 20, 30, . . ., 100, 150, 200

To control random variation, the values reported below are averages over 10 independent runs.

Results: At 402 of Figure 4, it is illustrated how the proposed method succeeded in extracting a reliable subgraph. For few query nodes, such as three or four query nodes (|Q|), a subgraph of relatively low number of edges, such as 20-30 edges, managed to capture about 80% of the reliability of the original source graph of 500 edges. For a large number of query nodes, the problem seems more challenging. Larger subgraphs are preferably needed for a larger number of query nodes, if the reliability is to be preserved. The number |C| of candidate trees produced in the first phase of the algorithm has an effect on the reliability of the extracted subgraph, but sampling a relatively small number of trees is enough to produce good subgraphs (approximately 50 trees for four query nodes; results not shown). An experimental analysis of the running time indicates that the method scales linearly with respect to the number of candidate trees generated.

The scalability of the proposed algorithm to large source graphs (see the diagram at 404) may be considered as superior to previous methods. Source graphs of thousands of edges may be handled within a second or two. Scalability is close to linear, which was expected: the running time of the algorithm is dominated by Monte Carlo simulation, whose complexity grows linearly with respect to the input graph size and the number of iterations. Limiting the length of tree branches may shorten the running times in some cases.

The original reliability of the growing source graph as well as the reliability of the extracted subgraph (of a fixed size of 60 edges), are illustrated at 406. The relative difference in reliability is less than 20% in all cases, emphasizing the ability of the algorithm to preserve strong connectivity between the query nodes. These results suggest that the algorithm is competitive for interactive visualization.

Further, the proposed method was briefly compared against the aforesaid Crawler using four query nodes. The proposed method reached 80% of the original relia- bility with only 30 edges whereas the Crawler needed 60 edges for the same.

Thus reliable k-terminal subgraphs of fairly small sizes seem to capture essential connectivity well. Second, the proposed method extracts a reliable subgraph very rapidly, e.g. in a matter of seconds depending on the available computational power and memory capacity, even from a source graph of thousands of edges; the time complexity seems to be linear with respect to the size of the original graph. There are many possible variants of the approaches described herein that could be explored depending on each particular use scenario and embodiment. For instance, different techniques to select which partial tree to expand and how to ex- pand it, or how to efficiently use partial trees also in the second phase, may be applied by a skilled person. Another embodiment could imply using (approximated) Steiner trees as spanning trees. Algorithm parameters may be case- specifically selected to provide optimum results. Alternatively, one or more ge- neric sets performing reliably over wider range of input graphs and query nodes may be determined. The solution may be further tailored for use with directed graphs. Figure 5a illustrates an embodiment of finding representative nodes. The arrangement 104 may be configured to extract and/or (visually) highlight a number of representative concepts (nodes) in a subject graph, such as the original index graph or a search result subgraph, so that the representative concepts are (maximally) relevant to the search, but mutually (maximally) different. Thus, a com- pact and at the same time extensive representation may be obtained and visualized.

Identification of few representative nodes may be considered as one applicable approach to help users make sense of large graphs. The visualization/perception problems typically start already with only dozens of nodes. As an exemplary scenario, one could consider link discovery. Given a large number of predicted links, it would be useful to present only a small number of representative ones to the user. Or, the representatives could be used to abstract a large set of nodes, e.g. all nodes fulfilling some user-specified criteria of relevance, into a smaller but representative sample. In connection with genetics different wet lab techniques are often used for identifying numerous genes, proteins, and/or something else as potentially interesting, e.g., by the statistical significance of their expression, or association with a phenotype such as disease. Finding representative genes among the potentially interesting ones would be useful in several ways. First, it could be used to remove redundancy, when several genes are closely related and showing all of them adds no value. Second, representatives might be helpful in identifying complementary or alternative components in biological mechanisms.

The probabilistic interpretation of edge weights p(e) gives natural similarity measures for indirect relationships between nodes.

Probability of a path: Given a path P consisting of edges ei,...,ek, the probability p(P) of the path may be defined as the product p(ei)-...-p(ek) as mentioned herei- nearlier. This corresponds to the probability that the path exists, i.e., that all of its edges exist. Probability of the best path: Given two nodes u, v V, a measure of their connectedness or similarity may be defined as the probability of the best path connecting them:

p is a path from u to v (5)

It shall be noticed that this is not necessarily the path with the least number of edges. For example, this kind of similarity function s( ) may be used for finding representatives.

According to the embodiment, finding representatives in networks incorporates clustering the given nodes, using the similarity measure defined above, and then selecting a representative from each cluster. The method execution can be characterized as follows:

Input: Set S of nodes, graph G, number k of representatives

Output: k representative nodes from S

1 : Find k clusters of nodes in S using similarities s(-) in graph G, and

2: For each of the k clusters, output its most central node (the node with the maximum similarity to other nodes in the cluster).

Thus the aim is to have representatives that are similar to the nodes they represent (i.e., to other members of the cluster), and also to have diverse representatives (from different clusters). For clustering, k-medoids or hierarchical clustering may be applied, for example. k-medoids is similar to the k-means method, but better suited for clustering nodes in a graph. Given k, the number of clusters to be constructed, the k-medoids method iteratively chooses cluster centers (medoids) and assigns all nodes to the cluster identi ed by the nearest medoid. The difference to the k-means clustering method is that instead of using the mean value of the objects within a cluster as cluster center, k-medoids uses the best object as a cluster center. This is a practical approach when working with graphs, since there is no well defined mean for a set of nodes. The k-medoids method also immediately gives the representatives. For very large graphs, a straight forward implementation of k-medoids is not necessarily the most efficient. Tools to facilitate faster clustering may be utilized.

Given a set S of nodes, e.g. biological entities, to be clustered, and k, the number of clusters to be constructed, an embodiment of the method may proceed as follows. First, the index database may be queried for a graph G of at most e.g. 1000 (or other predetermined number) nodes cross-connecting nodes in S as strongly as possible. The pairwise similarities between nodes may be then calculated as the best path probabilities in G.

To begin with the actual clustering, k nodes from S may be chosen randomly as initial medoids. Each remaining node in S is then clustered to the most similar medoid. the pairwise similarity between a node and all medoids equals zero, the node will be considered an outlier and is not assigned to any medoid in this itera- tion. Then, a new medoid is calculated for each cluster. The node that has a maximal product of similarities between each other node in the cluster and itself is chosen the new medoid. The last two steps are then repeated until the clustering converges or the maximum number of iterations is reached. As an example, k-medoids was run with k = 3 and a set of nine genes. The genes belong to three known groups, each group of three genes being associated to the same phenotype. The three OMIM phenotypes used in the example are a pigmentation phenotype (MIM:227220), lactase persistence (MIM:223100), and Alzheimer disease (MIM: 104300).

The algorithm converged in this case after two iterations. The result of the example run is shown 502 in the figure. Clusters (diamonds, boxes, ellipses) and representatives (double borders) 504 of nine given nodes, and some connecting nodes 506(circles) on best paths between them. Lines represent edges between two nodes, dotted lines represent best paths with several nodes. Generally, the clustering produced the expected partitioning: each gene was assigned to a cluster close to its corresponding phenotype with the exception of EntrezGene: 1627. The three representatives (medoids) are genes assigned to different phenotypes. Hence, the medoids can be considered representative for the nine genes.

Alternatively, hierarchical clustering could be applied. With the k-medoids approach it may happen that it discovers star-shaped clusters, where cluster members are connected mainly through the medoid. To give more weight on cluster coherence, we may use the average linkage method, as follows. In the practical implementation, we again start by querying the index database for a graph G of at most 1000 nodes connecting the given nodes S, and compute similarities of nodes in S as the probabilities of the best paths connecting them in G. The hie- rarchical clustering proceeds in the standard, iterative manner, starting with having each node in a cluster of its own. In each iteration, those two clusters are merged that give the best merged cluster as a result, measured by the average similarity of nodes in the merged cluster. The clustering is finished when exactly k clusters remain. After the clusters have been identified, a medoid may be found in each cluster (as in the k-medoids method) and be returned as a representative.

Figure 5b illustrates an embodiment of finding and visualizing nodes (and associated target subgraph(s)) deemed as particularly relevant to the query nodes included in a weighted source graph. A user may initially specify some query nodes (concepts), whereupon the arrangement in configured to identify a number of other nodes (concepts) that are relevant with respect to the query nodes, but still preferably non-redundant with respect to each other.

A user who wants to know how query nodes, such as Barcelona 510 and Helsin- ki 512, are related might already know that Barcelona is a city in Spain, that Helsinki is the capital of Finland, and that both Spain and Finland are in Europe and also are members of the European Union (EU). Other, perhaps less obvious relations, may be initially more interesting. For example, the user might have not known that architects Antoni Gaudi (who lived in Barcelona) and Alvar Aalto (who lived in Helsinki) both have exhibited at a world's fair (also known as Expo). Another user again might not know that soccer player Jari Litmanen has played for FC Barcelona as well as for HJK Helsinki. Given Barcelona and Helsinki as query nodes, the goal is to identify a non-redundant set of nodes relevant to both cities. In the small example graph EU is highly relevant to both and there - fore the first choice to be included in the result. Europe is highly relevant, too, but being closely related to EU it would be a redundant choice. Non-redundant but still relevant concepts include world's fair and Jari Litmanen. Relevant and non-redundant nodes could be used to form the target subgraph. Alternatively, they could be at least visually highlighted or otherwise indicated relative to a larger graph structure.

Note that relevance with respect to query nodes might be considered as highest for node(s) between the query nodes. A relevant node typically is a central node for the (indirect) relation between the query nodes. A non-redundant set of relevant nodes then highlights distinct relations between the query nodes. For more than two query nodes, we may consider a node to be more relevant if it is relevant to all the query nodes, i.e., if it helps to connect all of them. For a single query node the relevance may be directly related to proximity to the query node and may seem less interesting. However, finding a set of non-relevant nodes may help in giving a summary of the neighborhood. For Barcelona, for instance, a relatively non-redundant set of relevant nodes consists of Spain, FC Barcelona, and Antoni Gaudi.

As a more scientific example, e.g. the field of bioinformatics could be considered. Consider a scientist who researches a disease and has identified from high- throughput techniques numerous genes related to the disease. The scientist is interested in knowing know how these genes relate to each other, in order to study possible mechanisms behind the disease. Relevant but non-redundant concepts could now potentially help in identifying a representative set of shared connections. Further on, if the relevant (and non-redundant) nodes are ranked, the scientist could start study them from the top, and decide for herself when the relevance becomes too low or the cost of further studies of these candidates too high.

Again, a weight associated with an edge e is its probability p(e) (or can be at least interpreted as a probability): edge e exists with probability p(e), and conversely e does not exist, or is not true, with probability l-p(e). Edges are assumed mutually independent. Probability of a path may be still considered as the product of asso- ciated edge probabilities. Best path may maximize the path probability between two query nodes (refer to the equation 5 set forth hereinearlier, for example). Probability of the best path may be utilized as a natural measure of the nodes' proximity (s(-)). Reverting to the figure, the best path between Barcelona and Helsinki is then Barcelona - Spain - EU - Finland - Helsinki with a probability of 0.8-0.8-0.8-0.9 = 0.4608. Note that the most probable path is not necessarily the path with the least number of edges. Alternatively, other proximity functions could be used, too. In particular, network reliability is an interesting alternative. This is the probability that there exists at least one path (not necessarily the best one) between the given query nodes. The potential benefit of applying network reliability is that it uses more information than just the single best path. It has a computational and a conceptual downside, however. Computing network reliability is NP-hard, but it could be estimated by Monte Carlo approaches though.

Given a positive query node qEV, the relevance of node ueV with respect q is simply s(u, q). In other words, relevance directly depends on proximity. Given a non-empty set Qp > <%,} c V of positive query nodes, the relevance of u is the product of its proximities to the query nodes: relpiu.

In the figure, the relevance of EU to query nodes Barcelona and Helsinki is 0.8·0.8·0.8·0.9=0.4608, the relevance of Europe is 0.8-0.8-0.9-0.9-0.8-0.9 ~ 0.373, and the relevance of Jari Litmanen is 0.8-0.6-0.7-0.7 ~ 0.235. This corresponds to the intuition that EU and Europe are more strongly connected with both query nodes than Jari Litmanen is.

Several nodes may have equal relevances. This happens, in particular, when there are two query nodes: all nodes on the best path between the query nodes have identical relevance, equal to the probability of the best path. For instance, Spain, EU and Finland all have relevance 0.4608 in the figure. Nodes that are roughly equally far from each query node in terms of relevance (proximity) could be preferred. Such ties may be thus solved by giving highest priority to nodes with the smallest sum of squared proximities to the query nodes, i.e., by ranking the tied nodes u in ascending order by

(7) Figure 5c illustrates an extreme case. There are two positive query nodes, Barcelona and Helsinki, again, and three nodes, Pablo Picasso, Picasso's art, and Ate- neum 514 (museum of classical art in Helsinki) along the best path. If path lengths would be considered, Picasso's art would be preferred. Based on the weights it is, however, much more closely related with Barcelona (relevance 0.9) than with Helsinki (relevance 0.56). With the squared proximities, on the other hand, Ateneum 514 is preferred. Among the nodes on the best path, it has the most equal relevance to both query nodes (0.63 and 0.8). In some embodiments, the provided arrangement may further support negative search terms (concepts), i.e. nodes to be preferably avoided (subjective irrelevancy) in the search result (subgraph), which may be further associated with a search radius to be avoided around the excluded concepts, etc. Negative query nodes may be applied in the definition of relevance, to specify which neighborhoods are less relevant. Noting that redundancy between nodes is based on a similar effect of repellance as that of negative query nodes, it is proposed, as one feasible option, to treat these effects technically in the same way. The result is a relatively simple function that tries to find a balance between relevance with respect to positive query nodes, avoidance of negative query nodes, and mutual non- redundancy of nodes in the result, by treating them all in somewhat uniform manner.

As one logical approach, the closer a node is to a negative query node, the less relevant it is. Therefore, the relevance in the presence of a negative query node may be defined as the reverse of their mutual proximity: the relevance of node u V in the presence of a negative query node q€ V is l-s(u,tf).

In the graph of Figure 5b, assume that the user specifies Europe as a negative query node. Its effect is that the EU has a low relevance of 0.1 (when not consi- dering any positive query nodes), and Spain and Finland both have relevance 1- (0.8· 0.9) = 0.28. The most relevant concept now is architect.

For a set QN c V of several negative query nodes, we may define relevance relN(u, QN) in their presence again as a product of the individual relevances:

Given both positive and negative query nodes (sets QP and QN, respectively), the overall relevance of node u is defined as the product

For example, consider again the use of Europe as a negative query node in Figure 5b, in conjunction with Barcelona and Helsinki as the positive query nodes. Now the most relevant nodes is Jari Litmanen. It is relevant to both Barcelona and Helsinki while irrelevant to Europe. As the example above illustrates, the most relevant nodes can be close neighbors. While we would like to retrieve a list of relevant nodes, we would also like them to be non-redundant or complementary. The result then would obviously be more informative to the user.

We may define non-redundancy identically with relevance in the presence of negative query nodes: the non-redundancy between nodes u, v€ V is 1 - s(u, v). For convenience, the reverse s(u, v) may be called as their redundancy. It may seem curious that redundancy and relevance have identical definitions. This follows, some extent, from having just a single measure of edge weight. The benefit is that these different aspects are handled uniformly. And, as it will be shortly illustrated with small examples, using these measures allows quite powerful queries and selection of a diverse set of results.

Given a set c V of nodes, we may define their non-redundancy as the product all pairwise non-redundancies:

A query may be converted into a task of finding a diverse set of relevant nodes. Using the definitions above, the problem may be determined formally. In addition to the positive (QP) and negative (QN) query nodes, assume the user also specifies the number k of nodes in the output, or the number is specified otherwise, e.g. on the basis of predefined settings. An alternative formulation that may be preferable in practice, is to rank the given nodes in V instead. Informally, the goal would be that any top k nodes would constitute a good solution to the original problem, whatever the value of k is. The user could then explore the top results and set the cut-off after seeing the results. At the same time it is obvious that with a single ranking the solutions cannot be optimal for all values of k. For practical convenience, we let the user also give a set V £ V of nodes among which to select the output. Let R denote the output. The problem of retrieving a relevant and non-redundant set of nodes now is to identify a set R £ V that maximizes rel( ' i, Qp, Q N ) ~ Π r (u, Q P , Q N )nonred(R)

The first form highlights how relevance and non-redundancy are defined independently. The second form shows, in turn, how relevances and redundancies are handled in a uniform way. For example, considering Barcelona and Helsinki as positive, and Europe as negative query nodes. A set R of k nodes that maximizes the relevance and nonredundancy measure (equation 1 1) would be { Jari Litma- nen, world's fair,architect }, where the nodes are relevant to both, Barcelona and Helsinki, irrelevant to Europe, and non-redundant to each other. Regarding the potential algorithms for finding relevant and non-redundant nodes, two examples are provided hereinafter.

Algorithm 1, see Table 7, produces a ranked list of nodes in an incremental, greedy fashion. In each iteration, it finds the currently most representative node (Line 3) and outputs it. In case of a tie, the node minimizing Equation 4 is prioritized (Line 7). One clue of the algorithm is in treating nodes selected and output during previous iterations as negative query nodes. This leads to the desired property that the ith node output by Algorithm 1 is non-redundant with respect to first i - 1 nodes already output. Algorithm 1 actually makes in each iteration an optimal choice with respect to Equation 1 1, given the previously selected nodes. As a preprocessing step, it first computes all proximities s(-) in a single batch (Line 1). ' Table 7 Al orith I for amks reJe ¾ Q5 ¾d $mJa sets

Algorithm 1 Flanking nodes to produce relevant and non-redundant sets input: G = (V, J5>), a weighted graph,

Qp C V, Qp≠0, & set of positive query nodes,

QN C V, a se of negative query nodes, and

V' C V, the set of nodes to be ranked

Output: nodes of V' in a ranted order of relevance .r.t. Qp, QN an mutual non- i'edandfiiice

1; compute u. v) tor ali u€ V and ali v€ Qp U Qjv U V"

2: while V≠ ø do

3: /? ·*— · ¾ « i «€ V and product Jj c | FJ (1— is maximal}

4: if = 1 then

5: r *~ ii such that {¾} = R

(r, else

T: r «- arg mii -¾( : :J y Y *(«, <jf) 2

8: end if

9: output r

10: ..' v <-- U fr>

11 : V V \ { r ;

12; end while

Even though Algorithm 1 is locally optimal with respect to Equation 1 1 in each step, the set of top k nodes is not necessarily optimal for any k except k = 1. Algorithm 2, in turn, produces a non-redundant set of k relevant nodes, where k is given as a parameter, see Table 8. The algorithm also takes k nodes as input, used as an initial solution that is then iteratively improved in the algorithm. In each iteration, the algorithm takes one of the k nodes and replaces it by the op- timal one, given the k-1 other current nodes. When no improvements can be achieved, the algorithm stops.

T&li e 8 Alg rithms 2 for of a.folrsg reiev&ftt/ftoi r&iEeitci&itt set

Algorithm 2 Finding a relevant- and non-redundant set of size k

input: G = ( V, .£ ), & weighted graph.

Qp C V, Qp≠ ø, a set of positive query nodes.

QN V, set of negative query nodes

V C V, a set of admissible nodes,

:, the number of nodes to be returned, and

JR. C V", a se t of k nodes

Output: a non-redundant- set of k nodes of V" relevant w.r.t. QF . QK

I compute for ail u€ V and all v 6 Q U QN U V"

2: {keep ft. ordered and treat it as a queue}

3: repeat

-4: remove the foremost node r of R

5: r' arg max^ y-' fj s(u, q) Ff (1 - | ~ f (1 -

?SQ ·;: . '.,-·;·.,· :· ·. .V

6: {if there is a tie involving r, choose r' = r\

7: append r f to R

8: until J? converges (each node in i gets replaced by itseif)

9: return R

Thus in the above the definitions of relevance, irrelevance and non-redundancy were based on node proximity, using a probabilistic interpretation of edge weights. Two, merely exemplary, algorithms were established for solving the problem: one that ranks a given set of nodes for their relevance and mutual non- redundancy, another one for finding an optimal set of nodes when the size of the set is fixed.

In some embodiments, finding relevant and non-redundant target nodes relative to query nodes may form the main objective of the query, whereas the extracted subgraph then mainly serves for visualizing the target nodes' relationships with the query nodes.

Figure 6 discloses a flow diagram of an embodiment of a method in accordance with the present invention. At start-up 602, the applied arrangement and associated device(s) thereof and optionally connected thereto, such as a single, self- contained device, a terminal device and a server device, or a server entity com- prising a plurality of at least functionally interconnected devices, may be obtained and configured, for example, via installation and execution of related software. At 604, data is obtained from a number of, typically a plurality of, data sources such as databases for constructing an index. At 606 the index is created by preferably establishing a weighted (e.g. probabilistic), optionally directed, graph structure or a corresponding data structure preferably substantially also re- presentable visually in the form of a graph. Actions behind items 604 and 606 may be later at least partially re-executed upon need, e.g. in connection with updating the data in the data source(s) and/or adding/modifying/removing data relative to the index. Data retrieval from each data source may be executed in a timed manner, for example. Optionally also the user and/or operator of the index may update data therein. Supplementary data that is not common to all uses, such as personal annotations relative to graph information like nodes, edges, or complete (sub-)graphs may be stored with reference to the index so that the arrangement may manage and visualize the user data flexibly together with the generally available data, if needed.

Dotted horizontal line is used to highlight the logical division between preparatory actions and the actual subgraph determination in the figure. At 608, a query is received from the user and converted into a format applicable for traversing the index using a number of suitable methods as reviewed hereinbefore. The queried concepts (query terms) provided by the user may be directly suitable for association with the query nodes, or optional logics may be first used according to the guidelines set forth in this text, for example. E.g. misspelling, synonym, and general availability checks may be executed.

At 610, at least one subgraph fulfilling the search criteria is determined. The emphasis may be on finding the relationships between query nodes or on finding the relevant and non-redundant target nodes relative to the query node(s), for exam- pie. Nevertheless, the relevant subgraph(s) is advantageously provided as output.

At 612, the output is visualized to the user preferably in conjunction with the provision of interactive controls enabling the user to change the visualization details (zoom, angle of view, rotation, panning, shown node/edge details, etc.), re- vise query details, add/remove/change visualized information, and/or export result data, for example. Advantageously the arrangement controls the visualization by providing related visualization data, e.g. (graphical) image data or related instructions, to a local display device or an external client device comprising or being connected with a display, for instance. Naturally one or more visualization aspects may be client terminal/display device -specific, whereupon the arrangement and/or the display device itself may adapt the display data according to these aspects such as display properties, e.g. maximum or preferred resolution. The broken loop-back arrow depicts the potentially repetitive nature of the various method items. Multiple queries may be sequentially or even simultaneously served depending on the computational power and memory capability of the executing device, and the index database may be flexibly updated at intervals or not until actual need, for instance.

Consequently, a skilled person may on the basis of this disclosure and general knowledge apply the provided teachings in order to implement the scope of the present invention as defined by the appended claims in each particular use case with necessary modifications and additions. Various target domains such as biological, medical, biomedical, computer (network), and even social networking domains may be analyzed with the suggested solution.

Furthermore, the arrangement may be configured to provide at least one addi- tional feature selected from the group consisting of: user or user group-specific log-in functionality, support for importing user data to be included in the subgraph extraction and/or related analysis, store functionality for personal (search) history and/or results, annotation and/or editing functionality of history data, automatic notices such as e-mail notices relative to predetermined features like availability of new information in view of a query stored by a user (e.g. the query result could/would change), publication of results to the public or selected parties, optionally with editing and/or execution rights, using a selected domain such as predetermined server like the index server, team work support (e.g. result, annotation and/or settings adoption between team members), and social networking support (e.g. discussion forum, messaging, and/or contact board).

In some embodiments, the arrangement 104 may be configured to estimate, in addition to or instead of relationship strength (e.g. path strength) between two or more (query) nodes, also the unexpectedness of such a relationship as mentioned hereinbefore. The resulting subgraph might be adapted so as to include and/or highlight nodes the connection of which is strong, but inevident in the light of the used criteria such as link types. For instance, it is not generally surprising that a gene may code a protein, but some other link type could imply more non-obvious relationship instead. Text mining of available data sources, e.g. literature on the domain, may be further used to determine obviousness of connections. Accordingly, finding truly surprising information and associations could be facilitated. Yet, in some embodiments the obtained subgraph may be generally determined or modified so as to exclude uninteresting and/or obvious border nodes/edges according to a number of predetermined criteria. An interesting path may include an obvious intermediate sub-path, which should advantageously still remain in the subgraph for clarity/visualization purposes due to its role as a connecting entity between the interesting entities, but e.g. an obvious sub-path at the end of an interesting path could be omitted from the subgraph and/or visualization thereof.