A SYSTEM AND METHODS FOR DIFFERENTIATING ENTITIES USING COMBINATORIAL FEATURE EXTRACTION

Title:

A SYSTEM AND METHODS FOR DIFFERENTIATING ENTITIES USING COMBINATORIAL FEATURE EXTRACTION

Document Type and Number:

WIPO Patent Application WO/2015/134518

Kind Code:

Abstract:

A method and system to extract differentiating features across a plurality of groups comprising a method to generate comparable semantic features, and a system and method to process a plurality of experiments to extract differentiating features of interest.

Inventors:

FARLEY TONI R (US)
MOUSSES SPYRO (US)
YOO CHRISTOPHER (US)

Application Number:

PCT/US2015/018516

Publication Date:

September 11, 2015

Filing Date:

March 03, 2015

Export Citation:

Click for automatic bibliography generation Help

Assignee:

SYSTEMS IMAGINATION INC (US)

International Classes:

G06F17/00

Foreign References:

US20090138415A1	2009-05-28
US20030208473A1	2003-11-06
US20130013221A1	2013-01-10

Download PDF:

View/Download PDF PDF Help

Claims:

What is claimed:

1. A method to generate comparable features for a plurality of entities comprising:

a. providing a plurality of entities comprising descriptors;

b. providing a common format for the descriptors comprising a name, a value, and a quantity of measure, if the value is non-numeric, the quantity of measure is

optional;

c. translating disparate data inputs to descriptors in the common format;

d. providing a common format for equivalence rules;

e. providing equivalence rules comprising a rule to generate comparable features from disparate types of data and a rule to generate comparable features from

disparate measures with the same semantic meaning;

f. choosing an equivalence rule to apply to each entity based on the descriptors;

g. applying the chosen rule;

h. generating features for each descriptor;

i. associating the features generated for each descriptor with entities;

j. repeating steps d, e, f and g as required to generate a feature set associated with each entity.

2. A method according to claim 1 wherein said values further comprise:

a. disparate types of data associated with the same name;

b. disparate quantities of measure with the same semantic meaning; and

wherein the generated features for each descriptor are semantically comparable features in a common format.

3. A method according to claim 1 wherein the equivalence rules are dynamically generated comprising:

a) providing a plurality of descriptors;

b) determining a rule type from the descriptors; and

c) dynamically generating a new rule.

4. A system to iteratively execute one or more experiments to extract differentiating features among a plurality of entities comprising:

a) providing a plurality of entities each defined by a set of features;

b) providing a structured means to define an experiment design construct;

c) providing one or more experiment design; d) applying an experiment construct to each of the plurality of entities; e) using set theoretic techniques to analyze each experiment construct;

f) analyzing each feature set with respect to the one or more experiment designs to extract differentiating features;

g) generating a result based on each experiment construct, wherein the result for each experiment construct is a set of features that differentiate the plurality of entities;

h) a system to iteratively apply constructs of the one or more experiment design to extract differentiating features wherein the features may be derived using any method that results in the representation of an entity by a set of features.

5. A system according to claim 4 further comprising a singular experiment defined by: a) specifying a feature to categorize the plurality of entities into a plurality of groups; b) specifying a plurality of features for each group; and

c) specifying a condition for each feature as one of 0, 1, x, where 0 denotes the absence of a feature, 1 denotes the presence of a feature, and x denotes a feature may be absent or present.

6. The system of claim 4 comprising a plurality of experiments comprising a plurality of condition sets in the one or more experiment design.

Description:

A SYSTEM AND METHODS FOR DIFFERENTIATING ENTITIES USING

COMBINATORIAL FEATURE EXTRACTION FIELD OF THE INVENTION

[0001] The present disclosure relates to data analysis, and in particular a system and methods to extract differentiating features across a plurality of entity groups.

BACKGROUND OF THE INVENTION

[0002] An entity can be described as a set of descriptors, where a descriptor consists of a measured characteristic (attribute), and its associated value. Advancing technology capabilities are rapidly increasing data generation and storage capabilities, resulting in increasingly large numbers of disparate types of descriptors that can be associated with a single entity. An exhaustive comparison across a set of entities requires looking at all possible ways the entities can be grouped based on descriptors, and all possible

combinations of descriptors. This analysis becomes computationally intensive as the number of descriptors increases.

[0003] Values for descriptors with the same semantic meaning can be measured in different ways, and stored using disparate data types, resulting in descriptors that are incompatible for comparison. Disparate types and measures may be defined as:

[0004] disparate data types numeric (integer, floating point), string (alphanumeric), etc.

[0005] disparate measures quantitative vs qualitative, count vs percentage, age in days vs age in years, etc.

[0006] There exists an exponential number of combinatorial ways to compare a plurality of entities across a plurality of features. Existing methods for comparing entities based on large sets of descriptors reduce the scope and complexity of the analysis by limiting the input data to a pre-defined subset of compatible descriptors applicable to testing a single hypothesis. For example, statistical analysis methods operate on pre-determined subsets of descriptors with numeric values. Using existing methods, an exhaustive comparison across a set of entities requires numerous iterations of the same analysis across different subsets of descriptors, and data pre-processing steps to transform descriptor values to compatible types.

SUMMARY OF THE INVENTION

[0007] In order to overcome the challenges associated with comparisons across large data sets, the present disclosure provides a system and methods to extract differentiating features across a plurality of entities defined by large sets of descriptors. Inputs to the system include data descriptors, semantic features, or any other aspects that can be considered a set of data that describes the characteristics of an entity. Inputs are processed based on experimental designs that can be user driven, or fully automated. The system outputs results based on comparing and contrasting groups of entities across a spectrum of descriptors and analysis specifications. The outputs may be delivered in a variety of formats to enable interoperability with other systems, human interpretation, and persisting new information to a data store.

[0008] The system may include a database to store input, intermediate, and output values at all stages of processing. An interface may be provided for users to interact with the system to include the ability to upload data and experimental designs, and view and interact with results. The processing method includes a feature generator that encompasses methods to translate disparate types of descriptors to abstract features that can be compared and computed across. An analysis system automatically runs a series of experiments across the features to generate results. This system provides a straight-forward approach to iteratively executing a series of combinatorial analyses to differentiate groups of entities.

[0009] Example use cases are provided in the fields of molecular medicine, retail, transportation, and public health. The example use cases illustrate the necessity of systems that can deal with large volumes of data, and the utility of the present disclosure in extracting differentiating features across a plurality of entities.

BRIEF DESCRIPTION OF THE DRAWINGS

[0010] For a more complete understanding of the invention, reference is made to the following description and accompanying drawings, in which:

[0011] FIG. 1 shows a logical block diagram of a Feature Analysis System;

[0012] FIG. 2 shows a logical block diagram of an Interface for the Feature Analysis

System of FIG. 1;

[0013] FIG. 3 shows a logical block diagram of a Feature Generator for the Feature Analysis System of FIG. 1;

[0014] FIG. 4 shows an example method to apply rules using the Feature Generator of FIG. 3;

[0015] FIG. 5 shows an example method to dynamically generate rules using the Feature Generator of FIG. 3;

[0016] FIG. 6 shows a logical block diagram of an Analyzer for the Feature Analysis System of FIG. 1; [0017] FIG. 7 shows an example system comprising feature generation and combinatorial analysis components;

[0018] FIG. 8 shows an example described in Example Use 1 for populating the

Experiment Design Table of FIG. 2; and

[0019] FIG. 9 shows an example described in Example Use 2.

DETAILED DESCRIPTION OF THE INVENTION

[0020] The arrangement in FIG. 1 shows an exemplary arrangement of a preferred embodiment. FIG. 1 shows a logical block diagram of a Feature Analysis System 100. The Feature Analysis System 100 may include a User 101 that interacts with the Analysis System 102.

[0021] The Analysis System 102 may include a Database 105, an Interface 110, a Feature Generator 130, an Analyzer 150, and an Output Generator 170. The Database 105 may store data for input to the Feature Generator 130. Additionally, in some embodiments, the Database 105 may store the results of User 101 interactions with the Interface 110, output from the Feature Generator 130, output from the Analyzer 150, and/or output from the Output Generator 170.

[0022] The arrangement in FIG. 2 shows an exemplary arrangement of a preferred embodiment. FIG. 2 shows a logical block diagram of an Interface 110 for the Feature Analysis System of FIG. 1. The Interface 110 may enable the User 101 to interact with the Analysis System 102. The Interface 110 may include a Experiment Design Table 111 and a Process Button 112. The Experiment Design Table 111 may permit the User 101 to specify groups of entities, features of interest, and experiments to run by pressing the Process Button 112. The Experiment Design Table 111 may include an area to specify a plurality of groups denoted Group x in 111, where x = [1 . . . n]. The Experiment Design Table 111 may also include an area to specify a plurality of features on each group, denoted Fx in 111, where x = [1 . . . n].

[0023] The Experiment Design Table 111 may also include an area to specify a plurality of experiment constructs, denoted Experiment x in 111, where x = [1 . . . n]. The empty boxes in The Experiment Design Table 111 may be populated with the values, 1, 0, or x, where 1 may specify that a feature must exist, 0 may specify that a feature must not exist, and x may specify that it does not matter if a feature exists or not.

[0024] The arrangement in FIG. 3 shows an exemplary arrangement of a preferred embodiment. FIG. 3 shows a logical block diagram of a Feature Generator 130 for the Feature Analysis System of FIG. 1. The Feature Generator 130 may include a set of Descriptors 135 and a Rule Set 140, that are combined by a Processor 142 to generate a set of Features 145.

[0025] The Descriptors 135 may be represented by a set of tuples, where each tuple is one of:

(name, numeric _value, measure) (1)

(name, string _value[, measure]) (2)

[0026] The elements of the tuples in (1) and (2) may be defined as:

[0027] name a string that identifies the descriptor (e.g. age, color, price, a particular gene name, etc.)

[0028] numeric value a numeric value

[0029] string_value a string of alphanumeric characters

[0030] measure a particular measure (quantity) on the value (e.g. years, Euros, PPM, kg/mol)

[0031] The name in (1) and (2) may be any attribute or measure that represents an aspect of an entity. The values in (1) and (2) may be any type of data, including numeric, alpha, alphanumeric, or a reference to a location that contains a media file (e.g. image, audio, video, etc.) In some embodiments, the location referenced by a value may refer to a location in an external database or web server. This model may map to a relational database model, as (field, value, "table "), when a row in a table relates to an entity, and different tables are used for different measures with the same semantic, on the same entities.

[0032] Defining descriptors in this way is optimally concise and provides consistency across data sets. Being optimally concise provides the maximum flexibility to support disparate data, and the minimum space to support scalability and portability. Having this consistent format across all data sets reduces the necessary complexity of a rule set that operates on the descriptors.

[0033] The Descriptors 135 may be transferred from the Database 105, or uploaded and transferred using the Interface 110. In some embodiments, the Descriptors 135 may be transferred from an external database or web server.

[0034] The Rule Set 140 may include equivalence rules for translating Descriptors 135 to Features 145. Rules may be defined based on the domain of the input data (e.g. equivalence rules for molecular data). Rules may form associated equivalencies for disparate types and measures. A rule may exist for all string _values, and all measures (quantities), wherein the first requirement handles disparate data types (e.g. associating string values to equivalent numeric values - equivalent values), and the second requirement handles disparate measures with the same semantic meaning (e.g. associating numeric values with the same semantic meaning that were measured in a different way - equivalent quantities).

[0035] As an example of a rule, descriptors that have discrete values may translate directly to features. As a further example of a rule, descriptors that have discrete values may translate into features that represent finite ranges or sets of discrete values. As a further example of a rule, descriptors that have continuous values may have associated rules that apply statistical techniques to categorize the values into discrete ranges or sets, where each set is a unique feature on the descriptor.

[0036] Examples of some rules are shown in Table 1.

[0037] The first two rules in Table 1 equate a numeric value to a semantically equivalent

Table 1 Example equivalence rules

string value for a given quantity (count). The next two rules equate a string value to another semantically equivalent string value. The last two rules equate disparate quantities to a semantically equivalent feature (age). In the last case, more concise rules may exist to create age ranges.

[0038] The rules in Table 1 include statistically derived values that may be computed as:

[0039] mean a statistical mean

[0040] std one standard deviation

[0041] The Rule Set 140 may be transferred from the Database 105 or the Interface 110. In some embodiments, the Rule Set 140 may be transferred from an external database or web server. The Rule Set 140 may be represented using any formal logic system, or decision system, or related notational forms or rule systems. For example, in some embodiments, a rule may be defined by a decision tree.

[0042] The simplicity of a rule set defined in this way is permitted by the preferred method of representing descriptors using the tuples (1) and (2). The tuples and the rule set combine to form a generalized format for providing input to a Processor 142 for feature generation. The generalization provides maximum flexibility as it captures descriptors and rules in a straightforward and concise fashion. Any method that does not rely on generalizing the inputs would require significantly more complex rule sets to capture the same semantics of feature generation, resulting in a system that is not flexible and does not scale. The present approach overcomes the complexities associated with harmonizing across disparate data sets, thereby alleviating the need to perform data pre-processing steps required by other approaches.

[0043] The Processor 142 may include any method that applies the Rule Set 140 to the Descriptors 135 to generate a set of Features 145. In some embodiments, the Processor 142 may consist of any combination of computing hardware and/or software. The Features 145 may be transferred to the Analyzer 150 directly, or in some embodiments, stored in the Database 105.

[0044] An exemplary processor includes a method that finds rules in priority order: [name, string_value, rule, measure], and generates features that may be identified by a (name[, value]) pair, where the value is optional. The method may be defined as shown in FIG. 4.

[0045] In some embodiments, the processor may include methods to dynamically generate new rules when a rule is not present. An example method for dynamic rule generation is shown in FIG. 5, wherein the statistical and discrete values approaches can be any predefined methods for generating rules in different ways.

[0046] The arrangement in FIG. 6 shows an exemplary arrangement of a preferred embodiment. FIG. 6 shows a logical block diagram of a system for combinatorial feature extraction, comprising an Analyzer 150 for the Feature Analysis System of FIG. 1. The Analyzer 150 may include a set of Features 145, a set of Experiment Designs 151, an Analysis Processor 155, and a set of Results 160. The Features 145 may be transferred from the Database 105 or the Interface 110. In some embodiments, the Features 145 may be transferred from an external database or web server. The Experiment Designs 151 may be transferred from the Interface 110. In some embodiments, the Experiment Designs 151 may be transferred from the Database 105 or an external database or web server.

[0047] The Analysis Processor 155 may comprise a method to apply the Experiment Designs 151 to the Features 145 to generate a set of Results 160, where the Features 145 may be generated using the previously defined method, or any other method that describes entities as a set of features.

[0048] In a preferred embodiment, the Analysis Processor 155 may apply techniques from set theory to evaluate the Features 145 to extract differentiating features (as each entity either has a feature, or does not) for each experiment defined in the Experiment Designs 151. Examples of set theory techniques include combinations of set intersection, set union, and set difference. Entity groups in the Experiment Design Table 111 may be derived based on a feature, for example, creating two groups wherein the first entity group has a feature, and the second does not.

[0049] In some embodiments, the Analysis Processor 155 may consist of any combination of computing hardware and/or software. The Results 160 may be transferred to the Output Generator 170 directly, or in some embodiments, stored in the Database 105.

[0050] The Output Generator 170 may take the Results 160 and generate a human readable report in a common format, or a machine readable output in a portable document format. In some embodiments, the Output Generator 170 may transfer results to the Database 105, Interface 110, or an external database or web server. In some embodiments, the Interface 110 may include methods to generate graphical visualization from data transferred from the Output Generator 170.

[0051] In some embodiments, the User 101 is not required for the Feature Analysis System 100. For example, if a User 101 does not provide analysis specifications, the Analysis System 102 may process all possible combinations of analyses.

[0052] In some embodiments, outputs from more than one Feature Analysis System 100 may be combined for further analyses.

[0053] Embodiments of the present invention may be used to differentiate patients based on features derived from biomedical data including: identifying features for drug targeting, matching patients to clinical trials, discovering treatments for personalized medicine, and identifying optimal mates for sexual reproductive purposes.

[0054] Embodiments of the present invention may be used in an interactive automated service following the example shown in FIG. 7 following a process such as:

1. User signs up for account

2. User requests n new sample ids

3. User creates experiment groups based on observed phenotypes using table on web interface, for instance:

Inquiry 1 Inquiry 2 Group A Group B Group A Group B Group C

4. User adds sample ids to groups

5. User chooses molecular tests for each inquiry (DNA/R A/Protein)

6. Service generates sample labels and mailing label

7. User mails samples to bio lab

8. Bio lab runs tests, stores results on service file server

9. Service does pre-processing (variant calls, etc.)

10. Service completes combinatoric analysis by groups for each inquiry

11. Service stores results on Web App server

12. Service makes results available to user

[0055] Embodiments of the present invention may be used in a network where data linked to networked resources or data content stored in a networked databases may include sets of disparate descriptors, which could be used to extract interoperable features that then are used in mobile network packets to achieve situational and content awareness across the network. The extracted features from distributed data in various types of resources across a specific network, or across the entire Internet, could therefore be applied to create semantically meaningful and interoperable mobile packets of situational knowledge. These extracted and mobile features could in turn be used to more intelligently search across a network or the Internet to differentiate and recover entities that may otherwise not be easily distinguished or found by searching at the data level across the network. Since the features extracted from each network resource or networked database will enable disparate types of content and situational data to be described by unifying features and enable comparison, combinatorial context models could be modeled/designed, so multi-feature contexts that comprise a plurality of specific types of features could be generated and queried for sophisticated searching across distributed databases and resources. The extracted features could be indexed to support such intelligent searching across the network, or applied as mobile packets in a software defined network overlay to enable more sophisticated content aware network routing and switching. These network packets could represent semantically meaningful network packets for intelligent network operating systems or other network system where interoperable situational and content awareness is useful.

[0056] In some embodiments, intelligently federated networks of knowledge bases could be formed by creating a knowledge defined network operating system, comprising of distributed stores of knowledge and a system for extracting and mobilizing packetized features that represent the abstracted layer of content with each knowledge base. These knowledge stores could be disparate types of semantic databases, semantic networks, or could include more advanced hierarchically structured knowledge graphs, nested hypergraph representations, and including any other system that stores knowledge as a graph structure, or structure that can be translated to a graph structure. Multiple features could be extracted from such knowledgebase content and be applied to federate content into a content aware ecosystem that can support intelligent search and recovery of knowledge, as well as interoperable learning across the federated network.

[0057] Embodiments of the present invention may be used to match the transportation needs of individuals to perform collective tasks. Aggregate information about an individual may come from calendar entries, to-do lists, data collected by geographic and other applications on a computing devices. This information may be used as the descriptors to generate features and create a profile of features of an individual. Individuals can then be compared to recover matches to other individuals having similar transportation needs and enable individuals to collectively complete tasks by differentiating groups of individuals based on those features to find carpooling opportunities. Other examples in transportation include planning public transportation routes and schedules; scheduling taxis; and city planning related to roads, sidewalks, and other thoroughfares.

[0058] Embodiments of the present invention may be used to detect large-scale

environmental changes based on human interactions monitored by their electronic devices. Features may be derived from the behavior patterns of groups of individuals. Group differentiation may then detect changes in the environment that are subtle, or hard to quantify via existing measurement systems (e.g. chemical or biological based). Other examples in public health and welfare include detecting and predicting disease outbreaks; and detecting and predicting times and areas of civil unrest.

[0059] In addition to permitting the differentiation of groups of entities, feature generation according to the present invention can be used as a cornerstone to building knowledge defined networks. Features represent semantic concepts that can be shared and computed across at higher levels of abstraction. This enables a generalized harmonization on distributed networks of disparate data collections. Semantic features may be used to represent knowledge, and a network may then be defined by how knowledge is linked.

[0060] Embodiments of the present invention useful in the retail environment include: enabling accurate comparisons across unlimited, heterogeneous databases of products from retail companies; and extracting features to differentiate shoppers to drive logistics planning for sales and marketing, such as date/time of sale, products to discount, groups to market to.

[0061] Embodiments of the present invention useful in computer and network

security ,wherein features represent events, and methods for combinatorial differentiation can contribute to signature recognition and anomaly detection systems.

Example Use 1

[0062] In an example use of the Feature Analysis System 100, biomedical data may include clinical and molecular descriptors. Clinical descriptors may include diagnostic measures, patient characteristics, and treatment outcomes. Molecular descriptors may include measures of gene expression, amplification, deletion, mutation, and cytogenetic factors. The combined set of descriptors on a patient may come from multiple resources, and/or exist in multiple formats. The molecular descriptors for a single patient may number in the tens of thousands. This data may also be spread across a distributed network of data sources, such as clinics, hospitals, and medical labs.

[0063] In the present example, a plurality of patients (entities) exist, wherein descriptors on the patients comprise measures of gene expression and amplification across tens of thousands of genes. The descriptors are defined by tuples, where a tuple is one of:

(gene_ symbol, numeric _value,RF 'KM) (3)

(gene symbol, numeric value, log2FC) (4)

(gene symbol, amplified) (5)

(gene_ symbol, amplification) (6)

(gene symbol, numeric value, aCGH) (V)

[0064] where RPKM is reads per kilobase per million reads, which is a count of a specific transcripts RNA sequences generated by a sequencing instrument; Log2FC is a measure of a specific gene transcripts relative Fold Change; Amplification is a classification that can be assigned to specific genes, or regions in DNA, that have more than the normal number of gene copies; and aCGH stands for array Comparative Genomic Hybridization and is a measure of the specific number of copies of a gene.

[0065] The tuples (3) and (4) are semantically equivalent measures of gene expression, measured via different quantities. The tuples (5) and (6) are semantically equivalent measures of gene amplification, denoted by different string values. The tuple (7) is a semantically equivalent measure to the tuples (5) and (6) when the fifth rule in Table 2 is applied.

[0066] In the present example, molecular features are generated based on the rules shown in Table 2, where the first four rules handle disparate measures for semantically equivalent features, the fifth rule handles disparate data types for a semantically equivalent feature, and the sixth rule handles disparate string values for a semantically equivalent feature.

Table 2 Example equivalence rules

[0067] In Table 2, the statistical mean and std are computed across the plurality of measures for each entity and each gene, and gene expression is derived from disparate measures (RPKM, log2FC), which are otherwise incomparable. The last two rules associate semantically equivalent values and measures of gene amplification.

[0068] The User 101 in this example may interact with the Interface 110 to populate the Experiment Design Table 111 as shown in FIG. 8. The Example Use Experiment Design Table 115 shows two groups, denoted Group 1 and Group 2, three features, denoted A, B, and C, and three experiments denoted Experiment 1, Experiment 2, and Experiment 3. Group 1 may represent a set of patients that have sensitivity to a drug. Group 2 may represent a set of patients that do not have sensitivity to a drug. The experiments may be designed to determine which features relate to drug sensitivity.

[0069] In the Example Use Experiment Design Table 115, A may represent the feature that a gene has low expression. B may represent the feature that a gene has high expression, and C may represent the feature that a gene is amplified. In this case, Experiment 1 specifies a set of genes that have low expression in Group 1 and high expression in Group 2, Experiment 2 specifies a set of genes that have low expression in Group 1 and high expression and amplification in Group 2, and Experiment 3 specifies a set of genes that do not have high expression and amplification in Group 1 and do have high expression and amplification in Group 2.

Example Use 2

[0070] In another example use of the Feature Analysis System 100, data for online stores may include sets of descriptors on purchasable items and customers. Customer data may include descriptors of demographics and shopping habits. The descriptors on items and customers may number in the tens of thousands. These descriptors may be distributed in databases across a network of store locations.

[0071] The User 101 in this example may interact with the Interface 110 to populate the Experiment Design Table 111 as shown in FIG. 9. The Example 2 Experiment Design Table 116 shows three groups, denoted Group A, Group B, and Group C, four features, denoted P, 1, 2, and 3, and two experiments denoted Experiment 1 and Experiment 2. Group A, Group B, and Group C may represent sets of customers in different geographic regions. The experiments may be designed to determine which items are exclusively purchased within each of the different regions.

[0072] In the Example 2 Experiment Design Table 116, P may represent the feature that an item was purchased, and 1, 2, and 3 may represent the feature that a customer rated the item with the respective score. In this case, Experiment 1 specifies a set of items that were purchased by Group A and not purchased by Group B or Group C. Experiment 2 specifies a set of items that were purchased and given a rating of 3 by Group A, and were purchased and given a rating of 1 by Group B, and were not purchased by Group C.

Previous Patent: CALIBRATION AND TRACKING TO ASSIST INTER-FREQUENCY MEASUREMENTS OF LTE CELL BY WLAN RADIO

Next Patent: ANALOG BUILT-IN SELF TEST TRANSCEIVER