Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
PROACTIVE DATA MODELING
Document Type and Number:
WIPO Patent Application WO/2019/209571
Kind Code:
A1
Abstract:
In implementations of the present disclosure, a method, device and computer program product for proactive data modeling for a dataset are provided. For a given dataset, a first subset is selected proactively to generate a first model with at least a first variable as an independent variable, and a second subset is selected proactively to generate a second model with at least a second variable as an independent variable. Then, the first model and the second model are combined to generate a target model. In implementations of the present disclosure, a plurality of data subsets are selected proactively so as to generate a plurality of models for a plurality of independent variables, and a final target model is generated by combining the plurality of models. Therefore, implementations of the present disclosure can reduce the number of independent variables during the modeling process, thereby improving modeling efficiency for the dataset.

Inventors:
SHAO BIN (US)
XIA HUANHUAN (US)
LIU TIE-YAN (US)
Application Number:
PCT/US2019/027570
Publication Date:
October 31, 2019
Filing Date:
April 16, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
MICROSOFT TECHNOLOGY LICENSING LLC (US)
International Classes:
G06N3/12; G06N20/00
Foreign References:
US20090018982A12009-01-15
Other References:
AMIR HOSSEIN GANDOMI ET AL: "Multi-stage genetic programming: A new strategy to nonlinear system modeling", INFORMATION SCIENCES, AMSTERDAM, NL, vol. 181, no. 23, 5 July 2011 (2011-07-05), pages 5227 - 5239, XP028292526, ISSN: 0020-0255, [retrieved on 20110723], DOI: 10.1016/J.INS.2011.07.026
SLAPAK MARTIN ET AL: "Matching subtrees in genetic programming crossover operator", 2017 13TH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (ICNC-FSKD), IEEE, 29 July 2017 (2017-07-29), pages 208 - 213, XP033363172, DOI: 10.1109/FSKD.2017.8393091
Attorney, Agent or Firm:
MINHAS, Sandip S. et al. (US)
Download PDF:
Claims:
CLAIMS

1. A computer-implemented method, comprising:

generating a first model with at least a first variable as an independent variable based on a first subset in a dataset, the first model indicating a first constraint satisfied by data in the first subset;

generating a second model with at least a second variable as an independent variable based on a second subset in the dataset, the second model indicating a second constraint satisfied by data in the second subset, and both the first variable and the second variable being variables in the dataset; and

generating, by combining at least the first model and the second model, a target model indicating a third constraint satisfied by data in the dataset, the target model being used for making a predi ction based on the dataset.

2. The method according to claim 1, wherein the generating the first model comprises:

generating a first submodel with at least the first variable as an independent variable based on a first group of data in the first subset, a value of the second variable in the first group of data being a first value;

generating a second submodel with at least the first variable as an independent variable based on a second group of data in the first subset, a value of the second variable in the second group of data being a second value different from the first value; and

generating the first model based on the first submodel and the second submodel.

3. The method according to claim 2, wherein in the first group of data, only a value of the first variable changes, and the generating the first submodel comprises:

generating the first submodel with only the first variable as an independent variable based on the first group of data.

4. The method according to claim 3, further comprising:

sampling the first group of data by fixing values of other variables other than the first variable in the dataset.

5. The method according to claim 1, wherein the generating the target model comprises:

generating a first tree representing the first model and a second tree representing the second model; and

generating the target model by matching the first tree and the second tree.

6. The method according to claim 1, wherein the generating the target model comprises:

generating a model template comprising an unknown parameter based on the first model and the second model;

resolving the unknown parameter in the model template using the dataset; and determining the target model based on the model template and the unknown parameter.

7. The method according to claim 1, wherein the generating the target model comprises:

generating a first candidate model and a second candidate model for the dataset by combining the first model and the second model, the first candidate model and the second candidate model both taking at least the first variable and the second variable as independent variables;

evaluating the first candidate model and the second candidate model using the dataset; and

determining the target model based on the evaluation for both the first candidate model and the second candidate model.

8. The method according to claim 1, further comprising:

generating a third model with at least a third variable as an independent variable based on a third subset in the dataset, the third model indicating a fourth constraint satisfied by data in the third subset,

and the generating the target model comprises: generating the target model by combining the first model, the second model and the third model.

9. An electronic device, comprising:

a processing unit; and

a memory coupled to the processing unit and storing instructions, the instructions, when executed by the processing unit, performing the following acts of:

generating a first model with at least a first variable as an independent variable based on a first subset in a dataset, the first model indicating a first constraint satisfied by data in the first subset;

generating a second model with at least a second variable as an independent variable based on a second subset in the dataset, the second model indicating a second constraint satisfied by data in the second subset, and both the first variable and the second variable being variables in the dataset; and

generating, by combining at least the first model and the second model, a target model indicating a third constraint satisfied by data in the dataset, the target model being used for making a prediction based on the dataset.

10. The device according to claim 9, wherein the generating the first model comprises:

generating a first submodel with at least the first variable as an independent variable based on a first group of data in the first subset, a value of the second variable in the first group of data being a first value;

generating a second submodel with at least the first variable as an independent variable based on a second group of data in the first subset, a value of the second variable in the second group of data being a second value different from the first value; and

generating the first model based on the first submodel and the second submodel.

11. The device according to claim 10, wherein in the first group of data, only a value of the first variable changes, and the generating the first submodel comprises:

generating the first submodel with only the first variable as an independent variable based on the first group of data.

12. The device according to claim 11, the acts further comprising:

sampling the first group of data by fixing values of other variables other than the first variable in the dataset.

13. The device according to claim 9, wherein the generating the target model comprises:

generating a first tree representing the first model and a second tree representing the second model; and

generating the target model by matching the first tree and the second tree.

14. The device according to claim 9, wherein the generating the target model comprises:

generating a model template comprising an unknown parameter based on the first model and the second model;

resolving the unknown parameter in the model template using the dataset; and determining the target model based on the model template and the unknown parameter.

15. A computer program product tangibly stored on a non-transient computer readable medium and including machine executable instructions, the machine executable instructions, when executed by a device, causing the device to:

generate a first model with at least a first variable as an independent variable based on a first subset in a dataset, the first model indicating a first constraint satisfied by data in the first subset;

generate a second model with at least a second variable as an independent variable based on a second subset in the dataset, the second model indicating a second constraint satisfied by data in the second subset, and both the first variable and the second variable being variables in the dataset; and

generate, by combining at least the first model and the second model, a target model indicating a third constraint satisfied by data in the dataset , the target model being used for making a prediction based on the dataset.

Description:
PROACTIVE DATA MODELING

BACKGROUND

[0001] Data modeling refers to generating a model based on the data. By analyzing the data objects in a given dataset, the relations or constraints between these data objects may be determined, and then a model that best fits the given dataset is generated. The method of data modeling includes regression analysis, statistical analysis, machine learning, deep learning, grey prediction, principal component analysis, neural networks, time sequence analysis and so on.

[0002] Regression analysis, as the most common modeling method, is used to find the relations between dependent variables and independent variables. Depending on the number of involved independent variables, the regression analysis may be divided into one- variable regression analysis and multi-variable regression analysis. Depending on the number of dependent variables, the regression analysis may be divided into simple regression analysis and multiple regression analysis. Depending on the type of relations between the independent variables and the dependent variables, the regression analysis may be divided into linear regression analysis and non-linear regression analysis. Symbolic regression is a type of regression analysis that finds a model (such as function) that best fits a given dataset by means of evolutionary search (such as genetic programming). The goal of symbolic regression is to find one or more patterns, constraints or laws in a dataset automatically.

SUMMARY

[0003] In implementations of the present disclosure, there are provided a method, device and computer program product for proactive symbolic modeling for a dataset. For a given dataset, a first subset is selected proactively to generate a first model with at least a first variable as an independent variable, and a second subset is selected proactively to generate a second model with at least a second variable as an independent variable. Then, the first model and the second model are combined to generate a target model indicating data constraint in the dataset for making a prediction based on the dataset. In implementations of the present disclosure, a plurality of data subsets are selected proactively so as to generate a plurality of models for a plurality of independent variables, and a final target model is generated by combining the plurality of models. Therefore, implementations of the present disclosure can reduce the number of independent variables during the modeling process, thereby improving modeling efficiency for the dataset. [0004] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

BRIEF DESCRIPTION OF THE DRAWINGS

[0005] With reference to the drawings and the detailed description below, the above and other features, advantages and aspects of the implementations of the present disclosure will become more apparent. Throughout the drawings, the same or similar reference symbols refer to the same or similar elements, in which:

[0006] Fig. 1 illustrates a block diagram of a computing device/server in which one or more implementations of the present disclosure may be implemented;

[0007] Fig. 2 illustrates a flowchart of a method for proactive data modeling in accordance with implementations of the present disclosure;

[0008] Fig. 3A illustrates a flowchart of a method for generating a first model in accordance with implementations of the present disclosure;

[0009] Fig. 3B illustrates a flowchart of a method for generating a second model in accordance with implementations of the present disclosure;

[0010] Fig. 3C illustrates a flowchart of a method for generating a target model through tree matching in accordance with implementations of the present disclosure;

[0011] Fig. 4A illustrates a schematic diagram of uniformly accelerated rectilinear motion in accordance with implementations of the present disclosure;

[0012] Fig. 4B illustrates a schematic diagram of a dataset associated with uniformly accelerated rectilinear motion in accordance with implementations of the present disclosure;

[0013] Fig. 4C illustrates a schematic diagram of a data subset in the dataset shown in Fig. 4B;

[0014] Fig. 4D illustrates a schematic diagram for generating trees for representing various models in accordance with implementations of the present disclosure;

[0015] Fig. 4E illustrates a schematic diagram of a target tree generated by matching various trees in Fig. 4D; and

[0016] Fig. 5 illustrates a schematic diagram of comparison between experiment results of proactive modeling in accordance with the present disclosure and the deep learning method.

DETAILED DESCRIPTION

[0017] The implementations of the present disclosure will be described in details with reference to the drawings. Although the drawings demonstrate some implementations of the present disclosure, it is to be understood that the present disclosure can be implemented in various manners other than the ones described herein. To the contrary, these implementations are provided for a more thorough and comprehensive understanding of the present disclosure. It is to be understood that the drawings and implementations of the present disclosure are only for the purpose of illustration, rather than suggesting any limitations on the scope of the present disclosure

[0018] As used herein, the term“comprises” and its variants are to be read as open terms that mean“comprises, but is not limited to.” The term“based on” is to be read as“based at least in part on.” The term“one implementation” and“an implementation” are to be read as“at least one implementation.” The term“another implementation” is to be read as “at least one other implementation”; the term“some implementations” is to be read as“at least some implementations.” Other definitions, explicit and implicit, may be included below.

[0019] Conventionally, for a given dataset, a model is generated by means of evolutionary search so that it best fits the given dataset. Generally, for a single independent variable, a target model is easy to be obtained through one-variable regression analysis. However, there may exist multiple independent variables in the dataset while it is generally complex to resolve multi-variable linear regression, which involves resolving a plurality of parameters using, for instance, least square method, or it is even impossible to find an accurate target model that fits the given dataset. An improvement over the traditional approach is to use deep learning method for function fitting, which utilizes a complicated neural network structure for learning and training. However, it takes a large amount of time and computing resources to perform deep learning and the obtained model is probably not accurate enough. Hence, since the conventional modeling method only uses data in the dataset passively, its data modeling efficiency is very low.

[0020] To this end, implementations of the present disclosure provide a proactive data modeling method for a dataset. In implementations of the present disclosure, a plurality of data subsets are selected proactively so as to generate a plurality of models for a plurality of independent variables, and a final target model is generated by combining the plurality of models. Therefore, implementations of the present disclosure can reduce the number of independent variables during the modeling process, thereby effectively improving modeling efficiency for the dataset.

[0021] Basic principles and various example implementations of the present disclosure will now be described with reference to Figs. 1-5. Fig. 1 illustrates a block diagram of a computing device/server 100 in which one or more implementations of the present disclosure can be implemented. It is to be understood that the computing device/server 100 as described in Fig. 1 is merely illustrative, rather than to form any limitation to the function and scope of implementations of the present disclosure.

[0022] As shown in Fig. 1, the computing device/server 100 is in the form of a general computing device. Components of the computing device/server 100 include, but are not limited to, one or more processors or processing units 110, a memory 120, a storage device 130, one or more communication units 140, one or more input devices 150, and one or more output devices 160. A processing unit 110 may be a physical or virtual processor and may execute various processes based on the programs stored in the memory 120. In a multi- processor system, multiple processing units perform computer-executable instructions in parallel to improve the parallel processing capacity of the computing device/server 100.

[0023] The computing device/server 100 typically includes a plurality of computer storage media, which can be any available media accessible by the computing device/server 100, including but not limited to volatile and non-volatile media, and removable and non removable media. The memory 120 can be a volatile memory (for example, a register, cache, Random Access Memory (RAM)), non-volatile memory (for example, a Read-Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash memory), or any combination thereof. The storage device 130 may be any removable or non-removable media and may include machine-readable media, such as a memory, flash drive, disk, and any other media, which can be used for storing information and/or data (for example, dataset 180) and accessed in the computing device/server 100.

[0024] The computing device/server 100 may further include additional removable/non removable, volatile/non-volatile memory media. Although not shown in Fig. 1, a disk drive is provided for reading and writing a removable and non-volatile disk (such as“floppy disk”) and a disc drive is provided for reading and writing a removable non-volatile disc. In these cases, each drive is connected to the bus (not shown) via one or more data media interfaces. The memory 120 may include a modeling engine 125 having one or more sets of program modules configured to perform methods or functions of various implementations described herein.

[0025] The communication unit 140 communicates with a further computing device via communication media. Additionally, functions of components in the computing device/server 100 may be implemented by a single computing cluster or multiple computing machines connected communicatively for communication. Therefore, the computing device/server 100 can be operated in a networking environment using a logical link with one or more other servers, network personal computers (PCs) or another general network node.

[0026] The input device 150 may include one or more input devices, such as a mouse, keyboard, tracking ball and the like. The output device 160 may include one or more output devices, such as a display, loudspeaker, printer, and the like. As required, the computing device/server 100 can also communicate via the communication unit 140 with one or more external devices (not shown) such as a storage device, display device and the like, one or more devices that enable users to interact with the computing device/server 100, or any devices that enable the computing device/server 100 to communicate with one or more other computing devices (for example, a network card, modem, and the like). Such communication is performed via an input/output (I/O) interface (not shown).

[0027] As illustrated in Fig. 1, the storage device 130 stores a dataset 180 which includes a large amount of data involving multiple variables. In accordance with an implementation of the present disclosure, the modeling engine 125 proactively selects a plurality of subsets in the dataset 180 to generate a plurality of models and combines the plurality of models so as to create a model that fits the dataset 180. Example implementations that the modeling engine 125 generates a model based on the dataset 180 are described below in detail with reference to Figs. 2-5.

[0028] Fig. 2 illustrates a flowchart of a method 200 for proactive symbolic modeling in accordance with implementations of the present disclosure. It is to be understood that method 200 may be performed by the computing device/server 100 as illustrated in Fig. 1.

[0029] At 202, a first model with at least a first variable as an independent variable is generated based on a first subset in the dataset. The dataset represents a set of data and may include various types of data, such as traffic data, medical data, financial data, industrial data and so on. As an example, in the scenario where an object is in uniformly accelerated rectilinear motion, the dataset may include data related to the object itself and/or movement of the obj ect. The dataset may involve multiple variables, including a mass m of the obj ect, an external force F received by the object, an initial speed vo of the object and a motion time t of the object. An example of the dataset 180 will be further described below with reference to Fig. 4B.

[0030] In some embodiments, the first subset may include a part of data that meet a predefined rule in the dataset. For example, the first subset may include multiple groups of data, and the value(s) of only one variable (such as the first variable) or a part of variables in each group of data change(s) while the values of other variables are constant. Continuing with the scenario where the object is in uniformly accelerated rectilinear motion, each group of data in the first subset may be the group of data for which only the time t changes (other variables, such as the initial speed vo, in each group of data remain unchanged). An example of the dataset 410 will be depicted below with reference to Fig. 4B.

[0031] In accordance with implementations of the present disclosure, the first model indicates a first constraint satisfied by data in the first subset. For example, in the scenario where the object is in uniformly accelerated rectilinear motion, the first model generated based on the first subset of multiple groups of data for which only time t changes may be a constraint (such as a function relation) between the displacement d and the time t of the object. Therefore, the first model may indicate the function relation between the dependent variable d and the independent variable t when other variables (such as the initial speed vo) remain unchanged.

[0032] At 204, a second model with at least the second variable as an independent variable is generated based on the second subset in the data set. For example, the second subset may also include multiple groups of data, and the value(s) of only one variable (such as the second variable) or a part of variables in each group of data change(s) while the values of other variables are constant. The second model indicates a second constraint satisfied by data in the second subset.

[0033] For example, in the scenario where the object is in uniformly accelerated rectilinear motion, the second model generated based on the second subset of multiple groups of data for which only initial speed vo changes may be a constraint (such as a function relation) satisfied by displacement d and initial speed vo. Thus, the second model may indicate the function relation between the dependent variable d and the independent variable initial speed vo when other variables (such as time t) remain unchanged. It is to be understood by those skilled in the art that the blocks 202 and 204 may be executed in turn or in parallel.

[0034] At 206, a target model indicating a third constraint satisfied by data in the dataset is generated by combining at least the first model and the second model, and the target model is used for making a prediction based on the dataset. The first subset and the second subset are a part in the dataset, and the constraint in each subset satisfies the constraint of the whole dataset. In some implementations, by generating a corresponding model for each variable or each group of variables and by combining the generated plurality of models through pattern matching or tree alignment, a target model for the whole dataset can be generated, and the target model may be represented as a function, an equation and so on. The generated target model may be used for making the prediction, for instance, based on the target model generated with the sampled traffic datasets, the future traffic condition at a certain time may be predicted.

[0035] For example, in the scenario where the object is in uniformly accelerated rectilinear motion, the first model (such as first function relation) satisfied by the displacement d and the time t and the second model (such as second function relation) satisfied by the displacement d and the initial speed vo are combined so as to generate a target model (such as target function relation) satisfied by the displacement d , the time t and the initial speed vo. For example, an example implementation of combining a plurality of models 455 to generate a target model 495 will be described below with reference to Figs. 4D and 4E.

[0036] With the method 200 in accordance with implementations of the present disclosure, a plurality of datasets are selected proactively to generate a plurality of models respectively, and the plurality of models are combined to generate a target model for the dataset. Therefore, the method 200 according to implementations of the present disclosure may replace the traditional passive modeling fitting with the proactive data modeling, thereby looking for data proactively to generate a model.

[0037] Example implementations of method 200 in Fig. 2 will be described below with reference to Figs. 3A-3C.

[0038] Fig. 3A illustrates a flowchart of a method 300 for generating a first model in accordance with implementations of the present disclosure. The method 300 may be implemented by the computing device/server 100 as depicted in Fig. 1 and may be an example implementation of block 202 in Fig. 2. For the sake of convenience, some example implementations will be described below with reference to the dataset related to the uniformly accelerated rectilinear motion in Figs. 4A-4E and Fig. 5 as an example.

[0039] At 302, a first submodel with at least the first variable as an independent variable is generated based on the first group of data in the first subset of the dataset. At 304, a second submodel with at least the first variable as an independent variable is generated based on the second group of data in the first subset. The value of the second variable in the first group of data is the first value while the value of the second variable in the second group of data is the second value, and the first value and the second value are different. [0040] In some implementations, in the first group of data, only the value of the first variable changes. Therefore, the generated first submodel only takes the first variable as the independent variable, thereby ensuring that each submodel can be generated easily, quickly and accurately. It is to be understood that the first group of data may be not only a group of data selected from the dataset in which only the first variable changes but also may be sampled proactively by fixing values of other variables other than the first variable in the dataset, thereby ensuring that the group of data in which only the first variable changes can be obtained. Alternatively, in the first group of data, it is also possible that values of a plurality of variables (not all the variables in the dataset) change so that the generated first submodel only takes a plurality of variables (such as the first variable and the third variable) as independent variables. It is to be understood that when the first submodel involves a plurality of independent variables, as the number of variables in the first group of data is reduced, the efficiency of data modeling still can be improved.

[0041] For example, reference is made to a schematic diagram 400 of uniformly accelerated rectilinear motion as illustrated in Fig. 4A. It is to be understood that Figs. 4A- 4E are only some examples for facilitating understanding implementations of the present disclosure and should not be construed as limiting the scope of the present disclosure. As illustrated, on the road 402, the object 405 is driving in uniformly accelerated rectilinear motion. Fig. 4A illustrates the displacement position of the object 405 at different times. For example, the object 405 is displaced by 2 meters at the I st second, and the object 405 is displaced by 8 meters at the 2 nd second. During the uniformly accelerated rectilinear motion, the displacement of the object 405 is associated with a number of factors.

[0042] Fig. 4B illustrates sampling data of a plurality of factors of the uniformly accelerated rectilinear motion to form a dataset 180. As shown in Fig. 4B, the dataset 180 contains the values of initial speed vo of the object, the external force F received by the object, the mass m of the object, the motion time t and the displacement d under different conditions, where time t may be referred to as the first variable, the initial speed vo may be referred to as the second variable, the external force F may be referred to the third variable, and the mass m may be referred to as fourth variable. For example, the physical meaning of the first line of data in the dataset 180 as shown in Fig. 4B means: when the initial speed vois O. lm/s, the external force F is 0.5 newton, the mass m is 1.0 kg, the time / is 1.0 second, the corresponding displacement d of the object is 0.35 m.

[0043] Different from using the whole dataset 180 passively to perform regression analysis or deep learning, implementations of the present disclosure utilize proactive modeling method to select the data subsets (namely, a part of data) to generate the model of each data subset, and then combine all the generated models to generate a model that fits the dataset 180.

[0044] For example, Fig. 4C shows a first subset 410 of the dataset 180 in Fig. 4B, which includes a first group of data 411 and a second group of data 412. In the first group of data 411, only the first variable t among the plurality of independent variables changes while the values of other variables remain constant, where the value of the second variable vo i s 0.5. In the second group of data 411, only the first variable t among the plurality of independent variables changes, while the values of other variables remain constant, where the value of the second variable vois 1.0.

[0045] Since there only exist one independent variable t and one dependent variable d in the first group of data 411, while the values of other variables vo, F and m remain unchanged (where v 0 = 0.5, F = 1.5, m = 2.0). Through simple one-variable symbolic regression, it may be determined that the first submodel satisfied by the first group of data 411 is the following equation (1):

f t (t = 0.5 t + 0.375 t 2 ( 1 )

[0046] Generally, during the process of generating a submodel through one-variable symbolic regression, the model with the optimal fitting degree may be selected. If the fitting degrees of a plurality of candidate submodels are the same, the equation with the shortest length may be selected as submodel. Additionally, as there only exist one independent variable t and one dependent variable d in the second group of data 412, while the values of the other variables vo, F and m remain unchanged (where v 0 = 1.0, F = 1.5, m = 2.0). Through one-variable symbolic regression, it may be determined that the second submodel satisfied by the second group of data 412 is the following equation (2):

/ t (t) = l.Ot + 0.375 t 2 ( 2)

[0047] Returning to Fig. 3A, at 306, the first model is generated based on the first submodel and the second submodel. For example, in the data subset 410 as shown in Fig. 4C, depending on the model (such as equation (1)) generated based on the first group of data 411 and the model (such as equation (2)) generated based on the second group of data 412, it may be determined that the data in the data subset 410 satisfies the following equation (3), namely, the first model. That is, by selecting multiple groups of data in the first subset to run multiple rounds, the first model (such as function) for the first variable may be obtained:

f t (t) = X Q t + X 1 t 2 ( 3 ) where / t (t) denotes the displacement function with time / as independent variable and X 0 and X 4 are unknown parameters.

[0048] Fig. 3B illustrates a flowchart of a method 350 for generating a second model in accordance with implementations of the present disclosure. It is to be understood that the method 350 may be implemented by the computing device/server 100 as depicted in Fig. 1 and may be an example implementation of block 204 in Fig. 2.

[0049] At 308, a third submodel with at least the second variable as an independent variable is generated based on the third group of data in the second subset of the dataset. At 310, a fourth submodel with at least the second variable as an independent variable is generated based on the fourth group of data in the second subset. At 312, the second model is generated based on the third submodel and the fourth submodel. Similar to the generation of the first model at 302-306, it is possible to proactively select the second subset in which only the second variable changes, and generate the second model satisfied by the second subset. For example, the second model with the initial speed vo as independent variable is generated in the dataset 180 as depicted in Fig. 4B, for instance, as the following equation (4):

where f Vo (t¾) denotes the displacement function with the initial speed vo as independent variable and X 2 and X 3 are unknown parameters.

[0050] Optionally, in some implementations, other than generating the first model and the second model, it is further possible to generate a third model with at least a third variable as an independent variable based on the third subset in the dataset. For example, in the dataset 180 as depicted in Fig. 4B, it is possible to generate a third model with the external force F as an independent variable, such as the following equation (5):

F) = x 4 + x 5 F ( 5 ) where / F ( ) denotes the displacement function with the external force F as independent variable, and X 4 and X 5 are unknown parameters.

[0051] In some implementations, it is further possible to generate a fourth model with at least a fourth variable as an independent variable based on the fourth subset of the dataset. For example, in the dataset 180 as depicted in Fig. 4B, the fourth model with mass m as independent variable may be generated, such as the following equation (6): where f m (jn ) denotes displacement function with mass w as independent variable, and X 6 and X 7 are unknown parameters.

[0052] Fig. 3C illustrates a flowchart of a method 390 for generating a target model through tree matching in accordance with implementations of the present disclosure. It is to be understood that method 390 may be implemented by the computing device/server 100 depicted in Fig. 1 and may be an example implementation of block 206 as shown in Fig. 2.

[0053] At 314, a first tree representing the first model and a second tree representing the second model are generated. For example, Fig. 4D is a schematic diagram 450 for generating a tree representing each model in accordance with implementations of the present disclosure. After a plurality of models (such as equations (3)-(6)) are generated, the corresponding tree structure may be generated for each model, for instance, a tree 460 representing the first submodel, a tree 470 representing the second submodel, a tree 480 representing the third submodel and a tree 490 representing the fourth submodel. By using the tree structure for matching, it is possible to combine the plurality of models quickly and efficiently.

[0054] At 316, a target model is generated by at least matching the first tree and the second tree. For example, a model template comprising unknown parameter(s) may be generated at least based on the first model and the second model. Fig. 4E shows a target tree 499 generated by matching all the trees in Fig. 4D, which is obtained by sub-graph matching for the trees (such as trees 460, 470, 480 and 490) of each model. Next, a target template 495 comprising unknown parameters may be generated based on the target tree 499, as shown by equation (7):

where f(t, v 0 , F, m ) denotes the displacement function with time /, initial speed vo, external force F and mass m as independent variables, and Y 0 and Y 1 are unknown parameters.

[0055] After obtaining the model template (such as equation (7)) comprising unknown parameters Y 0 and Y x , it is possible to resolve the unknown parameters Y 0 and Y 1 using data in the dataset 180. In this example, the value of Y 0 may be determined as 1 while the value of Y 1 may be determined as 1/2. Therefore, after the values of the unknown parameters Y 0 and Y 1 are determined, it is possible to determine that the final target model satisfied by the dataset 180 is equation (8): d = v 0 t + t 2 ( 8 )

[0056] Actually, the equation (8) is the standard displacement calculation equation of uniformly accelerated rectilinear motion. Therefore, based on the proactive modeling method for the dataset 180, it is possible to find data laws in the dataset quickly and efficiently and generate the most suitable target model.

[0057] During the combination process of various models, a plurality of candidate models may be generated, where each candidate model at least takes the first variable and the second variable as independent variables. In some implementations, the dataset may be used to evaluate each candidate model to determine the fitting degree of each model with the dataset, and to select the final target model based on the evaluation. In this manner, when the models may be combined to form a plurality of target models, it is possible to select a model with the highest accuracy based on the dataset as the target model.

[0058] Fig. 5 illustrates a schematic diagram 500 of comparison between experiment results of proactive symbolic modeling in accordance with the present disclosure and deep learning method. The experiment result 510 of the proactive modeling method in accordance with the present disclosure demonstrates that the accuracy is improved at a leap and the error at 2 I st second at point 511 is 0, that is, the most accurate target model is found. In comparison, under the condition that GPU having more cores than CPU is used to optimize machine learning algorithm, the experiment result 520 of deep learning method shows that the accuracy is improved smoothly, and the error at the lOO* 11 second at point 521 is 0.18 and at 181 st second at point 522 is 0.09. Thus, as compared with deep learning method, the proactive modeling method in accordance with implementations of the present disclosure not only has faster modeling speed but also achieves higher accuracy.

[0059] The functionally described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-Programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

[0060] Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowcharts and/or block diagrams to be implemented. The program code may execute entirely on a machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

[0061] In the context of this disclosure, a machine readable medium may be any tangible medium that may contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine readable medium may be a machine readable signal medium or a machine readable storage medium. A machine readable medium may include but not limited to an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of the machine readable storage medium would include an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

[0062] Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are contained in the above discussions, these should not be construed as limitations on the scope of the present disclosure, but rather as descriptions of features that may be specific to particular implementations. Certain features that are described in the context of separate implementations may also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation may also be implemented in multiple implementations separately or in any suitable sub- combination.

[0063] Some exemplary implementations of the present disclosure are listed below.

[0064] In one aspect, a computer-implemented method is provided. The method comprises generating a first model with at least a first variable as an independent variable based on a first subset in a dataset, wherein the first model indicates a first constraint satisfied by data in the first subset. The method further comprises generating a second model with at least a second variable as an independent variable based on a second subset in the dataset, wherein the second model indicates a second constraint satisfied by data in the second subset, wherein both the first variable and the second variable are variables in the dataset. The method further comprises generating a target model indicating a third constraint satisfied by data in the dataset by combining at least the first model and the second model, wherein the target model is used for making a prediction based on the dataset.

[0065] In some implementations, the generating the first model comprises: generating a first submodel with at least the first variable as an independent variable based on a first group of data in the first subset, wherein a value of the second variable in the first group of data is a first value; generating a second submodel with at least the first variable as an independent variable based on a second group of data in the first subset, wherein a value of the second variable in the second group of data is a second value, and the first value and the second value are different; and generating the first model based on the first submodel and the second submodel.

[0066] In some implementations, in the first group of data, only a value of the first variable changes, and the generating the first submodel comprises generating the first submodel only with the first variable as an independent variable based on the first group of data.

[0067] In some implementations, the method further comprises sampling the first group of data by fixing values of other variables other than the first variable in the dataset.

[0068] In some implementations, the generating the target model comprises: generating a first tree representing the first model and a second tree representing the second model; and generating the target model by matching the first tree and the second tree.

[0069] In some implementations, the generating the target model comprises: generating a model template comprising a unknown parameter based on the first model and the second model; resolving the unknown parameter in the model template using the dataset; and determining the target model based on the model template and the unknown parameter.

[0070] In some implementations, the generating the target model comprises: generating a first candidate model and a second candidate model for the dataset by combining the first model and the second model, wherein the first candidate model and the second candidate model both take at least the first variable and the second variable as independent variables; evaluating the first candidate model and the second candidate model using the dataset; and determining the target model based on evaluation of the first candidate model and the second candidate model.

[0071] In some implementations, the method further comprises: generating a third model with at least a third variable as an independent variable based on a third subset in the dataset, wherein the third model indicates a fourth constraint satisfied by data in the third subset, and the generating the target model comprises generating the target model by combining the first model, the second model and the third model.

[0072] In another aspect, there is provided an electronic device, the electronic device comprises a processing unit and a memory coupled to the processing unit and storing instructions. The instructions, when executed by the processing unit, perform the acts of generating a first model with at least a first variable as an independent variable based on a first subset in a dataset, wherein the first model indicates a first constraint satisfied by data in the first subset. The acts further comprise generating a second model with at least a second variable as an independent variable based on a second subset in the dataset, wherein the second model indicates a second constraint satisfied by data in the second subset, and both the first variable and the second variable are variables in the dataset. The acts further comprise generating a target model indicating a third constraint satisfied by data in the dataset by combining at least the first model and the second model, wherein the target model is used for making a prediction based on the dataset.

[0073] In some implementations, the generating the first model comprises: generating a first submodel with at least the first variable as an independent variable based on a first group of data in the first subset, wherein a value of the second variable in the first group of data is a first value; generating a second submodel with at least the first variable as an independent variable based on a second group of data in the first subset, wherein a value of the second variable in the second group of data is a second value, and the first value and the second value are different; and generating the first model based on the first submodel and the second submodel.

[0074] In some implementations, in the first group of data, only a value of the first variable changes, and the generating the first submodel comprises generating the first submodel only with the first variable as an independent variable based on the first group of data.

[0075] In some implementations, the acts further comprise sampling the first group of data by fixing values of other variables other than the first variable in the dataset.

[0076] In some implementations, the generating the target model comprises: generating a first tree representing the first model and a second tree representing the second model; and generating the target model by matching the first tree and the second tree.

[0077] In some implementations, the generating the target model comprises: generating a model template comprising an unknown parameter based on the first model and the second model; resolving the unknown parameter in the model template using the dataset; and determining the target model based on the model template and the unknown parameter.

[0078] In some implementations, the generating the target model comprises: generating a first candidate model and a second candidate model for the dataset by combining the first model and the second model, wherein the first candidate model and the second candidate model both take at least the first variable and the second variable as independent variables; evaluating the first candidate model and the second candidate model using the dataset; and determining the target model based on evaluation of the first candidate model and the second candidate model.

[0079] In some implementations, the acts further comprise: generating a third model with at least a third variable as an dependent variable based on a third subset in the dataset, wherein the third model indicates a fourth constraint satisfied by data in the third subset, and the generating the target model comprises: generating the target model by combining the first model, the second model and the third model.

[0080] In still another aspect, there is provided a computer program product tangibly stored on a non-transient computer readable medium and including machine executable instructions. The machine executable instructions, when executed by a device, cause the device to generate a first model with at least a first variable as an independent variable based on a first subset in a dataset, wherein the first model indicates a first constraint satisfied by data in the first subset. The instructions further cause the device to generate a second model with at least a second variable as an independent variable based on a second subset in the dataset, wherein the second model indicates a second constraint satisfied by data in the second subset, and both the first variable and the second variable are variables in the dataset. The instructions further cause the device to generate a target model indicating a third constraint satisfied by data in the dataset by combining at least the first model and the second model, wherein the target model is used for making a prediction based on the dataset.

[0081] In some implementations, the generating the first model comprises: generating a first submodel with at least the first variable as an independent variable based on a first group of data in the first subset, wherein a value of the second variable in the first group of data is a first value; generating a second submodel with at least the first variable as an independent variable based on a second group of data in the first subset, wherein a value of the second variable in the second group of data is a second value, and the first value and the second value are different; and generating the first model based on the first submodel and the second submodel.

[0082] In some implementations, in the first group of data, only a value of the first variable changes, and the generating the first submodel comprises generating the first submodel only with the first variable as an independent variable based on the first group of data.

[0083] In some implementations, the machine executable instructions, when performed in the device, further cause the device to sample the first group of data by fixing values of other variables other than the first variable in the dataset.

[0084] In some implementations, the generating the target model comprises: generating a first tree representing the first model and a second tree representing the second model; and generating the target model by matching the first tree and the second tree.

[0085] In some implementations, the generating the target model comprises: generating a model template comprising an unknown parameter based on the first model and the second model; resolving the unknown parameter in the model template using the dataset; and determining the target model based on the model template and the unknown parameter.

[0086] In some implementations, the generating the target model comprises: generating a first candidate model and a second candidate model for the dataset by combining the first model and the second model, wherein the first candidate model and the second candidate model both take at least the first variable and the second variable as independent variables; evaluating the first candidate model and the second candidate model using the dataset; and determining the target model based on evaluation of the first candidate model and the second candidate model.

[0087] In some implementations, the machine executable instructions, when performed in the device, further cause the device to generate a third model with at least a third variable as an independent variable based on a third subset in the dataset, wherein the third model indicates a fourth constraint satisfied by data in the third subset, and the generating the target model comprises generating the target model by combining the first model, the second model and the third model.

[0088] Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter specified in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.