Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DEVICES AND METHODS FOR GENERATION AND COMPILATION OF DECISION TREE SOFTWARE CODE
Document Type and Number:
WIPO Patent Application WO/2024/046574
Kind Code:
A1
Abstract:
A data processing apparatus (100) configured to generate a software code (170) based on a first data set (140). The software code (170) comprises a plurality of conditionals for implementing a decision tree (160). The data processing apparatus is further configured to execute the software code (170) for determining a result of the decision tree (160) for a first subset of a second data set (150), collect statistical data based on a second subset of the second data set (150) and modify the software code (170) based on the collected statistical data by one or more of a plurality of code modification operations.

Inventors:
KASTRATI FISNIK (DE)
Application Number:
PCT/EP2022/074417
Publication Date:
March 07, 2024
Filing Date:
September 02, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
KASTRATI FISNIK (DE)
International Classes:
G06F8/41
Foreign References:
US20140089908A12014-03-27
Other References:
TOKUNAGA TAKAMASA ET AL: "Profiling with helper threads", PROCEEDINGS OF THE IASTED INTERNATIONAL CONFERENCE APPLIEDINFORMATICS. INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTEDCOMPUTING AND NETWORKS. PROCEEDINGS OF THE IASTED INTERNATIONALCONFERENCE APPLIED INFORMATICS. INTERNATIONAL SYMPOSIUM ON PAR, X, vol. 23, 15 February 2005 (2005-02-15), pages 1 - 6, XP009090786
KENNETH A ROSS ED - ASSOCIATION FOR COMPUTING MACHINERY: "Conjunctive selection conditions in main memory", PROCEEDINGS OF THE 21ST. ACM SIGACT-SIGMOD-SIGART SYMPOSIUM ON PRINCIPLES OF DATABASE SYSTEMS. PODS 2002. MADISON, WI, JUNE 3 - 5, 2002; [PROCEEDINGS OF THE ACM SIGACT-SIGMOD-SIGART SYMPOSIUM ON PRINCIPLES OF DATABASE SYSTEMS], NEW YORK, NY : ACM, US, 3 June 2002 (2002-06-03), pages 109 - 120, XP058404863, ISBN: 978-1-58113-507-7, DOI: 10.1145/543613.543628
PATRICK CARRIBAULT ET AL: "Branch Strategies to Optimize Decision Trees for Wide-Issue Architectures", 23 August 2005, LANGUAGES AND COMPILERS FOR HIGH PERFORMANCE COMPUTING, 17TH INTERNATIONAL WORKSHOP, LCPC 2004; [LECTURE NOTES IN COMPUTER SCIENCE;;LNCS], SPRINGER-VERLAG, BERLIN/HEIDELBERG, PAGE(S) 439 - 454, ISBN: 978-3-540-28009-5, XP019013691
Attorney, Agent or Firm:
KREUZ, Georg M. (DE)
Download PDF:
Claims:
CLAIMS

1. A data processing apparatus (100) configured to perform the following operations: generating a software code (170) based on a first data set (140), wherein the software code (170) comprises a plurality of conditionals for implementing a decision tree (160); executing the software code (170) for determining a result of the decision tree (160) for a first subset (307) of a second data set (150); collecting statistical data based on a second subset (309) of the second data set (150); modifying the software code (170) based on the collected statistical data by one or more of a plurality of code modification operations.

2. The data processing apparatus (100) of claim 1 , wherein the data processing apparatus (100) is further configured to compile the software code (170) into a machine code version of the software code (170) and to execute the machine code version of the software code (170) for determining a result of the decision tree (160) for the first subset (307) of the second data set (150).

3. The data processing apparatus (100) of claim 1 or 2, wherein the data processing apparatus (100) is configured to execute the software code (170) using at least one first thread and to collect the statistical data using at least one second thread and wherein the data processing apparatus (100) is configured to execute the at least one first thread and the at least one second thread substantially in parallel.

4. The data processing apparatus (100) of any one of the preceding claims, wherein the data processing apparatus (100) is configured to perform the operations of executing the software code (170), collecting statistical data, and modifying the software code (170) based on the collected statistical data as a continuous process.

5. The data processing apparatus (100) of any one of the preceding claims, wherein for collecting the statistical data the data processing apparatus (100) is configured to collect statistics about a respective result of each of the plurality of conditionals of the software code (170) implementing the decision tree (160).

6. The data processing apparatus (100) of any one of the preceding claims, wherein at least one of the plurality of conditionals comprises at least two Boolean conditions connected by an AND operator and wherein for modifying the software code (170) based on the statistical data the one or more of the plurality of code modification operations comprise reordering the at least two Boolean conditions.

7. The data processing apparatus (100) of any one of the preceding claims, wherein at least one conditional of the plurality of conditionals is nested within at least one further conditional of the plurality of conditionals and wherein for modifying the software code (170) based on the statistical data the one or more of the plurality of code modification operations comprise reordering the at least one conditional and the at least one further conditional of the plurality of conditionals of the software code (170) implementing the decision tree (160).

8. The data processing apparatus (100) of any one of the preceding claims, wherein the data processing apparatus (100) is further configured to compile the software code (170) into a machine code version of the software code (170) and wherein at least one of the plurality of conditionals comprises at least two Boolean conditions connected by a first type of AND operator, which in the machine code version of the software code (170) results in a branching portion of the machine code version of the software code (170), wherein for modifying the software code (170) based on the statistical data the one or more of the plurality of code modification operations comprise replacing the first type of AND operator with a second type of AND operator, which in the machine code version of the software code (170) results in a non-branching portion of the machine code version of the software code (170).

9. The data processing apparatus (100) of any one of the preceding claims, wherein the data processing apparatus (100) is further configured to compile the software code (170) into a machine code version of the software code (170) and wherein at least one of the plurality of conditionals comprises at least two Boolean conditions connected by a second type of AND operator, which in the machine code version of the software code (170) results in a non-branching portion of the machine code version of the software code (170), wherein for modifying the software code (170) based on the statistical data the one or more of the plurality of code modification operations comprise replacing the second type of AND operator with a first type of AND operator, which in the machine code version of the software code (170) results in a branching portion of the machine code version of the software code (170).

10. The data processing apparatus (100) of any one of the preceding claims, wherein for modifying the software code (170) based on the collected statistical data by the one or more of the plurality of code modification operations the data processing apparatus (100) is configured to determine the one or more of the plurality of code modification operations based on the statistical data using a loss function, wherein the loss function is configured to minimize an execution time of the software code (170) implementing the decision tree (160).

11. A data processing method (700), wherein the data processing method (700) comprises: generating (701) a software code (170) based on a first data set (140), wherein the software code (170) comprises a plurality of conditionals for implementing a decision tree (160); executing (703) the software code (170) for determining a result of the decision tree (160) for a first subset (307) of a second dataset (150); collecting (705) statistical data based on a second subset (309) of the second data set (150); and modifying (707) the software code (170) based on the collected statistical data by one or more of a plurality of code modification operations. 12. A computer program product comprising a computer-readable storage medium for storing program code which causes a computer or a processor to perform the method of claim 11 , when the program code is executed by the computer or the processor.

Description:
Devices and methods for generation and compilation of decision tree software code

TECHNICAL FIELD

The present disclosure relates to code generation and compilation. More specifically, the present disclosure relates to devices and methods for generation and compilation of software code implementing a decision tree.

BACKGROUND

Decision Trees (DT) are a popular predictive model used in statistics, data mining and machine learning. Decision trees typically are used to go from observations about an item which is represented in the branches of the decision tree to a final outcome about the item's target value. The latter is represented in the leaves of the decision tree. Decision trees can also be used to represent Boolean expressions, whereby internal nodes represent Boolean conditions and leaves the outcome, e.g. true or false, yes or no. To that end, decision trees are used as underlying structure by many very popular Machine Learning boosting algorithms such as XGBoost, Random Forests, AdaBoost, and the like. Decision trees can be used for both classification and regression tasks.

SUMMARY

It is an objective of the present disclosure to provide improved devices and methods for generation and compilation of software code implementing a decision tree.

The foregoing and other objectives are achieved by the subject matter of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

According to a first aspect a data processing apparatus is provided. The data processing apparatus, which may be implemented as a server, may comprise processing circuitry, such as one or more processors for performing the following operations.

The data processing apparatus is configured to generate a software code based on a first data set (herein also referred to as the training data set), wherein the software code comprises a plurality of conditionals, e.g. conditional statements and/or conditional expressions for implementing a decision tree. Moreover, the data processing apparatus is configured to execute the software code for determining, i.e. predicting a respective result of the decision tree for a first subset of a second data set (herein also referred to as the prediction data set). For executing the software code, the data processing apparatus may be configured to compile a source code version of the software code into a machine code version of the software code.

The first data set may comprise data tuples comprising data for evaluating the conditionals. The first data set may be read from a memory of the data processing apparatus device and/or may be received by a communication device of the data processing apparatus device from a further server, in particular a cloud server. The first data set may for example be stored in a repository file, a database, or a storage object.

The data processing apparatus is further configured to collect statistical data based on a second subset of the second data set, i.e. the prediction data set. As used herein, collecting statistical data based on the second subset of the second data set, i.e. the prediction data set may comprise determining, i.e. predicting a respective result of the decision tree for the second subset of the second data set and/or collecting statistical data about the respective result of the plurality of conditionals, e.g. conditional statements and/or conditional expressions for implementing the decision tree.

Moreover, the data processing apparatus is configured to modify, i.e. optimize the software code based on the collected statistical data by one or more of a plurality of code modification operations for reducing an execution time of the compiled modified software code. The code modification operations may for example comprise a rearrangement of parts of the software code based on the collected statistical data, in particular based on an execution likeliness or execution probability derived from the statistical data and relating to data processing steps of the rearranged parts of the software code.

Advantageously, the processing apparatus allows to improve the performance of the prediction for decision trees, in particular by dynamically generating highly optimized code and Just-in-time (JIT) compiling it. This improves access locality and cache efficiency, eliminates the interpretation overhead and dynamic dispatching, reduces the code size and therefore improves CPU cache efficiency, reduces the error on condition selectivities and optimizes Boolean conditions and therefore reduces the so-called branch misprediction penalty (which may be considered as a loss function). In a further possible implementation form, the data processing apparatus is further configured to compile a source code version of the software code into a machine code version of the software code and to execute the machine code version of the software code for determining, i.e. predicting a respective result of the decision tree for the first subset of the second data set, i.e. the prediction data set.

In a further possible implementation form, the data processing apparatus is configured to execute the software code using at least one first thread and to collect the statistical data using at least one second thread, wherein the data processing apparatus is configured to execute the at least one first thread and the at least one second thread substantially in parallel.

In a further possible implementation form, the data processing apparatus is configured to perform the operations of executing the software code, collecting statistical data, and modifying the software code based on the collected statistical data as a continuous process until all data of the second data set has been processed.

In a further possible implementation form, for collecting the statistical data the data processing apparatus is configured to collect statistical data about a respective result of each of the plurality of conditionals of the software code implementing the decision tree.

In a further possible implementation form, at least one of the plurality of conditionals of the software code implementing the decision tree comprises at least two Boolean conditions connected by an AND operator, wherein for modifying the software code based on the statistical data the one or more of the plurality of code modification operations comprise reordering the at least two Boolean conditions.

In a further possible implementation form, at least one conditional of the plurality of conditionals of the software code implementing the decision tree is nested within at least one further conditional of the plurality of conditionals, wherein for modifying the software code based on the statistical data the one or more of the plurality of code modification operations comprise reordering the at least one conditional and the at least one further conditional of the plurality of conditionals of the software code implementing the decision tree so that the at least one further conditional becomes nested within the at least one conditional of the plurality of conditionals of the software code implementing the decision tree. In a further possible implementation form, the data processing apparatus is further configured to compile a source code version of the software code into a machine code version of the software code, wherein at least one of the plurality of conditionals of the software code implementing the decision tree comprises at least two Boolean conditions connected by a first type of AND operator (for instance the bitwise AND operator &), which in the machine code version of the software code results in a non-branching portion of the machine code version of the software code, wherein for modifying the software code based on the statistical data the one or more of the plurality of code modification operations comprise replacing the first type of AND operator with a second type of AND operator (for instance the logical AND operator &&), which in the machine code version of the software code results in a branching portion of the machine code version of the software code.

In a further possible implementation form, the data processing apparatus is further configured to compile a source code version of the software code into a machine code version of the software code, wherein at least one of the plurality of conditionals comprises at least two Boolean conditions connected by a second type of AND operator (for instance the logical AND operator &&), which in the machine code version of the software code results in a branching portion of the machine code version of the software code, wherein for modifying the software code based on the statistical data the one or more of the plurality of code modification operations comprise replacing the second type of AND operator with a first type of AND operator (for instance the bitwise AND operator &), which in the machine code version of the software code results in a non-branching portion of the machine code version of the software code.

In a further possible implementation form, for modifying, e.g. optimizing the software code based on the collected statistical data by the one or more of the plurality of code modification operations the data processing apparatus is configured to determine the one or more of the plurality of code modification operations based on the statistical data using a loss function, wherein the loss function is configured to minimize an execution time of the software code implementing the decision tree. The loss function may comprise a branch misprediction penalty of a branch of the decision tree.

According to a second aspect a computer-implemented data processing method is provided. The data processing method according to the second aspect comprises the steps of: generating a software code based on a first data set, e.g. a training data set, wherein the software code comprises a plurality of conditionals for implementing a decision tree; executing the software code for determining, i.e. predicting a respective result of the decision tree for a first subset of a second data set, e.g. a prediction data set; collecting statistical data based on a second subset of the second data set, e.g. the prediction data set; and modifying the software code based on the collected statistical data by one or more of a plurality of code modification operations for reducing an execution time of the compiled modified software code.

The method according to the second aspect of the present disclosure can be performed by the data processing apparatus according to the first aspect of the present disclosure. Thus, further features of the method according to the second aspect of the present disclosure result directly from the functionality of the data processing apparatus according to the first aspect of the present disclosure as well as its different implementation forms described above and below.

According to a third aspect a computer program product is provided, comprising a computer- readable storage medium for storing program code which causes a computer or a processor to perform the method according to the second aspect, when the program code is executed by the computer or the processor.

Details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description, drawings, and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

In the following, embodiments of the present disclosure are described in more detail with reference to the attached figures and drawings, in which:

Fig. 1 schematically shows a data processing apparatus according to an embodiment;

Fig. 2 schematically shows an exemplary decision tree;

Fig. 3a illustrates operations performed by a data processing apparatus according to an embodiment;

Fig. 3b illustrates parallelized operations performed by a data processing apparatus according to an embodiment;

Fig. 4 shows a flow diagram illustrating steps performed by an algorithm implemented by a data processing apparatus according to an embodiment;

Fig. 5a-b exemplary shows different versions of a software code processed by a data processing apparatus according to an embodiment; Fig. 6 shows a graph diagram illustrating an exemplary misprediction penalty curve; and

Fig. 7 shows a flow diagram illustrating a data processing method according to an embodiment.

In the following, identical reference signs refer to identical or at least functionally equivalent features.

DETAILED DESCRIPTION

In the following description, reference is made to the accompanying figures, which form part of the disclosure, and which show, by way of illustration, specific aspects of embodiments of the present disclosure or specific aspects in which embodiments of the present disclosure may be used. It is understood that embodiments of the present disclosure may be used in other aspects and comprise structural or logical changes not depicted in the figures. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims.

For instance, it is to be understood that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if one or a plurality of specific method steps are described, a corresponding device may include one or a plurality of units, e.g. functional units, to perform the described one or plurality of method steps (e.g. one unit performing the one or plurality of steps, or a plurality of units each performing one or more of the plurality of steps), even if such one or more units are not explicitly described or illustrated in the figures. On the other hand, for example, if a specific apparatus is described based on one or a plurality of units, e.g. functional units, a corresponding method may include one step to perform the functionality of the one or plurality of units (e.g. one step performing the functionality of the one or plurality of units, or a plurality of steps each performing the functionality of one or more of the plurality of units), even if such one or plurality of steps are not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary embodiments and/or aspects described herein may be combined with each other, unless specifically noted otherwise.

Figure 1 schematically shows a data processing apparatus 100 according to an embodiment. As illustrated in figure 1, the data processing apparatus 100 may comprise a processing circuitry 110 and a communication interface 120, for example a wired and/or wireless communication interface 120. The processing circuitry 110 may be implemented in hardware and/or software and may comprise digital circuitry, or both analog and digital circuitry. Digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), digital signal processors (DSPs), or general-purpose processors. The data processing apparatus 100 may further comprise a memory 130 configured to store executable program code which, when executed by the processing circuitry 110, causes the data processing apparatus 100 to perform the functions and methods described herein.

As further illustrated in figure 1 and will be described in more detail below, the data processing apparatus 100 is configured to generate based on a first data set (herein also referred to as training data set) 140 and a second data set (herein also referred to as prediction data set) 150 as inputs, which are for example received by the communication interface 120 and/or stored in the memory 130, a software code 170 which, when executed for instance, implements a decision tree 160.

Before describing different embodiments of the data processing apparatus 100 in more detail, in the following some technical background as well as terminology will be introduced making use of one or more of the following abbreviations and concepts:

D-Cache CPU data cache

DT Decision Tree l-Cache CPU instruction cache

JIT Just-in-time

RDBMS Relational Database Management System

SQL Structured query language.

JIT Code generation refers to Just-in-time code generation, meaning immediate code generation.

JIT compilation refers to Just-in-time compilation, meaning immediate code compilation, i.e. , translation of a code written in higher programming language into machine code.

Boolean condition refers to an atomic Boolean conditional. That is, a condition evaluates to either true or false and can be connected by means of logical AND, OR operators and this way form a Boolean expression. Boolean condition may also be referred to by the term predicate. An example of a Boolean condition is weather = “Sunny”. An example of a Boolean expression is weather = “Sunny” AND temperature < 30 OR city = “Berlin”. Selectivity refers to the probability of a Boolean predicate evaluating to true.

Query plan refers to a sequence of steps used by RDBMS to fetch the results for a given query. A query can be written by users in a declarative (non-procedural) query language such as SQL.

Branch misprediction penalty refers to the circumstances that modern CPUs try to predict the path of conditional statements based on the previous outcomes. If, however, the wrong path is predicted, then the entire CPU pipeline has to be flushed. This can result in many wasted CPU cycles.

Training data set refers to data used to train an algorithm, i.e. , to build the decision tree.

Prediction data set refers to data which is used to make predictions.

A Decision Tree (DT) refers to a tree structure composed of (Boolean) conditions, i.e. internal nodes, which can branch into further conditions. The leaf nodes of the decision tree represent the outcomes.

Decision Trees (DT) may relate to predictive models used for example in statistics, data mining and machine learning. Decision trees are typically used to go from observations about an item which is represented in the branches of the decision tree to final outcome about the item's target value. The latter is represented in the leaves of the decision tree.

Decision trees can also be used to represent Boolean expressions, whereby internal nodes represent Boolean conditions and leaves the outcome, e.g. true or false, yes or no.

In Figure 2 an exemplary decision tree 10 is schematically illustrated, which by way of example can be used to decide if one should play soccer depending on the weather conditions. As illustrated by the exemplary decision tree 10, the internal nodes 11a-11c together with their respective branches 13a-f can represent conditions whereas the leaf nodes 15a-b, 17a-b can represent the outcomes (also referred to as results herein). As illustrated in figure 2 a first outcome 15a-b may be for example “yes” and a second outcome 17a-b may be for example “no”. Decision trees are very popular and used in many domains such as banking, operations research, autonomous driving etc. Moreover, Boolean expressions in a query e.g. SQL, can be represented as a decision tree structure too. Thus, optimization of decision trees can be used also for optimization of Boolean expressions in a database system in general. To that end, decision trees are used as underlying structure by many very popular Machine Learning boosting algorithms such as XGBoost, Random Forests, AdaBoost, etc. Decision trees can be used for both classification and regression tasks.

Figure 3a illustrates operations performed by the data processing apparatus 100 according to an embodiment. As will be described in more detail below, the operations comprise generating the software code 170 based on the first data set 140, e.g. the training data set 140. The software code 170 comprises a plurality of conditionals, e.g. conditional statements or conditional expressions, for implementing the decision tree 160. The operations further comprise executing the compiled, i.e. machine software code 170 for determining, i.e. predicting a respective result of the decision tree 160 for a first subset 307 of the second data set 150, i.e. the prediction data set 150, collecting statistical data based on a second subset 309 of the second data set 150 and modifying, i.e. optimizing the software code 170 based on the collected statistical data by one or more of a plurality of code modification operations for reducing an execution time of the compiled modified software code 170. The second data set 150 may comprise at least one further subset 311 which may be used for collecting further statistical data following the collecting of statistical data based on the second subset 309.

Thus, according to embodiments disclosed herein, the data processing apparatus 100 may be configured for dynamic JIT code generation and compilation for decision tree programs. The data processing apparatus 100, in particular the processing circuitry 110, may be configured to implement a code generator that generates the software code 170 containing tight loops for evaluating the decision tree 160, i.e. evaluating a Boolean expression. This way the code locality is improved thus yielding better CPU cache behavior. Moreover, due to code compilation, dynamic dispatching is eliminated, compilation overhead is very small (negligible). Furthermore, the data processing apparatus 100 permits to dynamically JIT generate and compile optimized code for prediction/evaluation of the not yet seen data. This may involve the decision tree 160 to only build once. In contrary, traditional approaches typically generate a static prediction model based solely on the training data set 140 whereby the prediction data set 150 - the not yet seen data - does not play any role, thus yielding suboptimal code leaving a large optimization potential unharvested.

The data processing apparatus 100 allows dynamically scanning ahead the prediction data set 150 in order to minimize the error on the generated code. As further detailed below, a parallel task may read the data ahead in order to collect selectivities of Boolean conditions of the code in order to reduce the branch misprediction penalty (also referred to as loss function) on the generated software code 170. Generally, branch misprediction causes expensive CPU pipeline stalls. For example, in Intel’s Skylake architectures these costs 16.5 cycles, whereas on AMD’s Zen1 the cost is about 19 cycles. Optimizers are good at predicting branches that are nearly always taken or never. However, as can be taken from figure 6, branch misprediction penalty is highest for branches taken randomly, for example 50% of times. As further detailed below, the data processing apparatus 100 according to an embodiment is configured to optimize the branch misprediction penalty by reordering conditionals of the software code 170 implementing the decision tree 160 and using branching (further referred to by “&&”) and non-branching (further referred to by “&”) connections depending on their Boolean condition selectivities.

Figure 3b illustrates parallelized operations performed by the data processing apparatus 100 according to an embodiment. In figure 3b the first to fifth step 313a-e implemented by the data processing apparatus 100 are schematically illustrated, some of which may be invoked by the data processing apparatus 100 in parallel and/or recurringly, as described below.

According to an embodiment, in the first step 313a, the data processing apparatus 100, in particular the processing circuitry 110, may read the first data set 140, i.e. training data set 140.

In the second step 313b, the data processing apparatus 100 may construct the decision tree 160 based on the first data set 140 read in the first step 313a.

In the third step 313c, the data processing apparatus 100 may JIT generate and compile the software code 170 of the decision tree 160 constructed in the second step 313b.

In the fourth step 313d, the data processing apparatus 100 may predict the first subset 307 of the second data set 150, i.e. prediction data set 150 using the initial JIT compiled software code 170 from the third step 313c.

In the task-parallel implemented fifth step 313e, the data processing apparatus 100 may read-ahead, i.e., scans ahead the prediction data set, i.e. the second subset 309 of the second data set 150 and collects the Boolean conditions statistics (i.e. the statistical data), optimizes the Boolean conditions of the software code 170 based on the newly collected statistical data and invokes the third step 313c to JIT (re)generate and (re)compile the newly optimized software code 170. The fourth step 313d may continue predicting until the chunk is reached which was read- ahead in the fifth step 313e, i.e. the second subset 309 of the second data set 150, and then switches prediction to using the newly JIT compiled software code 170 which resulted from the fifth step 313e. At this point, the fifth step 313e may be invoked again in parallel to read ahead the new chunk, i.e. to read the at least one further subset 311. This process may be repeated until the entire prediction data set 150 has been evaluated.

According to a further embodiment relating to massively parallel prediction, in the first step 313a, the data processing apparatus 100, in particular the processing circuitry 110, may read the first data set 140, i.e. training data set 140.

In the second step 313b, the data processing apparatus 100 may construct the decision tree 160 based on the first data set 140 read in the first step 313a.

In the third step 313c, the data processing apparatus 100 may JIT generate and compile the software code 170 of the decision tree 160 constructed in the second step 313b.

In the fourth step 313d, the data processing apparatus 100 may predict the first subset 307 of the second data set 150, i.e. prediction data set 150 using the initial JIT compiled software code 170 from the third step 313c.

In the fifth step 313e, the data processing apparatus 100 may split the prediction data set 150, i.e. the second data set 150 into equal sized chunks, i.e. subsets 307-311.

In a sixth step (not illustrated in figure 3b), for each chunk, i.e. subset 307-311, a thread (parallel task) may read-ahead its assigned chunk, i.e. subset 307-311 , and collect Boolean conditions statistics, i.e. statistical data. Moreover, the data processing apparatus 100 may optimize Boolean conditions of the software code 170 based on the newly collected statistical data and invoke the third step 313c to JIT (re)generate and (re)compile the newly optimized software code 170.

In a seventh step (not illustrated in figure 3b), when one thread is done collecting statistical data, it may re-read its assigned chunk, i.e. subset 307-311, and predicts using the newly JIT compiled software code 170 from the sixth step. As will be appreciated, the prediction code in the sixth step was compiled for the particular chunk, i.e. subset 307-311 , that now the thread performs prediction over, hence a thread always predicts using optimized JIT compiled code.

Figure 4 shows a flow diagram illustrating steps performed by the data processing apparatus 100 according to an embodiment.

In step 401 , a user may initiate a training task, whereby the training data set, i.e. the first data set 140, is read from a repository where the training data set 140 has been stored. The training data set 140 can be read from any medium or repository, such as the memory 130, which supports read operations. Some examples of such repositories are files stored on disk, database systems, storage objects in Cloud, and the like.

In step 403, once the training data set 140 has been read, the data processing apparatus 100 may build the decision tree 160. In the machine learning world, the building of decision tree 160 corresponds to the training phase.

Once the decision tree 160 has been built, the data processing apparatus 100 may trigger step 405, where the code 170 of the already built decision tree 160 (as done in step 403) may be JIT generated as nested if/else expressions and compiled into machine code. At this phase a complete decision tree program 170 is provided which can be used for prediction.

In step 407, once the code generation and compilation in step 405 completes, the prediction phase commences. That is, the data processing apparatus 100 may read the data to be predicted, one chunk-at-the-time, i.e. one sequential block or subset 307-311 of the second data set 150 and use the already compiled code 170 to perform prediction over that chunk of data, i.e. subset 307-311.

In similar fashion to the first data set 140 used for training, the second data set 150 for prediction can be stored in any medium, such as the memory 130, which supports read operations, e.g., files, database systems, storage objects in Cloud, and the like. If the repository containing the prediction data set 150, i.e. the data where the algorithm is invoked to make predictions, does not contain more data than a single chunk, i.e. subset 307-311 , the data processing apparatus 100 may predict the data in that subset 307-311 and terminate, as shown in steps 409 and 417. In step 411 , if the prediction data set 150 is larger than a single chunk, i.e. subset 307-311 , while performing step 407, the data processing apparatus 100 may spawn a parallel task, e.g. thread, to read the next subset 307-311 of data ahead.

This is further illustrated in figure 3a, whereby the first subset 307 corresponds to the current chunk that the data processing apparatus 100 is reading and making predictions. As will be appreciated, the decision tree 160 was already compiled in step 407, thus predictions based on that compiled software code 170 can be made. The second subset 309 corresponds to the chunk of data that is read in parallel.

In step 413, from the chunk that is read ahead, i.e. the second subset 309, selectivities of Boolean conditions in the decision tree 160, i.e. the statistical data may be collected.

In step 415, once the statistical data, e.g. in the form of the selectivities have been collected, the algorithm may optimize the Boolean expressions of the software code 170 based on the newly collected selectivities, i.e. statistical data. That is, the data processing apparatus 100 may perform modifications of the software code 170, such as optimizing the Boolean conditions of the software code 170 in order to reduce the branch misprediction penalty (illustrated in figure 6) and/or reordering and connecting the Boolean conditions using branching-AND (&&) and non-branching-AND (&) operations depending on their corresponding selectivities. After the optimization is done, the data processing apparatus 100 may retrigger step 405 in order to JIT generate and compile the newly optimized software code 170, such that when the prediction phase moves to the new chunk of data, i.e. the second subset 309, it will perform the prediction operation over the optimized software code 170.

Figure 5a shows an exemplary first version 507a of the software code 170 generated based on the first data set 140 and a second modified, i.e. optimized, version 509a of the software code 170 based on the collected statistical data by one or more of the plurality of code modification operations. As illustrated in figure 5a, the code modification operations may comprise changing a Boolean condition from a branching-AND (&&) to a non-branching-AND (&) which may optimize the compiled machine code of the software code 170, i.e. reducing an execution time of the compiled machine code of the software code 170.

Figure 5b shows an exemplary further first version 507b of the software code 170 generated based on the first data set 140 and a further second modified, i.e. optimized, version 509b of the software code 170 based on the collected statistical data by one or more of the plurality of code modification operations. As illustrated in figure 5b at least one conditional of the plurality of conditionals of the further first version 507b is nested within at least one further conditional of the plurality of conditionals. Thus, for modifying the software code 170 based on the statistical data the one or more of the plurality of code modification operations may further comprise reordering the at least one conditional and the at least one further conditional of the plurality of conditionals of the software code 170 implementing the decision tree 160, as exemplary shown by the further second version 509b in comparison with the first further version 507b of the software code 170. Thus, "unlikely" code can be kept away from "likely" code, thereby improving processing time of the software code 170.

Figure 6 shows a graph diagram illustrating a misprediction penalty curve 600. The curve 600 relates a likeliness of a Boolean expression to the expected branch misprediction penalty (also referred to as loss function). By rearranging or reformulating the Boolean expressions of the software code 170, i.e. modifying the software code 170 based on the collected statistical data, the data processing apparatus 100 can optimize away the branch misprediction penalty, i.e. optimizing the software code 170 with Boolean expressions with higher significance of prediction and a lower expected misprediction penalty.

Figure 7 shows a flow diagram illustrating a computer-implemented data processing method 700 according to an embodiment.

The data processing method 700 comprises a step 701 of generating the software code 170 based on the first data set 140, e.g. the training data set 140, wherein the software code 170 comprises a plurality of conditionals for implementing the decision tree 160.

The data processing method 700 further comprises a step 703 of executing the compiled software code 170 for determining, i.e. predicting a respective result of the decision tree 160 for the first subset 307 of the second dataset 150, i.e. the prediction data set 150.

The data processing method 700 further comprises a step 705 of collecting statistical data based on the second subset 309 of the second data set 150.

The data processing method 700 further comprises a step 707 of modifying the software code 170 based on the collected statistical data by one or more of a plurality of code modification operations for reducing an execution time of the compiled modified software code 170. As the data processing method 700 can be implemented by the data processing apparatus 100, further features of the data processing method 700 result directly from the functionality of the data processing apparatus 100 and its different embodiments described above and below.

The person skilled in the art will understand that the "blocks" ("units") of the various figures (method and apparatus) represent or describe functionalities of embodiments of the present disclosure (rather than necessarily individual "units" in hardware or software) and thus describe equally functions or features of apparatus embodiments as well as method embodiments (unit = step).

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described embodiment of an apparatus is merely exemplary. For example, the unit division is merely logical function division and may be another division in an actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.