Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A DATA PROCESSING DEVICE AND A METHOD OF OPERATING THE DATA PROCESSING DEVICE
Document Type and Number:
WIPO Patent Application WO/2017/058042
Kind Code:
A1
Abstract:
The invention relates to a data processing device (100) configured to extract data from a data repository (107) in response to a data query. The data processing device (100) comprises a converter (101) configured to convert the query into a high-level language query and an optimizer (103) configured to optimize the high-level language query on the basis of a configuration of the repository (107) to generate an optimized query. Moreover, the invention relates to a method of operating such a date processing device (100).

Inventors:
SLESARENKO ALEXANDER VLADIMIROVICH (CN)
KNIZHNIK KONSTANTIN ALEXANDROVICH (CN)
Application Number:
PCT/RU2015/000618
Publication Date:
April 06, 2017
Filing Date:
September 28, 2015
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
International Classes:
G06F17/30; G06F9/455
Foreign References:
RU2015000020W2015-01-19
Other References:
YANNIS KLONATOS ET AL: "Building efficient query engines in a high-level language", PROCEEDINGS OF THE VLDB ENDOWMENT, vol. 7, no. 10, 1 June 2014 (2014-06-01), New York, NY, pages 853 - 864, XP055279415, ISSN: 2150-8097, DOI: 10.14778/2732951.2732959
RUI ZHANG ET AL: "Micro-specialization: Dynamic Code Specialization of Database Management Systems", PROCEEDINGS OF THE TENTH INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION, CHO '12, 1 January 2012 (2012-01-01), New York, New York, USA, pages 63, XP055279425, ISBN: 978-1-4503-1206-6, DOI: 10.1145/2259016.2259025
CHRISTOPH KOCH ET AL: "Abstraction Without Regret in Database Systems Building: a Manifesto", IEEE DATA ENGINEERING BULLETIN 37(1), MARCH 2014, 1 March 2014 (2014-03-01), XP055279439, Retrieved from the Internet [retrieved on 20160609]
TIARK ROMPF ET AL: "Go Meta! A Case for Generative Programming and DSLs in Performance Critical Systems", 1ST SUMMIT ON ADVANCES IN PROGRAMMING LANGUAGES, SNAPL'15, MAY 3-6, 2015, ASILOMAR, CALIFORNIA, US, 1 May 2015 (2015-05-01), pages 238 - 261, XP055279441, Retrieved from the Internet [retrieved on 20160609], DOI: 10.4230/LIPIcs.SNAPL.2015.238
ALEXANDER SLESARENKO ET AL: "First-class isomorphic specialization by staged evaluation", GENERIC PROGRAMMING, ACM, 2 PENN PLAZA, SUITE 701 NEW YORK NY 10121-0701 USA, 26 August 2014 (2014-08-26), pages 35 - 46, XP058054974, ISBN: 978-1-4503-3042-8, DOI: 10.1145/2633628.2633632
Attorney, Agent or Firm:
LAW FIRM "GORODISSKY & PARTNERS" LTD (RU)
Download PDF:
Claims:
CLAIMS

1. A data processing device (100) configured to extract data from a repository (107) in response to a query, comprising:

a converter (101) configured to convert the query into a high-level language query; and

an optimizer (103) configured to optimize the high-level language query on the basis of a configuration of the repository (107) to generate an optimized query.

2. The data processing device (100) of claim 1, wherein the optimizer (103) is configured to optimize the high-level language query on the basis of the configuration of the repository (107) by performing an isomorphic specialization of the high-level language query.

3. The data processing device (100) of claim 2, wherein the optimizer (103) is configured to optimize the high-level language query on the basis of the configuration of the repository (107) by performing an isomorphic specialization of the high-level language query as part of a staged evaluation of the high-level language query.

4. The data processing device (100) of claim 3, wherein the optimizer (103) is configured to perform an isomorphic specialization of the high-level language query as part of a staged evaluation of the high-level language query by using a graph-based intermediate representation of the high-level language query.

5. The data processing device (100) of claim 4, wherein the optimizer (103) is further configured to perform further low-level optimization steps of the graph-based intermediate representation of the high-level language query, in particular common subexpression elimination, code fusion, loop unrolling, data structure transformations, aggressive inlining, constant propagation and/or dead code elimination.

6. The data processing device (100) of claim 5, wherein the optimizer (103) is further configured to perform the further low-level optimization steps of the graph- based intermediate representation of the high-level language query in the context of a compilation and code generation framework.

7. The data processing device (100) of any one of the preceding claims, wherein the optimizer (103) is further configured to perform further high-level optimization steps of the high-level language query, in particular predicate pushdown optimization.

8. The data processing device (100) of any one of the preceding claims, wherein the optimizer (103) is configured to generate the optimized query in the form of a high-level language optimized query.

9. The data processing device (100) of any one of the preceding claims, wherein the data processing device (100) further comprises an executer (105) configured to extract data from the repository (107) by executing the optimized query.

10. The data processing device (100) of any one of the preceding claims, wherein the configuration of the repository (107) is defined by the configuration of the tables of data within the repository ( 107).

11. The data processing device (100) of any one of the preceding claims, wherein the optimizer (103) is further configured to obtain information about the configuration of the repository (107) from data stored along with the data of the repository (107) or from a separate metadata repository.

12. The data processing device (100) of any one of the preceding claims, wherein the high-level language query is a Scala query.

13. The data processing device (100) of any one of the preceding claims, wherein the query is a SQL query.

14. A method (200) of operating a data processing device (100) configured to extract data from a repository (107) in response to a query, wherein the method (200) comprises the steps of:

converting (201) the query into a high-level language query; and optimizing (203) the high-level language query on the basis of a configuration of the repository (107) to generate an optimized query.

15. A computer program comprising program code for performing the method

Description:
A DATA PROCESSING DEVICE AND A METHOD OF OPERATING THE DATA PROCESSING DEVICE

TECHNICAL FIELD

The present invention relates to a data processing device and a method of operating a data processing device. In particular, the present invention relates to a data processing device for extracting data from a data repository, such as a database, in response to a data query.

BACKGROUND

Several attempts to improve the performance of a database by compiling a data query to native code rather than interpreting the query have been made since the very early days of relational databases. The main goal of these attempts was to reduce the interpretation overhead and thereby increase the execution speed of a query. Usually, the query is compiled to native machine code using some pattern-based code generator. Generally, this approach is very complex, non-portable and does not allow performing further optimizations such as common subexpression elimination, code fusion, loop unrolling, and the like, of the generated native code.

An extensible and parallel query evaluation system, called Volcano, has been developed several years ago, which provides an environment for research in database systems design, heuristics for query optimization, parallel query execution, and resource allocation. The Volcano system uses two meta-operators, namely the choose- plan meta-operator, which supports dynamic query evaluation plans that allow delaying selected optimization decisions until run-time, e.g., for embedded queries with free variables, and the exchange meta-operator, which supports intra-operator parallelism on partitioned datasets and both vertical and horizontal inter-operator parallelism, translating between demand-driven dataflow within processes and data- driven dataflow between processes.

The more recent development of many high level programming languages with smart compilers producing machine code of acceptable performance lead to several attempts to translate query execution plans to such high level programming languages instead of native code. One of these attempts is known as LegoBase, which is an in- memory query execution engine written in the high-level programming language Scala and implementing the concept of abstraction without regret using generative programming. A data query in LegoBase is written in the Scala language and is then transformed into a specialized Scala or C code. In this way query engine structures themselves can be adapted to a particular query. LegoBase was evaluated, for instance, according to the TPC-H benchmark and significantly outperforms most of the commercial in-memory database systems as well as existing query compilers. Moreover, these performance improvements require programming just a few hundred lines of high-level language code instead of complicated low-level language code that is required by other query compilers.

Although the attempts described above already lead to an improved performance, there is still a need for further improvements, in particular with respect to the execution performance of a data query for extracting data from a database.

Thus, there is a need for an improved data processing device and an improved method of operating such a data processing device, in particular a data processing device for extracting data from a data repository, such as a database, in response to a data query.

SUMMARY

It is an object of the invention to provide an improved data processing device and an improved method of operating such a data processing device, in particular a data processing device for extracting data from a data repository, such as a database, in response to a data query.

The foregoing and other objects are achieved by the subject matter of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

According to a first aspect, a data processing device is provided configured to extract data from a repository or storage in response to a data query. The data processing device comprises a converter configured to convert the query into a high- level programming language query and an optimizer configured to optimize the high- level programming language query on the basis of a configuration of the repository to generate an optimized query. The repository or storage can be a database. Alternatively, the repository or storage can be any other device for storing data at least temporarily, such as a ROM, a RAM, a data communication channel, and the like. A query in a high-level programming language is defined by a higher level of abstraction than the same query in a low-level programming language.

By taking into account the specific configuration of the repository, i.e. adapting a query to the specific configuration of the repository, it is advantageously possible to obtain queries, which are better optimized than queries optimized using conventional methods. This is of great advantage in particular for databases, which provide many different models of organizing data, such as sharding, columnar store, vertical and horizontal distribution, replication and the like. Thus, an improved data processing device is provided, in particular a data processing device for extracting data from a data repository, such as a database, in response to a data query.

In a first possible implementation form of the data processing device according to the first aspect of the invention the optimizer is configured to optimize the high- level language query on the basis of the configuration of the repository by performing an isomorphic specialization of the high-level language query.

An isomorphic specialization of the high-level language query is a specialization of the query for the configuration of the repository using at least one isomorphism between different implementations of tables used for storing data in the repository. By performing an isomorphic specialization of the high-level language query it is possible to obtain queries, which are better optimized than queries optimized using conventional methods.

In a second possible implementation form of the data processing device according to the first implementation form of the first aspect of the invention the optimizer of the data processing device is configured to optimize the high-level language query on the basis of the configuration of the repository by performing an isomorphic specialization of the high-level language query as part of a staged evaluation of the high-level language query.

A staged evaluation is a special kind of query evaluation, which takes an abstract syntax tree (AST) representation of the query and produces a graph-based intermediate representation (IR) of the query. Alternatively, staged evaluation can be defined as a standard evaluation of staged code, where the staged code of a query Q is a query Q' such that the standard evaluation of Q' produces an intermediate representation, which is semantically equivalent to the query Q. A standard evaluation (or interpretation) of a query is a process which takes an intermediate representation and some data as input and produces new data as output. Staged evaluation allows to implement further optimization measures, which are not available in the prior art methods for optimizing queries, in particular queries written in SQL.

In a third possible implementation form of the data processing device according to the second implementation form of the first aspect the optimizer is configured to perform an isomorphic specialization of the high-level language query as part of a staged evaluation of the high-level language query by using a graph-based intermediate representation of the high-level language query.

In a fourth possible implementation form of the data processing device according to the third implementation form of the first aspect the optimizer is further configured to perform further low-level optimization steps of the graph-based intermediate representation of the high-level language query, in particular common subexpression elimination, code fusion, loop unrolling, data structure transformations, aggressive inlining, constant propagation and/or dead code elimination.

This provides for a further optimization of the code.

In a fifth possible implementation form of the data processing device according to the fourth implementation form of the first aspect the optimizer is further configured to perform the further low-level optimization steps of the graph-based intermediate representation of the high-level language query in the context of a compilation and code generation framework. The output can be further compiled and executed. A possible domain-specific compilation and code generation framework is Lightweight modular staging (LMS).

In a sixth possible implementation form of the data processing device according to the first aspect of the invention as such or any one of the first to fifth implementation form thereof the optimizer is further configured to perform further high-level optimization steps of the high-level language query, in particular predicate pushdown optimization.

This implementation form provides a further high-level optimization of the high-level language query using in particular predicate pushdown optimization. Predicate pushdown optimization moves filtering operators towards the root of the query execution tree to reduce the amount of processed data.

In a seventh possible implementation form of the data processing device according to the first aspect of the invention as such or any one of the first to sixth implementation form thereof the optimizer is configured to generate the optimized query in the form of a high-level language optimized query.

According to this implementation form an optimized query can be provided, for instance, in Scala or C++, which allows for further compilation of the high-level language optimized query.

In an eighth possible implementation form of the data processing device according to the first aspect of the invention or any one of the first to seventh implementation form thereof the data processing device further comprises an executer configured to extract data from the repository by executing the optimized query.

Using the optimized query for performing the search for data leads to a better performance of the data processing device.

In a ninth possible implementation form of the data processing device according to the first aspect of the invention or any one of the first to eighth implementation form thereof the configuration of the repository is defined by the configuration of tables of data within the repository. In a possible implementation form the tables of data in the repository could be configured as "LocalRowTable", "LocalPairTable", "ShardedRowTable" or "ShardedPairTable". These different configurations can be defined using type declarations.

In a tenth possible implementation form of the data processing device according to the first aspect of the invention or any one of the first to ninth implementation form thereof the optimizer is further configured to obtain information about the configuration of the repository from data stored along with the data of the repository or from a separate metadata repository.

In an eleventh possible implementation form of the data processing device according to the first aspect of the invention or any one of the first to tenth implementation form thereof the high-level language query is a Scala query.

In a twelfth possible implementation form of the first aspect of the invention as such or any one of the first to eleventh implementation form thereof the query is a SQL query.

In a further implementation form of the first aspect of the invention as such or any one of the first to twelfth implementation form thereof the data can be distributed between the nodes of the repository by range partitioning, round robin, hash partitioning and the like.

In a further implementation form of the first aspect of the invention as such or any one of the first to twelfth implementation form thereof the optimizer is further configured to optimize the high-level programming language query on the basis of a configuration of the repository to generate an optimized query by generating at least two queries for at least two different configurations of the repository and by using a weighted average of the at least two queries for generating the optimized query. Instead of a weighted average it is also possible to generate the optimized query by optimizing a cost function, such as by using a weighted linear functional, quadratic, or other mathematical models of optimization.

According to a second aspect the invention relates to a method of operating a data processing device configured to extract data from a repository in response to a query, wherein the method comprises the steps of: converting the query into a high- level language query; and optimizing the high-level language query on the basis of a configuration of the repository to generate an optimized query.

The method according to the second aspect of the invention can be performed by the data processing device according to the first aspect of the invention. Further features of the method according to the second aspect of the invention result directly from the functionality of the data processing device according to the first aspect of the invention and its different implementation forms described above.

According to a third aspect the invention relates to a computer program comprising program code for performing the method according to the second aspect of the invention when executed on a computer.

The invention can be implemented in hardware and/or software.

BRIEF DESCRIPTION OF THE DRAWINGS

Further embodiments of the invention will be described with respect to the following figures, in which:

Fig. 1 shows a schematic diagram of a data processing device configured to extract data from a repository in response to a query according to an embodiment;

Fig. 2 shows a schematic diagram illustrating steps of a method of operating a data processing device according to an embodiment; Fig. 3 shows a schematic diagram illustrating different aspects of the present invention according to an embodiment;

Fig. 4 shows a schematic diagram illustrating different aspects of the present invention according to an embodiment;

Fig. 5 shows a schematic diagram illustrating different aspects of the present invention according to an embodiment;

Fig. 6 shows a schematic diagram illustrating different aspects of the present invention according to an embodiment;

Fig. 7 shows a schematic diagram illustrating different aspects of the present invention according to an embodiment;

Fig. 8 shows a schematic diagram illustrating different aspects of the present invention according to an embodiment; and

Fig. 9 shows a schematic diagram illustrating different aspects of the present invention according to an embodiment.

DETAILED DESCRIPTION OF EMBODIMENTS

In the following detailed description, reference is made to the accompanying drawings, which form a part of the disclosure, and in which are shown, by way of illustration, specific aspects in which the present invention may be practiced. It is understood that other aspects may be utilized and structural or logical changes may be made without departing from the scope of the present invention. The following detailed description, therefore, is not to be taken in a limiting sense, as the scope of the present invention is defined by the appended claims.

For instance, it is understood that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if a specific method step is described, a corresponding device may include a unit to perform the described method step, even if such unit is not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary aspects described herein may be combined with each other, unless specifically noted otherwise.

Figure 1 shows a schematic diagram of a data processing device 100 configured to extract data from a data repository 107 in response to a query according to an embodiment. Although in figure 1 the repository 107 is illustrated as a database, the present invention is applicable to other types of repositories as well, which are configured to store data at least temporarily, such as a ROM, a RAM, a data communication channel, and the like.

The data processing device 100 comprises a converter 101 configured to convert the query into a high-level programming language query. The query in the high-level programming language is defined by a higher level of abstraction than the original query. In an embodiment, the query is a SQL query. In an embodiment, the high-level language query is a query in the high-level programming language Scala.

Moreover, the data processing device 100 comprises an optimizer 103 for optimizing the high-level programming language query provided by the converter 101 and thereby creating an optimized query. As schematically indicated by the arrow in dashed lines in figure, the optimizer 103 is configured to optimize the high-level programming language query on the basis of a configuration of the repository 107 to generate the optimized query.

In an embodiment, the data processing device 100 further comprises an executer 105 configured to extract data from the repository 107 by executing the optimized query provided by the optimizer 103.

In an embodiment, the optimizer 103 is configured to optimize the high-level language query on the basis of the configuration of the repository 107 by performing an isomorphic specialization of the high-level language query. An isomorphic specialization of the high-level language query is a specialization of the query for the configuration of the repository using at least one isomorphism between different implementations of tables used for storing data in the repository, as will be described in more detail in the context of figures 4 and 5 further below.

In an embodiment, the data processing device 100, i.e. the optimizer 103, is configured to optimize the high-level language query on the basis of the configuration of the repository 107 by performing an isomorphic specialization of the high-level language query as part of a staged evaluation of the high-level language query. A staged evaluation is a special kind of query evaluation, which takes an abstract syntax tree (AST) representation of the query and produces a graph-based intermediate representation (IR) of the query. Alternatively, staged evaluation can be defined as a standard evaluation of staged code, where the staged code of a query Q is a query Q' such that the standard evaluation of Q' produces an intermediate representation, which is semantically equivalent to the query Q. A standard evaluation (or interpretation) of a query is a process which takes an intermediate representation and some data as input and produces new data as output. Staged evaluation allows to implement further optimization measures, which are not available in the prior art methods for optimizing queries, in particular queries written in SQL.

In an embodiment, the optimizer 103 is configured to perform an isomorphic specialization of the high-level language query as part of a staged evaluation of the high-level language query by using a graph-based intermediate representation of the high-level language query. The staged evaluation including a graph-based intermediate representation (IR) can be implemented as described in document WO 2015/012711 Al, which is herein fully incorporated by reference.

In an embodiment, the optimizer 103 is further configured to perform further low-level optimization steps of the graph-based intermediate representation of the high-level language query, in particular common subexpression elimination, code fusion, loop unrolling, data structure transformations, aggressive inlining, constant propagation and/or dead code elimination.

In an embodiment, the optimizer 103 is further configured to perform the further low-level optimization steps of the graph-based intermediate representation of the high-level language query in the context of a compilation and code generation framework. The output can be further compiled and executed. A possible domain- specific compilation and code generation framework is Lightweight modular staging (LMS).

In an embodiment, the optimizer 103 is further configured to perform further high-level optimization steps of the high-level language query, in particular predicate pushdown optimization. Predicate pushdown optimization allows moving filtering operators towards the root of the query execution tree to reduce the amount of processed data.

In an embodiment, the optimizer 103 is configured to generate the optimized query in the form of a high-level language optimized query. An optimized query can be provided by the optimizer 103, for instance, in Scala or C++, which allows for further compilation of the high-level language optimized query. In an embodiment, the configuration of the repository 107 is defined by the configuration of tables of data within the repository. In a possible implementation form the tables of data in the repository could be configured as "LocalRowTable", "LocalPairTable", "ShardedRowTable" or "ShardedPairTable", as will be described in more detail in the context of figures 4 and 5 further below. These different configurations can be defined using type declarations using, for instance, the Scalan framework.

In an embodiment, the optimizer 103 is further configured to obtain information about the configuration of the repository 107 from data (or metadata) stored along with the data of the repository 107 or from a separate metadata repository.

In an embodiment, the data can be distributed between the nodes of the repository 107 by range partitioning, round robin, hash partitioning and the like.

In an embodiment, the optimizer 103 is further configured to optimize the high- level programming language query on the basis of a configuration of the repository 107 to generate an optimized query by generating at least two queries for at least two different configurations of the repository 107 and by using a weighted average of the at least two queries for generating the optimized query. A possible method for selecting the best configuration is disclosed in PCT/RU2015/00020. Instead of a weighted average it is also possible to generate the optimized query by optimizing a cost function, such as by using a weighted linear functional, quadratic, or other mathematical models of optimization.

Figure 2 shows a schematic diagram illustrating steps of a method 200 of operating a data processing device, for instance the data processing device 100 shown in figure 1, according to an embodiment. The method 200 comprises a step 201 of converting the query into a high-level language query and a step 203 of optimizing the high-level language query on the basis of a configuration of the repository to generate an optimized query.

In the following, further implementation forms, embodiments and aspects of the data processing device 100 and the method 200 will be described.

By way of example, figure 3 shows a schematic diagram illustrating different aspects of the present invention according to an embodiment. As a first step a query in standard SQL can be generated. Then a database configuration can be defined using Scala SQL DSL (domain-specific language). This code of the query and the database configuration can be mixed together as will be described in more detail further below. A staged evaluation of this code produces a query execution plan that can be represented as a graph-based intermediate representation. As result, this execution plan is a specialized version of the original plan in which specifics of the particular database configuration are taken into account. Then this plan can be further optimized by applying various high level query optimization rules, such as predicates pushdown. Finally lightweight modular staging (LMS) can be used to generate Scala code free from any high level abstractions and low level optimizations can be applied, such as constant propagation, common sub-expression elimination, code fusion and the like.

A further embodiment of method 200 comprises the following steps. In an embodiment, a database schema using SQL DDL (data definition language) is defined, i.e. tables and indexes are created. In an embodiment, queries can be written using SQL DML (data manipulation language). In an embodiment, a SQL-to-SqlDSL converter is used to convert these SQL statements to Scalan SqlDSL code. In an embodiment, this code uses Scalan SqlDSL abstractions, such as Table[T], PrimaryIndex[K,T], and the like. In an embodiment, the database configuration can be defined using Scalan SqlDSL and generated code. In an embodiment, a particular implementation for each declared table is provided and indexes are defined. In an embodiment, code is written in Scalan, which loads data into the database and executes the queries for this data. In an embodiment, the code from the previous step is evaluated by Scalan in a staged evaluation mode. In an embodiment, isomorphic specialization rules are configured to be applied during staged evaluation. As a result of the staged evaluation, the original query code is optimized with respect to the database configuration and converted to plain Scala code. Also during this step various high level SQL optimizations can be applied (for example predicates push-down). In an embodiment, such optimizations are defined as graph transformation rules in Scalan. In an embodiment, the final Scala code can be further transformed using lightweight modular staging (LMS). LMS can perform various low level code optimizations and transformations, such as inlining, code fusion, dead code elimination, common subexpression propagation, loop unrolling etc. In an embodiment, LMS can generate both Scala and C++ code. As SQL is the de-facto standard of database access method, embodiments of the present invention allow to describe database schema and to write data queries in standard SQL. Below is an example of a definition a "lineitem" table and attached indexes and the so-called Ql query defined by the well-known TPC-H benchmark: create table lineitem (

l_orderkey integer,

l_partkey integer,

l_suppkey integer,

l_linenumber integer,

l_quantity real,

l_extendedprice real,

l_discount real,

l_tax real,

l_returnflag char,

l_linestatus char,

l_shipdate date,

l_commitdate date,

l_receiptdate date,

l_shipinstruct varchar,

l_shipmode varchar,

create index lineitem_pk on lineitem (l_orderkey, l_linenumber) ; create index lineitem_order_fk on lineitem (l_orderkey) ;

create index lineitem_supp_fk on lineitem (l_suppkey) ;

create index lineitem_part_fk on lineitem (l_partkey) ;

create index lineitem_ps_fk on lineitem (l_partkey, l_suppkey) ; create index part_pk on part (p_partkey) ;

select

l_returnflag,

l_linestatus ,

sum ( l_quantity) as sum_qty,

sum(l_extendedprice) as sum base price,

sum(l_extendedprice* (l-l_discount) ) as sum_disc_price, sum ( l_extendedprice* (l-l_discount) * (l+l_tax) ) as sum_charge, avg ( l_quantity) as avg_qty,

avg (l_extendedprice) as avg_price,

avg (l_discount) as avg_disc,

count (*) as count__order

from

lineitem

where

l_shipdate <= 19981201

group by

l_returnflag,

1 linestatus order by

l_returnflag,

l_linestatus ;

In an embodiment, the above query is translated by SQL-to-Scalan converter to the following code:

def Ql ( lineitem: Table [Lineitem] ) = ReadOnly able (linei tern

.where (rl => (rl . l_shipdate <= toRep (19981201) ) )

.mapReducefr => Pair ( Pair (r . l_returnflag, r.l linestatus) , Pair (r.l_quantity, Pair (r . l_extendedprice, Pair ( (r . l_extendedprice * (toRep(l.O) - r . l_discount) ) , Pair (( (r . l_extendedprice *

(toRep(l.O) - r.l_discount) ) * (toRep(l.O) + r.l_tax)),

Pair (r . l_quantity, Pair (r . l_extendedprice, Pair (r . l_discount, 1)))))))),

(si: Rep[ (Double, (Double, (Double, (Double, (Double, (Double, (Double, Int))) ))))], s2: Rep [(Double, (Double, (Double, (Double, (Double, (Double, (Double, Int)))))))]) => (sl._l + s2._l,sl._2 + s2._2,sl._3 + s2._3,sl._4 + s2._4,sl._5 +

s2._5,sl._6 + s2._6,sl._7 + s2._7,sl._8 + s2. _8 ) ) . toArray .map (r => Pair (r. head. _1, Pair (r. head. _2, Pair (r . tail ._1, Pair ( r . tail ._2 , Pair (r. tail. _3, Pair (r . tail ._4 , Pair (( r . tail ._5. toDouble / r . tail ,_8. toDouble) , Pair ( (r . tail ._6. toDouble /

r. tail._8. toDouble) , Pair ( (r .tail ._7. toDouble /

r. tail._8. toDouble) , r.tail._8) ))))))))))

.orderBy(r => Pair(r._l, r._2))

In this example "Lineitem" is the Scala type generated for the "lineitem" table. Table [Lineitem] and ReadOnlyTable are Scalan DSL classes. So this query is given the abstract table "lineitem" (abstract here means that the concrete implementation of the table might not be known at the time of writing the query) and returns as a temporary table the results of the query.

By way of example, figure 4 shows different isomorphic representations of an exemplary table of data that could be used for storing data in the database 107. Figure 4 shows a row based table, a column based table and a mixed table, which are all isomorphic to the abstract table shown at the top of figure 4. In practice, a given repository or database, such as the database 107 shown in figure 1, will use one of the exemplary representations shown in figure 4 or a further representation for storing data therein and thereby define the configuration of the database 107 in the sense of the present invention. Generally, the present invention is applicable to many different ways of distributing data between the nodes of the repository, such as range partitioning, round robin, hash partitioning and the like. By way of example, the exemplary isomorphic representations of a data table shown in figure 4, which define different configurations of the database 107 in the sense of the present invention, can be defined in Scalan using type declarations as illustrated in figure 5. For instance, by selecting a column/row table representation and a sharded/local distribution model the four configurations shown in figure 5 can be created using, for instance, Scalan SqlDSL. By way of example, these four different configurations can be defined in Scalan using SQL DSL in the following way:

LocalRowTable: local row-oriented table

lazy val TPCH_Ql_hor_seq = fun { in: Array [Array [String] ] => val data = in.map(r => parseLineitem (r) )

val lineitem = ReadOnlyTable (data)

Ql (lineitem)

}

LocalPairTable: local column-oriented table

lazy val TPCH_Ql_ver_seq = fun { in: Arr [Array [String] ] => val data = in.map (r => parseLineitem (r ) )

val lineitem = createLineitem( "Lineitem") . insertFrom(data) Ql (lineitem)

}

ShardedRowTable: partitioned row-oriented table

lazy val TPCH_Ql_hor_par = fun { in: Arr [Array [String] ] => val nJobs = 4

val data = in.map(r => parseLineitem (r) )

val lineitem = ShardedTable . create [Lineitem] ( "dist" , nJobs, (node: Rep[Int]) => Table . createShard [Lineitem] (node) , (r: Rep [Lineitem] ) => r._l . hashcode ). insertFrom (data) Ql (lineitem)

}

ShardedPairTable: partitioned column-oriented table

lazy val TPCH_Ql_ver_par = fun { in: Arr [Array[String] ] => val nShards = 4

val data = in.map(r => parseLineitem (r ) )

val lineitem = ShardedTable . create [Lineitem] ( "dist" , nShards,

(node: Rep[Int]) => createLineitem (node . toStr) , (r: Rep [Lineitem] ) => r._l .hashcode) . insertFrom (data) Ql (lineitem)

}

In the above example, the function "ReadOnlyTable(data)" creates a horizontal non-sharded read-only table with the given input data (provided as a Scala array). The function "createLineitem" is generated by the SQL to Scalan converter and constructs a vertical representation of this table. The function "ShardedTable.create" creates a sharded table with a given number of shards and a specified sharding key, i.e. the key by which records are distributed between shards. In an embodiment, the nodes of shared tables can have horizontal or vertical representation.

In an embodiment, given the original query and the concrete configuration of the repository, 107 a framework, such as Scalan, can be used to optimize the code on the basis of this concrete configuration. In an embodiment, various SQL optimizations can be used during this stage, which are used to be performed by database optimizers. In an embodiment, the SQL optimizations can be defined as Scalan transformation rules. In an embodiment, the SQL optimizations comprise a predicate pushdown optimization. In such an embodiment, if the query performs a join of several tables and there are some filter conditions for the content of the resulting table, then the predicates should be checked as soon as possible, reducing the amount of data produced at each stage of query processing. An example of such a query optimization is illustrated in figure 6, where the box in the middle contains the non-optimized query code and the box at the bottom of figure 6 contains the optimized query code.

In an embodiment, a framework, such as Scalan can eliminate high level abstractions from the query code during isomorphic specialization, leaving only operations with basic Scala collections. In an embodiment, further compilation steps can follow. In an embodiment, unlike general purpose compilers which produce machine code, a framework, such as Scalan, can produce high level Scala code which is represented using a graph-based intermediate representation (IR). This IR can be further optimized. In an embodiment, lightweight modular staging (LMS) can be used for final processing, but other compilation frameworks can be used as well. In an embodiment, LMS can perform several domain specific optimizations. By using information about the domain of the program LMS can perform such optimizations, which could not be performed by a standard Scala compiler. These optimizations can include aggressive inlining, code fusion, unrolling loops, data structure transformations and the like. Moreover, LMS can perform constant propagation and dead code and common subexpression elimination. Figure 7 shows an example of an original query code and a final query code processed by LMS. The performance of exemplary embodiments of the present invention has been compared with the respective performance of existing commercially available databases and research projects like LegoBase. As a benchmark the well-known TPC- H decision support benchmark has been chosen, which consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have a broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity and give answers to critical business questions. A scale factor of 1 has been used, which generates about 1 Gb tables. The results of executing the query Ql are given in the table shown in figure 8, wherein the last four entries of the table illustrate the respective performance of different embodiments of the present invention.

Figure 9 shows a table illustrating the respective performance, i.e. the execution time of the query in milliseconds, of different embodiments of the present invention for different queries for different configurations of the database 107. Embodiments of the invention allow a straightforward testing of the efficiency of many different configurations for a particular set of queries and an automatic selection of the configuration resulting in a query having the best performance both for each query individually and for a combination of several queries, i.e. an average. The query "row+seq" can refer to horizontal or columnar tables with sequential processing. Correspondingly, the query "row+par" can refer to parallel processing. The queries can take the definitions provided in the TPC-H Benchmark.

While a particular feature or aspect of the disclosure may have been disclosed with respect to only one of several implementations or embodiments, such feature or aspect may be combined with one or more other features or aspects of the other implementations or embodiments as may be desired and advantageous for any given or particular application. Furthermore, to the extent that the terms "include", "have", "with", or other variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term "comprise". Also, the terms "exemplary", "for example" and "e.g." are merely meant as an example, rather than the best or optimal. The terms "coupled" and "connected", along with derivatives may have been used. It should be understood that these terms may have been used to indicate that two elements cooperate or interact with each other regardless whether they are in direct physical or electrical contact, or they are not in direct contact with each other.

Although specific aspects have been illustrated and described herein, it will be appreciated by those of ordinary skill in the art that a variety of alternate and/or equivalent implementations may be substituted for the specific aspects shown and described without departing from the scope of the present disclosure. This application is intended to cover any adaptations or variations of the specific aspects discussed herein.

Although the elements in the following claims are recited in a particular sequence with corresponding labeling, unless the claim recitations otherwise imply a particular sequence for implementing some or all of those elements, those elements are not necessarily intended to be limited to being implemented in that particular sequence.

Many alternatives, modifications, and variations will be apparent to those skilled in the art in light of the above teachings. Of course, those skilled in the art readily recognize that there are numerous applications of the invention beyond those described herein. While the present invention has been described with reference to one or more particular embodiments, those skilled in the art recognize that many changes may be made thereto without departing from the scope of the present invention. It is therefore to be understood that within the scope of the appended claims and their equivalents, the invention may be practiced otherwise than as specifically described herein.