BATCHED QUERY PROCESSING AND OPTIMIZATION

Title:

BATCHED QUERY PROCESSING AND OPTIMIZATION

Document Type and Number:

WIPO Patent Application WO/2020/206288

Kind Code:

Abstract:

Disclosed are various embodiments for batched query processing and optimization in database management systems. A single algebraic expression is generated based at least in part on applying equivalence rules to algebraic expressions for a plurality of database queries of a database comprising a set of relations. The equivalence rules involve relational operators comprising Psi (Ψ) operators. The database can be queried using a single database query to create a result that is equivalent to the plurality of database queries.

Inventors:

TU YICHENG (US)
ESLAMI MEHRAD (US)

Application Number:

PCT/US2020/026624

Publication Date:

October 08, 2020

Filing Date:

April 03, 2020

Export Citation:

Click for automatic bibliography generation Help

Assignee:

UNIV SOUTH FLORIDA (US)

International Classes:

G06F16/24; G06F16/00; G06F16/245

Foreign References:

US20150220597A1	2015-08-06
US6032144A	2000-02-29
US20110055199A1	2011-03-03
US20140201166A1	2014-07-17
US6748392B1	2004-06-08

Attorney, Agent or Firm:

MULLIN, Gregory J. (US)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS

Therefore, at least the following is claimed:

1. A system comprising:

a database;

at least one computing device in communication with the database, the at least one computing device being configured to at least:

obtain a plurality of database queries;

generate a single expression based at least in part on the plurality of database queries; and

query the database based at least in part on the single expression to return a global relation corresponding to relations for the plurality of database queries.

2. The system of claim 1 , wherein the single expression comprises data necessary for the relations for each of the plurality of database queries.

3. The system of claim 1 , wherein the single expression is generated based at least in part on applying equivalence rules to algebraic expressions for the plurality of database queries.

4. The system of claim 1 , wherein the at least one computing device is further configured to at least:

modify at least one aspect of at least one table in the database; and

generate a second single expression corresponding to a second plurality of database queries by applying equivalence rules for the plurality of second database queries.

5. A method comprising:

obtaining, via at least one computing device, a plurality of database queries;

generating, via the at least one computing device, a single expression based at least in part on the plurality of database queries; and

querying, via the at least one computing device, a database based at least in part on the single expression.

6. The method of claim 5, wherein the single expression comprises data necessary for each of the plurality of database queries.

7. The method of claim 5, wherein the single expression is generated based at least in part on applying equivalence rules to algebraic expressions for the plurality of database queries.

8. The method of claim 5, further comprising:

modifying at least one aspect of at least one table in the database; and

generating a second single expression corresponding to a second plurality of database queries by applying equivalence rules for the plurality of second database queries.

9. A system comprising:

a data store comprising a set of relations;

at least one computing device in communication with the data store, the at least one computing device being configured to at least:

obtain a plurality of database queries associated with the set of relations, each of the relations comprising attributes;

generate, based at least in part on the set of relations, a single algebraic expression whose result is a global relation comprising a set of tuples; and

provide each tuple in the global relation to a plurality of filters to generate output relations corresponding to the plurality of database queries.

10. The system of claim 9, wherein the plurality of database queries are registered in a database management system.

11. The system of claim 9, wherein the plurality of database queries are registered to be executed concurrently.

12. The system of claim 9, wherein at least one of the plurality of database queries is associated with an algebraic expression that returns a vector of the relations.

13. The system of claim 9, wherein the single algebraic expression is associated with a global relation comprising data necessary for the plurality of database queries.

14. The system of claim 13, wherein the data does not include any attributes unless the attributes have been used in the plurality of database queries.

15. The system of claim 13, wherein there is not any row of the data that is not used by the plurality of database queries.

16. The system of claim 9, wherein the single algebraic expression is generated based at least in part on applying equivalence rules to algebraic expressions for the plurality of database queries.

17. The system of claim 16, wherein the equivalence rules involve a plurality of relational operators comprising a plurality of ip operators.

18. The system of claim 9, wherein the at least one computing device is further configured to at least modify at least one aspect of at least one table in the data store.

19. The system of claim 9, wherein the at least one computing device is further configured to at least generate a single database query based at least in part on the single algebraic expression.

20. The system of claim 19, wherein the at least one computing device is further configured to at least query the data store using the single database query to return data necessary for the plurality of database queries.

Description:

BATCHED QUERY PROCESSING AND OPTIMIZATION

CROSS-REFERENCE TO RELATED APPLICATION

[0001] This application claims priority to and the benefit of, U.S. Provisional Application No. 62/828,834, filed on April 3, 2019, entitled“BATCHED QUERY PROCESSING AND OPTIMIZATION,” the entire contents of which is hereby incorporated herein by reference.

GOVERNMENT LICENSE RIGHTS

[0002] This invention was made with government support under project IIS- 1253980 awarded by National Science Foundation. The government has certain rights in the invention.

BACKGROUND

[0003] Conventional query processing for database management systems (DBMSs) can follow a one-query-at-a-time query processing model, in which queries in a workload are optimized and executed largely in an independent manner. Design of some DBMSs provide that query results should be combined together using the union (u) operator resulting in systems that exclude operands that are not union compatible. New approaches are needed that can cover optimization techniques adopted in existing batch processing systems while providing for new optimization opportunities.

SUMMARY [0004] A system can include a database and a computing device in communication with the database. The computing device can be configured to obtain a plurality of database queries, generate a single expression based at least in part on the plurality of database queries, and query the database based on the single expression to return a global relation corresponding to relations for the plurality of database queries.

[0005] A method can include obtaining, via a computing device, a plurality of database queries. The method can also include generating, via the computing device, a single expression based on the plurality of database queries, and querying, via the computing device, a database based at least in part on the single expression.

[0006] A system can include a data store comprising a set of relations, and a computing device that is in communication with the data store. The computing device can be configured to obtain a plurality of database queries associated with the set of relations, with each the relations comprising attributes. The computing device can be further configured to generate, based at least in part on the set of relations, a single algebraic expression whose result is a global relation comprising a set of tuples, and provide each tuple in the global relation to a plurality of filters to generate output relations corresponding to the plurality of database queries.

BRIEF DESCRIPTION OF THE DRAWINGS

[0007] For a more complete understanding of the embodiments and the advantages thereof, reference is now made to the following description, in conjunction with the accompanying figures briefly described as follows: [0008] FIG. 1 is a diagram of a query processing system model with

T _l T ₂, ...., Tm being input relations and P _l P ₂, ..., P _n being output relations according to various embodiments of the present disclosure.

[0009] FIG. 2 illustrates a resulting relation of a full outer join between two tables and T ₂ according to various embodiments of the present disclosure.

[0010] FIG. 3 is a graph of a speedup of the present disclosure over SQL Server under different database sizes and query numbers according to various embodiments of the present disclosure.

[0011] FIG. 4 is a graph of the time for processing Q and tuple distribution under SF100 according to various embodiments of the present disclosure.

[0012] FIG. 5 is a graph of a speedup of the present disclosure over MySQL according to various embodiments of the present disclosure.

[0013] FIG. 6 shows two forms of consecutive full outer joins over three tables and the seven types of tuples in them according to various embodiments of the present disclosure.

[0014] FIG. 7 shows components needed for recovering table T ₂ from the two resulting tables shown in FIG. 6 according to various embodiments of the present disclosure.

[0015] FIG. 8 shows graphs of execution time of the present disclosure and SQL Server under different database sizes and query numbers according to various embodiments of the present disclosure.

[0016] FIG. 9 is a graph of the time for processing Q and tuple distribution under SF1 according to various embodiments of the present disclosure.

[0017] FIG. 10 is a graph of the time for processing Q and tuple distribution under SF10 according to various embodiments of the present disclosure. [0018] FIG. 11 A is a schematic block diagram of a system and networked environment according to various embodiments of the present disclosure.

[0019] FIG. 11 B is a schematic block diagram that illustrates an example computing device employed in the various embodiments described herein.

[0020] FIG. 12 illustrates an example flowchart of certain functionality implemented by portions of a computing device in the system and networked environment of FIG. 11 A according to various embodiments of the present disclosure.

[0021] FIG. 13 illustrates an example flowchart of certain functionality implemented by portions of a computing device in the system and networked environment of FIG. 11 A according to various embodiments of the present disclosure.

[0022] The drawings illustrate only example embodiments and are therefore not to be considered limiting of the scope described herein, as other equally effective embodiments are within the scope and spirit of this disclosure. The elements and features shown in the drawings are not necessarily drawn to scale, emphasis instead being placed upon clearly illustrating the principles of the embodiments. Additionally, certain dimensions may be exaggerated to help visually convey certain principles. In the drawings, similar reference numerals between figures designate like or corresponding, but not necessarily the same, elements.

DETAILED DESCRIPTION

[0023] In the following paragraphs, the embodiments are described in further detail by way of example with reference to the attached drawings. In the description, well known components, methods, and/or processing techniques are omitted or briefly described so as not to obscure the embodiments. As used herein, the“present disclosure” refers to any one of the embodiments described herein and any equivalents. Furthermore, reference to various feature(s) of the “present embodiment” is not to suggest that all embodiments must include the referenced feature(s).

[0024] Among embodiments, some aspects of the present disclosure are implemented by a computer program executed by one or more processors, as described and illustrated. As would be apparent to one having ordinary skill in the art, one or more embodiments may be implemented, at least in part, by computer-readable instructions in various forms, and the present disclosure is not intended to be limiting to a particular set or sequence of instructions executed by the processor.

[0025] The embodiments described herein are not limited in application to the details set forth in the following description or illustrated in the drawings. The disclosed subject matter is capable of other embodiments and of being practiced or carried out in various ways. Also, the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of "including," "comprising," or "having" and variations thereof herein is meant to encompass the items listed thereafter, additional items, and

equivalents thereof. The terms "connected" and "coupled" are used broadly and encompass both direct and indirect connections and couplings. In addition, the terms“connected” and“coupled” are not limited to electrical, physical, or mechanical connections or couplings. As used herein the terms“machine,” “computer,”“server,” and“workstation” are not limited to a device with a single processor, but may encompass multiple devices (e.g., computers) linked in a system, devices with multiple processors, special purpose devices, devices with various peripherals and input and output devices, software acting as a computer or server, and combinations of the above.

[0026] Systems and methods are discussed herein that include generating a single expression that corresponds to a set of database queries. The single expression includes a set of relations containing all of the data necessary for the set of database queries. Equivalence rules can be applied to algebraic expressions to convert the set of database queries into a single expression. A computing device can receive the set of database queries from one or more computing devices. The database can be queried using the single expression.

If the database is modified, a single expression can still be generated for sets of database queries. This is due to the use of equivalence rules and algebraic expressions.

[0027] Techniques based on sharing data and computation among queries have been an active research topic in database systems. While work in this area developed algorithms and systems that have shown to be effective, there is a lack of logical foundation for query processing and optimization. This disclosure presents a system model for processing a large number of database queries in a batch. One example of the system model or system is referred to herein as PsiDB. A key idea is to generate a single query expression that returns a global relation containing all of the data needed for individual queries. For that, use of a group of relational operators called ^-operators is proposed. The ^-operators can be used to combine the individual queries into the global expression and study their features. The algebraic optimization problem in the system is tackled by developing equivalence rules to transform concurrent queries with the purpose of revealing query optimization opportunities. Centering around the y- operator, the disclosed rules cover many optimization techniques adopted in existing batch processing systems, and reveal new optimization opportunities. Also, a vision for the development of a query optimizer following the disclosed strategy is presented. Experiments conducted on the system showed a performance improvement of up to 36X the performance over a mainstream commercial DBMS.

[0028] Traditional database management systems (DBMSs) follow a one- query-at-a-time query processing model, in which queries in a workload are optimized and executed largely in an independent manner. Today’s database systems often need to serve a multitude of user queries in a short period of time. For example, in a typical Online Analytical Processing (OLAP) system it can be desirable to process a workload at the 10,000 query per second (qps) level or above. Under such high demand, the one-query-at-a-time model falls short in meeting performance requirements because it can easily lead to resource contentions. On the other hand, the database community observed that queries in a workload may have a significant overlap in the data access paths and intermediate query results. Plausible efforts to harness computation and data access sharing among concurrent queries have been reported. This disclosure refers to the many flavors of such work as concurrent database management systems (CDBMS), although the present disclosure is not necessarily limited to CDBMS.

[0029] In CDBMS’s, a main task in query optimization is essentially to find a global execution plan that combines the necessary computations to derive the results of multiple concurrent queries with the smallest resource consumption. With query optimization in traditional DBMSs already a very challenging problem, the involvement of multiple queries in CDBMSs adds extra difficulties. Therefore, related work in this field has focused on heuristics or rules to locate viable plans or sub-plans for query processing. CDBMS systems are shown to significantly outperform traditional relational database management systems (RDBMSs) in both throughput and query response time. While such work developed effective algorithms and systems, there is a lack of theoretical foundation for CDBMS query optimization. In particular, existing methodologies involve developing rules for physical plan generation and selection, without considering the

representation and transformation of the queries at the logical level. As a result, such systems can work well for workloads with a fixed query pattern.

[0030] Algebraic representation of queries is often seen as the first step toward query optimization. Following that, equivalence rules are used to transform the initial algebraic expression into others that also return the same results. Such rules are essential in deriving a rich set of query plans with different performance profiles to choose the next steps in query optimization. In this disclosure, it is argued that the rules adopted in traditional DBMSs are insufficient to capture optimization opportunities in current query processing.

One problem is that an algebraic expression, taking a set of relations as input, returns a single relation in traditional RDBMSs, however, in CDBMSs it returns multiple relations.

[0031] A system for query processing in CDBMSs is presented. The key idea is to generate a single query expression that returns a global relation containing all the data needed for individual queries. This disclosure studies the features of a group of relational operators called ^-operators for gluing the individual queries into the global expression.

[0032] Equivalence rules to transform algebraic expressions within the system framework are developed. Centering around the ^-operator, the disclosed rules not only cover many optimization techniques adopted in existing CDBMSs, but also reveal new optimization opportunities.

[0033] This disclosure introduces a system model (also referred to herein as a framework or the PsiDB Framework) for concurrent query processing including all the definitions and terminology that are used throughout the disclosure. Key differences between some conventional CDBMSs, such as DataPath and

ShareDB, are also highlighted below. ShareDB is sometimes referred to as SharedDB.

[0034] Multi-Query-One-Execution Model. A traditional RDBMS processes one query at a time and returns a relation as the query result. The query processor of a DBMS needs to transform a query written in a declarative language such as Structured Query Language (SQL) to a relational algebraic representation for further consideration in the query optimization stage. The same relational algebraic representation is needed for CDBMSs, the major difference is that there are more than one query to be executed at a time. The present disclosure is useful for CDBMS systems that organize queries into batches and each batch consists of all queries that arrive during a period of time. Let us first define notations to set up the theoretical framework.

[0035] Definition 1. A table schema R is a set of attributes, each of which defines a domain of values. A relation (e.g., table) r comprises a set of tuples (e.g., rows), each tuple t (denoted as t e r) has value following the schema R. [0036] Definition 2. A query q _t is a relational algebraic expression over a set of relations, and the result of a query is also a relation. The set q =

{q _l q ₂, , q _n} indicates the list of registered queries in the system to be executed concurrently.

[0037] Definition 3. An attribute is a column of a relation and is denoted as A _{k .} The resulting relation of query q _t contains a set of attributes L _t =

{A ₁,A ₂, ...,A _k}. Define a global attribute set containing all attributes in the resulting relations of all the n queries as L _q = u L ₂ ... u L _n.

[0038] A database contains a set of relations T = {T _l T ₂, ...., T _m}, each of which contains different attributes. Each query will take one or a combination of such relations as inputs. As in many relational algebraic studies, one can assume the relations in T are designed following at least the 1 st Normal Form.

[0039] Example 1. An example database is presented along with a few queries to be used for examples throughout the disclosure. The database consists of three tables: Part, PartSupp and Supplier. They are referred to hereafter as P, PS and S, respectively.

[0040] There exists the following foreign keys: PS.pkey P.pkey and PS.skey S.skey, and (pkey, skey) serves as the primary key of PS. The query workload consists of five queries shown in Table 2 below.

[0041] In a CDBMS, an expression returns a vector of relations (vs. one single relation in RDBMS) as output. However, the existence of multiple output relations makes mathematical reasoning difficult, as one may need to combine all the single query expressions into one. One approach is to introduce a global relation T, which is a single relation that contains all the data needed to derive the final results for all queries in the workload. Therefore, a goal is to transform the individual queries (e.g., expressions) q _lt q ₂, - - - , q _n into ^a single algebraic expression Q whose resulting relation is T. Each tuple in T will be distributed to filter functions that belong to individual queries to generate the final result for the latter.

[0042] Turning now to the drawings, exemplary embodiments are described in detail. With reference to FIG. 1 , shown is an example design of a system 100 according to various example embodiments. FIG. 1 is a diagram of a system 100 for concurrent query processing. The system 100 is also referred to herein as PsiDB or the PsiDB system model. The system 100 can include a data store or database comprising a plurality of relations, with T ₁, T ₂, .. . ., Tm being input relations and P1, P2, ..., Pn being output relations. Key components of the system 100 are as follows.

Ts s ti Bstisfess® ScfciiBsa

T&fefe 2:: Osieries iii sasap : wefeksasl

Combining Multiple Relations

[0043] With the concept of a global relation T, it is necessary to develop an approach to combine results of individual relational expressions into T.

Specifically, one can define a binary relational operator for such purposes.

Intuitively, such an operator: (i) should generate a global output relation T that contains all the data of the operands; (ii) one can apply a filtering function to recover the exact individual operands; (iii) Furthermore, to apply the operation to a series of operands, the order of such sequence should not change the recoverability of all operands from T. It turns out more than one relational operators satisfy such conditions. Let us first formalize the definition of such operators.

[0044] Definition 4. Given a binary relational operator O and for T = 7 O T ₂, the operator is a ^-family operator (or simply ^-operator) if all of the following conditions are satisfied:

(i) Operand Recoverability: There exist relational algebraic functions and f ₂ to recover the two input tables 7 and T ₂ exactly from T, e.g., one must have =

A (?) and T 2 = f2(J>) (ii) Weak Commutativity: Given P' = T ₂ O 7 , one can use the same functions /j and f ₂ to recover from

(iii) Weak-Associativity: The same functions f _{l t} f ₂, and f ₃ can be used to exactly recover T _{l t} T ₂, T ₃ from P = 7 O (T ₂ Q T ₃) and also from P ' = (T Q T ₂) Q T ₃.

[0045] Note that the common concepts of commutativity (e.g., T ₂ =

T ₂ O 7 ) and associativity (e.g., 7 O {T ₂ Q T ₃) = (7 Q T ₂) Q T ₃) are strong forms of conditions (ii) and (iii), respectively. In other words, when one reorders the sequence of operands of the ^-operator, a unique global relation is not required. Instead, some embodiments of this disclosure require that the same filter functions can be used to generate the original relations. Now individual members of the ^-family are studied.

[0046] Theorem 1. A Cartesian (or cross) product between two nonempty relations 7 and T ₂, defined as

i x T ₂ = {< t _lt t ₂ > I vti e 7\ A vt ₂ e r ₂}

is a ^-operator.

[0047] Proof. Operand Recoverability: suppose 7 x T ₂ = P and L _Tv L _T2 are the set of attributes for relations 7 and T ₂, respectively. The exact input relations of the product operator can be easily recovered from P as

[0048] Commutativity and Associativity: It is easy to see that cross product is (strong) commutative and (strong) associative.

[0049] Another example ^-operator involves the concept of outer join. Let us first start with some definitions. [0050] Definition 5. Selection condition Q on a relation r compares two atomic values

valid value for the attribute A _{t .}

[0051] Definition 6. A selection operation over a relation r is to apply a set of conditions as follows: s _q (r ) = {t 1 t e r A 0(t ) = TRUE}.

[0052] Definition 7. The (inner) join between two relations by a common condition c is

[0053] Definition 8. A left outer join between two relations is denoted as 7 xi _j where c is the join condition. Let L _Tl be the attribute list of table and let w ₂ = {(null, null )} be the singleton on all attributes of T ₂ , the left outer join can be described as follows:

[0054] Similarly, the right outer join can be defined as:

[0055] Definition 9. A full outer join is a union between both left and right outer join (definition 8) on the same relations:

[0056] Theorem 2. The full outer join ( X ) is a ^-operator.

[0057] Proof. Consider = 7 x] _c T ₂ with any general join conditions c (FIG. 2a). FIG. 2a depicts an outer join with general join conditions.

[0058] Operand Recoverability: The resulting table of a full outer join essentially consists of three parts: the inner join results, the tuples that do not join, and those of T ₂ that do not join. Since the schema of the resulting relation consists of all attributes in both tables, one only needs to project out the corresponding set of attributes for 7 (or T ₂) to recover (or T ₂). Formally,

where A _t is an attribute in L _Tl or L _TT The selection step after projection is to remove the tuples that contain null values in all attributes (e.g., unshaded regions in FIG. 2a).

[0059] Commutativity: By its definition, it is easy to show that full outer join is (strong) commutative: among the three sets of tuples in the resulting table, the inner join is known to be commutative; that also leaves the tuples from 7 and T ₂ that do not appear in the inner join unchanged when operands are switched.

[0060] Associativity: The full outer join is strong associative with certain assumptions made with respect to the join conditions. Flowever, this is not true under any general join conditions. It is proved that it is weak associative under any conditions.

[0061] Comparison: The way to combine queries together is a vital design decision of a CDBMS. Systems such as ShareDB and multiquery optimization (MQO) meet this challenge by combining query results with the union (u) operator. Flowever, this excludes operands that are not union compatible and thus is not as general a solution as ip. Certainly, there are practical issues with the use oUp, one major concern being the size of its output table. This disclosure develops techniques to reduce the size of the output table.

[0062] Size of Resulting Relation. With the ^-operator as a key concept in the disclosed framework, the size of its resulting table becomes an important query optimization consideration. In that sense, the cross product is in obvious disadvantage over outer join. Another disadvantage of the cross product is that neither operand can be an empty relation (Theorem 1 ). Therefore, unless specified otherwise, discussions herein will be focused on the outer join. An interesting observation is that one can control the size of outer join results via different join conditions. Specifically, the selectivity of the join condition determines how many times a tuple or T ₂ will appear in the final result (middle region shown in FIG. 2a). To reduce such redundancy, one can apply a dummy condition to disallow any joined tuples.

[0063] Definition 10. Given any two tables and T ₂, an anti-join condition Q ensures 7 rxi g T ₂ = 0, and 7 TXg T ₂ = 7 x w ₂.

[0064] In practice, the anti-join condition can be implemented as a system artifact or a real join condition that never returns true in any tuple (e.g., comparing an attribute with infinity).

[0065] Definition 11. For any two input tables, an outer join with join conditions that never returns a tuple in the inner join section is called an exclusive outer join; otherwise it is called an inclusive outer join.

[0066] In the remainder of this disclosure, unless stated otherwise, when an outer join without a join condition is mentioned, the intended meaning is that the reasoning is true under arbitrary conditions.

[0067] Another factor to consider is the size of the w values - the storage cost of a NULL value is 1 bit in modern DBMSs. Thus, tuples appear in the inner join section can potentially decrease the number of NULL values. In an extreme case, all tuples in and T ₂ participate in a 1 -to-1 join, and the size of the outer join is minimized (the same as the total size of both input tables). In practice, for equi-joins or more strictly, natural joins, one could store only one copy of the common attributes to further save storage space (FIG. 2b). FIG. 2b depicts an outer join with natural join conditions.

[0068] In considering the use of ip to combine different query expressions, the size of the final table can be effectively controlled via the adaptation of various equivalence rules in query optimization.

Filtering the Global Query

[0069] As shown earlier, the global query Q can be written as qi Y qi y y CJm º Q

[0070] For convenience, the PsiDB name is given to the disclosed framework of the system 100 to highlight the role the ip operator. Upon generating the global relation T, PsiDB can use a result transformation application to send the tuples of T to query filters associated with each original query q _t. The result transformation application will check each tuple with recovering functions and make on-the-fly decisions on what tuples/columns belong to its corresponding query result p _t. In some embodiments, the recovering functions can be predefined.

[0071] Comparison: In ShareDB and DataPath, each tuple in the global relation is associated with a bitmap describing which queries the tuple belongs to. Then the tuple is only multi-casted to corresponding queries. The system 100 does not need to tag the tuples with its destination queries, therefore the tuples can be broadcasted to all queries.

EQUIVALENCE RULES

[0072] An equivalence rule essentially transforms a relational algebraic expression to another logically equivalent expression. They enable different query execution plans to be generated and evaluated thus form the foundation for query optimization. This disclosure extends work in traditional relational algebra to consider equivalence rules related to ip operators and concurrent queries. In preparing this disclosure, popular equivalence rules that are widely adopted in traditional DBMSs were studied. Based on such rules, new rules that work particularly for batched queries were developed. The disclosed rules can cover optimization techniques used in SharedDB and DataPath, and also provide extra optimization paths that are not found in such systems.

[0073] Definition 12. Suppose a relational algebraic expression £ ₁ contains a series of ip operations over multiple operands, e.g., £ ₁ = e ₁^e ₂^ ... e _m, the expression is said to be weakly equivalent to another expression £ ₂ , denoted as £ ₁ £ ₂ , if all the operands e _it e ₂, ... e _m of £ can be exactly recovered from the resulting relation of £ ₂.

[0074] Via weak equivalence, an expression can be transformed into one that, although does not derive the same exact resulting relation, carries all the data and a means to recover the operands (of the original expression). Note that such transformations are not allowed in traditional DBMSs. In practice, it is preferred that a common function be used to recover the operands from both £ and £ ₂ in the disclosed equivalence rules. The following definitions are needed for describing the rules.

[0075] Definition 13. Suppose a query q _t applies a selection with a set of conditions in conjunctive form (denoted as Q{) over a relation. Such selection conditions in all n concurrent queries over a relation is then denoted as Q _t =

Q _c u 6> ₂ u ... u Q _h.

[0076] Example 2. The conditions sets for query workload as shown in Table

2 are as follows. Q-, — {{ retprice <40} }

02— {(supprice<retprice),(supprice<4it}JC{s ps _t (Qps.pfi}

03 ” { ( s itpp rice < retprice), (ai*iqtij>$ (Q^p$ ₎),iC{ps.P\)} q ₄ = {(supprice < re t price), (state-FL), (Qs _{, p}s)MQ PS. P _! )} q ₅ = {(state FI), (C _iPS,s)) }· where C(A,B) represents equality join conditions set on foreign keys between tables A and B for easier presentation, e.g., C(S,PS) = {S.skey = PS.skey}. The total set of conditions for all 5 queries is

0·/· ^“ { ( ^' r e tpr i <: e < 40), (s u pp r i ce < r etp r i e e),

Definition 14 The common selection conditions among aM n concurrent queries over a relation is denoted a 0? = O _¾ q·.

We also defin e Q) as the set of selection conditions for a query % (over a relation} that is not found in Q _/, Le., Q, = © _f· - &i .

Definition 15. We can write an expression ¾· with ail mem bers of in a conjunctive form, ie. _t ¾ = A«H( )· the same way we define mi expression QI - A iii®',)· Furthermore, for

Rule 1 : Sequence of Selections

[0077] In traditional DBMSs,

[0078] In other words, a selection over a relation can be broken down to conjunctive conditions. Furthermore, and this essentially means the conditions can be applied to the relation one by one in any order. Such rules reveal the high level of flexibility in reordering different selection conditions over the same relation of the same (only) query. This is beneficial to query optimization as it provides a rich set of computational paths (e.g., execution plans) with different costs for the same set of selections. For example, a common practice is to process conditions with higher selectivity first to reduce the size of intermediate results. This disclosure is interested in the flexibility of doing the same while processing batched queries. First, observe that reordering of selection conditions cannot be done freely across different queries. In other words,

and the above is true for any ordering of s _{q ί} on the right-hand side (RFIS).

Fortunately, the following theorem shows that there is room for rearranging the selection conditions among concurrent queries.

[0079] Theorem 3 (Sequence of Selections). For n concurrent queries, each of which applies a selection to the same table r,

[0080] Proof. By the definition of weak-equivalence one needs to show that each operand on the left-hand side (LHS) must be recoverable from the resulting relation of both sides. Let us recall that table r follows schema R. For the LHS, its resulting table T is a wide relation with repetitive occurrences of R in its schema, one for each operand. Let us denote such occurrences as R _±, R ₂, · · · , R _n, one can recover each individual operand via

Note that the above is true for both cross product and full outer join under any join conditions. [0081] For the RHS, denote its resulting table as T', it is shown that one can use the function f( ( T' ) = s _{q ί} ( T' ) to recover all operands of the LHS as follows.

The recoverability of all operands of LHS concludes the proof.

[0082] From the above definitions, the selection conditions over a relation for query i can be written as ft = L _bίί{Q;} = (L _OHQ;) L (L _bίίq/) = q:' L q; (/ή

[0083] Therefore, the RHS of Eq. (5) becomes ^&¾ ^L¾ · ^? and one can follow the traditional equivalence rules to make the following claims.

[0084] Corollary 1. The selection conditions within Q ₁ can be applied sequentially on the relation r, according to Eq. (5).

[0085] Corollary 2. The selection condition sets 0 _D and 0 ₇ can be applied sequentially, and in any order. f ^f¾A¾,tr) = ^£¾,(<¾( ^? )} = ¾(ø¾(* ^*>}

[0086] Example 3. In one example database, taking a subset of the queries q = { q ₂ , q ₃, q ₄} as the workload, one gets

{{sfafe = FL}\ therefore the non-common conditions are

§ _D = {( pprice > 40) V (avlqty > 0) V (state = FL}}

[0087] Significance: Rule 1 shows an important way to reduce the size of the results of the ip operator. Consider the full outer join as ip, the total size for the

LHS of Eq. (5) is /2(|r|n) where |r| denotes the number of tuples (cardinality) of table r. By following Theorem 3, one gets what is essentially the union among different subsets of r, and the total size is 0(|r|). This rule was also applied in ShareDB.

[0088] The above two corollaries show that the query optimizer can try either way of ordering 0 _D and 0 _{7 .} Corollary 1 says that reordering the (conjunctive) terms in 0 ₇ will also lead to plans of a different cost. Note that the existence of non-empty 6> ₇ is very common in real-world workloads. For example, in TPC-H, all joins are done based on equality of foreign keys, as shown in the example database.

Rule 2: Sequence of Projections

[0089] In traditional DBMSs,

This means in a query with a sequence of projections over a relation, only the last one matters. If one considers the projections on the same relation from concurrent queries, the rule is of a different form.

[0090] Theorem 4 (Sequence of projections). For n concurrent queries, each applies a projection over the same table r,

where L _t is a subset of attributes of r required by query q _{t .}

[0091] Proof. Show that, for any operand n _Li (r) of the LHS, all its tuples will be found in the resulting table (named ') of RFIS. While ' may drop some rows out of r due to the generation of duplicates, any duplicate in it is also a duplicate in n _Li (r) as L _t is a subset of the attributes of '. Given that, n _Li (r) can be exactly recovered from ' by simply applying the projection using its own attribute set: 1¾ίί·) = /H ') = ¾{?>'), Vi

[0092] In fact, the same function / can be used to recover the operands from the resulting relation of the LHS.

[0093] Significance: Similar to Rule 1 , one can also avoid the large size resulted from a series of ip operations over different parts of the same table. In Rule 5, this idea is extended to more general situations. Note that neither ShareDB nor DataPath applied optimization on projection like Rule 2.

[0094] Example 4. Referring to the example database, one can write another query similar to q on table PART.

q[ SELECT pkey, mfgr, retprice FROM part P WHERE mfgr=mf2

[0095] When q = {q _l qr·}, the projected attributes for each of the queries are - _L = {pkey, name, retprice}, L} = {pkey, mfgr, retprice}; then the RHS returns the table PART with Lq = L ₁ U L} = {pkey, name, mf gr . retprice} attributes.

Rule 3: Distributive Selection

[0096] In traditional DBMSs, selection is distributive over join if the selection conditions can be grouped into two parts, each of which involves attributes in only one relation, e.g.,

where c is any join condition and all the conditions in q and q ₂ involve attributes only in and T ₂, respectively. There are similar findings regarding ip.

[0097] Theorem 5 (Distributive Selection over ip). With the same setup shown in Eq. (9),

ϊ¾L¾ ίTiyT^) = (<"¾ A Tv rVi [0098] Proof. First, both LHS and RHS result in relations with the same schema containing all attributes of both 7 and T ₂. Second, one sees that only tuples that pass selection condition will appear in the resulting tables of both LHS and RHS. For full outer join, the resulting table on both sides have three types of tuples (FIG. 2a): T ₁w ₂, T^, and w^. Due to the same join conditions applied for LHS and RHS, one has the same set of tuples for all three types.

[0099] The above results can be extended to more than two tables.

[00100] Corollary 3. For a series of tables 7 \, T ₂, . . . ,T _n and is a set of join conditions that only involve attributes in table T _{i t}

[00101] The proof can be achieved by reduction given the proof of Theorem 5.

[00102] Significance: Generally, when one applies the ip operator first, one gets a larger relation than if one applies the single-table selection followed by the y operator. This technique is popular in modern DBMSs, and was also adopted by ShareDB. In the system 100, it can also be used to transform an intermediate expression.

[00103] Example 5. In the example database, if q = {q ₂, q ₃, q ₄} and Q = {(supprice < retprice ), ( supprice < 40), ( avlqty > 0), ( state =

FL), C(P, PS), C(PS,S)}·, the selection can be distributed as

Rule 4: Distributive Projection

[00104] In traditional DBMSs, a projection is distributive over joins.

where L _t is a subset of the attributes of table 7 The above rule is a common query optimization strategy by filtering out attributes that will not be in the final relation T. In CDBMS, the following theorem shows that projection is also distributive over the ip operator.

[00105] Theorem 6 (Distributive projection). Suppose one has n concurrent queries, and query q _t projects L _t out of table T _{t .} If one considers an exclusive outer join as the ip operator,

[00106] Proof. Firstly, both sides generate a relation with the exact same schema u ₌₁ *. When one considers the exclusive outer join, one can see that the same set of tuples will also be generated on both sides. For such a claim, it suffices to show it for any two tables and T ₂, then the results can be easily extended to n tables by reduction. Specifically, for both LHS and RFIS, all tuples of both 7 and T ₂ will show up in the final result. By applying the exclusive outer join, each tuple in 7 will appear once (with w ₂) for both LHS and RFIS. This concludes the proof.

[00107] Eq. (10) does not hold true for an outer join with arbitrary conditions (e.g., inclusive outer joins). This is because different tuples could appear in the inner join section of the final result. In particular, for the RFIS of Eq. (10), the join condition can only be applied to attributes in u ₌₁ L _t. For the LHS, the join condition could include attributes that are not in u ₌₁ L _t. Therefore, the total number and content of tuples could be different. The applicability of Theorem 6 can be extended as follows. [00108] Corollary 4. By considering inclusive outer join as the y operator, Eq. (10) holds true only if all the join conditions of LHS involve attributes that are only found in u ₌₁ *.

[00109] Significance: Similar to Rule 3, optimization opportunities exist when performing the projection on individual tables first and then start the outer join - one avoids handling a large intermediate relation. More details of this can be found in Eq. (13), below. This rule was never mentioned in ShareDB and

DataPath.

Rule 5: Skeleton Join

[00110] Using ip to combine multiple queries could result in a relation that is excessively large. The following theorem shows one way to effectively control the size. Unlike previous rules, this disclosure considers more general Selection- Projection-Join (SPJ) queries.

[00111] Theorem 7 (Skeleton Join). Suppose one has n concurrent queries, each of which involves a join among a subset of tables of the database HqGq s _{q ί} could contain conditions related to any table’s attributes. Let T be the collection of all tables involved in the n queries, e.g., T = u _Vi T _l and C be the disjunctive form of all the conditions applied in individual queries, e.g., ί = q ₁ n ¾ n · · · n q„,

when x _c is a full outer join with C as the join condition.

[00112] Proof. Show the LHS operands are recoverable from the RHS results

7 =

via two cases. Let us first define , which is the cross product of all tables accessed by the n queries. [00113] Case 1 : Consider a special case in which the same set of tables are accessed by all queries, e.g., = T ₂ = = T. By directly applying Rule 1

(Theorem 3),

[00114] Consider the full outer join as ip, the resulting table of RHS, denoted as T, consists of two different sets: (i) all tuples in o _cZ, in which there is no w values; (ii) those with w values as a result of some 7} not joining with at least one other table. One can fully recover the LHS operands via the following functions:

[00115] Note that the function f _t will return no tuples by working on subset (ii) of T, as the selection condition will exclude any tuple with a w value in it. On the other hand, one can recover all tuples needed from subset (i) since ^s _qi ( ^acZ) = s _{q ί}Z.

[00116] Case 2: Consider the general case in which there are queries that only access a subset of T, e. g. , 3j, T _j <º T.

[00117] For such a query q _j, one can generate a different query

[00118] In other words, one can concatenate each tuple in the result of q _j with NULL values to generate a relation that still follows the schema of Z. With this transformation, one can follow the same reasoning in Case 1 to show that Eq.

(11 ) holds true if one replaces q _j with q _j' on the LHS, meaning q _j' can be recovered from the RHS results. Given q _j', one can easily recover q _j by projecting out the attributes that belong to q _j (e.g., non-NULL attributes).

[00119] Significance: Rule 5 carries significant value in query optimization. First, it shows that if a table is accessed by many queries (via joins), there is no need to join it many times (according to the definition of ip). The RHS of Eq. (11 ) is a formula in which each database table only appears once regardless of the number of queries. Since the results of every query can be recovered by the RHS of Eq. (11 ), call its resulting table the skeleton table.

[00120] An interesting observation is that Case 1 still works if one considers cross product as the ip operator, and ip _c on the RHS is essentially an inner join, which yields a smaller resulting table as compared to using the outer join as ip in those cases. This is clearly an optimization path.

[00121] Case 2 in the above proof shows that, when a query accesses only a subset of the tables in T, one could piggyback it to the skeleton table in query processing. And the full outer join has to be used to recover the results of the piggybacked q _j queries. Plus, one can work with the outer join to reduce the skeleton size as follows.

[00122] Reducing Size of Skeleton Tables: For q _j mentioned in Case 2 of the proof of Theorem 7, there is potentially large storage overhead - in the skeleton, each tuple in q _j will join with all other qualified tuples from all tables that q _j does not access (e.g., those in T - T _j). To reduce that cost, one needs to (or preferably can) use the anti-join condition mentioned in Definition 10 to disallow such joined tuples. [00123] With that, for aforementioned query q _{j t} replace its join condition with 6 _j L Q. As a result, its tuples will not appear in the inner join segment of the skeleton table. Instead, they will only appear in the outer join segment in the format of q _j' as mentioned above. Revisiting Eq. (11 ), replace the RHS join condition C = ^{/ ΐ? 1} ' . It is easy to see that the existence of Q conditions does not change the correctness of Eq. (11 ) - it just gets a smaller skeleton table.

[00124] Example 6. Using the example database and queries and considering the workload of q = {q _lt q ₂},

[00125] By applying Rule 5, one gets an RHS of by distributing the conditions in C to the relevant tables. Specifically, ^ = {( retprice < 40 L

QUERY OPTIMIZATION IN PSIDB

[00126] The present disclosure can provide query optimization with a focus on algebraic optimization techniques. Examples below also briefly comment on optimization at the physical plan level.

Algebraic Optimization Towards Q

[00127] With the introduction of full outer join as a cost-effective ip operator, as well as the skeleton joins given by Theorem 7, many algebraic optimization can start from the following recommended expression for a workload.

[00128] Following the notations in Theorem 7, for n concurrent queries q ₁, q ₂, . . . , q _n, and each query follows the SPJ pattern, e.g., q _t = fl _Li ^s _bi ( ^cnt _ETί T), the global query can be rewritten by starting from Rule 5 and then applying Rule 2. The expression is

where L _t is the set of attributes requested by query q _t in its resulting table, and L _c is the set of attributes involved in condition set C. Note that the RHS of Eq.

(12) is basically that of Eq. (11 ) with extra projections for all the attributes needed by the queries. One can also project out the attributes in C for recovering the individual query results. It is easy to see that the projection does not change the correctness of Theorem 7 (Rule 5).

[00129] Eq. (12) serves as one starting point for query optimization in the system 100. It can be made more general by considering the following two scenarios.

[00130] Corollary 5. Modify Eq. (12) to handle cases of a cross product between two tables being requested by a query. Note that a cross product is essentially a join with an implicit condition that always returns true. Define such a condition as fly with respective to the two involved tables 7) and 7}. Then just add fly to condition set C as a disjunctive term on the RHS.

[00131] Corollary 6. If a self-join between the same relation 7) is requested by a query, Eq. (12) is still valid upon adding another copy of 7) to the set T and adding the self-join condition to C as an extra disjunctive item.

[00132] Further Optimizations of Q are provided. The disclosed equivalence rules allow one to rewrite Q to explore other optimization opportunities such as the follows. [00133] (1 ) According to Rule 4, for a set of attributes A that: (i) includes all attributes from table 7) that appeared in u _{=1 7}; and (ii) includes all attributes from table 7} that appeared in C, project out A from 7) first, e.g., gets

[00134] (2) Eq. (12) does not consider the order the tables are (outer) joined, as the order will not change the recoverability of the results according to the definition of ip. However, the query optimizer will evaluate different sequences of the joined tables.

[00135] (3) For the join condition C, locate a common component 0 ₇ among all q[ (Definition 15) such that C = (0j v · · · v ¾) L Q, = 0 _D A 0 ₇, and follow Rule 1 (Corollary 2) to reorder the selections.

[00136] (4) Following the discussions in (3), if the common conditions among all queries 0 ₇ can be further written as 0 ₇ = q _Ti A Q _T2 . . . q _Tpi where q _Tί is a set of selection condition over input table 7), one can apply Rule 3 to push down the selection condition q _Tn

[00137] Tuples for individual queries can be recovered. One can assume that tuples of T can be delivered to the query filters in a streaming or batching manner. This can be done by mechanisms such as cursors supported by modern DBMSs.

[00138] A recovering function can be used to filter each tuple generated from Eq. (12) or Eq. (13). Note that C = v q ₂ v · · · v q _h, therefore q _t can be obtained by

qi = p _£.;.s¾( ) {14; where T is the global query result. Interesting thing is that the recovering function is almost identical to the query expression itself (recall q _t =

n _Li ^ae _i ( ^xv _Te _Ti T )). This brings convenience in implementation as no extra guidance is needed for each query to recover the tuples. Now one can see why L _c should be included in Eq. (12) - as each tuple arrives, one can also apply the selection with predicates involving attributes in but not

[00139] Eq. (14) shows, due to the streaming nature of selection and projection, each tuple can be processed on-the-fly and discarded afterwards. Note that if duplicate elimination is required, more storage may be needed, e.g., according to’Group By’ as discussed herein.

[00140] Comparison: By the first glance, the system 100 might be less efficient in filtering the tuples as compared to ShareDB and DataPath. The system 100 may need to broadcast each tuple to all queries while ShareDB and DataPath only sends a tuple to queries it belongs to (due to the existence of query IDs for each tuple). However, it is believed that the system 100 approach performs better. For example:

(1 ) One can index the predicates in the selection conditions of all queries ( in Eq. (14)). Via such indexes, a tuple can be sent to only relevant queries;

(2) For n queries, each tuple in ShareDB and DataPath needs n bits for the query list. Lack of such memory overhead in the system 100 allows the tuple to reside at higher levels of cache in modern CPUs/co-processors; and

(3) For each tuple, it requires O (n) time to generate the query ID list in

ShareDB and DataPath. Such a cost does not exist in the system 100.

[00141] Example 7. Now create the global query Q for all queries and the database mentioned in Example 1. The resulting global table T for all five queries is shown in Table 3. By applying Eq. (14) on each tuple, one can generate the resulting tables for all five queries (details not shown here).

[00142] Table 3: Global output table for the five queries in Example 1 . Here w represents a single NULL value.

Feasibility of the PsiDB Approach

[00143] Even after the algebraic optimization, the main concern over the system 100 remains: the skeleton join could still be too expensive to compute, and outputs a relation that is excessively large. Cost of computing T In Eq. (12), the skeleton join condition C combines many query-level conditions in a disjunctive form (e.g., C = v q ₂ v · · · v q _h). This could lead to scanning an entire (inner) table in processing the join. If processed in a traditional DBMS, each join becomes more selective therefore index-based join algorithms can be utilized. However, for a large number of queries, the combined number of tuples accessed in the table are comparable to the whole table therefore it is efficient to conduct a table scan.

[00144] Another important factor is the performance advantage of sequential I/O achieved in table scans. A recent work reported a break-even point (in selectivity) of less than 1 % for an index-based scan to outperform table scan - indexed scan wins only when it accesses less than 1 % of the data in a table, and this is highly unlikely in a CDBMS. Note that the above reasoning holds true for a database that is disk based, and a database that is in memory, although the advantage of scanning is much bigger in disk-based systems.

[00145] Size of T Another concern is the size of T could be excessively large, thus increasing the cost of the query result filtering. This is not the case with the development of the disclosed equivalence rules, as shown in the following lemma.

[00146] Lemma 8. Every tuple in the relation T will appear in the resulting table of at least one query.

[00147] Proof. The above can be seen from Eq. (12): if a tuple passed the condition C from the intermediate table resulted from xt _c7), it means for some query q _j, its join condition 6 _j is evaluated as true for that tuple, therefore the tuple will be in the final result of q _j.

[00148] Denoting the cardinality of a table T as |G|, the above lemma derives ^i=i ¹. Denoting the degree and total size of a table as D and S, respectively,

where D _p is the average degree of the resulting tables of all queries, and a is the average number of queries a tuple in T is used. Quantity a represents the overlap among the resulting tables of different queries, and Lemma 8 basically says a ³ 1. Such overlaps could lead to significant resource savings.

EMPIRICAL EVALUATIONS

[00149] One example of the system 100 was implemented using TSQL, the procedural language supported by SQL Server. One component of the system 100 is the Query Transform module, which takes a batch of queries from the Query Generator and transforms all such queries into a global expression Q. In particular, the system 100 can apply the disclosed equivalence rules to reach a level of optimization shown in Eq. (13). The query Q is then transformed into an SQL statement and sent back to SQL Server to compute the final table T. The framework of the present example collects the tuples of T in batches, and each batch is sent to an in-memory data filtering module that applies Eq. (14) to generate the individual query results. In short, this implementation does not explicitly consider any physical level optimizations.

[00150] In some experiments, a setup was provided as follows. A query workload was also provided. Synthetic data were used and query sets generated from the TPC-H benchmark for experiments. To investigate system behavior under different data sizes, three databases under TPC-H scale factors (SF) 1 ,

10, and 100, respectively, were built. Indexes for all attributes involved in any selection condition in the queries were built. Under SF1 , the entire database was first loaded (by scanning tables) into the buffer pool thus it essentially became an in-memory database. The SF100 database is a typical disk-based system as the buffer pool can only hold a small part of the database, indexes, and runtime states. The SF10 database sits in the middle with a significant part of data and states in memory.

[00151] A query generator that outputs query workloads under different numbers of tables involved, attributes to apply a selection, the selectivity for selection operations, and total number of queries in a workload was developed.

In particular, all TPC-H queries were transformed into“seed” queries by removing the aggregates and nested queries. The seed queries are essentially a skeleton join over multiple tables. Note that there are up to five tables joined in the same query in TPC-H. The seed queries serve as the template for

generating actual queries with selection conditions (e.g., WHERE clause in SQL) applied to randomly-picked attributes and a random subset of the attributes to be projected out (e.g., SELECT clause in SQL) from the skeleton table. The selection conditions of each attribute are randomly generated towards a desired selectivity (and thus the size of the resulting table T).

[00152] Results are reported under different workload intensity (e.g., total number of queries), database sizes, and also query selectivity. For the latter, the selectivity of selection conditions of the queries is controlled such that the final result table T reaches different sizes. In particular, four cases were tested in which the size of T is approximately 1 %, 10%, 50% and 100% of the resulting table of a skeleton join (as shown in Eq. (11 )) among all five tables in the database. This table is generated by applying all applicable foreign key equality conditions but no selection conditions over any single attribute. Queries can be viewed as a "shrunk" version of this skeleton join: they either have selection conditions over some attributes, or involve less than five tables. Therefore, this table can be viewed as the superset of data that will be touched by workloads.

[00153] FIG. 3 shows the speedup of the system 100 over SQL Server in processing the same workload. Exact running time of such experiments are shown in FIG. 8. In general, the system 100 outperforms SQL Server in all cases, with the smallest speedup being 1 3X and highest being 36X. The speedup increases with the following factors: (1 ) more queries; (2) higher selectivity (smaller output table size); and (3) larger database sizes. For the first two factors, it is easy to understand because the amount of computation and data accesses shared among queries increase. Factor (2) is related to the size of T, and costs of filtering tuples from T are found to be the dominating factor in the total workload processing time (more details in the next paragraph). Note that, the cases of 1 % (with up to 36X speedup) and 100% (up to 2.6X) data coverage are put here as extreme cases. Real-world workloads are likely to fall into the 10% - 50% range. For factor (3), when the database size increases from SF1 to SF100, example designs transition from an in-memory database setup to a more traditional disk-based database setup. To further investigate the effects of such transition, query execution plans were retrieved. In computing the global query Q, SQL Server always chose table scan for every node that involves accessing a table (even though matching indexes do exist). On the other hand, when it processes queries one-by-one, index scan was used for the same. This confirms that when a system is bound by disk I/O, the scan-based algorithm can show advantages over indexed access. With reference to FIG. 4, shown is a time for processing Q and tuple distribution under SF100 according to various embodiments of the present disclosure.

[00154] Cost of computing T. By looking at the break-down of the workload processing time for the system 100 (FIG. 4), the time for processing the global query Q is very stable under different workload intensities. This is, again, because table scan based algorithms are chosen for processing Q even under 64 queries. This is also the foundation for the major claim of constant workload response time made by ShareDB. On the other hand, as the number of queries increases, the cost of filtering tuples from T increases (in a linear manner). Such costs become dominant in cases with higher number of queries. This is a very interesting finding: it means the system 100 becomes a computation-bound system even under large databases. In other words, as the cost of computing T is insignificant, the system 100 effectively removes the I/O bottleneck. As to the tuple distribution, there are abundant optimization opportunities (e.g., predicate indexing, SIMD). This means the potential of the system 100 could be even bigger than what are shown here. Similar trends are found in smaller databases (e.g., SF1 and SF10): the cost of computing T remains insignificant. Shown in FIG. 5 is a speedup of the system 100 over MySQL according to various embodiments of the present disclosure.

[00155] Comparison to ShareDB. To compare the system 100 with ShareDB, the environment is evaluated and the speedup of both systems over the same baseline is compared. In particular, examples generate a new set of workloads in which a query joins up to three tables in a in-memory database setup. As the selectivity of queries was on the high side (e.g., less tuples returned) in the ShareDB experiments, experiments were run under 1 % and 10% coverage over the skeleton table. As can be seen in FIG. 5, the system 100 achieved double digit speedup for most cases and the highest is more than 40X. This is much higher than the 1.2-1 OX that some publications have reported for ShareDB. Therefore, it is believed that the more extensive set of optimization techniques adopted in the system 100 translated into better performance (than ShareDB).

[00156] One contribution of this disclosure lies in the formal reasoning of concurrent query processing at the level of relational algebra. This disclosure has developed equivalence rules to guide the creation of query plans for processing a batch of queries together. Therefore, unlike SharedDB that currently focuses on one static plan, the present disclosure will pave the way for a systematic consideration of optimizing multiple queries in CDBMSs. Conclusions

[00157] A framework for multiple query processing via query batching is introduced herein. This disclosure develops relational algebraic equivalence rules to transform multiple queries into a single global query. The results of the global query are distributed among all the queries on-the-fly. This disclosure describes the system 100, and shows many optimization opportunities that are captured by the disclosed system 100 and equivalence rules. This disclosure then recommends initial algebraic plans for query optimization and evaluates feasibility of the framework in comparison to similar systems.

[00158] On the theoretical side, development of more equivalence rules to reveal other resource sharing scenarios is beneficial, especially for the Group-By and Aggregate operators. A more fundamental work is to model the total cost of processing the large number of queries in the manner of the system 100. This is much needed to reveal the actual performance advantage of scanning-based data processing system in large. The problem of optimal batching of queries can be a critical issue for the proposed framework of the system 100, deeper understanding of the problem structure and features is needed for the

development of efficient algorithms. Heuristic solutions with guaranteed performance is the way to pursue in such studies. The system 100 can take advantage of existing query optimizers towards many tasks it generates. There is significant opportunity for efficient processing of the ip operators, parallel computing for query filtering, and cost modeling of query plans.

WEAK-ASSOCIATIVITY OF FULL OUTER JOIN

[00159] Start by studying the difference between two associative forms among three tables T _l T ₂ and T ₃, as illustrated in FIG. 6. Here represents a tuple following the schema of but with NULL values in all attributes. Note that the tuples in the resulting relation of either associative form should come from the skeleton relation S = T[ x T ₂ x T ₃ where T[ = T _t u oi _t. In particular, a series of conditions are applied to select tuples out of 5. Without loss of generality, such conditions are shown as

®: those between tables and T ₃ only.

[00160] Note that any condition that involves attributes from only one table can be irrelevant for these discussions thus are not shown here. Now consider the seven different types of tuples in the resulting table of either form. Such types differ by the appearance of w segments. For example, tuples of type T ₁T ₂w ₃ consist of a real tuple from T _{l t} a real tuple from T ₂, and NULL values in all attributes of the T ₃ section. By analyzing the order of outer join operations between tables in both associative forms, one can derive how the three sets of conditions are evaluated in each type of tuples, as shown in FIG. 6. For example, to get the T ₁T ₂w ₃ tuples, one must have conditions © evaluated as true and 2 as false in the left formula. For the right formula, such tuples are generated as a result of conditions © being true plus either © or © is false.

Note that for tuple types T ₁w ₂w ₃ for the left form and w ₁w ₂T ₃ on the right form, each is generated under three different scenarios.

[00161] In FIG. 6, shown are two forms of consecutive full outer joins over three tables and the seven types of tuples in them according to various embodiments of the present disclosure. [00162] Clearly, the two forms are not equivalent - multiple tuple types have different evaluations of all conditions thus contain different sets of tuples.

However, one only needs to show that all three tables can be exactly recovered from both sides.

[00163] This disclosure shows that by applying the recovering function shown in Eq. (2) to the resulting table of both forms, one can get all three input tables back. Recall the function to recover T _t is

[00164] For that, gather all the relevant components for a table shown in FIG. 6. For example, all components needed for recovering T ₂ is shown in FIG. 7. The set of T ₂ tuples that can be recovered on the left form can be obtained by

f(i) U f(ii) U f(iii) U f ⁽iv)

[00165] In each component, an associated condition evaluation defines a subset of the T ₂ tuples. By working on all four components, the total recoverable set for T ₂ can be computed as a selection over T ₂ with the following conditions:

[00166] This means that all tuples of T ₂ will be recovered by f. Similarly, for the right form, one gets the set of T ₂ tuples via f (i ') u / (it ') u / ( ') u / ( iv '), which is a select over T ₂ by the following conditions:

[00167] This means that all tuples of T ₂ will be recovered by /.

[00168] Similarly, for the right form, one gets the set of T ₂ tuples via / (T) u / (it') u / (tit') u / ( iv '), which is a select over T ₂ by the following conditions:

[00169] With reference to FIG. 7 shown are components for recovering table T ₂ from the two resulting tables shown in FIG. 6. The same results can be found for 7 and T _3.

[00170] Various experimental results can be found in FIGS. 8 to 10 according to various embodiments of the present disclosure. FIG. 8 shows execution times of PsiDB and SQL Server under different database sizes and query numbers. FIG. 9 shows a time for processing Q and tuple distribution under SF1. FIG. 10 shows a time for processing Q and tuple distribution under SF10.

[00171] Turning to FIG. 11 A, shown is a system 100 of FIG. 1 according to various embodiments. The system 100 includes computing device(s) 1100, computing device(s) 103, and database management system 106 which are in data communication with each other via a network 109. The network 109 includes, for example, the Internet, intranets, extranets, wide area networks (WANs), local area networks (LANs), wired networks, wireless networks, or other suitable networks, etc., or any combination of two or more such networks. For example, such networks may comprise satellite networks, cable networks, Ethernet networks, and other types of networks.

[00172] The computing device 1100 may comprise, for example, a server computer or any other system providing computing capability. Alternatively, the computing devices 1100 may employ a plurality of computing devices that may be arranged, for example, in one or more server banks or computer banks or other arrangements. Such computing devices may be located in a single installation or may be distributed among many different geographical locations. For example, the computing devices 1100 may include a plurality of computing devices that together may comprise a hosted computing resource, a grid computing resource and/or any other distributed computing arrangement. In some cases, the computing devices 1100 may correspond to an elastic computing resource where the allotted capacity of processing, network, storage, or other computing-related resources may vary over time.

[00173] Various applications and/or other functionality may be executed in the computing devices 1100 according to various embodiments. Also, various data is stored in a data store 112 that is accessible to the computing device 1100. The data store 112 may be representative of a plurality of data stores 112 as can be appreciated. The data stored in the data store 112, for example, is associated with the operation of the various applications and/or functional entities described herein. The data stored in the data store 112 includes, for example, queries q ₁, q ₂, .. . , q _n, equivalence rules 121 , filters /, global query/queries Q, global relation(s) , and potentially other data, control structures, and/or operations. [00174] The components executed on the computing device 1100, for example, include a query transformation application 115, a result transformation application 118, and other applications, services, processes, systems, engines, or functionality not discussed in detail herein. The query transformation application 115 is executed to facilitate transformation of queries of the database management system 106 into a global query Q according to techniques described herein. The result transformation application 118 is executed to facilitate transformation of a result (e.g., global relation T) of the global query Q comprising data from the data store 113 according to techniques described herein. The computing device 1100 including the result transformation application 118 can also comprise an OLAP database and associated features for online analytical processing.

[00175] In some embodiments, the query transformation application 115 or the result transformation application 118 run as a plug-in application to the database management system 106. In this embodiment, the data store 112 can be a data store of the database management system 106, such as the data store 113.

[00176] Various data associated with the database management system 106 is stored in the data store 113 that is accessible to the computing device 1100. The data store 113 may be representative of a plurality of data stores 113 as can be appreciated. The data stored in the data store 113, for example, is associated with the various applications and/or functional entities described herein, including a relational database managed by the database management system 106.

[00177] The computing device 103 is representative of a plurality of computing devices 103 that may be coupled to the network 109. The computing devices 103 may comprise, for example, a processor-based system such as a computer system. Such a computer system may be embodied in the form of a desktop computer, a laptop computer, personal digital assistants, cellular telephones, smartphones, web pads, tablet computer systems, wearable devices, or other devices with like capability. The computing devices 103 may include respective displays. The displays may comprise, for example, one or more devices such as liquid crystal display (LCD) displays, gas plasma-based flat panel displays, organic light emitting diode (OLED) displays, electrophoretic ink (E ink) displays, LCD projectors, or other types of display devices, etc. In some embodiments, the computing device 103 may comprise, for example, a server computer or any other system providing computing capability.

Alternatively, the computing devices 1100 may employ a plurality of computing devices that may be arranged, for example, in one or more server banks or computer banks or other arrangements.

[00178] The computing devices 103 may be configured to execute various applications such as an application 124 or other applications. The application 124 may be executed in a respective computing device 103, for example, to provide queries associated with the database management system 106, to query the data in the data store 113, or to access network content served up by the computing device 1100 and/or other servers (e.g., to render a user interface on a display of the computing device 103).

[00179] To this end, the application 124 may comprise, for example, a dedicated application for the database management system 106, a query application, an enterprise application, an Online Analytical Processing (OLAP) client application, a browser, etc. The computing devices 103 may be configured to execute applications beyond the application 124 such as, for example, email applications, word processors, spreadsheets, programming applications or environments, data visualization applications, data model development applications, or other applications.

[00180] Next, a general description of the operation of the various

components of the system 100 is provided. To begin, the query transformation application 115 is configured to query the data store 113 of the database management system 106. The application 124, or users of the application 124 can provide queries q^ q ₂, . .., q _n, which are queries of the data store 133. In some embodiments, the database management system 106 can store the queries q^ q ₂, .. ., q _n ^as registered queries q to indicate that the queries are to be executed concurrently by the database management system 106. The query transformation application 115 can obtain the queries q^ q ₂, . . ., q _n and generate a single expression based at least in part on the queries q^ q ₂, . .. , q _n. The query transformation application 115 can query the data store 113 based at least in part on the single expression to return a result corresponding to results for the queries q^ q ₂, .. ., q _n. The query transformation application 115 can generate a single global query Q , which can include a single database query that is based at on the single algebraic expression, and can query the data store 113 using the single global query Q to return data necessary for the q^ q ₂, . . ., q _n.

[00181] In some examples, the data store 113 comprises a set of relations T = {T _lt T ₂, each of which can contain different attributes. The query transformation application 115 can obtain the queries q^ q ₂, . . ., q _n associated with the set of relations T and the different attributes. The query transformation application 115 can generate, based at least in part on the set of relations T, a single algebraic expression whose result is a relation T comprising a set of tuples as described herein. The single algebraic expression can be associated with a result (e.g., the global relation T) comprising data necessary for the queries q^ q ₂, .. ., q _{n .} In some examples, the data does not include any attributes unless those attributes have been used in queries q^ q ₂, .. ., q _n·

[00182] The query transformation application 115 can generate the single algebraic expression based on applying equivalence rules 121 to algebraic expressions for the queries q^ q ₂, . .. , q _n. The equivalence rules 121 stored in the data store 112 can be the equivalence rules described in detail herein. The equivalence rules 121 can involve a plurality of relational operators comprising a plurality of ip operators. The result transformation application 118 can provide each tuple in T to a plurality of filters / to generate output relations

corresponding to the queries q^ q ₂, . .. , q _n, among other result transformation features and operations described in detail herein.

[00183] Referring now to FIG. 11 B, an example hardware diagram of a computing device 1100 is illustrated. Any of the functionality discussed herein may be implemented, in part, using one or more elements of the computing device 1100. The computing device 1100 can include one or more of a processor 1110, a storage device or Random Access Memory (“RAM”) 1120, a Read Only Memory (“ROM”) 1130, a network interface 1150, and an Input Output (“I/O”) interface 1160. The elements of the computing device 1100 are communicatively coupled via a bus 1102.

[00184] The processor 1110 can include an arithmetic processor, Application Specific Integrated Circuit (“ASIC”), or other types of hardware or software processors. The RAM 1120 and the ROM 1130 can include a memory that stores computer-readable instructions to be executed by the processor 1110.

The RAM 1120 or the ROM 1130 can store computer-readable instructions thereon that, when executed by the processor 1110, direct the processor 1110 to execute the query transformation application 115, the result transformation application 118, the application 124, or various other aspects of the present disclosure described herein. Also stored in the RAM 1120 may be a data store 112 and other data. When the processor 1110 includes an ASIC, the processes described herein may be executed by the ASIC according to an embedded circuitry design of the ASIC, by firmware of the ASIC, or both an embedded circuitry design and firmware of the ASIC. As a non-limiting example group, the ROM 1130 comprises one or more of an optical disc, a magnetic disc, a semiconductor memory (e.g., a semiconductor, floating gate, or similar flash based memory), a magnetic tape memory, a removable memory, combinations thereof, or any other known memory means for storing computer-readable instructions. The network interface 1150 can include hardware interfaces to communicate over data networks such as the network 109 (FIG. 11A). The I/O interface 1160 can include device input and output interfaces such as keyboard, pointing device, display, communication, and other interfaces. The bus 1102 can electrically and communicatively couple the processor 1110, the RAM 1120, the ROM 1130, the network interface 1150, and the I/O interface 1160, so that data and instructions may be communicated among them.

[00185] In operation, the processor 1110 is configured to retrieve computer- readable instructions stored on a storage device, the RAM 1120, the ROM 1130, or another storage means, and copy the computer-readable instructions to the RAM 1120 or the ROM 1130 for execution, for example. The processor 1110 is further configured to execute the computer-readable instructions to implement various aspects and features of the present disclosure. For example, the processor 1110 may be adapted and configured to execute the processes described above. Also, a storage device (e.g., the storage device 1120) may store the databases and/or the data stored in the databases discussed herein.

[00186] With respect to FIG. 12, shown is a process 1200 according to various embodiments of the present disclosure. At box 1203, the process 1200 includes obtaining a plurality of database queries q ₂, . .. , q _n for querying a database or the data store 113. The database queries q^ q ₂, .. ., q _n, can be obtained or received by a computing device such as the computing device 1100. In one embodiment, the database queries q^ q ₂, . .. , q _n ^are registered queries q or queries that are registered to be executed currently in the database

management system 106 that comprises one or more computing devices.

[00187] At box 1206, the process 1200 can generate a single expression (e.g., a single expression that includes data necessary for each of the plurality of database queries q^ q ₂, . q _n ) The single expression can be generated based on applying equivalence rules 121 to algebraic expressions for the plurality of database queries q _lt q ₂, . . ., q _n.

[00188] At box 1209, the process 1200 can include querying the database or the data store 113 based on the single expression. In some embodiments, the process 1200 can include modifying an aspect of a table in the database or the data store 113. The process 1200 can further include generating a second single expression corresponding to a second plurality of database queries by applying equivalence rules for the plurality of second database queries.

Thereafter, the process can proceed to completion. [00189] With respect to FIG. 13, shown is a process 1300 according to various embodiments of the present disclosure. At box 1303, the process 1300 includes obtaining database queries q ₂, . .. , q _n associated with a database or data store 1 13 having a set of relations T = {7 , T ₂, .... , T _m}, where each of the relations comprises attributes. In some embodiments, the database

queries q^ q ₂, .. . , q _n ^are registered in the database management system 106 as registered queries q. The registered queries q can for example include queries q^ q ₂, .. . , q _n being registered to be executed concurrently by the database management system 106, another system, or in one or more computing devices. In some examples, the process 1300 can associate at least one of the database queries q^ q ₂, .. ., q _n with an algebraic expression that returns a vector of the relations.

[00190] At box 1306, the process 1300 includes generating a single algebraic expression based on the set of relations T. The process 1300 includes generating the single algebraic expression so the result of the single algebraic expression is a relation T comprising a set of tuples. The single algebraic expression can also be generated based on applying equivalence rules 121 to algebraic expressions for the plurality of database queries q^ q ₂, .. . , q _n. For example, the equivalence rules 121 can involve relational operators comprising a plurality of ip operators.

[00191] In some embodiments, the single algebraic expression is associated with a global relation T comprising data necessary for the database queries q^ q ₂, .. . , q _n. In some examples, the data does not include any attributes unless those attributes have been used in the plurality of database queries q^ q ₂, .. . , q _n· Some examples can also include a case where there is not any row of the data that is not used by the database queries q^ q ₂, · · ·, q _n-

[00192] At box 1309, the process 1300 includes providing each tuple in T to a plurality of filters / to generate output relations corresponding to the plurality of database queries q^ q ₂, ..., q _n. The process 1300 can also include generating a single database query Q based on the single algebraic expression, and querying the database or the data store 103 using the single database query Q to return data necessary for the plurality of database queries q^ q ₂, ..., q _n. In some embodiments, the process 1300 includes modifying at least one aspect of at least one table in the database or the data store 113. Thereafter, the process can proceed to completion.

[00193] A phrase, such as“at least one of X, Y, or Z,” unless specifically stated otherwise, is to be understood with the context as used in general to present that an item, term, etc., can be either X, Y, or Z, or any combination thereof (e.g., X, Y, and/or Z). Similarly,“at least one of X, Y, and Z,” unless specifically stated otherwise, is to be understood to present that an item, term, etc., can be either X, Y, and Z, or any combination thereof (e.g., X, Y, and/or Z). Thus, as used herein, such phrases are not generally intended to, and should not, imply that certain embodiments require at least one of either X, Y, or Z to be present, but not, for example, one X and one Y. Further, such phrases should not imply that certain embodiments require each of at least one of X, at least one of Y, and at least one of Z to be present.

[00194] Although embodiments have been described herein in detail, the descriptions are by way of example. The features of the embodiments described herein are representative and, in alternative embodiments, certain features and elements may be added or omitted. Additionally, modifications to aspects of the embodiments described herein may be made by those skilled in the art without departing from the spirit and scope of the present disclosure defined in the following claims, the scope of which are to be accorded the broadest

interpretation so as to encompass modifications and equivalent structures.

[00195] In addition to the foregoing, the various embodiments of the present disclosure include, but are not limited to, the embodiments set forth in the following clauses:

[00196] Clause 1. A system comprising: a database; at least one computing device in communication with the database, the at least one computing device being configured to at least: obtain a plurality of database queries; generate a single expression based at least in part on the plurality of database queries; and query the database based at least in part on the single expression to return a global relation corresponding to relations for the plurality of database queries.

[00197] Clause 2. The system of clause 1 , wherein the single expression comprises data necessary for the relations for each of the plurality of database queries.

[00198] Clause 3. The system of clause 1 or 2, wherein the single expression is generated based at least in part on applying equivalence rules to algebraic expressions for the plurality of database queries.

[00199] Clause 4. The system of any of clauses 1 -3, wherein the at least one computing device is further configured to at least: modify at least one aspect of at least one table in the database; and generate a second single expression corresponding to a second plurality of database queries by applying equivalence rules for the plurality of second database queries. [00200] Clause 5. A method comprising: obtaining, via at least one computing device, a plurality of database queries; generating, via the at least one computing device, a single expression based at least in part on the plurality of database queries; and querying, via the at least one computing device, a database based at least in part on the single expression.

[00201] Clause 6. The method of clause 5, wherein the single expression comprises data necessary for each of the plurality of database queries.

[00202] Clause 7. The method of clause 5 or 6, wherein the single expression is generated based at least in part on applying equivalence rules to algebraic expressions for the plurality of database queries.

[00203] Clause 8. The method of any of clauses 5-7, further comprising: modifying at least one aspect of at least one table in the database; and generating a second single expression corresponding to a second plurality of database queries by applying equivalence rules for the plurality of second database queries.

[00204] Clause 9. A system comprising: a data store comprising a set of relations; at least one computing device in communication with the data store, the at least one computing device being configured to at least: obtain a plurality of database queries associated with the set of relations, each of the relations comprising attributes; generate, based at least in part on the set of relations, a single algebraic expression whose result is a global relation comprising a set of tuples; and provide each tuple in the global relation to a plurality of filters to generate output relations corresponding to the plurality of database queries.

[00205] Clause 10. The system of any of clauses 1 -4, or 9, wherein the plurality of database queries are registered in a database management system. [00206] Clause 11. The system of any of clauses 1 -4, 9, or 10, wherein the plurality of database queries are registered to be executed concurrently.

[00207] Clause 12. The system of any of clauses 1 -4 or 9-11 , wherein at least one of the plurality of database queries is associated with an algebraic expression that returns a vector of the relations.

[00208] Clause 13. The system of any of clauses 1 -4 or 9-12, wherein the single algebraic expression is associated with a global relation comprising data necessary for the plurality of database queries.

[00209] Clause 14. The system of any of clauses 1 -4 or 9-13, wherein the data does not include any attributes unless the attributes have been used in the plurality of database queries.

[00210] Clause 15. The system of any of clauses 1 -4 or 9-14, wherein there is not any row of the data that is not used by the plurality of database queries.

[00211] Clause 16. The system of any of clauses 1 -4 or 9-15, wherein the single algebraic expression is generated based at least in part on applying equivalence rules to algebraic expressions for the plurality of database queries.

[00212] Clause 17. The system of any of clauses 1 -4 or 9-16, wherein the equivalence rules involve a plurality of relational operators comprising a plurality of y operators.

[00213] Clause 18. The system of any of clauses 1 -4 or 9-17, wherein the at least one computing device is further configured to at least modify at least one aspect of at least one table in the data store.

[00214] Clause 19. The system of any of clauses 1 -4 or 9-18, wherein the at least one computing device is further configured to at least generate a single database query based at least in part on the single algebraic expression. [00215] Clause 20. The system of any of clauses 1 -4 or 9-19, wherein the at least one computing device is further configured to at least query the data store using the single database query to return data necessary for the plurality of database queries.

Previous Patent: SEQUENTIALLY COLORED LOOPY YARN AND METHOD OF KNITTING NON-RANDOM DISCREET DESIGNS

Next Patent: SELECTIVE INHIBITORS OF PROTEIN ARGININE METHYLTRANSFERASE 5