Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
INTELLIGENT DATABASE SYSTEM AND METHOD
Document Type and Number:
WIPO Patent Application WO/2021/165576
Kind Code:
A1
Abstract:
The invention relates to an intelligent database system, i.e. a system in which artificial intelligence functionality and database functionality are tightly integrated, and associated methods. The system of the invention includes a database sub-system and an AI sub-system. The database sub-system includes database storage, in which the data is stored along with a bitmap index which indexes the features of the data. The AI subsystem is configured to perform statistical calculations on the bitmap index and provide statistical insights in response to queries.

Inventors:
RAUHALA ANTTI (FI)
Application Number:
PCT/FI2021/050115
Publication Date:
August 26, 2021
Filing Date:
February 18, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
AITO INTELLIGENCE OY (FI)
International Classes:
G06F16/2458; G06F16/22; G06N7/00
Foreign References:
US20070192341A12007-08-16
US20160364420A12016-12-15
Other References:
FAVRE CÉCILE ET AL: "Bitmap Index-Based Decision Trees", LECTURE NOTES IN COMPUTER SCIENCE . MAY 2005, 1 January 2005 (2005-01-01), XP055799626, Retrieved from the Internet [retrieved on 20210428], DOI: 10.1007/11425274_7.Source:
M.F. PORTER: "An algorithm for suffix stripping", PROGRAM, vol. 14, no. 3, 1980, pages 130 - 137, XP008030462
O'NEILCHENGGAWIICKO'NEIL: "The log-structured merge-tree (LSM-tree", ACTA INFORMATICA, vol. 33, no. 4, June 1996 (1996-06-01), pages 351 - 385, XP009122594
S. CHAMBID. LEMIREO. KASERR. GODIN: "Better bitmap performance with Roaring bitmaps", SOFTWARE: PRACTICE AND EXPERIENCE, vol. 46, no. 5, May 2016 (2016-05-01), pages 709 - 719
Attorney, Agent or Firm:
BOCO IP OY AB (FI)
Download PDF:
Claims:
Claims 1. A computer-implemented method comprising: receiving a plurality of data records, wherein each data record comprises a plurality of fields; in response to receiving the plurality of data records, generating bitmap indexes describing the fields of the plurality of data records; generating a machine learning model based on the bitmap indexes; and based on the machine learning model, outputting a prediction of one or more of: the value of one or more of fields one or more recommended data records; and one or more data records which match a query. 2. The computer-implemented method of claim 1, wherein the method further comprises, prior to receiving the plurality of data items, receiving a schema describing the plurality of data items, wherein the bitmap index is optionally generated based on the received schema, and wherein the method optionally further comprises, in response to receiving the plurality of data records and prior to generating the machine learning mode, generating links between database tables corresponding to references defined in the database schema, and generating bitmap indexes based on the linked database tables. 3. The computer-implemented method of any preceding claim, wherein the method further comprises receiving a query containing a request for at least one of: a field prediction, one or more recommended data records, and or one or more matching data records; wherein: the step of generating a machine learning model is performed in response to the query to produce a query-specific machine learning model; and outputting the prediction is performed based on the query-specific machine learning model, and the output prediction corresponds to the request of the query; and optionally, the method further comprises destroying the query-specific machine learning model after outputting the prediction. 4. The computer-implemented method of claim 3, wherein the machine learning model is a naïve Bayes model, and wherein generating the naïve Bayes model comprises: calculating statistics based on the bitmap indexes; and generating Bayesian lifts based on the calculated statistics. 5. The computer-implemented method of claim 4, wherein calculating statistics is performed using bitwise logical operations. 6. The computer-implemented method of claim 4 or 5, wherein prior to calculating statistics, the bitmap indexes are sparsely optimised. ^^ The computer-implemented method of any one of claims 4 to 6, wherein the bitmap indexes comprise conditional bits, which may have a defined value of 0, 1 or be undefined, and wherein prior to calculation statistics, the conditional bits are filtered to determine the bits which have a defined value, and wherein the filtered statistics are calculated based on the filtered conditional bits. 8. The computer-implemented method of any one of claims 4 to ^, wherein the prediction output includes one or more Bayesian lifts associated with prediction. 9. The computer-implemented method of any one of claims 3 to 8, wherein the query further comprises a request for query validation, the request for query validation a ratio describing a split of data records to be used for training and validation and/or an indication of a validation method.

10. The computer-implemented method of claim 9, wherein the method further comprises performing validation of the query according to the request for query validation by: training the machine learning model based on the bitmap indexes of the data records to be used for training, as indicated by the request for query validation; and generating an accuracy measure of the query-specific machine learning model by testing the model on the bitmap indexes of the data records to be used for validation, as indicated by the request for validation; wherein the accuracy measure provides an indication of the accuracy of the requested value of one or more of fields, one or more recommended data records, and/or one or more data records which match the query; and wherein the method further comprises outputting the accuracy measure of the query-specific machine learning model in response to the query. 11. The computer-implemented method of any preceding claim, wherein the bitmap indexes are generated based on identified features derived from the fields of the plurality of data records, wherein generating the machine learning model further comprises filtering the features to ensure or improve statistical independence between features, and optionally wherein filtering comprises: recognising a pattern of features derived from a field across several data records; and replacing the features identified as belonging to the pattern with synthetic features which are conditional on the absence of the pattern in the field. 12. The computer-implemented method of any preceding claim, wherein the method further comprises receiving a second plurality of data items and, in response to receiving the second plurality of data items, generating a second bitmap index describing the features of the second plurality of data items, and wherein the method optionally further comprises storing the second plurality of data items and the second bitmap index as a database table segment.

13. A data processing system comprising a processor configured to perform the method of any one of claims 1 to 12. 14. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any one of claims 1 to 12. 15. A computer-readable medium comprising instructions which, when executed by a computer, cause the computer to carry out the method of any one of claims 1 to12.

Description:
INTELLIGENT DATABASE SYSTEM AND METHOD Technical Field The present invention relates to an intelligent database system, i.e. a system in which artificial intelligence functionality and database functionality are tightly integrated, and associated methods. Background In a typical AI or data science project, a team of software engineers and data scientists collaborate to build a system that is capable of handling, preparing and analysing large datasets. The conventional approach for storing large amounts of data is to store it in a database. A separate AI system retrieves data from the database in order to carry out of the necessary statistical calculations to provide some insight. This traditional separation of AI from the DB can lead to a number of issues. As an example, consider the problem of building a content-based personalization system for an aggregated news platform. The news platform backend may orchestrate the cooperation between the database and AI by requesting an article list from the database and asking the AI to order the articles by personal preference. If there are ten thousand articles, spread across different new outlets, present in the database, and each article's ID is a 12-byte hash, (occupying approximately 20 bytes in Base64 with JSON decorations), then every request requires 200 KB of data to be sent from the database to the server and from the server to the AI. This is 400 KB of traffic per request, which leads to network congestion, effectively degrading both the responsiveness and the throughput of the system. Furthermore, the AI system must maintain all of the content in order to do content- based personalization. Thus, the AI system ends up duplicating much of the database content; essentially acting as another – lesser – database, which is always slightly out of sync with the main database. A further problem arises from the very high level of understanding of data science and statistics that is required to develop an AI system. Given the relative rarity of these skills, development of such systems can be difficult, especially for smaller and less sophisticated organisations. One prior art system, BayesDB, attempts to solve this problem by providing AI functionality built on top of a SQLite3 database. BayesDB provides a query language, similar to SQL, which can be used by a relatively untrained user to obtain data science insights from the data stored in the database. As such BayesDB makes it easier for users without statistics training to search, clean, and model multivariate databases. However, BayesDB, like other systems built using traditional databases and their APIs, is unable to quickly calculate the required statistics for machine learning on large datasets; it is inefficient and scales poorly with the quantity of data that is stored in the database and analysed. Furthermore, if the news platform has tens of thousands of users, there are two clear options for providing a model in the AI system of personalised article recommendations. One very generic model could be created for all the users, but this would not be very personalized. Alternatively, rough, user-specific models could be created, but this would raise the problem of maintaining tens of thousands of models at least. Finally, a further problem arises when scoring lots of content with a high level of detail. If a user of the news platform sees tens to hundreds of articles for every article that he/she engages with, the impression data is one or two magnitudes larger than the click dataset. Thus for 1 million clicks across the news platform by different users, there may be may tens of millions of impressions. If 1 million clicks occupy approximately 1 GB of memory, the impressions will take 30 GB, which no longer fits in an average computer’s memory. Furthermore, calculating the required statistics for a machine learning model based on millions of clicks and impressions takes a long time, slowing down the speed at which such models can be updated, for example. Summary of the Invention The system and method of the present invention solve the above problems by tightly integrating the AI system and database. By implementing the AI as database sorting, the network congestion problem is solved. Furthermore, the data integration, redundancy and synchronization and storage problems are solved, because the database content can simply be used directly in the statistical calculations. This integration is enabled through the use of a data structure – essentially a database index – which is used to calculate the necessary statistics in order to create the machine learning models. With the right optimizations, single statistics can be calculated on the scale of a microsecond. This allows for the calculation of thousands or even tens of thousands of statistical properties within just tens of milliseconds, which is fast enough to create individual ad-hoc machine learning models, avoiding the downsides of a generic model while allowing individual models to be produced on- demand, saving storage space and time. In addition, organizing the data by features and using sparse optimizations, further increases the speed of model generation. These optimized data structures are, again, basically database indexes. According to a first aspect of the invention, a computer-implemented method is provided. The method comprises receiving a plurality of data records, wherein each data record comprises a plurality of fields, and in response to receiving the plurality of data records, generating bitmap indexes describing the fields of the plurality of data records. The method also comprises generating a machine learning model based on the bitmap indexes and, based on the machine learning model, outputting a prediction of one or more of: the value of one or more of fields, one or more recommended data records, and one or more data records which match a query. Since machine learning models and query responses are generated based on the bitmap indexes, by generating bitmap indexes in response to the receipt of new data, new queries can immediately take the new data into account. The method may further comprise, prior to receiving the plurality of data items, receiving a schema describing the plurality of data items. The bitmap index may be generated based on the received schema. In response to receiving the plurality of data records and prior to generating the machine learning mode, the method may further comprise generating links between database tables corresponding to references defined in the database schema, and generating bitmap indexes based on the linked database tables. These links allow the machine learning model to take into account features of the data that are spread across multiple linked database tables. The query may contain a request for at least one of: a field prediction, one or more recommended data records, and or one or more matching data records. The step of generating a machine learning model may be performed in response to the query to produce a query-specific machine learning model and outputting the prediction may be performed based on the query-specific machine learning model, such that the output prediction corresponds to the request of the query. By generating machine learning models in response to the specific query individual ad-hoc machine learning models are created, avoiding the downsides of a generic model and solving the technical problems of storage space and processing time. The method may further comprise destroying the query-specific machine learning model is destroyed after outputting the prediction. Destroying the machine learning model after outputting the prediction is possible to do the system’s ability to rapidly generate new machine learning models and solves the technical problem of reducing storage requirements. The machine learning model may be a naïve Bayes model. Generating the naïve Bayes model may comprise calculating statistics based on the bitmap indexes and generating Bayesian lifts based on the calculated statistics. Calculating statistics is performed using bitwise logical operations. Since bitwise logical operations, such as AND, OR and NOT, are intrinsically fast and efficient operations for a processor to carry out, and since modern processors are optimised for performing other logical bitwise operations such as POPCOUNT, the statistics can be calculated exceptionally quickly. Prior to calculating statistics, the bitmap indexes may be sparsely optimised. Spare optimisation of the bitmap indexes further increases the speed and efficiency of model generation. The bitmap indexes comprise may comprises conditional bits, which have a defined value of 0, 1 or be undefined. Prior to calculation statistics, the conditional bits are filtered to determine the bits which have a defined value, and wherein the filtered statistics are calculated based on the filtered conditional bits. The prediction output may include one or more Bayesian lifts associated with prediction. As such, the statistical reasoning behind the prediction can be presented to and understood by a user of the system. The query may further comprise a request for query validation, the request for query validation a ratio describing a split of data records to be used for training and validation and/or an indication of a validation method. The method may further comprise performing validation of the query according to the request for query validation by training the machine learning model based on the bitmap indexes of the data records to be used for training, as indicated by the request for query validation, and generating an accuracy measure of the query-specific machine learning model by testing the model on the bitmap indexes of the data records to be used for validation, as indicated by the request for validation. The accuracy measure provides an indication of the accuracy of the requested value of one or more of fields, one or more recommended data records, and/or one or more data records which match the query. The method may further comprise outputting the accuracy measure of the query- specific machine learning model in response to the query. As such, the user of the system can understand the accuracy of the model used to generate the predictions. The bitmap indexes may be generated based on identified features derived from the fields of the plurality of data records. Generating the machine learning model may further comprise filtering the features to ensure or improve statistical independence between features. The filtering may comprise recognising a pattern of features derived from a field across several data records and replacing the features identified as belonging to the pattern with synthetic features which are conditional on the absence of the pattern in the field. The method may further comprise receiving a second plurality of data items and, in response to receiving the second plurality of data items, generating a second bitmap index describing the features of the second plurality of data items. The second plurality of data items and the second bitmap index may be stored as a database table segment. According to a second aspect of the invention, a data processing system is provided. The data processing system comprises a processor configured to perform the method described above. According to a third aspect of the invention, a computer program is provided, the computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method described above. According to a fourth aspect of the invention, a computer-readable medium is provided, the computer-readable medium comprising instructions which, when executed by a computer, cause the computer to carry out the method described above. Brief Description of the Drawings Figure 1 depicts a method for adding new data to a new database according to the present invention. Figure 2 depicts a method for adding new data to an existing database according to the present invention. Figure 3 depicts a method for executing a query against the database according to the present invention. Figure 4 depicts a computer system on which the method of the present invention may be performed. Detailed Description Figure 1 depicts a process 100 for inserting new data into a new database in the system of the present invention. As with conventional database systems, a schema received by the system at step 101. The scheme defines the structure of the database, such as the organisation of data items within the database (i.e. database rows/records) into tables, the fields within each row (i.e. the columns of the database), and the relationships between tables and fields. As described in more detail with respect to step 103, the schema is also used to define how features are selected for encoding in the bitmap indexes. The schema may be generated by a user and transmitted over a network to the database system; however, the precise details of how the schema is created or received by the system are not relevant to the invention. At step 102, subsequent to receiving the schema, the system receives data to be inserted as a row (or rows) into the table (or tables) of the database. The data is structured in a manner corresponding to the schema received at step 101, such that within the data received by the system each individual data item comprises some or all of the fields defined in the schema, and each data item can be inserted as a row into the database. At step 103, the data received by the system is analysed in order to create a bitmap index for the received data. A bitmap index comprises an array of bits for each database row in the received data, where each bit represents the presence of a single feature of the data with 1 and the absence (or unknown presence) of the feature with a 0. For example, for a very simple database table as follows: A bitmap index (or “bit row”) for a simple feature (word unigrams) from the job title field of the database may be formed as follows:

In a conventional database, such a bitmap index is used to answer queries by performing bitwise logical operations on these bitmaps. For example, to fetch all rows where job title contains the word “developer”, a bitwise logical AND operation can be performed between each column of the index (which correspond to rows of the table) and the bit array “0010” to identify rows containing the word. Additionally, or alternatively, another bitmap index may be created which relates each of the features in the database table to each of the rows, i.e. a “bit column”, for example: However, instead of using bitmap indexes to answer such queries, the system and methods of the present invention use the bitmap indexes to train machine learning models, as explained in more detail below with respect to Figure 3. The bit columns and/or bit rows are defined by the featurization, which is in turn defined by the schema. For example, where the schema defines a field that can take a number of discrete values, the featurization may include a feature corresponding to each of those values, i.e. if the schema defines a Boolean field X, the featurization may include a first feature X=True and a second feature X=False. The schema may also define a particular type of analyser that is to be used for a given field, for example by specific the language to be used to analyse a text field. As another example, where the schema defines a field that can include text, the text is split into 'terms', which are term normalized, e.g. by stemming, which may be carried out using existing software libraries, such as Snowball, An example stemming algorithm is described in more detail in M.F. Porter, 1980, An algorithm for suffix stripping,

Program, 14(3) pp 130-137, For example, the text 'horses are riding’ may be converted into terms 'hors' 'is' 'rid'. The exact splitting and normalization of the words depend of the precise implementation of the text normalization, which is not important to the function of the invention - only that the text is split into these features. Of course, the text that is input into each field is not necessarily known at the time the schema is received, or based on the schema alone, thus additional features may be created in the bitmap indexes with each new set of data that is input. These additional features are therefore encoded in the bitmap index for each new table segment that is generated (which is described in more detail below).

Furthermore, the schema may also define links between fields in a first database table to another field of the first database table or a field of a second database table. For example, in a system for tracking user engagement with online content, a database may include a table for content impressions, another table for content items and another for users. The fields of content impressions table may include links to the user ID of user in the users table, and content ID for content in the content items table.

After the bitmap indexes have been generated at step 103, the bitmap indexes and the new row (or rows) of the database are stored in the long-term memory of the system, e.g. a file system on a hard disk or solid-state drive, an in-memory map or a remote network service, such as Amazon S3. It is not necessary for the function of the invention that the rows and bitmap indexes are written to the long-term memory at the same time - for example the rows could be written first, before or while the bitmap indexes are generated. However, since only the row data without the supporting indexes is stored, most operations could not be carried out on the stored rows. Thus, there is little to be gained from staggering the storage in this way.

Alternatively, where the short-term memory of the system is sufficiently large, e.g. when the system has a large amount of RAM, the row data and the indexes may not need to be written to the long-term memory at all. Indeed, the present invention can function wherever the row data and indexes are stored in the system’s memory hierarchy. However, at present, where RAM is often limited and expensive and long- term storage, such as hard disks and solid-state drives, is more available and less expensive, the use of long-term memory may be desirable. in addition to the row data and bitmap indexes, the original data received by the system may also be stored. For example, where the data Is received in JSON format, from which the row data and bitmap indexes are created, the original JSON data may also be stored. In this way, should the format of the database or long-term storage change, the database can be rewritten based on the original data. The data (i.e. database rows and bitmap indexes) is stored as a set of binary large objects (blobs), called ‘the blob store’, and it is immutable. The binary objects are referred using hash/digest codes, that are long enough to be unique and not have collisions. The hashes may also be used for garbage collection. The database state is determined by a hash called HEAD, which points to the database root object state. In practice, the schemas of most databases used for real-world application contain links between fields in a first database table to another field of the first database table or a field of a second database table. These links must be reflected in stored database so that the machine learning models can follow these links quickly and efficiently. However, since the blob store is immutable, it cannot contain cyclic hard references between objects, i.e. the hard references always span an acyclic graph. As such, the hard references between objects in the blob store always refer to an immutable object, and the target of the hard reference cannot change. Thus, in order to have cyclic links and ‘backwards' references, the system also implements symbolic linking. The symbolic links are typically resolved write time or load-time (before queries are run). The symbolic links operate in a file system-like tree structure, which spans the blob store, e.g. the file system directories correspond to objects in the blob store. These symbolic links are essentially strings (e.g. "linked-table- name"). Symbolic linking is context dependent, so the symbol target may change or resolve differently in different context or environment. Figure 2 shows a second process 200 for inserting new data into an existing database in the system of the present invention. As explained above with respect to Figure 1, the database has already been created based on the database schema, and possibly some data has already been entered into the database. At step 201 of the method 200, new data is received by the system. As before, the received data is structured in a manner corresponding to the schema, such that within the data received by the system each individual data item comprises some or all of the fields defined in the schema, and each data item can be inserted as a row into the database. At step 202, the data received by the system is analysed in order to create a bitmap index for the received data, in the same way as for any data received in the method 100. At step 203, the new rows and bitmap indexes are collected into a table segment, which is, at least initially, independent of any other table segments. The table itself is composed of one or more such segments, which can be treated as a concatenated whole, e.g. using a log-structured merge tree. The use of table segments and log- structured merge trees is described in "The log-structured merge-tree (LSM-tree)", O'Neil, Cheng, Gawlick, O'Neil, Acta Informatica 33 (4), pages 351–385, June 1996. At step 204, the table segments are written to the system’s long-term memory. Importantly, since the machine learning models produced by the system are generated based on the bitmap indexes, as soon as the bitmap indexes and generated and accessible by the process responsible for generating the machine learning models, future models immediately take changes in the underlying data into account. Furthermore, since the system of the present invention generates models in response to each query, as explained in more detail below, this means that the updated training information is available to every model – and is thus used to generate responses to queries – immediately after the new data is entered to the system. Furthermore, there is no need for users of the system to maintain or update their queries to take the new data into account - the same queries continue to work and will now take the new data and the statistical evidence into account. At write time, the system may also prepare infrastructure which enables the links between database tables to be accessed and followed quickly. The system resolves symbolic links into hard references (e.g. hashes) and if the system recognizes that the target objects have changed since the last preparation, it updates the link infrastructure as follows: · The row-level linking between a first table and second table is determined in order to prepare a data structure that maps each row in the first table to rows in the second table. This data structure also contains information indicating whether the link is undefined for the row, which could occur if the link does not exist or the linked row cannot not be found. · It is determined which fields in the linked table (i.e. second table) are defined, and which are not. This enables conditional bits, which are discussed in more detail below, to be defined. · Linked bitmap indexes are created. In essence the linked bitmap indexes are created as if the linking table, i.e. the first table, is joined with the linked table, i.e. the second table. For example, the engagement table defines a product field, and if the product table has the field 'price', then the joined tables includes a field 'product.price'. In practice, the system may or may not actually perform a join operation between the tables in order to produce the linked indexes – the example of joining is provided simply to illustrate the result of the process. Bitmap indexes are created for the ‘joined’ table rows and columns as described above. Figure 3 depicts a method 300 for responding to query received by the system of the present invention. At step 301, the system receives a query, for example from a user of the system or from a back-end process of another system, to which a response is required. Simple queries, such as those that resemble queries on conventional databases, e.g. a SQL SELECT query, can be handled by the system in a straightforward manner, similar to the way in which a conventional database operates, in which case the process 300 is not followed. However, a query received by the system of the present invention may also include unknown questions to which an answer is required. For example, the following query requests, from the database, a cheap laptop for a particular user that maximizes the probability that the user will click on the laptop. In the query, the examined data set is selected (from), the known is declared (where), the unknown is requested (get), the results are sorted (orderBy), the returned fields are selected (select), and then the results are paged (offset, limit). { “from” : “impressions”, “where” : { “user” : 345, “query” : “cheap laptop” }, “get” : “product”, “orderBy” : { “$p“ : { “$context“ : { “click”:true } } }, “select” : [ “id”, “title”, “$p”, “$why”, “$highlight” ], “offset“: 10, “limit“: 10 } Other queries may specify other unknowns which are to be filled-in, e.g. predictions of a missing value, or the rows that are the best answers to a query, and/or the statistical relations between fields or features. This particular query is a request for a prediction, since the term “$p“ : { “$context“ : { “click”:true } instructs the system to predict the likelihood that the field “click” will have the value “true” for each product. A predict query may also be of the form: { “from" : "impressions",

"where" : { "user" : 345, "product . id" : "54651" }, "predict" : "click"

}

In this example query, a single product, is specified for which a prediction will be made regarding the probability of a being clicked, based on existing data describing purchases in the “impressions” database table.

Alternatively, instead of a request for a prediction, a query may include a request for matches or recommendations, as described in more detail below, although the overall process depicted in Figure 3 remains the same.

At step 302, after the query is received, the system generates a naive Bayesian model of the problem - in the present example, the probability that a user will click on a given product given the evidence, e.g. the user's history of clicks. Other machine learning models, such as decision trees, may be generated instead, or in addition to the naive Bayesian model, based on the bitmap indexes. Any such models will enjoy the same advantages of increased speed, efficiency and scalability as discussed below for a naive Bayesian model.

The system selects the best features from the data stored in the database that can be used to distinguish/discriminate the results, i.e. using sparse optimizations, such as the Roaring compressing bitmap format, described in more detail in “Better bitmap performance with Roaring bitmaps”, S. Chambi, D. Lemire, O. Kaser, R. Godin, Software: Practice and Experience Volume 46, Issue 5, pages 709-719, May 2016.

By using a naive Bayesian model, rather than full Bayesian inference, the system is able to calculate the required statistics faster. However, a requirement for a naive Bayesian model to hold is that the features used as evidence must be independent. Therefore, for data which contains features which are not independent, the system therefore may also select a subset of features, or create synthetic features, that are, or are close to being statistically independent. Alternatively, or in addition to selecting intrinsic features of the stored state which are statistically independent, the system may also transform the Intrinsic features into derived independent features using query-time representation learning.

Representation learning in the system is carried out in the following way: 1. A pattern is recognised in the features, for example the unigram title:''software” often occurs followed by the unigram title:''developer" in a job title field. 2. A new extrinsic feature E is introduced to represent the recognised pattern in the title field, for example E = title:“software” & title:“developer” 3. The introduction of the new feature leads to the duplication in the representation, since the information is encoded in both the new feature and the old features. The duplicated information in the old features is therefore eliminated, by replacing the old features with conditional features, for example replacing title:software feature with the conditional title:(software | !E) and by replacing title:developer with feature title:(developer | !E) This process reduces the representation's description length, in accordance with the minimum description length principle, which states that the best hypothesis to describe the regularities in data is also the one that is able to compress the data most. At step 303, the necessary statistics for the naïve Bayesian model (e.g. the lift in the probability of a click on a given laptop given the evidence for each laptop stored in the database) are calculated using bitwise logical operations on the bitmap indexes. At the simplest level, the naïve Bayesian model includes Bayesian lifts for a value X given the evidence E. Bayes theorem states: Which can be stated as the degree of belief in X given the evidence E is equal to the prior degree of belief in X (i.e. P(X)) times the lift (or support) in light of the evidence E. The term P(E|X)/ P(E) is called the lift (or support), since it represents that change in the belief of value X in light of the evidence E. The conditional probability P(E|X) = P(E ∩ X)/P(X) , soP the two values P(E|X) and P(E) can be calculated based on the data stored in the database, by counting the following statistics: the number of database rows in which E occurs (independent of X), the number of database rows in which X occurs (independent of E), and the number of database rows in which X and E occur together. However, in the system of the present invention the calculation of these statistics is significantly sped up by performing these calculations on the database indexes, rather than the data itself. Furthermore, since the indexes are bitmap indexes, the calculations can be sped up further by using bitwise logical operations, e.g. bitwise AND, OR, and NOT, which are very fast operations, and other bitwise operations such as POPCOUNT, for which modern CPUs often provide hardware support. Bitmap indexes can also be used to identify the rows exceptionally quickly when sampling data by rows, or when assigning probability estimates for rows (e.g. during recommendation or matching operations). However, since it is not necessary for every field within a given row to be defined, it is possible that for some rows it may be unknown whether X or E (or both) occurs. Thus, conditional bits, which know the indexes and where they are true, false and not defined, can be used. In order to calculate the Bayesian lift, it must first be determined in which rows of the database X, E or both X and E are defined. Thus, the statistics (i.e. the number of database rows in which E occurs (independent of X), the number of database rows in which X occurs (independent of E), and the number of database rows in which X and E occur together) can be calculated by taking into account only the rows for which values of X and/or E are defined. Due to the fast nature of the bitwise operations, single statistics can be calculated on the microsecond scale, which allows for thousands or tens of thousands of statistics to be calculated within milliseconds. This enables models to be generated on-the-fly, in response to individual queries. Consequently, the system does not need to store models after the query has been answered, significantly reducing storage/memory requirements of the system. Furthermore, the models do not need to be updated in response to new information added to the database, and since the processes for generating the models and responding to predictions are so tightly integrated with the database itself, every model is generated based on the most up-to-date information available in the database, without requiring inefficient and network-traffic-heavy synchronisation of data between separate database and AI systems, as in the prior art. Finally, by performing the statistical calculations on the indexes, which occupy less memory than the data itself, more information can be used in the calculation of each statistic, allowing further improvements in accuracy. These features of the system therefore result in a synergistic effect by which machine learning models can be produced rapidly and efficiently, while at the same time providing more accurate predictions in response to queries. At step 304, the system generates and provides query response to the user (or another requestor). In the example of the laptop recommendation, the query response includes a number of products which match the query for “cheap laptop”, ranked in order of the probability that user 345 will click on that product when it is displayed to them. Where other types of unknowns form part of the initial query, the results include the corresponding predictions of a missing value, or the rows that are the best answers to a query, and/or the statistical relations between fields or features. The query response may also include details of the Bayesian lifts for each of the results and/or predictions provided. As such, the query response provides an explanation of the results that is comprehensible to a human user. A predict query specifies a database table which is used to generate the naïve Bayesian model, the evidence E, which defines the values of fields for a row of the database table, and an indication of the field of the database table for which a value is to be predicted. In evaluating a predict query, the system uses data in the indicated database table to provide a prediction of the unknown field value of a database row given the values of one or more other fields in the database row specified in the query. Using a naïve Bayesian model, as described above with respect to Figure 3, the probability that the field to be predicted p has a given value v based on the one or more other fields specified for the database row w i can be expressed as follows: Each of the terms P(w i |p = v), P(p = v)and P(∩ i w i ) can be quickly and efficiently determined using logical bitwise operations on the bitmap indexes, as described above. Where the indicated database table links to other tables in the database, those links may be followed, using linked indexes as described above, so that properties of the linked database rows are also taken into account by the Bayesian model, thereby improving the accuracy of the result while maintaining fast and efficient calculations. A match query specifies a database table that contains previous matches between rows of a second database table and rows of a third database table, and requests a prediction of the best match from a second database table for a given row of the third database table. Using a naïve Bayesian model, the probability that row X of the second database table matches row Y of the third database table has the probability: By representing X and Y as bags of features, i.e. X = {x 0 ,x 1 , … , x n }, where x i is the ith feature of the element X, and Y = {y 0 ,y 1 , … ,y m }, where y j is the jth feature of the Y, this can be re-expressed as This approximation relies heavily on independence assumption within the features x i and y j . Therefore, it is important to manage the dependencies between the features as described above with respect to representation learning. A recommend query specifies a database table in which each database row previous pairings between rows of a second database table and rows of a third database table, along with at least one further field containing information which relates to the pairing. The recommendation query requests a prediction of one of the rows from the second or third database table which optimises a goal, which is based on the value of the at least one further field of the database rows. The difference between recommendation and match queries is that while matching seeks to imitate past behaviour represented in the database by previous matches, recommendation seeks to optimise it. Using a naïve Bayesian model, the probability that row s of the second database table S and row t of the third database table T achieves the goal G is By representing s and t as bags of features, i.e. s = {s 1 , s 2 , … , s n }, where s i is the ith feature of the row s, and t = {t 1 , t 2 , … , t m }, where t j is the jth feature of row t, this can be re-expressed as This equation has i × j components. The statistics these components can be quickly calculated using bitwise operations and POPCOUNTs on the bitmap indexes. Again, this approximation relies heavily on independence assumption within the features s i and t j . Therefore, it is important to manage the dependencies between the features as described above with respect to representation learning. A query may also include a request for validation of the query, i.e. for a measure of accuracy of the predictions produced by the machine learning model generated in response to the query. In the context of the present invention, in which AI and the database are tightly integrated, this enables the contents of the database to be used for both training and testing. In conventional applications validation methods are used to provide an indication of how well adapted a particular machine learning model is to the underlying system whose behaviour it is attempting to predict. This validation would then be used to tweak the machine learning model to attempt to improve its accuracy. In contrast, in the query-focused methods of the present invention, in which single-use query-specific machine learning models are generated, the validation instead provides a measure of accuracy of the query-specific machine learning model, and thus provides an indication of the accuracy of the specific query output, i.e. of the one or more of fields, one or more recommended data records, and/or one or more data records which match the query. A method for requesting validation of the query is described may also be provided. At a first step, a query is received by the system, the query including a request for query validation. This query may be the same as the query received at step 301 of the method 300. The request for query validation may include an indication of the ratio of data records within the database that are to be used as training data and the number to be used as testing data. For example, the request may include an indication that the ratio of data records to be used for training to testing is 4:1, in which case 4/5 of the data records are used for training the machine learning model and 1/5 is used for testing. It will be appreciated that the ratio for training to testing may be indicated in any suitable manner in the request, and need not be expressed explicitly as a ratio, e.g. 4:1. Alternatively, the request might not specify a specific ratio, in which case a default ratio for validation could be used. Preferably, 1-fold cross validation is used as the validation method, since this allows for rapid model generation and query validation, since only a single model must be trained. Alternatively, a different testing method may be indicated in the request for query validation. For example, the request may specify that leave-p-out, k-fold, hold out or Monte Carlo cross validation is to be used. At a second step, the query-specific machine learning model is generated based on the data records of the database indicated in request for query validation, i.e. based on the ratio of training to testing data. The second step corresponds to steps 302 and 303 of Figure described above, differing only in that the data used to generate the machine learning model is selected according to the ratio of training to testing data. The testing/training data is preferably selected at random from the data records according to the defined ratio. Where 1-fold cross validation is used as the validation method, the machine learning model generated at the second step based on the training data may also be used to generate the query response, as described above with respect to Figure 3. At a third step, the accuracy measure is generated according to the validation method, i.e. by generating predicted values of the requested fields for the testing data and comparing it with the actual known values of those fields contained within the testing data. At a fourth step, the accuracy measure is output, e.g. by transmitting it to the source of the query in response to the query. As such, the accuracy measure may also form part of the query response that is returned in the method 300 at step 304. Figure 4 depicts an exemplary computer system 400 on which the method described above may operate. In the simplest form, the system 400 is a general purpose computer or server, which includes inter alia one or more CPUs 401, fast short-term memory 402, e.g. RAM, long term memory 403, e.g. hard disks or solid state drives, and a network interface 404, through which the system 400 is connected to a network 410 such as the Internet and may communicate with other computer systems. Each of the elements of the system 401 to 404 are internally communicatively coupled, either directly or indirectly. Furthermore, while the system 400 is described with respect to a conventional memory hierarchy comprising short-term memory 402, such as RAM, and long-term memory 403, such as hard disks or solid state drives, the system 400 could alternatively, or additionally, employ a sufficient quantity of non-volatile random access memory which acts as both fast short-term memory 402 and long-term memory 403. Optionally, the system on which the inventive method operates may comprise a plurality of computer systems, such as the computer system 400, connected together via network interfaces 404 or by other means. The computer systems may be located physically close to one another and be connected over a local area network, or may be located in disparate physical locations and be connected over a wide area network such as the Internet. Whether a single computer system 400 or multiple computer systems are used, the system may communicate over network 410 with one or more client devices, such as personal computers, servers, or any other suitable device. The data, schemas and queries received by the system may be received via network 410 from the client devices, and the output of the method, i.e. predictions, query results and result explanations, may be transmitted via network 410 to the same or to different client devices.