Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEM AND METHOD FOR DATA COMPRESSION AND STORAGE ALLOWING FAST RETRIEVAL
Document Type and Number:
WIPO Patent Application WO/2009/001174
Kind Code:
A1
Abstract:
The invention relates to the field of data compression, database management systems and more particularly the systems and methods for compressing, storing and retrieval data from the compressed form. The offered data compression method allowing fast retrieval comprises the steps of representing a set of strings as a context-free grammar, encoding the context-free grammar into compressed file, wherein the said context-free grammar is the LM-grammar, which has one or more starting non-terminals and wherein the right-hand side of every production rule of the said context-free grammar is either an empty string, or the leftmost grammar symbol of the right-hand side represents only one non-empty string and for every two alternatives with the same left- hand side and both non-empty right-hand sides, the string represented by the leftmost grammar symbol of the right-hand side of one alternative is not a prefix of the string represented by the leftmost grammar symbol of the right-hand side of the other alternative.

Inventors:
FREIVALDS KARLIS (LV)
KIKUSTS PAULIS (LV)
RUCEVSKIS PETERIS (LV)
Application Number:
PCT/IB2007/052508
Publication Date:
December 31, 2008
Filing Date:
June 28, 2007
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
SMARTIMAGE SOLUTIONS SIA (LV)
FREIVALDS KARLIS (LV)
KIKUSTS PAULIS (LV)
RUCEVSKIS PETERIS (LV)
International Classes:
H03M7/30
Foreign References:
EP1122655A22001-08-08
Other References:
BOTTCHER S ET AL: "XML index compression by DTD subtraction", ICEIS 2007. PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS, VOLUME DISI INSTICC PORTUGAL, 12 June 2007 (2007-06-12), pages 86 - 94, XP009097565, ISBN: 978-972-8865-88-7
LOHREY ET AL: "The complexity of tree automata and XPath on grammar-compressed trees", THEORETICAL COMPUTER SCIENCE, AMSTERDAM, NL, vol. 363, no. 2, 28 October 2006 (2006-10-28), pages 196 - 210, XP005685578, ISSN: 0304-3975
WOLFF J E ET AL: "Searching and browsing collections of structural information", ADVANCES IN DIGITAL LIBRARIES, 2000. PROCEEDINGS. IEEE WASHINGTON, DC, USA 22-24 MAY 2000, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 22 May 2000 (2000-05-22), pages 141 - 150, XP010501112, ISBN: 0-7695-0659-3
COOPER B F ET AL: "A fast index for semistructured data", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, XX, XX, 2001, pages 341 - 350, XP002303292
Attorney, Agent or Firm:
FORTUNA, Jevgenijs (Raina boulevard 19, Riga, LV)
Download PDF:
Claims:

Claims

1. A data compression method allowing fast retrieval, comprising: representing a set of strings as a context-free grammar, encoding the context-free grammar into compressed file, characterized in that

- the said context-free grammar is LM-grammar, which has one or more starting non-terminals, wherein

- the right-hand side of every production rule of the said context-free grammar is either an empty string, or the leftmost grammar symbol of the right-hand side represents only one non-empty string and

- for every two alternatives with the same left-hand side and both non-empty right- hand sides, the string represented by the leftmost grammar symbol of the right-hand side of one alternative is not a prefix of the string represented by the leftmost grammar symbol of the right-hand side of the other alternative.

2. The method according to claim 1, characterized in that for every two alternatives with the same left-hand and both non-empty right-hand sides, the first symbol of the string represented by the leftmost grammar symbol of the right-hand side of one alternative is different from the first symbol of the string represented by the leftmost grammar symbol of the right-hand side of the other alternative.

3. The method according to claim 1 or 2, characterized in that representing a set of strings as a context-free grammar comprises: - forming a PATRICIA trie for each set of strings coming from one index (401);

- creating a collection of unique strings from all the labels of the edges of all the formed PATRICIA tries (402);

- creating a common context-free grammar representing those unique strings coming from one index (403); - combining the formed PATRICIA tries with created common context-free grammar to form an LM-grammar (404).

4. The method according to any claims 1 to 3, characterized in that the step of encoding the context-free grammar into compressed file comprises:

- dividing grammar rules into packets of a specified size (501);

- encoding each packet of grammar rules into compressed file (503); - forming a list of packet offsets (704), which consists of positions of each packet in compressed file (103);

- encoding the list of packet offsets into the file (705);

- forming a list of packet rule counts (706) consisting of integer numbers representing the number of grammar symbols in each packet; - encoding the list of packet rule counts into the compressed file (707).

5. A method of retrieval of data compressed according to any claims 1 to 4, comprising:

- selecting an appropriate index allowing the query to be executed in the fastest way (901) and the starting symbol of the LM-grammar according to the selected index;

- performing a recursive search process which generates only the strings that match the query (903);

- decoding the strings by using the inverse of the encoding transformation performed yielding the original data rows (905).

6. The method according to claim 5, characterized in that the step of performing a recursive search process which generates only the strings that match the query comprises: - identifying the longest prefix of the current partial string containing only terminal symbols (1002);

- based on the identified prefix, testing if the current string can match the query (1003);

- replacing the first non-terminal symbol in the current string with one of the alternatives of the grammar rule with the left-hand side equal to the said first non-terminal symbol (1009);

- recursively continuing the search process on the modified string (1010).

7. The method according to claim 6, characterized in that all the occurrences of the first non-terminal symbol in the current string are replaced with one of the alternatives of the grammar rule with the left-hand side equal to the said first nonterminal symbol.

8. The method according to claims 6 or 7, characterized in that the process of checking based on the identified prefix, testing if the current string can match the query comprises:

- dividing the prefix into fields that are between delimiters (1102); - identifying the fully decoded fields and the partially decoded field (1104);

- testing if the fully decoded fields match the query (1105);

- deriving an interval from the partially decoded field (1110);

- testing if the said interval intersects with the interval derived from the query criteria (1111).

9. A data storage system for storing data in compressed form and fast data retrieval comprising a microprocessor and memory, characterized in that the memory is holding data, including data in form of context-free LM-grammar, which has one or more starting non-terminals, wherein - the right-hand side of every production rule of the said context-free LM-grammar is either an empty string, or the leftmost grammar symbol of the right-hand side represents only one non-empty string and

- for every two alternatives with the same left-hand side and both non-empty right- hand sides, the string represented by the leftmost grammar symbol of the right-hand side of one alternative is not a prefix of the string represented by the leftmost grammar symbol of the right-hand side of the other alternative.

10. The data storage system according to claim 9, characterized in that for every two alternatives with the same left-hand and both non-empty right-hand sides, the first symbol of the string represented by the leftmost grammar symbol of the right-hand side of one alternative is different from the first symbol of the string represented by the leftmost grammar symbol of the right-hand side of the other alternative.

11. The data storage system according to claim 9 or 10, characterized in that the microprocessor is executing instructions, including instructions that cause the machine to perform the steps of: - processing a set of strings to represent them as a context-free LM-grammar,

- performing encoding the context-free LM-grammar into compressed file.

12. The data storage system according to claim 11, characterized in that at the step of processing a set of strings to represent them as a context-free LM-grammar, the microprocessor is executing instructions that cause the machine to perform the steps of:

- forming a PATRICIA trie for each set of strings coming from one index (401);

- creating a collection of unique strings from all the labels of the edges of all the formed PATRICIA tries (402); - creating a common context-free grammar representing those unique strings coming from one index (403);

- combining the formed PATRICIA tries with created common context-free grammar to form an LM-grammar (404).

13. The data storage system according to claims 11 or 12, characterized in that at the step of performing encoding the context-free LM-grammar into compressed file, the microprocessor is executing instructions that cause the machine to perform the steps of:

- dividing grammar rules into packets of a specified size (501); - encoding each packet of grammar rules into compressed file (503);

- forming a list of packet offsets (704), which consists of positions of each packet in compressed file (103);

- encoding the list of packet offsets into the file (705);

- forming the list of packet rule counts (706) consisting of integer numbers representing the number of grammar symbols in each packet;

- encoding the list of packet rule counts into the compressed file (707).

14. The data storage system according to any claims 9 to 13, characterized in that the microprocessor is executing instructions, including instructions that cause the machine to perform the steps of:

- selecting an appropriate index allowing the query to be executed in the fastest way (901) and the starting symbol of the LM-grammar according to the selected index;

- performing a recursive search process which generates only the strings that match the query (903);

- decoding the strings by using the inverse of the encoding transformation performed yielding the original data rows (905).

15. The data storage system according to claim 14, characterized in that the step of performing a recursive search process which generates only the strings that match the query comprises: - identifying the longest prefix of the current partial string containing only terminal symbols (1002);

- based on the identified prefix, testing if the current string can match the query (1003);

- replacing the first non-terminal symbol in the current string with one of the alternatives of the grammar rule with the left-hand side equal to the said first non-terminal symbol (1009);

- recursively continuing the search process on the modified string (1010).

16. The data storage system according to claim 15, characterized in that all the occurrences of the first non-terminal symbol in the current string are replaced with one of the alternatives of the grammar rule with the left-hand side equal to the said first non-terminal symbol.

17. The data storage system according to claims 15 or 16, characterized in that the microprocessor is executing instructions that at the step of checking based on the identified prefix, testing if the current string can match the query cause the machine to perform the steps of:

- dividing the prefix into fields that are between delimiters (1102);

- identifying the fully decoded fields and the partially decoded field (1104);

- testing if the fully decoded fields match the query (1105);

- deriving an interval from the partially decoded field (1110);

- testing if the said interval intersects with the interval derived from the query criteria (1111).

18. A machine -readable medium having stored thereon a plurality of instructions, the plurality of instructions including instructions that, when executed by a machine, cause the machine to perform a method as claimed in any of claims 1 - 8.

Description:

System and method for data compression and storage allowing fast retrieval

Technical Field

The present invention relates to the field of data compression, database management systems and more particularly the systems and methods for data compression and storage and retrieval data from the compressed form.

Background Art

To organize, store and retrieve different kind of logically related data it is common to use a database. Typically, data in a database is logically represented as a collection of one or more tables. Each table is composed of a series of rows and columns. Each row in the table represents a collection of related data. Each column in the table represents a particular type of data. Thus, each row is composed of a series of data values, one data value from each column.

The larger volume of data to be stored in a database is, the more expensive is data storage. Data compression is a common technique in computer systems to reduce data storage costs. Commonly used compression tools such as zip or gzip compress a given file or collection of files into a smaller representation of the same data. A drawback of such approach is that access to data is provided on the file granularity, namely the whole file has to be decompressed, even if a small part of that file is actually needed by the user. In database systems access is typically needed on the granularity of individual records. The database system may contain large quantities of small records and applying compression on each record individually does not yield significant reduction of data size.

For effective searching of data within a database it is common to create access paths to the same data - that is to perform data indexing. The index is a search pattern, which ultimately locates the data. Its main disadvantage is that it requires space and maintenance.

Normally, the nature of the indexing technique as well as the volume of the data held in the

files determine the number of VO operations that are required to retrieve, insert, delete or modify a given data record.

Various types of indexing schemes have been developed. One of the most widely used indexing algorithm is the B-tree (Rudolf Bayer, Binary B-Treesfor Virtual Memory, ACM- SIGFIDET Workshop 1971, San Diego, California, Session 5B, p. 219-235) in which the keys are kept in a balanced tree structure and the lowest level points at the data itself. Another indexing technique is using the trie data structure which enables fast search whilst avoiding data duplication in indices. The trie indexing file has the structure of a tree wherein the search is based on partitioning the search according to search key portions - digit or bit. The trie structure affords memory space savings, since the search-key is not held, as a whole, in interim nodes and hence the duplication is avoided. At the same time, the memory space savings using trie data structure are not significant comparing to memory space savings using some tried- and- true data compression techniques, for example RAR or zip compression technique.

There are several approaches to implement data compression inside a database management system (Cockshott, W. P., McGregor, D., Kotsis, N., and Wilson, J. "Data Compression in Database Systems". In Proceedings of the 1998 International Symposium on Database Engineering & Applications, 1998, IDEAS. IEEE Computer Society). In such systems data is compressed on page level yielding better compression than compressing each record separately and simultaneously relatively fast access time is retained. The drawback of such approach is that database indices are not compressed. In typical cases indices can occupy as much as 3 times space as the data itself, making overall space saving insignificant. Also such approach does not reach the compression ratio of RAR or zip compression tools.

A method for implementing storage and retrieval of data in a computing system disclosed in WO03096230 by Potapov et al. entitled "Method and Mechanism of Storing and Accessing Data and Improving Performance of Database Query Language Statements" provides compression of stored data by reducing or eliminating duplicated values in a database block. The on-disk data is configured to reference a copy of each duplicated data value stored in a symbol table. Information describing data duplication within a data block

is maintained. The disclosed method provides also ability to search compressed data. The method for decompression disclosed in WO03096230 has a disadvantage which limits the usefulness thereof. Each data block is compressed individually, hence the compression ratio is significantly lower than compression ratio achieved by RAR or zip compression tools. Besides that, no index compression is disclosed in said method, hence the overall size reduction is limited.

Another compression approach is disclosed in GB 1280486 by Liozides et al. entitled "Multilevel Compressed Index Generation". Liozides et al. discloses a method for generating a multilevel compressed index for data retrieval. Index is partitioned into blocks, and care is taken to minimize number of blocks accessed during search. This compression method provides only index compression; no data compression is performed. Therefore, the overall data size reduction is small. Although it is possible to use simultaneously index compression methods with data compression methods, it is not efficient since there is data duplication between the compressed index and compressed data making such approach suboptimal.

There is known a method for performing lossless data compression and searching the compressed data by consulting locally stored archive metadata extracted from the files during the compression process (US patent application 20050187962, publication date 25.08.2005 "Searchable archive"). According to the said method, searching of compressed data is performed using search agents. Search agents operate by checking archive metadata to identify the compressed segments which may contain the required data. Such search method is slow since in certain cases a lot of segments have to be identified and decompressed to obtain the query result.

There is also known a method for performing lossless data compression and decompression (US patent 6,400,289, 04.06.2002 "System and Method for Performing Lossless Data Compression and Decompression"). Data compression method comprises: parsing an input string into irreducible grammar using a trie-type data structure that represents variables of the irreducible grammar; updating said variables of said irreducible grammar based on at least one character to be parsed in the input string; and encoding said irreducible grammar into a string of bits. The method for performing data decompression

comprises: decoding input bit stream based on irreducible grammar; and updating the irreducible grammar based on at least one character to be parsed in the input string; the decoding and updating steps being performed to substantially prevent any bit-error in the input bit stream from adversely affecting operation of decoding system. The method provides compression of data stream corresponding to the input file. The disadvantage of this method is that it cannot be applied to databases having large number of small strings that have to be compressed and decompressed individually. Additionally, this method does not allow fast searching inside the compressed data, i.e. searching that would be faster than reading and processing the whole compressed file, hence this method cannot be applied to databases.

The known method for performing lossless data compression uses a context-free grammar (US 6,400,289; CG. Nevill-Manning, LH. Witten and D.L. Maulsby. Compression by induction of hierarchical grammars. In J. A. Storer and M. Cohn (editors), Proc. IEEE Data Compression Conference, pages 244-253, Snowbird, Utah, March 1994. IEEE Computer Society Press, Los Alamitos, California). A context-free grammar as a structure itself is well known in computer science field (Seymour Ginsburg, "The Mathematical Theory of Context Free Languages", McGraw-Hill, 1966). Context-free grammar G can be defined as a 4-tuple: G = (Vt, Vn, P, S), where

(gl) Vt is a finite set of terminals (g2) Vn is a finite set of non-terminals (g3) P is a finite set of production rules (g4) S is an element of Vn, the distinguished starting non-terminal. (g5) elements of P are of the form V -> w, where V is a non-terminal symbol and w is a string consisting of terminals and/or nonterminals, or w is an empty string denoted by ε.

In case when there are several production rules with the same left-hand side they are called alternatives and are grouped together. For example, alternatives S->a and S->b are written as S-> a I b.

Grammar production rules are used for transforming strings. To generate a string in a language, one begins with a string consisting of only a single start symbol, and then successively applies the rules any number of times, in any order to rewrite this string. Rewriting process terminates when the string contains only terminal symbols. The language corresponding to the particular grammar consists of all the strings that can be generated in this manner. Any particular sequence of legal choices taken during this rewriting process yields one particular string in the language. A grammar is called finite, if it generates only a finite set of strings. String generation process may be started with any grammar symbol, not only the starting symbol S, to generate a set of strings represented by this grammar symbol.

An example of a context-free grammar is shown below:

Vt = {a, b, c}, Vn = {S, A, B, C},

Production rules: S -> aA

A -> BC I cC I aB

B -> bc

C -> aa

This example of a finite context-free grammar generates a language consisting of tree strings "abcaa", "acaa", "aabc".

Traditionally, grammar based compression deals with only one string making a context- free grammar whose starting symbol generates this string. It is evident for the person skilled in the art that these algorithms can work on a set of strings producing a grammar with several start symbols Si .. S k where each start symbol generates one string of the string set.

The known methods either do not allow data compression yielding high compression ratio, or do not allow compression to be applied to databases having large number of small strings that have to be compressed and decompressed individually, or do not allow fast searching inside the compressed data, that is searching which would be faster than reading and processing the whole compressed file.

The effective data compression method allowing fast retrieval is required also for small electronic devices such as mobile phones, PDA (Personal Digital Assistant), palmtop computers, GPS (Global Positioning System), car navigation systems, where the amount of available memory is limited. Such devices must hold large amount of data, for example car navigation system has to hold the maps for several countries, which can occupy several gigabytes of storage space. Compressing the maps would decrease costs of navigation systems or increase the number of maps that can be stored on the same device. Conventional compression methods such as zip or gzip cannot be used in this case since access to the map is required by geographical coordinates what is not supported by these methods. Small electronic devices usually have limited computation power, small amount of available RAM for processing and small program memory for storing the program code. The existing compressed database methods are not suited to such environments.

Thus, the known methods and systems do not allow to achieve compression ratio similar to those achieved by the best archivers, at the same time permitting fast data searching and retrieval similar to the current database.

A need therefore exists for an improved system and method for data compression and storage yielding high compression ratio and allowing fast access to data in databases.

Disclosure of Invention

To overcome the limitations in the prior art described above, the present invention discloses a method and system for compressing, storing data in compressed form yielding high compression ratio and efficiently retrieving required data directly from the compressed format.

It is an object of the present invention to provide a method for lossless data compression yielding high compression ratio.

It is another object of the present invention to provide a data storage system for storing data in compressed form and allowing for fast retrieval.

It is another object of the present invention to provide a method and system that allows for efficient searching and provides fast data retrieval on a record granularity.

The present invention utilizes a specifically developed context-free grammar as a data structure for holding compressed data and improved algorithm for data compression and retrieval. The proposed compression method provides unification of data compression with index compression in a common data structure, resulting in high compression ratio and fast retrieval. A context-free grammar according to offered method is used not only for compression but also for fast searching of data. For that purpose a special kind of grammar is proposed. This special kind of grammar hereinafter referred to as the LM-grammar (L stands for left, M stands for multi-entry), further will be described in greater details. The right-hand side of any rule of the LM-grammar is in a special form. This grammar can have multiple starting non-terminals.

An LM-grammar G can be defined as a 4-tuple:

G = (Vt, Vn, P, { Si, S 2 , ... Sn}), where

(LMl) Vt is a finite set of terminals

(LM2) Vn is a finite set of non-terminals (LM3) P is a finite set of production rules

(LM4) Si, S 2 , ... Sn are elements of Vn, the distinguished starting non-terminals,

(LM5) elements of P are of the form V -> w, and the following properties hold

(a) the right-hand side of every production rule is either an empty string, or the leftmost grammar symbol of the right-hand side represents only one non-empty string; i.e. every production rule of the context-free grammar is in the form A-> ε or A -> BC, where B is a grammar symbol that represents exactly one non-empty string; C denotes any sequence of grammar symbols or an empty string and

(b) for every two alternatives with the same left-hand side and both non-empty right-hand sides, the string represented by the leftmost grammar symbol of the right-hand side of one alternative is not a prefix of the string represented by the leftmost grammar symbol of the right-hand side of the other alternative; i.e. for every two alternatives with the same left- hand side A->BC and A->DE the string represented by grammar symbol B is not the prefix

of the string represented by grammar symbol D and the string represented by grammar symbol D is not the prefix of the string represented by grammar symbol B; C and E denote any (possibly empty) sequences of grammar symbols.

Properties (a) and (b) provide a way for fast searching. The search process is recursive traversal of the grammar and these properties allow traversing only the relevant parts of the grammar. Property (a) ensures that the string represented by the leftmost symbol of the right-hand side of any rule is unique. Property (b) ensures the ability to distinguish between alternatives for finding the relevant ones by looking at the strings represented by the first grammar symbols of the right-hand sides.

Alternatively, in another embodiment of the invention, property (b) is replaced by property (bl), where for every two alternatives with the same left-hand and both non-empty right- hand sides, the first symbol of the string represented by the leftmost grammar symbol of the right-hand side of one alternative is different from the first symbol of the string represented by the leftmost grammar symbol of the right-hand side of the other alternative, i.e. for every two alternatives with the same left-hand side A->BC and A->DE the first symbol of the string represented by grammar symbol B differs from the first symbol of the string represented by grammar symbol D; C and E denote any (possibly empty) sequences of grammar symbols.

The requirement (bl) is stricter than (b) and the compression ratio in this case can be worse. But on the other hand, searching becomes faster since the non-relevant parts of the grammar can be identified quicker. If (bl) is satisfied, one can find the right alternative by checking only the first symbol of the string represented by the leftmost grammar symbol of the right-hand side of a grammar rule. The property (bl) is used in the preferred embodiment.

According to the present invention LM-grammar is used as a primary mechanism for storing compressed data. Each starting non-terminal of an LM-grammar corresponds to one database index. Each starting non-terminal generates a set of strings which corresponds to the rows of the database table. If the system comprises several indices, each database row can be obtained through its corresponding string from every index. Such storage

organization allows searching by multiple criteria. At the same time, storage space overhead is negligible since the generation process can use common grammar rules.

An example of an LM-grammar of two sets of strings {"abcaa", "acaa", "aabc"} and { "baac", "bbca", "aacaa" } is shown below:

Vt = {a, b, c}, Vn = { S 1 , S 2 , A, B, C, D, E}, starting non-terminals = {Si, S 2 } Production rules:

A -> BC I cC I aB B -> bc C -> aa

S 2 ->bD I EC D->E I Ba E->Cc

Another aspect of the invention concerns using PATRICIA trie data structure (D. R. Morrison. PATRICIA - Practical Algorithm to Retrieve Information Coded in Alphanumeric. Jrnl. of the ACM, 15(4) pp514-534, Oct 1968). For a given set of strings PATRICIA trie has one node for every maximal common prefix. Leaves of the PATRICIA trie correspond to input strings. Edges are labeled with such strings, that the path from root to any node spells out the prefix associated with that node. According to the present invention, PATRICIA trie is used as an intermediate step for generating prefixes of the set of strings for LM-grammar creation process.

The present invention may be embodied on a computer-based data storage system comprising at least a microprocessor and memory. The said memory can be random access memory, a hard or fixed disk, flash memory, CD-ROM, internal memory of the microprocessor, or any other device capable of holding digital data. A typical system optionally comprises also an input-output controller, a keyboard, display, a pointing device. According to the present invention, the said memory is holding a digital representation of the LM-grammar. In its turn, the microprocessor executes instructions for

creating LM-grammar or encoding LM-grammar into a stream of bits forming the compressed file or searching by using LM-grammar and retrieving of data matching a given query.

The present invention can also be embodied in the form of instructions for a microprocessor, or program code in some programming language in a machine -readable storage medium, such as floppy diskettes, flash memory, CD-ROM or a hard or fixed disk, wherein, when loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention. When implemented on a general- purpose microprocessor, the instructions or program code segments combine with the processor to provide a unique device for performing the disclosed methods that operates analogously to specific logic circuits.

Brief Description of Drawings

Fig. 1 illustrates a database table in conventional non-compressed form, one or more indices, overview of the method of data and index compression, yielding compressed file;

Fig. 2 is a flowchart of the process of creation of the sets of strings;

Fig. 3 is a flowchart of one embodiment of the method for converting a data row into a string;

Fig. 4 is a flowchart of one embodiment of the method for building the LM-grammar from the set of strings;

Fig. 5 illustrates an example of tries for each set of strings coming from one index, generated from the fields of the database table shown on Fig. 1. Fig. 6 is a flowchart of the process of encoding of LM-grammar into compressed file;

Fig. 7 is a flowchart of one embodiment of the method for writing a grammar rule to file;

Fig. 8 is a block diagram (overview) of the retrieval process;

Fig. 9 is a flowchart of the top level process of retrieval according to an embodiment of the present invention; Fig. 10 is a flowchart of a recursive search process which generates only the strings that match the query;

Fig. 11 is a flowchart of the process of testing whether the given partial string can match the query criteria;

Fig. 12 is a flowchart of the steps that are performed when the compressed file is opened for reading;

Fig 13 is a flowchart of the process of reading the rule referenced by its identifier.

Fig. 1 illustrates a database table 101 in conventional non-compressed form, one or more indices 102, method of data and index compression (blocks 105-107), yielding a compressed file 103. Database table 101 consists of data rows, each row having one or more fields. This example database table 101 has three columns; fields in column 1 and column 2 are of string type and field in column 3 is of integer type. According to one embodiment of the invention, each index 102 is an ordered list of table column names. This example database table 101 is indexed by two indices - the first by Column 1, the second by Column 2. Before starting the process of compression of database table 101 together with indices 102, a set of strings is created at block 104. The process of creation of a set of strings is illustrated in Fig. 2. As the first step 200, index information and table column information is written to the compressed file 103. The table 101 column information contains the number of columns, the column names and data types. Index information contains the number of indices and the ordered list of column names for each index 102. As the next step 201, a loop is carried out for each row and each index and each row is converted into string corresponding to each of the indices as described in Fig. 3. The process of converting a row into a string for a given index starts at steps 202-203 where each field of the row is converted into a string, depending on the type of the field. A string type field is represented as string of bytes in ASCII, utf-8, utf-16 or any other encoding. 32 bit integer is represented as a fixed length string of 4 bytes, highest bytes first. Similarly, a 64 bit integer is stored as a string of 8 bytes, highest bytes first. Positive double values are represented as their raw bits in IEEE 754 floating-point format with the highest bit inverted. Negative double values are represented as their all raw bits in IEEE 754 floatingpoint format inverted. Such storage of double values ensures that the resulting string lexicographic ordering is the same as double value ordering. The next step 204 is concatenation of the individual field strings to form the row string, in the order as specified in the index, and a special delimiter symbol is inserted between the fields to delimit them. The delimiter is a character that does not appear elsewhere in the data. If the index does not contain all fields of the row, the missing fields are appended at the end in arbitrary order. If some field is of fixed length, then the field delimiter after this field can be omitted.

For example, two sets of strings are generated from the fields of database table 101: for index 1 - {"abb#ca#005", "caa#bbb#017", "abc#bab#100"} and for index 2 - {"ca#abb#005", "bbb#caa#017", "bab#abc#100"}. The field delimiter is denoted by #. For simplicity of presentation it is assumed that integers are represented in decimal notation, instead of bytes and the maximum length of a decimal number is 3 digits.

After the set of strings is created, the process transfers to block 105 (Fig. 1) where the set of strings is represented as a context-free LM-grammar 106. After that, the context-free LM-grammar 106 is encoded into a stream of bits (block 107), yielding a compressed file 103.

Fig. 4 illustrates a flowchart of one embodiment of the method for building the LM- grammar 106 from the set of created strings. The process of forming an LM-grammar starts at block 401 by forming a separate PATRICIA trie for each collection of strings coming from the same index. Tries for each set of strings coming from one index and generated from the fields of the database table 101 shown on Fig. 1 are illustrated in Fig. 5. The next step 402 of building the LM-grammar is creation of collection of unique strings containing strings from all the labels of the edges of all the tries. If several edges have the same label, only one string is formed for all of them. The following collection of unique strings is created from trie edge labels shown on Fig. 5: {"ab", "caa#bbb#017", "b#ca#005", "c#bab#100", "b", "ca#abb#005", "ab#abc#100", "bb#caa#017"}.

As the next step 403 a common context-free grammar having grammar symbols Ui..U k representing the given unique strings is formed. To achieve compression, this grammar should be as short as possible. Any of the known methods for performing data compression using a context-free grammar can be used for this purpose. The method described in CG. Nevill-Manning, LH. Witten and D. L. Maulsby. Compression by induction of hierarchical grammars. In J.A. Storer and M. Cohn (editors), Proc. IEEE Data Compression Conference, pages 244-253, Snowbird, Utah, March 1994. IEEE Computer Society Press, Los Alamitos, California is used in the preferred embodiment. In the case of example, a possible common context-free grammar of these strings with starting non-terminals Ui - Us is as follows:

Ui->ab

U 2 -> KDJL

U 3 -> MFH

U 4 -> cDaMI U 5 -> b

U 6 -> FUiMH

U 7 -> Ui# Uic#l

U 8 -> JK#L

C->ca D->#b

E->00

F->C#

H->E5

I->1E J->bM

K->Ca

L->017

M->b#

Then, at block 404, the created tries are combined with the generated common context-free grammar to form the LM-grammar. For this purpose, tries are processed one by one; a nonterminal symbol is formed for each non-leaf trie node. Grammar rule in form of R -> AB is created for each trie edge, having the non-terminal symbol R corresponding to the source node of this edge as the left-hand side of the rule. The right-hand side of the rule is concatenation of the string of grammar symbols denoted by A and non-terminal symbol B corresponding to the target node of this edge, where string A is either the right-hand side of the grammar rule U 1 ->A of the common grammar corresponding to the label of the trie edge, if grammar symbol U 1 does not occur anywhere in the right-hand side of any other rule, or A is the grammar symbol U 1 itself, if U 1 occurs in the right-hand side of some other grammar rule. The target non-terminal B in the created rule is omitted, if the target node of the edge is a leaf in the trie. The LM-grammar includes all the referenced rules of the common grammar and the newly generated rules. The starting terminals of the LM- grammar correspond to the root nodes of the tries.

The obtained grammar for the database table 101 illustrated in Fig. 1 is the following LM- grammar with starting symbols Si and S 2 : A -> MFH I cDaMI

5 2 -> bB I FUiMH

B -> Ui# Uic#l I JK#L

Ui->ab

C->ca D->#b

E->00

F->C#

H->E5

I->1E J->bM

K->Ca

L->017

M->b#

The next step 107 of the process of data compression is illustrated in Fig. 6 in greater details. First, the grammar rules are divided into packets of a specified size at block 501. All the alternatives with the same left-hand side are considered as one grammar rule when storing LM-grammar into the compressed file 103. In this way each grammar rule is uniquely identified by its left-hand side grammar symbol. The rules are sorted and divided into contiguous chunks according to that sorting. There are several alternative ways of sorting the rules for achieving different goals. In the preferred embodiment, to achieve the most efficient coding, rules are sorted by the reference counts. The reference count is the number of times the left-hand side symbol of this rule appears in the right-hand side of other grammar rules. Then, the required frequency model can be built and encoded efficiently. In some other embodiment, to achieve faster search time, rules can be sorted topologically in the underlying reference graph or by distance from the root, thus achieving that commonly accessed rules are together in one packet or in a few packets. The goal for dividing rules into packets is to be able to read from file and decode grammar rules from

each packet independently, based on the query to be performed. The size of each packet is selected to balance the read performance with overhead of storing utility information for each packet. Preferred size of a packet is 1 to 8 kilobytes. A size of 4 kilobytes is used in the preferred embodiment.

The next step 502 of the compression process is forming the encoding frequency model from the rule reference counts and encoding the model and writing it to the compressed file 103. The model is an array of frequencies of occurrence of each grammar rule in the sorted order. The frequencies can be represented exactly and written into the file as 4 byte integer numbers, or variable length integer code such as Golomb code, Elias delta code, Elias gamma code, Elias omega code, Fibonacci code, Rice code, preferably Elias delta code. In some other embodiment frequencies are represented approximately by quantization into a fixed number of bits. In another embodiment frequencies are encoded with wavelet transform (Charles K. Chui, An Introduction to Wavelets, (1992), Academic Press, San Diego, ISBN 0121745848) and wavelet coefficients are saved into the file. Alternatively, in the case where rules are sorted by frequencies in a decreasing way, the frequencies are approximated with a power-law function and only the coefficients of the function are written to the file. In case when entropy coding such as arithmetic or Huffman is not used in encoding grammar rules, the step 502 of forming and encoding the frequency model may be omitted. In the preferred embodiment the frequency model obtained by approximation with a power-law function is used.

Then, at block 503, the all grammar rules of each packet are encoded into the compressed file 103 according to the process illustrated in Fig. 7. All alternatives of the given rule are encoded into compressed file 103. A pointer to the current alternative is allocated and set to the first alternative of the rule at block 604. Next, at block 605, a pointer to the current symbol is allocated and set to the first symbol of the current alternative. Then the current symbol is encoded and emitted to file at block 606. According to the preferred embodiment, encoding is done using arithmetic coding (LH. Witten, R. Neal, and J.G. Cleary entitled "Arithmetic Coding for Data Compression", Communications of the ACM, Vol. 30, pp. 520-540, 1987) which utilizes the frequency model built in step 502. Alternatively, grammar symbols can be encoded using fixed number of bits, or using Huffman coding (D. A. Huffman, "A method for the construction of minimum-redundancy

codes", Proceedings of the I.R.E., sept 1952, pp 1098-1102), or variable-length integer codes such as Golomb code, Elias code, Fibonacci code, Rice code. The next step is carrying out a test 607 whether the current symbol is the last one of the alternative. If it is not, the next symbol is set to current at block 608, and the process is transferred to block 606. In the opposite case, a special symbol NEXT _ALTERN ATIVE is written to compressed file 103 at block 609 indicating that the current alternative has ended and the corresponding decoder should switch to the next alternative. The next step 610 is testing whether the current alternative is the last one of the rule. If it is not, the next alternative is set to current at block 611, and the process is transferred to block 605. In the opposite case a special symbol NEXT_RULE is written to compressed file 103 at block 612 indicating that the current rule has ended and the corresponding decoder should switch to the next rule of the packet.

The next step 704 of the process of encoding (Fig. 6) is forming a list of packet offsets. The list consists of positions of each packet in compressed file 103. The next step 705 is encoding this list of packet offsets into the compressed file 103. Each offset is written using fixed number of bits, or with variable-length integer code such as Golomb code, Elias code, Fibonacci code, Rice code. Preferably Elias code is used to keep the offset list compact. The next step 706 is forming the list of packet rule counts. The list consists of integer numbers representing the number of grammar rules in each packet. The next step 707 is encoding this list of packet rule counts into the compressed file 103. Each integer is written using fixed number of bits, or with variable-length integer code. Preferably Elias code is used.

Data Retrieval

Fig. 8 illustrates the overview of the data retrieval process. Given a compressed file 103 that is created according to the disclosed compression method and a query 801, the retrieval process 802 retrieves and decompresses the data 803 matching to the query. The search query consists of specifying criteria for each field of the database table and is written down in some query language such as Structured Query Language (SQL). Only queries which have separate criteria on each field combined with AND are described, but it

is obvious for a person skilled in the art, that the present invention can be applied to more complicated queries, too.

The top level process of retrieval is illustrated in Fig. 9. As the first step 901, an appropriate index is selected which allows the query to be executed in the fastest way. More than one known technique for selecting the best index may be adopted for realization of the search process in the offered method. As an example, the index which starts with the field having the strictest criterion can be selected. The correctness of the method does not depend on the selected index, only performance is affected. The starting grammar symbol of the LM-grammar is selected which corresponds to the selected index. Next step 902 is creation of a list for holding the strings that match the query criteria. Initially the list is empty. Then, at block 903 the search process described in Fig. 10 is performed, passing the parameter which is a string consisting of the selected starting grammar symbol. When the search process finishes, the resultList contains strings of the rows matching the query. At blocks 904 - 907 a loop is carried out over all strings. For that purpose, at block 904 a pointer to the current string is allocated and set to the first string in resultList. The next step 905 is decoding the string by using the inverse of the encoding transformation performed as indicated in Fig. 3 yielding the original data rows. The decoded rows are reported to the user. The next step 906 is checking if the current string is the last in the resultList. If yes, the process terminates. If no, the next string in resultList is set to current at step 907 and the process is transferred to block 905.

Fig. 10 illustrates a recursive search process which generates only the strings that match the query. The process takes a parameter partial_string (block 1001) which initially holds one symbol that is one of the starting symbols of the LM-grammar. Partial_string is modified during the process to hold the current partially decoded string. At the next step 1002, the longest prefix of partial _str ing containing only terminal symbols is identified starting at the beginning of the string and ending before the first non-terminal symbol. The whole partial_string is taken as a prefix, if it has no non-terminals. The next step 1003 of the process is, based on the identified prefix, testing whether the current partial_string can match the query criteria. That is illustrated in Fig. 11. If there is no match, the process ends. If there is a match, a test 1004 is carried out whether the partial_string contains nonterminal symbols. If no, a matching row is found, partial _string is added to the resultList

at block 1005, and the process terminates. If the partial_string contains non-termninal symbols, the first non-terminal symbol, denoted by R, in partial _string is identified at block 1006. Next, in block 1007, a pointer to the current alternative is allocated and set to the first alternative of the rule with the left-hand side equal to R. Then the process is transferred to blocks 1008 and 1009 where a copy of the partial_string (named partial_stringl) is made and the first occurrence of symbol R in partial _stringl is replaced with the right-hand side of the current alternative. Then, the process is transferred to block 1010 where the search process is continued recursively on partial _stringl . After returning from the recursive call, the process is transferred to block 1011, where a test is carried out if the current alternative is the last one. If yes, the process terminates. If no, the process continues to block 1012 where the next alternative is taken and the process is further transferred to block 1008. An alternative embodiment is possible where all occurrences of symbol R are replaced with the current alternative in block 1009.

The process of testing of the prefix of the partial string on whether it can satisfy the query criteria will now be described (Fig. 11). As the first step at block 1101, the parameter prefix is taken. Next, at block 1102, the prefix is divided into fields that are between field delimiters. There are some fields that are fully decoded, and there is a prefix of the last decoded field. Next, the process is transferred to block 1103 where a pointer to the current field is allocated and set to the first decoded field. Next, in the block 1104 a check is performed on whether the current field is completely decoded. If yes, the current field is tested if it satisfies the query criteria at block 1105. If criteria are not satisfied, the process ends with result FALSE at block 1106. If the current field satisfies the query criteria, the process is transferred to block 1107 where a test is carried out if the current field is the last one. If it is the last, the process ends with result TRUE at block 1108. If the current field is not the last one, step 1109 is taken where next field is set to current and the process is returned to step 1104. If the test at step 1104 returns No, the process is transferred to block 1110, where the interval of possible values, that the field can take in order to have the given prefix, is derived from the partially decoded field. Next, at step 1111, the derived interval is checked if it intersects with the interval defined by the query criteria. If there is no intersection the process terminates with result FALSE (block 1106). If there is intersection, the process is transferred to block 1107.

It is evident that the process of testing of partial string on whether it can satisfy the query criteria described above can be improved for faster performance. The observation is that, if a field is completely decoded and satisfies query criteria, this field needs not to be checked by subsequent match processes. Therefore one can consider only the field that is partially decoded. For that purpose one keeps a position in prefix where the partially decoded field starts and does not consider symbols before that position. This position is updated when a field is completely decoded. Another observation for improving performance is that matching should be performed only when the prefix has changed.

Another aspect of the invention is how grammar rules are loaded and decompressed from the compressed file 103. Fig. 12 shows the steps that are performed when the compressed file 103 is opened for reading. The process starts by reading the table column information from the compressed file 103 at block 1201. Column information includes the number of columns and the column names. The next step 1202 is reading the index information. The index information includes the number of indices, the number of fields in each index and the ordered list of table column names for each index. Then the arithmetic coding frequency model is read from the compressed file 103 at block 1203. The frequency model reading step is omitted, if rules are not encoded by arithmetic or Huffman coding. The next step 1204 is reading the packet information from the compressed file 103. The packet information contains the number of packets, the list of packet offsets and the list of number of rules in each packet. The process of loading and decompression of the grammar rules from the disk is performed only once when the compressed file 103 is opened. It is not time consuming since the size of the read information is relatively small when compared to the total size of the compressed file 103. After opening the compressed file 103, arbitrary number of queries can be executed. To be able to execute processes illustrated in Fig. 10 and Fig. 11 efficiently, the grammar rules must be accessible in any order. For this purpose, the division into packets is exploited which allows to decompress any given rule with only small overhead. Rules are identified by their numbers in the sorting obtained at block 501 (Fig. 6). Each rule is loaded from compressed file 103 into main memory on demand as soon as its right-hand side is accessed. Fig 13 illustrates the process of reading the rule referenced by its identifier. At first (block 1301), it is tested whether the rule with the given identifier is already in the memory, if it is, the rule is returned at block 1302 and the process ends. If not, the process is transferred to block 1303 where the packet in which

the rule is located is identified by using binary search in the list of offsets. The entire packet in which the rule is located is read from the disk and all rules of the packet are decoded. The starting offset in the compressed file 103 and number of rules that have to be decoded are determined from the lists stored at compression steps 704 and 706. The rule decoding process starts by allocation of a new empty rule in memory at block 1304. The next step 1305 is creating an empty alternative of the rule. Then a symbol from the file is inputted using an arithmetic coder (block 1306) provided the frequency model that was read at block 1203. If some other encoding method was used in encoding step 606, decoding is done using an appropriate decoder, corresponding to the encoder. Next, in block 1307, the input symbol is tested whether it is the special NEXT_RULE symbol. If the symbol is NEXT_RULE symbol, it means that the rule is completely decoded and the process is transferred to block 1308, where the test is carried out on whether all rules of the packet are decoded. If yes, the requested decoded rule with the asked identifier is returned at block 1309 and the process terminates. If no, the process is transferred to block 1304 for decoding of the next rule. If the test in block 1307 (if the symbol is NEXT_RULE symbol) returns No, the process is transferred to block 1310 where a test is carried out if the symbol is NEXT _ALTERN ATIVE symbol. If yes, the current alternative is finished and decoding of the next alternative of the current rule is started by transferring to block 1305. If no, the symbol is recognized as an identifier of some grammar symbol (either terminal or non- terminal) and the grammar symbol is appended to the current alternative of the current rule at block 1311 and the process is transferred to block 1306.

Continuing the example of compressed data of database table illustrated in Fig. 1, data retrieval process will now be described by the way of example. Let the query be that all rows of the database where Column 1 equals "abc" have to be found.

In this case the first index is selected at block 901 (Fig. 9) and its starting non-terminal Si is selected. Initially a partial string "Si" is created. There are two alternatives how to rewrite this string, the first one is used first obtaining a partial string "UiA". This partial string is tested if it matches the query criteria. In this case the prefix until the first nonterminal is empty, there is one partially decoded field with infinitely wide interval that intersects with the query interval and it is reported that it matches the query. The next step is rewriting the first non-terminal symbol Ui in the partial string "UiA". The partial string

"abA" is obtained. In the matching process it is found that the prefix is "ab", the interval is from "ab" to "abz" (assuming that the last non-terminal is "z") and concluded that there is a match. The next replacement in the partial string is using the first alternative of A obtaining partial string "abMFH". Match in this case is the same as in previous case since the prefix is the same. After replacing M, the partial string "abb#FH" is obtained. Now the first field is completely decoded, the check is made if it equals the string specified in the query. Since it is not equal, the process goes back to the last point where there are alternatives, namely to partial string "abA" and use the next alternative of A obtaining a partial string "abcDaMI". In the matching process an interval from "abc" to "abcz" is derived and concluded that there is a match. The process continues in the same way until there are all terminals in this string. The string will be "abc#bab#100". This string is added to resultList and the process is retreated to the point where there were alternatives, namely to the second alternative of Si giving the string "KDJL". After two replacements the partial string "caaDJL" is obtained which does not match the query and the grammar processing ends. Thus, there is only one string "abc#bab#100" in the resultList. This string is decoded and the following table row is obtained:

From this example it is evident that properties (a), (b) and (bl) of LM-grammar facilitate fast searching since only a relatively small part of the grammar was decoded for finding the required string. For the strings that do not match the query, only a short prefix is decoded and from the prefix it is concluded that further decoding is not necessary.

Data storage system for storing data in compressed form and fast retrieval

The system for realization of the data compression method allowing fast retrieval and subsequent data retrieval will now be described in greater details.

The offered data storage system for storing data in compressed form and fast data retrieval, comprises at least a microprocessor and memory. Besides other data, the memory is holding data in form of the context-free LM-grammar, according to properties (a) and (b)

or (bl), as described earlier. LM-grammar according to the property (bl) is used in the preferred embodiment.

For compression, the microprocessor is executing instructions that cause the machine to perform the steps of processing a set of strings to represent them as a context-free LM- grammar and performing encoding of the context-free LM-grammar into compressed file 103. At the step of processing of a set of strings to represent them as a context-free LM- grammar, the microprocessor is executing instructions that cause the machine to perform the steps of forming a PATRICIA trie for each set of strings coming from one index (block 401), creating a collection of unique strings from all the labels of the edges of all the formed PATRICIA tries (block 402), creating a common context-free grammar representing those unique strings (block 403), and combining the formed PATRICIA tries with created common context-free grammar to form an LM-grammar (block 404).

At the step of performing encoding of the context-free LM-grammar into compressed file 103, the microprocessor is executing instructions that cause the machine to perform the steps of dividing grammar rules into packets of a specified size (block 501), encoding each packet of grammar rules into compressed file 103 (block 503), forming a list of packet offsets (block 704), which consists of positions of each packet in compressed file 103, encoding the list of packet offsets into the file (block 705), forming a list of packet rule counts (block 706) consisting of integer numbers representing the number of grammar symbols in each packet and encoding the list of packet rule counts into the compressed file 103 (block 707).

For data retrieval the microprocessor is further executing instructions that cause the machine to perform the steps of selecting an appropriate index allowing the query to be executed in the fastest way (block 901) and selecting the starting symbol of the LM- grammar according to the selected index, performing a recursive search process which generates only the strings that match the query (block 903), and decoding the strings by using the inverse of the encoding transformation performed yielding the original data rows (block 905).

In its turn, the step of performing a recursive search process which generates only the strings that match the query comprises identifying the longest prefix of the current partial string containing only terminal symbols (block 1002), based on the identified prefix, testing if the current string can match the query (block 1003), replacing the first non- terminal symbol in the current string with one of the alternatives of the grammar rule with the left-hand side equal to the said first non-terminal symbol (block 1009), and recursively continuing the search process on the modified string (block 1010).

In another embodiment, all the occurrences of the first non-terminal symbol in the current string are replaced with one of the alternatives of the grammar rule with the left-hand side equal to the said first non-terminal symbol.

At the step of checking based on the identified prefix, testing if the current string can match the query, the microprocessor is executing instructions that cause the machine to perform the steps of dividing the prefix into fields that are between delimiters (block 1102), identifying the fully decoded fields and the partially decoded field (block 1104), testing if the fully decoded fields match the query (block 1105), deriving an interval from the partially decoded field (block 1110) and testing if the said interval intersects with the interval derived from the query criteria (block 1111).

The effectiveness of the offered method and system (in terms of compression ratio and data retrieval time) comparing to prior art has been evaluated by carrying out two experiments.

In the first experiment, the method of this invention according to (bl) version of the LM- grammar (claim 2) was compared to the RAR compression. RAR compression method is currently one of the best and in most cases outperforms zip or gzip compression significantly. Also, the data size is given after loading into a MySQL database. Experiments were carried out on log-file data collected on an internet provider server. Within 50 days about 38GB of data has been collected on such server. The data contained World Wide Web addresses, IP addresses, date and text of queries. These data had been indexed for searching by column ,,World Wide Web address". The searching for one specific World Wide Web address in this database has been performed. The time of searching has been measured for performing search of the first entry, which corresponded

to the query. The time of searching practically did not depend on the size of compressed database. When more than one data row has been retrieved from the compressed file, data was decompressed and retrieved with speed of about 2MB per second.

In the second experiment, an e-mail folder containing 9223 e-mail messages with the total size 26,7 MB was used. This e-mail folder was stored in the same three ways - the disclosed method, RAR and MySQL. Compressing data using the disclosed method, indexing by column "sender's name" has been performed.

The tests were carried out on a Pentium IV 3GHz processor with 512 MB of RAM. The comparative results of both experiments are shown in Table 1 and Table 2.

Table 1.

Size of compressed and stored data (MB)

Table 2.

Time of searching (seconds)

Thus, the offered system and method for data compression and storage allowing fast retrieval yields high data compression ratio similar to those achieved by best archivers, at the same time permitting fast data searching and retrieval similar to the current database systems. Additionally, the retrieval method, when implemented in Java programming language, is of only about 30 KB object code, and uses very little memory for operation, hence it is well suited for use in mobile devices.