IMPROVEMENTS RELATING TO HASH TABLES

Title:

IMPROVEMENTS RELATING TO HASH TABLES

Document Type and Number:

WIPO Patent Application WO/2011/073680

Kind Code:

Abstract:

Methods for inserting objects into a hash table, searching for objects in a hash table, and deleting objects from a hash table. The hash table comprising a multiplicity of buckets. The hash table has a corresponding hash function, and a probe sequence that defines for a given hash value a sequence of buckets in the hash table. Each bucket in the hash table has an extend flag to indicate if there are subsequent objects in the hash table with the same hash value. The invention is particularly applicable to mapping genome sequences onto reference genome sequences.

More Like This:

WO/2024/063995	MULTI-VERSION PROCESSING USING A MONITOR SUBSYSTEM
WO/2013/109700	STABLE PAIR-WISE E-VALUE
WO/2021/182718	BIOCHEMICAL PATHWAY EXPANSION METHOD, DEVICE AND PROGRAM

Inventors:

LUNTER GERTON ANTON (GB)

Application Number:

PCT/GB2010/052136

Publication Date:

June 23, 2011

Filing Date:

December 17, 2010

Export Citation:

Click for automatic bibliography generation Help

Assignee:

ISIS INNOVATION (GB)
LUNTER GERTON ANTON (GB)

International Classes:

G16B50/00; G06F17/30; G16B30/10

Domestic Patent References:

WO2001069507A2

2001-09-20

Foreign References:

US20040083347A1

2004-04-29

Other References:

AMBLE O ET AL: "Ordered hash tables", COMPUTER JOURNAL UK, vol. 17, no. 2, May 1974 (1974-05-01), pages 135 - 142, XP002623162, ISSN: 0010-4620
BURKHARD ET AL: "Double hashing with passbits", INFORMATION PROCESSING LETTERS, AMSTERDAM, NL, vol. 96, no. 5, 16 December 2005 (2005-12-16), pages 162 - 166, XP005123404, ISSN: 0020-0190, DOI: DOI:10.1016/J.IPL.2005.08.005
NING Z: "SSAHA: a fast search method for large DNA databases", GENOME RESEARCH, COLD SPRING HARBOR LABORATORY PRESS, WOODBURY, NY, US, vol. 11, no. 10, 1 October 2001 (2001-10-01), pages 1725 - 1729, XP002983796, ISSN: 1088-9051, DOI: DOI:10.1101/GR.194201
PAGH R; RODLER FF: "Cuckoo hashing", J. ALGORITHMS, vol. 51, 2004, pages 122 - 144
BURROWS M; WHEELER DJ: "A Block-sorting Lossless Data Compression Algorithm", SRC RESEARCH REPORT, 1994, pages 124
FERRAGINA P; MANZINI G: "Indexing compressed text", J. ACM, vol. 52, no. 4, 2005, pages 552 - 58I
LI H; DURBIN R: "Fast and accurate short read alignment with Burrows-Wheeler transform", BIOINFORMATICS, vol. 25, no. 14, 2009, pages 1754 - 1760
LANGMEAD B; TRAPNELL C; POP M; SALZBERG S: "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome", GENOME BIOLOGY, vol. 10, no. 3, 2009, pages R25
LI R; YU C; LI Y; LAM TW; YIU SM; KRISTIANSEN K; WANG J: "SOAP2: an improved ultrafast tool for short read alignment", BIOINFORMATICS, vol. 25, no. 15, 2009, pages 1996 - 7
BRENT RP: "Reducing the retrieval times of scatter storage techniques", COMM. ACM, vol. 16, no. 2, 1973, pages 105 - 109
GONNET GH; MUNRO JI: "Efficient ordering of hash tables", SIAM J COMPUT, vol. 8, no. 3, 1979, pages 463 - 478

Attorney, Agent or Firm:

BURT, Matthew, Thomas (20 Red Lion StreetLondon Greater, London WC1R 4PQ, GB)

Download PDF:

View/Download PDF PDF Help

Claims:

Claims

1. A computer-implemented method of inserting an object into a hash table comprising a multiplicity of buckets, wherein the hash table has a corresponding hash function and a probe sequence that defines for a given hash value a sequence of buckets in the hash table, and wherein each bucket in the hash table has an extend flag to indicate if there are subsequent objects in the hash table with the same hash value, the method comprising the steps of:

i) computing the hash value of the object to be inserted using the hash function;

ii) searching the hash table for an available bucket in the probe sequence for the hash value;

iii) storing the object to be inserted in the available bucket;

iv) in the case that there exists a preceding bucket in the probe sequence that contains an object with the same hash value as the object to be inserted, marking the extend flag of the preceding bucket to indicate that there is a subsequent object in the hash table with the same hash value .

2. A method as claimed in claim 1, wherein steps i to iv comprise the steps of:

a) computing the hash value of the object to be inserted using the hash function;

b) moving to the first bucket given by the probe sequence for the hash value; c) checking whether the current bucket contains an ob ect;

d) if the current bucket contains an object:

dl) computing the hash value of the stored object; d2 ) if the hash value of the stored object is equal to the hash value of the object to be inserted, recording the current bucket location;

d3) moving to the next bucket in the probe sequence;

d4) returning to step c;

e) if the bucket is empty:

el) storing the object to be inserted in the current bucket;

e2) if a bucket location has been recorded, setting the extend flag of the recorded bucket to indicate that there is a subsequent object in the hash table with the same hash value.

3. A method as claimed in claim 1 or 2, wherein each bucket in the hash table has a present flag to indicate if there is an object in the hash table with the hash value of the bucket, and wherein the method further comprises the step of setting the present flag of the first bucket given by the probe sequence to indicate that there is an object in the hash table with the hash value of the bucket.

4. A method as claimed in claim 2 or 3, wherein step e2 further comprises the step of setting the extend flag of the current bucket to the previous value of the extend flag of the recorded bucket, and further comprising the step: e3) if a bucket location has not been recorded, setting the extend flag of the current bucket to the value of the present flag of the first bucket given by the probe

sequence .

5. A computer-implemented method of searching for objects in a hash table as defined in any of claims 1 to 4, comprising the steps of:

i) computing the hash value of the object to be found using the hash function;

ii) searching the hash table for a bucket in the probe sequence containing on object that matches the object to be found;

iii) in the case that an object is found in a bucket in the probe sequence with the same hash value as the object to be found, and the extend flag of the bucket in which the object is stored is not marked to indicate that there is a subsequent object in the hash table with the same hash value, aborting the search.

6. A method as claimed in claim 5, wherein steps i to iii comprise the steps of:

a) computing the hash value of the object to be found using the hash function;

b) moving to the first bucket given by the probe sequence for the hash value;

c) computing the hash value of the object stored in the current bucket; d) checking if the hash value of the stored object is different from the hash value of the object to be found, and if so skipping to step g;

e) checking if the stored object matches the object to be found, and if so outputting the current bucket;

f) checking if the extend flag of the current bucket indicates that there is a subsequent object in the hash table with the same hash value, and if not aborting the search;

g) moving to the next bucket in the probe sequence; h) returning to step c.

7. A method as claimed in claim 5 or 6, wherein the object to be found matches an object stored in a bucket only when the objects are the same.

8. A method as claimed in claim 5 or 6, wherein the object to be found may match multiple different objects. 9. A method as claimed in any of claims 6 to 8 when

dependent on claim 3, wherein step c further comprises checking if the present flag of the first bucket indicates that there is no object in the hash table with the hash value of the first bucket, and if so aborting the search.

10. A method as claimed in any of claims 6 to 9, wherein step c further comprises checking if the contents of the current cell have been deleted, and if so recording the current bucket location if a bucket location has not already been recorded.

11. A method as claimed in claim 10, further comprising the steps :

d2 ) deleting the object in the current bucket, and setting the extend flag of the current bucket to indicate that there are no subsequent objects in the hash table with the same hash value;

d3) checking if the object in the recorded bucket matches the object to be found, and if so outputting the recorded bucket;

d4) checking if the extend flag of the recorded bucket indicates that there is a subsequent object in the hash table with the same hash value, and if not aborting the search;

d5) skipping to step g.

12. A computer-implemented method of deleting objects in a hash table as defined in any of claims 1 to 4, comprising the steps of:

i) computing the hash value of the object to be found using the hash function;

ii) searching the hash table for a bucket in the probe sequence containing on object that matches the object to be found;

iii) in the case that an object is found in a bucket the probe sequence with the same hash value as the object to be found, and the extend flag of the bucket in which the object is stored is not marked to indicate that there is a subsequent object in the hash table with the same hash value, aborting the deletion.

13. A method as claimed in claim 12, wherein steps i to iii comprise the steps of:

a) computing the hash value of the object to be deleted using the hash function;

b) moving to the first bucket given by the probe sequence for the hash value;

c) computing the hash value of the object stored in the current bucket;

d) checking if the hash value of the stored object is different from the hash value of the object to be found, and if so skipping to step g;

e) checking if the stored object matches the object to be deleted, and if so deleting the object in the current bucket;

f) checking if the extend flag of the current bucket indicates that there is a subsequent object in the hash table with the same hash value, and if not aborting the deletion;

g) moving to the next bucket in the probe sequence; h) returning to step c.

14. A method as claimed in claim 13, wherein step g further comprises recording the current bucket location, and step e further comprises checking if a bucket location has been recorded, if so setting the extend flag of the recorded bucket to the value of the extend flag of the current bucket, and if not setting the present flag of the first bucket to the value of the extend flag of the current bucket .

15. A computer program product arranged, when executed, to perform any of methods 1 to 14.

16. A method of creating a reference genome hash table, comprising the steps of:

obtaining a reference genome sequence;

inserting sequence fragments from the reference genome sequence into a hash table using the method of any of claims 1 to 4.

17. A method of mapping a genome onto a reference genome, comprising the steps of:

obtaining a reference genome hash table created using the method of claim 16;

obtaining the sequence of the genome in fragmented form using a sequencing machine;

mapping the genome sequence fragments onto the

reference genome by searching for reference genome sequence fragments matching the genome sequence fragments using the methods of any of claims 5 to 11.

Description:

Improvements relating to hash tables

Field of the Invention The present invention concerns methods for inserting objects into a hash table, searching for objects in a hash table, and deleting objects from a hash table. The

invention is particularly, but not exclusively, applicable to mapping genome sequences onto reference genome sequences.

Background of the Invention

The following definitions are used herein:

Genome - the nuclear or organellar DNA content of a biological individual or sample;

DNA - deoxyribonucleic acid;

RNA - ribonucleic acid;

Sequence - a representation of the order in which the nucleotide bases are arranged within a nucleic acid

sequence ;

Sequencing machine - a machine taking as input a sample of nucleic acid, either DNA or RNA, and which by a process of analysis, can output the sequence of the sample in the form of a large number of short sequences ("reads") ;

Read - a short and possibly imperfect fragment of sequence produced by a sequencing machine representing a fragment of DNA in the original biological sample; Read quality score - a score representing the estimated probability that a corresponding base in fact represents the original biological sample;

Paired-end read - a pair of reads that are associated by the sequencing machine, which originate from a single larger fragment of library DNA, and which are separated on this original larger fragment by an unknown but tightly constrained distance;

Library - the result of preparing a sample of DNA or RNA for sequencing by a sequencing machine a solution of possibly modified DNA molecules and other chemicals.

The development of new DNA sequencing technologies over the last few years has led to a rapid reduction in the cost of sequencing the DNA of an individual biological organism (for instance a human being) . However, a weakness of current technology is that DNA sequences are obtained in relatively short stretches or "reads". The process of locating the original position of the reads within a reference genome is called "mapping". Reads generally need to be mapped onto a reference genome in order to identify, for instance, any mutations that an individual may have that may affect its biological function (for instance that predispose a human individual to disease) .

A known method for mapping a genome is to use a hash table. A hash table is a table of objects (i.e. data items, such as genome sequence fragments) stored in "buckets". The hash table has a hash function, which for any object provides a hash value which maps to a bucket in the table. It is usual for the hash function to map more than one object to a single bucket, which is known as "hash

collision". A known strategy to allow for this is "open addressing". This uses a probe sequence, which for any hash value defines a sequence of buckets in the table. When inserting a new object in the hash table, if the bucket indicated by the hash function is already filled, the probe sequence can be used to find the next non-empty bucket in which to store the object.

Open addressing is popular as it has relatively little memory overhead, and uses one less level of indirection, compared to chaining (another known method for resolving hash collision) . This leads to low memory requirements, good cache usage, and fast search times. However, a drawback of open addressing is a sharply increased search time as the proportion of occupied hash buckets (the "load factor" a) increases towards 1, as now explained.

Search times are commonly measured under two usage scenarios, successful search and unsuccessful search. In a standard open-addressing hash table, average lookup times for successful searches grow logarithmically in 1-a, and thus remain reasonably low even as loads approach unity. In contrast, an unsuccessful search must scan the probe sequence until an unoccupied bucket is encountered. Since a proportion 1-a of buckets is unoccupied, this results in an average run-time of 0( (1-a) ^'1) .

In certain applications, a third usage scenario brings out this weakness of open addressing hashes even more clearly. For example, a commonly used strategy for inexact string matching in a large text is to look for matches to substrings of the query string, and perhaps to substrings at small edit distances of these. The objects in the hash correspond to text substrings, and the payload is their location. In this case, the hash implements a multiset, since any text substring may occur many times. In addition, unsuccessful searches are common, because of inexact matches or non-existing mutants. For a multiset, a search operation returns all objects matching the hash, rather than

terminating as soon as an exact match is found, similarly to the "unsuccessful search" operation. A "multiset search" for an existing element (searching a multiset for a non- existing element is a simple unsuccessful search) may be supposed to visits hash buckets with a probability

proportional to the number of elements stored in them, causing long chains to be visited more often than short ones. For that reason, "multiset search" tends to be slower than either successful or unsuccessful searches for standard open-addressing hash tables.

Another drawback of standard open addressing is that it is difficult to delete entries. Clearing entries would break probe chains prematurely and cause subsequent elements to become inaccessible. Instead these entries must be marked "deleted", which has the side effect of increasing search times when the hash table becomes saturated with marked entries. Entries can be cleared by rebuilding the hash, but this can be an expensive operation.

The above-mentioned disadvantages can be particularly relevant in the context of genome sequence mapping, as small insertions or deletions in the DNA sequence ("indels") dramatically reduce the efficiency of the mapping. Since indels are the most likely candidates for mutations that may cause disease (for instance, by their propensity to cause frame-shifts within exons resulting in large aberrations in the encoded protein) , the region of the genome which are most likely to be interesting are more likely to be missed out in the genome sequence which results.

Alternatives to open addressing hash tables have recently been proposed. Cuckoo hashing (see Pagh R, Rodler FF: Cuckoo hashing. J. algorithms 2004, 51:122-144) is one such alternative, that guarantees both successful and unsuccessful searches in constant time. The standard algorithm uses two hash functions and requires the hash table to be at most half-full. Modifications of the original proposal improve this, at the cost of adding more hash functions, and more memory accesses per search

operation. Cuckoo hashes cannot however be used to

implement multisets, as they rely on the hash functions to avoid collisions, which make them unsuitable for inexact string matching.

A powerful data structure particular to substring search are suffix trees. Derivatives of this data structure that require less memory include suffix arrays and the Burrows-Wheeler transform (see Burrows M, Wheeler DJ: A Block-sorting Lossless Data Compression Algorithm. SRC Research Report 1994, 124; Ferragina P, Manzini G: Indexing compressed text. J. ACM 2005, 52 ( 4 ) : 552-581. ) . The last data structure in particular supports efficient substring searches, and has very good memory usage through the use of compression. With some modifications it can also be used for inexact string matching (see Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25(14) :1754-1760; Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory- efficient alignment of short DNA sequences to the human genome. Genome Biology 2009, 10(3) :R25; Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J: SOAP2 : an improved ultrafast tool for short read alignment. Bioinformatics 2009, 25(15) : 1996-7) . Since approximate string matching algorithms based on hashes and based on the Burrows-Wheeler transform are intrinsically different, and practical implementations take various incomparable and heuristic cutoffs, it is difficult to decide which of the two

approaches is faster in principle. Informally, current cited implementations using the Burrows-Wheeler transform appear both faster and less sensitive than the best hash- based approaches. Another difference between the two approaches is that Burrows-Wheeler transforms require the search text to be static, whereas hash-based approaches allow changes at run-time.

Another known solution to improve hash table search times is to use a key arrangement scheme (see Brent RP:

Reducing the retrieval times of scatter storage techniques. Comm. ACM 1973, 16 (2 ): 105-109; Gonnet GH, Munro JI :

Efficient ordering of hash tables. SIAM J Comput 1979, 8(3) :463-478), and a solution that could be used in

conjunction with a key arrangement scheme would be

advantageous .

The present invention seeks to mitigate the above- mentioned problems, by providing methods that allow hash tables to be efficiently searched. Summary of the Invention In accordance with a first aspect of the invention there is provided a computer-implemented method of inserting an object into a hash table comprising a multiplicity of buckets, wherein the hash table has a corresponding hash function and a probe sequence that defines for a given hash value a sequence of buckets in the hash table, and wherein each bucket in the hash table has an extend flag to indicate if there are subsequent objects in the hash table with the same hash value, the method comprising the steps of:

i) computing the hash value of the object to be inserted using the hash function;

ii) searching the hash table for an available bucket in the probe sequence for the hash value;

iii) storing the object to be inserted in the available bucket;

Thus, when an object is inserted into the hash table, the extend flag of the preceding bucket in the probe sequence that contains object in the probe sequence with the same hash value will have its extend flag set to indicate that there is a subsequent object with the same hash value. Consequently, when a search is performed, if an object with the same hash value is found, but the extend flag for the bucket containing that object is set to False, there must be no further object with the same hash value in the probe sequence, and so the search can be aborted. The search does not need to continue until an empty bucket is found, and so an unsuccessful search in most cases takes much less time.

Preferably, steps i to iv comprise the steps of:

a) computing the hash value of the object to be inserted using the hash function;

b) moving to the first bucket given by the probe sequence for the hash value;

c) checking whether the current bucket contains an obj ect;

d) if the current bucket contains an object:

dl) computing the hash value of the stored object; d2 ) if the hash value of the stored object is equal to the hash value of the object to be inserted, recording the current bucket location;

d3) moving to the next bucket in the probe sequence;

d4) returning to step c;

e) if the bucket is empty:

el) storing the object to be inserted in the current bucket;

e2) if a bucket location has been recorded, setting the extend flag of the recorded bucket to indicate that there is a subsequent object in the hash table with the same hash value. This is an exemplary method for carrying out the invention, in which the buckets in the probe sequence and their contents are considered in turn.

Preferably, each bucket in the hash table has a present flag to indicate if there is an object in the hash table with the hash value of the bucket, and wherein the method further comprises the step of setting the present flag of the first bucket given by the probe sequence to indicate that there is an object in the hash table with the hash value of the bucket. If the present flag is set to False there must be no object in the hash table with the hash value of the bucket, and so any search can immediately be aborted.

Advantageously, step e2 further comprises the step of setting the extend flag of the current bucket to the previous value of the extend flag of the recorded bucket, and further comprising the step:

e3) if a bucket location has not been recorded, setting the extend flag of the current bucket to the value of the present flag of the first bucket given by the probe

sequence. This allows the hash table to operate when objects in the hash table are deleted.

In accordance with a second aspect of the invention there is provided a computer-implemented method of searching for objects in a hash table as defined in any of claims 1 to 4, comprising the steps of:

i) computing the hash value of the object to be found using the hash function; ii) searching the hash table for a bucket in the probe sequence containing on object that matches the object to be found;

Thus, the buckets in the probe sequence are searched, and if an object has the same hash value as the object to be found, but does not match the object, the extend flag can be checked to see if there are any subsequent objects with the same has value in the probe sequence. If not, the search can be aborted without needing to continue until an empty bucket is found.

Preferably, steps i to iii comprise the steps of:

a) computing the hash value of the object to be found using the hash function;

b) moving to the first bucket given by the probe sequence for the hash value;

c) computing the hash value of the object stored in the current bucket;

d) checking if the hash value of the stored object is different from the hash value of the object to be found, and if so skipping to step g;

e) checking if the stored object matches the object to be found, and if so outputting the current bucket;

f) checking if the extend flag of the current bucket indicates that there is a subsequent object in the hash table with the same hash value, and if not aborting the search;

g) moving to the next bucket in the probe sequence; h) returning to step c.

This is an exemplary method for carrying out the invention, in which the buckets in the probe sequence and their contents are considered in turn.

The object to be found may match an object stored in a bucket only when the objects are the same. Alternatively, the object to be found may match multiple different objects. This, the search may be completed when an object is found matching the object to be found, or may continue to find multiple objects matching the object to be found.

Preferably, step c further comprises checking if the present flag of the first bucket indicates that there is no object in the hash table with the hash value of the first bucket, and if so aborting the search. If the present flag is set to False there must be no object in the hash table with the hash value of the bucket, and so the search can immediately be aborted.

Advantageously, step c further comprises checking if the contents of the current cell have been deleted, and if so recording the current bucket location is a bucket

location has not already been recorded. Further, the method advantageously comprising the steps:

dl) if a bucket location has been recorded, storing the object in the current bucket in the recorded bucket, and setting the extend flag of the recorded bucket the value of the extend flag of the current bucket; d2 ) deleting the object in the current bucket, and setting the extend flag of the current bucket to indicate that there are no subsequent objects in the hash table with the same hash value;

d3) checking if the object in the recorded bucket matches the object to be found, and if so outputting the recorded bucket;

d4) checking if the extend flag of the recorded bucket indicates that there is a subsequent object in the hash table with the same hash value, and if not aborting the search;

d5) skipping to step g. This allows the search to be performed on a hash table in which objects have been deleted. Further, by storing the location of the most recent bucket marked as deleted, and moving a subsequent object with same hash value to the recorded bucket,

subsequent searches are made more efficient

In accordance with a third aspect of the invention there is provided a computer-implemented method of deleting objects in a hash table as defined in any of claims 1 to 4, comprising the steps of:

i) computing the hash value of the object to be found using the hash function;

ii) searching the hash table for a bucket in the probe sequence containing on object that matches the object to be found;

Thus, the object to be deleted is searched for and deleted when found, and the search is aborted if it can be seen from the extend flags that it will not be possible for the object to be found.