Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
IMPROVEMENTS RELATING TO HASH TABLES
Document Type and Number:
WIPO Patent Application WO/2011/073680
Kind Code:
A1
Abstract:
Methods for inserting objects into a hash table, searching for objects in a hash table, and deleting objects from a hash table. The hash table comprising a multiplicity of buckets. The hash table has a corresponding hash function, and a probe sequence that defines for a given hash value a sequence of buckets in the hash table. Each bucket in the hash table has an extend flag to indicate if there are subsequent objects in the hash table with the same hash value. The invention is particularly applicable to mapping genome sequences onto reference genome sequences.

Inventors:
LUNTER GERTON ANTON (GB)
Application Number:
PCT/GB2010/052136
Publication Date:
June 23, 2011
Filing Date:
December 17, 2010
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ISIS INNOVATION (GB)
LUNTER GERTON ANTON (GB)
International Classes:
G16B50/00; G06F17/30; G16B30/10
Domestic Patent References:
WO2001069507A22001-09-20
Foreign References:
US20040083347A12004-04-29
Other References:
AMBLE O ET AL: "Ordered hash tables", COMPUTER JOURNAL UK, vol. 17, no. 2, May 1974 (1974-05-01), pages 135 - 142, XP002623162, ISSN: 0010-4620
BURKHARD ET AL: "Double hashing with passbits", INFORMATION PROCESSING LETTERS, AMSTERDAM, NL, vol. 96, no. 5, 16 December 2005 (2005-12-16), pages 162 - 166, XP005123404, ISSN: 0020-0190, DOI: DOI:10.1016/J.IPL.2005.08.005
NING Z: "SSAHA: a fast search method for large DNA databases", GENOME RESEARCH, COLD SPRING HARBOR LABORATORY PRESS, WOODBURY, NY, US, vol. 11, no. 10, 1 October 2001 (2001-10-01), pages 1725 - 1729, XP002983796, ISSN: 1088-9051, DOI: DOI:10.1101/GR.194201
PAGH R; RODLER FF: "Cuckoo hashing", J. ALGORITHMS, vol. 51, 2004, pages 122 - 144
BURROWS M; WHEELER DJ: "A Block-sorting Lossless Data Compression Algorithm", SRC RESEARCH REPORT, 1994, pages 124
FERRAGINA P; MANZINI G: "Indexing compressed text", J. ACM, vol. 52, no. 4, 2005, pages 552 - 58I
LI H; DURBIN R: "Fast and accurate short read alignment with Burrows-Wheeler transform", BIOINFORMATICS, vol. 25, no. 14, 2009, pages 1754 - 1760
LANGMEAD B; TRAPNELL C; POP M; SALZBERG S: "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome", GENOME BIOLOGY, vol. 10, no. 3, 2009, pages R25
LI R; YU C; LI Y; LAM TW; YIU SM; KRISTIANSEN K; WANG J: "SOAP2: an improved ultrafast tool for short read alignment", BIOINFORMATICS, vol. 25, no. 15, 2009, pages 1996 - 7
BRENT RP: "Reducing the retrieval times of scatter storage techniques", COMM. ACM, vol. 16, no. 2, 1973, pages 105 - 109
GONNET GH; MUNRO JI: "Efficient ordering of hash tables", SIAM J COMPUT, vol. 8, no. 3, 1979, pages 463 - 478
Attorney, Agent or Firm:
BURT, Matthew, Thomas (20 Red Lion StreetLondon Greater, London WC1R 4PQ, GB)
Download PDF:
Claims:
Claims

1. A computer-implemented method of inserting an object into a hash table comprising a multiplicity of buckets, wherein the hash table has a corresponding hash function and a probe sequence that defines for a given hash value a sequence of buckets in the hash table, and wherein each bucket in the hash table has an extend flag to indicate if there are subsequent objects in the hash table with the same hash value, the method comprising the steps of:

i) computing the hash value of the object to be inserted using the hash function;

ii) searching the hash table for an available bucket in the probe sequence for the hash value;

iii) storing the object to be inserted in the available bucket;

iv) in the case that there exists a preceding bucket in the probe sequence that contains an object with the same hash value as the object to be inserted, marking the extend flag of the preceding bucket to indicate that there is a subsequent object in the hash table with the same hash value .

2. A method as claimed in claim 1, wherein steps i to iv comprise the steps of:

a) computing the hash value of the object to be inserted using the hash function;

b) moving to the first bucket given by the probe sequence for the hash value; c) checking whether the current bucket contains an ob ect;

d) if the current bucket contains an object:

dl) computing the hash value of the stored object; d2 ) if the hash value of the stored object is equal to the hash value of the object to be inserted, recording the current bucket location;

d3) moving to the next bucket in the probe sequence;

d4) returning to step c;

e) if the bucket is empty:

el) storing the object to be inserted in the current bucket;

e2) if a bucket location has been recorded, setting the extend flag of the recorded bucket to indicate that there is a subsequent object in the hash table with the same hash value.

3. A method as claimed in claim 1 or 2, wherein each bucket in the hash table has a present flag to indicate if there is an object in the hash table with the hash value of the bucket, and wherein the method further comprises the step of setting the present flag of the first bucket given by the probe sequence to indicate that there is an object in the hash table with the hash value of the bucket.

4. A method as claimed in claim 2 or 3, wherein step e2 further comprises the step of setting the extend flag of the current bucket to the previous value of the extend flag of the recorded bucket, and further comprising the step: e3) if a bucket location has not been recorded, setting the extend flag of the current bucket to the value of the present flag of the first bucket given by the probe

sequence .

5. A computer-implemented method of searching for objects in a hash table as defined in any of claims 1 to 4, comprising the steps of:

i) computing the hash value of the object to be found using the hash function;

ii) searching the hash table for a bucket in the probe sequence containing on object that matches the object to be found;

iii) in the case that an object is found in a bucket in the probe sequence with the same hash value as the object to be found, and the extend flag of the bucket in which the object is stored is not marked to indicate that there is a subsequent object in the hash table with the same hash value, aborting the search.

6. A method as claimed in claim 5, wherein steps i to iii comprise the steps of:

a) computing the hash value of the object to be found using the hash function;

b) moving to the first bucket given by the probe sequence for the hash value;

c) computing the hash value of the object stored in the current bucket; d) checking if the hash value of the stored object is different from the hash value of the object to be found, and if so skipping to step g;

e) checking if the stored object matches the object to be found, and if so outputting the current bucket;

f) checking if the extend flag of the current bucket indicates that there is a subsequent object in the hash table with the same hash value, and if not aborting the search;

g) moving to the next bucket in the probe sequence; h) returning to step c.

7. A method as claimed in claim 5 or 6, wherein the object to be found matches an object stored in a bucket only when the objects are the same.

8. A method as claimed in claim 5 or 6, wherein the object to be found may match multiple different objects. 9. A method as claimed in any of claims 6 to 8 when

dependent on claim 3, wherein step c further comprises checking if the present flag of the first bucket indicates that there is no object in the hash table with the hash value of the first bucket, and if so aborting the search.

10. A method as claimed in any of claims 6 to 9, wherein step c further comprises checking if the contents of the current cell have been deleted, and if so recording the current bucket location if a bucket location has not already been recorded.

11. A method as claimed in claim 10, further comprising the steps :

dl) if a bucket location has been recorded, storing the object in the current bucket in the recorded bucket, and setting the extend flag of the recorded bucket the value of the extend flag of the current bucket;

d2 ) deleting the object in the current bucket, and setting the extend flag of the current bucket to indicate that there are no subsequent objects in the hash table with the same hash value;

d3) checking if the object in the recorded bucket matches the object to be found, and if so outputting the recorded bucket;

d4) checking if the extend flag of the recorded bucket indicates that there is a subsequent object in the hash table with the same hash value, and if not aborting the search;

d5) skipping to step g.

12. A computer-implemented method of deleting objects in a hash table as defined in any of claims 1 to 4, comprising the steps of:

i) computing the hash value of the object to be found using the hash function;

ii) searching the hash table for a bucket in the probe sequence containing on object that matches the object to be found;

iii) in the case that an object is found in a bucket the probe sequence with the same hash value as the object to be found, and the extend flag of the bucket in which the object is stored is not marked to indicate that there is a subsequent object in the hash table with the same hash value, aborting the deletion.

13. A method as claimed in claim 12, wherein steps i to iii comprise the steps of:

a) computing the hash value of the object to be deleted using the hash function;

b) moving to the first bucket given by the probe sequence for the hash value;

c) computing the hash value of the object stored in the current bucket;

d) checking if the hash value of the stored object is different from the hash value of the object to be found, and if so skipping to step g;

e) checking if the stored object matches the object to be deleted, and if so deleting the object in the current bucket;

f) checking if the extend flag of the current bucket indicates that there is a subsequent object in the hash table with the same hash value, and if not aborting the deletion;

g) moving to the next bucket in the probe sequence; h) returning to step c.

14. A method as claimed in claim 13, wherein step g further comprises recording the current bucket location, and step e further comprises checking if a bucket location has been recorded, if so setting the extend flag of the recorded bucket to the value of the extend flag of the current bucket, and if not setting the present flag of the first bucket to the value of the extend flag of the current bucket .

15. A computer program product arranged, when executed, to perform any of methods 1 to 14.

16. A method of creating a reference genome hash table, comprising the steps of:

obtaining a reference genome sequence;

inserting sequence fragments from the reference genome sequence into a hash table using the method of any of claims 1 to 4.

17. A method of mapping a genome onto a reference genome, comprising the steps of:

obtaining a reference genome hash table created using the method of claim 16;

obtaining the sequence of the genome in fragmented form using a sequencing machine;

mapping the genome sequence fragments onto the

reference genome by searching for reference genome sequence fragments matching the genome sequence fragments using the methods of any of claims 5 to 11.

Description:
Improvements relating to hash tables

Field of the Invention The present invention concerns methods for inserting objects into a hash table, searching for objects in a hash table, and deleting objects from a hash table. The

invention is particularly, but not exclusively, applicable to mapping genome sequences onto reference genome sequences.

Background of the Invention

The following definitions are used herein:

Genome - the nuclear or organellar DNA content of a biological individual or sample;

DNA - deoxyribonucleic acid;

RNA - ribonucleic acid;

Sequence - a representation of the order in which the nucleotide bases are arranged within a nucleic acid

sequence ;

Sequencing machine - a machine taking as input a sample of nucleic acid, either DNA or RNA, and which by a process of analysis, can output the sequence of the sample in the form of a large number of short sequences ("reads") ;

Read - a short and possibly imperfect fragment of sequence produced by a sequencing machine representing a fragment of DNA in the original biological sample; Read quality score - a score representing the estimated probability that a corresponding base in fact represents the original biological sample;

Paired-end read - a pair of reads that are associated by the sequencing machine, which originate from a single larger fragment of library DNA, and which are separated on this original larger fragment by an unknown but tightly constrained distance;

Library - the result of preparing a sample of DNA or RNA for sequencing by a sequencing machine a solution of possibly modified DNA molecules and other chemicals.

The development of new DNA sequencing technologies over the last few years has led to a rapid reduction in the cost of sequencing the DNA of an individual biological organism (for instance a human being) . However, a weakness of current technology is that DNA sequences are obtained in relatively short stretches or "reads". The process of locating the original position of the reads within a reference genome is called "mapping". Reads generally need to be mapped onto a reference genome in order to identify, for instance, any mutations that an individual may have that may affect its biological function (for instance that predispose a human individual to disease) .

A known method for mapping a genome is to use a hash table. A hash table is a table of objects (i.e. data items, such as genome sequence fragments) stored in "buckets". The hash table has a hash function, which for any object provides a hash value which maps to a bucket in the table. It is usual for the hash function to map more than one object to a single bucket, which is known as "hash

collision". A known strategy to allow for this is "open addressing". This uses a probe sequence, which for any hash value defines a sequence of buckets in the table. When inserting a new object in the hash table, if the bucket indicated by the hash function is already filled, the probe sequence can be used to find the next non-empty bucket in which to store the object.

Open addressing is popular as it has relatively little memory overhead, and uses one less level of indirection, compared to chaining (another known method for resolving hash collision) . This leads to low memory requirements, good cache usage, and fast search times. However, a drawback of open addressing is a sharply increased search time as the proportion of occupied hash buckets (the "load factor" a) increases towards 1, as now explained.

Search times are commonly measured under two usage scenarios, successful search and unsuccessful search. In a standard open-addressing hash table, average lookup times for successful searches grow logarithmically in 1-a, and thus remain reasonably low even as loads approach unity. In contrast, an unsuccessful search must scan the probe sequence until an unoccupied bucket is encountered. Since a proportion 1-a of buckets is unoccupied, this results in an average run-time of 0( (1-a) '1 ) .

In certain applications, a third usage scenario brings out this weakness of open addressing hashes even more clearly. For example, a commonly used strategy for inexact string matching in a large text is to look for matches to substrings of the query string, and perhaps to substrings at small edit distances of these. The objects in the hash correspond to text substrings, and the payload is their location. In this case, the hash implements a multiset, since any text substring may occur many times. In addition, unsuccessful searches are common, because of inexact matches or non-existing mutants. For a multiset, a search operation returns all objects matching the hash, rather than

terminating as soon as an exact match is found, similarly to the "unsuccessful search" operation. A "multiset search" for an existing element (searching a multiset for a non- existing element is a simple unsuccessful search) may be supposed to visits hash buckets with a probability

proportional to the number of elements stored in them, causing long chains to be visited more often than short ones. For that reason, "multiset search" tends to be slower than either successful or unsuccessful searches for standard open-addressing hash tables.

Another drawback of standard open addressing is that it is difficult to delete entries. Clearing entries would break probe chains prematurely and cause subsequent elements to become inaccessible. Instead these entries must be marked "deleted", which has the side effect of increasing search times when the hash table becomes saturated with marked entries. Entries can be cleared by rebuilding the hash, but this can be an expensive operation.

The above-mentioned disadvantages can be particularly relevant in the context of genome sequence mapping, as small insertions or deletions in the DNA sequence ("indels") dramatically reduce the efficiency of the mapping. Since indels are the most likely candidates for mutations that may cause disease (for instance, by their propensity to cause frame-shifts within exons resulting in large aberrations in the encoded protein) , the region of the genome which are most likely to be interesting are more likely to be missed out in the genome sequence which results.

Alternatives to open addressing hash tables have recently been proposed. Cuckoo hashing (see Pagh R, Rodler FF: Cuckoo hashing. J. algorithms 2004, 51:122-144) is one such alternative, that guarantees both successful and unsuccessful searches in constant time. The standard algorithm uses two hash functions and requires the hash table to be at most half-full. Modifications of the original proposal improve this, at the cost of adding more hash functions, and more memory accesses per search

operation. Cuckoo hashes cannot however be used to

implement multisets, as they rely on the hash functions to avoid collisions, which make them unsuitable for inexact string matching.

A powerful data structure particular to substring search are suffix trees. Derivatives of this data structure that require less memory include suffix arrays and the Burrows-Wheeler transform (see Burrows M, Wheeler DJ: A Block-sorting Lossless Data Compression Algorithm. SRC Research Report 1994, 124; Ferragina P, Manzini G: Indexing compressed text. J. ACM 2005, 52 ( 4 ) : 552-581. ) . The last data structure in particular supports efficient substring searches, and has very good memory usage through the use of compression. With some modifications it can also be used for inexact string matching (see Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25(14) :1754-1760; Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory- efficient alignment of short DNA sequences to the human genome. Genome Biology 2009, 10(3) :R25; Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J: SOAP2 : an improved ultrafast tool for short read alignment. Bioinformatics 2009, 25(15) : 1996-7) . Since approximate string matching algorithms based on hashes and based on the Burrows-Wheeler transform are intrinsically different, and practical implementations take various incomparable and heuristic cutoffs, it is difficult to decide which of the two

approaches is faster in principle. Informally, current cited implementations using the Burrows-Wheeler transform appear both faster and less sensitive than the best hash- based approaches. Another difference between the two approaches is that Burrows-Wheeler transforms require the search text to be static, whereas hash-based approaches allow changes at run-time.

Another known solution to improve hash table search times is to use a key arrangement scheme (see Brent RP:

Reducing the retrieval times of scatter storage techniques. Comm. ACM 1973, 16 (2 ): 105-109; Gonnet GH, Munro JI :

Efficient ordering of hash tables. SIAM J Comput 1979, 8(3) :463-478), and a solution that could be used in

conjunction with a key arrangement scheme would be

advantageous .

The present invention seeks to mitigate the above- mentioned problems, by providing methods that allow hash tables to be efficiently searched. Summary of the Invention In accordance with a first aspect of the invention there is provided a computer-implemented method of inserting an object into a hash table comprising a multiplicity of buckets, wherein the hash table has a corresponding hash function and a probe sequence that defines for a given hash value a sequence of buckets in the hash table, and wherein each bucket in the hash table has an extend flag to indicate if there are subsequent objects in the hash table with the same hash value, the method comprising the steps of:

i) computing the hash value of the object to be inserted using the hash function;

ii) searching the hash table for an available bucket in the probe sequence for the hash value;

iii) storing the object to be inserted in the available bucket;

iv) in the case that there exists a preceding bucket in the probe sequence that contains an object with the same hash value as the object to be inserted, marking the extend flag of the preceding bucket to indicate that there is a subsequent object in the hash table with the same hash value.

Thus, when an object is inserted into the hash table, the extend flag of the preceding bucket in the probe sequence that contains object in the probe sequence with the same hash value will have its extend flag set to indicate that there is a subsequent object with the same hash value. Consequently, when a search is performed, if an object with the same hash value is found, but the extend flag for the bucket containing that object is set to False, there must be no further object with the same hash value in the probe sequence, and so the search can be aborted. The search does not need to continue until an empty bucket is found, and so an unsuccessful search in most cases takes much less time.

Preferably, steps i to iv comprise the steps of:

a) computing the hash value of the object to be inserted using the hash function;

b) moving to the first bucket given by the probe sequence for the hash value;

c) checking whether the current bucket contains an obj ect;

d) if the current bucket contains an object:

dl) computing the hash value of the stored object; d2 ) if the hash value of the stored object is equal to the hash value of the object to be inserted, recording the current bucket location;

d3) moving to the next bucket in the probe sequence;

d4) returning to step c;

e) if the bucket is empty:

el) storing the object to be inserted in the current bucket;

e2) if a bucket location has been recorded, setting the extend flag of the recorded bucket to indicate that there is a subsequent object in the hash table with the same hash value. This is an exemplary method for carrying out the invention, in which the buckets in the probe sequence and their contents are considered in turn.

Preferably, each bucket in the hash table has a present flag to indicate if there is an object in the hash table with the hash value of the bucket, and wherein the method further comprises the step of setting the present flag of the first bucket given by the probe sequence to indicate that there is an object in the hash table with the hash value of the bucket. If the present flag is set to False there must be no object in the hash table with the hash value of the bucket, and so any search can immediately be aborted.

Advantageously, step e2 further comprises the step of setting the extend flag of the current bucket to the previous value of the extend flag of the recorded bucket, and further comprising the step:

e3) if a bucket location has not been recorded, setting the extend flag of the current bucket to the value of the present flag of the first bucket given by the probe

sequence. This allows the hash table to operate when objects in the hash table are deleted.

In accordance with a second aspect of the invention there is provided a computer-implemented method of searching for objects in a hash table as defined in any of claims 1 to 4, comprising the steps of:

i) computing the hash value of the object to be found using the hash function; ii) searching the hash table for a bucket in the probe sequence containing on object that matches the object to be found;

iii) in the case that an object is found in a bucket in the probe sequence with the same hash value as the object to be found, and the extend flag of the bucket in which the object is stored is not marked to indicate that there is a subsequent object in the hash table with the same hash value, aborting the search.

Thus, the buckets in the probe sequence are searched, and if an object has the same hash value as the object to be found, but does not match the object, the extend flag can be checked to see if there are any subsequent objects with the same has value in the probe sequence. If not, the search can be aborted without needing to continue until an empty bucket is found.

Preferably, steps i to iii comprise the steps of:

a) computing the hash value of the object to be found using the hash function;

b) moving to the first bucket given by the probe sequence for the hash value;

c) computing the hash value of the object stored in the current bucket;

d) checking if the hash value of the stored object is different from the hash value of the object to be found, and if so skipping to step g;

e) checking if the stored object matches the object to be found, and if so outputting the current bucket;

f) checking if the extend flag of the current bucket indicates that there is a subsequent object in the hash table with the same hash value, and if not aborting the search;

g) moving to the next bucket in the probe sequence; h) returning to step c.

This is an exemplary method for carrying out the invention, in which the buckets in the probe sequence and their contents are considered in turn.

The object to be found may match an object stored in a bucket only when the objects are the same. Alternatively, the object to be found may match multiple different objects. This, the search may be completed when an object is found matching the object to be found, or may continue to find multiple objects matching the object to be found.

Preferably, step c further comprises checking if the present flag of the first bucket indicates that there is no object in the hash table with the hash value of the first bucket, and if so aborting the search. If the present flag is set to False there must be no object in the hash table with the hash value of the bucket, and so the search can immediately be aborted.

Advantageously, step c further comprises checking if the contents of the current cell have been deleted, and if so recording the current bucket location is a bucket

location has not already been recorded. Further, the method advantageously comprising the steps:

dl) if a bucket location has been recorded, storing the object in the current bucket in the recorded bucket, and setting the extend flag of the recorded bucket the value of the extend flag of the current bucket; d2 ) deleting the object in the current bucket, and setting the extend flag of the current bucket to indicate that there are no subsequent objects in the hash table with the same hash value;

d3) checking if the object in the recorded bucket matches the object to be found, and if so outputting the recorded bucket;

d4) checking if the extend flag of the recorded bucket indicates that there is a subsequent object in the hash table with the same hash value, and if not aborting the search;

d5) skipping to step g. This allows the search to be performed on a hash table in which objects have been deleted. Further, by storing the location of the most recent bucket marked as deleted, and moving a subsequent object with same hash value to the recorded bucket,

subsequent searches are made more efficient

In accordance with a third aspect of the invention there is provided a computer-implemented method of deleting objects in a hash table as defined in any of claims 1 to 4, comprising the steps of:

i) computing the hash value of the object to be found using the hash function;

ii) searching the hash table for a bucket in the probe sequence containing on object that matches the object to be found;

iii) in the case that an object is found in a bucket the probe sequence with the same hash value as the object to be found, and the extend flag of the bucket in which the object is stored is not marked to indicate that there is a subsequent object in the hash table with the same hash value, aborting the deletion.

Thus, the object to be deleted is searched for and deleted when found, and the search is aborted if it can be seen from the extend flags that it will not be possible for the object to be found.

Preferably, steps i to iii comprise the steps of:

a) computing the hash value of the object to be deleted using the hash function;

b) moving to the first bucket given by the probe sequence for the hash value;

c) computing the hash value of the object stored in the current bucket;

d) checking if the hash value of the stored object is different from the hash value of the object to be found, and if so skipping to step g;

e) checking if the stored object matches the object to be deleted, and if so deleting the object in the current bucket;

f) checking if the extend flag of the current bucket indicates that there is a subsequent object in the hash table with the same hash value, and if not aborting the deletion;

g) moving to the next bucket in the probe sequence; h) returning to step c.

This is an exemplary method for carrying out the invention, in which the buckets in the probe sequence and their contents are considered in turn.

Preferably, step g further comprises recording the current bucket location, and step e further comprises checking if a bucket location has been recorded, if so setting the extend flag of the recorded bucket to the value of the extend flag of the current bucket, and if not setting the present flag of the first bucket to the value of the extend flag of the current bucket. This allows the hash table to operate when objects in the hash table are deleted.

In accordance with a fourth aspect of the invention there is provided a computer program product arranged, when executed, to perform any of methods described above.

In accordance with a fifth aspect of the invention there is provided a method of creating a reference genome hash table, comprising the steps of:

obtaining a reference genome sequence;

inserting sequence fragments from the reference genome sequence into a hash table using any of methods described above .

In accordance with a sixth aspect of the invention there is provided a method of mapping a genome onto a reference genome, comprising the steps of:

obtaining a reference genome hash table created using the method described above;

obtaining the sequence of the genome in fragmented form using a sequencing machine;

mapping the genome sequence fragments onto the reference genome by searching for reference genome sequence fragments matching the genome sequence fragments using any of the methods of searching for objects described above.

The apparatus of a sequencing machine and that of a computer are used to enable the production of the partial or complete DNA sequence of a biological sample. The biological sample may consist of the complete or partial genome of an individual biological individual.

Alternatively, the biological sample may consist of the complete or partial reverse transcribed RNA complement of an individual biological individual. A software module may be used in order to map sequence reads onto a reference genome. The practice of the aspect of the invention may employ conventional methods of microbiology, molecular biology, and computer science within the reach of a person having

ordinary skill in the art.

It will of course be appreciated that features

described in relation to one aspect of the present invention may be incorporated into other aspects of the present invention .

Description of the Drawings

Embodiments of the present invention will now be described by way of example only with reference to the accompanying schematic drawings of which:

Figure 1 is a flow chart describing the insertion of an object into a hash table, in accordance with a first embodiment of the present invention;

Figure 2 is a flow chart describing the searching for an object in a hash table, in accordance with the first embodiment; Figure 3 is a flow chart describing the insertion of an object into a hash table, in accordance with a second embodiment of the present invention;

Figure 4 is a flow chart describing the deletion of an object from a hash table, in accordance with the second embodiment;

Figure 5 is a flow chart describing the searching for an object in a hash table, in accordance with the second embodiment.

Detailed Description

Embodiments of the invention that provide methods of inserting objects into a hash table, searching for objects in a hash table, and deleting objects from a hash table, are now described, with reference to the figures.

An unsuccessful search in a standard open addressing hash table is slow because the probe sequence is traversed until an empty bucket is found, and the density of these is low at high load factors. In accordance with certain embodiments of the invention, this is addressed by adding flag bits indicating that more entries exist in the chain, here defined as the smallest prefix of a probe sequence that contains all elements inserted in a hash bucket. (This flag bit is referred to as the "extend" flag below.) An unset bit signals that the search can be aborted at this point. This reduces the average time for an unsuccessful search to be the same as that for a successful search of the last element in the chain. As the flag bit will only be present for chains that include at least one element, in certain embodiments an additional sentinel bit is used ensure that the search algorithm does not enter empty chains. (This flag bit is referred to as the "present" flag below.) (Such searches could instead be terminated by empty buckets, but this would negate any gain from the first modification.)

The order in which hash buckets are probed is given by the probe sequence h(k,i). For standard hash tables this can be an arbitrary function of the object k and probe index i, as long as h (k, · ) : { 0, ... ,m} -> { 0, ... ,m} is a permutation. However, the use of the sentinel bit stipulates that the probe sequence be determined by h' (k) =h (k , 0) . Of the standard probe sequences, linear and quadratic probing satisfy this requirement, but double hashing does not.

Another choice, which minimizes clustering but hash bad hash performance, is random hashing, which in its simplest incarnation takes the form h (k, i) =h r (k) +h' ' (i) mod n, where h' ' (i) is a permutation.

Figure 1 is a flow chart showing the insertion of an object into a hash table based the ideas set out above, in accordance with a first embodiment of the invention. The hash table is initialized by setting T[j].obj=NIL and

T[j ] .present=False for each bucket j.

First, the hash value of the object to be inserted into the hash table is calculated using the hash function, and the first bucket in the probe sequence for that value is located (step 1) . The bucket is checked to see if it is empty (step 2) . If not, the hash value of the object stored in the bucket is calculated, and compared to the hash value of the object to be inserted (step 3) . If they are the same, the current bucket location is recorded (step 4) . The next bucket in the probe sequence is then found (step 5) and the process is repeated from step 2.

If at step 2 the current bucket is empty, the object to be inserted is stored in the bucket (step 6), and the present flag for the first bucket is set to True (step 7), indicating that there are objects in the hash table with the hash value of the first bucket. Then, if a bucket location has been recorded in step 4 (step 8), the extend flag for the recorded bucket is set to True (step 9), indicating that there are objects with the same hash value as the object stored in the recorded bucket further on in the probe sequence.

The insert method can alternatively be represented by the following pseudo-code:

Figure 2 is a flow chart showing the search for an object in a hash table in accordance with the first

embodiment of the invention. First, the hash value of the object to be found in the hash table is calculated using the hash function, and the first bucket in the probe sequence for that value is located (step 21) . The present flag for the bucket is checked (step 22) . If it is False, the search is aborted (step 23) . If it is True, the hash value of the object stored in the bucket is calculated, and compared to the hash value of the object to be found (step 24) . If they are the same, the object in the bucket is compared to the object to be found (step 25) . If they match (for example if they are equal, or if they are neighbours to within a chosen distance) , the current bucket is returned as the result of the search (step 26) . If not, the extend flag of the current bucket is checked (step 27) . If it is False, the search is aborted (step 28) . If it is True, the next bucket in the probe sequence is found (step 29) and the process is repeated from step 24.

The search method can alternatively be represented by the following pseudo-code:

Thus, it can be seen that the use of the extend flag means it can be determined when the last object in the hash table with the same hash value as the object to be found has been reached, without the search needing to continue until an empty bucket is found. The use of the present flag means it can be determined when there are no objects at all in the hash table with the same hash as the object to be found, as in that case there is no last object which can have its extend flag checked.

It will be appreciated that the search method could be adapted to return multiple objects matching the object to be found .

Figure 3 is a flow chart showing the insertion of an object into a hash table based the ideas set out above, in accordance with a second embodiment of the invention. The proposed hash table has multiple parallel interleaved probe chains, and this structure can be used to efficiently implement the deletion of objects in the hash table.

As in the preceding embodiment, the hash table is initialized by setting T[j].obj=NIL and T[j ] .present=False for each bucket j.

First, the hash value of the object to be inserted into the hash table is calculated using the hash function, and the first bucket in the probe sequence for that value is located (step 41) . The bucket is checked to see if it is empty, or has previously been filled but the contents have been deleted (step 42) . If not, the hash value of the object stored in the bucket is calculated, and compared to the hash value of the object to be inserted (step 43) . If they are the same, the current bucket location is recorded (step 4) . The next bucket in the probe sequence is then found (step 5) and the process is repeated from step 42.

If at step 42 the current bucket is empty, the object to be inserted is stored in the bucket (step 46) . Then, it is checked whether a bucket location has been recorded in step 44 (step 47) . If so, the extend flag for the current bucket is set to the value of the extend flag for the recorded bucket (step 48), and the extend flag for the recorded bucket is set to True (step 49) . If not, the extend flag for the current bucket is set to the value of the present flag for the first bucket (step 49) . The present flag for the first bucket is then set to True (step 51), indicating that there are objects in the hash table with the hash value of the first bucket.

Thus, it can be seen that the insert method of the second embodiment operates in a similar manner to the insert method of the first embodiment, except that the extend flag of the bucket in which the object is stored is also set to take account of deleted objects.

The insert method can alternatively be represented by the following pseudo-code:

Figure 4 is a flow chart showing the deletion of an object from a hash table in accordance with the second embodiment of the invention.

First, the hash value of the object to be deleted from the hash table is calculated using the hash function, and the first bucket in the probe sequence for that value is located (step 61) . The present flag for the bucket is checked (step 62) . If it is False, the search is aborted (step 63) . If it is True, the hash value of the object stored in the bucket is calculated, and compared to the hash value of the object to be deleted (step 64) . If the hash values are different, the next bucket in the probe sequence is found (step 69) and the process is repeated from step 64.

If at step 64 the hash values are the same, it is checked whether the object in the bucket matched the object to be deleted (step 65) . (In the case that objects are to be deleted it is likely that the level of matching required is identity between the objects.) If they do not match, the extend flag of the current location is checked (step 66), and if not the search is aborted (step 67) . If it is set to True, the location of the current bucket is recorded (step 68) , the next bucket in the probe sequence is found (step 69 again) and the process is repeated from step 64.

If at step 66 the objects do not match, it is checked whether a bucket location has been recorded (step 70) . If so, the extend flag of the recorded bucket is set to the extend flag of the current bucket (step 71); if not, the present value of the first bucket is set to the extend value of the current bucket (step 72) . The contents of the current bucket is then deleted (step 73), and the location of the current bucket is returned (step 74) .

Thus, the delete method of the second embodiment searches for the object to be deleted, and when it is found deletes that object, and also sets extend or present flags to take account of that deletion.

The deletion method can alternatively be represented by the following pseudo-code:

Figure 5 is a flow chart showing the search for an object in a hash table in accordance with the second embodiment of the invention. As well as searching for objects, the method ensures that any deleted entries in a probe chain "bubble" towards the end of that chain. Each time a deleted entry is moved, it has a positive probability of landing beyond the end of all of the m chains. When this happens, it no longer slows down future search operations. While the deleted entry is not marked as empty, the insert method is able to use deleted entries for new storage, so for practical purposes it can be considered to be empty. (The search could of course be done without moving any deleted entries, but doing so makes future searches more efficient . )

First, the hash value of the object to be found in the hash table is calculated using the hash function, and the first bucket in the probe sequence for that value is located (step 81) . The present flag for the bucket is checked (step 82) . If it is False, the search is aborted (step 83) . If it is True, it is checked whether the current bucket is marked as deleted (step 84) . If so, it is checked whether a bucket location has been recorded (step 85), and if not the current bucket location is recorded (step 86) . The next bucket in the probe sequence is found (step 87) and the process is repeated from step 84.

If in step 84 the current bucket is not marked as deleted, the hash value of the object stored in the bucket is calculated, and compared to the hash value of the object to be found (step 88) . If they are not the same, the next bucket in the probe sequence is found (step 87 again) and the process is repeated from step 84.

If in step 88 the hash values are the same, it is checked whether a bucket location has been recorded (step 89) . If not, the object in the current bucket is compared to the object to be found (step 90) . If they match, the current bucket is returned as the result of the search (step 91) . If not, the extend flag of the current bucket is checked (step 92) . If it is False, the search is aborted (step 93) . If it is True, the next bucket in the probe sequence is found (step 87 again) and the process is repeated from step 84. If in step 89 a bucket location has been recorded, the object stored in the current bucket is moved to the recorded bucket (step 94), and the current bucket is marked as deleted and the extend flag of the current bucket is set to False (step 95) . The object in the recorded bucket is then compared to the object to be found (step 96) . If they match, the recorded bucket is returned as the result of the search (step 97) . If not, the extend flag of the recorded bucket is checked (step 98) . If it is False, the search is aborted (step 99) . If it is True, the next bucket in the probe sequence is found (step 87 again) and the process is repeated from step 84.

Thus, it can be seen that the search method operates similarly to that of the first embodiment, with the addition that the location of the most recent bucket marked as deleted is recorded (at step 86) ; then, when a subsequent object with same hash value is found, it is moved to recorded bucket, thus making subsequent searches more efficient .

The search method can alternatively be represented by the following pseudo-code:

In certain embodiments, the invention is used to map a genome sequence onto a reference genome sequence. In this case, the overall method of the embodiment is as follows. First, an operator creates a sample of DNA in the form required by the sequencing machine (a "library") using conventional molecular biological techniques. Second, a sequencing machine is employed in a standard manner to obtain the sequence of the DNA represented in the library in a fragmented form (reads consisting of between 10 and 500 bases with existing technologies) . Third, the mapping system "maps" all of the reads onto a reference sequence, i.e. infers the original positions of the reads within said reference sequence, the reference sequence being for example the human reference sequence. This mapping procedure is described in more detail below. The end result consists of a coordinate indicating the chromosome and base pair coordinate of the read relative to the reference sequence, in addition to various other data including, but not limited to, the confidence assigned by the mapping system in the correctness of the inferred coordinate.

The mapping procedure discussed above comprises the following general steps:

• Create a genome hash table;

• For a single read, make all k-character infix

words, and make neighbours within a given maximum edit distance b;

• For every neighbour, lookup candidate genome locations in the hash;

• Filter candidates using local similarity, and

remove duplicates;

• Align and score the read at each candidate location, and sort by score;

• Re-align and re-score top hits using sensitive but slower (non-banded affine-gap) alignment;

• Compute the mapping posterior quality score from the candidate list. For paired-end reads, these steps are followed for each read independently, followed by a post-processing stage specific to paired-end reads. These steps are explained in more detail below.

Step: Creating a genome hash table. The genome is stored as a single text consisting of A, C, G and T

characters, encoded by two bits per character. Chromosomes or contigs are separated by runs of a pre-set character, and their starting location is stored in a separate lookup table. At every nth location in the resulting text, where n is a pre-set step size, a .fc-character infix is obtained, and first encoded in 2k bits. The reverse-complement symmetry is divided out, by considering the 2k- t codes for the infix and its reverse complement, and from each such pair choosing a unique exemplar by some rule, and encoding each of the resulting 2 2k~1 exemplars in 2k-l bits. Several examples of such rules and corresponding encodings exist. Finally, the 2k-l-bit code is randomized, by subjecting it to a permutation of {0 , ... , 2 2k~1 -l} that can be computed quickly, in order to reduce primary clustering in the hash. Again several equivalent such permutations exist.

The location divided by n is considered the "object", and the randomized 2k-l-bit code its hash value. The hash value of each "object" can be calculated by looking up the .fc-character infix at the corresponding location in the genome and calculating the hash value as described above.

As a first step, the multiplicity of every hash value is calculated, by calculating the hash of every considered position (at multiples of n) and keeping occupancy counts. The buckets with occupancies exceeding a given bound are flagged by a special hash code signifying a "high count". The rationale is that motifs that hash to "high count" buckets will not be informative of the true genomic location of the read, but would increase the hash size and slow down the algorithm if these were not removed. All the considered locations that do not hash to a "high count" entry are then stored in the hash table, following any of the key

arrangement schemes described above or known in the

literature .

For hashing, any probe sequence that satisfies the requirement for the fast hash algorithm can be used; the preferred implementation uses quadratic probing, which has both good cache properties, and has low secondary

clustering. To improve cache properties for the relatively few but relatively often-accessed buckets with high

occupancy (but with occupancies under the "high count" limit) , when bucket occupancy exceeds a pre-set threshold, linear probing is used instead of quadratic probing. To signal that linear probing is to be used, a second special hash code is stored in the chain's primary entry, with the actual bucket members stored in subsequent positions. In cases where the chain's primary entry is already occupied by an entry from another chain, the algorithm falls back to using quadratic probing.

Step: Make all .fc-character infix words. All

consecutive k- ers that fall within the read are stored in a list, with the exception of reads that contain more than a specified number of N characters, signifying un-called and uninformative bases. To increase efficiency, positions where the quality of bases fall below a set threshold can also be skipped in this step. For each of the resulting infixes, all neighbours within edit distance b are added to the same list. When infixes contain N characters, instead all possible nucleotides are tried for that position.

Step: Lookup candidate genome locations in hash.

Following the algorithm described in detail above, the genomic locations ("candidate") of each infix and infix- neighbour are looked up in the hash, and stored in a second list. By comparing the text at the genomic location with the infix or infix-neighbour, the read direction is established. When an infix or infix-neighbour hashes to a "high count" entry, the search is aborted. The algorithm keeps track of the number of "high count" entries encountered, separately for infixes and infix neighbours.

Step: Candidate filtering. A pre-set number of

characters outside the infix or infix-neighbour, but inside the read, are compared to the corresponding positions in the genome. Based on summary statistics that are easy to calculate, a pseudo-distance is computed. This pseudo- distance is chosen in such a way that it assigns short distances to reads that differ from the genome position by read errors or common polymorphisms or substitutions, such as single-nucleotide polymorphisms or short indels. A possible implementation uses the counts of A, C, G and T characters as statistics, and computes a distance as the sum of absolute differences of the read and genome counts. When this pseudo-distance exceeds a set threshold, the candidate is removed.

Next, each candidate is assigned a normalized position, taking account of the infix location in the read. The list of candidates is sorted according to this normalized position, and duplicates are removed.

Step: Alignment, scoring and sorting. For each candidate, a segment of the genome text is obtained in a pre-set window around the candidate, and the read is aligned to this text using a fast, banded affine-gap alignment algorithm. A possible implementation would use vector instructions and a linear-memory Viterbi algorithm, allowing the dynamic programming table to be contained within processor registers, keeping run-time requirements low.

This algorithm is sufficient for calculating the likelihood of the Viterbi alignment; for later reference, the algorithm also stores back pointers allowing for Viterbi traceback to recover the actual alignment.

The likelihood is expressed in logarithmic units and serves as the alignment score. This score depends on the read quality at each position, gap opening and gap extension penalties, and the genome and read texts. After aligning and scoring, the resulting list is sorted by score.

Step: Re-align and re-score top hits. A pre-set fraction or number of top-scoring hits is re-aligned and re- scored using a slower but more sensitive alignment algorithm that is not banded, and thus allows larger gaps to occur.

Step: Compute mapping posterior. The resulting scores have an interpretation as likelihoods, and the list of candidate locations thus allows the posterior probability of a candidate location with sub-maximal likelihood to be the true mapping location to be calculated in the standard fashion. In addition, the probability that the list of candidates does not contain the true mapping location is estimated, using the counts of infix and infix-neighbours that hash to "high-count" entries. As high-count entries are skipped, it becomes more likely that particular

configurations of read errors and true substitutions and polymorphisms changes all infix locations beyond the edit distance of b. Offline, a simulation is used to estimate this probability for various combinations of read length, maximum edit distance, .fc-mer infix length, step size n, infix "high-count" counts, and infix-neighbour "high-count" counts. The mapping algorithm looks up the required value and adjusts the final mapping posterior accordingly.

Step: Paired-end processing. For paired-end sequences, two lists of candidates are computed according to the algorithm set out above, independently for each read. When the two lists are empty, the reads are output as

"unmappable" and the algorithm is finished. When one of the lists is empty, the read associated to the empty list is aligned against the genome text in a segment around the expected location, based on the location of the other read (the "mate") , and a pre-set multiple of the standard deviation of the distances between the 5' ends of the reads (the "insert size") . The latter is either pre-set, or is obtained automatically from the mapping results, using a standard robust estimator. This results in a list of pairs of candidates, and pairs of scores. Each candidate pair is associated with a single score, computed as the sum of the single-read alignment scores, and a term for the likelihood of the observed insert size. The latter is based on the pre-set or observed distribution of insert sizes, a prior probability for "stretched pairs" defined as pairs at non- canonical distances but within a pre-set threshold of each other, and a prior probability for "broken pairs", defined as pairs whose distance is beyond this pre-set threshold.

When both lists are populated, a list of pairs is constructed by pairing a pre-set fraction or number of candidates from each list, or a subset of candidates determined in another fashion. In addition, each candidate from one of the subsets is paired to a location obtained in the first paragraph of this section, i.e. by considering the read as an orphan. Pairs obtain in both of these ways are scored following the same formula as described above.

The resulting list of paired candidates is sorted by their score, and the top candidate is reported. Mapping posteriors are calculated in the same way as for single reads.

In certain embodiments, the invention may be used in together with a key arrangement scheme, as follows. Key arrangement schemes are intended to shorten the average chain length, which determines both successful and

unsuccessful search times. For example, the Brent method (see Brent RP : Reducing the retrieval times of scatter storage techniques. Comm. ACM 1973, 16 ( 2 ) : 105-109 ) works by considering the chain that would be created by inserting an element normally, and all other chains that intersect it.

It then considers moving each of the intersecting elements up on their chain, which would free the intersecting hash entry to receive the new element. The configuration that minimizes the total chain length is then chosen. This elegant idea leads to expected (successful) search times that are bounded as the load factor tends to 1. Using more potential chains even higher efficiencies can be achieved (see Gonnet GH, Munro JI : Efficient ordering of hash tables. SIAM J Comput 1979, 8 ( 3 ) : 463-478 ) .

These key arrangement schemes can be applied to the proposed data structure as well. For this it is necessary that the present and extend bits be updated, and this can be achieved without difficulty; the precise algorithm is not given here. The result is a hash table that supports constant-time successful and unsuccessful searches, assuming a perfect (uniformly random) distribution of hashes.

Generally, high-occupancy buckets will be visited more often, and their chains will be longer too. It is therefore particularly important to keep these chains as short as possible. A simple key arrangement scheme is to insert elements normally, but in reverse order of bucket occupancy. This ensures that the most-often used chains will be

compact, while those containing just one element are the longest. This scheme requires a temporary table to hold bucket occupancies, but is very simple to implement and gives good results, especially when the distribution of hashes is severely skewed.

Whilst the present invention has been described and illustrated with reference to particular embodiments, it will be appreciated by those of ordinary skill in the art that the invention lends itself to many different variations not specifically illustrated herein.