LAYERED LOCALITY SENSITIVE HASHING (LSH) PARTITION INDEXING FOR BIG DATA APPLICATIONS

Title:

LAYERED LOCALITY SENSITIVE HASHING (LSH) PARTITION INDEXING FOR BIG DATA APPLICATIONS

Document Type and Number:

WIPO Patent Application WO/2019/165546

Kind Code:

Abstract:

System and method of partitioning a plurality of data objects that are each represented by a respective high dimensional feature vector is described, including performing a hashing function on each high dimensional feature vector to generate a respective lower dimensional binary compact feature vector for the data object that is represented by the high dimensional feature vector; performing a further hashing function on each compact feature vector to assign a sub-index ID to the compact feature vector; and partitioning the compact feature vectors into respective partition groups that correspond to the sub-index IDs assigned to the compact feature vectors.

Inventors:

LU YANGDI (CA)
HE WENBO (CA)
NABATCHIAN AMIRHOSEIN (CA)

Application Number:

PCT/CA2019/050228

Publication Date:

September 06, 2019

Filing Date:

February 26, 2019

Export Citation:

Click for automatic bibliography generation Help

Assignee:

HUAWEI TECH CANADA CO LTD (CA)

International Classes:

G06F16/901; G06F7/00; G06F16/903

Domestic Patent References:

WO2017011768A1

2017-01-19

Foreign References:

US6745205B2	2004-06-01
CN104035949B	2017-05-10

Attorney, Agent or Firm:

RIDOUT & MAYBEE LLP (CA)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS

1. A method of partitioning a plurality of data objects that are each represented by a respective high dimensional feature vector, comprising: performing a hashing function on each high dimensional feature vector to generate a respective lower dimensional binary compact feature vector for the data object that is represented by the high dimensional feature vector; performing a further hashing function on each compact feature vector to assign a sub-index ID to the compact feature vector; and partitioning the compact feature vectors into respective partition groups that correspond to the sub-index IDs assigned to the compact feature vectors.

2. The method of claim 1 wherein the hashing function performed on each high dimensional feature vector is a locality sensitive hashing (LSH) function, and the further hashing function performed on each compact feature vector is also an LSH function.

3. The method of claim 2 wherein the hashing function and the further hashing function are orthogonal angle hashing functions.

4. The method of claim 3 comprising generating a searchable sub-index structure for each of the respective partition groups.

5. The method of claim 4 wherein each compact feature vector is partitioned into only a single one of the partition groups.

6. The method of claim 4 or 5 comprising storing the sub-index structures as independently searchable structures enabling the sub-index structures to be searched concurrently with each other.

7. The method of anyone of claims 4 to 6 wherein generating a searchable sub-index structure for each of the respective partition groups comprises, for each partition group: generating a plurality of twisted compact feature vector sets for the compact feature vectors of the partition group, each of the twisted compact feature vector sets being generated by applying a respective random shuffling permutation to the compact feature vectors of the partition group; for each twisted compact feature vector set, generating an index table for the data objects represented by the compact feature vectors of the partition group based on sequences of the hashed values in the twisted compact feature vector set; and including the index tables generated for each of the twisted compact feature vector sets in the searchable sub-index structure for the partition group.

8. A system for partitioning data objects that are each represented by a respective high dimensional feature vector, comprising: one or more processing units; a system storage device coupled to each of the processing units, the system storage device tangibly storing thereon executable instructions that, when executed by the one or more processing units, cause the system to: perform a hashing function on each high dimensional feature vector to generate a respective lower dimensional binary compact feature vector for the data object that is represented by the high dimensional feature vector; perform a further hashing function on each compact feature vector to assign a sub-index ID to the compact feature vector; and partition the compact feature vectors into respective partition groups that correspond to the sub-index IDs assigned to the compact feature vectors.

9. The system of claim 8 wherein the hashing function performed on each high dimensional feature vector is a locality sensitive hashing (LSH) function, and the further hashing function performed on each compact feature vector is also an LSH function.

10. The system of claim 9 wherein the hashing function and the further hashing function are orthogonal angle hashing functions.

1 1 . The system of claim 10 comprising generating a searchable sub-index structure for each of the respective partition groups, wherein each compact feature vector is partitioned into only a single one of the partition groups.

12. The system of claim 1 1 wherein the executable instructions, when executed by the one or more processing units, cause the system to store the sub-index structures in one or more storages as independently searchable structures, enabling the sub-index structures to be searched concurrently with each other.

13. The system of claim 1 1 or 12 wherein the executable instructions, when executed by the one or more processing units, cause the system to generate the searchable sub-index structure for each of the respective partition groups by: generating a plurality of twisted com pact feature vector sets for the compact feature vectors of the partition group, each of the twisted compact feature vector sets being generated by applying a respective random shuffling permutation to the compact feature vectors of the partition group; for each twisted compact feature vector set, generating an index table for the data objects represented by the com pact feature vectors of the partition group based on sequences of the hashed values in the twisted compact feature vector set; and including the index tables generated for each of the twisted compact feature vector sets in the searchable sub-index structure for the partition group.

14. A computer program product comprising a medium tangibly storing thereon executable instructions that, when executed by a digital processing system, cause the digital processing system to: perform a hashing function on each of a plurality of high dimensional feature vectors to generate respective lower dimensional binary compact feature vectors, the high dimensional feature vectors each representing a respective data object; perform a further hashing function on each compact feature vector to assign a sub-index ID to the compact feature vector; and partition the compact feature vectors into respective partition groups that correspond to the sub-index IDs assigned to the compact feature vectors.

15. A method of searching for data objects that are similar to a query object, comprising: converting the query object into a d-dimensional feature vector; performing a hashing function on the d-dimensional feature vector to generate an m-dimensional binary compact query vector for the query object, where m<d; performing a further hashing function on the query vector to determine a sub-index ID for the query vector; and searching, in a sub-index structure that corresponds to the sub-index ID, for com pact feature vectors that are similar to the query vector, the sub-index structure comprising an index of compact feature vectors that each represent a respective data object.

16. The method of claim 15 wherein the hashing function performed on the d- dimensional feature vector is a locality sensitive hashing (LSH) function, and the further hashing function performed on the compact feature query vector is also an LSH function.

17. The method of claim 16 wherein the hashing function and the further hashing function are orthogonal angle hashing functions.

18. The method of any one of claims 15 to 17 further comprising: determining a set of further sub-index IDs that fall within a similarity threshold for the sub-index ID for the query vector; and searching further sub-index structures that correspond to the further sub- index IDs for compact feature vectors that are similar to the query vector.

19. The method of claim 18 wherein the similarity threshold is a threshold level of different bit values in the further sub-index IDs relative to the sub-index ID of the query vector.

20. The method of claim 18 or 19 wherein the searching of further sub-index structures is terminated if a threshold number of search results is reached before all of the sub-index structures that correspond to the further sub-index IDs have been searched.

21 . The method of any one of claims 15 to 20 comprising, concurrent with searching in a sub-index structure that corresponds to the sub-index ID:

searching a further sub-index structure for compact feature vectors that are similar to a further query vector for which a further sub-index ID has been determined.

22. A system for searching for data objects that are similar to a query object, comprising: one or more processing units; a system storage device coupled to each of the one or more processing units, the system storage device tangibly storing thereon executable instructions that, when executed by the one or more processing units, cause the system to: convert the query object into a d-dimensional feature vector; perform a hashing function on the d-dimensional feature vector to generate an m-dimensional binary compact query vector for the query object, where m<d; perform a further hashing function on the query vector to determine a sub- index ID for the query vector; and search, in a sub-index structure that corresponds to the sub-index ID, for compact feature vectors that are similar to the query vector, the sub-index structure comprising an index of compact feature vectors that each represent a respective data object.

23. The system of claim 22 wherein the hashing function performed on the d- dimensional feature vector is a locality sensitive hashing (LSH) function, and the further hashing function performed on the compact feature query vector is also an LSH function.

24. The system of claim 23 wherein the hashing function and the further hashing function are orthogonal angle hashing functions.

25. The system of anyone of claims 22 to 24 wherein the executable instructions further cause the system to: determine a set of further sub-index IDs that fall within a similarity threshold for the sub-index ID for the query vector; and search further sub-index structures that correspond to the further sub- index IDs for compact feature vectors that are similar to the query vector.

26. The system of claim 25 wherein the similarity threshold is a threshold level of different bit values in the further sub-index IDs relative to the sub-index ID of the query vector.

27. The system of claim 25 or 26 wherein the searching of further sub-index structures is terminated if a threshold number of search results is reached before all of the sub-index structures that correspond to the further sub-index IDs have been searched.

28. The system of anyone of claims 22 to 27 wherein the executable instructions further cause the system to, concurrent with searching in a sub- index structure that corresponds to the sub-index ID: search a further sub-index structure for compact feature vectors that are similar to a further query vector for which a further sub-index ID has been determined.

29. A computer program product comprising a medium tangibly storing thereon executable instructions that, when executed by a digital processing system, cause the digital processing system to search for data objects that are similar to query object by: converting the query object into a d-dimensional feature vector; performing a hashing function on the d-dimensional feature vector to generate an m-dimensional binary compact query vector for the query object, where m<d; performing a further hashing function on the query vector to determine a sub-index ID for the query vector; and searching, in a sub-index structure that corresponds to the sub-index ID, for com pact feature vectors that are similar to the query vector, the sub-index structure comprising an index of compact feature vectors that each represent a respective data object.

Description:

LAYERED LOCALITY SENSITIVE HASHING (LSH) PARTITION

INDEXING FOR BIG DATA APPLICATIONS

Related Applications

[0001] This application claims benefit of and priority to United States Provisional Patent Application No. 62/637,278 filed March 1 , 2018, and United States Utility Patent Application No. 16/044,362 filed July 24, 2018, the contents of which are both incorporated herein by reference.

Field

[0002] The present disclosure relates to generally to indexing and searching of databases, and in particular, to partition indexing of unstructured data.

Background

[0003] The volume of unstructured multimedia data objects, including for example image data, video data, audio data, text data and other sophisticated digital objects, that is stored in digital information repositories such as online Internet and cloud-based databases is growing dramatically. Processing search queries for unstructured data in an accurate and resource efficient manner presents technical challenges.

[0004] Similarity searching is a type of data searching in which unstructured data objects are searched based on a comparison of similarities between a query object and the data objects in a search database. Similarity searching typically involves creating metadata for each of the data objects stored in a database, creating metadata for a query object and then comparing the metadata for the query object with the metadata of the data objects. The metadata for each object can take the form of a feature vector, which is a multi-dimensional vector of numerical features that represent the object. In this regard, similarity searching can be defined as finding a feature vector from among multiple feature vectors stored in a database that is most similar to a given feature vector (e.g. query vector). Similarity search algorithms can be used in pattern recognition and classification, recommendation systems, statistical machine learning and many other areas.

[0005] Thus, a similarly search generally involves translating (converting) a query object (e.g. an image, video sample, audio sample or text) into a query feature vector which is representative of the query object, using a feature extraction algorithm. The query feature vector is then used for searching a database of feature vectors to locate one or more data object feature vectors (e.g. a feature vector for a data object stored in the database) that are most similar to the query feature vector.

[0006] In the context of unstructured data objects, the feature vectors are often high-dimensional vectors. In a high dimensional feature space, data for a given dataset becomes sparse, so distances and similarities lose statistical significance, with the result that query performance declines exponentially with an increasing number of dimensions. This is referred to as the“Curse of Dimensionality” problem.

[0007] One method to address the“Curse of Dimensionality” problem includes applying a dimensionality reduction algorithm to each feature vector stored in the database to generate a shorter version of each feature vector (e.g. a compact feature vector). After generating a compact feature vector for each feature vector for each object stored in the database, a search index is generated from the compact feature vectors using an index generation algorithm. The dimensionality reduction algorithm is also applied to the query feature vector to generate a shorter version of the query feature vector (e.g. compact query feature vector). A similarity search can then be performed by providing the compact query vector and the search index to a search algorithm to find candidate data object feature vectors that are most similar to the query feature vector.

[0008] One method for converting a feature vector having a large number of vector dimensions into a compact feature vector with a reduced number of vector dimensions and generating a corresponding search index is to apply hashing- based approximate nearest neighbor (ANN) algorithms. For example, locality sensitive hashing (LSH) can be used to reduce the dimensionality of high- dimensional data. LSH hashes input items so that similar items map to the same “buckets” with high probability (the number of buckets being much smaller than the universe of possible input items). In particular, a feature vector can be hashed using an LSH algorithm to produce a LSH hash value that functions as the compact feature vector.

[0009] However, a problem with existing LSH-ANN based indexing and search algorithms is that they can result in search queries that are overly biased towards similarities between the most significant bits (MSB) of the compact feature vectors. In particular, existing index generation methods may use the first several bits (or other groups of consecutive bits such as the final several bits) of compact feature vectors to identify similar feature vectors. However, these bits may be a poor indicator of similarity, resulting in inaccurate searching and inefficient use of computing resources.

[0010] An example of this MSB problem is illustrated in FIG. 1A, which shows an example of an LSH-based index and search method 100. In the example of FIG. 1A, an index 102 points to different slots or buckets 104(1 ), 104(2) that each include respective set of hash values in the form of compact feature vectors Ki. The compact feature vectors Ki are grouped in respective buckets 104(1 ), 104(2) based on a longest length of common prefix (LLCP) or other defined distance measurement approach. As depicted in FIG. 1A, the compact feature vector Ki is more similar to compact feature vector K2than to compact feature vector K3 based on Euclidian distance. However, based on a comparison of the first two (2) components (for example the first 2 bits) of the compact feature vector Ki to compactfeature vectors K2 and K3, the index generation method of FIG. 1 divides the compact feature vectors K1 and K2 into different buckets 104(1 ) and 104(2), and combines compact feature vectors K1 and K3 into the same bucket 104(2). When a compact query feature vector q comes in, based on the first two components, the compact query feature vector q would be more close to the first bucket 104(1 ) and hence compact feature vectors K1 and K3 are returned as candidate nearest neighbors, where ideally compact feature vectors K1 and K2 should be returned as the nearest neighbors to compact query feature vectors q. This error results from the fact that the left components or bits are granted priority in partitioning although there was no preference for the components or bits when selecting the hash functions. This affects the accuracy when using the generated search index for similarity searching.

[0011] In environments that have multiple search queries to search large volumes of unstructured data objects stored in digital information repositories, concurrent search queries partition strategies can be used to divide data indexes into groups. For example, in order to facilitate searching, indexes can be partitioned or divided into partition groups (which can include slots or buckets) with purportedly similar objects being assigned to the same partition group. Similar to the MSB problem described above, existing partition methods use a fixed number of leading bits in a compact feature vector to partition the compact feature vectors into partition groups. When a query is performed, the search is conducted only in respect of one partition group, which can yield a large error. Figure 1 B shows an example of a conventional (not content-based) partition method. Based on their leading 2 bits, the compact feature vectors K2 and K3 are placed in partition group 11 , and the compact feature vectors K1 and K ₄ are in placed in partition group 01. Although the hash values K1 and K2 are almost identical except for their first bits, the conventional partitioning method places the hash values K1 and K2 into different partition groups. Also, conventional partitioning method places the extremely different hash values K2 and K3 into the same partition group. Accordingly, similar compact feature vectors are likely to be placed into different sub-indexes (e.g. partition groups), which affects the accuracy and consistency of similarity searching.

[0012] Accordingly, methods and systems are disclosed herein that address the aforementioned partitioning problem to improve the accuracy and efficiency of searching large scale unstructured data stored in digital information repositories, including systems and methods that can improve computational efficiency when searching and searching accuracy.

Summary

[0013] Illustrative embodiments are disclosed by way of example in the description and claims. According to one example aspect is a system and method of generating an index structure for indexing a plurality of unstructured data objects, comprising: generating a set of compact feature vectors, the set including a compact feature vector for each of the data objects, the compact feature vector for each data object including a sequence of hashed values that represent the data object; and indexing the com pact feature vectors into partition groups based on content of the compact feature vector.

[0014] According to a first example aspect, a method of partitioning a plurality of data objects that are each represented by a respective high dimensional feature vector is described The method includes performing a hashing function on each high dimensional feature vector to generate a respective lower dimensional binary compact feature vector for the data object that is represented by the high dimensional feature vector; performing a further hashing function on each compact feature vector to assign a sub-index ID to the compact feature vector; and partitioning the com pact feature vectors into respective partition groups that correspond to the sub-index IDs assigned to the compact feature vectors.

[0015] In some example embodiments, the hashing function performed on each high dimensional feature vector is a locality sensitive hashing (LSH) function, and the further hashing function performed on each compact feature vector is also an LSH function. In some examples, the hashing function and the further hashing function are orthogonal angle hashing functions. In some examples the method includes generating a searchable sub-index structure for each of the respective partition groups.

[0016] In some examples, each compact feature vector is partitioned into only a single one of the partition groups. In some examples, the sub-index structures are stored as independently searchable structures enabling the sub-index structures to be searched concurrently with each other.

[0017] In some example embodiments, generating a searchable sub-index structure for each of the respective partition groups comprises, for each partition group: generating a plurality of twisted compact feature vector sets for the compact feature vectors of the partition group, each of the twisted compact feature vector sets being generated by applying a respective random shuffling permutation to the compact feature vectors of the partition group; for each twisted compact feature vector set, generating an index table for the data objects represented by the com pact feature vectors of the partition group based on sequences of the hashed values in the twisted compact feature vector set; and including the index tables generated for each of the twisted compact feature vector sets in the searchable sub-index structure for the partition group.

[0018] According to a second example aspect, a system for partitioning data objects that are each represented by a respective high dimensional feature vector is described. The system includes one or more processing units and a system storage device coupled to the processor system. The system storage device stores executable instructions that, when executed by the one or more processing units, cause the system to: perform a hashing function on each high dimensional feature vector to generate a respective lower dimensional binary compact feature vector for the data object that is represented by the high dimensional feature vector; perform a further hashing function on each compact feature vector to assign a sub-index ID to the compact feature vector; and partition the compact feature vectors into respective partition groups that correspond to the sub-index IDs assigned to the compact feature vectors.

[0019] According to a third example aspect is a computer program product comprising a medium tangibly storing thereon executable instructions that, when executed by a digital processing system, cause the digital processing system to: perform a hashing function on each of a plurality of high dimensional feature vectors to generate respective lower dimensional binary compact feature vectors, the high dimensional feature vectors each representing a respective data object; perform a further hashing function on each compact feature vector to assign a sub-index ID to the compact feature vector; and partition the compact feature vectors into respective partition groups that correspond to the sub-index IDs assigned to the compact feature vectors.

[0020] According to a fourth example aspect is a method of searching for data objects that are similar to a query object. The method includes: converting the query object into a d-dimensional feature vector; performing a hashing function on the d-dimensional feature vector to generate an m-dimensional binary compact query vector for the query object, where m<d; performing a further hashing function on the query vector to determine a sub-index ID for the query vector; and searching, in a sub-index structure that corresponds to the sub-index ID, for compact feature vectors that are similar to the query vector, the sub- index structure comprising an index of compact feature vectors that each represent a respective data object.

[0021] In example embodiments of the fourth aspect, the hashing function performed on the d-dimensional feature vector is a locality sensitive hashing

(LSH) function, and the further hashing function performed on the compact feature query vector is also an LSH function. In some examples, the hashing function and the further hashing function are orthogonal angle hashing functions.

[0022] In example embodiments of the fourth aspect, the method includes:

determining a set of further sub-index IDs that fall within a similarity threshold for the sub-index ID for the query vector; and searching further sub- index structures that correspond to the further sub-index IDs for compact feature vectors that are similar to the query vector In some examples, the similarity threshold is a threshold level of different bit values in the further sub-index IDs relative to the sub-index ID of the query vector.

[0023] In some example embodiments of the fourth aspect, the searching of further sub-index structures is terminated if a threshold number of search results is reached before all of the sub-index structures that correspond to the further sub-index IDs have been searched.

[0024] In some example embodiments of the fourth aspect, the method includes, concurrent with searching in a sub-index structure that corresponds to the sub- index ID: searching a further sub-index structure for compact feature vectors that are similar to a further query vector for which a further sub-index ID has been determined.

[0025] According to a fifth example aspect, a system for searching for data objects that are similar to a query object is described. The system includes: one or more processing units; and a system storage device coupled to each of the one or more processing units. The system storage device tangibly stores executable instructions that, when executed by the one or more processing units, cause the system to: convert the query object into a d-dimensional feature vector; perform a hashing function on the d-dimensional feature vector to generate an m-dimensional binary compact query vector for the query object, where m<d; perform a further hashing function on the query vector to determine a sub-index ID for the query vector; and search, in a sub-index structure that corresponds to the sub-index ID, for compact feature vectors that are similar to the query vector, the sub-index structure comprising an index of compact feature vectors that each represent a respective data object.

[0026] According to a sixth example embodiments is a computer program product comprising a medium tangibly storing thereon executable instructions that, when executed by a digital processing system, cause the digital processing system to search for data objects that are similar to query object by: converting the query object into a d-dimensional feature vector; performing a hashing function on the d-dimensional feature vector to generate an m-dimensional binary compact query vector for the query object, where m<d; performing a further hashing function on the query vector to determine a sub-index I D for the query vector; and searching, in a sub-index structure that corresponds to the sub-index ID, for compact feature vectors that are similar to the query vector, the sub-index structure comprising an index of compact feature vectors that each represent a respective data object.

Brief Description of the Drawings

[0027] Examples of embodiments of the invention will now be described in greater detail with reference to the accompanying drawings.

[0028] FIG. 1 A is a diagram showing an example of a prior art locality sensitive hashing (LSH) based index and search method.

[0029] FIG. 1 B is a diagram showing an example of a prior art partitioning method.

[0030] FIG. 2 is a flow diagram illustrating index generation and similarity search methods according to example embodiments.

[0031] FIG. 3 is a pseudo-code representation of a method for generating hash value functions according to example embodiments.

[0032] FIG. 4 is a pseudo-code representation of a method for generating compact feature vectors based on the functions generated by the method of FIG. 3.

[0033] FIG. 5 illustrates a first layer LSH hash value table according to an example embodiment.

[0034] FIG. 6 shows an index structure generation process of the index generation method of FIG. 2 according to example embodiments.

[0035] FIG. 7 shows an example of a random hash value shuffling process according to example embodiments.

[0036] FIG. 8 shows an example of an LSH index table generation task of the process of FIG. 6.

[0037] FIG. 9 illustrates variable length scaling for different d-nodes in an LSH index table.

[0038] FIG. 10 is an example of a digital processing system that can be used to implement methods and systems described herein.

[0039] FIG. 1 1 A shows an example of an index generation method that includes partitioning according to an example embodiment.

[0040] FIG. 1 1 B shows an example of the partitioning method included in the index generation method of FIG. 1 1A according to example embodiments.

[0041] FIG. 12 shows a schematic representation of the index generation method of FIG. 1 1 A, including the partitioning method.

[0042] FIG. 13 is a pseudo-code representation of the portioning method of FIG.

1 1 B for assigning sub-index IDs to compact feature vectors.

[0043] FIG. 14 is a flowchart of a step-wise search using the partitioned index.

[0044] FIG. 15 illustrates a delta-step sub-index calculation.

Detailed Description

[0045] FIG. 2 is a flow diagram illustrating index generation and similarity search methods 202, 204 according to example embodiments. In example embodiments index generation method 202 and similarity search method 204 are performed by software implemented on one or more digital processing systems. In example embodiments, the index generation method 202 and similarity search method 204 enable their host digital processing system(s) to function in a more efficient and accurate manner. For example, the methods and systems described herein may in some applications use less processing resources and to deliver search results of similar or better accuracy than previously available similarity search methodologies.

[0046] As illustrated in FIG. 2, in example embodiments the index generation method 202 is periodically performed to index unstructured data objects 208 that are stored in an object database 206. For example, index generation method 202 could be performed when a threshold level of changes occurs in the object database 206 through the addition, modification or deletion of objects 208 stored in the object database 206. Additionally, or alternatively, index generation method 202 could be performed based on a predefined schedule, for example hourly or daily or weekly. In example embodiments, similarity search 204 is performed when a query object is received. In some example embodiments, object database 206 may be a distributed database that includes complex data objects 208 stored across multiple digital repositories that are hosted on different real or virtual machines at one or more locations.

[0047] Index generation method 202, which generates an index structure 219 for n objects 208 stored in object database 206, will now be described in greater detail according to example embodiments. Index generation method 202 begins with a feature extraction process 210 during which information is extracted from the unstructured data objects 208 that are included in database 206 to produce a corresponding raw feature vector v for each one of the n data objects 208. The unstructured data objects 208 that are included in database 206 may for example be one of video data objects, audio data objects, image data objects, text data objects, and other unstructured data objects. For example, image objects 208 may each be represented by a respective raw feature vector v derived from a color histogram of the raw image data, and video objects 208 may each be represented by a respective raw feature vector v derived from a scale-invariant feature transform (SIFT) or 3D-SIFT of the raw video data or from discriminate video descriptors (DVD). A number of different feature vector formats are known for representing different classes of data objects, and any of these formats are suitable for feature extraction process 210 to convert data objects 208 into respective raw feature vectors vi to v _n. In the example of FIG.2, the raw feature vectors Vi to V _n (for a total of n data objects) are stored in a main table 250. In main table 250, each raw feature vector Vi to Vn is stored as an objectID and a corresponding d-dimensional feature list that includes d normalized feature values fvi to fv _d ( e.g. Vj = { fvi _, fv _å, ...fi/c _/ }, where each feature value fvi to fv _d is normalized between 0 and 1. The objectID can directly or indirectly point to the storage locations in the object database where the unstructured data objects 208 that the raw feature vectors Vi to V _n represent are stored.

[0048] A dimensionality reduction process 214 is then performed on each of the raw feature vectors Vi to V _n to convert the high-dimensional raw feature vectors to respective low-dimensional compact feature vectors Ki to K _n. Although different reduction algorithms are possible, in at least one example embodiment, dimensionality reduction process 214 applies a locality sensitivity hashing (LSH) algorithm that uses orthogonal angle hash functions to convert d-dimensional raw feature vectors Vi to V _n to respective m-dimensional compact feature vectors Ki to K _n. In this regard, FIG. 3 shows a pseudo-code representation of an algorithm for generating the orthogonal angle hash functions that are then applied during dimensionality reduction process 214 to convert raw feature vectors to respective compact feature vectors. The algorithm of FIG. 3 may be performed as a configuration step prior to index generation process 202 and the resulting hash functions stored as LSH function tables for future use.

[0049] The algorithm of FIG. 3 is provided with predefined inputs that include: the number (d) of dimensions of the raw feature vector Vji that the hash functions will be applied to (data point dimension =d); the number (m) of hash functions that will be included in each orthogonal angle hash function chain Gi; and the total hash family size F _s (e.g. the total number of hash functions that the m hash functions are chosen from). The output of the algorithm of FIG. 3 is a set of L orthogonal angle hash function chains Gi, where i = 1 to L. Each orthogonal angle hash function chain Gi includes m hash functions hj (denoted as Gi = (hi, hi2, ... , hm) where hi, h2, ... , h _m are randomly picked hash functions from the family of F _s hash functions ). As represented in FIG. 3, a random L by d matrix H is generated, with the elements x of matrix H sampled independently from the normal distribution. A QR decomposition of matrix H is then performed (where H=QR, and assuming d£F _s) to determine the orthogonal matrix Q. After QR decomposition, each column in the resulting m by L matrix Q provides an orthogonal vector (namely an orthogonal angle hash function chain Gi) of m elements. Accordingly, each column in the matrix Q provides a respective orthogonal angle hash function chain Gi (also referred to as an LSH table) that includes m hash functions hj , where 1 <j<m (Gi = (hi, h2, ... , h _m)). FIG. 3 provides one example of a suitable hash function generation algorithm, and in other example embodiments different known hash generation algorithms could be used in place of the algorithm of FIG. 3 to generate suitable compound LSH function chains for use in the index generation and searching processes described herein.

[0050] Once the orthogonal angle hash function chains Gi are generated, the hash functions are available for use in dimensionality reduction process 214 to reduce each d-dimension raw feature vector Vji to a respective m-dimension compact feature vector Kj. In this regard, FIG. 4 shows a pseudo-code representation of an algorithm for generating hash value matrix E of compact feature vectors Ki to Kn.

[0051] In example embodiments, the feature vector values stored in main table 250 for each of the raw feature vectors Vi to V _n are already normalized. For each of the feature vector values, the inner product between the hash function and the feature vector value is directly calculated. The result is the cos(hash function, feature vector value), which is called the angular distance. To determine which hyper plane the feature vector value lies in, a sign() operation is applied to the result, providing an output for each hash function on a feature vector value of -1 or 1. To simplify digital storage, a hash value of -1 is treated as a 0. The algorithm shown in FIG. 4 is an example of one suitable hashing algorithm for obtaining compound hash values, and other orthogonal hashing algorithms that reduce d-dimensional vectors to m-sized vectors may be used in other example embodiments.

[0052] Accordingly, dimensionality reduction process 216 applied LSH to reduce each d-length raw feature vector to an m-length binary sequence, as

represented by the com pact feature value Kj = Gi(Vj)= {hi (Vj),h2(Vj), ... , hm(Vj)} Each binary value in the binary sequence of the compact feature value Kj is the hash function result of all the feature values fvi to fv _d of a feature vector Vj with a respective one of the m hash functions (hi, h2, ... , h _m) of hash function chain Gi. For example, the first binary value in compact featire vector Kj is the hash of hash function hi with the feature values of fvi to fv _d of raw feature vector Vj. FIG. 5 shows the resulting compact feature vector set 502, which is shown as a table of hash values in which each row represents a respective compact feature vector Kj. Each compact feature vector has a respective identifier (ID) Kj, where 1 £j£n, and a sequence of m binary values. In FIG. 5, m=32. In example embodiments, the ID Kj is a memory pointer that points to a list of the m binary hash values that make up compact feature vector 216. In example embodiments, each compact feature vector K is associated with or includes a pointer (for example objectID) that points to the raw feature vector Vi that the compact feature vector K represents.

[0053] Referring again to FIG. 2, after the compact feature vector set 502 is generated, a corresponding index structure 219 is then generated by random draw forest (RDF) index structure generation process 218. In this regard, FIG. 6 illustrates steps that are performed during the RDF index structure generation process 218 according to example embodiments.

[0054] For ease of reference, Table 1 below provides a summary of parameters relevant to RDF index structure generation process 218.

Table 1 :

[0055] As indicated in step 602, random shuffling permutations SP(1 ) to SP(n _s) are applied to the compact feature vector set 502 to generate n _s twisted compact feature vector sets THV Set(1 ) to THV Set (n _s). An example of step 602 is illustrated in FIG 7. Shuffling permutations SP(1 ) to SP(n _s) are randomly generated, and then applied to randomly shuffle the column positions of the hash values in the compact feature vector set 502 to different column positions in respective twisted compact feature vector sets THV Set (1 ) to THV Set (n _s). As noted above, each com pact feature vector Kj includes m binary values. In one example embodiment, a first subset of s bits of each compact feature vector Kj of the compact feature set 502 is used as a Segment ID, and only (m-s) bits of each compact feature vector Kj are shuffled during step 602. Accordingly, in example embodiments, each shuffling permutation SP(1 ) to SP(n _s) specifies a random re-shuffling order of the compact feature vectors. By way of example, in FIG.7 each of the positions in the shuffling permutation SP(1 ) to SP(n _s) corresponds to a bit position column in the corresponding twisted compact feature vector sets THV Set(1 ) to THV Set (N _s), and the value in the position refers to a bit position column c+s of the compact feature set 502 to use as the source binary value to fill the column in the twisted compact feature vector set THV Set(i).

[0056] For example, in FIG. 7, m=32 and s=4. The first value in the first position of shuffling permutation SP(1 ) is 15, meaning that the 19 ^th (15+s) hash value bit for compact feature vector Ki in compact feature set 502 (which is a“1”) is to be relocated to the first shuffled hash value bit position for compact feature vector Ki in THV Set(1 ), as indicated by line 702. Accordingly, random shuffling permutation step 602 generates n _s twisted hash value versions of the compact feature vectors Ki to K _n. In each twisted hash value version, the hash value bit order is randomly shuffled with respect to the order of the compact feature set 502, however, within each THV Set the random shuffling order is the same for all of the compact feature vectors Ki to K _n such that column-wise similarities are maintained throughout the shuffling process. By generating n _s twisted versions of the compact feature vector set 502 the MSB problem noted above can be mitigated as there is no longer any bias to any particular hash value bit order grouping. As shown in the THV sets of FIG. 7, in example embodiments, the s bits of the segmentID are pre-pended to the front of the (m-s) shuffled bits of each of the compact feature vectors Kj within each of the THV Sets. Using the first s bits of the compact feature vectors Kj as a SegmentID supports parallelism for the indexing described below - in particular, the number of possible segment IDs is 2 ^s.

[0057] Referring again to FIG. 6, the next task (604) in RDF index structure generation process 218 is to generate a respective LSH index table T(1 ) to T(n _s) for each of the twisted compact feature vector sets THV Set(1 ) to THV set (n _s). LSH Index Table Generation Task 604, which is shown as steps 610 to 622 in FIG. 6, is repeated for each of the twisted compact feature vector sets THV Set(1 ) to THV set (n _s), resulting in n _s LSH index tables.

[0058] LSH Index Table Generation Task 604 will now be described in the context of a twisted compact feature vector set THV Set(y) (where 1 <y£n _s) and in conjunction with FIG. 8 which graphically illustrates the steps of LSH Index Table Generation Task 604 being performed in respect of com pact feature vector set THV Set(y) to generate a corresponding LSH index table T(y). FIG. 8 illustrates intermediate stages 801 A, 801 B, 801 C and 801 D of the LSH index table T(y) as it is being generated. Table 802 is a decimal representation of the compact feature vector set THV Set(y) that is indexed in LSH index table T(y). In particular, in table 802, the column“SEG” is the decimal value of the first 4 bits (e.g. Segment ID) of the respective twisted compact feature vector K, the column“level 1” is the decimal value of the next 7 bits (e.g. the first 7 shuffled bits), the column“level 2” is the decimal value of the next 7 bits, the column “level 3” is the decimal value of the next 7 bits, and the column“level 4” is the decimal value of the next 7 bits. Thus, in the example of FIG. 8 where m=32, s=4 and the number of shuffled bits per twisted compact feature vector Kj is m-s= 28, the number of 7-bit levels is 4. In the example of FIG.8, the Segment ID bits are “1001”, providing a decimal Segment ID=9.

[0059] As shown in FIG. 8, LSH index table T(y) is an index tree structure that comprises two types of nodes, denoted as k-nodes and d-nodes. LSH index table T(y) as shown at the bottom of FIG. 8 includes two levels of d-nodes (a first level or root d-node (d-node (1 )) and a second level d-node (d-node(2))), and five k-nodes (k-node (1 ) to k-node (5)), Each k-node (1 ) to (5) corresponds to a respective compact feature vector Ki to Ks of the compact feature vector set THV Set(y). In example embodiments, each LSH index table T(y) includes n K- nodes, where n is the number of compact feature vectors Kj.

[0060] Each d-node(i) is an integer array of li slots (denoted as Slot() in the Figures, and numbered as Slot(0) to Slot(127) in FIG. 8 in which 1= 128), where li is less than or equal to a predefined slot maximum I. The number of slots li per d- node level is mutable. Each d-node Slot() corresponds to a bucket of compact feature vectors K that have been identified as meeting a similarity threshold with respect to each other. Each k-node contains two fields, namely KEY 804 and POINT 806. KEY 804 is an objectID that points to the raw feature vector (for example Ki points to Vi), and POINT 806 stores the offset, if any, of the next k- node in the same Slot. A d-node Slot is used to store either a pointer to the first k- node associated with the Slot (provided that the number of k-nodes associated with the Slot does not exceed threshold Th), or a further d-node level (if the number of k-nodes associated with the Slot does exceed the threshold Th).

[0061] As indicated in step 610 of FIG. 6, LSH index table generation task 604 commences with the initialization of an / long d-node as a first level or root d- node(1 ). As noted above, to support parallelism, the first s bits of each compact feature vector K are treated as a SegmentID, which allows 2 ^s segments. This is a sufficient number to maximize parallelism for each twisted compact feature vector set THV Set(y). In example embodiments, the number of hash value bits in each twisted com pact feature vector Kj used to classify or locate the corresponding data object into a respective d-node slot is determined as log ₂(l) and the maximum number of d-node levels is (m-s )/log ₂(l). As will be described below, task 604 classifies twisted compact feature vectors Kj into respective d-node slots based on the similarities between log ₂(l) length groupings of successive twisted hash bits. In this regard, the log ₂(l) bit set acts as a similarity threshold.

[0062] In example embodiments, the threshold Th represents the number of data objects that can be classified into a single Slot without further sub-classification. When the threshold Th is exceeded, further classification or sorting is required, which is accomplished by adding a further d-node level, and the twisted compact feature vectors can then be further classified based on a further set of log ₂(l) bits. Thus, progressively more bits from the hash value of a compact feature vector can be used to provide more d-node indexing levels. When there are more than Th k- nodes under the same Slot, they are redistributed them to the next d-node level of the hash tree structure of LSH index table(y).

[0063] In the example represented in FIG. 8, /=128; 77i=3; s=4; m=32; m-s=28; l°92(l) = 7; the 28 values of shuffling permutation SP(y) are {15, 7, 3, 4, 21 , 6, 20, 14, 16, 26, 19, 28, 25, 18, 24, 13, 22, 9, 17, 27, 5, 2, 1 , 1 1 , 8, 10, 23, 12}; and the resulting 32 bit binary sequence of the first twisted compact feature vector in THV Set(y) is:

Twisted compact feature vector Ki= 1001001 101000010001 101 1010000101

(including the 4 bit segmentID followed by 28 shuffled bits). (Note that the examples of Kj in FIG. 8 are not the same binary sequences as the examples shown in FIGs. 5 and 7).

[0064] Accordingly, in step 610, the first level or root d-node(1 ) is initialized to have a length of /=128 slots (as shown in intermediate stage 801A of FIG 8.) As indicated in step 612 in FIG. 6, the next available twisted compact feature vector Kj is obtained for the twisted compact vector THV Set(y). The first time step 612 is performed for a twisted compact feature vector set, the next available twisted compact feature vector will be the first compact feature vector in THV Set(y), namely Ki. It will be appreciated that steps 602 and 612 can be combined and the twisted hash values for a particular compact feature vector Kj could be determined as part of step 612, rather than pre-calculated in step 602.

[0065] As indicated in step 613, a respective k-node(i) is initialized for the compact feature vector Kj. As noted above the k-node(i) includes two fields, namely KEY 804 and POINT 806. Accordingly, in the example of twisted compact vector Ki, the KEY 804 field of k-node(1 ) is set to point to the respective raw feature vector vi. In the case when a new k-node is initialized, its POINT 806 field is initially set to null.

[0066] As indicated in step 614, a segmentID and SlotID are then extracted from the twisted compact feature vector Kj. In the present example of twisted compact feature vector Ki, the first four bits provide SegmentlD=(1001 )b=9. The next log2(l) = 7 bits of Ki are (001 1010)b=26, providing a level 1 d-node(1 ) SlotID of 26.

[0067] As indicated at step 616, a determination is made whether or not the identifed d-node Slot(SlotlD) is empty or not. If the Slot has not been occupied, as indicated in step 618 and illustrated by stage 801 A in FIG. 8, the value in the corresponding Slot (e.g. Slot (26)) of root d-node (1 )) is updated to point to an address of the respective k-node location (e.g. k-node (1 )) in system storage, such as system storage device 1408 described below, (as noted above, the k-node (j) itself points to the address of the corresponding raw feature vector Vi).

[0068] After updating the respective d-node Slot, as indicated in step 619, a determination is made if all n of the compact feature vectors in the twisted compact feature vector set THV(y) have been classified into the TSH index table T(y). If so, the LSH Index table T(y) is complete and task 604 can be terminated for the THV set(y). If not, task 604 repeats. As indicated in step 612, the next compact feature vector Kj is retrieved from the THV set(y). In the example of FIG. 8, the next compact feature vector is K2. As illustrated in stage 801 B in FIG. 8 and steps 613 and 614 of FIG. 6, a second k-node(2) is initialized for the compact feature vector K2, and the segmentID and level 1 SlotID are extracted (as shown in table 802, in the present example the K2 segmentID = 9 and level 1 slotl D= 26, the same as Ki). In the case of compact feature vector K2, in step 616 a determination is made that the d-node Slot(SlotlD) (e.g. Slot(26)) is occupied. Accordingly, as indicted at step 620, a determination is then made as to whether the number of k-nodes that are allocated to the Slot(SlotlD) without an intervening d-node layer exceeds the threshold Th. If the number of k-nodes under the d-node Slot(SlotlD) is equal to or less than Th, then the new k-node can be included under this Slot in the hash tree of the LSH index table T(y). In particular, as indicated at step 622, the value in the Slot(SlotlD) is set to point to the current k-node(i), and the POINT field of the current k-node(j) is set to point to the address of the k-node that was previously referenced by the Slot(SlotlD).

[0069] In FIG. 8, an example of step 622 is represented in stage 801 b, which shows the value of Slot(26) being updated to point to k-node(2). In turn, the POINT 806 field of k-node(2) is set to point to k-node(1 ) (which was previously identified in Slot(26)).

[0070] In the example of FIG. 8, the k-node(3) that is created for twisted compact feature vector K3 also has segmentID = 9 and level 1 slotlD= 26. As illustrated in stage 801 c of FIG. 8, when twisted compact feature vector K3 is processed, k- node(3) is initialized with its KEY 804 field pointing to the objectID of the raw feature vector V3 (as per step 613) and, as per step 622, the value in d-node(1 ) Slot(26) is updated to point to k-node(3), and the POINT 806 field of k-node(3) is set to point to k-node(3).

[0071] In the example of FIG. 8, the k-node(4) that is created for twisted compact feature vector K ₄ has segmentID = 9, and level 1 slotlD= 1 (different than that of Ki to K3). Accordingly, as illustrated in stage 801 d of FIG. 8, in step 616 a determination is made that Slot(1 ) is empty, and in step 618 the value in d-node(1 ) Slot(1 ) is updated to point to k-node(4).

[0072] In the example of FIG. 8, the k-node(5) that is created for twisted compact feature vector K5 also has segmentID = 9, and level 1 d-node slotlD=26 (again, the same as that of K1 to K3). In this case, in step 620, a determination is made that the number of k-nodes under the level 1 d-node Slot(26) exceeds the threshold Th. As indicated in Step 624 and illustrated in the final version of LSH index table T(1 ) at the bottom of FIG. 8, the insertion of k-node(5) into the LSH index table requires that an additional d-node level (e.g. 2 ^nd level d-node(2)) be generated and the k-nodes under the upper level d-node Slot be redistributed among the Slots of the lower level d-node. As noted above, the use of multiple d- node levels effectively allows objects that are similar enough to be classed into a single d-node level Slot, as determined by a matching group of twisted hash value bit values, to be further sorted into different sub-buckets.

[0073] In the example of k-node(5) in FIG. 8, step 624 is carried out by initializing second level d-node(2) to have a length of /=128 Slots. The value of first level d- node(1 ) Slot(26) is set to point to the system storage address of d-node(2) (rather than directly to a k-node). The assignment of k-nodes (1 ), (2), (3) and (5) to the Slots of second level d-node(2) is similar to that described above in respect of the first level, however a different group of twisted hash bits from the twisted compact feature vectors are used to determine the second level SlotID than the first level SlotID. In particular, the next log2(l) group of hashed bits in each of the twisted compact feature vectors Ki, K2, K3 and Ks is used. Thus, in the example of Ki= 1001001 101000010001 101 1010000101 , the first four bits provide SegmentlD=(1001 )b=9, the next log2(l) = 7 bits (001 1010)b=26 provide level 1 d- node(1 ) SlotID of 26, and the next log2(l) = 7 bits (0001000)b=8 provide a level 2 d-node(2) SlotID of 8. In the example of FIG. 8, k-nodes (1 ), (2) and (3) all have the same second level SlotID of 9 (as illustrated in table 802), and accordingly are all assigned to second level d-node(2) Slot(9). In particular, d-node(2) Slot(9) points to k-node(3) which in turn points to k-node(2), which in turn points to k- node(1 ). However, hashed bits 12 to 18 of Ks identify a 2 ^nd layer d-node SlotID of 4, and according, k-node(5) is assigned to 2 ^nd layer d-node slot(4).

[0074] The steps 610 to 622 of LSH Index Table Generation Task 604 are repeated until all of the compact feature vectors K1 to K _n in a twisted compact vector set THV Set(y) are indexed into a respective LSH index table T(y). As represented by the 4 columns level 1 to level 4 in table 802, in the example of FIG. 8 the maximum level ( Dmax ) of d-nodes is 4. In some example embodiments, when the maximum level (Dmax) of d-nodes for a Slot in a LSH index table T is reached, the threshold Th is ignored and the length of K-node chains in the Dmax d-node level is unlimited.

[0075] LSH Index Table Generation Task 604 is repeated for all of the n _s Twisted Compact Vector Sets THV Set(1 ) to THV Set (n _s) to generate n _s respective LSH index tables T(1 ) to T(n _s), which are collectively stored in system storage as index structure 219.

[0076] In example embodiments, the index generation method 202 described above can be summarized by the following general steps that follow feature extraction process 210. Step 1 : Calculate the LSH hash value of an input raw feature vector v, to produce a corresponding compact feature vector Kj. The first s bits compact feature vector Kj are used as a SegmentID. Then, the next log2(l ) bits of the compact feature vector Kj following the SegmentID, as shuffled by a random shuffling permutation, are used to generate an Integer range from 0 to / as the slotID for a slot of the first level (e.g. d-node(1 )) of an index table (e.g. LSH Index table T(y)). Step 2: If the slot has is not occupied, it is updated to point to the address of raw feature vector v,. Step 3: If the slot has been occupied, and the number of objects under this slot is equal or less than Th , then a k-node is added under the slot. If the number of objects under this slot is larger than Th, then a new d-node level is added under the slot, followed by Step 4: The next log2(l) items from the shuffling permutation is used to provide the corresponding log2(l) bits of a compact feature vector Kj as the slotID in the new d-node, and the k-nodes are redistributed in this new d-node.

[0077] In example embodiments, the number of slots // can be set at a different value for each d-node level in LSH index table T(y), as illustrated in FIG. 9. The variable // controls the number of bits to locate the objects in different d-node levels of the hash tree defined by LSH index table T(y). For instance, in one example 1=32, \OQ2(I)=5, and 5 bits of the compact feature vector are used to determine the slots for all d-node levels. By this design, each d-node level is treated with the same degree of resolution. Alternatively, different resolutions can be used for different levels. For example, for first level d-node(1 ), a shorter h could be used, which enables datasets with small numbers of similar objects to gain enough efficient candidates. In lower levels, the number of bits can be gradually increased, with h _<l2 _<h· The only condition for objects going deeper is number of the“similar” objects under the same slot being equal or larger than Th. Therefore, for the second level, the resolution should be increased to make these“similar” objects be divided into different“similar” groups with higher similarities.

[0078] Thus, in example embodiments, index structure generation process 218 implements a random draw that produces random draw forest (RDF) index structure 219 in which each LSH index table T(y) represents a respective tree in the RDF index structure 219. The random draw performed during index structure generation process 218 is a function of the randomly generated shuffling permutations (sp).

[0079] Referring again to Figure 2, similarity search method 204 will now be described. A query object 220 is received. In an example embodiment, the query object 220 is an unstructured object data such as an image file, a video sample, an audio sample, or text string. As indicated in feature extraction process 222, query object 220 is converted to a raw query feature vector Qv in the same manner that data objects 208 were converted to raw feature vectors in feature extraction process 210. The resulting raw query feature vector Qv is then converted at dimensionality reduction process 226 to an m-length binary sequence compact query vector Qk using the same process and previously generated hash functions as described above in respect of dimensionality reduction process 214.

[0080] The compact query vector Qk is then processed in combination with the index structure 219 for search process 230. In an example embodiment, n _s shuffled versions Qks(1 ) to Qks(n _s) of the compact query vector Qk are generated by applying each of the above mentioned shuffling permutations SP(1 ) to SP(n _s) to the compact query vector Qk. Each of these n _s shuffled versions Qks(1 ) to Qks(n _s) used to search a respective LSH index table T ( 1 ) to T(n _s). For example, compact query vector Qks(y), which has been shuffled according to shuffling permutation SP(y) is used to search corresponding LSH index table T(y). In particular, the first group of log2(/ _f) bits of compact query vector Qks(y) (excluding the s bits used for SegmentID) are used to determine a SlotID for the root (e.g. first level) d-node(1 ) of LSH index table T(y). If the matching slot of the first level d-node(1 ) points to a k-node, then all of data objects 208 that are addressed in the k-nodes under the slot are returned as candidate result objects 232. In the event that the matching slot of the first level d-node(1 ) points to a second level d- node, then the next group of log2 (h) bits of compact query vector Qks(y) are used to determine a SlotID for the second level d-node(2) of LSH index table T(y), and any data objects 208 that are addressed in the k-nodes directly under the matching d-node(2) slot without an intervening d-node are returned as candidate result objects 232. In the event that the matching d-node(2) slot points to a further, third level d-node(3), the process of determining additional lower level slotIDs from successive bits of the compact query vector Qks(y) are repeated until all k-nodes under any matching slots are processed and all candidate result objects 232 returned.

[0081 ] Accordingly at the completion of search process 230, the candidate results 232 includes data objects 208 that correspond to each of the shuffled query vectors Qks(1 ) to Qks(n _s) as identified in the respective LSH index tables T ( 1 ) to T(n _s). As indicated by items 232 to 240 in FIG. 2, the candidate results 232 can then be filtered using a filtering process 234 to produce filtered results 236 that can be ranked using a ranking process 238 to produce a ranked list of objects as the final results 250. The methodologies applied in filtering process 235 and ranking process 238 may for example be similar to those used in existing similarity searching processes.

[0082] As described above, the index generation method 202 and similarity search method 204 use a random draw forest (RDF) index structure that overcomes the MSB problem. Using the RDF index structure 219 described above for similarity searching may in at least some applications result in faster and more accurate similarity searches than prior methods. By improving the high quality candidates included in candidate results, the index structure 219, when used in a similarity search, may in at least some applications achieve better approximate nearest neighbor performance (accuracy and quality of results) than prior methods, and have a better time performance compared to at least some prior methods.

[0083] In example embodiments the index generation method for similarity searching based on RDF (random draw forest) described above includes: Step 1 : Based on the input raw feature vectors, by using locality sensitive hashing, produce hash values; Step 2: Based on the hash values, by using random draw, produce the twisted hash values; Step 3: Based on the twisted hash values, by following the adaptive hash tree building steps, produce the random draw forest (multiple hash trees); Step 4: Based on the query’s raw feature, by using locality sensitive hashing, produce the query’s hash value; and Step 5: Combine the query’s hash value and random draw forest as input information, by following the similarity search strategy, produce the query’s similar objects from dataset.

[0084] As noted above, in example embodiments index generation method 202 and similarity search method 204 are performed by software (that may include one or more software modules) that are implemented on one or more digital processing systems. In some examples, instances of index generation method 202 or similarity search method 204 may be implemented on one or more digital processing systems that are implemented as virtual machines using one or more physical computing systems.

[0085] FIG. 10 illustrates an example of a digital processing system 1410 that could be used to implement one or both of index generation method 202 and similarity search method 204. As shown in FIG. 10, the system 1410 includes at least one processing unit 1400. The processing unit 1400 implements various processing operations of the system 1410. For example, the processing unit 1400 could perform data processing, power control, input/output processing, or any other functionality enabling the system 1410 to operate. The processing unit 1400 may also be configured to implement some or all of the functionality and/or embodiments described in more detail above. Each processing unit 1400 includes any suitable processing or computing device configured to perform one or more operations. Each processing unit 1400 could, for example, include a microprocessor, microcontroller, digital signal processor, field programmable gate array, or application specific integrated circuit, and combinations thereof.

[0086] The system 1410 further includes one or more input/output devices 1406 or interfaces (such as a wired or wireless interface to the internet or other network). The input/output devices 1406 permit interaction with a user or other devices in a network. Each input/output device 1406 includes any suitable structure for providing information to or receiving information from a user, such as a speaker, microphone, keypad, keyboard, display, or touch screen, including network interface communications for receiving query objects and

communicating search results.

[0087] In addition, the system 1410 includes at least one system storage device 1408. The system storage device 1408 stores instructions and data used, generated, or collected by the system 1410. For example, the system storage device 1408 could store software instructions or modules configured to implement some or all of the functionality and/or embodiments described above and that are executed by the processing unit(s) 1400. System storage device(s) 1408 can also include storage for one or more object databases 206, main tables 250, compact feature vector sets 502 and index structures 219. System storage device(s) 1408 can include any suitable volatile and/or non-volatile storage and retrieval device(s). Any suitable type of memory may be used, such as random access memory (RAM), read only memory (ROM), hard disk, solid state disc, optical disc, subscriber identity module (SIM) card, memory stick, secure digital (SD) memory card, and the like.

[0088] In the examples described above, index generation method 202 generates an RDF index structure 219 for the compact feature vector set 502 that represents n objects 208 stored in object database 206. In the above example, the compact feature vector set 502 is treated as a single partition group and indexed using a single RDF index structure 219. However, in some examples, the volume of data objects that need to be indexed is so large that representing the corresponding compact feature vector set in a single index structure can lead to system latency and inefficiency, especially in the context of concurrent search query processing. As noted in the background above, partitioning can be used to break groups of data objects into smaller groups of similar data objects for indexing and searching purposes.

[0089] As also noted above, in addition to the MSB problem that can be created when indexing compact feature vectors, errors can also be introduced through sub-index partitioning issues. Partitioning can be an important part of hash based index generation methods and as mentioned in the background, existing partition methods use fixed number of head bits to divide the hash values (e.g. put hash values into different partitions). These existing methods might partition very similar feature vectors in different partitions or put extremely different hash values in same partition just because they rely on limited number of bits. Dividing the hash values into wrong sub-indexes (e.g. partitions) affects the accuracy and consistency of similarity searching. The following is a description of an improved partitioning method to mitigate problems with conventional partitioning methods. In the presently described embodiment, a partitioning method is used to generate partition groups that are each then respectively indexed using the RDF index structure generation process 218 described above. However, the partitioning method described herein is not limited to being used in combination with the RDF index structure generation process but rather, in other example embodiments, may be used to produce partition groups that can be respectively indexed using known or suitable indexing methods.

[0090] The partitioning method described herein uses multiple layers of LSH which use orthogonal angle hash functions, and can be used in conjunction with the indexing generation and search methods described above in respect of FIGs. 2 to 9. In example embodiments that will now be described, during the index generation method, compact feature vector set 502 is divided into multiple partition groups before being indexed. A corresponding sub-index structure is then created for each partition group. In this regard, FIG. 1 1 A shows an alternative example of index generation method 202A that is similar to index generation method 202 discussed above except that the index generation method 202A includes an additional procedure (process 1 100 in FIG. 1 1 A) of partitioning the compact feature vector set 502 into a total of 2 ^M partition groups 1 to 2 ^M. The partition groups 1 to 2 ^M are then each subjected to a respective RDF index structure generation process 218(1 ) to 218(2 ^M) to generate respective sub-index structures 219(1 ) to 219(2 ^M).

[0091] As will be explained in greater detail below, the partition method uses a distributed layered LSH method that enables the parallelism of indexing and search methods. It is a content-based partition strategy, enabling each search query to be mapped to only one partition group. The orthogonal hash family is used to partition objects (as represented by compact feature vectors) more accurately. A stepwise search is described below for an accurate searching way to search over the sub-indexes that correspond to the respective partition groups.

[0092] Index generation method 202A will now be explained in greater detail with reference to FIG. 1 1 A, which provides an overview of the entire index generation method 202A, and FIG. 1 1 B which shows the partitioning process 1 100 in greater detail. Reference will also be made to FIG. 12, which schematically illustrates parts of the index generation method 202A for the specific example of m=6 when the number of sub-index partition groups is 4 (i.e. 2 ^M=4, M=2).

[0093] As indicated in FIG.s 1 1A index generation method 202A includes preliminary operations that are the same as those of index generation method 202 described above, namely feature extraction process 210 and dimensionality reduction process 214. In particular, feature extraction process 210 processes n unstructured data objects 208 to generate n corresponding representative d- dimensional raw feature vectors Vi to Vn that are stored, for example, in a main table 250 that includes the raw feature vectors Vi to V _n with pointers (for example an object ID) to their respective unstructured data objects 208.

[0094] Dimensionality reduction process 214 applies a first layer LSH to process the n d-dimensional raw feature vectors Vi to V _n and generate n corresponding m- dimensional compact feature vectors Ki to K _n, that are stored, for example, as a compact feature vector set 502 that includes the compact feature vectors Ki to K _n with pointers (for example an object ID) to one or both of their respective raw feature vectors Vi to V _n and unstructured data objects 208.

[0095] In example embodiments, the LSH based dimensionality reduction process 214 of index generation method 202A uses the orthogonal angle hash functions h described above in respect of the index generation method 202, which have better performance than original angle hash functions. As described above, using the geminated orthogonal hash functions, hash values from the compact feature vectors Ki to K _n are generated for each raw feature vector Vi to V _n associated with an object. Each compact feature vectors Kj is an m long sequence of 0’s and 1’s. By way of example, the illustrated dimensionality reduction process 214 of FIG. 12 where m=6 demonstrates the hashing of raw feature vector Vi ={fviJV2, . . .JVd} with the m-length hash function chain Gi={hi,h2, to generate the m-length binary sequence compact feature vector Ki= Gi(Vi)= {hi(Vi),h2(Vi),h ₃(Vi),h (Vi),h5(Vi),h ₆(Vi)} = {0,0,1 ,0,1 ,0}.

[0096] Following the first layer LSH dimensionality reduction process 214, the compound hash values (i.e. com pact feature vectors Ki to K _n) of compact feature vector set 502 are then partitioned into sub-index partition groups by partitioning process 1 100, which will now be described in greater detail with FIG. 1 1 B. The partitioning process 1 100 functions to assign compact feature vectors Kj that are sufficiently similar into respective partition groups.

[0097] In order to partition similar objects (each represented by a respective com pact feature vector Kj) into respective partition groups, a new LSH index layer is introduced, which is called partition layer LSH index. The principle behind the partition layer LSH index is that: similar objects (as represented by raw feature vectors) have a high possibility p1 to have similar hash values after a first layer LSH has been performed; and similar compact feature vectors have a high possibility p2 to have similar hash values after a second, partition layer LSH is performed. Therefore, after two layers of LSH, similar objects have p1 ^*p2 possibility of having similar compact feature vectors. This principle is the basis for defining partition groups and generating a sub-index-ID (SubID) for each partition group, as shows in FIG. 1 1A and FIG. 12. In at least some examples, each compact feature vector Kj is included in only one partition group. Accordingly, at search time, each search query only needs to access the sub-index structure for only a single partition group, which improves the speed of similarity searching. Furthermore, the robustness of the partitioning method, to handle concurrency, can be easily controlled by a single parameter, M, where M is the number of bits used for partitioning into partition groups.

[0098] As shown in FIG. 1 1 B, the partitioning process 1 100 is repeated for each of the n compact feature vectors Kj that are contained in the compact feature vector set 502, and at the completion of partitioning process 1 100 each of the n compact feature vectors Kj is assigned to a respective partition group 1 to 2 ^M of similar compact feature vectors Kj, where similarity is a function of a partition layer LSH process 1 104. The number of partition groups is 2 ^M and each partition group and its respective sub-index structure 219(SublD) is mapped to a unique M-bit sub-index ID (SubID).

[0099] As indicated in block 1 102, each repetition of partitioning process 1 100 begins with getting the next compact feature vector Kj from the compact feature vector set 502. As indicated at process block 1 104, a partition layer LSH is then performed on the compact feature vector Kj to generate a sub-index ID (Sub-ID) and thereby assign the compact feature vector Kj to a respective one of the partition groups 1 to 2 ^M. In example embodiments, applying a partition layer LSH comprises hashing the compact feature vector Kj with a hash function chain G’ that includes M orthogonal local sensitivity based hash functions (e.g. Sub-ID for Kj= G’ (K _j)={/7 f(K _j),/72(K _j), . . ../7M(K _j)}). FIG. 13 is a pseudo-code representation of the process blocks 1102 and 1 104 of partitioning process 1 100, in which compact feature vectors Kj (identified in represented in FIG. 13 as“Hash value matrix E [j,i]") are each assigned a respective sub-index ID (SubID). FIG. 12 illustrates an example of the LSH partitioning process applied to 6-bit compact feature vector K1 ={0,0, 1 ,0, 1 ,0} at process block 1 104. The m=6 bit compact feature vector Ki is hashed with the function chain G’={hi,h2} (M=2) to output a 2-bit sub-index ID (SublD)= G’(Ki)={hi(Ki),h ₂(Ki)} = {1 ,0}. The first binary value of the sub-index ID is the hash output of the 6-bit compact feature vector Ki={0, 0,1 , 0,1 ,0} and the orthogonal hash function hi, and the second binary value of the sub-index ID is the hash output of the 6-bit compact feature vector Ki={0, 0,1 , 0,1 ,0} and the orthogonal hash function hi

[00100] As indicated by process block 1 108 in FIG. 11 B, once a sub-index ID is determined for a compact feature vector Kj, the compact feature vector Kj is added to the corresponding partition group 1 to 2 ^M. In the example of FIG. 12, the 6-bit com pact feature vector Ki={0, 0, 1 , 0,1 ,0} is added to partition group 3, as identified by its binary sub-index ID, SublD=10b. Thus, each com pact feature vectors K (and its corresponding raw feature vector Vi and unstructured data object 208) is individually assigned to a sub-index partition group of similar vectors.

[00101] At the completion of partitioning process 1100, the compact feature vectors Ki to K _n of compact feature set 250 are distributed among M partition groups, each of which is a subset of the compact feature vectors Ki to K _n. As indicated in FIG 1 1A and 12, each one of the M partition groups are then processed using a respective RDF index structure generation process 128(1 ) to

218 (2 ^M) to generate respective RDF sub-index structure 219(1 ) to 219(2 ^M). Each RDF index structure generation process 128(1 ) to 218 (2 ^M) processes its respective sub-index partition group in the same manner as described above with reference to Figures 6 to 9 in respect the processing of compact feature vector set 502 by index structure generation process 218. Each of the respective RDF sub- index structures 219(1 ) to 219(2 ^M) includes respective LHS index tables T(1 ) to T(n _s), where the n _s can be individually selected for each of the RDF sub-index structures 219(1 ) to 219(2 ^M).

[00102] As illustrated by the dashed boxes labelled“Machine(1 )” to“Machine(2 ^M)” in Figure 1 1 A, in at least some example embodiments, each of the RDF sub-index structures 219(1 ) to 219(2 ^M) is hosted or stored at a different digital processing system to support concurrent queries. In some example, the multiple different digital processing systems may include multiple virtual machines implemented on a common digital processing system (for example digital processing system 1410), or on physically different machines (for example multiple digital processing systems 1410). The size of M determines the number of sub-index partition groups, which affects the ability to support concurrent query requests. The larger the size of M, the greater the ability to handle concurrent searches. Accordingly, in example embodiments each of the sub-index structures 219(1 ) to 219(2 ^M) is stored as an independent, searchable structure, enabling concurrent searching of the sub-index structures.

[00103] Searching of RDF sub-index structures 219(1 ) to 219(2 ^M) will now be described with reference to FIG. 14 which shows a similarity search method 204A according to example embodiments. Similarity search method 204A is similar to the similarity search method 204 described above in respect of FIG. 2, except that similarity search method 202A includes additional processes of generating a sub- index ID for the compact feature query vector Qk (process 1450) and, in at least some example embodiments, conducting a step wise search index structures with similar sub-index IDs as the compact feature query vector Qk (process 1454). As indicated in FIG. 14, the similarity search method 204A includes feature extraction process 222 to convert a query object into a d-dimensional raw feature query vector Qv, and LSH dimensionality reduction process 226 to reduce the d- dimensional raw feature query vector Qv to an m-dimensional compact feature query vector Qk = Gi(Qv)= {hi(Qv),h2(Qv), ... h _m(Qv)}.

[00104] An additional LSH level is applied at process 1450 to determine the appropriate RDF sub-index structure 219(SublD) for searching for compact feature vectors Ki that are similar to the compact feature query vector Qk. In particular, the same operation of applying a second LSH layer described above in respect of process 1104 is applied to the query vector Qk. In particular, a sub- index ID (SubID) is determined for the query vector Qk by applying orthogonal angle hash function G’ as follows:

SubID for query vector Qk = G’ (Qk)={/7 _f(Qk),/7 ₂(Qk), ..../7 _M(Qk)}.

[00105] As indicated by process 1452 in FIG. 14, the SubID for the compact feature query vector Qk is used to identify the RDF sub-index structure 219(SublD) for the sub-index partition group that is most likely to include objects similar to the search query object. The same search process 230 as described above in respect of FIG. 2 is then applied to identify candidate results 232 from the RDF sub-index structure 219(SublD).

[00106] Ideally, a partition method strives to divide all similar objects into one sub- index partition group). However, due to the approximate nature of applying a partition layer LSH to assign a partition group sub-index ID, it is possible that in at least some applications similar objects are still likely to be divided into different partition groups, which can affect the accuracy and consistency of similarity searches using the generated sub-index structures. Accordingly, to increase search accuracy, in example embodiments, a step-wise search approach is implemented based on another LSH property. An example the additional steps required to implement a step-wise search approach are illustrated in the process block 1454 (“Step-wise Search of Index Structures with Similar Sub-Index IDs”) in FIG. 14 and the step-wise search diagram shown in FIG. 15.

[00107] The step-wise search approach is based on the assumption that the sub- index structures that are one step away from each other are most likely to contain compact feature vectors that are close to the compact feature vector of the search query than the sub-indexes that are two steps away. Because there are only two possible values 0/1 in each bit of a compact feature vector, the Hamming distance between two compact feature vectors can be denoted as delta steps, and the maximum number of delta steps is M steps.

[00108] In example embodiments, as indicated by process 1452, initially, the sub- index structure 219(SublD) that corresponds to the sub-index ID generated for the com pact feature query vector Qk is searched. However, to increase accuracy, the 1-step sub-index structures are also searched, with lost time efficiency increasing with the number of searched sub-indexes. In some example embodiments, the number of 1-step sub-index structures for searching is set at M (i.e. the same number of bits used for the sub-index ID). Using this approach, a higher accuracy may in some cases be achieved by searching within a reasonable number of sub- index structures.

[00109] To identify the delta-step sub-index structures for a particular SubID, +1 (for bit=0) or -1 (for bit=1 ) is applied to the delta number of bits in original sub- index-ID. For example, if the original sub-index-ID of Qk is SublD=G’(Qk)= {h1 (Qk),h2(Qk),... , hM(Qk )} , the 1 -step sub-index-ID is determined by applying +1/-1 operation on one random bit of G’(Qk) the 2-step is applying +1/-1 operation on two random bits of SublD=G’ (Qk)and so on. For example, as can be seen from FIG. 15, if M = 3, the original sub-index-ID is 010, the 1-step sub-index-IDs are 1 10, 000, 01 1 , the 2-step sub-index-IDs are 100, 1 1 1 , 001 , the 3-step sub-index- IDs is 101.

[00110] Accordingly, in example embodiments, the process block 1454 (“Step- wise Search of Index Structures with Similar Sub-Index IDs”) includes determining, as indicated in process block 1456, the sub-index IDs for all of the sub-index structures 219(SublD) that are within a threshold similarity of the“original” or “Step-0” sub-index ID (where the “original” sub-index ID is the SubID of the compact query function vector Qk). In example embodiments, the threshold is the maximum number of steps (e.g. bit changes) within the SubID that fall within a maximum number (e.g. M) of steps. Accordingly, in the example of FIG 15 where M=3, and the original SublD= {0,1 ,0}, then there will be 3“1 -step” SublDs that have one bit different than the original SubID, namely: {1 ,1 ,0}, {0,0,0}, {0,1 ,1}, 3 “2-step” SublDs that have two bits different than the original SubID, namely: {0,0,1 }, {1 ,1 ,1}, {1 ,0,0}, and 1“3-step” SubID that has three bits different than the original SubID, namely: {1 ,0,1 }.

[00111] As illustrated in process block 1458, each of the respective sub-index structures 219(SublD) that are identified as falling within the maximum step size are then individually searched to identify any compact vectors Ki that are similar to the com pact query function vector Qk. In example embodiments, such searching is conducted using the search process 230 described above and returns a set of candidate results 232 for each searched sub-index structure 219(SublD). In example embodiments, the candidate search results may be subjected to filtering and ranking.

[00112] In at least some examples, decisions to perform step-wise searching and the extend of such searching may be individually determined by the processing system 1410 for each compact query function vector Qk based on predetermined search result thresholds. For example, if a threshold number of candidate search results is met after the search of the sub-index structure that corresponds to the original sub-index ID, then additional step-searching (i.e. process block 1454) need not be performed. Similarly, if additional step-searching is performed, the step-searching of additional sub-index structures can be terminated if the threshold number of candidate search results is reached before the maximum number of step searches is completed.

[00113] As noted above, in at least some example embodiments, each of the RDF sub-index structures 219(1 ) to 219(2 ^M) is hosted or stored at a different digital processing systems to support concurrent queries. These systems can support concurrent queries based on different object queries, or concurrent step-wise queries based on the same object query.

[00114] In at least some example embodiments the methods and systems described above may address some of the time and processing inefficiencies that are inherent in existing large volume unstructured data storage systems, indexing systems, and searching systems, thereby improving one or more of search accuracy, search speed, and use of system resources including processor time and power consumption.

[00115] The previous description of some embodiments is provided to enable any person skilled in the art to make or use an apparatus, method, or compu readable medium according to the present disclosure.

[00116] Various modifications to the embodiments described herein may be readily apparent to those skilled in the art, and the generic principles of the methods and devices described herein may be applied to other embodiments. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

[00117] For example, although embodiments are described with reference to bits, other embodiments may involve non-binary and/or multi-bit symbols.

Previous Patent: DIFFUSE LIGHTING SYSTEMS

Next Patent: SPRAY DEVICES FOR DISPENSING AQUEOUS IODINE, AND METHODS OF MAKING AND USING SPRAY DEVICES THAT DISP...