Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
NEAREST NEIGHBOR SEARCH WITH RANDOM PROJECTION TREES
Document Type and Number:
WIPO Patent Application WO/2011/050412
Kind Code:
A1
Abstract:
The invention concerns a hyperplane-partitioned spatial indexing tree, such as a random projection tree that indexes data having a high dimensional ambient space. The invention also concerns estimating at least one nearest neighbor data point of a given query point using the tree. In particular the invention searches a hyperplane-partitioned spatial indexing tree for at least one nearest neighbor of a given query point. The tree indexes data having a high dimensional ambient space and low intrinsic dimensionality. The method recursively searches through nodes of the tree, and at each node searched updates as appropriate the current nearest neighbor identified. Also, at each node, a shortest distance (QB) on the intrinsic space between the query point and a splitting hyperplane bounding that node is determined. If this shortest distance is greater than a predetermined value (QG), disregarding in this search any node that is also bounded by that splitting hyperplane. Aspects of the invention include a method, computer system and software.

Inventors:
ZVEDENIOUK ILIA ISAKOVICH (AU)
Application Number:
PCT/AU2010/001439
Publication Date:
May 05, 2011
Filing Date:
October 27, 2010
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV SYDNEY (AU)
ZVEDENIOUK ILIA ISAKOVICH (AU)
International Classes:
G06F17/30
Foreign References:
US7475071B12009-01-06
Other References:
WANG, G. ET AL.: "A hyperplane based indexing technique for high-dimensional data.", IN JOURNAL OF INFORMATION SCIENCES, vol. 177, no. 11, 1 June 2007 (2007-06-01), pages 2255 - 2268, XP005927890, DOI: doi:10.1016/j.ins.2006.12.018
SPROULL, R. F.: "Refinements to nearest-neighbour searching in k-dimensional trees.", ALGORITHMICA, vol. 6, no. 4, 1991, pages 579 - 589
MCNAMES, J 'A: "fast nearest neighbour algorithm based on a principal axis search tree", IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGIENCE, vol. 23, no. 9, September 2001 (2001-09-01), pages 964 - 976
DASGUPTA, S.: "Random projection trees and low dimensional manifolds.", PROCEEDINGS: STOC, 17 May 2008 (2008-05-17), pages 537 - 546
ARYA, S. ET AL.: "An optimal algorithm for approximate nearest neighbor searching fixed dimensions.", JOURNAL OF THE ACM, vol. 45, no. 6, 1998, pages 891 - 923, XP058146321, DOI: doi:10.1145/293347.293348
"Proceedings of the 27 VLDB Conference, Italy, 11-14 September, 2001", article YU, C. ET AL.: "Indexing the distance: An efficient method to KNN Processing."
Attorney, Agent or Firm:
F B RICE & CO (44 Market StreetSydney, NSW 2000, AU)
Download PDF:
Claims:
CLAIMS DEFINING THE INVENTION ARE AS FOLLOWS:

1. A method of searching a hyperplane-partitioned spatial indexing tree for at least one nearest neighbor of a given query point, wherein the tree indexes data having a high dimensional ambient space and low intrinsic dimensionality, the method comprising recursively searching through nodes of the tree and at each node searched updating as appropriate the current nearest neighbor identified, the method comprising the following step:

at each node, determining a shortest distance (QB) on the intrinsic space between the query point and a splitting hyperplane bounding that node; and if this shortest distance is greater than a predetermined value (QG), disregarding in this search any node that is also bounded by that splitting hyperplane.

2. , The method of claim 1, wherein the method further comprises determining the predetermined value (QG) by determining a substantially shortest distance between the query point and the current nearest neighbor identified.

3. The method of claim 1 or 2, wherein determining substantially the shortest distance (QB) comprises estimating the angle (a) between the splitting hyperplane defining that node and a data plane of data represented by the node.

4. The method of claim 1, 2 or 3, wherein determining substantially the shortest distance (QB) comprises determining a further substantially shortest distance (QC) between the query point and the splitter hyperplane defining that node.

5. The method of claim 4, determining the further substantially shortest distance (QC) between the query point and the splitter hyperplane defining that node is based on a projection of a normal to the splitting hyper plane onto the data plane of data represented by that node.

6. The method of claim 3 limited by 4 or 5, wherein the determining substantially the shortest distance (QB) comprises dividing the further shortest distance (QC) by the Sine of the angle (sin(a)). 7. The method of claim 4 or 5, wherein determining the angle (a) is based on determining a mean distribution of the data represented by that node.

8. The method of claim 4 or 5, wherein determining the angle (a) is based on a principle component analysis (PC A) of the data represented by that node where the data plane is defined by a number of first eigenvectors.

5

9. The method of claim 4 or 5, wherein determining the angle (a) is based on a number of vectors defining the centre of the data represented by that node to a number of randomly selected data points represented by that node.

10 10. The method according to any one of claims 4 to 9, wherein if the angle (a) is equal to π/2, the method further comprises assessing whether other nodes should be disregarded in this search using a hypersphere.

11. The method of any one of the preceding claims, wherein if this distance (QB) is 15 smaller than the predetermined value, the method further comprises performing further steps to determine whether nodes also partitioned by that splitting hyperplane can be disregarded.

12. The method of any one of claims l . to 11, wherein if this distance (QB) is 20 smaller than the predetermined value, the method further comprises searching nodes also partitioned by that splitting hyperplane.

13. The method according to any one of the preceding claims, wherein the disregarded node is a subtree.

25

14. The method according to any one of the preceding claims, wherein the query is an image, document or other electronic file.

15. The method according to any one of the preceding claims, wherein the tree is a 30 random projection tree.

16. Software, being computer readable instructions stored on a computer readable medium, that when executed causes a computer to perform the method of any one of the preceding claims.

35

17. A computer system for searching a hyperplane-partitioned spatial indexing tree for at least one nearest neighbor of a given query point, having a searching module to perform the method of any one of claims 1 to 15. 18. The steps, features, integers, compositions and or compounds disclosed herein or indicated in the specification of this application individually or collectively, and any and all combinations of two or more of said steps or features.

Description:
Title

Nearest Neighbor Search with Random Projection Trees Technical Field

The invention ' concerns a hyperplane-partitioned spatial indexing tree, such as a random projection tree that indexes data having a high dimensional ambient space. The invention also concerns estimating at least one nearest neighbor data point of a given query point using the tree. Aspects of the invention include a method, computer " system and software.

Background Art

A k-dimensional (k-d) tree is a spatial data structure that partitions 9t D into hyperrectangular cells. It is build in a recursive manner, splitting along one coordinate direction at a time. The succession of splits corresponds to a binary tree whose leaf nodes contain the individual cells in ¾ D and a splitter is represented as a fork in the branch of the tree. In this way a cell/node is bounded by a splitter. At each level of the tree, a different dimension is used to split the remaining points.

It is an empirical observation that the usefulness of k-d trees diminishes as the dimension D increases, k-d trees are susceptible to the same curse of dimensionality that has been the bane of other nonparametric statistical methods. However, it has recently been noted that a majority of real-world data sets actually lie close to a low dimensional manifold structure. Generating a k-d tree is part of pre-processing a set of points n in D dimensional Euclidean space into a data structure so that given a query point q, its nearest k- neighbors can be found quickly. This is a fundamental problem in various computer science fields such as pattern recognition, statistical classification and computer vision. To understand the application of these spatial indexing trees, consider Fig. 1 where all the dots are points in a database, where the cross is the query point q. The node/cell containing q, i.e. cellfq), can be quickly identified by moving q down the tree. If the diameter of cell(¾/ is smaller (where the diameter is taken to be the distance between the furthest pair of data points in the cell), then the points in it can be expected to have similar properties, for instance similar labels. The two most popular algorithms to solve this problem are cover trees and Locality Sensitive Hashing (LSH). Cover trees offer an exact search with run-time depending only in the intrinsic dimensional (as well as being adaptable for approximation) but suffer from a high pre-processing requirements, as well as 0(n) space usage. LSH is a probabilistic method that is very fast, but achieves high search accuracy. However, LSH methods needs to use a large number of hash tables and a large amount of available memory.

A random projection tree (rp-tree) shown in Fig. 2 is the same as a k-d tree except for the nature of the splits. Instead of splitting along coordinate directions at the median, we split along a random direction SD-1 (the unit sphere ^°), and instead of splitting exactly at the median, a small amount of "jitter" is added. Put another way the rp-tree splits the data with randomly inclined planes of the form N(0, 1)° meaning it splits along a random combination of all the axes, none of which are zero. rd-trees are much more effective than the kd-tree, specially where the data is sparse, kd-trees automatically adapts to this intrinsic low-dimensional structure of data. Specifically, Dasgupta et al. [1J prove that the diameter of the cells (as defined by the distance between the two furthest points within a cell) reduces by factor of two every dlogd levels in the rp-tree, with no dependence on the ambient dimension D. This is the major shortcoming of regular kd-trees, which require up to D levels. This reduction in cell diameter is the key property determining the effectiveness of spatial partitioning structures. Summary of the invention

In a first aspect the invention is a method of searching a hyperplane-partitioned spatial indexing tree for at least one nearest neighbor of a given query point, wherein the tree indexes data having a high dimensional ambient space and low intrinsic dimensionality, the method comprising recursively searching through nodes of the tree and at each node searched updating as appropriate the current nearest neighbor identified, the method comprising the following step:

at each node, determining a shortest distance (QB) on the intrinsic space between the query point and a splitting hyperplane bounding that node; and if this shortest distance is greater than a predetermined value (QG), disregarding in this search any node that is also bounded by that splitting hyperplane. The invention represents a trade off between accuracy and time in estimating the nearest neighbour by exploiting the intrinsic low dimensionality of the data. As a result it is an advantage of the invention that it allows for fast approximation and exact nearest neighbor search in a very high dimensional space. It is a further advantage of the invention that finding the shortest distance between the query point and the splitting hyperplane is easier than learning about the manifold.

A node is bounded by a splitting hyperplane if the node stems from an immediate fork in a branch of the tree that represents that splitting hyperplane.

The method further may comprise determining the predetermined value (QG) by determining a substantially shortest distance between the query point and the current nearest neighbor identified. Determining substantially the shortest distance (QB) may comprise estimating the angle (a) between the splitting hyperplane defining that node and a data plane of data represented by the node. It is an advantage of at least one embodiment of the invention that estimating the angle is easier than learning about the manifold. Determining substantially the shortest distance (QB) may comprise determining a further substantially shortest distance (QC) between the query point and the splitter hyperplane defining that node.

Determining the further substantially shortest distance (QC) between the query point and the splitter hyperplane defining that node may be based on a projection of a normal to the splitting hyper plane onto the data plane of data represented by that node.

Determining substantially the shortest distance (QB) may comprise dividing the further shortest distance (QC) by the Sine of the angle (sin(a)).

Determining the angle (a) may be based on determining a mean distribution of the data represented by that node.

Determining the angle (a) may be based on a principle component analysis (PCA) of the data represented by that node where the data plane is defined by a number of first eigenvectors. It is an advantage of this embodiment of the invention that the most important (most common) vectors in the data can be identified and removes noise;

Determining the angle (a) may be based on a number of vectors defining the centre of the data represented by that node to a number of randomly selected data points represented by that node.

If the angle (a) is equal to π/2, the method may further comprise assessing whether other nodes should be disregarded in this search using a hypersphere.

If this distance (QB) is smaller than the predetermined value, the method may further comprise performing further steps to determine whether nodes also partitioned by that splitting hyperplane can be disregarded. If this distance (QB) is smaller than the predetermined value, the method may further comprise searching nodes also partitioned by that splitting hyperplane.

The disregarded node may be a subtree. The tree may be a random projection tree. The query may be an image, document or other electronic file. It is an advantage of this embodiment of the invention that a similar image or document can be found quickly without the use of textual descriptors, such as a file name of the file or tags associated with a file. In a further aspect, the invention provides software, being computer readable instructions stored on a computer readable medium, that when executed causes a computer to perform the method described above.

In yet a further aspect, the invention provides a computer system for searching a hyperplane-partitioned spatial indexing tree for at least one nearest neighbor of a given query point, having a searching module to perform the method described above.

Brief Description of the Drawings

Fig.l is a schematic representation of a spatial partitioning of XR2 induces by a kd-tree with three levels. Fig. 2 is a schematic representation of a spatial partitioning of XR2 induced by a rp-tree.

Fig. 3 schematically shows points that lie close to an affine subspace if dimension d .

Fig. 4 schematically shows cells of a kd-tree intersected by a hypersphere;

An example of the invention will now be described with reference to the accompanying drawings, in which:

Fig. 5 schematically shows the intersection of two hyperplanes.

Fig. 6 shows how the intersection of two hyperplanes affects a nearest neighbour search in the rp-tree.

Fig. 7(a) shows the pruning criterion of a kd-nearest neighbour search (prior art) and Fig. 7(b) shows the pruning criterion of the present example.

Fig. 8 assists in showing the proof of Theorem 1.

Fig. 9 schematically shows the volume of the hypersphere segment as a proportion of the volume of the entire hypersphere.

Fig. 10 schematically shows the probability of missing the region is sensitive to both the width of the region and the dimensionality.

Fig. 11 schematically shows the relationships between the angles θ, ω and ψ .

Best Modes of the Invention

Intrinsic Dimension

A recent positive development in machine learning has been the realisation that a lot of data which superficially lie in a very high-dimensional space actually have low intrinsic dimension, in the sense of lying close to a manifold of dimension d« D. Algorithms exists that lean this manifold from data. By applying these algorithms, high dimensional data can be transformed into this low-dimensional space in which standard methods will work well. Intrinsic dimensionality represents the number of variables that are needed to capture the variance of data. If data has a low intrinsic dimension, it means there is tremendous redundancy in using D variables - which can be in the tens of thousands - for each data point. Various definitions of intrinsic dimensionality have been proposed, including doubling, or Assouad dimension, low dimensional manifolds, and some others. In this specification we primarily refer to intrinsic dimension as data lying in a low- dimensional affine subspace. Data in ^ has covariance dimension if its covariance matrix has eigenvalues ί λι, λ2, ..., λυ ^ sat i s fy. (λι + λ2 + ... + Ad) > (1—€)(λι-|-λ2-|-...-Ι-λχ>)

That is, "there is a d-dimensional affine subspace such that (avg dist 2 from subspace) <(avg dist 2 from mean). We are interested in distributions that locally have this property, ie. for some partition of ^ , the restriction of the distribution to each region of the partition has covariance dimension

Fig. 3 schematically shows points 30 that lie close to an affine subspace 32 of dimension d .

For example, within a particular class of images (say of faces) a lot of things stay the same, and there is only a limited number of degrees of freedom (size of mouth, distance between eyes etc). Another example is images of dogs. Even though pictures of Chihuahuas and Great Danes look very different, they still have a lot in common and only differ in few respects (size, type of fur, ears etc). This suggests that there is a not of redundancy in using one-million dimensions to describe the difference of images and in a lower dimensional space it would be easier to search for similarity. Nearest Neighbour Search

Typically kd-tree nearest neighbor (NN) aims to find the point in the tree which is the nearest to a given input query point. Searching for a nearest neighbor in a kd-tree is done recursively as follows. First, we proceed down the branches of the tree as though inserting the query point q into the structure.

At each leaf node/cell, we calculate the distance of q to all of the data points found at that leaf node - being the most likely candidates for the nearest neighbours. Intuitively, the data represented in the same node/cell is on the same side of a number of splitting planes (i,e, branches), and so should share similar properties.. If the shortest distance calculated at this node is less than the current best neighboring point previously identified, then the closest point in the current node is now identified as the current best neighbour. Next we must decide which, if any, neighboring cells/nodes we need to check as well. As shown in Fig. 4, for a kd-tree, this is done by forming a hypersphere 40 centered at the query point 42, with radius of the hypersphere 40 being the calculated distance of between the query 42 and the current best nearest neighbor 44. Then checking whether this sphere 40 intersects any of the splitting planes (i.e. branches) that form the cell/node (in a regular kd-tree this is simple as the splits are all orthogonal to some axis, so we simply compare the distance to nearest neighbor so far to the difference in value along that axis of the splitter and the query point). If it does, it means there is a possibility that there is data in cells on the other side of the splitter (i.e. a cell/node that is also bounded by the splitter), and this other cell needs to be checked. [2] estimates this checking to take up to 95% of the search time.

Example

In this example, data is dispersed throughout the ambient space in all direction but has a low co variance dimension and lies close to a low dimensional hyperplane. If our data is stretched throughout the' high dimensional space, then for any splitting hyperplane, we can find a vector within the data that is normal to the splitter. If, however, the data lies in some intrinsic plane (we do not know its axes) then even if we take an infinite number of vectors within our data, the angle that this vector makes with the splitter will be bounded by the dihedral angle between the two planes of data and splitter. This can be used to help break the curse of dimensionality and do a fast nearest-neighbour search in high dimension data provided the data lies on, or close to, a low dimensional plane. An example of searching a hyperplane-partitioned spatial indexing tree for at least one nearest neighbor of a given query point will now be described.. In this example, the query is a electronic image comprised of thousands of pixels. A database is stored in the memory of a computer system, such as a hard drive. The database is comprised with millions of electronic images, such as photos, each also comprised on thousands of photos. The photos represent data with a high dimensional ambient space, each pixel representing a different dimension. The photos also have a low intrinsic dimensionality (as described above).

The processor of the computer system is comprised of a pre-processor module, a searching module and an output module. The pre-processor module is in communication with the database. As part of the preprocessing phase the pre-processor module calculates a hyperplaned-partitioned spatial indexing tree structure, in this case a random projection tree. The pre-processor module then operates to store the random projection tree in memory.

The computer system further comprises at least one input port and at least one output port. In this example one input and one output port is in communication with an internet and is able to send and receive communications therethrough. At the input port the computer system receives a user query, that is an electronic image. The user wishes to identify k similar images stored in the database.

The searcher module performs this query by searching the tree for the k nearest neighbours of the query image within the tree. The search involves recursively searching through nodes of the tree and at each node searched updating as appropriate the current nearest neighbor identified. In this example, at each node, determining a shortest distance (QB) on the intrinsic space between the query point and a splitting hyperplane bounding that node; and if this shortest distance is greater than a predetermined value (QG), disregarding in this search any node that is also bounded by that splitting hyperplane.

The searching module having identified the k nearest neighbors, the output module retrieves the identified image files from the database and passes them to the output port for receipt by a user, such as communicating the images over the internet to a user's computer or for immediate display on a display device, such as a monitor that is connected to an output port.

Let us assume we have an infinite number of data points. If the data is dispersed in all directions then for any splitting hyperplane (splitter) we can find a pair of points within the data that define a vector that is normal to the splitter. If, however, the data is restricted to an intrinsic d -dimensional hyperplane ( / ) as in Fig. 5 then the angle that any vector defined by a pair of data points makes with the splitter is bounded by the dihedral angle between the two hyperplanes, the splitter hyperplane 50 and the hyperplane formed by the data 52. This angle can more easily be calculated by considering the vector normal to the splitter (since the splitter is of dimension D - 1 in ^ , it has a unique normal vector). No vector within the data can make an angle smaller than a certain bound, with this normal.

It can now be seen that the vector within / that minimizes the angle with the normal of the splitter (and hence is complementary to the dihedral angle a between the two hyperplanes 50 and 52) is in fact the projection of the splitter normal onto the / .

In Fig. 6 4>9<* < m Also P^ = a > 4 « * Also clearly 9 r < 9« by the triangle inequality. So qr is the shortest path from q to the splitter 64, if we restrict our path to within / . Cell 1 60 and Cell 2 62 are divided by the splitter 64. Query point is Oand G and F are two possible cases of nearest neighbour found so far in Cell 1 60. Dotted line 66 is the projection of the splitter normal onto / , extended in both directions. B is the point where the splitter intersects with this line. It makes the positive dihedral angle a with the splitter 64.

In Fig. 6, both spheres 68 and 70 going through G and F intersect the splitter 64. However, only the sphere going through F 70 suggests needing to check Cell 2 62. This is because, even though there is space on the other side of the splitter 64 in Cell 2 62, and inside the sphere going through G , this space does not intersect with the data space. The region BE is within our intrinsic data space, inside the larger sphere, and also on the other side of the splitter, and so only the larger sphere suggests checking Cell 2 62.

OB (equals to qr in Fig. 5) is the shortest path from O to the splitter 64, if we restrict our path to within / . So then if OB > OG , then the intrinsic space cannot intersect with the sphere 68 with radiiis OG . No data point can lie both within this sphere 68 and on the other side of the splitter 64 fromO . Hence, Cell 2 62 (or the subtree represented by that space, branching from the branch represented by splitter 64) can be pruned so that it is disregarded in further recursive searching of the tree.

.

So now we modify the pruning algorithm of kd-tree nearest neighbour search from the algorithm shown in Fig. 7(a) to the algorithm shown in Fig. 7(b). That is if a is very small (in high dimensions it usually is) OB is many times greater than OC. So knowing a , we now prune all the cells that can be pruned and the adaptation of the rp-tree to the low intrinsic dimension of the data is fully utilised. Note the requirement that the covariance dimension be X is unlikely n practice for small d in real world data sets. What is likely is that the majority f variance is contained in the first few eigenvectors of data (at least locally) and so X is usually some small non-zero value (even for small d). As we will see, this is usually good enough, though there is a small probability of error associated with the X.

Distribution of Dihedral Angle

In order for the result in the previous section to have any significance, we must show that the dihedral angle a is not nil for most data sets when the rp-tree is built, since if it were, OB = OC in Fig. 6. In fact, the case a = π 12 is a special case of our method where the multiplier becomes one, and we do regular kd-tree nearest neighbour search.

Theorem 1: Given a random d-dimensional hyperplane (I) within and a random splitting a-ne hyperplane of dimension D-J (splitter) defined by its perpendicular vector , let a be the dihedral angle between them. IfD is large (D > 100),

Xd

/£>_ !

sin(a) then converges to v a .

Proof: a is complementary to Ψ, the angle between the normal to the splitter (denoted by V ) and /. By Yao's principle, we fix / to be the hyperplane spanned by unit vectors in the first d axes. ^ is then equal to the angle between V and IIV; the projection of Fonto /. This is illustrated in Fig. 8.

V— (xi , X2 , ...XD ) where all Xi ~ iV(0, 1)

then IIV = (a?i, a¾. .., ατ^, 0, 0, ..)

sin a) = cos« = cos Z(F, IIV) = μ , So the numerator and denominator are both chi-distributions with d and D degrees of freedom respectively.

R. A. Fisher showed that the denominator converges to N i\ *l D - when D is large. Since D is large, the denominator's variance becomes insignificant relative to its mean, and so we can replace it by its mean. sin(a) ~ - 7 St T .

This shows quite clearly the effect of the increasing dimensionality on the dihedral angle, and hence on the ratio between OC and OB in Fig; 6. This effect also exists in lower dimensions, though it is much less noticeable.

The previous result suggests that we can estimate the dihedral angle directly by taking the mean of the distribution found above. Then we can implement our rp-tree, and use this estimate in Fig. 7(b). In fact this very naive technique sometimes does not yield such bad results. However, it is often the case that locally and globally the intrinsic dimensionality differs. For example, the surface of a manifold can be locally flat. So we would like to find the dihedral angle to the splitter locally in each cell. Additionally, the intrinsic dimension is often not known, although techniques exist for estimating it.

Finding the Dihedral Angle

Ideally we would like to run principal component analysis (PCA) on the data, and then consider the hyperplane spanned by the first d eigenvectors. However, PCA is computationally expensive and should not be necessary if we are only looking for the angle in question. Calculating the angle would be simple if the covariance dimension was (^? ~ \ We would choose d + 1 points within our data such that they define /. We would then subtract one from the rest, to obtain a set of d independent vectors that lie on /. We would then use the Gram-Schmidt process to orthogonalise the vectors. Then we would project the perpendicular vector of the splitter onto / by projecting on each of these vectors, and calculate ^exactly.

However, since our data points only lie close to / and not directly on it, it is unclear whether a Gram-Schmidt type of process can yield the orthogonal vectors of /. So we employ a randomized method for estimating the dihedral angle. We take the center of our data cell, and sample some constant number of random data points within the cell (say, 5000). We then subtract the center from all of these points, to obtain 5000 random vectors within our data. We calculate the angle that all of these vectors make with the vector V, which is the normal to the splitter. The smallest of these 5000 angles will be within a small range of Ψ , the true angle between V and /. In order for this to be true, we need at least one of our random vectors to be arbitrarily close to some random vector within our data, which represents IIV. This depends on the dimensionality of /, as we will show. Lemma 1: In ^ euclidean space where D is large, the cosine of the angle between two random vectors of the form \ ' ' converges to u ~i .

Proof: By Yao's principle, we fix the first random vector Vl to be ·'·Ό)

As from Theorem 5.1, R. A. Fisher showed that the denominator converges to , and the variance becomes insignificant relative to the mean. So replacing the denominator by its mean, we get: cos (t 1 , i 2 ) ~ ^m = iV(0, -^).

This suggests that for large D, in order for cos v to be large (near 1 for example), it is around standard deviations from the mean in a normal distribution. Intuitively this means that two random vectors in high dimensional space are much more likely to be near-perpendicular than near-parallel. It also means that when the dimensionality is large, the probability of two random vectors to be within some small angle of each other becomes extremely small, and so our random sampling strategy from above will not work. However, we will show that for ^— ^ it is not impractical. Evaluation of Random Sampling of Vectors Strategy

In order to evaluate this strategy, we form the following problem. We assume pur data is evenly distributed within centered at the origin, and we have some fixed vector

IIV within this space. We then generate 5000 random vectors of the form ^(^> within our space, and calculate the probability that not one of these is within some sma „ll ang ,le 0 of IIV

We denote the volume of the sphere segment spanning the vectors that are close to IIV by s, and the volume of the entire hypersphere by S.

( 1 _ J-ΛδΟΟΟ

The probability that all of our 5000 vectors miss the region s is then ' 5 / We have evaluated exactly this probability for several different values of ^ and dimensions from two to ten by using available formulas for volume of hypersphere cap and cone.

Up until dimensionality four, the probability of missing even a region as narrow as 2 X 25° can ^ ma de arbitrarily small by increasing the number of random samples. Beyond that, we must accept a larger error if we want to have a high degree of certainty of sampling at least one vector within the region. The next can be made arbitrarily small by increasing the number of random samples. Beyond that, we must accept a larger error if we want to have a high degree of certainty of sampling at least one vector within the region.

Effect of Error on Pruning Calculation

The next question we must answer is, what affect does this error have on our pruning calculation from Fig. 7(b)?

Theorem 2: In Fig. 11, cos ω = cos Φ cos θ regardless of dimensionality.

Proof: n ^ + 9 = r 2 cos 0 = IIV■ r

c os u; = !i = v {W+9 )

H cosw = ^ + ¾

(dot product is distributive) V · g

COS U = COS ¾i> + -r jj- (1.1)

Then since IIV is a normalized projection onto the first d axes, it is of the form 1 Ϊ Π7[(νΐ , fa, ..., ** 0, 0, ..)

Since ...)

Since 0

Then 0 M wdl

So then equation (1) becomes:

cos ' cos

ω = -I— T- a 1

But

cos a = cost/' cos 0

So .

and COf, e

Now since is complementary to a from Diagram 5.1, and the pruning multiplier ^ a cos ψ t we do not want to overestimate this multiplier, or we will be pruning populated regions. So, we do not want to underestimate cos ^ .

Now given that our smallest observed angle is Φ, and we are certain that this

Λ * Φ

observation is within " of IIV (see Figure 5.5), we know that coe 0 >— cosi ψ . So

j j _ 1 __ oeQ

the multiplier that we will use will be: 008 Φ coa $ COf "^ So by this transformation we are now certain that we will not be pruning any populated regions.

This also shows the importance of minimizing the error angle since the larger our multiplier the more aggressive our pruning, and it is very possible that there are superior sampling techniques to do this. It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments without departing from the scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.

References

[1] Dasgupta, S. Freund, Y Random projection trees for vector quantization Communication, Control, and Computing, 46th Annual Allerton Conference, September 2008

[2] Ting Liu, Andrew W. Moore, Alexander Gray and Ke Yang An Investigation of Practical Approximate Nearest Neighbor Algorithms NIPS, 2004