Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SEEDING METHOD FOR K-MEANS CLUSTERING AND OTHER CLUSTERING ALGORITHMS
Document Type and Number:
WIPO Patent Application WO/2008/022341
Kind Code:
A3
Abstract:
Techniques for determining a collection of seeds for use in a clustering algorithm. After one or more seeds are selected by any of various methods, additional seeds are selected by randomly selecting a point from a set of data points with the probability of selecting a given point being proportional to the distance from the given point to the nearest previously selected seed. An abundant number of seeds may be initially selected, and the collection of seeds culled by an iterative process.

Inventors:
OSTROVSKY RAFAIL (US)
RABANI YUVAL (IL)
SCHULMAN LEONARD J (US)
SWAMY CHAITANYA (CA)
Application Number:
PCT/US2007/076258
Publication Date:
December 31, 2008
Filing Date:
August 18, 2007
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV CALIFORNIA (US)
TECHNION RES & DEV FOUNDATION (IL)
CALIFORNIA INST OF TECHN (US)
OSTROVSKY RAFAIL (US)
RABANI YUVAL (IL)
SCHULMAN LEONARD J (US)
SWAMY CHAITANYA (CA)
International Classes:
G06K9/62; G06K9/00
Foreign References:
US20040022442A12004-02-05
US6973212B22005-12-06
US6944338B22005-09-13
Other References:
KANUNGO ET AL.: "The Analysis of a Simple k-Means Clustering Algorithm", PROC. 16TH ACM SYMP. ON COMPUTATIONAL GEOMETRY, 2000, pages 101 - 109
KANUNGO ET AL.: "An efficient k-Means Clustering Algorithm: Analysis and Implementation", IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, vol. 24, no. 7, July 2002 (2002-07-01), pages 1 - 12, 881 - 892, XP011094536, DOI: doi:10.1109/TPAMI.2002.1017616
Attorney, Agent or Firm:
SERAFINI, Andrew, T. et al. (Silicon Valley Center801 California Stree, Mountain View CA, US)
Download PDF:
Claims:
What is Claimed:

1. A method for selecting a set of seeds for use in clustering data points, the data points represented by points in a multi-dimensional space, the method comprising: selecting at least one seed; and selecting at least one additional seed by selecting a point at random from the set of data points wherein the probability of selecting a given point from the set of data points is proportional to a distance from the given point to the closest previously selected seed.

2. The method as recited in claim 1, wherein selecting at least one seed comprises: selecting a point at random from a set of data points wherein the probability of selecting a given point from the set of data points is determined by the distance of the given point from the centroid of the set of data points.

3. The method as recited in claim 2, wherein the probability of selecting any a given point from the set of data points is proportional to a constant plus the ratio of the distance between the given point and the centroid to the average over all points in the data set of the distance between each point and the centroid.

4. The method as recited in claim 1, wherein selecting at least one seed comprises: selecting the centroid of a set of at least some of the points in the set of data points as a seed.

5. The method as recited in claim 1, wherein selecting at least one seed comprises: selecting at least one point uniformly at random from the set of data points.

6. The method as recited in claim 1 wherein the distance between two points is computed as the squared Euclidean distance between the two points in the multi-dimensional space.

7. A method for selecting a set of seeds for use in clustering data points, the data points represented by points in a multi-dimensional space, the method comprising: determining a desired number of seeds; selecting a set of seeds wherein the number of seeds in the set of seeds is greater than the desired number of seeds, and wherein at least one of the seeds in the initial set of seeds is selected by selecting a point at random from the set of data points wherein the probability of selecting a given point from the set of data points is proportional to distance from the given point to the closest previously selected seed; and iteratively removing seeds from the set of seeds until the number of seeds in the set of remaining seeds is the desired number of seeds.

8. The method as recited in claim 7, wherein iteratively removing seeds comprises: initializing a current set of seeds as the selected set of seeds; and repeating the steps of: determining for each seed x in the current set of seeds a cost of clustering in relation to a clustering based on a set of seeds consisting of the current set of seeds except x; identifying a seed XQ for which its determined cost of clustering was minimum among the determined costs; removing the seed x 0 from the current set of seeds to obtain an updated set of seeds; determining a clustering based on the updated set of seeds; and relocating each seed in the updated set of seeds to the centroid of the cluster in the clustering based on the updated set of seeds containing the seed; until the updated set of seeds contains the desired number of seeds.

9. A computer readable medium comprising computer executable instructions for selecting a set of seeds for use in clustering data points, the data points represented by points in a multidimensional space, the computer executable instructions comprising computer executable instructions for: selecting at least one seed; and selecting at least one additional seed by selecting a point at random from the set of data points wherein the probability of selecting a given point from the set of data points is proportional to distance from the given point to a closest previously selected seed.

10. The computer readable medium as recited in claim 9, wherein the computer executable instructions for selecting a first seed comprise computer executable instructions for: selecting a point at random from a set of data points wherein the probability of selecting a given point from the set of data points is determined by the distance of the given point from the centroid of the set of data points.

11. The computer readable medium as recited in claim 10, wherein the probability of selecting any a given point from the set of data points is proportional to a constant plus the ratio of the distance between the given point and the centroid to the average over all points in the data set of the distance between each point and the centroid.

12. The computer readable medium as recited in claim 9, wherein the computer executable instructions for selecting at least one seed comprise computer executable instructions for:

selecting the centroid of a set of at least some of the points in the set of data points.

13. The computer readable medium as recited in claim 9, wherein the computer executable instructions for selecting at least one seed comprise computer executable instructions for: selecting a point uniformly at random from the set of data points.

14. A computer readable medium comprising computer executable instructions for use in selecting a set of seeds for use in clustering data points, the data points represented by points in a multi-dimensional space, the computer executable instructions comprising computer executable instructions for: determining a desired number of seeds; selecting a set of seeds wherein the number of seeds in the set of seeds is greater than the desired number of seeds, and wherein at least one of the seeds in the initial set of seeds is selected by selecting a point at random from the set of data points wherein the probability of selecting a given point from the set of data points is proportional to distance from the given point to a closest previously selected seed; and iteratively removing seeds from the set of seeds until the number of seeds in the set of remaining seeds is the desired number of seeds.

16. The computer readable medium as recited in claim 15, wherein the computer executable instructions for iteratively removing seeds comprise computer executable instructions for: initializing a current set of seeds as the selected set of seeds; and repeating the steps of: determining for each seed x in the current set of seeds a cost of clustering in relation to a clustering based on a set of seeds consisting of the current set of seeds except x; identifying a seed x 0 for which its determined cost of clustering was minimum among the determined costs; removing the seed XQ from the current set of seeds to obtain an updated set of seeds; determining a clustering based on the updated set of seeds; and relocating each seed in the updated set of seeds to the centroid of the cluster in the clustering based on the updated set of seeds containing the seed; until the updated set of seeds contains the desired number of seeds.

Description:

SEEDING METHOD FOR K-MEANS CLUSTERING AND OTHER CLUSTERING

ALGORITHMS

CROSS REFERENCE TO RELATED APPLICATION

[0001] This application claims the benefit of U.S. Provisional Application No. 60/822,903, filed August 18, 2006 and entitled "SEEDING METHOD FOR K-MEANS CLUSTERING AND OTHER CLUSTERING ALGORITHMS", hereby incorporated by reference in its entirety.

BACKGROUND

[0002] Clustering is performed in many applications, including the areas of pattern recognition, image analysis, data mining, search engines, bioinformatics, fraud detection, and the like, in order to associate data into groups for further analysis, including detecting items that are similar and also items that are dissimilar. Generally, in clustering, it is assumed that there are a number of data items, whatever the items may represent, and the data items are to be organized in some fashion.

[0003] To perform clustering on a data set of data items, each data item is transformed into an analyzable form that may be best understood as a single point in a multi-dimensional space, and then clusters of points, sometimes called clouds or groups of points, are identified within such space. That is, given enough points, distinct groups of points are identified, where each group has a number of points in relatively close proximity such that the group is a clustering of such points.

[0004] Notably, the number of points derived from a data set can be relatively large, perhaps even on the order of millions, tens of millions, and perhaps even hundreds of millions. Accordingly, identifying clusters within a space of such data points is not a trivial matter. One method for doing so includes the use of Lloyd-type fc-means clustering.

[0005] In a grossly simplified example in connection with such Lloyd-type fc-means clustering, it is to be assumed that a data set produces two clusters. Note, however, that the number of clusters is typically more, and can either be pre-defined or can be derived from the data set in multiple fashions. Regardless, with the assumed two clusters, the center of each such cluster within the high-dimensional space is found, and then each point is associated with one or the other of the clusters based on distance within such space to the found center of each cluster. Typically each point is associated with the cluster whose center is closest to such point. Distance within the space may be computed by any suitable metric. We will often make use of the squared Euclidean distance as the measure of the distance between two points.

[0006] Finding the center of each cluster is not a trivial matter. Accordingly, Lloyd- type k- means clustering may be employed to do so. Briefly, in classic LIo yd- type fc-means clustering, and again presuming two clusters merely for the sake of explanation, a potential center for each cluster is defined, based on some set of parameters, or even randomly, where each such potential center is in effect a seed to initialize an iterative process. In such iterative process, some number of points closest to each seed are located, and for the located points for each seed, the center of such located points is found in a known manner, and the seed is relocated to such found center. The iterative process is repeated a number of times until some terminating event or condition occurs. Such event or condition can for example be that the movement of each seed effectively settles down to a more-or-less stationary location or settled center, or can be a fixed number of iterations. Of course, it may be that at least some of the seeds never settle down to a settled center, in which case the iterations may be stopped after some predetermined number of iterations. In such a case, it may be that the seeds were initially poorly chosen, in which case new seeds may be picked and the process repeated.

SUMMARY

[0007] An issue arises in connection with LIo yd- type ^-clustering in that it may be that two seeds as chosen are effectively in one 'true' cluster and none are in the other 'true' cluster, where each true cluster is defined as the cluster that ought to be found. In such a case, and as may be appreciated, the seeds may result in two settled centers within the one true cluster and no settled centers in the other true cluster, and the points in the true clusters likely end up being assigned to the two settled centers in some undesirable fashion.

[0008] Accordingly, a need exists for systems and methods for performing Lloyd-type

^-clustering that more accurately define an initial seed for each cluster prior to performing the iterative process of such Lloyd-type ^-clustering. More particularly, a need exists for such systems and methods that employ a procedure to at least roughly identify the initial location of each true cluster based on distances between at least a sampling of all the points in the data set.

[0009] The aforementioned needs are satisfied at least in part by a method of seeding that is performed with regard to performing clustering on a data set of data items, each of which has been transformed into a point in a multi-dimensional space. A number of clusters are defined within the space, and a center of each cluster is initially located by selecting an initial seed as a potential center for each cluster. An iterative process is then performed to move each seed to what eventually becomes the located center of a cluster.

[0010] In some embodiments of the method, a first initial seed for a first cluster is selected by:

(i) receiving a number n of the points, each of which is numbered D 1 for / =1, 2,

3, ..., «;

(ii) calculating a center of gravity, or centroid, C of the n points by taking the average on each coordinate over the n points;

_ 0

(iii) calculating, for each of the n points, a squared distance C, of such point D 1 to C;

(iv) calculating an average squared distance as:

(v) selecting a point μ \ from among the n points at random as the first initial seed. The selection may, in some embodiments, be performed randomly with a uniform probability of picking any point, or in other embodiments may be based on an assigned probability distribution with, for example, the probability of picking a point D 1 as μ \ being:

_ 0

2nC

In still other embodiments, the first initial seed may be selected as the centroid of the data points or the centroid of a sampling of the data points.

[0011] Thereafter, an initial seed for each successive cluster may be selected by:

(i) receiving an integer k between 2 and n - 1 as the number of seeds that are to be produced; and

(ii) selecting points μ 2 , ..., μ k as seeds according to an iterative process, where, supposing μ 1; . . . , μ / -_ 1 have already been chosen, μ } is selected by:

- for each l ≤ i ≤ n , calculating C, as a squared distance between D ; and μ j _i

- for each l ≤ i ≤ n , letting C 1 denote a minimum of

- updating C to be an average of C, over all l ≤ i ≤ n ; and

- picking μ ; = D ; with assigned probability:

[0012] In some embodiments, more than one initial seed may be selected before undertaking the iterative process described above for determining remaining seeds. In some embodiments, an abundance of seeds may be selected and later culled according to processes such as described below.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] The foregoing summary, as well as the following detailed description of various embodiments of the present invention, will be better understood when read in conjunction with the appended drawings. For the purpose of illustrating the embodiments, there are shown in the drawings embodiments which are presently preferred. As should be understood, however, the embodiments of the present invention are not limited to the precise arrangements and instrumentalities shown. In the drawings:

[0014] Fig. 1 is a block diagram of an example of a computing environment within which various embodiments of the present invention may be implemented;

[0015] Fig. 2 is a conceptual diagram showing a space within which a number of data points from a data set are located;

[0016] Fig. 3 is a conceptual diagram showing a space within which a number of data points from a data set are located and a partition of the points into clusters;

[0017] Fig. 4 is a conceptual diagram showing a space within which a number of data points from a data set are located and a partition of the points into clusters; and

[0018] Fig. 5 is a flow diagram showing steps performed when selecting the initial seeds for clusters in accordance with various embodiments presented herein.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS

Example Computing Environment

[0019] Fig. 1 is set forth herein as an exemplary computing environment in which various embodiments of the present invention may be implemented. The computing system environment is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality. Numerous other general purpose or special purpose computing system environments or configurations may be used. Examples of well known computing systems, environments, and/or configurations that may be suitable for use include, but are not limited to, personal computers (PCs), server computers, handheld or laptop devices, multi-processor systems, microprocessor-based systems, network PCs, minicomputers, mainframe computers, embedded systems, distributed computing environments that include any of the above systems or devices, and the like.

[0020] Computer-executable instructions such as program modules executed by a computer may be used. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Distributed computing environments may be used where tasks are performed by remote processing devices that are linked through a communications network or other data transmission medium. In a distributed computing environment, program modules and other data may be located in both local and remote computer storage media including memory storage devices.

[0021] With reference to Fig. 1, an exemplary system for implementing aspects described herein includes a computing device, such as computing device 100. In its most basic configuration, computing device 100 typically includes at least one processing unit 102 and memory 104. Depending on the exact configuration and type of computing device, memory 104 may be volatile (such as random access memory (RAM)), non-volatile (such as read-only memory (ROM), flash memory, etc.), or some combination of the two. This most basic configuration is illustrated in Fig. 1 by dashed line 106. Computing device 100 may have additional features/functionality. For example, computing device 100 may include additional storage (removable and/or non-removable) including, but not limited to, magnetic or optical disks or tape. Such additional storage is illustrated in Fig. 6 by removable storage 108 and nonremovable storage 110.

[0022] Computing device 100 typically includes or is provided with a variety of computer-readable media. Computer readable media can be any available media that can be accessed by computing device 100 and includes both volatile and non-volatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media.

[0023] Computer storage media includes volatile and non-volatile, removable and nonremovable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Memory 104, removable storage 108, and non-removable storage 110 are all examples of computer storage media. Computer storage media includes, but is not limited to, RAM, ROM, electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD- ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by computing device 100. Any such computer storage media may be part of computing device 100.

[0024] Computing device 100 may also contain communications connection(s) 112 that allow the device to communicate with other devices. Each such communications connection 112 is an example of communication media. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared and other wireless media. The term computer readable media as used herein includes both storage media and communication media.

[0025] Computing device 100 may also have input device(s) 114 such as keyboard, mouse, pen, voice input device, touch input device, etc. Output device(s) 116 such as a display, speakers, printer, etc. may also be included. All these devices are generally known to the relevant public and therefore need not be discussed in any detail herein except as provided.

[0026] Notably, computing device 100 may be one of a plurality of computing devices 100 inter-connected by a network 118, as is shown in Fig. 1. As may be appreciated, the network 118 may be any appropriate network, each computing device 100 may be connected thereto by way of a connection 112 in any appropriate manner, and each computing device 100

may communicate with one or more of the other computing devices 100 in the network 118 in any appropriate manner. For example, the network 118 may be a wired or wireless network within an organization or home or the like, and may include a direct or indirect coupling to an external network such as the Internet or the like.

[0027] It should be understood that the various techniques described herein may be implemented in connection with hardware or software or, where appropriate, with a combination of both. Thus, the methods and apparatus of the presently disclosed subject matter, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine -readable storage medium wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the presently disclosed subject matter. In the case of program code execution on programmable computers, the computing device generally includes a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. One or more programs may implement or utilize the processes described in connection with the presently disclosed subject matter, e.g., through the use of an application- program interface (API), reusable controls, or the like. Such programs may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language, and combined with hardware implementations.

[0028] Although exemplary embodiments may refer to utilizing aspects of the presently disclosed subject matter in the context of one or more stand-alone computer systems, the subject matter is not so limited, but rather may be implemented in connection with any computing environment, such as a network 118 or a distributed computing environment. Still further, aspects of the presently disclosed subject matter may be implemented in or across a plurality of processing chips or devices, and storage may similarly be effected across a plurality of devices in a network 118. Such devices might include personal computers, network servers, and handheld devices, for example.

Clustering - Overview

[0029] It is to be appreciated that clustering is performed in many applications in order to associate data into groups for further analysis, including detecting items that are similar and also items that are dissimilar. Generally, in clustering, it is assumed that there are a number of data items, whatever the items may represent, and the data items are to be organized in some

fashion. One example of the use of clustering is fraud detection such as that which may be performed by a bank or other financial institution to find transactions that are improper and which should be prevented. Typically, in such fraud detection, patterns of proper transactions are identified, and transactions that fall outside of or are dissimilar to such proper transactions are identified and then scrutinized for possible identification as being fraudulent. Of course, clustering may be performed with regard to any other type of data set without departing from the spirit and scope of the present invention.

[0030] To perform clustering on a data set of data items such as the aforementioned transactions, each data item is transformed into an analyzable form that may be best understood as a single point in a multi-dimensional space. Fig. 2 is an example of a collection of points 10 and a set of two clusters 12, 14 delineated by the dotted curves. A goal of a clustering algorithm is to identify within a collection of points 10 a set of clusters 12, 14 of points. The set of clusters may partition the collection of points 10 into a set of distinct clusters 12, 14. Note that by its nature the drawing of Fig. 2 is two-dimensional. Nevertheless, it is to be understood that the objects shown in the two-dimensional drawing of Fig. 2 can readily be extended to a greater number of dimensions.

[0031] Distinct clusters 12, 14 are identified, where each cluster 12, 14 has a number of points in relatively close proximity to each other. As seen in Fig. 2, it may be that there are some points that are relatively close to each other, in which case such points and the data items represented by them are relatively similar in some sense. Likewise, it may be that there are points 10 that are relatively far from each other, in which case such points and the data items represented by them are relatively dissimilar.

[0032] Points that are relatively closer to the center of a cluster 12, 14 tend to be more closely identifiable as being part of such cluster 12, 14, while points that are relatively farther from the center of a cluster 12, 14 tend to be less identifiable as being part of such cluster 12, 14. With regard to the aforementioned fraud detection example, points that are relatively far from the center of any cluster that represents legitimate transactions may be more closely scrutinized as being at least potentially fraudulent.

[0033] Lloyd-type fc-means clustering is used to find a center of each cluster, such that each point can then be associated with a particular cluster based on distance within such space between the point and the center of the cluster. In particular, each point is associated with the cluster whose center is nearest the point. Finding distances within high-dimensional space and associating a point with a particular cluster is known or should be apparent to the relevant public and therefore need not be set forth herein in detail. Accordingly, any method for finding such

distance within such space and associating such point with a particular cluster may be employed without departing from the spirit and scope of the present invention.

[0034] Briefly, classic Lloyd-type fc-means clustering involves selecting an initial collection of seeds as potential centers for each cluster. The initial selection may be based on some set of parameters, or may even be performed randomly. Various techniques for this initial selection are described below. An iterative process is then performed wherein the following steps are repeated: each point is assigned to a cluster associated with the potential cluster center nearest the point and then the centroid of all points (or some sampling of points) in each cluster is computed and a new cluster center for that cluster is located at the centroid of that cluster. The iterative process is repeated until some termination condition is met. For example, the termination condition might be that the cluster centers move less than a predetermined amount from one iteration to the next, or that some predetermined number of iterations is performed.

[0035] Fig. 3 provides an example of an initial stage of a Lloyd-type k- means clustering algorithm for two clusters. A collection of points 30 is provided. Initial seeds μi and μ 2 have been chosen and are designated as initial potential centers for the two clusters 32, 34. Techniques for selected the initial seeds are discussed below. Each point in the collection of points 30 is then associated with the cluster whose center is closest to the point. Thus points nearer to μi than μ 2 , i.e., points above line 36, are associated with one cluster 32 and points nearer to μ 2 than μ 1? i.e., points below line 36, are associated with the other cluster 34. The centroid of the points in the each cluster 32 34 is computed and new cluster centers, V 1 , V 2 respectively, are located at those centroids.

[0036] Referring now to Fig. 4, each point in the collection of points 30 is once again associated with a cluster whose center is closest to the point. Thus points nearer to V 1 than V 2 , i.e., points above and to the right of line 38, are associated with one cluster 40 and points nearer to V 2 than V 1 , i.e., points below and to the left of line 38, are associated with the other cluster 42. The process is repeated, determining new cluster centers and perforce new clusters, until some termination condition is met.

[0037] Such Lloyd-type ^-clustering is generally expected to work well if the initial seeds are well-chosen. However, if, for example, multiple seeds are chosen within one 'true' cluster (i.e., a cluster that ought to be found) or if multiple seeds iteratively migrate to one true cluster, the process can result in multiple cluster centers within the one true cluster and no cluster centers in another true cluster, and the points in the true clusters may end up being assigned to the clusters in some false and perhaps even bizarre fashion.

[0038] Clustering in general is a venerable statistical problem of both practical and theoretical interest. There is presently a wide and unsatisfactory gap between the practical and theoretical clustering literatures. For decades, practitioners have been using heuristics of great speed but uncertain merit; the latter should not be surprising since the problem is hard in almost any formulation. However, in the last few years, algorithms researchers have made considerable innovations, and even obtained polynomial-time approximation schemes (PTAS's) for some of the most popular clustering formulations. Yet these contributions have not had a noticeable impact on practice. Practitioners instead continue to use a variety of iterative heuristics (Lloyd, EM, agglomerative methods, etc.) that have no known performance guarantees.

[0039] There are two ways to approach this disjuncture. One is to continue developing new techniques until they are so good— down to the implementations— that they displace entrenched methods. The other is to look toward popular heuristics and ask whether there are reasons that justify their extensive use, but elude the standard theoretical criteria; and in addition, whether theoretical scrutiny suggests improvements in their application. This is the approach taken in embodiments of the present invention.

[0040] As in other prominent cases such an analysis involves some abandonment of the worst-case inputs criterion. In fact, part of the challenge is to identify simple conditions on the input which allow one to prove a performance guarantee of wide applicability. Our starting point is the notion that one should be most concerned with clustering data that possesses a meaningful ^-clustering. Two examples of settings where one would intuitively not consider the data to possess a meaningful ^-clustering are as follows. If nearly optimum cost can be achieved by two very different k-way partitions of the data then the identity of the optimal partition carries little meaning. For example, if the data was generated by random sampling from a source, then the optimal cluster regions might shift drastically upon resampling. Alternatively, if a near-optimal ^-clustering can be achieved by a partition into fewer than k clusters, then that smaller value of k should be used to cluster the data. If near-optimal ^-clusterings are hard to find only when they provide ambiguous classification or marginal benefit (i.e., in the absence of a meaningful k- clustering), then such hardness should not be viewed as an acceptable obstacle to algorithm development. Instead, the performance criteria should be revised.

[0041] Specifically, we consider the following fc-means formulation of clustering: given a finite set X c R d , find k points or centers to minimize the sum over all points X G X of the squared distance between x and the center to which it is assigned. This sum will sometimes be referred to as the cost of the clustering.

[0042] In an optimal solution, each center is associated with all data in its Voronoi region, and is located at the center of mass of this data. Perhaps the most popular heuristic used for this problem is Lloyd's method, which consists of the following two phases: (a) 'Seed' the process with some initial centers (the literature contains many competing suggestions of how to do this); (b) Iterate the following Lloyd step until the clustering is 'good enough': cluster all the data in the Voronoi region of a center together, and then move the center to the centroid of its cluster.

[0043] Although Lloyd-style methods are widely used, to our knowledge, there is no known mathematical analysis that attempts to explain or predict the performance of these heuristics on high-dimensional data. If the data is well-clusterable according to a certain 'clusterability' or 'separation' condition that we discuss below, then various Lloyd-style methods do indeed perform well and return a near-optimal clustering. Our contributions are threefold:

[0044] (1) We introduce a separation condition and justify it as a reasonable abstraction of well-clusterability for the analysis of fc-means clustering algorithms. Our condition is simple, and abstracts a notion of well-clusterability alluded to earlier: letting (X) denote the cost of an optimal fc-means solution of input X, we say that X is ε-separated for Ic- means if ≤ ε 2 .

[0045] One motivation for proposing this condition is that a significant drop in the Jc- clustering cost is already used in practice as a diagnostic for choosing the value of Jc. Furthermore, we show below that: (i) The data satisfies our separation condition if and only if it satisfies the other intuitive notion of well-clusterability suggested earlier, namely that any two low-cost ^-clusterings disagree on only a small fraction of the data; and (ii) The condition is robust under noisy (even adversarial) perturbation of the data.

[0046] (2) We present a novel and efficient sampling process for seeding Lloyd's method with initial centers. The disclosed seeding procedures will allow us to prove the effectiveness of these methods.

[0047] (3) We demonstrate the effectiveness of our variants of the Lloyd heuristic under the separation condition. Specifically: (i) Our simplest variant uses only the new seeding procedure, requires a single Lloyd-type descent step, and achieves a constant-factor approximation in time linear in IXI. This algorithm has success probability exponentially small in k, but we show that (ii) a slightly more complicated seeding process based on our sampling procedure yields a constant-factor approximation guarantee with constant probability, again in linear time. Since only one run of seeding+descent is required in both algorithms, these are candidates for being faster in practice than currently used Lloyd variants, which are used with

multiple re-seedings and many Lloyd steps per re-seeding, (iii) We also give a PTAS

(polynomial time approximation scheme) by combining our seeding process with a sampling procedure whose running time is linear in IXI and the dimension, and exponential in k. This PTAS is significantly faster, and also simpler.

Literature And Problem Formulation

[0048] Let X c R d be the given point set and n = X . In the Jc- means problem, the objective is to partition X into k clusters X λ ..., X k and assign each point in every cluster X 1 to a common center c ; e R d , so as to minimize the fc-means cost, defined as ∑f_ j xe% Il JC - c, II 2 ,

where H-Il denotes the I 2 norm. We let (x ) denote the optimum Ic- means cost. Observe that given the centers C 1 ,..., ε k , it is easy to determine the best clustering corresponding to these centers: cluster X 1 simply consists of all points x e X for which c, is the nearest center, breaking ties arbitrarily. Conversely given a clustering X 1 ,..., X k , the best centers corresponding to this clustering are obtained by setting c, to be the center of mass (centroid) of

cluster X 1 , that is, setting c, = χ * ∑ xeχ x ■ ^ follows that both of these properties

simultaneously hold in an optimal solution, that is, c, is the centroid of cluster X 1 , and each point in X 1 has c, as its nearest center.

Preliminaries

[0049] We use the following notation throughout. For a point set S , we use ctr( S ) to denote the center of mass of S . Let partition X 1 u ... u X k = X be an optimal fc-means clustering of the input X , and let C 1 =ctr( X 1 ) and c=ctr( X ). So A 2 k (X ) = ∑f =1 xeXι Il x - C 1 Il 2 = 1, «=l X I, and r ; 2 = δ 2 (X 1 ) / Ji 1 , that is, r t 2 is the 'mean squared error' in cluster X 1 . Define D 1 = min Il c - c Il . We assume throughout that X is ε-separated

for fc-means, that is, δ^ (x) ≤ £ 2 δ 2 i _ 1 (x ) where 0 < £ ≤ £ 0 with ε 0 being a suitably small constant. We use the following basic lemmas quite frequently.

[0050] Lemma 2.1: For every x , Y W x - y \\ 2 = δ 2 (x )+ n \\ x - e ll 2 . Hence ye X

σ y \\ 2 = nA 2 1 (X) . χ ,y}z » x - x

[0051] Lemma 2.2: Let S c R d , and S 1 u S 2 be a partition of S with S 1 ≠ 0.

Let s, S 1 , S 2 denote respectively XcIr[S 2 ). Then,

\s \\s

(/ ): A] [S) = A] [S 1 ) + A] [S 2 ) + - ] ^- \\ S 1 - S 2 II 2 , and

(«): S 1 - S < δ 2 (s) 5

1*1 tø

The 2-Means Problem

[0052] We first consider the 2-means case. We assume that the input X is ε-separated for 2-means. We present an algorithm that returns a solution of cost at most (l + f[ε))A\ [X ) in linear time, for a suitably defined function / that satisfies lim/(f) = 0 . An appealing feature ε→O of our technique is its simplicity, both in description and analysis. Where we consider the k- means case below, we will build upon this technique to obtain both a linear time constant-factor (of the form 1 + f[ε)) approximation algorithm and a PTAS with running time exponential in Ic, but linear in n, d..

[0053] Our 2-means technique provides a novel non-uniform sampling process to pick two seed centers. In our sampling process we pick the pair x, y e X with probability proportional to Il x - y Il 2 . This biases the distribution towards pairs that contribute a large amount to A] [x) (noting that nA] [x)= ^H x — y II 2 ). This seeding method is new to the vast

literature on the fc-means problem. By putting more weight on pairs that contribute a lot to A] [X ) , the sampling process aims to pick the initial centers from the cores of the two optimal clusters. We define the core of a cluster precisely later, but loosely speaking, it consists of points in the cluster that are significantly closer to this cluster center than to any other center. Lemmas 3.1 and 3.2 below make the benefits of this approach precise. Thus, in essence, we are able to leverage the separation condition to nearly isolate the optimal centers. Once we have the initial centers within the cores of the two optimal clusters, we show that a simple Lloyd-like step, which is also simple to analyze, yields a good performance: we consider a suitable ball around each center and move the center to the centroid of this ball to obtain the final centers.

[0054] In particular, randomly select a pair of points from the set X to serve as the initial centers, picking the pair x, y in X with probability proportional to Il x - y Il . Let C 1 , C 2 denote the two picked centers. For each C 1 , consider the ball of radius Il C 1 - C 2 Il / 3 around C 1 and compute the centroid C 1 of the portion of X in this ball. Return q , C 2 as the final centers.

[0055] The entire algorithm runs in time O(nd ). Step 2 clearly takes only O( nd ) time.

We show that the sampling step can be implemented to run in O( nd ) time. Consider the following two-step sampling procedure:

(a) first pick center C 1 by choosing a point x e X with probability equal to

(using Lemma 2.1);

(b) pick the second center by choosing point y e X with probability equal to

H j - C 1 II 2 f (A 2 1 {x) + n \\ c - c 1 \\ 2 ).

[0056] The analysis hinges on the important fact that under the separation condition, the radius r t of each optimal cluster is substantially smaller than the inter-cluster separation Il C 1 - C 2 Il (Lemma 3.1 below). This allows us to show in Lemma 3.2 that with high probability, each initial center C 1 lies in the core (suitably defined) of a distinct optimal cluster, say X 1 , and hence Il C 1 - C 2 Il is much larger than the distances Il C 1 - C 1 Il for i =1,2. Assuming that C 1 , C 2 lie in the cores of the clusters, we show in Lemma 3.3 that the ball around C 1 contains only, and most of, the mass of cluster X 1 , and therefore the centroid c, of this ball is very 'close' to C 1 . This in turn implies that the cost of the clustering around C 1 , C 2 is relatively small.

[0057] Lemma 3.1: max Ik ~ c 2 f = θ(ε 2 )\c 1 - c 2 \\ .

100 ε 2

Let p = — (where ε is sufficiently small so that p < 1. We define the core l - ε

of cluster X 1 as the set x c ° r = < r x e X : . p. - C 1 l 1 | | 2 < r l2 //l > . By Markov' s inequality,

[0058] Lemma 3.2: Pr [fo ,c 2 }n X™ r ≠ 0 and Jc 1 , c 2 }n X 2 cor ≠ ø] = 1 - θ{p).

[0059] Lemma 3.3: For each i, we have I OT c δ c X . Hence, lie - c f < -^- r ; 2 .

Il \ - p

[0060] Theorem 3.4: The above algorithm returns a clustering of cost at most —

\-p with probability at least 1 - θ(p) in time θ(nd), where p = θ(e 2 ). .

THE Z-MEANS PROBLEM

[0061] We now consider the fc-means setting. We assume that (X ) <e 2 A 2 ^ 1 (X ).

We describe a linear time constant- factor approximation algorithm, and a PTAS that returns a (l+ &>)-optimal solution in time O\2°^ k/W 'nd). The techniques consist of various ingredients, which we describe separately first for ease of understanding, before gluing them together to obtain the final algorithm.

[0062] Conceptually both techniques proceed in two stages. The first stage is a seeding stage, which performs the bulk of the work and guarantees that at the end of this stage there are k seed centers positioned at nearly the right locations. By this we mean that if we consider distances at the scale of the inter-cluster separation, then at the end of this stage, each optimal center has a (distinct) initial center located in close proximity. This is precisely the leverage that we obtain from the fc-means separation condition (as in the 2-means case). We employ three simple seeding procedures, with varying time vs. quality guarantees, that exploit this separation condition to seed the k centers at locations very close to the optimal centers. We consider a natural generalization of the sampling procedure used for the 2-means case, and show that this picks the k initial centers from the cores of the optimal clusters. This sampling procedure runs in linear time but it succeeds with probability that is exponentially small in k.

[0063] We present a very simple deterministic greedy deletion procedure, where we start off with all points in X as the centers and then greedily delete points (and move centers) until there are k centers left. The running time here is Oψ d). For the fc-means problem, if A 2 k (X ) <e 2 δ 2 λ _ j (X ), we show that our greedy deletion procedure followed by a clean-up step (in the second stage) yields a (l+f(ε))-approximation algorithm. Finally, we combine the sampling and deletion procedures to obtain an Oψkd + & 3 <i)-time initialization procedure. We sample O(k) centers, which ensures that every cluster has an initial center in a slightly expanded version of the core, and then run the deletion procedure on an instance of size O(k) derived from the sampled points to obtain the k seed centers.

[0064] Once the initial centers have been positioned sufficiently close to the optimal centers, we can proceed in two ways in the second-stage. One option is to use a ball-fc-means step, as in 2-means, which yields a clustering of cost (l + (X ) due to exactly the same reasons as in the 2-means case. Thus, combined with the initialization procedure, this yields a constant-factor approximation algorithm with running time Oψkd + k 3 d).

[0065] Another option, which yields a PTAS, is, for each initial center, we compute a list of candidate centers for the corresponding optimal cluster as follows: we sample a small set

of points uniformly at random from a slightly expanded Voronoi region of the initial center, and consider the centroid of every subset of the sampled set of a certain size as a candidate. We exhaustively search for the k candidates (picking one candidate per initial center) that yield the least cost solution, and output these as our final centers. The fact that each optimal center C 1 has an initial center in close proximity allows us to argue that the entire optimal cluster X 1 is contained in the expanded Voronoi region of this initial center, and moreover that I X 1 1 is a significant fraction of the total mass in this region. Given this property, a random sample from the expanded Voronoi region also (essentially) yields a random sample from X 1 , which allows us to compute a good estimate of the centroid of X 1 , and hence of (X 1 ).

[0066] We obtain a (l+ω)-optimal solution in time θ(2 o{k/w) nd) with constant probability. Since we incur an exponential dependence on k anyway, we just use the simple sampling procedure in the first- stage to pick the k initial centers. Although the running time is exponential in k, it is significantly better than others and we obtain a simpler PTAS. Both of these features can be traced to the separation condition, which enables us to nearly isolate the positions of the optimal centers in the first stage.

Seeding Procedures Used in Stage I

[0067] Sampling. We pick k initial centers as follows: first pick two centers C 1 , C 2 as in the 2-means case, that is, choose x, y e X with probability proportional to Il x - y Il 2 . Suppose we have already picked i ≥ 2 centers C 1 ,..., C 1 . Now pick a random point x e X with probability proportional to min ^ 1 ^W x - c } Il 2 and set that as center C 1+1 .

[0068] Running time. The sampling procedure consists of k iterations, each of which takes θ(nd) time. This is because after sampling a new point C 1+1 , we can update the quantity min ^ 1 !+1} ll JC - c Il for each point x in θ(d) time. So the overall running time is Oinkd) . [0069] Analysis. Let ε 2 « p < 1 be a parameter that we will set later. As in the 2-

means case, we define the core of cluster X 1 as x c ° r = < r x e X : . \.\x - C 1 l 1 | | 2 < r ' 2 //l > . We show that

under our separation assumption, the above sampling procedure will pick the k initial centers to lie in the cores of the clusters X 1 ,..., X k with probability (l - θ(p)) k . We also show that if we sample more than k, but still O(k) points, then with constant probability, every cluster will

contain a sampled point that lies in a somewhat larger core that we call the outer core of the cluster. This analysis will be useful below.

[0070] Lemma 4.1: With probability l-O(p), the first two centers C 1 , C 2 lie in the cores of different clusters, that is, Pr[u !≠J (λ: e X c ° r and y e X c ° r )] = 1 - θ{p) .

Now inductively suppose that the first i centers picked C 1 ,..., C 1 lie in the cores of clusters X ,..., X β . We show that conditioned on this event, C 1+1 lies in the core of some cluster Xt where t £ [J 1 , ..., J 1 ] with probability 1 - O(p). Given a set S of points , we use d(x, S) to denote min yeS Il x - y \\.

[0071] Lemma 4.2:

Pr[C 1+1 G U^ 1 ]ι} Xr c \ 0,,...,C 1 lie in the cores of X λ ,..., X } \ = l - θ{p).

[0072] Next, we analyze the case when more than k points are sampled. Let P 1 = p 3 .

,2

Define the outer core of X 1 to be X° ut = [ x e X 1 : Il x - C 1 Il 2 < ^- }. Note that X^ c X 1 '

Let N = h- — - — -r where 0 < δ < 1 is a desired error tolerance. One can show that at

1 - 5/? (1 - 5/?) 2 every sampling step, there is a constant probability that the sampled point lies in the core of some cluster whose outer core does not contain a previously sampled point. Observe that Lemma 4.3 below gives an unconditional bound. This allows us to argue, via a straightforward martingale analysis, that if we sample N points from X, then with some constant probability, each outer core X° ut will contain a sampled point. Corollary 4.4 below summarizes the results.

[0073] Lemma 4.3: Suppose that we have sampled i points { Jc 1 ,..., X 1 ] from X. Let X 1 ,..., X m be all the clusters whose outer cores contain some sampled point x } . Then

Pr[S 1+1 G u; =m+1 x; OT ]> 1 -5 / ?.

[0074] Lemma 4.4: Suppose we sample N points X 1 x N from X using the above sampling procedure. Then Pr[V/ = l, ... , k, there exists some X 1 e X ° ut \ ≥ 1 - δ . [0075] Corollary 4.5:

(i) If we sample k points C 1 ,..., c k , then with probability (1 - O(p)) k , where p = ω(£ 2/3 ) , each point C 1 lies in a distinct core X c ° r , so Il C 1 - C 1 W ≤

(ii) If we sample N = points C 1 ,..., c N and p = *Jε , then with probability 1 - 0(p) , each outer core X° ut contains some C 1 , so Il C 1 - C 1 Il < Y 1 I ^p 3 .

Deletion / Merger Techniques

[0076] In some embodiments, it may be desirable to initially determine some abundant number of cluster centers, i.e., some number greater than k, the number of cluster centers desired at the end of the process, and then delete (sometimes called "merge") some cluster centers until k cluster centers remain. The initial abundant collection of cluster centers may be determined by any of the methods described herein, or by a hybrid of two or more of the methods. Some of the initial cluster centers may be selected by random methods and some may be selected by deterministic methods. Some may be chose with respect to previously selected cluster centers and some may be chosen independently of previously selected cluster centers.

[0077] Once some abundant number of cluster centers has been selected, centers may be deleted or merged, for example by the techniques described below, until the desired number of cluster centers remain.

[0078] Greedy deletion procedure. We maintain a set of centers C that are used to cluster X. For purposes of the analysis below, we initialize C <— X , i.e., C are selected uniformly at random from X , and delete centers until k centers remain. It should be understood however that for purposes of the techniques disclosed herein, the initial set of centers C may be selected in a variety of manners, including by way of examples, the ways described above and by the sampling procedure described below. For any point X G R d , let R(x) <Z X denote the points of X in the Voronoi region of X (given the set of centers C ). We call R(x) the Voronoi set of X. Repeat the following steps until I C \=k.

(i) Compute T = σ λ:e( A σ yeR(x) H y - x H 2 , the cost of clustering X around C . For every X G C , compute T x = ∑ ^ . , σ yeR ( _ ) Il y - z II 2 , where R_ x (z) is the Voronoi set of z

given the center set C \ { x } .

(ii) Pick the center y e C for which T x -T is minimum and set C <— C \ { y } . (iii) Recompute the Voronoi sets R(x) = R (x) c X for each (remaining) center x € C . Now we "move" the centers to the centroids of their respective (new) Voronoi sets, that is, for every set R(x) , we update C <— C \ {x} u {ctr(R(x))} .

[0079] Running time. There are n — k iterations of the previous three steps. Each iteration takes O(n 2 d) time: computing T and the sets R(x) for each x takes O(n 2 d) time and we can then compute each T x in 0(|i?(X)|<i) time (since while computing T , we can also compute for each point its second- nearest center in C ). Therefore the overall running time is O(n 3 d) .

[0080] A linear time seeding procedure. We now combine the sampling idea with the deletion procedure to obtain a seeding procedure that runs in time O(nkd + k 3 d) and succeeds with high probability. We first sample O(k) points from X using the sampling procedure. Then we run the deletion procedure on an O(fc)-size instance consisting of the centroids of the Voronoi regions of the sampled points, with each centroid having a weight equal to the mass of X in its corresponding Voronoi region. The sampling process ensures that with high probability, every cluster X 1 contains a point C 1 that is close to its center C 1 . This will allow us to argue that the

A 2 k (•) cost of the sampled instance is much smaller than its (•) cost, and that the optimal centers for the sampled instance lie near the optimal centers for X. By the analysis of the previous section, one can then show that after the deletion procedure the k centers are still close to the optimal centers for the sampled instance, and hence also close to the optimal centers for X.

[0081] Sampling - Sample N = 1 l — points from X using the sampling

1 - 5 A (1 - 5 A) procedure above. Let S denote the set of sampled points.

[0082] Deletion phase. - For each x e S , let R(x) = \y e X : |>> - xj = min _^ ^y - z|| be its Voronoi set (with respect to the set S). We now ignore X, and consider a weighted instance with point-set S = { x = ctr(R(jc)) : X G S }, where each x has a weight w( x ) = IR(JC)I. Run the deletion procedure above on S to obtain k centers c λ ,...,c k .

[0083] Running time: Sampling takes O(nNd) = O(nkd) time. The run-time analysis of the deletion phase above shows that the deletion phase takes O(N 3 d) = O(k 3 d) time. So the overall running time is O(nkd + k 3 d) .

[0084] Lemma 4.9: A 2 k (S) = O(ε 2 2 k _ 1 (S) . For each optimal center c ; , there is some

s. such that Il S - c, Il <

[0085] Lemma 4.10: For each center c ; , there is a center C 1 such that

.

Procedures Used in Stage II

[0086] Given k seed centers C 1 ,..., c k located sufficiently close to the optimal centers after stage I, we use two procedures in stage II to obtain a near-optimal clustering: the ball-fc- means step, which yields a (l+f(ε))-approximation algorithm, or the centroid estimation step, which yields a PTAS with running time exponential in k. Define (I 1 = min J≠[ H c 7 - C 1 W.

[0087] Ball-λ> means step. Let B ; be the points of X in a ball of radius U 1 I 3 around C 1 , and C 1 be the centroid of B ; . Return C 1 ,..., c ~ k as the final centers.

[0088] Centroid estimation. For each i, we will obtain a set of candidate centers for cluster X 1 . Fix β =25 / {25+256ε 2 }. Let R[ c X = { x € X : Ix - C 1 J ≤ min ILc - cJ + rf, / 4 } be

4 the expanded Voronoi region of C 1 . Sample points independently and uniformly at random

from R' , where ω is a given input parameter, to obtain a random subset S 1 c R 1 . Compute the centroid of every subset of S , of size 2 / O) ; let T 1 be the set consisting of all these centroids. Select the candidates C 1 e T 1 ,..., ε k e T k that yield the least-cost solution, and return these as the final centers.

[0089] Lemma 4.11: Let Y 1 = { X G X^. W x - cJl 2 ≤ ή / p}. If II c 1 - C 1 II < D 1 Z lO

for each i, then F, c B c X , and Il c - c Il < -^- r ; 2 .

\ - p

[0090] Lemma 4.12: Suppose Il C 1 - C 1 Il < D 1 1 10 for each i. Then X 1 c R 1 " , where R 1 " is as defined above, and I X 1 1 > βl R' I.

A linear time constant-factor approximation technique

[0091] A linear time constant-factor approximation technique may be accomplished by : 1 - Execute the seeding procedure above to obtain k initial centers C 1 ,..., c k . 2 - Run the ball-fc- means step to obtain the final centers.

[0092] The running time is O(nkd+k 3 d). By Lemma 4.10, we know that with probability 1 - O(y/ε) , for each c ; , there is a distinct center C 1 such that Il C 1 - C 1 Il < D 1 /10. Combined with Lemma 4.11, this yields the following theorem.

[0093] Theorem 4.13: If A 2 J (X) ≤ ε 2 for a small enough ε, the above algorithm

returns a solution of cost at most • A 2 k (X) with probability 1 - O(y/ε) in time l - 37f

O(nkd + k 3 d) .

A PTAS for any fixed k

[0094] The PTAS combines the sampling procedure with the centroid estimation step: 1 - Use the procedure above to pick k initial centers c 1 ,..., c k . 2 - Run the centroid estimation procedure to obtain the final centers.

[0095] The running time is dominated by the exhaustive search in the centroid estimation procedure which takes time O(2 λ {(4k / {βω})}. We show that the cost of the final solution is at most (1+ω) (X), with probability γ k for some constant γ . By repeating the procedure O(γ ~k ) times, we can boost this to a constant.

[0096] Theorem 4.14: Assuming that for a small enough ε, there is a PTAS for the fc-means problem that returns a (l+ω)-optimal solution with constant probability in time O(2 λ { O(k(\ +ε λ 2) / ω)}nd.

The separation condition

[0097] We show that our separation condition implies, and is implied by, the condition that any two near-optimal ^-clusterings disagree on only a small fraction of the data.

[0098] Let cost(jc;,..., X k ) denote the cost of clustering X around the centers xi,...,x t in R d . We use R(x) to denote the Voronoi region of point x (the centers will be clear from the context). Let S 1 ® S 2 = (S 1 XS 2 ) (J (S 2 XS 1 ) denote the symmetric difference of S 1 and S 2 .

[0099] Theorem 5.1: Let α 2 = { l-401ε 2 } / 400. Suppose that X c R d is ε-separated for fc-means for a small enough ε. The following hold:

(i) If there are centers C 1 ,..., c k such that cost( C 1 ,..., c k ) ≤ a 2 (X) , then for each C 1 there is a distinct optimal center c σ(ι) such that IR( C 1 ) θ X σ(ι) \ ≤ 28ε 2 I X σ(i) l;

(ii) If X is obtained by perturbing each x in X 1 by a distance of then .

[0100] Notice that part (i) also yields that if X is ε-separated for fc-means, then any k- clustering of cost at most α 2 2 A 2 k (X) has small Hamming distance to the optimal ^-clustering

(more strongly, each cluster has small Hamming distance to a distinct optimal cluster). We now show the converse.

[0101] Theorem 5.2: Let ε ≤ 1 / 3. Suppose that for every ^-clustering X 1 ,..., X k of X of cost at most α 2 A 2 k (X) ,

(i) there exists a bijection σ such that for all i, \ X t ® X σ(ι) I ≤ ε I X σ{ι) I. Then, X is α- separated for fc-means;

(ii) there is a bijection σ such that X 1 θ X σ([) I < 2ε I {k-l } IXI. Then, X is α- separated for k- means.

Proofs

[0102] Proofs relating to the above Lemmas, Corollaries, and Theorems are believed to be ancillary to the various embodiments of the present invention as set forth in the claims, and therefore not necessary for enablement of such various embodiments of the present invention. Accordingly, such proofs have been omitted from the present disclosure. However, such proofs are available from U.S. Provisional Application No. 60/822,903, filed August 18, 2006 and entitled "SEEDING METHOD FOR fc-MEANS CLUSTERING AND OTHER CLUSTERING ALGORITHMS", hereby incorporated by reference in its entirety. Additionally, such proofs are also available from "The Effectiveness of Lloyd- Type Methods for the fc-Means Problem" as published in the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06) at pages 165-176, also hereby incorporated by reference in its entirety.

[0103] The various embodiments of the present invention are directed in particular toward data analysis and methods of clustering data points into clusters that represent groups of interest. One method of performing clustering is the so-called fc-means clustering heuristic. However, such heuristic tends to be sensitive to the way initial cluster centers or seeds are chosen, which is significant given that some methods for randomly selecting seeds can lead to results of variable quality. Accordingly, in various embodiments of the present invention, a procedure for choosing initial seeds to find centers of clusters is provided which leads to more effective clustering procedures, both in terms of the quality of results and running time.

[0104] The details of various seeding procedures for the clustering procedures and the clustering procedure itself, including justifications and method steps, are set forth in great detail

above. We set forth below and in Fig. 5 detailed steps for implementing some embodiments of the teachings herein.

[0105] Preliminarily it is to be understood that the seeding procedure and clustering procedure are performed with regard to a data set of items, each of which must be in an analyzable form that may be best understood as a single point in a multi-dimensional space. Accordingly, such data set is procured, obtained, or otherwise provided, as a collection of points 500. As should be understood, such procurement and transformation may be performed in any appropriate manner without departing from the spirit and scope of the present invention.

[0106] With such collection of points it is to be appreciated that the seeding procedure for the clustering procedure receives as inputs a number n of the points from the collection, each of which is numbered D t for i =l, 2, 3, ..., n, and each of which resides in an m-dimensional space. In addition, the seeding procedure receives as an input an integer k between 2 and n-\ which represents the number of seeds that are to be produced for the clustering procedure. As should be understood, based thereon, the seeding procedure will output k points out of the n input points that will serve as seeds.

Picking the first seed

[0107] In a seeding procedure for picking an initial collection of cluster centers, various techniques may be employed for selecting the at least one initial cluster center, i.e., seed 502. In some embodiments, the first seeds may be determined by selecting one or more points at random from the collection of data points. The random selection may be made from any probability distribution assigned to the data points. In some embodiments the first seeds are selected uniformly at random from the data points.

[0108] The center of gravity C of the n points may be calculated, for example by taking

the average over the n points of each of the m coordinates, i.e., C = — ∑D . In some n I^ embodiments, this center of gravity C, sometimes called the centroid, is used as the first seed.

[0109] In some embodiments, the first seed may be chosen by the following procedure:

(i) for each of the n points, the distance C 1 0 of such point D 1 to C is calculated.

Such distance C 1 0 may be calculated in any appropriate manner without departing from the spirit and scope of the teachings herein. For example, the distance may be calculated as the squared Euclidean distance between such point

(ii) An average squared distance is then calculated as:

∑C,° n and with such distances C x , ... , C ° , and C , a point μ x from among the n points is selected at random as the first seed with the probability of picking any point D ;

C ~ + C" 0 as μ j given by: Pr[μ x = D 1 ] = ' .

2nC

Picking subsequent seeds

[0110] With μ \ selected, points μ 2 , ..., μ k are then selected as seeds according to the following iterative process. Supposing μ 1; . . . , μ } - \ have already been chosen (initially j=2 and only μ \ has been chosen), we then pick// by:

(i) For each l ≤ i ≤ n , and each 1 < t < j - 1 calculate C' , the distance between D t and μ t . As before, each distance may be calculated in any appropriate manner, such as the

aforementioned squared Euclidean distance. Let C 1 denote the minimum of JC 1 j f=0 , i.e., the distance between point D 1 and its closest previously selected seed (step 504).

(ii) Update C to be the average of C 1 over all l ≤ i ≤ n , and pick μ } = D t with

C assigned probability: Pτ[μ = DJ = — = (step 506). n C

[0111] If more seeds are required, the "Yes" branch out of decision 508, steps 504 and 506 are repeated until all of the required seeds μ x , ..., μ k have been selected. Thereafter, μ x , ..., μ k as the selected seeds are then supplied (step 510) to a clustering procedure, and the clustering procedure iteratively moves the seeds within the space until a termination condition is reached and cluster centers are defined.

[0112] In some embodiments, more than one seed may be selected by techniques such as those described in connection with "Picking the first seed" above before proceeding to the techniques of "Picking subsequent seeds."

[0113] In some embodiments, an abundant number of seeds may be initially selected. That is, some number of seeds greater than that to be used by a clustering algorithm are initially selected. The collection of seeds is then culled, such as for example by the deletion / merger techniques described above. The initial abundant collection of seeds may be chosen by any suitable method, such as those described above for example, or by a combination of methods.

For example, some number of seeds may be chosen by the procedure described in connection with Fig. 5 above and then the collection of seeds may be augmented by selecting additional seeds at random. As another example, some number of seeds may be picked randomly, and then the collection of seeds augmented by the procedure described in connection with steps 504 to 508 of Fig. 5.

[0114] Generally, the seeds that are selected by the techniques disclosed herein will be communicated to a clustering algorithm for subsequent determination of a desirable clustering of given data. For example, a Lloyd-type iteration algorithm may take the selected seeds as a starting point for an iterative process that determines a clustering for the data.

CONCLUSION

[0115] The programming necessary to effectuate the processes performed in connection with the various embodiments of the present invention is straight-forward and should be apparent to the relevant programming public. Accordingly, such programming is not attached hereto. Any particular programming, then, may be employed to effectuate the various embodiments of the present invention without departing from the spirit and scope thereof.

[0116] In the present invention, systems and methods are provided for performing Lloyd-type ^-clustering that more accurately define the initial seed for each cluster prior to performing the iterative process of such Lloyd-type ^-clustering. An initial procedure is performed to at least roughly identify the location of each true cluster based on distances between at least a sampling of all the points in the data set. Once the first seed μ \ is picked, each successive seed μ 2 , ..., μ k is then picked proportional to the distances of all other points to the previously picked seeds. Thus, each point in effect competes to be the next seed with probability proportional to the distance of the point to the previously picked seeds. It should be appreciated that changes could be made to the embodiments described above without departing from the inventive concepts thereof. It should be understood, therefore, that this invention is not limited to the particular embodiments disclosed, but it is intended to cover modifications within the spirit and scope of the present invention as defined by the appended claims.