Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
ACCELERATED K-MEANS CLUSTERING
Document Type and Number:
WIPO Patent Application WO/2017/176145
Kind Code:
A1
Abstract:
A device for iteratively clustering an input dataset, the device comprising a direction update unit configured to compute an update direction of a current iteration based on one or more cluster centroids of one or more previous iterations, a step size update unit configured to de termine an update step size factor based on a Tchebychev polynomial, and a center update unit configured to compute an updated cluster center based on the update direction and the update step size factor.

Inventors:
LEVIN MIKHAIL PETROVICH (CN)
Application Number:
PCT/RU2016/000192
Publication Date:
October 12, 2017
Filing Date:
April 05, 2016
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
LEVIN MIKHAIL PETROVICH (CN)
International Classes:
G06F17/10; G06K9/62
Other References:
K A NAZEER ET AL: "Improving the Accuracy and Efficiency of the k-means Clustering Algorithm", PROCEEDINGS OF THE WORLD CONGRESS ON ENGINEERING 2009 (WCE 2009), 1-3 JULY 2009, LONDON, UK, 1 July 2009 (2009-07-01), XP055331501
RAFAIL OSTROVSKY ET AL: "The effectiveness of lloyd-type methods for the k-means problem", JOURNAL OF THE ASSOCIATION FOR COMPUTING MACHINERY, ACM, NEW YORK, NY, US, vol. 59, no. 6, 9 January 2013 (2013-01-09), pages 1 - 22, XP058030420, ISSN: 0004-5411, DOI: 10.1145/2395116.2395117
Attorney, Agent or Firm:
LAW FIRM "GORODISSKY & PARTNERS" LTD. (RU)
Download PDF:
Claims:
CLAIMS

A device (100) for iteratively clustering an input dataset, the device comprising:

a direction update unit (1 10) configured to compute an update direction of a current iteration based on one or more cluster centroids of one or more previous iterations,

a step size update unit (120) configured to determine an update step size factor based on a Tchebychev polynomial, and

a center update unit (130) configured to compute an updated cluster center based on the update direction and the update step size factor.

The device (100) of claim 1, wherein the step size update unit (120) is configured to compute an update step size factor Tj of an i-th iteration as

½

X[ 1 + p0cos[(2m - 1)π/8]

wherein i = 4q + m, m E {0, 1, 2, 3}, p0 = ( max - Amin)/(Amin + Amax) and τ0 = 2/(Amin + max), wherein Amax and Amin are estimates of maximum and minimum eigenvalues of an approximation of a cluster update operator A.

The device (100) of one of the previous claims, further comprising a table for storing a plurality of precomputed update step size factors, wherein the update step size unit (120) is configured to determine the update step size factor based on the plurality of precomputed update step size factors, in particular by looking up a precomputed update step size factor based on a current iteration number.

The device (100) of one of the previous claims, wherein the device is configured to compute an updated cluster center Xi of an i-th iteration as: wherein τέ is an update step size factor in an i-th iteration, 1 ^ is a j-th element of a cluster in an (i-l)th iteration, '~2) is a j-th element of the cluster in an (i-2)nd iteration, and and K^2 are the numbers of cluster elements of the cluster in the (i-l)th and (i-2)nd iteration.

5. The device (100) of one of the previous claims, wherein the device is a field- programmable gate array.

6. An iterative method (200, 300) for clustering an input dataset, the method comprising:

computing (210) an update direction of a current iteration based on one or more cluster centroids of one or more previous iterations,

computing (220) an update step size factor based on a Tchebychev polynomial, and

computing (230) an updated cluster center based on the update direction and the update step size factor.

7. The method (200, 300) of claim 6, wherein an update step size factor of an i-th iteration is computed as

To

Ti 1 + p0cos[(2m - 1)π/8]

wherein i = 4q + m, i e {0, 1, 2, 3}, p0 = (Amax - Amin)/(Amin + Amax) and τ0 = 2/(Amjn + Amax), wherein Amax and Amjn are estimates of maximum and minimum eigenvalues of an approximation of a cluster update operator A.

8. The method (200, 300) of one of claims 6 and 7, wherein an updated cluster center x of an i-th iteration is computed as:

wherein x is an update step size factor in an i-th iteration, p _1) is a j-th element of a cluster in an (i-l)th iteration, '-2"* is a j-th element of the cluster in an (i-2)nd iteration, and Ki→ and K^2 are the numbers of cluster elements of the cluster in the (i-l)th and (i-2)nd iteration.

9. The method (200, 300) of one of the claims 6 to 8, wherein a plurality of update step size factors for a plurality of iterations has been predetermined based on a sample dataset that is independent of the input dataset.

10. The method (200, 300) of one of claims 6 to 8, wherein the update step size factors are pre-computed based on a sample dataset that is a subset of the input dataset, in particular a subset that is randomly selected from the input dataset.

The method (200, 300) of one of claims 9 and 10, wherein pre-computing the update step size factors comprises estimating a maximum eigenvalue max and a minimum eigenvalue Amjn of an approximation of an update operator A on the predetermined dataset and/or the subset.

The method (200, 300) of claim 1 1, wherein the maximum eigenvalue max and/or the minimum eigenvalue Amin are computed by minimizing a number of iterations required for clustering the predetermined dataset and/or the sample dataset until a maximum difference between a cluster center of a current iteration and a previous iteration is below a predetermined sample dataset threshold.

The method (200, 300) of claim 12, wherein minimizing the number of iterations comprises a step of performing a coordinate descent minimization, in particular an alternating direction descent minimization.

The method (200, 300) of one of the claims 6 to 13, wherein the method is performed until a maximum difference between a cluster center of a current iteration and a previous iteration is below a predetermined input dataset threshold.

A computer-readable storage medium storing program code, the program code comprising instructions for carrying out the method of one of claims 6 to 14.

Description:
ACCELERATED K-MEANS CLUSTERING

TECHNICAL FIELD

The present invention relates to a device for iteratively clustering an input dataset and an iterative method for clustering an input dataset. The present invention also relates to a computer-readable storage medium storing program code, the program code comprising instructions for carrying out such a method.

BACKGROUND

K-Means clustering technology is widely used in Computer Vision, Machine Leaning, Genomics and Bioinformatics, Medical Cybernetics, and Financial Applications, e.g.

1 ) for a direct clustering of big data;

2) in computer graphics for color quantization (reducing the color palette of an image to a fixed number of colors k);

3) as a preliminary procedure for acceleration of other ML techniques;

4) as a preliminary procedure for acceleration of Community Detection clustering method;

5) as a preliminary procedure for parallelization of a Support Vector clustering method.

In particular in Computer Vision K-means clustering technology is one of the most commonly used basic technologies.

One of the attractive points of K-Means clustering is the simplicity of this approach. But for this simplicity it pays with some weak points, which include:

1 ) Needs a lot of iterations and computer resources for handling big data;

2) Convergence to a local minimum.

The first above-mentioned weak point is a principal obstacle in K-Means clustering technologies, because a lot of applications of Computer Vision have a so-called run-time restriction. This means that the time of reaction in these applications should be very small. Very often this reaction time is one of the principal points in such applications and decreasing this time is one of the main ways of improving technologies using K-Means clustering. Thus decreasing of processing time of clustering is very important and an urgent problem in the usage of K- means clustering technology.

Conventional K-means clustering requires KN operations for clustering, wherein K is the number of clusters and N is the number of data points to be clustered. In the Lloyd version K- Means comprises the following steps

1) Set initial centers for each of the K data groups (so called seeding);

2) Assign each data point to the closest group by calculating the distance between each data record to each center to allocate the data record to the nearest group;

3) Reevaluate each center for each data group;

4) Evaluate maximal absolute value of difference Δ between centers on current and previous iterations;

5) Repeat the last three steps 2, 3, 4 until each group will have stable center and members in each group;

6) Evaluate a sum of square error to evaluate the quality of the clustering result.

Evaluation of centers for each group of data in Lloyds version is provided as follows:

Let any group k in an i-th iteration consist of Kj elements

p {i) j j = \, 2, \..., K, , for example ρ - e R n .

Then, in the update step, the center of the K-th cluster of the (i+1) th iteration is calculated as the centroid of the elements of the K-th cluster of the i-th iteration: Λ<

Iterations are done until

wherein iteration, ε is a data exactness, and II . I) is the Euclidean norm in 3¾ n space.

There have been many attempts to improve the speed of i^-Means clustering technology. Mainly, these modifications are concerned with changing of data structure and usage of another type of norm in distance evaluation instead of the classical Euclidian norm. Some of the best known attempts to accelerate /. " -Means clustering technology include:

1. Clustering by using kd-tree proposed by Dan Pelleg and Andrew Moore in 1999. Originally, kd-trees were used to accelerate nearest-neighbor queries. This technique modified steps 4 and 5 in Lloyd's algorithm by decreasing the number of operations on steps 4 and 5;

2. Effects of seeding: The quality of the clusters and the running time of the clustering are depended on the ways of seeding. Preclustering technique instead of random seeding should decrease the number of iterations and running time. Other algorithms of seeding are considered in papers. All these modifications concern stage 3 in the above mentioned scheme of Lloyd's method;

3. Jenks in 1967 suggested to use a modified function in minimization problem and also suggested another kind of seeding. He obtained about 90% acceleration in comparison with Lloyd method, but only in some cases.

4. In 2004 Ya Guan, Ali A. Ghorbani, Nabil Belacel proposed K-Means+ modification.

It selects the number of clusters automatically basing on initial data analysis.

Despite the prior art improvements, there is still a need for improving computational efficiency of K-means clustering.

SUMMARY OF THE INVENTION

The objective of the present invention is to provide a device for iteratively clustering an input dataset and a iterative method for clustering an input dataset, wherein the device for iterative - ly clustering an input dataset and the iterative method for clustering an input dataset overcome one or more of the above-mentioned problems of the prior art.

A first aspect of the invention provides a device for iteratively clustering an input dataset, the device comprising:

- a direction update unit configured to compute an update direction of a current iteration based on one or more cluster centroids of one or more previous iterations,

a step size update unit configured to determine an update step size factor based on a Tchebychev polynomial, and an update unit configured to compute an updated cluster center based on the update direction and the update step size factor.

Iterations in the Lloyd algorithm and its modifications as known in the prior art use tech- niques similar to successive iteration techniques well-known in Numeric Linear Algebra. Their convergence is often very slow and they need a lot of iterations to get the result. Experiments have shown that adapting the update step size based on Tchebychev polynomials yields a faster convergence compared to standard algorithms. Calculating the update size factors may involve a small additional computational effort, but the performance gain through the smaller number of iterations until convergence is achieved has a larger effect.

Therefore, the device of the first aspect can use approximate multi-layer Tchebychev iteration techniques instead of simple iteration techniques. In every iteration of a k-means clustering method, the cluster centers are updated. As the cluster center can be seen as a point in ¾ , the update can be seen as moving the cluster center in a certain direction. Thus, each iteration performs an update into a certain direction. This is what can be referred to as "update direction of a current iteration". For example, if a cluster center of a previous iteration is given by and a cluster center of a current iteration is giv- en by χ έ , a vector that is pointing in the direction from x i ^ 1 to X( can be referred to as a update direction of a current iteration.

Adapting update step sizes using update step size factors that are based on Tchebychev polynomials has the advantage that computational time and computer resources needed for clustering of data in applications where K-Means clustering is used can be reduced.

In a first implementation of the device for iteratively clustering an input dataset according to the first aspect, the step size update unit is configured to compute an update step size factor Tj of an z ' -th iteration as

= το

Τί 1 + p 0 cos[(2m - 1)π/8]

wherein i = q + m, m E {0, 1, 2, 3}, p 0 = (A max - A min )/ (A min + A max ) and τ 0 = wherein l max and A min are estimates of maximum and minimum eigenval ¬ ues of an approximation of a cluster update operator A. Experiments have shown that adapting the step sizes of a k-means clustering algorithm with the above-defined update step size factors T ; yields particularly short iteration numbers and thus yields a particularly efficient k-means clustering.

In a second implementation of the device for iteratively clustering an input dataset according to the first aspect, the device further comprises a table for storing a plurality of precomputed update step size factors, wherein the update step size unit is configured to determine the update step size factor based on the plurality of precomputed update step size factors, in particu- lar by looking up a precomputed update step size factor based on a current iteration number.

Storing the update step size factors in a table has the advantage that during computation of clusters for a given dataset, the precomputed update step size factors can be used. In other words, the precomputed update step size factor only need to be calculated once and can then be re-used for different datasets.

In a third implementation of the device for iteratively clustering an input dataset according to the first aspect, the device is configured to compute an updated cluster center of an i-th iteration as:

wherein x t is an update step size factor in an i-th iteration, p 1 ^ is a j-th element of a cluster in an (i-l )th iteration, ' -2 '' is a j-th element of the cluster in an (i-2)nd iteration, and Ki- t and K " i_2 are the numbers of cluster elements of the cluster in the (i-l)th and (i-2)nd iteration.

In the above equation, the expression Σ ! "1 ,' - 1"1 - can be seen as a

cluster update operator since it computes an update from a cluster center x i→ of a previous iteration to the cluster center of a current iteration.

In other words, the expression (— p l_ 1) - ττ— Σ^ 2 P ' -2) 1 represents an update direction from a previous cluster center to a current cluster center. Thus, the scalar Tj that is multiplied with the update direction can be seen as a update step size factor for adjusting the step size.

For example the center update unit of the device of the first aspect can be configured to com- pute the update cluster center as defined above.

In a fourth implementation of the device for iteratively clustering an input dataset according to the first aspect, the device is a field-programmable gate array. Field-programmable gate arrays (FPGAs) are integrated circuits designed to be configured by a customer or a designer after manufacturing (hence "field-programmable"). Experiments have shown that they are particularly suitable for implementing a clustering device which determines update step size factor based on a Tchebychev polynomial. In particular, update step size factors can be stored by the user on the FPGA.

A second aspect of the invention refers to an iterative method for clustering an input dataset, the method comprising:

computing an update direction of a current iteration based on one or more cluster cen- troids of one or more previous iterations,

- computing an update step size factor based on a Tchebychev polynomial, and

computing an updated cluster center based on the update direction and the update step size factor.

The methods according to the second aspect of the invention can be performed by the device for iteratively clustering an input dataset according to the first aspect of the invention. Further features or implementations of the method according to the second aspect of the invention can perform the functionality of the device according to the first aspect of the invention and its different implementation forms. In a first implementation of the iterative method for clustering an input dataset of the second aspect, an update step size factor T j of an i-th iteration is computed as

= To

Ti 1 + p 0 cos[(2m - 1)π/8] wherein i = 4q + m, i 6 {0, 1, 2, 3}, p 0 = (A max - A min )/(/l min + A max ) and r 0 =

2/ (^-min + -max)' wherein A max and A m j n are estimates of maximum and minimum eigenvalues of an approximation of a cluster update operator A.

In a second implementation of the iterative method of the second aspect, an updated cluster center x t of an i-th iteration is computed as:

wherein Tj is an update step size factor in an i-th iteration, p l is a j-th element of a cluster

(1— 2)

in an (i-l)th iteration, P j is a j-th element of the cluster in an (i-2)nd iteration, and and are the numbers of cluster elements of the cluster in the (i-l)th and (i-2)nd iteration.

In a third implementation of the iterative method of the second aspect, a plurality of update step size factors for a plurality of iterations has been predetermined based on a sample dataset that is independent of the input dataset. Experiments have shown that suitable update step size factors can be found independent of a specific input dataset. Thus, update step size factors can be found for one predetermined "reference" dataset and can then be used for different datasets. This has the advantage that no computation of the update step size factors is required when computing clusters for a new dataset. Thus, a computational effort for clustering a previously unseen input dataset can be reduced.

In a fourth implementation of the iterative method of the second aspect, the update step size factors are pre-computed based on a sample dataset that is a subset of the input dataset, in particular a subset that is randomly selected from the input dataset.

Precomputing the update step size factors based on a subset of the full dataset has the advantage that the update step size factors are adapted to the dataset, yet can be computed with a reduced computational effort. In a fifth implementation of the method of the third or fourth implementation of the second aspect, pre-computing the update step size factors comprises estimating a maximum eigenval- ue ^max a minimum eigenvalue A min of an approximation of an update operator A on the predetermined dataset and/or the subset.

Estimating (or calculating) the smallest and largest eigenvalues of an approximation of a clus- ter update operator A has the advantage that based on these eigenvalues suitable update step size factors can be computed.

In a sixth implementation of the iterative method of the fifth implementation of the second aspect, the maximum eigenvalue A max and/or the minimum eigenvalue A min are computed by minimizing a number of iterations required for clustering the predetermined dataset and/or the sample dataset until a maximum difference between a cluster center of a current iteration and a previous iteration is below a predetermined sample dataset threshold.

Experiments have shown that this approach yields good approximations of the minimum and maximum eigenvalues of the cluster update operator A and at the same time yields a low number of iterations when these eigenvalue estimates are used for determining update step size factors and using these for implementing k-means clustering.

In a seventh implementation of the iterative method of the sixth implementation of the second aspect, minimizing the number of iterations comprises a step of performing a coordinate descent minimization, in particular an alternating direction descent minimization.

This represents an efficient way of determining estimates of maximum and minimum eigenvalues that require a minimum number of iterations for clustering the sample dataset and/or the predetermined dataset.

In an eighth implementation of the iterative method of the second aspect, the method is performed until a maximum difference between a cluster center of a current iteration and a previous iteration is below a predetermined input dataset threshold.

This provides a particularly good stopping rule, when further iterations only yield minimal, if any, improvements of the found cluster centers. Thus, unnecessary iterations can be avoided and an overall computation time can be reduced. A third aspect of the invention refers to a computer-readable storage medium storing program code, the program code comprising instructions for carrying out the method of the second aspect or one of the implementations of the second aspect.

BRIEF DESCRIPTION OF THE DRAWINGS

To illustrate the technical features of embodiments of the present invention more clearly, the accompanying drawings provided for describing the embodiments are introduced briefly in the following. The accompanying drawings in the following description are merely some em- bodiments of the present invention, but modifications on these embodiments are possible without departing from the scope of the present invention as defined in the claims.

FIG. 1 is a block diagram illustrating a device for clustering an input dataset in ac- cordance with an embodiment of the present invention,

FIG. 2 is a flow chart illustrating a method for clustering an input dataset in accordance with a further embodiment of the present invention,

FIG. 3 is a simplified flow chart of the iterative clustering method in accordance with a further embodiment of the present invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS FIG. 1 shows a device 100 for iteratively clustering an input dataset.

The device 100 for iteratively clustering an input dataset comprises a direction update unit 1 10, a step size update unit 120 and a center update unit 130. The units 1 10-130 can e.g. be implemented on a processor (not shown in FIG. 1).

The direction update unit 1 10 is configured to compute an update direction of a current iteration based on one or more cluster centroids of one or more previous iterations. A cluster centroid of a previous iteration may be defined as p ( '~ ) j , wherein ' ^ are the elements of a cluster of a previous iteration (wherein i is the index of the current iteration). Thus, the direction update unit 1 10 can compute the update direction, i.e., the direction from a previous cluster center to a current cluster center based on cluster members of previous itera- tions.

The step size update unit 120 is configured to determine an update step size factor based on a Tchebychev polynomial. The update step size factor may be used to change a length of the update direction as computed by the direction update unit 1 10. The step size update unit 120 may be configured to repeat this for every cluster of the set of clusters.

The center update unit 130 is configured to compute an updated cluster center based on the update direction and the update step size factor. For example, the center update unit 130 may multiply an update vector as determined by the direction update unit 130 with a scalar update step size factor as determined by the step size update unit 120. The result of this multiplication may be added to a previous cluster center in order to obtain a new cluster center of a current iteration. The center update unit 130 may be configured to repeat this for every cluster of the set of clusters. FIG. 2 shows an iterative method 200 for clustering an input dataset.

The method 200 comprises a first step of computing 210 an update direction of a current iteration based on one or more cluster centroids of one or more previous iterations. The method comprises a second step of computing 220 an update step size factor based on a Tchebychev polynomial.

The method comprises a third step of computing 230 an updated cluster center based on the update direction and the update step size factor.

FIG. 3 shows a simplified flow chart of an iterative clustering method 300 according to a further embodiment of the present invention. The method comprises a first step 310 of setting initial cluster centers. For example, the initial cluster centers can be randomly chosen from the dataset.

In a second step 320, iteration steps are tabulated. In particular, this can comprising determin- ing update step size factors τ τ 2 , τ 3 , τ 4 . These update step size factors can later be used to determine iteration step sizes.

The update step size factors can be determined by a profiling method in a method step 315. This can be performed in advance and can be based on profile information that includes esti- mates of minimum and maximum eigenvalues of the input dataset to be clustered, of a predetermined reference dataset and/or a subset of the input dataset.

The results of the profiling in method step 315 can be stored in a table that can be accessed during the further iteration steps.

In a third step 330, each data point of the data set is assigned to the closest cluster center. In a fourth step 340, cluster centers are re-evaluated using an iterative approach. In detail, the fourth step 340 can be described by starting from modification of the formula for centers of clusters evaluation in K-Means. Let us represent this formula computation in evaluation form

or as follows x, = x M + Γ (Χ,_, , Χ ( _ 2 ) (1)

Formula (1) represents a successive iteration technique widely used in numerical solution of Linear Algebra problems, wherein we refer to A as a cluster update operator. The deficiency of the successive iteration technique is the low convergence and big number of iterations needed to get result. Now let us consider four layers Tchebychev iteration method for centers of clusters evaluation. Let us denote A m i n and A max as minimal and maximal eigenvalues (or estimates of the minimal and maximal eigenvalues) of an operator that approximates the cluster update operator A.

Let us evaluate so called "optimal" time step in the simple iteration scheme as follows: An initial update step size factor is given by: and a value p 0 is defined by:

We suggest that τ is changing in iteration process with respect to the number of iterations (i+m) and we evaluate it on each 4 subsequent iterations as follows:

Let i = 4q + m, then:

^ , w = 0,l, 2, 3 (2)

1 + p 0 cos[(2w - \)n 18]

^4 ? + m . iw = 1, 2, 3, 4

Here are the roots of Tchebychev polynomial of the 4 th order.

Let us note that in this process we could not evaluate directly A m i n and A max . In result evaluation of centers is provided by the following modification of (1)

X, = x,_, + T Aq+m A(x t _ x , ,_2 ), i = 4q + m, m = 0,1,2,3 (3) Now let us consider some properties of unknown eigenvalues: A m j n and A m j n .

Unfortunately, the cluster update operator A or its linear approximation cannot simply be ex- pressed in analytical form. Therefore, one cannot evaluate or estimate eigenvalues of the linear approximation directly. For estimation of these eigenvalues, let us use the following properties of the cluster update operator A: Property 1 : This operator should be compressible to provide convergence of the iteration approximation process described by formula (1).

Property 2: The number of iterations N iter in iteration process (1) should be as small as possi- ble.

For evaluation of estimations of the unknown eigenvalues max and mm the following technique can be used: 1. Choose any small subset d of small size of the considered data set D= pl, p2, p3, ..., ρ Here, K is a number of points in the considered data set.

2. The volume of subset d should be small to provide evaluations by Lloyd method for a short time. In practice, it is possible to use random choice to select d from D.

3. Consider an unrestricted minimization problem for A max and A mln as follows:

Find N iter ( A max , A min ) - min (4)

4. Let us use for solution of minimization problem (4) with respect to 2 variables

-max an d m j n well-known alternative-variable descent method.

We refer to the above steps 1. to 4. as profiling or profile-guided technique.

By profiling we obtained A max = 1.001 and L min = -0.168

Numerical experiments showed that it is possible to use these values also for other data sets. Thus, the eigenvalues A max and A min do not need to be newly estimated for each new dataset. Instead, the above eigenvalue estimates can be used as "reference eigenvalue estimates" for new datasets. This significantly reduces the computational effort.

In a fifth step 350, a maximum absolute value of a difference Δ between centers of current and previous iterations is evaluated.

In a sixth step 360, it is evaluated whether A < ε , i.e., whether the maximum absolute difference value is smaller than a predetermined threshold. If so, the method continues in a final step 370, wherein a clustering quality is evaluated and the method terminates. Otherwise, the method continues in step 330, wherein the data points are assigned to the closest group centers.

The overhead of the proposed approach is very small in comparison with allocation of data with respect to centers of clusters computation. It can be omitted, because time steps are tabulated in advanced and it needs to provide evaluations only by formula (3) (1 add and 1 multiplication additional operations per iteration).

Now let us consider a performance checking of the considering proposal on the well-known data file kmeans jrain_3089 consisting of 3089 lines with 5 columns. The exactness of convergence was taken equal to 10 "7 . The appropriate results are presented in the following table:

Table 1. Comparison of the proposed technology with basement and alternative solutions

As shown in Table 1 , alternative solutions for the k-means clustering problem include B.T. Polyak's 3 layer Iteration Method and Successive Over Relaxation Method with changing relaxation coefficients. These approaches were also realized by us for comparison and showed less effectiveness than the proposed technology (see fourth row in Table 1). Also it is important to mention that profiling for these alternative approaches is more complicated problem and needs in few times more preliminary profiling works to choose coefficients in iteration formulas. Table 1 shows that the presented method provides a significant acceleration effect compared to standard iteration method and also compared to other iteration methods. Advantages of the proposed method and device for clustering include:

■ Essentially decrease the volume of computations in K-Means Machine Learning clustering method (up to 3,5 times acceleration)

■ The technique can be simply incorporated into other known realizations of K-Means and some other Machine Learning clustering tools and essentially improve their performance

■ This technique should decrease the volume of energy and resources needed for the big data clustering and classification

■ In Computer Vision run-time technologies this technique should essentially decrease the time of response

Embodiments of the invention comprise a method for data set clustering by using a profile- guided multi-layer iteration scheme in K-Means clustering of dataset comprising the following steps:

Step 1. Data Selection for profiling: extraction of a small size data set from the considered dataset, in particular by random choice selection;

Step 2. Profiling: performing K-Means clustering of the selected small size data set by a multi-layer iteration scheme (such as Tchebychev multi-layer scheme) in K-Means clustering to get estimates of minimal and maximal eigenvalues of a multi-layer iteration scheme operator;

Step 3. Evaluation of changing iteration steps: Usage of minimal and maximal eigenvalues obtained on Step 2 for evaluation of changing iteration steps for the multi-layer iteration scheme in K-Means clustering;

Step 4. Clustering of real huge data set: Usage of evaluated iteration steps, in particular obtained on Step 3 in the multi-layers iteration scheme in K-Means clustering for real huge data set.

The foregoing descriptions are only implementation manners of the present invention, the scope of the present invention is not limited to this. Any variations or replacements can be easily made through person skilled in the art. Therefore, the protection scope of the present invention should be subject to the protection scope of the attached claims.