Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DEVICE AND METHOD FOR CLUSTERING OF INPUT-DATA
Document Type and Number:
WIPO Patent Application WO/2019/219198
Kind Code:
A1
Abstract:
The present invention provides a device (100) for clustering of input data (101) being a data-set comprising data-points, wherein the device (100) comprises an auto-encoding unit (102), configured to, in a first operating phase of the device (100), reduce a dimension of the input data (101) and/or extract features relevant for clustering from the input data (101), thereby creating low dimensional data (103) and a clustering unit (104), configured to, in a second operating phase of the device (100), obtain at least one cluster (105), based on the low dimensional data (103), and associate each data-point in the low dimensional data (103) to one cluster of the at least one cluster (105), wherein the low dimensional data (103) is optimized, by the auto-encoding unit (102), for being reconstructed without loss.

Inventors:
TZOREFF ELAD (DE)
KOGAN OLGA (DE)
CHOUKROUN YONI (DE)
Application Number:
PCT/EP2018/062961
Publication Date:
November 21, 2019
Filing Date:
May 17, 2018
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
TZOREFF ELAD (DE)
International Classes:
G06F17/30
Foreign References:
US6134541A2000-10-17
Other References:
BO YANG ET AL: "Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering", COPYRIGHT, 13 June 2017 (2017-06-13), XP055452200, Retrieved from the Internet [retrieved on 20180219]
ELIE ALJALBOUT ET AL: "Clustering with Deep Learning: Taxonomy and New Methods", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 23 January 2018 (2018-01-23), XP080854112
LEYLI-ABADI MILAD ET AL: "Denoising Autoencoder as an Effective Dimensionality Reduction and Clustering of Text Data", 23 April 2017, MEDICAL IMAGE COMPUTING AND COMPUTER-ASSISTED INTERVENTION - MICCAI 2015 : 18TH INTERNATIONAL CONFERENCE, MUNICH, GERMANY, OCTOBER 5-9, 2015; PROCEEDINGS; [LECTURE NOTES IN COMPUTER SCIENCE; LECT.NOTES COMPUTER], SPRINGER INTERNATIONAL PUBLISHING, CH, ISBN: 978-3-642-38287-1, ISSN: 0302-9743, XP047410529
Attorney, Agent or Firm:
KREUZ, Georg (DE)
Download PDF:
Claims:
CLAIMS

1. A device (100) for clustering of input data (101) being a data-set comprising data- points, wherein the device (100) comprises:

an auto-encoding unit (102), configured to, in a first operating phase of the device (100), reduce a dimension of the input data (101) and/or extract features relevant for clustering from the input data (101), thereby creating low dimensional data (103); and

a clustering unit (104), configured to, in a second operating phase of the device (100), obtain at least one cluster (105), based on the low dimensional data (103), and associate each data-point in the low dimensional data (103) to one cluster of the at least one cluster (105);

wherein the low dimensional data (103) is optimized, by the auto-encoding unit (102), for being reconstructed without loss.

2. The device (100) according to claim 1, wherein the low dimensional data (103)

comprises linear independent code rows, thereby minimizing reconstruction loss.

3. The device (100) according to any one of the preceding claims, wherein the

dimensional reduction of the input data (101) includes applying a first function to the input data (101), which is configured to minimize pairwise similarity of data-points in the input data (101), to provide the low dimensional data (103).

4. The device (100) according to claim 3, wherein a similarity metric is applied by the first function to the data-points in the input data (101).

5. The device (100) according to claim 4, wherein the similarity metric applied by the first function is a cosine similarity.

6. The device (100) according to any one of the preceding claims, wherein the device (100) further comprises a decoder (201), configured to decode the low dimensional data (103) and compare it to the input data (101), to measure a reconstruction loss and adjust operating parameters of the auto-encoding unit (102), to minimize

reconstruction loss.

7. The device (100) according to any one of the preceding claims, wherein the clustering unit (104) is further configured to obtain a centroid parameter for each cluster (105).

8. The device (100) according to claim 7, wherein the clustering unit (104) is further configured to determine, to which cluster a data-point is assigned, based on a centroid parameter of a cluster.

9. The device (100) according to any one of the preceding claims, wherein the clustering unit (104) is further configured to apply a second function, to minimize pairwise similarity of data-points, and to improve separability of the data-points.

10. The device (100) according to claim 9, wherein a similarity metric applied by the second function is a same similarity metric as applied by the first function, in particular a cosine similarity of data-points.

11. The device (100) according to any one of the preceding claims, wherein the clustering unit (104) is further configured to minimize similarities of data-points that are associated to different clusters (105).

12. The device (100) according to any one of the preceding claims, wherein the clustering unit (104) is further configured to maximize similarities of data-points that are associated to a same cluster (105).

13. The device (100) according to any one of the preceding claims, wherein in the second phase, the device (100) is further configured to optimize the operating parameters of the auto-encoding unit (102) based on operating parameters of the clustering unit (104), and optimize the operating parameters of the clustering unit (104) based on the operating parameters of the auto encoding unit (102).

14. The device (100) according to claim 13, wherein the device (100) is further configured to simultaneously optimize the operating parameters of the auto-encoding unit (102) and the operating parameters of the clustering unit (104).

15. A method (300) for clustering of input data (101) being a data-set comprising data- points, wherein the method (300) comprises the steps of:

in a first operating phase of the device (100), reducing (301), by an auto-encoding unit (102), a dimension of the input data (101) and/or extracting (301), by the auto- encoding unit (102), features relevant for clustering from the input data (101), thereby creating low dimensional data (103); and

in a second operating phase of the device (100), obtaining (302), by a clustering unit (104), at least one cluster (105), based on the low dimensional data (103), and associating (302), by the clustering unit (104), each data-point in the low dimensional data (103) to one cluster of the at least one clusters (105);

wherein the low dimensional data (103) is optimized, by the auto-encoding unit (102), for being reconstructed without loss.

Description:
DEVICE AND METHOD FOR CLUSTERING OF INPUT-DATA

TECHNICAL FIELD The present invention relates to the field of machine learning and clustering, i.e. a process of discovering similarity structures in large data-sets. More specifically, the present invention relates to a device for clustering of input data and a corresponding method, wherein the device comprises an auto-encoding unit and a clustering unit.

BACKGROUND

Presently, clustering is one of the most fundamental unsupervised machine learning problems. Its main goal is to separate data-sets of input data into clusters of similar data-points. For example, clustering can be used to cluster users based on user behavior, e.g. for cybersecurity purposes, event clustering for IT operations, clustering and anomaly detection for healthcare applications or industrial monitoring applications. Besides these applications, clustering is beneficial for multiple other fundamental tasks. For instance, it can serve for automatic data labeling for supervised learning and as a pre-processing step for data visualization and analysis. In the prior art, dimensionality reduction and feature extraction have been used along with clustering, to map input data into a feature space in which separation into clusters is easier with respect to a present problem’s context. Using deep neural networks (DNNs), it is possible to leam non-linear mappings, allowing to transform input data into more clustering- friendly representations.

In the prior art, dimension reduction/feature selection and clustering are treated separately in a two-phase process, as e.g. illustrated in Fig. 7. First, a dimension of input data is reduced and informative features are extracted by means of an auto-encoder. Second, clustering is applied to these features. However, there is a natural tension between auto-encoder components and clustering components: The auto-encoder chooses features that are optimized for reconstruction with low loss, of all variations in the input data, while clustering requires features that can reduce all data variations to a single template (i.e. a single class, or a single cluster). In many cases, the auto-encoder output (obtained in the first phase) loses features that are important for the clustering (which is done in the second phase). Once the information is lost, overall clustering has bad accuracy. For example, as it is now described in view of Fig. 8, important features are lost that are essential, e.g. to differentiate between“9” and“4”, as shown in Fig. 8A when running an auto-encoder on the Modified National Institute of Standards and Technology (MNIST) database, i.e. data-set. The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems. As a result, the reconstruction of centroids (cf. Fig. 8B) when running clustering on data created by a conventional auto-encoder shows that there are two centroids for“9” and no centroids for“4”. T-Distributed Stochastic Neighbor Embedding (t-SNE) visualization demonstrates this mistake as well.

That is, in the prior art, there is a demand for clustering solutions with higher accuracy.

SUMMARY

In view of the above-mentioned problems and disadvantages, the present invention aims to improve the conventional clustering devices. The present invention has the objective to provide a device for clustering of input data. The device comprises an auto-encoding unit and a clustering unit. The auto-encoding unit employs an auto-encoding algorithm that reaches high separability of data (resulting in higher clustering accuracy) by optimizing data even before it is processed in a clustering step in the clustering unit. To this end, the auto-encoding unit receives input data (which can be regarded as a data-set comprising data-points) and provides optimized output data, being low dimensional output data, to the clustering unit. The auto- encoder in particular aims at subspace dimension maximization regularized by reconstruction loss. In other words, a dimension of the input data is reduced, based on a reconstruction loss parameter of the input data. The dimension is only reduced to such an amount that reconstruction loss in the reduced data is minimized and that the low dimensional data forwarded by the auto-encoding unit, to the clustering unit, is optimized for achieving high clustering accuracy.

In the clustering step, which is subsequently employed in the clustering unit, the encoder output is used to obtain at least one cluster, based on the low dimensional data, and associate each data-point in the low dimensional data to one cluster of the at least one cluster. In particular, in the second phase, in a two-step alternating maximization of clustering and encoder parameters (i.e. of operating parameters of the clustering unit and the auto-encoding unit), higher accuracy can be achieved compared to the results reached by prior art solutions. In a first optional step, a similarity maximization of data-points is achieved by subspace dimension maximization, i.e. by further maximizing the dimension of the low dimensional data output by the auto-encoding unit.

This step is in particular performed in a clustering phase by the clustering unit. In a second optional step, similarity maximization regularized by cross-cluster similarity minimization and within-cluster similarity maximization is achieved. In other words, similarities of data-points associated with different clusters are minimized, and similarities of data-points associated with a same cluster are maximized.

The object of the present invention is achieved by the solution provided in the enclosed independent claims. Advantageous implementations of the present invention are further defined in the dependent claims.

A first aspect of the present invention provides a device for clustering of input data being a data-set comprising data-points, wherein the device comprises an auto-encoding unit, configured to, in a first operating phase of the device, reduce a dimension of the input data and/or extract features relevant for clustering from the input data, thereby creating low dimensional data; and a clustering unit, configured to, in a second operating phase of the device, obtain at least one cluster, based on the low dimensional data, and configured to associate each data-point in the low dimensional data to one cluster of the at least one cluster; wherein the low dimensional data is optimized, by the auto-encoding unit, for being reconstructed without loss.

This is beneficial, since unsupervised deep learning with high accuracy can be used on different types of data (not necessarily visual data) to solve the problem of clustering and anomaly detection, e.g. in user behavior analytics for cybersecurity, events correlation for IT operations, anomaly detection for healthcare applications or industrial monitoring applications. Further, quick convergence of an auto-encoder algorithm (already after several epochs) can be achieved. Additionally, auto-encoder code is smaller compared to the prior art, which requires less memory for inference.

In an implementation form of the first aspect, the low dimensional data comprises linear independent code rows, thereby minimizing reconstruction loss. Linear independent code rows in the low dimensional data output by the auto-encoder ensure that reconstruction loss, which occurs when reconstructing the input data from the low dimensional data, is minimized. This also helps to increase clustering accuracy of the clustering unit.

In an implementation form of the first aspect, the dimensional reduction of the input data includes applying a first function to the input data, which is configured to minimize pairwise similarity of data-points in the input data, to provide the low dimensional data.

Minimizing pairwise similarity of data-points in the auto-encoding step helps to improve accuracy of clustering in the clustering unit, and overall clustering accuracy.

In an implementation form of the first aspect, a similarity metric is applied by the first function to the data-points in the input data.

Using a similarity metric by the first function ensures that minimizing the pairwise similarity can be precisely controlled according to the similarity metric.

In an implementation form of the first aspect, the similarity metric applied by the first function is a cosine similarity.

Using a cosine similarity metric by the first function ensures that an effective metric can be used to minimized pairwise similarity.

In an implementation form of the first aspect, the device further comprises a decoder, configured to decode the low dimensional data and compare it to the input data, to measure a reconstruction loss and adjust operating parameters of the auto-encoding unit, to minimize reconstruction loss.

Comparing decoding results of the decoder with the input data allows to determine if reconstruction loss is effectively minimized and adjust operating parameters accordingly.

In an implementation form of the first aspect, the clustering unit is further configured to obtain a centroid parameter for each cluster.

Obtaining a centroid parameter for each cluster ensures that efficiency of processing is increased, since, instead of properties of all data-points that are associated to a cluster, only the centroid parameter needs to be evaluated during operation of the device. In an implementation form of the first aspect, the clustering unit is further configured to determine, to which cluster a data-point is assigned, based on a centroid parameter of the cluster.

This ensures that efficiency of clustering in the clustering unit can be further improved.

In an implementation form of the first aspect, the clustering unit is further configured to apply a second function, to minimize pairwise similarity of data-points, and to improve separability of the data-points.

This ensures that accuracy of clustering can be further improved in the clustering unit by minimizing pairwise similarity of data-points also in the clustering-unit.

In an implementation form of the first aspect, a similarity metric applied by the second function is a same similarity metric as applied by the first function, in particular a cosine similarity of data-points.

This ensures that accuracy of clustering can be further improved in the clustering unit by minimizing pairwise similarity of data-points also in the clustering-unit.

In an implementation form of the first aspect, the clustering unit is further configured to minimize similarities of data-points that are associated to different clusters.

This ensures that accuracy of clustering can be further improved in the clustering unit.

In an implementation form of the first aspect, the clustering unit is further configured to maximize similarities of data-points that are associated to a same cluster.

This ensures that accuracy of clustering can be further improved in the clustering unit.

In an implementation form of the first aspect, in the second phase, the device is further configured to optimize operating parameters of the auto-encoding unit based on operating parameters of the clustering unit, and optimize the operating parameters of the clustering unit based on the operating parameters of the auto encoding unit.

This ensures that accuracy of clustering can be further improved jointly in the auto-encoding unit and in the clustering unit. In an implementation form of the first aspect, the device is further configured to simultaneously optimize the operating parameters of the auto-encoding unit and the operating parameters of the clustering unit.

This ensures that accuracy of clustering can be further improved simultaneously in the auto- encoding unit and in the clustering unit.

A second aspect of the present invention provides a method for clustering of input data being a data-set comprising data-points, wherein the method comprises the steps of, in a first operating phase of the device, reducing, by an auto-encoding unit, a dimension of the input data and/or extracting, by the auto-encoding unit, features relevant for clustering from the input data, thereby creating low dimensional data; and, in a second operating phase of the device, obtaining, by a clustering unit, at least one cluster, based on the low dimensional data, and associating, by the clustering unit, each data-point in the low dimensional data to one cluster of the at least one clusters; wherein the low dimensional data is optimized, by the auto-encoding unit, for being reconstructed without loss. In an implementation form of the second aspect, the low dimensional data comprises linear independent code rows, thereby minimizing reconstruction loss.

In an implementation form of the second aspect, the dimensional reduction of the input data includes applying a first function to the input data, which is configured to minimize pairwise similarity of data-points in the input data, to provide the low dimensional data. In an implementation form of the second aspect, a similarity metric is applied by the first function to the data-points in the input data.

In an implementation form of the second aspect, the similarity metric applied by the first function is a cosine similarity.

In an implementation form of the second aspect, the method further comprises decoding, by a decoder, the low dimensional data, and comparing it to the input data, to measure a reconstruction loss and adjust operating parameters of the auto-encoding unit, to minimize reconstruction loss.

In an implementation form of the second aspect, the method further includes obtaining, by the clustering unit, a centroid parameter for each cluster. In an implementation form of the second aspect, the method further includes determining, by the clustering unit, to which cluster a data-point is assigned, based on a centroid parameter of the cluster.

In an implementation form of the second aspect, the method further includes applying, by the clustering unit, a second function, to minimize pairwise similarity of data-points, and to improve separability of the data-points.

In an implementation form of the second aspect, a similarity metric applied by the second function is a same similarity metric as applied by the first function, in particular a cosine similarity of data-points.

In an implementation form of the second aspect, the method further includes minimizing, by the clustering unit, similarities of data-points that are associated to different clusters.

In an implementation form of the second aspect, the method further includes, by the clustering unit, maximizing similarities of data-points that are associated to a same cluster.

In an implementation form of the second aspect, in the second phase, the method further includes optimizing operating parameters of the auto-encoding unit based on operating parameters of the clustering unit, and optimizing the operating parameters of the clustering unit based on the operating parameters of the auto encoding unit.

In an implementation form of the second aspect, the method further includes simultaneously optimizing the operating parameters of the auto-encoding unit and the operating parameters of the clustering unit.

The method of the second aspect and its implementation forms include the same advantages as the service orchestrator according to the first aspect and its implementation forms.

It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.

BRIEF DESCRIPTION OF DRAWINGS

The above-described aspects and implementation forms of the present invention will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which

FIG. 1 shows a schematic view of a clustering device according to an embodiment of the present invention.

FIG. 2 shows a schematic view of a clustering device according to an embodiment of the present invention in more detail.

FIG. 3 shows a schematic view of a method according to an embodiment of the present invention.

FIG. 4 shows a schematic view of an auto-encoder.

FIG. 5 shows a schematic view of algorithms implemented by the present invention.

FIG. 6 shows a schematic view of operating results of the clustering device according to the present invention, compared to the prior art.

FIG. 7 shows a schematic view of clustering according to the prior art.

FIG. 8 shows another schematic view of clustering according to the prior art.

DETAIFED DESCRIPTION OF EMBODIMENTS FIG. 1 shows a schematic view of a clustering device 100 according to an embodiment of the present invention. The device 100 is configured for clustering of input data 101 being a data- set comprising data-points. The device 100 comprises an auto-encoding unit 102 and a clustering unit 104.

The auto-encoding unit 102 is configured to, in a first operating phase of the device 100, reduce a dimension of the input data 101 and/or extract features relevant for clustering from the input data 101, thereby creating low dimensional data 103. The low dimensional data 103 specifically includes a data-set with data-points of reduced dimension.

In other words, the auto-encoding unit 102 allows to extract a discriminative and informative latent space (i.e. features) that is clustering oriented, from the input data 101. That is, in the first operating phase, the input data 101 is reduced to a low dimensional space, providing good initial conditions for processing in the clustering unit 104. In particular, keeping linear independency between code rows in the low dimensional data 103 helps in preserving essential features for clustering in the clustering unit 104. Thus, the low dimensional data 103 optionally can comprise linear independent code rows, thereby minimizing reconstruction loss. In particular, the low dimensional data 103 includes a data-set with data-points of reduced dimension. Further, the first operating phase can also be called a learning phase of the auto-encoding unit 102.

The device 100 further includes the clustering unit 104, which is configured to, in a second operating phase of the device 100, obtain at least one cluster 105, based on the low dimensional data 103, and associate each data-point in the low dimensional data 103 to one cluster of the at least one cluster 105. The low dimensional data 103, that is processed in the clustering unit 104, is optimized, by the auto-encoding unit 102, for being reconstructed without loss. Thus, by optimizing the low dimensional data 103, already in the auto-encoding unit 102, the present invention provides an increased overall clustering accuracy, by means of the device 100.

Although there are three clusters 105 shown in Fig. 1, there can be an arbitrary number of clusters obtained by the device 100, as long as there is at least one cluster 105.

In other words, clustering in the clustering unit 104 relates to grouping data-points (obtained from the low dimensional data 103) with similar patterns, based on their latent space representation, obtained in the first phase in the auto-encoding unit 102, to obtain several centroids (which may also be called centroid parameters) that characterize the data-set, such that each data-point is represented by a single centroid to which it is assigned. In other words, the clustering unit 104 is further configured to obtain a centroid parameter for each cluster 105. In this manner, each new data-point can be compared exclusively to the centroids, to decide to which cluster it is assigned, instead of comparing it to the entire data-set. That is, the clustering unit 104 is further configured to determine, to which cluster a data-point is assigned, based on a centroid parameter of a cluster. This improves efficiency of classification of new data-points.

The first and the second operating phase may also be regarded training phases of the device 100. In particular, the first operating phase may be called a training phase of the auto-encoding unit 102, while the second phase may be called a clustering phase.

The training phase of the auto-encoding unit 102 is motivated by the need to provide the clustering phase with a good initial separation of data-points in the low dimensional data 103. In this phase, a latent space with discriminative attributes is learned by using a discriminative loss function. The discriminative loss function enables minimization of pairwise similarity between each pair of data-points in the data-set (which is comprised by the input data 101). In other words, the dimensional reduction of the input data 101 includes applying a first function to the input data 101, which is configured to minimize pairwise similarity of data-points in the input data 101, to provide the low dimensional data 103. The first function can in particular be the discriminative loss function. A similarity metric can be applied by the first function to the data-points in the input data, to precisely adjust an outcome of the first function. The similarity metric applied by the first function in particular can be a cosine similarity. As a result, in the first phase, operating parameters of the auto-encoding unit 102 are optimized.

Then, the clustering phase is initialized with the latent space obtained by the auto-encoding unit 102 in the previous phase. In particular, parameters of centroids are now jointly optimized together with the auto encoder parameters (i.e. the operating parameters of the auto-encoding unit 102). In this phase the latent space is improved to achieve a higher accuracy of the overall clustering task in the device 100 while optimizing for the best representative centroids of each cluster.

In particular, a metric chosen as an objective for the clustering task is the sum of cosine similarities between each data-point and the centroid to which the data-point is assigned.

In the second operating phase, the processing of the clustering unit 104 in particular can comprise the following two optional steps: In the first step, cross-cluster separation of data- points is maximized by controlling operating parameters of the auto-encoding unit 102 by the clustering unit 104. In the second step, cross-cluster separation and within-cluster similarity are maximized by controlling operating parameters of the clustering unit 104.

In other words, no matter whether the first or the second step is performed, this can be regarded as the clustering unit 104 being further configured to apply a second function, to minimize pairwise similarity of data-points, and to improve separability of the data-points. The second function can either be applied to data-points in the input data 101, or to data-points in the low dimensional data 103. In particular, when the second function relates to step one above, a similarity metric applied by the second function can be a same similarity metric as applied by the first function, in particular a cosine similarity of data-points.

The above described steps one and two can optionally comprise at least one of the following details: In the first step, the centroids and cluster assignments obtained during the optimization in the auto-encoding unit 102 are not trusted in, yet. Therefore, regularizing the clustering task with the pairwise discriminative loss function from the auto-encoder training phase continues. That is, no distinction is made between pairs of data-points that are considered similar (i.e. that are assigned to a same cluster), and the ones that are consider dissimilar (i.e. that are assigned to different clusters).

In the second step, it is trusted enough in the centroids to apply their assignments of data-points to divide the discriminative loss function (i.e. the second function) into two objectives: minimizing pairwise similarities of“between cluster” data-points (i.e. data-points associated to different clusters), and maximizing pairwise similarities of“within cluster” data-points (i.e. data-points that are associated to a same cluster).

That is, the clustering unit 104 optionally can be further configured to minimize similarities of data-points that are associated to different clusters 105. The clustering unit 104 can also optionally be configured to maximize similarities of data-points that are associated to a same cluster 105.

Optionally, the device 100 can also be configured to perform a third operating phase, which can be called an inference phase. In this phase, each new data-point that arrives at the device 100 passes through the auto-encoding unit 102 in order to extract the new data-point’ s latent representation. Then, a cosine similarity between the latent representations of the new data- point is calculated. The new data-point is then assigned to the cluster with the highest cosine similarity. If a low cosine similarity with respect to all clusters is observed (based on typical values of similarity obtained from the training set), the data-point shall be considered an anomaly.

FIG. 2 shows a schematic view of the clustering device 100 according to an embodiment of the present invention in more detail. The device 100 of Fig. 2 is based on the device 100 of Fig. 1 and therefore includes all of its function and features. To this end, identical features are labelled with identical reference signs. All features that are now going to be described in view of Fig. 2 are optional features of the device 100.

As it is shown in Fig. 2, the device 100 optionally further can comprise a decoder 201, configured to decode the low dimensional data 103 and compare it to the input data 101, to measure a reconstruction loss and adjust operating parameters of the auto-encoding unit 102, to minimize reconstruction loss.

As it is indicated in Fig. 2 by the dashed double arrow connecting the auto-encoding unit 102 and the clustering unit 104, to further improve accuracy of clustering, the device 100 optionally enables for joint learning of clustering and auto-encoder parameters (i.e. operating parameters of the clustering unit 104 and the auto-encoding unit 102). That is, in particular in the second phase, the device 100 is optionally further configured to optimize the operating parameters of the auto-encoding unit 102 based on operating parameters of the clustering unit 104, and optimize the operating parameters of the clustering unit 104 based on the operating parameters of the auto encoding unit 102. Further optionally, the device 100 can be configured to simultaneously optimize the operating parameters of the auto-encoding unit 102 and the operating parameters of the clustering unit 104.

In the following section, further details of the functionality of the device 100 as described in Fig. 1 or 2 are described. In particular, implementation aspects of the processing that is performed in the first operating phase and the second operating phase of the device 100 are described.

During the first operating phase (i.e. the training phase of the auto-encoding unit 102), the following loss function is minimized with respect to the parameters q e , q ά of the auto encoding unit 102. In this function, q b , q ά are operating parameters of the encoder and a decoder, respectively. X L denotes a raw data representation. Z j (0 e ) denotes a latent space representation (i.e. the output of the auto-encoding unit 102). Xi(0 e , q ά ) denotes a reconstructed data-point. = stands for the cosine similarity between pairs of latent space representation of data-

points. d is a distance between a data-point and its reconstruction (which e.g. can be L 2 , L- L or any distance measure). 1 is a regularization strength.

The first component of the loss function L(6 e , 6 d ) stands for the reconstruction loss, whereas the second component z,) | ) of the loss function stands for the maximal separability (which is obtained by minimizing the pairwise cosine similarity among all data-points’ latent representation).

The solution of the present invention maximizes the dimensionality of the encoder sub-space by minimizing an L norm of the cosine similarities between all pairs of samples in a batch matrix (i.e. a matrix comprising input data 101 for the device 100). This encourages the linear independency between matrix rows in the low dimensional data 103, therefore maximize a rank of the matrix.

Since the training phase of the auto-encoding unit 102 is unsupervised, within-cluster pairs, whose cosine similarity is to be maximized, cannot be distinguished from cross cluster pairs, whose cosine similarity is to be minimized.

However, assuming that there is no dominant cluster in the data-set, and considering that batch size mush is larger than a number of clusters, this results in significantly less within cluster pairs than cross cluster pairs in each batch matrix. For example, considering the MNIST data set (with 10 clusters) and a batch size of 1000 » 10, there are ~10 = 4950 within

clusters pairs, versus ~ cross cluster pairs.

Moreover, since a sparse cosine similarities vector is a desired result (i.e. all cross cluster cosine similarities are desired to be zero and all within cluster cosine similarities are desired to be 1) the L- L norm of the cosine similarities vector is to be minimized. This encourages the sparsity of the vector with less shrinkage of higher magnitude entries of the vector (as oppose to L 2 for example). Using this approach, an accuracy of over 85% can be reached on MNIST in the first operating phase, which is better than the existing prior art results.

The clustering phase, as already mentioned, is a two step process, in the first step of which the loss function is maximized with respect to auto encoder parameters q e , q ά and centroids {/ } /i=1 K , and an assignments matrix S £ R BxK : hi(0 e , Q ά , m, S ) =

The loss function is composed of three (3) components (from left to right): A cosine similarity between centroids and their assigned data-points which is to be maximized. A reconstruction loss. A separability (which is obtained by minimizing the pairwise cosine similarity among all data-points’ latent representation, similar to the term that is used in the training phase of the auto-encoding unit 102).

In the second step of the clustering phase, the centroids and assignments is trusted in, and the separability term is divided into two terms: within cluster similarity, which is to be maximized, and between cluster similarity, which is to be minimized. That is, the loss function of the second step in the second phase becomes:

The first and second most left components of the loss function L 2 are identical to those of Li. The third term denotes the within cluster similarity. The fourth term is the between cluster similarity, an infinity norm of which is to be minimized over all average pairwise similarities between data-points assigned to disjoint clusters. This is equivalent to minimizing the worst case in which two clusters share the highest similarities among all clusters. A rec , A w , A b denote the regularization strength of the different components of the loss function. FIG. 3 shows a schematic view of a method 300 according to an embodiment of the present invention. The method 300 is for operating a device 100 and thus enables for clustering of input data 101.

The method 300 comprises a first step of, in a first operating phase of the device 100, reducing 301, by an auto-encoding unit 102 of the device 100, a dimension of the input data 101 and/or extracting 301, by the auto-encoding unit 102, features relevant for clustering from the input data 101, thereby creating low dimensional data 103.

The method 300 further comprises a second step of, in a second operating phase of the device 100, obtaining 302, by a clustering unit 104 of the device 100, at least one cluster 105, based on the low dimensional data 103, and associating 302, by the clustering unit 104, each data- point in the low dimensional data 103 to one cluster of the at least one clusters 105; wherein the low dimensional data 103 is optimized, by the auto-encoding unit 102, for being reconstructed without loss.

In the following section, the present invention is described in more detail, in particular in view of Fig. 4 and Fig. 5. The details presented in the following can be regarded as optional features, which can be arbitrarily combined with the features of the device 100 as described in Fig. 1 or in Fig. 2.

The present invention teaches a paradigm in which a latent space representation of a certain data-set is trained jointly with clustering parameters. This end-to-end approach achieves a clustering oriented representation and therefore better clustering accuracy. In most prior art solutions, a conventional auto-encoding unit is first trained to minimize reconstruction loss and then applied as an initial condition to the joint optimization problem in which both clustering and encoder parameters are jointly optimized to minimize the clustering error. However although a major attention is dedicated to the clustering phase, it can be observed that in most cases the improvement of the clustering phase over the initial phase amounts a maximum of 15% to 20% of accuracy. This leads to the conclusion that the initial phase has a significant effect on the overall accuracy and therefore a focus shall be put on this step.

In the following it is described, how a discriminative auto-encoder is obtained in the first operating phase. Let D denote a data-set grouped into K clusters {C k } k=1 , and let x L £ R p be a data-point in the data-set with feature dimension p. Let z t = f(x , 8 e ) £ R d stand for the latent representation of x t , where the parameters of the encoder are denoted by q b , and p » d. Let x L = g(z L ,· q ά ) stand for a reconstructed data-point, that is, the output of the decoder, where the parameters of the decoder are denoted by q ά .

According to the present invention, a family of pairwise discriminative functions of the form L d (z.i, Z j ) : R dxd ® R is proposed, where L d (z Z ) = sim(z i , z / ), and sim(·) stands for any similarity measure between a pair of data-points. Note that desired properties for the similarity measure are sim{zi, z L ) = 1 and 0 < sim{zi, Z ) < 1. When applied to the entire data-set this results in, where w Lj are related to similarities between raw data-points. When prior knowledge is not available it is set w tj = \D \ ~2 , where \D \ stands for the cardinality of the data-set. It is to be noted that the objective function in formula (1) above is sub-optimal, since it penalizes all similarities regardless of whether they belong to a same cluster or not. Obviously if the assignments of each data-point were available, formula (1) would have been split into two terms: one for the minimization of the cross clusters similarities, and one for the maximization of within cluster similarities, yielding the following objective function instead: sim {z zj ) (2) where N b , N w stands for the number of between cluster and within cluster pairs, respectively. However, the justification of formula (1) arises from the following observation, considering a balanced data-set that is not dominated by a single or a few clusters, then the number of pairs in D are (^) and the within cluster and between cluster pairs cardinalities are approximately N w ~ respectively. Accordingly, there are much more dissimilar pairs than similar ones. It is to be noted that N b increases with both \D \, K, and for \D | » K, the number of within cluster pairs is approximately - fraction of all data pairs. In light of formula (2) it is created a k-nearest neighbors graph between the data-points based on their original representation. A fraction of pairs with the largest similarities from the k- nearest neighbors graph is then applied to formula (1) as anchor pairs whose similarity is to be maximized. Defining A as the set of anchor pairs, this results in yielding

1

L (V-, e ) T; sim( sim (z . z (4)

:2?| 2 - U i. tA where a < 1 is used to compensate for the non-confidence of the similarities among the anchors. Let define z £ = as the normalized latent space representation. The similarity measure sim is applied, where |-| stands forthe absolute value. It is to be noted that formula (4) is the L norm of all pairwise cosine similarities. L L is taken instead of e.g. L due to the desired sparsity of the similarities (only - non-zero elements) which is encouraged by the L norm.

Since the data-set cannot be maintained in general, in main memory, and training is performed on batches of the data-set B by stochastic gradient descent (SGD), formula (4) shall be approximated using the batch matrix of the latent representation Z £ R l B l xd where |S| denotes the cardinality of the batch. Define Z as the row-wise normalized batch matrix (its t-th row is the row vector zj), and C = ZZ £ bI b I c I b I as the pairwise cosine similarity matrix, i.e. C = zf . Furthermore, for each batch, a k-nearest neighbors graph is constructed and a set of A B is determined, yielding the following approximation to formula (4):

It should be noted that the diagonal terms of C are constant and equal 1, and therefore do not affect the optimization. In order to avoid an arbitrary discrimination of the data-points, that is not aligned with to the actual separation, it is proposed to regularize formula (5) with the reconstruction loss, yielding

L(X: 9 e .9 d ) = L d (Z; 9 e ) + X L r (X. X) (6) where l stands for the regularization strength, and L r denotes the reconstruction loss, X stands for the raw input batch matrix and ||-|| F stands for the Frobenius norm.

In the following it is going to be described, how dissimilarity is maintained in the clustering phase (i.e. the second operating phase).

After obtaining the discriminative latent space in the first operating phase of the device 100, it is applied as an initialization to the clustering phase (i.e. the second operating phase of the device 100). In this step both the encoder-decoder parameters 9 e , 9 d , and the centroids {m, ί } / / { =1 £ R d are jointly optimized. They represent the new optimization variables for the clustering objective. A natural candidate for the objective function for the clustering phase is the cosine similarity between the learned centroids and each data-point, since the cosine similarity is applied to discriminate between pairs of data-points in the initial phase. Accordingly, the primary goal of the clustering phase is to maximize the following objective function where S stands for the assignment matrix and S ik £ {0,1} are the hard decisions of the clustering procedure. The clustering phase is divided into two steps. In a first step the clustering assignments are not trusted in, and it is continued to regularize them with formula (5) and the reconstruction loss. In a second step in which the clustering assignments are trusted in, and the discriminative objective is split into between cluster and within cluster terms, such that the assignments of each data-point to the associated cluster, are derived from the assignments matrix S. Therefore, the optimization problem solved in the first step is given by where l ά , l n stands for the regularization strength of the discriminative and reconstruction losses, respectively. For the second step, let define a similarity measure between each pair of clusters, yielding It should be noted that, formula (9) penalizes for the worst-case, that is, for the pair of clusters with the greatest similarity. In the same manner, it is done for the within class objective:

It should be noted that in formula (10), the absolute value has been omitted, and the value of 1 for the cosine similarity between pairs of data-points is preferable over—1. The optimization problem in the second step becomes where b , w , l t stand for the regularization strength of the between cluster, within cluster and reconstruction loss, respectively.

Now, in a specific example, a system architecture of the device 100 is going to be described. The device 100 comprises two parts: an auto-encoder (or encoder, or auto-encoding unit 102) and clustering (or clustering unit 104). An auto-encoder network of the auto-encoder can be a fully convolutional neural network with alternating convolution with batch normalization and max pooling layers. The decoder applies unsampling to higher resolution using resize with nearest-neighbor extrapolation followed by convolution layers also with layer- wise batch normalization. The auto-encoder architecture is depicted in Fig. 4.

In the following, a training strategy of the auto encoder is going to be described.

Training the auto-encoder in the initial phase (i.e. the first operating phase) begins with the minimization of formula (6). The regularization strength of the discriminative loss is an hyper parameter that is determined in the region l £ (0, 1] The value of l differs among different data-sets, such that data-sets that are more complex, require more aggressive discrimination while maintaining the strength of the reconstruction loss is constant. The training is done on large batches to ensure | B \ » K . As the training proceeds, the X is increased in order to optimize for a discrimination in directions that enable a reasonable reconstruction. The training scheme for the auto encoder is summarized by the algorithm shown in Fig. 5 A.

In the clustering phase, the clustering variables (m ¾ } =1 , S are jointly optimized with the auto encoder parameters q b , q ά . An alternating maximization scheme is applied, in which each set of variables is optimized while the other sets remain fix. The optimization process begins with the initialization of {m, ί } / ^ =1 by optimizing formula (7) based on the entire data-set D. Next it is alternated over the maximization of the assignment matrix S, followed by the maximization of {m, ί } / ^ =1 , and finally maximizing for the auto-encoder parameters. The optimization procedure iterates until convergence. The clustering phase is divided into two stages (i.e. the two steps of the second phase) which differ from each other by the objective functions they aim to maximize.

The pseudo-code for the first stage (oh the second phase) is summarized in the algorithm show in Fig. 5B. In the first stage it is optimized for formula (8), while using relative large regularization strength for the discriminative loss X d £ [1, 5] and lower regularization strength X r £ (0,1] for the reconstruction loss. The while loop in the algorithm refers to the alternating maximization scheme in which each set of parameters is maximized over several epochs where the optimization is carried out using back-propagation for each Z, X £ B. Termination of the inner loop occurs either when the maximal number of iteration N t is exceeded or when the clustering objective L c does not improve over consecutive iterations above a predefined tolerance tol. Note that l ά , X r , X min , b are hyper-parameters and are data-set dependent.

The second stage is initialized with the parameters from the first stage q , q ά * , (m } =1 . Then it is optimized for formula (11) wherein, similarly to the previous stage, the discriminative regularization strengths are set to a relative high value, that is, X b , X w £ [1, 5], while the regularization strength of the reconstruction loss remains unchanged. The process iterates until convergence. The maximization of each set of variables is carried out using back-propagation of large batches over several epochs. In both stages, large batches are used as in the auto encoder training phase and for several epochs. The entire procedure of the clustering step is similar to the pseudo-code of the algorithm shown in Fig. 5B but now with formula (11) and its associated hyper-parameters . FIG. 6 shows a schematic view of operating results of the clustering device 100 according to the present invention, compared to the prior art. Assuming that an output of the auto-encoder phase can be evaluated, and a singular value histogram of batch output can be calculated, it can be demonstrated how the device 100 performs when being run on an MNIST data-set. The solution according to the present invention demonstrates an easily identifiable signature of singular value decomposition values as shown in Figs. 6B and 6C. The results of the present invention can be detected and are identifiable regardless of a layer size of the auto-encoding unit 102. In Fig. 6,“corr” is a hyper-parameter that is used by the auto-encoding algorithm to define the required strength of the linear independence of the data (larger value means a stronger independence requirement). In Fig. 6A, all results are spread across many values and it can be seen that the device according to the present invention is not used. In Fig. 6B and 6C, most of the values are grouped together or even a significant peek (that is not zero) can be observed. It can thus easily be seen that the solution according to the present invention is used.

The present invention has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed invention, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word “comprising” does not exclude other elements or steps and the indefinite article“a” or“an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.