Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
GENERATING HIGH-DIMENSIONAL, HIGH UTILITY SYNTHETIC DATA
Document Type and Number:
WIPO Patent Application WO/2021/228404
Kind Code:
A1
Abstract:
In some examples, a computer-implemented method for generating high-dimensional, high utility synthetic data comprises generating a differentially privatised global model using a global model, the differentially privatised global model defining an autoencoder configured to map high-dimensional user data to a lower dimensional feature space, and iteratively refining the global model on the basis of multiple differentially privatised local models received from a network of user equipment defining a federated learning structure. The refinement process can proceed by broadcasting the differentially privatised global model to the network of user equipment as part of a refinement iteration, receiving updated versions of the multiple differentially privatised local models from the network of user equipment, and on the basis of a convergence threshold representing convergence of the differentially privatised global model to a selected measure of accuracy according to a loss function, using the differentially privatised global model to generate a set of synthetic data by selecting a set of random latent features using a predefined distribution as input to the differentially privatised global model whereby to generate a set of synthetic data as output of the differentially privatised global model. The generated synthetic data can be used for data mining and the building of machine learning models and so on.

Inventors:
JIANG XUE (DE)
ZHOU XUEBING (DE)
Application Number:
PCT/EP2020/063565
Publication Date:
November 18, 2021
Filing Date:
May 15, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
JIANG XUE (DE)
International Classes:
G06N3/04; G06F21/62; G06N3/08
Other References:
QINGRONG CHEN ET AL: "Differentially Private Data Generative Models", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 6 December 2018 (2018-12-06), XP080989669
BRENDAN MCMAHAN H ET AL: "Learning Differentially Private Recurrent Language Models", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 19 October 2017 (2017-10-19), XP081319236
ABAY NAZMIYE CEREN ET AL: "Privacy Preserving Synthetic Data Release Using Deep Learning", ADVANCES IN DATABASES AND INFORMATION SYSTEMS; [LECTURE NOTES IN COMPUTER SCIENCE; LECT.NOTES COMPUTER], SPRINGER INTERNATIONAL PUBLISHING, CHAM, vol. 11051 Chap.31, no. 558, 18 January 2019 (2019-01-18), pages 510 - 526, XP047500612, ISBN: 978-3-319-10403-4, [retrieved on 20190118]
UTHAIPON TANTIPONGPIPAT ET AL: "Differentially Private Mixed-Type Data Generation For Unsupervised Learning", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 6 December 2019 (2019-12-06), XP081546787
M. ABADIA. CHUI. GOODFELLOWH. B. MCMAHANI. MIRONOVK. TALWARL. ZHANG: "Deep learning with differential privacy", PROCEEDINGS OF THE 2016 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2016, pages 308 - 318
Attorney, Agent or Firm:
KREUZ, Georg (DE)
Download PDF:
Claims:
Claims

1. A computer-implemented method for generating high-dimensional, high utility synthetic data, the method comprising: generating a differentially privatised global model using a global model, the differentially privatised global model defining an autoencoder configured to map high dimensional user data to a lower dimensional feature space; iteratively refining the global model on the basis of multiple differentially privatised local models received from a network of user equipment defining a federated learning structure, by: broadcasting the differentially privatised global model to the network of user equipment as part of a refinement iteration; receiving updated versions of the multiple differentially privatised local models from the network of user equipment; and on the basis of a convergence threshold representing convergence of the differentially privatised global model to a selected measure of accuracy according to a loss function, using the differentially privatised global model to generate a set of synthetic data by: selecting a set of random latent features using a predefined distribution as input to the differentially privatised global model whereby to generate a set of synthetic data as output of the differentially privatised global model.

2. The method as claimed in claim 1, further comprising: initialising the differentially privatised global model using initialisation data.

3. The method as claimed claim 2, wherein the initialisation data comprises one or more of random data, data from a synthetic database, and data from a public database.

4. The method as claimed in claim 2 or 3, further comprising: evaluating the utility of an initialised differentially privatised global model using an attribute-wise evaluation method.

5. The method as claimed in claim 4, further comprising: generating a measure of the divergence between an attribute distribution of data generated using the initialised differentially privatised global model and an attribute distribution of real data.

6. The method as claimed in claim 2 or 3, further comprising: evaluating the utility of an initialised differentially privatised global model using a record-wise evaluation method.

7. The method as claimed in any preceding claim, further comprising: aggregating the multiple differentially privatised local models received from the network of user equipment whereby to generate a set of parameters; and using the parameters to update the global model.

8. The method as claimed in any preceding claim, wherein the differentially privatised global model is a generative autoencoder.

9. The method as claimed in claim 8, further comprising using a set of random latent features as input to a decoder of the generative autoencoder.

10. The method as claimed in any preceding claim, wherein the convergence threshold represents a training loss associated with the differentially privatised global model.

11. The method as claimed in any preceding claim, further comprising: tracking a cost for privatizing a local update at a user equipment using a local privacy monitor.

12. The method as claimed in any preceding claim, further comprising: tracking a cost for privatizing a global model using a server privacy monitor.

13. The method as claimed in claim 11 or 12, wherein a privacy monitor is used to control privacy loss.

14. The method as claimed in any preceding claim, further comprising: generatingrandom data from a predefined distribution; and inputting the generated distributed data to decoder component of the autoencoder, whereby to generate the set of synthetic data.

15. User equipment forming a node in a federated learning framework, the user equipment comprising a processor coupled to a memory, the processor configured to: receive a first instantiation of a framework to generate synthetic data representing a user profile from a remote service; using local data, perform a modification of the first instantiation of the framework by adjusting a set of parameters defining the first instantiation of the framework whereby to generate an updated instantiation of the framework; differentially privatise the updated instantiation of the framework to form a privatised local framework; and provide the privatised local framework to the remote service. 16. User equipment as claimed in claim 15, wherein the received first instantiation of the framework defines a differentially privatised autoencoder.

Description:
GENERATING HIGH-DIMENSIONAL, HIGH UTILITY SYNTHETIC DATA

TECHNICAL FIELD

Aspects relate, in general, to a method for generating high-dimensional, high utility synthetic data, and more particularly, although not exclusively, to methods for generating such data in a federated learning structure in which differential privacy is applied at multiple stages of a training iteration.

BACKGROUND

Services for user equipment, such as mobile telephones and smart devices for example, are ubiquitous. Such services enable a plethora of bespoke recommendations and information to be provided to a user based on, for example, historic choices and/or data representing, e.g., a user profile such as age, sex, height, purchase history and so on. The more information that is available as a reference point for a user, the more accurate a tailored recommendation will be, which can increase the degree to which a user engages with a service for example. Typically, the information available representing a user and/or their choices and preferences can be used as training data for a service that is underpinned by an artificial intelligence (Al) model used to generate a set of tailored responses to a query from the user or from a service being used.

To be of any value in terms of their accuracy and efficacy, such Al services require a large amount of data from user devices for model training. With the rapid development of network and computer technologies, a large amount and variety of multi-dimensional personal- specific data is generated on local devices. This data can contain rich univariate and multivariate statistical information, which can be used to build high-accuracy Al services. However, since the data are generated based on, e.g., user’s daily behaviours, direct collection may reveal sensitive information about individuals and lead to severe privacy problems.

To this end, local differential privacy (LDP) can be employed to enable privacy-preserving data collection. That is, the user data can be locally randomized before it is sent ‘off-device’ for the purposes of training for example. Broadly speaking, LDP algorithms ensure that the server that is used to build a model that is used to implement a service cannot see the original user data but is able to learn a population’s overall statistics. However, LDP mechanisms only support the collection of low-dimensional data (of the order of around 10 dimensions for example), which limits their usefulness and the utility of any information that may be garnered from a model trained using the data.

SUMMARY

According to a first aspect, there is provided a computer-implemented method for generating high-dimensional, high utility synthetic data, the method comprising generating a differentially privatised global model using a global model, the differentially privatised global model defining an autoencoder configured to map high-dimensional user data to a lower dimensional feature space. In an implementation of the first aspect, the autoencoder can comprise two components: an encoder for projecting high-dimensional data to low dimensional data, and a decoder for projecting the lower dimensional data back to high dimensional data. The distribution of features in the lower dimensional space can be forced to follow a predefined distribution, e.g., a standardised Gaussian distribution. Once the autoencoder is trained, the decoder component can be used for data generation.

According to the first aspect, the global model is iteratively refined on the basis of multiple differentially privatised local models received from a network of user equipment defining a federated learning structure, by broadcasting the differentially privatised global model to the network of user equipment as part of a refinement iteration, receiving updated versions of the multiple differentially privatised local models from the network of user equipment, and for instance, on the basis of a convergence threshold representing convergence of the differentially privatised global model to a selected measure of accuracy according to a loss function, using the differentially privatised global model to generate a set of synthetic data by selecting a set of random latent features using a predefined distribution as input to the differentially privatised global model whereby to generate a set of synthetic data as output of the differentially privatised global model.

From privacy point of view, federated learning with differential privacy provides a high degree of protection for user data which is used to train a model that can be used to generate synthetic data. That is, training can proceed without collection of local (raw) data at a server. Furthermore, a differentially privatised model as provided herein supports the collection of high-dimensional data and the subsequent generation of high-dimensional synthetic data. In general, when considering high-dimensional data, as the number of categorical attributes increases linearly, the data domain (i.e. the number of possible combinations over all the attributes) increases exponentially. Directly randomizing the original data (where the data domain is proportional to either the length of anonymized data or the estimation error) results in significant communication cost and low data utility in the case of a large data domain. Thus, statistical information of the original data can be learnt using a model that is able to generate synthetic data without the need to directly randomize the original data. The method as provided herein can be applied to categorical, numerical and multi-media data, such as image, and video data and so on. That is, the method can be applied both on structured data (e.g. collecting [age, job, salary] from clients) and on unstructured data such as images, audio data and so on. In an example, pre-encoding and post-decoding can be used so that the autoencoder can be applied to categorical data (e.g. job).

The utility of the set of synthetic data can be evaluated using an attribute-wise evaluation method. The evaluation of utility can be used in a mechanism for pre-tuning a model in order to reduce the number of iterations made until model convergence. This can include generating a measure of the divergence between an attribute distribution of synthetic data compared with an attribute distribution of real data. The utility of the set of synthetic data can also be evaluated using a record-wise evaluation method, either in isolation or in combination with an attribute-wise evaluation. Record-wise evaluation can comprise comparing the outputs of a pair of frameworks, one trained with real data, such as real data obtained from a public database for example, and one trained with the synthetic data. In an example, a public database can be used to design the model structures and pre-tune a model. The pre-tuned model can be further trained according to the mechanism described herein using federated learning and differential privacy. Pre-tuning can reduce the number of iterations performed until model convergence. That is, the structure of a model can be designed around the structure of a public dataset. The public dataset can be used to simulate a data collection process and tune model parameters by evaluating the utility of synthetic data generated using the model. Also, the public dataset can be used for pre training the autoencoder, which helps accelerate the model convergence.

A framework according to an example that comprises a combination of an autoencoder, federated learning and differential privacy thus enables collection of high-dimensional data with strong privacy guarantees while preserving data utility.

In an implementation of the first aspect, the multiple differentially privatised local models received from the network of user equipment can be aggregated whereby to generate a set of parameters, and the parameters can be used to update the global model. The differentially privatised global model is, in an example, a generative autoencoder. The set of random latent features can be provided as input to a decoder of the autoencoder. An iterative refinement process can end at a point when a convergence threshold representing a training loss associated with the differentially privatised global model is met. In an example, the global model can be initialised using a public database or data from a randomly generated synthetic database user data. As noted above, this provides a mechanism to pre-tune a model.

The autoencoder can follow a Gaussian or normal distribution, or any other suitable distribution, such as any other continuous probability distribution for example. Random Gaussian distributed data can be generated and input to the decoder component of the autoencoder whereby to generate the set of synthetic data.

According to a second aspect, there is provided user equipment forming a node in a federated learning framework, the user equipment comprising a processor coupled to a memory, the processor configured to receive a first instantiation of a framework to generate synthetic data representing a user profile from a remote service, using local data, perform a modification of the first instantiation of the framework by adjusting a set of parameters defining the first instantiation of the framework whereby to generate an updated instantiation of the framework, differentially privatise the updated instantiation of the framework to form a privatised local framework, and provide the privatised local framework to the remote service. The received first instantiation of the framework can define a differentially privatised autoencoder. In an implementation of the second aspect, the framework can form a model that can be trained locally with user data on the user equipment devices. A trained model can be differentially privatised by way of the addition of noise, for example.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the present disclosure, reference is now made, by way of example only, to the following descriptions taken in conjunction with the accompanying drawings, in which:

Figure 1 is a schematic representation of a method for generating high-dimensional, high utility synthetic data according to an example;

Figure 2 is a schematic representation of the method of figure 1 , according to an example;

Figure 3 is a schematic representation of data pre-processing and data postprocessing according to an example; and Figure 4 is a schematic representation of the distribution of attributes of exemplary data according to an example.

DESCRIPTION

Example embodiments are described below in sufficient detail to enable those of ordinary skill in the art to embody and implement the systems and processes herein described. It is important to understand that embodiments can be provided in many alternate forms and should not be construed as limited to the examples set forth herein. Accordingly, while embodiments can be modified in various ways and take on various alternative forms, specific embodiments thereof are shown in the drawings and described in detail below as examples. There is no intent to limit to the particular forms disclosed. On the contrary, all modifications, equivalents, and alternatives falling within the scope of the appended claims should be included. Elements of the example embodiments are consistently denoted by the same reference numerals throughout the drawings and detailed description where appropriate.

The terminology used herein to describe embodiments is not intended to limit the scope. The articles “a,” “an,” and “the” are singular in that they have a single referent, however the use of the singular form in the present document should not preclude the presence of more than one referent. In other words, elements referred to in the singular can number one or more, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes,” and/or “including,” when used herein, specify the presence of stated features, items, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, items, steps, operations, elements, components, and/or groups thereof.

Unless otherwise defined, all terms (including technical and scientific terms) used herein are to be interpreted as is customary in the art. It will be further understood that terms in common usage should also be interpreted as is customary in the relevant art and not in an idealized or overly formal sense unless expressly so defined herein.

In real-life scenarios, data collected from user equipment for the purposes of, e.g., training an Al service, comprises many dimensions. In some cases, such data can comprise in excess of 100 dimensions, which makes LDP approaches infeasible. For example, in the case of multidimensional data, each attribute can be separately randomised locally (i.e. on the user equipment) to obtain a univariate distribution on a server side where the data can be used for training purposes for example. However, since the attributes in multi-dimensional data are usually associated, separately applying LDP on individual dimensions will break possible multivariate correlations and cause a loss of information. If all of the attributes are considered as one in order to estimate a joint distribution, only low-dimensional data (such as data with a number of dimensions up to around 10 for example) can be considered since the domain size increases exponentially in relation to the number of dimensions. This not only leads to a significant rise in computational and communication costs but also can also cause degradation in data utility.

According to an example, an autoencoder, such as a generative autoencoder for example, can be used to simulate the distribution of the high dimensional local data to enable reliable synthetic data to be generated. Federated learning and differential privacy can be employed to train the generative autoencoder in a distributed setting. Accordingly, data synthesis can be performed without collecting real local data, which is advantageous for both privacy protection and data utility.

In an example, a differentially privatised global model defining an autoencoder configured to map high-dimensional user data to a lower dimensional feature space can capture attribute distribution and correlations in high dimensional data and be used to generate high-utility synthetic data. Such generated synthetic data preserves similar statistical properties to real data and can be scaled up to replace real data for data analysis and Al model training tasks. Furthermore, since generated data is fully synthetic and thus cannot be linked to any particular individual, it is no longer considered as personal data. Re-identification attacks or attribute disclosure become almost impossible.

In an example, a local model on user equipment can be trained using a federated learning (FL) mechanism, where raw user data stays on a local device and is not accessible to a remote server. Differential privacy (DP) can be incorporated into the FL process to avoid revealing personal information via model updates during training.

In the context of the present system, FL can provide a decentralized learning mechanism formed by way of a network of user equipment defining a federated learning structure. By distributing training tasks to local user equipment under the coordination of a central server, FL can achieve computational efficiency and privacy benefits.

According to an example, a global model can be iteratively refined on the basis of multiple differentially privatised local models received at a service from a network of user equipment defining the federated learning structure. At each iteration, the service can, for example, randomly select a number of local user equipment clients and distribute a current differentially privatised global model to them. Each client can train the model with their local data and return the (trained) local model update to the service, which can execute on a remote server for example. On the server side, the local updates can be aggregated and the average used to update the global model. The server may broadcast the updated global model to local devices to initialize the next global round. Since only model parameters are exchanged during the training process, FL thus enables model training without the collection of raw local data.

According to an example, a differentially privatised global model defines an autoencoder. An autoencoder is a type of neural network that is used to learn efficient and compressed feature representations in an unsupervised manner. An autoencoder comprises two main parts or components: an encoder Q < and a decoder G e . The encoder compresses the original high-dimensional input x ~ P x into the low-dimensional latent feature z = Q < (x) and the decoder maps z to the reconstructed output x’ = Gg (z), which is of the same shape as x. The goal of training is to find an optimized pair of encoder and decoder, which minimize the distance between x and c' = 6 b (<2f(c)), namely:

L AE = E X ~P X [C(X, G 0 (¾ (X))] where c(·,·) is a metric for featuring the difference between two vectors. In an example, the mean squared error (MSE) can be used to measure the distance between numerical input vectors and the cross entropy (CE) for binary input vectors.

Thus, a service executing on a server can generate synthetic data instead of directly collecting real user data, thereby addressing privacy issues in data collection. In an example, the autoencoder model can be trained under a federated learning framework, which enables user data to never leave user devices and provides strong privacy protection on user data.

During training, user devices and/or the server can apply local and server differential privacy to ensure that information from local training data cannot be inferred from either local model updates or global model parameters, which further strengthens privacy protection on user data. The structure of the model can be flexibly modified according to different data dimensions, so as to fit high-dimension data, and generated synthetic data preserves high fidelity and utility and can be easily scaled up as offline datasets for subsequent Al model training tasks. Figure 1 is a schematic representation of a method for generating high-dimensional, high utility synthetic data according to an example. In the example of figure 1 , a cloud-based server 101 is provided. The server 101 can implement a service for a network of user equipment 103 defining a federated learning structure comprising k user equipment. For example, a service can include the provision of models for use by user equipment, which models can be geared to enable user equipment to provide tailored data and services to a user.

In the example of figure 1, a model 105, which represents the starting point of a method for generating high-dimensional, high utility synthetic data, can be pre-tuned or initialised (124) using initialisation data 126 that can be random data, data from a synthetic database or using data from a public database, or a combination of these. The pre-tuned model can be further trained according to the mechanism described herein in which federated learning and differential privacy are utilised in order to update the model or model parameters. Pre-tuning can reduce the number of iterations performed until model convergence. That is, in an example, before the model 105 is sent to user equipment 103 forming the federated learning structure, the model 105 can be initialized. The initialised model can be trained using server 101, and the trained model can then be fine-tuned using the FL framework. The advantage is that the model may then be better initialized and the number of rounds of training using FL can be further reduced. In the example of figure 1, data utility evaluation 125 can be used to ensure the quality of a pre trained model using evaluation metrics, such as attribute- wise and/or record-wise evaluation. Thus, utility evaluation 125 for pre-training and parameter tuning of the model can be performed.

Following an iteration within a federated learning framework according to an example, the model 105 can be updated using a global model 107. In the example of figure 1, the global model 107 stems from an aggregation of local model updates that have been generated using local data on user equipment 103, as will be explained in more detail below.

The global model 107 can be differentially privatised (108) at server 101 in order to provide the differentially privatised global model 105, although this may not occur in certain circumstances as described in more detail below. As noted above, the differentially privatised global model 105 defines an autoencoder configured to map high-dimensional user data to a lower dimensional feature space. The global model 107 is iteratively refined on the basis of multiple differentially privatised local models 109 received from the network of user equipment 103. The differentially privatised global model 105 is broadcast to the network of user equipment as part of a refinement iteration. The user equipment that the model is broadcast to may comprise all or a selected or random subset of the k user equipment 103. The broadcast model 105 is used in the refinement process in order to generate a differentially privatised local model 109 at each of the user equipment that has received the broadcast model. More particularly, each user equipment uses the global model 105 received from the server 101 , which may be differentially privatised, with local (private) data 111 to generate a local model 113. The local model 113 is differentially privatised at each user equipment to form the differentially privatised local model 109. The updated and privatised local models are sent (115) to the server 101. As will be described in more detail below, on the basis of a convergence threshold representing convergence of the differentially privatised global model 105 to a selected measure of accuracy according to a loss function, the differentially privatised global model 105 (in the form of a now termed ‘final model’, M) can be used to generate a set of synthetic data 117. More particularly, a set of random latent features 119 can be selected using a predefined distribution 121 as input to the final model whereby to generate the set of synthetic data 117. In an implementation, a decoder component of the autoencoder is used generate the set of synthetic data 117.

According to an example, a privacy monitor can be provided on both the local 103 and server 101 sides of the system depicted with reference to figure 1. A privacy monitor 150 can be used to track a cost for privatizing a local update. A privacy monitor 160 can be used to track an overall cost to privatise a global model. To calculate an overall privacy cost, one approach is to sum up the privacy costs of all the global iterations. In another example, the overall privacy cost can be determined from a subset of user equipment 103 that are randomly sampled at each global iteration. Accordingly, the overall privacy cost can be further reduced by a factor of q, where q is the sampling rate. When Gaussian noise, for example, is applied to a local update, an even smaller overall privacy cost can be achieved using the Moment Accountant algorithm. Thus, in an example, a privacy monitor can be used to control privacy loss. If a user is sampled too often and a privacy budget is not enough, the user equipment in question may be excluded from an iteration.

Therefore, in an example, training is conducted under a federated learning framework, where the model is collaboratively trained by a central server 101 and a number of user devices 103. Training in the federated setting ensures raw data 111 on user devices 103 is never sent to the server 101 , which effectively protects the data privacy. Furthermore, the user devices 103 use local differential privacy 104 to privatize the local model updates and the server 101 can use server differential privacy 108 to privatize the global model 107 so as to prevent the local or global model parameters revealing information relating to private training data.

Figure 2 is a schematic representation of the method of figure 1 , according to an example. During each global training iteration, the server 101 broadcasts the current global model 105 to all the user devices (1). User devices 103 train the global model with local private data 111 and obtain local model parameters (2). A local differential privacy step 104 is applied by each user device in order to prevent local model parameters revealing local training data (3) and the privatized local model is returned 115 to the server (4). The server 101 then aggregates 123 all the received local models and updates the global model (5). Server differential privacy 108 may be used to privatize the global model parameters, so as to further prevent information of local data being inferred from the global model parameters (6). The global model 105 can be shared with the user devices 103 as the start of the next global training iteration (7), and (1)-(6) can repeat until the global model achieves a threshold degree of accuracy according to a loss function.

In an example, user devices 103 can send local model updates to the server (as opposed to local model parameters). In this case, a user device can privatize the local model updates using local differential privacy and send the privatized local model updates to the server. The server can then aggregate all the received local model updates, privatize the aggregated model updates using server differential privacy, and update the global model with privatized model updates.

The server 101 can use a decoder component of an autoencoder to generate the synthetic data 117. In an example, random latent features z gen are drawn 118 from a distribution 121 , such as a Gaussian distribution for example. The latent features z gen 119 are fed into a decoder component of the final model, M, and the synthetic data 117 is produced as the output of the model.

The server 101 can use different evaluation methods to investigate the utility of synthetic data. For example, attribute-wise evaluation and record-wise evaluation can be performed in block 125. In the case of attribute-wise evaluation, statistical properties of each attribute, such as mean, variation, maximum, minimum, percentile, etc can be compared. More particularly, according to an example, the KL-divergence between the attribute distribution of real data and generated synthetic data can be calculated. According to an example, user equipment 103 can process the original categorical data in the form of the local private data 111 into numerical form, which can be used for training the generative model. The server 101 defines the structure of the generative model based on the dimension of local training data and initializes the model, which is then collaboratively trained between the clients and server under the differentially private federated framework described herein. Once the model is trained, the decoder component can be extracted for generating synthetic data, which can be converted back to categorical form and used for, e.g., data mining and the building of machine learning models and so on.

In an example, since the original data 111 is categorical, which means that it cannot be directly processed by the models, it is converted into numerical form. In an example, one- hot encoding can be used to encode each categorical attribute into a binary vector. Each entry in the binary vector stands for a unique attribute value and the entry of the given value is set to 1 while all the others are set to 0. Finally, the binary vectors can be concatenated into one vector as the input data for the generative model.

Figure 3 is a schematic representation of data pre-processing and data postprocessing according to an example. In the example of figure 3, 3a depicts the encoding of some original data (categorical) to binary vectors (numerical), and 3b depicts a reversion from the predicted vectors (numerical) to synthetic data (categorical).

According to an example, a generative model can be a Wasserstein Autoencoder (WAE), which provides better data synthesis capability in comparison to Variational Autoencoders and less training difficulty than Generative Adversarial Networks (GAN). A WAE preserves the typical encoder-decoder architecture of autoencoders, which compresses original high dimensional inputs x to low-dimensional latent space features z and then reconstructs the latent features back to the input space x’. Other suitable autoencoders can be used.

In an example, in addition to the reconstruction cost 6 b (<2f(c)), a regularizer term D z (q z ,p z ) is introduced to the objective function of the WAE, which measures the distance between the latent space distribution q z and certain predefined distribution p z . The goal of training is to find an optimal set of parameters for the encoder and decoder, which minimizes the distance between the inputs and outputs while restricting the latent distribution to follow the predefined distribution. The final objective function can thus be formulated as: WAE = IE X ~p x [c(x, Ge (0f( c ))] + l D z (q z ,p z ) where l is a hyperparameter for balancing the two terms.

According to an example, a WAE model with fully connected hidden layers is utilised, and a relu activation is applied on the output of each hidden layer for better training performance. Moreover, since the inputs are binary vectors, a sigmoid activation can be used on the output layer, which restricts the output value within [0,1]

Cross entropy can be used to measure the reconstruction cost c(x, G(z)) and the maximum mean discrepancy (MMD) to measure the latent space distance D z (q z ,p z ), where p z follows a standard Gaussian distribution.

Local differential privacy (DP) 104 can be applied to privatize the local update. The DP mechanism can comprise adding Gaussian or Laplacian noise, for example, to each dimension. In an example, the noise is calibrated according to a desired privacy guarantee. For example, given a l_2-sensitivity D, the noise scale o for an (e,6)-DP Gaussian mechanism should satisfy o ³ A/eV(2ln(1.25/6)). The privatized local update can then be returned to the server. On the server side, all the local updates are aggregated in order to update the global model. Since the local updates satisfy DP, according to the post processing property of DP, the updated global model also satisfies DP. Since local DP may be sufficient to protect both local updates and the global model, server DP (108) may not be used. However, in case that weak local privacy is applied in order to improve model utility, server differential privacy 108 can be applied in order to ensure the privacy of the global model.

In an example, server 103 can randomly select some user equipment 103 at each iteration (e.g. 10% of all users, or 500 users, and so on). Since training is performed iteratively, each time user equipment is selected for training information about that user can be revealed. How much information is revealed is controlled by e, which is the privacy parameter of the differential privacy process. An overall privacy cost can be calculated over all iterations. In order to achieve higher privacy protection, server 103 can delete all the received model updates before sending the new model to users for the next training iteration.

For example, in the process of iteratively training the model, at each global round t, the server 101 can randomly select n = qN clients 103 (where N is the total number of clients 103 and q is a sampling rate) and distribute the current global model 105 (M t ) to them. Each user equipment client i can train the global model for several steps of gradient descent using local data (111) D, and calculate the local update A l t . The client 103 can then clip the local update with a clipping bound S and add an amount of, e.g., Gaussian noise with variance o 2 S 2 . The noised local update A (109) is returned (115) to the server 101. On the server side, all the local updates are aggregated 123 and averaged as the global update 107 as: which is used to update the global model. The updated global model M t+i can then be distributed to local clients to start the next iteration round. Since the sum of Gaussian is still Gaussian, the calculation of the global update can be further derived as:

That is, adding Gaussian noise with variance a 2 S 2 on the individual local update and then calculating the sum is equivalent to adding Gaussian noise with variance n a 2 S 2 on the sum of local updates. Thus, the framework satisfies differential privacy. Given the sampling rate q, the number of global rounds T and a failure probability 5 , a moment accountant mechanism, such as that described in, for example, M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016, pp. 308-318, can be used to keep track of privacy loss e, as it provides tight privacy bounds for the Gaussian mechanism. The final model satisfies (e, 5) differential privacy.

Once the (e.g. WAE) model has been trained, the server 101 can use the decoder part to generate synthetic data. Note that, in an example, the latent space features follow a standard Gaussian distribution p z . Therefore, random latent features can be sampled from p z and fed into the decoder to obtain the predicted outputs.

Since the predicted outputs (i.e. the generated synthetic data) are numerical vectors, they can be converted back to categorical form (as shown in Figure 3b for example). That is, in an example, given a prediction vector, it is first split into pieces forming short vectors, each representing one categorical attribute. Then, for each short vector, the entry with the maximum value is chosen as the attribute value. Finally, all the categorical labels can be concatenated into one vector as the final synthetic data.

As noted above, utility of the generated synthetic data can be evaluated by statistical comparison and Al training performance. According to an example, in the case of statistical comparison the statistical properties between real data and synthetic data under different privacy levels can be compared. More particularly, univariate and multivariate distribution can be evaluated by diagram visualization and distance calculation.

For example, in the case of univariate distribution, the per-attribute frequency of real and synthetic data can be compared. In an example, categorical data is converted into binary form and the mean value of each dimension is calculated, which provides a measure for the frequency of certain attribute values. Bar charts can be used to visualize the frequency comparison of different datasets. That is, by plotting attribute values the distribution of synthetic versus the distribution of real data can be compared.

Figure 4 is a schematic representation of the distribution of attributes of exemplary data according to an example, as generated by way of per-attribute frequency of real data and generated synthetic data. The example of figure 4 represents a comparison between real and synthetic data in a pre-training scenario.

The frequency distance can be quantified using the Jensen-Shannon Divergence (JSD) for example, which is a symmetric and smoothed version of Kullback-Leibler (KL) divergence and is a distance metric. The D JS D is bounded by [0; 1], where zero means the two distributions are identical. The averaged JSD over all the attributes can be calculated according to: where d is the total number of attributes, p, and q, are the ith-attribute distribution of real and synthetic data, m, = (p, + q,)/2, |W έ | is the domain size of the ith attribute, and DKL is the KL divergence.

For the multivariate distribution, in an example, the correlation matrix of the real and synthetic data can be compared. Correlation Matrix Distance (CMD) can be used to measure the distance between the correlation matrix of real and synthetic data, such that:

Rreai and Rsyn are correlation matrices of real and synthetic data, tr(.) is the matrix trace, || . ||2 is the Frobenius norm. DCMD is also bounded by [0; 1], where zero means the two correlation matrices are identical. For each dataset, the DCMD can be calculated under different privacy levels and the results compared. Similarly, the DCMD between the real data and the non-private synthetic data can be calculated as a baseline. The methods described herein enable synthetic data to be generated from a model that is trained using high-dimensional categorical data from user equipment collected using a privacy-preserving framework for high-dimensional data collection. With the combination of federated learning, differential privacy, and a (generative) autoencoder, the framework is able to generate high-utility synthetic datasets without accessing real local data. The generated synthetic data preserves very similar statistical properties to the real data and can replace real data for data mining and model training tasks. For datasets also containing numerical variables, such numerical data can be converted into categorical data with histograms and used with the present framework. For collecting multimedia data such as images, the loss function can be changed to mean squared error for instance and the rest of framework can remain unchanged.