Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SERVER EFFICIENT ENHANCEMENT OF PRIVACY IN FEDERATED LEARNING
Document Type and Number:
WIPO Patent Application WO/2021/247066
Kind Code:
A1
Abstract:
Techniques are disclosed that enable training a global model using gradients provided to a remote system by a set of client devices during a reporting window, where each client device randomly determines a reporting time in the reporting window to provide the gradient to the remote system. Various implementations include each client device determining a corresponding gradient by processing data using a local model stored locally at the client device, where the local model corresponds to the global model.

Inventors:
THAKKAR OM (US)
THAKURTA ABHRADEEP GUHA (US)
KAIROUZ PETER (US)
DE BALLE PIGEM BORJA (US)
MCMAHAN BRENDAN (US)
Application Number:
PCT/US2020/055906
Publication Date:
December 09, 2021
Filing Date:
October 16, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
GOOGLE LLC (US)
International Classes:
G06N20/00; G06F21/10
Other References:
ERLINGSSON ULFAR ET AL: "Amplification by Shuffling: From Local to Central Differential Privacy via Anonymity", PROCEEDINGS OF THE 2019 ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 6 January 2019 (2019-01-06), San Diego, USA, XP055778322, Retrieved from the Internet [retrieved on 20210222]
BALLE BORJA ET AL: "Privacy Amplification via Random Check-Ins", 13 July 2020 (2020-07-13), XP055778059, Retrieved from the Internet [retrieved on 20210219]
REZA SHOKRI ET AL: "Privacy-Preserving Deep Learning", PROCEEDINGS OF THE 22ND ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, CCS '15, 1 January 2015 (2015-01-01), New York, New York, USA, pages 1310 - 1321, XP055394102, ISBN: 978-1-4503-3832-5, DOI: 10.1145/2810103.2813687
ANDREA BITTAU ET AL: "Prochlo: Strong Privacy for Analytics in the Crowd", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 2 October 2017 (2017-10-02), XP081147302, DOI: 10.1145/3132747.3132769
CHEN LINLIN ET AL: "Crowdlearning: Crowded Deep Learning with Data Privacy", 2018 15TH ANNUAL IEEE INTERNATIONAL CONFERENCE ON SENSING, COMMUNICATION, AND NETWORKING (SECON), IEEE, 11 June 2018 (2018-06-11), pages 1 - 9, XP033367934, DOI: 10.1109/SAHCN.2018.8397100
Attorney, Agent or Firm:
HIGDON, Scott et al. (US)
Download PDF:
Claims:
CLAIMS

What is claimed is:

1. A method implemented by one or more processors, the method comprising: selecting, at a remote system, a set of client devices, from a plurality of client devices; determining, at the remote system, a reporting window indicating a time frame for the set of client devices to provide one or more gradients, to update a global model; transmitting, by the remote system, to each client device in the set of client devices, the reporting window, wherein transmitting the reporting window causes each of the client devices to at least selectively determine a corresponding reporting time, within the reporting window, for transmitting a corresponding locally generated gradient to the remote system; receiving, in the reporting window, the corresponding locally generated gradients at the corresponding reporting times, wherein each of the corresponding locally generated gradients is generated by a corresponding one of the client devices based on processing, using a local model stored locally at the client device, data generated locally at the client device to generate a predicted output of the local model; and updating one or more portions of the global model, based on the received gradients.

2. The method of claim 1, further comprising: selecting, at the remote system, an additional set of additional client devices, from the plurality of client devices; determining, at the remote system, an additional reporting window indicating an additional time frame for the additional set of additional client devices to provide one or more additional gradients, to update the global model; transmitting, by the remote system, to each additional client device in the additional set of additional client devices, the additional reporting window, wherein transmitting the additional reporting window causes each of the additional client devices to at least selectively determine a corresponding additional reporting time, within the additional reporting window, for transmitting a corresponding additional locally generated gradient to the remote system; receiving, in the additional reporting window, the corresponding additional locally generated gradients at the corresponding additional reporting times, wherein each of the corresponding additional locally generated gradients is generated by a corresponding one of the additional client devices based on processing, using a local model stored locally at the additional client device, additional data generated locally at the additional client device to generate an additional predicted output of the local model; and updating one or more additional portions of the global model, based on the received additional gradients.

3. The method of any preceding claim, wherein processing, using the local model stored locally at the client device, data generated locally at the client device to generate the predicted output of the local model further comprises: generating the gradient based on the predicted output of the local model and ground truth data generated by the client device.

4. The method of claim 3, wherein the global model is a global automatic speech recognition ("ASR") model, the local model is a local ASR model, and wherein generating the gradient based on the predicted output of the local model comprises: processing audio data capturing a spoken utterance using the local ASR model to generate a predicted text representation of the spoken utterance; and generating the gradient based on the predicted text representation of the spoken utterance and a ground truth representation of the spoken utterance generated by the client device.

5. The method of any preceding claim, wherein each of the client devices at least selectively determining the corresponding reporting time, within the reporting window, for transmitting the corresponding locally generated gradient to the remote system comprises: for each of the client devices, randomly determining the corresponding reporting time, within the reporting window, for transmitting the corresponding locally generated gradient to the remote system.

6. The method of claim 5, wherein each of the client devices at least selectively determining the corresponding reporting time, within the reporting window, for transmitting the corresponding locally generated gradient to the remote system comprises: for each of the client devices: determining whether to transmit the corresponding locally generated gradient to the remote system; and in response to determining to transmit the corresponding locally generated gradient, transmitting corresponding locally generated gradient to the remote system.

7. The method of claim 6, wherein determining whether to transmit the corresponding locally generated gradient to the remote system comprises: randomly determining whether to transmit the corresponding locally generated gradient to the remote system.

9. The method of claim 2, wherein at least one client device in the set of client devices, is in the additional set of additional client devices.

10. The method of any preceding claim, wherein receiving, in the reporting window, the corresponding locally generated gradients at the corresponding reporting times comprises receiving a plurality of corresponding locally generated gradients at the same reporting time.

11. The method of claim 10, wherein updating one or more portions of the global model, based on the received gradients comprises: determining an update gradient based on the plurality of corresponding locally generated gradients received at the same reporting time; and updating the one or more portions of the global model, based on the update gradient.

12. The method of claim 11, wherein determining the update gradient based on the plurality of corresponding locally generated gradients, received at the same reporting time, comprises: selecting the update gradient from the plurality of corresponding locally generated gradients received at the same reporting time.

IB. The method of claim 12, wherein selecting the update gradient from the plurality of corresponding locally generated gradients, received at the same reporting time, comprises: randomly selecting the update gradient from the plurality of corresponding locally generated gradients, received at the same reporting time.

14. The method of claim 11, wherein determining the update gradient based on the plurality of corresponding locally generated gradients, received at the same reporting time, comprises: determining the update gradient based on an average of the plurality of corresponding locally generated gradients.

15. A method implemented by one or more processors, the method comprising: receiving, at a client device and from a remote system, a reporting window indicating a time frame for the client device to provide a gradient, to the remote system, to update one or more portions of a global model; processing locally generated data, using a local model, to generate predicted output of the local model; generating the gradient based on the predicted output of the local model; determining a reporting time, in the reporting window, to transmit the gradient to the remote server; and at the reporting time, transmitting the gradient to the remote server.

16. A computer program comprising instructions that when executed by one or more processors of a computing system, cause the computing system to perform the method of any preceding claim.

17. A computing system configured to perform the method of any one of claims 1 to 15.

18. A computer-readable storage medium storing instructions executable by one or more processors of a computing system to perform the method of any one of claims 1 to 15.

Description:
SERVER EFFICIENT ENHANCEMENT OF PRIVACY IN FEDERATED LEARNING

Background

[0001] Data used to train a global model can be distributed across many client devices. Federated learning techniques can train a global model using this distributed data. For example, each client device can generate a gradient by processing data using a local model stored locally at the client device. The global model can be trained using these gradients without needing the data used to generate the gradients. In other words, data used to train the global model can be kept locally on-device by transmitting gradients for use in updating the global model (and not transmitting the data itself).

Summary

[0002] Techniques disclosed herein are directed towards training a global model, using data generated locally at a set of client devices (e.g., gradients generated locally at each client device using a local model stored locally at each corresponding client device), where the client devices provide the update data at a response time chosen by each client device. In other words, a global model can be updated based on randomness independently generated by each individual client device participating in the training process.

[0003] In some implementations, a remote system (e.g., a server) can select the set of client devices used to update the global model. For example, the remote system can randomly (or pseudo randomly) select a set of client devices. Additionally or alternatively, the remote system can determine a reporting window in which to receive updates (e.g., gradients) from the selected client devices. The remote system can transmit the reporting window to each of the selected client devices, and each of the client devices can determine a reporting time, in the reporting window, to provide an update to the remote system. For example, a remote system can select client devices A and B from a group of client devices of A, B, C, and D for use in updating a global model. The remote system can determine a reporting window from 9:00am to 9:15am. Client device A can determine (e.g., randomly or pseudo randomly) a reporting time of 9:03am in the reporting window. At 9:03am, client device A can provide gradient A, generated by processing data using a corresponding local model stored locally at client device A, to the remote system. The remote system can use gradient A to update one or more portions of the global model. Similarly, client device B can determine (e.g., randomly or pseudo randomly) determine a reporting time of 9:10am in the reporting window. At 9:10am, client device B can transmit gradient B, generated by processing data using a corresponding local model stored locally at client device B, to the remote system. The remote system can use gradient B to update one or more portions of the global model.

[0004] Additionally or alternatively, in some implementations, at 9:03am, client device A can provide gradient A', generated by processing data using the global model, to the remote system. The remote system can use gradient A' to update one or more portions of the global model. Similarly, at 9:10am, client device B can transmit gradient B', generated by processing data using the global model, to the remote system. The remote system can use gradient B' to update one or more portions of the global model.

[0005] Additionally or alternatively, each client device selected in the set of client devices can determine whether to participate in training the global model. For example, a selected client device can determine (e.g., by a virtual coin flip) whether to participate in the round of training the global model. If the client device determines to participate, the client device can then determine a reporting time and transmit a locally generated gradient to the remote system at the reporting time. Additionally or alternatively, if the client device determines to not participate, the client device may not determine a reporting time and/or transmit a locally generated gradient to the remote system.

[0006] In some implementations, a remote system can determine a first set of client devices with a corresponding first reporting window and a second set of client devices with a corresponding second reporting window. For example, a global model can be updated using data from client devices across the world. In some implementations, the first set of client devices can be selected based on the geographical location of the client devices (e.g., client devices physically located in the same city, state, time zone, country, continent, and/or additional or alternative location based group(s) of client devices). Additionally or alternatively, the reporting window can be determined based on device availability for the corresponding physical location (e.g., a reporting window when most client devices are available but idle in the middle of the night and/or additional or alternative reporting window(s)). In some implementations, the remote system can determine the second set of client devices based on a second physical location. Similarly, the second reporting window can be determined based on device availability in the second physical location.

[0007] Accordingly, various implementations set forth techniques to ensure privacy in training a global model using decentralized data generated locally at many client devices. Classic techniques require significant server-side overhead to train a global model using decentralized data while maintaining the privacy of the data. In contrast, data privacy is orchestrated at the client device, and techniques disclosed herein require no or minimal server-side orchestration in preserving data privacy while training the global model. As such, data privacy may be enhanced while reducing server-side resource use (e.g., processor cycles, memory, power consumption, etc.).

[0008] Additionally or alternatively, client devices can randomly determine the time to transmit a locally generated gradient to the server in a reporting window. Selection of the number of client devices and the size of the reporting window can ensure the server receives gradients at a fairly constant rate. In other words, allowing the client devices to randomly select a reporting time, in a large enough reporting window, can lead to an even distribution of gradients throughout the reporting window. This even distribution of gradients can ensure network resources are not overwhelmed while training the global model. For instance, the even distribution of gradients can ensure a more even utilization of network bandwidth and may prevent spikes in bandwidth utilization which leaves the system unable to receive gradients, more even memory usage and/or processor usage which may prevent spikes leaving the system unable to (temporarily) process additional gradients, and/or a more even utilization of additional or alternative network resources. Additionally or alternatively, more even utilization of network resources will increase the number of gradients that can immediately be used to train the global model and thus may limit the number of gradients which need to be queued for later training. [0009] The above description is provided only as an overview of some implementations disclosed herein. These and other implementations of the technology are disclosed in additional detail below.

[0010] It should be appreciated that all combinations of the foregoing concepts and additional concepts described in greater detail herein are contemplated as being part of the subject matter disclosed herein. For example, all combinations of claimed subject matter appearing at the end of this disclosure are contemplated as being part of the subject matter disclosed herein.

Brief Description of the Drawings

[0011] FIG. 1 illustrates an example environment in which implementations described herein may be implemented.

[0012] FIG. 2 illustrates an example of training a global model in accordance with various implementations described herein.

[0013] FIG. 3 is a flowchart illustrating an example process of training a global model at a remote system in accordance with various implementations disclosed herein.

[0014] FIG. 4 is a flowchart illustrating an example process of transmitting a gradient from a client device to a remote system in accordance with various implementations disclosed herein. [0015] FIG. 5 schematically depicts an example architecture of a computer system.

Detailed Description

[0016] Differentially Private Stochastic Gradient Descent (DP-SGD) may form a fundamental building block in many applications for learning over sensitive data. Two standard approaches, privacy amplification by subsampling, and privacy amplification by shuffling, may permit adding lower noise in DP-SGD than via naive schemes. A key assumption in both these approaches is that the elements in the data set can be uniformly sampled, or be uniformly permuted — constraints that may become prohibitive when the data is processed in a decentralized or distributed fashion.

[0017] Iterative methods, like DP-SGD, may be used in the setting of federated learning (FL) wherein the data is distributed among many devices (clients). In some implementations, the random check-in distributed protocol(s) may be utilized, which may rely only on randomized participation decisions made locally and independently by each client. Random check-ins can have privacy/accuracy trade-offs similar to privacy amplification by subsampling/shuffling. However, random check-ins may not require server-initiated communication and/or require knowledge of the population size.

[0018] Privacy amplification via random check-ins is tailored for a distributed learning framework, and can have broader applicability beyond FL. In some implementations, privacy amplification by shuffling can be extended to incorporate (e, <5)-DP local randomizers, and improve its guarantees. In practical regimes, this improvement can allow for similar privacy and utility using data from an order of magnitude fewer users.

[0019] Modern mobile devices and web services can benefit significantly from large-scale machine learning, often involving training on user (client) data. When such data is sensitive, steps must be taken to ensure privacy, and a formal guarantee of differential privacy (DP) may be the gold standard.

[0020] Other privacy-enhancing techniques can be combined with DP to obtain additional benefits. For example, cross-device federated learning (FL) may allow model training while keeping client data decentralized (each participating device keeps its own local dataset, and only sends model updates or gradients to the coordinating server). However, existing approaches to combining FL and DP make a number of assumptions that may be unrealistic in real-world FL deployments.

[0021] Attempts to combine FL and DP research have been made previously. However, these works and others in the area sidestep a critical issue: the DP guarantees require very specific sampling or shuffling schemes assuming, for example, that each client participates in each iteration with a fixed probability. While possible in theory, such schemes are incompatible with the practical constraints and design goals of cross-device FL protocols; to quote a comprehensive FL survey, "such a sampling procedure is nearly impossible in practice." The fundamental challenge is that clients decide when they will be available for training and when they will check in to the server, and by design the server cannot index specific clients. In fact, it may not even know the size of the participating population. [0022] Implementations described herein target these challenges. One goal is to provide strong central DP guarantees for the final model released by FL-like protocols, under the assumption of a trusted orchestrating server. This may be accomplished by building upon recent work on amplification by shuffling and/or combining it with new analysis techniques targeting FL-specific challenges (e.g., client-initiated communications, non-addressable global population, and constrained client availability).

[0023] Some implementations include a privacy amplification analysis specifically tailored for distributed learning frameworks. In some implementations, this may include a novel technique, called random check-in that relies on randomness independently generated by each individual client participating in the training procedure. It can be shown that distributed learning protocols based on random check-ins can attain privacy gains similar to privacy amplification by subsampling/shuffling while requiring minimal coordination from the server. While implementations disclosed herein are described with respect to distributed DP-SGD within the FL framework, it should be noted that the techniques used in are broadly applicable to any distributed iterative method.

[0024] Some implementations described herein include the use of random check-ins, a privacy amplification technique for distributed systems with minimal server-side overhead. Some implementations can include formal privacy guarantees for our protocols. Additionally or alternatively, it can be shown that random check-ins can attain similar rates of privacy amplification as subsampling and shuffling while reducing the need for server-side orchestration. Furthermore, some implementations include utility guarantees in the convex case can match the optimal privacy/accuracy trade-offs for DP-SGD in the central setting. Furthermore, as a byproduct, some implementations may improve privacy amplification by shuffling. In the case with e 0 -DP local randomizers, the dependency of the final central DP e may be improved by a factor of O(e 0,5e °). Additionally or alternatively, implementations disclosed herein may be extend the analysis to the case with (e 0 , <5 0 ) -DP local randomizers. This improvement may be crucial in practice as it allows shuffling protocols based on a wider family of local randomizers, including Gaussian local randomizers. [0025] To introduce the notion of privacy, neighboring data sets are defined. A pair of data sets D,D', E D n may be referred to as neighbors if D' can be obtained from D by modifying one sample d t e D for some i E [n].

[0026] In some implementations, differential privacy can be defined as: a randomized algorithm A: D n → S is ( e , d ) — differentially private if, for any pair of neighboring data sets D,D' e D n , and for all events S £ S in the output range of A, we have Pr[A(D ) e S'] < e e . Pr [A(D') E S'] + d.

[0027] For meaningful central DP guarantees (i.e., when n > 1), e can be assumed to be a small constant, and d « 1/n. The case d = 0 is often referred to as pure DP (in which case, it can be written as e-DP). Additionally or alternatively, the term approximate DP may be used when d > 0. Adaptive differentially private mechanisms can occur naturally when constructing complex DP algorithms, for e.g., DP-SGD. In addition to the dataset D, adaptive mechanisms also receive as input the output of other differentially private mechanisms. Formally, an adaptive mechanism A: S'xD n → S is (e, d) — DP if the mechanism 4(s',·) is (e, d) — DP for every s' e S'. In some implementations, using n = 1 gives a local randomizer, which may provide a local DP guarantee. Local randomizers can be the building blocks of local DP protocols where individuals privatize their data before sending it to an aggregator for analysis.

[0028] As an illustrative example, in some implementations, the distributed learning setup may involves n clients, where each client j E [n] can hold a data record d 7 e D,j e [n], forming a distributed data set D = . . ., d n ). In some implementations, it can be assumed that a coordinating server wants to train the parameters Q E Q of a model by using the dataset D to perform stochastic gradient descent steps according to some loss function l : ΰcq → R + . The server's goal is to protect the privacy of all the individuals in D by providing strong DP guarantees against an adversary that can observe the final trained model as well as all the intermediate model parameters. In some implementations, it can be assumed that the server is trusted, all devices adhere to the prescribed protocol (i.e., there are no malicious users), and all server-client communications are privileged (i.e., they cannot be detected or eavesdropped by an external adversary). [0029] The server can start with model parameters q 1 , and over a sequence of m time slots can produce a sequence of model parameters q 2 , . ., 0 m+1 . The random check-ins technique can allow clients to independently decide when to offer their contributions for a model update. If and when a client's contribution is accepted by the server, she uses the current parameters ϋ and her data d to send a privatized gradient of the form A Idp (V 6 l (d, Q )) to the server, where A Idp is a DP local randomizer (e.g. performing gradient clipping and adding Gaussian noise). [0030] The results of some implementations consider three different setups inspired by practical applications: (1) The server uses m « n time slots, where at most one user's update is used in each slot, for a total of m/b minibatch SGD iterations. It can be assumed all n users are available for the duration of the protocol, but the server does not have enough bandwidth to process updates from every user; (2) The server uses m » n/b time slots, and all n users are available for the duration of the protocol. On average, b users contribute updates to each time slot, and so, m minibatch SGD steps may be taken; (3) As with (2), but each user is only available during a small window of time relative to the duration of the protocol.

[0031] In some implementations, random check-ins for privacy amplification can be used in the context of distributed learning. Consider the distributed learning setup described in Section 2 where each client is willing to participate in the training procedure as long as their data remains private. To boost the privacy guarantees provided by the local randomizer A Idp , clients can volunteer their updates at a random time slot of their choosing. This randomization has a similar effect on the uncertainty about the use of an individual's data on a particular update as the one provided by uniform subsampling or shuffling. Informally, random check-in can be expressed as a client in a distributed iterative learning framework randomizing their instant of participation, and determining with some probability whether to participate in the process at all.

[0032] In some implementations, random check-in can formally be defined as letting A be a distributed learning protocol with m check-in time slots. For a set R j £ [m] and probability P j e [0,1], client j performs an (R j ,p j )-c heck-in in the protocol if with probability p j she requests the server to participate in A at time step I < — u. a. r. R j , and otherwise abstains from participating. If p j = 1, it can alternatively be denoted as an check-in. [0033] A distributed learning protocol based on random check-ins in accordance with some implementations is presented in Algorithm 1 (below). Client j independently decides in which of the possible time steps (if any) she is willing to participate by performing an (R j , pf)-c heck- in. We set R j = [m] for all j E [n], and assume all n clients are available throughout the duration of the protocol. On the server side, at each time step i E [m], a random client J t among all the ones that checked-in at time / is queried: this client receives the current model qi, locally computes a gradient update V S l (d j ., Of) using their data d j ., and returns to the server a privatized version of the gradient obtained using a local randomizer A Idp . Clients checked-in at time / that are not selected do not participate in the training procedure. If at time / no client is available, the server adds a "dummy" gradient to update the model.

[0034] Algorithm 1 - Distributed DP-SGD with random check-ins (fixed window)

[0035] Algorithm 1 - Server-side protocol

[0036] Parameters: local randomizer A idp : Q ® Q, total update steps m

[0037] Initialize model Q 1 E R p

[0038] Initialize gradient accumulator g 1 <- 0 P

[0039] for i e [m] do

[0040] Si <- {j: User(j ) check — ins for index i}

[0041] if Si is empty then

[0042] //Dummy gradient

[0043] else

[0044] Sample J t u. a. r. <- 5 έ

[0045] Request User (/ j ) for update to model

[0046] Receive g t from User (]{)

[0047] (0 i+1 ,g i+1 ) <- ModeUpdate(6i, g t + g u i )

[0048] Output q ί+1

[0049] Algorithm 1 - Client-side protocol for User(y)

[0050] Parameters: check-in window R j , check-in probability p j , loss function l, local randomizer A Idp

[0051] Private inputs: datapoint d j E D [0052] if a p j -biased coin returns heads then [0053] Check-in with the server at time I u. a. r. ← R j [0054] if receive request for update to model θ I then [0055] [0056] [0057] Algorithm 1 – ModelUpdate(θ,g, i) [0058] Parameters: batch size b, learning rate ^ [0059] if i mod b = 0 then [0060] return //Gradient descent step [0061] else [0062] return (θ,g) //skip update [0063] From a privacy standpoint, Algorithm 1 may share an important pattern with DP-SGD: each model update uses noisy gradients obtained from a random subset of the population. However, there are factors that can make the privacy analysis of random check-ins more challenging than the existing analysis based on subsampling and shuffling. First, unlike in the case of uniform sampling where the randomness in each update is independent, here there is a correlation induced by the fact that clients that check-in into one step cannot check-in into a different step. Second, in shuffling there is also a similar correlation between updates, but there we can ensure each update uses the same number of datapoints, while here the server does not control the number of clients that will check-in into each individual step. Nonetheless, the following result shows that random check-ins provides a factor of privacy amplification comparable to these techniques. [0064] Theorem 3.2 (Amplification via random check-ins into a fixed window) Suppose A ldp is an ε 0 -DP local randomizer. Let A fix :D n → θ m be the protocol from Algorithm 1 with check-in p robability p j = p 0 and check-in window R j = [m] for each client j ∈ [n]. For any δ ∈ particular, for Furthermore, if A ldp is

[0065] Remark 1 - In some implementations, privacy can be increased in the above statement by decreasing p 0 . However, this may also increase the number of dummy updates, which suggests choosing p 0 = 0(m/n). With such a choice, an amplification factor of Vrn/n can be obtained. Critically, however, exact knowledge of the population size is not required to have a precise DP guarantee above.

[0066] Remark 2 - At first look, the amplification factor of Vrn/n may appear stronger than the typical l/VrTfactor obtained via uniform subsampling/shuffling. Note that one run of random check-ins may provide m updates (as opposed to n updates via the other two methods). When the server has sufficient capacity, we can set m = n to recover a l/Vn amplification. In some implementations, one advantage of random check-ins can be benefiting from amplification in terms of the full n even if only a much smaller number of updates are actually processed. In some implementations, random check-ins may be extended to recover the l/Vn amplification even when the server is rate limited (p 0 = m/n), by repeating the protocol Af ix adaptively n/m times to get the following corollary and applying advanced composition for DP.

[0067] Corollary 3.3. For algorithm Af ix\ D n → 0 m described in Theorem 3.2, suppose A idp is an e 0 -ϋR local randomizer such that e 0 and n > (e £ °Vrn log 1/ Setting p 0 = and running^· repetitions of Af ix results in a total of n updates, and overall central (e, <5)-DP with e = 0(e 1,5£ °/Vn) and d E (0,1), where O(-) hides polylog factors in l/b and 1/d.

[0068] In some implementations, a utility analysis for random check-ins can be provided. First, a bound can be provided on the expected number of "dummy" updates during a run of the algorithm described in Theorem 3.2. The result is described below in Proposition 3.4.

[0069] Proposition 3.4 (Dummy updates in random check-ins with a fixed window). For algorithm Af ix\ D n → 0 m described in Theorem 3.2, the expected number of dummy updates performed by the server is at most (m (l — ~) j- For c > 0 if p 0 = we get at most expected dummy updates.

[0070] Utility for Convex ERMs - We now instantiate our amplification theorem (Theorem 3.2) in the context of differentially private empirical risk minimization (ERM). For convex ERMs, it can be shown that DP-SGD in conjunction with the privacy amplification theorem (Theorem 3.2) may be capable of achieving the optimal privacy/accuracy trade-offs.

[0071] Theorem 3.5 (Utility guarantee). Suppose in algorithm Af ix\ D n → Q m described in Theorem 3.2 the loss f: B x 0 → M + is the L-Lipschitz and convex in its second parameter and the model space Q has dimension p and diameter R, i.e., sup \\q — q'\\ < R. Furthermore, let q,q'eq

T> be a distribution on D, define the population risk£(T>; Q ) = E d ^[£(d; 0)], and let Q * = argmin 0eg) £ (T>; 0). If A idp is a local randomizer that adds Gaussian noise with variance s 2 ,

R(l-2e n Po/ m ) and the learning rate for a model update at step i e [m] is set to = then

V(pff 2 +i 2 )i the output of 0 m of Afix(p ) on a dataset D containing n i.i.d. samples from satisfies In some implementations 0 hides a polylog factor in m.

[0072] Remark 3 - Note that as m → n, it is easy to see for p 0 = W(— ) that Theorem 3.5 achieves the optimal population risk trade-off.

[0073] This section presents two variants of the main protocol from the previous section. The first variant makes a better use of the updates provided by each user at the expense of a small increase in the privacy cost. The second variant allows users to check-in into a sliding window to model the case where different users might be available during different time windows. [0074] In some implementations, variant(s) of Algorithm 1 may be utilized which, at the expense of a mild increase in the privacy cost, removes the need for dummy updates, and/or for discarding all but one of the clients checked-in at every time step. The server-side protocol of this version is given in Algorithm 2 (the client-side protocol is identical as Algorithm 1). Note that in this version, if no client checked-in at some step i e [m], the server simply skips the update. Furthermore, if at some time i e [m] multiple clients have checked in, the server requests gradients from all the clients, and performs a model update using the average of the submitted noisy gradients.

[0075] These changes may have the advantage of reducing the noise in the model coming from dummy updates, and increasing the algorithm's data efficiency by utilizing gradients provided by all available clients. The corresponding privacy analysis becomes more challenging because (1) the adversary gains information about the time steps where no clients checked-in, and (2) the server uses the potentially non-private count |5 j | of clients checked-in at time / when performing the model update. Nonetheless, it may be show that the privacy guarantees of Algorithm 2 are similar to those of Algorithm 1 with an additional 0(e °/ 2 ) factor, and the restriction of non-collusion among the participating clients. For simplicity, we only analyze the case where each client has check-in probability p j = 1.

[0076] Algorithm 2 - A avg Server-side protocol:

[0077] Parameters: total update steps m [0078] Initialize model Q 1 E R p [0079] for i e [m] do [0080]

[0081] if Si is empty then

[0082] q ί+ i <- qi

[0083] else

[0084]

[0085] for j e Si do

[0086] Request User(j ) for update to model qi

[0087] Receive g t j from User(j )

[0088]

[0090] Output q ί+1

[0091] Theorem 4.1 (Amplification via random check-ins with averaged updates). Suppose A idp is an e 0 -ϋR local randomizer. Let A avg \ D n → 0 m be the protocol from Algorithm 2 performing m averaged model updates with check-in probability p j = 1 and check in window for each user . Algorithm ) particular, for we get i s We provide a utility guarantee for A avg in terms of the excess population risk convex ERMs (similar to Theorem 3.5). [0092] Theorem 4.2 (Utility guarantee of Algorithm 2). Suppose an algorithm A avg : described in Theorem 4.1 the loss is L-Lipschitz and convex in its second parameter and the model space Θ has dimension p and diameter Furthermore, let be a distribution of D, define the population risk and let is a local randomizer that adds Gaussian noise with variance σ 2 , and the learning rate for a model update at step i ∈ [m] is set to be then the output Θ m of A avg (D) on a dataset ^ containing n i.i.d. samples from D satisfies . Furthermore, if the loss ℓ is smooth in its second parameter and we set the step size then we have [0093] Comparison of the utility of Algorithm 2 to that of Algorithm 1: Recall that in A fix we can achieve a small fixed ε by taking in which case the excess risk bound in Theorem 3.5 becomes On the other hand, in A avg we can obtain a fixed small ε by taking In this case the excess risks in Theorem 4.2 are bond in the convex and smooth case. Thus, we observe that all the bounds recover the optimal population risk trade-offs as m → n, and for m ≪ n and non-smooth loss A fix provides a better trade-off than A avg , while on smooth losses A avg and Af ix are incomparable. Note that Af ix (with b = 1) will not attain a better bound on smooth losses because each update is based on a single data-point. Setting b > 1 will reduce the number of updates to m/b for Af ix , whereas to get an excess risk bound for Af ix for smooth losses where more than one data point is sampled at each time step will require extending the privacy analysis to incorporate the change.

[0094] The second variant we consider removes the need for all clients to be available throughout the training period. Instead, we assume that the training period comprises of n time steps, and each client j E [n] is only available during a window of m time steps. Clients perform a random check-in to provide the server with an update during their window of availability. For simplicity, we assume clients wake up in order, one every time step, so client j e [n] will perform a random check-in within the window R j = {j, .. .,j + m — 1}. The server will perform n — m + 1 updates starting at time m to provide a warm-up period where the first m clients perform their random check-ins.

[0095] Theorem 4.3 (Amplification via random check-ins with sliding windows). Suppose A idp is an e 0 -ϋR local randomizer. Let A sidw : T> n ® Q n~m+1 be the distributed algorithm performing n model updates with check-in probability p j = 1 and check-in window R j =

{j, ...,j + m — 1} for each user j e [n]. For any m e [n], algorithm A sidw is (e, <5)-DP with e = with

[0096] Remark 4 - We can always increase privacy in the statement above by increasing m. However, that may also increases the number of clients who do not participate in training because their scheduled check-in time is before the process begins, or after it terminates. Moreover, the number of empty slots where the server introduces dummy updates will also increase, which we would want to minimize for good accuracy. Thus, m can introduce a tradeoff between accuracy and privacy. [0097] Proposition 4.4 (Dummy updates in random check-ins with sliding windows). For algorithm A sidw : D n → Q n~m+1 described in Theorem 4.S, the expected number of dummy gradient updates performed by the server is at most (n — m + l)/e.

[0098] In some implementations, an improvement on privacy amplification can be provided by shuffling. This can be obtained by tightening the analysis of amplification by swapping, a central component in the analysis of amplification by shuffling.

[0099] Theorem 5.1 (Amplification via Shuffling). Let i E

[n], be a sequence of adaptive £ 0 -DP local randomizers. Let 4 si : T> n be the algorithm that given a dataset D = (d lt . .., d n ) E T> n samples a uniform random permutation p over [n], sequentially computes and outputs s 1:n . For any d E

(0,1), algorithm A si satisfies

Furthermore, if then A si satisfies

(£',<5')-DP with

[0100] For comparison, the guarantee in some existing techniques in the case <5 0 = 0 results in

[0101] The rapid growth in connectivity and information sharing has been accelerating the adoption of tighter privacy regulations and better privacy-preserving technologies. Therefore, training machine learning models on decentralized data using mechanisms with formal guarantees of privacy is highly desirable. However, despite the rapid acceleration of research on both DP and FL, only a tiny fraction of production ML models are trained using either technology. Implementations described herein take an important step in addressing this gap. [0102] For example, implementations disclosed herein highlight the fact that proving DP guarantees for distributed or decentralized systems can be substantially more challenging than for centralized systems, because in the distributed world it may become much harder to precisely control and characterize the randomness in the system, and this precise characterization and control of randomness is at the heart of DP guarantees. Specifically, production FL systems do not satisfy the assumptions that are typically made under state-of- the-art privacy accounting schemes, such as privacy amplification via subsampling. Without such accounting schemes, service providers cannot give DP statements with small e's. Implementations disclosed herein, though largely theoretical in nature, propose a method shaped by the practical constraints of distributed systems that allows for rigorous privacy statements under realistic assumptions.

[0103] Turning now to the figures, FIG. 1 illustrates an example environment 100 in which implementations described herein may be implemented. Example environment 100 includes remote system 102 and client device 104. Remote system 102 (e.g., a server) is remote from one or more client devices 104. In some implementations, remote system 102 may include global model training engine 106, client device engine 108, reporting window engine 110, global model 112, and/or additional or alternative engine(s) or model(s) (not depicted). In some implementations, client device 104 may include reporting engine 114, gradient engine 116, local model 118, and/or additional or alternative engine(s) or model(s) (not depicted). [0104] In some implementations, remote system 102 may communicate with one or more client devices 104 via one or more networks such as a local area network (LAN) and/or a wide area network (WAN) (e.g., the Internet). In some implementations, client device 104 may include may include user interface input/output devices, which may include, for example, a physical keyboard, a touch screen (e.g., implementing a virtual keyboard or other textual input mechanisms), a microphone, a camera, a display screen, and/or speaker(s). The user interface input/output devices may be incorporated with one or more client devices 104 of a user. For example, a mobile phone of the user may include the user interface input output devices; a standalone digital assistant hardware device may include the user interface input/output device; a first computing device may include the user interface input device(s) and a separate computing device may include the user interface output device(s); etc. In some implementations, all or aspects of client device 104 may be implemented on a computing system that also contains the user interface input/output devices. In some implementations client device 104 may include an automated assistant (not depicted), and all or aspects of the automated assistant may be implemented on computing device(s) that are separate and remote from the client device that contains the user interface input/output devices (e.g., all or aspects may be implemented "in the cloud"). In some of those implementations, those aspects of the automated assistant may communicate with the client device via one or more networks such as a local area network (LAN) and/or a wide area network (WAN) (e.g., the Internet). [0105] Some non-limiting examples of client device 104 include one or more of: a desktop computing device, a laptop computing device, a standalone hardware device at least in part dedicated to an automated assistant, a tablet computing device, a mobile phone computing device, a computing device of a vehicle (e.g., an in-vehicle communications system, and in- vehicle entertainment system, an in-vehicle navigation system, an in-vehicle navigation system), or a wearable apparatus of the user that includes a computing device (e.g., a watch of the user having a computing device, glasses of the user having a computing device, a virtual or augmented reality computing device). Additional and/or alternative computing systems may be provided. Client device 104 may include one or more memories for storage of data and software applications, one or more processors for accessing data and executing applications, and other components that facilitate communication over a network. The operations performed by client device 104 may be distributed across multiple computing devices. For example, computing programs running on one or more computers in one or more locations can be coupled to each other through a network.

[0106] As illustrated in FIG. 1, local model 118 can be a local model stored locally at client device 104 corresponding with global model 112. For example, global model 112 can be a global automatic speech recognition ("ASR") model used to generate a text representation of a spoken utterance, and local model 118 can be a corresponding ASR model stored locally at client device 104. Additionally or alternatively, global model 112 can be a global text prediction model used to predict one or more words while a user is typing, and local model 118 can be a corresponding local text prediction model stored locally at client device 104.

Additional or alternative global and corresponding local models may be utilized in accordance with techniques described herein.

[0107] Global model training engine 106 can be used to grain global model 112. In some implementations, global model training engine 112 can process gradients received from one or more client devices 104 at a specific time step, and update one or more portions of global model 112 based on the received gradient(s). For example, in some implementations, remote system 102 can receive a gradient from a single client device at a time step. Global model training engine 106 can update one or more portions of global model 112 based on the received gradient. Additionally or alternatively, remote system 102 can receive multiple gradients from multiple client devices at a single time step (e.g., receive gradients from each of two client devices, three client devices, five client devices, 10 client devices, and/or additional or alternative number(s) of client devices). In some implementations, global model training engine 106 can select one of the received gradients (e.g., select the first received gradient, select the last received gradient, randomly (or pseudo randomly) select one of the received gradients and/or select a received gradient using one or more additional or alternative processes) for use in updating one or more portions of global model 112. Additionally or alternatively, global model training engine 106 can update one or more portions of global model 112 based on more than one of the received gradients (e.g., average the gradients received for the time step, average the first three gradients received for the time step, etc.). Furthermore, in some implementations, global model training engine 106 can update one or more portions of global model 112 based on each of the gradients received for the time step (e.g., store the received gradients in a buffer and update portion(s) of global model 112 based on each of the received gradients).

[0108] Client device engine 108 can be used to select a set of client devices 104. In some implementations, client device engine 108 can select each available client device. In some implementations, client device engine 108 can select (e.g., randomly or pseudo randomly select) a set of the client devices (e.g., select a set of client devices from the available client devices). Additionally or alternatively, client device engine 108 can select a subset of the client devices based on the physical location of the devices, based on historic data indicating device availability, and/or based on additional or alternative characteristics of the device(s). In some implementations, client device engine 108 can determine the number of client device(s) selected (e.g., client device engine 108 can randomly or pseudo randomly determine the number of client devices to be selected). In some implementations, client device engine 108 can determine multiple sets of client devices. For example, client device engine 108 can determine two sets of client devices, three sets of client devices, five sets of client devices, ten sets of client devices, one hundred sets of client devices, and/or additional or alternative numbers of sets of client devices.

[0109] Reporting window engine 110, of remote system 102, can be used to determine the time frame for each selected client device to update remote system 102. For example, reporting window engine 110 can determine the size of the reporting window based on the number of client devices selected by client device engine 108 (select a reporting window size sufficiently large enough for the selected number of client devices). Additionally or alternatively, the reporting window can be selected based on historical data indicating when the selected client devices are in communication with the remote system but are otherwise idle. For example, reporting window engine 110 can select a reporting window in the middle of the night when devices are more likely to be idle.

[0110] In some implementations, reporting engine 114 of client device 104 can determine whether to provide a gradient to remote system 102 for use in updating global model 112 and/or determine a reporting time within a reporting window (e.g., a reporting window generated using reporting window engine 110) to provide the gradient. For example, reporting engine 114 can make a determination of whether to participate in the current round of training (e.g., randomly determining whether to participate or not). If reporting engine 114 determines to participate, reporting engine 114 can then randomly determine a reporting time in the reporting window for client device 104 to provide a gradient to the remote system 102. Conversely, if reporting engine 114 determines to not participate in the training, a reporting time may not be selected from the reporting window and/or a gradient may not be transmitted to remote system 102 in the reporting window.

[0111] In some implementations, gradient engine 116 can be used to generate a gradient to provide to remote system 102 for use in updating global model 112. In some implementations, gradient engine 116 can process data generated locally at client device 104, using local model 118, to generate output. Additionally or alternatively, gradient engine 116 can generate the gradient based on the generated output in a supervised and/or in an unsupervised manner.

For example, global model 112 and local model 118 can be a global ASR model and a corresponding local ASR model respectively. Audio data capturing a spoken utterance, captured using a microphone of client device 104, can be processed using the local ASR model to generate a candidate text representation of the spoken utterance. In some implementations, client device 104 can prompt the user who spoke the utterance asking if the candidate text representation correctly captures the spoken utterance and if not, for the user to correct the text representation. The gradient can be determined based on the difference between the candidate text representation of the spoken utterance and the corrected text representation of the spoken utterance. As another example, global model 112 and local model 118 may be predictive text models used to predict text based on user provided input (e.g., used to predict the next word(s) while a user is typing). In some implementations, current text can be processed using the predictive text model to generate the candidate next text. The system can determine whether the next text typed by the user matches the candidate next text. In some implementations, the gradient can be determined based on the difference between the next text typed by the user and the candidate next text. Additional or alternative techniques may be used by gradient engine 116 to generate a gradient at client device 104.

[0112] FIG. 2 illustrates an example 200 updating a global model in accordance with implementations disclosed herein. In the illustrated example, at step 202, remote system 102 can select a set of client devices including client device A 104A and client device N 104N. In some implementations, remote system 102 can select the set of client devices using client device engine 108 of FIG. 1. A step 204, remote system 102 can determine a reporting window indicating a timeframe for client devices A and N to provide updates for the global model. In some implementations, remote system 102 can determine the reporting window using reporting window engine 110 of FIG. 1. Additionally or alternatively, at step 206, remote system 102 can transmit the reporting widow to client device A 104A and client device N 104N. [0113] At step 208, client device A 104A can determine a reporting time, in the reporting window received from remote system 102. In some implementations, client device A 104A can determine the reporting time using reporting engine 114 of FIG. 1. At step 210, client device A 104A can transmit gradient A to remote system 102. In some implementations, client device A 104A can determine gradient A using gradient engine 116. For example, gradient A can be generated by processing data using a local model, stored locally at client device A, corresponding with the global model. At step 212, remote system 102 can update one or more portions of the global model using gradient A received from client device A 104A. In some implementations, remote system 102 can update global model 112 using gradient A using global model training engine 106 of FIG. 1.

[0114] Similarly, at step 214, client device N 104N can determine a reporting time, in the reporting window received from remote system 102. In some implementations, client device N 104N can determine the reporting time using reporting engine 114 of FIG. 1. At step 216, client device N 104N can transmit gradient N to remote system 102. In some implementations, client device N 104N can determine gradient N using gradient engine 116. For example, gradient N can be generated by processing data using a local model, stored locally at client device N, corresponding with the global model. At step 218, remote system 102 can update one or more portions of the global model using gradient N received from client device N 104N. In some implementations, remote system 102 can update global model 112 using gradient N using global model training engine 106 of FIG. 1.

[0115] FIG. 2 is merely an illustrative example and is not meant to be limiting. For instance, remote system 102 can receive gradients from additional or alternative client device(s), the client devices can determine reporting times in the same step (e.g., client device A determines its corresponding reporting time while client device N is determining its corresponding reporting time), multiple devices can select the same reporting time, etc.

[0116] FIG. 3 is a flowchart illustrating a process 300 of training a global model using a remote system in accordance with implementations disclosed herein. For convenience, the operations of the flowchart are described with reference to a system that performs the operations. This system may include various components of various computer systems, such as one or more components remote system 102, client device 104, and/or computing system 510. Moreover, while operations of process 300 are shown in a particular order, this is not meant to be limiting. One or more operations may be reordered, omitted, and/or added. [0117] At block 302, the system selects, at the remote system, the set of client devices, from a plurality of client devices. In some implementations, the system can select the set of client devices using client device engine 108 of FIG. 1. Additionally or alternatively, in some implementations, the system can select multiple sets of client devices.

[0118] At block 304, the system determines, at the remote system, a reporting window indicating a time frame for the set of client devices to provide one or more gradients, to update a global model. In some implementations, the system can determine the reporting window using reporting window engine 110 of FIG. 1.

[0119] At block 306, the system transmits the reporting window to each of the client devices in the selected set of client devices.

[0120] At block 308, the system receives, at the remote system and at corresponding reporting times, locally generated gradients. In some implementations the corresponding reporting times can be determined, by each client device, in the reporting window. In some implementations, each locally generated gradient can be generated by processing, using a local model stored locally at the corresponding client device, data generated locally at the corresponding client device. In some implementations, each client device can transmit the corresponding gradient to the remote system in accordance with process 400 of FIG. 1 described herein.

[0121] At block 310, the system updates one or more portions of the global model based on the received gradients.

[0122] FIG. 4 is a flowchart illustrating a process 400 of transmitting a gradient, from a client device to a remote system, for use in updating a global model in accordance with implementations disclosed herein. For convenience, the operations of the flowchart are described with reference to a system that performs the operations. This system may include various components of various computer systems, such as one or more components remote system 102, client device 104, and/or computing system 510. Moreover, while operations of process 400 are shown in a particular order, this is not meant to be limiting. One or more operations may be reordered, omitted, and/or added. [0123] At block 402, the system receives, at a client device from a remote system, a reporting window indicating a time frame for the client device to provide a gradient to update a global model.

[0124] At block 404, the system generates the gradient by processing data, generated locally at the client device, using a local model stored locally at the client device, where the local model corresponds to the global model. In some implementations, the system can generate the gradient using gradient engine 116 of FIG. 1.

[0125] At block 406, the system determines a reporting time in the reporting window. In some implementations, the system can randomly (or pseudo randomly) select a reporting time in the reporting window. In some implementations, the system can determine the reporting time using reporting engine 114 of FIG. 1.

[0126] At block 408, the system transmits, at the reporting time, the generated gradient to the remote system.

[0127] FIG. 5 is a block diagram of an example computing device 510 that may optionally be utilized to perform one or more aspects of techniques described herein. Computing device 510 typically includes at least one processor 514 which communicates with a number of peripheral devices via bus subsystem 512. These peripheral devices may include a storage subsystem 524, including, for example, a memory subsystem 525 and a file storage subsystem 526, user interface output devices 520, user interface input devices 522, and a network interface subsystem 516. The input and output devices allow user interaction with computing device 510. Network interface subsystem 516 provides an interface to outside networks and is coupled to corresponding interface devices in other computing devices.

[0128] User interface input devices 522 may include a keyboard, pointing devices such as a mouse, trackball, touchpad, or graphics tablet, a scanner, a touchscreen incorporated into the display, audio input devices such as voice recognition systems, microphones, and/or other types of input devices. In general, use of the term "input device" is intended to include all possible types of devices and ways to input information into computing device 510 or onto a communication network.

[0129] User interface output devices 520 may include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices. The display subsystem may include a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image. The display subsystem may also provide non-visual display such as via audio output devices. In general, use of the term "output device" is intended to include all possible types of devices and ways to output information from computing device 510 to the user or to another machine or computing device.

[0130] Storage subsystem 524 stores programming and data constructs that provide the functionality of some or all of the modules described herein. For example, the storage subsystem 524 may include the logic to perform selected aspects of the processes of FIGS. 3, 4, and/or other processes described herein.

[0131] These software modules are generally executed by processor 514 alone or in combination with other processors. Memory 525 used in the storage subsystem 524 can include a number of memories including a main random access memory (RAM) 530 for storage of instructions and data during program execution and a read only memory (ROM) 532 in which fixed instructions are stored. A file storage subsystem 526 can provide persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges. The modules implementing the functionality of certain implementations may be stored by file storage subsystem 526 in the storage subsystem 524, or in other machines accessible by the processor(s) 514.

[0132] Bus subsystem 512 provides a mechanism for letting the various components and subsystems of computing device 510 communicate with each other as intended. Although bus subsystem 512 is shown schematically as a single bus, alternative implementations of the bus subsystem may use multiple busses.

[0133] Computing device 510 can be of varying types including a workstation, server, computing cluster, blade server, server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description of computing device 510 depicted in FIG. 5 is intended only as a specific example for purposes of illustrating some implementations. Many other configurations of computing device 510 are possible having more or fewer components than the computing device depicted in FIG. 5.

[0134] In situations in which the systems described herein collect personal information about users (or as often referred to herein, "participants"), or may make use of personal information, the users may be provided with an opportunity to control whether programs or features collect user information (e.g., information about a user's social network, social actions or activities, profession, a user's preferences, or a user's current geographic location), or to control whether and/or how to receive content from the content server that may be more relevant to the user. Also, certain data may be treated in one or more ways before it is stored or used, so that personal identifiable information is removed. For example, a user's identity may be treated so that no personal identifiable information can be determined for the user, or a user's geographic location may be generalized where geographic location information is obtained (such as to a city, ZIP code, or state level), so that a particular geographic location of a user cannot be determined. Thus, the user may have control over how information is collected about the user and/or used.

[0135] In some implementations, a method implemented by one or more processors is provided, the method includes selecting, at a remote system, a set of client devices, from a plurality of client devices. In some implementations, the method includes determining, at the remote system, a reporting window indicating a time frame for the set of client devices to provide one or more gradients, to update a global model. In some implementations, the method includes transmitting, by the remote system, to each client device in the set of client devices, the reporting window, wherein transmitting the reporting window causes each of the client devices to at least selectively determine a corresponding reporting time, within the reporting window, for transmitting a corresponding locally generated gradient to the remote system. In some implementations, the method includes receiving, in the reporting window, the corresponding locally generated gradients at the corresponding reporting times, wherein each of the corresponding locally generated gradients is generated by a corresponding one of the client devices based on processing, using a local model stored locally at the client device, data generated locally at the client device to generate a predicted output of the local model. In some implementations, the method includes updating one or more portions of the global model, based on the received gradients.

[0136] These and other implementations of the technology can include one or more of the following features.

[0137] In some implementations, the method further includes selecting, at the remote system, an additional set of additional client devices, from the plurality of client devices. In some implementations, the method further includes determining, at the remote system, an additional reporting window indicating an additional time frame for the additional set of additional client devices to provide one or more additional gradients, to update the global model. In some implementations, the method further includes transmitting, by the remote system, to each additional client device in the additional set of additional client devices, the additional reporting window, wherein transmitting the additional reporting window causes each of the additional client devices to at least selectively determine a corresponding additional reporting time, within the additional reporting window, for transmitting a corresponding additional locally generated gradient to the remote system. In some implementations, the method further includes receiving, in the additional reporting window, the corresponding additional locally generated gradients at the corresponding additional reporting times, wherein each of the corresponding additional locally generated gradients is generated by a corresponding one of the additional client devices based on processing, using a local model stored locally at the additional client device, additional data generated locally at the additional client device to generate an additional predicted output of the local model. In some implementations, the method further includes updating one or more additional portions of the global model, based on the received additional gradients. In some versions of those implementations, at least one client device in the set of client devices, is in the additional set of additional client devices.

[0138] In some implementations, processing, using the local model stored locally at the client device, data generated locally at the client device to generate the predicted output of the local model further includes generating the gradient based on the predicted output of the local model and ground truth data generated by the client device. In some versions of those implementations, the global model is a global automatic speech recognition ("ASR") model, the local model is a local ASR model, and wherein generating the gradient based on the predicted output of the local model includes processing audio data capturing a spoken utterance using the local ASR model to generate a predicted text representation of the spoken utterance. In some versions of those implementations, the method further includes generating the gradient based on the predicted text representation of the spoken utterance and a ground truth representation of the spoken utterance generated by the client device.

[0139] In some implementations, each of the client devices at least selectively determining the corresponding reporting time, within the reporting window, for transmitting the corresponding locally generated gradient to the remote system includes, for each of the client devices, randomly determining the corresponding reporting time, within the reporting window, for transmitting the corresponding locally generated gradient to the remote system. In some versions of those implementations, each of the client devices at least selectively determining the corresponding reporting time, within the reporting window, for transmitting the corresponding locally generated gradient to the remote system includes, for each of the client devices, determining whether to transmit the corresponding locally generated gradient to the remote system. In some versions of those implementations, in response to determining to transmit the corresponding locally generated gradient, the method further includes transmitting corresponding locally generated gradient to the remote system. In some versions of those implementations, determining whether to transmit the corresponding locally generated gradient to the remote system includes randomly determining whether to transmit the corresponding locally generated gradient to the remote system.

[0140] In some implementations, receiving, in the reporting window, the corresponding locally generated gradients at the corresponding reporting times includes receiving a plurality of corresponding locally generated gradients at the same reporting time. In some versions of those implementations, updating one or more portions of the global model, based on the received gradients includes determining an update gradient based on the plurality of corresponding locally generated gradients received at the same reporting time. In some versions of those implementations, the method further includes updating the one or more portions of the global model, based on the update gradient. In some versions of those implementations, determining the update gradient based on the plurality of corresponding locally generated gradients, received at the same reporting time, includes selecting the update gradient from the plurality of corresponding locally generated gradients received at the same reporting time. In some versions of those implementations, selecting the update gradient from the plurality of corresponding locally generated gradients, received at the same reporting time, includes randomly selecting the update gradient from the plurality of corresponding locally generated gradients, received at the same reporting time. In some versions of those implementations, determining the update gradient based on the plurality of corresponding locally generated gradients, received at the same reporting time, includes determining the update gradient based on an average of the plurality of corresponding locally generated gradients.

[0141] In some implementations, a method implemented by one or more processors is provided, the method includes receiving, at a client device and from a remote system, a reporting window indicating a time frame for the client device to provide a gradient, to the remote system, to update one or more portions of a global model. In some implementations, the method includes processing locally generated data, using a local model, to generate predicted output of the local model. In some implementations, the method includes generating the gradient based on the predicted output of the local model. In some implementations, the method includes determining a reporting time, in the reporting window, to transmit the gradient to the remote server. In some implementations, the method includes, at the reporting time, transmitting the gradient to the remote server.

[0142] In addition, some implementations include one or more processors (e.g., central processing unit(s) (CPU(s)), graphics processing unit(s) (GPU(s), and/or tensor processing unit(s) (TPU(s)) of one or more computing devices, where the one or more processors are operable to execute instructions stored in associated memory, and where the instructions are configured to cause performance of any of the methods described herein. Some implementations also include one or more transitory or non-transitory computer readable storage media storing computer instructions executable by one or more processors to perform any of the methods described herein.