Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR UPDATING NEURAL NETWORKS
Document Type and Number:
WIPO Patent Application WO/2022/219232
Kind Code:
A1
Abstract:
The embodiments relate to a method and technical equipment for weight update compression in neural network. For a neural network topology comprising topology elements having weight updates, the method comprises determining (310) importance of a topology element, and performing weight update dropping (220) of topology elements having less importance; selecting (320) one or more quantization method with different quantizers to be used alternatively; quantizing (102, 240, 330) the existing weight updates according to the selected quantization method; and coding (104, 250, 340) the quantized weight updates.

Inventors:
REZAZADEGAN TAVAKOLI HAMED (FI)
CRICRI FRANCESCO (FI)
ZHANG HONGLEI (FI)
AFRABANDPEY HOMAYUN (FI)
RANGU GOUTHAM (FI)
AKSU EMRE BARIS (FI)
HANNUKSELA MISKA MATIAS (FI)
Application Number:
PCT/FI2022/050214
Publication Date:
October 20, 2022
Filing Date:
April 04, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NOKIA TECHNOLOGIES OY (FI)
International Classes:
G06N3/08; H03M7/30; G06N3/04; G06N20/00; H03M7/40; H03M7/48
Other References:
BASU, D. ET AL.: "Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations", IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY, vol. 1, no. 1, 6 April 2020 (2020-04-06), pages 217 - 226, XP011792643, Retrieved from the Internet [retrieved on 20220714], DOI: 10.1109/JSAIT.2020.2985917
FANG, J. ET AL.: "Accelerating Distributed Deep Learning Training with Gradient Compression", ARXIV:1808.04357V3 [CS.DC, 22 July 2019 (2019-07-22), pages 1 - 10, XP081179384, Retrieved from the Internet [retrieved on 20220714]
SATTLER, F. ET AL.: "Sparse Binary Compression: Towards Distributed Deep Learning with minimal Communication", ARXIV:1805.08768V1 [CS.LG, 22 May 2018 (2018-05-22), pages 1 - 12, XP080881177, Retrieved from the Internet [retrieved on 20220714]
MORIN, G. ET AL.: "Smart Ternary Quantization", ARXIV:1909.12205V1 [CS.LG, 26 September 2019 (2019-09-26), pages 1 - 8, XP081483772, Retrieved from the Internet [retrieved on 20220715]
Attorney, Agent or Firm:
NOKIA TECHNOLOGIES OY et al. (FI)
Download PDF:
Claims:
Claims: 1. An apparatus for weight update compression in a neural network, wherein the neural network topology comprises topology elements being associated with weight updates, wherein a weight update belongs to a grouping defined by a structure, the apparatus comprising: - means for determining importance of a structure based on the neural network topology and weight updates, and sparsifying a structure in a weight update according to the determined importance; - means for selecting one or more quantization method with different quantizers to be used alternatively; - means for quantizing the weight-updates associated with the topology with sparsified structure according to the selected quantization method, and - means for coding the quantized weight updates. 2. The apparatus according to claim 1, wherein the importance is determined by an importance function being one of the following: L1 magnitude, saliency, graph diffusion-based importance. 3. The apparatus according to claim 1 or 2, further comprising means for determining element importance of a topology element, and performing weight update dropping of topology elements according to the determined element importance. 4. The apparatus according to claim 3, wherein an importance metric is one of the following: magnitude of weight updates or a graph diffusion-based score; a metric derived from gradients computed during one or more training iterations; a metric derived from the activations that are output by topology elements after one or more training iterations; a metric derived from one or more of the previous options. 5. The apparatus according to claim 3 or 4, further comprising means for reducing values of weight updates with respect to the amount of dropped weight updates.

6. The apparatus according to any of the claims 3 to 5, further comprising means for performing weight update scaling for a topology element. 7. The apparatus according to any of the claims 3 to 6, wherein the determined element importance indicates topology elements whose weight updates are to be dropped or whose weight updates are not to be dropped. 8. The apparatus according to any of the claims 1 to 7, further comprising means for determining criteria for selecting said one or more quantization method. 9. A method for weight update compression in neural network, wherein the method comprises: for a neural network topology comprising topology elements being associated with weight updates, wherein a weight update belongs to a grouping defined by a structure: - determining importance of a structure based on the neural network topology and weight updates, and sparsifying a structure in a weight update according to the determined importance; - selecting one or more quantization method with different quantizers to be used alternatively; - quantizing the weight-updates associated with the topology with sparsified structure according to the selected quantization method, and - coding the quantized weight updates. 10.The method according to claim 9, an importance is determined by an importance function being one of the following: L1 magnitude, saliency, graph diffusion-based importance. 11.The method according to claim 9 or 10, further comprising determining importance of a topology element, and performing weight update dropping of topology elements according to the determined importance. 12.The method according to claim 11, wherein an importance metric is one of the following: magnitude of weight updates or a graph diffusion-based score; a metric derived from gradients computed during one or more training iterations; a metric derived from the activations that are output by topology elements after one or more training iterations; a metric derived from one or more of the previous options. 13.The method according to claim 11 or 12, further comprising reducing values of weight updates with respect to the amount of dropped weight updates. 14.The method according to any of the claims 11 to 13, further comprising performing weight update scaling for a topology element. 15.The method according to any of the claims 11 to 14, wherein the determined element importance indicates topology elements whose weight updates are to be dropped or whose weight updates are not to be dropped. 16.The method according to any of the claims 8 to 15, further comprising determining criteria for selecting said one or more quantization method. 17.An apparatus for weight update compression in a neural network, wherein the neural network topology comprises topology elements being associated with weight updates, wherein a weight update belongs to a grouping defined by a structure, the apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: - determine importance of a structure based on the neural network topology and weight updates, and sparsifying a structure in a weight update according to the determined importance; - select one or more quantization method with different quantizers to be used alternatively; - quantize the weight-updates associated with the topology with sparsified structure according to the selected quantization method, and - code the quantized weight updates. 18.The apparatus according to claim 17, wherein an importance is determined by an importance function being one of the following: L1 magnitude, saliency, graph diffusion-based importance.

19.The apparatus according to claim 17 or 18, further being caused to determine importance of a topology element, and perform weight update dropping of topology elements according to the determined importance. 20.A computer program product for weight update compression in a neural network, wherein the neural network topology comprises topology elements being associated with weight updates, wherein a weight update belongs to a grouping defined by a structure, the computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to - determine importance of a structure based on the neural network topology and weight updates, and sparsifying a structure in a weight update according to the determined importance; - select one or more quantization method with different quantizers to be used alternatively; - quantize the weight-updates associated with the topology with sparsified structure according to the selected quantization method, and - code the quantized weight updates.

Description:
A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR UPDATING NEURAL NETWORKS The project leading to this application has received funding from the ECSEL Joint Undertaking (JU) under grant agreement No 783162. The JU receives support from the European Union’s Horizon 2020 research and innovation programme and Netherlands, Czech Republic, Finland, Spain, Italy. Technical Field The present solution generally relates to representations of compressed neural networks. In particular, the present solution relates to compression of weight updates. Background Artificial neural networks are used for a broad range of tasks in multimedia analysis and processing, media coding, data analytics and many other fields. Trained neural networks contain a large number of parameter and weights, resulting in a relatively large size. Therefore, the trained neural networks or their weights updates should be represented in a compressed form. Summary The scope of protection sought for various example embodiments of the invention is set out by the independent claims. The example embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various example embodiments of the invention. Various aspects include a method, an apparatus and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments are disclosed in the dependent claims. According to a first aspect, there is provided a method for weight update compression in neural network, wherein the method comprises for a neural network topology comprising topology elements having weight updates: determining importance of a topology element, and performing weight update dropping of topology elements according to the determined importance; selecting one or more quantization method with different quantizers to be used alternatively; quantizing the existing weight updates according to the selected quantization method, and coding the quantized weight updates. According to a second aspect, there is provided an apparatus for weight update compression in a neural network, wherein the neural network topology comprises topology elements having weight updates, the apparatus comprising means for determining importance of a topology element, and performing weight update dropping of topology elements according to the determined importance; means for selecting one or more quantization method with different quantizers to be used alternatively; means for quantizing the existing weight updates according to the selected quantization method, and means for coding the quantized weight updates. According to a third aspect, there is provided an apparatus for weight update compression in a neural network, wherein the neural network topology comprises topology elements having weight updates, the apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: determine importance of a topology element, and performing weight update dropping of topology elements according to the determined importance; select one or more quantization method with different quantizers to be used alternatively; quantize the existing weight updates according to the selected quantization method, and code the quantized weight updates. According to a fourth aspect, there is provided a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to determine importance of a topology element, and performing weight update dropping of topology elements according to the determined importance; select one or more quantization method with different quantizers to be used alternatively; quantize the existing weight updates according to the selected quantization method, and code the quantized weight updates. According to an embodiment, an importance metric is used to determine which topology elements are less important. According to an embodiment, the importance metric is one of the following: magnitude of weight updates; a graph diffusion-based score; a metric derived from gradients computed during one or more training iterations; a metric derived from the activations that are output by topology elements after one or more training iterations; a metric derived from one or more of the previous options. According to an embodiment, the importance metric for a certain topology element is the average value of the gradients of the training loss function with respect to the weight of that topology element, where the average is computed over a certain number of training iterations and over the multiple gradients.. According to an embodiment, the importance metric for a certain topology element is the average value of the output from that topology element where the average may be computed over the multiple output values. According to an embodiment, the importance metric is a linear combination of one or more of the previous examples. According to an embodiment, values of weight updates are reduced with respect to the amount of dropped weight updates. According to an embodiment, weight update scaling is performed for a topology element. According to an embodiment, the determined importance indicates topology elements whose weight updates are to be dropped or whose weight updates are not to be dropped. According to an embodiment, criteria for selecting said one or more quantization method is determined. According to an embodiment, the computer program product is embodied on a non- transitory computer readable medium. Description of the Drawings In the following, various embodiments will be described in more detail with reference to the appended drawings, in which Fig.1 shows an example of NNR encoding pipelines; Fig.2 shows a general example of a system according to present embodiments; Fig.3 is a flowchart illustrating a method according to an embodiment; and Fig.4 shows an apparatus according to an example. Description of Example Embodiments The present embodiments provide a method and technical equipment for compressing weight updates. The following description and drawings are illustrative and are not to be construed as unnecessarily limiting. The specific details are provided for a thorough understanding of the disclosure. However, in certain instances, well-known or conventional details are not described in order to avoid obscuring the description. References to one or an embodiment in the present disclosure can be, but not necessarily are, reference to the same embodiment and such references mean at least one of the embodiments. Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment in included in at least one embodiment of the disclosure. MPEG is currently pursuing standardization of representations for compressed neural networks in the standardization group called NNR (Neural Network Representation). The standardization effort has reached the Draft International Standard (DIS) Stage. A neural network (NN) is a known structure comprising several layers of successive computation. A layer may comprise one or more units or neurons performing an elementary computation. A unit may be connected to one or more units, and the connection may have associated a weight. The weight, also called as a “parameter”, or a “weight parameter”, may be used for scaling the signal passing through the associated connection. Weight may be a learnable parameter, i.e., values, which can be learned from training data. A Graph Convolutional Network is a neural network working on input graphs. A neural network is defined in terms of a topology. A topology defines the way the neurons are connected to each other. The topology can be defined at different granularities of elements and a full model can be the biggest possible element in a topology. In its definition, a topology could consist of a set of topology elements. The topology could be layered, in which a sequential processing, e.g., in feed-forward fashion happens. The topology could be recurrent and fully recurrent. It is also possible to have modular topologies where each module consists of another neural network that performs a distinct task. In this disclosure, term “topology elements” refers to the components of the topology. In a layered topology, such components can be a convolution layer, batch-norm layer, linear or fully-connected layer, etc. For simplicity, terms “convolution” and “convolution layer”, “batch-norm” and “batch- norm layer” etc. are used interchangeably. In order to configure a neural network to perform a task, an untrained neural network model has to go through a training phase. The training phase is the development phase, where the neural network learns to perform the final task. A training data set that is used to train the neural network is supposed to be the representative of the data on which the neural network will be used. During training, the neural network uses the examples in the training data set to modify its learnable parameters (e.g., its connections’ weights) in order to achieve the desired task. Input to the neural network is the data, and the output of the neural network represents the desired task. The training may happen by minimizing or decreasing the output’s error, also referred to as the loss. In the deep learning techniques, training is an iterative process, where at each iteration, the algorithm modifies the weights of the neural network to make a gradual improvement of the network’s output, i.e., to gradually decrease the loss. After training, the trained neural network model is applied to new data during an inference phase, in which the neural network performs the desired task to which it has been trained for. The inference phase is also known as a testing phase. As a result of the inference phase, the neural network provides an output which is a result of the inference on the new data. Training can be performed in several ways. The main ones are supervised, unsupervised, and reinforcement training. In supervised training, the neural network model is provided with input-output pairs, where the output may be a label. In unsupervised training, the neural network is provided only with input data (and also with output raw data in case of self-supervised training). In reinforcement learning, the supervision is sparser and less precise; instead of input-output pairs, the neural network gets input data and, sometimes, delayed rewards in the form of scores (e.g., -1, 0, or +1). The learning is a result of training algorithm, or of a meta-level neural network providing the training signal. In general, the training algorithm comprises changing some properties of the neural network so that its output is as close as possible to a desired output. Training a neural network is an optimization process, but the final goal may be different from the goal of optimization. In machine learning, the goal of optimization or training process is to make the model learn the properties of the data distribution from a limited training dataset. In other words, the goal is to learn to use a limited training dataset in order to learn to generalize to previously unseen data, i.e. data which was not used for training the model. This may be referred to as generalization. In practice, data may be split into at least two datasets, the training set and the validations set. The training set is used for training the network, i.e., to modify its learnable parameters in order to minimize the loss. The validation set is used for checking the performance of the network on data which was not used to minimize the loss, as an indication of the final performance of the model. NNR specifies a compressed representation of the parameters/weight of a trained neural network and a decoding process of the compressed representation. Figure 1 illustrates an example of NNR encoding pipelines for weight compression that can be assembled using various compression tools. From the group of parameter transformation tools, multiple tools can be applied in sequence. Parameter quantization can be applied to source models as well as to the outputs of transformation with parameter reduction methods. Entropy coding may be applied to the output of quantization. Raw outputs of earlier steps without applying entropy coding can be serialized if needed. Figure 1 shows parameter reduction methods 101, comprising (parameter) pruning, (parameter) sparsification, decomposition and (weight) unification. In pruning, the number of parameters are reduced by eliminating parameter or group of parameters. The pruning results in a dense representation which has less parameters in comparison to the original model e.g. by removing some redundant convolution filters from the layers. In sparsification, parameters or group of parameters are processed to produce a sparse representation of a model, e.g. by replacing some weight values with zeros. In decomposition, a matrix decomposition operation is performed to change the structure of the weights of the model. In unification, parameters are processed to produces a group of similar parameters. Unification may not eliminate or constrain the weights to be zero. The entropy of model parameters are however lowered by making them similar to each other. The different parameter reduction methods 101 can be combined or applied in sequence to produce a compact model. Parameter quantization methods 102 reduce the precision of the representation of parameters. If supported by the inference engine, the quantized representation can be used for more efficient inference. The parameter quantization methods may comprise a uniform quantization, a codebook quantization and dependent (scalar) quantization. Entropy coding methods 104 are configured to encode the results of parameter quantization methods. An example of an entropy coding method is a Deep CABAC. Term “weight compression” relates to compression of neural network weights. The weight compression considers compression for deployment of a neural network or making it efficient for inference time. Some of the techniques for weight compression use a loss function to impose sparsification, e.g., applying L1-norm during training a neural network. A known quantization approach for quantizing neural networks for efficient deployment is k-means, which can be used for tying the weights. The weight tying aims to group the learned weights during a training process to reduce the amount of information communicated during deployment. Weight tying cannot be applied to weight update compression since each grouping requires a training pass. This is not feasible/applicable for compression of weight updates because weight update compression is concerned about the compression of information communication during the training, and one cannot afford the cost of a training inside a training process. Training of a neural network model is a time-consuming process, since neural networks comprises millions of parameters that should be trained over huge datasets. In order to reduce time, the training can be performed in a distributed manner. In distributed training, the training is performed by several processors. One of the important aspects in distributed training is an incremental weight update compression. Such weight update compression has applications in various setups, including training of a neural network in parallel setup (multiple GPUs on the same machine), distributed clusters, and lately federated learning setup. Term “weight update” refers to the change in the neural network weight that can be caused by a learning algorithm. Weight updates are different than “gradient updates” because weight updates are accumulative changes over several iterations. A gradient update is a special case of a weight update where only one iteration is considered. However, this description uses terms “weight update” and “gradient update” interchangeably. Recent MPEG NNR has started studying incremental weight update compression. The activity contains two uses cases: incremental weight update for multimedia use case, where a single client setup is studied; and a federated learning setup with two clients and a server. The intent of CfP is to gather and standardize technologies suitable for efficient communication of weight updates in distributed setting. Beyond NNR, the federated learning and incremental weight update has significant use in IoT (Internet of Things) and distributed intelligent machines. Thus, the activity relates to all the intelligent edge solutions that can receive a weight update in order to improve their working efficiency. Various techniques for achieving weight update compression during distributed training exist in the technical field. Examples of such techniques comprises a top-k sparsification, a quantization, and the combination of these two. In the combined model, top-k sparsification preserves top-k gradients given their magnitude, and zeros the rest. Then, the remaining weight updates (i.e., the non-zero weight-update values) are quantized. The quantization can be achieved by using various techniques, e.g., a sign-based quantization as in signSGD, binary and ternary quantization or a codebook based quantization techniques. For example, the combined approach may combine the top-k sparsification and ternary quantization to achieve a good compression ratio in federated use case. Weight update compression may comprise a lossless compression step, for example by using an entropy codec such as an arithmetic codec. Lossless compression may be performed after any lossy compression step, such as after sparsification and/or quantization. In the present embodiments concerning weight update compression, the weight updates are zeroed, which does not result in a sparse neural network but a sparse weight update. This is different from weight compression, where neurons are zeroed to obtain a sparse neural network. In a weight compression during training of a neural network, one may employ sparse variational drop out of training a sparse neural network combined with element sparsification, where the term “element” refers to the input data. The element dropping according to present embodiments refers to dropping the gradients/weight updates of an element from topology, not the input. The present embodiments provide a solution to compress weight updates. Thus, the present embodiments provide - a structured way of sparsification that may mitigate the risk of variance increase and may improve the performance efficiency of the compression of weight updates; - incorporation of element dropping in addition to gradient/weight update dropping, where the element dropping is defined as a process of dropping gradient/weight updates corresponding to a topology element. That is the weight updates corresponding a topology element are set to zero. - application of gradient/weight update scaling after gradient/weight update dropping to enhance performance efficiency and convergence characteristics algorithm; - application of a new selection scheme for quantization to allow better bandwidth utilization. This may be achieved by allowing switching between multiple algorithms with different quantization granularities and considering structures. Two quantization methods may be used to approximate information at coarse and finer levels. For example o at coarse level: a stochastic binary quantization algorithm to stochastically signal changes in one direction is used; o at fine level: a stochastic ternary quantization algorithm to stochastically signal scales in both directions or one global scale value; o a stochastic binary-ternary quantization method is proposed. - definition of high-level syntax and semantics which is relevant to the incremental weight update procedure described above and for communicating the weight updates between two communication-network- connected processing entities. Generic framework and sparsification method A framework for compressing neural network weight updates is proposed. Figure 2 illustrates an example of a system according to example embodiment. As shown in Figure 2, the framework comprises a structured sparsification component 210, topology element dropping 220, optionally a weight update scaling 230, quantization 240, and encoding 250. The structured sparsification component 210 imposes sparsity inside a given structure of weight updates. Example of possible structures could be a specific block configuration inside the weight matrices, e.g., 8x8 blocks, a specific CTU- based folding inside the weight block, or a specific grouping of weights, e.g., choosing to group the convolutions based on their channels. Another structure configuration could be layer-based where one or multiple layers are grouped together. The sparsification step may use the same structure for all the elements of the topology. In yet another embodiment, multiple structures may be used during the weight update sparsification. To perform structure sparsification 210, an importance metric is used to determine which blocks or structures are less important. The importance metric can be a magnitude of weight updates; a graph diffusion-based score; a metric derived from gradients computed during one or more training iterations; a metric derived from the activations that are output by topology elements after one or more training iterations; a metric derived from one or more of the previous options. For example the importance metric for a certain topology element may be the average value of the gradients of the training loss function with respect to the weight of that topology element, where the average is computed over a certain number of training iterations and over the multiple gradients. As another example, the importance metric for a certain topology element may be the average value of the output from that topology element where the average may be computed over the multiple output values. As yet another example, the importance metric may be a linear combination of one or more of the previous examples. Given a percentage of required sparsity or some hyper-parameter tuning, less important structures are put to zero. In the neural network, the sparsification and pruning can be done at the level of topology elements, e.g., convolution layer, linear layer, etc. This is referred to as topology element dropping 220. In topology element dropping, the weight updates of one or more topology elements are eliminated, e.g. by putting the weight updates of a convolution layer to zero. In this process, some criteria is used to determine the element goodness. Examples of such criteria can be the average nonzero activations, distance between average excitation and average inhibition in the network and the element, or some other factor. The element goodness is used to decide if the weight updates for that topology element is going to be kept or put to zero. For example, if the element goodness for a certain topology element is higher than a threshold, the weight update for that topology element may be set to zero. The dropping can be done with respect to the location of the element in topology, e.g., the elements that are at the beginning of the neural architecture or intermediate can be more favorable for weight update dropping. Given the element goodness criteria the topology element dropping step could become inactive during the training process from the beginning or at some point during the training. It is appreciated that the order of steps 210 and 220 may change in various embodiments, and they may work independent of each other. After the sparsification, a weight update scaling 230 can be optionally performed. The concept of weight update scaling 230 is to reduce the values of weight updates with respect to the amount of dropped weight updates. This may have advantages in keeping the variance small and enhancing convergence properties of a neural network during distributed training, on the same machine, multiple clusters, or federated learning use case. This optional step is independent of the sparsification algorithm and can apply to other sparsification algorithms as well. The aforementioned steps (i.e. sparsification and weight update scaling) can by applied independently of quantization techniques described below: To perform quantization 240, a scheme allowing better bandwidth utilization is disclosed. This means, that whenever the bandwidth allows, the weight updates can be less compressed. This allows better learning for the algorithm. To achieve this, a stochastic mechanism is provided to switch between two quantization algorithms, for example, binary weight update compression and ternary weight update compression or any other quantization algorithm. The stochastic mechanism can be turned off to allow only one mode of working and a deterministic algorithm usage. According to another example, more than two algorithms to be chosen can be allowed. The choice of algorithms can be based on weight update properties or external parameter like the bandwidth or a combination of both. According to an example, the choice of algorithm can be influenced by an expert to enforce a specific characteristic. In addition, according to an example, the choice of algorithm can be different for all or some participants, i.e. clients. After quantization, an encoding 250 pipeline will encode the quantized weight updates, various encoding mechanism can apply, e.g. Run-Length Encoding (RLE), GOLOMB encoding, combination of RLE-GOLOMG, CABAC-based encodings. It is possible that multiple levels of quantization can be applied, i.e. a combination of quantization techniques, e.g., the bitmasks can be encoded with RLE, whereas GOLOMB, and the other quantized data elements could be encoded using CABAC. For example, the scale variables can be quantized and later encoded by CABAC- based algorithms or other entropy coding mechanisms. In the following section, an algorithm according to example embodiments is described, where some example implementation options are explained in more detailed manner with respect to the system description of Figure 2. A neural network N consists of a topology T = {t 1 , t 2 , … , t n }, where each member of T corresponds to a topology element. T = T q ∪ T nq , T q are the quantizable topology elements, e.g., convolutions, linear layers, etc. and T nq are the non- quantizable elements of the toplogy, e.g., batch norms, biases of some specific topology elements, etc. For each element of the topology T within neural network N, there is a weight W and a weight update DW. According to example embodiments, a structure, denoted by s is provided, where s defines a grouping of elements inside W and consequently DW. Example of such grouping can be a channel-wise grouping inside weights of convolutions, block-wise groupings inside of weight tensors, kernel-wise grouping of the weights of the convolutions, CTU-based tile grouping of weights, etc. The largest grouping can be model-wise grouping which is equal to not having an explicit structure. Algorithm for structure-aware top-k sparsification and compression of weight updates is provided in the following. Based on the a topology T and weight updates DW and a structure s, a structure-aware sparsification and quantization is employed: Step1: importance calculation and structure sparsification (fine sparsification) • For t ∈ T q • For s ∈ t • Importance_s = F(DW_s ) • DW s ′ = sparsify (DW_s , importance_s, percent) Step2: element dropping (coarse sparsification) • For t ∈ T q • Importance_e= Element_goodness (DW t ′) • DW t ′ =drop_element(D W t ′, Importance_e) Step3: (optional) • Weight update scaling: scale the remaining weight updates with respect to the ratio of the weight updates that are set to zero. Step 4: quantization, will be executed if the original model is not a quantized model, details will be given in next section. Step 5: coding In the previous algorithm: • F(.) is an importance function. Examples of importance measures can be L1 magnitude, saliency, graph diffusion-based importance, etc.; • Element_goodness(.) is a measure of element importance. Examples of such measure can be number of non-zeros, mean nonzero / total mean, average nonzero activations, distance between average excitation and average inhibition in the network and the element, etc. • sparsify (.,.,.) performs structure sparsification. That is, it puts some values in the structure to zero by considering a target sparsification percent, the importance value of weight updates, e.g., from the least important weight updates are set to zero; • drop_element (.,.) puts all the weight updates of a topology element to zero and prevents any update to it based on the element goodness, e.g., if the goodness is below a threshold or is insignificant; • Weight update scaling adjusts the values of the weight updates by considering the amount of zeroed updates. Example of such operation is dividing the remaining nonzero weight update values with the number of the weight updates that are set to zero. Quantization A quantization scheme can be used for better bandwidth utilization during weight update compression: Generic Structured Stochastic Quantization for Better Bandwidth Utilization: • Given a quantizable topology, T q' and s’; It is to be noticed that separating T q' and s’ can be different than T q' and s, or the same, depending on the configuration provided by an expert or decided via a hyper parameter tuning. • Scale = {} # set • Gradient_direction = {} # set • For t ∈ T q' • For s' ∈ t • α = algorithm_selection_Criteria(DW t ', DW' , number of rounds) • If rand() < α • Quantization_Algorithm_1() • Else • Quantization_Algorithm_2() where • Quantization_Algorithm_1() is any quantization algorithm suitable for the condition expected by Algorithm_selection_criteria(.), e.g. binary quantization, codebook quantization, etc. • Quantization_Algorithm_2() is any quantization algorithm except the one in Quantization_Algorithm_1(). • Algorithm_selection_criteria(.) is a function that determines a good criteria for using a quantization algorithm. Example implementation of such algorithm can be, nonzero ratio, entropy, relative entropy change, 1-nonzero ratio, number of communication rounds, or a combination of those, etc. According to an embodiment, the criteria can be fixed, or such that the choice of algorithms can be random, e.g., a constant value. Stochastic binary-ternary quantization: • Given a quantizable topology, T q' and s’; • Scale = {} # set • Gradient_change_Bitmasks = {} # set • Gradient_direction = {} # set • For t ∈ T q' • For s' ∈ t • α = algorithm_selection_Criteria(DW t ', DW' , number of rounds) • If Random() < α [perform coarse approximation: e.g., binary quantization] • • • If rand() > direction_selection(Ep, En) • Scale = Scale \union Ep • Gradient_change_Bitmask = Gradient_change_Bitmask \union bitmask(W s ' + ) • Gradient_direction = Gradient_direction \union [b00] • Else • Scale = Scale \union En • Gradient_change_Bitmask = Gradient_Bitmask \union bitmask(W s '-) • Gradient_direction = Gradient_direction \union [b01] • Else [perform finer quantization: e.g. ternary quantization] • If rand() < γ • Sg = nonzero_mean(DW s ' ) • Scale = Scale \union Sg • Gradient_direction = Gradient_direction \union [b10] • Else • Scale = Scale \union Spg \union Sng • For • if • Gradient_direction = Gradient_direction \union [b01] • elseif • Gradient_direction = Gradient_direction \union [b10] • else • Gradient_direction = Gradient_direction \union [b00] • Send(scale, Gradient_direction) where • Direction_selection(.,.) is a function that determines which gradient direction chosen for in a coarse binary approach. Example of such function can be E p /(E p + E n ), a random choice, etc. • select_granularity(.,.) is a function that allows stochastic selection for sending scales. Example of such function can be • Send(.,.,.) is a communication function that transfers the provided information. High-level syntax and decoding procedure The relevant syntax and semantics may consist of several components to reflect how partitioning happens, the encoding mechanism is, and the components of the bitstream could be. The decoding steps with regard to each configuration are also disclosed in this section. The output of the present framework can be divided into header information and payload data. The header information can have the following elements as depicted in the following table: structure_id: defines the type of the structure configuration that is used for the sparsification and quantization. An alternative to structure_id that represents the combined configuration of structures for quantization and sparsification one could define two structure IDs (structure_id): one for quantization and one for sparsification, e.g., sparse_structure_id, and quantization_structure_id The structure_id can be determined by some identifiers, e.g., as those defined below rle: if set to 1, a run-length encoding may be used for the bitmasks and some of the data structures and taken into account during decoding process; position_encoding: In position encoding, the bitmasks are encoded by the number of zeros or ones that follow each other. When this bit is set, a position encoding mechanism is used. The output of the framework can be considered as a payload that consists of weight update scales, gradient change and gradient directions. The syntax can have the following structure: scales: is a list of weight update scales that are computed from nonzero locations of the weight update matrix. If there is no partitioning or there is a deterministic configuration, there is only one element in the list; size_flag: is a one bit flag to allow accommodating various number of weight update scales efficiently; weight_size_flag: is a flag to allow variable size num_weight_updates to enable encoding large bitmasks of large weight updates of huge tensors; num_weight_updates: number of elements in the weight update tensor; bitmask_value: indicates if there is a valid weight update and what could be the direction of the weight update. Thus, each weight update is encoded with 2 bits, wherein the example of bitmasks values can be as presented below: Alternatively, instead of a 2-bit bitmask_value, the syntax may comprise the following: The semantics may be specified as follows: Alternatively, one may first write one bitmask indicating the existence of weight mask and one bitmask to indicate the direction. If there is a deterministic algorithm that allows only one direction of gradients, e.g., binary quantization, only one bitmask of one bit may be provided. In that case, the relevant syntax part can be as shown below: weight_update_bitmask__value: corresponds to where there has been a change in the weight updates which is nonzero; weight_update_direction_bitmask__value indicates the direction of an existing weight update and it is interpreted in conjunction with the weight_update_bitmask__value during decoding. weight_update_direction_bitmask_value has value zero, if there is no weight update or the direction is positive, and weight_update_direction_bitmask_value has value one, if it has the negative direction; byte_alignment(): adjusts the length of the bitstream to be byte aligned if required. Embodiment on run length encoding of the bitmasks is one way of encoding the information. In such a case, the bitstream may contain an RLE encoded bitmask, wherein the example definition of rle_encoded_data() can be as follows: count_runs: is the number of runs that exist in the data. It may have 7 bit representation, and allow the partitioning of the runs into multiple ones if the sizes exceeds. Alternatively, it is possible to allow larger size descriptors or a flagged mechanism to increase the number of runs, e.g., by introducing a size_run_flag and revising the definition as follows: run_value: is the value that is repeated several times, for us this amount is 0 or 1; run_length: is the number of times a value is repeated. Alternatively, the count of consecutive zeros between each pair of bit equal to 1 is coded. The following syntax may be used: run_length: is the number of times the value of 0 is repeated before the next value of 1. In position encoding, according to another embodiment, the data can be encoded using a nonzero encoding location. In such embodiment, the position of first nonzero is encoded, and the number of zeros that are following each nonzero from then onward is indicated. For example, for the bit mask [0001000011010000], the encoding as [3, 4, 0, 1, 4] is obtained. Depending on the mode of operation, the position encoding can be such that the process encodes the number of nonzeros that follow a zero depending on the statistics of the weight update. In such a case a single flag may be used to indicate the order. The example of such position-based encoding can be depicted as follows:

order_flag: if the value is 0, the number zeros following a nonzero is encoded; and if the value is 1, the number nonzeros following a zero is encoded; size_flag: if set, a 16 bit representation for the list values is used. len_encoded_list: the number of elements in the encoced list of positions; list_value: is the position of zeros or nonzeros based on order_flag; The encoding procedure for obtaining the list_value may have the following steps: idx = 0 for (i=0 ; i < len_bitmask; i++) { count = 0 while ( bitmaks[i] != (1 – order_flag)) { count++ i++ } encoded_bitmask[idx] = count idx++ } According to another embodiment, it is possible that the position encoded bitmask further compresses via some other encoding mechanism such as CABAC, e.g., when the position encoded bitmask consists of a list of integers depicting positions, and this list is further compressed by some means of compression like CABAC, for example. It needs to be understood that instead of u(n), i.e., unsigned integer using n bits, any other representation could be used for syntax elements in the embodiments above, such as unsigned integer Exp-Golomb coded syntax element or an arithmetic-coded syntax element. Decoding procedure: At the decoder side, the decoder is configured to read the partition_id first to determine the level of topology element, and the structures used during the quantization. It, then, parses the rest of the data and assigns the values according to the structures and topologies to the NN. For example, if a channel-wise structure is used, the scales are read and assigned to the correct channel in the correct topology element with respect to the order that they are parsed during encoding procedure. The encoded bitmasks are decoded according to their format. In the plain bitmask, encoding is straightforward and the values are parsed and assigned to the correct weight update on a row major basis. For the RLE encoded data, the RLE is first decompressed into a normal bitmask, and later treated as the plain bitmask case. Similar working principle may hold for the position encoded. In a position encoded case, the position is read, and a zero or one is put based on the order_flag value. Then one can pad with the required value until the next position is reached. Once the original bitmask has been reconstructed, it can be treated as simple bitmask case. As additional embodiments, the following are provided: • Quantized models as input: The proposed scheme can be applied to quantized models. That is, the input model is quantized and pruning happens on the quantized weight updates. For such a case, the quantization step can be optional depending on the bandwidth requirements and amount of update bit allocation. • Loss functions: The input can be a model that is prepared with loss functions to promote sparsity. Or the method could be used in conjunction with a loss function that promotes sparsity of the weight updates. • Random drop: It is possible that the weight update dropping in structure sparsification happens completely randomly. In such case, the importance function F(.), can assign a random importance value to weight updates. • Different structures for different steps of the algorithm: According to an embodiment, structures that are used in each step can be different, e.g., there is a different structure for quantization and a different structure for weight update dropping. As an example, a structure to promote channel wise data sparsity can be used for quantization, and a block-wise structure can be used for weight update dropping. • Other quantization algorithms: According to an embodiment, a binary or ternary quantization or both can be changed with another quantization approach, e.g., a codebook-based quantization technique, a lattice-based quantization technique, or a training-based quantization. • Embodiment on compression of bitmasks using RLE: According to yet another embodiment, bitmasks can be further compressed using run-length encoding • Embodiment on position encoding of the bitmasks: According to yet another embodiment, the bitmasks may be compressed using position encoding. In such encoding, the location of zeros or ones are only communicated. The encoding can be further followed by another level of compression e.g. running CABAC. The method according to an embodiment is shown in Figure 3. The method generally comprises – for a neural network topology comprising topology elements having weight updates – determining 310 importance of a topology element, and performing weight update dropping of topology elements according to the determined importance; selecting 320 one or more quantization method with different quantizers to be used alternatively; quantizing 330 the existing weight updates according to the selected quantization method, and coding 340 the quantized weight updates. Each of the steps can be implemented by a respective module of a computer system. An apparatus, according to an embodiment, for weight update compression in a neural network, wherein the neural network topology comprises topology elements having weight updates comprises means for determining importance of a topology element, and performing weight update dropping of topology elements according to the determined importance; means for selecting one or more quantization method with different quantizers to be used alternatively; means for quantizing the existing weight updates according to the selected quantization method, and means for coding the quantized weight updates. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 3 according to various embodiments. An apparatus according to an embodiment is illustrated in Figure 4. The apparatus is a user equipment for the purposes of the present embodiments. The apparatus 90 comprises a main processing unit 91, a memory 92, a user interface 94, a communication interface 93. The apparatus according to an embodiment, shown in Figure 4, may also comprise a camera module 95. Alternatively, the apparatus may be configured to receive image and/or video data from an external camera device over a communication network. The memory 92 stores data including computer program code in the apparatus 90. The computer program code is configured to implement the method according various embodiments by means of various computer modules. The camera module 95 or the communication interface 93 receives data, in the form of images or video stream, to be processed by the processor 91. The communication interface 93 forwards processed data, i.e. the image file, for example to a display of another device, such a virtual reality headset. When the apparatus 90 is a video source comprising the camera module 95, user inputs may be received from the user interface. The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the method. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment. A computer program product according to an embodiment can be embodied on a non-transitory computer readable medium. According to another embodiment, the computer program product can be downloaded over a network in a data packet. If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined. Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims. It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.