Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD, APPARATUS AND COMPUTER PROGRAM FOR TRAINING DEEP NETWORKS
Document Type and Number:
WIPO Patent Application WO/2020/239824
Kind Code:
A1
Abstract:
A method for training deep networks comprises calculating (201) an intermediate weight based on an existing quantized weight and an existing quantized residual parameter, and quantizing (202) the intermediate weight to derive an updated weight. The method further comprises generating (203) an updated residual parameter using the intermediate weight and the updated weight.

Inventors:
LOROCH DOMINIK MAREK (DE)
Application Number:
PCT/EP2020/064677
Publication Date:
December 03, 2020
Filing Date:
May 27, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
FRAUNHOFER GES FORSCHUNG (DE)
International Classes:
G06N3/063; G06N3/04; G06N3/08
Foreign References:
US20180211166A12018-07-26
Other References:
H. LI ET AL.: "Training quantized nets: A deeper understanding", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, 2017, pages 5811 - 5821
A. KRIZHEVSKY ET AL.: "Imagenet classification with deep convolutional neural networks", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, 2012, pages 1097 - 1105, XP055309176
A. KRIZHEVSKYG. HINTON: "Learning multiple layers of features from tiny images", TECH. REP., 2009
J. DENG ET AL.: "ImageNet: A Large-Scale Hierarchical Image Database", CVPR09, 2009
S. ZHOU ET AL.: "Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients", ARXIV, 2016
B. RECHT ET AL.: "Hogwild: A lock-free approach to parallelizing stochastic gradient descent", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, 2011, pages 693 - 701
Attorney, Agent or Firm:
2SPL PATENTANWÄLTE PARTG MBB (DE)
Download PDF:
Claims:
Claims

What is claimed is:

1. A method (200) for training deep networks, comprising:

calculating (201) an intermediate weight based on an existing quantized weight and an existing quantized residual parameter;

quantizing (202) the intermediate weight to derive an updated weight; and generating (203) an updated residual parameter using the intermediate weight and the updated weight.

2. The method of claim 1 , wherein the intermediate weight is quantized using logarithmic quantization.

3. The method of claims 1 and 2, further comprising:

setting updated weights below a threshold to zero.

4. The method any one of claims 1 to 3, wherein generating (203) an updated residual parameter comprises:

scaling an intermediate residual parameter with the updated weight.

5. The method of claim 4, wherein scaling the intermediate residual parameter comprises performing a bit shift operation on the intermediate residual.

6. The method of claims 4 and 5, wherein generating (203) an updated residual parameter further comprises:

quantizing the scaled intermediate residual parameter to derive the updated residual parameter.

7. The method of claim 6, comprising:

using fixed point quantization to derive the updated residual parameter. 8. A processing circuit (300) for training deep networks, configured to:

calculate an intermediate weight based on an existing quantized weight and an existing quantized residual parameter;

quantize the intermediate weight to derive an updated weight; and

generate an updated residual parameter using the intermediate weight and the updated weight.

9. The processing circuit (300) of claim 8, further configured to:

set updated weights below a threshold to zero 10. The processing circuit (300) of claims 8 or 9, further configured to:

scale an intermediate residual parameter with the updated weight.

11. The processing circuit (300) of claim 10, further configured to perform a bit shift operation on the intermediate residual parameter for scaling the intermediate residual parameter.

12. The processing circuit (300) of claims 10 or 11, further configured to:

quantize the scaled intermediate residual parameter to derive the residual parameter. 13. The processing circuit (300) of one of claims 9 torl2, configured to:

use fixed point quantization to derive the quantized updated residual parameter.

14. A computer program having program code for, when executed by a programmable processor, performing the method of one of claims 1 to 7.

Description:
Method, apparatus and computer program for training deep networks

Field

Examples relate to deep learning of neural networks.

Background

A quantized deep neural network (DNN) is a model of a deep neural network which uses quantization on one or both of its weights and activations. One method may be to store scalar values in a lower precision numerical format, which may require less memory (measured, e.g., in bits per stored value). Further, computation of a DNNs output may be accelerated when using quantized parameters in a lower precision format.

Quantization in DNNs has become an interesting topic for embedded systems as well as for high performance computing (HPC). Quantization is most often used to reduce the memory size of the weights, activations and gradients with different numerical formats, such as fixed point numbers, power of 2, ternary, binary and mixtures of different numerical precisions. A lot of work in this field considers how quantization can improve the design of hardware accelerator architectures for DNNs, reducing the computational complexity.

Some efforts have also been made to improve quantized DNN training, which concentrate on techniques that may be applied to the topology to improve the evaluation score of a quantized DNN. Some approaches switch between a low precision’compression’ and high precision ’learning’ phase during training. Nonetheless, optimization schemes for quantized training of DNNs rely on a high precision representation of the weights, and in many cases of higher momentums introduced by update schemes.

In summary, quantization is conventionally perceived as a post-training step to reduce the model complexity and to make execution more efficient, while training of the DNN is performed using unquantized values, hence consuming a lot of memory and computational power. Therefore, there is a need of a method for deep learning providing a high precision trained network while consuming reasonable processing power.

Summary

An example relates to a method for training deep networks comprising the steps of calculating an intermediate weight based on an existing quantized weight and an existing quantized residual parameter, quantizing the intermediate weight to derive an updated weight, and generating an updated residual parameter using the intermediate weight and the updated weight.

Using a residual parameter holding information on an error caused by a quantized weight, deep learning may be performed with high accuracy while directly using quantized weights, which results in reduced memory requirements and accelerated computation and, consequently, accelerated learning.

In another example, the intermediate weight is quantized using logarithmic quantization.

Using quantization may enable weight values and residual values with a lower number of bits while logarithmic quantization may enable an efficient hardware implementation of subsequent computation steps.

A further example further comprises the step of applying a thresholding mechanism after quantizing the intermediate weight. Using a threshold or a thresholding function (th A ) may, for example, serve to disregard values below the threshold, reducing the overall number of bits used for storing the updated weight. For the same reason, less processing power may be needed for processing the weights.

Another example further comprises scaling an intermediate residual parameter with the updated weight. By scaling an intermediate residual parameter, which may hold the residual in a high accuracy numerical format with the updated weight, the resultant parameter may hold the information on an error in an efficient way at appropriate accuracy. More specifically, the same error may be saved with a lower number of bits.

A further example comprises the step of performing a bit shift operation on the intermediate residual parameter for scaling the intermediate residual parameter with the updated weight.

Performing bit shift operations for scaling may be performed computationally inexpensive and in hardware, hence less processing power may be needed as compared to performing multiplications or divisions for scaling.

An additional example further comprises the step of quantizing the scaled intermediate parameter to derive the updated residual parameter, which may result in a further reduction of the memory requirements and further acceleration of computations.

An optional example further comprises the step of using fixed point quantization to derive the updated residual parameter.

Fixed point quantization may lead to a quantized updated residual parameter with a fixed bit number (bit width) which may result in a low quantization error, further enabling efficient hardware implementations. The quantization error may be dependent on the word width and may decrease with a increasing word width.

Another example relates to a processing circuit for training deep networks being configured to calculate an intermediate weight based on an existing quantized weight and an existing quantized residual parameter, quantize the intermediate weight to derive an updated weight, and generate an updated residual parameter using the intermediate weight and the updated weight.

An example is further configured to set updated weights below a threshold to zero.

A further example is further configured to scale an intermediate residual parameter with the updated weight. Another example is further configured to perform a bit shift operation on the updated residual parameter for scaling the updated residual parameter with the updated weight.

An optional example is further configured to quantize the scaled intermediate residual parameter to derive the updated residual parameter.

An additional example is further configured to use fixed point quantization to derive the quantized updated residual parameter.

Another example relates to a computer program having program code for, when executed by a programmable processor, performing the method of training deep networks.

Brief description of the Figures

Some examples of apparatuses and/or methods will be described in the following by way of example only, and with reference to the accompanying figures, in which

Fig. 1 illustrates an exemplary schematic model of a deep neural network (DNN),

Fig. 2 shows a flowchart illustrating an embodiment of a method for training deep networks,

Fig. 3 illustrates an embodiment of a processing circuit for training deep networks,

Fig. 4 shows results of the application of an embodiment on a convex function of two parameters,

Fig. 5 illustrates the validation accuracy and a histogram of the final weights for training Cifarnet with an embodiment, and

Fig. 6 fig. 5 illustrates a loss landscape plotted along the principal components.

Detailed Description Various examples will now be described more fully with reference to the accompanying drawings in which some examples are illustrated. In the figures, the thicknesses of lines, layers and/or regions may be exaggerated for clarity.

Accordingly, while further examples are capable of various modifications and alternative forms, some particular examples thereof are shown in the figures and will subsequently be described in detail. However, this detailed description does not limit further examples to the particular forms described. Further examples may cover all modifications, equivalents, and alternatives falling within the scope of the disclosure. Same or like numbers refer to like or similar elements throughout the description of the figures, which may be implemented identically or in modified form when compared to one another while providing for the same or a similar functionality.

It will be understood that when an element is referred to as being“connected” or“coupled” to another element, the elements may be directly connected or coupled via one or more intervening elements. If two elements A and B are combined using an“or”, this is to be understood to disclose all possible combinations, i.e. only A, only B as well as A and B, if not explicitly or implicitly defined otherwise. An alternative wording for the same combinations is“at least one of A and B” or“A and/or B”. The same applies, mutatis mutandis, for combinations of more than two Elements.

The terminology used herein for the purpose of describing particular examples is not intended to be limiting for further examples. Whenever a singular form such as“a,”“an” and“the” is used and using only a single element is neither explicitly or implicitly defined as being mandatory, further examples may also use plural elements to implement the same functionality. Likewise, when a functionality is subsequently described as being implemented using multiple elements, further examples may implement the same functionality using a single element or processing entity. It will be further understood that the terms“comprises,” “comprising,”“includes” and/or“including,” when used, specify the presence of the stated features, integers, steps, operations, processes, acts, elements and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, processes, acts, elements, components and/or any group thereof. Unless otherwise defined, all terms (including technical and scientific terms) are used herein in their ordinary meaning of the art to which the examples belong.

Fig. 1 illustrates an exemplary topology of a Deep Neural Network 100, in particular an example of a deep neural network with an input layer 110, two hidden layers 120 and an output layer 130. The first layer is the input data, and the output layer predicts two response variables. The last node in each input layer (+1) represents a bias term.

Just as an example for a deep neural network, Fig. 1 illustrates a simple form of a fully connected (FC) network where each node of a layer is connected to each node of the preceding layers.

In a simple implementation, each node in the FC network may be defined to be an artificial neuron or perceptron, which receives input, changes its output (activation) according to the input and, optionally, internal parameters (weights).. The output of a neuron of a preceding layer is typically multiplied with a weight before it serves as an input (net input) to a neuron of the subsequent layer. The weights and other parameters of a DNN can be found by training the network.

DNNs used in practice, however, do deviate significantly from the simple example in fig. 1. For example, they are generally not fully connected, i.e. not every node of a layer is connected to each node of the preceding layer. Further, some transformations of nodes within the network may not depend on a trained parameter but be static instead. Likewise, the topology may exhibit multiple variations, e.g. branches may be formed that do not reconnect to subsequent layers at all. Further, some layers may consist of only single nodes while subsequent layers may again have multiple nodes.

In more general terms, Deep Neural Networks (DNNs) are networks of processing layers, which transform a tensor of input data into a sequence of hidden, intermediate representations of the input information to arrive at a resulting tensor of the output layer, which represents an estimation or prediction on the input. This way, DNNs represent universal function approximators, which can principally fit any input distribution into any output distribution. In general, a layer represents any transformation from input to output. There are, for example, layers that represent a tensor-tensor multiplication (in the sense of Einstein notation: Oy = Ai k ' B kj = å /c ^i f c B kj ). An example of the convolution operation is the (not general) may be the output (batch, output height, output width, output channels), I the input (batch, input height, input width, input channels) and W the filters/weights (filter height, filter width, input channel, output channel). In fact, this tensor-tensor multiplication layer type is a common case in DNNs.

Activation functions in the nodes of the hidden layers 120 of the DNN are examples for nonlinear, mostly parameter-free transformations. A modern example may be the rectifying linear unit (ReLU): f(x) = max(0, x ), where x is the output of a neuron. This ReLU function is also known as a ramp function.

While some nodes of the DNN do not use parameters to be trained, others do. In particular, some nodes use weights that are to be trained before the DNN is used.

If the desired output of a deep neural network given an input is known upfront for some input samples, the deep neural network may be trained by updating the weights in order to minimize the output error. The output known upfront may be called target or label. For minimizing the error efficiently, the weights maybe modified or updated multiple times in an iterative algorithm. This type of learning is generally known as supervised learning.

Neural networks may, for example, be trained by stochastic gradient descent (SGD). During SGD, parameters of each layer, for example the weights associated to a neuron and its threshold are iteratively altered (using a Cost or Loss function C) until the training is finished. The loss Function is an approximation for the quality of the quantity to be optimized, e.g. the accuracy in case of image classification. Training may be based on the error between the expected output and the resulting output (activation) of the final layer. The so trained parameters are used during the inference phase after the training phase.

When training a neural network, the gradient of a cost function C may be calculated with respect to the weights dC /dW. This calculation may be performed by using the chain rule to deduce the interesting gradient ( dC/dW = dL/dO dO/dW ) from the output O of a layer regarding the weights.

The following algorithm describes a single training step to train a DNN for the general case:

{Compute the forward pass}

1 : for k = 1 to L do

2: a k < — forward k (a k-1 , w k )

3: end for

{Compute the backward pass with previously computed a k }

5: for k = L to 1 do

6: g a ^ <- backward_input k (g ak , w k )

8: end for

{Apply update based on computed gradients g W] }

9: for k = 1 to L do

10:

11 : end for

Training a DNN with L layers may, amongst others, use function forward k for the k th layer, weights w k , activations a k and gradients g k. The function forward k is an abbreviation for the operation taking place when moving from the layer k-1 to layer k, which uses the activations a k-1 and the weights w k . C is a cost function used to estimate the quality of an output of the DNN and, optionally, of layers within the DNN. A training step may further require information on the inputs a 0 , targets/labels a * , previous weights w l and a learning rate a t .

Lines 1 to 3 of the previous algorithm describe a first phase 1 of a training step. The first phase is to compute a forward pass from layer 1 to layer L, which corresponds to the forward propagation through the network. Forward propagation, in general, may be the process on how to obtain a neural network output, layer by layer. By use of the forward function, an activation a k of layer k may be obtained out of an activation a k-1 of previous layer k-1 and a weight w k of layer k.

The second phase (lines 4 to 8 of the pseudo code) is to compute the backward pass from layer L to layer 1 via the previously computed activation a k. The backward pass may correspond to a backpropagation. A backpropagation is about determining how changing the weights may impact the overall cost in the perceptron or neural network. A gradient g aL may be determined based on the partial derivative of the cost function with respect to the activation a L of layer L. The gradient 5 afc-1 of the previous layer k-1 may be determined based on the gradient g ak of layer k and the weight w k thereof.

The third phase of a training cycle is given in lines 9 to 11 and updates the weights based on the computed gradients g for the layers 1 to L. An updated weight may be determined

wk

based on the weight w k , the gradient g Wk thereof and the learning rate a t .

The previously described steps describe the usual procedure for training deep neural networks. The following pseudo code is a modification of the previous one which uses quantized parameters instead of full precision parameters to some extent. In particular, weights, w k activations a k and gradients g k are quantized using quantization functions q w , q a , and q g , respectively.

{Compute the forward pass}

1 : for k = 1 to L do

2: wf <- q w (w k )

3: a k < - for

4: optional: a

5: end for

{Compute the backward pass with previously computed a k }

7: for k = L to 1 do

8: optional: g k < - ¾(£ afc ) 10:

11 : end for

{Apply update based on computed gradients g <? }

wk

12: for k = 1 to L do

13: optional:

14:

15: end for

While there are quantized parameters used in the modified algorithm given by the previous pseudo code, there still is a full precision copy of the weights required in the learning phase of lines 12 to 15. The weights are only quantized in every iteration for the forward and backward pass.

In summary, conventional learning requires high accuracy weights and, hence, costly computations, which may be avoided using an embodiment as illustrated by means of Fig. 2. The embodiments described herein improve the learning phase, i.e. they refer to lines 12 to 15 of the previous code. The embodiments allow to use quantized parameters already in the training phase.

Fig. 2 illustrates a flowchart of a method 200 for training deep networks comprising the steps of calculating 201 an intermediate weight based on an existing quantized weight and an existing quantized residual parameter, quantizing 202 the intermediate weight to derive an updated weight, and generating 203 an updated residual parameter using the intermediate weight and the updated weight.

In the following, a particular example as to how quantized parameters can be used in deep learning will be given.

The following practical implementation is based on the first two phases of the previously described method and modifies the update of phase 3 which now uses a quantized weight. In the following, the index t denotes the number of the iteration within the learning process and the following considerations are to be understood to refer to the same layer k within a DNN. The method for training deep networks comprises calculating the intermediate weight w t+1 based on the existing quantized weight w and an existing quantized residual parameter r t. The existing quantized residual parameter r t may be a low precision fixed point variable, which keeps track of the quantization error. The method for training deep networks further comprises quantizing the intermediate weight vv i+1 to derive the updated weight w t+1. The updated weight, hence, is a quantized quantity.

Let f(x t , w t ) be the Cost function, which in this case may be the loss of the DNN. x t may be an input sample and w t is the quantized weight of the network at iteration t. One particular way to compute the updated weight w t+1 is given below. w t+1 = w t + r t - a t Vf(x t , w t ) (1)

W t+i = th A (q w (w t+1 )) (2)

First, intermediate weight w i+1 is based on an existing quantized weight w t and on an existing quantized residual parameter r t. Then, the intermediate weight vv i+1 is quantized using quantization function q w to derive the updated weight w t+1. The updated weight used in the update process, hence, is a quantized quantity. Equation (2) further includes an optional thresholding mechanism th^ for setting updated weights below a threshold to zero, which may serve to further reduce computational load by disregarding insignificant weights.

Further, the method comprises generating an updated residual parameter r t+1 using the intermediate weight w i+1 and the updated weight w t+1.

= 0 else. (3)

Equation (3) further implements an additional Quantization q r of an intermediate residual parameter (w i+1 — w t+1 ) to derive the updated residual parameter. The intermediate residual parameter (w i+1 — w t+1 ) may be scaled by the updated weight w t+1 as in equation (3) or it may be used unsealed. Alternatively, the intermediate residual parameter may likewise also be used unquantized. Irrespective of the particular choice, q w and q r may be two different quantization functions for the weights and the residual parameter r t , which may be seen as an auxiliary variable. The residual may have the same number of elements as the weight vector. In some embodiments, Vf(x t , w t ) may be replaced with a stochastic approximation like in SGD.

In principle, one may choose any quantization method for q w . It may especially be beneficial to quantize weights to a logarithmic scale. The logarithmic quantization may be defined as

The hardware realization of this quantization scheme may be to mask away all bits below the most significant bit (negative numbers may require a slightly modified treatment). In contrast to the formal description of q w of equation (4), the hardware realization of such masking is very simple. According to some embodiments, therefore, the intermediate weight is quantized using logarithmic quantization.

To deal with the case that the logarithmically quantized weights may be very close to zero, an additional thresholding mechanism may optionally be applied in combination with quantization, which may be described as where D may be the precision of the residual rounding in equation (11). Ignoring small values may have the benefit that the representation of the quantized weights may be reduced to very few bits, enforcing a bound on the weights. In other words, embodiments may optionally further comprise applying a thresholding mechanism after quantizing the intermediate weight to reduce the number of bits consumed.

As illustrated by means of equation (3), the residual may be a quantized variable, too, and requires less size than a high precision copy of the weights. The embodiments disclosed herein may be called residual optimization (ResOpt) after that auxiliary variable. The residuals may have some peculiar properties, which may allow to reduce them to a low precision numerical format. For example, fixed point quantization may optionally be used for the residuals, which is parametrized by the word width w and the fractional width /. The formal definition may be In simple terms, x may be an integer in two’s complement, whereas the last f bits may be interpreted as 2 l , where i may be the position after the fractional point. The smallest increment may be called the precision D = 2 ~ 1 of the fixed point number. Also, the number may be truncated to the maximum and minimum representable value, so there may be no overflow. The used rounding scheme may be nearest rounding. In other words, the updated residual parameter may be quantized to use a lower number of bits to represent the updated residual parameter.

As indicated in equation 3, the method optionally further comprises the step of scaling the intermediate residual parameter (w i+1 — w t+1 ) with the updated weight w t+1 .

By scaling the intermediate residual parameter with the updated weight, the scaled intermediate residual parameter may store an error in an efficient way by normalizing it by the updated weight. In a further example, the method comprises the step of performing a bit shift operation on the updated residual parameter for scaling the updated residual parameter with the updated weight. Performing bit shift operations for scaling may be computationally inexpensive, and less processing power may be needed compared to multiplication operations.

Fig. 3 describes a processing circuit 300 for training deep networks being configured to calculate an intermediate weight based on an existing quantized weight and an existing quantized residual parameter, quantize the intermediate weight to derive an updated weight, and generate an updated residual parameter using the intermediate weight and the updated weight.

An example is further configured to apply a thresholding mechanism after quantizing an intermediate weight to reduce bits of a value thereof.

A further example is further configured to scale the updated residual parameter with the updated weight. Another example is further configured to perform a bit shift operation on the updated residual parameter for scaling the updated residual parameter with the updated weight.

An optional example is further configured to quantize the updated residual parameter to derive a quantized updated residual parameter.

An additional example is further configured to use fixed point quantization to derive the quantized updated residual parameter.

Another example relates to a computer program having program code for, when executed by a programmable processor, performing the method of training deep networks.

In other words, using deep neural networks (DNN) with quantized variables may make deep learning more efficient with respect to memory usage, computational speed and energy efficiency. Training DNNs with quantized variables may result in models with the same accuracy as their unquantized counterparts. Modem hardware accelerators may be harnessing this potential by enabling calculations in low precision formats. However, most of the low precision hardware acceleration may target the inference phase, which may numerically be more robust to low precision computations than the training phase. So far, only few approaches aim at enabling low precision during training. Existing methods may suffer from the need to keep at least one high precision copy of the DNN weights during gradient updates. In contrast, the residual optimization (ResOpt) scheme may facilitate training with fully quantized weights. It may also replace the multiplication with weights by bit shifts, leading to a major speed-up during training and inference. Thus, new insights may be provided on the nature of training with quantized variables and may open up a new perspective on how to design hardware accelerators for deep neural networks.

A corresponding update scheme may not need a high precision representation of the weights to train a DNN. The key may be to use a low precision fixed point variable called the residual, which may be bound by [0, 1), that may keep track of the quantization error. After the training, the residual may be discarded without any changes to the model. Additionally, it may use weights which are powers of 2, which may be beneficial for the hardware design of custom accelerators. The key points may in this regard be use of a novel, bit shift based optimizer scheme for DNNs with logarithmic scale weights, which does not hold on to a set of latent, high precision weights during training.

The following paragraphs give an analysis of the convergence behavior of the suggested method and some insights to the loss landscape of DNNs with quantized variables.

The convergence analysis is based on (H. Li et al. “Training quantized nets: A deeper understanding,” in Advances in Neural Information Processing Systems, pp. 5811-5821, 2017), but takes the composition of two previously introduced quantization schemes into consideration.

Theorem 1 : Suppose F(w ) = - fi(w) is a sum of convex, L-Lipschitz smooth functions, with L2-Lipschitz smooth Hessians F 2 /*, the domain has finite diameter D, the variance of the gradient is bounded, i.e. E||/i(-) || 2 < G 2 , the coordinate values of weights do not exceed Wmax , the quantization step D for residuals is a power of 2, and learning rates are given by at = , then the following is true for an ergodic estimate w(T) = time T, generated by the unsealed ResOpt algorithm:

E[F(vv(

where w * = argmin w F(w ) and d is the dimension of w.

A simple example of a convex function of two parameters may explain some peculiarities in quantized training methods, which may be seen in DNN training space as well. Fig. 4 shows two examples of poorly conditioned, convex functions of two parameters 401, 402. The ResOpt method may be used to find the optimum of these functions. The weights may be quantized logarithmically in ResOpt, which may be visualized by the grid lines (A) in the plots. In the rectangular regions framed by lines, the weights may be quantized to the same value, which may lie at the bottom left comer of a region. An arrow (B) in every corner may describe the gradient at that point. It is important to note that this gradient may be valid for all points in the region. Therefore, the trajectory may follow (B) in a straight line. When the trajectory crosses the boundary to another region, it may either bounce back or may continue crossing the new region. Yet even when bouncing back and forth, the overall trajectory may move along the boundary. Finally, the trajectory may reach the region which may contain the optimum (C). However, it may not stop, but may continue moving towards (D). It is important to note that in these examples, the learning rate may be chosen such that the trajectory may never skip over an entire cell. Nevertheless, the general idea of how quantized deep learning may work may remain the same.

A circuit which may move bit patterns by a limited, but variable number of positions to the left or right by padding with zeros or ones may be called a bit shifter. Using bit shifters instead of multiplier or divider circuits may result in a large speedup if the hardware architecture can be customized, as it may be the case in FPGAs. Due to techniques known as instruction level parallelism, a modern processor may easily return one multiplication result per clock cycle. So the time until an instruction may finish may not be a good measure of speed for modern processors. Speedup on processors may be gained by allowing to issue multiple of the same instruction per cycle. In hardware, this may require several copies of the circuit which may carry out the instruction. Since the space on a chip may be limited, a smaller circuit may allow for more copies to fit into the chip, allowing for a higher degree of parallelism. A good proxy for the size of a digital circuit may be the number of logic gates it may require to be implemented in hardware.

Multiplication with a number which may be a power of two 2 k (with k 6 Z) may be the same as shifting the binary representation of that number by |fc| bits. A barrel shifter may be a circuit which may shift a bit pattern by a variable, but limited amount of bits in a single clock cycle. Therefore, when multiplying with power of 2 numbers, it may be a substitution for a conventional multiplier circuit. Barrel shifters may be implemented with a number of gates in the order of 0(n log n), with n the number of bits for an operand. A simple, one-directional realization may be built with 3 n log 2 (n ) gates.

A conventional integer multiplier needs 0 gates and a concrete realization may be implemented with 6 n 2 — 8 n gates. Table 1 compares the two circuits for some operand sizes. The ratio of the gate numbers may be a good approximation for the expected speedup. Compared to floating point multiplier circuits, the speedup may be even more substantial, as that circuit may include at least both a fixed point multiplier and a bit shifter (for normalization). Table 1 : Comparison of circuit size measured in the number of gates for shifter and multiplier. The bitwidth refers to the number of bits for an operand. bitwidth (n) 4 8 16 32 shifter (3n log 2 (ri)) 24 72 192 480 multiplier (6 n 2 — 8 n) 64 320 1408 5888 ratio 2.7 4.4 7.3 12.3

The topologies used in experiments may be Cifamet (“Cifarnet code,” 2016. access: 2019- 05-22, "https://github.com/tensorflow/models/blob/master/research/s lim/nets/cifarnet.py"), which may be an AlexNet (A. Krizhevsky et al. “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097-1105, 2012) like neural network with roughly 1.8 million weights and Alexnet. They may be trained on CifarlO (A. Krizhevsky and G. Hinton,“Learning multiple layers of features from tiny images,” tech rep., Citeseer, 2009.) and Imagenet (J. Deng et al. “ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR09, 2009.), respectively. The experiments may use the ResOpt variant with scaled residuals, if not mentioned otherwise. The experiments which may not use ResOpt use stochastic gradient descend (SGD) without momentum, because it may be the most similar optimizer to ResOpt. The baseline training may use floating point 32 bits numbers and the hyper-parameters may be the same as for the quantized training. The main metric may be the accuracy of the classifier evaluated on the validation dataset. All experiments may start from the same random initialization point. The model for CifarlO may be trained for 200 epochs, using an initial learning rate of 0.1 and halving it every 40 epochs. AlexNet may be trained for 50 epochs, using an initial learning rate of 0.1 and halving it every 10 epochs. The framework used for all experiments may be TensorFlow, using quantization operations from the Tensorquantemulation toolbox.

Fig. 5a shows the validation accuracy 502 over the training steps 501 for different residual quantizers. The key may be to use a low precision fixed point variable called the residual, which may be bound by [0, 1) that may keep track of the quantization error. Therefore, the fractional part may be set to / = w— 1. The baseline accuracy may be at 90.5 % after 200 epochs. The training being run with ResOpt may be about 3 % below the baseline. Notice that the gap between the baseline and ResOpt may become larger every time the learning rate may be halved, for example around iteration 3 c 10 4 . The baseline method may be benefiting from the learning rate schedule more than ResOpt. Down to 12 bits, the residual based method may result in similar accuracies. At 8 bits, there may be a sharp drop in accuracy. The accuracy values between 8 and 12 bits may be given in Table 2. A residual width of 11 bits may be sufficient to achieve a final accuracy of 87 %.

Table 2: Comparison of training with scaled residuals and unsealed residuals word width w 32 16 12 11 10 9 8 6 scaled res. accuracy (%) 8 2 8 7 87 8C9 854 82 A 78 ~ 2 602 unsealed res. accuracy (%) 87.4 85.4 67.1 52.6 38.6 27.4 12.8 10.6

The same table shows the difference to the unsealed version. For high bit width, the scaled and unsealed variants may produce similar results. But going to 16 bits and below, the accuracy drops.

The weight histogram of all weights in the neural network trained with 16 bits residuals may be shown in Fig. 5b, which shows the count 504 over the value 503. The lower boundary at the values 2 -15 may be enforced by the thresholding mechanism in eq. (12). There may be no weights greater than one, so all weights may be encoded in logarithmic scale with 5 bits signed integers.

As mentioned in the methods, the convergence behavior of quantized training methods may be different from standard SGD. Fig. 6a and 6b show the loss landscape of a DNN trained on CifarlO. The plots may be centered on the final weights, marked by a red star (A). The directions in weight space 601, 602, 603 and 604 may be the principal components calculated from 25 snapshots of the weights during the entire training run. The weights at every point in the plot may be always quantized down to be power of 2 and the loss may be plotted in the logarithmic scale. The trajectory itself may not convey any insights, so they may not be present on the plots. The thresholds where the logarithmic quantization may jump from one value to the other may be visible as steps in the loss landscape (B). As described previously, the training trajectory may jump around a threshold, which may be the reason they become visible in the principal component analysis. These transitions may not always be visible, since they may be smoothed out due to the grid granularity in high dimensional spaces. Further, the trajectory may not end in a sink of the loss landscape (C). It gets stuck at an intersection between threshold lines (B), which may be similar to what may be observed in the toy example in Fig. 4. The final weights may be close to a region of relatively low loss (C).

Table 3 : Comparison of validation accuracy (%) of Cifarnet with fixed point weights trained with a latent, high precision representation of the weights, without one and trained with

ResOpt.

weight width 32 16 12 8 6 latent 902 908 902 900 802 no latent 90.0 87.3 66.2 22.9 16.4

ResOpt 87.2 87.7 87.3 78.2 60.2

To further demonstrate the impact of the latent high precision weights, Table 3 shows training runs for update schemes with a latent copy, without a latent copy and with ResOpt. The latent and non-latent runs may use fixed point weights. The gradients may be accumulated in the form of fixed point residuals in the case of ResOpt, so to compare against fixed point quantization may be fair. The fractional part may again be / = w— 1, which may be a good choice for this problem. The fixed point training may be similar to the technique in DoReFa- Net (S. Zhou et al.,“Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients”, arXiv preprint arXiv: 1606.06160, 2016), but may use simple straight-through estimators for the gradients in the backward propagation. The latent fixed point training may do well throughout all the different weights.

In the non-latent fixed point training, the weights may be quantized right after the update is applied. The consequence may be that small incremental changes to the weights vanish in the quantization. Therefore, 12 bit training may already be failing for the non-latent representation. It may become evident that the latent representation may help a lot in training the quantized DNNs.

ResOpt may be the first method which omits the latent representation successfully, opening up new possibilities to implement the update step on custom hardware accelerators. It may achieve similar accuracies as other existing optimization schemes for quantized weights, but with a much simpler update scheme. The key to the ResOpt method may be the scaled residuals. They may store the error made by the quantization in an efficient way by normalizing it by the weight. An analysis of the convergence behavior of the proposed method may be given. ResOpt may further use power of 2 weights, which may allow to replace multiplications with bit shifts. This, in combination with the lack of latent high precision weights may open up a way to design asynchronous, distributed training on custom hardware accelerators such as FPGAs.

Customizable hardware may be designed such that it may handle quantized numbers very efficiently and speed up computation. Devices called field programmable gate arrays (FPGAs) may allow to relatively quickly implement a custom hardware architecture. The literature on the combination of FPGAs with DNNs is currently focusing on accelerating the inference phase, but there is some literature about training on FPGAs as well. There are no strong technical bounds limiting the use of FPGAs as alternative accelerators for DNN training, similar to how graphical processing units (GPU) are used.

Another important factor in the training of DNNs is how to train on distributed platforms. The most common approach may be to keep a central copy of the model weights on a parameter server and to broadcast the updated versions to workers. The gradients may be gathered by the parameter server from the workers and the updated model is again broadcasted. Such an update scheme is synchronous, since all workers have the same set of weights at all times.

Asynchronous approaches may allow for the individual worker models to diverge to some degree. In an approach, called Hogwild! approach (B. Recht, et ak,“Hogwild: A lock-free approach to parallelizing stochastic gradient descent,”, Advances in neural information processing systems, pp. 693-701, 2011), a worker may send its updates to a small subset of other workers instead to a central parameter server. This approach is much less deterministic, but has proven to converge in less time, since workers spend no time waiting for updates. An important difference is that the workers may need to apply their updates locally in the asynchronous scheme. A novel approach in HPC clusters may be to use special accelerator hardware for the workers. This could be a GPU cluster like the Oakridge 44 Summit or a FPGA cluster like IBM’s cloudFPGA platform. Combining asynchronous schemes with clusters built from special accelerator hardware, the entire update scheme may be representable in the numerical representation which the worker architecture supports. The aspects and features mentioned and described together with one or more of the previously detailed examples and figures, may as well be combined with one or more of the other examples in order to replace a like feature of the other example or in order to additionally introduce the feature to the other example.

Examples may further be or relate to a computer program having a program code for performing one or more of the above methods, when the computer program is executed on a computer or processor. Steps, operations or processes of various above-described methods may be performed by programmed computers or processors. Examples may also cover program storage devices such as digital data storage media, which are machine, processor or computer readable and encode machine-executable, processor-executable or computer- executable programs of instructions. The instructions perform or cause performing some or all of the acts of the above-described methods. The program storage devices may comprise or be, for instance, digital memories, magnetic storage media such as magnetic disks and magnetic tapes, hard drives, or optically readable digital data storage media. Further examples may also cover computers, processors or control units programmed to perform the acts of the above-described methods or (field) programmable logic arrays ((F)PLAs) or (field) programmable gate arrays ((F)PGAs), programmed to perform the acts of the above-described methods.

The description and drawings merely illustrate the principles of the disclosure. Furthermore, all examples recited herein are principally intended expressly to be only for illustrative purposes to aid the reader in understanding the principles of the disclosure and the concepts contributed by the inventor(s) to furthering the art. All statements herein reciting principles, aspects, and examples of the disclosure, as well as specific examples thereof, are intended to encompass equivalents thereof.

A functional block denoted as“means for ...” performing a certain function may refer to a circuit that is configured to perform a certain function. Hence, a“means for s.th.” may be implemented as a“means configured to or suited for s.th”, such as a device or a circuit configured to or suited for the respective task.

Functions of various elements shown in the figures, including any functional blocks labeled as“means”,“means for providing a signal”,“means for generating a signal.”, etc., may be implemented in the form of dedicated hardware, such as“a signal provider”,“a signal processing unit”,“a processor”,“a controller”, etc. as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which or all of which may be shared. However, the term“processor” or“controller” is by far not limited to hardware exclusively capable of executing software, but may include digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), read only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage. Other hardware, conventional and/or custom, may also be included.

A block diagram may, for instance, illustrate a high-level circuit diagram implementing the principles of the disclosure. Similarly, a flow chart, a flow diagram, a state transition diagram, a pseudo code, and the like may represent various processes, operations or steps, which may, for instance, be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown. Methods disclosed in the specification or in the claims may be implemented by a device having means for performing each of the respective acts of these methods.

It is to be understood that the disclosure of multiple acts, processes, operations, steps or functions disclosed in the specification or claims may not be construed as to be within the specific order, unless explicitly or implicitly stated otherwise, for instance for technical reasons. Therefore, the disclosure of multiple acts or functions will not limit these to a particular order unless such acts or functions are not interchangeable for technical reasons. Furthermore, in some examples a single act, function, process, operation or step may include or may be broken into multiple sub-acts, -functions, -processes, -operations or -steps, respectively. Such sub acts may be included and part of the disclosure of this single act unless explicitly excluded.

Furthermore, the following claims are hereby incorporated into the detailed description, where each claim may stand on its own as a separate example. While each claim may stand on its own as a separate example, it is to be noted that - although a dependent claim may refer in the claims to a specific combination with one or more other claims - other examples may also include a combination of the dependent claim with the subject matter of each other dependent or independent claim. Such combinations are explicitly proposed herein unless it is stated that a specific combination is not intended. Furthermore, it is intended to include also features of a claim to any other independent claim even if this claim is not directly made dependent to the independent claim.