Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
TRAINING BINARY DEEP NEURAL NETWORKS WITHOUT BATCH NORMALIZATION
Document Type and Number:
WIPO Patent Application WO/2023/217361
Kind Code:
A1
Abstract:
A device (800) for training a binary DNN by backpropagating information for updating weights of a prototype version of the DNN, configured to: generate (701) a pre-activation value (602) from an arithmetic component (601) of the prototype DNN; input (702) the pre-activation value (602) to a forward activation function (603); determine (703) a backpropagation signal (607) in dependence on an error between a final output of the prototype DNN and an expected output; input (704) the pre-activation value (602) and the backpropagation signal (607) to a backward activation function (610); and update (705) weights of the prototype DNN in dependence on the output of the backward function (610), wherein the backward function (610) forms an output being inversely proportional to a distance from an origin of the forward function (603) to the pre-activation value (602). The device can implement activation functions that are independently designed and work independently of BatchNorm.

Inventors:
NGUYEN VAN MINH (DE)
Application Number:
PCT/EP2022/062745
Publication Date:
November 16, 2023
Filing Date:
May 11, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
NGUYEN VAN MINH (FR)
International Classes:
G06N3/04; G06N3/08
Other References:
CHEN TIANLONG ET AL: ""BNN - BN = ?": Training Binary Neural Networks without Batch Normalization", 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), IEEE, 19 June 2021 (2021-06-19), pages 4614 - 4624, XP033967504, DOI: 10.1109/CVPRW53098.2021.00520
JIANG XINRUI ET AL: "Training Binary Neural Network without Batch Normalization for Image Super-Resolution", PROCEEDINGS OF THE AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, vol. 35, no. 2, 18 May 2021 (2021-05-18), pages 1700 - 1707, XP055983141, ISSN: 2159-5399, DOI: 10.1609/aaai.v35i2.16263
Attorney, Agent or Firm:
KREUZ, Georg M. (DE)
Download PDF:
Claims:
CLAIMS

1 . A device (800) for training a binary deep neural network by backpropagating information for updating weights of a prototype version of the binary deep neural network, the device being configured to: generate (701) a pre-activation value (602) from an arithmetic component (601) of the prototype version of the binary deep neural network; input (702) the pre-activation value (602) to a forward activation function (603); determine (703) a backpropagation signal (607) in dependence on an error between a final output of the prototype version of the binary deep neural network and an expected output; input (704) the pre-activation value (602) and the backpropagation signal (607) to a backward activation function (610); and update (705) one or more weights of the prototype version of the binary deep neural network in dependence on the output of the backward activation function (610), wherein the backward activation function (610) forms an output that is inversely proportional to a distance from an origin of the forward activation function (603) to the pre- activation value (602).

2. The device (800) as claimed in claim 1 , wherein the pre-activation value (602) is a weighted sum of a standard real-valued neuron.

3. The device (800) as claimed in claim 1 or claim 2, wherein the backward activation function (610) has a metric of spreading defined by one or more parameters.

4. The device (800) as claimed in claim 3, wherein the device (800) is configured to determine the metric of spreading of the backward activation function (610) in dependence on a metric of spreading of a distribution of pre-activation values (602).

5. The device (800) as claimed in claim 4, wherein the metric of spreading of the backward activation function (610) is determined in dependence on a standard deviation of the distribution of pre-activation values (602).

6. The device (800) as claimed in claim 4 or claim 5, wherein the device (800) is configured to compute the one or more parameters in dependence on the distribution of pre-activation values (602) to form a pre-configured backward activation function (610).

7. The device (800) as claimed in claim 4 or claim 5, wherein the device (800) is configured to learn the one or more parameters.

8. The device (800) as claimed in any preceding claim, wherein the device (800) is configured to learn the one or more parameters during training of the binary deep neural network subject to the minimization of a pre-defined loss function.

9. The device (800) as claimed in any preceding claim, wherein the forward activation function (603) is a function having a binary output.

10. The device (800) as claimed in claim 9, wherein the forward activation function (603) is a Sign function.

11. The device (800) as claimed in any preceding claim, wherein the backward activation function (610) is a Tanh’ function, a Sigmoid’ function or 1/xy function.

12. The device (800) as claimed in any preceding claim, wherein the forward activation function (603) comprises one or more learnable parameters.

13. The device (800) as claimed in any preceding claim, wherein the device (800) is configured to pre-configure the forward activation function (603) and/or the backward activation function (610) in dependence on the pre-activation value (602).

14. A method (700) for training a binary deep neural network by backpropagating information for updating weights of a prototype version of the binary deep neural network, the method comprising: generating (701) a pre-activation value (602) from an arithmetic component (601) of the prototype version of the binary deep neural network; inputting (702) the pre-activation value (602) to a forward activation function (603); determining (703) a backpropagation signal (607) in dependence on an error between a final output of the prototype version of the binary deep neural network and an expected output; inputting (704) the pre-activation value (602) and the backpropagation signal (607) to a backward activation function (610); and updating (705) one or more weights of the prototype version of the binary deep neural network in dependence on the output of the backward activation function (610), wherein the backward activation function (610) forms an output that is inversely proportional to a distance from an origin of the forward activation function (603) to the preactivation value (602). 15. A computer-readable storage medium (802) having stored thereon computer-readable instructions that, when executed at a computer system (801), cause the computer system to perform the method (700) of claim 14.

Description:
TRAINING BINARY DEEP NEURAL NETWORKS WITHOUT BATCH NORMALIZATION

FIELD OF THE INVENTION

This invention relates to machine learning models, in particular to the training of binary deep neural networks.

BACKGROUND

Deep learning is the origin of numerous successes in such fields as computer vision and natural language processing during the last decade. It has occupied a central spot of technological and social landscapes, with applications going far beyond computer sciences.

Deep learning uses deep neural networks (DNNs), which are complex non-linear systems vaguely inspired by biologic brains, being able to manipulate high-dimensional objects, autonomously learn from given examples without being programmed with any task-specific rules and obtain state-of-the-art performance.

Figure 1 shows the basic operation scheme of a DNN, which is usually split into two phases. In the training phase, the DNN, noted as ‘model’ in the figure, learns its own parameters. A starting version of the model to be trained, shown at 101 , is trained using provided training data 102 to form trained model 103. Then, in the inference phase, the DNN is used to output one or more predictions 104 on unseen input data 105.

A DNN is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. The connections between artificial neurons have a weight (also called a parameter) that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. As schematically illustrated in Figure 2, a real-valued neuron performs a weighted sum of its inputs xi, X2... x m with its own weights wi, W2... w m , in an arithmetic layer 201. The obtained sum is known as pre-activation data, which is then passed through an activation function 202 to provide a final output y, which can also be referred to as activation data.

Conceptually, the activation function 202 is used to mimic a biologic synapse, which only fires an informative output when its signal is sufficiently strong. It is a simple yet critical component thanks to which artificial neural networks are non-linear systems so as to be able to learn and generalize. It is essentially a non-linear function such as a rectified linear unit (ReLU), Sigmoid or Tanh function for real-valued networks, and a Sign function for binary neural networks. The Sign function is shown in Figure 3. This function is a step function which has a binary output, for example -1 and +1.

Meanwhile, deep neural networks are known as being very intensive in memory and computation. On one hand, the DNN is composed of a very large number of parameters, reaching hundreds of millions in today’s applications. This requires significant memory footprint for representing the model. On top of that, the training phase requires a large amount of training data, as well as incurring a lot of other temporary variables used for optimizing the model parameters, such as gradient information. As a result, training a DNN requires a dedicated, powerful infrastructure and limits the potential of artificial intelligence.

One promising approach for alleviating this so-called memory wall issue is to design DNNs with binary parameters and binary activations, meaning that the model parameters and activation data are represented by a binary number, consuming only 1 bit instead of a floatingpoint number of 32 bits. This can reduce the model memory footprint by 32 times for the same network size. However, the binary constraint imposes the unique choice of the activation function that is the step-like function, such as the Sign function shown in Figure 3.

A step-like activation function has zero derivative everywhere except at the origin, hence annihilating the backward information flow. This can lead to insufficient training performance.

The main line of thinking in existing approaches is to develop an approximation of the step function, such that the issue of ‘almost everywhere zero derivative’ can be somewhat alleviated.

Figure 4(a) shows a scaled Sign function. Two approximations to this function are shown in Figure 4(b): HardTanh (solid line) and ApproxSign (dashed line). Figure 4(c) shows the respective derivatives of these functions.

However, these existing approaches are tied to the forward point of view in trying to find a differentiable approximate of the Sign function. Doing so, the backward activation function is just a consequence of the derivation, hence important features required by the backward direction are missing.

In particular, backward information is completely annihilated outside of a [-1 , +1] region. Therefore, an existing design must be coupled with a batch normalization layer (BatchNorm, see https://en.wikipedia.org/wiki/Batch_normalization) in front of the binary activation function so that the BatchNorm squeezes the data closely to the [-1 , +1] region, which adds further computational complexity and increases memory requirements for the training process.

It is desirable to develop a method which overcomes such problems.

SUMMARY

According to a first aspect, there is provided a device for training a binary deep neural network by backpropagating information for updating weights of a prototype version of the binary deep neural network, the device being configured to: generate a pre-activation value from an arithmetic component of the prototype version of the binary deep neural network; input the preactivation value to a forward activation function; determine a backpropagation signal in dependence on an error between a final output of the prototype version of the binary deep neural network and an expected output; input the pre-activation value and the backpropagation signal to a backward activation function; and update one or more weights of the prototype version of the binary deep neural network in dependence on the output of the backward activation function, wherein the backward activation function forms an output that is inversely proportional to a distance from an origin of the forward activation function to the pre-activation value.

As a result, the device can implement backward and forward activation functions that are independently designed, that work independently of BatchNorm and that can be trained to achieve high accuracy during inference.

The arithmetic component may be a layer of the binary deep neural network. The pre-activation value may be a weighted sum of a standard real-valued neuron. A real-valued neuron performs a weighted sum of its inputs with its own weights in an arithmetic layer. Thus, the technique is compatible with common network architectures.

The backward activation function may have a metric of spreading defined by one or more parameters. One such metric may be the standard deviation. This may be a suitable metric to quantify the spread of the function.

The device may be configured to determine the metric of spreading of the backward activation function in dependence on a metric of spreading of a distribution of pre-activation values. This may allow pre-activation spreading to be taken into account and allow information from a distribution of pre-activation values to be effectively used in the backpropagation of information to optimize the binary deep neural network during training. The metric of spreading of the backward activation function may be determined in dependence on a standard deviation of the distribution of pre-activation values. This may be a convenient implementation that can match the metric of spreading of the pre-activation data to the backward activation function.

The device may be configured to compute the one or more parameters in dependence on the distribution of pre-activation values to form a pre-configured backward activation function. This may allow the device to take the spread of pre-activation data into account when configuring the backward activation function.

The device may be configured to learn the one or more parameters. This may allow the device to optimize the parameters.

The device may be configured to learn the one or more parameters during training of the binary deep neural network subject to the minimization of a pre-defined loss function. This may allow the parameters of the metric of spreading of the backward activation function to be optimized during training of the binary deep neural network, such that the binary deep neural network used during inference achieves good performance.

The forward activation function may be a function having a binary output. This may allow the output of the arithmetic component to contribute to the final output of the binary DNN when its signal is sufficiently strong.

The forward activation function may be a Sign function. This may be a suitable function that satisfies the binary output criterion.

The backward activation function may be a Tanh’ function, a Sigmoid’ function or 1/x y function. Such functions can form an output that is inversely proportional to a distance from an origin of the forward activation function to the pre-activation value.

The forward activation function may comprise one or more learnable parameters. This may allow the device to learn the forward activation function.

The device may be configured to pre-configure the forward activation function and/or the backward activation function in dependence on the pre-activation value. This may allow the forward and/or backward activation functions to be adapted in dependence on the preactivation data.

According to a second aspect, there is provided a method for training a binary deep neural network by backpropagating information for updating weights of a prototype version of the binary deep neural network, the method comprising: generating a pre-activation value from an arithmetic component of the prototype version of the binary deep neural network; inputting the pre-activation value to a forward activation function; determining a backpropagation signal in dependence on an error between a final output of the prototype version of the binary deep neural network and an expected output; inputting the pre-activation value and the backpropagation signal to a backward activation function; and updating one or more weights of the prototype version of the binary deep neural network in dependence on the output of the backward activation function, wherein the backward activation function forms an output that is inversely proportional to a distance from an origin of the forward activation function to the pre- activation value.

The method can implement backward and forward activation functions that are independently designed, that work independently of BatchNorm and that can be trained to achieve high accuracy during inference.

According to a further aspect, there is provided a computer readable storage medium having stored thereon computer readable instructions that, when executed at a computer system, cause the computer system to perform the method as set forth above. The computer system may comprise one or more processors. The computer readable storage medium may be a non-transitory computer readable storage medium.

BRIEF DESCRIPTION OF THE FIGURES

The present disclosure will now be described by way of example with reference to the accompanying drawings.

In the drawings:

Figure 1 schematically illustrates the basic operation scheme of a DNN.

Figure 2 schematically illustrates the basic components of an artificial neuron.

Figure 3 shows a Sign function.

Figures 4(a)-(c) schematically illustrate examples of existing approaches for approximating a Sign forward activation function and their associated derivatives.

Figure 5 schematically illustrates an exemplary distribution of pre-activation data. Figure 6 schematically illustrates the architecture of an embodiment of the present invention.

Figure 7 shows a flowchart for an example of a method for training a binary DNN comprising multiple binary weights in accordance with embodiments of the present invention.

Figure 8 shows an example of a device for training a binary DNN.

Figure 9 schematically illustrates the architecture of state-of-the-art prior techniques compared to the architecture of embodiments of the present invention.

Figures 10(a) shows training and test accuracies for a prior method with BatchNorm removed. Figure 10(b) shows training and test accuracies on the same task for an embodiment of the present invention.

DETAILED DESCRIPTION

Embodiments of the present invention use a binary activation function satisfying the design constraint of binary neural networks, while overcoming several limitations of state-of-the-art step activation functions.

As illustrated in Figure 5, the pre-activation output of neurons can reach a large range far from 0. Figure 5 shows a distribution of pre-activation data of a binary neuron without BatchNorm. In this example, a binary neuron of input size 200 has pre-activation outputs spreading over [- 200, +200] along the x axis.

In order to take into account information from this wide distribution of pre-activation values during backpropagation in the training process, a complementary backward activation function should be used.

Embodiments of the present invention use a forward activation function which can be subject to satisfying the binary output criterion, meaning that it belongs to the step function family. Furthermore, the backward activation function has an output that is inversely proportional to the distance from the forward function origin to the input pre-activation value.

Advantageously, spreading of the backward activation function can also be matched to that of the pre-activation data distribution, for example by matching the standard deviations, as will be described in more detail herein.

An exemplary architecture 600 for the training of a binary DNN is schematically illustrated in Figure 6. A binary DNN comprises multiple binary weights. A weight is a parameter within a neural network that transforms input data within the network’s layers. A neural network comprises a series of nodes, or neurons. Within each neuron is a set of inputs, a set of weights and a bias value. As an input enters the node, it is multiplied by its associated weight value. It may optionally add a bias before passing the data to the next layer. The resulting output is either observed or passed to the next layer of the network.

Weights are learnable parameters within the network. A DNN may randomize the weights before learning initially begins. As training continues, the weights are adjusted toward the desired values to give the “correct” output.

A ‘weight’ of a DNN may also be referred to as a ‘parameter’ interchangeably, having the same meaning. Each binary weight has only two possible values. This can allow the model weights to be represented by a binary number, consuming only 1 bit instead of a floating-point number of 32 bits. This may reduce not only the model footprint, but also training memory and computation complexity.

The DNN may be trained using sets of training data, the sets comprising input data and respective expected outputs.

In general, the training process for the DNN follows a standard iterative process, which iterates loops of a forward pass followed by a backpropagation pass.

In the forward pass, training data is injected into a prototype version of the DNN (i.e. the current version of the DNN in that iteration of the training process) for evaluating the error between the output produced by the prototype version of the DNN and the true data labels. In the backward pass, the error is propagated from the output layer throughout the network, back to the input layer. This may be done in dependence on an optimization target, such as minimizing a predetermined loss function.

A backpropagation signal is determined in dependence on the error between the final output of the prototype version of the binary deep neural network and an expected output (i.e. the true data labels for the training data set).

In Figure 6, the arithmetic component 601 , which in this example is an arithmetic layer, is a part of a binary deep neural network which computes a pre-activation value 602, for instance the weighted sum of a standard real-valued neuron. The binary deep neural network can comprise a plurality of arithmetic components which output respective pre-activation data, and a plurality of subsequent activation functions, into which the respective pre-activation data is input.

In this example, in the forward pass, the forward activation function is determined by the blocks FORWARD ACTIVATION SPECIFICATION 604 and PRE-CONFIGURED OR ONLINE CONTROL 605.

In the ‘online control’ mode, the forward activation function can comprise one or more learnable parameters. This may allow the device to learn the forward activation function in dependence on the pre-activation data.

Alternatively, the forward activation function may be pre-configured in dependence on the pre- activation data 602.

Then, pre-activation data 602 is passed through the FORWARD ACTIVATION block 603 (i.e. the pre-activation data is input to the forward pre-activation function) to provide a postactivation output 606.

The post-activation output 606 may be the final output of the prototype version of the binary DNN, or there may be one or more subsequent arithmetic components and pre-activation blocks in the network before the final output of the prototype version of the binary DNN is formed.

Backpropagation signal 607 is formed in dependence on the error between the final output of the prototype version of the binary deep neural network and an expected output (i.e. the true data labels for the training data set).

In the backward direction, the forward pre-activation data 602 is taken into account by the backward PRE-CONFIGURED OR ONLINE CONTROL block 608, which in combination with BACKWARD ACTIVATION SPECIFICATION 609 determines the backward activation function used in BACK ACTIVATION block 610.

Then, the BACKWARD ACTIVATION block 610 acts on the pre-activation data (i.e. the pre- activation data is input to the backward activation function) to provide backward information to its upstream components. The backpropagation signal 607 is also input to the backward activation function.

In dependence on the output of the backward activation function, weights of the prototype version of the binary deep neural network can be updated.

The updated version of the binary deep neural network can be used in the subsequent iteration of the training process, for example until a predetermined level of convergence is reached. The final trained binary DNN can then be used in the inference phase.

As mentioned above, the backward activation function forms an output that is inversely proportional to a distance from an origin of the forward activation function to the pre-activation value. For example, the backward activation function may be a l/x y , Tanh’(x) or a Sigmoid’(x) function. Herein, “origin” means the threshold which is used in the feed-forward threshold activation function (the x value at which the “step" occurs in the step function).

In one implementation, a metric of spreading of the backward activation function can be matched to the pre-activation data distribution. One such metric is the standard deviation. This can be further performed in two ways. Firstly, by analytically matching and computing parameters of the metric of the backward activation function before-hand. Secondly, by learning the parameters which determine the said metric of the backward activation function.

Alternatively, the parameters of the backward activation function can be learned directly from the DNN training process, i.e. , subject to minimizing a pre-defined loss function. In this option, the learnable parameters are not restricted to the determination of the matching metric, but they are function parameters learnt directly to minimize the loss function (i.e. the optimization target of the model training process).

In one particular embodiment, the components used are as follows. In the forward direction, the forward activation function is specified as Sign(x) with no control. In the backward direction, the backward activation function is specified as a Tanh’(x) function and is controlled by being pre-configured to pre-activation spreading.

In this example, the metric of spreading of the backward activation function is matched to the spread of the pre-activation data by standard deviation. A spreading factor A = TT/(2 3 /?) is used, where R is the range of the arithmetic layer. The backward activation is given by Tanh’(Ax). The factor A can be pre-configured or can be learned. Therefore, in embodiments of the present invention, the binary activation components comprise a forward activation function and a backward activation function, which can be independently designed. The backward activation function can be independent of or dependent on the forward activation function.

The forward activation function is preferably subject to satisfying the binary output criterion. The forward activation function can optionally be y-scaled and/or x-shifted, and can additionally be controlled by a pre-configured or online procedure, as described above.

The backward activation function is a function which is inversely proportional to the distance between the pre-activation value and the origin of the forward activation function. It can be further controlled to match the spreading of pre-activation data. The control procedure can be analytically pre-configured or by online adaptation (i.e. by learning the parameters of a metric of spreading of the backward activation function).

Figure 7 shows an example of a method for training a binary deep neural network by backpropagating information for updating weights of a prototype version of the binary deep neural network in accordance with embodiments of the present invention. At step 701 , the method comprises generating a pre-activation value from an arithmetic component of the prototype version of the binary deep neural network. At step 702, the method comprises inputting the pre-activation value to a forward activation function. At step 703, the method comprises determining a backpropagation signal in dependence on an error between a final output of the prototype version of the binary deep neural network and an expected output. At step 704, the method comprises inputting the pre-activation value and the backpropagation signal to a backward activation function. As described above, the backward activation function forms an output that is inversely proportional to a distance from an origin of the forward activation function to the pre-activation value. At step 705, the method comprises updating one or more weights of the prototype version of the binary deep neural network in dependence on the output of the backward activation function.

Figure 8 shows an example of a device configured to implement the above methods.

The device 800 may comprise at least one processor, such as processor 801 , and at least one memory, such as memory 802. The memory stores in a non-transient way code that is executable by the processor(s) to implement the device in the manner described herein.

The device 800 may also be configured to implement a binary deep neural network trained according to the method 700, with optional additional features as described with respect to the above embodiments. This may allow the device to perform inference using the binary DNN for a varity of tasks.

In contrast to existing solutions, which choose a fixed function to approximate the Sign function, and the backward function is then directly imposed by a derivation rule, the approach described herein has backward and forward activation functions which can be independently designed. Existing solutions ignore the role of pre-activation spreading, whereas embodiments of the solution described herein can effectively take the wide spreading of pre-activation values into account.

Embodiments of the present invention can also operate independently of BatchNorm, whereas current state-of-the-art techniques are strictly tied to that.

Figure 9 schematically illustrates the architecture required by state-of-the-art techniques, compared with that of embodiments of the present invention. In state-of-the-art techniques, the network comprises multiple arithmetic components 901 , 904 and their respective activation functions 903, 906. It is also necessary to apply BatchNorms 903, 905 between the arithmetic components and the activation functions to scale the pre-activation data to make it concentrated inside a [-1 , 1] region. In embodiments of the present invention, the method can achieve good results independently of BatchNorm and the network architecture can comprise multiple arithmetic components 951 , 953 and their respective activation functions 952, 954, without the need for BatchNorm.

Figure 10(a) demonstrates this effect. An existing solution is taken which provides approximately 90% of test accuracy on the CIFAR10 (Canadian Institute for Advanced Research) image dataset with VGG (Visual Geometry Group) Small architecture. When removing the BatchNorm, this existing solution obtains little over 30% test accuracy, as shown in Figure 10(a). This demonstrates that this existing solution cannot effectively work without BatchNorm, which is very intensive in memory and computation and is incompatible with a binary DNN.

Figure 10(b) shows the results for an embodiment of the present invention, which achieved 98.84% training accuracy and 89.08% test accuracy on the same task.

Therefore, embodiments of the present invention can implement backward and forward activation functions that can be independently designed, that work independently of BatchNorm and that can be trained to achieve high accuracy during inference. The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.