Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND SYSTEM FOR QUANTIZING A NEURAL NETWORK
Document Type and Number:
WIPO Patent Application WO/2022/103291
Kind Code:
A1
Abstract:
A method and computing system for determining mixed-precision quantization parameters to quantize a neural network are provided. The method comprises determining a vector of quantization parameters on the basis of a size of the weight vectors of the neural network, and, for each one of multiple training vectors of a training dataset evaluating a second loss function on the basis of the training vector and the vector of quantization parameters and modifying the weight vectors and the vector of quantization parameters to minimize an output of the second loss function. Each one of the quantization parameters of the vector of quantization parameters constrains the size of a quantized weight vector for a layer of a quantized neural network corresponding to the weight vector for the respective layer of the neural network.

Inventors:
CHIKIN VLADIMIR MAXIMOVICH (CN)
SOLODSKIKH KIRILL IGOREVICH (CN)
TELEGINA ANNA DMITRIEVNA (CN)
Application Number:
PCT/RU2020/000601
Publication Date:
May 19, 2022
Filing Date:
November 13, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HUAWEI TECH CO LTD (CN)
CHIKIN VLADIMIR MAXIMOVICH (CN)
International Classes:
G06N3/08
Other References:
AHMED T ELTHAKEB ET AL: "WaveQ: Gradient-Based Deep Quantization of Neural Networks through Sinusoidal Adaptive Regularization", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 29 February 2020 (2020-02-29), XP081651983
AHMED T ELTHAKEB ET AL: "SinReQ: Generalized Sinusoidal Regularization for Low-Bitwidth Deep Quantized Training", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 4 May 2019 (2019-05-04), XP081542586
MAXIM NAUMOV ET AL: "On Periodic Functions as Regularizers for Quantization of Neural Networks", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 24 November 2018 (2018-11-24), XP081040812
STEFAN UHLICH ET AL: "Mixed Precision DNNs: All you need is a good parametrization", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 27 May 2019 (2019-05-27), XP081663581
STEVEN K ESSER ET AL: "Learned Step Size Quantization", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 7 May 2020 (2020-05-07), XP081663151
Attorney, Agent or Firm:
LAW FIRM "GORODISSKY & PARTNERS" LTD. (RU)
Download PDF:
Claims:
CLAIMS

1 . A method for determining mixed-precision quantization parameters to quantize a neural network, the neural network comprising a plurality of layers, each layer being associated with a weight vector comprising multiple floating point data values, wherein the neural network is trained on a training dataset and the weight vectors are selected to minimise a first loss function associated to the neural network, the method comprising; determining a vector of quantization parameters on the basis of a size of the weight vectors; and, for each one of multiple training vectors of the training dataset: evaluating a second loss function on the basis of the training vector and the vector of quantization parameters; and modifying the weight vectors and the vector of quantization parameters to minimize an output of the second loss function. wherein each one of the quantization parameters of the vector of quantization parameters constrains the size of a quantized weight vector for a layer of a quantized neural network corresponding to the weight vector for the respective layer of the neural network.

2. The method of claim 1 , wherein the second loss function comprises the first loss function, a first regularization component and a second regularization component.

3. The method of claim 2 wherein the first regularization component is selected to constrain the quantized weight vector for each layer to a pre-determined range of values.

4. The method of claims 1 to 3 wherein the second regularization component is selected to constrain quantized input data values to each layer of the corresponding quantized neural network to a pre-determined range of values.

5. The method of claims 2 to 4, wherein the first and/or second regularization components comprise functions that depend continuously on the data values of the vector of quantization parameters.

6. The method of claim 1 to 5, wherein modifying the weight vectors and the vector of quantization parameters comprises determining a local minimum of the second loss function.

7. The method of claim 6, wherein the local minimum is determined according to a gradient descent method.

8. The method of claims 1 to 7, comprising: accessing a validation dataset; and evaluating, for each one of multiple validation vectors in the validation dataset, the quantized neural network on the basis of the quantization parameters.

9. The method of claims 1 to 8, wherein determining the vector of quantization parameters on the basis of a size of the weight vectors comprises: determining a sum of the sizes of the multiple floating point data values of the weight vectors; and generating a parameter surface on the basis of the determination.

10. The method of claim 9, wherein the parameter surface comprises a portion of an ellipsoidal surface.

11 . The method of any one of claims 2 to 5, comprising generating a set of size parameters; and selecting the first regularization component to train the quantization parameters to the set of size parameters.

12. The method of any one of claims 1 to 11 , wherein the second regularization component is selected to constrain the quantized input data values to each layer of the corresponding quantized neural network to the set of bit-width parameters associated to the first regularization component.

13. The method of any one of claims 1 to 11 , wherein the second regularization component is selected to constrain the quantized input data values to each layer of the corresponding quantized neural network to a pre-determined bit-width.

14. The method of any one of claims 1 to 11 , comprising determining a further vector of quantization parameters on the basis of a size of the input data values of each layer of the neural network, evaluating the second loss function on the basis of the further vector and modifying the further vector on the basis of the evaluation; and selecting the second regularization component to train the input quantization parameters to the set of input size parameters.

15. A method for performing inference based on an input to a neural network, the method comprising: determining quantization parameters for the neural network according to the method of any one of claims 1 to 14; determining a quantized neural network corresponding to the neural network on the basis of the quantization parameters; evaluating the quantized neural network on the basis of the input to infer an output.

16. A method for operating a neural network, the method comprising: obtaining data to be processed by the neural network: processing the data by the neural network, wherein the neural network is configured by quantization parameters obtained or obtainable according to the method of any one of claims 1 to 14.

17. A computer program to perform the method of any one of claims 1 to 16.

18. A non-transitory computer readable medium comprising instructions that, when executed by a processor cause the processor to perform the method according to any one of claims 1 to 16.

19. A computing system, comprising one or more processors configured to perform the method according to any one of claims 1 to 16.

20. A computing system, for determining mixed-precision quantization parameters to quantize a neural network comprising a plurality of layers, each layer being associated with a weight vector comprising multiple floating point data values, wherein the neural network is trained on a training dataset and the weight vectors are selected to minimise a first loss function associated to the neural network, the computing system comprising: at least one processor; and at least one memory including program code which when executed by the at least one processor provides instructions to: determine a vector of quantization parameters on the basis of a size of the weight vectors; and, for each one of multiple training vectors of the training dataset: evaluate a second loss function on the basis of the training vector and the vector of quantization parameters; and modify the weight vectors and the vector of quantization parameters to minimize an output of the second loss function

17 wherein each one of the quantization parameters of the vector of quantization parameters constrains the size of a quantized weight vector for a layer of a quantized neural network corresponding to the weight vector for the respective layer of the neural network.

18

Description:
METHOD AND SYSTEM FOR QUANTIZING A NEURAL NETWORK

TECHNICAL FIELD

The present disclosure relates to a system and method for determining mixed-precision quantization parameters for a neural network.

BACKGROUND

Neural networks have delivered impressive results across a wide range of applications in recent years. This has led to widespread adoption across many different hardware platforms including mobile devices and embedded devices. In these types of devices hardware constraints may limit the usefulness of neural networks where high accuracy cannot be achieved efficiently.

Quantization methods may reduce the memory footprint and inference time in neural networks. Quantization compresses data in a neural network from large floating point representations to smaller fixed-point representations. Lower bit-width quantization permits greater optimization. However, lowering the bit width to too great an extent may reduce the accuracy too much.

SUMMARY

It is an object of the invention to provide a method for determining mixed-precision quantization parameters of a neural network allowing to improve the performance of the neural network, e.g. with regard to quality and efficiency of the neural network, e.g. by reducing memory footprint and complexity of operations (by optimized quantization parameters).

The foregoing and other objects are achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

According to a first aspect, a method for determining mixed-precision quantization parameters to quantize a neural network is provided. The neural network comprises a plurality of layers. Each layer is associated with a weight vector comprising multiple floating point data values. The neural network is trained on a training dataset and the weight vectors are selected to minimise a first loss function associated to the neural network. The method comprises determining a vector of quantization parameters on the basis of a size of the weight vectors; and, for each one of multiple training vectors of the training dataset, evaluating a second loss function on the basis of the training vector and the vector of quantization parameters and modifying the weight vectors and the vector of quantization parameters to minimize an output of the second loss function. Each one of the quantization parameters of the vector of quantization parameters constrains the size of a quantized weight vector for a layer of a quantized neural network corresponding to the weight vector for the respective layer of the neural network.

The method according to the first aspect provides a general purpose method for determining quantization parameters to quantize a neural network to mixed-precision. The method may be, for example, implemented or executed by one or more processors. The data used for training (and later for inference, also referred to as operation) may be pictures, e.g. still picture or video pictures, or respective picture data, audio data, any other measured or captured physical data or numerical data.

In a first implementation form the second loss function comprises the first loss function, a first regularization component and a second regularization component.

In a second implementation form the first regularization component is selected to constrain the quantized weight vector for each layer to a pre-determined range of values.

In a third implementation form the second regularization component is selected to constrain quantized input data values to each layer of the corresponding quantized neural network to a pre-determined range of values.

In a fourth implementation form the first and/or second regularization components comprise functions that depend continuously on the data values of the vector of quantization parameters.

In a fifth implementation form modifying the weight vectors and the vector of quantization parameters comprises determining a local minimum of the second loss function.

In a sixth implementation form the local minimum is determined according to a gradient descent method.

In a seventh implementation form the method comprises accessing a validation dataset; and evaluating, for each one of multiple validation vectors in the validation dataset, the quantized neural network on the basis of the quantization parameters.

In an eighth implementation form determining the vector of quantization parameters on the basis of a size of the weight vectors comprises: determining a sum of the sizes of the multiple floating point data values of the weight vectors; and generating a parameter surface on the basis of the determination.

In an ninth implementation form the parameter surface comprises a portion of an ellipsoidal surface.

In a tenth implementation form the method comprises generating a set of size parameters; and selecting the first regularization component to train the quantization parameters to the set of size parameters.

In an eleventh implementation the second regularization component is selected to constrain the quantized input data values to each layer of the corresponding quantized neural network to the set of bit-width parameters associated to the first regularization component.

In a twelfth implementation form the method comprises quantizing input data values to each layer of the corresponding quantized neural network to a pre-determined bit-width.

In a thirteenth implementation form the method comprises determining a further vector of quantization parameters on the basis of a size of the input data values of each layer of the neural network, evaluating the second loss function on the basis of the further vector and modifying the further vector on the basis of the evaluation.

In a fourteenth implementation form the method comprises performing inference based on an input to a neural network. The fourteenth implementation form comprises determining quantization parameters for the neural network according to the method according to the first aspect, determining a quantized neural network corresponding to the neural network on the basis of the quantization parameters and evaluating the quantized neural network on the basis of the input to infer an output.

According to a second aspect, a method for operating a neural network is provided. The method comprising: obtaining data to be processed by the neural network: and processing the data by the neural network, wherein the neural network is configured by quantization parameters obtained or obtainable according to any of the methods described above and herein.

The processing of the neural network may comprise processing of pictures or other data for signal enhancement (e.g. picture enhancement, e.g. for super resolution), denoising (e.g. still or video picture denoising), speech and audio processing (e.g. natural language processing, NLP) or other purposes.

According to a third aspect a computer program to perform the method according to the first or second aspect is provided.

According to a fourth aspect a non-transitory computer readable medium is provided. The non- transitory computer readable medium comprises instructions that, when executed by a processor cause the processor perform the method according to the first or second aspect.

According to a fifth aspect a computing system for determining mixed-precision quantization parameters to quantize a neural network system for is provided. The system comprises at least one processor and at least one memory including program code which when executed by the at least one processor provides instructions to determine a vector of quantization parameters on the basis of a size of the weight vectors; and, for each one of multiple training vectors of the training dataset: evaluate a second loss function on the basis of the training vector and the vector of quantization parameters; and modify the weight vectors and the vector of quantization parameters to minimize an output of the second loss function. Each one of the quantization parameters of the vector of quantization parameters constrains the size of a quantized weight vector for a layer of a quantized neural network corresponding to the weight vector for the respective layer of the neural network.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the present disclosure, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

Figure 1 shows a schematic diagram of a'neural network, according to an example.

Figure 2 shows a diagram of a portion of a neural network, according to an example.

Figure 3 shows a graph of a regularizer function, according to an example.

Figure 4 shows a diagram of a parameter surface, according to an example.

Figure 5 shows a flow diagram of a method for determining quantization parameters, according to an example. Figure 6 shows a table showing a comparison of quantization methods for quantization of ResNet-20 with mixed-precision.

Figure 7 shows a diagram of mixed precision bit-widths for layers of a neural network, according to an example.

Figure 8 shows a table showing a comparison of quantization methods for quantization of ResNet-20 with mixed-precision.

Figure 9 shows a diagram of mixed-precision bit-widths for layers of a neural network, according to an example.

Figure 10 shows a table comparing quantization methods for quantization of MobileNet_v2 on Imagenet.

Figure 11 shows a table of distribution of bit-widths, according to an example.

Figure 12 is a block diagram of a computing system that may be used for implementing the devices and methods disclosed herein.

DETAILED DESCRIPTION

Example embodiments are described below in sufficient detail to enable those of ordinary skill in the art to embody and implement the systems and processes herein described. It is important to understand that embodiments can be provided in many alternate forms and should not be construed as limited to the examples set forth herein.

Accordingly, while embodiments can be modified in various ways and take on various alternative forms, specific embodiments thereof are shown in the drawings and described in detail below as examples. There is no intent to limit to the particular forms disclosed. On the contrary, all modifications, equivalents, and alternatives falling within the scope of the appended claims should be included. Elements of the example embodiments are consistently denoted by the same reference numerals throughout the drawings and detailed description where appropriate.

The terminology used herein to describe embodiments is not intended to limit the scope. The articles “a,” “an,” and “the” are singular in that they have a single referent, however the use of the singular form in the present document should not preclude the presence of more than one referent. In other words, elements referred to in the singular can number one or more, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes,” and/or “including,” when used herein, specify the presence of stated features, items, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, items, steps, operations, elements, components, and/or groups thereof.

Unless otherwise defined, all terms (including technical and scientific terms) used herein are to be interpreted as is customary in the art. It will be further understood that terms in common usage should also be interpreted as is customary in the relevant art and not in an idealized or overly formal sense unless expressly so defined herein.

Quantization methods may reduce the memory footprint and inference time in neural networks. A neural network may comprise of several computing blocks of different sizes, each of which can be quantized into any bit-width. By quantizing different blocks of a neural network into different bit-widths, neural networks with different degrees of compression, acceleration and quality may be achieved.

For example, an inference model represented by a neural network may comprise of three blocks that fill 20%, 30% and 50% of the model size respectively. All weights of the full precision model use a 32-bit floating point representation. The accuracy of the full precision model may be equal to 99% for a compression ratio of 1 , the accuracy of the full 8-bit model with all weights quantized to 8 bits may be equal to 98% with a compression ratio of 4, the accuracy of the model in which the first block (20%) is quantized to a 4-bit representation and the rest to an 8-bit representation may be equal to 97% with a compression ratio of 4.44, the accuracy of the model in which the second block (30%) are quantized to a 4-bit representation and the rest to 8-bit may be equal to 65% with a compression ratio of 4.71 , and the accuracy of the model in which the third block (50%) is quantized to 4-bit, and the rest to 8 bits may be equal to 55% with a compression ratio of 5.33. A user that requires higher than 65% accuracy may select the quantized model from among a number of quantization models for the neural network to achieve the desired accuracy.

Mixed precision quantization therefore permits flexible control between quality and compression. Some methods produce models with different compression and quality, among which a user may have to manually select the final result. In contrast, the method and systems described herein allow the user to set the required compression ratio of the model. The methods described herein make use of a family of functions <p that smoothly depend on a set of parameters that determine the number of their minima. Such functions may be used as regularizers for training quantized models. In particular in the methods and systems described herein, the solution of the problem of minimizing the loss of a model in a region Q of limited quantization errors of weights and activations:

E Loss( ,, W)) - min may be reduced to solving the problem: in the domain of definition of parameters (V , s w , s a ) for some numbers A w and A a .

In equation (1), is the input data distribution; 14Z ( and A are quantized weights and quantized inputs to layers of the neural network and s w . and s a are corresponding scale factors of weights and inputs.

Figure 1 is a simplified schematic diagram of a neural network architecture 100, according to an example. The neural network architecture 100 may be used to train a modified neural network that is constructed from a neural network. Each layer of the neural network may comprise a number of nodes which are interconnected with nodes of the preceding and subsequent layer of the neural network. The nodes may be associated with data values referred to herein as weights. Each layer may be associated with a weight vector comprising the vector of weights for the nodes of the layer. The weights of the neural network are specified by floating-point data values.

In Figure 1 , training data from a training data set 110 is fed into the neural network architecture 100 as input data 120. The neural network architecture 100 comprises a modified neural network 130, which is generated from the underlying neural network, according to the methods described herein. Once trained, the modified neural network 130 comprises a plurality of blocks of layers, where blocks are quantized to a specific size. In examples, references to the “size” of a data value herein may refer to the bit-width of a data value. The “model size” may refer to the sum of the bit-widths of weight vectors of a neural network. To construct the modified neural network 130, a user specifies a required model size.

According to examples, training the modified neural network 130 comprises initialising a set of trainable variables. These trainable variables optimize the sizes of quantized weights for the model size specified by the user. The neural network architecture 100 comprises a regularization computation 140. The regularization computation 140 is an accumulation of computations from each layer of the modified neural network 130. The computations are generated from evaluating functions in the family of regularizer functions <p previously described. In Figure 1 the block 150 comprises an output of the modified neural network 130. Block 160 comprises a loss function computation that is determined from the output 150 of the modified neural network 130 and the regularization computation 140. During training, the parameters of the modified neural network 130, including the size parameters for quantizing weights are updated on the basis of the computation represented by block 160.

The data used for training (and later for inference, also referred to as operation) may be pictures, e.g. still picture or video pictures, or respective picture data, audio data, any other measured or captured physical data or numerical data.

Figure 2 is a simplified schematic diagram 200 of an intermediate layer of the neural network architecture 130 shown in Figure 1. The block 210 shown in Figure 2 comprises input values to an intermediate layer 220 of the modified neural network 130. The block 230 comprises a quantization layer that is applied to the input values 210. Block 240 comprises a local computation of regularizer functions <p, which may be generated on the basis of the quantization parameters, the input data and the weight vector for the intermediate layer. The block 250 comprises a global accumulator of regularizer terms which are input to the loss function computation 160 in Figure 1.

Figure 3 shows a graph of a regularizer function (p, according to an example. The family of regularizer functions cp(x, t) are defined in such a way to depend smoothly on a parameter t and at each moment have 2|t] (floor part) roots, which are global minima of the functions. Functions from this class are constructed in such a way that if <p(x, t) is close to 0, then components of x are close to a grid of integers from segment [- t , t - 1] in the case of p int , or alternatively from segment [0, 2 t - 1] in the case of p uint . Examples of the functions may be defined as follows: According to examples described herein the loss function computation 160 for the neural network architecture 100 comprises minimising the following loss function:

In equation (2), variables t w . and t a define sizes of weights and inputs for different quantized layers of the modified neural network 130, where each weight vector and input vector is quantized to a grid of integers consisting of 2[t] (rounding) elements. For example, if t = 128, then corresponding vector is quantized to a grid of integers consisting of 256 elements, or to 8 bits.

Considering these variables as independent trainable variables, there are no constraints on layers’ bit-widths. In particular, bit-widths of all layers may become large. Given a neural network comprising n model blocks of sizes k L having bit-width bi the size of the quantized portion of the model is:

N - 1 independent variables e [0, 1] may be defined that parameterize the first quadrant (x ; > 0) of the surface of the ellipsoid given by equation

Z" =1 = C (3)

Figure 4 shows a diagram of a quadrant 400 of an ellipsoid parameterized by the variables , are related to the variables x t via the equations: According to examples described herein, a quantized model is trained by minimizing the loss LQ given by equation (2), relative to the variables (W, s, 0), and then quantizing the scaled weights of each layer of the neural network to a grid of integers consisting of 2 • [t ( ]= 2 ■ [2 Z -1 ] elements on each validation. If x 2 e Z then the size of quantization grid is the exact power of two. Therefore the values act as continuous analogues of bit-widths of weights. Equality in equation (3) is always satisfied and therefore quantized models of approximately the same size are trained.

Many existing and future processors only support arithmetic at certain fixed bit-widths. For example, many GPUs support 4 and 8-bit calculations but do not support 5, 6, and 7-bit arithmetic. In examples described herein, regularizer functions for specific bit-width parameters may be defined that provide stabilization of bit-width values on a pre-determined fixed set.

In order for size values to stabilize by integers, a sinusoidal regularizer defined by: sin 2 TCX 2 may be added to the loss function in equation (2). In the case where a hardwarespecific set of bit-widths is desired for a quantized model, a special regularizer function may be added to LQ to contract bit-width values of weights and activations to the specific set. For example, a user or other entity may define a set of required bit-width values such as {4, 8, 16}. As a result of this the bit-widths of the weights of the layers of the resulting model are equal to 4, 8 and 16-bits only.

Figure 5 is a block diagram of a method 500 for determining mixed-precision quantization parameters to quantize a neural network according to an example. The method 500 may be used with the other methods and examples described herein. The method 500 may be used with any neural network plurality of layers where each layer is associated with a weight vector comprising multiple floating point data values, where the weight vectors are selected to minimize a first loss function associated to the neural network.

At block 510, the method 500 comprises determining a vector of quantization parameters on the basis of a size of the weight vectors. According to examples described herein, the vector of quantization parameters comprises the vector of trainable variables previously defined. In particular, according to examples described herein, each one of the quantization parameters of the vector of quantization parameters constrains the size of a quantized weight vector for a layer of a quantized neural network corresponding to the weight vector for the respective layer of the neural network. Determining the vector of quantization parameters on the basis of a size of the weight vectors may comprise determining a sum of the sizes of the multiple floating point data values of the weight vectors and generating a parameter surface, such as the ellipsoidal parameter surface 400 shown in Figure 4, on the basis of the determination.

At block 520 the method 500 comprises, for each training vector in the training data set, evaluating a second loss function on the basis of the training vector and the vector of quantization parameters. According to examples described herein, the second loss function may be the loss function defined in equation (2). In particular, the second loss function may comprise the first loss function associated to the neural network, a first regularization component and a second regularization component.

According to examples, the first regularization component may be selected to constrain the quantized weight vector for each layer to a pre-determined range of values. The second regularization component may also be selected to constrain quantized input data values to each layer of the corresponding quantized neural network to a pre-determined range of values. Furthermore, in examples, the first and/or second regularization components comprise functions that depend continuously on the data values of the vector of quantization parameters. These properties are achieved according to functions p x, t) previously defined.

At block 530, the method 500 comprises modifying the weight vectors and the vector of quantization parameters to minimize an output of the second loss function. According to examples, minimizing an output of the second loss function may comprise determining a local minimum of the second loss function. This may be performed using a gradient descent method.

The method 500 may further comprise accessing a validation dataset and evaluating, for each one of multiple validation vectors in the validation dataset, the quantized neural network on the basis of the quantization parameters.

The method 500 may further comprise applying quantization to input vectors aka activations of respective layers of the neural network. The method of determining quantization parameters for activations may be similar to the method 500 for determining quantization parameters for the weights. In particular the method 500 may further comprise determining a further vector of quantization parameters on the basis of a size of the input data values of each layer of the neural network, evaluating the second loss function on the basis of the further vector and modifying the further vector on the basis of the evaluation. Alternatively, the inputs may be quantized using the same quantization parameters as the weights, or to a pre-determined bitwidth.

Figure 6 shows a table showing a comparison of quantization methods for quantization of ResNet-20 with mixed-precision. All weights are quantized except the first and last layers so that 0.91 % of the model remains 32-bit. The table shows a comparison of different quantitation methods with non-quantized activations including SinReQ and quantization using smooth regularizer with fixed bit-width (QSin). In this case model accuracy using the methods described herein is 91.83%.

Figure 7 shows a diagram of mixed precision bit-widths for layers of ResNet-20, quantized using the mixed-prevision quantization method described herein. The majority of layers are quantized to 4-bits.

Figure 8 shows a table showing a comparison of quantization methods for quantization of ResNet-20 with mixed-precision with 4-bit activations. In the table shown in Figure 8, the method described herein is compared with DoReFa, PACT, SinReQ and QSin. The method described herein demonstrates superior accuracy for mixed-precision quantized models over quantized 4-bit models of the same total model size. Full precision model accuracy is 91 .73%.

Figure 9 shows a diagram of mixed precision bit-widths for layers of ResNet-20, with 4-bit quantization of activations.

Figure 10 shows a table comparing quantization methods for quantization of MobileNet_v2 on Imagenet. Weights of all model layers are quantized and 1 % of the model remains 32-bit, namely biases and batch norms. Activations are quantized to 8 bit. The method is compared with two quantization methods: TensorFlow 8-bit quantization using the straight through estimator, and a mixed precision DNN method. Using the method described herein full precision model accuracy is 71.88%. Figure 11 shows a table of distributions of bit-widths, using the method described herein.

The methods and systems described provide a general approach to achieve mixed-prevision quantization of any neural network architecture independent of layer types, activation functions or network topology. Furthermore the method described shows improved results in classification, regression tasks and image enhancement tasks. The method is memory efficient: full precision weights are used on both forward and backward propagation during training without rounded weights and there is no need to store multiple instances of models with different bit-widths. Model training may be performed using gradient decent giving an improved convergence rate. The method provides the ability to explicitly set constraints on the overall model size. Furthermore, the model may be trained to give quantization parameters to a specific set of bit-widths such as special hardware-specific bit-widths.

Figure 12 is a block diagram of a computing system 1200 that may be used for implementing the methods disclosed herein.

Specific devices may utilize all of the components shown or only a subset of the components, and levels of integration may vary from device to device. Furthermore, a device may contain multiple instances of a component, such as multiple processing units, processors, memories, transmitters, receivers, etc. The computing system 1200 includes a processing unit 1202. The processing unit includes a central processing unit (CPU) 1214, memory 1408, and may further include a mass storage device 1204, a video adapter 1210, and an I/O interface 1212 connected to a bus 1220.

The bus 1220 may be one or more of any type of several bus architectures including a memory bus or memory controller, a peripheral bus, or a video bus. The CPU 1214 may comprise any type of electronic data processor. The memory 1208 may comprise any type of non-transitory system memory such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous DRAM (SDRAM), read-only memory (ROM), or a combination thereof. In an embodiment, the memory 1208 may include ROM for use at boot-up, and DRAM for program and data storage for use while executing programs.

The mass storage 1204 may comprise any type of non-transitory storage device configured to store data, programs, and other information and to make the data, programs, and other information accessible via the bus 1220. The mass storage 1204 may comprise, for example, one or more of a solid state drive, hard disk drive, a magnetic disk drive, or an optical disk drive.

The video adapter 1210 and the I/O interface 1212 provide interfaces to couple external input and output devices to the processing unit 1202. As illustrated, examples of input and output devices include a display 1218 coupled to the video adapter 1210 and a mouse, keyboard, or printer 1216 coupled to the I/O interface 1412. Other devices may be coupled to the processing unit 1202, and additional or fewer interface cards may be utilized. For example, a serial interface such as Universal Serial Bus (USB) (not shown) may be used to provide an interface for an external device. The processing unit 1202 also includes one or more network interfaces 1206, which may comprise wired links, such as an Ethernet cable, or wireless links to access nodes or different networks. The network interfaces 1206 allow the processing unit 1202 to communicate with remote units via the networks. For example, the network interfaces 1206 may provide wireless communication via one or more transmitters/transmit antennas and one or more receivers/receive antennas. In an embodiment, the processing unit 1202 is coupled to a localarea network 1222 or a wide-area network for data processing and communications with remote devices, such as other processing units, the Internet, or remote storage facilities.

It should be appreciated that one or more steps of the embodiment methods provided herein may be performed by corresponding units or modules. For example, a signal may be transmitted by a transmitting unit or a transmitting module. The respective units or modules may be hardware, software, or a combination thereof. For instance, one or more of the units or modules may be an integrated circuit, such as field programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs).

Although the present disclosure and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the disclosure as defined by the appended claims.

The present inventions can be embodied in other specific apparatus and/or methods. The described embodiments are to be considered in all respects as illustrative and not restrictive. In particular, the scope of the invention is indicated by the appended claims rather than by the description and figures herein. All changes that come within the meaning and range of equivalency of the claims are to be embraced within their scope.